AOGC: Anchor-Free Oriented Object Detection Based on Gaussian Centerness

Wang, Zechen; Bao, Chun; Cao, Jie; Hao, Qun

doi:10.3390/rs15194690

Open AccessArticle

AOGC: Anchor-Free Oriented Object Detection Based on Gaussian Centerness

¹

Beijing Institute of Technology, School of Optics and Photonics, Beijing 100081, China

²

Yangtze River Delta Research Institute (Jiaxing), Beijing Institute of Technology, Jiaxing 314003, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(19), 4690; https://doi.org/10.3390/rs15194690

Submission received: 10 August 2023 / Revised: 17 September 2023 / Accepted: 18 September 2023 / Published: 25 September 2023

(This article belongs to the Special Issue Object Detection and Information Extraction Based on Remote Sensing Imagery)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Oriented object detection is a challenging task in scene text detection and remote sensing image analysis, and it has attracted extensive attention due to the development of deep learning in recent years. Currently, mainstream oriented object detectors are anchor-based methods. These methods increase the computational load of the network and cause a large amount of anchor box redundancy. In order to address this issue, we proposed an anchor-free oriented object detection method based on Gaussian centerness (AOGC), which is a single-stage anchor-free detection method. Our method uses contextual attention FPN (CAFPN) to obtain the contextual information of the target. Then, we designed a label assignment method for the oriented objects, which can select positive samples with higher quality and is suitable for large aspect ratio targets. Finally, we developed a Gaussian kernel-based centerness branch that can effectively determine the significance of different anchors. AOGC achieved a mAP of 74.30% on the DOTA-1.0 datasets and 89.80% on the HRSC2016 datasets, respectively. Our experimental results show that AOGC exhibits superior performance to other methods in single-stage oriented object detection and achieves similar performance to the two-stage methods.

Keywords:

remote sensing images; orientated object detection; one-stage; anchor-free; Gaussian kernal

1. Introduction

Due to the emergence of convolutional neural networks (CNN) [1], the object detection method has developed rapidly in recent years and has reached a relatively mature stage. Object detection generally involves using a horizontal bounding box (HBB) to detect targets. In recent years, with the development of satellite technology, oriented object detection (OBB) technology for remote sensing images has attracted extensive attention from researchers [2]. However, there are often the following problems with this challenging task using general HBB object detection methods: (1) When using the HBB method to detect objects, there is often significant overlap between bounding boxes due to their dense arrangement, as shown in Figure 1a. (2) Most remote sensing images in size are large-scale images with numerous tiny targets, thus making detection difficult. (3) The aspect ratio of the target varies greatly. Due to remote sensing image data characteristics, the detected target usually has a large aspect ratio. The detection method, when using HBB, often results in there being a small proportion of target pixels in the bounding boxes, as shown in Figure 1b. These problems have made it difficult for the HBB method to detect targets effectively; as such, methods based on OBB regression were constructed and have become the mainstream methods for remote sensing image object detection.

According to whether the anchor boxes are preset or not, the object detection methods can be divided into anchor-based methods and anchor-free methods. At present, most of the mainstream methods of oriented object detection are based on anchor-based methods. These anchor-based methods first preset numerous dense anchor boxes; then, the detector predicts the deviation between the preset box and the ground truth. The Learning RoI Transformer for detecting oriented objects in aerial images (RoI Transformer) [3] approach is a typical two-stage algorithm based on anchor boxes. It borrows from the framework of the Faster region-based convolutional neural network (Faster R-CNN) [4]. The first stage will preset dense anchor boxes on the feature map, output horizontal proposals, and extract the features of the horizontal proposal through RoI Align. Unlike Faster R-CNN, it designs the RRoI learner to extract rotated features from horizontal features and to use them for the second stage of learning. Remote sensing object detection is typically a large image with high resolution and a large number of small targets. Anchor-based methods often need to lay dense anchor boxes, which will cause an imbalance of positive and negative samples as well as the redundancy of anchor boxes in the first detection stage (which then results in low detection efficiency).

Therefore, in certain practical scenarios that have requirements for detection efficiency, anchor-free detectors have more advantages than anchor-based detectors. The anchor-free detector [5,6,7] does not preset the anchor frame but regresses the parameters of the frame directly on the feature map. Fully convolutional one-stage object detection (FCOS) [5] is a feature point-based detector that only predicts the vector of the bounding box on the feature points and requires a lesser computation cost. To distinguish the quality of different feature points, FCOS also designs a centerness branch to represent the distance from the feature point to the target center point. Previous work has proved that FCOS has achieved excellent results in HBB object detection. In order to enhance the precision of object detection in remote sensing images, FCOS requires better detection capabilities in this field. And the existing label assignment method and centerness branch for HBB are not suitable for detecting OBB objects.

Based on the above conclusions, we proposed a novel anchor-free detector based on the baseline of FCOS. We used a residual network (ResNet) [8] as the backbone. To extract the feature information of targets better, we designed a novel feature pyramid network for object detection (FPN) [9] structure that uses the attention mechanism to extract context information, which can effectively improve the accuracy of subsequent detection and classification tasks. To adapt to the detection of oriented targets, we designed a label assignment suitable for oriented targets in the detection head part. The positive samples divided by this assignment method have a higher quality and are suitable for large aspect ratio targets. Finally, we used a two-dimensional Gaussian kernel function to design the centerness branch, which has a better ability to determine the significance of different anchor points. We conducted extensive experiments on two public-oriented object detection datasets, DOTA and HRSC2016, and demonstrated great performance.

Our contributions are as follows:

(1): We proposed a new anchor-free detector anchor-free oriented object detection based on Gaussian centerness (AOGC), which uses FCOS as the baseline and adds a detection branch for oriented objects. In addition, our model has a solid ability to detect oriented objects.
(2): We designed an FPN structure based on an attention mechanism that can effectively extract the targets’ contextual information and improve the network’s feature expression ability. This method is suitable for object detection in remote sensing images with more background pixels.
(3): We designed a label assignment method suitable for rotating boxes, which can efficiently divide positive and negative samples as well as adapt to targets with large aspect ratios; secondly, we also designed a Gaussian-based kernel function for oriented detection tasks. The centerness branch is used to determine the significance of different anchor points and improve the detection quality of the network.
(4): Our method achieves mAP of 74.30% and 89.80% on the DOTA and HRSC2016 datasets, respectively. The experimental results show that our method shows substantial improvement compared to the baseline method, surpassing most anchor-free and single-stage oriented object detection approaches.

2. Materials and Methods

2.1. Related Works

2.1.1. Horizontal Object Detection

As the CNN network continues to develop, the performance of object detectors also improves. Object detection generally refers to horizontal object detection, the process of detecting and locating a desired target with a horizontal bounding box. Mainstream horizontal object detection methods can be broadly classified according to the following criteria: two-stage and one-stage object detection.

Two-stage object detectors, such as the Faster R-CNN series [4,10,11], first generate RoIs, which can be roughly divided into the background class and the objects to be detected. They then extract RoI features in the second stage to perform fine classification and localization. Object detectors with two stages can offer higher detection accuracy, but they may have slower inference speeds. Object detectors like the you only look once (YOLO) series [12,13,14,15], single shot multiBox detector (SSD) [16], and RetinaNet [17] are single-stage detectors that predict the full detection results in one step. Single-stage detectors have a faster real-time detection speed. However, they have lower accuracy compared to two-stage detectors. Due to the dense arrangement of remote sensing image targets and the sharp changes in oriented detection tasks, these horizontal object detectors often need help with problems such as a large proportion of background pixels and overlapping detection bounding boxes. Therefore, when performing object detection on remote sensing images, the oriented object detection method has far more advantages than the horizontal object detection method.

2.1.2. Oriented Object Detection

To solve the problem of the dense arrangement of targets in natural scenes and the rapid change in detection target size, oriented object detection has begun to appear and has received significant attention in natural scene texts and remote sensing images [18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38]. Typically, object detectors that orient objects use a basic object detector as a starting point and then incorporate specialized modules to estimate the OBBs from HBBs.

For example, rotation region proposal networks (RRPN) [18] detect oriented objects by directly preset rotating anchor boxes. Rotational region CNN (R²CNN) [19] uses Faster R-CNN as the baseline, and RPN is used to generate rotating proposals for subsequent detection. RoI Transformer [3] learns oriented RoIs from horizontal RoIs through an RRoI learner. R³Det [20] adds a feature optimization module, which reconstructs the feature map through bilinear interpolation to solve the problem of the feature misalignment caused by the change in the bounding box position. More robust detection for small, cluttered, and rotated objects (SCRDet) [21] mitigates the effect of angular periodicity by designing a novel Intersection over Union (IoU) smoothing L1 loss. Align deep features for oriented object detection (S²A-Net) [22] developed a new convolution method, which is different from the random offset of separable convolution in that it first predicts a rotation box and then calculates the difference between the rotation box and the preset box to obtain the offset value of the convolution kernel, thus achieving the alignment of the rotated features. Recalibrating features and regression for oriented object detection (EDA) [23] adds an affine transformation-based feature decoupling module and a post-classification regression module to simplify the classification process on the basis of Faster-RCNN and then transformed it into an oriented object detection method. Oriented object detection methods designed for dense detection tasks have solved some existing problems to a certain extent, but the redundancy problem of anchor-based methods still exists.

2.1.3. Anchor-Free Detection

The utilization of a generalized design for anchors is a crucial component of Faster R-CNN. It is imperative to note that in anchor-based detectors, anchor boxes are considered predetermined sliding windows or proposal boxes, which must be categorized as either positive or negative. The network refines the bounding box location by predicting an additional offset regression. The object detector based on the design of the anchor box has become mainstream.

Anchor-based detectors are widely used in both horizontal and oriented object detection, as they hold a prominent position. But the abundance of preset anchor boxes on the feature map can lead to redundancy and an increase in the subsequent regression tasks and NMS calculations. Therefore, a corresponding anchor-free detector was designed to directly localize objects without manually defining anchor boxes. For example, FCOS [5] generates feature points on the feature map, and each feature point will return its distance vector to the target box and design a centerness branch to determine the significance of different feature points. CornerNet [6] obtains the positions of the upper left and lower right corners of the target through an hourglass network and then pairs the corner points through the embedding layer. CenterNet [7] directly returns the center point of the target, and then predicts the length and width of the target for the center point to obtain the final HBB. Box boundary-aware vectors (BBAVectors) [24] are an oriented anchor-free detection method based on the CenterNet. It first predicts a heat map to indicate the position of the center point, then predicts the distance from the center point to the oriented boxes, and then adds the length and width of the HBB bounding box. Learning center probability map for detecting objects in aerial images (CenterMap-Net) [25] treats OBB regression as a center map prediction problem; thus, it proposes a weighted pseudosegmentation-guided attention network to obtain context information. The interacting embranchment one stage anchor free detector (IENet) [26] added two offsets, w and h, to realize the regression of the oriented boxes based on the FCOS regression horizontal boxes. Feature-enhanced CenterNet [27] designs a feature enhancement module (FEM), which contributes to improving the perception of small objects by mining multiscale contextual information. Anchor-free methods have fast inference speeds and achieve competitive detection results compared to anchor-based object detection methods.

For oriented object detection, the anchor-based method must preset oriented anchor boxes with multiple angles, as shown in Figure 2. Presetting a large number of anchor boxes will lead to a very unbalanced ratio of positive and negative samples, and for densely arranged targets in remote sensing images, anchor boxes are prone to overlap; as such, oriented anchor-free detection methods are becoming increasingly popular.

2.2. Method

2.2.1. Overall Architecture

The overall architecture of our proposed AOGC method is shown in Figure 3. AOGC is a single-stage anchor-free oriented object detector. It consists of ResNet, contextual attention FPN, as well as Gaussian kernel anchor-free detection head and oriented bounding box label assignment. The last three feature maps, C₃, C₄, and C₅ of ResNet, are used as the input of the contextual attention FPN. The contextual attention FPN establishes the connection between the three feature layers and sets a specific attention relationship to obtain the contextual information of the target to be detected. We will introduce a particular implementation of contextual attention FPN in Section 3.2. Contextual attention FPN will generate three feature layers, P₃, P₄, and P₅, and then P₆ and P₇ are obtained by downsampling from P₅. The head part will output three branches, namely the regression branch, the classification, and the Gaussian centerness branch. The classification branch and the regression branch obtain the classification features and regression features after performing four 3 × 3 convolution layers on the feature map. The regression feature passes through a 1 × 1 convolution layer to obtain an output of H × W × 5, where H and W mean the length and width of the feature map. Each point in the feature map will output five regression vectors, which are the four distance vectors from the center point to the left, top, right, and bottom sides of the bounding box (l, t, r, b), as well as a vector α representing the angle of the target predicted by the network. The categorical feature passes through a 1 × 1 convolutional layer to output the confidence of each category at each point. Then, the Gaussian centerness branch we designed is generated from the regression branch feature and obtained through a convolution layer. This branch will determine the significance of each anchor point on the feature map. Finally, we used an oriented bounding box label assignment module, which is a method for dividing positive and negative anchor points for oriented targets. The ground truth generated by this module will calculate the loss with the anchor boxes predicted by the network.

2.2.2. Contextual Attention FPN (CAFPN)

Remote sensing images usually have the characteristics of a dense distribution of targets, small target size, and a large proportion of background, which make it difficult to detect targets. Background information often contains a large amount of prior knowledge; for example, airplanes generally appear in airports, and ships typically appear in harbors. Such prior knowledge often plays a vital role in object detection. Many research studies have verified the importance of background information in remote sensing object detection [39,40,41,42]. Attention mechanisms have also shown promise by obtaining the contextual information of the target and extracting the association between the pixels in oriented object detection.

In order to better fuse the contextual information of oriented objects, we designed a contextual attention FPN structure. The structure is shown in Figure 4. We obtained the feature maps, C₃, C₄, and C₅, of the last three layers from the backbone as the input of the contextual attention module. Their feature maps can be expressed as

C_{i} \in ℝ^{H \times W \times C}

, i = 3, 4, 5, which finally outputs the feature layer

P_{i} \in ℝ^{H \times W \times C}

, i = 3, 4, 5, 6, 7.

Specifically, for each input feature map layer

C_{i} \in ℝ^{H \times W \times C}

, i = 3, 4, 5, we obtained the contextual attention information from the remaining two feature maps

C_{j} \in ℝ^{H \times W \times C}, j \neq i

, and multiplied it with the feature map C_i to obtain P_i. The following formulas can describe the expression of the aforementioned:

P_{4} = C_{4} \cdot (I (C_{5}) + D c (C_{3}))

(1)

P_{3} = C_{3} \cdot (I (C_{5}) + I (C_{4}))

(2)

P_{5} = C_{5} \cdot (D c (C_{3}) + D c (C_{4}))

(3)

where the upsampling operation I is used to obtain the contextual information of the next-level feature map. D_C stands for dilated convolution [43]. The dilated convolution is employed because it has the ability to widen the range of the convolution layer’s receptive field, thus resulting in the extraction of more comprehensive spatial information. We fused the upper and lower levels of information as a spatial attention mechanism and then multiplied it with the original feature layer to obtain context information. To match the channels of the fused feature map, we added a 1 × 1 convolution before the upsampling operation to perform channel conversion.

The acquisition of P₃ is shown in Figure 4b. We obtained the upsampling information from layer C₄ and layer C₅, then multiplied it with C₃. The acquisition of P₄ is shown in Figure 4a. We used dilated convolution to extract extensive receptive field features from the C₃ layer to fuse the upsampling information of the C₅, and then we multiplied it with C₄. The corresponding feature P₅ corresponds to the C₅ layer, as shown in Figure 4c. We extracted extensive receptive field features from the C₅ and C₄ layers by dilated convolution and multiplied them with C₅. In general, we extracted information from the other two feature layers using dilated convolutions from the layer with a low sampling rate and upsampling from the layer with a high sampling rate.

After extracting features through the contextual attention layer, our contextual attention FPN outputs three feature layers: P₃, P₄, and P₅. The remaining two feature layers, P₆ and P₇, were obtained by downsampling from the P₅ layer following the FCOS method. Finally, our contextual attention FPN outputs the feature layer,

P_{i} \in ℝ^{H \times W \times C}

, i = 3, 4, 5, 6, 7, for the classification and regression tasks of the subsequent Gaussian detection head.

2.2.3. Oriented Bounding Box Label Assignment (OLA)

Anchor-based detectors usually calculate the IoU between the preset box and ground truth and then assign the sample as a positive sample or a negative sample by the size of the IoU value. The anchor-free detection network often divides the anchor points into positive samples and negative samples. For example, CenterNet defines that the center point of the ground truth falls on the heatmap as a positive sample, that the regression label is 1, that other position points are negative samples, and that the regression label is obtained according to the Gaussian distribution. FCOS divides the anchor points falling in the ground truth as positive sample points, and the rest of the points are divided into negative sample points. These label assignment methods based on horizontal boxes are no longer applicable for oriented object detection. As shown in Figure 5, it can be seen that the positive sample point area divided by the FCOS method and the ground truth are misaligned due to the change in angle. Therefore, we designed an oriented label assignment method for oriented object detection.

We defined the ground truth as

(x^{*}, y^{*}, h^{*}, w^{*}, θ^{*})

.

(x^{*}, y^{*})

represent the coordinates of the center point of the ground truth.

h^{*}

and

w^{*}

represent the long and short sides of the ground truth, respectively.

θ^{*}

is the rotation angle of the ground truth, which means the angle between the long side

h^{*}

of the ground truth and the x-axis, and its scope is

(- π / 2, π / 2)

. For an anchor point

(x, y)

on the feature map, we first calculated the distance

(x^{*}, y^{*})

from it to the ground truth center point

D

, and then we obtained the distance

D'

after rotating this distance map through affine transformation

D' = (x', y')

. The transformation process can be expressed as Equation (4).

D' = (\begin{matrix} c o s θ^{*} & - s i n θ^{*} \\ s i n θ^{*} & c o s θ^{*} \end{matrix}) D

(4)

After obtaining the rotation coordinate distance

(x', y')

from the anchor point to the ground truth center point, we could divide the positive and negative samples by this distance. To improve the quality of positive samples, we first located the division range of positive and negative samples as

\frac{h^{*}}{2}

and

\frac{w^{*}}{2}

, that is, the judge points that satisfy

x' < \frac{h^{*}}{2}

and

y' < \frac{w^{*}}{2}

as positive samples. The remaining points are positioned as negative sample points. However, we have observed that targets with large aspect ratios make up a significant proportion of remote sensing images, and the distance from the short side w of these large aspect ratio targets to the center point is particularly small. Therefore, setting the positive sample range as

\frac{w^{*}}{2}

on the short side will cause the large aspect ratio target to have fewer positive sample points, which leads to the imbalance of positive and negative samples. Figure 6a shows that the green area is the positive sample sampling area. To solve this problem, we defined the division range at the short side as

\frac{\sqrt{h^{*} \cdot w^{*}}}{2}

, which effectively alleviates the pain of fewer positive sample points for large aspect ratio targets. Therefore, in the end, we judged the points that satisfy

x' < \frac{h^{*}}{2}

and

y' < \frac{\sqrt{h^{*} \cdot w^{*}}}{2}

as positive samples, and the rest of the points were positioned as negative samples. The representation of the sampling area is shown in Figure 6b.

2.2.4. Gaussian Centerness Branch (GC)

Anchor-based detectors usually filter out low-quality anchor boxes by IoU thresholding. And anchor-free detectors do not produce anchor boxes; as such, methods based on IoU are not generally used to filter out low-quality anchor boxes. In fact, many low-quality prediction bounding boxes are often generated at positions far from the center of the target. FCOS designs a centerness branch to filter out these low-quality anchor boxes. Centerness describes the normalization from this position to the target center responsible for this position. This method has been adopted in HBB networks such as FCOS [5] and YOLOX [14]. Moreover, it has steadily improved detection accuracy, but it no longer applies to OBB detection tasks. Therefore, we proposed a novel centerness branch based on a two-dimensional Gaussian distribution. We used OBB parameters

(x, y, h, w, θ)

to define a two-dimensional Gaussian distribution:

Σ = R A R^{T}

(5)

where

Σ

is the covariance matrix, R is the rotation transformation matrix formed by the sine and cosine angles of the target (its function is shown in (6)),

R^{T}

is the transpose of the rotation transformation matrix, and A represents the matrix of the covariance matrix produced by eigenvalue decomposition. In addition, the elements on the diagonal are eigenvalues arranged from the largest to the smallest, and its function is shown in (7).

R = (\begin{matrix} c o s θ & - s i n θ \\ s i n θ & c o s θ \end{matrix})

(6)

A = k (\begin{matrix} w^{2} & 0 \\ 0 & h^{2} \end{matrix})

(7)

where k is a hyperparameter (which we set to 0.1 in this paper), and h and w represent the long and short sides of the target. Finally, we uses a normalized 2D Gaussian kernel function

g (X)

to represent the centerness of the OBB task:

g (X) = e x p (- \frac{1}{2} {(X - μ)}^{T} Σ^{- 1} (X - μ))

(8)

where X represents the preset anchor point coordinates,

μ

represents the center point coordinates of the ground truth, and exp( ) is an exponential function. We used the normalized Gaussian centerness

g (X) \in (0, 1)

. Gaussian centerness will generate an elliptical area with the center point of the ground truth as the core. The closer to the center point, the higher its value; the closer to the boundary, the lower the value. Therefore, two-dimensional Gaussian kernel functions can effectively reflect the significance of the anchor point to the ground truth; furthermore, adding the Gaussian centerness branch can effectively filter out certain unimportant anchor points. Figure 7 shows a Gaussian ground truth heat map of an oriented object detection target.

2.2.5. Loss Function

The final loss function of AOGC is similar to that of FCOS, including regression loss, classification loss, and Gaussian centerness loss. The function for calculating the total loss is displayed in Equation (9).

L o s s = \frac{1}{N_{p o s}} \sum_{i} w_{G C} L_{r e g} + \frac{λ_{1}}{N} \sum_{l} L_{c l s} + \frac{λ_{2}}{N_{p o s}} \sum_{i} L_{G C}

(9)

where

N

represents the number of all the ground truth,

N_{p o s}

represents the number of positive samples in the ground truth, and i represents the preset anchor point.

λ_{1}

and

λ_{2}

are hyperparameters used to adjust the loss ratio, and we ended up setting them to 1 in our experiment.

w_{G C}

is the weight represented by the Gaussian centerness, which is used to adjust the size of the regression loss at different positions.

L_{cls}

means classification loss. We used focal loss [17] to address sample imbalance, and its expression is as follows:

L_{c l s} = - α {(1 - p)}^{γ} l o g (p)

(10)

where α and γ represent the hyperparameters of the balanced sample, which we set to 0.25 and 2, respectively. P represents the probability that the model predicts a certain category. For regression loss, we used SkewIoU [18] loss to express the following:

L_{r e g} = - l o g (S_{I o U} (R, R^{*}))

(11)

In the above formula,

S_{I o U}

represents SkewIoU,

R

is the regression parameter

(x, y, h, w, θ)

predicted by the network, and

R^{*}

is the regression parameter

(x^{*}, y^{*}, h^{*}, w^{*}, θ^{*})

of the ground truth.

L_{G C}

is our Gaussian centerness loss, which we implemented using BCE loss, where

p

is the predicted Gaussian centerness and

p'

is the Gaussian centerness of ground truth. The expression of

L_{G C}

is as follows:

L_{G C} = - (1 - p) l o g (1 - p') - p l o g (p')

(12)

3. Results

3.1. Datasets

We evaluated our method on the DOTA-1.0 and HRSC2016 datasets. The DOTA-1.0 [44] dataset is a collection of remote sensing images that are used for detecting oriented objects. Consisting of nearly three thousand aerial images with diverse scales, orientations, and object shapes, these images come from a range of sensors and platforms. Their resolution varies between 800 × 800 and 4000 × 4000. Notably, the fully annotated images contain 188,282 instances. DOTA-1.0 has 15 categories: plane (PL), baseball field (BD), bridge (BR), ground track field (GTF), small vehicle (SV), large vehicle (LV), ship (SH), tennis court (TC), basketball court (BC), storage tank (ST), soccer ball field (SBF), roundabout (RA), harbor (HA), swimming pool (SP), and helicopter (HC). DOTA images involve various large and small objects. The DOTA dataset is divided into a training set, validation set, and test set, and the ratios for these are 1/2, 1/6, and 1/3, respectively. We trained our method on the training set and validation set. Then, we tested it on the test set, and the final results of the DOTA evaluation service were sent for evaluation. We cropped the original image into 1024 × 1024 patches with a gap of 200. Only random horizontal flips were used during training. For multi-scale training and testing, we chose three scales (0.5, 1.0, and 1.5) to resize the original images, and we then cropped them into 1024 × 1024 patches with a gap of 500 while training with random rotations.

HRSC2016 [45] is a dataset focused on ship detection and includes 1061 images of rotating ships with different aspect ratios. The images were gathered from six well-known ports and include ships both at sea and offshore. This dataset has the following detection difficulties: (1) There are numerous ships present on the shore, and they display densely arranged and distributed characteristics. The labeling frames overlap to a great extent. (2) The background of remote sensing images is intricate, and the texture of the ships to be analyzed is comparable to that of the nearby shoreline. (3) The scale of ships varies greatly, with different sizes visible in the same image. (4) There are multiple ship types with dozens of different variations, thus making detection and classification challenging. (5) Problems such as cloud and fog occlusion make detection difficult. The dataset consists of images with varying pixel ranges from 300 × 300 to 1500 × 900, and the ground sample distance varies between 2 m and 0.4 m. Following R²CNN [19], we split the dataset into three sets: training, validation, and test. The training set has 436 images with 1207 instances, the validation set has 181 images with 541 instances, and the test set has 444 images with 1228 instances. The training set and validation set were used for training, and the test set was used for evaluation. We evaluated our results using PASCAL VOC07 and VOC12 [46] metrics. We resized all images to 800 × 800 without changing the aspect ratio.

3.2. Implementation Details

Our AOGC network uses ResNet50 [8] as the backbone. Our experiments were performed with a batch size of 2 on a computer equipped with two 3080Ti GPUs. We used a ResNet50 model trained on ImageNet [47] as the pretrained model during training. We used the stochastic gradient descent (SGD) optimizer to train our models with an initial learning rate of 0.005, a momentum of 0.9, and a weight decay of 0.0001. We trained the model on the DOTA dataset for 12 epochs, and we reduced the learning rate to one-tenth of the original at the end of epoch 8 and epoch 11. We trained on HRSC2016 for 36 epochs and reduced the learning rate to one-tenth of the original at the end of epoch 24 and epoch 33. During testing, the confidence threshold was set to 0.1. We implemented our training using mmdetection [48].

3.3. Ablation Studies

We conducted a series of ablation experiments on the DOTA-1.0 test set to evaluate the effectiveness of the proposed method. We used FCOS with ResNet50 as the baseline. We added an angle branch to achieve oriented object detection in order to predict the angle. We named it FCOS-R, and our experimental results are shown in Table 1.

As shown in Table 1, the mAP of FCOS-R at baseline was 69.58%. When added to our CAFPN, the accuracy on DOTA1.0 reached 72.19%. When adding our positive and negative sample division method OLA module, the detection accuracy increased to 73.39%. Finally, by adding our Gaussian centerness, the detection accuracy reached 74.30%.

In order to prove the effectiveness of our OLA module, we conducted ablation experiments on the OLA module on the DOTA-1.0 dataset, the experimental results of which are shown in Table 2. We also used FCOS-R as the baseline, and after adding the label assignment method with a rotating affine transform, the model accuracy mAP reached 70.44%. However, when we reduced the range of the positive sample area to half of the original, the accuracy of the model was reduced to 70.32% due to the reduction in positive sample anchor points of the large aspect ratio target. Finally, we corrected the range of positive samples, and the accuracy of the final model reached 70.90%.

3.4. Comparison with State-of-the-Art Methods

The test results on the DOTA-1.0 dataset are shown in Table 3, and we compared them with certain state-of-the-art one-stage, two-stage, and anchor-free oriented object detection methods. Our AOGC achieved 74.30% mAP on the DOTA dataset, surpassing most of the anchor-free and single-stage detection models, and it had a similar accuracy to certain two-stage detection models. After adding multi-scale training and random rotation, our accuracy reached 76.55% mAP. Furthermore, our model achieved state-of-the-art results on challenging object categories with large aspect ratios, such as large vehicles, ships, bridges, harbors, and swimming pools. Some of our test results on DOTA are shown in Figure 8.

In addition, we performed a visual comparison experiment between the baseline method FCOS-R and our method AOGC on the DOTA dataset to demonstrate the effectiveness of our method. The comparison of visual results is shown in Figure 9. Through the comparison of experimental results, we can see that our method detects more accurate target angles with fewer omissions and false checks.

The dataset labeled HRSC2016 is composed of multiple ship instances situated closely together, and it features varying orientations and significant aspect ratios. The test results of HRSC2016 are shown in Table 4, and our AOGC method works well on HRSC2016. In comparing our model with state-of-the-art methods, our AOGC achieved 89.80% mAP (07) and 95.20% mAP (12) on the HRSC2016 dataset, surpassing most current anchor-free and single-stage detection models. Some of our test results on HRSC2016 are shown in Figure 10.

4. Discussion

4.1. Effect of the Proposed CAFPN

The input image is a series of feature maps of different scales obtained through the backbone network, but the feature information in the feature maps of different scales needs to be more balanced. In general, the deep feature map contains a large number of semantic features and less positioning information. In comparison, the shallow feature map carries more position information, but the semantic features are weaker; as such, the features need to be further enhanced. To achieve multi-scale informative feature fusion, FPN creates high-level semantic feature maps at all scales by using a top-down architecture and lateral connections. This structure can integrate deep features and underlying information and strengthen the relationship between features of different scales, and each pyramid feature layer is only responsible for detecting objects within a specific scale, which improves the efficiency of detection tasks. However, FPN is a top-down network that can only transfer deep semantic information to shallow layers. Although multi-scale semantic expression is added, the positioning information between features needs to be more effectively circulated. In addition, the previous work of [49] proved that the top-down structure of the FPN feature enhances the network’s focus on smaller targets, as well as loses the feature information of large targets in the process of gradient backpropagation.

The CAFPN structure we designed based on the above conclusions will effectively avoid these problems. Each feature layer of our CAFPN structure will fuse the shallow semantic information obtained in its adjacent upper-level feature layer with the deep semantic information obtained in its adjacent lower-level feature layer. During the gradient backpropagation process, the structure we designed achieves sufficient semantic information fusion and will not cause information loss due to too much attention to a particular category. In order to reduce the parameter amount of the CAFPN network, we canceled the top-down pathway of the FPN structure. The ablation experiments for the FPN structure are shown in Table 5. We conducted our ablation experiments on the DOTA-1.0 dataset. The baseline was the FCOS-R network mentioned in Section 4.1. The experimental results show that our CAFPN structure improves the mAP by 2.61%, and our network parameters, Params and FLOPs, only increased by about 10%.

4.2. Effect of the Proposed Gaussian Kernel Anchor-Free Detection Head

We modified the FCOS method for detecting oriented objects in remote sensing images. Our approach involves adding a Gaussian kernel anchor-free detection head with a specially designed oriented label assignment and a Gaussian centerness branch to the existing FCOS detection head.

Before training an object detector, it is necessary to determine which ground truth (or background) each anchor should be assigned to, and the positive and negative sample division methods will directly affect the performance of the object detector. Anchor-based detectors usually use a certain threshold of IoU as the allocation criterion, while anchor-free detectors define positive and negative samples by directly assigning anchor points. FCOS directly assigns anchor points inside the bounding box as positive samples, showing good detection performance on horizontal object detection. To enable FCOS to have better performance in oriented object detection, we designed the oriented label assignment method OLA. It maps the original FCOS positive sample area to the rotation box through affine transformation, and we improved the quality of the positive sample by shrinking the positive sample area. In order to adapt to a large number of large aspect ratio targets in remote sensing images, we also corrected the short side of the sampling area. Our approach has proven to be highly effective for objects that possess large aspect ratios, as demonstrated by the outcomes of our testing on the DOTA-1.0 dataset. And our method achieves the best results on large aspect ratio target classes such as small vehicles, large vehicles, and ships, thus achieving state-of-the-art capabilities.

In the FCOS model, the centerness branch plays a crucial role in determining the significance of anchor points based on their distance from the center point of the target. In order to allow FCOS to achieve a better oriented object detection performance, we built a new centerness branch based on the principle of the two-dimensional Gaussian kernel function. Using our centerness branch has the following advantages: (1) The 2D Gaussian kernel function has the characteristic such that the weight of the pixel increases and decreases monotonically as per the distance from the point to the center point, which meets the design requirements of the centerness branch. (2) The 2D Gaussian kernel function has rotational symmetry, and its smoothness in all directions is the same, which is suitable for oriented target detection. (3) The smoothness of the 2D Gaussian kernel function can be adjusted by setting hyperparameters.

5. Conclusions

In this paper, we proposed a novel anchor-free object detection method, AOGC, which can be widely used in oriented object detection for aerial images. In our method, we proposed the CAFPN module in order to obtain contextual attention information about objects and to enhance the network’s capability to extract features. Then, aiming at the positive and negative sample division problem in the current anchor-free detection network, a label assignment method suitable for oriented object detection was designed. Finally, we developed a Gaussian centerness branch suitable for oriented object detection to select high-quality anchors. Comprehensive experiments show that our AOGC helps improve detection accuracy. We conducted extensive experiments on DOTA-1.0 and HRSC2016 to validate our method, and the results show that our method outperforms most anchor-free and single-stage detection methods, especially in the detection of large aspect ratio targets. However, there are still accuracy differences between our method and certain two-stage methods. As such, in future work, we will continue to improve our method and continue to explore the potential of oriented anchor-free detection methods for object detection in aerial images.

Author Contributions

Conceptualization, Z.W., J.C. and Q.H.; methodology, Z.W.; software, J.C.; validation, Z.W., C.B. and J.C.; formal analysis, C.B. and Z.W.; investigation, Z.W; resources, Q.H.; data curation, Z.W.; writing—original draft preparation, Z.W.; writing—review and editing, C.B and J.C.; visualization, Z.W.; supervision, J.C. and Q.H.; project administration, Z.W.; funding acquisition, J.C. and Q.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under grant 62275022, the Beijing Nature Science Foundation of China under grant 4222017, and the funding of the Science and Technology Entry program under grant (KJFGS-QTZCHT-2022-008).

Data Availability Statement

The DOTA and HRSC2016 datasets are available at the following https://captainwhu.github.io/DOTA/dataset.html (accessed on 10 June 2023) and https://sites.google.com/site/hrsc2016/ (accessed on 10 June 2023), respectively.

Conflicts of Interest

The authors declare no conflict of interest.

References

Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Wen, L.; Cheng, Y.; Fang, Y.; Li, X.Y. A comprehensive survey of oriented object detection in remote sensing images. Expert Syst. Appl. 2023, 224, 119960. [Google Scholar] [CrossRef]
Ding, J.; Xue, N.; Long, Y.; Xia, G.-S.; Lu, Q. Learning RoI Transformer for Detecting Oriented Objects in Aerial Images. arXiv 2018, arXiv:1812.00155. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–4 September 2018. [Google Scholar]
Zhou, X.; Wang, D.; Krhenbühl, P. Objects as Points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
Girshick, R. Fast R-CNN. arXiv 2015, arXiv:1504.08083. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. arXiv 2015, arXiv:1506.02640. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
Berg, A.C.; Fu, C.Y.; Szegedy, C.; Anguelov, D.; Erhan, D.; Reed, S.; Liu, W. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Ma, J.; Shao, W.; Ye, H.; Wang, L.; Wang, H. Arbitrary-Oriented Scene Text Detection via Rotation Proposals. IEEE Trans. Multimed. 2017, 20, 3111–3122. [Google Scholar] [CrossRef]
Jiang, Y.; Zhu, X.; Wang, X.; Yang, S.; Li, W.; Wang, H.; Fu, P.; Luo, Z. R2CNN: Rotational Region CNN for Orientation Robust Scene Text Detection. arXiv 2017, arXiv:1706.09579. [Google Scholar]
Yang, X.; Liu, Q.; Yan, J.; Li, A.; Zhang, Z.; Yu, G. R3Det: Refined Single-Stage Detector with Feature Refinement for Rotating Object. arXiv 2019, arXiv:1908.05612. [Google Scholar] [CrossRef]
Yang, X.; Yang, J.; Yan, J.; Zhang, Y.; Zhang, T.; Guo, Z.; Xian, S.; Fu, K. SCRDet: Towards More Robust Detection for Small, Cluttered and Rotated Objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Han, J.; Ding, J.; Li, J.; Xia, G.S. Align Deep Features for Oriented Object Detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5602511. [Google Scholar] [CrossRef]
Chen, W.; Miao, S.; Wang, G.; Cheng, G. Recalibrating Features and Regression for Oriented Object Detection. Remote Sens. 2023, 15, 2134. [Google Scholar] [CrossRef]
Yi, J.; Wu, P.; Liu, B.; Huang, Q.; Qu, H.; Metaxas, D. Oriented Object Detection in Aerial Images with Box Boundary-Aware Vectors. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020. [Google Scholar]
Wang, J.; Yang, W.; Li, H.C.; Zhang, H.; Xia, G.S. Learning Center Probability Map for Detecting Objects in Aerial Images. IEEE Trans. Geosci. Remote Sens. 2020, 59, 4307–4323. [Google Scholar] [CrossRef]
Lin, Y.; Feng, P.; Guan, J. IENet: Interacting Embranchment One Stage Anchor Free Detector for Orientation Aerial Object Detection. arXiv 2019, arXiv:1912.00969. [Google Scholar]
Shi, T.; Gong, J.; Hu, J.; Zhi, X.; Zhang, W.; Zhang, Y.; Zhang, P.; Bao, G. Feature-Enhanced CenterNet for Small Object Detection in Remote Sensing Images. Remote Sens. 2022, 14, 5488. [Google Scholar] [CrossRef]
Ming, Q.; Zhou, Z.; Miao, L.; Zhang, H.; Li, L. Dynamic Anchor Learning for Arbitrary-Oriented Object Detection. arXiv 2020, arXiv:2012.04150. [Google Scholar] [CrossRef]
Xu, Y.; Fu, M.; Wang, Q.; Wang, Y.; Chen, K.; Xia, G.S.; Bai, X. Gliding Vertex on the Horizontal Bounding Box for Multi-Oriented Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1452–1459. [Google Scholar] [CrossRef]
Xiao, Z.; Qian, L.; Shao, W.; Tan, X.; Wang, K. Axis Learning for Orientated Objects Detection in Aerial Images. Remote Sens. 2020, 12, 908. [Google Scholar] [CrossRef]
Pan, X.; Ren, Y.; Sheng, K.; Dong, W.; Yuan, H.; Guo, X.; Ma, C.; Xu, C. Dynamic Refinement Network for Oriented and Densely Packed Object Detection. arXiv 2020, arXiv:2005.09973. [Google Scholar] [CrossRef]
Wei, H.; Zhang, Y.; Chang, Z.; Li, H.; Wang, H.; Sun, X. Oriented objects as pairs of middle lines. ISPRS J. Photogramm. Remote Sens. 2020, 169, 268–279. [Google Scholar] [CrossRef]
Llerena, J.M.; Zeni, L.F.; Kristen, L.N.; Jung, C. Gaussian Bounding Boxes and Probabilistic Intersection-over-Union for Object Detection. arXiv 2021, arXiv:2106.06072. [Google Scholar]
Chen, Z.; Chen, K.; Lin, W.; See, J.; Yu, H.; Ke, Y.; Yang, C. PIoU Loss: Towards Accurate Oriented Object Detection in Complex Environments. arXiv 2020, arXiv:2007.09584. [Google Scholar]
Liu, L.; Pan, Z.; Lei, B. Learning a Rotation Invariant Detector with Rotatable Bounding Box. arXiv 2017, arXiv:1711.09405. [Google Scholar]
Yang, X.; Yang, J.; Yan, J.; Zhang, Y.; Zhang, T.; Guo, Z.; Xian, S.; Fu, K. R2CNN++: Multi-Dimensional Attention Based Rotation Invariant Detector with Robust Anchor Strategy. arXiv 2018, arXiv:1811.07126. [Google Scholar]
Qian, W.; Yang, X.; Peng, S.; Guo, Y.; Yan, J. Learning Modulated Loss for Rotated Object Detection. arXiv 2019, arXiv:1911.08299. [Google Scholar] [CrossRef]
Yang, X.; Hou, L.; Zhou, Y.; Wang, W.; Yan, J. Dense Label Encoding for Boundary Discontinuity Free Rotation Detection. arXiv 2020, arXiv:2011.09670. [Google Scholar] [CrossRef]
Ye, X.; Xiong, F.; Lu, J.; Zhou, J.; Qian, Y. 3-Net: Feature Fusion and Filtration Network for Object Detection in Optical Remote Sensing Images. Remote Sens. 2020, 12, 4027. [Google Scholar] [CrossRef]
Zhang, G.; Lu, S.; Zhang, W. CAD-Net: A Context-Aware Detection Network for Objects in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2019, 57, 10015–10024. [Google Scholar] [CrossRef]
Xu, Z.; Zhang, W.; Zhang, T.; Li, J. HRCNet: High-Resolution Context Extraction Network for Semantic Segmentation of Remote Sensing Images. Remote Sens. 2020, 13, 71. [Google Scholar] [CrossRef]
Hu, G.; Du, B.; Wei, G. HG-SMA: Hierarchical guided slime mould algorithm for smooth path planning. Artif. Intell. Rev. 2023, 56, 9267–9327. [Google Scholar] [CrossRef]
Yu, F.; Koltun, V. Multi-Scale Context Aggregation by Dilated Convolutions. In Proceedings of the ICLR, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-scale Dataset for Object Detection in Aerial Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
Liu, Z.; Wang, H.; Weng, L.; Yang, Y. Ship Rotated Bounding Box Space for Ship Extraction From High-Resolution Optical Satellite Images With Complex Backgrounds. IEEE Geosci. Remote Sens. Lett. 2017, 13, 1074–1078. [Google Scholar] [CrossRef]
Everingham, M.; Van Gool, L.; Williams, C.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012, Lake Tahoe, NV, USA, 3–6 December 2012; Volume 25. [Google Scholar]
Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J. MMDetection: Open MMLab Detection Toolbox and Benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar]
Jin, Z.; Yu, D.; Song, L.; Yuan, Z.; Yu, L. You Should Look at All Objects. arXiv 2022, arXiv:2207.07889. [Google Scholar] [CrossRef]

Figure 1. Problems that exist in remote sensing object detection when using horizontal bounding boxes. (a) The overlapping phenomenon of anchor boxes when using horizontal bounding boxes to detect objects in aerial images. The green box is the detected horizontal bounding box. (b) The problem where the proportion of target pixels is smaller when using horizontal bounding boxes to detect objects with large aspect ratios in aerial images. The red area in the figure is the target pixel, and the green area is the background pixel.

Figure 2. Anchor-based detector preset anchor box redundancy. The green boxes in the figure are the preset rotation boxes of the anchor-based detector.

Figure 3. The overall architecture of the AOGC network, where C₃, C₄, and C₅ represent the output features of the backbone, and CAM is the contextual attention module. H and W are the width and height of the feature map, and C is the number of categories in the network classification. OLA is the oriented label assignment method we designed, and Gaussian centerness is a centerness branch based on a two-dimensional Gaussian kernel.

Figure 4. CAFPN structure diagram. Among them, DConv represents dilated convolution, IP means interpolation operation, and Conv (1 × 1) represents 1 × 1 convolution for channel conversion. (a–c) are the implementation methods of CAFPN three-layer structure respectively.

Figure 5. The FCOS positive and negative sample division method does not match the OBB detection. The blue area in the figure is the positive sample point area of FCOS, and the red box is the ground truth.

Figure 6. Oriented label assignment method and its modification for large aspect ratio objects. (a) After reducing the sampling area of the positive sample, the number of positive sample points of the target with a large aspect ratio was noted to be too small. In the figure, the red box is the original sampling area, and the green box is the sampling area reduced by half. (b) The sampling area after correction and the blue area is the sampling area of the positive sample.

Figure 7. The representation heat map of the Gaussian centerness.

Figure 8. Partial visualization results of our method on the DOTA-1.0 dataset.

Figure 9. Partial visualization results of FCOS-R and our method. (a) Part of the visualization results of the FCOS-R method on the DOTA dataset. (b) Part of the visualization results of our method on the DOTA dataset.

Figure 10. Partial visualization results of our method on the HRSC2016 dataset.

Table 1. Ablation experimental results on the DOTA dataset, where CAFPN represents the contextual attention FPN module we designed, OLA represents the oriented label assignment module, and GC represents the Gaussian centerness module.

Methods	Backbone	CAFPN	OLA	GC	mAP (%)
FCOS-R	ResNet50				69.58
		√			72.19
		√	√		73.39
		√	√	√	74.30

Table 2. Ablation experimental results for the OLA module on the DOTA dataset, where AT represents that we have added an affine transformation to the label assignment, HS represents half of the positive sample area, and OLA represents the revised oriented label assignment that we ultimately adopted.

Methods	Backbone	AT	HS	OLA	mAP (%)
FCOS-R	Resnet50				69.58
		√			70.44
		√	√		70.32
		√		√	70.90

Table 3. Comparison with the DOTA1.0 test results of other state-of-the-art methods. The category names are abbreviated as follows—PL: plane, BD: baseball diamond, BR: bridge, GTF: ground field track, SV: small vehicle, LV: large vehicles, SH: ship, TC: tennis court, BC: basketball court, ST: storage tank, SBF: soccer ball field, RA: roundabout, HA: harbor, SF: swimming pool, and HC: helicopter. * indicates multi-scale training. The highlighted results represent the best results in each category.

Method	Backbone	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC	mAP (%)
One-stage
Retina-Net-O	R-50	88.67	77.62	41.81	58.17	74.58	71.64	79.11	90.29	82.18	74.32	54.75	60.60	62.57	69.67	60.64	68.43
S2A-Net [22]	R-50	89.11	82.84	48.37	71.11	78.11	78.39	87.25	90.83	84.90	85.64	60.36	62.60	65.26	69.13	57.94	74.12
DAL [28]	R-50	88.68	76.55	45.08	66.80	67.00	76.76	79.74	90.84	79.54	78.45	57.71	62.27	69.05	73.14	60.11	71.44
R3Det [20]	R-101	88.76	83.09	50.91	67.27	76.23	80.39	86.72	90.78	84.68	83.24	61.98	61.35	66.91	70.63	53.94	73.79
Two-stage
RRPN [18]	R-101	88.52	71.20	31.66	59.30	51.85	56.19	57.25	90.81	72.84	67.38	56.69	52.84	53.08	51.94	53.58	61.01
RoI transformer [3]	R-101	88.64	78.52	43.44	75.92	68.81	73.68	83.59	90.74	77.27	81.46	58.39	53.54	62.83	58.93	47.67	69.56
SCRDet [21]	R-101	89.98	80.65	52.09	68.36	68.36	60.32	72.41	90.85	87.94	86.86	65.02	66.68	66.25	68.24	65.21	72.61
EDA [23]	R-50	89.2	83.5	51.6	69.3	77.6	74.9	86.3	90.9	85.6	85.9	59.5	64.8	68.1	66.4	57.3	74.1
Gliding Vertex [29]	R-101	89.64	85.00	52.26	77.34	73.01	73.14	86.82	90.74	79.02	86.81	59.55	70.91	72.94	70.86	57.32	75.02
Anchor-free
IENet [26]	R-101	88.15	71.38	34.26	51.78	63.78	65.63	71.61	90.11	71.07	73.63	37.62	41.52	48.07	60.53	49.53	61.24
Axis learning [30]	R-101	79.53	77.15	38.59	61.15	67.53	70.49	76.30	89.66	79.07	83.53	47.27	61.01	56.28	66.06	36.05	65.98
BBAVectors [24]	R-101	88.35	79.96	50.69	62.18	78.43	78.98	87.94	90.85	83.58	84.35	54.13	60.24	65.22	64.28	55.70	72.32
DRN [31]	H-104	88.91	80.22	43.52	63.35	73.48	70.69	84.94	90.14	83.85	84.11	50.12	58.41	67.62	68.60	52.50	70.70
O2-DNet [32]	H-104	89.31	82.14	47.33	61.21	71.32	74.03	78.62	90.76	82.23	81.26	60.93	60.17	58.21	66.98	61.03	71.04
ProbIoU [33]	R-50	89.09	72.15	46.92	62.22	75.78	74.70	86.62	89.59	78.35	83.15	55.83	64.01	65.50	65.46	46.32	70.04
AOGC (ours)	R-50	84.04	80.61	52.22	67.23	80.64	81.75	87.81	90.91	82.81	84.91	56.55	65.73	73.50	72.50	53.32	74.30
AOGC * (ours)	R-50	83.44	80.29	54.06	70.90	81.52	83.42	88.24	90.88	83.02	86.84	60.34	64.90	75.36	80.23	64.80	76.55

Table 4. Comparison with the HRSC2016 test results of other state-of-the-art methods. mAP (07) and mAP (12) represent the VOC2007 index and VOC2012 index, respectively. The highlighted results represent the methods with the highest accuracy.

Method	Backbone	mAP (07)	mAP (12)
R²CNN [19]	R-101	73.07	79.73
RRPN [18]	R-101	79.08	85.64
Axis Learning [30]	R-101	78.20	-
BBAVectors [24]	R-101	88.60	-
PIoU [34]	DLA-34	89.20	-
RoI Transformer [3]	R-101	86.20	-
DAL [28]	R-101	88.60	-
EDA [23]	R-50	89.13	-
S²A-Net [22]	R-101	90.17	95.01
AOGC (ours)	R-50	89.80	95.20

Table 5. Ablation experiment table of the CAFPN structure.

Method	Backbone	Neck	mAP (%)	Params (MB)	FLOPs (GB)
Baseline	ResNet50	FPN	69.58	206.91	31.92
Proposed Method	ResNet50	CAFPN	72.19	220.74	35.07

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Bao, C.; Cao, J.; Hao, Q. AOGC: Anchor-Free Oriented Object Detection Based on Gaussian Centerness. Remote Sens. 2023, 15, 4690. https://doi.org/10.3390/rs15194690

AMA Style

Wang Z, Bao C, Cao J, Hao Q. AOGC: Anchor-Free Oriented Object Detection Based on Gaussian Centerness. Remote Sensing. 2023; 15(19):4690. https://doi.org/10.3390/rs15194690

Chicago/Turabian Style

Wang, Zechen, Chun Bao, Jie Cao, and Qun Hao. 2023. "AOGC: Anchor-Free Oriented Object Detection Based on Gaussian Centerness" Remote Sensing 15, no. 19: 4690. https://doi.org/10.3390/rs15194690

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AOGC: Anchor-Free Oriented Object Detection Based on Gaussian Centerness

Abstract

1. Introduction

2. Materials and Methods

2.1. Related Works

2.1.1. Horizontal Object Detection

2.1.2. Oriented Object Detection

2.1.3. Anchor-Free Detection

2.2. Method

2.2.1. Overall Architecture

2.2.2. Contextual Attention FPN (CAFPN)

2.2.3. Oriented Bounding Box Label Assignment (OLA)

2.2.4. Gaussian Centerness Branch (GC)

2.2.5. Loss Function

3. Results

3.1. Datasets

3.2. Implementation Details

3.3. Ablation Studies

3.4. Comparison with State-of-the-Art Methods

4. Discussion

4.1. Effect of the Proposed CAFPN

4.2. Effect of the Proposed Gaussian Kernel Anchor-Free Detection Head

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI