SOD-YOLO: Small-Object-Detection Algorithm Based on Improved YOLOv8 for UAV Images

Li, Yangang; Li, Qi; Pan, Jie; Zhou, Ying; Zhu, Hongliang; Wei, Hongwei; Liu, Chong

doi:10.3390/rs16163057

Open AccessArticle

SOD-YOLO: Small-Object-Detection Algorithm Based on Improved YOLOv8 for UAV Images

by

Yangang Li

¹,

Qi Li

^1,2,*,

Jie Pan

²,

Ying Zhou

¹,

Hongliang Zhu

¹,

Hongwei Wei

¹ and

Chong Liu

¹

Qilu Aerospace Information Research Institute, Jinan 250132, China

²

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(16), 3057; https://doi.org/10.3390/rs16163057

Submission received: 9 July 2024 / Revised: 2 August 2024 / Accepted: 15 August 2024 / Published: 20 August 2024

(This article belongs to the Special Issue Advanced Application of Artificial Intelligence and Machine Vision in Remote Sensing (Third Edition))

Download

Browse Figures

Versions Notes

Abstract

:

The rapid development of unmanned aerial vehicle (UAV) technology has contributed to the increasing sophistication of UAV-based object-detection systems, which are now extensively utilized in civilian and military sectors. However, object detection from UAV images has numerous challenges, including significant variations in the object size, changing spatial configurations, and cluttered backgrounds with multiple interfering elements. To address these challenges, we propose SOD-YOLO, an innovative model based on the YOLOv8 model, to detect small objects in UAV images. The model integrates the receptive field convolutional block attention module (RFCBAM) in the backbone network to perform downsampling, improving feature extraction efficiency and mitigating the spatial information sparsity caused by downsampling. Additionally, we developed a novel neck architecture called the balanced spatial and semantic information fusion pyramid network (BSSI-FPN) designed for multi-scale feature fusion. The BSSI-FPN effectively balances spatial and semantic information across feature maps using three primary strategies: fully utilizing large-scale features, increasing the frequency of multi-scale feature fusion, and implementing dynamic upsampling. The experimental results on the VisDrone2019 dataset demonstrate that SOD-YOLO-s improves the mAP50 indicator by 3% compared to YOLOv8s while reducing the number of parameters and computational complexity by 84.2% and 30%, respectively. Compared to YOLOv8l, SOD-YOLO-l improves the mAP50 indicator by 7.7% and reduces the number of parameters by 59.6%. Compared to other existing methods, SODA-YOLO-l achieves the highest detection accuracy, demonstrating the superiority of the proposed method.

Keywords:

object detection; UAV; small objects; feature fusion

Graphical Abstract

1. Introduction

Declining production costs and advancements in flight control technology [1,2] have broadened unmanned aerial vehicle (UAV) applications across diverse fields, including traffic monitoring [3,4,5], power facility inspection [6], security protection [7], and emergency response [8]. Object detection is crucial for these tasks and a prerequisite for advanced image processing [9].

The rapid advancement of deep learning technology has resulted in object-detection algorithms that can efficiently recognize objects in images or videos with robust generalization ability, enabling their use in various scenarios and tasks [10,11,12]. The You Only Look Once (YOLO) network has garnered widespread attention as a fast and accurate object-detection method. It segments an image into grids and predicts the object’s bounding boxes and category probabilities. The YOLO network facilitates rapid object detection, making it suitable for numerous applications.

Although the YOLO network has demonstrated excellent performance in object detection with a horizontal camera angle, it encounters significant challenges when applied to images captured by UAVs. As shown in Figure 1a,b, UAVs typically capture images from a high altitude, causing ground objects to appear small with limited information, which complicates the detection of small objects, such as pedestrians and vehicles [13,14]. Additionally, as shown in Figure 1c,d, UAVs can capture images from various angles, including overhead, oblique, and side perspectives, by adjusting their altitude, speed, and orientation. This diversity of perspectives results in different appearances and scales for the same object in different images, increasing the uncertainty in object detection. Furthermore, as shown in Figure 1e, targets in UAV images often exhibit a non-uniform spatial distribution, i.e., dense, sparse, or clustered [15,16,17], increasing the detection difficulty. Moreover, as shown in Figure 1f, the movement of objects, such as vehicles and pedestrians, combined with the movement of UAVs, can result in motion blur, particularly at high speeds or when capturing fast-moving objects. Motion blur degrades image quality, challenging the stability and robustness of detection algorithms [9].

Additionally, drone object-detection algorithms [18,19] are generally designed for two application scenarios. The first scenario involves the real-time transmission of raw data captured by a drone-mounted camera to a ground station. Post-processing occurs on a desktop workstation with substantial computational resources. In this scenario, no limitations exist regarding the size of the detection model; thus, the primary focus is to achieve the highest possible detection accuracy for small objects. The second scenario entails immediate image analysis on the drone’s embedded computer. The primary considerations are a lightweight model with low computational requirements and sufficient storage in the resource-limited conditions of embedded devices.

We introduce a novel small-object-detection model, SOD-YOLO, intended for UAV imagery to overcome the existing challenges. To ensure the model’s adaptability across diverse applications, we have developed multiple variants, SOD-YOLO-n, SOD-YOLO-s, SOD-YOLO-m, and SOD-YOLO-l, which are based on the SOD-YOLO framework. The experimental findings on the authoritative VisDrone2019 dataset [20] indicate that the proposed SOD-YOLO networks achieve superior detection accuracy with few parameters and low computational complexity, significantly outperforming the baseline YOLOv8 networks.

The main contributions of this paper are as follows:

The receptive field convolutional block attention module (RFCBAM) [21] is employed for downsampling in the backbone network. This approach addresses the limitations of convolutional kernel parameter sharing by focusing on spatial features in the receptive field, enhancing the backbone network’s ability to extract features.
We developed a novel neck module, the balanced spatial and semantic information fusion pyramid network (BSSI-FPN), to facilitate multi-scale feature fusion. Due to the importance of large-scale features, we increased the frequency of feature fusion at different scales. Moreover, we incorporated a dynamic upsampling mechanism to facilitate a balanced fusion of spatial information and semantic content.
We propose several object-detection models to address the different requirements encountered in practical applications. These models include a large-scale, high-precision version designed for substantial computing devices, such as ground stations, and a streamlined version with fewer parameters and lower computational complexity optimized for embedded computers on drones. This diversification allows the effective deployment of our method in various complex scenarios.

The following sections of this paper are structured as follows: Section 2 provides an overview of related studies on the YOLO network. Section 3 presents the improved model for detecting small objects in UAV images and elaborates on the architecture and operational principles of the proposed model. Section 4 details the experimental setup, including the environment and parameters, and discusses various experiments, such as ablation studies, comparative analyses, and validations using the VisDrone2019 dataset to ascertain the method’s viability. Section 5 explains how the proposed model addresses several challenges encountered in UAV images. Section 6 summarizes the study’s findings and highlights potential directions for subsequent research.

2. Related Work

Recent advancements in computer technology have significantly enhanced deep learning-based object-detection methods, revolutionizing the field of computer vision. Single-stage approaches, particularly the YOLO network, have gained widespread popularity among these methods. The YOLO network is known for its real-time processing capabilities, enabling efficient and timely detection tasks. Its end-to-end training simplifies the implementation process, making it suitable for various applications. Additionally, the YOLO network strikes an optimal balance between speed and accuracy, providing reliable results rapidly. This balance and ease of deployment underscores its indispensable role in modern object-detection solutions. Consequently, the YOLO network has become a primary choice for professionals seeking robust and efficient object-detection frameworks.

2.1. The YOLO Series Methods

In 2016, Redmon et al. proposed the YOLOv1 network [22], which employs a single-stage approach to object detection. By treating object detection as a regression problem, YOLOv1 enables end-to-end object detection, significantly boosting detection speed. This innovation addresses the shortcomings of conventional two-stage detectors, which often have slower processing speeds, making them unsuitable for real-time applications. Redmon and Farhadi proposed YOLOv2 in 2017 [23] to enhance the detection accuracy for small objects and handle multi-scale variations more effectively. This advancement was achieved by integrating batch normalization [24], anchor boxes, and feature pyramid networks (FPNs), alleviating previous issues while maintaining real-time performance.

The following year, Joseph Redmon’s team introduced YOLOv3 [25], initially employing Darknet-53 as its backbone network and improving category prediction accuracy to facilitate object detection across a broader range of categories. However, the model exhibited low performance in high congestion or severe occlusion scenarios. In 2020, Bochkovskiy et al. introduced YOLOv4 [26], which incorporated various advanced techniques, including the mish activation function, a spatial pyramid pooling (SPP) block, the cross-stage partial network (CSPNet), and the path aggregation network (PANet) for neck fusion. They also implemented training strategies, such as mosaic data augmentation, cross-stage partial batch normalization (CmBN), drop-block regularization, and self-adversarial training (SAT), to enhance the model’s robustness and generalization capabilities. Although YOLOv4 had a lower inference speed than YOLOv3, its detection accuracy was significantly higher, enabling it to compete with contemporary state-of-the-art two-stage detection models.

YOLOv5 [27] was developed by the Ultralytics team under the leadership of Glenn Jocher and underwent continuous updates and iterations. It uses the lightweight CSPDarknet as its backbone network and includes various optimization strategies, such as automatic mixed-accuracy training, multi-scale training, and adaptive anchor frame generation, to streamline training. Although YOLOv5 had equivalent computational resources as the state-of-the-art EfficientDet series of models, its peak accuracy was marginally inferior. However, YOLOv5 excelled in real-time performance, making it more suitable for applications requiring rapid inference. Li et al. proposed YOLOv6 [28], which replaced CSPDarknet with EfficientRep, maintaining high accuracy while enhancing hardware compatibility. Wang et al. developed YOLOv7 [29], which incorporated model reparameterization into the network architecture and a novel efficient layer aggregation network (ELAN) to improve network efficiency and performance. Additionally, Jocher et al. proposed YOLOv8 [30], which uses the novel C2f structure in the backbone network. YOLOv8 achieved higher performance than YOLOv7 due to modifications in the gradient flow.

2.2. The Improved YOLO Methods

With the advancement of the YOLO algorithm, researchers have introduced numerous improvements, leading to the development of various small-object-detection methods. Xu et al. developed the YOLOv5s-pp [31] model, which enhances the detection performance of small targets in low-resolution images by incorporating a channel attention (CA) module to mitigate interference from complex backgrounds and negative samples. Yue et al. proposed LE-YOLO [32] based on the YOLOv8n. This model introduces the LHGNet backbone network, LGS bottleneck, and LGSCSP fusion module, achieving real-time small object detection on edge devices like drones. These modifications reduce computational complexity while maintaining high detection accuracy. Li et al. [33] improved YOLOv8 by integrating the Bi-PAN-FPN structure to enhance multi-scale feature fusion. They also replaced some C2f modules with the GhostlockV2 structure, reducing information loss and the number of model parameters. These enhancements improve the detection ability of small targets in complex environments. Tahir et al. developed the PVswin-YOLOv8s [14] model, which integrates the Swin Transformer to boost global feature extraction capability. The model also incorporates a CBAM module to improve attention to key features, enhancing its ability to detect small-sized and occluded targets in complex environments. PVswin-YOLOv8s demonstrates superior performance in challenging detection scenarios by utilizing these advanced components.

2.3. Small-Object-Detection Methods for Other Frameworks

In the field of small object detection, several excellent methods have emerged beyond the YOLO framework. Qu et al. introduced the AMMFN [34] network, which enhances the extraction of small target features and reduces feature redundancy through the Detection Head Enhancement Module (DHEM) and the Attention Mechanism-Based Channel Cascade (AMCC) module. These innovations significantly improve the detection accuracy of small targets in remote sensing images. Yuan et al. developed the CFINet [35] framework, which ensures high-quality candidate region generation through the Coarse-to-Fine RPN (CRPN) and enhances small target feature representation using Feature Imitation (FI) learning. This framework demonstrates excellent detection performance on the SODA-D dataset. Jiang et al. designed the MFFSODNet [36] network, which boosts the detection performance of small objects by introducing a small-object-specific prediction head, a multi-scale feature extraction module, and a bidirectional dense feature pyramid network. Yang et al. [37] proposed a method that improves the acquisition of discriminative contextual information for small objects by enhancing convolutional feature transformation. This approach utilizes a wave function representation based on a visual multilayer perceptron (MLP) and a local maximum method to calculate the amplitude and phase of features. This results in more effective global representation and dynamic information aggregation, improving the detection performance of small objects in drone scenes.

2.4. The Classic Object-Detection Methods

As the YOLO network was improved, several exemplary detection methods based on other frameworks have been developed. The faster region-based convolutional neural network (R-CNN) proposed by Ren et al. [12] is a popular two-stage approach that combines the strengths of the R-CNN [10] and Fast R-CNN [11], offering improvements in accuracy and efficiency. Its main innovation is the region proposal network (RPN), which reduces redundant computation by sharing features with the backbone network to improve detection speed. However, despite these advancements, the algorithm has high complexity due to the two-stage detection.

Cascade R-CNN was proposed by a research team from Peking University [38]. It is a modified version of Faster R-CNN that utilizes a cascade structure to improve detection accuracy. This approach employs multiple detection stages, usually three or more, each with a progressively increasing intersection over union (IoU) threshold. Each subsequent stage performs more precise filtering and refining of high-quality candidate frames identified in earlier stages. However, this method requires multiple detection stages, increasing model complexity and inference time. Additionally, two-stage detection methods are less effective at detecting small objects and require specialized optimization strategies.

RetinaNet is an object-detection model developed by Lin et al. [39]. It employs an innovative loss function called focal loss to prevent class imbalance during training. This loss function reduces the influence of easy-to-classify negative samples, allowing the model to focus on more challenging ones. However, the prediction of numerous anchors results in a high computational complexity.

CenterNet was developed by Duan et al. [40]. It is an object-detection method that employs keypoint detection. It generates a heatmap to identify the most probable object locations, with the highest pixel values indicating the object centers. The network also predicts the size and category of the objects, providing a simpler identification process than conventional methods. However, detecting small objects can be challenging due to the limited resolution of the heatmap. Therefore, additional training is recommended to enhance the detection capabilities for smaller objects.

Detecting small objects in UAV images presents several challenges, including significant variations in the object size, small object sizes, complex spatial distributions, background clutter, and interference. To address these challenges, we propose the SOD-YOLO based on the YOLOv8 architecture for UAV image object detection. The proposed model achieves relatively high detection accuracy, and the optimized backbone and neck networks result in a compact network size.

3. Proposed Model

3.1. Overview of the SOD-YOLO Model

Figure 2 shows the structure of SOD-YOLO. We have implemented numerous optimizations and improvements to the YOLOv8n baseline architecture, primarily focusing on the backbone and neck components. SOD-YOLO can be customized for specific scenarios by adjusting the channel and width parameters of the convolutional layer, resulting in different versions, including SOD-YOLO-n, SOD-YOLO-s, SOD-YOLO-m, and SOD- YOLO-l. The parameters of the network modules are presented in Table 1 and Table 2.

3.2. Improvement of Backbone

As shown in Figure 3, the backbone of our model comprises three primary components: the downsampling module, the feature extraction module C2f, and the SPPF module. In the standard YOLOv8 network, downsampling is performed by a standard convolution with a stride of 2. In the SOD-YOLO network, we employ the RFCBAM module, which uses receptive field attention mechanisms, to perform this operation. Unlike conventional convolution, RFCBAM convolution focuses on the local receptive field characteristics of the input feature map. It employs an attention mechanism to evaluate the significance of each feature point in the receptive field precisely, mitigating information dilution caused by parameter sharing. This operation ensures that key features in local regions are adequately expressed, enhancing the network’s ability to capture essential information.

The standard RFCBAM convolutional structure (Figure 3) has two stages. First, a 3 × 3 group convolution is applied to the input feature map to extract feature representations for each feature point in the receptive field. Subsequently, dimensional transformation is performed to serialize and expand feature points in different channels at the same 2D spatial location based on their spatial order (Figure 4). Then, the spatial attention features are refined using mean pooling and max pooling.

The squeeze-and-excitation (SE) method [41] is utilized as a channel attention mechanism. This method captures global contextual information through global pooling and refines channelwise attention using a pair of fully connected layers. Subsequently, the channel attention features are integrated with the previously computed spatial attention feature map through elementwise multiplication, fusing both types of attention information. Finally, the fused features are processed through a standard convolution with a 3 × 3 kernel and a stride of 3 to resize, generating the final output feature map, which is consistent in size with the input feature map.

It should be noted that, when the RFCBAM module is utilized for downsampling operations, the stride of the group convolution must be set to 2. This setting ensures that the output feature map is half the size of the input feature map, thereby completing the downsampling operations, as shown in Figure 5.

Another significant advantage of the RFCBAM convolution module is its ability to mitigate the loss of spatial information during downsampling. Since we focus on small objects in UAV application scenarios, preserving adequate spatial information in large-scale feature maps is crucial for effective object detection. Although the downsampling operation can extract high-level semantic information and enhance global semantic understanding, it often leads to a loss of spatial information. The RFCBAM module addresses this problem by preserving spatial information during the downsampling operations. It extracts extensive local receptive field features from the feature maps and expands the width and height of the receptive field feature maps by a factor of three using dimensional transformation. The subsequent spatial information extraction utilizes the augmented receptive field features. Although the feature map returns to its original size after the RFCBAM module is executed, the spatial information of small objects is adequately extracted in the enlarged intermediate receptive field feature map. RFCBAM outperforms traditional downsampling convolutions in terms of reducing spatial information loss, enhancing the spatial information fidelity of the feature map.

3.3. Improvement of the Neck

Object detection involves localization and classification. Localization requires spatial information, whereas classification relies on semantic information. Increasing the convolutional layers in the backbone network enhances the semantic information in the feature map, but dilutes the spatial localization information. Although the RFCBAM module mitigates this issue, it cannot achieve an optimal balance between spatial localization and semantic information. Therefore, the neck module is improved to fuse multiple feature maps extracted from the backbone network to balance spatial and semantic information, providing a more accurate and comprehensive feature representation for the subsequent head module.

The Path Aggregation Network for feature pyramid networks (PAN-FPN) [42,43] is employed in the neck of YOLOv8 for multi-scale feature fusion, as shown in Figure 6. This architecture comprises two parallel paths. The first path is a top-down deep semantic feature propagation chain that incrementally transfers abstract semantic features to the bottom layers. The second path is a bottom-up spatial feature propagation chain that integrates spatial information to the upper layers. These two paths operate in parallel, enabling the network to combine deep semantic information with shallow spatial information.

However, the PAN-FPN structure has several limitations in small-object-detection tasks. The primary issue is its failure to utilize large-scale feature maps during multi-scale feature fusion. Since small objects are typically more easily detected on larger feature maps, focusing on these maps enhances the performance of small object detection. Furthermore, the frequency and mode of feature fusion are insufficient in the PAN-FPN structure. The top-down path only employs an upsampling fusion strategy of deep semantic features. In contrast, no multi-scale downward information fusion that involves downsampling from different-scale feature maps in the backbone network exists. This deficiency results in limitations. Therefore, we propose the novel BSSI-FPN as the neck structure (Figure 7). It performs the following three optimizations.

3.3.1. Structural Improvements

Our design provides two critical structural improvements to the PAN-FPN architecture. First, we expanded the PAN-FPN architecture upwards to utilize spatial information from shallow feature maps to preserve the high-resolution spatial details necessary for detecting small objects. We added a detection head for micro-small objects with a size of 160 × 160 pixels and removed the original 20 × 20-pixel detection head suitable for large object detection. Second, we increased the frequency of the multi-scale feature fusion by adding additional downsampling modules between the backbone network, bottom-up paths, and top-down paths to combine the rich spatial details in the shallow feature maps with the robust semantic information in the deep feature maps.

3.3.2. Preliminary Screening for Information in Large-Scale Feature Map

The B1 and B2 layers are located in the shallow portion of the network with a limited receptive field. Their spatial location and semantic information may not be adequate, leading to relatively disorganized information. Directly fusing these feature layers by downsampling without appropriate processing may not provide sufficient information for subsequent object detection. Moreover, the redundancy and mixing of information in the feature map might cause noise interference, reducing the network’s detection performance.

As shown in Figure 7, we incorporated a pointwise (PW) [44] convolution layer during the fusion of the B1 and B2 feature maps to address this issue. It preliminarily extracts relevant information from the feature maps and filters out irrelevant information for object detection without adding unnecessary parameters. Furthermore, it refines and enhances the information on shallow features, providing a more focused and targeted feature representation for the following detection head.

3.3.3. Optimization of Upsampling Module

Deep feature maps typically need to be fused with shallow feature maps through upsampling to balance semantic and spatial information in feature maps. In the standard YOLOv8 network architecture, upsampling is performed using bilinear interpolation within a fixed coordinate system. However, this method has limitations because it cannot dynamically adjust the sampling positions based on the content of the input feature map, resulting in fixed coordinates for the sampling points.

We innovatively optimized the upsampling module by incorporating a dynamic upsampling mechanism called DySample [45], which adaptively generates upsampling coordinates based on the content of the input feature map. As shown in Figure 8, the feature map is transformed through a linear layer and multiplied with an offset factor to calculate the relative displacement coordinates of each pixel point. Subsequently, a pixel shuffle operation is used to adjust the feature maps with displacement information to the upsampling size. The calculated displacement coordinates are added to the base grid coordinates to determine the exact sampling coordinates of each pixel point. Finally, grid sampling is performed based on the sample coordinates.

DySample is an innovative upsampling method that offers flexible resampling of continuous feature maps. Unlike traditional upsampling methods that use static interpolation rules, DySample generates a distribution of sampling points based on the input feature content. This dynamic adjustment of the sampling strategy according to the image content is particularly advantageous in complex scenes, improving the model’s adaptability to complex terrain conditions. DySample enhances the accuracy of feature map resolution recovery and expansion, promoting more balanced feature fusion.

4. Experiment

4.1. Image Datasets for Small Object Detection

Our study utilizes the VisDrone dataset and SODA-D [46] dataset for validation. The VisDrone dataset is well known in the international UAV vision research community and is recognized as an authoritative source. We used the VisDrone2019 version, which was developed and publicly released by the AISKYEYE team from Tianjin University. It was designed to advance the development of UAV vision perception, especially for object detection, tracking, and semantic segmentation tasks. Due to its large volume and high diversity, the VisDrone2019 dataset serves as a benchmark reference for UAV vision research. It includes 288 video clips captured by UAVs in various complex environments and from different perspectives, such as city streets, rural fields, and construction sites. In addition, the VisDrone2019 dataset contains 10,209 aerial images captured by UAVs. These images cover a wide range of spatial locations and environmental conditions, supporting single-frame object-detection studies. The dataset is meticulously labeled with ten categories of objects. Following the division methodology of the VisDrone2019 Challenge, it is divided into a training set (6471 samples), a validation set (548 samples), and a test set (1610 samples). Some typical data samples are shown in Figure 9.

The SODA-D dataset, developed by Cheng et al. from Northwestern Polytechnical University, is a large-scale dataset specifically designed for small object detection. It comprises 24,828 high-quality traffic scene images, meticulously labeled with nine categories and a total of 278,433 instances. These instances include common traffic elements such as traffic lights, vehicles, and pedestrians. Due to its rich diversity, high-resolution images, and extensive number of small object instances, the SODA-D dataset is ideal for evaluating the performance of object-detection models in small-object processing. Some typical data samples are shown in Figure 10.

The experiments were executed on a Windows 11 operating system, leveraging Python 3.8, PyTorch 1.2.0, and Cuda 11.8 in the desktop computing environment. The hardware utilized for these experiments was an NVIDIA 3090 graphics card. The neural network implementation code was adapted based on Ultralytics version 8.0.202. The training was conducted for 150 epochs with input images resized to 640 × 640 pixels. The optimization utilized the stochastic gradient descent (SGD) algorithm, starting with an initial learning rate of 0.01 and progressively decreasing to a final learning rate of 0.0001. Each model was trained without using pre-trained weights to ensure an equitable comparison.

4.2. Experimental Evaluation Indicators

The proposed model was evaluated based on its detection performance and its model size. The precision (P), recall (R), average precision (AP), and mean average precision (mAP) were employed to measure the accuracy of all object categories. Additionally, the computational complexity of the networks was assessed using the giga floating-point operations per second (GFLOPs) metric. The model size was determined by the number of parameters, and the frames per second (FPS) was employed to measure the model’s real-time processing speed.

4.3. Comparative Analysis with YOLOv8

The performance and model size of SOD-YOLO and YOLOv8 on the VisDrone2019 dataset are shown in Table 3 and Table 4 and Figure 11.

The results in Table 3 demonstrate that SOD-YOLO outperforms the standard YOLOv8 model in terms of detection accuracy at different scales. Specifically, SOD-YOLO-s achieves a mAP50 score of 42.0%, 3% higher than for YOLOv8s. Additionally, the number of model parameters has been significantly reduced from 11.1 million to 1.75 million, and the computational complexity has decreased from 28.8 GFLOPs to 20 GFLOPs. These findings indicate that SOD-YOLO achieves higher detection accuracy with a smaller model size and a lower computational cost. Furthermore, the FPS of SOD-YOLO at various scales is only slightly lower than YOLOv8, which is sufficient to meet the real-time processing requirements for UAV applications.

Figure 11 indicates that the proposed detection model consistently outperforms the baseline model, regardless of the scale. Further analysis reveals that the difference in the detection performance between the proposed model and the baseline model (YOLOv8, in this case) increases significantly with the model scale. In the “n”-scale configuration, SOD-YOLO-n outperforms YOLOv8 by only 1.0% based on the mAP50 indicator. However, when the scale is increased to “l”, the mAP50 indicator is 7.7% larger, significantly widening the gap between the two models. This phenomenon indicates that the detection accuracy of SOD-YOLO improves faster as the model parameter number and computational complexity increase. Therefore, the proposed model is expected to provide high object-detection accuracy when it is deployed on a hardware platform with sufficient computational resources, especially in large-scale configurations.

To accurately assess the detection capabilities of SOD-YOLO for objects of various sizes, we utilized the standard evaluation metrics of the COCO [47] dataset and conducted a comparative analysis with YOLOv8 at different scales. Table 4 indicates that SOD-YOLO and YOLOv8 exhibit competitive performance for large-sized objects. However, SOD-YOLO outperforms YOLOv8 in detecting small-sized targets. Specifically, SOD-YOLO-l surpasses YOLOv8l by 6.4% in the AP-small indicator. These results suggest that SOD-YOLO has significant advantages in detecting medium and small objects, with particular excellence in small object detection. Given that objects from UAV perspectives are typically medium to small in size, SOD-YOLO demonstrates high compatibility with these application scenarios.

4.4. Ablation Experiment

To verify the effectiveness of the improvement methods used in the SOD-YOLO network, we conducted ablation experiments based on the YOLOv8n model, incrementally adding the improvement methods. We trained the models with the same hyperparameters to conduct a fair experiment. The ablation experiment was designed as follows:

(1): A: Improve the neck structure by expanding it upwards and adding a small-object-detection head.
(2): B: Remove the large-object-detection head and increase the frequency of the multi-scale feature fusion by adding additional downsampling.
(3): C: Add PW convolution.
(4): D: Use the DySample module for upsampling.
(5): E: Replace the downsampling operation of the backbone network with RFCBAM.

The experimental results in Table 5 indicate that the object-detection system’s accuracy improves steadily as the improvement methods are added. Notably, the second step of the improvement methods results in the most significant performance increase, as shown in Figure 12. During this step, the mAP50 indicator increased by 7.1% compared to YOLOv8n, and the number of network parameters was reduced by nearly half by removing the horizontal branch responsible for large object detection. These findings demonstrate that large-scale feature maps contain rich spatial detail. Effectively utilizing this information can enhance the model’s performance for detecting small objects. However, a significant increase in computational complexity was observed and attributed to the repeated addition of large-scale feature maps to the fusion. Since the computational complexity of the fully improved model was increased to 20.1 GFLOPs, we classified the model as “s” scale and compared it to YOLOv8s. The results show that the mAP50 of SOD-YOLO-s is 3 percentage points higher than that of YOLOv8s and 10 percentage points higher than that of YOLOv8n. The ablation experiments on the Visdrone2019 dataset indicate that our improved strategy effectively enhances object-detection accuracy.

We introduced various feature pyramid network (FPN) methods into the baseline YOLOv8 model for further comparative testing. The results are shown in Table 6. When the BSSI-FPN structure is integrated into the YOLOv8n model, it achieves the highest detection accuracy with the fewest parameters, demonstrating robust feature fusion capabilities. These findings clearly illustrate the superiority of the proposed BSSI-FPN method.

4.5. Comparison with YOLO Series Models

Deep learning object detection algorithms are typically categorized into single-stage and two-stage methods based on the anchor-frame-generation mechanisms. Single-stage methods are generally more suitable for UAV object detection due to their inherent advantages. The YOLO network is a representative framework due to its excellent performance and broad applicability. We conducted a multi-level comparative analysis of SOD-YOLO and the YOLO series networks using the VisDrone2019 dataset to validate the detection performance of the SOD-YOLOv8 and its adaptability to specific UAV scenarios. The detection accuracy, model size, and computational complexity were analyzed.

4.5.1. Comparison with Lightweight Models

We conducted a comparative analysis of various lightweight models, including YOLOv3 Tiny, YOLOv5s, YOLOv6s, YOLOv7 Tiny, YOLOv8s, and the latest YOLOv10s. We trained all models using the same hyperparameters without pre-trained weights to conduct a fair experiment. The experimental results are presented in Table 7.

The results indicate that SOD-YOLO-s achieves the highest detection accuracy with the fewest parameters and lowest computational complexity, confirming its superiority. As shown in Figure 13, the SOD-YOLO-s model has the highest accuracy for detecting small objects, such as “people” and “pedestrians”, compared to larger objects like “car” and “bus”. The SOD-YOLO-s model outperforms the YOLOv10s model by 8.3% for the “pedestrian” category, demonstrating its superior performance in small object detection. Other lightweight networks have low detection performance due to their focus on detecting larger objects in natural scenes without enhancement to detect small objects.

4.5.2. Comparison with Large-Scale Models

Similarly, we conducted a detailed comparative analysis of large-scale models on the VisDrone2019 dataset to demonstrate the competitiveness of the proposed network. The results are shown in Table 8 and Figure 14.

We included the YOLOv9e [52] model in this comparison. Table 8 shows that SOD-YOLO-l achieves the highest accuracy, with YOLOv9e ranking second. Specifically, SOD-YOLO-l’s mAP50 indicator is 4.9% higher than that of YOLOv9e, but has only 30.6% of the number of parameters of YOLOv9e. Figure 9 indicates that YOLOv9e surpasses SOD-YOLO-l in detecting “bus” and “truck” objects. However, its accuracy for other small-sized objects is significantly lower than SOD-YOLO-l. These findings underscore the superior performance of the SOD-YOLO series in small object detection. Notably, SOD-YOLO-l has the fewest parameters among all models, highlighting its design efficiency.

In summary, the experiments confirm that the proposed model provides the best performance at different scales. It has higher accuracy and computational efficiency and is more compact than the other models. The findings provide a reliable foundation for selecting the most appropriate model scale based on task requirements and computing platform conditions in practical applications.

4.6. Comparison with Improved YOLO Object-Detection Method

To comprehensively evaluate the performance of the SOD-YOLO series of models, we selected various improved YOLO algorithms of different scales for comparative analysis. These models include the small-scale LE-YOLO, the large-scale Drone-YOLO, and several medium-scale YOLO algorithms such as YOLOv5-pp, Modified YOLOv8, PVswin-YOLOv8, and UAV-YOLOv8.

Table 9 reveals that the SOD-YOLO series of models outperforms others in detection performance. Specifically, SOD-YOLO-s achieves a 2.7% higher mAP50 than the lightweight LE-YOLO, using only 1.75 million parameters—significantly fewer than LE-YOLO’s 10.5 million parameters. Additionally, SOD-YOLO surpasses Drone-YOLO in detection performance while employing just one-fourth of the parameters. Among medium-sized models such as YOLOv5-pp, Modified YOLOv8, PVswin-YOLOv8, and UAV-YOLOv8, the SOD-YOLO-m model also proves superior in detection performance and parameter efficiency. Overall, the SOD-YOLO series achieves high detection performance with minimal parameter usage, outperforming both lightweight and larger models.

4.7. Comparison with Classical Object Detection Models

We conducted a comparative analysis of the SOD-YOLO detection model’s performance against other architecture object-detection models, such as MFFSODNet, CenterNet, RetinaNet, ORIN-YOLOX, Cascade R-CNN, and Faster R-CNN, for UAV applications. The experimental results are listed in Table 10.

The proposed SOD-YOLO-l outperforms the other algorithms. Faster R-CNN uses the RPN to generate candidate frames. However, this method may not consider regions with small objects due to their small size, resulting in insufficient candidate frames or inaccurate localization, adversely affecting the subsequent detection. Cascade R-CNN enhances accuracy through multi-stage detection, but each additional stage increases the computational load. Detecting small objects requires finer division and more iterations, reducing detection speed. RetinaNet uses a focal loss to address category imbalances. However, its single prediction mechanism may not adequately learn the features of small objects, especially in numerous background samples, where the sparse features of small objects can be easily overwhelmed. CenterNet is based on keypoint detection. The center point of small objects is difficult to detect and vulnerable to noise interference, complicating localization.

MFESODNet is a detection network designed for small objects, demonstrating satisfactory performance in object detection. However, its single-scale structure limits its suitability for scenarios requiring high accuracy. Similarly, although ORIN-YOLOX offers higher detection accuracy, its large number of parameters makes it unsuitable for application scenarios with limited computing resources. In contrast, the SOD-YOLO series has multiple scale structures, enhancing its adaptability to various scenarios. Additionally, the SOD-YOLO series incorporates specific optimizations for small object detection, making it particularly well-suited for UAV images and offering superior detection performance.

4.8. Comparison of the SODA-D Dataset

To evaluate the generalization ability of the proposed model, we conducted comparative experiments on the SODA-D dataset using its evaluation criteria to assess object detection performance. The results, shown in Table 11, indicate that SOD-YOLO-s exhibits superior performance across all evaluation metrics. In terms of the

A P_{r S}

indicators, SOD-YOLO-s surpasses YOLOv8s by 4.1%. Based on these findings, the proposed model possesses satisfactory generalization ability.

4.9. Visualization Experiment

We compared the detection results of the SOD-YOLO-l and SOD-YOLO-s models with those of the baseline models YOLOv8s and YOLOv8l. Four images were selected from the VisDrone2019 validation dataset, representing different scenes, lighting conditions, shooting positions, and object types. Critical detection areas are highlighted using white boxes to showcase the detection results. Due to the high density of detected objects in some scenes, we have hidden the confidence labels and category labels for clarity, using different colors to represent the corresponding categories.

Figure 15 shows a UAV image of a neighborhood during daylight hours at a moderate flight altitude and an oblique angle. Most objects were successfully identified and framed by the four models. Two pedestrians walking in the parking lot in the upper part of the image are highlighted with a white box. YOLOv8l and YOLOv8s failed to identify and frame these two pedestrians. In contrast, SOD-YOLO-l and SOD-YOLO-s demonstrated higher detection sensitivity and accuracy, recognizing these small and often overlooked objects.

Figure 16 presents an oblique aerial image of a main road in a city captured by a drone at night, flying at a moderate altitude. A group of people is gathered in front of a white van in the upper part of the image. A white box outlines them. Due to the drone’s movement during footage capture, the image of the group is slightly blurry, posing challenges for accurate object detection. The benchmark models YOLOv8l and YOLOv8s fail to identify and locate the pedestrian group accurately due to blurring. In contrast, SOD-YOLO-l and SOD-YOLO-s exhibit superior performance in low-light environments and motion blur. Despite adverse imaging conditions, these models successfully distinguish three individual pedestrians within a crowd despite the degradation caused by drone movement. The proposed models’ robust detection capability and adaptability to blurring in complex nighttime urban traffic scenarios have been validated.

Figure 17 displays a UAV image of an urban intersection during the daytime. The photograph was captured from a high altitude, causing the ground objects, such as pedestrians and vehicles, to appear very small. A white box at the center of the image indicates an area with a group of pedestrians. However, the small size makes them barely recognizable. The detection results show that the YOLOv8s and YOLOv8l models cannot detect small pedestrians in a crowd, revealing their limitations in these scenes. In contrast, SOD-YOLO successfully detects one pedestrian in the scene. Furthermore, the larger model version (SOD-YOLO-l) accurately detects multiple pedestrians. These findings confirm that the proposed model has significant advantages in detecting small objects in UAV images.

Figure 18 shows a one-way street scene captured by a UAV during daylight hours. The UAV was flying at a low altitude and captured the scene horizontally. In the area marked by a white box, two pedestrians are partially obstructed by a moving vehicle, which poses a challenge to the object-detection algorithm’s ability to handle occlusion. YOLOv8s and YOLOv8l cannot detect occluded pedestrians, indicating their limitations in handling complex occlusion scenarios. Similarly, as shown in Figure 16, SOD-YOLO-s also does not recognize the occluded pedestrians. However, it is noteworthy that SOD-YOLO-l exhibits an exceptional ability to handle occlusions in this scenario. Despite its relatively large model size, it successfully detects pedestrians occluded by vehicles, demonstrating superior detection performance in dealing with complex occlusion scenarios.

5. Discussion

Object detection in UAV images presents numerous challenges. This section discusses how SOD-YOLO effectively addresses these challenges.

5.1. Significant Variations in the Object Size

In drone images, object sizes vary widely, ranging from small to large. To address this challenge, SOD-YOLO utilizes three different scales of detection heads: Detect1 (40 × 40), Detect2 (80 × 80), and Detect3 (160 × 160). Each detection head is optimized for a specific range of object sizes: Detect3 is designed for small-sized objects; Detect2 is designed for medium-sized objects; Detect1 is designed for larger objects. This multi-level design ensures that the algorithm can effectively handle variations in object size and detect small objects.

According to the data presented in Table 2, SOD-YOLO demonstrates notable advantages in detecting small and medium-sized objects, outperforming YOLOv8. Meanwhile, SOD-YOLO also demonstrates satisfactory detection performance for larger objects. This is illustrated in Figure 17, where SOD-YOLO accurately detects both large-sized “cars” nearby and small-sized “people” in the distance. Additionally, Figure 17d confirms SOD-YOLO’s ability to detect very small targets successfully.

5.2. Complex Spatial Distributions, Background Clutter, and Interference

To address complex spatial distributions, background clutter, and various interferences, we incorporated the RFCBAM module in SOD-YOLO. This improvement improves the feature extraction capability of the backbone network and mitigates the dilution effect on spatial information. Additionally, we introduced the BSSI-FPN module, which facilitates balanced fusion between spatial and semantic information. These two improvements enhance the algorithm’s ability to extract spatial information and perform semantic analysis. As a result, SOD-YOLO can accurately locate objects even when faced with complex spatial distributions, background clutter, and various interferences.

Figure 16 illustrates that drone images suffer from poor quality due to nighttime shooting and slight motion blur. Despite these interference factors, SOD-YOLO still accurately locates objects, exhibiting its robust anti-interference capability. Similarly, Figure 18 shows that variations in distance result in significant differences in object size, complex spatial structures, and potential object occlusion. However, as shown in Figure 18d, SOD-YOLO can still accurately locate occluded pedestrians, demonstrating its anti-occlusion capability.

6. Conclusions

This paper proposes the SOD-YOLO model based on YOLOv8 for detecting small objects in UAV images. The model addresses common challenges in UAV object-detection tasks. It has a novel neck structure (BSSI-FPN) for multi-scale feature fusion. A balanced integration of spatial and semantic information is achieved in the feature map by utilizing large-scale features, increasing the frequency of multi-scale feature fusion, and implementing a dynamic upsampling strategy. Additionally, the backbone network includes the RFCBAM module as the downsampling layer. RFCBAM has higher feature extraction efficiency than a standard convolutional layer and mitigates the sparsity of spatial information caused by downsampling.

The experimental results on the Visdrone2019 dataset indicated that SOD-YOLO achieved higher detection accuracy than YOLOv8 with fewer parameters at all scales. Additionally, SOD-YOLO has fewer parameters than the baseline model at each scale. Further comparisons with other models in the YOLO series revealed that SOD-YOLO-s and SOD-YOLO-l excelled in overall and category-specific detection accuracy. They had the fewest parameters among models with the same scale. The results demonstrate the superiority of the SOD-YOLO model for detecting small objects in UAV images.

The current network improvements lack a head component and do not contain mainstream attention mechanisms, indicating potential for further enhancement in specific tasks. Additionally, the evaluation of the SOD-YOLO model was restricted to the VisDrone2019 dataset. To comprehensively assess the network’s performance in real-world scenarios, further evaluations should be conducted using a wider range of UAV platforms and more diverse UAV datasets.

Author Contributions

Conceptualization, Y.L. and Q.L.; methodology, Y.L. and J.P.; software, Y.L., Y.Z. and C.L.; validation, Y.L., H.Z. and H.W.; writing—original draft preparation, Y.L. and Q.L.; writing—review and editing, Y.L. and Q.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the National Key Research and Development Program of China, grant number 2021YFB1407005, and the Key Technology Research and Development Program of Shandong Province, grant number 2021SFGC0401.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Acknowledgments

We thank the editors and reviewers for their hard work and valuable advice.

Conflicts of Interest

The authors declare no conflict of interest.

References

Chang, Y.C.; Chen, H.T.; Chuang, J.H.; Liao, I.C. Pedestrian detection in aerial images using vanishing point transformation and deep learning. In Proceedings of the 2018 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1917–1921. [Google Scholar]
Božić-Štulić, D.; Marušić, Ž.; Gotovac, S. Deep learning approach in aerial imagery for supporting land search and rescue missions. Int. J. Comput. Vis. 2019, 127, 1256–1278. [Google Scholar] [CrossRef]
Byun, S.; Shin, I.K.; Moon, J.; Kang, J.; Choi, S.I. Road traffic monitoring from UAV images using deep learning networks. Remote Sens. 2021, 13, 4027. [Google Scholar] [CrossRef]
Muhmad Kamarulzaman, A.M.; Wan Mohd Jaafar, W.S.; Mohd Said, M.N.; Saad, S.N.M.; Mohan, M. UAV Implementations in Urban Planning and Related Sectors of Rapidly Developing Nations: A Review and Future Perspectives for Malaysia. Remote Sens. 2023, 15, 2845. [Google Scholar] [CrossRef]
Yu, Y.; Gu, T.; Guan, H.; Li, D.; Jin, S. Vehicle detection from high-resolution remote sensing imagery using convolutional capsule networks. IEEE Geosci. Remote Sens. Lett. 2019, 16, 1894–1898. [Google Scholar] [CrossRef]
Li, Z.; Zhang, Y.; Wu, H.; Suzuki, S.; Namiki, A.; Wang, W. Design and application of a UAV autonomous inspection system for high-voltage power transmission lines. Remote Sens. 2023, 15, 865. [Google Scholar] [CrossRef]
Ko, Y.D.; Song, B.D. Application of UAVs for tourism security and safety. Asia Pac. J. Mark. Logist. 2021, 33, 1829–1843. [Google Scholar] [CrossRef]
Jin, W.; Yang, J.; Fang, Y.; Feng, W. Research on application and deployment of UAV in emergency response. In Proceedings of the 2020 IEEE 10th International Conference on Electronics Information and Emergency Communication (ICEIEC), Beijing, China, 17–19 July 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 277–280. [Google Scholar]
Cao, J.; Bao, W.; Shang, H.; Yuan, M.; Cheng, Q. GCL-YOLO: A GhostConv-based lightweight yolo network for UAV small object detection. Remote Sens. 2023, 15, 4932. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, Proceedings of the 28th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; MIT Press: Cambridge, MA, USA, 2015; Volume 28. [Google Scholar]
Yi, H.; Liu, B.; Zhao, B.; Liu, E. Small Object Detection Algorithm Based on Improved YOLOv8 for Remote Sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 17, 1734–1747. [Google Scholar] [CrossRef]
Tahir, N.U.A.; Long, Z.; Zhang, Z.; Asim, M.; ELAffendi, M. PVswin-YOLOv8s: UAV-Based Pedestrian and Vehicle Detection for Traffic Management in Smart Cities Using Improved YOLOv8. Drones 2024, 8, 84. [Google Scholar] [CrossRef]
Deng, S.; Li, S.; Xie, K.; Song, W.; Liao, X.; Hao, A.; Qin, H. A global-local self-adaptive network for drone-view object detection. IEEE Trans. Image Process. 2020, 30, 1556–1569. [Google Scholar] [CrossRef]
Domozi, Z.; Stojcsics, D.; Benhamida, A.; Kozlovszky, M.; Molnar, A. Real time object detection for aerial search and rescue missions for missing persons. In Proceedings of the 2020 IEEE 15th International Conference of System of Systems Engineering (SoSE), Budapest, Hungary, 2–4 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 000519–000524. [Google Scholar]
Cheng, G.; Xie, X.; Han, J.; Guo, L.; Xia, G.S. Remote sensing image scene classification meets deep learning: Challenges, methods, benchmarks, and opportunities. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 3735–3756. [Google Scholar] [CrossRef]
Adaimi, G.; Kreiss, S.; Alahi, A. Perceiving traffic from aerial images. arXiv 2020, arXiv:2009.07611. [Google Scholar]
Bouguettaya, A.; Zarzour, H.; Kechida, A.; Taberkit, A.M. Vehicle detection from UAV imagery with deep learning: A review. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6047–6067. [Google Scholar] [CrossRef] [PubMed]
Cao, Y.; He, Z.; Wang, L.; Wang, W.; Yuan, Y.; Zhang, D.; Zhang, J.; Zhu, P.; Van Gool, L.; Han, J.; et al. VisDrone-DET2021: The vision meets drone object detection challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2847–2854. [Google Scholar]
Zhang, X.; Liu, C.; Yang, D.; Song, T.; Ye, Y.; Li, K.; Song, Y. Rfaconv: Innovating spatital attention and standard convolutional operation. arXiv 2023, arXiv:2304.03198. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning. Pmlr, Lille, France, 7–9 July 2015; pp. 448–456. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Jocher, G.; Stoken, A.; Chaurasia, A.; Borovec, J.; Kwon, Y.; Michael, K.; Changyu, L.; Fang, J.; Skalski, P.; Hogan, A.; et al. Ultralytics/Yolov5: V6.0—YOLOv5n ‘Nano’ Models, Roboflow Integration, TensorFlow Export, OpenCV DNN Support. 2021. Available online: https://zenodo.org/record/5563715 (accessed on 15 December 2023).
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Joher, G.; Chaurasia, A.; Qiu, J. YOLO by Ultralytics. 2023. Available online: https://github.com/ultralytics/ultralytics/blob/main/CITATION.cff (accessed on 15 December 2023).
Xu, H.; Zheng, W.; Liu, F.; Li, P.; Wang, R. Unmanned aerial vehicle perspective small target recognition algorithm based on improved yolov5. Remote Sens. 2023, 15, 3583. [Google Scholar] [CrossRef]
Yue, M.; Zhang, L.; Huang, J.; Zhang, H. Lightweight and Efficient Tiny-Object Detection Based on Improved YOLOv8n for UAV Aerial Images. Drones 2024, 8, 276. [Google Scholar] [CrossRef]
Li, Y.; Fan, Q.; Huang, H.; Han, Z.; Gu, Q. A modified YOLOv8 detection network for UAV aerial image recognition. Drones 2023, 7, 304. [Google Scholar] [CrossRef]
Qu, J.; Tang, Z.; Zhang, L.; Zhang, Y.; Zhang, Z. Remote sensing small object detection network based on attention mechanism and multi-scale feature fusion. Remote Sens. 2023, 15, 2728. [Google Scholar] [CrossRef]
Yuan, X.; Cheng, G.; Yan, K.; Zeng, Q.; Han, J. Small object detection via coarse-to-fine proposal generation and imitation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6317–6327. [Google Scholar]
Jiang, L.; Yuan, B.; Du, J.; Chen, B.; Xie, H.; Tian, J.; Yuan, Z. MFFSODNet: Multi-Scale Feature Fusion Small Object Detection Network for UAV Aerial Images. IEEE Trans. Instrum. Meas. 2024, 73, 5015214. [Google Scholar] [CrossRef]
Yang, C.; Cao, Y.; Lu, X. Towards better small object detection in UAV scenes: Aggregating more object-oriented information. Pattern Recognit. Lett. 2024, 182, 24–30. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: High quality object detection and instance segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 1483–1498. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to Upsample by Learning to Sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 6027–6037. [Google Scholar]
Cheng, G.; Yuan, X.; Yao, X.; Yan, K.; Zeng, Q.; Xie, X.; Han, J. Towards large-scale small object detection: Survey and benchmarks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13467–13488. [Google Scholar] [CrossRef] [PubMed]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Yang, G.; Lei, J.; Zhu, Z.; Cheng, S.; Feng, Z.; Liang, R. AFPN: Asymptotic feature pyramid network for object detection. In Proceedings of the 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Oahu, HI, USA, 1–4 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 2184–2189. [Google Scholar]
Kang, M.; Ting, C.M.; Ting, F.F.; Phan, R.C.W. ASF-YOLO: A novel YOLO model with attentional scale sequence fusion for cell instance segmentation. Image Vis. Comput. 2024, 147, 105057. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar]
Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. UAV-YOLOv8: A small-object-detection model based on improved YOLOv8 for UAV aerial photography scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef]
Zhang, Z. Drone-YOLO: An efficient neural network method for target detection in drone images. Drones 2023, 7, 526. [Google Scholar] [CrossRef]
Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: A simple and strong anchor-free object detector. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1922–1933. [Google Scholar] [CrossRef] [PubMed]
Yang, Z.; Liu, S.; Hu, H.; Wang, L.; Lin, S. Reppoints: Point set representation for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9657–9666. [Google Scholar]
Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9759–9768. [Google Scholar]
Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C.; et al. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14454–14463. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]

Figure 1. Examples of UAV images.

Figure 2. Structure of the SOD-YOLO network. RFCBAM denotes the receptive field convolutional block attention module. The C2f module is the feature extraction component. It has 32, 64, 128, and 128 channels in the backbone section. The number of channels in the neck is 64. Downsampling is achieved through convolutional modules with a stride of 2 and a kernel size of 3. DySample refers to dynamic upsampling. PW denotes pointwise convolutions. The network outputs three feature maps with dimensions of 160 × 160, 80 × 80, and 40 × 40.

Figure 3. Structure of the standard RFCBAM module: stride = 1.

Figure 4. The mechanism of dimensionality transformation operations.

Figure 5. Structure of the RFCBAM module for downsampling operations: stride = 2.

Figure 6. Neck structure of YOLOv8: PAN-FPN.

Figure 7. Neck structure of SOD-YOLO: BSSI-FPN.

Figure 8. Structure of the Dysample module.

Figure 9. Examples of the VisDrone dataset.

Figure 10. Examples of the SODA-D dataset.

Figure 11. Detection accuracy versus scale.

Figure 12. Results of ablation experiments.

Figure 13. Detection accuracy for different categories obtained from lightweight models.

Figure 14. Detection accuracy for different categories obtained from large-scale models.

Figure 15. A scene of a neighborhood acquired during daylight hours at an oblique angle.

Figure 16. An oblique aerial image acquired at nightfall in the main thoroughfare of a city.

Figure 17. A vertical aerial image acquired during the daytime over an urban intersection.

Figure 18. A one-way street image acquired from a horizontal viewpoint at a low elevation during the daytime.

Table 1. Parameter settings for the backbone of SOD-YOLO.

Layer	Module	SOD-YOLO-n	SOD-YOLO-s	SOD-YOLO-m	SOD-YOLO-l
Layer1	Downsample	8	16	32	48
Layer2	RFCBAM	16	32	64	96
Layer2	C2f	16/n = 1	32/n = 2	64/n = 1	96/n = 2
Layer3	RFCBAM	32	64	128	192
Layer3	C2f	32/n = 2	64/n = 2	128/n = 2	192/n = 4
Layer4	RFCBAM	96	128	256	384
Layer4	C2f	64/n = 2	128/n = 2	256/n = 2	384/n = 4
Layer5	RFCBAM	96	128	256	384
	C2f	64/n = 1	128/n = 1	256/n = 1	384/n = 2
	SPPF	64	128	256	384

Table 2. Parameter settings for the neck of SOD-YOLO.

Network	SOD-YOLO-n	SOD-YOLO-s	SOD-YOLO-m	SOD-YOLO-l
C2f	32/n = 1	64/n = 1	128/n = 1	192/n = 2
Downsample	64	128	256	384
Downsample	32	64	128	192
PW	32	64	128	192

Table 3. A comparison of SOD-YOLO with YOLOv8.

Network	mAP50 (%)	mAP50-95 (%)	Params (M)	GFLOPs	FPS
YOLOv8n	32.0	18.4	3.0	8.2	162
SOD-YOLO-n	33.0	19.3	0.6	7.8	145
YOLOv8s	39.0	23.3	11.1	28.7	148
SOD-YOLO-s	42.0	25.1	1.75	20.1	126
YOLOv8m	42.1	25.2	25.9	79.1	117
SOD-YOLO-m	48.5	29.5	6.3	65.8	103
YOLOv8l	43.8	26.9	43.6	164.9	94
SOD-YOLO-l	51.5	32.0	17.6	167.0	88