1. Introduction
Declining production costs and advancements in flight control technology [
1,
2] have broadened unmanned aerial vehicle (UAV) applications across diverse fields, including traffic monitoring [
3,
4,
5], power facility inspection [
6], security protection [
7], and emergency response [
8]. Object detection is crucial for these tasks and a prerequisite for advanced image processing [
9].
The rapid advancement of deep learning technology has resulted in object-detection algorithms that can efficiently recognize objects in images or videos with robust generalization ability, enabling their use in various scenarios and tasks [
10,
11,
12]. The You Only Look Once (YOLO) network has garnered widespread attention as a fast and accurate object-detection method. It segments an image into grids and predicts the object’s bounding boxes and category probabilities. The YOLO network facilitates rapid object detection, making it suitable for numerous applications.
Although the YOLO network has demonstrated excellent performance in object detection with a horizontal camera angle, it encounters significant challenges when applied to images captured by UAVs. As shown in
Figure 1a,b, UAVs typically capture images from a high altitude, causing ground objects to appear small with limited information, which complicates the detection of small objects, such as pedestrians and vehicles [
13,
14]. Additionally, as shown in
Figure 1c,d, UAVs can capture images from various angles, including overhead, oblique, and side perspectives, by adjusting their altitude, speed, and orientation. This diversity of perspectives results in different appearances and scales for the same object in different images, increasing the uncertainty in object detection. Furthermore, as shown in
Figure 1e, targets in UAV images often exhibit a non-uniform spatial distribution, i.e., dense, sparse, or clustered [
15,
16,
17], increasing the detection difficulty. Moreover, as shown in
Figure 1f, the movement of objects, such as vehicles and pedestrians, combined with the movement of UAVs, can result in motion blur, particularly at high speeds or when capturing fast-moving objects. Motion blur degrades image quality, challenging the stability and robustness of detection algorithms [
9].
Additionally, drone object-detection algorithms [
18,
19] are generally designed for two application scenarios. The first scenario involves the real-time transmission of raw data captured by a drone-mounted camera to a ground station. Post-processing occurs on a desktop workstation with substantial computational resources. In this scenario, no limitations exist regarding the size of the detection model; thus, the primary focus is to achieve the highest possible detection accuracy for small objects. The second scenario entails immediate image analysis on the drone’s embedded computer. The primary considerations are a lightweight model with low computational requirements and sufficient storage in the resource-limited conditions of embedded devices.
We introduce a novel small-object-detection model, SOD-YOLO, intended for UAV imagery to overcome the existing challenges. To ensure the model’s adaptability across diverse applications, we have developed multiple variants, SOD-YOLO-n, SOD-YOLO-s, SOD-YOLO-m, and SOD-YOLO-l, which are based on the SOD-YOLO framework. The experimental findings on the authoritative VisDrone2019 dataset [
20] indicate that the proposed SOD-YOLO networks achieve superior detection accuracy with few parameters and low computational complexity, significantly outperforming the baseline YOLOv8 networks.
The main contributions of this paper are as follows:
The receptive field convolutional block attention module (RFCBAM) [
21] is employed for downsampling in the backbone network. This approach addresses the limitations of convolutional kernel parameter sharing by focusing on spatial features in the receptive field, enhancing the backbone network’s ability to extract features.
We developed a novel neck module, the balanced spatial and semantic information fusion pyramid network (BSSI-FPN), to facilitate multi-scale feature fusion. Due to the importance of large-scale features, we increased the frequency of feature fusion at different scales. Moreover, we incorporated a dynamic upsampling mechanism to facilitate a balanced fusion of spatial information and semantic content.
We propose several object-detection models to address the different requirements encountered in practical applications. These models include a large-scale, high-precision version designed for substantial computing devices, such as ground stations, and a streamlined version with fewer parameters and lower computational complexity optimized for embedded computers on drones. This diversification allows the effective deployment of our method in various complex scenarios.
The following sections of this paper are structured as follows:
Section 2 provides an overview of related studies on the YOLO network.
Section 3 presents the improved model for detecting small objects in UAV images and elaborates on the architecture and operational principles of the proposed model.
Section 4 details the experimental setup, including the environment and parameters, and discusses various experiments, such as ablation studies, comparative analyses, and validations using the VisDrone2019 dataset to ascertain the method’s viability.
Section 5 explains how the proposed model addresses several challenges encountered in UAV images.
Section 6 summarizes the study’s findings and highlights potential directions for subsequent research.
2. Related Work
Recent advancements in computer technology have significantly enhanced deep learning-based object-detection methods, revolutionizing the field of computer vision. Single-stage approaches, particularly the YOLO network, have gained widespread popularity among these methods. The YOLO network is known for its real-time processing capabilities, enabling efficient and timely detection tasks. Its end-to-end training simplifies the implementation process, making it suitable for various applications. Additionally, the YOLO network strikes an optimal balance between speed and accuracy, providing reliable results rapidly. This balance and ease of deployment underscores its indispensable role in modern object-detection solutions. Consequently, the YOLO network has become a primary choice for professionals seeking robust and efficient object-detection frameworks.
2.1. The YOLO Series Methods
In 2016, Redmon et al. proposed the YOLOv1 network [
22], which employs a single-stage approach to object detection. By treating object detection as a regression problem, YOLOv1 enables end-to-end object detection, significantly boosting detection speed. This innovation addresses the shortcomings of conventional two-stage detectors, which often have slower processing speeds, making them unsuitable for real-time applications. Redmon and Farhadi proposed YOLOv2 in 2017 [
23] to enhance the detection accuracy for small objects and handle multi-scale variations more effectively. This advancement was achieved by integrating batch normalization [
24], anchor boxes, and feature pyramid networks (FPNs), alleviating previous issues while maintaining real-time performance.
The following year, Joseph Redmon’s team introduced YOLOv3 [
25], initially employing Darknet-53 as its backbone network and improving category prediction accuracy to facilitate object detection across a broader range of categories. However, the model exhibited low performance in high congestion or severe occlusion scenarios. In 2020, Bochkovskiy et al. introduced YOLOv4 [
26], which incorporated various advanced techniques, including the mish activation function, a spatial pyramid pooling (SPP) block, the cross-stage partial network (CSPNet), and the path aggregation network (PANet) for neck fusion. They also implemented training strategies, such as mosaic data augmentation, cross-stage partial batch normalization (CmBN), drop-block regularization, and self-adversarial training (SAT), to enhance the model’s robustness and generalization capabilities. Although YOLOv4 had a lower inference speed than YOLOv3, its detection accuracy was significantly higher, enabling it to compete with contemporary state-of-the-art two-stage detection models.
YOLOv5 [
27] was developed by the Ultralytics team under the leadership of Glenn Jocher and underwent continuous updates and iterations. It uses the lightweight CSPDarknet as its backbone network and includes various optimization strategies, such as automatic mixed-accuracy training, multi-scale training, and adaptive anchor frame generation, to streamline training. Although YOLOv5 had equivalent computational resources as the state-of-the-art EfficientDet series of models, its peak accuracy was marginally inferior. However, YOLOv5 excelled in real-time performance, making it more suitable for applications requiring rapid inference. Li et al. proposed YOLOv6 [
28], which replaced CSPDarknet with EfficientRep, maintaining high accuracy while enhancing hardware compatibility. Wang et al. developed YOLOv7 [
29], which incorporated model reparameterization into the network architecture and a novel efficient layer aggregation network (ELAN) to improve network efficiency and performance. Additionally, Jocher et al. proposed YOLOv8 [
30], which uses the novel C2f structure in the backbone network. YOLOv8 achieved higher performance than YOLOv7 due to modifications in the gradient flow.
2.2. The Improved YOLO Methods
With the advancement of the YOLO algorithm, researchers have introduced numerous improvements, leading to the development of various small-object-detection methods. Xu et al. developed the YOLOv5s-pp [
31] model, which enhances the detection performance of small targets in low-resolution images by incorporating a channel attention (CA) module to mitigate interference from complex backgrounds and negative samples. Yue et al. proposed LE-YOLO [
32] based on the YOLOv8n. This model introduces the LHGNet backbone network, LGS bottleneck, and LGSCSP fusion module, achieving real-time small object detection on edge devices like drones. These modifications reduce computational complexity while maintaining high detection accuracy. Li et al. [
33] improved YOLOv8 by integrating the Bi-PAN-FPN structure to enhance multi-scale feature fusion. They also replaced some C2f modules with the GhostlockV2 structure, reducing information loss and the number of model parameters. These enhancements improve the detection ability of small targets in complex environments. Tahir et al. developed the PVswin-YOLOv8s [
14] model, which integrates the Swin Transformer to boost global feature extraction capability. The model also incorporates a CBAM module to improve attention to key features, enhancing its ability to detect small-sized and occluded targets in complex environments. PVswin-YOLOv8s demonstrates superior performance in challenging detection scenarios by utilizing these advanced components.
2.3. Small-Object-Detection Methods for Other Frameworks
In the field of small object detection, several excellent methods have emerged beyond the YOLO framework. Qu et al. introduced the AMMFN [
34] network, which enhances the extraction of small target features and reduces feature redundancy through the Detection Head Enhancement Module (DHEM) and the Attention Mechanism-Based Channel Cascade (AMCC) module. These innovations significantly improve the detection accuracy of small targets in remote sensing images. Yuan et al. developed the CFINet [
35] framework, which ensures high-quality candidate region generation through the Coarse-to-Fine RPN (CRPN) and enhances small target feature representation using Feature Imitation (FI) learning. This framework demonstrates excellent detection performance on the SODA-D dataset. Jiang et al. designed the MFFSODNet [
36] network, which boosts the detection performance of small objects by introducing a small-object-specific prediction head, a multi-scale feature extraction module, and a bidirectional dense feature pyramid network. Yang et al. [
37] proposed a method that improves the acquisition of discriminative contextual information for small objects by enhancing convolutional feature transformation. This approach utilizes a wave function representation based on a visual multilayer perceptron (MLP) and a local maximum method to calculate the amplitude and phase of features. This results in more effective global representation and dynamic information aggregation, improving the detection performance of small objects in drone scenes.
2.4. The Classic Object-Detection Methods
As the YOLO network was improved, several exemplary detection methods based on other frameworks have been developed. The faster region-based convolutional neural network (R-CNN) proposed by Ren et al. [
12] is a popular two-stage approach that combines the strengths of the R-CNN [
10] and Fast R-CNN [
11], offering improvements in accuracy and efficiency. Its main innovation is the region proposal network (RPN), which reduces redundant computation by sharing features with the backbone network to improve detection speed. However, despite these advancements, the algorithm has high complexity due to the two-stage detection.
Cascade R-CNN was proposed by a research team from Peking University [
38]. It is a modified version of Faster R-CNN that utilizes a cascade structure to improve detection accuracy. This approach employs multiple detection stages, usually three or more, each with a progressively increasing intersection over union (IoU) threshold. Each subsequent stage performs more precise filtering and refining of high-quality candidate frames identified in earlier stages. However, this method requires multiple detection stages, increasing model complexity and inference time. Additionally, two-stage detection methods are less effective at detecting small objects and require specialized optimization strategies.
RetinaNet is an object-detection model developed by Lin et al. [
39]. It employs an innovative loss function called focal loss to prevent class imbalance during training. This loss function reduces the influence of easy-to-classify negative samples, allowing the model to focus on more challenging ones. However, the prediction of numerous anchors results in a high computational complexity.
CenterNet was developed by Duan et al. [
40]. It is an object-detection method that employs keypoint detection. It generates a heatmap to identify the most probable object locations, with the highest pixel values indicating the object centers. The network also predicts the size and category of the objects, providing a simpler identification process than conventional methods. However, detecting small objects can be challenging due to the limited resolution of the heatmap. Therefore, additional training is recommended to enhance the detection capabilities for smaller objects.
Detecting small objects in UAV images presents several challenges, including significant variations in the object size, small object sizes, complex spatial distributions, background clutter, and interference. To address these challenges, we propose the SOD-YOLO based on the YOLOv8 architecture for UAV image object detection. The proposed model achieves relatively high detection accuracy, and the optimized backbone and neck networks result in a compact network size.
4. Experiment
4.1. Image Datasets for Small Object Detection
Our study utilizes the VisDrone dataset and SODA-D [
46] dataset for validation. The VisDrone dataset is well known in the international UAV vision research community and is recognized as an authoritative source. We used the VisDrone2019 version, which was developed and publicly released by the AISKYEYE team from Tianjin University. It was designed to advance the development of UAV vision perception, especially for object detection, tracking, and semantic segmentation tasks. Due to its large volume and high diversity, the VisDrone2019 dataset serves as a benchmark reference for UAV vision research. It includes 288 video clips captured by UAVs in various complex environments and from different perspectives, such as city streets, rural fields, and construction sites. In addition, the VisDrone2019 dataset contains 10,209 aerial images captured by UAVs. These images cover a wide range of spatial locations and environmental conditions, supporting single-frame object-detection studies. The dataset is meticulously labeled with ten categories of objects. Following the division methodology of the VisDrone2019 Challenge, it is divided into a training set (6471 samples), a validation set (548 samples), and a test set (1610 samples). Some typical data samples are shown in
Figure 9.
The SODA-D dataset, developed by Cheng et al. from Northwestern Polytechnical University, is a large-scale dataset specifically designed for small object detection. It comprises 24,828 high-quality traffic scene images, meticulously labeled with nine categories and a total of 278,433 instances. These instances include common traffic elements such as traffic lights, vehicles, and pedestrians. Due to its rich diversity, high-resolution images, and extensive number of small object instances, the SODA-D dataset is ideal for evaluating the performance of object-detection models in small-object processing. Some typical data samples are shown in
Figure 10.
The experiments were executed on a Windows 11 operating system, leveraging Python 3.8, PyTorch 1.2.0, and Cuda 11.8 in the desktop computing environment. The hardware utilized for these experiments was an NVIDIA 3090 graphics card. The neural network implementation code was adapted based on Ultralytics version 8.0.202. The training was conducted for 150 epochs with input images resized to 640 × 640 pixels. The optimization utilized the stochastic gradient descent (SGD) algorithm, starting with an initial learning rate of 0.01 and progressively decreasing to a final learning rate of 0.0001. Each model was trained without using pre-trained weights to ensure an equitable comparison.
4.2. Experimental Evaluation Indicators
The proposed model was evaluated based on its detection performance and its model size. The precision (P), recall (R), average precision (AP), and mean average precision (mAP) were employed to measure the accuracy of all object categories. Additionally, the computational complexity of the networks was assessed using the giga floating-point operations per second (GFLOPs) metric. The model size was determined by the number of parameters, and the frames per second (FPS) was employed to measure the model’s real-time processing speed.
4.3. Comparative Analysis with YOLOv8
The performance and model size of SOD-YOLO and YOLOv8 on the VisDrone2019 dataset are shown in
Table 3 and
Table 4 and
Figure 11.
The results in
Table 3 demonstrate that SOD-YOLO outperforms the standard YOLOv8 model in terms of detection accuracy at different scales. Specifically, SOD-YOLO-s achieves a mAP50 score of 42.0%, 3% higher than for YOLOv8s. Additionally, the number of model parameters has been significantly reduced from 11.1 million to 1.75 million, and the computational complexity has decreased from 28.8 GFLOPs to 20 GFLOPs. These findings indicate that SOD-YOLO achieves higher detection accuracy with a smaller model size and a lower computational cost. Furthermore, the FPS of SOD-YOLO at various scales is only slightly lower than YOLOv8, which is sufficient to meet the real-time processing requirements for UAV applications.
Figure 11 indicates that the proposed detection model consistently outperforms the baseline model, regardless of the scale. Further analysis reveals that the difference in the detection performance between the proposed model and the baseline model (YOLOv8, in this case) increases significantly with the model scale. In the “n”-scale configuration, SOD-YOLO-n outperforms YOLOv8 by only 1.0% based on the mAP50 indicator. However, when the scale is increased to “l”, the mAP50 indicator is 7.7% larger, significantly widening the gap between the two models. This phenomenon indicates that the detection accuracy of SOD-YOLO improves faster as the model parameter number and computational complexity increase. Therefore, the proposed model is expected to provide high object-detection accuracy when it is deployed on a hardware platform with sufficient computational resources, especially in large-scale configurations.
To accurately assess the detection capabilities of SOD-YOLO for objects of various sizes, we utilized the standard evaluation metrics of the COCO [
47] dataset and conducted a comparative analysis with YOLOv8 at different scales.
Table 4 indicates that SOD-YOLO and YOLOv8 exhibit competitive performance for large-sized objects. However, SOD-YOLO outperforms YOLOv8 in detecting small-sized targets. Specifically, SOD-YOLO-l surpasses YOLOv8l by 6.4% in the AP-small indicator. These results suggest that SOD-YOLO has significant advantages in detecting medium and small objects, with particular excellence in small object detection. Given that objects from UAV perspectives are typically medium to small in size, SOD-YOLO demonstrates high compatibility with these application scenarios.
4.4. Ablation Experiment
To verify the effectiveness of the improvement methods used in the SOD-YOLO network, we conducted ablation experiments based on the YOLOv8n model, incrementally adding the improvement methods. We trained the models with the same hyperparameters to conduct a fair experiment. The ablation experiment was designed as follows:
- (1)
A: Improve the neck structure by expanding it upwards and adding a small-object-detection head.
- (2)
B: Remove the large-object-detection head and increase the frequency of the multi-scale feature fusion by adding additional downsampling.
- (3)
C: Add PW convolution.
- (4)
D: Use the DySample module for upsampling.
- (5)
E: Replace the downsampling operation of the backbone network with RFCBAM.
The experimental results in
Table 5 indicate that the object-detection system’s accuracy improves steadily as the improvement methods are added. Notably, the second step of the improvement methods results in the most significant performance increase, as shown in
Figure 12. During this step, the mAP50 indicator increased by 7.1% compared to YOLOv8n, and the number of network parameters was reduced by nearly half by removing the horizontal branch responsible for large object detection. These findings demonstrate that large-scale feature maps contain rich spatial detail. Effectively utilizing this information can enhance the model’s performance for detecting small objects. However, a significant increase in computational complexity was observed and attributed to the repeated addition of large-scale feature maps to the fusion. Since the computational complexity of the fully improved model was increased to 20.1 GFLOPs, we classified the model as “s” scale and compared it to YOLOv8s. The results show that the mAP50 of SOD-YOLO-s is 3 percentage points higher than that of YOLOv8s and 10 percentage points higher than that of YOLOv8n. The ablation experiments on the Visdrone2019 dataset indicate that our improved strategy effectively enhances object-detection accuracy.
We introduced various feature pyramid network (FPN) methods into the baseline YOLOv8 model for further comparative testing. The results are shown in
Table 6. When the BSSI-FPN structure is integrated into the YOLOv8n model, it achieves the highest detection accuracy with the fewest parameters, demonstrating robust feature fusion capabilities. These findings clearly illustrate the superiority of the proposed BSSI-FPN method.
4.5. Comparison with YOLO Series Models
Deep learning object detection algorithms are typically categorized into single-stage and two-stage methods based on the anchor-frame-generation mechanisms. Single-stage methods are generally more suitable for UAV object detection due to their inherent advantages. The YOLO network is a representative framework due to its excellent performance and broad applicability. We conducted a multi-level comparative analysis of SOD-YOLO and the YOLO series networks using the VisDrone2019 dataset to validate the detection performance of the SOD-YOLOv8 and its adaptability to specific UAV scenarios. The detection accuracy, model size, and computational complexity were analyzed.
4.5.1. Comparison with Lightweight Models
We conducted a comparative analysis of various lightweight models, including YOLOv3 Tiny, YOLOv5s, YOLOv6s, YOLOv7 Tiny, YOLOv8s, and the latest YOLOv10s. We trained all models using the same hyperparameters without pre-trained weights to conduct a fair experiment. The experimental results are presented in
Table 7.
The results indicate that SOD-YOLO-s achieves the highest detection accuracy with the fewest parameters and lowest computational complexity, confirming its superiority. As shown in
Figure 13, the SOD-YOLO-s model has the highest accuracy for detecting small objects, such as “people” and “pedestrians”, compared to larger objects like “car” and “bus”. The SOD-YOLO-s model outperforms the YOLOv10s model by 8.3% for the “pedestrian” category, demonstrating its superior performance in small object detection. Other lightweight networks have low detection performance due to their focus on detecting larger objects in natural scenes without enhancement to detect small objects.
4.5.2. Comparison with Large-Scale Models
Similarly, we conducted a detailed comparative analysis of large-scale models on the VisDrone2019 dataset to demonstrate the competitiveness of the proposed network. The results are shown in
Table 8 and
Figure 14.
We included the YOLOv9e [
52] model in this comparison.
Table 8 shows that SOD-YOLO-l achieves the highest accuracy, with YOLOv9e ranking second. Specifically, SOD-YOLO-l’s mAP50 indicator is 4.9% higher than that of YOLOv9e, but has only 30.6% of the number of parameters of YOLOv9e.
Figure 9 indicates that YOLOv9e surpasses SOD-YOLO-l in detecting “bus” and “truck” objects. However, its accuracy for other small-sized objects is significantly lower than SOD-YOLO-l. These findings underscore the superior performance of the SOD-YOLO series in small object detection. Notably, SOD-YOLO-l has the fewest parameters among all models, highlighting its design efficiency.
In summary, the experiments confirm that the proposed model provides the best performance at different scales. It has higher accuracy and computational efficiency and is more compact than the other models. The findings provide a reliable foundation for selecting the most appropriate model scale based on task requirements and computing platform conditions in practical applications.
4.6. Comparison with Improved YOLO Object-Detection Method
To comprehensively evaluate the performance of the SOD-YOLO series of models, we selected various improved YOLO algorithms of different scales for comparative analysis. These models include the small-scale LE-YOLO, the large-scale Drone-YOLO, and several medium-scale YOLO algorithms such as YOLOv5-pp, Modified YOLOv8, PVswin-YOLOv8, and UAV-YOLOv8.
Table 9 reveals that the SOD-YOLO series of models outperforms others in detection performance. Specifically, SOD-YOLO-s achieves a 2.7% higher mAP50 than the lightweight LE-YOLO, using only 1.75 million parameters—significantly fewer than LE-YOLO’s 10.5 million parameters. Additionally, SOD-YOLO surpasses Drone-YOLO in detection performance while employing just one-fourth of the parameters. Among medium-sized models such as YOLOv5-pp, Modified YOLOv8, PVswin-YOLOv8, and UAV-YOLOv8, the SOD-YOLO-m model also proves superior in detection performance and parameter efficiency. Overall, the SOD-YOLO series achieves high detection performance with minimal parameter usage, outperforming both lightweight and larger models.
4.7. Comparison with Classical Object Detection Models
We conducted a comparative analysis of the SOD-YOLO detection model’s performance against other architecture object-detection models, such as MFFSODNet, CenterNet, RetinaNet, ORIN-YOLOX, Cascade R-CNN, and Faster R-CNN, for UAV applications. The experimental results are listed in
Table 10.
The proposed SOD-YOLO-l outperforms the other algorithms. Faster R-CNN uses the RPN to generate candidate frames. However, this method may not consider regions with small objects due to their small size, resulting in insufficient candidate frames or inaccurate localization, adversely affecting the subsequent detection. Cascade R-CNN enhances accuracy through multi-stage detection, but each additional stage increases the computational load. Detecting small objects requires finer division and more iterations, reducing detection speed. RetinaNet uses a focal loss to address category imbalances. However, its single prediction mechanism may not adequately learn the features of small objects, especially in numerous background samples, where the sparse features of small objects can be easily overwhelmed. CenterNet is based on keypoint detection. The center point of small objects is difficult to detect and vulnerable to noise interference, complicating localization.
MFESODNet is a detection network designed for small objects, demonstrating satisfactory performance in object detection. However, its single-scale structure limits its suitability for scenarios requiring high accuracy. Similarly, although ORIN-YOLOX offers higher detection accuracy, its large number of parameters makes it unsuitable for application scenarios with limited computing resources. In contrast, the SOD-YOLO series has multiple scale structures, enhancing its adaptability to various scenarios. Additionally, the SOD-YOLO series incorporates specific optimizations for small object detection, making it particularly well-suited for UAV images and offering superior detection performance.
4.8. Comparison of the SODA-D Dataset
To evaluate the generalization ability of the proposed model, we conducted comparative experiments on the SODA-D dataset using its evaluation criteria to assess object detection performance. The results, shown in
Table 11, indicate that SOD-YOLO-s exhibits superior performance across all evaluation metrics. In terms of the
indicators, SOD-YOLO-s surpasses YOLOv8s by 4.1%. Based on these findings, the proposed model possesses satisfactory generalization ability.
4.9. Visualization Experiment
We compared the detection results of the SOD-YOLO-l and SOD-YOLO-s models with those of the baseline models YOLOv8s and YOLOv8l. Four images were selected from the VisDrone2019 validation dataset, representing different scenes, lighting conditions, shooting positions, and object types. Critical detection areas are highlighted using white boxes to showcase the detection results. Due to the high density of detected objects in some scenes, we have hidden the confidence labels and category labels for clarity, using different colors to represent the corresponding categories.
Figure 15 shows a UAV image of a neighborhood during daylight hours at a moderate flight altitude and an oblique angle. Most objects were successfully identified and framed by the four models. Two pedestrians walking in the parking lot in the upper part of the image are highlighted with a white box. YOLOv8l and YOLOv8s failed to identify and frame these two pedestrians. In contrast, SOD-YOLO-l and SOD-YOLO-s demonstrated higher detection sensitivity and accuracy, recognizing these small and often overlooked objects.
Figure 16 presents an oblique aerial image of a main road in a city captured by a drone at night, flying at a moderate altitude. A group of people is gathered in front of a white van in the upper part of the image. A white box outlines them. Due to the drone’s movement during footage capture, the image of the group is slightly blurry, posing challenges for accurate object detection. The benchmark models YOLOv8l and YOLOv8s fail to identify and locate the pedestrian group accurately due to blurring. In contrast, SOD-YOLO-l and SOD-YOLO-s exhibit superior performance in low-light environments and motion blur. Despite adverse imaging conditions, these models successfully distinguish three individual pedestrians within a crowd despite the degradation caused by drone movement. The proposed models’ robust detection capability and adaptability to blurring in complex nighttime urban traffic scenarios have been validated.
Figure 17 displays a UAV image of an urban intersection during the daytime. The photograph was captured from a high altitude, causing the ground objects, such as pedestrians and vehicles, to appear very small. A white box at the center of the image indicates an area with a group of pedestrians. However, the small size makes them barely recognizable. The detection results show that the YOLOv8s and YOLOv8l models cannot detect small pedestrians in a crowd, revealing their limitations in these scenes. In contrast, SOD-YOLO successfully detects one pedestrian in the scene. Furthermore, the larger model version (SOD-YOLO-l) accurately detects multiple pedestrians. These findings confirm that the proposed model has significant advantages in detecting small objects in UAV images.
Figure 18 shows a one-way street scene captured by a UAV during daylight hours. The UAV was flying at a low altitude and captured the scene horizontally. In the area marked by a white box, two pedestrians are partially obstructed by a moving vehicle, which poses a challenge to the object-detection algorithm’s ability to handle occlusion. YOLOv8s and YOLOv8l cannot detect occluded pedestrians, indicating their limitations in handling complex occlusion scenarios. Similarly, as shown in
Figure 16, SOD-YOLO-s also does not recognize the occluded pedestrians. However, it is noteworthy that SOD-YOLO-l exhibits an exceptional ability to handle occlusions in this scenario. Despite its relatively large model size, it successfully detects pedestrians occluded by vehicles, demonstrating superior detection performance in dealing with complex occlusion scenarios.
5. Discussion
Object detection in UAV images presents numerous challenges. This section discusses how SOD-YOLO effectively addresses these challenges.
5.1. Significant Variations in the Object Size
In drone images, object sizes vary widely, ranging from small to large. To address this challenge, SOD-YOLO utilizes three different scales of detection heads: Detect1 (40 × 40), Detect2 (80 × 80), and Detect3 (160 × 160). Each detection head is optimized for a specific range of object sizes: Detect3 is designed for small-sized objects; Detect2 is designed for medium-sized objects; Detect1 is designed for larger objects. This multi-level design ensures that the algorithm can effectively handle variations in object size and detect small objects.
According to the data presented in
Table 2, SOD-YOLO demonstrates notable advantages in detecting small and medium-sized objects, outperforming YOLOv8. Meanwhile, SOD-YOLO also demonstrates satisfactory detection performance for larger objects. This is illustrated in
Figure 17, where SOD-YOLO accurately detects both large-sized “cars” nearby and small-sized “people” in the distance. Additionally,
Figure 17d confirms SOD-YOLO’s ability to detect very small targets successfully.
5.2. Complex Spatial Distributions, Background Clutter, and Interference
To address complex spatial distributions, background clutter, and various interferences, we incorporated the RFCBAM module in SOD-YOLO. This improvement improves the feature extraction capability of the backbone network and mitigates the dilution effect on spatial information. Additionally, we introduced the BSSI-FPN module, which facilitates balanced fusion between spatial and semantic information. These two improvements enhance the algorithm’s ability to extract spatial information and perform semantic analysis. As a result, SOD-YOLO can accurately locate objects even when faced with complex spatial distributions, background clutter, and various interferences.
Figure 16 illustrates that drone images suffer from poor quality due to nighttime shooting and slight motion blur. Despite these interference factors, SOD-YOLO still accurately locates objects, exhibiting its robust anti-interference capability. Similarly,
Figure 18 shows that variations in distance result in significant differences in object size, complex spatial structures, and potential object occlusion. However, as shown in
Figure 18d, SOD-YOLO can still accurately locate occluded pedestrians, demonstrating its anti-occlusion capability.
6. Conclusions
This paper proposes the SOD-YOLO model based on YOLOv8 for detecting small objects in UAV images. The model addresses common challenges in UAV object-detection tasks. It has a novel neck structure (BSSI-FPN) for multi-scale feature fusion. A balanced integration of spatial and semantic information is achieved in the feature map by utilizing large-scale features, increasing the frequency of multi-scale feature fusion, and implementing a dynamic upsampling strategy. Additionally, the backbone network includes the RFCBAM module as the downsampling layer. RFCBAM has higher feature extraction efficiency than a standard convolutional layer and mitigates the sparsity of spatial information caused by downsampling.
The experimental results on the Visdrone2019 dataset indicated that SOD-YOLO achieved higher detection accuracy than YOLOv8 with fewer parameters at all scales. Additionally, SOD-YOLO has fewer parameters than the baseline model at each scale. Further comparisons with other models in the YOLO series revealed that SOD-YOLO-s and SOD-YOLO-l excelled in overall and category-specific detection accuracy. They had the fewest parameters among models with the same scale. The results demonstrate the superiority of the SOD-YOLO model for detecting small objects in UAV images.
The current network improvements lack a head component and do not contain mainstream attention mechanisms, indicating potential for further enhancement in specific tasks. Additionally, the evaluation of the SOD-YOLO model was restricted to the VisDrone2019 dataset. To comprehensively assess the network’s performance in real-world scenarios, further evaluations should be conducted using a wider range of UAV platforms and more diverse UAV datasets.