Improved YOLOv3 Integrating SENet and Optimized GIoU Loss for Occluded Pedestrian Detection
:1. Introduction
- The channel attention mechanism of squeeze-and-excitation networks (SENet) is adopted to be incorporated between the feature extraction layers of YOLOv3, which gives larger weights to the features of the non-overlapping parts of pedestrians to address the problem of feature extraction of uncovered parts.
- The positional loss function of Generalized Intersection over UnionIntersection over Groundtruth (GIoUIoG) is proposed by replacing IoU in the GIoU loss function with IoG, which makes the areas of predicted frames of pedestrians constant to solve the issue of inaccurate location of pedestrians.
2. Related Works
2.1. Loss Function Works for Pedestrian Location
2.2. Network Model Works for Occluded Pedestrian Detection
3. Proposed Method: YOLOv3-Occ
3.1. Preliminary Work
3.1.1. SENet
3.1.2. GIoU Loss
3.1.3. Soft-NMS
3.2. The Architecture of the Proposed YOLOv3-Occ
3.3. The Proposed Loss Function: GIoUIoG Loss
Algorithm 1: The computation process of GIoUIoG loss |
input: the coordinates of the upper left corner and the lower right corner of prediction(Bp) and ground truth(Bg): Bp = (x1p, y1p, x2p, y2p), Bg = (x1g, y1g, x2g, y2g) output: GIoUIoG loss Step 1 Calculating the coordinates of the common area(I) of Bp and Bg: x1I = maximum(x1p, x1g), y1I = maximum(y1p, y1g), x2I = minimum(x2p, x2g), y2I = minimum(y2p, y2g) where (x1I, y1I) is the coordinates of the upper left corner, and (x2I, y2I) is the coordinates of the lower right corner. Step 2 Calculating the area of I: SI = (x2I − x1I) × (y2I − y1I) where x2I − x1I is the width of I, and y2I − y1I is the height of I. Step 3 Calculating the area of Bp: Sp = (x2p − x1p) × (y2p − y1p) where x2p − x1p is the width of Bp, and y2p − y1p is the height of Bp. Step 4 Calculating the area of Bg: Sg = (x2g − x1g) × (y2g − y1g) where x2g − x1g is the width of Bg, and y2g − y1g is the height of Bg. Step 5 Calculating the area of the union between Bp and Bg: SU = Sp + Sg − SI where the reason for the operation, minus SI, is that SI is calculated twice in the calculation process of Sp + Sg. Step 6 Calculating the IoG: IoG = SI/Sg where IoG is the ratio of the intersection area to the target area. Step 7 Calculating the coordinates of the smallest external rectangle(Bs) surrounding Bp and Bg: x1S = minimum(x1p, x1g), y1S = minimum(y1p, y1g), x2S = maximum(x2p, x2g), y2S = maximum(y2p, y2g) where (x1S, y1S) is the coordinates of the upper left corner, and (x2S, y2S) is the coordinates of the lower right corner. Step 8 Calculating the area of Bs: Ss = (x2S − x1S) × (y2S − y1S) where x2S − x1S is the width of Bs, and y2S − y1S is the height of Bs. Step 9 Calculating the GIoUIoG: GIoUIoG = IoG − (Ss − SU)/Ss where GIoUIoG is generated by replacing IoU in GIoU with IoG. Step 10 Calculating the GIoUIoG loss: GIoUIoG loss = 1 − GIoUIoG |
4. Experimental Results and Analyses
4.1. Experiment Settings
4.1.1. Datasets
4.1.2. Evaluation Metrics
- Both P and R are for a single category of a single picture. Larger P and R indicate better performance. The formula of P and R are shown in (5) and (6), respectively:P = (true positives)/(true positives + false positives),R = (true positives)/(true positives + false negatives),
- AP50 is aimed at a single category of all pictures, which is the area enclosed by the P–R curve and the R axis when the iou-threshold is 0.5. It is used to measure the performance of the model in a given category. The larger the AP50 is, the better the performance is.
- mAP@50 is the mAP when the iou-threshold is 0.5, which is used to measure the performance of the model in all categories. The larger the mAP@50 is, the better the performance is. The formula of mAP@50 is shown in (7):mAP@50 = (1/C)Σc=1CAP50c,
- MR−2, the area enclosed by the MR-FPPI curve and the FPPI axis, is commonly used in pedestrian detection. A smaller MR−2 suggests better performance.
4.1.3. Detailed Settings
4.2. Experiments on CityPersons
4.2.1. Ablation Study
4.2.2. Comparisons with Previous Works
4.2.3. The Impact of the Hyperparameters on YOLOv3-Occ
4.2.4. Visual Comparison
4.3. Experiments on COCO2014
4.3.1. Ablation Study
4.3.2. Robustness Experiments
4.4. Computation Cost and Limitation
5. Conclusions and Future Directions
Author Contributions
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
- Bakalos, N.; Voulodimos, A.; Doulamis, N.; Doulamis, A.; Ostfeld, A.; Salomons, E.; Caubet, J.; Jimenez, V.; Li, P. Protecting water infrastructure from cyber and physical threats: Using multimodal data fusion and adaptive deep learning to monitor critical systems. IEEE Signal Process. Mag. 2019, 36, 36–48. [Google Scholar] [CrossRef]
- Othman, N.A.; Aydin, I. A new IoT combined body detection of people by using computer vision for security application. In Proceedings of the 2017 9th International Conference on Computational Intelligence and Communication Networks (CICN), Girne, Northern Cyprus, 16–17 September 2017. [Google Scholar]
- Makantasis, K.; Doulamis, A.; Doulamis, N.; Psychas, K. Deep learning based human behavior recognition in industrial workflows. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016. [Google Scholar]
- Veres, G.; Grabner, H.; Middleton, L.; Van Gool, L. Automatic workflow monitoring in industrial environments. In Proceedings of the Computer Vision—ACCV 2010: 10th Asian Conference on Computer Vision, Queenstown, New Zealand, 8–12 November 2010. [Google Scholar]
- Gao, H.; Cheng, B.; Wang, J.; Li, K.; Zhao, J.; Li, D. Object classification using CNN-based fusion of vision and lidar in autonomous vehicle environment. IEEE Trans. Ind. Inform. 2018, 14, 4224–4231. [Google Scholar] [CrossRef]
- Zhao, Y.; Hu, C.; Zhu, Z.; Qiu, S.; Chen, B.; Jiao, P.; Wang, F.Y. Crowd sensing intelligence for ITS: Participants, methods, and stages. IEEE Trans. Intell. Veh. 2023, 8, 3541–3546. [Google Scholar] [CrossRef]
- Gao, H.; Lv, C.; Zhang, T.; Zhao, H.; Jiang, L.; Zhou, J.; Liu, Y.; Huang, Y.; Han, C. A structure constraint matrix factorization framework for human behavior segmentation. IEEE Trans. Cybern. 2022, 52, 12978–12988. [Google Scholar] [CrossRef] [PubMed]
- Wang, Y.; Chan, P.H.; Donzella, V. Semantic-aware video compression for automotive cameras. IEEE Trans. Intell. Veh. 2023, 8, 3712–3722. [Google Scholar] [CrossRef]
- Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005. [Google Scholar]
- Dollár, P.; Tu, Z.; Perona, P.; Belongie, S. Integral channel features. In Proceedings of the British Machine Vision Conference (BMVC), London, UK, 7–10 September 2009. [Google Scholar]
- Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1627–1645. [Google Scholar] [CrossRef] [PubMed]
- Chi, C.; Zhang, S.; Xing, J.; Lei, Z.; Li, S.Z.; Zou, X. Pedhunter: Occlusion robust pedestrian detector in crowded scenes. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 3 April 2020. [Google Scholar]
- Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
- Gao, H.; Su, H.; Cai, Y.; Wu, R.; Hao, Z.; Xu, Y.; Wu, W.; Wang, J.; Li, Z.; Kan, Z. Trajectory prediction of cyclist based on dynamic Bayesian network and long short-term memory model at unsignalized intersections. Sci. China Inf. Sci. 2021, 64, 172207. [Google Scholar] [CrossRef]
- Huang, X.; Ge, Z.; Jie, Z.; Yoshie, O. NMS by representative region: Towards crowded pedestrian detection by proposal pairing. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Liu, S.; Huang, D.; Wang, Y. Adaptive NMS: Refining pedestrian detection in a crowd. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Chu, X.; Zheng, A.; Zhang, X.; Sun, J. Detection in crowded scenes: One proposal, multiple predictions. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- Lin, M.; Li, C.; Bu, X.; Sun, M.; Lin, C.; Yan, J.; Ouyang, W.; Deng, Z. DETR for crowd pedestrian detection. arXiv 2020, arXiv:2012.06785. [Google Scholar]
- Chi, C.; Zhang, S.; Xing, J.; Lei, Z.; Li, S.Z.; Zou, X. Relational learning for joint head and human detection. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 3 April 2020. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T. UnitBox: An advanced object detection network. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016. [Google Scholar]
- Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Gao, H.; Zhu, J.; Zhang, T.; Xie, G.; Kan, Z.; Hao, Z.; Liu, K. Situational assessment for intelligent vehicles based on stochastic model and gaussian distributions in typical traffic scenarios. IEEE Trans. Syst. Man Cybern. Syst. 2022, 52, 1426–1436. [Google Scholar] [CrossRef]
- Wang, X.; Xiao, T.; Jiang, Y.; Shao, S.; Sun, J.; Shen, C. Repulsion loss: Detecting pedestrians in a crowd. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Luo, Z.; Fang, Z.; Zheng, S.; Wang, Y.; Fu, Y. NMS-loss: Learning with non-maximum suppression for crowded pedestrian detection. In Proceedings of the 2021 International Conference on Multimedia Retrieval, Taipei, Taiwan, 21–24 August 2021. [Google Scholar]
- Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S.Z. Occlusion-aware R-CNN: Detecting pedestrians in a crowd. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Zhou, C.; Yuan, J. Multi-label learning of part detectors for heavily occluded pedestrian detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Zhang, S.; Chen, D.; Yang, J.; Schiele, B. Guided attention in CNNs for occluded pedestrian detection and re-identification. Int. J. Comput. Vis. 2021, 129, 1875–1892. [Google Scholar] [CrossRef]
- Zou, T.; Yang, S.; Zhang, Y.; Ye, M. Attention guided neural network models for occluded pedestrian detection. Pattern Recognit. Lett. 2020, 131, 91–97. [Google Scholar] [CrossRef]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef] [PubMed]
- Zhang, S.; Benenson, R.; Schiele, B. Citypersons: A diverse dataset for pedestrian detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
- Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Liu, W.; Liao, S.; Ren, W.; Hu, W.; Yu, Y. High-level semantic feature detection: A new perspective for pedestrian detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Pang, Y.; Xie, J.; Khan, M.H.; Anwer, R.M.; Khan, F.S.; Shao, L. Mask-guided attention network for occluded pedestrian detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
- The Official Code of Yolov5. Available online: (accessed on 2 June 2020).
Achievements | Effect | Disadvantage |
MSE Loss [21] | Euclidean distance between a prediction and a target | Drastic change in the loss |
SmoothL1 Loss [22] | The l1 and l2 norms of the distance vector between a prediction and a target | Inequivalent to IoU |
IoU Loss [23] | Coordinates of bounding boxes regarded as a whole | Unoptimizable when a prediction and a target are disjoint |
GIoU loss [24] | The normalized area between a prediction and a target supplementing the IoU loss | Changing areas of prediction frames during the optimization of the loss |
Repulsion Loss [26] | Loss of predictions overlapped with other ground truths and predictions | The unevaluated weights of two losses |
NMS Loss [27] | The penalty of false positives and false negatives supplementing the loss | Only suitable for binary classification tasks |
Achievements | Effect | Disadvantage |
OR-CNN [28] | Divide pedestrians into several parts | Noise production |
Multi-label Learning [29] | A set of decision trees shared by the part detectors | / |
Guided Attention [30] | Channel-wise attention to pay attention to the unobstructed parts of the occludee | / |
AGNN [31] | Select features representing the body parts of pedestrians | / |
Dataset | Size of Training Set/Imgs | Size of Validation Set/Imgs | Size of Test Set/Imgs | Overlaps per Img |
CityPersons | 2975 | 500 | 1575 | 0.32 |
COCO2014 | 117,264 | 5000 | — | 0.015 |
Dataset | Full Bbox | Visible Bbox | Head Bbox |
CityPersons | ✓ | ✓ | |
COCO2014 | ✓ |
Name of Parameters | Value of Parameters |
device | NVIDIA GeForce RTX 3090 from USA |
GPU memory | 24 GB |
batch size | 32 |
epoch | 85 |
learning rate | 10−3 (epoch ≤ 65); 10−4 (epoch > 65) |
momentum | 0.9 |
σ | 0.5 |
iou-threshold | 0.5 |
SE | GL | P/% | R/% | mAP@50/% |
49.7 | 50.6 | 48.1 | ||
✓ | 50.5 | 51.4 | 49.6 | |
✓ | 50.8 | 51.2 | 49.4 | |
✓ | ✓ | 51.7 | 52.4 | 50.5 |
Method | Backbone | MR−2/% | mAP@50/% |
EMD-RCNN [18] | ResNet-50 | 10.7 | 96.1 |
NMS-Ped [27] | ResNet-50 | 10.1 | — |
CSP [36] | ResNet-50 | 11.0 | — |
Adaptive-NMS [17] | VGG-16 | 11.9 | — |
MGAN [37] | VGG-16 | 11.5 | — |
Ours | Darknet-53 | 10.7 | 50.5 |
SE | GL | P/% | R/% | mAP@50/% |
48.3 | 49.6 | 47.5 | ||
✓ | 48.8 | 50.3 | 48.1 | |
✓ | 49.3 | 50.1 | 48.6 | |
✓ | ✓ | 50.5 | 51.9 | 49.7 |
Method | # Parameters/M | Average Training Time per Epoch/S | FPS in the Inference Process/Imgs |
YOLOv3 | 61 | 1.33 | 3 |
YOLOv3-Occ | 62 | 2.40 | 3 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (
Share and Cite
Zhang, Q.; Liu, Y.; Zhang, Y.; Zong, M.; Zhu, J. Improved YOLOv3 Integrating SENet and Optimized GIoU Loss for Occluded Pedestrian Detection. Sensors 2023, 23, 9089.
Zhang Q, Liu Y, Zhang Y, Zong M, Zhu J. Improved YOLOv3 Integrating SENet and Optimized GIoU Loss for Occluded Pedestrian Detection. Sensors. 2023; 23(22):9089.
Chicago/Turabian StyleZhang, Qiangbo, Yunxiang Liu, Yu Zhang, Ming Zong, and Jianlin Zhu. 2023. "Improved YOLOv3 Integrating SENet and Optimized GIoU Loss for Occluded Pedestrian Detection" Sensors 23, no. 22: 9089.
APA StyleZhang, Q., Liu, Y., Zhang, Y., Zong, M., & Zhu, J. (2023). Improved YOLOv3 Integrating SENet and Optimized GIoU Loss for Occluded Pedestrian Detection. Sensors, 23(22), 9089.