1. Introduction
Intelligent coal mine video surveillance is an important measure to ensure production safety. The pedestrians in the monitoring video are detected by AI (Artificial Intelligence) models and control the equipment or issue alarms according to the pedestrian position, which can effectively prevent the operating equipment from causing injury to workers.
The CNN (Convolutional Neural Network) has achieved remarkable success in the field of intelligent image processing, and the accuracy of image classification models based on CNNs have even surpassed that of human beings [
1]. With the excellent performance of CNNs in feature extraction, various CNN-based object detection models have been proposed and used in different fields [
2,
3]. However, traditional object detection models are usually deployed on cloud servers due to the large demands of computing and storage resources. When intelligent analysis of monitoring video is required, surveillance video must be transmitted to cloud servers through the network. Then, the monitoring videos are analyzed by AI models in cloud servers and return the results of video analysis through the network. The whole process of cloud computing produces serious transmission latency because of the limitation of network bandwidth. Meanwhile, transmitting a large amount of surveillance video also causes serious network congestion [
4]. Edge computing is proposed to decentralize intelligent computing close to the data source for avoiding transmission latency and network congestion. Therefore, deploying object detection models on embedded platforms can not only avoid the problems caused by cloud computing, but also control equipment or alarm devices in real-time according to the video analysis results. However, it is difficult to deploy AI models on edge due to the constraint of computing and storage resources of embedded platforms.
To deploy CNN models on embedded platforms, neural network compression methods have received a lot of attention from researchers. Neural network compression aims to reduce the number of parameters or calculations through model pruning, weight quantization, knowledge distillation, or other methods, to greatly improve real-time performance [
5]. Model pruning improves the inference speed by removing redundant neurons [
6,
7,
8]. The pruning approaches for CNN can be roughly divided into non-structured pruning and structured pruning. The inference speed is difficult to accelerate because of the irregular memory access of the non-structure pruned model, unless using specialized hardware or libraries [
9]. Structured pruning prevents the structure of CNN by directly removing whole filters [
8]. However, it is necessary to evaluate the importance of the pruned filters/channels or weights for the two pruning methods. We only focus on structured pruning in this paper.
Currently, there are various approaches to evaluate the importance of filters or channels for structured pruning [
7,
8,
10]. Attention mechanism [
11,
12] is used to enhance the important information and suppress unnecessary information [
13]. It was widely used in NLP (Natural Language Processing) at first, and then it has been introduced into the computer vision field [
14]. Attention mechanism improves the performance of computer vision by important feature enhancement [
15]. The output scale value of attention mechanism represents the enhancement value and the importance level of the features. Therefore, some researchers have designed channel attention modules for model pruning. Channel attention mechanisms evaluate the importance level of channels, and the filters corresponding to low-importance channels will be pruned [
16]. However, the application of attention mechanism in pruning object detection models is rare. Moreover, the high complexity of the object detection model requires an advanced channel attention module for evaluating channel importance levels.
YOLO is a classical one-stage object detection model [
2]. It has the advantages of high real-time performance and fewer parameters compared with two-stage models [
17]. In order to deploy YOLO on embedded platforms, researchers have undertaken a lot of work to reduce the number of parameters and calculations [
18,
19,
20,
21]. However, how to identify redundant channels or filters is still a challenge. CLAHE is usually combined with object detection models [
22] for improving detection performance. However, the lighting environments in coal mines are complex and variable, and the lighting conditions in different monitoring areas are also different. Therefore, it is necessary to set the parameters of CLAHE according to monitoring fields. Unfortunately, the parameters of CLAHE are usually fixed, which makes it difficult to adapt to various places in coal mines. Moreover, the GAN (Generative Adversarial Network) based image augmentation algorithms require huge computing resources leading to serious degradation of real-time performance [
23]. Meanwhile, the datasets for training GAN are difficult to obtain in coal mines. Hence, GAN-based image augmentation algorithms are not suitable for coal mine real-time intelligence monitoring.
To solve the abovementioned problems, we proposed CAP-YOLO and AEPSM for coal mine real-time intelligent monitoring. First, DCAM (Deep Channel Attention Module) is designed for evaluating the importance level of channels. Then, we removed the filters corresponding to low-importance channels in YOLOv3 to form CAP-YOLO. Meanwhile, fine-tuning is used to recover the accuracy of CAP-YOLO. Finally, the AEPSM is designed and combined with the Backbone of CAP-YOLO, which has the ability to adaptively select parameters of CLAHE according to environments.
The main contributions of this paper are summarized as follows:
- (1)
DCAM is designed for evaluating the importance level of channels in feature maps.
- (2)
The coal mine pedestrian dataset was established for transfer learning YOLOv3. Then, the YOLOv3 was pruned with the guidance of DCAM for forming CAP-YOLO.
- (3)
For the complex lighting environments in coal mines, AEPSM proposed and combined with the Backbone of CAP-YOLO to perceive the lighting environment, to set the parameters of CLAHE for improving the accuracy of object detection.
The remainder of this paper is organized as follows. Related methods about model pruning and attention mechanisms are introduced in
Section 2. In
Section 3, DCAM, AEPSM, and pruning approaches are proposed.
Section 4 provides an experiment and comparison of the proposed approaches. Finally, we conclude this paper in
Section 5.
6. Conclusions
In this paper, the DCAM was proposed to evaluate the channel importance level and identify the redundant channels; then, we pruned YOLOv3 based on DCAM to form CAP-YOLO. CAP-YOLO reached 86.7% mAP when the pruning ratio was set to 93% and achieved 31 FPS inference speed on NVIDIA Jetson TX2. Meanwhile, we further proposed AEPSM to perceive the lighting environments of different coal mine fields, which adaptively set the parameters of CLAHE for improving the accuracy of CAP-YOLO.
In the future, we will undertake a further study on channel attention mechanisms for evaluating the importance level of channels. In addition, we will design a special loss function or optimization method for DCAM and CAP-YOLO in the next step, for improving the real-time performance and accuracy of intelligent video monitoring.