1. Introduction
The computer vision-based detection of road cracks has a vital role in both life and engineering. Road cracks are often the initial manifestation of some diseases, and therefore, we need to detect the problem and take care of it as early as possible. In the past, the traditional manual inspection was time-consuming and laborious, the results were not always accurate, and it may have even posed a threat to the safety of the inspector’s life. In response, how to use computer vision to detect road defects in time and carry out maintenance as early as possible has become a popular topic. Deep learning has advanced at a rapid pace in recent years, and related algorithms continue to break through; many scholars have employed deep learning algorithms in road crack detection and achieved good outcomes. Zhang et al. were the first to propose the use of deep learning for road crack detection in 2016, which could directly use images taken by smartphones for manual annotation and automatically learn features [
1]. The method they proposed greatly reduces the cost of detection, improves the disadvantages of manually produced features, and has better detection performance. However, the method has much room for improvement when it comes to detecting with speed and accuracy, and many scholars have subsequently continued to improve it. CrackNet is a new CNN-based pavement fracture detection system proposed by Zhang et al. [
2] that is characterized by the absence of any pooling layers and that can reduce the output between layers. CrackNet has significantly better performance than the previous three-dimensional (3D) shadow modeling, but it still has many shortcomings, such as taking a long time to analyze and difficulty identifying cracks as tiny as a hair. Fei et al. were inspired by CrackNet and proposed CrackNet-V for pixel-level crack detection in pavements, which uses a deeper structure with fewer parameters to improve detection speed and accuracy [
3]. Ye et al. put forward a deep learning-based crack detection method called Ci-Net, which is great improvement compared with edge detection [
4]. Dung et al. suggested a crack semantic segmentation method based on full-depth convolution, choosing VGG16 as the FCN encoder backbone, and the semantic segmentation of this method could determine the crack path [
5]. Although this method was able to capture the road crack path, further research is still needed to automatically quantify the crack size. Kalfarisi et al. proposed two crack detection and segmentation methods, faster region-based convolutional neural network forest edge detection (FRCNN-FED) and mask R-CNN, with detection results showing that speed is inversely proportional to performance; mask R-CNN had higher AP values but slower detection [
6]. The two methods were applied to a unified framework to achieve crack segmentation, quantitative evaluation, and visualization. Huyan et al. proposed a pixel-level pavement crack detection method called CrackU-net that had good results for pixel-level pavement crack detection [
7].
Although deep learning has produced excellent performance in intelligent crack detection, which no longer requires manual feature extraction and can now automatically learn features through the network, actual road crack pictures often have many complications. The well-known object detection algorithm faster R-CNN was tested [
8], as seen in
Section 3.3 in this paper, where the dataset shows road marking interference, shallow cracks, and blurred, multiple cracks. Other algorithms would be undetectable, so we needed to initially preprocess the dataset. Sparse representation and compressed sensing theory were applied to image processing.
Sparse representation has numerous uses in hyperspectral image anomaly object detection. Zhu et al. suggested both a target detection and a binary hypothesis model based on a sparse representation of object dictionary construction for hyperspectral image target detection, and the algorithms showed good performance [
9]. Li et al. devised a method for detecting anomalies in hyperspectral images that accurately models the background using structural sparsity, and the results of the experiment show that the performance of this method is greatly optimized when compared with various current methods [
10]. Ling et al. utilized a sparsity-based anomaly identification technique for hyperspectral images based on the fact that background pixels can be approximated and anomalous pixels estimated as sparse linear combinations of their neighborhoods [
11]. Huyan et al. put forward a hyperspectral image anomaly detection method based on a dictionary of background and potential anomalies, where the anomalous part uses a sparse representation to model its properties [
12]. In addition to the anomalous target detection of hyperspectral images, sparse representation is often used, for example, for video anomaly detection, fault diagnosis, and medical applications. Chu et al. combined sparse representation with unsupervised learning for video anomaly event detection, where sparse coding results for manual features generated from the input were used to guide unsupervised feature learning [
13]. Sun et al. advanced a sparse representation framework based on a variational self-encoder latent space learning dictionary for anomaly detection that could reduce the dimensionality of high-dimensional data to save the cost of space [
14]. Yuan et al. developed a dictionary learning algorithm to detect anomalous events in surveillance videos that explored new structural information in the sparse representation framework, and the algorithm confirmed its effectiveness with real data [
15]. Sparse representation has also been used by several scholars in recent years in the field of intelligent transportation. Cheng et al. used sparse coding to identify the type of weather while driving, providing a key factor for tasks such as road detection [
16]. Wang et al. proposed a method of identifying concrete cracks based on L2 sparse representation, and experiments showed that their method had high accuracy and efficiency [
17]. Gao et al. performed a sparse representation of cab activities on spatially divided cells in cities that could perceive the spatiotemporal relationships between traffic flows and detect traffic anomalies in a timely manner [
18].
Sparse representation has many applications in the field of anomaly detection. However, the essence of sparse representation is that in a sufficiently large training sample space, a class of objects can be roughly linearly represented by a subspace of similar samples in the training sample so that when the object has the entire sample space represented, the coefficients of its representation are sparse. Then, the model building requires that the sample space is large enough, the object is linear, and many other requirements, and the generalization of the method is poor. In contrast, the target detection method of deep learning requires less data and has high accuracy and generalization compared with the sparse representation method.
Compressed sensing theory was first proposed by Donoho [
19], Candes and Tao [
20], and Candes and Romberg [
21]. The theory states that compressible signals can be sampled in a way that is far below the Nyquist standard and still recover the original signal accurately. The theory has been widely used in, for example, the fields of image processing, signal processing, medicine, and pattern recognition. Hu et al. proposed an intelligent fault diagnostic technique based on compressed sensing of improved multiple scale networks. Compressed sensing can decrease the amount of data and discover critical issue information while also providing enough training samples for subsequent learning [
22]. Haneche et al. suggested a speech enhancement technique based on compressed sensing that could can subtract the effect of noise in speech, thereby achieving a more accurate recovery of speech [
23]. Li et al. designed a method based on linear discriminant analysis and compressed sensors in order to identify the defective features on the surfaces of wood that could save time by reducing the complex training process using the principle of compressed sensing [
24]. Zhang et al. suggested a wood disadvantage classification approach based on principal component analysis and compressed sensing, with the compressed sensing classifier offering fewer parameters, greater flexibility, faster calculation, and higher classification accuracy [
25]. Böttger et al. utilized a compressed sensing-based method for monitoring and locating gray-scale texture defects in real time, which has significant advantages in terms of accuracy and speed [
26]. Islam et al. proposed a deep learning framework based on compressed sensing to automatically detect pneumonia on images, and the method has strong robustness [
27]. Shao et al. [
28] proposed a feature learning and fault diagnosis of rolling bearings based on compressed sensing that could reduce the amount of complex data, exclude background noise, and improve diagnosis efficiency.
Although good results have been achieved using compressed sensing for anomaly detection, it is a difficult mathematical problem to find the appropriate sparse representation matrix, and the generalization of anomaly detection using compressed sensing theory is poor and the research and training process is not as easy as deep learning compared to target detection methods in the field of Faster R-CNN and other deep learning. But the idea of compressed sensing is still worthy of our reference.
Inspired by sparse representation and compressed sensing, four sparse feature algorithms were tested to highlight the cracks in the image and emphasize the features needed by denoising and enhancing the contrast between cracks and background. At the same time, data preprocessing methods can remove redundant information well without affecting the accuracy, reduce the data dimensionality and computation, and allow the model to learn the features needed better.
2. Method
Figure 1 shows the pipeline of our method, first the dataset is preprocessed by four sparse feature methods, the keys of sparse feature are denoising, grayscale, thresholding, and contrast enhancement. Each sparse feature method selects multiple sets of parameters to compare to get the optimal parameters. Then the preprocessing dataset is trained using Faster R-CNN, and finally the optimal method is obtained by metrics comparison.
2.1. Data Preprocessing Methods
In the process of crack images acquisition, due to certain limitations in cost, electronic acquisition equipment, external environment, sensors, circuit structure, etc. may introduce noise, overexposure, over-darkness, blurring, etc. to make the crack not prominent enough, and the image transmission process will also be affected by these. The sparse representation and compressed sensing that inspired us can solve these problems, but they usually think from a mathematical point of view, with a complex computational process and poor automation. In contrast, our method combined with deep learning, which first processes the dataset with the idea of sparse features and then trains with Faster R-CNN, has better generalization, works on different datasets, and is faster and more accurate, too.
Data preprocessing is very significant in deep learning tasks. Firstly, we often encounter the problem of difficult data dimensions in real-life tasks, and if we can filter out some redundant features and keep important features, it will provide a good basis for subsequent learning. Secondly, after sparse feature, we can reduce the difficulty of learning, simplify the post-processing, while speeding up the training speed.
In this paper, we proposed a deep learning-based crack detection method, which initially uses the idea of image sparse representation and compressed sensing to preprocess the datasets. Only the pixels which represent the crack features remain, while most pixels of non-crack features are relatively sparse which can improve significantly the accuracy and efficiency of crack identification. The proposed method achieved good results based on the limited datasets of crack images. Various algorithms were tested namely linear smooth, median filtering, Gaussian smooth, and grayscale threshold, where the optimal parameters of various algorithms were analyzed, and trained with Faster R-CNN.
2.1.1. Introduction of the 1st Algorithm
Algorithm1 is a kind of linear smooth filter which can eliminate noise. It is very effective in suppressing noise obeying normal distribution, and is used extensively in image preprocessing. Algorithm1 is processed by convolving the image using a mask whose template coefficients obey a two-dimensional Gaussian distribution that drops as the distance from the template’s center grows. Algorithm1 may retain a lot of visual detail. The two-dimensional Gaussian function is shown in Equation (1) and the pipeline of algorithm1 is shown in
Figure 2.
where
is a constant,
is the coordinate of any point inside the mask,
is the coordinate of the center point of the mask, and
is the standard deviation.
2.1.2. Introduction of the 2nd Algorithm
Algorithm2 can remove the noise in the image, and at the same time can solve the problem of blurred image details generated in algorithm1, which is very effective for suppressing pretzel noise. Algorithm2 usually uses an odd number of sliding sampling windows, such as (2
n + 1) × (2
n + 1),
n = 1,2,3……, and sorts the values in the window by the size of the pixel value of each pixel, then takes the median value after sorting and replaces it with the middle of the window. The pipeline of algorithm2 is shown in
Figure 3. In
Figure 3, the sampling window of 3 × 3 were taken as an example to illustrate the principle of algorithm2. First, the pixel values of these 9 pixels are sorted by size, and the middle value is 89, then the middle pixel of the sampling window is replaced with 89.
2.1.3. Introduction of the 3rd Algorithm
Algorithm3 is a type of nonlinear smooth filter that is essentially Gaussian filter and is created to solve the edge blurring problem that occurs with algorithm1. The intensity of a pixel is replaced by a weighted average of the luminance values of the surrounding pixels in algorithm3, which takes into consideration not only the pixel’s Euclidean distance but also the radiation in the range domain of the pixel. The function of algorithm3 [
29] is shown in Equations (2)–(5):
where
is the input image,
is the filtered image,
is the spatial domain kernel,
is the image pixel domain kernel,
is the sum of normalized weights,
is the input pixel point,
and
are the horizontal and vertical coordinates of the input pixel,
is the box center pixel point, and
and
are the coordinates of the box center pixel.
As shown in
Figure 4, the crack intersects with the background mainly affected by the image pixel domain kernel, and the crack edge is preserved. And the background is mainly affected by the spatial domain kernel, and the pixels become smooth between them.
2.1.4. Introduction of the 4th Algorithm
Algorithm4 is the process of adjusting the grayscale values of an image based on set values, eliminating pixels within the image that are above a certain value or below a certain value, which can give the entire image a distinct black and white effect. Algorithm4 enhances the subsequent work and facilitates the next step of model training, allowing the dataset to reduce unnecessary features, remove redundant information, and highlight the contours of objects of interest, such as removing backgrounds and highlighting crack features.
In this experiment, first grayscale the cracked image, intercept the crack and background parts separately and plot their grayscale histograms, then choose the threshold, which was used to divide the image into two parts, above and below the threshold, and give them different pixels. As shown in Equation (6) and
Figure 5,
src(
x,
y) indicates the pixel position of the original image, the part above the threshold is retained, and the part below or equal to the threshold takes the minimum value of zero.
2.2. Faster R-CNN
Faster R-CNN is divided into three critical sections: feature extraction, region proposal networks and region CNN as shown in
Figure 6. We used Residual Network (ResNet) [
30] and regional proposal network (FPN) [
31] for the feature extraction part. After extracting the features, the feature extractor sends the five different size feature maps extracted to the following network.
2.2.1. Network Structure of Faster R-CNN
Regarding the region proposal networks (RPNs) part, five feature maps are obtained from five identical RPNs and are employed in the creation of region proposals. Among them, the RPN creates anchors of various sizes to acquire a certain number of region proposal feature maps. Regarding the region CNN (R-CNN) part, the R-CNN takes the region suggestion feature maps obtained in the previous step with uniform size and then inputs them into the fully connected layer for classification and regression. Our network used ResNet101 as the backbone, which includes 5 classes of 100-layer convolution layers, and an average pooling layer for a total of 101 layers. We used batch normalization technology and we employed relu as the activation function. The entire Faster R-CNN network adopts a joint training strategy to sum each loss for unified gradient descent.
2.2.2. Loss Function of Faster R-CNN
The loss function of Faster R-CNN includes Fast R-CNN loss [
32] and RPN loss [
8], which is multi-task, and both parts include classification loss
and regression loss
, as shown in Equation (7):
where
is the anchor index of the mini-batch, and
is the likelihood of the anchor being the target. The ground truth label
indicates whether the anchor is positive or negative; if the anchor is positive,
, and if the anchor is negative,
.
denotes the coordinate vector of the bounding box, and
is the ground truth bounding box coordinate vector corresponding to the positive anchor.
is the mini-batch size and
is the number of the anchor location, and they are normalized and weighted by a parameter
. The classification loss
is the logarithmic loss for two classes, (object or non-object), and is shown in Equation (8); the regression loss as shown in Equation (9), where
is the
function, as shown in Equation (10).
2.2.3. Mean of Average Precision (mAP)
The mAP is a key indicator of good or bad model performance in the field of object detection. The mAP value is the area enclosed under the precision recall curve after smoothing, as shown in Equation (13), where
P is
Precision and
r is
Recall. 3. Analysis
Urban roads are maintained in time before cracks or cracks tend to appear, and future traffic volume is predicted to avoid overload when designing roads, so it is not an easy task to collect a large number of road crack pictures in a short time. Our method requires about 150 crack images to achieve good results. The dataset contains 150 representative crack images, of which 90% is the training data and 10% the testing data. We used a random flip method to expand the dataset by setting the probability of both horizontal and vertical flips to 0.5, which finally expanded the dataset by 50%.
3.1. Selection and Processing of Datasets
The dataset was partly derived from publicly available crack datasets on the web called CFD, and partly from actual road collection, taken by smartphones or cameras. The road crack dataset CFD in this study is publicly available (
https://github.com/cuilimeng/CrackForest-dataset (accessed on 11 May 2017)). The crack images in the dataset are affected, for example, by broken road markings, shadows, water on the road, different road materials and blurring. The size of the image is 480 × 320 pixels, and all images were resized to 1333 × 800 pixels in our experiment. The optimal threshold will be influenced by the brightness, and the brightness of the images in the dataset will vary somewhat. However, in
Figure 5, it can be seen that we determined the range of optimal thresholds based on statistical principles after counting the thresholds of the background and cracks in the dataset, and there will inevitably be individual images with poor results. At the same time, retaining more differences also allows the network to learn the features better for the actual detection and can improve the detection effect. If different datasets need to be used, the optimal threshold can be redetermined, but in general, the difference in brightness of the same dataset is not too large, so it is possible to separate the cracks from the background very well using our method. We marked the cracks in the images by drawing rectangular boxes, and the images marked are shown in
Figure 7. The data are converted into the COCO format. The dataset after preprocessing by the four sparse feature methods introduced in
Section 2.1 is shown in
Figure 8.
Figure 7 shows the labeling of the dataset, and both the training data and testing data are labeled with rectangular boxes. If the crack feature in one image is shown in
Figure 7 with multiple cracks in vertical and horizontal directions, multiple rectangular boxes are used to label it.
Figure 8 shows the comparison between the original image and the images after sparse feature, where (a) is the original image, and it can be seen that this image has the most detailed information and that there is much redundant information and noise; (b) is the image after algorithm1, and it can be seen that this image is the algorithm of removing the most redundant information, although the image has the disadvantage of blurred edges; (c) is the image after algorithm2, and it can be seen that this image removes part of the redundant information while retaining the crack information well; (d) is the image after algorithm3, in which it can be seen that this image is similar to algorithm2; and (e) is the image after algorithm4, which can be seen that the crack is clearly emphasized.
3.2. Experimental Details
Our experiments were conducted on the Ubuntu 20.04.3 operating system and NVIDIA GTX 3060 GPU and based on PyTorch 1.8.2 and CUDA 11.1. After several tests, the learning rate was set to a better effect of 0.005, and we trained a total of 24 epochs. We used a pretrained backbone based on ImageNet and executed a fine-tuning strategy to train our network. We also used SGD optimizer, batch normalization, and warm-up in training to improve the performance.
3.3. Crack Detection Results and Comparative Analysis
Figure 9 shows the detection results for the testing data. Each group is a comparison of the detection results for the same image after different sparse feature methods and unprocessed, algorithm1, algorithm2, algorithm3, and algorithm4 from front to back. All assays set the threshold to 0.6 or higher.
It can be clearly seen that for group (a), there is road marking interference in the picture, and the crack on the right side of the picture is thin and shallow; therefore, the unprocessed method cannot detect the whole crack well, while the whole crack can be detected completely after algorithm1 and algorithm2. Regarding the algorithm of both horizontal and vertical cracks in the same image, such as group (b), the unprocessed method cannot enclose the cracks completely, while the other four processed methods can detect the cracks completely, and the thresholds are all improved. Regarding the algorithm of group (c), where there are both horizontal and vertical cracks in the same image with the influence of road markings, the unprocessed method cannot enclose the cracks completely, while algorithm3 and algorithm4 can detect the cracks efficiently. Although for simple transverse cracks, such as in group (d), they all detect the cracks well, the thresholds are higher after the four sparse feature methods. The next groups (e) and (f) are both crack images that were collected with road condition acquisition equipment in cooperation with road inspection agencies, both of which are characterized, for example, by blurring and road marking line interference. The presence of detection errors, i.e., false positives, can be seen in the upper right corner of the untreated method in group (e), and there are no false positives at that position after algorithm1, algorithm2, and algorithm3; the threshold also increased substantially. The result after algorithm4 is still not satisfactory. The crack at the top left position in group (f) is prone to undetection, i.e., false negative, and algorithm1 and algorithm4 can avoid this situation and detect the crack. In summary, each sparse feature method is effective in improving the results of the assay.
5. Conclusions
The appearance of cracks is an early warning of road diseases, which will cause more serious traffic problems if not treated in time; therefore, it is especially important to detect cracks in a timely and efficient manner. In this paper, the faster R-CNN algorithm was adopted and improved, which is an efficient and high-accuracy method of detecting cracks in a timely manner.
- (i)
An intelligent crack detection method based on deep learning algorithm is investigated to improve the accuracy and reliability of small sample sets.
- (ii)
Combined with sparse representation and compressed sensing, the dataset was preprocessed, and various preprocessing algorithms were compared.
- (iii)
The optimal parameters for different algorithms were compared and analyzed, and the corresponding mAP improved significantly. Among them, the mAP of algorithm2 is the largest, reaching 5%.
- (iv)
The results show that the crack detection effects in complex situations such as road marking interference, shallow cracks multiple cracks, and ambiguity were significantly improved.
- (v)
Algorithm1 is aimed at shallow cracks, with better results for road marking interference, lateral cracks, blurred pictures, and small-area cracks; algorithm2 is aimed at shallow cracks, with better results for road marking interference, multiple cracks, and blurred pictures have better results. With Algorithm3 for multiple cracks, there are better results for road marking interference, lateral cracks, and blurred pictures. Algorithm4 is aimed at multiple cracks, and there are better results for road marking interference and blurred pictures.
For further work, we will study how to make better feature selection, remove redundant features to the greatest extent, retain the most real cracks, allow the model to be better trained, and obtain better detection results. In order to further improve the effects, we will continue to study other feature sparse methods and try more dimensional sparse feature methods, such as color saturation and wavelet. Our research is preliminary, and further research is needed on how to identify the types of damage because different repair methods may be used for different types of damage (cracking) as well as how to classify detected cracks as harmless or repair them immediately. Since the focus of this research paper is on the improvement of detection effects after adding sparse feature processing before the deep learning method, our dataset only selected from images of longitudinal, transverse, and bifurcation cracks in asphalt pavement. Future refinement of the dataset is needed to include other types of damage that are significant and require precise automatic detection in the road engineering: block, fatigue, edge, and reflection cracks, and moreover potholes as well as losses of binder and aggregate in the road surface.