With the rapid growth of society and the modern logistics industry, road infrastructure has been greatly increased in today’s world. There are a total of more than 64,000,000 kilometers of roads in the world , which leads to massive operational requirements for pavement maintenance. The pavement inspection is one of the key steps . Generally speaking, the cameras are often utilized as pavement inspection equipment due to their low cost and the powerful data representational ability of images. Therefore, the pavement inspection task is often translated into a pavement distress analysis task based on the acquired pavement images, and then this task is accomplished manually by the proficient workers. Clearly, such an operation consumes plenty of time and labor resources due to an enormous amount of pavement images produced daily . Therefore, automating the pavement distress analysis can play a critical role in improving efficiency, reducing cost and also avoiding the labeling errors in manual pavement inspection.
Pavement distress detection and recognition are two fundamental tasks for pavement distress analysis, which aim at identifying the distressed pavement images and classifying the distressed images into specific categories, respectively, as shown in Figure1. In recent decades, many classical approaches have been proposed to address these two tasks, and they can be roughly categorized into two groups.
The first one is to utilize the image processing, hand-craft features and conventional classifiers to recognize the pavement distress [4, 5, 6, 7, 8]. For example, Zhou et al.  developed a two-step method that conducts the wavelet transform followed by a random transform to classify the pavement distress. Sun et al. 
proposed a crack classification method based on topological properties and chain code. The main drawback of these methods is that they often optimize the feature extraction and classification step separately or even do not involve any learning process which leads to poor performance. Moreover, it usually needs plenty of sophisticated image pre-processing.
The second group is comprised of those using deep learning-based methods. Inspired by the advance of deep learning approaches, it is more and more popular to apply different deep learning-based visual learning models for pavement distress detection and recognition [9, 10, 11, 12, 13]. For example, K.Gopalakrishnan et al. 
leveraged VGG-16 pre-trained on ImageNet to identify whether the specific pavement image is ”crack” or ”non-crack”. Laha et al.  detected road damages with RetinaNet . Compared to the conventional approaches, deep learning-based approaches often achieve better performance. However, most of these approaches only regard the pavement distress detection or recognition problem as common object detection or image classification problem and directly apply the classical deep learning approaches. They seldom paid attention to the specific characteristics of pavement images, such as the high image resolution, the low distress area ratio, and uneven illumination, in the model design phase.
To address the issue, instead of directly classifying the pavement images, IOPLIN 
performed the histogram equalization for suppressing the negative effects from illumination, and tackled the pavement distress detection task via inferring the labels of patches from pavement images with a Patch Label Inference Network (PLIN) for fully exploiting the high-resolution image information. IOPLIN is able to be iteratively trained with only the image label via the Expectation-Maximization Inspired Patch Label Distillation (EMIPLD) strategy and achieves promising detection performances for various categories of pavement distress. The main drawback of IOPLIN is that its optimization process is quite complex and time-consuming. Moreover, IOPLIN is not end-to-end trainable and is also unable to be further extended to the pavement distress recognition scenario.
To address the aforementioned issues, we present a novel pavement image classification framework named Weakly Supervised Patch Label Inference Network (WSPLIN)  for both pavement distress detection and recognition. Similar to the IOPLIN, our method also accomplishes the pavement image classification via inferring the labels of patches from the pavement images with Patch Label Inference Networks (PLIN). Therefore, WSPLIN inherits the merits of IOPLIN, such as the better image information utilization and result interpretability, but also suffer from the obstacle of training PLIN only with image labels. Compared to IOPLIN, WSPLIN solves this model training issue via introducing a more concise end-to-end weakly supervised learning framework. Such a framework endows WSPLIN with better efficiency and greater flexibility, enabling the pavement distress recognition application.
In WSPLIN, the pavement image is divided into patches with different patch collection strategies under different scales for exploiting both global and local information. Then, a CNN is implemented as PLIN for inferring the labels of patches with a sparsity constraint. Finally, the patch label inference results are fed into a Comprehensive Decision Network (CDN) for completing the classification. We integrate PLIN and CDN as an end-to-end deep learning model. In such a manner, the PLIN can be optimized by the guidance of CDN and the patch label sparsity constraint in a cleaner and more efficient fashion. Moreover, three different strategies, namely Sliding Window (SW), Image Pyramid (IP), and Sparse Sampling (SS), are adopted for collecting patches from images. We name these corresponding WSPLIN versions, WSPLIN-SW, WSPLIN-IP and WSPLIN-SS, respectively. As same as IOPLIN, WSPLIN-SW has not considered any scale information during patch collection. It can be deemed as a naive version of WSPLIN. Different to WSPLIN-SW, WSPLIN-IP incorporates the scale information via dividing images into patches from coarse to fine based on an image pyramid. It is the default version of WSPLIN. WSPLIN-SW conducts a sparse patch sampling to collect only a few of patches from the image pyramid for improving the efficiency of WSPLIN. It can be seen as the speedy version of WSPLIN. We evaluate WSPLIN on a large-scale pavement image dataset named CQU-BPDD  under different settings, including distress detection, one-stage recognition, and two-stage recognition. The experimental results show that WSPLIN outperforms extensive baselines and demonstrate the prominent advantages over IOPLIN in both efficiency and performance.
The main contributions of our work are summarized as follows:
We propose a novel end-to-end weakly supervised deep learning model named WSPLIN for addressing both pavement distress detection and recognition issues. WSPLIN not only inherits the merits of IOPLIN, but also enjoys faster training speed, better classification performance, and wider application scenarios over IOPLIN.
Different from IOPLIN and the conventional CNN-based image classification methods, we introduce image pyramid to WSPLIN-IP for exploiting scale information. Moreover, we design a sparse patch sampling strategy in the image pyramid for further speeding up WSPLIN. The model training time of this faster WSPLIN version (WSPLIN-SS) is only one-fourth of the training time of IOPLIN while they share similar performance in pavement distress detection.
We design a patch label sparsity constraint based on the prior knowledge of distress distribution and leverage the CDN to guide the training of PLIN in a weakly supervised way. The patch labels produced by PLIN provide interpretable intermediate information, such as the rough location and the type of distress.
We empirically evaluate our model against the current state-of-the-art CNN methods and some classic transformer methods as baselines in both the pavement distress detection and recognition tasks. Extensive results show that WSPLIN outperforms them in both tasks under different settings.
Ii Related Work
Ii-a Image-based Pavement Distress Analysis
The traditional pavement distress analysis approaches mainly include filter-based methods, and hand-crafted feature-based classical classifiers. For example, in , wavelet transform is used to decompose a pavement image into different-frequency subbands. Hu et al.  propose a novel Local Binary Pattern (LBP) based operator for pavement crack detection. In , a random structured forest named CrackForest, which is combined with the integral channel features is proposed for automatic road crack detection. Kapela et al.  propose a crack recognition system based on the Histograms of Oriented Gradients (HOG). Pan et al. 
use the four popular supervised learning algorithms (KNN, SVM, ANN, RF) to discern pavement damages. However, the traditional methods usually have weak performance owing to numerous artificial design factors, separate optimization procedures, and they cannot be adapted to a large number of data currently.
Inspired by the recent remarkable successes of deep learning in extensive applications, simple and efficient convolutional neural networks (CNN) based pavement distress analysis methods have gradually become the mainstream in recent years. In general, these methods can be divided into three parts according to the task objective: pavement distress segmentation[12, 22, 23, 24], pavement distress location [25, 26], and pavement distress classification [13, 27, 28]. Among them, pixel-based pavement distress segmentation is a hot research field. Zhang et al.  leverage CNN to classify the image patch for segmenting pavement distress. In , a CNN is used to learn the structure of the cracks from raw images, then the segmentation result is generated by the obtained structure information. Based on the fully conventional network (FCN), Yang et al.  fuse multiscale features from top-to-down for pavement crack segmentation. In DeepCrack , multiscale deep convolutional features learned at hierarchical convolutional stages are fused together to capture the line structures. For distress localization, Ibragimov et al. 
propose a method for localizing signs of pavement distress based on faster region based conventional neural network (Faster R-CNN). Zhu et al. compare the performance of three state-of-the-art object-detection algorithms on an Unmanned aerial vehicles(UAV) pavement image dataset, which includes six types of distress. Because pavement distress annotation requires professional knowledge and a large amount of time, the datasets used in the above methods are low-resolution and small-scale. However, it remains to be determined whether models derived from small-scale datasets can be applied to real-world practice.
For pavement distress classification, Dong et al.  propose a metric-learning based method for multi-target few-show pavement distress classification on the dataset which includes 10 different kinds of distress. In , discriminative super-features constructed by the multi-level context information from the CNN is used to determine whether there is distress in the pavement image and recognize the type of the distress. All of these methods do a good job of classification on the dataset they use, which is small and only contains distressed images, and on which the test accuracy even achieves 100% . There have been few works to systematically evaluate the model’s performance on a difficult large-scale multi-type dataset. Moreover, these approaches only regard the pavement distress detection or recognition problem as a common image classification problem and directly apply the classical deep learning approaches. In , patch-based weakly learning model IOPLIN and large-scale distress datasets CQU-BPDD are proposed to solve these problems. However, the main drawback of IOPLIN is that the patch label inference strategy based on the pseudo label makes IOPLIN incompatible with pavement recognition and its optimization process is quite complex and time-consuming. Our approach takes inspiration from IOPLIN but operates with different patch inference strategy, and uses more effective and hierarchical patch collection strategies.
Ii-B Deep Learning-based Image Classification
In recent years, due to the popularity of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
, many computer vision algorithms based on deep learning have emerged. Among them, a series of convolutional neural networks (CNNs) play a leading role in the field of image classification. I.e., AlexNet first applies the structure of convolutional neural networks to large-scale image classification datasets. Simonyan et al. first propose the deep and large-scale convolutional neural network VGGNet  (i.e., the VGG19 model has 19 layers and more than 130 million parameters, while the previous convolutional neural network has less than 10 layers and millions to tens of millions of parameters). In InceptionNet , convolution kernel application is proposed for the first time. He et al.  propose a residual structure and network extension strategy to construct network family for the first time. Zoph et al.  bring CNN into the embedded mobile terminal and propose MobileNet specially designed for low computing power and low memory computing platform. Then based on MobileNet and Neural architecture search (NAS) , Tan et al. propose an efficient CNN, dubbed EfficientNet 
. In the past year, inspired by the field of Natural Language Processing (NLP), Dosovitskiya et al. propose a visual classification model based on Transformer , named as ViT.
However, the general classification models based on CNN and Transformer are difficult to be directly used in the field of pavement distress analysis. This is mainly because the pavement images have many specific characteristics, such as high image resolution, the low distress area ratio, and uneven illumination, compared with object-centric natural images. This paper aims to use a patch collection strategy to incorporate both local and multiscale information for pavement distress analysis.
In this section, the network architecture of the Weakly Supervised Patch Label Inference Network (WSPLIN) is shown in Figure 2. We first introduce the problem formulation and overview of WSPLIN in section III-A. Then, we introduce the involved patch collection strategies in section III-B. After that, the core modules of WSPLIN, Patch Label Inference Network (PLIN), and Comprehensive Decision Network (CDN) are detailed in section III-C and section III-D respectively. Finally, we will show how to apply WSPLIN to detect and recognize the pavement distress in section III-E.
Iii-a Problem Formulation and Overview
Both pavement distress detection and recognition can be deemed as an image classification task from the perspective of computer vision. Let and be the collection of pavement images and their pavement labels respectively. is a
-dimensional one-hot vector whereis the number of categories and indicates the -th element of . In the detection case, such a classification task is a binary image classification issue (distressed or normal) where . In the recognition case, this classification task is a multi-class image classification problem where . In a pavement label , if the -th element is the only nonzero element, it indicates that the corresponding pavement image belongs to the -th category. The pavement distress detection or recognition is to learn a classifier or can label the pavement image correctly, .
There are two strategies for accomplishing the pavement distress recognition task. One is the two-stage recognition flow path and the other is the one-stage recognition flow path. The two-stage recognition is to identify distressed images first via pavement distress detection, and then apply the pavement distress recognition to further classify each distressed image into a specific type of pavement distress. The one-stage recognition is to directly consider the normal case as an additional category in the recognition procedure, and therefore the pavement distress detection and recognition tasks are jointly tackled with one image classification model.
Similar to Iteratively Optimized Patch Label Inference Network (IOPLIN), WSPLIN is a patch-based pavement image classification method whose main obstacle is to train Patch Label Inference Network (PLIN) only with the image label. WSPLIN introduces an additional module named Comprehensive Decision Network (CDN) to guide the optimization of PLINs in an end-to-end weakly supervised learning manner. The flow path of WSPLIN is very concise. In WSPLIN, the pavement image is divided into several patches first, and then PLIN is used to infer the labels of these patches, finally the inferred labels are fed into CDN for yielding the final pavement label. Clearly, WSPLIN only has two core modules, namely PLIN and CDN, whose corresponding mapping functions are and respectively.
Iii-B Patches Collection
We adopt three different patch collection strategies for producing patches. They are Slide Window (SW), Image Pyramid (IP) and Sparse Sampling (SS). The first strategy is also adopted by IOPLIN. WSPLIN uses the second strategy to fully exploit image information from different scales. The third strategy is newly designed by us based on the IP strategy for speeding up WSPLIN via reducing the patch amount for training.
Slide Window: The pavement image is simply divided into a series of uniform scale patches following non-overlapping strategy. We adopt
as the sliding window size with 300 sliding stride. The patch collection can be denoted aswhere is our patch extraction operation.
Image Pyramid: The slide window strategy does not consider the scale information. So we resize the pavement image into three resolutions, , and (the original size), to construct a three-layer image pyramid from top to down, and then employ sliding window method for dividing the image into patches. The patch collection can be denoted as where and indicates the layer ID. Similar to the slide window strategy, we also apply as the sliding window size with 300 sliding strides in the image pyramid. So, , , and .
Sparse Sampling: The patch number determines the scale of training data, and the patches in the same image pyramid also contain some redundant information in the scale space. Therefore, we can sample some patches for each image for reducing the training burden, and thereby speeding up the model. More specifically, let be the sparse sample ratio to control the number of sampled patches for each layer, where returns the smallest integer that is greater than or equal to the input. We design a simple strategy for sampling patches in each layer. In this strategy, the sampled patches of all three layers should cover all scales while maximizing the spatial coverage. The optimal patch sparse sampling strategy is mathematically denoted as follows,
where returns the volume of the given set and denotes an index subset to patches in -th layer. Since the solutions of the above problem are limited, we can use the enumeration method to address this issue efficiently when is fixed. In this paper, we empirically set . In such a manner, , and .
For distinguishing different versions of WSPLIN, WSPLIN-SW, WSPLIN-IP and WSPLIN-SS indicate the versions that use sliding window, image pyramid and sparse sampling patch collection strategies respectively. The default version of WSPLIN is WSPLIN-IP.
Iii-C Patch Label Inference Network
Similar to IOPLIN , we adopt EfficientNet-B3  as our Patch Label Inference Network (PLIN) due to its good trade-off between performance and efficiency. We denote as the mapping function of PLIN. The patch label inference procedure is denoted as,
where is an -dimensional matrix in which every column encodes the label inference confidences of patches. Such confidences are expected to be zero if the patch does not exist the distress and reflect the possibility of the certain distress that the corresponding patch has. Note, there is no supervised information of patches, so all these labels are randomly produced just via forward propagation. We need to leverage the follow-up comprehensive decision network to guide the PLIN to generate reasonable patch labels with image-level labels in a weakly supervised manner. We will introduce such a procedure later.
Patch Label Sparsity Constraint (PLSC): Since the distressed area is often the small part of the pavement, and it is seldom that a pavement image has many different distress. In such a manner, the label confidence matrix should only have very limited nonzero elements. In other words, should be sparse. Thus, we introduce an -norm constraint to the label confidence matrices of the distressed training samples,
where is the label of the normal pavement image. We introduce this constraint only to the distressed samples, since there should be no nonzero element in the label confidence matrices of the normal samples,
Iii-D Comprehensive Decision Network
We establish a Comprehensive Decision Network (CDN) for accomplishing the final pavement image classification based on the aforementioned patch label results. CDN consists of four layers where the first two layers are all the
fully connection layers followed by a ReLU, and Dropout layer, the third layer is also anfully connection layer and the size of output fully connection layer is . Here, is the number of categories. Let be the mapping function of CDN, then the predicted pavement distress label can be obtained by,
We use the cross-entropy to measure the discrepancy between the predicted label and ground-truth and denote it as the classification loss ,
Finally, the optimal WSPLIN model is learned by minimizing the following loss,
where is a positive parameter for reconciling the classification loss and the sparsity constraint.
WSPLIN is an end-to-end deep learning framework that uses the back-propagation to compute the loss deviation and update the parameters layer by layer. In WSPLIN, CDN requires the patch label results produced by PLIN that should be useful for the final classification and the patch label sparsity constraint forces WSPLIN to highlight only a few of the most crucial patches for participating in the final decision. Clearly, these highlighted patches should be distressed ones and their inferred patch label results should be nonzero since only the distressed patches can provide helpful information for the final detection and recognition. In such a manner, CDN essentially guides the training of PLIN in a weakly supervised manner.
Iii-E Pavement Distress Detection and Recognition
Detection and One-Stage Recognition: The pavement detection and one-stage pavement distress detection can be deemed as a one-stage pavement image classification problem. To tackle these tasks, we can train our model as a pavement image classifier. Once the model is trained, the pavement image can be divided into patches with different patch collection strategies, which are fed into WSPLIN for yielding the final classification,
where and the predicted category should be corresponding to the maximum element of .
Two-Stage Recognition: The two-stage recognition has two stages to accomplish the pavement distress recognition. The first stage is to train our model as a pavement distress detector for filtering out the normal samples and finding the distressed samples. The second stage is to train our model as a multi-class pavement image classifier for completing the final distress recognition,
where and the maximum element of reflects the specific pavement distress category of the distressed pavement image .
Iv Experiments and Results
Iv-a Dataset and Setup
We test our method on pavement distress detection and recognition tasks under four application settings. The first one is the one-stage recognition (I-REC), which tackle the pavement distress detection and recognition tasks jointly. In this setting, all samples (including both the distressed and normal ones) and their fine-grained category label are available for training and testing model. Moreover, both the detection and recognition performances can be evaluated under this setting. The second one is the one-stage detection (I-DET), which is the conventional detection fashion. In this setting, all samples (including both the distressed and normal ones) are involved, but only the binary coarse-grained category label (distressed or normal) is available. The other two settings are all from the two-stage recognition scenario. One is the ideal second-stage recognition II-REC(i) which assumes all distressed samples are ideally detected via the first-stage detection. In this setting, the recognition models are only evaluated with distressed pavement images. The last setting is the normal second-stage recognition II-REC(n). The training stage of II-REC(i) and II-REC(n) are identical. But their testing stages are different. In II-REC(n), the recognition models are only evaluated on the images detected by the detection model trained in I-DET. In such a manner, the recognition error under this setting is the errors accumulated by both the first-stage detection and the second-stage recognition, since some distressed images may be incorrectly filtered out while some normal images may be incorrectly classified as distressed ones by the detector in the recognition testing stage under II-REC(n). The results in II-REC(n) can reflect the comprehensive performances of two-stage recognition.
A large-scale bituminous pavement distress dataset named CQU-BPDD  is used for evaluating the approaches under four application settings. This dataset involves seven different types of distress: alligator crack, crack pouring, longitudinal crack, massive crack, transverse crack, raveling, and mending. For settings of I-DET, I-REC, and II-REC(n). We simply follow the data split strategy in . With regard to the setting of II-REC(i), 5140 distressed pavement images are randomly selected as the training set while the rest 11589 distressed pavement images are for testing. The detailed data split information of different settings are tabulated in Table I.
Similar to IOPLIN, we adopt EfficientNet-B3 as the Patch Label Inference Network (PLIN). Since the comprehensive decision network (CDN) adopts a fully connected layer with fixed dimensions, WSPLIN requires the input size to be fixed at , and the optimizer uses RangerLars, which is just a combination of RAdam, LookAhead and LARS. The learning rate is , and the cosine annealing strategy is adopted to adjust the learning rate: the learning rate remained unchanged in the first 25% of the training process, and gradually decreased with the cosine function in the subsequent training process. Data augments such as rotation, flipping, and brightness balance are carried out for the raw images. The dropout rate of the classification layer is 0.5.
Iv-B Evaluation Metrics
Iv-B1 Evaluation Metrics of Detection
For pavement distress detection task, we adopt Area Under Curve (AUC) of Receiver Operating Characteristic (ROC) , which is common in binary classification tasks (this metric is not affected by classification threshold). It is mathematically defined as follows,
where is the sum of the all positive samples ranked, while and denote the number of positive and negative samples. Additionally, Binaryscore is defined as:
where P is the precision while R is the recall. , , and are the numbers of true positives, false positives and false negatives respectively. The precision measures how many true positive samples are among the samples that are predicted as the positive samples. Similarly, recall measures how many true positive samples are correctly detected among all positive samples. Moreover, in the medical or pavement image analysis tasks, it is more meaningful to discuss the precision under the high recall, since the miss of the positive samples (the distressed sample) may lead to a more serious impact than the miss of the negative ones.
Iv-B2 Evaluation Metrics of Recognition
For pavement distress recognition task, we mainly use the Top-1 accuracy and Marco score to evaluate the performance of models. Top-1 accuracy mainly measures the overall accuracy of the models, while Marco score evaluates the accuracy of the model across different categories. The Macro score can be mathematically represented as follows,
where indicates the binary score of the -th category, and is the total number of categories.
Note: The represents and in pavement distress detection and recognition tasks respectively.
, Support Vector Machine (SVM), ResNet-50 , Inception-v3 , VGG-19 , ViT-S/16 , ViT-B/16 , EfficientNet-B3 , and Iterative Optimized Patch Label Inference Network (IOPLIN)  are selected as baselines. The first four approaches are the shallow learning-based approaches. ResNet-50, Inception-v3, VGG-19, and EfficientNet-B3 are the classical Convolutional Neural Network (CNN) models. ViT-S/16 and ViT-B/16 are the recently popular transformer models. IOPLIN is a well elaborated pavement distress detection approach.
Iv-D Pavement Distress Detection
Table II reports pavement distress detection performances of different approaches. These approaches include the detectors trained under I-DET and the recognizers trained under I-REC
where recognizers address the detection issue along with the recognition task. Based on observations, WSPLIN-IP outperforms all baselines in all evaluation metrics under bothI-DET and I-REC. In I-DET, WSPLIN-IP improves the performances of IOPLIN by 0.1%, 1.5%, 2.5%, and 1.1% in AUC, P@R=90%, P@R=95%, and F1-score respectively. In I-REC, WSPLIN-IP achieves 1.6%, 8.0%, 12.7%, and 4.2% performance gains over EfficientNet-B3 in AUC, P@R=90%, P@R=95%, and F1-score respectively. Moreover, the methods under I-REC consistently perform much better than the ones under I-DET. For example, the WSPLIN-IP trained under I-REC achieves 0.1%, 2.1%, 3.1%, and 1.0% performance gains than the WSPLIN-IP trained under I-DET in AUC, P@R=90%, P@R=95%, and F1-score respectively. Similarly, the gains of EfficientNet-B3 are 0.6%, 8.4%, 8.8%, and 1.9%. We attribute this to the fact that recognizers trained under I-REC utilize fine-grained distress labels instead of binary distress labels for training the pavement image classification models. It reflects that the much finer-grained supervised information, such as the specific pavement distress information, can benefit the pavement distress detection.
Iv-E Pavement Distress Recognition
Table III records the pavement distress recognition performances and parameter scales of different approaches under different application settings on the CQU-BPDD dataset. Similar to the pavement distress detection performances, WSPLIN-IP achieves better recognition performances than baselines under all settings but enjoys the smaller parameter scale. In I-REC, the performance gains of WSPLIN-IP over Inception-v3, which is the second-best method, are 1.8% and 3.4% in top-1 accuracy and F1-score, respectively. In II-REC(n) and II-REC(i), the EfficientNet-B3 achieves the second-best performances. The performance gains of WSPLIN-IP over it under II-REC(n) are 1.1% and 3.3% in top-1 accuracy and F1-score, respectively. Such gains under II-REC(i) are 6.4% and 6.9%. The distribution of different pavement image categories are imbalanced. Top-1 accuracy is sensitive to this data imbalance while F1-score is more stable to this imbalance. Therefore, F1-score can better reflect the comprehensive performances of recognizers. According to the observations, WSPLIN-IP shows more advantages compared with baselines in F1-score.
The test settings of I-REC and II-REC(n) are identical as seen in Table I. However, the models trained under I-REC outperform the ones of II-REC(n). For example, Inception-v3, ViT-B/16, EfficientNet-B3, and WSPLIN-IP trained under I-REC achieve 3.1%, 1.6%, 2.0%, and 1.8% improvements over the ones under II-REC(n) in F1-score. This implies that the end-to-end pavement distress recognition solution, which addresses detection and recognition tasks jointly (I-REC), enjoys more advantages than the conventional two-stage implementation solution, which addresses detection and recognition tasks individually (II-REC(n)), in the real-world application. We attribute this to that the end-to-end solution exploits the complementarity of these two tasks and introduces global optimization.
An interesting phenomenon is observed from Table III that the top-1 accuracies of II-REC(i) are lower than the ones of the rest two settings while its F1-scores are higher than the F1-scores of the rest settings. This is because the test setting of II-REC(i) is different from the settings of I-REC and II-REC(n), which does not involve any normal pavement sample. Top-1 accuracy is measured sample-wise but F1-score is measured category-wise. The normal samples comprise a large proportion of the whole data in I-REC and II-REC(n). So, the superabundant normal samples will make the recognizers trained under I-REC bias to the classification of the normal sample, which leads to the high top-1 accuracy but the low F1-score. With regard to II-REC(n), the massive normal samples push up the top-1 accuracy. However, the measure of F1-score is independent of the sample amount and the classification error of II-REC(n) is accumulated from both the detection and recognition stages. Therefore, it achieves a lower F1-score in comparison with II-REC(i).
Iv-F Ablation Study
In this section, we will systematically discuss the effects of different components and different hyperparameters on our model. TableIV records the performances and efficiencies of our approaches under different settings in comparison with IOPLIN.
|WSPLIN-IP w/o PLSC||81.2%||64.5%||11.0h (-12%)|
|WSPLIN-SS ()||81.1%||64.1%||3.2h (-74%)|
|WSPLIN-SS ()||81.4%||64.9%||5.7h (-54%)|
|WSPLIN-SS ()||80.0%||63.7%||8.4h (-33%)|
Iv-F1 Discussion on Patch Collection Strategies
We adopt three strategies named Slide Window (SW), Image Pyramid (IP), and Sparse Sampling (SS) to collect the patches from pavement images. Their corresponding versions are WSPLIN-SW, WSPLIN-IP, and WSPLIN-SS respectively. In all three versions, WSPLIN-IP, which is also the default version of WSPLIN, achieve the best performances under two application settings with different evaluation metrics. WSPLIN-IP achieves 1.5% performance gains in P@R=90% in the pavement distress detection case. However, its training time is only 89% of the training time of IOPLIN. In comparison with WSPLIN-SW, WSPLIN-IP not just exploits the local information but also exploits the scale information of pavement image. The results indicate that such scale information can further improve the performance of WSPLIN. Although WSPLIN-SW has not outperformed IOPLIN with a 0.8% performance decrease in pavement distress detection, WSPLIN-SW is much faster than WSPLIN-IP and its training only takes around 3/4 of the training time of IOPLIN. We attribute this to the efficiency advantage of the end-to-end model optimization strategy. Moreover, compared with other versions, WSPLIN-SW has not suffered from the scale variation, so it can produce better patch label inference visualization results and thereby enjoys the better interpretability.
WSPLIN-SS also takes the scale information into consideration, and can be deemed as a speedy version of WSPLIN-IP. The best performed WSPLIN-SS () achieves the similar performance as IOPLIN where is a hyperparameter to control the number of sampled patches in each layer of an image pyramid. However, WSPLIN-SS saves around half of the training time compared with IOPLIN, and it is only 76% of the training time of WSPLIN-IP. Clearly, WSPLIN-SS highly speeds up WSPLIN only with an acceptable performance decrease. Another interesting phenomenon is observed that WSPLIN-SS with a higher does not always enjoy better performance. Generally, a higher implies collecting more patches which means more information can be preserved for classification. However, the results indicate that not all preserved information is necessary for classification. Moreover, the less amount of patches per image means more images can be taken into one batch for model optimization since the memory size is fixed in our case. The higher diversity of pavement images in each batch benefits the model optimization. A good sparse sampling strategy should optimize the trade-off between patch preservation and the diversity of samples in the same batch. We believe this is the reason why WSPLIN-SS () performs well in both tasks.
In summary, all WSPLIN approaches show prominent advantages in training efficiency with similar or even better performances. We recommend using WSPLIN-IP in the application scenarios, which pay more attention to the performance instead of the efficiency, while using WSPLIN-SS in the application scenarios, which needs to take both performance and efficiency into considerations. WSPLIN-SW is recommended in the application scenarios, which pay more attention to the visual analysis of distressed images.
Iv-F2 Discussion on Patch Label Sparsity Constraint
The distressed area is often a small proportion over the whole image. So we introduce the Patch Label Sparsity Constraint (PLSC) to model and leverage this prior property for better addressing the pavement image classification issue. Table IV reports the performances of WSPLIN-IP with and without PLSC. WSPLIN-IP with PLSC achieves 2.0% more accuracies in P@R=90% under I-DET and 1.8% greater F1-scores under I-REC over WSPLIN without PLSC. This implies that PLSC can offer a considerable improvement of WSPLIN. We also leverage Grad-CAM  to plot the Class Activation Maps (CAM) of the features extracted by the WSPLIN-IP models before and after using PLSC in Figure 3. The CAM visualization results also validate that PLSC benefits the distressed feature extraction.
is a positive hyperparameter for reconciling the classification loss and the PLSC. Figure 4 plots the relationships between the different values of and the performances of WSPLIN-IP under I-DET and I-REC. From observations, we can find that the WSPLIN-IP is insensitive to the setting of . The optimal is .
Iv-F3 The Efficiency of WSPLIN
According to observations in Table IV, all WSPLIN approaches are more efficient than IOPLIN. Moreover, IOPLIN and WSPLIN have very similar network structure, so they have the same parameter scale.
Iv-G User Scenarios
WSPLIN has wider application scenarios in comparison to IOPLIN. IOPLIN can only address the pavement distress detection problem, which is a typical binary image classification issue and attempts to find the distressed samples only. WSPLIN can tackle both the pavement distress detection and the recognition tasks under various aforementioned application settings shown in Table I. The pavement distress recognition task is a multi-class image classification task, which tries to classify the pavement image into the specific distress category. The usages of WSPLIN and IOPLIN are similar. Once the corresponding models are trained, we can input pavement images into the models for acquiring their detection or recognition labels. For more details about their usages, please refer to . Similar to IOPLIN, WSPLIN can also roughly localize the distressed area of a pavement images. The main difference between IOPLIN and WSPLIN in such process is that WSPLIN can further recognize the diseases in those distressed areas. Figure 5 gives some examples of this scenario via visualizing the patch labels produced by the trained WSPLIN.
In this paper, we present a novel patch-based deep learning model named WSPLIN for automatic pavement distress detection and recognition in the wild. WSPLIN divides the pavement image into patches with different patch collection strategies and then learns the label of patches in a weakly supervised manner. Finally, these inferred patch labels are fed into a comprehensive decision network for yielding the final recognition results. Similar to IOPLIN, WSPLIN can sufficiently utilize the resolution and scale information, and can also provide interpretable information, such as the location of the distressed area. However, WSPLIN is more efficient than IOPLIN with similar or even better performance. The experiments on a large pavement distress dataset validate the effectiveness of our approach.
-  CIA, “Roadways - the world factbook,” Jun. 2021. [Online]. Available: https://www.cia.gov/the-world-factbook/field/roadways/country-comparison
-  S. M. Piryonesi and T. El-Diraby, “Using data analytics for cost-effective prediction of road conditions: Case of the pavement condition index,” Federal Highway Administration: McLean, VA, USA, 2018.
-  A. Benedetto, F. Tosti, L. Pajewski, F. D’Amico, and W. Kusayanagi, “Fdtd simulation of the gpr signal for effective inspection of pavement damages,” in Proceedings of the 15th International Conference on Ground Penetrating Radar. IEEE, 2014, pp. 513–518.
-  C. Wang, A. Sha, and Z. Sun, “Pavement crack classification based on chain code,” in 2010 Seventh international conference on fuzzy systems and knowledge discovery, vol. 2. IEEE, 2010, pp. 593–597.
N. Dalal and B. Triggs, “Histograms of oriented gradients for human
2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol. 1. IEEE, 2005, pp. 886–893.
-  J. Chou, W. A. O’Neill, and H. Cheng, “Pavement distress classification using neural networks,” in Proceedings of IEEE International Conference on Systems, Man and Cybernetics, vol. 1. IEEE, 1994, pp. 397–401.
-  J. Zhou, P. Huang, and F.-P. Chiang, “Wavelet-based pavement distress classification,” Transportation research record, vol. 1940, no. 1, pp. 89–98, 2005.
-  F. M. Nejad and H. Zakeri, “An expert system based on wavelet transform and radon neural network for pavement distress classification,” Expert Systems with Applications, vol. 38, no. 6, pp. 7088–7101, 2011.
K. Gopalakrishnan, S. K. Khaitan, A. Choudhary, and A. Agrawal, “Deep convolutional neural networks with transfer learning for computer vision-based data-driven pavement distress detection,”Construction and Building Materials, vol. 157, pp. 322–330, 2017.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. IEEE, 2009, pp. 248–255.
-  B. Li, K. C. Wang, A. Zhang, E. Yang, and G. Wang, “Automatic classification of pavement crack using deep convolutional neural network,” International Journal of Pavement Engineering, vol. 21, no. 4, pp. 457–463, 2020.
-  Z. Fan, Y. Wu, J. Lu, and W. Li, “Automatic pavement crack detection based on structured prediction with the convolutional neural network,” arXiv preprint arXiv:1802.02208, 2018.
-  W. Tang, S. Huang, Q. Zhao, R. Li, and L. Huangfu, “An iteratively optimized patch label inference network for automatic pavement distress detection,” IEEE Transactions on Intelligent Transportation Systems, 2021.
-  L. Ale, N. Zhang, and L. Li, “Road damage detection using retinanet,” in IEEE International Conference on Big Data, 2018, pp. 5197–5200.
-  T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
-  G. Huang, S. Huang, L. Huangfu, and D. Yang, “Weakly supervised patch label inference network with image pyramid for pavement diseases recognition in the wild,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 7978–7982.
-  J. Zhou, P. S. Huang, and F.-P. Chiang, “Wavelet-based pavement distress detection and evaluation,” Optical Engineering, vol. 45, no. 2, p. 027007, 2006.
-  Y. Hu and C.-x. Zhao, “A novel lbp based methods for pavement crack detection,” Journal of pattern Recognition research, vol. 5, no. 1, pp. 140–147, 2010.
-  Y. Shi, L. Cui, Z. Qi, F. Meng, and Z. Chen, “Automatic road crack detection using random structured forests,” IEEE Transactions on Intelligent Transportation Systems, vol. 17, no. 12, pp. 3434–3445, 2016.
-  R. Kapela, P. Śniatała, A. Turkot, A. Rybarczyk, A. Pożarycki, P. Rydzewski, M. Wyczałek, and A. Błoch, “Asphalt surfaced pavement cracks detection based on histograms of oriented gradients,” in 2015 22nd International Conference Mixed Design of Integrated Circuits & Systems (MIXDES). IEEE, 2015, pp. 579–584.
-  Y. Pan, X. Zhang, M. Sun, and Q. Zhao, “Object-based and supervised detection of potholes and cracks from the pavement images acquired by uav.” International Archives of the Photogrammetry, Remote Sensing & Spatial Information Sciences, vol. 42, 2017.
-  L. Zhang, F. Yang, Y. D. Zhang, and Y. J. Zhu, “Road crack detection using deep convolutional neural network,” in 2016 IEEE international conference on image processing (ICIP). IEEE, 2016, pp. 3708–3712.
-  F. Yang, L. Zhang, S. Yu, D. Prokhorov, X. Mei, and H. Ling, “Feature pyramid and hierarchical boosting network for pavement crack detection,” IEEE Transactions on Intelligent Transportation Systems, vol. 21, no. 4, pp. 1525–1535, 2019.
-  Q. Zou, Z. Zhang, Q. Li, X. Qi, Q. Wang, and S. Wang, “Deepcrack: Learning hierarchical convolutional features for crack detection,” IEEE Transactions on Image Processing, vol. 28, no. 3, pp. 1498–1512, 2018.
-  E. Ibragimov, H.-J. Lee, J.-J. Lee, and N. Kim, “Automated pavement distress detection using region based convolutional neural networks,” International Journal of Pavement Engineering, pp. 1–12, 2020.
-  J. Zhu, J. Zhong, T. Ma, X. Huang, W. Zhang, and Y. Zhou, “Pavement distress detection using convolutional neural networks with images captured via uav,” Automation in Construction, vol. 133, p. 103991, 2022.
-  H. Dong, K. Song, Q. Wang, Y. Yan, and P. Jiang, “Deep metric learning-based for multi-target few-shot pavement distress classification,” IEEE Transactions on Industrial Informatics, pp. 1–1, 2021.
-  H. Dong, K. Song, Y. Wang, Y. Yan, and P. Jiang, “Automatic inspection and evaluation system for pavement distress,” IEEE Transactions on Intelligent Transportation Systems, 2021.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International journal of computer vision, vol. 115, no. 3, pp. 211–252, 2015.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, pp. 1097–1105, 2012.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
-  B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,” arXiv preprint arXiv:1611.01578, 2016.
-  M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” arXiv preprint arXiv:1905.11946, 2019.
-  A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
-  L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han, “On the variance of the adaptive learning rate and beyond,” arXiv preprint arXiv:1908.03265, 2019.
-  M. Zhang, J. Lucas, J. Ba, and G. E. Hinton, “Lookahead optimizer: k steps forward, 1 step back,” in Advances in Neural Information Processing Systems, 2019, pp. 9597–9608.
-  Y. You, I. Gitman, and B. Ginsburg, “Large batch training of convolutional networks,” arXiv preprint arXiv:1708.03888, 2017.
-  D. J. Hand and R. J. Till, “A simple generalisation of the area under the roc curve for multiple class classification problems,” Machine learning, vol. 45, no. 2, pp. 171–186, 2001.
T. Ahonen, A. Hadid, and M. Pietikainen, “Face description with local binary patterns: Application to face recognition,”IEEE transactions on pattern analysis and machine intelligence, vol. 28, no. 12, pp. 2037–2041, 2006.
-  F. Perronnin, J. Sánchez, and T. Mensink, “Improving the fisher kernel for large-scale image classification,” in European conference on computer vision. Springer, 2010, pp. 143–156.
L. Breiman, “Random forests,”Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
-  C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826.
-  R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 618–626.