DeepAI

# PicT: A Slim Weakly Supervised Vision Transformer for Pavement Distress Classification

Automatic pavement distress classification facilitates improving the efficiency of pavement maintenance and reducing the cost of labor and resources. A recently influential branch of this task divides the pavement image into patches and addresses these issues from the perspective of multi-instance learning. However, these methods neglect the correlation between patches and suffer from a low efficiency in the model optimization and inference. Meanwhile, Swin Transformer is able to address both of these issues with its unique strengths. Built upon Swin Transformer, we present a vision Transformer named \textbf{P}avement \textbf{I}mage \textbf{C}lassification \textbf{T}ransformer (\textbf{PicT}) for pavement distress classification. In order to better exploit the discriminative information of pavement images at the patch level, the \textit{Patch Labeling Teacher} is proposed to leverage a teacher model to dynamically generate pseudo labels of patches from image labels during each iteration, and guides the model to learn the discriminative features of patches. The broad classification head of Swin Transformer may dilute the discriminative features of distressed patches in the feature aggregation step due to the small distressed area ratio of the pavement image. To overcome this drawback, we present a \textit{Patch Refiner} to cluster patches into different groups and only select the highest distress-risk group to yield a slim head for the final image classification. We evaluate our method on CQU-BPDD. Extensive results show that \textbf{PicT} outperforms the second-best performed model by a large margin of $+2.4\%$ in P@R on detection task, $+3.9\%$ in $F1$ on recognition task, and 1.8x throughput, while enjoying 7x faster training speed using the same computing resources. Our codes and models have been released on \href{https://github.com/DearCaat/PicT}{https://github.com/DearCaat/PicT}.

• 3 publications
• 17 publications
• 1 publication
• 8 publications
07/30/2021

### DPT: Deformable Patch-based Transformer for Visual Recognition

Transformer has achieved great success in computer vision, while how to ...
08/27/2019

### Synthetic patches, real images: screening for centrosome aberrations in EM images of human cancer cells

Recent advances in high-throughput electron microscopy imaging enable de...
08/19/2022

### Accelerating Vision Transformer Training via a Patch Sampling Schedule

We introduce the notion of a Patch Sampling Schedule (PSS), that varies ...
03/19/2021

### Cluster-to-Conquer: A Framework for End-to-End Multi-Instance Learning for Whole Slide Image Classification

In recent years, the availability of digitized Whole Slide Images (WSIs)...
03/31/2022

### Weakly Supervised Patch Label Inference Networks for Efficient Pavement Distress Detection and Recognition in the Wild

Automatic image-based pavement distress detection and recognition are vi...
03/22/2022

### Learning Patch-to-Cluster Attention in Vision Transformer

The Vision Transformer (ViT) model is built on the assumption of treatin...
08/14/2022

### Shuffle Instances-based Vision Transformer for Pancreatic Cancer ROSE Image Classification

The rapid on-site evaluation (ROSE) technique can signifi-cantly acceler...

## 1. Introduction

With the rapid growth of transport infrastructures such as airports, bridges, and roads, pavement maintenance is deemed as a crucial element of sustainable pavement today (Zhang and Mohsen, 2018)

. It is a trend to automate this process via machine learning and pattern recognition techniques, which can significantly reduce the cost of labor and resources. One of the core tasks in pavement maintenance is pavement distress classification (PDC), which aims to detect the damaged pavement and recognize its specific distress category. These two steps are also referred to as pavement distress detection and pavement distress recognition, respectively

(Huang et al., 2022).

For decades, many researchers have been devoted to investigate PDC from the perspective of computer vision. The traditional methods are to utilize image processing, hand-craft features, and conventional classifiers to recognize the pavement distress (Wang et al., 2010; Chou et al., 1994; Zhou et al., 2005; Nejad and Zakeri, 2011). The main drawback of these methods is that they highly rely on expert knowledge and suffer from the weak generalization ability (Dong et al., 2021). Inspired by the recent remarkable successes of deep learning, many deep learning-based approaches have been proposed for addressing the PDC issue (Gopalakrishnan et al., 2017; Li et al., 2020; Fan et al., 2018). However, these methods often regard the PDC problem as a common image classification problem, while seldom paying attention to the specific characteristics of pavement images, such as the high image resolution, the low distressed area ratio, and uneven illumination, in the model design phase (Huang et al., 2021).

To consider these characteristics of pavement images, some of the latest approaches (Tang et al., 2021; Huang et al., 2021, 2022) attempt to divide the high-resolution raw pavement image into patches and solve this problem via inferring the labels of patches with CNNs as shown in Figure 2. Intuitively, these approaches can be deemed as a kind of multi-instance learning approach (Dietterich et al., 1997), merited by their ability to better learn the local discriminative information of the high-resolution image and thereby achieve promising performances. For example, IOPLIN (Tang et al., 2021)

conducts the feature extraction and classification on each patch independently and finally aggregates the classification results via max-pooling or average-pooling. Similarly, WSPLIN

(Huang et al., 2021, 2022) also extracts the feature of each patch and classifies the patch independently. However, instead of conducting the classification result aggregation, it inputs the classification results of all patches into a comprehensive decision network for accomplishing the final classification. Generally speaking, the whole patch label inference process, including patch partition, feature extraction, and classification is quite expatiatory, which leads to the low efficiencies of these approaches. What’s more, the patches naturally have correlation while these methods all neglect this nature.

Recently, the vision Transformer has demonstrated excellent performances in supervised learning (Shao et al., 2021; Zhang et al., 2021b; Xie et al., 2021a; Girdhar et al., 2019), weakly supervised learning (Siméoni et al., 2021; Gao et al., 2021), and self-supervised learning tasks (Xie et al., 2021b; He et al., 2021c). Its success is benefited from the self-attention mechanism (Vaswani et al., 2017) and the strong local representation capability. As an influential approach in vision Transformer, Swin Transformer (Liu et al., 2021) shares a similar idea as the aforementioned patch-based PDC in the visual information exploitation. It also partitions the image into different patches and encodes them as visual tokens. Then, self-attention is employed to model their relations. Finally, these tokens are aggregated for yielding the final classification. Compared to previous approaches, the vision Transformer provides a more succinct and advanced framework for learning the subtle discriminative features of images as shown in Figure 2 (a). However, it is still inappropriate to directly apply Swin Transformer to address the PDC issue for two reasons. The first one is that the potential discriminating features of patches are still not well explored as the CAMs shown in Figure 1 (b), since the patch-level supervised information has still not been sufficiently mined in the Swin Transformer. The second reason is that its average-pooling aggregation for all patches will suppress the distressed patch features, since the distressed patches are only a small fraction of all patches.

In this paper, we elaborate on a novel Transformer named PicT which stands for Pavement Image Classification Transformer, based on the Swin Transformer framework for Pavement Distress Classification (PDC). In order to overcome the first drawback of the Swin Transformer in PDC, we develop a Patch Labeling Teacher module based on the idea of teacher-student model to present a weakly supervised patch label inference scheme for fully mining the discriminative features of patches. In order to overcome the second drawback of the Swin Transformer in PDC, a Patch Refiner module is designed to cluster the patches and select the patches from the highest risk cluster to yield a slimmer head for final classification. This strategy can significantly suppress the interferences from the plentiful non-distressed (normal) patches in the image label inference phase. A large-scale pavement image dataset named CQU-BPDD is adopted for evaluation. Extensive experiments demonstrate that PicT achieves state-of-the-art performances in two common PDC tasks, namely pavement distress detection, and pavement distress recognition. More importantly, PicT outperforms the second-best performed model by a large margin of +2.4% in P@R on detection task, +3.9% in on recognition task, and 1.8x throughput, while enjoying 7x faster training efficiency.

The main contributions of our work are summarized as follows:

• We propose a novel vision Transformer named PicT for pavement distress classification. PicT not only inherits the merits of Swin Transformer but also takes the characteristics of pavement images into consideration. To the best of our knowledge, it is the first attempt to specifically design a Transformer to address pavement image analysis issues.

• A Patch Labeling Teacher module is elaborated to learn to infer the labels of patches in a weakly supervised fashion. It enables better exploiting the discriminative information of patches under the guidance of the patch-level supervised information, which is generated by a Prior-based Patch Pseudo Label Generator based on image labels. Such an idea can also be flexibly applied to other vision Transformers to infer labels of visual tokens in a weakly supervised manner.

• A Patch Refiner is designed to yield a slimmer head for PDC. It enables filtering out the features from low-risk patches while preserving the ones from high-risk patches to boost the discriminating power of the model further.

• We systematically evaluate the performances of common vision Transformers on two pavement distress classification tasks. Extensive results demonstrate the superiority of PicT over them. Compared to the second-best performed method, PicT achieves 2.4% more detection performance gains in P@R, 3.9% recognition performance gains in , 1.8x higher throughput, and 7x faster training speed.

## 2. Related Work

### 2.1. Pavement Distress Classification

Pavement Distressed Classification (PDC) aims to detect the diseased pavement and recognize the specific distress category of pavements with pavement images. The traditional methods are to utilize image processing, hand-craft features, and conventional classifiers to recognize the pavement distress (Wang et al., 2010; Chou et al., 1994; Zhou et al., 2005; Nejad and Zakeri, 2011). The main drawback of these methods is that they rely on expert knowledge and suffer from the weak generalization ability (Dong et al., 2021).

Inspired by the recent remarkable successes of deep learning in extensive applications, simple and efficient convolutional neural networks (CNN) based PDC methods have gradually become the mainstream in recent years. There are two major branches: one is the general-CNN methods and the other is patch-based methods. The general-CNN approaches

(Gopalakrishnan et al., 2017; Li et al., 2020; Fan et al., 2018) only regard the PDC as a standard image classification problem and directly apply the classical deep learning approaches to solve it. For example,  (Li et al., 2020) proposes a novel method using deep CNN to automatically classify image patches cropped from 3D pavement images and successfully trains four supervised CNNs with different sizes of receptive field. However, general-CNN methods seldom paid attention to the specific characteristics of pavement images. Patch-based methods (Tang et al., 2021; Huang et al., 2022, 2021) have addressed these issues by splitting the patches and inferring the labels of patches. IOPLIN (Tang et al., 2021) manually partitions the high-resolution pavement image into patches and elaborates an iteratively optimized CNN model to predict patch labels for detection. It enables learning the discriminative details of the local regions and then improves the performance. WSPLIN (Huang et al., 2021, 2022) conducts the same process as IOPLIN but uses a comprehensive decision network to aggregate the classification results of patches for accomplishing the final image classification. However, the whole image partition, patch feature extraction, and classification process often necessitates a large burden in efficiency. Moreover, such methods often neglect the correlation between patches.

### 2.2. Vision Transformer

Transformer origins from natural language processing and machine translation

(Vaswani et al., 2017; Devlin et al., 2018). Recently, they have been successfully applied to many computer vision domains (Li et al., 2021; Chen et al., 2021; Zhang et al., 2021a, b; Zhao and Liu, 2021; He et al., 2021a). Generally speaking, the Transformer models in the computer vision community can be roughly divided into two groups: combining self-attention with CNN (Girdhar et al., 2019; Xie et al., 2021a; Shao et al., 2021; Feng et al., 2021) and vision Transformer models (Dosovitskiy et al., 2020; Liu et al., 2021; Xie et al., 2021b; Xu et al., 2021). The first type of method focuses on using powerful self-attention mechanism to model the correlation between features extracted by CNN. For example, Girdhar et al. (Girdhar et al., 2019) repurpose a Transformer-style architecture to model features from the spatiotemporal context around the person whose actions are attempted for classification. On the other hand, with the release of ViT (Dosovitskiy et al., 2020), more and more work apply vision Transformer to achieve better results in different vision areas (Liu et al., 2021; Xu et al., 2021; Gao et al., 2021; Xie et al., 2021b; He et al., 2021c; Bao et al., 2021; He et al., 2021b; Hu et al., 2021). For example, LOST (Siméoni et al., 2021) achieves SOTA on objective search by combining similar tokens and using seed search to complete the object localization. SimMM (Xie et al., 2021b) predicts raw pixel values of the randomly masked patches by a lightweight one-layer head and performs learning using a simple L1 loss. In particular, these outstanding studies in the field of weakly supervised localization (Siméoni et al., 2021; Gao et al., 2021) and self-supervised learning (Xie et al., 2021b; He et al., 2021c; Bao et al., 2021) demonstrate the powerful local representation capabilities of tokens in vision Transformer models. In this work, we attempt to leverage the merits of Transformer for addressing the PDC issue.

## 3. Method

### 3.1. Problem Formulation and Overview

#### 3.1.1. Problem Formulation

Pavement Distress Classification (PDC) can be deemed as an image classification task from the perspective of computer vision. Let and be the collection of pavement images and their pavement labels respectively, where is a

-dimensional vector and

is the number of categories. Let the subset be the non-distressed label set and be the distressed label set, which indicate the presence or absence of distress in a pavement image. Pavement distress detection and recognition are the most common tasks in the PDC. The pavement distress detection task is a binary image classification issue (.), which judges if a pavement image exists distress or not, while the pavement distress recognition task is a multi-class image classification problem () which classifies the pavement image into a specific distress category. In this paper, we develop a novel vision Transformer model named PicT for addressing the aforementioned PDC issues both.

#### 3.1.2. Overview

The overview of PicT is shown in Figure 3. This entire framework is based on the well-known Swin Transformer. In order to incorporate the specific characteristics of pavement images, such as the high resolution and the low distressed area ratio, we enhance Swin Transformer by introducing two elaborate modifications. We borrow the idea from self-supervised learning to develop a Patch Labeling Teacher (PLT) for better exploring the discriminating power of patch information via classifying patches in a weakly supervised manner. The main obstacle to patch classification is that there are the only image-level labels for training. So, we design a Prior-based Patch Pseudo Label Generator (PLG) for dynamically generating the pseudo labels of patches based on the prior information of image labels and distressed area ratio during the optimization of PLT. Since only a small fraction of patches has distresses, the original global-average-pooling based feature aggregation easily leads to the problem that the discriminative features of the distressed patches are diluted by the features from the overabundant non-distressed patches. In order to solve this problem, the Patch Refiner (PR) is presented to cluster patches into different groups, and then only the patches from the highest distressed risk group are aggregated to yield a slim head for image classification. In the following sections, we will introduce these modules in detail.

### 3.2. Patches Labeling Teacher

Similar to the Swin Transformer, PicT splits the resized image into patches first, and employs the Swin Transformer blocks as the feature extractor to learn the visual features of patches, which are also referred to as tokens. The Swin Transformer lacks the patch-level supervision, which leads to the insufficient exploration of discriminative information in patches. We present a Patch Labeling Teacher (PLT) module to address this issue. Similar to the idea of self-supervised learning, PLT also adopts the teacher-student scheme, which employs a teacher model to supervise the optimization of the student model (the finally adopted model). More specifically, let and be the mapping functions of the student and teacher models respectively. and are their corresponding parameters. The parameters of the teacher model are updated by the exponential moving average (EMA) of the student parameters . The update rule is . The student and teacher models share the same network structure. They all use Swin-S (Liu et al., 2021) as the backbone , and a patch head for classifying the patches, . The patch head only consists of a Linear layer with dimensions where is the dimension of token. The Swin-S architecture considers pavement image as a grid of non-overlapping contiguous image patches, and these patches are then passed through Swin Transformer blocks to form a set of features, which are called tokens. The token collection can be denoted as,

 (1) Ti=f(xi)={ti1,…,tij,…,tim∣tij∈RN×N×L},

where is the -th token of the -th image , and . The patch label predictions can be achieved by inputting these tokens of image to the patch head for classification,

 (2) Pi=hpat(Ti)={pi1,…,pij,…,pim∣pij∈RC},

where is the label prediction of token . However, we only have the image-level label while do not have patch-level label for training the patch classifier directly. In such a manner, we design a Prior-based Patch Pseudo Label Generator (PLG) to generate the pseudo labels of patches (tokens) based on the image-level label and the label predictions produced by the teacher model in the previous iteration, which will be introduced later. The pseudo patch labels of image are denoted as , where is the pseudo label of patch .

Finally, we use the cross-entropy to measure the discrepancy between the patch pseudo label and the student patch label prediction, and denote the patch classification loss as follows,

 (3) Lp=−1nmn∑i=1m∑j=1~yijlogpij.

Moreover, to facilitate the training of the student model, we adopt FixMatch (Sohn et al., 2020), which is a popular advancement in the semi-supervised image classification task. Strong augmented data is applied for training the student model while the original data is used for training the teacher model.

### 3.3. Prior-based Patch Pseudo Label Generator

The performance of the patch classifier highly relies on the quality of the generated pseudo-label (Xu et al., 2021). So, we propose a Prior-based Patch Pseudo Label Generator (PLG) to accomplish this task with the help of the prior information, such as image labels, distressed area ratio, and the patch label predictions dynamically produced by the teacher model. PLG consists of two important steps: Relative Distress Threshold and Patch Filter.

#### 3.3.1. Relative Distress Threshold (RDT)

A Relative Distress Threshold (RDT) is defined for producing the pseudo labels of patches. In RDT, the patch pseudo labels are always fixed to the normal one, , if the pavement image has no distress (). While the patches, which own the top highest predictions of the image distress label , will be considered as the patches owning the distressed category , if the distressed label of the pavement image is . The whole strategy can be mathematically denoted as follows,

 (4) ~yij={yi,pyiij>T(Δrel) and yi∈Ydisynormal,others,

where is the element of patch label predictions corresponding to the distressed category of , and returns the largest of all patches in image . Compared with the common absolute threshold strategy (Jiang et al., 2018), this threshold enables ensuring that a distressed image always has a certain percentage of distressed instances. This also alleviates the pseudo-label bias caused by inconsistent training of the model for different distress categories.

#### 3.3.2. Patch Filter

Following (Xu et al., 2021), we filter out low prediction confidence patches, and only preserve the high confidence ones for training.

• For Distressed Image: We preserve all patches whose pseudo-labels are not , and also preserve the patches whose pseudo-labels are and .

• For Non-distressed Image: We only preserve the high prediction confident patches whose .

These two optimal patch filtering thresholds are obtained in an empirical way.

### 3.4. Patch Refiner

After the feature extraction of patches, these features will be aggregated via average-pooling overall patches, and then performed for classification in the Swin Transformer. However, the distressed patches are only a small fraction of overall patches, so this feature aggregation strategy is easy to dilute the discriminative features of distressed patches. In order to address this issue, we develop a Patch Refiner (PR) to yield a slimmer image head.

Let be the mapping function of PR where is its parameters. It consists of three steps, namely token clustering, aggregation, and selection. The mapping functions of these steps are denoted as , , and respectively. So . Compared with the image head in Swin Transformer, which aggregates all tokens for classification, our image head is a much slimmer head, since it only highlights the highest risk group of tokens for conducting the final classification.

In Patch Refiner

, the tokens are first clustered into different groups via using K-Means

(Hartigan and Wong, 1979). Then the average-pooling is conducted on each group for feature aggregation, and these aggregated features will be input into an image head for assessing the distressed risk of each group. The whole process can be mathematically represented as follows,

 (5) Ri=himg(fclu(Ti))={ri1,⋯,rit,⋯,rik∣rit∈RC},

where is the label predictions of groups, and is the group number. is the label predictions of the -th group, which can be deemed as the risk assessments of different distresses for this group. In our approach, is empirically fixed to for the detection task while for the recognition task.

In the image label classification phase, we intend to suppress the interferences from the overabundant non-distressed patches. So, we only select the highest risk group to represent the pavement image for final classification. In other words, we only preserve the label predictions of the group, which owns the lowest confident label prediction in the normal (non-distressed) category, as the predicted labels of image ,

 (6) ^yi=R(Ti)=fsel(Ri)=argminrit  {rynormalit∣rit∈Ri},

where is the element of corresponding to the normal category. Once we obtain the predicted labels of images, we can use the cross-entropy to measure the image classification loss,

 (7) Li=−1nn∑i=1yilog^yi.

Finally, the optimal PicT model can be learned by minimizing both the patch and image classification losses,

 (8) (^θs,^ξ)←argminθs,ξ  Ltotal:=Li+Lp.

### 3.5. Pavement Distress Classification

To tackle pavement distress classification, we can train our model as a pavement image classifier. Once the model is trained, the pavement image can be fed into PicT for yielding the final classification result. Of particular interest, although we use patch-level branch named Patch Labeling Teacher to enhance the use of patch information in the training process. But in the testing phase, we only leverage the image-level branch named Patch Refiner for model inference due to the instability of patch-level testing. The process of PDC can be denoted as:

 (9) yi=R(f(xi))=fsel(himg(fclu(f(xi)))).

Note, the selection in the testing phase is slightly different from it in training phase. More details about this will be specified in the support material.

## 4. Experiments

### 4.1. Dataset and Setup

Following (Huang et al., 2022), we evaluate our method on pavement distress detection and recognition tasks under two application settings. The first one is the one-stage detection (I-DET), which is the conventional detection fashion. In this setting, the model only needs to determine whether there are distressed areas in the pavement image. The second one is the one-stage recognition (I-REC), which tackles the pavement distress detection and recognition tasks jointly. In this setting, the model not only detects the presence of distressed areas within the pavement image, but also needs to further classify the distressed category. Compared to I-DET, this task is more challenging and practically applicable. Moreover, both the detection and recognition performances can be evaluated in this setting.

#### 4.1.2. Datasets

Before  (Tang et al., 2021), most of the methods in the PDC are validated on private datasets. Thus, following (Tang et al., 2021), a large-scale public bituminous pavement distress dataset named CQU-BPDD (Tang et al., 2021) is used for evaluation. This dataset involves seven different types of distress and non-distressed categories. There are 10137 images in the training set and 49919 images in the test set only with the image-level labels. For the setting of I-DET, only the binary coarse-grained category label (distressed or normal) is available. For the setting of I-REC, their fine-grained category labels are available for training and testing models.

#### 4.1.3. Evaluation Protocols and Metrics

Following the standard evaluation protocol in (Tang et al., 2021), for I-DET task, we adopt Area Under Curve (AUC) of Receiver Operating Characteristic (ROC) and Precision@Recall (P@R) to evaluate the model performance. The AUC is the common metric in binary classification task. P@R is used to discuss the precision under high recall, which is more meaningful in the medical or pavement distress classification tasks. Since the miss of the positive samples (the distressed sample) may lead to a more severe impact than the miss of the negative ones. As for I-REC task, following (Huang et al., 2022), we use the Top-1 accuracy and Marco F1 score () to evaluate the performance of the models. Please refer to Supplementary for the details of implementation.

### 4.2. Pavement Distress Detection

Table 1 tabulates the pavement distress detection performances of different methods on I-DET and I-REC settings. The observations demonstrate that PicT outperforms all the compared methods on both I-DET and I-REC

settings under all evaluation metrics. Swin-S is adopted as the backbone of PicT. The performance gains of PicT over it are 0.8%, 5.7%, and 9.7% in AUC, P@R=90%, and P@R=95% respectively in the

I-DET setting. These numbers on I-REC settings are 0.9%, 4.9%, and 8.0% respectively. These results validate the effectiveness of our modification in Swin-S according to the characteristics of the pavement images. Moreover, even though Swin-S adopts a more powerful learning framework (Transformer) and also can be deemed as a patch-based approach, it still cannot defeat the conventional patch-based methods, such as WSPLIN and IOPLIN. This phenomenon reflects that it is inappropriate to directly apply Swin-S for PDC, and further confirms that it is important to incorporate the characteristics of pavement images in the model design phase.

WSPLIN is the second-best approach for pavement distress detection. Our method still gets considerable advantages in performance over it. For example, the performance gains of PicT over WSPLIN are 0.4%, 2.4%, and 5.7% in AUC, P@R=90%, and P@R=95% respectively on I-DET setting. In the I-REC setting, these numbers are 0.5%, 2.3%, and 4.4% respectively. These results verify the advantages of the Swin Transformer framework over the conventional CNN-based learning framework in pavement distress detection.

### 4.3. Pavement Distress Recognition

Table 2 reports the pavement distress recognition performances on the CQU-BPDD dataset. We can observe similar phenomena as the ones observed in pavement distress detection. PicT still outperforms all approaches in pavement distress recognition, which is considered a more challenging PDC task than pavement distress detection. The results show that our method even holds the more prominent advantages in performance compared with the ones of pavement distress detection. For example, the performance gains of PicT over Swin-S are 2.7% and 5.3% in Top-1 and , respectively, while the number of PicT over WSPLIN are 1.1% and 3.9% respectively. Clearly, all observations well verify the arguments we raised in Section 4.2.

### 4.4. Efficiency Evaluation

We tabulate the training and inference efficiencies of the patch-based approaches and also their backbone networks along with their classification performances in Table 3. Patch-based approaches include our proposed PicT, WSPLIN, and IOPLIN. We report the efficiencies of two WSPLIN approaches, namely WSPLIN-IP (the default version) and WSPLIN-SS (the speed-up version). The backbone of PicT is Swin-S while the backbones of other patch-based approaches are EfficientNet-B3 (Effi-B3). From observations, we can see that Effi-B3 slightly performs better than Swin-S in efficiency. Swin-S needs 14% more time for training the model with the same computing resources. Even so, PicT still shows significant advantages over other patch-based approaches in efficiency. For example, the training speed of PicT is around 7x, 8x, and 4x faster than the ones of WSPLIN-IP, IOPLIN, and WSPLIN-SS respectively. PicT can accomplish the label inferences of 26 more images, 14 more images, and 4 more images per second over WSPLIN-IP, IOPLIN, and WSPLIN-SS respectively. We attribute the good efficiency of PicT to the succinct learning framework of Swin-S. Besides the superiority in efficiency, PicT also holds a significant advantage in performance. For instance, WSPLIN-SS is the fastest patch-based approach except for PicT. PicT obtains 4.2% and 6.1% more performances than WSPLIN-SS in P@R=90% and respectively. To sum up, PicT is superior to the other patch-based approaches no matter in both performance and efficiency.

### 4.5. Ablation Study

#### 4.5.1. Impacts of Different Modules

Table 4 reports the performances of PicT under different settings in modules. Swin-S is the backbone of PicT, which is considered the baseline. Patch Refiner and Patch Labeling Teacher respectively indicate the PicT models only use our proposed image-level classification branch and our proposed patch-level classification branch. Patch Labeling Teacher + Swin-S indicates the models use two classification branches. One is our proposed patch-level classification branch, and the other is the original image head of Swin-S (the broad head). Patch Labeling Teacher + Patch Refiner is the final PicT model, which has two classification branches. One is our proposed patch-level classification branch, and the other is our proposed image-level classification branch (the slim head). From observations, we can find that all two proposed modules improve the performances of the baseline no matter if used independently or in a combined way. These results verify our argument that further exploiting the discriminative information of patches via the patch-level supervision and highlighting the discriminative features of distressed patches can assist Swin-S in better addressing the PDC issue.

#### 4.5.2. Discussion on Hyperparameters

There are two tunable hyperparameters in our model. One is the cluster number

in Patch Refiner, which determines how many patches will be filtered out. The other is in the Relative Distress Threshold (RDT) of Patch Labeling Teacher, which determines how many high-risk patches should be preserved in a distressed pavement image. Figures 5 and 6 respectively plot the relationships between the performances of PicT and the values of these hyperparameters. We can observe that the PicT performs much better with a relatively smaller . We attribute this to the fact that a larger is easy to cause the overmuch information loss in image-level classification, since only about of the patches have been preserved for taking part in the final classification. Similar phenomenon can be also observed on . The reason why the relatively smaller often achieves a better performance is that the distressed area of pavement image is often very small and most of the patches are actually non-distressed, even in a distressed image. According to the observations of Figures 5 and 6, the optimal is 2 and 3, while the optimal is 0.25 and 0.35 on detection and recognition tasks respectively.

### 4.6. Visualization Analysis

In this section, we attempt to understand the proposed approach through visualizations. All example images are from the CQU-BPDD test set.

#### 4.6.1. CAM Visualization

We leverage Grad-CAM (Selvaraju et al., 2017) to plot the Class Activation Maps (CAM) (Zhou et al., 2016) of the image features extracted by the PicT and backbone Swin-S in Figure 4. We see that Swin-S does not achieve the desired results on pavement images. On the one hand, Swin-S loses some discriminative information at the patch level, since patch-level supervised information has still not been sufficiently mined, such as the CAMs in the third and fifth columns. On the other hand, its broad head makes the model more susceptible to interference by the complex pavement environment (see the first column) or causes dilution of the discriminative regions (see the fourth column). In contrast, PicT demonstrates the fuller patch-level information utilization and stronger discriminating power over Swin-S with our proposed modules.

#### 4.6.2. Token Visualization

In Figure 7, we visualize the classification results of the tokens produced by Patch Labeling Teacher to investigate whether the token can characterize a pixel patch of pavement image. We can observe that these tokens exhibit strong local discriminability despite the absence of any strong supervised information. We attribute this to the local representational ability of the visual tokens themselves, and the effectiveness of the weakly supervised training module Patch Labeling Teacher. Token visualization results demonstrate that the weakly supervised Patch Labeling Teacher does indeed label the patches as designed. Moreover, it also explains how PicT profits from the patch-level classification branch to further improve the final classification performance.

## 5. Conclusion

In this work, we propose a slim weakly supervised vision Transformer (PicT) for pavement distress classification, achieving solid results on detection and recognition tasks. We show that PicT elegantly solves the efficiency and modeling problems of previous approaches. We also present that PicT is more suitable for pavement distress classification tasks than the general vision Transformer due to its adequate and rational utilization of patches information with quantitative and visualization analysis.

###### Acknowledgements.
This work was supported in part by the National Natural Science Foundation of China under Grant 62176030, and the Natural Science Foundation of Chongqing under Grant cstc2021jcyj-msxmX0568.

## References

• H. Bao, L. Dong, and F. Wei (2021) Beit: bert pre-training of image transformers. arXiv preprint arXiv:2106.08254. Cited by: §2.2.
• Z. Chen, Y. Zhu, C. Zhao, G. Hu, W. Zeng, J. Wang, and M. Tang (2021) Dpt: deformable patch-based transformer for visual recognition. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 2899–2907. Cited by: §2.2.
• J. Chou, W. A. O’Neill, and H. Cheng (1994)

Pavement distress classification using neural networks

.
In Proceedings of IEEE International Conference on Systems, Man and Cybernetics, Vol. 1, pp. 397–401. Cited by: §1, §2.1.
• J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §2.2.
• T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez (1997) Solving the multiple instance problem with axis-parallel rectangles. Artificial intelligence 89 (1-2), pp. 31–71. Cited by: §1.
• H. Dong, K. Song, Y. Wang, Y. Yan, and P. Jiang (2021) Automatic inspection and evaluation system for pavement distress. IEEE Transactions on Intelligent Transportation Systems. Cited by: §1, §2.1.
• A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al. (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §2.2, Table 1, Table 2.
• Z. Fan, Y. Wu, J. Lu, and W. Li (2018) Automatic pavement crack detection based on structured prediction with the convolutional neural network. arXiv preprint arXiv:1802.02208. Cited by: §1, §2.1.
• X. Feng, D. Song, Y. Chen, Z. Chen, J. Ni, and H. Chen (2021)

Convolutional transformer based dual discriminator generative adversarial networks for video anomaly detection

.
In Proceedings of the 29th ACM International Conference on Multimedia, pp. 5546–5554. Cited by: §2.2.
• W. Gao, F. Wan, X. Pan, Z. Peng, Q. Tian, Z. Han, B. Zhou, and Q. Ye (2021) TS-cam: token semantic coupled attention map for weakly supervised object localization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2886–2895. Cited by: §1, §2.2.
• R. Girdhar, J. Carreira, C. Doersch, and A. Zisserman (2019)

Video action transformer network

.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 244–253. Cited by: §1, §2.2.
• K. Gopalakrishnan, S. K. Khaitan, A. Choudhary, and A. Agrawal (2017)

Deep convolutional neural networks with transfer learning for computer vision-based data-driven pavement distress detection

.
Construction and Building Materials 157, pp. 322–330. Cited by: §1, §2.1.
• J. A. Hartigan and M. A. Wong (1979) Algorithm as 136: a k-means clustering algorithm. Journal of the royal statistical society. series c (applied statistics) 28 (1), pp. 100–108. Cited by: §3.4.
• D. He, Y. Zhao, J. Luo, T. Hui, S. Huang, A. Zhang, and S. Liu (2021a) TransRefer3D: entity-and-relation aware transformer for fine-grained 3d visual grounding. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 2344–2352. Cited by: §2.2.
• J. He, J. Chen, S. Liu, A. Kortylewski, C. Yang, Y. Bai, C. Wang, and A. Yuille (2021b) Transfg: a transformer architecture for fine-grained recognition. arXiv preprint arXiv:2103.07976. Cited by: §2.2.
• K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick (2021c)

Masked autoencoders are scalable vision learners

.
arXiv preprint arXiv:2111.06377. Cited by: §1, §2.2.
• K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Table 1, Table 2.
• Y. Hu, X. Jin, Y. Zhang, H. Hong, J. Zhang, Y. He, and H. Xue (2021) RAMS-trans: recurrent attention multi-scale transformer for fine-grained image recognition. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 4239–4248. Cited by: §2.2.
• G. Huang, S. Huang, L. Huangfu, and D. Yang (2021) Weakly supervised patch label inference network with image pyramid for pavement diseases recognition in the wild. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7978–7982. Cited by: §1, §1, §2.1.
• S. Huang, W. Tang, G. Huang, L. Huangfu, and D. Yang (2022) Weakly supervised patch label inference networks for efficient pavement distress detection and recognition in the wild. arXiv preprint arXiv:2203.16782. Cited by: §1, §1, §2.1, §4.1.1, §4.1.3, Table 1, Table 2, Table 3.
• L. Jiang, Z. Zhou, T. Leung, L. Li, and L. Fei-Fei (2018) Mentornet: learning data-driven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning, pp. 2304–2313. Cited by: §3.3.1.
• B. Li, K. C. Wang, A. Zhang, E. Yang, and G. Wang (2020) Automatic classification of pavement crack using deep convolutional neural network. International Journal of Pavement Engineering 21 (4), pp. 457–463. Cited by: §1, §2.1.
• J. Li, W. Wang, J. Chen, L. Niu, J. Si, C. Qian, and L. Zhang (2021) Video semantic segmentation via sparse temporal transformer. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 59–68. Cited by: §2.2.
• Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo (2021) Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030. Cited by: §1, §2.2, §3.2, Table 1, Table 2, Table 3, Table 4.
• F. M. Nejad and H. Zakeri (2011) An expert system based on wavelet transform and radon neural network for pavement distress classification. Expert Systems with Applications 38 (6), pp. 7088–7101. Cited by: §1, §2.1.
• R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626. Cited by: §4.6.1.
• Z. Shao, H. Bian, Y. Chen, Y. Wang, J. Zhang, X. Ji, et al. (2021) Transmil: transformer based correlated multiple instance learning for whole slide image classification. Advances in Neural Information Processing Systems 34. Cited by: §1, §2.2.
• O. Siméoni, G. Puy, H. V. Vo, S. Roburin, S. Gidaris, A. Bursuc, P. Pérez, R. Marlet, and J. Ponce (2021) Localizing objects with self-supervised transformers and no labels. arXiv preprint arXiv:2109.14279. Cited by: §1, §2.2.
• K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Table 1, Table 2.
• K. Sohn, D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel, E. D. Cubuk, A. Kurakin, and C. Li (2020)

Fixmatch: simplifying semi-supervised learning with consistency and confidence

.
Advances in Neural Information Processing Systems 33, pp. 596–608. Cited by: §3.2.
• C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: Table 1, Table 2.
• M. Tan and Q. V. Le (2019) Efficientnet: rethinking model scaling for convolutional neural networks. arXiv preprint arXiv:1905.11946. Cited by: Table 1, Table 2, Table 3.
• W. Tang, S. Huang, Q. Zhao, R. Li, and L. Huangfu (2021) An iteratively optimized patch label inference network for automatic pavement distress detection. IEEE Transactions on Intelligent Transportation Systems. Cited by: §1, §2.1, §4.1.2, §4.1.3, Table 1, Table 3.
• H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou (2021) Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pp. 10347–10357. Cited by: Table 1, Table 2.
• A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.2.
• C. Wang, A. Sha, and Z. Sun (2010) Pavement crack classification based on chain code. In 2010 Seventh international conference on fuzzy systems and knowledge discovery, Vol. 2, pp. 593–597. Cited by: §1, §2.1.
• E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo (2021a) SegFormer: simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems 34. Cited by: §1, §2.2.
• Z. Xie, Z. Zhang, Y. Cao, Y. Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu (2021b) Simmim: a simple framework for masked image modeling. arXiv preprint arXiv:2111.09886. Cited by: §1, §2.2.
• M. Xu, Z. Zhang, H. Hu, J. Wang, L. Wang, F. Wei, X. Bai, and Z. Liu (2021) End-to-end semi-supervised object detection with soft teacher. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3060–3069. Cited by: §2.2, §3.3.2, §3.3.
• G. Zhang, P. Zhang, J. Qi, and H. Lu (2021a) Hat: hierarchical aggregation transformers for person re-identification. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 516–525. Cited by: §2.2.
• H. Zhang, Y. Hao, and C. Ngo (2021b) Token shift transformer for video classification. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 917–925. Cited by: §1, §2.2.
• Y. Zhang and J.P. Mohsen (2018) A project-based sustainability rating tool for pavement maintenance. Engineering 4 (2), pp. 200–208. Note: Sustainable Infrastructure External Links: ISSN 2095-8099, Document, Link Cited by: §1.
• Z. Zhao and Q. Liu (2021) Former-dfer: dynamic facial expression recognition transformer. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 1553–1561. Cited by: §2.2.
• B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2016)

Learning deep features for discriminative localization

.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929. Cited by: Figure 4, §4.6.1.
• J. Zhou, P. Huang, and F. Chiang (2005) Wavelet-based pavement distress classification. Transportation research record 1940 (1), pp. 89–98. Cited by: §1, §2.1.