Log In Sign Up

Iterative Self Knowledge Distillation – From Pothole Classification to Fine-Grained and COVID Recognition

by   Kuan-Chuan Peng, et al.

Pothole classification has become an important task for road inspection vehicles to save drivers from potential car accidents and repair bills. Given the limited computational power and fixed number of training epochs, we propose iterative self knowledge distillation (ISKD) to train lightweight pothole classifiers. Designed to improve both the teacher and student models over time in knowledge distillation, ISKD outperforms the state-of-the-art self knowledge distillation method on three pothole classification datasets across four lightweight network architectures, which supports that self knowledge distillation should be done iteratively instead of just once. The accuracy relation between the teacher and student models shows that the student model can still benefit from a moderately trained teacher model. Implying that better teacher models generally produce better student models, our results justify the design of ISKD. In addition to pothole classification, we also demonstrate the efficacy of ISKD on six additional datasets associated with generic classification, fine-grained classification, and medical imaging application, which supports that ISKD can serve as a general-purpose performance booster without the need of a given teacher model and extra trainable parameters.


page 1

page 2

page 3

page 4


Improved Knowledge Distillation for Pre-trained Language Models via Knowledge Selection

Knowledge distillation addresses the problem of transferring knowledge f...

Knowledge Concentration: Learning 100K Object Classifiers in a Single CNN

Fine-grained image labels are desirable for many computer vision applica...

Self-Referenced Deep Learning

Knowledge distillation is an effective approach to transferring knowledg...

Cross-Task Knowledge Distillation in Multi-Task Recommendation

Multi-task learning (MTL) has been widely used in recommender systems, w...

Subclass Knowledge Distillation with Known Subclass Labels

This work introduces a novel knowledge distillation framework for classi...

Dynamic Knowledge Distillation with A Single Stream Structure for RGB-DSalient Object Detection

RGB-D salient object detection(SOD) demonstrates its superiority on dete...

1 Introduction

Detecting potholes is essential for the municipalities and road authorities to repair defective roads. The vehicle repair bills related to pothole damage have cost U.S. drivers $3 billion annually on average [1]. Due to the cost constraint of the edge devices which the pothole classifiers run on, the edge devices installed on the road inspection vehicles may only have limited computational power (e.g., no GPU). In such scenarios, lightweight models are needed if real-time inference speed is required. Motivated by this application and deployment time constraint of pothole classifiers, we focus on the following problem: Given a fixed number of training epochs and a lightweight model to be trained, what can practitioners do to improve the pothole classification accuracy?

Given the problem, we explore self knowledge distillation (KD) [2] to tackle pothole classification. By self KD, we refer to the KD methods which need no teacher model in advance and introduce no extra trainable parameters. Showing that KD is actualy learned label smoothing regularization, Yuan et al[2] propose Tf-KD, the teacher-free KD by using the pre-trained student model itself as the teacher model. Inspired by [2] and the assumption that better teacher models result in better student models, we propose iterative self knowledge distillation (ISKD), which iteratively performs self KD by using the pre-trained student model as the teacher model.

Most KD methods [3, 4, 5, 6, 7] typically require that the teacher model is available in advance, which is not always true. Even if there are KD methods which need no teacher model [2, 8, 9], these methods typically do not experiment on lightweight models or perform KD multiple times. Utilizing KD iteratively and requiring no teacher model, our proposed ISKD shows its efficacy on lightweight models for pothole classification, generic classification, fine-grained classification, and medical imaging application. Although there exist methods utilizing iterative KD [10, 11, 12], they typically require additional constraints and are validated on only few datasets. For example, Koutini’s method [10] requires training multiple models and selecting the best trained model for each class to predict pseudo labels for sound event detection. In contrast, our proposed ISKD only needs to train one model at once with no need to select models and predict pseudo labels. In [11, 12], their teacher model is trained until convergence for each KD iteration, and their methods are validated on only few datasets. In addition, the accuracy gain of [11] comes from the ensemble of all the students in the history, which needs more deployment space at testing time. In contrast, we show ISKD’s efficacy on a wide variety of datasets even when the teacher model is only moderately trained, and ISKD does not rely on the ensemble, which is more practical for embedded devices.

To the best of our knowledge, we are the first in pothole classification to use iterative self knowledge distillation when training lightweight neural networks under limited training epochs. We make the following contributions:

(1) We propose iterative self knowledge distillation (ISKD), which outperforms the state-of-the-art self KD method Tf-KD [2] on the road damage dataset [13], the Nienaber potholes simplex [14] and complex [15]

datasets, CIFAR-10 

[16], CIFAR-100 [16], Oxford 102 Flower [17], Oxford-IIIT Pet Dataset [18], Caltech-UCSD Birds 200 [19], and COVID-19 Radiography [20] datasets, which supports the wide applicability of ISKD from pothole classification to generic, fine-grained, and medical imaging classification.
(2) We provide more evidence showing that even using a teacher model with accuracy lower than the baseline accuracy from a classifier trained with a larger number of epochs, the student model can still possibly outperform the baseline.
(3) ISKD can outperform the baseline under a wide range of weight balancing the objectives of ISKD, which supports that ISKD is flexible with respect to parameter selection.

2 Iterative self knowledge distillation

Figure 1: Our proposed iterative self knowledge distillation.

Inspired by Yuan et al[2], we propose iterative self knowledge distillation (ISKD) such that both teacher and student models can improve over time. We illustrate ISKD in Fig. 1, where we denote the teacher/student model in the -th KD iteration as /. During , since is not given in advance, we train using the softmax cross-entropy loss as the classification loss. During , we use the trained student model in (i.e., ) as , and train with both

and the Kullback-Leibler (KL) divergence loss. Specifically, the total loss function to train

during can be written as , where is the KL divergence, /

is the output probability distribution of

/, and is the weight of .

During , we freeze the parameters of and only train . We pre-train

from ImageNet 

[21], not from because we hope to decrease the chance that is trapped from the possibly local optimum associated with . ISKD stops at if shows no obvious accuracy gain over . Since Yuan et al[2] show that to benefit the student model, the teacher model need not outperform the student model, we directly use the previously trained student model as the current teacher model, waiving the typical KD requirement that the teacher model is needed in advance. We expect that using ISKD improves both and when increases under the assumption that using better teacher models in KD generally results in better student models.

3 Experimental setup

We experiment on road damage dataset (termed as RDD) [13], Nienaber potholes simplex (termed as simplex) [14] and complex (termed as complex) [15] datasets, CIFAR-10 [16], CIFAR-100 [16], Oxford 102 Flower dataset (termed as Oxford-102) [17], Oxford-IIIT Pet dataset (termed as Oxford-37) [18], Caltech-UCSD Birds 200 dataset (termed as CUB-200) [19], and COVID-19 Radiography dataset (termed as COVID) [20]. We choose these datasets to cover a diverse range of task domains from pothole, generic, fine-grained to medical imaging classification. The RDD, simplex, and complex datasets provide annotations of whether each image contains any pothole or not. The Oxford-102, Oxford-37, and CUB-200 datasets provide the images and labels of 102, 37, and 200 different species of flowers, cats and dogs, and birds, respectively. The COVID dataset includes 4 different types of chest x-rays: normal, COVID, lung opacity, and viral pneumonia. For all the datasets except COVID, we use the official training/testing split of each dataset. Since the COVID dataset does not provide the official training/testing split, we randomly generate the split using the ratio of 7:3.

We use the official PyTorch 

[22] implementation of ResNet-18 [23], SqueezeNet v1.1 [24], and ShuffleNet v2 x0.5 & x1.0 [25]

, modify their last layers such that the number of output nodes of the last layer equals the number of classes, and pre-train them from the ImageNet 

[21]. The four network architectures are selected based on the following criteria: (1) For the ease of reproducibility, they are officially supported by PyTorch [22], which provides their weights pre-trained from the ImageNet [21]. (2) Considering typically limited computational power on edge devices, we limit the number of network parameters to be fewer than 12M.

We first experiment on the four network architectures mentioned previously using the RDD [13], simplex [14], and complex [15] datasets. For each KD iteration, we use the same network architecture for both teacher and student models. We compare ISKD with the following two baselines with the same network architecture, total number of training epochs, and learning schedule: (1) training a classifier using without KD (termed as the large-epoch baseline), and (2) Tf-KD [2], which only performs KD once without multiple KD iterations. For the extended study involving the other six datasets irrelevant to potholes, we use the ResNet-18 [23] and ShuffleNet v2 x1.0 [25] as the network architectures of ISKD. We conduct the extended study in the same way as the pothole classification task mentioned previously using the same two baselines.

dataset experiment ID
model KD iteration (no KD) (large-epoch) + Tf-KD [2]
RDD [13] ResNet-18 [23] 91.54 92.71 92.99 93.04 93.08 n/a 93.08 92.24 92.34
SqueezeNet v1.1 [24] 89.67 89.91 90.28 90.51 90.70 90.70 90.70 90.47 90.28
ShuffleNet v2 x0.5 [25] 90.05 90.14 90.98 91.40 91.40 n/a 91.40 92.66 91.22
ShuffleNet v2 x1.0 [25] 92.01 92.15 92.66 93.22 93.22 n/a 93.22 93.13 93.13
simplex [14] ResNet-18 [23] 81.85 90.46 93.69 96.92 98.62 99.08 99.08 83.38 92.00
SqueezeNet v1.1 [24] 81.85 86.77 87.69 88.15 88.15 n/a 88.15 84.62 87.69
ShuffleNet v2 x0.5 [25] 90.00 95.69 99.38 100.00 n/a n/a 100.00 92.46 97.54
ShuffleNet v2 x1.0 [25] 90.31 96.31 98.77 100.00 n/a n/a 100.00 93.38 97.54
complex [15] ResNet-18 [23] 61.92 74.14 77.65 79.80 83.77 84.93 84.93 62.58 82.95
SqueezeNet v1.1 [24] 59.27 70.70 76.16 76.16 n/a n/a 76.16 62.25 71.03
ShuffleNet v2 x0.5 [25] 56.79 78.97 88.58 88.91 n/a n/a 88.91 65.89 79.80
ShuffleNet v2 x1.0 [25] 60.43 78.31 86.59 86.42 n/a n/a 86.42 71.85 82.45
CIFAR-10 [26] ResNet-18 [23] 90.75 91.81 92.05 92.07 n/a n/a 92.07 91.53 92.01
ShuffleNet v2 x1.0 [25] 82.63 85.16 88.45 90.30 91.03 91.58 91.58 90.68 85.55
CIFAR-100 [26] ResNet-18 [23] 80.15 81.05 81.64 82.17 82.30 82.67 82.67 81.34 81.62
ShuffleNet v2 x1.0 [25] 58.95 65.60 72.16 75.59 77.27 77.85 77.85 77.61 65.78
Oxford-102 [17] ResNet-18 [23] 96.58 97.31 97.68 97.80 n/a n/a 97.80 96.94 97.43
ShuffleNet v2 x1.0 [25] 94.74 97.19 98.17 98.41 n/a n/a 98.41 98.41 97.19
Oxford-37 [18] ResNet-18 [23] 90.57 90.98 91.33 91.80 91.58 n/a 91.80 91.20 91.31
ShuffleNet v2 x1.0 [25] 79.69 84.30 85.96 86.59 86.59 n/a 86.59 86.10 84.46
CUB-200 [19] ResNet-18 [23] 42.07 46.49 48.66 49.82 49.49 n/a 49.82 46.03 47.58
ShuffleNet v2 x1.0 [25] 41.54 45.40 48.53 49.69 50.28 49.95 50.28 48.99 46.95
COVID [20] ResNet-18 [23] 94.11 94.68 94.80 95.01 n/a n/a 95.01 94.79 94.88
ShuffleNet v2 x1.0 [25] 90.76 92.43 93.07 93.81 93.98 n/a 93.98 92.72 93.10
Table 1: Comparing the classification accuracy (%) of the iterative self knowledge distillation (KD) method versus the baselines. The numbers are in the format of [accuracy], where is the number of epochs which the student model is trained for.

For all the experiments, the training images are resized to 224224, and the model is pre-trained from ImageNet [21] and fine-tuned with the training data of each dataset. We use the momentum 0.9, weight decay 5e-4, batch size 128, and the SGD optimizer to train the student model for 50 epochs for each KD iteration, and the learning rate is fixed within each KD iteration but model-specific during training. For ResNet-18 [23] and SqueezeNet v1.1 [24], we use the initial learning rate 0.001, but for ShuffleNet v2 x0.5 & x1.0 [25], we use the initial learning rate 0.1. Following Tf-KD [2], we obtain the values by grid search on the validation data sampled from the training set when experimenting on the RDD dataset [13]. Once we determine the values from the RDD dataset, we fix the values and use the same set of values when experimenting on other datasets (i.e., the values are not tuned for most of the datasets except the RDD dataset). We purposely do so to test whether the values searched from one dataset can be transferable and directly applied to other datasets. For all the other parameters, we use the default PyTorch [22] setting unless otherwise specified.

datasetmethod ISKD prior work (backbone)
CIFAR-100 [26] 82.67 81.60 [27] (Wide-ResNet-28-10)
Oxford-102 [17] 97.80 91.10 [28] (ResNet152-SAM)
Oxford-37 [18] 91.80 91.60 [28] (ResNet50-SAM)
Table 2: The comparison of classification accuracy (%) between ISKD (backbone: ResNet-18 [23]) and the prior works using backbones with more parameters.

4 Experimental result

The experimental results are summarized in Table 1, where we refer to each column by the experiment ID . All the numbers are reported in the format of [accuracy], where is the number of epochs which the student model is trained for. list the performance of , and summarizes the last performance after . There is no teacher model for and the large-epoch baseline (), but for the baseline using Tf-KD [2] (), the teacher model is the trained . All the student models are pre-trained from ImageNet [21] for ISKD () and the two baselines (, ).

In Table 1, the accuracy of is higher than that of , which shows that self KD can improve the student model’s accuracy. The accuracy of is higher than that of for most cases when , which validates the assumption that in self KD, better teacher models result in better student models and that self KD can be done iteratively instead of just once. Comparing with and , we show that given a fixed number of training epochs, ISKD outperforms the large-epoch baseline and the state-of-the-art self KD method Tf-KD [2] in most cases. The fact that outperforms is also an ablation study supporting that self KD is better done iteratively than just once.

Since Table 1 covers diverse task domains, including pothole, generic, fine-grained, and medical imaging classification, our results support that ISKD can serve as a general-purpose performance booster. In addition, we obtain the results in Table 1 by directly using the values chosen for pothole classification without tuning the values for each dataset, which supports that the values we use are transferable across different datasets. We also compare the accuracy of ISKD (backbone: ResNet-18 [23]) with the prior methods which use the backbones with more parameters in Table 2, where ISKD performs on par or even outperforms the listed methods which use more parameters. This finding supports that ISKD is more parameter efficient than the listed methods.

Furthermore, we analyze what is the worst performing teacher model in self KD which can still make the student model outperform the baseline with no KD ( in Table 1). To gain more insight, we design the following experiment to find the accuracy relation between the teacher and student models. Given that in Table 1 is trained for 50 epochs, we use the 50 models saved after each epoch is completed as the teacher models. We repeat in Table 1 for 50 times (each time with one of the 50 teacher models produced during ), and record the teacher-student accuracy relation. We perform this experiment on the simplex [14] and complex [15] datasets using ResNet-18 [23] as the network architectures.

Figure 2: The teacher-student accuracy relation on the simplex [14] and complex [15] datasets using ResNet-18 [23]

. The gray lines are obtained from linear regression, and the line equation and Pearson’s correlation coefficient

are marked in the legend. Given the baseline performance in of Table 1, the shaded blue areas are the areas where the teacher model performs worse than the baseline but the student model outperforms the baseline.

We show the result of teacher-student accuracy relation in Fig. 2. Most of the data points are above the red line (), which supports that performing self KD can make the student model outperform the teacher model. Fig. 2 suggests that the accuracy of the teacher and student models has strong positive correlation (the slopes of the gray lines are positive and the ), which again supports the assumption that using better teacher models in KD generally results in better student models. We find that the number of data points falling into the shaded blue areas is not negligible, which serves as statistically more significant evidence than [2] supporting that even if the teacher model is worse than the baseline, it is still likely that the student model can outperform the baseline after KD.

Another experiment which is also not presented in the paper of Yuan et al[2] is the impact of the values on the accuracy. To study this, we use the ResNet-18 [23] as the network architecture of ISKD and the same random seed 1, repeat , , and corresponding to the CUB-200 [19] dataset with different values, and report the student’s accuracy in Fig. 3, where the red lines mark the baseline accuracy (i.e., the student’s accuracy in the previous KD iteration reported in Table 1). For each sub-figure of Fig. 3, the teacher model is fixed as the corresponding one used in Table 1. The triangular points in Fig. 3 are the accuracy reported in Table 1 using the values chosen for pothole classification, so their corresponding accuracy is not necessarily the best. Fig. 3 shows that the student model outperforms the baseline in each KD iteration of ISKD under a wide range of values, which supports that ISKD is flexible in terms of parameter selection.

Figure 3: The accuracy of the student model using ResNet-18 under different values when we repeat , , and on the CUB-200 dataset. The triangular red points are the accuracy we report in Table 1 by using the values chosen for pothole classification. Shown as the red lines, the baseline for is the accuracy of the student model in reported in Table 1.

5 Conclusion

We propose iterative self knowledge distillation (ISKD) in self KD to improve pothole classification accuracy when training lightweight models given a fixed amount of training epochs. Experimenting on three pothole classification datasets and six other datasets associated with generic classification, fine-grained classification, and medical imaging application, we show that ISKD outperforms the state-of-the-art self KD method Tf-KD, for most cases, given the same number of training epochs and that ISKD has wide applicability to various tasks without the need of a given teacher model and extra trainable parameters. In addition, we show more evidence supporting that the performance of the student model can benefit from self KD even when the pre-trained student model (which serves as the teacher model) is only moderately trained. Our study on the impact of the weight balancing the objectives of ISKD shows that even if we choose different weights deviating from the weights we initially use within a reasonable range, the student model can still improve over KD iterations, which supports that ISKD is flexible in terms of parameter selection.


  • [1] “American Automobile Association (AAA) pothole fact sheet,”, 2016.
  • [2] L. Yuan, F. EH Tay, G. Li, T. Wang, and J. Feng, “Revisiting knowledge distillation via label smoothing regularization,” in CVPR, June 2020.
  • [3] N. Passalis, M. Tzelepi, and A. Tefas, “Heterogeneous knowledge distillation using information flow modeling,” in CVPR, June 2020.
  • [4] L. Zhao, X. Peng, Y. Chen, M. Kapadia, and D. N. Metaxas, “Knowledge as priors: Cross-modal knowledge generalization for datasets without superior knowledge,” in CVPR, June 2020.
  • [5] G. Xu, Z. Liu, X. Li, and C. C. Loy, “Knowledge distillation meets self-supervision,” in ECCV, August 2020.
  • [6] B. Peng, X. Jin, J. Liu, D. Li, Y. Wu, Y. Liu, S. Zhou, and Z. Zhang, “Correlation congruence for knowledge distillation,” in ICCV, October 2019.
  • [7] X. Jin, B. Peng, Y. Wu, Y. Liu, J. Liu, D. Liang, J. Yan, and X. Hu, “Knowledge distillation via route constrained optimization,” in ICCV, October 2019.
  • [8] Q. Guo, X. Wang, Y. Wu, Z. Yu, D. Liang, X. Hu, and P. Luo, “Online knowledge distillation via collaborative learning,” in CVPR, June 2020.
  • [9] S. Yun, J. Park, K. Lee, and J. Shin, “Regularizing class-wise predictions via self-knowledge distillation,” in CVPR, June 2020.
  • [10] K. Koutini, H. Eghbal-Zadeh, and G. Widmer, “Iterative knowledge distillation in R-CNNs for weakly-labeled semi-supervised sound event detection,” in Detection and Classification of Acoustic Scenes and Events, 2018.
  • [11] T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and A. Anandkumar, “Born again neural networks,” in ICML, 2018.
  • [12] L. Zhang, D. Du, C. Li, Y. Wu, and T. Luo, “Iterative knowledge distillation for automatic check-out,” IEEE Transactions on Multimedia, 2020.
  • [13] H. Maeda, Y. Sekimoto, T. Seto, T. Kashiyama, and H. Omata, “Road damage detection and classification using deep neural networks with smartphone images,” Computer-Aided Civil and Infrastructure Engineering, vol. 33, 06 2018.
  • [14] “Kaggle dataset: Nienaber potholes 1 simplex,”
  • [15] “Kaggle dataset: Nienaber potholes 2 complex,”
  • [16] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Technical report, 2009.
  • [17] M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” in

    Indian Conference on Computer Vision, Graphics and Image Processing

    , 2008.
  • [18] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V. Jawahar, “Cats and dogs,” in CVPR, 2012.
  • [19] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona, “Caltech-UCSD Birds 200,” Tech. Rep. CNS-TR-2010-001, California Institute of Technology, 2010.
  • [20] “COVID-19 radiography database,”
  • [21] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., “ImageNet large scale visual recognition challenge,” IJCV, vol. 115, no. 3, pp. 211–252, 2015.
  • [22] “PyTorch,”
  • [23] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
  • [24] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and 0.5mb model size,” arXiv preprint arXiv:1602.07360, 2016.
  • [25] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “ShuffleNet V2: Practical guidelines for efficient CNN architecture design,” in ECCV, September 2018.
  • [26] A. Krizhevsky, “Learning multiple layers of features from tiny images,”, 2009.
  • [27] I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le, “Attention augmented convolutional networks,” in ICCV, October 2019.
  • [28] X. Chen, C.-J. Hsieh, and B. Gong, “When vision transformers outperform ResNets without pretraining or strong data augmentations,” arXiv preprint arXiv:2106.01548, 2021.