Data augmentation, as a widely used technology, is also beneficial to knowledge distillation . For example,  use data augmentation to improve the generalization ability of knowledge distillation.  use Mixup , a widely applied data augmentation technique, to improve the efficiency of knowledge distillation. In this paper, we focus on the mixture-based data augmentation (e.g. Mixup and Cutmix ), arguably one of the most widely used type of augmentation techniques.
Intuitively, we expect the order concordance between soft labels and hard labels. In Fig. 2, for an augmented sample , the hard label distribution is . Then we expect the soft labels to be monotonic w.r.t. the hard labels: .
However, we found critical order violations between hard labels and soft labels in real datasets and teacher models. To verify this, we plot the Kendall’s coefficient 
between the soft labels and the hard labels of different teacher models and different data augmentation techniques in CIFAR-100 in Fig.0(a). We only count label pairs whose orders are known. In other words, we ignore the orders between two “other” labels, since we do not know them. A clear phenomenon is that, although the hard labels and soft labels are almost completely concordant for original samples, they are likely to be discordant for augmented samples. What’s more surprising is that, in Fig. 0(b), we find that there are a proportion of augmented samples, in which none of the original labels are in the top 2 of the soft labels.We attribute this to the insufficient generalization ability of the teacher, which leads to the prediction error of the augmented sample. We will show in Sec 5.3 that the order violations will injury the knowledge distillation. As far as we know, the order violations between hard labels and soft labels havn’t been studied in previous studies.
A natural direction to tackle the problem is to reduce the order violations in soft labels. To this end, we leverage the isotonic regression (IR) – a classic technique from statistics – to introduce the order restrictions into the soft labels. IR minimizes the distance from given nodes to a set defined by some order constraints. In Fig. 2, by applying order restrictions to soft labels via IR, we obtain concordant soft labels while keeping the original soft label information as much as possible. IR has numerous important applications in statistics , operations research , and signal processing . To our knowledge, we are the first to introduce IR in knowledge distillation.
2 Related Work
Knowledge Distillation with Erroneous Soft Labels. In recent years, knowledge distillation  as a model compression and knowledge transfer technology has received extensive research interests. Since the teacher model is non-optimal, how to deal with the errors of soft labels has become an important issue. Traditional methods often solve this problem via optimizing the teacher model or student model.
For teacher optimization,  finds that a larger network is not necessarily a better teacher, because student models may not be able to imitate a large network. They proposed that early-stopping should be used for the teacher, so that large networks can behave more like small networks , which is easier to imitate. An important idea for teacher model optimization is “strictness” 
, which refers to tolerating the teacher’s probability distribution outside of hard labels.
The training optimization of the student model is mainly to modify the loss function of distillation. proposed to assign different s to different samples based on their deceptiveness to teacher models.  proposed that the label correlation represented by student should be consistent with teacher model. They use residual labels to add this goal to the loss function.
However, none of these studies reveal the phenomenon of rank violations in data augmented knowledge distillation.
Data Mixing is a typical data augmentation method. Mixup  first randomly combines a pair of samples via weighted sum of their data and labels. Some recent studies include CutMix , and RICAP . The main difference among the different mixing methods is how they mix the data.
The difference between our isotonic data augmentation and the conventional data augmentation is that we focus on relieving the error transfer of soft labels in knowledge distillation by introducing order restrictions. Therefore, we pay attention to the order restrictions of the soft labels, instead of directly using the mixed data as data augmentation. We verified in the experiment section that our isotonic data augmentation is more effective than directly using mixed data for knowledge distillation.
3 Data Augmentation for Knowledge Distillation
3.1 Standard Knowledge Distillation
In this paper, we consider the knowledge distillation for multi-class classification. We define the teacher model as , where is the feature space, is the label space. We denote the student model as . The final classification probabilities of the two models are computed by and , respectively. We denote the training dataset as , where
is one-hot encoded. We denote the score of the-th label for as .
The distillation has two objectives: hard loss and soft loss. The hard loss encourages the student model to predict the supervised hard label . The soft loss encourages the student model to perform similarly with the teacher model. We use cross entropy (CE) to measure both similarities:
where is a hyper-parameter denoting the temperature of the distillation.
The overall loss of the knowledge distillation is the sum of the hard loss and soft loss:
where is a hyper-parameter.
3.2 Knowledge Distillation with Augmented Samples
In this subsection, we first formulate data augmentation for knowledge distillation. We train the student model against the augmented samples instead of the original samples from . This method is considered as a baseline without introducing the order restrictions. We then formulate the data augmentation techniques used in this paper.
Data Augmentation-base Knowledge Distillation. In this paper, we consider two classic augmentations (i.e., CutMix  and Mixup ). Our work can be easily extended to other mixture-based data enhancement operations (e.g. FCut , Mosiac ). As in Mixup and CutMix, we combine two original samples to form a new augmented sample. For two original samples and , data augmentation generates a new sample . The knowledge distillation based on augmented samples has the same loss function as in Eq. (2):
where the augmented sample is sampled by first randomly selecting original samples from , and then mixing the samples.
We formulate the process of augmenting samples as:
where denotes the specific data augmentation technique. is distributed in two labels (e.g. ). We will formulate different data augmentation techniques below.
CutMix augments samples by cutting and pasting patches for a pair of original images. For and , CutMix samples a patch for both of them. Then CutMix removes the region in and fills it with the patch cropped from of . We formulate CutMix as:
where indicates whether the coordinates are inside () or outside () the patch. We follow the settings in  to uniformly sample and and keep the aspect ratio of to be proportional to the original image:
Mixup augments a pair of sample by a weighted sum of their input features:
where each for .
4 Isotonic Data Augmentation
In this section, we introduce the order restrictions in data augmentation for knowledge distillation, which is denoted as isotonic data augmentation. In Sec 4.1, we analyze the partial order restrictions of soft labels. We propose the new objective of knowledge distillation subjected to the partial order restrictions in Sec 4.2. Since the partial order is a special directed tree, we propose a more efficient Adapted IRT algorithm based on IRT-BIN  to calibrate the original soft labels. In Sec 4.3, we directly impose partial order restrictions on the student model. We propose a more efficient approximation objective based on penalty methods.
4.1 Analysis of the Partial Order Restrictions
We hope that the soft label distribution by isotonic data augmentation and the hard label distribution have no order violations. We perform isotonic regression on the original soft labels to obtain new soft labels that satisfy the order restrictions. We denote the new soft labels as the order restricted soft labels . For simplicity, we will use to denote . We use to denote the score of the -th label.
To elaborate how we compute , without loss of generality, we assume the indices of the two original labels of the augmented sample are respectively with . So is monotonically decreasing, i.e. .
We consider two types of order restrictions: (1) the order between two original labels (i.e., ); (2) The order between an original label and a non-original label (i.e. ). For example, in Fig. 2, we expect the probability of panda is greater than that of cat. And the probability of cat is greater than other labels. We do not consider the order between two non-original labels.
We use to denote the partial order restrictions, where each vertex represents , an edge represents the restriction of . is formulated in Eq. (8). We visualize the partial order restrictions in Fig. 3.
is a directed tree.
4.2 Knowledge Distillation via Order Restricted Soft Labels
For an augmented sample , we first use the teacher model to predict its soft labels. Then, we calibrate the soft labels to meet the order restrictions. We use the order-restricted soft label distribution to supervise the knowledge distillation. We formulate this process below.
Objective with Order Restricted Soft Labels. Given the hard label distribution and soft label distribution of an augmented sample , the objective of knowledge distillation with isotonic data augmentation is:
where denotes the optimal calibrated soft label distribution.
To compute , we calibrate the original soft label to meet the order restrictions. There are multiple choices for to meet the restrictions. Besides order restrictions, we also hope that the distance between the original soft label distribution and the calibrated label distribution is minimized. Intuitively, the original soft labels contain the knowledge of the teacher model. So we want this knowledge to be retained as much as possible. We formulate the calibration below.
We compute which satisfies the order restriction while preserving most knowledge by minimizing the mean square error to the original soft labels:
Eq. (10b) denotes the order restrictions. Eq. (10a) denotes the objective of preserving most original information. The goal of computing can be reduced to the classical isotonic regression in statistics.
Here we analyze the difference between isotonic data augmentation and probability calibration in machine learning. Isotonic regression is also applied in probability calibration. While both the proposed isotonic data augmentation and probability calibration try to rectify the erroneous predicted by models, our proposed isotonic data augmentation only happens in the training phase when the groundtruth distribution (i.e. the hard labels) is known. We use the isotonic soft labels as the extra supervision for model training. In contrast, the probability calibration needs to learn an isotonic function and uses it to predict the probability of unlabeled samples.
Algorithm. To optimize , we need to compute first. According to lemma 1, finding the optimal can be reduced to the tree-structured IR problem, which can be solved by IRT-BIN  with binomial heap in time complexity. We notice that the tree structure in our problem is special: a star (nodes ) and an extra edge . So we give a more efficient implementation compared to IRT-BIN with only one sort in algorithm 1.
The core idea of the algorithm is to iteratively reduce the number of violations by merging node blocks until no order violation exists. Specifically, we divide the nodes into several blocks, and use to denote the block for node . At initialization, each only contains node itself. Since all nodes except and are leaf nodes with a common parent , we first consider the violations between and (line 4-7). Note that nodes are sorted according to their soft probabilities . We enumerate and iteratively determine whether there is a violation between node and node . If so, we absorb node into . This absorption will set all nodes in to their average value. In this way, we ensure that there are no violations among nodes . Then, we consider the order between and . If they are discordant (i.e. ), we similarly absorb into to eliminate this violation (line 9-11). If this absorption causes further violations between and a leaf node, we similarly absorb the violated node as above (line 12-15). Finally, we recover from according to the final block divisions.
 The Adapted IRT algorithm terminates with the optimal solution to .
The correctness of the algorithm is due to the strictly convex function of isotonic data augmentation subject to convex constraints. Therefore it has a unique local minimizer which is also the global minimizer . Its time complexity is .
4.3 Efficient Approximation via Penalty Methods
We found two drawbacks of the proposed order restricted data augmentation in Sec 4.2: (1) although the time complexity is , the algorithm is hard to compute in parallel in GPU; (2) The order restrictions are too harsh, which overly distorts information of the original soft labels. For example, if the probability of original labels are very low, then almost all nodes will be absorbed and averaged. This will loss all valid knowledge from the original soft labels. In this subsection, we loose the order restrictions and propose a more GPU-friendly algorithm.
Note that, the partial order in Eq. (10b) introduces the restrictions to the soft labels, and then uses the isotonic soft labels to limit the student model. If we directly use the partial order to limit the student model instead, the restrictions can be rewritten as:
Note that we can replace with a simpler term without changing the actual restriction. We use because we want to ensure the loss below is equally sensitive to both and .
Objective with Order Restricted Student. We convert the optimization problem subjected to Eq. (11) to the unconstraint case in Eq. (12) via penalty methods. The idea is to add the restrictions in the loss function.
where is the penalty coefficients. The penalty-based loss can be computed in time and is GPU-friendly (via the max function).
Models. We use teacher models and the student models of different architectures to test the effect of our proposed isotonic data augmentation algorithms for knowledge distillation. We tested the knowledge transfer of the same architecture (e.g. from ResNet101 to ResNet18), and the knowledge transfer between different architectures (e.g. from GoogLeNet to ResNet).
Competitors. We compare the isotonic data augmentation-based knowledge distillation with standard knowledge distillation . We also compare with the baseline of directly distilling with augmented samples without introducing the order restrictions. We use this baseline to verify the effectiveness of the order restrictions.
Datasets. We use CIFAR-100 
, which contains 50k training images with 500 images per class and 10k test images. We also use ImageNet, which contains 1.2 million images from 1K classes for training and 50K for validation, to evaluate the scalability of our proposed algorithms.
For CIFAR-100, we train the teacher model for 200 epochs and select the model with the best accuracy on the validation set. The knowledge distillation is also trained for 200 epochs. We use SGD as the optimizer. We initialize the learning rate as 0.1, and decay it by 0.2 at epoch 60, 120, and 160. By default, we set, which are derived from grid search in . We set
from common practice. For ImageNet, we train the student model for 100 epochs. We use SGD as the optimizer with initial learning rate is 0.1. We decay the learning rate by 0.1 at epoch 30, 60, 90. We also set. We follow  to set . Models for ImageNet were trained on 4 Nvidia Tesla V100 GPUs. Models for CIFAR-100 were trained on a single Nvidia Tesla V100 GPU.
5.2 Main Results
Results on CIFAR-100. We show the classification accuracies of the standard knowledge distillation and our proposed isotonic data augmentation in Table 1. Our proposed algorithms effectively improve the accuracies compared to the standard knowledge distillation. This finding is applicable to different data augmentation techniques (i.e. CutMix and Mixup) and different network structures. In particular, the accuracy of our algorithms even outperform the teacher models. This shows that by introducing the order restriction, our algorithms effectively calibrate the soft labels and reduce the error from the teacher model. As Mixup usually performs better than CutMix, we only use Mixup as data augmentation in the rest experiments.
Results on ImageNet. We display the experimental results on ImageNet in Table 2. We use the same settings as , namely using ResNet-34 as the teacher and ResNet-18 as the student. The results show that isotonic data augmentation algorithms are more effective than the original data augmentation technology. This validates the scalability of the isotonic data augmentation.
We found that KD-p is better on CIFAR-100, while KD-i is better on ImageNet. We think this is because ImageNet has more categories (i.e. 1000), which makes order violations more likely to appear. Therefore, strict isotonic regression in KD-i is required to eliminate order violations. On the other hand, since CIFAR-100 has fewer categories, the original soft labels are more accurate. Therefore, introducing loose restrictions through KD-i is enough. As a result, we suggest to use KD-i if severe order violation occurs.
Ablation. In Table 1, we also compare with the conventional data augmentation without introducing order restrictions (i.e. KD-aug). It can be seen that by introducing the order restriction, our proposed isotonic data augmentation consistently outperforms the conventional data augmentation. This verifies the advantages of our isotonic data augmentation over the original data augmentation.
5.3 Effect of Order Restrictions
Our basic intuition of this paper is that, order violations of soft labels will injure the knowledge distillation. In order to verify this intuition more directly, we evaluated the performance of knowledge distillation under different levels of order violations. Specifically, we use the Adapted IRT algorithm to eliminate the order violations of soft labels for augmented samples, respectively. We show in Fig. 5 the effectiveness of eliminating different proportions of order violations in CIFAR-100. As more violations are calibrated, the accuracy of knowledge distillation continues to increase. This verifies that the order violations injure the knowledge distillation.
5.4 Efficiency of Isotonic Data Augmentation
We mentioned that KD-p based on penalty methods is more efficient and GPU-friendly than KD-i. In this subsection, we verified the efficiency of different algorithms. We selected models from Table 1 and counted their average training time of one epoch. In Table 3, taking the time required for standard KD as the unit 1, we show the time of different data augmentation algorithms. It can be seen that KD-p based on penalty methods require almost no additional time. This shows that KD-p is more suitable for large scale data in terms of efficiency.
5.5 Effect of the Looseness of Order Restrictions
The coefficient in the Eq. (12) is the key hyper-parameter that controls the looseness of KD-p. It can be found that for most models, the model performs best when . Therefore, is a recommended value for real tasks.
5.6 Effect on NLP Tasks
and DBPedia. We use Bert as the teacher and DistilBert as the student. We leverage the mixup method in Mixup-Transformer, and the results indicate that comparing to KD-aug, KD-i and KD-p will improve student models’ accuracy.
We reveal that the conventional data augmentation techniques for knowledge distillation have critical order violations. In this paper, we use isotonic regression (IR) - a classic statistical algorithm - to eliminate the rank violations. We adapt the traditional IRT-BIN algorithm to the adapted IRT algorithm to generate concordant soft labels for augmented samples. We further propose a GPU-friendly penalty-based algorithm. We have conducted a variety of experiments in different datasets with different data augmentation techniques and verified the effectiveness of our proposed isotonic data augmentation algorithms. We also directly verified the effect of introducing rank restrictions on data augmentation-based knowledge distillation.
This paper was supported by National Natural Science Foundation of China (No. 61906116), by Shanghai Sailing Program (No. 19YF1414700).
Nonlinear image estimation using piecewise and local image models. TIP 7 (7), pp. 979–991. Cited by: §1.
-  (2007) Dbpedia: a nucleus for a web of open data. In The semantic web, pp. 722–735. Cited by: §5.6.
-  (1972) The isotonic regression problem and its dual. JASA 67 (337), pp. 140–147. Cited by: §1.
-  (2013) Nonlinear programming: theory and algorithms. John Wiley & Sons. Cited by: §4.2.
-  (2020) YOLOv4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934. Cited by: §3.2.
-  (2019) On the efficacy of knowledge distillation. In ICCV, pp. 4794–4802. Cited by: §2.
-  (2020) An empirical analysis of the impact of data augmentation on knowledge distillation. arXiv preprint arXiv:2006.03810. Cited by: §1.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. Cited by: §5.6.
-  (2019) Adaptive regularization of labels. arXiv preprint arXiv:1908.05474. Cited by: §1, §2.
-  (2020) Fmix: enhancing mixed sample data augmentation. arXiv preprint arXiv:2002.12047 2 (3), pp. 4. Cited by: §3.2.
Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §2, §5.1, Table 1.
-  (1938) A new measure of rank correlation. Biometrika 30 (1/2), pp. 81–93. Cited by: §1.
-  (2009) Learning multiple layers of features from tiny images. Cited by: §5.1.
Learning question classifiers. In COLING, Cited by: §5.6.
-  (2017) Early stopping without a validation set. arXiv preprint arXiv:1703.09580. Cited by: §2.
Torchdistill: a modular, configuration-driven framework for knowledge distillation.
International Workshop on Reproducible Research in Pattern Recognition, pp. 24–44. Cited by: §5.1.
-  (1985) Establishing consistent and realistic reorder intervals in production-distribution systems. OR 33 (6), pp. 1316–1341. Cited by: §1.
Predicting good probabilities with supervised learning. In ICML, pp. 625–632. Cited by: §4.2.
-  (1999) Algorithms for a class of isotonic regression problems. Algorithmica 23 (3), pp. 211–222. Cited by: §4.2, §4, Theorem 1.
-  (2019) DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. Cited by: §5.6.
-  (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, pp. 1631–1642. Cited by: §5.6.
-  (2020) Mixup-transfomer: dynamic data augmentation for nlp tasks. arXiv preprint arXiv:2010.02394. Cited by: §5.6.
-  (2019) Data augmentation using random image cropping and patching for deep cnns. TCSVT. Cited by: §2.
-  (2019) Contrastive representation distillation. In ICLR, Cited by: §1, §5.2, Table 1.
-  (2020) Neural networks are more productive teachers than human raters: active mixup for data-efficient knowledge distillation from a blackbox model. In CVPR, pp. 1498–1507. Cited by: §1.
-  (2020) Knowledge distillation thrives on data augmentation. arXiv preprint arXiv:2012.02909. Cited by: §1.
-  (2019) Preparing lessons: improve knowledge distillation with better supervision. arXiv preprint arXiv:1911.07471. Cited by: §1, §2.
-  (2019) Training deep neural networks in generations: a more tolerant teacher educates better students. In AAAI, Vol. 33, pp. 5628–5635. Cited by: §2.
-  (2019) Cutmix: regularization strategy to train strong classifiers with localizable features. In ICCV, pp. 6023–6032. Cited by: §1, §2, §3.2, §3.2.
-  (2018) Mixup: beyond empirical risk minimization. In ICLR, Cited by: §1, §2, §3.2.