An Empirical Analysis of the Impact of Data Augmentation on Knowledge Distillation

06/06/2020 ∙ by Deepan Das, et al. ∙ 0

Generalization Performance of Deep Learning models trained using the Empirical Risk Minimization can be improved significantly by using Data Augmentation strategies such as simple transformations, or using Mixed Samples. In this work, we attempt to empirically analyse the impact of such augmentation strategies on the transfer of generalization between teacher and student models in a distillation setup. We observe that if a teacher is trained using any of the mixed sample augmentation strategies, the student model distilled from it is impaired in its generalization capabilities. We hypothesize that such strategies limit a model's capability to learn example-specific features, leading to a loss in quality of the supervision signal during distillation, without impacting it's standalone prediction performance. We present a novel KL-Divergence based metric to quantitatively measure the generalization capacity of the different networks.



There are no comments yet.


page 2

page 5

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Overview of the results of Mix Up, Cut Mix, Cut Out and simple transformations on CIFAR-10 Test Set when distilled into two different student models from a ResNet Teacher Model. Note that Mixed Sample Augmentation strategies help improve teacher performance, but corresponding student performance is impaired.

Deep Neural Networks (DNN) now have the ability to generate sophisticated models that capture the intricacies of information in large amounts of data. This ability has lent spectacular success to Deep Learning models in many domains. However, often as the ability for the machine learning models to encapsulate information needed to express this complexity increases, so does the size of the machine learning model. These complex models may be too demanding to run on more reasonable hardware, including personal computers or mobile devices. One solution to this problem is to use knowledge-based distillation, which trains a smaller, more efficient model that approximates the performance of the original, more cumbersome model. A knowledge distillation objective can enable us to train a smaller, lightweight model without using any external annotations on given data, but instead utilising the prediction labels generated by a pre-trained cumbersome teacher model. There are plenty of other techniques that achieve high performance in compressed models including model quantization

(Zhou et al., 2017), model pruning (Han et al., 2015), and more recently, lottery tickets (Jonathan Frankle, 2018).

Figure 2: Pictorial representation of the data samples generated using the several data augmentation strategies. Starting from the left, we note that the original image is that of a firetruck. Standard transformations such as flipping and cropping produces the augmented image. Cutout introduces regional dropout in the image, whereas in Mix Up and Cut Mix, images and their labels are combined proportionally. For instance, here we see that a car is combined along with the firetruck.

With the recent surge of deep learning research, there is a natural need to create knowledge-based distillation techniques that mimic more complex algorithms. At the same time, however, there is little basis for evaluating the impact of existing strategies on the latent representation space generated by the teacher model that allows distillation training on the student model. This raises the question: what is the impact of generalization in teacher networks? While many forms of explicit regularization have been added to large, cumbersome models, it will be interesting to analyze the impact of the implicit regularization techniques when applied to such neural networks through the different optimization techniques. Implicit regularization has shown that as neural networks increase in size, they are actually able to create solutions with lower complexity (Neyshabur, 2017). Since there is no explicit regularization in the model’s objective function, this ability to generalize is built in through well studied techniques such as normalization, gradient descent, and the choice of weight kernels, like the convolution filter. Some of these regularizers have become fairly popular and have become a standard addition to any Deep Learning model optimization.

Regularization performed on neural networks have many different aspects. In a distillation setup, where a smaller model(student) attempts to mimic the softened softmax output of a large, cumbersome model(teacher), student performance is shown to improve, but there still exists a significant generalization gap between the two models. It is indeed important to try and minimize this generalization gap, but it is also important to consider the viability of using one or many of these implicit regularizers in the distillation process. In this experiment, we will be creating a framework that measures the impact of using such implicit regularizers on the generalization gap between teacher and student models.

One such form of implicit regularization is the Vicinal Risk Minimization (VRM) principle (Vapnik, 2000). In this work, we will evaluate the impact data augmentation inspired by VRM techniques have in the transfer of generalization between teacher and student models. However, this is a general framework, and can also be used to evaluate other types of generalization in the future. In any natural setting, we encounter noise in the presented data that enables us to gain additional information about relatively similar data. Traditionally, data augmentation and even Cut Out (DeVries and Taylor, 2017), is viewed as one such example, where the model is trained on some simple transformations of the data. However, the scope of a vicinity is not explicitly defined in this case. Mixed Sample Data Augmentation Techniques like Mix Up (Zhang et al., 2017) and Cut Mix (Yun et al., 2019) or FMix (Harris et al., 2020) provide a new outlook on the concept of vicinity. Considered as standalone techniques, they can lead to state of the art results in standard Deep Learning tasks, but it will be interesting to note the impact each technique has on the supervision signal from the teacher model to the student model. We hypothesize that even though Data Augmentation techniques provide good regularization, they impair the distillation process because of several implicit qualitative biases in the techniques. This impairment is much more pronounced in the Mixed Sample Data Augmentation Techniques.

To prove this hypothesis, we will be using a framework to evaluate four data augmentation techniques inspired by VRM (traditional augmentation, Mix Up, Cut Out and Cut Mix) by creating teacher networks on ResNet-18 architectures. Student networks will then be distilled and evaluated against appropriate data sets to investigate their generalization capabilities. Our contributions are thus summarized as follows:

  • We demonstrate that popular data augmentation techniques, and especially Mixed Sample techniques, such as MixUp and CutMix when applied on a teacher model, can impair the transfer of generalization capabilities onto a student model in a distillation setting.

  • We present a novel similarity-based metric to help explain some qualitative traits inherent in the latent representations of such models. These findings are also backed by a traditional KL-Divergence based metric that operates on the probability distribution of the model predictions.

  • We also analyze the performance of these distilled models under distributional shift, and demonstrate the adversarial impact of Mixed Sample Augmentation strategies on the distillation objective.

  • We present empirical proof that data augmentation techniques tend to increasingly make models more discriminative and regularizes on example-specific features pertinent to the image.

2 Related Work

Knowledge Distillation: The idea behind knowledge distillation was first introduced in 2006 (Buciluǎ et al., 2006), but made popular by (Hinton et al., 2015) in 2015, as a novel technique for model compression. It works by teaching a student network to mimic, step by step, the behavior of a larger network. This process works by first obtaining a more cumbersome teacher model on the original data. The complex features thus extracted by the teacher model are then transferred to a simpler model or a model of similar size (Furlanello et al., 2018) by using softened probability scores from the cumbersome models. It is important to note that the relative probability of incorrect cases to the probability of the correct class and each other are useful in understanding how complex models generalize the data. For example, in the well-known MNIST dataset, it is helpful to understand not only that the correct classification is a 2, but also which of these 2’s look like 3’s and the 2’s that look like 7’s. Based on this idea, several improved distillation techniques have been proposed (Tarvainen and Valpola, 2017; Li et al., 2017; Xie et al., 2019; Heo et al., 2019) that address different aspects of improving distillation performance. More recent work attempts to find the crucial aspects that determine the quality of a distilled model (Phuong and Lampert, 2019).

Regularization: The ability of the cumbersome model to learn the nuances in probability can be seen as the model’s ability to generalize the data. Generalization, can be interpreted as the capacity of a model to adapt to new, unseen data, drawn from the same distribution the model was trained on. Several works analyse the generalization performance of the numerous implicit regularization methods in Deep Learning. Moreover, several other papers propose different techniques of introducing noise to improve generalization performance of highly parameterized neural networks, (Srivastava et al., 2014a; Blundell et al., 2015; Ioffe and Szegedy, 2015; Vapnik, 2000). Some other works also analyse the relationship between certain qualities of a learned model and the measured performance metrics (Neyshabur, 2017; Tsipras et al., 2018; Peterson et al., 2019). Knowledge distillation as a technique for model compression is unique due to its similarity to human learning, and is thus an interesting downstream task to analyze more about a neural networks ability to generalize. A lot of research has been done to improve the generalization of the student by re-defining a novel teacher architecture, such as adding sequence-level techniques and ensemble teachers (Simard et al., 1996). However, this work is more interesting in exploring the impact of generalization techniques meant to enhance the existing teacher architecture. Some recent works attempt to explain the relationship between implicit regularization techniques and knowledge distillation. (Müller et al., 2019) used a similar approach to ours while using a novel latent representation strategy to model the adversarial impact of label smoothing on distillation, while (Arani et al., 2019) analyse the beneficial impact of using trial-to-trial variability during distillation. (Cho and Hariharan, 2019) analyse the impact of Early Stopping on Distillation, but none refer to data augmentation strategies. Specific to Data Augmentation techniques, there have been attempts to explain the latent effect of data augmentation using mathematical formulations, such as in (Chen et al., 2019; Thulasidasan et al., 2019; He et al., 2019), and for Mixed Sample Data Augmentation Techniques in (Harris et al., 2020). In the following sections, we describe the experimental setup used in this paper, and then attempt to explain the interesting results obtained.

3 Experimental Setup

In this section, we will describe in detail the experimental setup used in this paper, ranging from the VRM techniques considered for the experiments, to the generalization measures we used to evaluate the student and teacher networks.

3.1 Comparison Methods

The principle behind VRM was introduced in (Vapnik, 2000), and has found its application in Machine Learning model training via several different tools. Using the standard Empirical Risk Minimization principle, the loss objective is optimized only on the training samples, whereas in VRM, virtual data points are also sampled from the vicinity of the real data points. Thus, whereas ERM can be thought of as minimizing the expectation of the loss objective with respect to an available empirical distribution

, VRM can be thought of as a natural improvement where the density estimates on each sample is replaced by some estimate of the density in the neighborhood of the sample. Thus, the optimization objective is now regularized by an uncertainty estimate in the sampling task as follows:

The following augmentation strategies can be thought of as extended VRM techniques and are used in the experiments to follow. These strategies are represented pictorially in 2.

Teacher LeNet Student AlexNet Student
Models Accuracy KLD ECE Accuracy KLD ECE Accuracy KLD ECE
Baseline 0.852 0.656 0.205 0.652 1.002 0.15 0.769 0.710 0.149
Augment 0.940 0.466 0.255 0.631 0.951 0.070 0.777 0.735 0.210
Cutout 0.880 0.220 0.237 0.644 0.987 0.109 0.785 0.726 0.249
Mixup 0.868 0.641 0.170 0.633 0.991 0.033 0.714 0.836 0.130
Cutmix 0.954 0.524 0.252 0.621 0.987 0.28 0.720 0.776 0.062
Table 1: Performance of the different Teacher and Student Models on CIFAR-10 Test Set. The KLD Metric is the distance between Human labeled confidence scores and Model prediction probabilities. Expected Calibration Error(ECE) measures prediction quality

Teacher LeNet-Accuracy AlexNet-Accuracy
Baseline 0.600 0.457 0.544
Augment 0.687 0.439 0.549
Cutout 0.630 0.451 0.555
Mixup 0.614 0.444 0.498
Cutmix 0.716 0.439 0.499
Table 2:

Performance on CINIC-Imagenet Test Set.

Standard Transformations: Data Augmentation techniques such as flipping, splitting, scaling, rotating, cropping, etc. are some of the most common techniques to augment data for image classification techniques and find their use in several state of the art results. While general data augmentation leads to improvement in generalization by creating several invariant features, it is important to note that these augmentations are data set dependent and require domain expertise. In this work, a few traditional data augmentation techniques are tested, that include random cropping and random flipping along the horizontal axis.

Cut Out: Cut Out is a generalization technique inspired by the Dropout regularization technique (Srivastava et al., 2014b). While Dropout has been shown to improve generalization of the model, it blindly reduces the capacity of the model by adding in sensitive hyper-parameters (Hernandez-Garcia and Konig, 2019). Therefore, Cut Out works as a data augmentation technique, pulling from previous works that show that dropping continuous areas has shown improvement in generalization (Golnaz Ghiasi, 2018). A selected number of randomly sized continuous sections are removed from the image to create a modified image for training, as given by (DeVries and Taylor, 2017).

Mix Up: A relatively newer strategy to implement data augmentation is to create a convex combination of data samples and their label. The idea was first introduced in (Zhang et al., 2017) and the virtual data-label pair can be generated as follows:


is the mixture percentage of each image. This creates a target value that is a mix of the two original target values. By using an implicit bias that linear interpolations of data should lead to predictions that are linearly interpolated in the target space, Mix Up enables generation of well-calibrated models whose generalization performance is slightly better

(Thulasidasan et al., 2019).

Cut Mix: Introduced by (Yun et al., 2019), Cut Mix was inspired by augmentation techniques that blend the classes of images (Yuji Tokozume, 2018), like Mix Up, and techniques that cut out regions of images, like Cut Out. The new virtual sample is given as:

where is a binary mask that contains the information of where to drop and fill the image, denotes element-wise multiplication and (1, 1) is the combination ratio. The target label is generated similarly to Mix Up, but the authors claim that it improves upon both Cut Out and Mix Up by not removing informative pixels and generalizing over more natural samples.

We understand that none of these techniques can be truly considered to be pure VRM techniques in the absolute sense, but only capture the essence of VRM. However, these techniques are some of the most popular regularization techniques to be found in the state of the art results for most Deep Learning tasks. Moreover, a lot of recent work attempts to understand the qualitative abilities of such techniques (He et al., 2019; Gontijo-Lopes et al., 2020)

3.2 Knowledge Distillation

As described above, we attempt to understand the impact of these augmentation strategies in a simple Knowledge distillation setup, where a smaller model is trained by forcing it to mimic the tempered probability distributions of a larger, cumbersome model. When neural networks use a softmax output layer to convert the logit

for an example into a probability score , we can soften it using a temperature parameter in the following manner:

One can conveniently use unlabeled data to train a smaller model, by completely depending on the cumbersome model’s softened outputs, or use label information, if available while using the following loss formulation using as the temperature parameter.

3.3 Evaluation

After having trained a set of teacher and student models using the above mentioned augmentation and distillation strategy, it is important to set up a well-defined and intuitive evaluation strategy that can explain the impairment in transfer of distilled knowledge.

3.3.1 Test Datasets

A key aspect of measuring generalization performance is to analyse performance of the models on not only unseen data, but also test data with some natural variations in it. For this, we use the following data sets to analyse the performance of the models trained on CIFAR-10.

CIFAR-10: This is used to measure the performance of the model on unseen data lying within the seen distribution.

CINIC-10: This is used for the out-of-sample generalization test. This data set is collected by (Darlow et al., 2018) and contains both CIFAR-10 and ImageNet images in its test fold. However, we just use the 70,000 ImageNet images that have been bucketized into CIFAR-10 classes.

Figure 3: Average human confidence distribution across ground truth classes in CIFAR-10. Diagonal elements, that have the highest confidence have been masked to reveal implicit patterns between different classes

CIFAR-10H: This is called the Human Labeled CIFAR Test Set which is collected by (Peterson et al., 2019)

and contains the exact same images as the CIFAR-10 images, but instead of one-hot encoded target labels, this dataset provides the original probability distributions based on the labelings provided by the human annotators. This enables us to measure the closeness of a model’s predictions with human beliefs about a data sample.

3.4 Generalization Measures

We evaluate each model on the basis of standard confusion metrics, such as

Accuracy, F-1 Score, Precision and Recall

. However, accuracy ignores the probability assigned to the predicted class label as well as the probabilities assigned to the incorrect classes. This information is key to understanding the degree of generalization that a model has achieved. Since we have access to the probability distributions for each CIFAR test sample, we can easily compare a model’s softmax score vector with the human-labeled distribution. To better understand this, note that in Figure

3, which represents the average confidence scores that humans have in each class, certain implicit patterns exist, that reveal a more generalized story for a prediction. For instance, humans are more likely to confuse a dog with a cat than a car. This generalization should also exist in the predicted models. This difference in model prediction and human-level confidence can be easily measured by a KL-Divergence between the model prediction and human-labeled scores, as formulated below, where represents the model output and represents the human labeled confidence scores.

In essence, this is nothing but the cross-entropy loss between the model predictions and human labels. However, it would also be interesting to note the KL-Divergences between different classes. Since, it is not feasible to do that on a sample-level, it would be interesting to compute the Divergences between the averaged probability distribution for each ground truth class for both the model prediction scores and human labeled scores. To understand quality of predictions, we also plot reliability diagrams of the different models and compare the model calibration using Expected Calibration Error, which is given as follows:

We also propose a novel metric to explain the discriminative power of the different models we train. This operates on the penultimate layer embeddings generated by the model and can be thought of as a measure of how well-separated the embedding manifold is. It takes into account both the intra-class and the inter-class similarity and defines the discriminative power of the model as the difference between them. If intra-class similarity is high, class representations are cohesive and more compressed. If inter-class similarity is low, class representations are less adhesive and are far away from each other. An optimal classifier will tend to have high cohesivity and low adhesion, and thus higher discriminative power. To compute this metric, we first standardize the embeddings, and define the cohesion and adhesion metrics as inter- and intra-class similarities, respectively using a cosine similarity function


Where, represents the number of instances in Class . Thus, the class-discrimination can then be computed as:

Where, represents the total number of classes and represents the dimensionality of the embeddings.

(a) Baseline
(b) CutMix
(c) Cutout
(d) Augmentation
Figure 4: Latent Space Visualization from Penultimate Layer, dimensionally reduced using tSNE. Note the well-separated class representations in the augmented models as compared to the baseline model

4 Experiments

Five unique experimental setups were created using the framework and evaluation techniques from above. Teacher models were trained using augmentation techniques and distilled into the student models. This was evaluated with different complexity student models, and an additional data set for comparison. Additionally, we also performed a small ablation study, where we compared the effect of our augmentation techniques on each step of the knowledge distillation process.

The models were implemented in PyTorch, with different parameter values for the different augmentation techniques. Knowledge distillation was performed with a temperature value of 20 and gamma of 0.5. When performing augmentation, baseline and standard data augmentation did not require explicit parameters. Mix Up was implemented with a beta value of 1, and Cut Out was implemented with 16 holes to be randomly places in the images. Finally, Cut Mix used alpha and beta parameters of 0.5 and 1, respectively. These values were chosen based on original paper implementation of the techniques, and could be fine tuned in later work.

4.1 CIFAR 10 - Augmented ResNet Distilled to LeNet

4.1.1 SetUp

Five ResNet18 models (He, 2015) were trained using the data augmentation techniques discussed above, on the CIFAR-10 training dataset. The student networks were built from the LeNet architecture (Haffner, 1998) with no data augmentation techniques added.

4.1.2 Results

As can be seen in 1, the baseline student model out performed other models in terms of accuracy. The baseline model also outperformed all other student models in terms of Precision, Recall and F1 Score on the CIFAR-10 and CINIC data set, which can be found in the appendix 6.

4.2 CIFAR 10 - Augmented ResNet Distilled to AlexNet

4.2.1 Set Up

Another experiment distilled the ResNet18 teacher networks, from above, into a more complex student architecture, to evaluate if it’s increased complexity would better capture the nuance of the Teacher network generalization. The architecture chosen was the AlexNet (Krizhevsky et al., 2012).

4.2.2 Results

This experiment emphasized the same trend from the previous experiment that enhanced generalization did not lead to equally enhanced distillation, as can be seen in 1, 6. However, it is important to note that the augmentation techniques based on assumed sampling distance, such as standard augmentation and Cut Out, outperformed the baseline model. While, augmentation techniques that incorporate VRM sampling, Mix Up and Cut Mix, still under-performed the baseline teacher-student model. It is of interest that possibly the larger AlexNet model was able to pick up on more nuance. Additionally, we propose that this difference between more standard augmentation and Mix Sample augmentation techniques is related to the varying interpolative nature of the techniques. This finding will be explored and further evaluated in the next section.

4.3 MNIST - Augmented AlexNet Distilled to LeNet

4.3.1 Set Up

This work also used the MNIST data set to validate these findings on another data set. For this validation, we just compared the baseline ERM technique with our Mixed Sample augmentation techniques - Mix Up and Cut Mix. This was due to the fact that the standard augmentation techniques as we defined them have been shown to inhibit MNIST training (He et al., 2019). We trained three teacher networks using the AlexNet architecture on the MNIST data using the augmentation techniques. Information from these models was then distilled into baseline LeNet models, which were then evaluated.

4.3.2 Results

The results can be seen in the table below. While the accuracies were very close to one another, it is shown that the validation loss shows more significant decrease for the baseline model. This validation was done with fairly simple models and data set, and it would be interesting to further explore this concept with more complex models and data sets for further work.

AlexNet Teacher LeNet Student
Models Accuracy Loss Accuracy Loss
Baseline 0.9958 0.020 0.9868 0.039
Mixup 0.9960 0.010 0.9886 0.074
Cutmix 0.9938 0.104 0.9861 0.107
Table 3: Performance of the different Teacher and Student Models on MNIST Test Set.

4.4 CIFAR 10 - Ablation Study - ResNet distilled to LeNet

4.4.1 Set Up

To further explore the effect of data augmentation on distillation, we evaluated how augmentation impacted each step of the distillation process. This evaluation was done with ResNet18 teachers and LeNet students, with one of the outcomes (Augmented Teacher - Student) being the first experiment described above. We also evaluated the set ups of a Baseline ResNet teacher distilled into LeNet students trained with augmentation techniques. Finally, we evaluated augmented ResNet teachers distilled into augmented LeNet students.

(a) Teacher Model
(b) AlexNet Student Models
(c) LeNet Student Models
Figure 5: Reliability curves for different models. It is observed that data augmentation generates calibrated predictions

4.4.2 Results

This study showed that no matter the set up, data augmentation techniques hurt the performance of the student models. We suspect that much of this has to do with the interpolation of the latent space by the augmentation techniques. These results can be seen in 6, 8, 7. In our previous experiments, we found that the teacher manifold is the key driver in distillation, so it is no surprise that the techniques with augmented teachers performed poorly. However, it is interesting that adding augmentation to the student networks also inhibited distillation. However, more work may need to be done with more complex student networks to further understand the impact on augmentation in student networks.

5 Explaining the Dichotomy

It is clear from Table 1 and from Figure 1 that when simpler models are distilled from cumbersome models trained with Data Augmentation, generalization performance is generally impaired. This is more pronounced for mixed sample data augmentation strategies like Mix Up and Cut Mix. This reversal of behavior is quite counter-intuitive, as we expect better generalized teacher models to transfer their capabilities to their respective students. We try to explain this in the following ways, with varying degrees of success.

Latent Space Representations: We visualize the latent representations generated under the influence of these different augmentation strategies. The teacher model representations as seen in 4 depict a clearer picture that can help explain this reversal in performance. We plot the tSNE-reduced embeddings(Maaten and Hinton, 2008) from the penultimate layer of the teacher ResNet models for four different classes in the CIFAR-10 Test Set. We chose two distinct pairs of semantically different classes. Shown in orange and green are two semantically similar classes, that represent vehicles(cars and trucks), whereas shown in blue and red are another pair of semantically similar classes (cat and bird), which are different from vehicles. This combination enables us to analyse relationships better, as a well-generalized model should be able to not only form correct clusters, but cluster similar classes near one another. Plotting all 10 classes might lead to undistinguishable inter-class distances, and so we focus on these representations. We note that for Cut Mix that uses a linear interpolation of both images and labels, the clusters are much more compressed and each semantic group lies much farther away from each other than any other strategy. Moreover, there is little interaction between the classes, as not a lot of points lie between the clusters. On the other hand, the baseline model manifold presents itself as much more uniform, wherein a lot of points lie on the boundaries of the class clusters and there is a gradual change in representational capability of the model between the classes. The two different semantic groups are closer in the baseline, when compared to any other augmentation strategy. This separability between classes and semantic groups exists in Cutout and transformation based Augmentations, but is not as pronounced as the interpolative Cut Mix.

Class-Separability: To quantify the behavior observed in the latent space visualiztions, we propose the use of a very basic class-separability metric that simply measures the difference in a model’s confidence structure for different classes. In an ideal scenario where the model is supremely confident of each image, each class’ confidence distribution is well separated and this can theoretically happen if a model is let to run for long enough using gradient descent algorithm. But, in real scenarios, models also assign probabilities to incorrect classes and as claimed by (Hinton et al., 2015), this relative probability structure is crucial in the generalization performance during distillation. The mathematical formulation of this class-separability metric can be written as:

Where, represents the average model prediction distribution for the class , and represents the number of classes.

Teacher LeNet AlexNet
Baseline 3.73 3.21 3.07
Augment 5.19 3.14 3.23
Cutout 4.30 3.14 3.41
Mixup 5.32 2.92 2.17
Cutmix 5.33 2.89 2.27
Table 4: Class Separability Score

Class Discrimination: Accuracy Performance can be a deceptive metric to evaluate generalization performance, as it depends on the one-hot representation of a prediction probability distribution. Similarly, the Similarity Score formulation we use above can be made even more robust by considering the latent space embeddings instead of the probability structure. We develop a novel metric, while taking inspiration from Discriminant Analysis techniques that help explain the model behavior in a better fashion with the current setup. This metric, called class-discrimination has been explained in Section 3.4 and involves a joint optimization of both intra-class similarity and inter-class similarity objectives in a latent space. We find the class-discrimination, intra-class similarity, and inter-class similarity metrics for the different models and find that data augmentation acts as a regularizer by creating more discriminant models. In simpler terms, a model with loose, yet well-separated(low intra-class similarity, low inter-class similarity) class representations will make a better teacher than a model with comparatively tighter and well-separated class representations(high intra-class similarity, low inter-class similarity), while not losing out on original accuracy performance.

Metrics Cohesion Adhesion Discrimination
Baseline 0.739 -0.041 0.246
Augment 0.886 -0.049 0.296
Cutout 0.783 -0.043 0.261
Mixup 0.793 -0.042 0.255
Cutmix 0.917 -0.050 0.306
Table 5: Class Discrimination Evaluation for Teacher Models

As is evident from the above set of analysis, we can attempt to explain the adversarial impact of such augmentation strategies on distillation as that of the discriminative powers that these models gain during training. This is consistent with the ideas presented in (He et al., 2019)

, wherein data augmentation strategies are believed to be regularization strategies that focus more on class-specific major features, while regularizing the example specific minor features. This regularization is more pronounced in interpolative techniques like Cut Mix and Mix Up as we believe they don’t really add any new information to the model. By attempting to combine class-specific labels and corresponding images, they attempt to regularize on finer, example specific information and generate overall class-specific concepts as features. This is great for the model’s performance on test data and data from a slightly different distribution, but distillation quality depends largely on the amount of information encoded in all the latent features encoded by the teacher model. Thus, if the generated features within any given class have greater variance, they are able to encode more information about the class’ relationship with other classes and are expected to generate more generalized probability distributions. This is a key factor in generating better quality students. This also helps explaining the superior performance of Cutout and Augmentation that add new information by retaining the same label and transforming the image.

(a) Baseline Teacher
(b) CutMix Teacher
Figure 6:

Relative KL-Divergence Confusion Matrix

Figure 7: Expected Calibration Error for the different Models. Smaller models like LeNet usually have better calibration performance and previous studies also show that linear interpolation of data samples always leads to better model calibration

Prediction Quality: To test this idea of generating high quality probability distributions, we make use of the CIFAR-10 Human labeled datasets, and measure the quality of the average model confidence distributions against the human confidence estimates. This can again be represented as a KL-Divergence between the probability distributions of a model’s average prediction probabilities for a given class against the average human distribution. We present this information in a Confusion Matrix like visualization in 6. Each cell represents the KL-Divergence between the human estimate on that class and the model’s estimate. The KL-Divergences are all scaled uniformly and are color coded, so higher values correspond to brighter pixel values. We are more interested in the non-diagonal elements as they reveal the mutual information between classes and note that the Cut Mix Matrix is much brighter in those pixels when compared to the Baseline. This points out the fact that the mutual information between different classes is better encoded in the Baseline than Cut Mix-trained model. This enables the creation of a superior representation manifold for distillation to take place.

Model Calibration: We also measure calibration performance across the several models, and note that interpolative techniques generate better calibrated models consistently across teacher and student models. This is evident from the reliability diagrams in 5 and the Expected Calibration Error metrics in 1 and 7. However, it has been hypothesized in (Guo et al., 2017) that smaller models like LeNet generally tend to exhibit better Calibration performance than overparameterized, modern models like ResNet. This trend is visible as the reliability curves for all models tend to hug the ideal straight line closer as the model complexity decreases. However, no direct relationship can be found between the calibration performance of a student and the augmentation applied on its teacher model.

Model Generalization

Generalization is often defined in machine learning by the model’s ability to adapt and classify on unseen data. However, with our analysis of the latent representation space, we found that as these techniques increase the intra-class discrimination, they may inhibit the ability to create probability distributions that mimic informative generalization for distillation into a student model.

We analyzed this concept further through the use of confusion matrices. These confusion matrices compared the predicted label probabilities and actual classes. Both the baseline and Mix Up models show that they classify well overall, with a strong diagonal line in the un-masked confusion matrices. When that diagonal is masked,however, we can see that the probabilities between classes not chosen are far more spread apart in the baseline model. These can be compared to 3, which visualized the human generated predicted label probabilities.

It is important to note that these matrices have different color scales, with the Masked Baseline going up to .1 and the Masked Cut Mix ranging to 0.07. The matrices show that they are both able to pick up on some of the nuances between data, but it can be seen that the masked baseline shows more generalization between classes with brighter colors between related classes. This ability to create an informative probabilistic distribution between different class types is the backbone for effective knowledge distillation.

6 Future Work

Moving ahead, a quantification of the regularization effect of mixed data augmentation strategies like Cut Mix and Mix Up should be formulated. An extension of this work would be to develop a novel augmentation strategy that is able to retain or improve the latent embedding manifold qualities for the given data set, and consequently, generate more generalized model predictions. A key missing aspect in this study is the fact that the training data set used to distill student models is the same as the one used to train the teacher models. A set of experiments could be conducted on a held out distillation data set. Another aspect of measuring true generalization performance could be the relative performance of these models under adversarial attack, and would be an interesting metric to analyse.


  • E. Arani, F. Sarfraz, and B. Zonooz (2019) Improving generalization and robustness with noisy collaboration in knowledge distillation. External Links: 1910.05057 Cited by: §2.
  • C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra (2015) Weight uncertainty in neural networks. External Links: 1505.05424 Cited by: §2.
  • C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil (2006) Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535–541. Cited by: §2.
  • S. Chen, E. Dobriban, and J. H. Lee (2019) A group-theoretic framework for data augmentation. External Links: 1907.10905 Cited by: §2.
  • J. H. Cho and B. Hariharan (2019) On the efficacy of knowledge distillation. In

    2019 IEEE/CVF International Conference on Computer Vision (ICCV)

    Vol. , pp. 4793–4801. Cited by: §2.
  • L. N. Darlow, E. J. Crowley, A. Antoniou, and A. J. Storkey (2018) CINIC-10 is not imagenet or cifar-10. External Links: 1810.03505 Cited by: §3.3.1.
  • T. DeVries and G. W. Taylor (2017)

    Improved regularization of convolutional neural networks with cutout

    External Links: 1708.04552 Cited by: §1, §3.1.
  • T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and A. Anandkumar (2018) Born again neural networks. External Links: 1805.04770 Cited by: §2.
  • Q. V. L. Golnaz Ghiasi (2018) DropBlock: a regularization method for convolutional networks. External Links: 1810.12890 Cited by: §3.1.
  • R. Gontijo-Lopes, S. J. Smullin, E. D. Cubuk, and E. Dyer (2020) Affinity and diversity: quantifying mechanisms of data augmentation. External Links: 2002.08973 Cited by: §3.1.
  • C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger (2017) On calibration of modern neural networks. External Links: 1706.04599 Cited by: §5.
  • P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE. Cited by: §4.1.1.
  • S. Han, H. Mao, and W. J. Dally (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. External Links: 1510.00149 Cited by: §1.
  • E. Harris, A. Marcu, M. Painter, M. Niranjan, A. Prügel-Bennett, and J. Hare (2020) Understanding and enhancing mixed sample data augmentation. External Links: 2002.12047 Cited by: §1, §2.
  • K. He (2015) Deep residual learning for image recognition. External Links: 1512.03385 Cited by: §4.1.1.
  • Z. He, L. Xie, X. Chen, Y. Zhang, Y. Wang, and Q. Tian (2019) Data augmentation revisited: rethinking the distribution gap between clean and augmented data. External Links: 1909.09148 Cited by: §2, §3.1, §4.3.1, §5.
  • B. Heo, M. Lee, S. Yun, and J. Y. Choi (2019) Knowledge distillation with adversarial samples supporting decision boundary.

    Proceedings of the AAAI Conference on Artificial Intelligence

    33, pp. 3771–3778.
    External Links: ISSN 2159-5399, Link, Document Cited by: §2.
  • A. Hernandez-Garcia and P. Konig (2019) Data augmentation instead of explicit regularization. External Links: 1806.03852 Cited by: §3.1.
  • G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. External Links: 1503.02531 Cited by: §2, §5.
  • S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. External Links: 1502.03167 Cited by: §2.
  • M. C. Jonathan Frankle (2018) The lottery ticket hypothesis: finding sparse, trainable neural networks. External Links: 1803.03635 Cited by: §1.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1097–1105. External Links: Link Cited by: §4.2.1.
  • Y. Li, J. Yang, Y. Song, L. Cao, J. Luo, and L. Li (2017) Learning from noisy labels with distillation. 2017 IEEE International Conference on Computer Vision (ICCV). External Links: ISBN 9781538610329, Link, Document Cited by: §2.
  • L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §5.
  • R. Müller, S. Kornblith, and G. Hinton (2019) When does label smoothing help?. External Links: 1906.02629 Cited by: §2.
  • B. Neyshabur (2017) Implicit regularization in deep learning. arXiv preprint arXiv:1709.01953. Cited by: §1, §2.
  • J. Peterson, R. Battleday, T. Griffiths, and O. Russakovsky (2019) Human uncertainty makes classification more robust. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). External Links: ISBN 9781728148038, Link, Document Cited by: §2, §3.3.1.
  • M. Phuong and C. Lampert (2019) Towards understanding knowledge distillation. In International Conference on Machine Learning, pp. 5142–5151. Cited by: §2.
  • P. Y. Simard, Y. LeCun, J. S. Denker, and B. Victorri (1996)

    Transformation invariance in pattern recognition-tangent distance and tangent propagation

    In Neural Networks: Tricks of the Trade, Cited by: §2.
  • N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014a) Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, pp. 1929–1958. Cited by: §2.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014b) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 (56), pp. 1929–1958. External Links: Link Cited by: §3.1.
  • A. Tarvainen and H. Valpola (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. External Links: 1703.01780 Cited by: §2.
  • S. Thulasidasan, G. Chennupati, J. Bilmes, T. Bhattacharya, and S. Michalak (2019) On mixup training: improved calibration and predictive uncertainty for deep neural networks. External Links: 1905.11001 Cited by: §2, §3.1.
  • D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry (2018)

    Robustness may be at odds with accuracy

    External Links: 1805.12152 Cited by: §2.
  • V. N. Vapnik (2000) The vicinal risk minimization principle and the svms. In

    The nature of statistical learning theory

    pp. 267–290. Cited by: §1, §2, §3.1.
  • Q. Xie, M. Luong, E. Hovy, and Q. V. Le (2019) Self-training with noisy student improves imagenet classification. External Links: 1911.04252 Cited by: §2.
  • T. H. Yuji Tokozume (2018) Between-class learning for image classification. External Links: 1711.10284 Cited by: §3.1.
  • S. Yun, D. Han, S. Chun, S. J. Oh, Y. Yoo, and J. Choe (2019) CutMix: regularization strategy to train strong classifiers with localizable features. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). External Links: ISBN 9781728148038, Link, Document Cited by: §1, §3.1.
  • H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2017) Mixup: beyond empirical risk minimization. External Links: 1710.09412 Cited by: §1, §3.1.
  • Z. Zhou, P. Mertikopoulos, N. Bambos, S. Boyd, and P. Glynn (2017) On the convergence of mirror descent beyond stochastic convex programming. External Links: 1706.05681 Cited by: §1.

Appendix A Appendix Figures

Model Test Set Accuracy Precision Recall F1-Score
Baseline LeNet CIFAR-10 0.652 0.648 0.652 0.650
CINIC 0.456 0.452 0.457 0.449
AlexNet CIFAR-10 0.768 0.767 0.769 0.768
CINIC 0.544 0.550 0.544 0.529
Augmentation LeNet CIFAR-10 0.631 0.632 0.631 0.474
CINIC 0.439 0.434 0.435 0.435
AlexNet CIFAR-10 0.777 0.776 0.777 0.766
CINIC 0.549 0.557 0.549 0.535
Cut Out LeNet CIFAR-10 0.644 0.643 0.644 0.643
CINIC 0.451 0.447 0.451 0.445
AlexNet CIFAR-10 0.785 0.783 0.785 0.784
CINIC 0.554 0.562 0.554 0.540
Mix Up LeNet CIFAR-10 0.633 0.631 0.633 0.632
CINIC 0.444 0.435 0.443 0.433
AlexNet CIFAR-10 0.714 0.712 0.714 0.712
CINIC 0.497 0.501 0.498 0.477
Cut Mix LeNet CIFAR-10 0.621 0.617 0.621 0.619
CINIC 0.439 0.434 0.439 0.435
AlexNet CIFAR-10 0.720 0.717 0.720 0.718
CINIC 0.498 0.502 0.498 0.478
Table 6: Appendix Table: All Student Metrics trained from ResNet18 CIFAR Teachers
Model Test Set Accuracy Precision Recall F1-Score
Baseline LeNet CIFAR-10 0.652 0.648 0.652 0.650
CINIC 0.456 0.452 0.457 0.449
Augmentation LeNet CIFAR-10 0.641 0.639 0.641 0.639
CINIC 0.451 0.446 0.451 0.444
Cut Out LeNet CIFAR-10 0.635 0.630 0.635 0.632
CINIC 0.446 0.446 0.439 0.441
Mix Up LeNet CIFAR-10 0.647 0.645 0.647 0.647
CINIC 0.457 0.455 0.457 0.452
Cut Mix LeNet CIFAR-10 0.629 0.626 0.629 0.627
CINIC 0.439 0.433 0.439 0.434
Table 7: Appendix Table: All Baseline Teacher with Augmented Student Test Metrics trained from CIFAR-10
Model Test Set Accuracy Precision Recall F1-Score
Baseline LeNet CIFAR-10 0.652 0.648 0.652 0.650
CINIC 0.456 0.452 0.457 0.449
Augmentation LeNet CIFAR-10 0.641 0.637 0.641 0.638
CINIC 0.445 0.441 0.445 0.441
Cut Out LeNet CIFAR-10 0.636 0.643 0.636 0.638
CINIC 0.440 0.451 0.440 0.440
Mix Up LeNet CIFAR-10 0.626 0.622 0.626 0.624
CINIC 0.440 0.437 0.440 0.431
Cut Mix LeNet CIFAR-10 0.635 0.631 0.635 0.633
CINIC 0.446 0.443 0.446 0.441
Table 8: Appendix Table: All ResNet Teacher - Student with same Augmentation Test Metrics trained from CIFAR-10
Model Test Set Accuracy Precision Recall F1-Score
Baseline ResNet CIFAR-10 0.852 0.851 0.852 0.851
CINIC 0.600 0.602 0.600 0.593
Augmentation ResNet CIFAR-10 0.940 0.940 0.940 0.940
CINIC 0.630 0.620 0.628 0.629
Cut Out ResNet CIFAR-10 0.879 0.878 0.880 0.879
CINIC 0.629 0.620 0.628 0.630
Mix Up ResNet CIFAR-10 0.867 0.954 0.867 0.953
CINIC 0.603 0.606 0.603 0.598
Cut Mix ResNet CIFAR-10 0.954 0.869 0.954 0.868
CINIC 0.710 0.707 0.720 0.710
Table 9: Appendix Table: All Teacher Test Metrics trained from CIFAR-10