1 Introduction
Deep Neural Networks (DNN) now have the ability to generate sophisticated models that capture the intricacies of information in large amounts of data. This ability has lent spectacular success to Deep Learning models in many domains. However, often as the ability for the machine learning models to encapsulate information needed to express this complexity increases, so does the size of the machine learning model. These complex models may be too demanding to run on more reasonable hardware, including personal computers or mobile devices. One solution to this problem is to use knowledgebased distillation, which trains a smaller, more efficient model that approximates the performance of the original, more cumbersome model. A knowledge distillation objective can enable us to train a smaller, lightweight model without using any external annotations on given data, but instead utilising the prediction labels generated by a pretrained cumbersome teacher model. There are plenty of other techniques that achieve high performance in compressed models including model quantization
(Zhou et al., 2017), model pruning (Han et al., 2015), and more recently, lottery tickets (Jonathan Frankle, 2018).With the recent surge of deep learning research, there is a natural need to create knowledgebased distillation techniques that mimic more complex algorithms. At the same time, however, there is little basis for evaluating the impact of existing strategies on the latent representation space generated by the teacher model that allows distillation training on the student model. This raises the question: what is the impact of generalization in teacher networks? While many forms of explicit regularization have been added to large, cumbersome models, it will be interesting to analyze the impact of the implicit regularization techniques when applied to such neural networks through the different optimization techniques. Implicit regularization has shown that as neural networks increase in size, they are actually able to create solutions with lower complexity (Neyshabur, 2017). Since there is no explicit regularization in the model’s objective function, this ability to generalize is built in through well studied techniques such as normalization, gradient descent, and the choice of weight kernels, like the convolution filter. Some of these regularizers have become fairly popular and have become a standard addition to any Deep Learning model optimization.
Regularization performed on neural networks have many different aspects. In a distillation setup, where a smaller model(student) attempts to mimic the softened softmax output of a large, cumbersome model(teacher), student performance is shown to improve, but there still exists a significant generalization gap between the two models. It is indeed important to try and minimize this generalization gap, but it is also important to consider the viability of using one or many of these implicit regularizers in the distillation process. In this experiment, we will be creating a framework that measures the impact of using such implicit regularizers on the generalization gap between teacher and student models.
One such form of implicit regularization is the Vicinal Risk Minimization (VRM) principle (Vapnik, 2000). In this work, we will evaluate the impact data augmentation inspired by VRM techniques have in the transfer of generalization between teacher and student models. However, this is a general framework, and can also be used to evaluate other types of generalization in the future. In any natural setting, we encounter noise in the presented data that enables us to gain additional information about relatively similar data. Traditionally, data augmentation and even Cut Out (DeVries and Taylor, 2017), is viewed as one such example, where the model is trained on some simple transformations of the data. However, the scope of a vicinity is not explicitly defined in this case. Mixed Sample Data Augmentation Techniques like Mix Up (Zhang et al., 2017) and Cut Mix (Yun et al., 2019) or FMix (Harris et al., 2020) provide a new outlook on the concept of vicinity. Considered as standalone techniques, they can lead to state of the art results in standard Deep Learning tasks, but it will be interesting to note the impact each technique has on the supervision signal from the teacher model to the student model. We hypothesize that even though Data Augmentation techniques provide good regularization, they impair the distillation process because of several implicit qualitative biases in the techniques. This impairment is much more pronounced in the Mixed Sample Data Augmentation Techniques.
To prove this hypothesis, we will be using a framework to evaluate four data augmentation techniques inspired by VRM (traditional augmentation, Mix Up, Cut Out and Cut Mix) by creating teacher networks on ResNet18 architectures. Student networks will then be distilled and evaluated against appropriate data sets to investigate their generalization capabilities. Our contributions are thus summarized as follows:

We demonstrate that popular data augmentation techniques, and especially Mixed Sample techniques, such as MixUp and CutMix when applied on a teacher model, can impair the transfer of generalization capabilities onto a student model in a distillation setting.

We present a novel similaritybased metric to help explain some qualitative traits inherent in the latent representations of such models. These findings are also backed by a traditional KLDivergence based metric that operates on the probability distribution of the model predictions.

We also analyze the performance of these distilled models under distributional shift, and demonstrate the adversarial impact of Mixed Sample Augmentation strategies on the distillation objective.

We present empirical proof that data augmentation techniques tend to increasingly make models more discriminative and regularizes on examplespecific features pertinent to the image.
2 Related Work
Knowledge Distillation: The idea behind knowledge distillation was first introduced in 2006 (Buciluǎ et al., 2006), but made popular by (Hinton et al., 2015) in 2015, as a novel technique for model compression. It works by teaching a student network to mimic, step by step, the behavior of a larger network. This process works by first obtaining a more cumbersome teacher model on the original data. The complex features thus extracted by the teacher model are then transferred to a simpler model or a model of similar size (Furlanello et al., 2018) by using softened probability scores from the cumbersome models. It is important to note that the relative probability of incorrect cases to the probability of the correct class and each other are useful in understanding how complex models generalize the data. For example, in the wellknown MNIST dataset, it is helpful to understand not only that the correct classification is a 2, but also which of these 2’s look like 3’s and the 2’s that look like 7’s. Based on this idea, several improved distillation techniques have been proposed (Tarvainen and Valpola, 2017; Li et al., 2017; Xie et al., 2019; Heo et al., 2019) that address different aspects of improving distillation performance. More recent work attempts to find the crucial aspects that determine the quality of a distilled model (Phuong and Lampert, 2019).
Regularization: The ability of the cumbersome model to learn the nuances in probability can be seen as the model’s ability to generalize the data. Generalization, can be interpreted as the capacity of a model to adapt to new, unseen data, drawn from the same distribution the model was trained on. Several works analyse the generalization performance of the numerous implicit regularization methods in Deep Learning. Moreover, several other papers propose different techniques of introducing noise to improve generalization performance of highly parameterized neural networks, (Srivastava et al., 2014a; Blundell et al., 2015; Ioffe and Szegedy, 2015; Vapnik, 2000). Some other works also analyse the relationship between certain qualities of a learned model and the measured performance metrics (Neyshabur, 2017; Tsipras et al., 2018; Peterson et al., 2019). Knowledge distillation as a technique for model compression is unique due to its similarity to human learning, and is thus an interesting downstream task to analyze more about a neural networks ability to generalize. A lot of research has been done to improve the generalization of the student by redefining a novel teacher architecture, such as adding sequencelevel techniques and ensemble teachers (Simard et al., 1996). However, this work is more interesting in exploring the impact of generalization techniques meant to enhance the existing teacher architecture. Some recent works attempt to explain the relationship between implicit regularization techniques and knowledge distillation. (Müller et al., 2019) used a similar approach to ours while using a novel latent representation strategy to model the adversarial impact of label smoothing on distillation, while (Arani et al., 2019) analyse the beneficial impact of using trialtotrial variability during distillation. (Cho and Hariharan, 2019) analyse the impact of Early Stopping on Distillation, but none refer to data augmentation strategies. Specific to Data Augmentation techniques, there have been attempts to explain the latent effect of data augmentation using mathematical formulations, such as in (Chen et al., 2019; Thulasidasan et al., 2019; He et al., 2019), and for Mixed Sample Data Augmentation Techniques in (Harris et al., 2020). In the following sections, we describe the experimental setup used in this paper, and then attempt to explain the interesting results obtained.
3 Experimental Setup
In this section, we will describe in detail the experimental setup used in this paper, ranging from the VRM techniques considered for the experiments, to the generalization measures we used to evaluate the student and teacher networks.
3.1 Comparison Methods
The principle behind VRM was introduced in (Vapnik, 2000), and has found its application in Machine Learning model training via several different tools. Using the standard Empirical Risk Minimization principle, the loss objective is optimized only on the training samples, whereas in VRM, virtual data points are also sampled from the vicinity of the real data points. Thus, whereas ERM can be thought of as minimizing the expectation of the loss objective with respect to an available empirical distribution
, VRM can be thought of as a natural improvement where the density estimates on each sample is replaced by some estimate of the density in the neighborhood of the sample. Thus, the optimization objective is now regularized by an uncertainty estimate in the sampling task as follows:
The following augmentation strategies can be thought of as extended VRM techniques and are used in the experiments to follow. These strategies are represented pictorially in 2.
Teacher  LeNet Student  AlexNet Student  

Models  Accuracy  KLD  ECE  Accuracy  KLD  ECE  Accuracy  KLD  ECE 
Baseline  0.852  0.656  0.205  0.652  1.002  0.15  0.769  0.710  0.149 
Augment  0.940  0.466  0.255  0.631  0.951  0.070  0.777  0.735  0.210 
Cutout  0.880  0.220  0.237  0.644  0.987  0.109  0.785  0.726  0.249 
Mixup  0.868  0.641  0.170  0.633  0.991  0.033  0.714  0.836  0.130 
Cutmix  0.954  0.524  0.252  0.621  0.987  0.28  0.720  0.776  0.062 
Models 
Teacher  LeNetAccuracy  AlexNetAccuracy 

Baseline  0.600  0.457  0.544 
Augment  0.687  0.439  0.549 
Cutout  0.630  0.451  0.555 
Mixup  0.614  0.444  0.498 
Cutmix  0.716  0.439  0.499 
Performance on CINICImagenet Test Set.
Standard Transformations: Data Augmentation techniques such as flipping, splitting, scaling, rotating, cropping, etc. are some of the most common techniques to augment data for image classification techniques and find their use in several state of the art results. While general data augmentation leads to improvement in generalization by creating several invariant features, it is important to note that these augmentations are data set dependent and require domain expertise. In this work, a few traditional data augmentation techniques are tested, that include random cropping and random flipping along the horizontal axis.
Cut Out: Cut Out is a generalization technique inspired by the Dropout regularization technique (Srivastava et al., 2014b). While Dropout has been shown to improve generalization of the model, it blindly reduces the capacity of the model by adding in sensitive hyperparameters (HernandezGarcia and Konig, 2019). Therefore, Cut Out works as a data augmentation technique, pulling from previous works that show that dropping continuous areas has shown improvement in generalization (Golnaz Ghiasi, 2018). A selected number of randomly sized continuous sections are removed from the image to create a modified image for training, as given by (DeVries and Taylor, 2017).
Mix Up: A relatively newer strategy to implement data augmentation is to create a convex combination of data samples and their label. The idea was first introduced in (Zhang et al., 2017) and the virtual datalabel pair can be generated as follows:
where
is the mixture percentage of each image. This creates a target value that is a mix of the two original target values. By using an implicit bias that linear interpolations of data should lead to predictions that are linearly interpolated in the target space, Mix Up enables generation of wellcalibrated models whose generalization performance is slightly better
(Thulasidasan et al., 2019).Cut Mix: Introduced by (Yun et al., 2019), Cut Mix was inspired by augmentation techniques that blend the classes of images (Yuji Tokozume, 2018), like Mix Up, and techniques that cut out regions of images, like Cut Out. The new virtual sample is given as:
where is a binary mask that contains the information of where to drop and fill the image, denotes elementwise multiplication and (1, 1) is the combination ratio. The target label is generated similarly to Mix Up, but the authors claim that it improves upon both Cut Out and Mix Up by not removing informative pixels and generalizing over more natural samples.
We understand that none of these techniques can be truly considered to be pure VRM techniques in the absolute sense, but only capture the essence of VRM. However, these techniques are some of the most popular regularization techniques to be found in the state of the art results for most Deep Learning tasks. Moreover, a lot of recent work attempts to understand the qualitative abilities of such techniques (He et al., 2019; GontijoLopes et al., 2020)
3.2 Knowledge Distillation
As described above, we attempt to understand the impact of these augmentation strategies in a simple Knowledge distillation setup, where a smaller model is trained by forcing it to mimic the tempered probability distributions of a larger, cumbersome model. When neural networks use a softmax output layer to convert the logit
for an example into a probability score , we can soften it using a temperature parameter in the following manner:One can conveniently use unlabeled data to train a smaller model, by completely depending on the cumbersome model’s softened outputs, or use label information, if available while using the following loss formulation using as the temperature parameter.
3.3 Evaluation
After having trained a set of teacher and student models using the above mentioned augmentation and distillation strategy, it is important to set up a welldefined and intuitive evaluation strategy that can explain the impairment in transfer of distilled knowledge.
3.3.1 Test Datasets
A key aspect of measuring generalization performance is to analyse performance of the models on not only unseen data, but also test data with some natural variations in it. For this, we use the following data sets to analyse the performance of the models trained on CIFAR10.
CIFAR10: This is used to measure the performance of the model on unseen data lying within the seen distribution.
CINIC10: This is used for the outofsample generalization test. This data set is collected by (Darlow et al., 2018) and contains both CIFAR10 and ImageNet images in its test fold. However, we just use the 70,000 ImageNet images that have been bucketized into CIFAR10 classes.
CIFAR10H: This is called the Human Labeled CIFAR Test Set which is collected by (Peterson et al., 2019)
and contains the exact same images as the CIFAR10 images, but instead of onehot encoded target labels, this dataset provides the original probability distributions based on the labelings provided by the human annotators. This enables us to measure the closeness of a model’s predictions with human beliefs about a data sample.
3.4 Generalization Measures
We evaluate each model on the basis of standard confusion metrics, such as
Accuracy, F1 Score, Precision and Recall
. However, accuracy ignores the probability assigned to the predicted class label as well as the probabilities assigned to the incorrect classes. This information is key to understanding the degree of generalization that a model has achieved. Since we have access to the probability distributions for each CIFAR test sample, we can easily compare a model’s softmax score vector with the humanlabeled distribution. To better understand this, note that in Figure
3, which represents the average confidence scores that humans have in each class, certain implicit patterns exist, that reveal a more generalized story for a prediction. For instance, humans are more likely to confuse a dog with a cat than a car. This generalization should also exist in the predicted models. This difference in model prediction and humanlevel confidence can be easily measured by a KLDivergence between the model prediction and humanlabeled scores, as formulated below, where represents the model output and represents the human labeled confidence scores.In essence, this is nothing but the crossentropy loss between the model predictions and human labels. However, it would also be interesting to note the KLDivergences between different classes. Since, it is not feasible to do that on a samplelevel, it would be interesting to compute the Divergences between the averaged probability distribution for each ground truth class for both the model prediction scores and human labeled scores. To understand quality of predictions, we also plot reliability diagrams of the different models and compare the model calibration using Expected Calibration Error, which is given as follows:
We also propose a novel metric to explain the discriminative power of the different models we train. This operates on the penultimate layer embeddings generated by the model and can be thought of as a measure of how wellseparated the embedding manifold is. It takes into account both the intraclass and the interclass similarity and defines the discriminative power of the model as the difference between them. If intraclass similarity is high, class representations are cohesive and more compressed. If interclass similarity is low, class representations are less adhesive and are far away from each other. An optimal classifier will tend to have high cohesivity and low adhesion, and thus higher discriminative power. To compute this metric, we first standardize the embeddings, and define the cohesion and adhesion metrics as inter and intraclass similarities, respectively using a cosine similarity function
.Where, represents the number of instances in Class . Thus, the classdiscrimination can then be computed as:
Where, represents the total number of classes and represents the dimensionality of the embeddings.
4 Experiments
Five unique experimental setups were created using the framework and evaluation techniques from above. Teacher models were trained using augmentation techniques and distilled into the student models. This was evaluated with different complexity student models, and an additional data set for comparison. Additionally, we also performed a small ablation study, where we compared the effect of our augmentation techniques on each step of the knowledge distillation process.
The models were implemented in PyTorch, with different parameter values for the different augmentation techniques. Knowledge distillation was performed with a temperature value of 20 and gamma of 0.5. When performing augmentation, baseline and standard data augmentation did not require explicit parameters. Mix Up was implemented with a beta value of 1, and Cut Out was implemented with 16 holes to be randomly places in the images. Finally, Cut Mix used alpha and beta parameters of 0.5 and 1, respectively. These values were chosen based on original paper implementation of the techniques, and could be fine tuned in later work.
4.1 CIFAR 10  Augmented ResNet Distilled to LeNet
4.1.1 SetUp
4.1.2 Results
4.2 CIFAR 10  Augmented ResNet Distilled to AlexNet
4.2.1 Set Up
Another experiment distilled the ResNet18 teacher networks, from above, into a more complex student architecture, to evaluate if it’s increased complexity would better capture the nuance of the Teacher network generalization. The architecture chosen was the AlexNet (Krizhevsky et al., 2012).
4.2.2 Results
This experiment emphasized the same trend from the previous experiment that enhanced generalization did not lead to equally enhanced distillation, as can be seen in 1, 6. However, it is important to note that the augmentation techniques based on assumed sampling distance, such as standard augmentation and Cut Out, outperformed the baseline model. While, augmentation techniques that incorporate VRM sampling, Mix Up and Cut Mix, still underperformed the baseline teacherstudent model. It is of interest that possibly the larger AlexNet model was able to pick up on more nuance. Additionally, we propose that this difference between more standard augmentation and Mix Sample augmentation techniques is related to the varying interpolative nature of the techniques. This finding will be explored and further evaluated in the next section.
4.3 MNIST  Augmented AlexNet Distilled to LeNet
4.3.1 Set Up
This work also used the MNIST data set to validate these findings on another data set. For this validation, we just compared the baseline ERM technique with our Mixed Sample augmentation techniques  Mix Up and Cut Mix. This was due to the fact that the standard augmentation techniques as we defined them have been shown to inhibit MNIST training (He et al., 2019). We trained three teacher networks using the AlexNet architecture on the MNIST data using the augmentation techniques. Information from these models was then distilled into baseline LeNet models, which were then evaluated.
4.3.2 Results
The results can be seen in the table below. While the accuracies were very close to one another, it is shown that the validation loss shows more significant decrease for the baseline model. This validation was done with fairly simple models and data set, and it would be interesting to further explore this concept with more complex models and data sets for further work.
AlexNet Teacher  LeNet Student  

Models  Accuracy  Loss  Accuracy  Loss 
Baseline  0.9958  0.020  0.9868  0.039 
Mixup  0.9960  0.010  0.9886  0.074 
Cutmix  0.9938  0.104  0.9861  0.107 
4.4 CIFAR 10  Ablation Study  ResNet distilled to LeNet
4.4.1 Set Up
To further explore the effect of data augmentation on distillation, we evaluated how augmentation impacted each step of the distillation process. This evaluation was done with ResNet18 teachers and LeNet students, with one of the outcomes (Augmented Teacher  Student) being the first experiment described above. We also evaluated the set ups of a Baseline ResNet teacher distilled into LeNet students trained with augmentation techniques. Finally, we evaluated augmented ResNet teachers distilled into augmented LeNet students.
4.4.2 Results
This study showed that no matter the set up, data augmentation techniques hurt the performance of the student models. We suspect that much of this has to do with the interpolation of the latent space by the augmentation techniques. These results can be seen in 6, 8, 7. In our previous experiments, we found that the teacher manifold is the key driver in distillation, so it is no surprise that the techniques with augmented teachers performed poorly. However, it is interesting that adding augmentation to the student networks also inhibited distillation. However, more work may need to be done with more complex student networks to further understand the impact on augmentation in student networks.
5 Explaining the Dichotomy
It is clear from Table 1 and from Figure 1 that when simpler models are distilled from cumbersome models trained with Data Augmentation, generalization performance is generally impaired. This is more pronounced for mixed sample data augmentation strategies like Mix Up and Cut Mix. This reversal of behavior is quite counterintuitive, as we expect better generalized teacher models to transfer their capabilities to their respective students. We try to explain this in the following ways, with varying degrees of success.
Latent Space Representations: We visualize the latent representations generated under the influence of these different augmentation strategies. The teacher model representations as seen in 4 depict a clearer picture that can help explain this reversal in performance. We plot the tSNEreduced embeddings(Maaten and Hinton, 2008) from the penultimate layer of the teacher ResNet models for four different classes in the CIFAR10 Test Set. We chose two distinct pairs of semantically different classes. Shown in orange and green are two semantically similar classes, that represent vehicles(cars and trucks), whereas shown in blue and red are another pair of semantically similar classes (cat and bird), which are different from vehicles. This combination enables us to analyse relationships better, as a wellgeneralized model should be able to not only form correct clusters, but cluster similar classes near one another. Plotting all 10 classes might lead to undistinguishable interclass distances, and so we focus on these representations. We note that for Cut Mix that uses a linear interpolation of both images and labels, the clusters are much more compressed and each semantic group lies much farther away from each other than any other strategy. Moreover, there is little interaction between the classes, as not a lot of points lie between the clusters. On the other hand, the baseline model manifold presents itself as much more uniform, wherein a lot of points lie on the boundaries of the class clusters and there is a gradual change in representational capability of the model between the classes. The two different semantic groups are closer in the baseline, when compared to any other augmentation strategy. This separability between classes and semantic groups exists in Cutout and transformation based Augmentations, but is not as pronounced as the interpolative Cut Mix.
ClassSeparability: To quantify the behavior observed in the latent space visualiztions, we propose the use of a very basic classseparability metric that simply measures the difference in a model’s confidence structure for different classes. In an ideal scenario where the model is supremely confident of each image, each class’ confidence distribution is well separated and this can theoretically happen if a model is let to run for long enough using gradient descent algorithm. But, in real scenarios, models also assign probabilities to incorrect classes and as claimed by (Hinton et al., 2015), this relative probability structure is crucial in the generalization performance during distillation. The mathematical formulation of this classseparability metric can be written as:
Where, represents the average model prediction distribution for the class , and represents the number of classes.
Models 
Teacher  LeNet  AlexNet 

Baseline  3.73  3.21  3.07 
Augment  5.19  3.14  3.23 
Cutout  4.30  3.14  3.41 
Mixup  5.32  2.92  2.17 
Cutmix  5.33  2.89  2.27 
Class Discrimination: Accuracy Performance can be a deceptive metric to evaluate generalization performance, as it depends on the onehot representation of a prediction probability distribution. Similarly, the Similarity Score formulation we use above can be made even more robust by considering the latent space embeddings instead of the probability structure. We develop a novel metric, while taking inspiration from Discriminant Analysis techniques that help explain the model behavior in a better fashion with the current setup. This metric, called classdiscrimination has been explained in Section 3.4 and involves a joint optimization of both intraclass similarity and interclass similarity objectives in a latent space. We find the classdiscrimination, intraclass similarity, and interclass similarity metrics for the different models and find that data augmentation acts as a regularizer by creating more discriminant models. In simpler terms, a model with loose, yet wellseparated(low intraclass similarity, low interclass similarity) class representations will make a better teacher than a model with comparatively tighter and wellseparated class representations(high intraclass similarity, low interclass similarity), while not losing out on original accuracy performance.
Metrics  Cohesion  Adhesion  Discrimination 

Baseline  0.739  0.041  0.246 
Augment  0.886  0.049  0.296 
Cutout  0.783  0.043  0.261 
Mixup  0.793  0.042  0.255 
Cutmix  0.917  0.050  0.306 
As is evident from the above set of analysis, we can attempt to explain the adversarial impact of such augmentation strategies on distillation as that of the discriminative powers that these models gain during training. This is consistent with the ideas presented in (He et al., 2019)
, wherein data augmentation strategies are believed to be regularization strategies that focus more on classspecific major features, while regularizing the example specific minor features. This regularization is more pronounced in interpolative techniques like Cut Mix and Mix Up as we believe they don’t really add any new information to the model. By attempting to combine classspecific labels and corresponding images, they attempt to regularize on finer, example specific information and generate overall classspecific concepts as features. This is great for the model’s performance on test data and data from a slightly different distribution, but distillation quality depends largely on the amount of information encoded in all the latent features encoded by the teacher model. Thus, if the generated features within any given class have greater variance, they are able to encode more information about the class’ relationship with other classes and are expected to generate more generalized probability distributions. This is a key factor in generating better quality students. This also helps explaining the superior performance of Cutout and Augmentation that add new information by retaining the same label and transforming the image.
Prediction Quality: To test this idea of generating high quality probability distributions, we make use of the CIFAR10 Human labeled datasets, and measure the quality of the average model confidence distributions against the human confidence estimates. This can again be represented as a KLDivergence between the probability distributions of a model’s average prediction probabilities for a given class against the average human distribution. We present this information in a Confusion Matrix like visualization in 6. Each cell represents the KLDivergence between the human estimate on that class and the model’s estimate. The KLDivergences are all scaled uniformly and are color coded, so higher values correspond to brighter pixel values. We are more interested in the nondiagonal elements as they reveal the mutual information between classes and note that the Cut Mix Matrix is much brighter in those pixels when compared to the Baseline. This points out the fact that the mutual information between different classes is better encoded in the Baseline than Cut Mixtrained model. This enables the creation of a superior representation manifold for distillation to take place.
Model Calibration: We also measure calibration performance across the several models, and note that interpolative techniques generate better calibrated models consistently across teacher and student models. This is evident from the reliability diagrams in 5 and the Expected Calibration Error metrics in 1 and 7. However, it has been hypothesized in (Guo et al., 2017) that smaller models like LeNet generally tend to exhibit better Calibration performance than overparameterized, modern models like ResNet. This trend is visible as the reliability curves for all models tend to hug the ideal straight line closer as the model complexity decreases. However, no direct relationship can be found between the calibration performance of a student and the augmentation applied on its teacher model.
Model Generalization
Generalization is often defined in machine learning by the model’s ability to adapt and classify on unseen data. However, with our analysis of the latent representation space, we found that as these techniques increase the intraclass discrimination, they may inhibit the ability to create probability distributions that mimic informative generalization for distillation into a student model.
We analyzed this concept further through the use of confusion matrices. These confusion matrices compared the predicted label probabilities and actual classes. Both the baseline and Mix Up models show that they classify well overall, with a strong diagonal line in the unmasked confusion matrices. When that diagonal is masked,however, we can see that the probabilities between classes not chosen are far more spread apart in the baseline model. These can be compared to 3, which visualized the human generated predicted label probabilities.
It is important to note that these matrices have different color scales, with the Masked Baseline going up to .1 and the Masked Cut Mix ranging to 0.07. The matrices show that they are both able to pick up on some of the nuances between data, but it can be seen that the masked baseline shows more generalization between classes with brighter colors between related classes. This ability to create an informative probabilistic distribution between different class types is the backbone for effective knowledge distillation.
6 Future Work
Moving ahead, a quantification of the regularization effect of mixed data augmentation strategies like Cut Mix and Mix Up should be formulated. An extension of this work would be to develop a novel augmentation strategy that is able to retain or improve the latent embedding manifold qualities for the given data set, and consequently, generate more generalized model predictions. A key missing aspect in this study is the fact that the training data set used to distill student models is the same as the one used to train the teacher models. A set of experiments could be conducted on a held out distillation data set. Another aspect of measuring true generalization performance could be the relative performance of these models under adversarial attack, and would be an interesting metric to analyse.
References
 Improving generalization and robustness with noisy collaboration in knowledge distillation. External Links: 1910.05057 Cited by: §2.
 Weight uncertainty in neural networks. External Links: 1505.05424 Cited by: §2.
 Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535–541. Cited by: §2.
 A grouptheoretic framework for data augmentation. External Links: 1907.10905 Cited by: §2.

On the efficacy of knowledge distillation.
In
2019 IEEE/CVF International Conference on Computer Vision (ICCV)
, Vol. , pp. 4793–4801. Cited by: §2.  CINIC10 is not imagenet or cifar10. External Links: 1810.03505 Cited by: §3.3.1.

Improved regularization of convolutional neural networks with cutout
. External Links: 1708.04552 Cited by: §1, §3.1.  Born again neural networks. External Links: 1805.04770 Cited by: §2.
 DropBlock: a regularization method for convolutional networks. External Links: 1810.12890 Cited by: §3.1.
 Affinity and diversity: quantifying mechanisms of data augmentation. External Links: 2002.08973 Cited by: §3.1.
 On calibration of modern neural networks. External Links: 1706.04599 Cited by: §5.
 Gradientbased learning applied to document recognition. Proceedings of the IEEE. Cited by: §4.1.1.
 Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. External Links: 1510.00149 Cited by: §1.
 Understanding and enhancing mixed sample data augmentation. External Links: 2002.12047 Cited by: §1, §2.
 Deep residual learning for image recognition. External Links: 1512.03385 Cited by: §4.1.1.
 Data augmentation revisited: rethinking the distribution gap between clean and augmented data. External Links: 1909.09148 Cited by: §2, §3.1, §4.3.1, §5.

Knowledge distillation with adversarial samples supporting decision boundary.
Proceedings of the AAAI Conference on Artificial Intelligence
33, pp. 3771–3778. External Links: ISSN 21595399, Link, Document Cited by: §2.  Data augmentation instead of explicit regularization. External Links: 1806.03852 Cited by: §3.1.
 Distilling the knowledge in a neural network. External Links: 1503.02531 Cited by: §2, §5.
 Batch normalization: accelerating deep network training by reducing internal covariate shift. External Links: 1502.03167 Cited by: §2.
 The lottery ticket hypothesis: finding sparse, trainable neural networks. External Links: 1803.03635 Cited by: §1.
 ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1097–1105. External Links: Link Cited by: §4.2.1.
 Learning from noisy labels with distillation. 2017 IEEE International Conference on Computer Vision (ICCV). External Links: ISBN 9781538610329, Link, Document Cited by: §2.
 Visualizing data using tsne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §5.
 When does label smoothing help?. External Links: 1906.02629 Cited by: §2.
 Implicit regularization in deep learning. arXiv preprint arXiv:1709.01953. Cited by: §1, §2.
 Human uncertainty makes classification more robust. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). External Links: ISBN 9781728148038, Link, Document Cited by: §2, §3.3.1.
 Towards understanding knowledge distillation. In International Conference on Machine Learning, pp. 5142–5151. Cited by: §2.

Transformation invariance in pattern recognitiontangent distance and tangent propagation
. In Neural Networks: Tricks of the Trade, Cited by: §2.  Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, pp. 1929–1958. Cited by: §2.
 Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 (56), pp. 1929–1958. External Links: Link Cited by: §3.1.
 Mean teachers are better role models: weightaveraged consistency targets improve semisupervised deep learning results. External Links: 1703.01780 Cited by: §2.
 On mixup training: improved calibration and predictive uncertainty for deep neural networks. External Links: 1905.11001 Cited by: §2, §3.1.

Robustness may be at odds with accuracy
. External Links: 1805.12152 Cited by: §2. 
The vicinal risk minimization principle and the svms.
In
The nature of statistical learning theory
, pp. 267–290. Cited by: §1, §2, §3.1.  Selftraining with noisy student improves imagenet classification. External Links: 1911.04252 Cited by: §2.
 Betweenclass learning for image classification. External Links: 1711.10284 Cited by: §3.1.
 CutMix: regularization strategy to train strong classifiers with localizable features. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). External Links: ISBN 9781728148038, Link, Document Cited by: §1, §3.1.
 Mixup: beyond empirical risk minimization. External Links: 1710.09412 Cited by: §1, §3.1.
 On the convergence of mirror descent beyond stochastic convex programming. External Links: 1706.05681 Cited by: §1.
Appendix A Appendix Figures
Model  Test Set  Accuracy  Precision  Recall  F1Score  

Baseline  LeNet  CIFAR10  0.652  0.648  0.652  0.650 
CINIC  0.456  0.452  0.457  0.449  
AlexNet  CIFAR10  0.768  0.767  0.769  0.768  
CINIC  0.544  0.550  0.544  0.529  
Augmentation  LeNet  CIFAR10  0.631  0.632  0.631  0.474 
CINIC  0.439  0.434  0.435  0.435  
AlexNet  CIFAR10  0.777  0.776  0.777  0.766  
CINIC  0.549  0.557  0.549  0.535  
Cut Out  LeNet  CIFAR10  0.644  0.643  0.644  0.643 
CINIC  0.451  0.447  0.451  0.445  
AlexNet  CIFAR10  0.785  0.783  0.785  0.784  
CINIC  0.554  0.562  0.554  0.540  
Mix Up  LeNet  CIFAR10  0.633  0.631  0.633  0.632 
CINIC  0.444  0.435  0.443  0.433  
AlexNet  CIFAR10  0.714  0.712  0.714  0.712  
CINIC  0.497  0.501  0.498  0.477  
Cut Mix  LeNet  CIFAR10  0.621  0.617  0.621  0.619 
CINIC  0.439  0.434  0.439  0.435  
AlexNet  CIFAR10  0.720  0.717  0.720  0.718  
CINIC  0.498  0.502  0.498  0.478 
Model  Test Set  Accuracy  Precision  Recall  F1Score  

Baseline  LeNet  CIFAR10  0.652  0.648  0.652  0.650 
CINIC  0.456  0.452  0.457  0.449  
Augmentation  LeNet  CIFAR10  0.641  0.639  0.641  0.639 
CINIC  0.451  0.446  0.451  0.444  
Cut Out  LeNet  CIFAR10  0.635  0.630  0.635  0.632 
CINIC  0.446  0.446  0.439  0.441  
Mix Up  LeNet  CIFAR10  0.647  0.645  0.647  0.647 
CINIC  0.457  0.455  0.457  0.452  
Cut Mix  LeNet  CIFAR10  0.629  0.626  0.629  0.627 
CINIC  0.439  0.433  0.439  0.434 
Model  Test Set  Accuracy  Precision  Recall  F1Score  

Baseline  LeNet  CIFAR10  0.652  0.648  0.652  0.650 
CINIC  0.456  0.452  0.457  0.449  
Augmentation  LeNet  CIFAR10  0.641  0.637  0.641  0.638 
CINIC  0.445  0.441  0.445  0.441  
Cut Out  LeNet  CIFAR10  0.636  0.643  0.636  0.638 
CINIC  0.440  0.451  0.440  0.440  
Mix Up  LeNet  CIFAR10  0.626  0.622  0.626  0.624 
CINIC  0.440  0.437  0.440  0.431  
Cut Mix  LeNet  CIFAR10  0.635  0.631  0.635  0.633 
CINIC  0.446  0.443  0.446  0.441 
Model  Test Set  Accuracy  Precision  Recall  F1Score  

Baseline  ResNet  CIFAR10  0.852  0.851  0.852  0.851 
CINIC  0.600  0.602  0.600  0.593  
Augmentation  ResNet  CIFAR10  0.940  0.940  0.940  0.940 
CINIC  0.630  0.620  0.628  0.629  
Cut Out  ResNet  CIFAR10  0.879  0.878  0.880  0.879 
CINIC  0.629  0.620  0.628  0.630  
Mix Up  ResNet  CIFAR10  0.867  0.954  0.867  0.953 
CINIC  0.603  0.606  0.603  0.598  
Cut Mix  ResNet  CIFAR10  0.954  0.869  0.954  0.868 
CINIC  0.710  0.707  0.720  0.710 
Comments
There are no comments yet.