Robust Student Network Learning

07/30/2018 ∙ by Tianyu Guo, et al. ∙ The University of Sydney Peking University 2

Deep neural networks bring in impressive accuracy in various applications, but the success often relies on the heavy network architecture. Taking well-trained heavy networks as teachers, classical teacher-student learning paradigm aims to learn a student network that is lightweight yet accurate. In this way, a portable student network with significantly fewer parameters can achieve a considerable accuracy which is comparable to that of teacher network. However, beyond accuracy, robustness of the learned student network against perturbation is also essential for practical uses. Existing teacher-student learning frameworks mainly focus on accuracy and compression ratios, but ignore the robustness. In this paper, we make the student network produce more confident predictions with the help of the teacher network, and analyze the lower bound of the perturbation that will destroy the confidence of the student network. Two important objectives regarding prediction scores and gradients of examples are developed to maximize this lower bound, so as to enhance the robustness of the student network without sacrificing the performance. Experiments on benchmark datasets demonstrate the efficiency of the proposed approach to learn robust student networks which have satisfying accuracy and compact sizes.



There are no comments yet.


page 2

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recent years have witnessed the marked progress of deep learning. Since the breakthrough in 2012 ImageNet competition 

[1] achieved by AlexNet [2] using five convolutional layers and three fully connected layers, a series of more advanced deep neural networks have been developed to keep rewriting the record, e.g., VGGNet [3], GoogLeNet [4], and ResNet [5]. However, their excellent performance requires the support from a huge amount of computation. For instance, AlexNet [3] contains about 232 million parameters and needs multiplications to process an image with resolution of

. Hence, the potential power of deep neural networks can only be fully unlocked on high performance GPU servers or clusters. In contrast, majority of the mobile devices used in our daily life usually have rigorous constraints on the storage and computational resource, which prevents them from fully taking advantages of deep neural network. As a result, networks with smaller hardware demanding while still maintaining similar accuracies are of great interests to image processing and computer vision community.

Compressing convolutional neural networks can be achieved by vector quantization 

[6], decomposing weight matrices [7], and encoding with hashing tricks [8]. Unimportant weights can be pruned to achieve the same goal by removing the subtle weights [9, 10], reducing the redundancy between weights in the frequency domain [11], and using the binary networks [12, 13]. Another straightforward approach is to design a compact network directly, e.g., ResNeXt [14], Xception network [15], and MobileNets [16]. These networks are often deep and thin with fewer parameters in each layer, and the non-linearity of these networks are strengthened by increasing the number of layers, which guarantees the performance of the network.

Student-teacher learning framework, introduced in knowledge distillation (KD) [17], is one of the most popular approaches to realize model compression and acceleration [12, 11]. Taking a heavy neural network, such as GoogleNet [4] or ResNet [5], that has already been well trained with massive data and computing resources as the teacher network, a student network of light architecture can be better learned under teacher’s guidance. To inherit the advantages of teacher networks, different methods have been proposed to encourage the consistency between teacher and student network. For example, Ba and Caruana [18]

minimized the Euclidean distance between features extracted from these two networks, Hinton

et al. [17] encouraged the student to mimic a softened version of the teacher’s output, and FitNet [19] introduced intermediate-level hints from teacher’s hidden layers to guide the training process of student. Patrick and Nikolaus [20] proposed to keep the pairwise distance of examples between student network and teacher. You et al. [21] utilized multiple teacher networks to guide the training process of student network. Wang et al. [22] introduced a teaching assistant to encourage the similarity between distributions of features maps extracted from teacher and student networks.

Fig. 1: Framework of the proposed algorithm. Constraint imposed on the output part ensures student network to have higher confidence in the prediction than that of teacher network. Constraint imposed on the gradients encourages student network to preserve its confidence in the prediction if there is perturbation on the data. represents the network’s prediction for the ground-truth label of input .

These aforementioned algorithms have achieved impressive experimental results, however, they were mainly developed in ideal scenarios, where all data are implicitly assumed to be clean. In practice, given examples with perturbation, the training process of the network can be seriously influenced, and the resulting network would not be confident as before to make predictions of examples. Teacher network might make some mistakes, since it is difficult for teacher network to be familiar with all examples fed into the student network. This is consistent with student-teacher learning in the real world. An excellent student is expected to solve practical problems in changeable circumstances, where there might be questions even not known by teachers.

To solve this problem, in this paper, we introduce a robust teacher-student learning algorithm. The framework of the proposed method is illustrated in Figure 1. We enable student network to be more confident on its prediction with the help of teacher network. Perturbations on examples might seriously influence the learning of student network. We derive the lower bound of the perturbations that can make student be more vulnerable than teacher through a rigorous theoretical analysis. New objectives in terms of prediction scores and gradients of examples are further developed to maximize the lower bound of the required perturbation. Hence, the overall robustness of the student network to resist perturbations on examples can be improved. Experimental results on benchmark datasets demonstrate the superiority of the proposed method for learning compact and robust deep neural networks.

We organized the rest of the paper as follows. In Section II, we summarize related works on learning convolutional neural networks with fewer parameters by different methods. Section III introduces the previous work we based on. In Section IV, we formally introduce our robust student network learning method in detail, including mathematical proof to the proposed theorem, the calculation method of loss function, and the training strategy. Section V provides results of our algorithm obtained on various benchmark datasets to prove the effectiveness of the proposed method. Section VI concludes this paper.

Ii Related works

In this section, we briefly introduce related works on learning a efficient convolutional neural networks with fewer parameters. There are two different categories of methods according to their techniques and motivations.

Ii-a Network Trimming

Network trimming aims to remove redundancy in heavy networks to obtain a compact network with fewer parameters and less computational complexity, whereas the accuracy of this portable network is close to that of the original large model. Gong et al. [6] utilized the benefits of vector quantization to compress neural networks, and a cluster center of weights was introduced as the representation of similar weights. Denton et al. [7]

implemented singular value decomposition to the weight matrix of a fully-connected layer to reduce the number of parameters. Chen

et al. [8] attempted to explore hash encoding to improve the compression ratio. Courbariaux et al. [12] and Rastegari et al. [13] implemented binary networks. All weights previously storied as 32-bit floating, are converted to binary ( or ). Moreover, Wang et al. [11] and Han et al. [9] exploited weight pruning to achieve the same goal. In particular, Han et al. [9] focused on removing subtle weights to reduce the parameters while minimizing the impact of removing them. Over 80% subtle weights were dropped without the accuracy drop. Furthermore, Han et al. [10] integrated several neural network compression techniques i.e. pruning, quantization, and Huffman coding to further compress the network. Wang et al. [11] showed that redundancy exists in not only subtle weights, but also large weights. It converted convolutional kernels into frequency domain to reduce the redundancy contained in larger weights and thereby compress networks with a higher compression ratio. In addition, Wang et al. [23] focused on the redundancy in feature maps instead of network weights, which can also be considered as a modification of network architecture. Although the network trimming method brings a considerable compression and speedup ratio, due to the highly sparse parameters and the irregular network architectures, the actual acceleration effect is often heavily dependent on the customized hardware.

Ii-B Design Small Networks

Directly designing a new deep neural network of light size is a straightforward approach to realize efficient deep learning. Most of these methods increase the depth of networks with much lower complexity compared to simply stacking convolution layers. For example, ResNet [5] introduced a novel residual block that obtained a significant performance with only slightly computation costs. ResNeXt [14] explored group convolutions into the building blocks to boost performance. Flattened networks [24] introduced fully factorized convolutions and designed an extremely factorized network. Almost at the same time, Factorized Networks [25]

introduced topological convolution that treats sections of tensors separately. SqueezeNet 

[26] designed a portable network with a bottleneck architecture. SENet [27] proposed a novel architecture named SE block, which focuses on the relationship between channels. Moreover, [28] introduced depth-wise separable convolutions to obtain a great gain in the speed, and the size of networks. With the help from depth-wise separable convolutions, Inception models [29, 30] reduced the complexity of the first few layers of network. Latter, Xception network [15] outperformed Inception model by scaling up depth-wise separable convolutional filters. Subsequently, the MobileNets [16] combined channel-wised decomposition of convolutional filters with depth-wise separable convolutions and achieved state-of-the-art results among portable models. ShuffleNet [31] introduced a novel from of group convolution and depth-wise separable convolution. Deep fried convnets [32]

introduced a novel Adaptive Fastfood transform to reduce the computation of networks. Structured transform networks 

[33] offered considerable accuracy-compactness-speed tradeoffs based on the new notions rooted in the theory of structured matrices.

Ii-C Teacher-Student Learning

There is another way to train a portable network. Regard the trained network as a teacher and the deeper yet thinner network as a student. With the help of the intrinsic information captured by the teacher network, the deeper and thinner student network could be well trained. Ba and Caruana [18] suggested that student network mimic the features extracted from the last layer of the teacher networks to assist the training progress of student networks, thereby increasing the depth of the student network. Knowledge Distillation (KD) [17] pointed out that for two networks with huge structural differences, it is difficult to directly mimic features. Therefore, KD [17]

proposes to minimize the relaxed output of softmax layers of the two networks. This strategy can further deepen the student network. FitNet 

[19], based on KD, minimized the difference between the features extracted from the middle layers of student and teacher networks. They added several layers of MLPs at the middle layer of the teacher network to match the dimensions of the features of the student network. By establishing a connection between the middle layers of two networks, the student network can be further deepened with fewer parameters. McClure and Kriegeskorte [20] attempted to minimize the distance between pairs of samples to reduce the difficulty of training students’ networks. You et al. [21] proposed utilizing multiple teacher networks to provide more guidance for the training of student networks. They leverage a voting strategy to balance the multiple guidance from each teacher network. Wang et al. [22] regarded student network as a generator which is a part of GAN [34], as well as utilized a discriminator as a assistant of teacher for forcing student to generating features which are difficult to distinguish from the features of teacher.

Compared to the network trimming algorithm, the student-teacher learning framework has more flexibility, no special requirements on hardware, and a more structured network structure. Compared to the direct design of a deeper network, guidance from the teacher is beneficial to learning deep networks and improving the performance of student. However, existing student-teacher algorithms pay more attention to improving the performance of student network on pure data sets. The instability caused by the large reduction in parameters makes the performance degradation under the Perturbation settings not yet studied. Therefore, a more robust learning algorithm for improving student network performance under perturbed conditions needs to be developed. This paper proposed a method under the teacher-student learning and knowledge distillation framework, which enhanced the robustness of student network.

Iii Preliminary of Teacher-Student Learning

To make this paper self-contained, we briefly introduce some preliminary knowledge of teach-student learning here.

The teacher network has complicated architecture, and it has already been well trained to achieve a sufficiently high performance. We aim to learn a student network , which is deeper yet thinner than the teacher network but has a lower yet satisfying accuracy. Let be the example space and be its corresponding -label space. Outputs of these two networks are defined as:


where and are the features produced by pre-softmax layers of teacher and student networks, respectively.

The teacher network is usually trained on a relatively large dataset and consists of a large number of parameters, so that the teacher network usually achieves a high accuracy in classification task. Given significantly fewer parameters and numbers of multiplication operations, if adopting the same training strategies as the teacher network, the student network is difficult to achieve a high performance. It is therefore necessary to, improve student network performance by investigating the assistance of the teacher network. A straightforward method is to encourage the features of an image extracted from these two networks to be similar [18]. The objection function can be written as


where the second term helps the student network to extract knowledge from the teacher, refers to the cross-entropy loss, , indicates the output of the -th example in by the student network, refers to the corresponding label, and is the coefficient to balance two terms in the function. The teacher and student networks can be significantly different in architecture, and thus it is difficult to expect features extracted by these two networks for the same example to be same. Hence, Knowledge Distillation (KD) [17], as an effective alternative, was proposed to distill knowledge from classification results to minimize


where the second term aims to enforce the student network to learn from softened output of the teacher network. is a relaxation function defined as follow:


is introduced to make sure that the second term in equation (3) can play a different role compared with the first one. This is because that might be extremely similar to the one hot code representation of the ground-truth labels, while a soften version of output is different from the true labels. Moreover, the soften version of output could also provide more information to guide the learning of student, as the cross-entropy loss and soften version output will enhance the influence of classes other than the true label one.

Although KD loss in equation (3) allows the student network to access the knowledge from the teacher network, the significant reduction in the number of parameters decreases the capability of the student network and makes it more vulnerable to input disturbances. The learned student network might achieve a reasonable performance on clean data, but it would suffer from a serious performance decline when encountering perturbation on the data in real world applications. To solve this issue, it is therefore necessary to enforce the robustness of the student network when applied to practical scenario.

Iv Robust Student Network Learning

We take a multi-class classification problem over classes as an example to introduce our robust Student Network Learning. Given a teacher network and a student network , an example

can then be classified by two networks

and , respectively. Denote and as the -th value of the -dimensional vectors and , respectively. Then we define and as the scores produced by two networks for the ground-truth label of the example , respectively. If a classifier has more confidence in its prediction, the predicted score will be higher. With the help of the teacher network, the student network is supposed to be more confident in its prediction, so that


Iv-a Theoretical Analysi

The above relationship holds in ideally noise-free scenario. In practical scenario, perturbations on examples are unavoidable, and the student network is expected to resist the unexpected influence and bring in the robust prediction,


where is a perturbation added to . We restrict this perturbation in a spherical space of radius , and is a constraint set that specifies some requirements for the input, e.g., an image input should be in , where is the dimension. We define the ball as .

We aim to discover a student network that stands on the shoulder of the teachers to make a confident prediction not only for clean examples but also for examples with perturbations. The perturbation exists on examples without influencing their corresponding ground-truth labels. However, with the increase of perturbation intensity, the learning process of the student network would be seriously disturbed. Taking equation (6) as an auxiliary constraint in training the student network can be helpful for improving robustness of the network. But it is difficult and impossible to enumerate and try every possible to form the constraint. To make the optimization problem tractable, we seek for some alternatives and proceed to study the maximum perturbation that can be defended by the system. Figure 1 shows the framework of our approach.

Theorem 1.

Let be an example in . and are functions adapted from the student and the teacher networks to predict the label of example , respectively. Given , for any with


we have .


By the main theorem of calculus, we have




If the perturbation is so significant that , we get


Consider the fact that


where holder inequality is applied and q-norm is dual to the p-norm with . By combining equation (10) and equation (11), we have


where the denominator can be further upper bounded using the following inequality


The lower bound for the q-norm of to break the robust prediction of the student network (i.e., equation (6)) is therefore


which completes the proof. ∎

0:  A given neural network ; training dataset with n instances; the corresponding -label set ; parameters: , , and .
1:  Initialize a neural network , where the number of parameters in is significantly fewer than that in ;
2:  repeat
3:      Select an instance and its label randomly;
4:      Employ the teacher network: ,
5:      Employ the student network: ;
6:      Calculate the loss function using equation (16);
7:      Update weights in the student network ;
8:  until

reach the limitation of training epoch

8:  A robust student network .
Algorithm 1 Robust Student Network Learning

According to Theorem 1, maximizing the value of while minimizing the value of , the lower bound over will be enlarged, so that the student network is able to tolerate more severe perturbation and become more robust to make confident prediction. Without loss of generality, we take in the following discussion.

Iv-B Method

Based on the analysis above, two new objectives are introduced into teacher-student learning paradigm to achieve a robust student network. To encourage , we plan to minimize the loss function


where is a constant margin. is supposed to be greater than , otherwise, there will be a penalty for the student network. It is difficult to explicitly calculate the value of , due to the existence of the max operation. But by appropriately setting the radius and considering the sufficiently large training set, the data point in the ball to reach the maximum value of would often have some closed examples in the training set. Hence, to minimize the value of , we proposed to minimize the difference between gradients of student and the teacher networks w.r.t. the training examples as


where is the relaxation function explained in equation (3). In addition, we take the KD loss [17] into consideration, the resulting objective function of our robust student network learning algorithm can be written as:


where and are the balanced coefficients of and , respectively.

The process of training the student networks can be found in Algorithm 1. After the initialization of the student network, we train the student network according to the proposed algorithm. Next, we explain in detail the calculation of loss. For convenience, we set the batch size as 1, that is, we first select a sample from the dataset and as input for forward propagation of the teacher network and the student network. Then we calculate outputs of the two networks and . Combining outputs and with the corresponding label , we can calculate the first term in equation (17) according to equations (3) and (4). and are both -dimensional vectors, which are the network’s prediction scores for categories. With the help of label , we can get the predicted scores for label, and and calculate the second term in equation (17) according to equation (15). In order to get the value of , we first calculate the derivative of and with respect to the input sample

. Same as back propagation algorithm, we can apply the chain rule to get these results. In experiments, we utilize the automatic derivation tool which is integrated in mainstream deep learning platforms to achieve this process. After getting

and , the loss can be easily calculated using equation (17). Finally, the weights in the student network are updated by the gradients obtained by the back-propagation algorithm.

The delta in equation (7) is the noise in an image . The noise can come from various sources. Some are physical, linked to the nature of light and to optical artifacts, and some others are created during the conversion from electrical signal to digital data. As noise degrades the quality of an image, the performance of neural networks in image classification task could be seriously influenced. The proposed robust student network aims to handle unexpected noises in images and to preserve consistent decisions with or without noises (see equations (5) and (6). In the literature, the overall noise produced by a digital sensor is usually considered as a stationary white additive Gaussian noise [35]. We report the robustness of the learned networks against Gaussian noise in experiment. In addition, we also evaluate the performance of the learned networks against combinations of different types of noise on training and test sets, since it is difficult to know what types of noise could be before the test stage.

(a) MNIST (b) CIFAR-10 (c) CIFAR-100
Fig. 2: Accuracies obtained by different networks trained on three datasets and under various values of SNR.

V Experiments

In this section, we experimentally investigate the effectiveness of the proposed robust student network learning algorithm. The learned student network is compared with the original teacher network, and the student networks learned through KD [17] and Fitnet [19]. The experiments are on three benchmark datasets: MNIST [36], CIFAR-10 [37], and CIFAR-100 [37].

V-a Dataset and Settings

MNIST [36] is a handwritten digit dataset (from 0 to 9) composed of greyscale images from ten categories. The whole dataset of 70,000 images is split into 60,000 and 10,000 images for training and test, respectively. Following the setting in [19], we trained a teacher network of maxout convolutional layers reported in [38], which contains 3 convolutional maxout layers and a fully-connected layer with 48-48-24-10 units, respectively. After that, we design the student network which contains 6 convolutional maxout layers and a fully-connected layer, which is twice as deep as the teacher network but with roughly 8% of the parameters. As reported in Table VIII, the architectures of the teacher and student network were shown in detail in the first two columns.

CIFAR-10 [37] is a dataset that consists of RGB color images draw from 10 categories. There are 60,000 images in CIFAR-10 dataset which are split into 50,000 training and 10,000 testing images. According to [38] and [19], we preprocessed the data using global contrast normalization (GCA) and ZCA whitening, and augmented the training data via random flipping. We followed the architecture used in Maxout [38] and FitNet [19] to train a teacher network with three maxout convolutional layers of 96-192-192 units. For fair comparison, we designed a student network with a structure similar to FitNet which has 17 maxout convolutional layers followed by a maxout fully-connected layer and a top softmax layer, and we also investigate KD method with the same architecture. The detailed architecture of teacher was shown in the ‘Teacher(CIFAR-10)’ column of Table VIII, and that of student was shown as ‘Student 4’ column.

CIFAR-100 dataset [37] has images of the same size and format as those in CIFAR-10, except that it has 100 categories with only one tenth as labeled images per category. More categories and fewer labeled examples per category indicates that classification task on CIFAR-100 is more challenging than that on CIFAR-10. We preprocess images in CIFAR-100 using the same methods for CIFAR-10 and the teacher network and the student network share the same hyper-parameters with those on the CIFAR-10 dataset. Besides, the architecture of teacher is also same as that used for CIFAR-10, except that the number of units in the last softmax layer was changed to 100 to adapt to the number of categories.

The hyper-parameters are tuned by minimizing the error on a validation set consisting of the last 10,000 training examples on each dataset. Following the setting in FitNet [19], we set batch size as 128, max training epoch as 500, learning rate as 0.17 for linear layers and 0.0085 for convolutional layers, and momentum as 0.35. According to the hint layer proposed in FitNet [19], we pre-trained a classifier using the features of the teacher network’s middle layer, and then we apply the classifier with the student network features.

(a) Clean Cat (b) Cat with SNR=10 (c) Clean Ship (d) Ship with SNR=10
Fig. 3: Example images (the top line) and their corresponding prediction scores by different networks (the bottom line). (a) and (c) are pure images, while (b) and (d) are disturbed images.

V-B Robustness of Student Networks

We evaluated the robustness of student networks learned through different algorithms under different intensities of perturbation. Since it is difficult and impossible to know what test data can be in practice, the augmentation of training data with certain noise cannot be very helpful to resist the perturbation. Hence, we trained all networks using clean training set, and introduced White Gaussian Noise (WGN) into test data as the perturbation. The intensity of the introduced noise was measured in terms of Signal-to-Noise Ratio (SNR). We trained the proposed algorithm and compared it with the teacher network [38], and student networks from KD network [17] and FitNet [19] methods.

In Figure 2, we investigated the accuracy of these networks on three datasets with different SNR values. As the classification task on MNIST is relatively easier, lower SNRs were chosen from 9 to 1. Lower SNR value indicates more perturbations are added. It can be found from Figure 2(a) that the accuracy of the proposed robust student network is superior to other three networks nearly under all SNR values. When SNR equals to 2, two student networks from KD and FitNet perform even worse than the original teacher network. But our proposed algorithm achieves an obviously leading 98.17% accuracy. When SNR was down to 1, the accuracy drops of the teacher network, and the student network from KD and FitNet are serious, up to 5.65%, 7.25%, and 3.23%, respectively. In contrast, the accuracy of our robust student network only drops 2.23%. Our method achieves better performance and shows more robustness when there was perturbation in the input.

A similar phenomenon can be observed in Figures 2 (b) and (c) on the CIFAR-10 and CIFAR-100 datasets. With the decrease of SNR, the accuracy of KD network and FitNet dropped faster than that of the teacher network, especially during the period when SNR drops from 12 to 10. Given the significant reduction in network complexity, the capacity of the student network can be seriously weakened and the student network would be more vulnerable to perturbations on data if there is no appropriate response action. However, the student network learned from the proposed algorithm can be robust to serious perturbations.

In Figure 3, we reported the predicted scores of example images by different methods on the CIFAR-10 dataset. The clean image without noise looks fuzzy, since the images from CIFAR-10 dataset only a resolution of . In the first column of Figure 3, all student networks can confidently predict the ground-truth class ‘cat’ of the image. However, given the same image added with SNR=10 noise in Figure 3 (b), though student networks from KD and FitNet methods reluctantly made the correct prediction, KD also thought the image is similar to ‘deer’, and FitNet trusted ’bird’ as the prediction with a higher confidence level. In contrast, our robust student network confidently insisted on its correct prediction even the quality of image has been seriously influenced by the perturbation. In addition, given the ‘ship’ image, the teacher network can stand against the perturbation, due to its strong capability coming from the complicated network structure. The KD method mistook it as an ‘deer’ image, while FitNet assigned higher score to label ‘deer’ for this ‘ship’ image. By encouraging more confident predictions with the help of the teacher network during the training stage, we derive the robust student network that can not only keep the highest prediction score on the ‘ship’ label, but also suppress the predictions on wrong categories (see label ‘cat’ in Figures 3(c) and (d)).

Network C/C C/G C/P G/G G/P P/P P/G
Teacher [38] 90.25% 86.30% 86.60% 89.02% 87.36% 89.11% 86.06%
KD [17] 91.07% 80.61% 80.86% 90.48% 82.14% 90.63% 82.27%
FitNet [19] 91.64% 82.41% 82.43% 90.86% 86.11% 91.10% 84.02%
Robust (proposed) 91.93% 90.37% 90.50% 90.37% 90.50% 90.50% 90.37%
TABLE I: Performance comparison on different training and test sets. Dataset was split as training/test. ‘C’ represents clean data, ‘G’ represents data with Gaussian noise, and ‘P’ represents data with Poisson noise.
Dataset Algorithm 22 block 44 block 66 block 88 block 1010 block
CIFAR10 Teacher [38] 89.57% 89.03% 87.69% 85.65% 81.94%
KD [17] 90.53% 89.53% 87.23% 83.37% 77.58%
FitNet [19] 91.08% 89.94% 87.40% 84.62% 79.71%
Robust(proposed) 91.25% 90.79% 88.34% 85.92% 82.15%
CIFAR100 Teacher [38] 63.07% 61.92% 59.95% 57.41% 55.63%
KD [17] 63.48% 61.84% 58.73% 54.38% 51.67%
FitNet [19] 64.11% 62.62% 59.65% 55.51% 52.40%
Robust(proposed) 64.83% 63.16% 60.68% 57.49% 55.11%
TABLE II: Performance comparison on different block sizes.

V-C Comparison under Different Perturbation

A neural network might handle noisy test data, if similar noise also exists in the training set. However, in practice, it is difficult to guarantee the test data to have the same kind of perturbation as the training data. We next proceed to evaluate the performance of different methods under different combinations of noisy training and test sets. The accuracies in different settings are presented in Table I. The first line in Table I is a description of the experiment settings. The first capital letter indicates the noise type introduced to training set, and the second letter indicates that of test set. It should be noted that the results of the Robust network listed in this table are all trained under the clean data set, but tested under the corresponding type of noise indicated by the first line. If both training and test data are clean, all networks can achieve more than 90% accuracy, as shown in the first column of Table I. If networks are trained on the clean data and tested on the data with Gaussian noise, teacher networks and student networks from KD and FitNet will be seriously influenced and can only achieve less then 86% accuracy. However, the proposed robust student can still own more than 90% accuracy. A similar phenomenon can be observed when the networks are trained with Gaussian noise but tested with Poisson noise and reverse, as shown in the fourth and last columns of Table I, respectively. If both training and test data are polluted with the same type of Gaussian noise, all networks would try to fit the noisy data as far as possible and receive only slight performance drop. But this rigorous constraint over training and test data cannot always be satisfied in real-world applications. It shows that adding limited kinds of noise to the training set is difficult to improve the robustness of neural network when facing various unexpected kinds of noises existing in practical applications.

Moreover, Table I shows that the teacher Network’s accuracy is worse than that of Robust Network when training and test sets are both with Gaussian noise. The distributions of training data and test data would not be significantly different, if they are both polluted by the Gaussian noise. Hence general teacher networks and student networks can well fit the noisy data and receive reasonable accuracy. But the student network achieves some performance improvement, because of its deeper architecture than that of teacher networks. The depth encourages the re-use of features, and leads to more abstract and invariant representations at higher layers. The proposed robust student network can successfully train a deeper network by exploiting information from the teacher network.

V-D Complex Perturbation

In real-world applications, noise is not the only perturbation that may be encountered. Some more complex perturbation also challenges the robustness of neural networks. In this section, we investigate the robustness of the student network obtained by our proposed method on two more complex perturbations, i.e. image occlusion and domain adoption.

V-D1 Image Occlusion

Considering the target object in the real environment is often blocked, and such perturbation often results in the loss of information in a continuous area, the performance of neural networks will be influenced more seriously by image occlusion. In order to investigate the robustness of our method under this disturbance, we take image occlusion as a more complex perturbation. To simulate the occlusion in real-world applications, we randomly select a small rectangular area in an image, and set pixels covered by the rectangle as zeros. Five different block sizes, i.e. 22, 44, 66, 88 and 1010 are used in experiments. We implemented this experiment on the CIFAR-10 dataset. The results are shown in Table II. Given 44 blocks, teacher, KD, FitNet, and Robust network respectively have accuracies of 89.03%, 89.53%, 89.94%, and 90.79%. Given 88 blocks, the corresponding accuracies are 85.65%, 83.37%, 84.62%, and 85.92%, respectively. According to these results, larger blocks indicate more serious perturbations of images, which will degrade the performance of neural networks. However, the student network obtained by the proposed method stably stays ahead, because of its robustness.

Teacher [38] 99.45% 96.41% 86.88%
KD [17] 99.35% 93.25% 96.26% 82.74%
FitNet [19] 99.49% 94.12% 96.56% 87.23%
Robust(proposed) 99.55% 95.02% 96.71% 89.14%
TABLE III: Domain adaptation results.
Algorithm params layers CIFAR-10 CIFAR-100
Student-teacher learning paradigm
Teacher [38] 9M 5 90.25% 63.49%
Knowledge Distillation [17] 2.5M 19 91.07% 64.13%
FitNet [19] 2.5M 19 91.64% 64.86%
Robust learning(proposed) 2.5M 19 91.93% 65.28%
Maxout Network [38] 90.62% 61.43%
Network in Network [39] 91.20% 64.32%
Deeply-Supervised Networks [40] 91.78% 65.43%
TABLE IV: Classification accuracies of different networks on CIFAR-10 and CIFAR-100 datasets.
Algorithm plane car bird cat deer dog frog horse ship truck
Teacher [38] 90.1% 93.8% 86.0% 74.6% 93.5% 86.2% 95.2% 92.6% 95.3% 95.2%
KD [17] 90.0% 95.2% 83.2% 84.4% 93.2% 87.1% 95.0% 91.6% 97.3% 93.7%
FitNet [19] 90.7% 97.6% 91.0% 82.7% 93.8% 86.2% 92.7% 93.6% 94.6% 93.5%
Robust(proposed) 91.0% 97.0% 90.3% 83.6% 92.4% 87.2% 95.4% 93.2% 95.1% 94.1%
TABLE V: 10-Class classification accuracies of different networks on CIFAR-10
Algorithm params Misclass
Student-teacher learning paradigm
Teacher 361K 0.55%
Standard back-propagation 30K 1.90%
Knowledge Distillation [17] 30K 0.65%
FitNet [19] 30K 0.51%
Robust(proposed) 30K 0.45%
Maxout Network [38] 0.45%
Network in Network [39] 0.47%
Deeply-Supervised Networks [40] 0.39%
TABLE VI: Classification accuracies on ‘MNIST’ dataset.

V-D2 Domain Adaptation

In practical applications, not only unexpected noise and occlusion, but also the unexpected distribution shift could challenge the robustness of neural networks. It is also an important indicator to evaluate the adaptability of this algorithm in the task of domain adaptation.

In this experiment, we took the USPS dataset obtained from the scanning of handwritten digits from envelopes by the U.S. Postal Service. The images in this dataset are all grayscale images and the values have been normalized. The whole dataset has 9,298 handwritten numeric images, of which 7,291 are for training, and the remaining 2,007 are for validation. Similar to MNIST, the USPS dataset has 10 categories, but it has different numbers of samples per category. In addition, considering the picture size in the MNIST dataset is

, for convenience, we pad the images in the USPS dataset to the same size. Moreover, we preprocess USPS datasets in the same way as MNIST.

In this section, we train student networks on the MNIST dataset, and test them on USPS dataset. Similarly, we train networks on the USPS dataset and test them on MNIST. The results are shown in the Table III. The first two columns show the result of adapting MNIST to USPS, and the performance of adapting USPS to MNIST was reported in the last two columns of this table. According to the results, the proposed algorithm achieves an accuracy of 95.02%, while the comparison methods KD and FitNet only get 93.25% and 94.12%, respectively. This demonstrates that the proposed robust student network can preserve its robustness advantages over comparison methods, when faced with more complex perturbation of data in domain adaptation task. The similar phenomenon can be observed in the results of ‘USPS to MNIST’. With the similar accuracy on USPS dataset, the Robust Network outperforms networks obtained by the other algorithms. Moreover, the results tested on USPS dataset while trained on MNIST dataset are much better than those tested on MNIST and trained on USPS. This is because that the number of pictures in the MNIST dataset is much larger than that of USPS. The networks trained by MNIST dataset could extract more useful information from a larger amount of data, and thus has better generalization capabilities.

Network layers params mult Speed-up Ratio Compression Ratio FitNet Robust
Teacher 5 9M 725M 90.25%
Student 1 11 250K 30M 89.07% 89.62%
Student 2 11 862K 108M 91.02% 91.37%
Student 3 13 1.6M 392M 91.16% 91.50%
Student 4 19 2.5M 382M 91.64% 91.93%
TABLE VII: The performance of the proposed method on student networks with various architectures.
Teacher(MNIST) Student(MNIST) Teacher(CIFAR) Student 1 Student 2 Student 3 Student 4
conv 3x3x48 conv 3x3x16 conv 3x3x96 conv 3x3x16 conv 3x3x16 conv 3x3x32 conv 3x3x32
pool 4x4 conv 3x3x16 pool 4x4 conv 3x3x16 conv 3x3x32 conv 3x3x48 conv 3x3x32
pool 4x4 conv 3x3x16 conv 3x3x32 conv 3x3x64 conv 3x3x32
pool 2x2 pool 2x2 conv 3x3x64 conv 3x3x48
pool 2x2 conv 3x3x48
pool 2x2
conv 3x3x48 conv 3x3x16 conv 3x3x96 conv 3x3x32 conv 3x3x48 conv 3x3x80 conv 3x3x80
pool 4x4 conv 3x3x16 pool 4x4 conv 3x3x32 conv 3x3x64 conv 3x3x80 conv 3x3x80
pool 4x4 conv 3x3x32 conv 3x3x80 conv 3x3x80 conv 3x3x80
pool 2x2 pool 2x2 conv 3x3x80 conv 3x3x80
pool 2x2 conv 3x3x80
conv 3x3x80
pool 2x2
conv 3x3x48 conv 3x3x16 conv 3x3x96 conv 3x3x48 conv 3x3x96 conv 3x3x128 conv 3x3x128
pool 4x4 conv 3x3x16 pool 4x4 conv 3x3x48 conv 3x3x96 conv 3x3x128 conv 3x3x128
pool 4x4 conv 3x3x64 conv 3x3x128 conv 3x3x128 conv 3x3x128
pool 8x8 pool 8x8 pool 8x8 conv 3x3x128
conv 3x3x128
conv 3x3x128
pool 8x8
fc fc fc fc fc fc fc
softmax softmax softmax softmax softmax softmax softmax
TABLE VIII: Model architecture for datasets.

V-E Comparison with State-of-the-Art Methods

Although the main purpose of this paper is to improve the robustness of the student networks, instead of focusing on performance of the student networks on clean data. We also compared the proposed approach with state-of-the-art teacher-student learning methods on clean datasets. For the clean data, the proposed algorithm can still achieve comparable accuracy as compared to others in Tables VI and IV. Table VI summarized the obtained results on three datasets: MNIST, CIFAR-10 and CIFAR-100. On the MNIST dataset, the teacher network got a 99.45% accuracy. With the assistance of KD, the student network achieved a 99.46% accuracy. FitNet generated a slightly better student network with a 99.49% accuracy, which has outperformed the teacher network. Though the proposed algorithm aims to enhance the robustness of the learned student network, it can also achieve comparable or even better accuracy than those of state-of-the-art methods. The accuracy obtained by the proposed method increased to 99.51% on the MINIST dataset. Table IV shows the results on the CIFAR-10 datasets, the baseline teacher network achieved a 90.25% accuracy, and the accuracy of the student network generated by KD and Fitnet were 91.07% and 91.64%, respectively. The Robust student network obtained a 91.63% accuracy, which outperforms the other student networks and teacher. This suggests that the proposed method is able to enhance the stability of student network and then improve the performance of the network.

CIFAR-100 is similar but more challengeable than CIFAR-10 because of its 100 categories. The accuracy obtained by the teacher network is only 63.49%. As comparison, the accuracy of teacher on CIFAR-10 is 90.25%, which is much better than that on CIFAR-100. The robust student network achieved a 65.28% accuracy, which outperforms student networks trained by other strategies, i.e., the network trained by knowledge distillation obtains a test accuracy of 64.13%, and the accuracy of FitNet is 64.86%. When compared to other methods, the student network generated by the proposed method provides nearly the state-of-the-art performance. This result demonstrates that the proposed method succeeds in assisting to learn a student network with considerable performance.

V-F Analysis on Structures of Student Network

We followed the experimental setting in FitNet [19] and designed four student networks with different configurations of parameters and layers. The teacher network had the same structure as that used on the CIFAR-10 dataset.We design four student networks of different sizes and structures, the detailed structure of these networks can be found in Table VIII. From ‘Student 1’ to ‘Student 4’, the volume of the network has gradually increased, and the performance of the network has gradually increased, too. Table V reported the performance of four student networks and the teacher network on the CIFAR-10 dataset. The compression ratio and speed-up ratio compared with the teacher, and the number of parameters and multiplications can also be found in Table VII.

From Table V, we find that the proposed robust student network outperforms FitNet under all four different student structures. Though there is no perturbation on the data, the proposed method can achieve higher accuracy, which indicates the effectiveness of encouraging the student network to make confident predictions with the help of the teacher network. In addition, the smallest network Student 1 has the biggest compression and speed-up ratios, but it can still achieve a test accuracy of 89.62%, which is fairly close to the 90.25% of teacher and outperforms the 89.07% obtained by FitNet. As Student 1 contains significantly fewer parameters than those of the teacher, improving the accuracy of such a network with limited capacity is challenging, which in turn suggests the effectiveness of the proposed method.

Although there are significantly fewer parameters contained in student networks that learned by the proposed method. These student networks are still regular networks which can be further compressed and speeded-up by existing sparsity based deep neural network compression technologies, such as deep compression [10] and feature compression [11].

Vi Conclusion

We proposed to learn a robust student network with the guidance of the teacher network. The proposed method prevented the student network from being disturbed by the perturbations on input examples. Through a rigorous theoretical analysis, we proved a lower bound of perturbations that will weaken the student network’s confidence in its prediction. We introduced new objectives based on prediction score and gradients of examples to maximize this lower bound and then improved the robustness of the learned student network to resist perturbations on examples. Experimental results on several benchmark datasets demonstrate the proposed method is able to learn a robust student network with satisfying accuracy and compact size.