I Introduction
Recent years have witnessed the marked progress of deep learning. Since the breakthrough in 2012 ImageNet competition
[1] achieved by AlexNet [2] using five convolutional layers and three fully connected layers, a series of more advanced deep neural networks have been developed to keep rewriting the record, e.g., VGGNet [3], GoogLeNet [4], and ResNet [5]. However, their excellent performance requires the support from a huge amount of computation. For instance, AlexNet [3] contains about 232 million parameters and needs multiplications to process an image with resolution of. Hence, the potential power of deep neural networks can only be fully unlocked on high performance GPU servers or clusters. In contrast, majority of the mobile devices used in our daily life usually have rigorous constraints on the storage and computational resource, which prevents them from fully taking advantages of deep neural network. As a result, networks with smaller hardware demanding while still maintaining similar accuracies are of great interests to image processing and computer vision community.
Compressing convolutional neural networks can be achieved by vector quantization
[6], decomposing weight matrices [7], and encoding with hashing tricks [8]. Unimportant weights can be pruned to achieve the same goal by removing the subtle weights [9, 10], reducing the redundancy between weights in the frequency domain [11], and using the binary networks [12, 13]. Another straightforward approach is to design a compact network directly, e.g., ResNeXt [14], Xception network [15], and MobileNets [16]. These networks are often deep and thin with fewer parameters in each layer, and the nonlinearity of these networks are strengthened by increasing the number of layers, which guarantees the performance of the network.Studentteacher learning framework, introduced in knowledge distillation (KD) [17], is one of the most popular approaches to realize model compression and acceleration [12, 11]. Taking a heavy neural network, such as GoogleNet [4] or ResNet [5], that has already been well trained with massive data and computing resources as the teacher network, a student network of light architecture can be better learned under teacher’s guidance. To inherit the advantages of teacher networks, different methods have been proposed to encourage the consistency between teacher and student network. For example, Ba and Caruana [18]
minimized the Euclidean distance between features extracted from these two networks, Hinton
et al. [17] encouraged the student to mimic a softened version of the teacher’s output, and FitNet [19] introduced intermediatelevel hints from teacher’s hidden layers to guide the training process of student. Patrick and Nikolaus [20] proposed to keep the pairwise distance of examples between student network and teacher. You et al. [21] utilized multiple teacher networks to guide the training process of student network. Wang et al. [22] introduced a teaching assistant to encourage the similarity between distributions of features maps extracted from teacher and student networks.These aforementioned algorithms have achieved impressive experimental results, however, they were mainly developed in ideal scenarios, where all data are implicitly assumed to be clean. In practice, given examples with perturbation, the training process of the network can be seriously influenced, and the resulting network would not be confident as before to make predictions of examples. Teacher network might make some mistakes, since it is difficult for teacher network to be familiar with all examples fed into the student network. This is consistent with studentteacher learning in the real world. An excellent student is expected to solve practical problems in changeable circumstances, where there might be questions even not known by teachers.
To solve this problem, in this paper, we introduce a robust teacherstudent learning algorithm. The framework of the proposed method is illustrated in Figure 1. We enable student network to be more confident on its prediction with the help of teacher network. Perturbations on examples might seriously influence the learning of student network. We derive the lower bound of the perturbations that can make student be more vulnerable than teacher through a rigorous theoretical analysis. New objectives in terms of prediction scores and gradients of examples are further developed to maximize the lower bound of the required perturbation. Hence, the overall robustness of the student network to resist perturbations on examples can be improved. Experimental results on benchmark datasets demonstrate the superiority of the proposed method for learning compact and robust deep neural networks.
We organized the rest of the paper as follows. In Section II, we summarize related works on learning convolutional neural networks with fewer parameters by different methods. Section III introduces the previous work we based on. In Section IV, we formally introduce our robust student network learning method in detail, including mathematical proof to the proposed theorem, the calculation method of loss function, and the training strategy. Section V provides results of our algorithm obtained on various benchmark datasets to prove the effectiveness of the proposed method. Section VI concludes this paper.
Ii Related works
In this section, we briefly introduce related works on learning a efficient convolutional neural networks with fewer parameters. There are two different categories of methods according to their techniques and motivations.
Iia Network Trimming
Network trimming aims to remove redundancy in heavy networks to obtain a compact network with fewer parameters and less computational complexity, whereas the accuracy of this portable network is close to that of the original large model. Gong et al. [6] utilized the benefits of vector quantization to compress neural networks, and a cluster center of weights was introduced as the representation of similar weights. Denton et al. [7]
implemented singular value decomposition to the weight matrix of a fullyconnected layer to reduce the number of parameters. Chen
et al. [8] attempted to explore hash encoding to improve the compression ratio. Courbariaux et al. [12] and Rastegari et al. [13] implemented binary networks. All weights previously storied as 32bit floating, are converted to binary ( or ). Moreover, Wang et al. [11] and Han et al. [9] exploited weight pruning to achieve the same goal. In particular, Han et al. [9] focused on removing subtle weights to reduce the parameters while minimizing the impact of removing them. Over 80% subtle weights were dropped without the accuracy drop. Furthermore, Han et al. [10] integrated several neural network compression techniques i.e. pruning, quantization, and Huffman coding to further compress the network. Wang et al. [11] showed that redundancy exists in not only subtle weights, but also large weights. It converted convolutional kernels into frequency domain to reduce the redundancy contained in larger weights and thereby compress networks with a higher compression ratio. In addition, Wang et al. [23] focused on the redundancy in feature maps instead of network weights, which can also be considered as a modification of network architecture. Although the network trimming method brings a considerable compression and speedup ratio, due to the highly sparse parameters and the irregular network architectures, the actual acceleration effect is often heavily dependent on the customized hardware.IiB Design Small Networks
Directly designing a new deep neural network of light size is a straightforward approach to realize efficient deep learning. Most of these methods increase the depth of networks with much lower complexity compared to simply stacking convolution layers. For example, ResNet [5] introduced a novel residual block that obtained a significant performance with only slightly computation costs. ResNeXt [14] explored group convolutions into the building blocks to boost performance. Flattened networks [24] introduced fully factorized convolutions and designed an extremely factorized network. Almost at the same time, Factorized Networks [25]
introduced topological convolution that treats sections of tensors separately. SqueezeNet
[26] designed a portable network with a bottleneck architecture. SENet [27] proposed a novel architecture named SE block, which focuses on the relationship between channels. Moreover, [28] introduced depthwise separable convolutions to obtain a great gain in the speed, and the size of networks. With the help from depthwise separable convolutions, Inception models [29, 30] reduced the complexity of the first few layers of network. Latter, Xception network [15] outperformed Inception model by scaling up depthwise separable convolutional filters. Subsequently, the MobileNets [16] combined channelwised decomposition of convolutional filters with depthwise separable convolutions and achieved stateoftheart results among portable models. ShuffleNet [31] introduced a novel from of group convolution and depthwise separable convolution. Deep fried convnets [32]introduced a novel Adaptive Fastfood transform to reduce the computation of networks. Structured transform networks
[33] offered considerable accuracycompactnessspeed tradeoffs based on the new notions rooted in the theory of structured matrices.IiC TeacherStudent Learning
There is another way to train a portable network. Regard the trained network as a teacher and the deeper yet thinner network as a student. With the help of the intrinsic information captured by the teacher network, the deeper and thinner student network could be well trained. Ba and Caruana [18] suggested that student network mimic the features extracted from the last layer of the teacher networks to assist the training progress of student networks, thereby increasing the depth of the student network. Knowledge Distillation (KD) [17] pointed out that for two networks with huge structural differences, it is difficult to directly mimic features. Therefore, KD [17]
proposes to minimize the relaxed output of softmax layers of the two networks. This strategy can further deepen the student network. FitNet
[19], based on KD, minimized the difference between the features extracted from the middle layers of student and teacher networks. They added several layers of MLPs at the middle layer of the teacher network to match the dimensions of the features of the student network. By establishing a connection between the middle layers of two networks, the student network can be further deepened with fewer parameters. McClure and Kriegeskorte [20] attempted to minimize the distance between pairs of samples to reduce the difficulty of training students’ networks. You et al. [21] proposed utilizing multiple teacher networks to provide more guidance for the training of student networks. They leverage a voting strategy to balance the multiple guidance from each teacher network. Wang et al. [22] regarded student network as a generator which is a part of GAN [34], as well as utilized a discriminator as a assistant of teacher for forcing student to generating features which are difficult to distinguish from the features of teacher.Compared to the network trimming algorithm, the studentteacher learning framework has more flexibility, no special requirements on hardware, and a more structured network structure. Compared to the direct design of a deeper network, guidance from the teacher is beneficial to learning deep networks and improving the performance of student. However, existing studentteacher algorithms pay more attention to improving the performance of student network on pure data sets. The instability caused by the large reduction in parameters makes the performance degradation under the Perturbation settings not yet studied. Therefore, a more robust learning algorithm for improving student network performance under perturbed conditions needs to be developed. This paper proposed a method under the teacherstudent learning and knowledge distillation framework, which enhanced the robustness of student network.
Iii Preliminary of TeacherStudent Learning
To make this paper selfcontained, we briefly introduce some preliminary knowledge of teachstudent learning here.
The teacher network has complicated architecture, and it has already been well trained to achieve a sufficiently high performance. We aim to learn a student network , which is deeper yet thinner than the teacher network but has a lower yet satisfying accuracy. Let be the example space and be its corresponding label space. Outputs of these two networks are defined as:
(1) 
where and are the features produced by presoftmax layers of teacher and student networks, respectively.
The teacher network is usually trained on a relatively large dataset and consists of a large number of parameters, so that the teacher network usually achieves a high accuracy in classification task. Given significantly fewer parameters and numbers of multiplication operations, if adopting the same training strategies as the teacher network, the student network is difficult to achieve a high performance. It is therefore necessary to, improve student network performance by investigating the assistance of the teacher network. A straightforward method is to encourage the features of an image extracted from these two networks to be similar [18]. The objection function can be written as
(2) 
where the second term helps the student network to extract knowledge from the teacher, refers to the crossentropy loss, , indicates the output of the th example in by the student network, refers to the corresponding label, and is the coefficient to balance two terms in the function. The teacher and student networks can be significantly different in architecture, and thus it is difficult to expect features extracted by these two networks for the same example to be same. Hence, Knowledge Distillation (KD) [17], as an effective alternative, was proposed to distill knowledge from classification results to minimize
(3) 
where the second term aims to enforce the student network to learn from softened output of the teacher network. is a relaxation function defined as follow:
(4)  
is introduced to make sure that the second term in equation (3) can play a different role compared with the first one. This is because that might be extremely similar to the one hot code representation of the groundtruth labels, while a soften version of output is different from the true labels. Moreover, the soften version of output could also provide more information to guide the learning of student, as the crossentropy loss and soften version output will enhance the influence of classes other than the true label one.
Although KD loss in equation (3) allows the student network to access the knowledge from the teacher network, the significant reduction in the number of parameters decreases the capability of the student network and makes it more vulnerable to input disturbances. The learned student network might achieve a reasonable performance on clean data, but it would suffer from a serious performance decline when encountering perturbation on the data in real world applications. To solve this issue, it is therefore necessary to enforce the robustness of the student network when applied to practical scenario.
Iv Robust Student Network Learning
We take a multiclass classification problem over classes as an example to introduce our robust Student Network Learning. Given a teacher network and a student network , an example
can then be classified by two networks
and , respectively. Denote and as the th value of the dimensional vectors and , respectively. Then we define and as the scores produced by two networks for the groundtruth label of the example , respectively. If a classifier has more confidence in its prediction, the predicted score will be higher. With the help of the teacher network, the student network is supposed to be more confident in its prediction, so that(5) 
Iva Theoretical Analysi
The above relationship holds in ideally noisefree scenario. In practical scenario, perturbations on examples are unavoidable, and the student network is expected to resist the unexpected influence and bring in the robust prediction,
(6) 
where is a perturbation added to . We restrict this perturbation in a spherical space of radius , and is a constraint set that specifies some requirements for the input, e.g., an image input should be in , where is the dimension. We define the ball as .
We aim to discover a student network that stands on the shoulder of the teachers to make a confident prediction not only for clean examples but also for examples with perturbations. The perturbation exists on examples without influencing their corresponding groundtruth labels. However, with the increase of perturbation intensity, the learning process of the student network would be seriously disturbed. Taking equation (6) as an auxiliary constraint in training the student network can be helpful for improving robustness of the network. But it is difficult and impossible to enumerate and try every possible to form the constraint. To make the optimization problem tractable, we seek for some alternatives and proceed to study the maximum perturbation that can be defended by the system. Figure 1 shows the framework of our approach.
Theorem 1.
Let be an example in . and are functions adapted from the student and the teacher networks to predict the label of example , respectively. Given , for any with
(7) 
we have .
Proof.
By the main theorem of calculus, we have
(8) 
and
(9) 
If the perturbation is so significant that , we get
(10) 
Consider the fact that
(11) 
where holder inequality is applied and qnorm is dual to the pnorm with . By combining equation (10) and equation (11), we have
(12) 
where the denominator can be further upper bounded using the following inequality
(13)  
The lower bound for the qnorm of to break the robust prediction of the student network (i.e., equation (6)) is therefore
(14) 
which completes the proof. ∎
According to Theorem 1, maximizing the value of while minimizing the value of , the lower bound over will be enlarged, so that the student network is able to tolerate more severe perturbation and become more robust to make confident prediction. Without loss of generality, we take in the following discussion.
IvB Method
Based on the analysis above, two new objectives are introduced into teacherstudent learning paradigm to achieve a robust student network. To encourage , we plan to minimize the loss function
(15) 
where is a constant margin. is supposed to be greater than , otherwise, there will be a penalty for the student network. It is difficult to explicitly calculate the value of , due to the existence of the max operation. But by appropriately setting the radius and considering the sufficiently large training set, the data point in the ball to reach the maximum value of would often have some closed examples in the training set. Hence, to minimize the value of , we proposed to minimize the difference between gradients of student and the teacher networks w.r.t. the training examples as
(16) 
where is the relaxation function explained in equation (3). In addition, we take the KD loss [17] into consideration, the resulting objective function of our robust student network learning algorithm can be written as:
(17) 
where and are the balanced coefficients of and , respectively.
The process of training the student networks can be found in Algorithm 1. After the initialization of the student network, we train the student network according to the proposed algorithm. Next, we explain in detail the calculation of loss. For convenience, we set the batch size as 1, that is, we first select a sample from the dataset and as input for forward propagation of the teacher network and the student network. Then we calculate outputs of the two networks and . Combining outputs and with the corresponding label , we can calculate the first term in equation (17) according to equations (3) and (4). and are both dimensional vectors, which are the network’s prediction scores for categories. With the help of label , we can get the predicted scores for label, and and calculate the second term in equation (17) according to equation (15). In order to get the value of , we first calculate the derivative of and with respect to the input sample
. Same as back propagation algorithm, we can apply the chain rule to get these results. In experiments, we utilize the automatic derivation tool which is integrated in mainstream deep learning platforms to achieve this process. After getting
and , the loss can be easily calculated using equation (17). Finally, the weights in the student network are updated by the gradients obtained by the backpropagation algorithm.The delta in equation (7) is the noise in an image . The noise can come from various sources. Some are physical, linked to the nature of light and to optical artifacts, and some others are created during the conversion from electrical signal to digital data. As noise degrades the quality of an image, the performance of neural networks in image classification task could be seriously influenced. The proposed robust student network aims to handle unexpected noises in images and to preserve consistent decisions with or without noises (see equations (5) and (6). In the literature, the overall noise produced by a digital sensor is usually considered as a stationary white additive Gaussian noise [35]. We report the robustness of the learned networks against Gaussian noise in experiment. In addition, we also evaluate the performance of the learned networks against combinations of different types of noise on training and test sets, since it is difficult to know what types of noise could be before the test stage.
(a) MNIST  (b) CIFAR10  (c) CIFAR100 
V Experiments
In this section, we experimentally investigate the effectiveness of the proposed robust student network learning algorithm. The learned student network is compared with the original teacher network, and the student networks learned through KD [17] and Fitnet [19]. The experiments are on three benchmark datasets: MNIST [36], CIFAR10 [37], and CIFAR100 [37].
Va Dataset and Settings
MNIST [36] is a handwritten digit dataset (from 0 to 9) composed of greyscale images from ten categories. The whole dataset of 70,000 images is split into 60,000 and 10,000 images for training and test, respectively. Following the setting in [19], we trained a teacher network of maxout convolutional layers reported in [38], which contains 3 convolutional maxout layers and a fullyconnected layer with 48482410 units, respectively. After that, we design the student network which contains 6 convolutional maxout layers and a fullyconnected layer, which is twice as deep as the teacher network but with roughly 8% of the parameters. As reported in Table VIII, the architectures of the teacher and student network were shown in detail in the first two columns.
CIFAR10 [37] is a dataset that consists of RGB color images draw from 10 categories. There are 60,000 images in CIFAR10 dataset which are split into 50,000 training and 10,000 testing images. According to [38] and [19], we preprocessed the data using global contrast normalization (GCA) and ZCA whitening, and augmented the training data via random flipping. We followed the architecture used in Maxout [38] and FitNet [19] to train a teacher network with three maxout convolutional layers of 96192192 units. For fair comparison, we designed a student network with a structure similar to FitNet which has 17 maxout convolutional layers followed by a maxout fullyconnected layer and a top softmax layer, and we also investigate KD method with the same architecture. The detailed architecture of teacher was shown in the ‘Teacher(CIFAR10)’ column of Table VIII, and that of student was shown as ‘Student 4’ column.
CIFAR100 dataset [37] has images of the same size and format as those in CIFAR10, except that it has 100 categories with only one tenth as labeled images per category. More categories and fewer labeled examples per category indicates that classification task on CIFAR100 is more challenging than that on CIFAR10. We preprocess images in CIFAR100 using the same methods for CIFAR10 and the teacher network and the student network share the same hyperparameters with those on the CIFAR10 dataset. Besides, the architecture of teacher is also same as that used for CIFAR10, except that the number of units in the last softmax layer was changed to 100 to adapt to the number of categories.
The hyperparameters are tuned by minimizing the error on a validation set consisting of the last 10,000 training examples on each dataset. Following the setting in FitNet [19], we set batch size as 128, max training epoch as 500, learning rate as 0.17 for linear layers and 0.0085 for convolutional layers, and momentum as 0.35. According to the hint layer proposed in FitNet [19], we pretrained a classifier using the features of the teacher network’s middle layer, and then we apply the classifier with the student network features.
(a) Clean Cat  (b) Cat with SNR=10  (c) Clean Ship  (d) Ship with SNR=10 
VB Robustness of Student Networks
We evaluated the robustness of student networks learned through different algorithms under different intensities of perturbation. Since it is difficult and impossible to know what test data can be in practice, the augmentation of training data with certain noise cannot be very helpful to resist the perturbation. Hence, we trained all networks using clean training set, and introduced White Gaussian Noise (WGN) into test data as the perturbation. The intensity of the introduced noise was measured in terms of SignaltoNoise Ratio (SNR). We trained the proposed algorithm and compared it with the teacher network [38], and student networks from KD network [17] and FitNet [19] methods.
In Figure 2, we investigated the accuracy of these networks on three datasets with different SNR values. As the classification task on MNIST is relatively easier, lower SNRs were chosen from 9 to 1. Lower SNR value indicates more perturbations are added. It can be found from Figure 2(a) that the accuracy of the proposed robust student network is superior to other three networks nearly under all SNR values. When SNR equals to 2, two student networks from KD and FitNet perform even worse than the original teacher network. But our proposed algorithm achieves an obviously leading 98.17% accuracy. When SNR was down to 1, the accuracy drops of the teacher network, and the student network from KD and FitNet are serious, up to 5.65%, 7.25%, and 3.23%, respectively. In contrast, the accuracy of our robust student network only drops 2.23%. Our method achieves better performance and shows more robustness when there was perturbation in the input.
A similar phenomenon can be observed in Figures 2 (b) and (c) on the CIFAR10 and CIFAR100 datasets. With the decrease of SNR, the accuracy of KD network and FitNet dropped faster than that of the teacher network, especially during the period when SNR drops from 12 to 10. Given the significant reduction in network complexity, the capacity of the student network can be seriously weakened and the student network would be more vulnerable to perturbations on data if there is no appropriate response action. However, the student network learned from the proposed algorithm can be robust to serious perturbations.
In Figure 3, we reported the predicted scores of example images by different methods on the CIFAR10 dataset. The clean image without noise looks fuzzy, since the images from CIFAR10 dataset only a resolution of . In the first column of Figure 3, all student networks can confidently predict the groundtruth class ‘cat’ of the image. However, given the same image added with SNR=10 noise in Figure 3 (b), though student networks from KD and FitNet methods reluctantly made the correct prediction, KD also thought the image is similar to ‘deer’, and FitNet trusted ’bird’ as the prediction with a higher confidence level. In contrast, our robust student network confidently insisted on its correct prediction even the quality of image has been seriously influenced by the perturbation. In addition, given the ‘ship’ image, the teacher network can stand against the perturbation, due to its strong capability coming from the complicated network structure. The KD method mistook it as an ‘deer’ image, while FitNet assigned higher score to label ‘deer’ for this ‘ship’ image. By encouraging more confident predictions with the help of the teacher network during the training stage, we derive the robust student network that can not only keep the highest prediction score on the ‘ship’ label, but also suppress the predictions on wrong categories (see label ‘cat’ in Figures 3(c) and (d)).
Network  C/C  C/G  C/P  G/G  G/P  P/P  P/G 

Teacher [38]  90.25%  86.30%  86.60%  89.02%  87.36%  89.11%  86.06% 
KD [17]  91.07%  80.61%  80.86%  90.48%  82.14%  90.63%  82.27% 
FitNet [19]  91.64%  82.41%  82.43%  90.86%  86.11%  91.10%  84.02% 
Robust (proposed)  91.93%  90.37%  90.50%  90.37%  90.50%  90.50%  90.37% 
Dataset  Algorithm  22 block  44 block  66 block  88 block  1010 block 
CIFAR10  Teacher [38]  89.57%  89.03%  87.69%  85.65%  81.94% 
KD [17]  90.53%  89.53%  87.23%  83.37%  77.58%  
FitNet [19]  91.08%  89.94%  87.40%  84.62%  79.71%  
Robust(proposed)  91.25%  90.79%  88.34%  85.92%  82.15%  
CIFAR100  Teacher [38]  63.07%  61.92%  59.95%  57.41%  55.63% 
KD [17]  63.48%  61.84%  58.73%  54.38%  51.67%  
FitNet [19]  64.11%  62.62%  59.65%  55.51%  52.40%  
Robust(proposed)  64.83%  63.16%  60.68%  57.49%  55.11% 
VC Comparison under Different Perturbation
A neural network might handle noisy test data, if similar noise also exists in the training set. However, in practice, it is difficult to guarantee the test data to have the same kind of perturbation as the training data. We next proceed to evaluate the performance of different methods under different combinations of noisy training and test sets. The accuracies in different settings are presented in Table I. The first line in Table I is a description of the experiment settings. The first capital letter indicates the noise type introduced to training set, and the second letter indicates that of test set. It should be noted that the results of the Robust network listed in this table are all trained under the clean data set, but tested under the corresponding type of noise indicated by the first line. If both training and test data are clean, all networks can achieve more than 90% accuracy, as shown in the first column of Table I. If networks are trained on the clean data and tested on the data with Gaussian noise, teacher networks and student networks from KD and FitNet will be seriously influenced and can only achieve less then 86% accuracy. However, the proposed robust student can still own more than 90% accuracy. A similar phenomenon can be observed when the networks are trained with Gaussian noise but tested with Poisson noise and reverse, as shown in the fourth and last columns of Table I, respectively. If both training and test data are polluted with the same type of Gaussian noise, all networks would try to fit the noisy data as far as possible and receive only slight performance drop. But this rigorous constraint over training and test data cannot always be satisfied in realworld applications. It shows that adding limited kinds of noise to the training set is difficult to improve the robustness of neural network when facing various unexpected kinds of noises existing in practical applications.
Moreover, Table I shows that the teacher Network’s accuracy is worse than that of Robust Network when training and test sets are both with Gaussian noise. The distributions of training data and test data would not be significantly different, if they are both polluted by the Gaussian noise. Hence general teacher networks and student networks can well fit the noisy data and receive reasonable accuracy. But the student network achieves some performance improvement, because of its deeper architecture than that of teacher networks. The depth encourages the reuse of features, and leads to more abstract and invariant representations at higher layers. The proposed robust student network can successfully train a deeper network by exploiting information from the teacher network.
VD Complex Perturbation
In realworld applications, noise is not the only perturbation that may be encountered. Some more complex perturbation also challenges the robustness of neural networks. In this section, we investigate the robustness of the student network obtained by our proposed method on two more complex perturbations, i.e. image occlusion and domain adoption.
VD1 Image Occlusion
Considering the target object in the real environment is often blocked, and such perturbation often results in the loss of information in a continuous area, the performance of neural networks will be influenced more seriously by image occlusion. In order to investigate the robustness of our method under this disturbance, we take image occlusion as a more complex perturbation. To simulate the occlusion in realworld applications, we randomly select a small rectangular area in an image, and set pixels covered by the rectangle as zeros. Five different block sizes, i.e. 22, 44, 66, 88 and 1010 are used in experiments. We implemented this experiment on the CIFAR10 dataset. The results are shown in Table II. Given 44 blocks, teacher, KD, FitNet, and Robust network respectively have accuracies of 89.03%, 89.53%, 89.94%, and 90.79%. Given 88 blocks, the corresponding accuracies are 85.65%, 83.37%, 84.62%, and 85.92%, respectively. According to these results, larger blocks indicate more serious perturbations of images, which will degrade the performance of neural networks. However, the student network obtained by the proposed method stably stays ahead, because of its robustness.
Algorithm  MNIST2USPS  USPS2MNIST  
MNIST  USPS  USPS  MNIST  
Teacher [38]  99.45%  –  96.41%  86.88% 
KD [17]  99.35%  93.25%  96.26%  82.74% 
FitNet [19]  99.49%  94.12%  96.56%  87.23% 
Robust(proposed)  99.55%  95.02%  96.71%  89.14% 
Algorithm  params  layers  CIFAR10  CIFAR100 
Studentteacher learning paradigm  
Teacher [38]  9M  5  90.25%  63.49% 
Knowledge Distillation [17]  2.5M  19  91.07%  64.13% 
FitNet [19]  2.5M  19  91.64%  64.86% 
Robust learning(proposed)  2.5M  19  91.93%  65.28% 
Stateoftheartmethods  
Maxout Network [38]  90.62%  61.43%  
Network in Network [39]  91.20%  64.32%  
DeeplySupervised Networks [40]  91.78%  65.43% 
Algorithm  plane  car  bird  cat  deer  dog  frog  horse  ship  truck 

Teacher [38]  90.1%  93.8%  86.0%  74.6%  93.5%  86.2%  95.2%  92.6%  95.3%  95.2% 
KD [17]  90.0%  95.2%  83.2%  84.4%  93.2%  87.1%  95.0%  91.6%  97.3%  93.7% 
FitNet [19]  90.7%  97.6%  91.0%  82.7%  93.8%  86.2%  92.7%  93.6%  94.6%  93.5% 
Robust(proposed)  91.0%  97.0%  90.3%  83.6%  92.4%  87.2%  95.4%  93.2%  95.1%  94.1% 
Algorithm  params  Misclass 
Studentteacher learning paradigm  
Teacher  361K  0.55% 
Standard backpropagation  30K  1.90% 
Knowledge Distillation [17]  30K  0.65% 
FitNet [19]  30K  0.51% 
Robust(proposed)  30K  0.45% 
Stateoftheartmethods  
Maxout Network [38]  0.45%  
Network in Network [39]  0.47%  
DeeplySupervised Networks [40]  0.39% 
VD2 Domain Adaptation
In practical applications, not only unexpected noise and occlusion, but also the unexpected distribution shift could challenge the robustness of neural networks. It is also an important indicator to evaluate the adaptability of this algorithm in the task of domain adaptation.
In this experiment, we took the USPS dataset obtained from the scanning of handwritten digits from envelopes by the U.S. Postal Service. The images in this dataset are all grayscale images and the values have been normalized. The whole dataset has 9,298 handwritten numeric images, of which 7,291 are for training, and the remaining 2,007 are for validation. Similar to MNIST, the USPS dataset has 10 categories, but it has different numbers of samples per category. In addition, considering the picture size in the MNIST dataset is
, for convenience, we pad the images in the USPS dataset to the same size. Moreover, we preprocess USPS datasets in the same way as MNIST.
In this section, we train student networks on the MNIST dataset, and test them on USPS dataset. Similarly, we train networks on the USPS dataset and test them on MNIST. The results are shown in the Table III. The first two columns show the result of adapting MNIST to USPS, and the performance of adapting USPS to MNIST was reported in the last two columns of this table. According to the results, the proposed algorithm achieves an accuracy of 95.02%, while the comparison methods KD and FitNet only get 93.25% and 94.12%, respectively. This demonstrates that the proposed robust student network can preserve its robustness advantages over comparison methods, when faced with more complex perturbation of data in domain adaptation task. The similar phenomenon can be observed in the results of ‘USPS to MNIST’. With the similar accuracy on USPS dataset, the Robust Network outperforms networks obtained by the other algorithms. Moreover, the results tested on USPS dataset while trained on MNIST dataset are much better than those tested on MNIST and trained on USPS. This is because that the number of pictures in the MNIST dataset is much larger than that of USPS. The networks trained by MNIST dataset could extract more useful information from a larger amount of data, and thus has better generalization capabilities.
Network  layers  params  mult  Speedup Ratio  Compression Ratio  FitNet  Robust 

Teacher  5  9M  725M  90.25%  
Student 1  11  250K  30M  89.07%  89.62%  
Student 2  11  862K  108M  91.02%  91.37%  
Student 3  13  1.6M  392M  91.16%  91.50%  
Student 4  19  2.5M  382M  91.64%  91.93% 
Teacher(MNIST)  Student(MNIST)  Teacher(CIFAR)  Student 1  Student 2  Student 3  Student 4 

conv 3x3x48  conv 3x3x16  conv 3x3x96  conv 3x3x16  conv 3x3x16  conv 3x3x32  conv 3x3x32 
pool 4x4  conv 3x3x16  pool 4x4  conv 3x3x16  conv 3x3x32  conv 3x3x48  conv 3x3x32 
pool 4x4  conv 3x3x16  conv 3x3x32  conv 3x3x64  conv 3x3x32  
pool 2x2  pool 2x2  conv 3x3x64  conv 3x3x48  
pool 2x2  conv 3x3x48  
pool 2x2  
conv 3x3x48  conv 3x3x16  conv 3x3x96  conv 3x3x32  conv 3x3x48  conv 3x3x80  conv 3x3x80 
pool 4x4  conv 3x3x16  pool 4x4  conv 3x3x32  conv 3x3x64  conv 3x3x80  conv 3x3x80 
pool 4x4  conv 3x3x32  conv 3x3x80  conv 3x3x80  conv 3x3x80  
pool 2x2  pool 2x2  conv 3x3x80  conv 3x3x80  
pool 2x2  conv 3x3x80  
conv 3x3x80  
pool 2x2  
conv 3x3x48  conv 3x3x16  conv 3x3x96  conv 3x3x48  conv 3x3x96  conv 3x3x128  conv 3x3x128 
pool 4x4  conv 3x3x16  pool 4x4  conv 3x3x48  conv 3x3x96  conv 3x3x128  conv 3x3x128 
pool 4x4  conv 3x3x64  conv 3x3x128  conv 3x3x128  conv 3x3x128  
pool 8x8  pool 8x8  pool 8x8  conv 3x3x128  
conv 3x3x128  
conv 3x3x128  
pool 8x8  
fc  fc  fc  fc  fc  fc  fc 
softmax  softmax  softmax  softmax  softmax  softmax  softmax 
VE Comparison with StateoftheArt Methods
Although the main purpose of this paper is to improve the robustness of the student networks, instead of focusing on performance of the student networks on clean data. We also compared the proposed approach with stateoftheart teacherstudent learning methods on clean datasets. For the clean data, the proposed algorithm can still achieve comparable accuracy as compared to others in Tables VI and IV. Table VI summarized the obtained results on three datasets: MNIST, CIFAR10 and CIFAR100. On the MNIST dataset, the teacher network got a 99.45% accuracy. With the assistance of KD, the student network achieved a 99.46% accuracy. FitNet generated a slightly better student network with a 99.49% accuracy, which has outperformed the teacher network. Though the proposed algorithm aims to enhance the robustness of the learned student network, it can also achieve comparable or even better accuracy than those of stateoftheart methods. The accuracy obtained by the proposed method increased to 99.51% on the MINIST dataset. Table IV shows the results on the CIFAR10 datasets, the baseline teacher network achieved a 90.25% accuracy, and the accuracy of the student network generated by KD and Fitnet were 91.07% and 91.64%, respectively. The Robust student network obtained a 91.63% accuracy, which outperforms the other student networks and teacher. This suggests that the proposed method is able to enhance the stability of student network and then improve the performance of the network.
CIFAR100 is similar but more challengeable than CIFAR10 because of its 100 categories. The accuracy obtained by the teacher network is only 63.49%. As comparison, the accuracy of teacher on CIFAR10 is 90.25%, which is much better than that on CIFAR100. The robust student network achieved a 65.28% accuracy, which outperforms student networks trained by other strategies, i.e., the network trained by knowledge distillation obtains a test accuracy of 64.13%, and the accuracy of FitNet is 64.86%. When compared to other methods, the student network generated by the proposed method provides nearly the stateoftheart performance. This result demonstrates that the proposed method succeeds in assisting to learn a student network with considerable performance.
VF Analysis on Structures of Student Network
We followed the experimental setting in FitNet [19] and designed four student networks with different configurations of parameters and layers. The teacher network had the same structure as that used on the CIFAR10 dataset.We design four student networks of different sizes and structures, the detailed structure of these networks can be found in Table VIII. From ‘Student 1’ to ‘Student 4’, the volume of the network has gradually increased, and the performance of the network has gradually increased, too. Table V reported the performance of four student networks and the teacher network on the CIFAR10 dataset. The compression ratio and speedup ratio compared with the teacher, and the number of parameters and multiplications can also be found in Table VII.
From Table V, we find that the proposed robust student network outperforms FitNet under all four different student structures. Though there is no perturbation on the data, the proposed method can achieve higher accuracy, which indicates the effectiveness of encouraging the student network to make confident predictions with the help of the teacher network. In addition, the smallest network Student 1 has the biggest compression and speedup ratios, but it can still achieve a test accuracy of 89.62%, which is fairly close to the 90.25% of teacher and outperforms the 89.07% obtained by FitNet. As Student 1 contains significantly fewer parameters than those of the teacher, improving the accuracy of such a network with limited capacity is challenging, which in turn suggests the effectiveness of the proposed method.
Although there are significantly fewer parameters contained in student networks that learned by the proposed method. These student networks are still regular networks which can be further compressed and speededup by existing sparsity based deep neural network compression technologies, such as deep compression [10] and feature compression [11].
Vi Conclusion
We proposed to learn a robust student network with the guidance of the teacher network. The proposed method prevented the student network from being disturbed by the perturbations on input examples. Through a rigorous theoretical analysis, we proved a lower bound of perturbations that will weaken the student network’s confidence in its prediction. We introduced new objectives based on prediction score and gradients of examples to maximize this lower bound and then improved the robustness of the learned student network to resist perturbations on examples. Experimental results on several benchmark datasets demonstrate the proposed method is able to learn a robust student network with satisfying accuracy and compact size.
References
 [1] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
 [2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
 [3] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

[4]
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the
inception architecture for computer vision,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2016, pp. 2818–2826.  [5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
 [6] Y. Gong, L. Liu, M. Yang, and L. Bourdev, “Compressing deep convolutional networks using vector quantization,” arXiv preprint arXiv:1412.6115, 2014.
 [7] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, “Exploiting linear structure within convolutional networks for efficient evaluation,” in Advances in Neural Information Processing Systems, 2014, pp. 1269–1277.

[8]
W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen, “Compressing neural
networks with the hashing trick,” in
International Conference on Machine Learning
, 2015, pp. 2285–2294.  [9] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in neural information processing systems, 2015, pp. 1135–1143.
 [10] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.
 [11] Y. Wang, C. Xu, S. You, D. Tao, and C. Xu, “Cnnpack: packing convolutional neural networks in the frequency domain,” in Advances in Neural Information Processing Systems, 2016, pp. 253–261.
 [12] M. Courbariaux, I. Hubara, D. Soudry, R. ElYaniv, and Y. Bengio, “Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or1,” arXiv preprint arXiv:1602.02830, 2016.

[13]
M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnornet: Imagenet classification using binary convolutional neural networks,” in
European Conference on Computer Vision. Springer, 2016, pp. 525–542.  [14] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 2017, pp. 5987–5995.
 [15] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” arXiv preprint, pp. 1610–02 357, 2017.
 [16] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
 [17] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
 [18] J. Ba and R. Caruana, “Do deep nets really need to be deep?” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 2654–2662. [Online]. Available: http://papers.nips.cc/paper/5484dodeepnetsreallyneedtobedeep.pdf
 [19] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets,” arXiv preprint arXiv:1412.6550, 2014.
 [20] P. McClure and N. Kriegeskorte, “Representational distance learning for deep neural networks,” Frontiers in computational neuroscience, vol. 10, 2016.
 [21] S. You, C. Xu, C. Xu, and D. Tao, “Learning from multiple teacher networks,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2017, pp. 1285–1294.
 [22] Y. Wang, C. Xu, C. Xu, and D. Tao, “Adversaial learning of portable student networks,” in AAAI, 2018.
 [23] ——, “Beyond filters: Compact feature map for portable deep model,” in International Conference on Machine Learning, 2017, pp. 3703–3711.
 [24] J. Jin, A. Dundar, and E. Culurciello, “Flattened convolutional neural networks for feedforward acceleration,” arXiv preprint arXiv:1412.5474, 2014.
 [25] M. Wang, B. Liu, and H. Foroosh, “Factorized convolutional neural networks,” CoRR, abs/1608.04337, 2016.
 [26] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and¡ 0.5 mb model size,” arXiv preprint arXiv:1602.07360, 2016.
 [27] J. Hu, L. Shen, and G. Sun, “Squeezeandexcitation networks,” arXiv preprint arXiv:1709.01507, 2017.
 [28] V. Vanhoucke, “Learning visual representations at scale,” 2014.
 [29] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich et al., “Going deeper with convolutions.” Cvpr, 2015.
 [30] B. Normalization, “Accelerating deep network training by reducing internal covariate shift,” 2015.
 [31] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” arXiv preprint arXiv:1707.01083, 2017.
 [32] Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song, and Z. Wang, “Deep fried convnets,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1476–1483.
 [33] V. Sindhwani, T. Sainath, and S. Kumar, “Structured transforms for smallfootprint deep learning,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 3088–3096. [Online]. Available: http://papers.nips.cc/paper/5869structuredtransformsforsmallfootprintdeeplearning.pdf
 [34] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 2672–2680. [Online]. Available: http://papers.nips.cc/paper/5423generativeadversarialnets.pdf
 [35] T. Julliand, V. Nozick, and H. Talbot, “Image noise and digital image forensics,” in International Workshop on Digital Watermarking. Springer, 2015, pp. 3–17.
 [36] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [37] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” 2009.
 [38] I. J. Goodfellow, D. WardeFarley, M. Mirza, A. Courville, and Y. Bengio, “Maxout networks,” arXiv preprint arXiv:1302.4389, 2013.
 [39] M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013.
 [40] C.Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeplysupervised nets,” in Artificial Intelligence and Statistics, 2015, pp. 562–570.
Comments
There are no comments yet.