Recent years have witnessed the marked progress of deep learning. Since the breakthrough in 2012 ImageNet competition achieved by AlexNet  using five convolutional layers and three fully connected layers, a series of more advanced deep neural networks have been developed to keep rewriting the record, e.g., VGGNet , GoogLeNet , and ResNet . However, their excellent performance requires the support from a huge amount of computation. For instance, AlexNet  contains about 232 million parameters and needs multiplications to process an image with resolution of
. Hence, the potential power of deep neural networks can only be fully unlocked on high performance GPU servers or clusters. In contrast, majority of the mobile devices used in our daily life usually have rigorous constraints on the storage and computational resource, which prevents them from fully taking advantages of deep neural network. As a result, networks with smaller hardware demanding while still maintaining similar accuracies are of great interests to image processing and computer vision community.
Student-teacher learning framework, introduced in knowledge distillation (KD) , is one of the most popular approaches to realize model compression and acceleration [12, 11]. Taking a heavy neural network, such as GoogleNet  or ResNet , that has already been well trained with massive data and computing resources as the teacher network, a student network of light architecture can be better learned under teacher’s guidance. To inherit the advantages of teacher networks, different methods have been proposed to encourage the consistency between teacher and student network. For example, Ba and Caruana 
minimized the Euclidean distance between features extracted from these two networks, Hintonet al.  encouraged the student to mimic a softened version of the teacher’s output, and FitNet  introduced intermediate-level hints from teacher’s hidden layers to guide the training process of student. Patrick and Nikolaus  proposed to keep the pairwise distance of examples between student network and teacher. You et al.  utilized multiple teacher networks to guide the training process of student network. Wang et al.  introduced a teaching assistant to encourage the similarity between distributions of features maps extracted from teacher and student networks.
These aforementioned algorithms have achieved impressive experimental results, however, they were mainly developed in ideal scenarios, where all data are implicitly assumed to be clean. In practice, given examples with perturbation, the training process of the network can be seriously influenced, and the resulting network would not be confident as before to make predictions of examples. Teacher network might make some mistakes, since it is difficult for teacher network to be familiar with all examples fed into the student network. This is consistent with student-teacher learning in the real world. An excellent student is expected to solve practical problems in changeable circumstances, where there might be questions even not known by teachers.
To solve this problem, in this paper, we introduce a robust teacher-student learning algorithm. The framework of the proposed method is illustrated in Figure 1. We enable student network to be more confident on its prediction with the help of teacher network. Perturbations on examples might seriously influence the learning of student network. We derive the lower bound of the perturbations that can make student be more vulnerable than teacher through a rigorous theoretical analysis. New objectives in terms of prediction scores and gradients of examples are further developed to maximize the lower bound of the required perturbation. Hence, the overall robustness of the student network to resist perturbations on examples can be improved. Experimental results on benchmark datasets demonstrate the superiority of the proposed method for learning compact and robust deep neural networks.
We organized the rest of the paper as follows. In Section II, we summarize related works on learning convolutional neural networks with fewer parameters by different methods. Section III introduces the previous work we based on. In Section IV, we formally introduce our robust student network learning method in detail, including mathematical proof to the proposed theorem, the calculation method of loss function, and the training strategy. Section V provides results of our algorithm obtained on various benchmark datasets to prove the effectiveness of the proposed method. Section VI concludes this paper.
Ii Related works
In this section, we briefly introduce related works on learning a efficient convolutional neural networks with fewer parameters. There are two different categories of methods according to their techniques and motivations.
Ii-a Network Trimming
Network trimming aims to remove redundancy in heavy networks to obtain a compact network with fewer parameters and less computational complexity, whereas the accuracy of this portable network is close to that of the original large model. Gong et al.  utilized the benefits of vector quantization to compress neural networks, and a cluster center of weights was introduced as the representation of similar weights. Denton et al. 
implemented singular value decomposition to the weight matrix of a fully-connected layer to reduce the number of parameters. Chenet al.  attempted to explore hash encoding to improve the compression ratio. Courbariaux et al.  and Rastegari et al.  implemented binary networks. All weights previously storied as 32-bit floating, are converted to binary ( or ). Moreover, Wang et al.  and Han et al.  exploited weight pruning to achieve the same goal. In particular, Han et al.  focused on removing subtle weights to reduce the parameters while minimizing the impact of removing them. Over 80% subtle weights were dropped without the accuracy drop. Furthermore, Han et al.  integrated several neural network compression techniques i.e. pruning, quantization, and Huffman coding to further compress the network. Wang et al.  showed that redundancy exists in not only subtle weights, but also large weights. It converted convolutional kernels into frequency domain to reduce the redundancy contained in larger weights and thereby compress networks with a higher compression ratio. In addition, Wang et al.  focused on the redundancy in feature maps instead of network weights, which can also be considered as a modification of network architecture. Although the network trimming method brings a considerable compression and speedup ratio, due to the highly sparse parameters and the irregular network architectures, the actual acceleration effect is often heavily dependent on the customized hardware.
Ii-B Design Small Networks
Directly designing a new deep neural network of light size is a straightforward approach to realize efficient deep learning. Most of these methods increase the depth of networks with much lower complexity compared to simply stacking convolution layers. For example, ResNet  introduced a novel residual block that obtained a significant performance with only slightly computation costs. ResNeXt  explored group convolutions into the building blocks to boost performance. Flattened networks  introduced fully factorized convolutions and designed an extremely factorized network. Almost at the same time, Factorized Networks 
introduced topological convolution that treats sections of tensors separately. SqueezeNet designed a portable network with a bottleneck architecture. SENet  proposed a novel architecture named SE block, which focuses on the relationship between channels. Moreover,  introduced depth-wise separable convolutions to obtain a great gain in the speed, and the size of networks. With the help from depth-wise separable convolutions, Inception models [29, 30] reduced the complexity of the first few layers of network. Latter, Xception network  outperformed Inception model by scaling up depth-wise separable convolutional filters. Subsequently, the MobileNets  combined channel-wised decomposition of convolutional filters with depth-wise separable convolutions and achieved state-of-the-art results among portable models. ShuffleNet  introduced a novel from of group convolution and depth-wise separable convolution. Deep fried convnets 
introduced a novel Adaptive Fastfood transform to reduce the computation of networks. Structured transform networks offered considerable accuracy-compactness-speed tradeoffs based on the new notions rooted in the theory of structured matrices.
Ii-C Teacher-Student Learning
There is another way to train a portable network. Regard the trained network as a teacher and the deeper yet thinner network as a student. With the help of the intrinsic information captured by the teacher network, the deeper and thinner student network could be well trained. Ba and Caruana  suggested that student network mimic the features extracted from the last layer of the teacher networks to assist the training progress of student networks, thereby increasing the depth of the student network. Knowledge Distillation (KD)  pointed out that for two networks with huge structural differences, it is difficult to directly mimic features. Therefore, KD 
proposes to minimize the relaxed output of softmax layers of the two networks. This strategy can further deepen the student network. FitNet, based on KD, minimized the difference between the features extracted from the middle layers of student and teacher networks. They added several layers of MLPs at the middle layer of the teacher network to match the dimensions of the features of the student network. By establishing a connection between the middle layers of two networks, the student network can be further deepened with fewer parameters. McClure and Kriegeskorte  attempted to minimize the distance between pairs of samples to reduce the difficulty of training students’ networks. You et al.  proposed utilizing multiple teacher networks to provide more guidance for the training of student networks. They leverage a voting strategy to balance the multiple guidance from each teacher network. Wang et al.  regarded student network as a generator which is a part of GAN , as well as utilized a discriminator as a assistant of teacher for forcing student to generating features which are difficult to distinguish from the features of teacher.
Compared to the network trimming algorithm, the student-teacher learning framework has more flexibility, no special requirements on hardware, and a more structured network structure. Compared to the direct design of a deeper network, guidance from the teacher is beneficial to learning deep networks and improving the performance of student. However, existing student-teacher algorithms pay more attention to improving the performance of student network on pure data sets. The instability caused by the large reduction in parameters makes the performance degradation under the Perturbation settings not yet studied. Therefore, a more robust learning algorithm for improving student network performance under perturbed conditions needs to be developed. This paper proposed a method under the teacher-student learning and knowledge distillation framework, which enhanced the robustness of student network.
Iii Preliminary of Teacher-Student Learning
To make this paper self-contained, we briefly introduce some preliminary knowledge of teach-student learning here.
The teacher network has complicated architecture, and it has already been well trained to achieve a sufficiently high performance. We aim to learn a student network , which is deeper yet thinner than the teacher network but has a lower yet satisfying accuracy. Let be the example space and be its corresponding -label space. Outputs of these two networks are defined as:
where and are the features produced by pre-softmax layers of teacher and student networks, respectively.
The teacher network is usually trained on a relatively large dataset and consists of a large number of parameters, so that the teacher network usually achieves a high accuracy in classification task. Given significantly fewer parameters and numbers of multiplication operations, if adopting the same training strategies as the teacher network, the student network is difficult to achieve a high performance. It is therefore necessary to, improve student network performance by investigating the assistance of the teacher network. A straightforward method is to encourage the features of an image extracted from these two networks to be similar . The objection function can be written as
where the second term helps the student network to extract knowledge from the teacher, refers to the cross-entropy loss, , indicates the output of the -th example in by the student network, refers to the corresponding label, and is the coefficient to balance two terms in the function. The teacher and student networks can be significantly different in architecture, and thus it is difficult to expect features extracted by these two networks for the same example to be same. Hence, Knowledge Distillation (KD) , as an effective alternative, was proposed to distill knowledge from classification results to minimize
where the second term aims to enforce the student network to learn from softened output of the teacher network. is a relaxation function defined as follow:
is introduced to make sure that the second term in equation (3) can play a different role compared with the first one. This is because that might be extremely similar to the one hot code representation of the ground-truth labels, while a soften version of output is different from the true labels. Moreover, the soften version of output could also provide more information to guide the learning of student, as the cross-entropy loss and soften version output will enhance the influence of classes other than the true label one.
Although KD loss in equation (3) allows the student network to access the knowledge from the teacher network, the significant reduction in the number of parameters decreases the capability of the student network and makes it more vulnerable to input disturbances. The learned student network might achieve a reasonable performance on clean data, but it would suffer from a serious performance decline when encountering perturbation on the data in real world applications. To solve this issue, it is therefore necessary to enforce the robustness of the student network when applied to practical scenario.
Iv Robust Student Network Learning
We take a multi-class classification problem over classes as an example to introduce our robust Student Network Learning. Given a teacher network and a student network , an example
can then be classified by two networksand , respectively. Denote and as the -th value of the -dimensional vectors and , respectively. Then we define and as the scores produced by two networks for the ground-truth label of the example , respectively. If a classifier has more confidence in its prediction, the predicted score will be higher. With the help of the teacher network, the student network is supposed to be more confident in its prediction, so that
Iv-a Theoretical Analysi
The above relationship holds in ideally noise-free scenario. In practical scenario, perturbations on examples are unavoidable, and the student network is expected to resist the unexpected influence and bring in the robust prediction,
where is a perturbation added to . We restrict this perturbation in a spherical space of radius , and is a constraint set that specifies some requirements for the input, e.g., an image input should be in , where is the dimension. We define the ball as .
We aim to discover a student network that stands on the shoulder of the teachers to make a confident prediction not only for clean examples but also for examples with perturbations. The perturbation exists on examples without influencing their corresponding ground-truth labels. However, with the increase of perturbation intensity, the learning process of the student network would be seriously disturbed. Taking equation (6) as an auxiliary constraint in training the student network can be helpful for improving robustness of the network. But it is difficult and impossible to enumerate and try every possible to form the constraint. To make the optimization problem tractable, we seek for some alternatives and proceed to study the maximum perturbation that can be defended by the system. Figure 1 shows the framework of our approach.
Let be an example in . and are functions adapted from the student and the teacher networks to predict the label of example , respectively. Given , for any with
we have .
By the main theorem of calculus, we have
If the perturbation is so significant that , we get
Consider the fact that
where the denominator can be further upper bounded using the following inequality
The lower bound for the q-norm of to break the robust prediction of the student network (i.e., equation (6)) is therefore
which completes the proof. ∎
According to Theorem 1, maximizing the value of while minimizing the value of , the lower bound over will be enlarged, so that the student network is able to tolerate more severe perturbation and become more robust to make confident prediction. Without loss of generality, we take in the following discussion.
Based on the analysis above, two new objectives are introduced into teacher-student learning paradigm to achieve a robust student network. To encourage , we plan to minimize the loss function
where is a constant margin. is supposed to be greater than , otherwise, there will be a penalty for the student network. It is difficult to explicitly calculate the value of , due to the existence of the max operation. But by appropriately setting the radius and considering the sufficiently large training set, the data point in the ball to reach the maximum value of would often have some closed examples in the training set. Hence, to minimize the value of , we proposed to minimize the difference between gradients of student and the teacher networks w.r.t. the training examples as
where is the relaxation function explained in equation (3). In addition, we take the KD loss  into consideration, the resulting objective function of our robust student network learning algorithm can be written as:
where and are the balanced coefficients of and , respectively.
The process of training the student networks can be found in Algorithm 1. After the initialization of the student network, we train the student network according to the proposed algorithm. Next, we explain in detail the calculation of loss. For convenience, we set the batch size as 1, that is, we first select a sample from the dataset and as input for forward propagation of the teacher network and the student network. Then we calculate outputs of the two networks and . Combining outputs and with the corresponding label , we can calculate the first term in equation (17) according to equations (3) and (4). and are both -dimensional vectors, which are the network’s prediction scores for categories. With the help of label , we can get the predicted scores for label, and and calculate the second term in equation (17) according to equation (15). In order to get the value of , we first calculate the derivative of and with respect to the input sample
. Same as back propagation algorithm, we can apply the chain rule to get these results. In experiments, we utilize the automatic derivation tool which is integrated in mainstream deep learning platforms to achieve this process. After gettingand , the loss can be easily calculated using equation (17). Finally, the weights in the student network are updated by the gradients obtained by the back-propagation algorithm.
The delta in equation (7) is the noise in an image . The noise can come from various sources. Some are physical, linked to the nature of light and to optical artifacts, and some others are created during the conversion from electrical signal to digital data. As noise degrades the quality of an image, the performance of neural networks in image classification task could be seriously influenced. The proposed robust student network aims to handle unexpected noises in images and to preserve consistent decisions with or without noises (see equations (5) and (6). In the literature, the overall noise produced by a digital sensor is usually considered as a stationary white additive Gaussian noise . We report the robustness of the learned networks against Gaussian noise in experiment. In addition, we also evaluate the performance of the learned networks against combinations of different types of noise on training and test sets, since it is difficult to know what types of noise could be before the test stage.
|(a) MNIST||(b) CIFAR-10||(c) CIFAR-100|
In this section, we experimentally investigate the effectiveness of the proposed robust student network learning algorithm. The learned student network is compared with the original teacher network, and the student networks learned through KD  and Fitnet . The experiments are on three benchmark datasets: MNIST , CIFAR-10 , and CIFAR-100 .
V-a Dataset and Settings
MNIST  is a handwritten digit dataset (from 0 to 9) composed of greyscale images from ten categories. The whole dataset of 70,000 images is split into 60,000 and 10,000 images for training and test, respectively. Following the setting in , we trained a teacher network of maxout convolutional layers reported in , which contains 3 convolutional maxout layers and a fully-connected layer with 48-48-24-10 units, respectively. After that, we design the student network which contains 6 convolutional maxout layers and a fully-connected layer, which is twice as deep as the teacher network but with roughly 8% of the parameters. As reported in Table VIII, the architectures of the teacher and student network were shown in detail in the first two columns.
CIFAR-10  is a dataset that consists of RGB color images draw from 10 categories. There are 60,000 images in CIFAR-10 dataset which are split into 50,000 training and 10,000 testing images. According to  and , we preprocessed the data using global contrast normalization (GCA) and ZCA whitening, and augmented the training data via random flipping. We followed the architecture used in Maxout  and FitNet  to train a teacher network with three maxout convolutional layers of 96-192-192 units. For fair comparison, we designed a student network with a structure similar to FitNet which has 17 maxout convolutional layers followed by a maxout fully-connected layer and a top softmax layer, and we also investigate KD method with the same architecture. The detailed architecture of teacher was shown in the ‘Teacher(CIFAR-10)’ column of Table VIII, and that of student was shown as ‘Student 4’ column.
CIFAR-100 dataset  has images of the same size and format as those in CIFAR-10, except that it has 100 categories with only one tenth as labeled images per category. More categories and fewer labeled examples per category indicates that classification task on CIFAR-100 is more challenging than that on CIFAR-10. We preprocess images in CIFAR-100 using the same methods for CIFAR-10 and the teacher network and the student network share the same hyper-parameters with those on the CIFAR-10 dataset. Besides, the architecture of teacher is also same as that used for CIFAR-10, except that the number of units in the last softmax layer was changed to 100 to adapt to the number of categories.
The hyper-parameters are tuned by minimizing the error on a validation set consisting of the last 10,000 training examples on each dataset. Following the setting in FitNet , we set batch size as 128, max training epoch as 500, learning rate as 0.17 for linear layers and 0.0085 for convolutional layers, and momentum as 0.35. According to the hint layer proposed in FitNet , we pre-trained a classifier using the features of the teacher network’s middle layer, and then we apply the classifier with the student network features.
|(a) Clean Cat||(b) Cat with SNR=10||(c) Clean Ship||(d) Ship with SNR=10|
V-B Robustness of Student Networks
We evaluated the robustness of student networks learned through different algorithms under different intensities of perturbation. Since it is difficult and impossible to know what test data can be in practice, the augmentation of training data with certain noise cannot be very helpful to resist the perturbation. Hence, we trained all networks using clean training set, and introduced White Gaussian Noise (WGN) into test data as the perturbation. The intensity of the introduced noise was measured in terms of Signal-to-Noise Ratio (SNR). We trained the proposed algorithm and compared it with the teacher network , and student networks from KD network  and FitNet  methods.
In Figure 2, we investigated the accuracy of these networks on three datasets with different SNR values. As the classification task on MNIST is relatively easier, lower SNRs were chosen from 9 to 1. Lower SNR value indicates more perturbations are added. It can be found from Figure 2(a) that the accuracy of the proposed robust student network is superior to other three networks nearly under all SNR values. When SNR equals to 2, two student networks from KD and FitNet perform even worse than the original teacher network. But our proposed algorithm achieves an obviously leading 98.17% accuracy. When SNR was down to 1, the accuracy drops of the teacher network, and the student network from KD and FitNet are serious, up to 5.65%, 7.25%, and 3.23%, respectively. In contrast, the accuracy of our robust student network only drops 2.23%. Our method achieves better performance and shows more robustness when there was perturbation in the input.
A similar phenomenon can be observed in Figures 2 (b) and (c) on the CIFAR-10 and CIFAR-100 datasets. With the decrease of SNR, the accuracy of KD network and FitNet dropped faster than that of the teacher network, especially during the period when SNR drops from 12 to 10. Given the significant reduction in network complexity, the capacity of the student network can be seriously weakened and the student network would be more vulnerable to perturbations on data if there is no appropriate response action. However, the student network learned from the proposed algorithm can be robust to serious perturbations.
In Figure 3, we reported the predicted scores of example images by different methods on the CIFAR-10 dataset. The clean image without noise looks fuzzy, since the images from CIFAR-10 dataset only a resolution of . In the first column of Figure 3, all student networks can confidently predict the ground-truth class ‘cat’ of the image. However, given the same image added with SNR=10 noise in Figure 3 (b), though student networks from KD and FitNet methods reluctantly made the correct prediction, KD also thought the image is similar to ‘deer’, and FitNet trusted ’bird’ as the prediction with a higher confidence level. In contrast, our robust student network confidently insisted on its correct prediction even the quality of image has been seriously influenced by the perturbation. In addition, given the ‘ship’ image, the teacher network can stand against the perturbation, due to its strong capability coming from the complicated network structure. The KD method mistook it as an ‘deer’ image, while FitNet assigned higher score to label ‘deer’ for this ‘ship’ image. By encouraging more confident predictions with the help of the teacher network during the training stage, we derive the robust student network that can not only keep the highest prediction score on the ‘ship’ label, but also suppress the predictions on wrong categories (see label ‘cat’ in Figures 3(c) and (d)).
|Dataset||Algorithm||22 block||44 block||66 block||88 block||1010 block|
V-C Comparison under Different Perturbation
A neural network might handle noisy test data, if similar noise also exists in the training set. However, in practice, it is difficult to guarantee the test data to have the same kind of perturbation as the training data. We next proceed to evaluate the performance of different methods under different combinations of noisy training and test sets. The accuracies in different settings are presented in Table I. The first line in Table I is a description of the experiment settings. The first capital letter indicates the noise type introduced to training set, and the second letter indicates that of test set. It should be noted that the results of the Robust network listed in this table are all trained under the clean data set, but tested under the corresponding type of noise indicated by the first line. If both training and test data are clean, all networks can achieve more than 90% accuracy, as shown in the first column of Table I. If networks are trained on the clean data and tested on the data with Gaussian noise, teacher networks and student networks from KD and FitNet will be seriously influenced and can only achieve less then 86% accuracy. However, the proposed robust student can still own more than 90% accuracy. A similar phenomenon can be observed when the networks are trained with Gaussian noise but tested with Poisson noise and reverse, as shown in the fourth and last columns of Table I, respectively. If both training and test data are polluted with the same type of Gaussian noise, all networks would try to fit the noisy data as far as possible and receive only slight performance drop. But this rigorous constraint over training and test data cannot always be satisfied in real-world applications. It shows that adding limited kinds of noise to the training set is difficult to improve the robustness of neural network when facing various unexpected kinds of noises existing in practical applications.
Moreover, Table I shows that the teacher Network’s accuracy is worse than that of Robust Network when training and test sets are both with Gaussian noise. The distributions of training data and test data would not be significantly different, if they are both polluted by the Gaussian noise. Hence general teacher networks and student networks can well fit the noisy data and receive reasonable accuracy. But the student network achieves some performance improvement, because of its deeper architecture than that of teacher networks. The depth encourages the re-use of features, and leads to more abstract and invariant representations at higher layers. The proposed robust student network can successfully train a deeper network by exploiting information from the teacher network.
V-D Complex Perturbation
In real-world applications, noise is not the only perturbation that may be encountered. Some more complex perturbation also challenges the robustness of neural networks. In this section, we investigate the robustness of the student network obtained by our proposed method on two more complex perturbations, i.e. image occlusion and domain adoption.
V-D1 Image Occlusion
Considering the target object in the real environment is often blocked, and such perturbation often results in the loss of information in a continuous area, the performance of neural networks will be influenced more seriously by image occlusion. In order to investigate the robustness of our method under this disturbance, we take image occlusion as a more complex perturbation. To simulate the occlusion in real-world applications, we randomly select a small rectangular area in an image, and set pixels covered by the rectangle as zeros. Five different block sizes, i.e. 22, 44, 66, 88 and 1010 are used in experiments. We implemented this experiment on the CIFAR-10 dataset. The results are shown in Table II. Given 44 blocks, teacher, KD, FitNet, and Robust network respectively have accuracies of 89.03%, 89.53%, 89.94%, and 90.79%. Given 88 blocks, the corresponding accuracies are 85.65%, 83.37%, 84.62%, and 85.92%, respectively. According to these results, larger blocks indicate more serious perturbations of images, which will degrade the performance of neural networks. However, the student network obtained by the proposed method stably stays ahead, because of its robustness.
|Student-teacher learning paradigm|
|Knowledge Distillation ||2.5M||19||91.07%||64.13%|
|Maxout Network ||90.62%||61.43%|
|Network in Network ||91.20%||64.32%|
|Deeply-Supervised Networks ||91.78%||65.43%|
|Student-teacher learning paradigm|
|Knowledge Distillation ||30K||0.65%|
|Maxout Network ||0.45%|
|Network in Network ||0.47%|
|Deeply-Supervised Networks ||0.39%|
V-D2 Domain Adaptation
In practical applications, not only unexpected noise and occlusion, but also the unexpected distribution shift could challenge the robustness of neural networks. It is also an important indicator to evaluate the adaptability of this algorithm in the task of domain adaptation.
In this experiment, we took the USPS dataset obtained from the scanning of handwritten digits from envelopes by the U.S. Postal Service. The images in this dataset are all grayscale images and the values have been normalized. The whole dataset has 9,298 handwritten numeric images, of which 7,291 are for training, and the remaining 2,007 are for validation. Similar to MNIST, the USPS dataset has 10 categories, but it has different numbers of samples per category. In addition, considering the picture size in the MNIST dataset is
, for convenience, we pad the images in the USPS dataset to the same size. Moreover, we preprocess USPS datasets in the same way as MNIST.
In this section, we train student networks on the MNIST dataset, and test them on USPS dataset. Similarly, we train networks on the USPS dataset and test them on MNIST. The results are shown in the Table III. The first two columns show the result of adapting MNIST to USPS, and the performance of adapting USPS to MNIST was reported in the last two columns of this table. According to the results, the proposed algorithm achieves an accuracy of 95.02%, while the comparison methods KD and FitNet only get 93.25% and 94.12%, respectively. This demonstrates that the proposed robust student network can preserve its robustness advantages over comparison methods, when faced with more complex perturbation of data in domain adaptation task. The similar phenomenon can be observed in the results of ‘USPS to MNIST’. With the similar accuracy on USPS dataset, the Robust Network outperforms networks obtained by the other algorithms. Moreover, the results tested on USPS dataset while trained on MNIST dataset are much better than those tested on MNIST and trained on USPS. This is because that the number of pictures in the MNIST dataset is much larger than that of USPS. The networks trained by MNIST dataset could extract more useful information from a larger amount of data, and thus has better generalization capabilities.
|Network||layers||params||mult||Speed-up Ratio||Compression Ratio||FitNet||Robust|
|Teacher(MNIST)||Student(MNIST)||Teacher(CIFAR)||Student 1||Student 2||Student 3||Student 4|
|conv 3x3x48||conv 3x3x16||conv 3x3x96||conv 3x3x16||conv 3x3x16||conv 3x3x32||conv 3x3x32|
|pool 4x4||conv 3x3x16||pool 4x4||conv 3x3x16||conv 3x3x32||conv 3x3x48||conv 3x3x32|
|pool 4x4||conv 3x3x16||conv 3x3x32||conv 3x3x64||conv 3x3x32|
|pool 2x2||pool 2x2||conv 3x3x64||conv 3x3x48|
|pool 2x2||conv 3x3x48|
|conv 3x3x48||conv 3x3x16||conv 3x3x96||conv 3x3x32||conv 3x3x48||conv 3x3x80||conv 3x3x80|
|pool 4x4||conv 3x3x16||pool 4x4||conv 3x3x32||conv 3x3x64||conv 3x3x80||conv 3x3x80|
|pool 4x4||conv 3x3x32||conv 3x3x80||conv 3x3x80||conv 3x3x80|
|pool 2x2||pool 2x2||conv 3x3x80||conv 3x3x80|
|pool 2x2||conv 3x3x80|
|conv 3x3x48||conv 3x3x16||conv 3x3x96||conv 3x3x48||conv 3x3x96||conv 3x3x128||conv 3x3x128|
|pool 4x4||conv 3x3x16||pool 4x4||conv 3x3x48||conv 3x3x96||conv 3x3x128||conv 3x3x128|
|pool 4x4||conv 3x3x64||conv 3x3x128||conv 3x3x128||conv 3x3x128|
|pool 8x8||pool 8x8||pool 8x8||conv 3x3x128|
V-E Comparison with State-of-the-Art Methods
Although the main purpose of this paper is to improve the robustness of the student networks, instead of focusing on performance of the student networks on clean data. We also compared the proposed approach with state-of-the-art teacher-student learning methods on clean datasets. For the clean data, the proposed algorithm can still achieve comparable accuracy as compared to others in Tables VI and IV. Table VI summarized the obtained results on three datasets: MNIST, CIFAR-10 and CIFAR-100. On the MNIST dataset, the teacher network got a 99.45% accuracy. With the assistance of KD, the student network achieved a 99.46% accuracy. FitNet generated a slightly better student network with a 99.49% accuracy, which has outperformed the teacher network. Though the proposed algorithm aims to enhance the robustness of the learned student network, it can also achieve comparable or even better accuracy than those of state-of-the-art methods. The accuracy obtained by the proposed method increased to 99.51% on the MINIST dataset. Table IV shows the results on the CIFAR-10 datasets, the baseline teacher network achieved a 90.25% accuracy, and the accuracy of the student network generated by KD and Fitnet were 91.07% and 91.64%, respectively. The Robust student network obtained a 91.63% accuracy, which outperforms the other student networks and teacher. This suggests that the proposed method is able to enhance the stability of student network and then improve the performance of the network.
CIFAR-100 is similar but more challengeable than CIFAR-10 because of its 100 categories. The accuracy obtained by the teacher network is only 63.49%. As comparison, the accuracy of teacher on CIFAR-10 is 90.25%, which is much better than that on CIFAR-100. The robust student network achieved a 65.28% accuracy, which outperforms student networks trained by other strategies, i.e., the network trained by knowledge distillation obtains a test accuracy of 64.13%, and the accuracy of FitNet is 64.86%. When compared to other methods, the student network generated by the proposed method provides nearly the state-of-the-art performance. This result demonstrates that the proposed method succeeds in assisting to learn a student network with considerable performance.
V-F Analysis on Structures of Student Network
We followed the experimental setting in FitNet  and designed four student networks with different configurations of parameters and layers. The teacher network had the same structure as that used on the CIFAR-10 dataset.We design four student networks of different sizes and structures, the detailed structure of these networks can be found in Table VIII. From ‘Student 1’ to ‘Student 4’, the volume of the network has gradually increased, and the performance of the network has gradually increased, too. Table V reported the performance of four student networks and the teacher network on the CIFAR-10 dataset. The compression ratio and speed-up ratio compared with the teacher, and the number of parameters and multiplications can also be found in Table VII.
From Table V, we find that the proposed robust student network outperforms FitNet under all four different student structures. Though there is no perturbation on the data, the proposed method can achieve higher accuracy, which indicates the effectiveness of encouraging the student network to make confident predictions with the help of the teacher network. In addition, the smallest network Student 1 has the biggest compression and speed-up ratios, but it can still achieve a test accuracy of 89.62%, which is fairly close to the 90.25% of teacher and outperforms the 89.07% obtained by FitNet. As Student 1 contains significantly fewer parameters than those of the teacher, improving the accuracy of such a network with limited capacity is challenging, which in turn suggests the effectiveness of the proposed method.
Although there are significantly fewer parameters contained in student networks that learned by the proposed method. These student networks are still regular networks which can be further compressed and speeded-up by existing sparsity based deep neural network compression technologies, such as deep compression  and feature compression .
We proposed to learn a robust student network with the guidance of the teacher network. The proposed method prevented the student network from being disturbed by the perturbations on input examples. Through a rigorous theoretical analysis, we proved a lower bound of perturbations that will weaken the student network’s confidence in its prediction. We introduced new objectives based on prediction score and gradients of examples to maximize this lower bound and then improved the robustness of the learned student network to resist perturbations on examples. Experimental results on several benchmark datasets demonstrate the proposed method is able to learn a robust student network with satisfying accuracy and compact size.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the
inception architecture for computer vision,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–2826.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  Y. Gong, L. Liu, M. Yang, and L. Bourdev, “Compressing deep convolutional networks using vector quantization,” arXiv preprint arXiv:1412.6115, 2014.
-  E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, “Exploiting linear structure within convolutional networks for efficient evaluation,” in Advances in Neural Information Processing Systems, 2014, pp. 1269–1277.
W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen, “Compressing neural
networks with the hashing trick,” in
International Conference on Machine Learning, 2015, pp. 2285–2294.
-  S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in neural information processing systems, 2015, pp. 1135–1143.
-  S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.
-  Y. Wang, C. Xu, S. You, D. Tao, and C. Xu, “Cnnpack: packing convolutional neural networks in the frequency domain,” in Advances in Neural Information Processing Systems, 2016, pp. 253–261.
-  M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks: Training deep neural networks with weights and activations constrained to+ 1 or-1,” arXiv preprint arXiv:1602.02830, 2016.
M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “Xnor-net: Imagenet classification using binary convolutional neural networks,” inEuropean Conference on Computer Vision. Springer, 2016, pp. 525–542.
-  S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE, 2017, pp. 5987–5995.
-  F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” arXiv preprint, pp. 1610–02 357, 2017.
-  A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
-  G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
-  J. Ba and R. Caruana, “Do deep nets really need to be deep?” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 2654–2662. [Online]. Available: http://papers.nips.cc/paper/5484-do-deep-nets-really-need-to-be-deep.pdf
-  A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets,” arXiv preprint arXiv:1412.6550, 2014.
-  P. McClure and N. Kriegeskorte, “Representational distance learning for deep neural networks,” Frontiers in computational neuroscience, vol. 10, 2016.
-  S. You, C. Xu, C. Xu, and D. Tao, “Learning from multiple teacher networks,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2017, pp. 1285–1294.
-  Y. Wang, C. Xu, C. Xu, and D. Tao, “Adversaial learning of portable student networks,” in AAAI, 2018.
-  ——, “Beyond filters: Compact feature map for portable deep model,” in International Conference on Machine Learning, 2017, pp. 3703–3711.
-  J. Jin, A. Dundar, and E. Culurciello, “Flattened convolutional neural networks for feedforward acceleration,” arXiv preprint arXiv:1412.5474, 2014.
-  M. Wang, B. Liu, and H. Foroosh, “Factorized convolutional neural networks,” CoRR, abs/1608.04337, 2016.
-  F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size,” arXiv preprint arXiv:1602.07360, 2016.
-  J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” arXiv preprint arXiv:1709.01507, 2017.
-  V. Vanhoucke, “Learning visual representations at scale,” 2014.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich et al., “Going deeper with convolutions.” Cvpr, 2015.
-  B. Normalization, “Accelerating deep network training by reducing internal covariate shift,” 2015.
-  X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” arXiv preprint arXiv:1707.01083, 2017.
-  Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song, and Z. Wang, “Deep fried convnets,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1476–1483.
-  V. Sindhwani, T. Sainath, and S. Kumar, “Structured transforms for small-footprint deep learning,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 3088–3096. [Online]. Available: http://papers.nips.cc/paper/5869-structured-transforms-for-small-footprint-deep-learning.pdf
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, Eds. Curran Associates, Inc., 2014, pp. 2672–2680. [Online]. Available: http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf
-  T. Julliand, V. Nozick, and H. Talbot, “Image noise and digital image forensics,” in International Workshop on Digital Watermarking. Springer, 2015, pp. 3–17.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
-  A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” 2009.
-  I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, “Maxout networks,” arXiv preprint arXiv:1302.4389, 2013.
-  M. Lin, Q. Chen, and S. Yan, “Network in network,” arXiv preprint arXiv:1312.4400, 2013.
-  C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervised nets,” in Artificial Intelligence and Statistics, 2015, pp. 562–570.