I Introduction
Recently, large neural networks have demonstrated extraordinary performance on various computer vision and machine learning tasks. Visual competitions on large datasets such as ImageNet
[1] and MS COCO [2] suggest that wide and deepconvolutional neural networks tend to achieve better performance, if properly trained on sufficient labeled data with welltuned hyperparameters, at the cost of extremely high computational complexity. Overparameterization in large networks seems to be beneficial for the performance improvement [3, 4], however, the requirements for large sets of labeled data for training and high computational complexity pose significant challenges for us to develop and deploy deep neural networks in practice.First, low power devices such as mobile phones, cloud based services with high throughput demand, and realtime systems, have limited computational resources, which requires that the network inference or testing should have low computational complexity. Besides the complexity issue, a large network often consumes massive storage and memory bandwidth. Therefore, smaller and faster networks are often highly desired in realworld applications. Recently, great efforts have been made to address the network speed issue. A variety of model compression approaches [5, 6, 7, 8, 9] were proposed to obtain faster networks that mimic the behavior of large networks. Second, in practical applications, we often have access to very limited labeled samples. It is very expensive to obtain human labeled groundtruth samples for training. In some applications domains, it is simply not feasible to accumulate enough training examples for deep networks [10, 11, 12, 13] .
Interestingly, these two problems are actually coupled together. The network capacity is often positively correlated to its task complexity. For instance, we would expect a small network classifier of two classes (
e.g. , dog and cat) to achieve a similar level of accuracy as a significantly larger network for tens of thousand classes of objects. Existing solutions to obtaining a fast network on new tasks is often based on a twostep approach: train the network on a large dataset, then apply model compression or distillation to the network after finetuning or transfer learning on the new dataset [9]. Each step is performed separately and they are not jointly optimized. Therefore, how to jointly address the problems of network compression, speed up, and domain adaptation becomes a very important and intriguing research problem.A successful line of work [14, 9, 15, 16, 17] suggest that cumbersome large neural networks, despite their redundancy, have very robust interpretation of training data. By switching learning targets from labels to interpreted features in small networks, we have observed not only speedups but also performance improvements. Inspired by this phenomenon, we are interested to explore if this interpretation power is still valid across different (at least similar) domains, and to what extent of performance a newly trained student network can achieve with the help of a large model pretrained on different datasets.
In this paper, we propose a Knowledge Projection Network (KPN) with a twostage joint optimization method for training small networks under the guidance of a pretrained large teacher network, as illustrated in Figure 1. In KPN, a knowledge projection matrix is learned to extract distinctive representations from the teacher network, and used to regularize the training process of the student network. We carefully design the teacherstudent architecture and joint loss function so that the smaller student network can benefit from extra guidance while learning towards specific tasks. Our major observation is that, by learning necessary representations from a teacher network which is fully trained on a large dataset, a student network can disentangle the explanatory factors of variations in the new data and achieve more precise representation of the new data from a smaller number of examples. Thus, same level performance can be achieved using a smaller network. Extensive experimental results on benchmark datasets have demonstrated that our proposed knowledge projection approach outperforms existing methods, improving accuracy by up to 4% while reducing network complexity by 4 to 10 times, which is very attractive for practical applications of deep neural networks.
Our contributions in this paper are summarized as follows: (1) we propose a new architecture to transfer the knowledge from a large teacher network pretrained on a large dataset into a thinner and faster student network to guide and facilitate its training on a smaller dataset. Our approach addresses the issues of network adaptation and model compression at the same time. (2) We have developed a method to learn a projection matrix which is able to project the visual features from the teacher network into the student network to guide its training process and improve its overall performance. (3) We have developed an iterative method to select the optimal path for knowledge projection between the teacher and student networks. (4) We have implemented the proposed method in MXNet and conducted extensive experiments on benchmark datasets to demonstrate that our method is able to significantly reduce the network computational complexity by 410 times while largely maintaining or even improving the network performance by a significant margin.
Ii Related Work
Large neural networks have demonstrated extraordinary performance on various computer vision and machine learning tasks. During the past a few years, researchers have been investigating how to deploy these deep neural networks in practice. There are two major problems that need to be carefully addressed: the high computational complexity of the deep neural network and the large number labeled samples required to train the network [18, 19]. Our work is closely related to domain adaptation and model compression, which are reviewed in this section.
To address the problem of inadequate labeled samples for training, methods for network domain adaptation [20, 12, 21] have been developed, which enable learning on new domains with few labeled samples or even unlabeled data. Transfer learning methods have been proposed over the past several years, and we focus on supervised learning where a small amount of labeled data is available. It has been widely recognized that the difference in the distributions of different domains should be carefully measured and reduced [21]
. Learning shallow representation models to reduce domain discrepancy is a promising approach, however, without deeply embedding the adaptation in the feature space, the transferability of shallow features will be limited by the taskspecific variability. Recent transfer learning method coupled with deep networks can learn more transferable representations by embedding domain adaptations in the architecture of deep learning
[22] and outperforms traditional methods by a large margin. Tzeng et al. [13]optimizes domain invariance by correcting the marginal distributions during domain adaptation. The performance has been improved, but only within a single layer. Within the context of deep feedforward neural networks,
finetune is an effective and overwhelmingly popular method [23, 24]. Feature transferability of deep neural networks has been comprehensively studied in [25]. It should be noted that this method does not apply directly to many real problems due to insufficient labeled samples in the target domain. There are also some shallow architectures [26, 27] in the context of learning domaininvariant features. Limited by representation capacity of shallow architectures, the performance of shallow networks is often inferior to that of deep networks [21].With the dramatically increased demand of computational resources by deep neural networks, there have been considerable efforts to design smaller and thinner networks from larger pretrained network in the literature. A typical approach is to prune unnecessary parameters in trained networks while retaining similar outputs. Instead of removing closetozero weights in the network, LeCunn et al. proposed Optimal Brain Damage (OBD) [5] which uses the second order derivatives to find tradeoff between performance and model complexity. Hassibi et al. followed this work and proposed Optimal Brain Surgeon (OBS) [6] which outperforms the original OBD method, but was more computationally intensive. Han et al. [28] developed a method to prune stateofart CNN models without loss of accuracy. Based on this work, the method of deep compression [7] achieved better network compression ratio using ensembles of parameter pruning, trained quantization and Huffman coding, achieved 3 to 4 times layerwise speed up and reduced the model size of VGG16 [29] by 49 times. This line of work focuses on pruning unnecessary connections and weights in trained models and optimizing for better computation and storage efficiency.
Various factorization methods have also been proposed to speed up the computationintensive matrix operations which are the major computation in the convolution layers. For example, methods have been developed to use matrix approximation to reduce the redundancy of weights. Jenderberg et al. [8] and Denton et al. [30] use SVDbased low rank approximation. For example, Gong et al. [31] use a clusteringbased product quantization to reduce the size of matrices by building an indexing. Zhang et al. [32] successfully compressed very deep VGG16 [29]
to achieve 4 times speed up with 0.3% loss of accuracy based on Generalized Singular Value Decomposition and special treatment on nonlinear layers. This line of approaches can be configured as data independent processes, but finetuned with training data to improve the performance significantly. In contrast to offline optimization, Ciresan
et al. [33] trained a sparse network with random connections, providing good performance with better computational efficiency than densely connected networks.Rather than pruning or modifying parameters from existing networks, there has been another line of work in which a smaller network is trained from scratch to mimic the behavior of a much larger network. Starting from the work of Bucila et al. [14] and Knowledge Distillation (KD) by Hinton et al. [9], the design of smaller yet efficient networks has gained a lot of research interest. Smaller networks can be shallower (but much wider) than the original network, performing as well as deep models, as shown by Ba and Caruna in [34]. The key idea of knowledge distillation is to utilize the internal discriminative feature that is implicitly encoded in a way not only beneficial to original training objectives on source training dataset, but also has a sideeffect of eliminating incorrect mappings in networks. It has been demonstrated in [9] that small networks can be trained to generalize in the same way as large networks with proper guidance. FitNets [15] achieved better compression rate than knowledge distillation by designing a deeper but much thinner network using trained models. The proposed hintbased training is one step further beyond knowledge distillation which uses a finer network structure. Nevertheless, training deep networks has proven to be challenging [35]. Significant efforts have been devoted to alleviate this problem. Recently, adding supervision to intermediate layers of deep networks is explored to assist the training process [36, 37]. These methods assume that source and target domains are consistent. It is still unclear whether the guided training is effective when the source and target domains are significantly different.
In this paper, we consider a unique setting of the problem. We use a large network pretrained on a large dataset (e.g. , the ImageNet) to guide the training of a thinner and faster network on a new smaller dataset with limited labeled samples, involving adaptation over different data domains and model compression at the same time.
Iii Knowledge Projection Network
In this section, we present the proposed Knowledge Projection Network (KPN). We start with the KPN architecture and then explain the knowledge projection layer design. A multipath multistage training scheme coupled with iterative pruning for projection route selection is developed afterwards.
Iiia Overview
An example pipeline of KPN is illustrated in Figure 2. Starting from a large teacher network pretrained on a large dataset, a student network is designed to predict desired outputs for the target problem with guidance from the teacher network. The student network uses similar buiding blocks as the teacher network, such as Residue [38], Inception [39] or stacks of plain layers [29], subsampling and BatchNorm [40] layers. The similarity in baseline structure ensures smooth transferability. Note that the convolution layers consume most of the computational resources. Their complexity can be modeled by the following equation
(1) 
where the computational cost is multiplicatively related to the number of input and output channels , the spatial size of input feature map where and are the height and width of the feature map at the th layer, and kernel size . The student network is designed to be thinner (in terms of filter channels) but deeper to effectively reduce network capacity while preserves enough representation power [34, 15]. We depict the convolutional blocks in Figure 3 that are used to build the thin student networks. In contrast to standard convolutional layers, a squeezethenexpand [41, 42] structure is effective in reducing the channelwise redundancy by inserting spatially narrow () convolutional layers between standard convolutional layers. We denote this structure as bottleneck Type A and extend it to a more compact squeezeexpandsqueeze shape, namely bottleneck Type B. With (1), we can calculate the proportional layerwise computation cost for the standard convolutional layer, bottleneck Type A and B, respectively. For simplicity, feature map dimensions are denoted in capital letters, and we use identical size for kernel height and width, denoted as , without loss of generality:
(2) 
(3) 
(4) 
(5) 
(6) 
Bottleneck structures A and B can effectively reduce the computational cost while preserve the dimension of feature map and receptive field, and the layerwise reduction is controlled by . For example, by cutting the bottleneck channels by half, i.e. , , we have the approximate reduction rate for Type A, for Type B. In practice, the output channel is equal to or larger than input channel : . We replace standard convolutional layers by bottleneck structures A and B in the teacher network according to computational budget and constitute corresponding student network. Layerwise width multipliers are the major contributor to model reduction. We use smaller in deep layers where the feature is sparse and computational expensive layers where the gain is significant. The flexibility of bottleneck structures and elastic value range of
ensured we have enough degrees of freedom controlling the student network capacity. In our KPN, the student network is trained by optimizing the following joint loss function:
(7) 
where and are loss from the knowledge projection layer and problem specific loss, respectively. For example, for the problemspecific loss, we can choose the crossentropy loss in many object recognition tasks. is the weight parameter decaying during training, is the trained teacher network, is a regularization term, and is the trained parameters in the student network. Unlike traditional supervised training, the knowledge projection loss plays an important role in guiding the training direction of KPN, which will be discussed in more detail in the following section.
IiiB Knowledge Projection Layer Design
In this work, the pretrained teacher network and the student network analyze the input image simultaneously. To use the teacher network to guide the student network, we propose to map the feature of size
learned at one specific layer of the teacher network into a feature vector
of size and inject it into the student network to guide its training process. For the mapping, we choose linear projection(8) 
where is an matrix. In deep convolutional neural networks, this linear projection matrix can be learned by constructing a convolution layer between the teacher and student network. Specifically, we use a convolutional layer to bridge teacher’s knowledge layer and student’s injection layer. A knowledge layer is defined as the output of a teacher’s hidden convolutional layer responsible for guiding the student’s learning process by regularizing the output of student’s injection convolutional layer. Let , and be the spatial height, spatial width, and number of channels of the knowledge layer output in the teacher network, respectively. Let , and be the corresponding sizes of student’s injection layer output, respectively. Note that there are a number of additional layers in the student network to further analyze the feature information acquired in the inject layer and contribute to the final network output. We define the following loss function:
(9) 
(10) 
where and represent the deep nested functions (stacks of convolutional operations) up to the knowledge and injection layer with network parameters and , respectively. is the knowledge projection function applied on with parameter which is another convolution layer in this work. , and must be comparable in terms of spatial dimensionality.
The knowledge projection layer is designed as a convolutional operation with a kernel in the spatial domain. As a result, is a tensor. As a comparison, a fully connected adaptation layer will require parameters which is not feasible in practice especially when the spatial size of output is relatively large in the early layers. Using the convolutional adaptation layer is not only beneficial for lower computational complexity, but also provides a more natural way to filter distinctive channelwise features from the knowledge layers while preserve spatial consistency. The output of the knowledge projection layer will guide the training of student network by generating a strong and explicit gradient applied to backward path to the injection layer in the following form
(11) 
where is the weight matrix of injection layer in student network. Note that in (9), is applied to with respect to the hidden output of knowledge projection layer as a relaxation term. For negative responses from , is effectively reduced by the slope factor , which is set to by crossvalidation. Overall, acts as a relaxed loss. Compared to loss,
is more robust to outliers, but still has access to finer level representations in
.IiiC MultiPath MultiStage Training
In the student network, layers after the injection layer are responsible for adapting the projected feature to the final network output. This adaptation must be memorized throughout the training process. Those network layers before the injection layer aim to learn distinctive lowlevel features. Therefore, in our KPN framework, the student network and knowledge projection layer are randomized and trained in two stages: initialization stage and end to end joint training stage.
In the initialization stage, Path 2⃝ in Figure 2 is disconnected, i.e. the knowledge projection layer together with the lower part of student network is trained to adapt the intermediate output of teacher’s knowledge layer to the final target by minimizing
, which is the loss for target task, e.g., softmax or linear regression loss. The upper part of student network is trained sorely by minimizing
. In this stage, we use the projection matrix as an implicit connection between upper and lower parts in the student network. The upper student network layers are always optimized towards features interpreted by the projection matrix, and have no direct access to targets. This strategy prevents the student network from overfitting quickly during the early training stage which is very hard to correct afterwards.After the initialization stage, we then disconnect Path 1⃝ and reconnect Path 2⃝, the training now involves jointly minimizing the objective function described in (7). Using the results from stage 1 as the initialization, the joint optimization process aims to establish smooth transitions inside the student network from the input to the final output. The loss injected into the student network continues to regularize the training process. In this way, the student network is trained based on a multiloss function which has been used in the literature to regulate deep networks [43].
IiiD Iterative Pruning for Projection Route Selection
One important question in knowledge projection between the teacher and student networks is to determine which layers from the teacher network should be chosen as the knowledge layer and which layers from the students should be chosen for the injection layer. In this work, we propose to explore an iterative pruning and optimization scheme to select the projection route.
Assume that the teacher network and the student network have and layers, respectively. Candidate projection routes are depicted in Figure 4. We use only convolution layers as candidates for the knowledge and injection layers. To satisfy the constraints on spatial size and receptive field, candidate knowledge projection routes are computed and denoted as , where is the index of knowledge layer in the teacher network, is the index of injection layer in the student network, and is the set of all candidate routes. We follow the procedure for computing the center of receptive field in [44] for calculating the size of receptive field in layer :
(12) 
where and
are the layerwise stride and kernel size, assuming they are identical along
and directions for simplicity. Routes with constrained receptive filed are kept after calculation with a small tolerance :(13) 
For example, in Figure 4, we have
(14) 
and the rest routes in this figure are not valid due to mismatched spatial shapes. The idea of iterative pruning for the projection route selection is to traverse all possible routes with same training hyperparameters, and determine the best route for knowledgeinjection pair onthefly. Specifically, we randomly initialize KPNs according to each .
Each KPN stores a student network , knowledge projection parameter and routing , teacher network is shared across all KPNs to save computation and memory overhead. The target is to find the KPN setting with minimum joint loss
(15) 
We assume that the pretrained teacher network is responsible for guiding the training of a specifically designed student network which satisfies the computational complexity requirement. According to (13), we can generate a list of candidate KPNs. Each KPN is a copy of the designed student network with different projection routing and corresponding parameters . Within a period of epochs, the KPNs are optimized separately using Stochastic Gradient Descend to minimize the joint loss described in (15). Note that even though the optimization target is a joint loss, as depicted in Fig. 2, the upper and bottom layers of the student network are receiving different learning targets from the teacher network and dataset distribution, respectively. At the end of epochs, the joint loss of each KPN computed on the validation dataset is used to determine which KPN to prune. The same procedure is applied on the remaining KPNs in the list iteratively. This iterative pruning procedure is summarized in Algorithm 1:
Only one KPN will survive after the iterative pruning process. We continue the multistage training with or without adjusting the batchsize depending on the released memory size after sweeping out bad KPNs. The stopping criteria can either be plateau of validation accuracy or a predefined end epoch.
Iv Experimental Results
In this section, we provide comprehensive evaluations of our proposed method using three groups of benchmark datasets. Each group consists of two datasets, the large dataset used to train the teacher network and the smaller dataset used to train the student network. The motivation is that, in practical applications, we often need to learn a network to recognize or classify a relatively small number of different objects and the available training dataset is often small. We also wish the trained network to be fast and efficient. The large dataset is often available from existing research efforts, for example, the ImageNet. Both the large and the small datasets have the same image dimensions so that pretrained models are compatible with each other in terms of shape. We use the existing teacher network model already trained by other researchers on the public dataset . We compare various algorithms on the benchmark dataset where stateoftheart results have been reported. Performance reports on small datasets are rare, thus we choose existing large famous benchmark datasets in following experiments, and aggressively reduce the size of training set to simulate the shortage of labeled data in real world scenarios.
Iva Network Training
We have implemented our KPN framework using the MXNet [45], a deep learning framework designed for both efficiency and flexibility. The dynamically generated computational graph in MXNet allows us to modify network structures during run time. The KPNs are trained on NVidia Titan X 12GB with CUDNN v5.1 enabled. Batchsizes vary from 16 to 128 depending on the KPN group size. For all experiments, we train using the Stochastic Gradient Descend (SGD) with momentum 0.9 and weight decay 0.0001 except the knowledge projection layers. The weight decay for all knowledge projection layers is 0.001 in the initialization stage and 0 for the joint training stage. 40% of iterations are used for the initialization stage, and the rest goes to be joint training stage. The weight controller parameter for joint loss is set to be 0.6, and gradually decays to 0. The pruning frequency is 10000 and we also randomly revoke the initialization stage during joint training stage, to repetitively adjusting network guidance strength.
For finetuning, we test with a wide variety of experimental settings. Starting from pretrained networks, we adjust the last layer to fit to the new dataset, and randomly initialize the last layer. The reshaped network is trained with standard backpropagation with respect to labels on the new dataset, and unfreeze one more layer from the bottom one at a time. The best result from all configurations was recorded. To make sure all networks are trained using the optimal hyperparameter set, we extensively try a wide range of learning rates, and repeat experiments on the best parameter set for at least 5 times. The average performance of the best 3 runs out of 5 will be reported. Data augmentation is limited to random horizontal flip if not otherwise specified.
Methods  Accuracy with Different  
50000  5000  1000  500  
Maxout [46]  90.18        9M  379M 
FitNets11 [15]  91.06        0.86M  53M 
FitNets [15]  91.61        2.5M  107M 
GP CNN [47]  93.95        3.5M  362M 
ALLCNNC [48]  92.7        1.0M  257M 
Good Init [49]  94.16        2.5M  166M 
ResNet50 slim  87.53  71.92  55.86  48.17  0.27M  31M 
ResNet38  90.86  75.28  61.74  51.62  3.1M  113M 
ResNet38 finetune  91.15  89.61  86.26  83.45  3.1M  113M 
Our method  92.37  90.35  88.73  87.61  0.27M  31M 
IvB Results on the CIFAR10 Dataset
We first evaluate the performance of our method on the CIFAR10 dataset guided by a teacher network pretrained on CIFAR100 dataset. The CIFAR10 and CIFAR100 datasets [50] have 60000 3232 color images with 10 and 100 classes, respectively. They were both split into 50K10K sets for training and testing. To validate our approach, we trained a 38layer Resnet on the CIFAR100 as reported in [38], and use it to guide a 50layer but significantly slimmer Resnet on the CIFAR10. We augment the data using random horizontal flip and color jittering. Table I
summarizes the results, with comparisons against the stateoftheart results which cover a variety of optimization techniques including Layersequential unitvariance initialization
[49], poolingless [48], generalized pooling [47] and maxout activation [46]. We choose different sizes of the training set and list the accuracy. For network complexity, we compute its number of model parameters and the number of multiplication and additions needed for the network inference. It should be noted that for methods in the literature we do not have their accuracy results on downsized training sets.We do not apply specific optimization techniques used in the stateoftheart methods due to some structures not reproducible in certain conditions. To compare, we trained a standard 38layer Residue Network, a 50layer slimmer version of ResNet (each convolutional layer is half the capacity of the vanilla ResNet) and a finetuned model of 38layer ResNet (from CIFAR100) on CIFAR10 with different amount of training samples. With all 50000 training data, our proposed method outperforms direct training and best finetuning results and still match the stateoftheart performance. We believe the performance gain specified in [47, 49] can be also applied to our method, i.e. , ensemble of multiple techniques could achieve better performance. The proposed KPN method has improved the accuracy by up to 1.2% while significantly reducing the network size by about 11 times, from 3.1M network parameters to 273K parameters. It also demonstrated strong robustness against aggressive reduction of labeled training samples.
IvC Results on the Pascal VOC 07 Dataset
Methods  Accuracy at Different  
5011  1000  200  
Chatfield et al. [51]  82.4      6.5M  2483M 
VGG16+SVM [29]  89.3      14.7M  15470M 
VGG19+SVM [29]  89.3      21.8M  15470M 
HCPVGG [52]  90.9      14.7M  15470M 
FisherNetVGG16 [53]  91.7      14.7M  15470M 
VGG16 standard BP  83.5  65.2  <30  14.7M  15470M 
Finetune VGG16 last layer (softmax)  89.6  87.4  85.7  14.7M  15470M 
Finetune VGG16 2+ learnable layers  90.2  86.3  82.8  14.7M  15470M 
Our method  91.2  88.4  86.5  8M  3361M 
We evaluate the proposed method on the PASCAL Visual Object Classes Challenge(VOC) dataset [54] with a VGG16 model [29] pretrained on the ILSVRC 2012 dataset [1]
. The pretraining usually takes several weeks, thus we downloaded and converted the teacher network from the Caffe model available online. We compare our method with stateoftheart results obtained on this dataset in the literature, including the VGG16+SVM method
[29], the segment hypotheses based multilabel HCPVGG method [52], and the FisherNetVGG16 method [53] which encodes CNN feature with fisher vector. These papers have reported results on the original whole dataset with 5011 images. To test the learning capability of the network on smaller datasets with reduced samples, we also implement the finetuning method. We try different combination of network update scheme and learning parameters and use the best result for performance comparison with our method. We conducted our experiments on the entire training set with 5011 images and test set with 4952 images. In addition, we randomly sample 50 and 10 images from each class, generating two small datasets with 1000 and 200 training images, respectively. The results are summarized in Table II. We list the test accuracy of the network for each configuration. We compute the corresponding complexity of the network, including the number of model parameters and the number of multiplication and additions . It should be noted that for methods in the literature we do not have their accuracy results on downsized training sets. It can be seen that our proposed method outperforms standard training and finetuning by a large margin while reducing the model size by 2 times and improving the inference speed by 4.6 times.IvD Results on the Ommniglot Dataset
We are interested in how the proposed KPN method works on very small datasets, for example, the Ommniglot handwritten recognition dataset. The MNIST [55] is a famous handwritten digits dataset, consists of 60000 training images and 10000 test images, 28x28x1 in size, organized into 10 classes. The Omniglot [56]
is a similar but much smaller dataset, containing 1623 different handwritten characters from 50 alphabets. Each of the 1623 characters was drawn online via Amazon’s Mechanical Turk by 20 different people. All images are binarized and resized to 28
281 with no further data augmentation. We use all 70000 images from MNIST for training a 5layer Maxout convolutional model as the teacher network as proposed in [46]. We report experimental results of various algorithms across a wide range of number of training examples, from 19280 to merely 1000, shown in Table III. Note that we use class dependent shuffling to randomly select training subsets, which is critical to avoid unbalanced class distribution in Omniglot due to the limited number of samples for each class. We can see that the proposed KPN is able to reduce the error rate by 1.11.3%. Table III also provides some interesting insights of how models are transferred to different tasks. First, the finetuning methods are all affected by the number of learnable parameters and training samples. Smaller training set will result in significant overfitting, thus breaking the fragile coadaptation between layers. If the training set is large enough, the number of learnable parameters are positively related to the performance. This phenomenon is also discussed in [25], where transferring knowledge from the pretrained model to an exactly same network is extensively tested.Methods  Error Rates at Different  

19280  5000  1000  
Deep CNN [56]  13.5%     
Deep Siamese CNN [56]  8.0 %     
Large CNN standard BP  9.3%  12.9%  19.4% 
Small CNN standard BP  12.1%  18.5%  23.8% 
Finetuned from MNIST  6.8%  7.4%  9.2% 
Our method  5.9%  6.6%  7.9% 
# Layers  50  50  50   44  44   38  38   32  32   26  26  

Conv /s1  16  16  16  16  16  16  16  16  16  16  16 
ResConv /s2  32  32  16  32  16  32  16  32  16  32  16 
ResConv /s1  64  32  32  32  32  32  32  32  32  32  32 
ResConv /s2  128  64  48  64  48  64  48  64  48  64  48 
Conv /s1  256  128  96  128  96  128  96  128  96  128  96 
IvE Algorithm Parameter Analysis
In this section, we study how the performance of the our method is impacted by the selection of major parameters.
(1) Tradeoff between Performance and Efficiency. To evaluate how the size of network affects the performance, we measure the test accuracy, number of parameters, and network speed up ratio of various student networks on the CIFAR10 dataset. Figure 6 shows the results. Student networks are designed based on a multilayer Resnet denoted as  or  , where is the number of layers,  and   indicate it’s a slim or slimmer version of Resnet. The detailed network configurations are listed in Table IV. As expected, deeper and slimmer networks are more difficult to train with limited training data. However, with proposed method enabled, the depth is beneficial, and networks are less suffered from performance drop. Impressively, we could obtain a model which is 34 times faster using less than 2% parameters, with about 3% accuracy loss, compared to the teacher network.
(2) Analysis of Iterative Pruning for Automatic Route Selection. The knowledge projection route is critical for the network training and test performance. Intuitively, the projection route should not be too shallow or too deep. Shallow layers may contain only lowlevel texture features, while deep layers close to output may be too task specific. To study how the iterative pruning works during training, we record the pruning results and compare them with respect to manually defined projection routes, shown in Figure 7. We can see that the statistics of survival projection routes is highly correlated to the training accuracy, which is evaluated by manually defining projection route from to and disabling iterative pruning during training. The result also indicates that choosing the middle layers for projection is potentially better. Reducing the size of training data also affects the pruning results. This might relate to the difficulty of fitting knowledge projection layer to the target domain when very limited data is presented. As a result, projection layers tend to appear more on very deep layers close to the output, so that the penalty from adaptation loss will not dominate. The bottom line is, even though the iterative pruning method is a random optimization process, it is reliably producing satisfactory results.
IvF Discussion and Future Work
Our KPN is designed in a highly modular manner. The training of projection layers is removed during actual network testing, and the network capacity is highly configurable for performance/speed tradeoff. This KPN method can be easily extended to other problems such as object detection, object segmentation, and pose estimation by replacing softmax loss layer used in the classification problems. Since the deployed network is a pure standard network, another research direction is to apply KPN as a building block in traditional model compression techniques to reshape the network in a new perspective. Although we have focused on the advantage of KPN with thinner networks on smaller datasets, there are potential benefits to apply KPN on large network and relatively large datasets, for example, performance oriented situations where speed is not an issue.
V Conclusion
We have developed a novel knowledge projection framework for deep neural networks the address the issues of domain adaptation and model compression in training simultaneously. We exploit the distinctive general features produced by the teacher network trained on large dataset, and use a learned matrix to project them into domain relevant representations to be used by the student network. A smaller and faster student network is trained to minimize joint loss designed for domain adaptation and knowledge distillation simultaneously. Extensive experimental results have demonstrated that our unified training framework provides an effective way to obtain fast highperformance neural networks on small datasets with limited labeled samples.
References
 [1] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
 [2] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: common objects in context,” CoRR, vol. abs/1405.0312, 2014. [Online]. Available: http://arxiv.org/abs/1405.0312
 [3] M. Denil, B. Shakibi, L. Dinh, N. de Freitas et al., “Predicting parameters in deep learning,” in Advances in Neural Information Processing Systems, 2013, pp. 2148–2156.
 [4] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing coadaptation of feature detectors,” arXiv preprint arXiv:1207.0580, 2012.
 [5] Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in Advances in Neural Information Processing Systems 2, D. S. Touretzky, Ed. MorganKaufmann, 1990, pp. 598–605. [Online]. Available: http://papers.nips.cc/paper/250optimalbraindamage.pdf
 [6] B. Hassibi, D. G. Stork et al., “Second order derivatives for network pruning: Optimal brain surgeon,” Advances in neural information processing systems, pp. 164–164, 1993.
 [7] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.
 [8] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Speeding up convolutional neural networks with low rank expansions,” arXiv preprint arXiv:1405.3866, 2014.
 [9] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
 [10] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via transfer component analysis,” IEEE Transactions on Neural Networks, vol. 22, no. 2, pp. 199–210, 2011.
 [11] K. Zhang, B. Schölkopf, K. Muandet, and Z. Wang, “Domain adaptation under target and conditional shift.” in ICML (3), 2013, pp. 819–827.
 [12] X. Wang and J. Schneider, “Flexible transfer learning under support and model shift,” in Advances in Neural Information Processing Systems, 2014, pp. 1898–1906.
 [13] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko, “Simultaneous deep transfer across domains and tasks,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4068–4076.
 [14] C. Buciluǎ, R. Caruana, and A. NiculescuMizil, “Model compression,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006, pp. 535–541.
 [15] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets,” arXiv preprint arXiv:1412.6550, 2014.
 [16] Y. Hou, Z. Li, P. Wang, and W. Li, “Skeleton optical spectra based action recognition using convolutional neural networks,” IEEE Transactions on Circuits and Systems for Video Technology, 2016.
 [17] C. Xiong, L. Liu, X. Zhao, S. Yan, and T.K. Kim, “Convolutional fusion network for face verification in the wild,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 26, no. 3, pp. 517–528, 2016.
 [18] K. Kim, S. Lee, J.Y. Kim, M. Kim, and H.J. Yoo, “A configurable heterogeneous multicore architecture with cellular neural network for realtime object recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 19, no. 11, pp. 1612–1622, 2009.

[19]
N. Sudha, A. Mohan, and P. K. Meher, “A selfconfigurable systolic architecture for face recognition system based on principal component neural network,”
IEEE transactions on circuits and systems for video technology, vol. 21, no. 8, pp. 1071–1084, 2011.  [20] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2010.
 [21] M. Long, Y. Cao, J. Wang, and M. I. Jordan, “Learning transferable features with deep adaptation networks.” in ICML, 2015, pp. 97–105.
 [22] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropagation,” arXiv preprint arXiv:1409.7495, 2014.
 [23] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in European conference on computer vision. Springer, 2014, pp. 818–833.

[24]
M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and transferring
midlevel image representations using convolutional neural networks,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, 2014, pp. 1717–1724.  [25] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?” in Advances in neural information processing systems, 2014, pp. 3320–3328.
 [26] H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, and M. Marchand, “Domainadversarial neural networks,” arXiv preprint arXiv:1412.4446, 2014.

[27]
M. Ghifary, W. B. Kleijn, and M. Zhang, “Domain adaptive neural networks for
object recognition,” in
Pacific Rim International Conference on Artificial Intelligence
. Springer, 2014, pp. 898–904.  [28] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in Neural Information Processing Systems, 2015, pp. 1135–1143.
 [29] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 [30] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, “Exploiting linear structure within convolutional networks for efficient evaluation,” in Advances in Neural Information Processing Systems, 2014, pp. 1269–1277.
 [31] Y. Gong, L. Liu, M. Yang, and L. Bourdev, “Compressing deep convolutional networks using vector quantization,” arXiv preprint arXiv:1412.6115, 2014.
 [32] X. Zhang, J. Zou, K. He, and J. Sun, “Accelerating very deep convolutional networks for classification and detection,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 10, pp. 1943–1955, 2016.
 [33] D. C. Cireşan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber, “Highperformance neural networks for visual object classification,” arXiv preprint arXiv:1102.0183, 2011.
 [34] J. Ba and R. Caruana, “Do deep nets really need to be deep?” in Advances in neural information processing systems, 2014, pp. 2654–2662.
 [35] D. Erhan, P.A. Manzagol, Y. Bengio, S. Bengio, and P. Vincent, “The difficulty of training deep architectures and the effect of unsupervised pretraining.” in AISTATS, vol. 5, 2009, pp. 153–160.
 [36] C.Y. Lee, S. Xie, P. W. Gallagher, Z. Zhang, and Z. Tu, “Deeplysupervised nets.” in AISTATS, vol. 2, no. 3, 2015, p. 5.
 [37] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.
 [38] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
 [39] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–2826.
 [40] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
 [41] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and¡ 0.5 mb model size,” arXiv preprint arXiv:1602.07360, 2016.
 [42] J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” arXiv preprint arXiv:1612.08242, 2016.
 [43] C. Xu, C. Lu, X. Liang, J. Gao, W. Zheng, T. Wang, and S. Yan, “Multiloss regularized deep neural network,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 26, no. 12, pp. 2273–2283, 2016.
 [44] K. Lenc and A. Vedaldi, “Rcnn minus r,” arXiv preprint arXiv:1506.06981, 2015.
 [45] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang, “Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems,” arXiv preprint arXiv:1512.01274, 2015.
 [46] I. J. Goodfellow, D. WardeFarley, M. Mirza, A. C. Courville, and Y. Bengio, “Maxout networks.” ICML (3), vol. 28, pp. 1319–1327, 2013.
 [47] C.Y. Lee, P. W. Gallagher, and Z. Tu, “Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree,” in International conference on artificial intelligence and statistics, 2016.
 [48] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for simplicity: The all convolutional net,” arXiv preprint arXiv:1412.6806, 2014.
 [49] D. Mishkin and J. Matas, “All you need is a good init,” arXiv preprint arXiv:1511.06422, 2015.
 [50] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Master’s thesis, Department of Computer Science, University of Toronto, 2009.
 [51] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the devil in the details: Delving deep into convolutional nets,” arXiv preprint arXiv:1405.3531, 2014.
 [52] Y. Wei, W. Xia, M. Lin, J. Huang, B. Ni, J. Dong, Y. Zhao, and S. Yan, “Hcp: A flexible cnn framework for multilabel image classification,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 9, pp. 1901–1907, 2016.
 [53] P. Tang, X. Wang, B. Shi, X. Bai, W. Liu, and Z. Tu, “Deep fishernet for object classification,” arXiv preprint arXiv:1608.00182, 2016.
 [54] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
 [55] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [56] B. M. Lake, R. Salakhutdinov, and J. B. Tenenbaum, “Humanlevel concept learning through probabilistic program induction,” Science, vol. 350, no. 6266, pp. 1332–1338, 2015.