Knowledge Projection for Deep Neural Networks

by   Zhi Zhang, et al.

While deeper and wider neural networks are actively pushing the performance limits of various computer vision and machine learning tasks, they often require large sets of labeled data for effective training and suffer from extremely high computational complexity. In this paper, we will develop a new framework for training deep neural networks on datasets with limited labeled samples using cross-network knowledge projection which is able to improve the network performance while reducing the overall computational complexity significantly. Specifically, a large pre-trained teacher network is used to observe samples from the training data. A projection matrix is learned to project this teacher-level knowledge and its visual representations from an intermediate layer of the teacher network to an intermediate layer of a thinner and faster student network to guide and regulate its training process. Both the intermediate layers from the teacher network and the injection layers from the student network are adaptively selected during training by evaluating a joint loss function in an iterative manner. This knowledge projection framework allows us to use crucial knowledge learned by large networks to guide the training of thinner student networks, avoiding over-fitting, achieving better network performance, and significantly reducing the complexity. Extensive experimental results on benchmark datasets have demonstrated that our proposed knowledge projection approach outperforms existing methods, improving accuracy by up to 4 attractive for practical applications of deep neural networks.


RAIL-KD: RAndom Intermediate Layer Mapping for Knowledge Distillation

Intermediate layer knowledge distillation (KD) can improve the standard ...

One-Shot Optimal Topology Generation through Theory-Driven Machine Learning

We introduce a theory-driven mechanism for learning a neural network mod...

Elastic Neural Networks: A Scalable Framework for Embedded Computer Vision

We propose a new framework for image classification with deep neural net...

FitNets: Hints for Thin Deep Nets

While depth tends to improve network performances, it also makes gradien...

Data-Free Learning of Student Networks

Learning portable neural networks is very essential for computer vision ...

LIT: Block-wise Intermediate Representation Training for Model Compression

Knowledge distillation (KD) is a popular method for reducing the computa...

Teacher-Explorer-Student Learning: A Novel Learning Method for Open Set Recognition

If an unknown example that is not seen during training appears, most rec...

I Introduction

Recently, large neural networks have demonstrated extraordinary performance on various computer vision and machine learning tasks. Visual competitions on large datasets such as ImageNet

[1] and MS COCO [2] suggest that wide and deepconvolutional neural networks tend to achieve better performance, if properly trained on sufficient labeled data with well-tuned hyper-parameters, at the cost of extremely high computational complexity. Over-parameterization in large networks seems to be beneficial for the performance improvement [3, 4], however, the requirements for large sets of labeled data for training and high computational complexity pose significant challenges for us to develop and deploy deep neural networks in practice.

First, low power devices such as mobile phones, cloud based services with high throughput demand, and real-time systems, have limited computational resources, which requires that the network inference or testing should have low computational complexity. Besides the complexity issue, a large network often consumes massive storage and memory bandwidth. Therefore, smaller and faster networks are often highly desired in real-world applications. Recently, great efforts have been made to address the network speed issue. A variety of model compression approaches [5, 6, 7, 8, 9] were proposed to obtain faster networks that mimic the behavior of large networks. Second, in practical applications, we often have access to very limited labeled samples. It is very expensive to obtain human labeled ground-truth samples for training. In some applications domains, it is simply not feasible to accumulate enough training examples for deep networks [10, 11, 12, 13] .

Interestingly, these two problems are actually coupled together. The network capacity is often positively correlated to its task complexity. For instance, we would expect a small network classifier of two classes (

e.g. , dog and cat) to achieve a similar level of accuracy as a significantly larger network for tens of thousand classes of objects. Existing solutions to obtaining a fast network on new tasks is often based on a two-step approach: train the network on a large dataset, then apply model compression or distillation to the network after fine-tuning or transfer learning on the new dataset [9]. Each step is performed separately and they are not jointly optimized. Therefore, how to jointly address the problems of network compression, speed up, and domain adaptation becomes a very important and intriguing research problem.

Fig. 1: System overview. We apply learned projection during training to guide a standard thinner and faster network for inference on a smaller domain dataset.

A successful line of work [14, 9, 15, 16, 17] suggest that cumbersome large neural networks, despite their redundancy, have very robust interpretation of training data. By switching learning targets from labels to interpreted features in small networks, we have observed not only speed-ups but also performance improvements. Inspired by this phenomenon, we are interested to explore if this interpretation power is still valid across different (at least similar) domains, and to what extent of performance a newly trained student network can achieve with the help of a large model pre-trained on different datasets.

In this paper, we propose a Knowledge Projection Network (KPN) with a two-stage joint optimization method for training small networks under the guidance of a pre-trained large teacher network, as illustrated in Figure 1. In KPN, a knowledge projection matrix is learned to extract distinctive representations from the teacher network, and used to regularize the training process of the student network. We carefully design the teacher-student architecture and joint loss function so that the smaller student network can benefit from extra guidance while learning towards specific tasks. Our major observation is that, by learning necessary representations from a teacher network which is fully trained on a large dataset, a student network can disentangle the explanatory factors of variations in the new data and achieve more precise representation of the new data from a smaller number of examples. Thus, same level performance can be achieved using a smaller network. Extensive experimental results on benchmark datasets have demonstrated that our proposed knowledge projection approach outperforms existing methods, improving accuracy by up to 4% while reducing network complexity by 4 to 10 times, which is very attractive for practical applications of deep neural networks.

Our contributions in this paper are summarized as follows: (1) we propose a new architecture to transfer the knowledge from a large teacher network pre-trained on a large dataset into a thinner and faster student network to guide and facilitate its training on a smaller dataset. Our approach addresses the issues of network adaptation and model compression at the same time. (2) We have developed a method to learn a projection matrix which is able to project the visual features from the teacher network into the student network to guide its training process and improve its overall performance. (3) We have developed an iterative method to select the optimal path for knowledge projection between the teacher and student networks. (4) We have implemented the proposed method in MXNet and conducted extensive experiments on benchmark datasets to demonstrate that our method is able to significantly reduce the network computational complexity by 4-10 times while largely maintaining or even improving the network performance by a significant margin.

The rest of this paper is organized as follows. Related work is reviewed in Section II. We present the proposed Knowledge Projection Network in Section III. Experimental results are presented in Section IV. Finally, Section V concludes this paper.

Ii Related Work

Large neural networks have demonstrated extraordinary performance on various computer vision and machine learning tasks. During the past a few years, researchers have been investigating how to deploy these deep neural networks in practice. There are two major problems that need to be carefully addressed: the high computational complexity of the deep neural network and the large number labeled samples required to train the network [18, 19]. Our work is closely related to domain adaptation and model compression, which are reviewed in this section.

To address the problem of inadequate labeled samples for training, methods for network domain adaptation [20, 12, 21] have been developed, which enable learning on new domains with few labeled samples or even unlabeled data. Transfer learning methods have been proposed over the past several years, and we focus on supervised learning where a small amount of labeled data is available. It has been widely recognized that the difference in the distributions of different domains should be carefully measured and reduced [21]

. Learning shallow representation models to reduce domain discrepancy is a promising approach, however, without deeply embedding the adaptation in the feature space, the transferability of shallow features will be limited by the task-specific variability. Recent transfer learning method coupled with deep networks can learn more transferable representations by embedding domain adaptations in the architecture of deep learning

[22] and outperforms traditional methods by a large margin. Tzeng et al. [13]

optimizes domain invariance by correcting the marginal distributions during domain adaptation. The performance has been improved, but only within a single layer. Within the context of deep feed-forward neural networks,

fine-tune is an effective and overwhelmingly popular method [23, 24]. Feature transferability of deep neural networks has been comprehensively studied in [25]. It should be noted that this method does not apply directly to many real problems due to insufficient labeled samples in the target domain. There are also some shallow architectures [26, 27] in the context of learning domain-invariant features. Limited by representation capacity of shallow architectures, the performance of shallow networks is often inferior to that of deep networks [21].

With the dramatically increased demand of computational resources by deep neural networks, there have been considerable efforts to design smaller and thinner networks from larger pre-trained network in the literature. A typical approach is to prune unnecessary parameters in trained networks while retaining similar outputs. Instead of removing close-to-zero weights in the network, LeCunn et al. proposed Optimal Brain Damage (OBD) [5] which uses the second order derivatives to find trade-off between performance and model complexity. Hassibi et al. followed this work and proposed Optimal Brain Surgeon (OBS) [6] which outperforms the original OBD method, but was more computationally intensive. Han et al. [28] developed a method to prune state-of-art CNN models without loss of accuracy. Based on this work, the method of deep compression [7] achieved better network compression ratio using ensembles of parameter pruning, trained quantization and Huffman coding, achieved 3 to 4 times layer-wise speed up and reduced the model size of VGG-16 [29] by 49 times. This line of work focuses on pruning unnecessary connections and weights in trained models and optimizing for better computation and storage efficiency.

Various factorization methods have also been proposed to speed up the computation-intensive matrix operations which are the major computation in the convolution layers. For example, methods have been developed to use matrix approximation to reduce the redundancy of weights. Jenderberg et al. [8] and Denton et al. [30] use SVD-based low rank approximation. For example, Gong et al. [31] use a clustering-based product quantization to reduce the size of matrices by building an indexing. Zhang et al. [32] successfully compressed very deep VGG-16 [29]

to achieve 4 times speed up with 0.3% loss of accuracy based on Generalized Singular Value Decomposition and special treatment on non-linear layers. This line of approaches can be configured as data independent processes, but fine-tuned with training data to improve the performance significantly. In contrast to off-line optimization, Ciresan

et al. [33] trained a sparse network with random connections, providing good performance with better computational efficiency than densely connected networks.

Rather than pruning or modifying parameters from existing networks, there has been another line of work in which a smaller network is trained from scratch to mimic the behavior of a much larger network. Starting from the work of Bucila et al. [14] and Knowledge Distillation (KD) by Hinton et al. [9], the design of smaller yet efficient networks has gained a lot of research interest. Smaller networks can be shallower (but much wider) than the original network, performing as well as deep models, as shown by Ba and Caruna in [34]. The key idea of knowledge distillation is to utilize the internal discriminative feature that is implicitly encoded in a way not only beneficial to original training objectives on source training dataset, but also has a side-effect of eliminating incorrect mappings in networks. It has been demonstrated in [9] that small networks can be trained to generalize in the same way as large networks with proper guidance. FitNets [15] achieved better compression rate than knowledge distillation by designing a deeper but much thinner network using trained models. The proposed hint-based training is one step further beyond knowledge distillation which uses a finer network structure. Nevertheless, training deep networks has proven to be challenging [35]. Significant efforts have been devoted to alleviate this problem. Recently, adding supervision to intermediate layers of deep networks is explored to assist the training process [36, 37]. These methods assume that source and target domains are consistent. It is still unclear whether the guided training is effective when the source and target domains are significantly different.

In this paper, we consider a unique setting of the problem. We use a large network pre-trained on a large dataset (e.g. , the ImageNet) to guide the training of a thinner and faster network on a new smaller dataset with limited labeled samples, involving adaptation over different data domains and model compression at the same time.

Iii Knowledge Projection Network

In this section, we present the proposed Knowledge Projection Network (KPN). We start with the KPN architecture and then explain the knowledge projection layer design. A multi-path multi-stage training scheme coupled with iterative pruning for projection route selection is developed afterwards.

Iii-a Overview

An example pipeline of KPN is illustrated in Figure 2. Starting from a large teacher network pre-trained on a large dataset, a student network is designed to predict desired outputs for the target problem with guidance from the teacher network. The student network uses similar buiding blocks as the teacher network, such as Residue [38], Inception [39] or stacks of plain layers [29], sub-sampling and BatchNorm [40] layers. The similarity in baseline structure ensures smooth transferability. Note that the convolution layers consume most of the computational resources. Their complexity can be modeled by the following equation

Fig. 2: KPN architecture. Solid arrows showing the forward data-flow, dotted arrows showing the paths for gradients.

where the computational cost is multiplicatively related to the number of input and output channels , the spatial size of input feature map where and are the height and width of the feature map at the -th layer, and kernel size . The student network is designed to be thinner (in terms of filter channels) but deeper to effectively reduce network capacity while preserves enough representation power [34, 15]. We depict the convolutional blocks in Figure 3 that are used to build the thin student networks. In contrast to standard convolutional layers, a squeeze-then-expand [41, 42] structure is effective in reducing the channel-wise redundancy by inserting spatially narrow () convolutional layers between standard convolutional layers. We denote this structure as bottleneck Type A and extend it to a more compact squeeze-expand-squeeze shape, namely bottleneck Type B. With (1), we can calculate the proportional layer-wise computation cost for the standard convolutional layer, bottleneck Type A and B, respectively. For simplicity, feature map dimensions are denoted in capital letters, and we use identical size for kernel height and width, denoted as , without loss of generality:

Fig. 3: Left: Standard 3x3 Convolutional layer. Middle: Bottleneck type A. Right: Bottleneck type B. and are feature spatial height and width, are input, reduced and output channels for this building block, respectively. For simplicity, batch-norm and activation layers are omitted in this figure.

Combining (2), (3) and (4), we define the reductions in computation for Type A and B as


Bottleneck structures A and B can effectively reduce the computational cost while preserve the dimension of feature map and receptive field, and the layer-wise reduction is controlled by . For example, by cutting the bottleneck channels by half, i.e. , , we have the approximate reduction rate for Type A, for Type B. In practice, the output channel is equal to or larger than input channel : . We replace standard convolutional layers by bottleneck structures A and B in the teacher network according to computational budget and constitute corresponding student network. Layer-wise width multipliers are the major contributor to model reduction. We use smaller in deep layers where the feature is sparse and computational expensive layers where the gain is significant. The flexibility of bottleneck structures and elastic value range of

ensured we have enough degrees of freedom controlling the student network capacity. In our KPN, the student network is trained by optimizing the following joint loss function:


where and are loss from the knowledge projection layer and problem specific loss, respectively. For example, for the problem-specific loss, we can choose the cross-entropy loss in many object recognition tasks. is the weight parameter decaying during training, is the trained teacher network, is a regularization term, and is the trained parameters in the student network. Unlike traditional supervised training, the knowledge projection loss plays an important role in guiding the training direction of KPN, which will be discussed in more detail in the following section.

Iii-B Knowledge Projection Layer Design

In this work, the pre-trained teacher network and the student network analyze the input image simultaneously. To use the teacher network to guide the student network, we propose to map the feature of size

learned at one specific layer of the teacher network into a feature vector

of size and inject it into the student network to guide its training process. For the mapping, we choose linear projection


where is an matrix. In deep convolutional neural networks, this linear projection matrix can be learned by constructing a convolution layer between the teacher and student network. Specifically, we use a convolutional layer to bridge teacher’s knowledge layer and student’s injection layer. A knowledge layer is defined as the output of a teacher’s hidden convolutional layer responsible for guiding the student’s learning process by regularizing the output of student’s injection convolutional layer. Let , and be the spatial height, spatial width, and number of channels of the knowledge layer output in the teacher network, respectively. Let , and be the corresponding sizes of student’s injection layer output, respectively. Note that there are a number of additional layers in the student network to further analyze the feature information acquired in the inject layer and contribute to the final network output. We define the following loss function:


where and represent the deep nested functions (stacks of convolutional operations) up to the knowledge and injection layer with network parameters and , respectively. is the knowledge projection function applied on with parameter which is another convolution layer in this work. , and must be comparable in terms of spatial dimensionality.

The knowledge projection layer is designed as a convolutional operation with a kernel in the spatial domain. As a result, is a tensor. As a comparison, a fully connected adaptation layer will require parameters which is not feasible in practice especially when the spatial size of output is relatively large in the early layers. Using the convolutional adaptation layer is not only beneficial for lower computational complexity, but also provides a more natural way to filter distinctive channel-wise features from the knowledge layers while preserve spatial consistency. The output of the knowledge projection layer will guide the training of student network by generating a strong and explicit gradient applied to backward path to the injection layer in the following form


where is the weight matrix of injection layer in student network. Note that in (9), is applied to with respect to the hidden output of knowledge projection layer as a relaxation term. For negative responses from , is effectively reduced by the slope factor , which is set to by cross-validation. Overall, acts as a relaxed loss. Compared to loss,

is more robust to outliers, but still has access to finer level representations in


Fig. 4: Candidate Routes of Knowledge Projection. Candidate routes are paths from teacher’s knowledge layer to student’s injection layer. Only one route will survive after iterative pruning.

Iii-C Multi-Path Multi-Stage Training

In the student network, layers after the injection layer are responsible for adapting the projected feature to the final network output. This adaptation must be memorized throughout the training process. Those network layers before the injection layer aim to learn distinctive low-level features. Therefore, in our KPN framework, the student network and knowledge projection layer are randomized and trained in two stages: initialization stage and end to end joint training stage.

In the initialization stage, Path 2⃝ in Figure 2 is disconnected, i.e. the knowledge projection layer together with the lower part of student network is trained to adapt the intermediate output of teacher’s knowledge layer to the final target by minimizing

, which is the loss for target task, e.g., softmax or linear regression loss. The upper part of student network is trained sorely by minimizing

. In this stage, we use the projection matrix as an implicit connection between upper and lower parts in the student network. The upper student network layers are always optimized towards features interpreted by the projection matrix, and have no direct access to targets. This strategy prevents the student network from over-fitting quickly during the early training stage which is very hard to correct afterwards.

After the initialization stage, we then disconnect Path 1⃝ and reconnect Path 2⃝, the training now involves jointly minimizing the objective function described in (7). Using the results from stage 1 as the initialization, the joint optimization process aims to establish smooth transitions inside the student network from the input to the final output. The loss injected into the student network continues to regularize the training process. In this way, the student network is trained based on a multi-loss function which has been used in the literature to regulate deep networks [43].

Iii-D Iterative Pruning for Projection Route Selection

One important question in knowledge projection between the teacher and student networks is to determine which layers from the teacher network should be chosen as the knowledge layer and which layers from the students should be chosen for the injection layer. In this work, we propose to explore an iterative pruning and optimization scheme to select the projection route.

Assume that the teacher network and the student network have and layers, respectively. Candidate projection routes are depicted in Figure 4. We use only convolution layers as candidates for the knowledge and injection layers. To satisfy the constraints on spatial size and receptive field, candidate knowledge projection routes are computed and denoted as , where is the index of knowledge layer in the teacher network, is the index of injection layer in the student network, and is the set of all candidate routes. We follow the procedure for computing the center of receptive field in [44] for calculating the size of receptive field in layer :


where and

are the layer-wise stride and kernel size, assuming they are identical along

and directions for simplicity. Routes with constrained receptive filed are kept after calculation with a small tolerance :


For example, in Figure 4, we have


and the rest routes in this figure are not valid due to mismatched spatial shapes. The idea of iterative pruning for the projection route selection is to traverse all possible routes with same training hyper-parameters, and determine the best route for knowledge-injection pair on-the-fly. Specifically, we randomly initialize KPNs according to each .

Each KPN stores a student network , knowledge projection parameter and routing , teacher network is shared across all KPNs to save computation and memory overhead. The target is to find the KPN setting with minimum joint loss


We assume that the pre-trained teacher network is responsible for guiding the training of a specifically designed student network which satisfies the computational complexity requirement. According to (13), we can generate a list of candidate KPNs. Each KPN is a copy of the designed student network with different projection routing and corresponding parameters . Within a period of epochs, the KPNs are optimized separately using Stochastic Gradient Descend to minimize the joint loss described in (15). Note that even though the optimization target is a joint loss, as depicted in Fig. 2, the upper and bottom layers of the student network are receiving different learning targets from the teacher network and dataset distribution, respectively. At the end of epochs, the joint loss of each KPN computed on the validation dataset is used to determine which KPN to prune. The same procedure is applied on the remaining KPNs in the list iteratively. This iterative pruning procedure is summarized in Algorithm 1:

Input : List of KPNs, as in form {}, where , and teacher network
Output : , and
1 Configure all KPNs as initialization stage.
2 while  do
3       for  epochs do
4             for Batch in Data do
5                   Forward teacher: ;
6                   for {}  do
7                         Forward-backward w.r.t. ;
8                   end for
10             end for
12       end for
13      {} ;
14       Remove {} in ;
15 end while
return {} in ;
Algorithm 1 Iterative pruning algorithm for projection route selection.

Only one KPN will survive after the iterative pruning process. We continue the multi-stage training with or without adjusting the batch-size depending on the released memory size after sweeping out bad KPNs. The stopping criteria can either be plateau of validation accuracy or a pre-defined end epoch.

Iv Experimental Results

In this section, we provide comprehensive evaluations of our proposed method using three groups of benchmark datasets. Each group consists of two datasets, the large dataset used to train the teacher network and the smaller dataset used to train the student network. The motivation is that, in practical applications, we often need to learn a network to recognize or classify a relatively small number of different objects and the available training dataset is often small. We also wish the trained network to be fast and efficient. The large dataset is often available from existing research efforts, for example, the ImageNet. Both the large and the small datasets have the same image dimensions so that pre-trained models are compatible with each other in terms of shape. We use the existing teacher network model already trained by other researchers on the public dataset . We compare various algorithms on the benchmark dataset where state-of-the-art results have been reported. Performance reports on small datasets are rare, thus we choose existing large famous benchmark datasets in following experiments, and aggressively reduce the size of training set to simulate the shortage of labeled data in real world scenarios.

Iv-a Network Training

We have implemented our KPN framework using the MXNet [45], a deep learning framework designed for both efficiency and flexibility. The dynamically generated computational graph in MXNet allows us to modify network structures during run time. The KPNs are trained on NVidia Titan X 12GB with CUDNN v5.1 enabled. Batch-sizes vary from 16 to 128 depending on the KPN group size. For all experiments, we train using the Stochastic Gradient Descend (SGD) with momentum 0.9 and weight decay 0.0001 except the knowledge projection layers. The weight decay for all knowledge projection layers is 0.001 in the initialization stage and 0 for the joint training stage. 40% of iterations are used for the initialization stage, and the rest goes to be joint training stage. The weight controller parameter for joint loss is set to be 0.6, and gradually decays to 0. The pruning frequency is 10000 and we also randomly revoke the initialization stage during joint training stage, to repetitively adjusting network guidance strength.

For fine-tuning, we test with a wide variety of experimental settings. Starting from pre-trained networks, we adjust the last layer to fit to the new dataset, and randomly initialize the last layer. The reshaped network is trained with standard back-propagation with respect to labels on the new dataset, and unfreeze one more layer from the bottom one at a time. The best result from all configurations was recorded. To make sure all networks are trained using the optimal hyper-parameter set, we extensively try a wide range of learning rates, and repeat experiments on the best parameter set for at least 5 times. The average performance of the best 3 runs out of 5 will be reported. Data augmentation is limited to random horizontal flip if not otherwise specified.

Methods Accuracy with Different
50000 5000 1000 500
Maxout [46] 90.18 - - - 9M 379M
FitNets-11 [15] 91.06 - - - 0.86M 53M
FitNets [15] 91.61 - - - 2.5M 107M
GP CNN [47] 93.95 - - - 3.5M 362M
ALL-CNN-C [48] 92.7 - - - 1.0M 257M
Good Init [49] 94.16 - - - 2.5M 166M
ResNet-50 slim 87.53 71.92 55.86 48.17 0.27M 31M
ResNet-38 90.86 75.28 61.74 51.62 3.1M 113M
ResNet-38 fine-tune 91.15 89.61 86.26 83.45 3.1M 113M
Our method 92.37 90.35 88.73 87.61 0.27M 31M
TABLE I: CIFAR-10 accuracy and network capacity comparisons with state-of-the-art methods. Results using randomly sampled subsets from training data are also reported. Number of network parameters are calculated based on reports in related work.

Iv-B Results on the CIFAR-10 Dataset

We first evaluate the performance of our method on the CIFAR-10 dataset guided by a teacher network pre-trained on CIFAR-100 dataset. The CIFAR-10 and CIFAR-100 datasets [50] have 60000 3232 color images with 10 and 100 classes, respectively. They were both split into 50K-10K sets for training and testing. To validate our approach, we trained a 38-layer Resnet on the CIFAR-100 as reported in [38], and use it to guide a 50-layer but significantly slimmer Resnet on the CIFAR-10. We augment the data using random horizontal flip and color jittering. Table I

summarizes the results, with comparisons against the state-of-the-art results which cover a variety of optimization techniques including Layer-sequential unit-variance initialization

[49], pooling-less [48], generalized pooling [47] and maxout activation [46]. We choose different sizes of the training set and list the accuracy. For network complexity, we compute its number of model parameters and the number of multiplication and additions needed for the network inference. It should be noted that for methods in the literature we do not have their accuracy results on down-sized training sets.

We do not apply specific optimization techniques used in the state-of-the-art methods due to some structures not reproducible in certain conditions. To compare, we trained a standard 38-layer Residue Network, a 50-layer slimmer version of ResNet (each convolutional layer is half the capacity of the vanilla ResNet) and a fine-tuned model of 38-layer ResNet (from CIFAR-100) on CIFAR-10 with different amount of training samples. With all 50000 training data, our proposed method outperforms direct training and best fine-tuning results and still match the state-of-the-art performance. We believe the performance gain specified in [47, 49] can be also applied to our method, i.e. , ensemble of multiple techniques could achieve better performance. The proposed KPN method has improved the accuracy by up to 1.2% while significantly reducing the network size by about 11 times, from 3.1M network parameters to 273K parameters. It also demonstrated strong robustness against aggressive reduction of labeled training samples.

Fig. 5: (1)(2): CIFAR-100/10 sample images; (3): Imagenet 2012; (4) Pascal VOC 2007; (5) MNIST; (6) Omniglot;

Iv-C Results on the Pascal VOC 07 Dataset

Methods Accuracy at Different
5011 1000 200
Chatfield et al. [51] 82.4 - - 6.5M 2483M
VGG16+SVM [29] 89.3 - - 14.7M 15470M
VGG19+SVM [29] 89.3 - - 21.8M 15470M
HCP-VGG [52] 90.9 - - 14.7M 15470M
FisherNet-VGG16 [53] 91.7 - - 14.7M 15470M
VGG16 standard BP 83.5 65.2 <30 14.7M 15470M
Fine-tune VGG16 last layer (softmax) 89.6 87.4 85.7 14.7M 15470M
Fine-tune VGG16 2+ learnable layers 90.2 86.3 82.8 14.7M 15470M
Our method 91.2 88.4 86.5 8M 3361M
TABLE II: PASCAL VOC 2007 test object classification performances comparison. Results using randomly sampled subsets from training data are also reported. Number of convolution layer parameters are listed for fair comparison based on reports in related work.

We evaluate the proposed method on the PASCAL Visual Object Classes Challenge(VOC) dataset [54] with a VGG-16 model [29] pre-trained on the ILSVRC 2012 dataset [1]

. The pre-training usually takes several weeks, thus we downloaded and converted the teacher network from the Caffe model available online. We compare our method with state-of-the-art results obtained on this dataset in the literature, including the VGG16+SVM method

[29], the segment hypotheses based multi-label HCP-VGG method [52], and the FisherNet-VGG16 method [53] which encodes CNN feature with fisher vector. These papers have reported results on the original whole dataset with 5011 images. To test the learning capability of the network on smaller datasets with reduced samples, we also implement the fine-tuning method. We try different combination of network update scheme and learning parameters and use the best result for performance comparison with our method. We conducted our experiments on the entire training set with 5011 images and test set with 4952 images. In addition, we randomly sample 50 and 10 images from each class, generating two small datasets with 1000 and 200 training images, respectively. The results are summarized in Table II. We list the test accuracy of the network for each configuration. We compute the corresponding complexity of the network, including the number of model parameters and the number of multiplication and additions . It should be noted that for methods in the literature we do not have their accuracy results on down-sized training sets. It can be seen that our proposed method outperforms standard training and fine-tuning by a large margin while reducing the model size by 2 times and improving the inference speed by 4.6 times.

Iv-D Results on the Ommniglot Dataset

We are interested in how the proposed KPN method works on very small datasets, for example, the Ommniglot handwritten recognition dataset. The MNIST [55] is a famous handwritten digits dataset, consists of 60000 training images and 10000 test images, 28x28x1 in size, organized into 10 classes. The Omniglot [56]

is a similar but much smaller dataset, containing 1623 different handwritten characters from 50 alphabets. Each of the 1623 characters was drawn online via Amazon’s Mechanical Turk by 20 different people. All images are binarized and resized to 28

281 with no further data augmentation. We use all 70000 images from MNIST for training a 5-layer Maxout convolutional model as the teacher network as proposed in [46]. We report experimental results of various algorithms across a wide range of number of training examples, from 19280 to merely 1000, shown in Table III. Note that we use class dependent shuffling to randomly select training subsets, which is critical to avoid unbalanced class distribution in Omniglot due to the limited number of samples for each class. We can see that the proposed KPN is able to reduce the error rate by 1.1-1.3%. Table III also provides some interesting insights of how models are transferred to different tasks. First, the fine-tuning methods are all affected by the number of learnable parameters and training samples. Smaller training set will result in significant over-fitting, thus breaking the fragile co-adaptation between layers. If the training set is large enough, the number of learnable parameters are positively related to the performance. This phenomenon is also discussed in [25], where transferring knowledge from the pre-trained model to an exactly same network is extensively tested.

Methods Error Rates at Different
19280 5000 1000
Deep CNN [56] 13.5% - -
Deep Siamese CNN [56] 8.0 % - -
Large CNN standard BP 9.3% 12.9% 19.4%
Small CNN standard BP 12.1% 18.5% 23.8%
Fine-tuned from MNIST 6.8% 7.4% 9.2%
Our method 5.9% 6.6% 7.9%
TABLE III: Test error rate comparisons between experimental settings and baseline methods.
Fig. 6: Network capacity and performance analysis. Top: test accuracies with proposed KPN and normal training with standard back-propagation; Middle: number of parameters (), note that the y-axis is in logarithmic scale; Bottom: actual inference speed up ratio with respect to Resnet-50. Network notations: is teacher network, - denotes slim network with layers, similarly, layer slimmer network is denoted as - -.
# Layers 50 50- 50- - 44- 44- - 38- 38- - 32- 32- - 26- 26- -
Conv /s1 16 16 16 16 16 16 16 16 16 16 16
ResConv /s2 32 32 16 32 16 32 16 32 16 32 16
ResConv /s1 64 32 32 32 32 32 32 32 32 32 32
ResConv /s2 128 64 48 64 48 64 48 64 48 64 48
Conv /s1 256 128 96 128 96 128 96 128 96 128 96
TABLE IV: Network configurations for extensive benchmarks on Omniglot dataset. - denotes slim network with layers, similarly, layer slimmer network is denoted as - -. Note that adaptive convolutions for residue modules are not included in this table.
Fig. 7: Iterative pruning analysis. Top: occurrences of projection route - over 32 standalone tests. Bottom: mean classification error of projection route - by disable iterative pruning. -: network with knowledge layer from teacher to injection layer from student.

Iv-E Algorithm Parameter Analysis

In this section, we study how the performance of the our method is impacted by the selection of major parameters.

(1) Trade-off between Performance and Efficiency. To evaluate how the size of network affects the performance, we measure the test accuracy, number of parameters, and network speed up ratio of various student networks on the CIFAR-10 dataset. Figure 6 shows the results. Student networks are designed based on a multi-layer Resnet denoted as - or - -, where is the number of layers, - and - - indicate it’s a slim or slimmer version of Resnet. The detailed network configurations are listed in Table IV. As expected, deeper and slimmer networks are more difficult to train with limited training data. However, with proposed method enabled, the depth is beneficial, and networks are less suffered from performance drop. Impressively, we could obtain a model which is 34 times faster using less than 2% parameters, with about 3% accuracy loss, compared to the teacher network.

(2) Analysis of Iterative Pruning for Automatic Route Selection. The knowledge projection route is critical for the network training and test performance. Intuitively, the projection route should not be too shallow or too deep. Shallow layers may contain only low-level texture features, while deep layers close to output may be too task specific. To study how the iterative pruning works during training, we record the pruning results and compare them with respect to manually defined projection routes, shown in Figure 7. We can see that the statistics of survival projection routes is highly correlated to the training accuracy, which is evaluated by manually defining projection route from to and disabling iterative pruning during training. The result also indicates that choosing the middle layers for projection is potentially better. Reducing the size of training data also affects the pruning results. This might relate to the difficulty of fitting knowledge projection layer to the target domain when very limited data is presented. As a result, projection layers tend to appear more on very deep layers close to the output, so that the penalty from adaptation loss will not dominate. The bottom line is, even though the iterative pruning method is a random optimization process, it is reliably producing satisfactory results.

Iv-F Discussion and Future Work

Our KPN is designed in a highly modular manner. The training of projection layers is removed during actual network testing, and the network capacity is highly configurable for performance/speed trade-off. This KPN method can be easily extended to other problems such as object detection, object segmentation, and pose estimation by replacing softmax loss layer used in the classification problems. Since the deployed network is a pure standard network, another research direction is to apply KPN as a building block in traditional model compression techniques to reshape the network in a new perspective. Although we have focused on the advantage of KPN with thinner networks on smaller datasets, there are potential benefits to apply KPN on large network and relatively large datasets, for example, performance oriented situations where speed is not an issue.

V Conclusion

We have developed a novel knowledge projection framework for deep neural networks the address the issues of domain adaptation and model compression in training simultaneously. We exploit the distinctive general features produced by the teacher network trained on large dataset, and use a learned matrix to project them into domain relevant representations to be used by the student network. A smaller and faster student network is trained to minimize joint loss designed for domain adaptation and knowledge distillation simultaneously. Extensive experimental results have demonstrated that our unified training framework provides an effective way to obtain fast high-performance neural networks on small datasets with limited labeled samples.