In the past years, Deep Neural Networks have largely focused the attention in visual classification tasks. Indeed, they led to significant improvements in many problems including image classification and object detection [19, 27]
. This was mainly achieved by the development of complex new architectures based on Convolutional Neural Networks (CNNs)[38, 31, 15]. These architectures typically involve a huge number of parameters and operations for inference, making them ill-suited to being deployed in constrained environments. Several different strategies have been developed in order to circumvent this issue by compressing the models into smaller ones achieving similar performances. The main methods can be categorized into the following schemes: selectively pruning parameters, distillation, quantization or low rank factorization and sparsity [14, 21, 10, 4, 17, 32]. Another main challenge for DNNs is to remain efficient on data coming from a target distribution similar yet not identical to the training data. A popular technique is to use a model pre-trained on a large source dataset and fine-tune its parameters on the desired target dataset. However it may be that no labels are available on the target data. In that case, training requires unsupervised DA techniques [3, 9, 30].
In this work, we seek to adapt compression objectives to the specific setting of DA. Concretely, we are provided with both a source and a, possibly unlabeled, target distribution. We are interested in compressing a deep network with high accuracy on the target distribution. Some works have investigated this specific setting. The work in 
is strongly related to ours but only focuses on the case where a model is fine-tuned on a target dataset. Their method yields great results in this setting because it is data dependent. That is, compression depends on the input data distribution. In this paper, we focus on a compression method that is also data dependent. The novelty of our work is to go further in the analysis of how the data affects compression and to add a new regularization term. This regularization term is directly related to a DA objective. It basically favors nodes that are non domain discriminative. This proves useful as the more similar the features extracted are, between source and target distribution, the more discriminative ability learned on the source distribution will also apply to the target distribution. Finally we show that our extended compression method compares favorably to on various cases of fine-tuning a pre-trained model on a target dataset.
Our contribution is summarized in Section 6.
2 Related work
2.1 Model compression strategies
Here we present a brief review of the main strategies used in model compression. One can find in  a more in depth analysis of the different methods.
Pruning. These methods focus on finding weights of the network that are not crucial to the network’s performance. Early works used magnitude based pruning which removed weights with the lowest values . This method was extended by various works yielding significant results [12, 21]. Pruning may also be conducted during the learning phase to learn compact representations [12, 1, 47]
. Other pruning methods make use of second order derivatives of the loss function to choose which weight to remove[8, 39]. In this work we focus on a pruning method , which is based on spectral analysis of the covariance matrix of layers [35, 37].
Network distillation. These approaches are based on a teacher student learning scheme [4, 2, 16]. A compressed, small network will learn to mimic a large network or ensemble of large networks. This is efficient because the student model learns based on a combination of true labels, which are hard encoded, and soft outputs of the teacher model. Those contain valuable information about similarity between classes. Distillation can also train the student network to mimic the behavior of intermediate layers .
Parameter quantization. Network quantization is based on finding an efficient representation of the network’s parameters by reducing the number of bits required to represent each weight. Various strategies can achieve this objective: [10, 43]
applied k-means scalar quantization to the weight values. Other work use a fixed (8 or 16) bit representation of the weights while maintaining the accuracy of the original model[41, 13]. In  a network is constrained to use only binary weights while still achieving high accuracy.
Low rank factorization and sparsity. The idea behind low rank factorization and sparse techniques is to decompose layers into smaller ones with less parameters. For convolutional layers, main techniques rely on the fact that performing a convolution with the original filters can be approached by a linear combination of a set of convolutions with base filters [32, 7]. In 
truncation of Singular Value Decomposition (SVD) is used to factorize weight matrix of dense layers.
2.2 Model compression with Domain Adaptation
Some works investigate pruning methods in DA (or transfer learning) setting[24, 40, 33, 46, 45]. Unlike our work, most existing methods depend on iterative fine-tuning. For example, Network Adaptation (NwA) 
adapts an ImageNet pre-trained model for improving generalization ability and efficiency on target tasks. NwA can learn weights and widths for target tasks by performing pruning and fine-tuning iteratively and gradually. NwA prunes filters based on the cumulative sum of mean activations after feeding target data. This pruning strategy is simple but difficult to maintain the information of activations. Although each fine-tuning step may mitigate information loss caused by each pruning step, strong dependence on fine-tuning increases the risk of overfitting (see very recent work for more discussion).
In contrast to the iterative fine-tuning approach, the Domain Adaptive Low Rank Matrix Decomposition (DALR) method  can maintain accuracy without fine-tuning after compression. DALR and our method are data dependent, which makes them suited for DA setting. Since DALR uses SVD, both are based on spectral analysis. An important difference between DALR and our method is how the two methods affect the architecture of the network. We describe the difference, considering that uncompressed two layers have weight matrices of dimension and as shown in Fig. 1. For compressing the first layer, DALR produces two smaller layers that have weight matrices of dimension and ( is kept rank).
First, it is noted that, to actually reduce the number of parameters in the network, the parameter must verify the following condition: . DALR affects only the layer being compressed since the output shape remains unchanged. Conversely, our method does not create additional layers but also affects the input dimension of the following layer. Second, DALR is only designed to compress dense layers while our method can be applied to both dense and convolutional layers. Finally, in , the authors only mention using as a given input to compression. Therefore a prior knowledge of how much the layer is compressible should be known or will require fine-tuning. Our method helps avoiding this issue, using an information loss ratio parameter defined in Section 3.2. It should be noticed however that it is possible to reproduce a similar parametrization in DALR by computing the ratio of kept singular values over all singular values of the decomposition.
3 Spectral pruning with DA regularization
Our work was built on the layer wise compression method proposed by , which is based on spectral analysis. The method does not require any regularization during training, but is applicable to a pre-trained model. We will first detail this method then introduce our DA regularizer.
The method compresses the network layer per layer and can be applied to both dense and convolutional layers. It aims at reducing the dimension of the output of a layer such that the input of the next layer remains mostly unchanged. To compress a layer, it selects the most informative nodes, or filters, depending on the layer type. For brevity, we will avoid this distinction in the remainder of this paper and refer only to nodes for both cases.
More formally, let us denote the input to the network and suppose that we are given a training data with size . Let us consider a model that is a sequence of convolutional and fully connected layers. We denote by and
respectively the mapping function and activation function of a layer. Finally we denote by the representation obtained at layer where is the number of nodes in the layer . Therefore we have The objective of the method is to find for each layer , the subset of nodes that minimizes the following quantity:
where denotes the empirical expectation,
denotes the vector composed of the components ofindexed by , and is a matrix to recover the whole vector from its sub-vector as well as possible.
For a fixed , it is easy to minimize the objective (1) with respect to as
where and is the full set of indexes. Then can be approximated by which is equivalent to pruning the nodes of layer that do not belong to and reshaping the weights of the next layer. Let us denote the weight matrix if the next layer is a fully connected layer,
the weight tensor if it is a convolutional layer. In that case we denote by the four dimensional tuplethe shape of the weight tensor where stand respectively for output channels, input channels, width, height.
The reshaping of the weights of the next layer is done the following way. For fully connected layers, we obtain the reshaped weight matrix as For convolutional layers, it is done by first reshaping to a tensor. Therefore each point of the filter grid is associated to a filter matrix that will be reshaped by the operation: The tensor is then reshaped back to its original form where has been modified by the process.
3.2 Selecting optimal indices
Once we obtained the optimal for a fixed as in Eq. (2), then we minimize the objective (1) with respect to . By substituting the optimal to the objective function (1), it can be re-written as The minimand in the right hand side is zero for and for . Hence we may consider the following “ratio” of residual to measure the relative goodness of :
This can be seen as an information loss ratio. The higher the ratio the more information computed by the layer will be retained. It is no greater than 1 since the denominator is the best achievable value with no cardinality constraint. Compression can therefore be parametrized by an information loss ratio parameter and the final optimization problem translates to
It is worth noticing that though the method is aimed at compressing each layer sequentially, using allows more flexibility than methods with a fixed compression rate for each layer. Since we do not impose constraints on the cardinality of , compression will adapt to the inherent compressibility of the layer. We argue that this makes
an easy to tune and intuitive hyperparameter and, if necessary, can easily be combined with cardinality constraints.
In order to solve the optimization problem (4), the authors of  proposed a greedy algorithm where is constructed by sequentially adding the node that maximizes the information loss ratio. This algorithm does not give the optimal solution but additional fine-tuning can be used to restore the deteriorated accuracy.
3.3 Domain Adaptation regularization
In DA setting, informative nodes could be different between source and target domains. To adjust this difference, we propose to add an additional regularization term in order to select better nodes. We call our method Moment Matching Spectral Pruning (MoMaSP).
As shown by , one of the main challenges of DA is to find a representation discriminative with respect to the classification task but not discriminative between domains. The general idea is that, by matching the two distributions, the classification knowledge learned on the labeled source data can be transferred to the target data. Therefore, we propose to use a measure of the alignment of the source and target features distributions for selecting nodes during compression.
Previous work involving covariance alignment achieved high accuracy in DA tasks . Following those observations, we define our regularization term as the following
where denotes the source and target distributions, respectively the source and target empirical covariance matrices, a scaling matrix defined by , denotes the element wise multiplication, and means the Frobenius norm of a matrix. This quantity measures the discrepancy between the source and target distributions by using the up-to second order statistics. The first term measures the discrepancy of mean and the second term measures the discrepancy of the (scaled) second order moment. Hence, this regularization term induces a moment matching effect and the two distributions become similar on the selected nodes. We may use the MMD criterion (kernel based discrepancy measure)  to capture not only the first and second moments but also all higher moments instead, but we found that it takes too much computational time.
Instead of the criterion (5), we also consider the following alternative formulation:
The first formulation (5) depends on the whole subset while the second one computes information more specific to each node. The second formulation is more computationally efficient since the ’s can be computed only once for each candidate index. The first one needs to recompute it for every candidate index at each step of the procedure because of its dependence on .
The intuition behind our regularization is the following. In DA setting where few or no labels are available on the target distribution, the discriminative knowledge is learned mainly on the source distribution. Therefore, as explained previously, the feature distributions on both domains should be aligned for the discriminative knowledge learned on the source domain to apply to the target domain. When comparing the ratios of Eq. (3
) for different nodes, we observed that many of them had very close values. This means that many nodes capture approximately the same amount of information, in the sense of total variance. Our intuition was that when these relative differences are too small, they are not significant to discriminate between nodes and are more likely the result of noise. Therefore a better criteria should be used to differentiate nodes that capture a same amount of variance. Since our compression method is designed to be applied on models that have been trained using DA methods, it is natural to use a criteria related to DA objectives. We choose to compare nodes based on how well their feature distributions on source and target domains are aligned. Nodes leading to a better alignment should be favored as they will allow for a better transfer of discriminative knowledge from source to target domain. Our method realizes this via a moment matching type regularization.
3.5 Practical implementation
We present in Algorithm 1 a pseudo code for a practical implementation of our method using the first formulation of our regularizer. If the second formulation is used, the regularization term for each node is computed only once before entering the while loop as it does not depend on . At each step of the algorithm, we denote by the set of candidate indexes. To select which node to add to , the ratio from Eq. (3) is computed for each candidate index. We denote by the vector where each coordinate corresponds to the ratio value of a candidate index. Similarly, we denote by the vector where each coordinate correspond to the regularizing term associated to a candidate index.
Without regularization, the index to add to is simply given by Using regularization, the index to add to is chosen as
is the standard deviation of. The values of are rescaled to be in the same range as ’s by max normalization. Multiplying by ’s standard deviation, , ensures that the regularization will be applied at the right scale. It should only slightly modify the relative ordering of the values in as the main criteria for selection must remain the information ratio. Indeed, only taking into account the regularization term would favor nodes that output a constant value across both distributions. The hyperparameter allows for more control over the trade-off between those two terms. We use as the default in our experiments. The max normalization and scaling make easier to tune. This compression is applied after training and can therefore be combined with any DA method.
|Test data||Nodes specificity|
4 Experiments on digits images
In this section, we conduct experiments with a model trained on digits datasets, using the DA technique introduced in . The source dataset is the SVHN dataset  and the target dataset is the MNIST dataset .
Considering the relative simplicity of the task, we used a custom model composed of a 4 layers feature generator (3 convolutional + 1 dense) and a 2 layers classifier (see Appendix for details). To train the model we used the DA technique presented in which utilizes “maximum classifier discrepancy.” Briefly, the adaptation is realized by adversarial training of a feature generator network and two classifier networks. The adversarial training forces the feature generator to align the source and target distribution. Contrary to other methods relying on adversarial training with a domain discriminator, this method considers the decision boundary while aligning the distributions, ensuring that the generated features are discriminative with respect to the classification task at hand.
4.2 Data choice for compression
The uncompressed model was trained using both train splits and evaluated on the test split of MNIST. It reached an accuracy of 96.64%, similar to the results obtained in the original paper . We then apply compression on the trained model. During compression, only data of the train splits was used and the compressed models were evaluated on the MNIST test split. Since the method is data dependent, the choice of data to use to compute the empirical covariance matrix should be taken care of. Three different settings were tested:
Using 14,000 samples from the target distribution
Using 7,000 samples from the target distribution and 7,000 samples from the source distribution
Using 14,000 samples from the target distribution and 7,000 additional samples from the source distribution
The results obtained are presented in Fig. 2. Using only target samples to compute the covariance matrix shows the best performance.
To give a better understanding of this result, we conducted an analysis about how the nodes activation pattern depends on data distribution as follows: we compressed the first layer of a trained network using either only target data or only source data. We then compared the activation of nodes that were selected only in one of the two cases, in other words, nodes specific to target or source data. Those nodes showed significant differences in their activation depending on the distribution. We show the results in Table 1.
As expected, nodes selected using source data had a significantly higher activation rate on the source distribution (0.79 on source, 0.65 on target) and conversely for target specific nodes (0.77 on source, 0.84 on target). As a control case, we added activation of nodes that were selected in neither of the two settings. Those do not show any significant difference in their activation. Interestingly, this difference was no longer appearing when comparing activation of the last fully connected layer. This is because DA training aligns the distributions of extracted features. It is therefore expected not to observe such differences in the last layers.
This experiment sheds light on how the input data affects the compression of data dependent methods. In case of a model trained on different distributions with DA, early layers contain nodes that are specific to each distribution. Therefore it is critical for compression to use data coming exclusively from the distribution the model will be applied to. Partially using source data forces compression in the early layers to select source specific nodes to the expense of target specific nodes leading to poorer results.
4.3 Regularization for compression
Finally we compared the results of the best baseline, using only target data, with adding the regularization term introduced in Section 3. The results are presented in Fig. 3 and summarized in Table 2
. It appears that the second formulation of our regularizer gives the best performance. This is probably due to the fact it focuses on node level information and not on the whole subset, giving better ability to discriminate against specific nodes.
Compared to the baseline, our regularizer leads to significant improvements in compression performance. We observe up to 9% bumps in accuracy. Yet, we notice that the baseline performs slightly better for very high compression rates. We conjecture that reducing information loss becomes more important for such compression rates and our regularization with is too strong. There is room for improvement by tuning the hyperparameter depending on the compression rate or developing methods for adaptive regularization. Also, for such compression rates, additional fine-tuning can be used to improve accuracy. Actually, fine-tuning after compression in those cases leads to similar results regardless of whether regularization was used or not. It should be noted that DA training can be unstable especially when using adversarial training. Because of that, fine-tuning may actually harm compression performance and should be used only if the accuracy of the compressed model has significantly dropped, i.e., if the drop in accuracy is higher than the variance induced by fine-tuning.
5 Experiments on natural images
In this section, we compare the results of our method (MoMaSP) with the factorization based compression method (DALR) in . We reproduce the same setting as in their experiment: a VGG19 network pre-trained on the ImageNet dataset  is then fine-tuned on different target datasets.
5.1 Experimental settings
We establish our comparison based on three datasets used in  experiments.
Oxford 102 Flowers : contains 8,189 images. Train and validation splits contain each 1,020 samples with each class equally represented. Test split contains 6,149 samples with classes not equally represented.
CUB-200-2011 Birds : contains 11,788 images (5,994 train, 5,794 test) of 200 bird species. Although each image is annotated with bounding box, part location (head and body), and attribute labels, those annotations were not used in our experiments.
Stanford 40 Actions : contains 9,532 images (4,000 train, 5,532 test) of 40 categories corresponding to different human actions. Classes are equally represented on the training set and all samples contain at least one human performing the corresponding action.
For all three datasets, we first trained uncompressed models by fine-tuning from an ImageNet pre-trained VGG19 model that provided by the torchvision package of PyTorch. In the fine-tuning before compression, we trained the weights of the fully connected layers while keeping the weights of the convolutional layers frozen to their pre-trained value on ImageNet like.
We then compressed the models. The input data to compression was always composed of 4,000 samples of the train split, except for the Oxford 102 Flowers where the whole train split was used. Additional 4,000 randomly sampled images from the ImageNet train split were used for DA regularization. Note that our method does not need the labels of target datasets in this compression phase. After that we optionally fine-tuned the fully connected layers of compressed models with target labels.
We used the Adam optimizer  with a batch size of 50, a learning rate of , and a weight decay of
for training models. We trained for 2 epochs for VGG19 fc7 compression and 5 epochs for VGG19 full compression for fine-tuning after compression. All models are trained using PyTorch. See Appendix for details.
5.2 VGG19 fc7 compression
We first evaluated the compression on the last fully connected layer (fc7) of the model, containing 4,096 nodes, because the fc7 compression is a main evaluation setting in . For each trial we report the results both with and without fine tuning of the classifier layers after compression. We also report the results of compression using basic SVD on fc7 weight matrix as a baseline to further illustrate the advantage of using data dependent methods.
To compare our method with DALR, we need a way to fit the numbers of parameters because the two methods do not modify the architecture of the network the same way. DALR replaces a fully connected layer by two smaller layers without changing the next layer. Our method keeps the same number of layers but also affects the input dimension of the next layer. Therefore we proceeded the following way to compare the two methods objectively. A dimension was set for the compression using DALR then , the dimension to keep in our method resulting in an equal number of parameters, was determined accordingly. More precisely, the fc7 layer has a weight matrix of dimension (because and in Fig. 1 are the same value for the fc7 layer) and the next layer has a weight matrix of dimension . Taking into account the biases we get the following equation:
The results are presented in Tables 3, 4, and 5. We show the relative numbers of parameters in the fc7 layer compressed by DALR in the “params” rows in the tables as in the DARL paper . Compression rates for the total parameters of VGG19 models are limited to 11%–12% and FLOPs reduction is negligible, because layers before the fc7 layer are not compressed in this evaluation setting.
In all experiments, our method maintains a high accuracy even for high compression rates. In such cases it outperforms DALR by a large margin. However, for lower compression rates DALR consistently compares favorably to our method though the difference is arguably small. In most cases, fine-tuning does not improve much the performance of any of the two methods, except for DALR for high compression rates.
|SVD w/o FT.||5.2||12.8||30.2||54.6||65.2||67.6|
|SVD w/ FT.||13.7||36.7||56.6||65.9||70.0||70.8|
|DALR w/o FT.||8.6||31.1||55.1||66.9||71.0||72.8|
|DALR w/ FT.||18.9||48.2||64.4||70.3||70.7||72.3|
|Ours w/o FT.||68.9||69.3||70.0||71.2||72.0||72.5|
|Ours w/ FT.||67.3||68.9||70.3||69.9||70.8||70.6|
|SVD w/o FT.||4.1||12.6||30.3||46.9||54.8||60.5|
|SVD w/ FT.||17.3||39.1||51.2||57.2||58.8||58.5|
|DALR w/o FT.||5.6||23.1||49.3||57.8||60.1||60.9|
|DALR w/ FT.||19.0||40.9||54.0||59.8||59.3||59.5|
|Ours w/o FT.||58.3||58.2||58.2||58.8||59.3||60.2|
|Ours w/ FT.||57.9||57.3||57.7||58.2||58.2||59.7|
|SVD w/o FT.||17.1||31.0||54.5||67.3||72.2||72.7|
|SVD w/ FT.||45.4||62.0||68.6||72.8||73.1||73.7|
|DALR w/o FT.||20.7||43.3||65.0||73.0||74.3||74.7|
|DALR w/ FT.||46.8||63.3||69.6||73.5||72.9||72.9|
|Ours w/o FT.||69.0||70.2||71.8||73.1||73.9||73.8|
|Ours w/ FT.||70.5||70.4||71.7||72.9||73.7||74.0|
|Original Acc. (%)||73.1||61.5||74.6|
|DALR C.R. (%)||84.9||84.8||85.0|
|DALR #params||21.1 M||21.3 M||21.0 M|
|DALR w/o FT. Acc. (%)||48.1||43.0||57.7|
|DALR w/ FT. Acc. (%)||62.2||52.8||67.4|
|Ours C.R. (%)||85.4||85.5||85.4|
|Ours #params||20.4 M||20.4 M||20.4 M|
|Ours w/o FT. Acc. (%)||60.1||50.8||69.8|
|Ours w/ FT. Acc. (%)||49.9||53.1||71.1|
5.3 VGG19 full compression
It is important to notice that, contrary to DALR, our method is able to compress both convolutional and dense layers. To further demonstrate this advantage, we proceeded to the comparison of fully compressing the VGG19 network using both methods. We conducted the experiment on all three datasets. DALR was applied to the fully connected layers. See Appendix for other details.
Results are presented in Table 6. For all three datasets, without fine-tuning, our method consistently achieves better compression rate (C.R.) while reaching a test accuracy 10% higher than DALR. After fine-tuning, since both models have similar complexity the difference in accuracy is smaller but still favorable to our method.
5.4 Digging fine-tuning results
In many cases, fine-tuning degrades the performance of our method. To investigate this phenomenon, we evaluated other models that fine-tuned for various epochs on Oxford 102 Flowers. The results are presented in Table 7. The accuracy drops in the first epoch and recovers in succeeding epochs. Considering these results, the learning rate we used is too high for our method, and it causes the performance drops by keeping parameters away from optimal values. Tuning learning rates and epochs for fine-tuning by cross-validation will further improve accuracy, though it takes much time.
Our method achieves 74.2% accuracy in multiple settings in Table 7. This accuracy is higher than the original accuracy (73.1%) of the uncompressed model. Accuracy gain by compression in DA setting is also reported in . Thus comparison with NwA  is an interesting experiment, although the main advantage of our method is high accuracy without fine-tuning. We compared error reduction rate because many experimental settings differ by paper (e.g. VGG16 models are trained with data augmentation including rotation perturbations in ). We show the comparison results in Table 8. Our method achieves higher error reduction rate than NwA on Oxford 102 Flowers. On the other hand, VGG19 fc7 compression by our method does not improve much accuracy on the other two datasets. (Original accuracy is recovered by fine-tuning for 3 epochs on the two datasets in the case that is 128.) We conjecture that compressing fc7 aggressively is better for Oxford 102 Flowers because training images are few (1,020 images).
|Ours w/o FT.||68.9||69.3||70.0||71.2||72.0||72.5|
|Ours 1 epoch FT.||64.0||63.9||64.9||67.7||69.5||67.8|
|Ours 2 epochs FT.||67.3||68.9||70.3||69.9||70.8||70.6|
|Ours 5 epochs FT.||71.9||72.9||73.2||73.5||74.2||74.2|
Our method is drastically better for high compression rates, while DALR is slightly better for low compression rates. We conjecture that this is mainly due to the fact that the two methods modify differently the network (See Fig. 1). DALR affects only one layer. Therefore if the compression is too important, it reaches a critical point where not enough information can possibly be retained. Our method affects two successive layers therefore spreading the effect of compression and avoiding this critical point for high compression rates. On the other hand, our method needs to consider nonlinearity between two layers and uses a greedy algorithm . Thus only affecting one layer is better to maintain a high accuracy for lower compression rates, because the output of the one layer can be optimally approximated.
The optimization problem to solve in DALR admits a closed form solution, making it a very fast compression method. Compared to DALR, it is a limitation of our method to require an iterative optimization. However, extra computational time is usually a few hours on 1 GPU (Tesla V100) in our experiments. Furthermore, an iterative process is used in the DALR paper  to determine compression rate pairs for fc6 and fc7, and it needs iterative accuracy evaluation. Pruning methods based on iterative fine-tuning approach also take time for pruning, fine-tuning, and evaluation. Therefore our method is practical enough.
In this paper, we investigated compression of DNNs in the DA setting. As shown by , using a data dependent method is crucial in order to achieve good results. In that matter, our work shows that the input data to compression should only come from the distribution on which the model will be applied to, the target distribution in our case. This is because adding samples from another distribution will force compression in the early layers to select nodes that are specific to this distribution. However, we show that source data can still be used to improve nodes selection. This is done by comparing the first and second order statistics of the node’s feature distributions on each of the source and target data. This criterion serves as a measure of the alignment of the two distributions which directly relates to DA objectives. Therefore we denote this measure as a DA regularizer. We evaluated this regularization on a spectral pruning method introduced in  and obtained significant improvements on its compression results. Finally we compared our regularized compression method with the factorization based method of  on real world image datasets. Our method compares favorably on all three datasets, leading to significant improvements in retained accuracy for high compression rates.
Although our work focused on one compression method, we argue that using first and second order statistics of feature distributions to measure the alignment between source and target features and using it as a criterion for compression can be applied to other methods (e.g., ). This work can therefore serve as a first example and practical implementation of this idea.
Jose M. Alvarez and Mathieu Salzmann.
Learning the number of neurons in deep networks.In NIPS. 2016.
-  Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In NIPS. 2014.
-  Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine Learning, 2010.
-  Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil. Model compression. In KDD, 2006.
-  Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. Model compression and acceleration for deep neural networks: The principles, progress, and challenges. IEEE Signal Processing Magazine, 2018.
-  Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. BinaryConnect: Training deep neural networks with binary weights during propagations. In NIPS, 2015.
-  Emily Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In NIPS. 2014.
-  Xin Dong, Shangyu Chen, and Sinno Pan. Learning to prune deep neural networks via layer-wise optimal brain surgeon. In NIPS. 2017.
-  Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario March, and Victor Lempitsky. Domain-adversarial training of neural networks. JMLR, 2016.
-  Yunchao Gong, Liu Liu, Ming Yang, and Lubomir D. Bourdev. Compressing deep convolutional networks using vector quantization. arXiv:1412.6115, 2014.
-  Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. JMLR, 2012.
-  Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient dnns. In NIPS. 2016.
-  Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In ICML, 2015.
-  Stephen Jose Hanson and Lorien Y. Pratt. Comparing biases for minimal network construction with back-propagation. In NIPS. 1988.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. arXiv:1503.02531, 2015.
-  Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. In BMVC, 2014.
-  Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS. 2012.
-  Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998.
-  Guiying Li, Chao Qian, Chunhui Jiang, Xiaofen Lu, and Ke Tang. Optimization based layer-wise magnitude-based pruning for DNN compression. In IJCAI, 2018.
-  Bingyan Liu, Yao Guo, and Xiangqun Chen. WealthAdapt: A general network adaptation framework for small data tasks. In ACMMM, 2019.
-  Marc Masana, Joost van de Weijer, Luis Herranz, Andrew D. Bagdanov, and Jose M. Alvarez. Domain-adaptive deep network compression. In ICCV, 2017.
-  Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolutional neural networks for resource efficient inference. In ICLR, 2017.
-  Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
-  Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics & Image Processing, 2008.
-  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.
-  Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. FitNets: Hints for thin deep nets. In ICLR, 2015.
-  Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV, 2015.
-  Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tatsuya Harada. Maximum classifier discrepancy for unsupervised domain adaptation. In CVPR, 2018.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
-  Amos Sironi, Bugra Tekin, Roberto Rigamonti, Vincent Lepetit, and Pascal Fua. Learning separable filters. TPAMI, 2015.
-  Xavier Suau, Luca Zappella, and Nicholas Apostoloff. Network compression using correlation analysis of layer responses. arXiv:1807.10585, 2018.
-  Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frustratingly easy domain adaptation. In AAAI, 2016.
-  Taiji Suzuki. Fast generalization error bound of deep learning from a kernel perspective. In AISTATS, 2018.
-  Taiji Suzuki, Hiroshi Abe, Tomoya Murata, Shingo Horiuchi, Kotaro Ito, Tokuma Wachi, So Hirai, Masatoshi Yukishima, and Tomoaki Nishimura. Spectral-Pruning: Compressing deep neural network via spectral analysis. arXiv:1808.08558, 2018.
-  Taiji Suzuki, Hiroshi Abe, and Tomoaki Nishimura. Compression based bound for non-compressed network: unified generalization error analysis of large compressible deep neural network. In ICLR, 2020.
-  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, 2015.
-  Ming Tu, Visar Berisha, Martin Woolf, Jae-sun Seo, and Yu Cao. Ranking the parameters of deep neural networks using the fisher information. In ICASSP, 2016.
-  Frederick Tung, Srikanth Muralidharan, and Greg Mori. Fine-Pruning: Joint fine-tuning and compression of a convolutional network with Bayesian optimization. In BMVC, 2017.
-  Vincent Vanhoucke, Andrew Senior, and Mark Z. Mao. Improving the speed of neural networks on CPUs. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
-  Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
-  Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. Quantized convolutional neural networks for mobile devices. In CVPR, 2016.
-  Bangpeng Yao, Xiaoye Jiang, Aditya Khosla, Andy Lai Lin, Leonidas Guibas, and Li Fei-Fei. Human action recognition by learning bases of action attributes and parts. ICCV, 2011.
-  Chaohui Yu, Jindong Wang, Yiqiang Chen, and Zijing Wu. Accelerating deep unsupervised domain adaptation with transfer channel pruning. In IJCNN, 2019.
-  Yang Zhong, Vladimir Li, Ryuzo Okada, and Atsuto Maki. Target aware network adaptation for efficient representation learning. In ECCV Workshop on Compact and Efficient Feature Representation and Learning, 2018.
-  Hao Zhou, Jose M. Alvarez, and Fatih Porikli. Less is more: Towards compact cnns. In ECCV, 2016.
Appendix A Details of experimental settings
a.1 Experiments on digits images
We used a custom model composed of a 4 layers feature generator (3 convolutional + 1 dense) and a 2 layers classifier. The output widths of the layers are
a.2 Experiments on natural images
We trained for 10 epochs on Oxford-102 Flowers, 5 epochs on CUB-200 Birds, and 5 epochs on Stanford 40 Actions for fine-tuning before compression. On each dataset, we trained five models to mitigate randomness, and used the model that has the median accuracy of the five models as the original (uncompressed) model.
The iterative process of the DALR paper  for determining compression rate for each layer is computationally inefficient. Thus, for VGG19 full compression, we did not determine compression rate for each layer automatically for both methods. Specifically, for compression by DALR, compression rate for fc8 is set to 0.5 (a modest value for not breaking output values), and compression rate for other layers (fc6 and fc7) is set so that total compression rate becomes 85%. For our method, compression rate for fully connected layers is set to 0.96, and compression rate for convolutional layers is set so that total compression rate becomes over 85%. Although the FLOPs reduction by DALR is negligibly small, our method reduces FLOPs by 19% thanks to the compression of convolutional layers.