# Efficient and Accurate Approximations of Nonlinear Convolutional Networks

This paper aims to accelerate the test-time computation of deep convolutional neural networks (CNNs). Unlike existing methods that are designed for approximating linear filters or linear responses, our method takes the nonlinear units into account. We minimize the reconstruction error of the nonlinear responses, subject to a low-rank constraint which helps to reduce the complexity of filters. We develop an effective solution to this constrained nonlinear optimization problem. An algorithm is also presented for reducing the accumulated error when multiple layers are approximated. A whole-model speedup ratio of 4x is demonstrated on a large network trained for ImageNet, while the top-5 error rate is only increased by 0.9 comparably fast speed as the "AlexNet", but is 4.7

## Authors

• 86 publications
• 3 publications
• 3 publications
• 50 publications
• 130 publications
• ### Accelerating Very Deep Convolutional Networks for Classification and Detection

This paper aims to accelerate the test-time computation of convolutional...
05/26/2015 ∙ by Xiangyu Zhang, et al. ∙ 0

• ### A Unified Approximation Framework for Non-Linear Deep Neural Networks

Deep neural networks (DNNs) have achieved significant success in a varie...
07/26/2018 ∙ by Yuzhe Ma, et al. ∙ 0

• ### A Unified Approximation Framework for Deep Neural Networks

Deep neural networks (DNNs) have achieved significant success in a varie...
07/26/2018 ∙ by Yuzhe Ma, et al. ∙ 0

• ### Centripetal SGD for Pruning Very Deep Convolutional Networks with Complicated Structure

The redundancy is widely recognized in Convolutional Neural Networks (CN...
04/08/2019 ∙ by Xiaohan Ding, et al. ∙ 0

• ### Training CNNs with Low-Rank Filters for Efficient Image Classification

We propose a new method for creating computationally efficient convoluti...
11/20/2015 ∙ by Yani Ioannou, et al. ∙ 0

• ### Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation

We present techniques for speeding up the test-time evaluation of large ...
04/02/2014 ∙ by Emily Denton, et al. ∙ 0

• ### The local low-dimensionality of natural images

We develop a new statistical model for photographic images, in which the...
12/20/2014 ∙ by Olivier J. Hénaff, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

This paper addresses efficient test-time computation of deep convolutional neural networks (CNNs) [12, 11]. Since the success of CNNs [11] for large-scale image classification, the accuracy of the newly developed CNNs [24, 17, 8, 18, 19] has been continuously improving. However, the computational cost of these networks (especially the more accurate but larger models) also increases significantly. The expensive test-time evaluation of the models can make them impractical in real-world systems. For example, a cloud service needs to process thousands of new requests per seconds; portable devices such as phones and tablets mostly have CPUs or low-end GPUs only; some recognition tasks like object detection [4, 8, 7] are still time-consuming for processing a single image even on a high-end GPU. For these reasons and others, it is of practical importance to accelerate the test-time computation of CNNs.

There have been a few studies on approximating deep CNNs for accelerating test-time evaluation [22, 3, 10]. A commonly used assumption is that the convolutional filters are approximately low-rank along certain dimensions. So the original filters can be approximately decomposed into a series of smaller filters, and the complexity is reduced. These methods have shown promising speedup ratios on a single [3] or a few layers [10] with some degradation of accuracy.

The algorithms and approximations in the previous work are developed for reconstructing linear filters [3, 10] and linear responses [10]

. However, the nonlinearity like the Rectified Linear Units (ReLU)

[14, 11] is not involved in their optimization. Ignoring the nonlinearity will impact the quality of the approximated layers. Let us consider a case that the filters are approximated by reconstructing the linear responses. Because the ReLU will follow, the model accuracy is more sensitive to the reconstruction error of the positive responses than to that of the negative responses.

Moreover, it is a challenging task of accelerating the whole network (instead of just one or a very few layers). The errors will be accumulated if several layers are approximated, especially when the model is deep. Actually, in the recent work [3, 10] the approximations are applied on a single layer of large CNN models, such as those trained on ImageNet [2, 16]. It is insufficient for practical usage to speedup one or a few layers, especially for the deeper models which have been shown very accurate [18, 19, 8].

In this paper, a method for accelerating nonlinear convolutional networks is proposed. It is based on minimizing the reconstruction error of nonlinear responses, subject to a low-rank constraint that can be used to reduce computation. To solve the challenging constrained optimization problem, we decompose it into two feasible subproblems and iteratively solve them. We further propose to minimize an asymmetric reconstruction error, which effectively reduces the accumulated error of multiple approximated layers.

We evaluate our method on a 7-convolutional-layer model trained on ImageNet. We investigate the cases of accelerating each single layer and the whole model. Experiments show that our method is more accurate than the recent method of Jaderberg et al.’s [10] under the same speedup ratios. A whole-model speedup ratio of 4 is demonstrated, and its degradation is merely 0.9%. When our model is accelerated to have a comparably fast speed as the “AlexNet” [11], our accuracy is 4.7% higher.

## 2 Approaches

### 2.1 Low-rank Approximation of Responses

Our observation is that the response at a position of a convolutional feature map approximately lies on a low-rank subspace. The low-rank decomposition can reduce the complexity. To find the approximate low-rank subspace, we minimize the reconstruction error of the responses.

More formally, we consider a convolutional layer with a filter size of , where is the spatial size of the filter and is the number of input channels of this layer. To compute a response, this filter is applied on a volume of the layer input. We use

to denote a vector that reshapes this volume (appending one as the last entry for the bias). A response

at a position of a feature map is computed as:

 y=Wx. (1)

where is a -by-() matrix, and is the number of filters. Each row of denotes the reshaped form of a filter (appending the bias as the last entry). We will address the nonlinear case later.

If the vector is on a low-rank subspace, we can write , where is a -by- matrix of a rank and is the mean vector of responses. Expanding this equation, we can compute a response by:

 y=MWx+b, (2)

where is a new bias. The rank- matrix can be decomposed into two -by- matrices and such that . We denote as a -by-() matrix, which is essentially a new set of filters. Then we can compute (2) by:

 y=PW′x+b. (3)

The complexity of using Eqn.(3) is , while the complexity of using Eqn.(1) is . For many typical models/layers, we usually have , so the computation in Eqn.(3) will reduce the complexity to about .

Fig. 1 illustrates how to use Eqn.(3) in a network. We replace the original layer (given by ) by two layers (given by and ). The matrix is actually filters whose sizes are . These filters produce a -dimensional feature map. On this feature map, the -by- matrix can be implemented as filters whose sizes are . So corresponds to a convolutional layer with a 11 spatial support, which maps the -dimensional feature map to a -dimensional one. The usage of spatial filters to adjust dimensions has been adopted for designing network architectures [13, 19]. But in those papers, the filters are used to reduce dimensions, while in our case they restore dimensions.

Note that the decomposition of can be arbitrary. It does not impact the value of computed in Eqn.(3). A simple decomposition is the Singular Vector Decomposition (SVD) [5]: , where and are -by- column-orthogonal matrices and is a -by- diagonal matrix. Then we can obtain and .

In practice the low-rank assumption is an approximation, and the computation in Eqn.(3) is approximate. To find an approximate low-rank subspace, we optimize the following problem:

 minM∑i∥(yi−¯y)−M(yi−¯y)∥22, (4) s.t.rank(M)≤d′.

Here is a response sampled from the feature maps in the training set. This problem can be solved by SVD [5]

or actually Principal Component Analysis (PCA): let

be the -by- matrix concatenating responses with the mean subtracted, compute the eigen-decomposition of the covariance matrix where

is an orthogonal matrix and

is diagonal, and where are the first eigenvectors. With the matrix computed, we can find .

How good is the low-rank assumption of the responses? We sample the responses from a CNN model (with 7 convolutional layers, detailed in Sec. 3) trained on ImageNet [2]. For the responses of a convolutional layer (from 3,000 randomly sampled training images), we compute the eigenvalues of their covariance matrix and then plot the sum of the largest eigenvalues (Fig. 2). We see that substantial energy is in a small portion of the largest eigenvectors. For example, in the Conv2 layer () the first 128 eigenvectors contribute over 99.9% energy; in the Conv7 layer (), the first 256 eigenvectors contribute over 95% energy. This indicates that we can use a fraction of the filters to precisely approximate the original filters.

The low-rank behavior of the responses is because of the low-rank behaviors of the filters and the inputs . While the low-rank assumptions of filters have been adopted in recent work [3, 10], we further adopt the low-rank assumptions of the filter input , which is a local volume and should have correlations. The responses will have lower rank than and , so the approximation can be more precise. In our optimization (4), we directly address the low-rank subspace of .

### 2.2 The Nonlinear Case

Next we investigate the case of using nonlinear units. We use to denote the nonlinear operator. In this paper we focus on the Rectified Linear Unit (ReLU) [14]: . A nonlinear response is given by or simply . We minimize the reconstruction error of the nonlinear responses:

 (5) s.t.rank(M)≤d′.

Here is a new bias to be optimized, and is the nonlinear response computed by the approximated filters.

The above problem is challenging due to the nonlinearity and the low-rank constraint. To find a feasible solution, we relax it as:

 minM,b,{zi}∑i∥r(yi)−r(zi)∥22+λ∥zi−(Myi+b)∥22 s.t.rank(M)≤d′. (6)

Here is a set of auxiliary variables of the same size as . is a penalty parameter. If , the solution to (6) will converge to the solution to (5) [23]. We adopt an alternating solver, fixing and solving for , and vice versa.

(i) The subproblem of , . In this case, are fixed. It is easy to show where is the sample mean of . Substituting into the objective function, we obtain the problem involving :

 minM∑i∥(zi−¯z)−M(yi−¯y)∥22, (7) s.t.rank(M)≤d′.

Let be the -by- matrix concatenating the vectors of . We rewrite the above problem as:

 (8) s.t.rank(M)≤d′.

Here is the Frobenius norm. This optimization problem is a Reduced Rank Regression problem [6, 21, 20], and it can be solved by a kind of Generalized Singular Vector Decomposition (GSVD) [6, 21, 20]. The solution is as follows. Let . The GSVD is applied on as , such that is a -by- orthogonal matrix satisfying where is a -by-identity matrix, and is a -by- matrix satisfying (called generalized orthogonality). Then the solution to (8) is given by where and are the first columns of and and are the largest singular values. We can further show that if (so the problem in (7) becomes (4)), this solution degrades to computing the eigen-decomposition of .

(ii) The subproblem of . In this case, and are fixed. Then in this subproblem each element of each vector is independent of any other. So we solve a 1-dimensional optimization problem as follows:

 minzij (r(yij)−r(zij))2+λ(zij−y′ij)2, (9)

where is the -th entry of . We can separately consider and and remove the ReLU operator. Then we can derive the solution as follows: let

 z′ij=min(0,y′ij) (10) z′′ij=max(0,λ⋅y′ij+r(yij)λ+1) (11)

then if gives a smaller value in (9) than , and otherwise .

Although we focus on the ReLU, our method is applicable for other types of nonlinearities. The subproblem in (9) is a 1-dimensional nonlinear least squares problem, so can be solved by gradient descent or simply line search. We plan to study this issue in the future.

We alternatively solve (i) and (ii). The initialization is given by the solution to the linear case (4). We warm up the solver by setting the penalty parameter and run 25 iterations. Then we increase the value of . In theory, should be gradually increased to infinity [23]. But we find that it is difficult for the iterative solver to make progress if is too large. So we increase to 1, run 25 more iterations, and use the resulting as our solution. Then we compute and by SVD on .

### 2.3 Asymmetric Reconstruction for Multi-Layer

To accelerate a whole network, we apply the above method sequentially on each layer, from the shallow layers to the deeper ones. If a previous layer is approximated, its error can be accumulated when the next layer is approximated. We propose an asymmetric reconstruction method to address this issue.

Let us consider a layer whose input feature map is not precise due to the approximation of the previous layer/layers. We denote the approximate input to the current layer as . For the training samples, we can still compute its non-approximate responses as . So we can optimize an “asymmetric” version of (5):

 minM,b∑i∥r(Wxi)−r(MW^xi+b)∥22, (12) s.t.rank(M)≤d′.

Here in the first term is the non-approximate input, while in the second term is the approximate input due to the previous layer. We need not use in the first term, because is the real outcome of the original network and thus is more precise. On the other hand, we do not use in the second term, because is the actual operation of the approximated layer. This asymmetric version can reduce the accumulative errors when multiple layers are approximated. The optimization problem in (12) can be solved using the same algorithm as for (5).

### 2.4 Rank Selection for Whole-Model Acceleration

In the above, the optimization is based on a target of each layer. is the only parameter that determines the complexity of an accelerated layer. But given a desired speedup ratio of the whole model, we need to determine the proper rank used for each layer.

Our strategy is based on an empirical observation that the PCA energy is related to the classification accuracy after approximations. To verify this observation, in Fig. 3 we show the classification accuracy (represented as the difference to no approximation) vs. the PCA energy. Each point in this figure is empirically evaluated using a value of . 100% energy means no approximation and thus no degradation of classification accuracy. Fig. 3 shows that the classification accuracy is roughly linear on the PCA energy.

To simultaneously determine the rank for each layer, we further assume that the whole-model classification accuracy is roughly related to the product of the PCA energy of all layers. More formally, we consider this objective function:

 E=∏ld′l∑a=1σl,a (13)

Here is the -th largest eigenvalue of the layer , and is the PCA energy of the largest eigenvalues in the layer . The product is over all layers to be approximated. The objective is assumed to be related to the accuracy of the approximated whole network. Then we optimize this problem:

 max{d′l}E,s.t.∑ld′ldlCl≤C. (14)

Here is the original number of filters in the layer , and is the original time complexity of the layer . So is the complexity after the approximation. is the total complexity after the approximation, which is given by the desired speedup ratio. This problem means that we want to maximize the accumulated accuracy subject to the time complexity constraint.

The problem in (14) is a combinatorial problem [15]. So we adopt a greedy strategy to solve it. We initialize as , and consider the set . In each step we remove an eigenvalue from this set, chosen from a certain layer . The relative reduction of the objective is , and the reduction of complexity is . Then we define a measure as . The eigenvalue that has the smallest value of this measure is removed. Intuitively, this measure favors a small reduction of and a large reduction of complexity . This step is greedily iterated, until the constraint of the total complexity is achieved.

### 2.5 Discussion

In our formulation, we focus on reducing the number of filters (from to ). There are algorithmic advantages of operating on the “” dimension. Firstly, this dimension can be easily controlled by the rank constraint . This constraint enables closed-form solutions, e.g., PCA to the problem (4) or GSVD to the subproblem (7). Secondly, the optimized low-rank projection can be exactly decomposed into low-dimensional filters ( and ) by SVD. These simple and close-form solutions can produce good results using a very small subset of training images (3,000 out of one million).

## 3 Experiments

We evaluate on the “SPPnet (Overfeat-7)” model [8], which is one of the state-of-the-art models for ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2014 [16]. This model (detailed in Table 1) has a similar architecture to the Overfeat model [17], but has 7 convolutional layers. A spatial pyramid pooling layer [8] is used after the last convolutional layer, which improves the classification accuracy. We train the model on the 1000-class dataset of ImageNet 2012 [2, 16], following the details in [8].

We evaluate the “top-5 error” (or simply termed as “error”) using single-view testing. The view is the center region cropped from the resized image whose shorter side is 256. The single-view error rate of the model is 12.51% on the ImageNet validation set, and the increased error rates of the approximated models are all based on this number. For completeness, we report that this model has 11.1% error using 10-view test and 9.3% using 98-view test.

We use this model due to the following reasons. First, its architecture is similar to many existing models [11, 24, 17, 1] (such as the first/second layers and the cascade usage of

filters), so we believe most observations should be valid on other models. Second, on the other hand, this model is deep (7-conv.) and the computation is more uniformly distributed among the layers (see “complexity” in Table

1). A similar behavior exhibits on the compelling VGG-16/19 models [18]. The uniformly distributed computation indicates that most layers should be accelerated for an overall speedup.

For the training of the approximations as in (4), (6), and (12), we randomly sample 3,000 images from the ImageNet training set and use their responses as the training samples.

### 3.1 Single-Layer: Linear vs. Nonlinear

In this subsection we evaluate the single-layer performance. When evaluating a single approximated layer, the rest layers are unchanged and not approximated. The speedup ratio (involving that single layer only) is shown as the theoretical ratio computed by the complexity.

In Fig. 4 we compare the performance of our linear solution (4) and nonlinear solution (6). The performance is displayed as increase of error rates (decrease of accuracy) vs. the speedup ratio of that layer. Fig. 4 shows that the nonlinear solution consistently performs better than the linear solution. In Table 1, we show the sparsity (the portion of zero activations after ReLU) of each layer. A zero activation is due to the truncation of ReLU. The sparsity is over 60% for Conv2-7, indicating that the ReLU takes effect on a substantial portion of activations. This explains the discrepancy between the linear and nonlinear solutions. Especially, the Conv7 layer has a sparsity of 95%, so the advantage of the nonlinear solution is more obvious.

Fig. 4 also shows that when accelerating only a single layer by 2, the increased error rates of our solutions are rather marginal or ignorable. For the Conv2 layer, the error rate is increased by ; for the Conv3-7 layers, the error rate is increased by .

We also notice that for Conv1, the degradation is ignorable on or below speedup ( corresponds to ). This can be explained by Fig. 2(a): the PCA energy has almost no loss when . But the degradation can grow quickly for larger speedup ratios, because in this layer the channel number is small and needs to be reduced drastically to achieve the speedup ratio. So in the following, we will use for Conv1.

### 3.2 Multi-Layer: Symmetric vs. Asymmetric

Next we evaluate the performance of asymmetric reconstruction as in the problem (12). We demonstrate approximating 2 layers or 3 layers. In the case of 2 layers, we show the results of approximating Conv6 and 7; and in the case of 3 layers, we show the results of approximating Conv5-7 or Conv2-4. The comparisons are consistently observed for other cases of multi-layer.

We sequentially approximate the layers involved, from a shallower one to a deeper one. In the asymmetric version (12), is from the output of the previous approximated layer (if any), and is from the output of the previous non-approximate layer. In the symmetric version (5), the response where is from the output of the previous non-approximate layer. We have also tried another symmetric version of where is from the output of the previous approximated layer (if any), and found this symmetric version is even worse.

Fig. 5 shows the comparisons between the symmetric and asymmetric versions. The asymmetric solution has significant improvement over the symmetric solution. For example, when only 3 layers are approximated simultaneously (like Fig. 5 (c)), the improvement is over 1.0% when the speedup is 4. This indicates that the accumulative error rate due to multi-layer approximation can be effectively reduced by the asymmetric version.

When more and all layers are approximated simultaneously (as below), if without the asymmetric solution, the error rates will increase more drastically.

### 3.3 Whole-Model: with/without Rank Selection

In Table 2 we show the results of whole-model acceleration. The solver is the asymmetric version. For Conv1, we fix . For other layers, when the rank selection is not used, we adopt the same speedup ratio on each layer and determine its desired rank accordingly. When the rank selection is used, we apply it to select for Conv2-7. Table 2 shows that the rank selection consistently outperforms the counterpart without rank selection. The advantage of rank selection is observed in both linear and nonlinear solutions.

In Table 2 we notice that the rank selection often chooses a higher rank (than the no rank selection) in Conv5-7. For example, when the speedup is 3, the rank selection assigns to Conv7, while this layer only requires to achieve 3 single-layer speedup of itself. This can be explained by Fig. 2(c). The energy of Conv5-7 is less concentrated, so these layers require higher ranks to achieve good approximations.

### 3.4 Comparisons with Previous Work

We compare with Jaderberg et al.’s method [10], which is a recent state-of-the-art solution to efficient evaluation. This method mainly operates on the spatial domain. It decomposes a spatial support into a cascade of and spatial supports. This method focuses on the linear reconstruction error. The SGD solver is adopted for optimization. In the paper of [10], their method is only evaluated on a single layer of a model trained for ImageNet.

Our comparisons are based on our re-implementation of [10]. We use the Scheme 2 decomposition in [10] and its filter reconstruction version, which is the one used for ImageNet as in [10]. Our re-implementation of [10] gives a 2 single-layer speedup on Conv2 and increase of error. As a comparison, in [10] it reports increase of error on Conv2 under a 2 single-layer speedup, evaluated on another Overfeat model [17]. For whole-model speedup, we adopt this method sequentially on Conv2-7 using the same speedup ratio. We do not apply this method on Conv1, because this layer has a small fraction of complexity while the spatial decomposition leads to considerable error on this layer if using a speedup ratio similar to other layers.

In Fig. 6 we compare our method with Jaderberg et al.’s [10] for whole-model speedup. The speedup ratios are the theoretical complexity ratios involving all convolutional layers. Our method is the asymmetric version and with rank selection (denoted as “our asymmetric”). Fig. 6 shows that when the speedup ratios are large (4 and 5), our method outperforms Jaderberg et al.’s method significantly. For example, when the speedup ratio is 4, the increased error rate of our method is 4.2%, while Jaderberg et al.’s is 6.0%. Jaderberg et al.’s result degrades quickly when the speedup ratio is getting large, while ours degrades more slowly. This is indicates the effects of our method for reducing accumulative error. In our CPU implementation, both methods have similar actual speedup ratios for a given theoretical speedup, for example, 3.55 actual for 4 theoretical speedup. It is because the overhead for both methods mainly comes from the fully-connected and other layers.

Because our asymmetric solution can effectively reduce the accumulated error, we can approximate a layer by the two methods simultaneously, and the asymmetric reconstruction of the next layer can reduce the error accumulated by the two methods. As discussed in Sec. 2.5, our method is based on the channel dimension (), while Jaderberg et al.’s method mainly exploits the decomposition of the two spatial dimensions. These two mechanisms are complementary, so we conduct the following sequential strategy. The Conv1 layer is approximated using our model only. Then for the Conv2 layer, we first apply our method. The approximated layer has filters whose sizes are followed by filters (as in Fig. 1(b)). Next we apply Jaderberg et al.’s method to decompose the spatial support into a cascade of and filters (Scheme 2 [10]). This gives a 3-dimensional approximation of Conv2. Then we apply our method on Conv3. Now the asymmetric solver will take the responses approximated by the two mechanisms as the input, while the reconstruction target is still the responses of the original network. So while Conv2 has been approximated twice, the asymmetric solver of Conv3 can partially reduce the accumulated error. This process is sequentially adopted in the layers that follow.

In Fig. 6 we show the results of this 3-dimensional decomposition strategy (denoted as “our asymmetric (3d)”). We set the speedup ratios of both mechanisms to be equal: e.g., if the speedup ratio of the whole model is , then we use for both. Fig. 6 shows that this strategy leads to significantly smaller increase of error. For example, when the speedup is 5, the error is increased by only 2.5%. This is because the speedup ratio is accounted by all three dimensions, and the reduction of each dimension is lower. Our asymmetric solver effectively controls the accumulative error even if the multiple layers are decomposed extensively.

Finally, we compare the accelerated whole model with the well-known “AlexNet[11]. The comparison is based on our re-implementation of AlexNet. The architecture is the same as in [11] except that the GPU splitting is ignored. Besides the standard strategies used in [11], we train this model using the 224224 views cropped from resized images whose shorter edge is 256 [9]. Our re-implementation of this model has top-5 single-view error rate as 18.8% (10-view top-5 16.0% and top-1 37.6%). This is better than the one reported in [11]111In [11] the 10-view error is top-5 18.2% and top-1 40.7%..

Table 3 shows the comparisons on the accelerated models and AlexNet. The error rates in this table are the absolute value (not the increased number). The time is the actual running time per view, on a C++ implementation and Intel i7 CPU (2.9GHz). The model accelerated by our asymmetric solver (channel-only) has 16.7% error, and by our asymmetric solver (3d) has 14.1% error. This means that the accelerated model is 4.7% more accurate than AlexNet, while its speed is nearly the same as AlexNet.

As a common practice [11], we also evaluate the 10-view score of the models. Our accelerated model achieves 12.0% error, which means only 0.9% increase of error with 4 speedup (the original one has 11.1% 10-view error).

## 4 Conclusion and Future Work

On the core of our algorithm is the low-rank constraint. While this constraint is designed for speedup in this work, it can be considered as a regularizer on the convolutional filters. We plan to investigate this topic in the future.