# Spectral-Pruning: Compressing deep neural network via spectral analysis

The model size of deep neural network is getting larger and larger to realize superior performance in complicated tasks. This makes it difficult to implement deep neural network in small edge-computing devices. To overcome this problem, model compression methods have been gathering much attention. However, there have been only few theoretical back-grounds that explain what kind of quantity determines the compression ability. To resolve this issue, we develop a new theoretical frame-work for model compression, and propose a new method called Spectral-Pruning based on the theory. Our theoretical analysis is based on the observation such that the eigenvalues of the covariance matrix of the output from nodes in the internal layers often shows rapid decay. We define "degree of freedom" to quantify an intrinsic dimensionality of the model by using the eigenvalue distribution and show that the compression ability is essentially controlled by this quantity. Along with this, we give a generalization error bound of the compressed model. Our proposed method is applicable to wide range of models, unlike the existing methods, e.g., ones possess complicated branches as implemented in SegNet and ResNet. Our method makes use of both "input" and "output" in each layer and is easy to implement. We apply our method to several datasets to justify our theoretical analyses and show that the proposed method achieves the state-of-the-art performance.

## Authors

• 72 publications
• 3 publications
• 8 publications
• 1 publication
• 1 publication
• 1 publication
• 1 publication
• 1 publication
• 1 publication
11/16/2021

### Neuron-based Pruning of Deep Neural Networks with Better Generalization using Kronecker Factored Curvature Approximation

Existing methods of pruning deep neural networks focus on removing unnec...
09/25/2019

### Compression based bound for non-compressed network: unified generalization error analysis of large compressible deep neural network

One of biggest issues in deep learning theory is its generalization abil...
06/07/2022

### Neural Network Compression via Effective Filter Analysis and Hierarchical Pruning

Network compression is crucial to making the deep networks to be more ef...
08/19/2021

### An Information Theory-inspired Strategy for Automatic Network Pruning

Despite superior performance on many computer vision tasks, deep convolu...
06/11/2022

### A Theoretical Understanding of Neural Network Compression from Sparse Linear Approximation

The goal of model compression is to reduce the size of a large neural ne...
11/10/2020

### Dirichlet Pruning for Neural Network Compression

We introduce Dirichlet pruning, a novel post-processing technique to tra...
06/14/2018

### Scalable Neural Network Compression and Pruning Using Hard Clustering and L1 Regularization

We propose a simple and easy to implement neural network compression alg...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Currently, deep learning is the most promising approach adopted by various machine learning applications, such as computer vision, natural language processing, and audio processing. Along with the rapid development of the deep learning techniques, its network structure is getting extensively complicated. For example, SegNet

Badrinarayanan et al. (2017) has skip connections, ResNet He et al. (2016) and its variants Huang et al. (2017); Chen et al. (2017) also possess several skip connections. In addition to the model structure, the model size is getting larger, which prevents us to implement deep neural networks in edge-computing devices for such applications as smart phone services, autonomous vehicle driving and drone control.

To overcome this difficulty, model compression techniques have been studied extensively in the literature. One approach is pruning by an explicit regularization, such as and penalization during training Lebedev & Lempitsky (2016); Zhou et al. (2016b); Wen et al. (2016); He et al. (2017). A similar effect can be realized by an implicit randomized regularization, such as DropConnect Wan et al. (2013)

, which randomly removes connections during the training phase. The factorization method performs a matrix/tensor decomposition of the weight matrices to reduce the number of parameters

Denil et al. (2013b); Denton et al. (2014). Information redundancy can be reduced by a quantization technique that expresses the network by smaller bit variable type or hash tables Chen et al. (2015); Han et al. (2015a). More closely related ones are ThiNet Luo et al. (2017) and Net-Trim Aghasi et al. (2017) which prune the network weight so that the behaviors of the internal layers of the pruned network are as close as possible to those of the original network. Denil et al. (2013a) is quite close to ours, but it’s theoretical support is not satisfactory. In particular, the suggested way of the best subset selection is just a random choice. Zhang et al. (2018) proposed parameter sharing technique to reduce redundant parameters based on similarity between the weights. A big issue in the literature is that only few of them (e.g., Net-Trim Aghasi et al. (2017)

) are supported by statistical learning theory. In particular, it has been unclear what kind of quantity controls the compression ability. Another big issue is that the above mentioned methods can not be trivially applied to the recently developed networks with complicated structures such as skip connections like ResNet and SegNet.

In this paper, we develop a new simple network compression method that is applicable to networks with complicated structures, and give theoretical support to explain what quantity controls the compression ability. The theoretical analysis is applicable not only to our method but also to the existing methods. Almost all of the existing methods try to find a smaller network structure that approximates only the “output” from each layer as well as possible. In contrast, our method also deals with the “input” to each layer. The information of the input is exploited as a covariance matrix, and redundant nodes are discarded on the basis of that information. It can be applied even if the “outputs” are split into several branches. Moreover, by combining the information of both input and output, it achieves better accuracy.

We also develop a theoretical analysis to characterize the compression error by utilizing the notion of degree of freedom. The degree of freedom represents a kind of intrinsic dimensionality of the model. This quantity is determined by the eigenvalues of the covariance matrix calculated in each layer. Usually, we observe that the eigenvalue drops rapidly (Figs. 1 and 1(a)), which means that important information processed in each layer is not large. In particular, rapid decay of eigenvalues leads to low degree of freedom. Because of this, we can compress the network effectively though explicit number of parameters of the original network is large. Behind the theory, there is essentially a connection to the kernel quadrature rule Bach (2017). In addition to the model compression ability analysis, we also develop a generalization error analysis. Our analysis is categorized to so called fast learning rate that achieves convergence rate for sample size unlike such theories as Sun et al. (2015); Neyshabur et al. (2015); Bartlett et al. (2017); Neyshabur et al. (2017) showing convergence. Fast learning rate has been also studied in deep leaning, e.g., Koltchinskii & Panchenko (2002); Barron (1993); McCaffrey & Gallant (1994); Barron (1994); Klusowski & Barron (2016a, b); Schmidt-Hieber (2017); Imaizumi & Fukumizu (2018); Suzuki (2018)

, but they are not about model compression. According to our generalization error bound, we see that there appears bias and variance trade-off where bias is induced by the network compression and variance is induced by the training data variation. Finally, we conduct extensive numerical experiments to show the superiority of our method and give experimental verification of our theory. Our contributions are summarized as follows:

• We propose a new simple method for compressing the trained network that can be executed by simply observing the covariance matrix in the internal layers. Unlike existing methods, the proposed method can easily be implemented and applied to any type of network structure.

• We give a theoretical guarantee of the model compression ability by utilizing the notion of degree of freedom which represents an intrinsic dimensionality of the model. We reveal that the covariance between nodes affects the compression ability. We also give a generalization error analysis and derive the bias-variance trade-off induced by the model compression.

## 2 Model compression problem and its algorithm

Let the input domain be , and the output domain be where could be the set of real numbers for regression and be a binary label

for binary classification. Suppose that there exists a probability measure

defined on a measurable space

, and there is a Borel measurable random variable

. The training data are i.i.d. realizations of obeying the distribution . The marginal distribution of is denoted by . To train the appropriate relationship between and , we construct a deep neural network model as

 f(x)=(W(L)η(⋅)+b(L))∘⋯∘(W(1)x+b(1)),

where , (), and

is an activation function (here, the activation function is applied in an element-wise manner; for a vector

, ). Furthermore, is the width of the -th layer such that (output) and (input). Let be a trained network obtained from a training data . Accordingly, its parameters are denoted by , i.e., , and the output of its internal layer (before activation) is denoted by

 ^Fℓ(x)=(^W(ℓ)η(⋅)+^b(ℓ))∘⋯∘(^W(1)x+^b(1)).

Here, we do not specify how to train the network . Any learning method for training

is valid for the following argument to be true. It might be the empirical risk minimizer, the Bayes estimator, or another estimator. We want to compress the trained network

to another smaller network having widths which are as small as possible.

### 2.1 New model compression algorithm

To compress the trained network , we propose a simple strategy called Spectral-Pruning. The method works in a layer-wise manner. The main idea of the method is to find the most informative subset of the nodes where the amount of information is measured by how the selected nodes can explain the other nodes in the layer. If some nodes are heavily correlated to each other, then only one of them should be selected. The information redundancy can be computed by solving a simple regression problem, and requires only a covariance matrix. We do not need to solve some specific nonlinear optimization problem as in Lebedev & Lempitsky (2016); Zhou et al. (2016b); Wen et al. (2016); Aghasi et al. (2017). Our method can be executed by only using the input to the layer. We call such an approach input aware one. On the other hand, it can also make use of the output from the layer as in the most existing methods. We call such approaches output aware ones. Another important characteristics of our method is to incorporate the distribution of the data while some existing pruning techniques try to approximate the parameter itself and is independent from the data distribution.

### 2.2 Algorithm description

Let be the input to the -th layer, and let be a subvector of corresponding to an index set where . Basically, the strategy is to recover from as accurately as possible. To do so, we solve the following optimization problem:

 ^AJ=argminA∈Rmℓ×|J|ˆE[∥ϕ−AϕJ∥2]+∥A∥2w, (1)

where is the expectation with respect to the empirical distribution () and for a regularization parameter and . The optimal solution can be explicitly expressed by utilizing the (non-centered) covariance matrix in the -th layer of the trained network which is defined as

 ˆΣ:=ˆΣ(ℓ)=1nn∑i=1η(^Fℓ−1(xi))η(^Fℓ−1(xi))⊤,

defined on the empirical distribution (here, we omit the layer index for notational simplicity). Accordingly, let be the submatrix of for index sets such that . Let be the full index set. By noting that due to its definition, we can easily check that

 ^AJ=ˆΣF,J(ˆΣJ,J+Iw)−1.

Hence, we can decode the full vector from as Another approach is to directly approximate a specific “output” for a specific instead of approximating the “input” as Eq. (1). This can be realized by solving the following regression problem which we call an “output-aware” approach:

 ^aJ=argmina∈R|J|ˆE[∥z⊤ϕ−a⊤ϕJ∥2]+∥a⊤∥2w.

It can be easily checked that the optimal solution is given as . Therefore, an output aware compression can be recovered from the input aware method (1). In particular, the output to the next layer can be approximated by

#### Selecting optimal subindices

Next, we aim to optimize . Since the output to the next layer is multi-variate and we need to bound the approximation error of multiple outputs uniformly to reduce the approximation error in the entire network, we minimize the following quantity with respect to :

 L(A)w(J)=maxz∈Rmℓ:∥z∥≤1mina∈RmℓˆE[(z⊤ϕ−a⊤ϕJ)2]+∥a⊤∥2w.

By considering this, our method works no matter what branches there exist. The right hand side is equivalent to , where

is the spectral norm (the maximum singular value of the matrix). By substituting the explicit formula of

, this is further simplified as

 L(A)w(J)= ∥ˆΣF,F−ˆΣF,J(ˆΣJ,J+Iw)−1ˆΣJ,F∥op.

To obtain the optimal under a cardinality constraint for a pre-specified width of the compressed network, we propose to solve the following sparse subset selection problem:

 minJ L(A)w(J)   s.t.   J∈[mℓ]m♯ℓ. (2)

Let be the optimal that minimizes the objective. This optimization problem is NP-hard, but an approximate solution is obtained by the greedy algorithm since it is reduced to monotone submodular function maximization Krause & Golovin (2014). That is, we start from , sequentially choose an element that maximally reduces the objective , and add this element to () until is satisfied.

An advantage of this approach is that it requires only the covariance matrix, and it is accomplished by purely linear algebraic procedures. Moreover, our method can be applied to a complicated network structure in which there are recurrent structures, several branches, or outputs from the internal layers that are widely distributed to several other units (e.g., skip connections).

#### Output aware method

Suppose that there is an “important” subset of weight vectors, say , such that the output corresponding to should be well approximated. Then it would be more effective to focus on approximating instead of all . Here, suppose that is a finite set, and let the weight matrix be the one each row which corresponds to each distinguish element in : for . If is not a finite set, we may set as a projection matrix to the span of . Then, we consider an objective , which is equivalent to

 L(B)w(J)=∥Zℓ[ˆΣF,F−ˆΣF,J(ˆΣJ,J+Iw)−1ˆΣJ,F]Z⊤ℓ∥op.

A typical situation is to approximate the output . In that situation, we may set which corresponds to .

#### Combination of input aware and output aware methods

In our numerical experiments, we have found that only one of either input or output aware method does not give the best performance, but the combination of them achieved the best performance (see Fig. 3). Moreover, if the network has several branches, then it is not trivial which branches should be included in for the output aware method. In that situation, it is preferable to combine input aware and output aware methods instead of using only the output aware method. Therefore, we propose to take the convex combination of the both criteria given for a parameter as

 (Spectral-Pruning) minJL(θ)w(J) =θL(A)w(J)+(1−θ)L(B)w(J)   s.t.   J∈[mℓ]m♯ℓ. (3)

### 2.3 Practical algorithm

Calculating the exact value of is computationally demanding for a large network because we need to compute the spectral norm. However, we do not need to obtain the exact solution for the problem (3) in practice, because, if we obtain a reasonable candidate that approximately achieves the optimal, then additional fine-tuning gives a much better network. Hence, instead of solving (3) directly, we upper bound and by replacing the operator norm in their definitions with trace, and minimize it as a practical variant of our method. By setting , the objective of the variational method is reduced to . Then, the proposed optimization problem can be rearranged to the following problem:

 (Spectral-Pruning-2) minJ⊂{1,…,mℓ}  |J|  s.t.  Tr[(θI+(1−θ)Z⊤ℓZℓ)ˆΣF,JˆΣ−1J,JˆΣJ,F]Tr[(θI+(1−θ)Z⊤ℓZℓ)ˆΣF,F]≥α (4)

for a pre-specified . Here, since the denominator in the constraint is the best achievable objective value of the numerator without cardinality constrain, represents “information loss ratio.” The index set is restricted to a subset of that has no duplication. This problem is not only much simpler but also easier to implement than the original one (3). In our numerical experiments, we employed this simpler problem.

An extension of our method to convolutional layers is a bit tricky. There are several options, but to perform channel-wise pruning, we used the following “covariance matrix” between channels in the experiments. Suppose that a channel receives the input where indicate the spacial index, then “covariance” between the channels and can be formulated as . As for the covariance between an output channel and an input channel (which corresponds to the -th element of for the fully connected situation), it can be calculated as , where is the receptive field of the location in the output channel , and are the number of locations that contain in their receptive fields.

## 3 Compression accuracy analysis and generalization error bound

In this section, we give a theoretical guarantee of our model compression method. More specifically, we introduce a quantity called degree of freedom and show that it determines the approximation accuracy. The degree of freedom is defined by the spectrum of the covariance matrix between the nodes in an internal layer. In practice, we observe that a trained network typically shows has a rapid decay of its spectrum (see Fig. 1(a)), which results in a small degree of freedom. For the theoretical analysis, we define a neural network model with norm constrains on the parameters and (). Let and be upper bounds of the parameters, and define the norm constraint model as

 F:= {(W(L)η(⋅)+b(L))∘⋯∘(W(1)x+b(1))∣maxj∥W(ℓ)j,:∥≤R/√mℓ, ∥b(ℓ)∥∞≤Rb },

where means the -th column of the matrix , is the Euclidean norm and is the -norm. Here, we bound the approximation error induced by compressing the trained network into a smaller one . First, we make the following assumption.

###### Assumption 1.

We assume the following conditions on the activation function .

• is scale invariant: for all and (for arbitrary ).

• is 1-Lipschitz continuous: for all (for arbitrary ).

These conditions are satisfied by ReLU activation

(Nair & Hinton, 2010; Glorot et al., 2011) and leaky ReLU (LReLU) (Maas et al., 2013). The first condition reduces the model complexity because some networks with different scales can be identified as a function. The second condition ensures that the approximation error in each layer is not amplified through signal propagation to the last layer.

### 3.1 Approximation error analysis

Recall that the empirical covariance matrix in the -th layer is denoted as . Then, the degree of freedom is defined by

 ^Nℓ(λ):=Tr[ˆΣ(ℓ)(ˆΣ(ℓ)+λI)−1]=mℓ∑j=1^μ(ℓ)j^μ(ℓ)j+λ

where are the eigenvalues of . Let denote the width of . The next theorem characterizes the approximation accuracy between and on the basis of the degree of freedom with respect to the empirical -norm for a vector valued function .

###### Theorem 1 (Compression rate via degree of freedom).

Suppose that there exists such that . Let be

 λℓ=inf{λ≥0∣m♯ℓ≥5^Nℓ(λ)log(80^Nℓ(λ))} (5)

and the weight vector for the regularization is defined by the “leverage score”; that is, where

is the orthogonal matrix that diagonalizes

; . Let

 αj,ℓ=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩θ−1maxj′∉~Jℓ∥^W(ℓ)j′,:∥2maxj′∈[mℓ+1]∥^W(ℓ)j′,:∥2(j∉~Jℓ),maxj′∈~Jℓ∥^W(ℓ)j′,:∥2maxj′∈[mℓ+1]∥^W(ℓ)j′,:∥2(otherwise),

and . Then, the solution obtained by Spectral-Pruning (3) satisfies

 max1≤j≤mℓ+1∥^W(ℓ)j,:ϕ−^W(ℓ)j,:^A^Jϕ^J∥2nαj,ℓ≤4ζℓ,θλℓmℓR2. (6)

Moreover, there exits a universal constant such that the parameter of the compressed network satisfies the following norm bound:

 ∥^W(ℓ)j,:^A^Jdiag(w)1/2∥2≤^cλℓαj,ℓmℓR2 (7)

Moreover, if we solve the optimization problem (3) with an additional constraint for all , then the optimization problem is feasible and, the overall approximation error is bounded as

 ∥ˆf−f♯∥n≤L∑ℓ=2⎛⎝L∏ℓ′=ℓ+1√maxj{αj,ℓ′}⎞⎠√^cL−ℓ+1RL−ℓ+1√maxj{αj,ℓ}ζℓ,θλℓ,

for a compressed network given by our algorithm; in particular, if we set and , then it holds that

 ∥ˆf−f♯∥n≤L∑ℓ=2¯RL−ℓ+1√αmaxζℓ,θλℓ. (8)

The proof is given in Appendix A. It is basically proven using the techniques developed by Suzuki (2018) in which theories of the kernel quadrature rule developed by Bach (2017) are used for deep learning analysis. This theorem indicates that the approximation error induced by the compression is directly controlled by the degree of freedom. Since the degree of freedom is a monotonically decreasing function with respect to , it becomes large as is decreased to . The behavior of the eigenvalues determines how rapidly increases as . We can see that if the eigenvalues decrease rapidly, then becomes small for a specific network size . In other words, can be much smaller under a specific approximation error constraint if a network has many small eigenvalues. Another important aspect of the theorem is that the norms of the parameters are bounded properly (Eq. (7)). This bound is not trivial. Therefore, the effects of the approximation errors in the internal layers on the entire function are well regulated.

The quantity appearing in the bound represents the effect of misspecification of for the output aware method. For , appears in the definition of , but its effect could be canceled out if the norm is small. This means that, if we want to use the output aware method, it is recommended to include with large norm into the range of . If we do so, the effect of misspecification is negligible. Otherwise,

#### Kernel method perspective

The compression method can be viewed from the kernel method point of view. Here, we define the kernel function in the -th layer as

 kℓ(x,x′)=η(Fℓ−1(x))⊤η(Fℓ−1(x′))∈R.

The kernel function, , has a decomposition in (where is the empirical distribution) as

 kℓ(x,x′)=mℓ∑j=1μ(ℓ)jϕ(ℓ)j(x)ϕ(ℓ)j(x′),

where is an orthonormal system in and is an eigenvalue. The kernel decomposition and the covariance matrix have connection, i.e., for a decomposition , where is an orthogonal matrix and is a positive-semidefinite diagonal matrix, it holds that and . In particular, the eigenvalues of the kernel and the covariance matrix are identical. This can be checked as follows. First, note that can be represented as for a matrix because is included in the RKHS (reproducing kernel Hilbert space) defined by the kernel . Then, from the definition of the kernel, must satisfies . On the other hand, by the definition of the covariance matrix, it holds that . Therefore, which indicates that is a normal matrix and must be satisfied.

The RKHS , which is associated with the kernel function , is uniquely defined, and its unit ball is given by

 B(Hℓ) ={f(x)=mℓ∑j=1α′j√μ(ℓ)jϕ(ℓ)j(x)∣∥α′∥≤1}={f(x)=mℓ∑j=1αjη(Fℓ−1(x))j∣∥α∥≤1}.

In particular, the output to the next layer is an element in the RKHS. Thus, choosing a subset of nodes can be seen as choosing a lower dimensional subspace in that approximates the original RKHS as accurately as possible. The kernel quadrature rule enables this by choosing so that an alternative kernel defined as

 ˆkℓ(x,x′)=∑j∈^Jμ(ℓ)jϕ(ℓ)j(x)ϕ(ℓ)j(x′)

approximates the original kernel as well as possible.

### 3.2 Generalization error analysis

So far, we have developed an approximation error bound with respect to the “empirical” -distance. Here, we derive a generalization error bound for the compressed network, which is defined by the population distance. We see that there appears bias-variance trade-off induced by the network compression. The bound we derive is so called fast learning rate meaning that it is with respect to the sample size rather than as in the usual Rademacher complexity analysis. For this purpose, we specify the data generation model. First, we consider a simple regression model:

 yi=fo(xi)+ξi  (i=1,…,ntr),

where is the true function that we want to estimate, is independently identically distributed from , and is i.i.d. Gaussian noise with mean and variance . A regression problem is considered for theoretical simplicity. Nearly the same discussion is applicable to classification problems with margin conditions such as Tsybakov’s noise condition Mammen & Tsybakov (1999). The relative generalization error of is evaluated as where is defined as . Hence, we aim to bound . The training error is denoted by We assume (approximately) optimality of the trained network as follows.

###### Assumption 2 (Optimality).

There exists a constant such that the following inequality holds almost surely: .

In practice, it is difficult to assume that the global optimal solution is attained because of the non-convexity of the deep learning problem. This assumption ensures that the optimization error is bounded by . This can be relaxed to where . However, we use Assumption 2 just for simplicity. Next, we assume the following bound on the input data.

###### Assumption 3.

The support of is compact and its -norm is bounded as

Then, under the same setting as Theorem 1, we define the following constants corresponding to the norm bounds:

 ^R∞:=max{¯RLDx+L∑ℓ=1¯RL−ℓ¯Rb,∥fo∥∞}, ^G:=L¯RL−1Dx+L∑ℓ=1¯RL−ℓ,

where and for the constants and introduced in Theorem 1. To bound the generalization error, we define and for and as111.

 δ1=δ1(λ)=L∑ℓ=2¯RL−ℓ+1√αmaxζℓ,θλℓ, δ22(m′)=1nL∑ℓ=1m′ℓm′ℓ+1log+(1+4√2^Gmax{¯R,¯Rb}√nσ∧^R∞).

Under these notations, we obtain the following generalization error bound for the compressed network with respect to the population -norm .

###### Theorem 2 (Generalization error bound of the compressed network).

Suppose that Assumptions 1, 2 and 3 are satisfied. Consider a setting where or for . Let are the variables satisfying the condition (5):

 λℓ=inf{λ≥0∣m♯ℓ≥5^Nℓ(λ)log(80^Nℓ(λ))},

and assume that satisfy the approximation error bound (8) with the norm bound (7) as given in Theorem 1. Let and . Then, there exists a constant such that for all ,

 ∥f♯−fo∥2L2≤C1{δ2</