In the context of universal approximation, neural networks can represent functions of arbitrary complexity when the network is equipped with sufficiently large number of layers and neurons
. Such model flexibility has made the artificial deep neural network a pioneer machine learning tool over the past decades (see for a comprehensive review of deep networks). Basically, given unlimited training data and computational resources, deep neural networks are able to learn arbitrarily complex data models.
In practice, the capability of collecting huge amount of data is often restricted. Thus, learning complicated networks with millions of parameters from limited training data can easily lead to the overfitting problem. Over the past years, various methods have been proposed to reduce overfitting via regularizing techniques and pruning strategies [18, 13, 20, 23]. However, the complex and non-convex behavior of the underlying model barricades the use of theoretical tools to analyze the performance of such techniques.
In this paper, we present an optimization framework, namely Net-Trim, which is a layer-wise convex scheme to sparsify deep neural networks. The proposed framework can be viewed from both theoretical and computational standpoints. Technically speaking, each layer of a neural network consists of an affine transformation (to be learned by the data) followed by a nonlinear unit. The nested composition of such mappings forms a highly nonlinear model, learning which requires optimizing a complex and non-convex objective. Net-Trim applies to a network which is already trained. The basic idea is to reduce the network complexity layer by layer, assuring that each layer response stays close to the initial trained network.
More specifically, the training data is transmitted through the learned network layer by layer. Within each layer we propose an optimization scheme which promotes weight sparsity, while enforcing a consistency between the resulting response and the trained network response. In a sense, if we consider each layer response to the transmitted data as a checkpoint, Net-Trim assures the checkpoints remain roughly the same, while a simpler path between the checkpoints is discovered. A favorable leverage of Net-Trim is the possibility of convex formulation, when the ReLU is employed as the nonlinear unit across the network.
demonstrates the pruning capability of Net-Trim for a sample network. The neural network used for this example classifies 200 points positioned on the 2D plane into two separate classes based on their label. The points within each class lie on nested spirals to classify which we use a neural network with two hidden layers of each 200 neurons (the reader is referred to the Experiments section for more technical details). Figures1(a), (b) present the weighted adjacency matrix and partial network topology, relating the hidden layers before and after retraining. With only a negligible change to the overall network response, Net-Trim is able to prune more than 93% of the links among the neurons, and bring a significant model reduction to the problem. Even when the neural network is trained using sparsifying weight regularizers (here, dropout  and penalty), application of the Net-Trim yields a major additional reduction in the model complexity as illustrated in Figures 1(c) and 1(d).
Net-Trim is particularly useful when the number of training samples is limited. While overfitting is likely to occur in such scenarios, Net-Trim reduces the complexity of the model by setting a significant portion of weights at each layer to zero, yet maintaining a similar relationship between the input data and network response. This capability can also be viewed from a different perspective, that Net-Trim simplifies the process of determining the network size. In other words, the network used at the training phase can be oversized and present more degrees of freedom than what the data require. Net-Trim would automatically reduce the network size to an order matching the data.
Finally, a favorable property of Net-Trim is its post-processing nature. It simply processes the network layer-wise responses regardless of the training strategy used to build the model. Hence, the proposed framework can be easily combined with the state-of-the-art training techniques for deep neural networks.
1.1 Previous Work
In the recent years there has been increasing interest in the mathematical analysis of deep networks. These efforts are mainly in the context of characterizing the minimizers of the underlying cost function. In , the authors show that some deep forward networks can be learned accurately in polynomial time, assuming that all the edges of the network have random weights. They propose a layer-wise algorithm where the weights are recovered sequentially at each layer. In 
, Kawaguchi establishes an exciting result showing that regardless of being highly non-convex, the square loss function of a deep neural network inherits interesting geometric structures. In particular, under some independence assumptions, all the local minima are also the global ones. In addition, the saddle points of the loss function possess special properties which guide the optimization algorithms to avoid them.
The geometry of the loss function is also studied in , where the authors bring a connection between spin-glass models in physics and fully connected neural networks. On the other hand, Giryes et al. recently provide the link between deep neural networks and compressed sensing , where they show that feedforward networks are able to preserve the distance of the data at each layer by using tools from compressed sensing. There are other works on formulating the training of feedforward networks as an optimization problem [7, 6, 5]. The majority of cited works approach to understand the neural networks by sequentially studying individual layers, which is also the approach taken in this paper.
On the more practical side, one of the notorious issues with training a complicated deep neural network concerns overfitting when the amount of training data is limited. There have been several approaches trying to address this issue, among those are the use of regularizations such as and penalty [18, 13]
. These methods incorporate a penalty term to the loss function to reduce the complexity of the underlying model. Due to the non-convex nature of the underlying problem, mathematically characterizing the behavior of such regularizers is an impossible task and most literature in this area are based on heuristics. Another approach is to apply the early stopping of training as soon as the performance of a validation set starts to get worse.
More recently, a new way of regularization is proposed by Hinton et. al.  called Dropout. It involves temporarily dropping a portion of hidden activations during the training. In particular, for each sample, roughly of the activations are randomly removed on the forward pass and the weights associated with these units are not updated on the backward pass. Combining all the examples will result in a representation of a huge ensemble of neural networks, which offers excellent generalization capability. Experimental results on several tasks indicate that Dropout frequently and significantly improves the classification performance of deep architectures. More recently, LeCun et al. proposed an extension of Dropout named DropConnect . It is essentially similar to Dropout, except that it randomly removes the connections rather than the activations. As a result, the procedure introduces a dynamic sparsity on the weights.
The aforementioned regularization techniques (e.g , Dropout, and DropConnect) can be seen as methods to sparsify the neural network, which result in reducing the model complexity. Here, sparsity is understood as either reducing the connections between the nodes or decreasing the number of activations. Beside avoiding overfitting, computational favor of sparse models is preferred in applications where quick predictions are required.
1.2 Summary of the Technical Contributions
Our post-training scheme applies multiple convex programs with the cost to prune the weight matrices on different layers of the network. Formally, we denote and as the given input and output of the -th layer of the trained neural network, respectively. The mapping between the input and output of this layer is performed via the weight matrix and the nonlinear activation unit : . To perform the pruning at this layer, Net-Trim focuses on addressing the following optimization:
where is a user-specific parameter that controls the consistence of the output before and after retraining and is the sum of absolute entries of , which is essentially the norm to enforce sparsity on .
When is taken to be the ReLU, we are able to provide a convex relaxation to (1). We will show that , the solution to the underlying convex program, is not only sparser than , but the error accumulated over the layers due to the -approximation constraint will not significantly explode. In particular in Theorem 1 we show that if is the -th layer after retraining, i.e., , then for a network with normalized weight matrices
Basically, the error propagated by the Net-Trim at any layer is at most a multiple of . This property suggests that the network constructed by the Net-Trim is sparser, while capable of achieving a similar outcome. Another attractive feature of this scheme is its computational distributability, i.e., the convex programs could be solved independently.
Also, in this paper we propose a cascade version of Net-Trim, where the output of the retraining at the previous layer is fed to the next layer as the input of the optimization. In particular, we present a convex relaxation to
where is the retrained output of the -th layer and has a closed form expression to maintain feasibility of the resulting program. Again, for a network with normalized weight matrices, in Theorem 2 we show that
Here is a constant inflation rate that can be arbitrarily close to 1 and controls the magnitude of . Because of the more adaptive pruning, cascade Net-Trim may yield sparser solutions at the expense of not being computationally parallelizable.
Finally, for redundant networks with limited training samples, we will discuss that a simpler network (in terms of sparsity) with identical performance can be explored by setting in (1). We will derive general sufficient conditions for the recovery of such sparse model via the proposed convex program. As an insightful case, we show that when a layer is probed with standard Gaussian samples (e.g., applicable to the first layer), learning the simple model can be performed with much fewer samples than the layer degrees of freedom. More specifically, consider to be a Gaussian matrix, where each column represents an input sample, and a sparse matrix, with at most nonzero terms on each column, from which the layer response is generated, i.e., . In Theorem 3 we state that when
, with overwhelming probability,can be accurately learned through the proposed convex program.
As will be detailed, the underlying analysis steps beyond the standard measure concentration arguments used in the compressed sensing literature (cf. §8 in ). We contribute by establishing concentration inequalities for the sum of dependent random matrices.
1.3 Notations and Organization of the Paper
The remainder of the paper is structured as follows. In Section 2, we formally present the network model used in the paper. The proposed pruning schemes, both the parallel and cascade Net-Trim are presented and discussed in Section 3. The material includes insights on developing the algorithms and detailed discussions on the consistency of the retraining schemes. Section 4 is devoted to the convex analysis of the proposed framework. We derive the unique optimality conditions for the recovery of a sparse weight matrix through the proposed convex program. We then use this tool to derive the number of samples required for learning a sparse model in a Gaussian sample setup. In Section 5 we report some retraining experiments and the improvement that Net-Trim brings to the model reduction and robustness. Finally, Section 6 presents some discussions on extending the Net-Trim framework, future outlines and concluding remarks.
As a summary of the notations, our presentation mainly relies on multidimensional calculus. We use bold characters to denote vectors and matrices. Considering a matrix and the index sets , and , we use to denote the matrix obtained by restricting the rows of to . Similarly, denotes the restriction of to the columns specified by , and is the submatrix with the rows and columns restricted to and , respectively.
Given a matrix , we use to denote the sum of matrix absolute entries111The formal induced norm has a different definition, however, for a simpler formulation we use a similar notation and to denote the Frobenius norm. For a given vector , denotes the cardinality of , denotes the set of indices with non-zero entries from , and is the complement set. Mainly in the proofs, we use the notation to denote . The operation applied to a vector or matrix acts on every component individually. Finally, following the MATLAB convention, the vertical concatenation of two vectors and (i.e., ) is sometimes denoted by in the text.
2 Feed Forward Network Model
In this section, we introduce some notational conventions related to a feed forward network model, which will be used frequently in the paper. Considering a feed forward neural network, we assume to havetraining samples , , where is an input to the network. We stack up the samples in a matrix , structured as
The final output of the network is denoted by , where each column of is a response to the corresponding training column in . We consider a network with layers, where the activations are taken to be rectified linear units. Associated with each layer , we have a weight matrix such that
Basically, the outcome of the -th layer is , which is generated by applying the adjoint of to and going through a component-wise operation. Clearly in this setup and . A trained neural network as outlined in (2) and (3) is represented by . Figure 2(a) demonstrates the architecture of the proposed network.
For the sake of theoretical analysis, throughout the paper we focus on networks with normalized weights as follows.
A given neural network is link-normalized when for every layer .
A general network in the form of (2) can be converted to its link-normalized version by replacing with , and with . Since for , any weight processing on a network of the form (2) can be applied to the link-normalized version and later transferred to the original domain via a suitable scaling. Subsequently, all the results presented in this paper are stated for a link-normalized network.
3 Pruning the Network
Our pruning strategy relies on redesigning the network so that for the same training data each layer outcomes stay more or less close to the initial trained model, while the weights associated with each layer are replaced with sparser versions to reduce the model complexity. Figure 2(b) presents the main idea, where the complex paths between the layer outcomes are replaced with simple paths.
Consider the first layer, where is the layer input, the layer coefficient matrix, and the layer outcome. We require the new coefficient matrix to be sparse and the new response to be close to . Using the sum of absolute entries as a proxy to promote sparsity, a natural strategy to retrain the layer is addressing the nonlinear program
Despite the convex objective, the constraint set in (4) is non-convex. However, we may approximate it with a convex set by imposing and to have similar activation patterns. More specifically, knowing that is either zero or positive, we enforce the argument to be negative when , and close to elsewhere. To present the convex formulation, for we use the notation
Based on this definition, a convex proxy to (4) is
Basically, depending on the value of , a different constraint is imposed on to emulate the ReLU operation. For a simpler formulation throughout the paper, we use a similar notation as for any constraints of the form (5) parametrized by given , , and .
As a first step towards establishing a retraining framework applicable to the entire network, we show that the solution of (6) satisfies the constraint in (4) and the outcome of the retrained layer stays controllably close to .
Let be the solution to (6). For being the retrained layer response, .
Based on the above exploratory, we propose two schemes to retrain the neural network; one explores a computationally distributable nature and the other proposes a cascading scheme to retrain the layers sequentially. The general idea which originates from the relaxation in (6) is referred to as the Net-Trim, specified by the parallel or cascade nature.
3.1 Parallel Net-Trim
The parallel Net-Trim is a straightforward application of the convex program (6) to each layer in the network. Basically, each layer is processed independently based on the initial model input and output, without taking into account the retraining result from the previous layer. Specifically, denoting and as the input and output of the -th layer of the initially trained neural network (see equation (2)), we propose to relearn the coefficient matrix via the convex program
The optimization (7) can be independently applied to every layer in the network and hence computationally distributable. The pseudocode for the parallel Net-Trim is presented as Algorithm 1.
With reference to the constraint in (7), if we only retrain the -th layer, the output of the retrained layer is in the -neighborhood of that before retraining. However, when all the layers are retrained through (7), an immediate question would be whether the retrained network produces an output which is controllably close to the initially trained model. In the following theorem, we show that the retrained error does not blow up across the layers and remains a multiple of .
When all the layers are retrained with a fixed parameter (as in Algorithm 1), the following corollary simply bounds the overall discrepancy.
Using Algorithm 1, the ultimate network outcome obeys
We would like to note that the network conversion to a link-normalized version is only for the sake of presenting the theoretical results in a more compact form. In practice such conversion is not necessary and to retrain layer we can take , where plays a similar role as for a link-normalized network.
3.2 Cascade Net-Trim
Unlike the parallel scheme, where each layer is retrained independently, in the cascade approach the outcome of a retrained layer is used to retrain the next layer. To better explain the mechanics, consider starting the cascade process by retraining the first layer as before, through
Setting to be the outcome of the retrained layer, to retrain the second layer, we ideally would like to address a similar program as (10) with as the input and being the output reference, i.e.,
However, there is no guarantee that program (11) is feasible, that is, there exists a matrix such that
If instead of the constraint set (11) was parameterized by , a natural feasible point would have been . Now that is a perturbed version of , the constraint set needs to be slacked to maintain the feasibility of . In this context, one may easily verify that
as long as
where is the -th column of . Basically the constraint set in (13) is a slacked version of the constraint set in (12), where the right hand side quantities in the corresponding inequalities are sufficiently extended to maintain the feasibility of .
Following this line of argument, in the cascade Net-Trim we propose to retrain the first layer through (10). For every subsequent layer, , the retrained weighting matrix is obtained via
where for and ,
The constants (referred to as the inflation rates) are free parameters, which control the sparsity of the resulting matrices. After retraining the -th layer we set
and use this outcome to retrain the next layer. Algorithm 2 presents the pseudo-code to implement the cascade Net-Trim for and a constant inflation rate, , across all the layers.
In the following theorem, we prove that the outcome of the retrained network produced by Algorithm 2 is close to that of the network before retraining.
When and all the layers are retrained with a fixed inflation rate (as in Algorithm 2), the following corollary of Theorem 2 bounds the network overall discrepancy.
Using Algorithm 2, the ultimate network outcome obeys
Similar to the parallel Net-Trim, the cascade Net-Trim can also be performed without a link-normalization by simply setting .
3.3 Retraining the Last Layer
Commonly, the last layer in a neural network is not subject to an activation function and a standard linear model applies, i.e.,. This linear outcome may be directly exploited for regression purposes or pass through a soft-max function to produce the scores for a classification task.
In this case, to retrain the layer we simply need to seek a sparse weight matrix under the constraint that the linear outcomes stay close before and after retraining. More specifically,
In the case of cascade Net-Trim,
and the feasibility of the program is established for
Consider a link-normalized network , where a standard linear model applies to the last layer.
While the cascade Net-Trim is designed in way that infeasibility is never an issue, one can take a slight risk of infeasibility in retraining the last layer to further reduce the overall discrepancy. More specifically, if the value of in (18) is replaced with for some , we may reduce the overall discrepancy by the factor , without altering the sparsity pattern of the first layers. It is however clear that in this case there is no guarantee that program (18) remains feasible and multiple trials may be needed to tune . We will refer to as the risk coefficient and will present some examples in Section 5, which use it as a way to control the final discrepancy in a cascade framework.
4 Convex Analysis and Model Learning
In this section we will focus on redundant networks, where the mapping between a layer input and the corresponding output can be established via various weight matrices. As an example, this could be the case when insufficient training samples are used to train a large network. We will show that in this case, if the relation between the layer input and output can be established via a sparse weight matrix, under some conditions such matrix could be uniquely identified through the core Net-Trim program in (6).
As noted above, in the case of a redundant layer, for a given input and output , the relation can be established via more than one . In this case we hope to find a sparse by setting in (6). For this value of our central convex program reduces to
which decouples into convex programs, each searching for the -th column in :
For a more concise representation, we drop the index and given a vector focus on the convex program
In the remainder of this section we analyze the optimality conditions for (20), and show how they can be linked to the identification of a sparse solution. The program can be cast as
For a general , not necessarily structured as above, the following result states the sufficient conditions under which a sparse pair is the unique minimizer to (21).
The proposed optimality result can be related to the unique identification of a sparse from rectified observations of the form . Clearly, the structure of the feature matrix plays the key role here, and the construction of the dual certificate stated in Proposition 3 entirely relies on that. As an insightful case, we show that when is a Gaussian matrix (that is, the elements of
are i.i.d values drawn from a standard normal distribution), learningcan be performed with much fewer samples than the layer degrees of freedom.
Let be an arbitrary -sparse vector, a Gaussian matrix representing the samples and a fixed value. Given observations of the type , with probability exceeding the vector can be learned exactly through (20).
The standard Gaussian assumption for the feature matrix allows us to relate the number of training samples to the number of active links in a layer. Such feature structure could be a realistic assumption for the first layer of the neural network. As shown in the proof of Theorem 3, because of the dependence of the set to the entries in , the standard concentration of measure framework for independent random matrices is not applicable here. Instead, we will need to establish concentration bounds for the sum of dependent random matrices.
Because of the contribution the weight matrices have to the distribution of , without restrictive assumptions, a similar type of analysis for the subsequent layers seems significantly harder and left as a possible future work. Yet, Theorem 3 is a good reference for the number of required training samples to learn a sparse model for Gaussian (or approximately Gaussian) samples. While we focused on each decoupled problem individually, for observations of the type , using the union bound, an exact identification of can be warranted as a corollary of Theorem 3.
Consider an arbitrary matrix , where , and for . For being a Gaussian matrix, set . If and , for , can be accurately learned through (6) with probability exceeding
4.1 Pruning Partially Clustered Neurons
As discussed above, in the case of , program (6) decouples into smaller convex problems, which could be addressed individually and computationally cheaper. Clearly, for a similar decoupling does not produce the formal minimizer to (6), but such suboptimal solution may yet significantly contribute to the pruning of the layer.
Basically, in retraining corresponding to a large layer with a large number of training samples, one may consider partitioning the output nodes into clusters such that , and solve an individual version of (6) for each cluster focusing on the underlying target nodes:
Solving (23) for each cluster provides the retrained submatrix associated with that cluster. The values of in (23) are selected in a way that ultimately the overall layer discrepancy is upper-bounded by . In this regard, a natural choice would be
While , acquired through (6), and are only identical in the case of , the idea of clustering the output neurons into multiple groups and retraining each sublayer individually can significantly help with breaking down large problems into computationally tractable ones. Some examples of Net-Trim with partially clustered neurons (PCN) will be presented in the experiments section. Clearly, the most distributable case is choosing a single neuron for each partition (i.e., and ), which results in the smallest sublayers to retrain.
4.2 Retraining Reformulation as a Quadratically Constrained Program
As discussed earlier, Net-Trim implementation requires addressing optimizations of the form
where , , and . By the construction of the problem, all elements of are non-negative. In this section we represent (24) in a matrix form, which can be fed into standard quadratically constrained solvers. For this purpose we try to rewrite (24) in terms of
where the operator converts into a vector of length by stacking its columns on top of one another. Also corresponding to the subscript index sets and we define the complement linear index sets
as the identity matrix of size, using basic properties of the Kronecker product it is straightforward to verify that
Basically, is the inner product between and column of . Subsequently, denoting and , we can rewrite (24) in terms of as
We can apply an additional change of variable to make (25) adaptable to standard quadratically constrained convex solvers. For this purpose we define a new vector , where . This variable change naturally yields
The convex program (25) is now cast as the quadratic program