Large neural network models have become a central component in state-of-the-art practical implementations of the solution to various machine learning and artificial intelligence problems. These include, for example, classification problems involving images, audio or text, or reinforcement learning problems involving game playing or robot manipulation and navigation. This has also resulted in an enormous increase in the interest in deep neural net models from researchers in academy and industry, and even from non-experts and the public in general, as evidenced by the amount of scientific papers, blog entries or mainstream media articles published about it.
These practical successes in difficult problems have ocurred thanks to the availability of large-scale datasets and of massive computational power provided by GPUs, which are particularly well suited for the kind of linear algebra operations involved in training a neural net (such as stochastic gradient descent). One notable characteristic of deep neural nets that seems to distinguish them from other machine learning models is their ability to grow with the data. That is, as the size of the training set grows, we can continue to increase the (say) classification accuracy by increasing the size of the neural net (number of hidden units and of layers, and consequently the number of weights). This is unlike, for example, a linear model, whose classification accuracy will quickly stagnate as the data keeps growing. Hence, we can continue to improve the accuracy of a neural net by making it bigger and training on more data. This means that we can expect to see ever larger neural nets in future practical applications. Indeed, models reported in the literature of computer vision have gone from less than a million weights in the 1990s to millions in the 2000s and, in recent works, to models exceeding billions of weights (each a floating-point value).
The large size of the resulting model does cause an important practical problem when one intends to deploy it in a resource-constrained target device such as a mobile phone or other embedded systems. That is, the large neural net is trained in a resource-rich setting, e.g. GPUs and a multicore architecture with large memory and disk, where the model designer can explore different model architectures, hyperparameter values, etc. until a model with the best accuracy on a validation set is found. This final, large model is then ready to be deployed in practice. However, the target device will typically have far more restrictive computation constraints in memory size and speed, arithmetic operations, clock rate, energy consumption, etc., which make it impossible to accommodate the large model. In other words, we can only deploy models of a certain size. Download bandwidth is also significantly limited in apps for mobile phones or software for cars.
This problem has attracted considerable attention recently. Two important facts weigh upon it which have been known for some time among researchers. The first is that the large neural nets that we currently know how to train for optimal accuracy contain significant redundancy, which makes it possible to find smaller neural nets with comparable accuracy. The second is that, for reasons not entirely understood, one typically achieves a more accurate model by training a large model first and then somehow transforming it into a smaller one (“compressing” the model), than by training a small model in the first place. This leads us to the problem of compressing a neural net (or other model), which is our focus.
Compressing neural nets has been recognized in recent years as an important problem and various academic and industrial research groups have shown that one can indeed significantly compress neural nets without appreciable losses in accuracy (see related work below). However, the solutions proposed so far are somewhat ad-hoc in two senses. First, they define a specific compression technique (and a specific algorithm to find the compressed model) which may work well with some types of models but not others. Second, some of these solutions are not guaranteed to be optimal in the sense of achieving the highest classification accuracy for the compression technique considered.
In this paper, we provide a general formulation of model compression and a training algorithm to solve it. The formulation can accommodate any compression technique as long as it can be put in a certain mathematical form, which includes most existing techniques. The compression mechanism appears as a black box, so that the algorithm simply iterates between learning a large model and compressing it, but is guaranteed to converge to a locally optimal compressed model under some standard assumptions. In separate papers, we develop specific algorithms for various compression forms, such as quantization (Carreira-Perpiñán and Idelbayev, 2017a) or pruning (Carreira-Perpiñán and Idelbayev, 2017b), and evaluate them experimentally. In the rest of this paper, we discuss related work (section 2), give the generic formulation (section 3) and the generic LC algorithm (section 4), give conditions for convergence (section 5) and discuss the relation of our LC algorithm with other algorithms (section 6) and the relation with generalization and model selection (section 7).
2 Related work: what does it mean to compress a model?
In a general sense, we can understand model compression as replacing a “large” model with a “small” model within the context of a given task, and this can be done in different ways. Let us discuss the setting of the problem that motivates the need for compression.
Assume we define the machine learning task as classification for object recognition from images and consider as large model a deep neural net with inputs (images), outputs (object class labels) and real-valued weights (where the number of weights is large). In order to train a model we use a loss , e.g. the cross-entropy on a large training set of input-output pairs . Also assume we have trained a large, reference neural net , i.e., , and that we are happy with its performance in terms of accuracy, but its size is too large.
We now want a smaller model
that we can apply to the same task (classifying input imagesinto labels ). How should we define this smaller model? One possibility is for the small model to be of the same type as the reference model
but reparameterized in terms of a low-dimensional parameter vector. That is, we construct weightsfrom low-dimensional parameters via some transformation so that the size of is smaller than the size of , i.e., (obviously, this will constrain the possible that can be constructed). In the example before, would be the same deep net but, say, with weight values quantized using a codebook with entries (so contains the codebook and the assignment of each weight to a codebook entry). A second possibility is for the small model to be a completely different type of model from the reference, e.g. a linear mapping with parameters (so there is no relation between and ).
Given this, we have the following options to construct the small model :
Direct learning: : find the small model with the best loss regardless of the reference
. That is, simply train the small model to minimize the loss directly, possibly using the chain rule to compute the gradient wrt. This approximates the reference model indirectly, in that both and are trying to solve the same task. We can have the small model be a reparameterized version of or a completely different model. Direct learning is the best thing to do sometimes but not always, as noted later.
Direct compression (DC): : find the closest approximation to the parameters of the reference model, using a low-dimensional parameterization . This forces and to be models of the same type. Direct compression can be simply done with (lossless or lossy) compression of , but it generally will not be optimal wrt the loss since the latter is ignored in learning . We discuss this later in detail.
Model compression as constrained optimization: this is our proposed approach, which we describe in section 3. It forces and to be models of the same type, by constraining the weights to be constructed from a low-dimensional parameterization , but must optimize the loss .
Teacher-student: : find the closest approximation to the reference function , in some norm. The norm over the domain may be approximated with a sample (e.g. the training set). Here, the reference model “teaches” the student model . We can have the small model be a reparameterized version of or a completely different model.
Most existing compression approaches fall in one of these categories. In particular, traditional compression techniques have been applied using techniques most related to direct training and direct compression: using low-precision weight representations through some form of rounding (see Gupta et al., 2015; Hubara et al., 2016 and references therein), even single-bit (binary) values (Fiesler et al., 1990; Courbariaux et al., 2015; Rastegari et al., 2016; Hubara et al., 2016; Zhou et al., 2016), ternary values (Hwang and Sung, 2014; Li et al., 2016; Zhu et al., 2017) or powers of two (Marchesi et al., 1993; Tang and Kwan, 1993); quantization of weight values, soft (Nowlan and Hinton, 1992; Ullrich et al., 2017) or hard (Fiesler et al., 1990; Marchesi et al., 1993; Tang and Kwan, 1993; Gong et al., 2015; Han et al., 2015); zeroing weights to achieve a sparse model (Hanson and Pratt, 1989; Weigend et al., 1991; LeCun et al., 1990; Hassibi and Stork, 1993; Reed, 1993; Yu et al., 2012; Han et al., 2015); low-rank factorization of weight matrices (Sainath et al., 2013; Denil et al., 2013; Jaderberg et al., 2014; Denton et al., 2014; Novikov et al., 2015); hashing (Chen et al., 2015); and lossless compression, such as Huffman codes (Han et al., 2016). Some papers combine several such techniques to produce impressive results, e.g. Han et al. (2016) use pruning, trained quantization and Huffman coding. Although we comment on some of these works in this paper (particularly regarding the direct compression approach), we defer detailed discussions to our companion papers on specific compression techniques (quantization in Carreira-Perpiñán and Idelbayev, 2017a and pruning in Carreira-Perpiñán and Idelbayev, 2017b, so far).
The teacher-student approach seems to have arisen in the ensemble learning literature, inspired by the desire to replace a large ensemble model (e.g. a collection of decision trees), which requires large storage and is slow at test time, with a smaller or faster model with similar classification accuracy(Zhou, 2012, section 8.5). The smaller model can be of the same type as the ensemble members, or a different type of model altogether (e.g. a neural net). The basic idea is, having trained the large ensemble on a labeled training set, to use this ensemble (the “teacher”) to label a larger, unlabeled dataset (which may be available in the task at hand, or may be generated by some form of sampling). This larger, labeled dataset is then used to train the smaller model (the “student”). The hope is that the knowledge captured by the teacher is adequately represented in the synthetically created training set (although the teacher’s mistakes will also be represented in it), and that the student can learn it well even though it is a smaller model. More generally, the approach can be applied with any teacher, not necessarily an ensemble model (e.g. a deep net), and the teacher’s labels may be transformed to improve the student’s learning, e.g. log-outputs (Ba and Caruana, 2014)
or other manipulations of the output class probabilities(Hinton et al., 2015). However, from a compression point of view the results are unimpressive, with modest compression ratios (Hinton et al., 2015) or even failure to compress (using a single-layer net as student needs many more weights than the teacher deep net; Ba and Caruana, 2014). One practical problem with the teacher-student approach is that all it really does is construct the artificial training set, but it leaves the design of the student to the user, and this is a difficult model selection problem. This is unlike the compression approaches cited above, which use the teacher’s model architecture but compress its parameters.
3 A constrained optimization formulation of model compression
|XX1[t] XX2[t] FF[l][Bl] Fd1[l][Bl] Fd2[l][Bl] R[l][l]-space mappings||XX[t] FF1[l][Bl] FF2[Bl][l] FF3[l] FF4[l][Bl] Fd1[l][l][0.55] Fd2[l][l][0.55] Fd3[bl][l][0.55] Fd4[lt][l][0.55] F1[l][Bl] F2[l][Bl] F3[l][Bl] F4[l][Bl] R[l][l]-space|
In this work we understand compression in a mathematically specific sense that involves a learning task that we want to solve and a large model that sets the reference to meet. It is related to the direct training and direct compression concepts introduced above. Assume we have a large, reference model with parameters (e.g. a neural net with inputs and weights ) that has been trained on a loss (e.g. cross-entropy on a given training set) to solve a task (e.g. classification). That is, , where we abuse the notation to write the loss directly over the weights rather than , and the minimizer may be local. We define compression as finding a low-dimensional parameterization of in terms of parameters . This will define a compressed model . We seek a such that its corresponding model has (locally) optimal loss. We denote this“optimal compressed” and write it as and (see fig. 1).
Ordinarily, one could then solve the problem directly over : . This is the direct learning option in the previous section. Instead, we equivalently write model compression as a constrained optimization problem:
The reason, which will be useful later to develop an optimization algorithm, is that we decouple the part of learning the task, in the objective, from the part of compressing the model, in the constraints.
By eliminating , our formulation (1) is equivalent to direct learning , so why not do that in the first place, rather than training a large model and then compressing it? In fact, direct learning using gradient-based methods (via the chain rule) may sometimes be a good option. But it is not always convenient or possible. Firstly, if the decompression mapping is not differentiable wrt (as happens with quantization), then the chain rule does not apply. Second, using gradient-based methods over may lead to slow convergence or be prone to local optima compared to training a large, uncompressed model (this has been empirically reported with pruning Reed, 1993 and low-rank compression, e.g. Denil et al., 2013). Third, the direct learning does not benefit from the availability of a large, well-trained model in -space, since it operates exclusively in the low-dimensional -space. Finally, in direct learning the learning task aspects (loss and training set) are intimately linked to the compression ones ( and ), so that the design of a direct learning algorithm is specific to the combination of loss and compression technique (in our LC algorithm, both aspects separate and can be solved independently).
3.1 Compression as orthogonal projection on the feasible set
Compression and decompression are usually seen as algorithms, but here we regard them as mathematical mappings in parameter space. If
is well defined, we call the compression mapping. behaves as the “inverse” of the decompression mapping (although it is not a true inverse, because identity). Since , there generally will be a unique inverse for any given (but not necessarily). Compressing may need an algorithm (e.g. SVD for low-rank compression, -means for quantization) or a simple formula (e.g. taking the sign or rounding a real value). will usually satisfy for any , i.e., decompressing then compressing it gives back . The decompression mapping appears explicitly in our problem definition (1), while the compression mapping appears in our LC algorithm (in the C step), as we will see.
With lossless compression, is bijective and . With lossy compression, need be neither surjective (since
) nor injective (since it can have symmetries, e.g. reordering singular values in the SVD or centroids in-means). Also, need not be differentiable wrt (for low-rank compression it is, for quantization it is not).
The feasible set contains all high-dimensional models that can be obtained by decompressing some low-dimensional model . In our framework, compression is equivalent to orthogonal projection on the feasible set. Indeed, is equivalent to s.t. , which is the problem of finding the closest feasible point to in Euclidean distance, i.e., is the orthogonal projection of on the feasible set.
3.2 Types of compression
Our framework includes many well-known types of compression:
- Low-rank compression
defines , where we write the weights in matrix form with of , of and of , and with . If we learn both and
, the compression mapping is given by the singular value decomposition (SVD) of. We can also use a fixed dictionary or basis (e.g. given by wavelets or the discrete cosine transform) and learn either or only. The compression mapping is then given by solving a linear system. We study this in a paper in preparation.
uses a discrete mapping given by assigning each weight to one of codebook values. If we learn both the assignments and the codebook, compression can be done by -means. We can also use a fixed codebook, such as or . The compression mapping is then given by a form of rounding. We study this in a separate paper (Carreira-Perpiñán and Idelbayev, 2017a).
- Low-precision approximation
defines a constraint per weight where is real (or, say, a double-precision float) and is, say, a single-precision float. The compression mapping sets to the truncation of
. A particular case is binarization, whereand the compression mapping sets . This can be seen as quantization using a fixed codebook.
defines where is real and is constrained to have few nonzero values. The compression mapping involves some kind of thresholding. We study this in a separate paper (Carreira-Perpiñán and Idelbayev, 2017b).
- Lossless compression
takes many forms, such as Huffman coding, arithmetic coding or run-length encoding (Gersho and Gray, 1992), and is special in that is a bijection. Hence, the direct compression solves the problem with no need for our LC algorithm. However, lossless compression affords limited compression power.
It is also possible to combine several compression techniques.
For any lossy compression technique, the user can choose a compression level (e.g. the rank in low-rank approaches or the codebook size in quantization approaches). Obviously, we are interested in the highest compression level that retains acceptable accuracy in the task. Note that in accounting for the size of the compressed model, we need to add two costs: storing the weights of the compressed model, and storing the decompression algorithm data (the dictionary, the codebook, the location of the nonzero values, etc.).
We can be flexible in the definition of the constraints in problem (1). We need not compress all the weights (e.g. we usually do not compress the biases in a neural net); this simply means we have no constraint for each such weight . We can use different types of compression for different parts of the model (e.g. for different layers of the net); this means we have sets of constraints , which separate and can be ran in parallel in the C step (see later). Also, there may be additional constraints in problem (1). For example, in quantization we have binary assignment vectors whose sum must equal one, or variables that must belong to a set such as . The original minimization over the loss may also be constrained, e.g. if the weights should be nonnegative.
The decompression mapping can take different forms. Each may be a function of the entire , which are then “shared” compression parameters. This happens with low-rank compression: . Instead, each may have “shared” parameters and “private” parameters . This happens in quantization: , where is a codebook shared by all weights and is the index in the codebook that is assigned to111Actually, it is more convenient to express as a binary assignment vector where , see Carreira-Perpiñán and Idelbayev (2017a).. Here, .
Earlier we defined the feasible set , which contains all high-dimensional models that can be obtained by decompressing some low-dimensional model . Good compression schemes should satisfy two desiderata:
Achieve low compression error. Obviously, this depends on the compression level (which determines the “size” of the feasible set) and on the optimization, e.g. our LC algorithm (which adapts the parameters to the task at hand as best as possible). But the form of the compression mapping (low-rank, quantization, etc.) matters. This form should be such that every uncompressed model of interest is near some part of the feasible set, i.e., the decompression mapping is “space-filling” to some extent.
Have simple compression and decompression algorithms: fast and ideally without local optima.
3.3 Other formulations of model compression
A penalty formulation
One can define the compression problem as
where the penalty or cost function encourages to be close to a compressed model and is a user parameter. Computationally, this can also be conveniently optimized by duplicating :
and applying a penalty method and alternating optimization, so the learning part on and separates from the compression part on . However, this formulation is generally less preferable than the constrained formulation (1). The reason is that the penalty does not guarantee that the optimum of (3) is exactly a compressed model, only that it is close to some compressed model, and a final, suboptimal compression step is required. For example, penalizing the deviations of the weights from a given codebook (say, ) will encourage the weights to cluster around codebook values, but not actually to equal them, so upon termination we must round all weights, which makes the result suboptimal.
For some types of compression a penalty formulation does produce exactly compressed models. In pruning (Carreira-Perpiñán and Idelbayev, 2017b), we want the weight vector to contain many zeros (be sparse). Using a penalty where is a sparsity-inducing norm, say , will result in a sparse weight vector. Still, in the penalty form the number of nonzeros in is implicitly related to the value of , while the constraint form allows us to set the number of nonzeros directly, which is more convenient in practice.
Another constrained formulation
It is conceivable to consider the following, alternative formulation of model compression as a constrained optimization:
directly in terms of a well-defined compression mapping , rather than in terms of a decompression mapping as in (1). This has the advantage that problem (5) is simply solved by setting (the reference model) and , without the need for an iterative algorithm over and (we call this “direct compression” later). Indeed, the constraint in (5) does not actually constrain . However, this formulation is rarely useful, because the resulting compressed model may have an arbitrarily large loss . An exception is lossless compression, which satisfies , and here the optimal compressed solution can indeed be achieved by compressing the reference model directly.
4 A “Learning-Compression” (LC) algorithm
Although the constrained problem (1) can be solved with a number of nonconvex optimization algorithms, it is key to profit from parameter separability, which we achieve with penalty methods and alternating optimization, as described next.
Handling the constraints via penalty methods
Two classical penalty methods are the quadratic penalty (QP) and the augmented Lagrangian (AL) (Nocedal and Wright, 2006). In the QP, we optimize the following over the parameters while driving :
This has the effect of gradually enforcing the constraints, and the parameters trace a path for . A better method is the AL. Here we optimize the following over while driving :
and we update the Lagrange multiplier estimatesafter optimizing over for each . Optimizing over for fixed is like optimizing the QP with a shifted parameterization , as eq. (8) shows explicitly. The AL is equivalent to the QP if .
Optimizing the penalized function with alternating optimization
Applying alternating optimization to the penalized function (quadratic-penalty function or augmented Lagrangian ) over and (for fixed ) gives our “learning-compression” (LC) algorithm. The steps are as follows:
L (learning) step:
This is a regular training of the uncompressed model but with a quadratic regularization term. This step is independent of the compression type.
C (compression) step:
This means finding the best (lossy) compression of (the current uncompressed model, shifted by ) in the sense (orthogonal projection on the feasible set), and corresponds to our definition of the compression mapping in section 3.1. This step is independent of the loss, training set and task.
Fig. 2 gives the LC algorithm pseudocode for the augmented Lagrangian (using a single LC iteration per update). For the quadratic-penalty version, ignore or set to zero everywhere.
Reusing existing code in the L and C steps
The L step gradually pulls towards a model that can be obtained by decompressing some , and the C step compresses the current . Both of these steps can be done by reusing existing code rather than writing a new algorithm, which makes the LC algorithm easy to implement. The L step just needs an additive term “” to the gradient of , e.g. in stochastic gradient descent (SGD) for neural nets. The C step depends on the compression type, but will generally correspond to a well-known compression algorithm (e.g. SVD for low-rank compression, -means for quantization). Different types of compression can be used by simply calling a different compression routine in the C step, with no other change to the LC algorithm. This facilitates the model designer’s job of trying different types of compression to find the most suitable for the task at hand.
The runtime of the C step is typically negligible compared to that of the L step (which involves the actual training set, usually much larger than the number of parameters), although this depends on the type of compression and the loss. Hence, it pays to do the C step as exactly as possible. The overall runtime of the LC algorithm will be dominated by the L steps, as if we were training an uncompressed model for a longer time.
Schedule of the penalty parameter
In practice, as usual with penalty methods, we use a multiplicative schedule: with slightly larger than 1 and (set by trial and error); see also section 5. As noted in the pseudocode, we run a single L and C step per value because this keeps the algorithm simple (we avoid an extra loop). However, in some cases it may be advantageous to run multiple L and C steps per value, e.g. if it is possible to cache matrix factorizations in order to speed up the L or C step.
Initialization and termination
We always initialize and , i.e., to the reference model and direct compression, which is the exact solution for , as we show in the next section. We stop the LC algorithm when is smaller than a set tolerance, which will happen when is large enough. At this point, the final iterate satisfies , so that is a compressed model (hence a feasible weight vector), while is not (although it will be very close to ). Hence, the solution (feasible and (near-)optimal) is .
In the derivation of the LC algorithm we used a quadratic penalty to penalize violations of the equality constraint . This is convenient because it makes the L step easy with gradient-based optimization (it behaves like a form of weight decay on the loss), and the C step is also easy for some compression forms (quantization can be done by -means, low-rank approximation by the SVD). However, non-quadratic penalty forms may be convenient in other situations.
Note that the C step can also be seen as trying to predict the weights from the low-dimensional parameters via the mapping . In this sense, compression is a machine learning problem of modeling “data” (the weights) using a low-dimensional space (the parameters). This was noted by Denil et al. (2013) in the context of low-rank models, but it applies generally in our framework. See also our discussion of parametric embeddings in section 6.3. In fact, model fitting itself in machine learning can be seen as compressing the training set into the model parameters.
4.2 Direct compression (DC) and the beginning of the path
In the LC algorithm, the parameters trace a path for , and the solution is obtained for , when the constraints are satisfied and we achieve an optimal compressed model. The beginning of the path, for , has a special meaning: it corresponds to the direct compression (training the reference model and then compressing its weights), as we show next. (We write rather than because the latter defines a problem without .)
since the term is negligible, and
i.e., the orthogonal projection of on the feasible set (up to local optima in both and ), recalling the discussion of section 3.1. Hence, the path starts at , which corresponds to the direct compression: training the large, reference model and then compressing its weights (note that in DC we discard and keep only , i.e., the compressed model). This is not optimal in the sense of problem (1) because the compression ignores the learning task; the best compression of the weights need not be the best compressed model for the task.
The constrained optimization view shows that, if an optimal uncompressed model is feasible, i.e., there is a with , then it is optimal compressed, since the compression has zero error, and in this case (and there is no need to optimize with the LC algorithm). But, generally, compression will increase the loss, the more so the larger the compression level (so the smaller the feasible set and the larger the distance to the DC model). Therefore, we should expect that, with low compression levels, direct compression will be near-optimal, but as we compress more—which is our goal, and critical in actual deployment in mobile devices—it will become worse and worse in loss wrt the optimal compressed model . Hence, high compression rates require the better LC optimization. Plot 3 in figure 1 illustrates this. Indeed, the suboptimality of direct compression compared to the result of the LC algorithm becomes practically evident in experiments compressing neural nets as we push the compression level (Carreira-Perpiñán and Idelbayev, 2017a, b). In section 6.2, we discuss existing work related to direct compression.
5 Convergence results for the LC algorithm
The quadratic penalty and augmented Lagrangian methods belong to the family of homotopy (path-following) algorithms, where the minima of or define a path for and the solution we want is at . We give a theorem based on the QP; similar results are possible for the AL. Assume the loss and the decompression mapping are continuously differentiable wrt their arguments, and that the loss is lower bounded.
Consider the constrained problem (1) and its quadratic-penalty function of (6). Given a positive increasing sequence , a nonnegative sequence , and a starting point , suppose the QP method finds an approximate minimizer of that satisfies for Then, , which is a KKT point for the problem (1), and its Lagrange multiplier vector has elements , .
It follows by applying theorem 17.2 in Nocedal and Wright (2006), quoted in appendix A, to the constrained problem (1) and by noting that 1) exists and 2) that the constraint gradients are linearly independent. We prove these two statements in turn. First, the limit of the sequence exists because the loss and hence the QP function are lower bounded and have continuous derivatives. Second, the constraint gradients are linearly independent at any point and thus, in particular, at the limit . To see this, note that the Jacobian of the constraint function wrt is the matrix , whose rank is obviously , and so is full-rank. ∎
Stated otherwise, the LC algorithm defines a continuous path which, under some mild assumptions (essentially, that we minimize increasingly accurately as ), converges to a stationary point (typically a minimizer) of the constrained problem (1). With convex problems, there is a unique path leading to the solution. With nonconvex problems, there are multiple paths, each leading to a local optimum. As with any nonconvex continuous optimization problem, convergence may occur in pathological cases to a stationary point of the constrained problem that is not a minimizer, but such cases should be rare in practice.
Computationally, it is better to approach the solution by following the path from small , because (or ) become progressively ill-conditioned as . While ideally we would follow the path closely, by increasing slowly from to , in practice we follow the path loosely to reduce the runtime, typically using a multiplicative schedule for the penalty parameter, for where and . If, after the first iteration, the iterates get stuck at the direct compression value, this is usually a sign that we increased too fast. The smaller and , the more closely we follow the path.
This theorem applies to a number of common losses used in machine learning, and to various compression techniques, such as low-rank factorization, which are continuously differentiable. However, it does not apply to some popular compression forms, specifically quantization and pruning, which generally give rise to NP-complete problems. Indeed, consider one of the simplest models: least-squares linear regression, which defines a quadratic loss over the weights, whose solution is given by a linear system. Forcing the weights to be eitheror (quantization by binarization) defines a binary quadratic problem over the weights, which is NP-complete (Garey and Johnson, 1979). Forcing a given proportion of the weights to be zero (pruning) is an -constrained problem, also NP-complete (Natarajan, 1995). While we cannot expect the LC algorithm to find the global optimum of these problems, we can expect reasonably good results in the following sense. 1) The LC algorithm is still guaranteed to converge to a weight vector that satisfies the constraints (having elements in or having the given proportion of elements be zero, in those examples), hence it will always converge to a validly compressed model. 2) Because the L step minimizes (partially) the loss, the convergence will likely be to a low-loss point (even if not necessarily optimal).
5.1 Choice of learning rate in the L step with large-scale optimization222In this section, we use the subindex to indicate iterates, such as , within the L step. These are different from the iterates of the LC algorithm, denoted as in theorem 5.1.
Theorem 5.1 states that for convergence to occur, the L and C steps must be solved increasingly accurately. This is generally not a problem for the C step, as it is usually solved by an existing compression algorithm. The L step needs some consideration. The objective function over in the L step has the form of the original loss plus a very simple term, a separable quadratic function:
where for the QP (6) and for the AL (7). Intuitively, optimizing (13) should not be very different from optimizing the loss (which we had to do in order to obtain the reference model). Indeed, gradient-based optimization is straightforward, since the gradient of (13) simply adds to the gradient of the loss. Many optimization algorithms can be used to solve this depending on the form of , such as a modified Newton method with line searches or even solving a linear system if is quadratic. However, for large-scale problems (large datasets and/or large dimension of ) we are particularly interested in gradient-based optimization without line searches, such as gradient descent with a fixed step size, or stochastic gradient descent (SGD) with learning rates (step sizes) satisfying a Robbins-Monro schedule. Convergence without line searches requires certain conditions on the step size, and these must be corrected to account for the fact that the quadratic -term increases as increases, since then the gradient term
also increases, and this can cause problems (such as large, undesirable jumps in early epochs of each new L step if using SGD). We consider two cases of optimizing the L step: using gradient descent with a fixed step size, and using SGD.
5.1.1 Optimization of a convex loss using gradient descent with a fixed step size
As is well known from convex optimization arguments (see proofs in appendix B.1), if the loss is convex differentiable with Lipschitz continuous gradient with Lipschitz constant , then training the reference model can be done by gradient descent with a fixed step size . In the L step, the objective function (13) is strictly convex and therefore it has a unique minimizer. Gradient descent with a fixed step size , , converges to the minimizer linearly with rate from any initial point. Hence, we simply need to adjust the step size from in the reference model to in the L step. Although the step size becomes smaller as increases, the convergence becomes faster. The reason is that the objective function becomes closer to a separable quadratic function, whose optimization is easier.
5.1.2 Optimization using stochastic gradient descent (SGD)
With neural nets, which involve a nonconvex loss and a large dataset, SGD is the usual optimization procedure. The loss takes the form where is the loss for the th training point and is large, so it is too costly to evaluate the gradient exactly. It is practically preferable to take approximate gradient steps based on evaluating the gradient at step on a minibatch , hence each step can be seen as a gradient step using a noisy gradient:
The convergence theorems for this stochastic setting are very different from those assuming exact gradients, most notably in the requirement that the step sizes (learning rates) must tend to zero, at a certain speed, as , which we call a Robbins-Monro schedule:
Appendix B.2 gives detailed theorems with sufficient conditions for convergence (to a stationary point of the loss) in the case where the noise is deterministic (theorem B.7), in particular for the incremental gradient algorithm (theorem B.8), and when the noise is stochastic (theorem B.9). These conditions include Lipschitz continuity of the loss gradient, a condition on the noise to be small enough, and that the learning rates satisfy a Robbins-Monro schedule. The convergence rate is sublinear and much slower than with exact gradients (discussed in the previous section). In practice, the schedule typically has the form where are determined by trial-and-error on a subset of the data. Unfortunately, the convergence theory for SGD is not very practical: apart from the conditions on the loss and the noise (which can be hard to verify but are mild), all the theory tells us is that using a Robbins-Monro schedule will lead to convergence. However, the performance of SGD is very sensitive to the schedule and selecting it well is of great importance. Indeed, SGD learning of neural nets is notoriously tricky and considerable trial and error in setting its hyperparameters (learning rate, momentum rate, minibatch size, etc.) is unavoidable in practice.
Consider now the objective function (13) of the L step. The SGD updates take the form:
Our concern is, given a good Robbins-Monro schedule for the reference model (i.e., for optimizing alone), should the schedule be modified for optimizing , and if so how? In theory, no change is needed, because the only condition that the convergence theorems require of the schedule is that it be Robbins-Monro. In practice, this requires some consideration. The addition of the -term to the loss has two effects. First, since its exact gradient is fast to compute, the noise in the gradient of will (usually) be smaller, which is a good thing for convergence purposes. Second, and this is the practical problem, since increases towards infinity its gradient becomes progressively large444This makes the situation different from weight decay, which uses a similar objective function, of the form , but where is fixed and usually very small.. Hence, the early weight updates in the L step (which use a larger learning rate) may considerably perturb , sending it away from the initial (provided by warm-start from the previous iteration’s C step). While convergence to some minimizer (or stationary point in general) of is still guaranteed with a Robbins-Monro schedule, this may be much slower and occur to a different local minimizer. This makes the overall optimization unstable and should be avoided.
We can solve these problems by using a schedule that both satisfies the Robbins-Monro conditions and would lead to convergence of the -term alone:
That is, the first iterations will use a learning rate , and then switch to the original schedule (when ). The justification is provided by the following two theorems.
Consider minimizing a function using gradient descent with a fixed step size , i.e., we iterate for and some . If , then that sequence converges linearly to the minimizer iff . Also, converges to the minimizer in one iteration.
A gradient descent step with step size is . So yields . And , which converges linearly to zero if . ∎
We have by assumption that a) , b) and c) . Obviously, . From c) we have that , so there exists a such that , and . Hence , and . Finally, since , i.e., the sequence is majorized by the sequence , then . ∎
Theorem 5.2 tells us we should use learning rates below and suggests using (particularly as increases, since then the -term becomes dominant). Theorem 5.3 guarantees that clipping a Robbins-Monro schedule remains Robbins-Monro. Hence, the clipped schedule makes sure that the initial, larger updates do not exceed (the optimal step size for the -term), and otherwise leaves it unchanged. This then ensures that the first steps do not unduly perturb the initial , while convergence to a minimum of is guaranteed since the schedule is Robbins-Monro and has Lipschitz continuous gradient if does (as long as the noise condition in the convergence theorems holds).
In a nutshell, our practical recommendation is as follows: first, we determine by trial and error a good schedule for the reference model (i.e., which drives the weight vector to close to a local minimizer of the loss as fast as possible). Then, we use the clipped schedule in the L step for . We have done this in experiments on various compression forms (Carreira-Perpiñán and Idelbayev, 2017a, b) and found it effective.
6 Relation of the LC algorithm with other algorithms
6.1 One algorithm, many compression types
We emphasize that the specific form of our LC algorithm follows necessarily from the definition of the compression technique in the constraints of problem (1
). Some work on neural net compression is based on modifying the usual training procedure (typically backpropagation with SGD) by manipulating the weights on the fly in some ad-hoc way, such as binarizing or pruning them, but this usually has no guarantee that it will solve problem (1), or converge at all. In our framework, the LC algorithm (specifically, the C step) follows necessarily from the constraints that define the form of compression in (1). For example, for quantization (Carreira-Perpiñán and Idelbayev, 2017a) and low-rank compression the C step results in -means and the SVD, respectively, because the C step optimization “” takes the form of a quadratic distortion problem in both cases. For pruning using the norm (Carreira-Perpiñán and Idelbayev, 2017b), the optimization in the C step results in a form of weight magnitude thresholding. There is no need for any ad-hoc decisions in the algorithm.
6.2 Direct compression and retraining approaches
Our formulation of what an optimal solution to the model compression problem is and the form of the LC algorithm allows us to put some earlier work into context.
6.2.1 Direct compression approaches
Direct compression (DC) consists of training the reference model and then compressing its weights. As shown in section 4.2, DC corresponds to the beginning of the iterates’ path in the LC algorithm, and is suboptimal, that is, it does not produce the compressed model with lowest loss. That said, direct compression is an appealing approach: it is an obvious thing to do, it is simple, and it is fast, because it does not require further training of a reference model (and hence no further access to the training set). Indeed, particular instances of DC corresponding to particular compression techniques have been recently applied to compress neural nets (although presumably much earlier attempts exist in the literature). These include quantizing the weights of the reference net with -means (Gong et al., 2015), pruning the weights of the reference net by zeroing small values (Reed, 1993) or reducing the rank of the weight matrices of the reference net using the SVD (Sainath et al., 2013; Jaderberg et al., 2014; Denton et al., 2014). However, the LC algorithm is nearly as simple as direct compression: it can be seen as iterating the direct compression but with a crucial link between the L and C steps, the term. Practically, this is not much slower, given that we have to train the reference model anyway. Since the C steps are usually negligible compared to the L steps, the LC algorithm behaves like training the reference model for a longer time.
6.2.2 Retraining after direct compression
As we mentioned in section 4.2, the result of direct compression can have a large loss, particularly for large compression levels. A way to correct for this partially is to retrain the compressed model, and this has been a standard approach with neural net pruning (Reed, 1993). Here, we first train the reference net and then prune its weights in some way, e.g. by thresholding small-magnitude weights. This gives the direct compression model if using sparsity-inducing norms (see Carreira-Perpiñán and Idelbayev, 2017b). Finally, we optimize the loss again but only over the remaining, unpruned weights. This reduces the loss, often considerably. However, it loses the advantage of DC of not having to retrain the net (which requires access to the training set), and it is still suboptimal, since generally the set of weights that were pruned is not the set that would give the lowest loss. The LC algorithm consistently beats retraining for pruning, particularly for higher compression levels (see Carreira-Perpiñán and Idelbayev, 2017b).
6.2.3 Iterated direct compression (iDC)
Imagine we iterate the direct compression procedure. That is, we optimize to obtain and then compress it into . Next, we optimize again but initializing from , and then we compress it; etc. Our argument in section 4.2 implies that nothing would change after the first DC and we would simply cycle between and forever. In fact, several DC iterations may be needed for this to happen, for two reasons. 1) With local optima of , we might converge to a different optimum after the compression (see fig. 1 plot 2). However, sooner rather than later this will end up cycling between a local optimum of and its compressed model. Still, this improves over the DC optimum. 2) A more likely reason in practice are inexact compression or learning steps. This implies the iterates never fully reach either or or both, and keep oscillating forever in between. This is particularly the case if training neural nets with stochastic gradient descent (SGD), for which converging to high accuracy requires long training times.
We call the above procedure “iterated direct compression (iDC)”. A version of this for quantization has been proposed recently (“trained quantization”, Han et al., 2015), although without the context that our constrained optimization framework provides. In our experiments elsewhere (Carreira-Perpiñán and Idelbayev, 2017a), we verify that neither DC not iDC converge to a local optimum of problem (1), while our LC algorithm does.
6.3 Other algorithms beyond model compression
6.3.1 The method of auxiliary coordinates (MAC)
Overall, we derive our LC algorithm by applying the following design pattern to solve the compression problem: 1) introducing auxiliary variables in eq. (1), 2) handling the constraints via penalty methods (QP or AL) in eqs. (6)–(7), and 3) optimizing the penalized function using alternating optimization over the original variables and the auxiliary variables in eqs. (9)–(10). This design pattern is similar to that used in the method of auxiliary coordinates (MAC) for optimizing nested systems such as deep nets (Carreira-Perpiñán and Wang, 2012, 2014), i.e., involving functions of the form , where is an input data point and are trainable parameters (e.g. weights in a deep net). Here, one introduces auxiliary coordinates per data point of the form (where ), for and . Then, handling these constraints with a penalty method and applying alternating optimization yields the final algorithm. This alternates a “maximization” step that optimizes single-layer functions independently of each other (over the ) with a “coordination” step that optimizes over the auxiliary coordinates (the ) independently for each data point. Hence, in MAC the auxiliary coordinates are per data point and capture intermediate function values within the nested function, while in our LC algorithm the auxiliary variables duplicate the parameters of the model in order to separate learning from compression.
6.3.2 Parametric embeddings
MAC and the LC algorithm become identical in one interesting case: parametric embeddings. A nonlinear embedding
algorithm seeks to map a collection of high-dimensional data pointsof to a collection of low-dimensional projections of with such that distances or similarities between pairs of points are approximately preserved between the corresponding pairs of projections . Examples of nonlinear embeddings are spectral methods such as multidimensional scaling (Borg and Groenen, 2005) or Laplacian eigenmaps (Belkin and Niyogi, 2003), and truly nonlinear methods such as stochastic neighbor embedding (SNE) (Hinton and Roweis, 2003), -SNE (van der Maaten and Hinton, 2008) or the elastic embedding (Carreira-Perpiñán, 2010). For example, the elastic embedding optimizes:
where defines the similarity between and (the more positive is, the more similar and are). Therefore, the first term attracts similar points, the second term repels all points, and the optimal embedding balances both forces (depending on the tradeoff parameter ). In a parametric embedding, we wish to learn a projection mapping rather than the projections themselves (so we can use to project a new point as ). For the elastic embedding this means optimizing the following (where is a parametric mapping with parameters , e.g. a linear mapping or a neural net):
To optimize this using MAC (Carreira-Perpiñán and Vladymyrov, 2015), we recognize the above as a nested mapping and introduce auxiliary coordinates for . The QP function is
and alternating optimization yields the following two steps:
Over , it has the form of a nonlinear embedding with a quadratic regularization:
Over , it has the form of a regression problem with inputs and outputs
We can see this as an LC algorithm for model compression if we regard as the uncompressed model (so is the reference model) and (or equivalently the projection mapping ) as the compressed model. MAC and the LC algorithm coincide because in a parametric embedding each data point () is associated with one parameter vector (). The decompression mapping is , which (approximately) recovers the uncompressed model by applying the projection mapping to the high-dimensional dataset. The compression step finds optimally the parameters of via a regression fit. The learning step learns the regularized embedding . “Direct compression” (called “direct fit” in Carreira-Perpiñán and Vladymyrov, 2015) fits directly to the reference embedding , which is suboptimal, and corresponds to the beginning of the path in the MAC or LC algorithm. Hence, in this view, parametric embeddings can be seen as compressed nonlinear embeddings.
7 Compression, generalization and model selection
In this paper we focus exclusively on compression as a mechanism to find a model having minimal loss and belonging to a set of compressed models, as precisely formulated in problem (1). However, generalization is an important aspect of compression, and we briefly discuss this.
Compression can also be seen as a way to prevent overfitting, since it aims at obtaining a smaller model with a similar loss to that of a well-trained reference model. This was noted early in the literature of neural nets, in particular pruning weights or neurons was seen as a way to explore different network architectures (seeReed, 1993 and Bishop, 1995, ch. 9.5). Soft weight-sharing (Nowlan and Hinton, 1992), a form of soft quantization of the weights of a neural net, was proposed as a regularizer to make a network generalize better. More recently, weight binarization schemes have also been seen as regularizers (Courbariaux et al., 2015).
Many recent papers, including our own work (Carreira-Perpiñán and Idelbayev, 2017a, b), report experimentally that the training and/or test error of compressed models is lower than that of the reference (as long as the compression level is not too large). Some papers interpret this as an improvement in the generalization ability of the compressed net. While this is to some extent true, there is a simpler reason for this (which we note in section 6.2.3) that surely accounts for part of this error reduction: the reference model was not trained well enough, so that the continued training that happens while compressing reduces the error. This will generally be unavoidable in practice with large neural nets, because the difficulty in training them accurately will mean the reference model is close to being optimal, but never exactly optimal.
Model selection consists of determining the model type and size that achieves best generalization for a given task. It is a difficult problem in general, but much more so with neural nets because of the many factors that determine their architecture: number of layers, number of hidden units, type of activation function (linear, sigmoidal, rectified linear, maxpool…), type of connectivity (dense, convolutional…), etc. This results in an exponential number of possible architectures.Compression can be seen as a shortcut to model selection, as follows. Instead of finding an approximately optimal architecture for the problem at hand by an expensive trial-and-error search, one can train a reference net that overestimates the necessary size of the architecture (with some care to control overfitting). This gives a good estimate of the best performance that can be achieved in the problem. Then, one compresses this reference using a suitable compression scheme and a desired compression level, say pruning % of the weights or quantizing the weights using bits. Then, what our LC algorithm does is automatically search a subset of models of a given size (corresponding to the compression level). For example, the -based pruning mechanism of Carreira-Perpiñán and Idelbayev (2017b) uses a single parameter (the total number of nonzero weights in the entire net) but implicitly considers all possible pruning levels for each layer of the net. This is much easier on the network designer than having to test multiple combinations of the number of hidden units in each layer. By running the LC algorithm at different compression levels , one can determine the smallest model that achieves a target loss that is good enough for the application at hand. In summary, a good approximate strategy for model selection in neural nets is to train a large enough reference model and compress it as much as possible.
We have described a general framework to obtain compressed models with optimal task performance based on casting compression, usually understood as a procedure, as constrained optimization in model parameter space. This accommodates many existing compression techniques and also sheds light on previous approaches that were derived procedurally and do not converge to an optimal compressed model, even if they are effective in practice. We have also given a general “learning-compression” (LC) algorithm, provably convergent under standard assumptions. The LC algorithm reuses two kinds of existing algorithms as a black-box, independently of each other: in the L step, a learning algorithm for the task loss (such as SGD on the cross-entropy), which requires the training set, and whose form is independent of the compression technique; and in the C step a compression algorithm on the model parameters (such as -means or the SVD), whose form depends on the compression technique but is independent of the loss and training set. The L and C steps follow mathematically from the definition of the constrained problem; for example, the C step for quantization and low-rank compression results in -means and the SVD, respectively, because the C step takes the form of a quadratic distortion problem in both cases. A model designer can try different losses or compression techniques by simply calling the appropriate routine in the L or C step.
In companion papers, we develop this framework for specific compression mechanisms and report experimental results that often exceed or are comparable with the published state of the art, but with the additional advantages of generality, simplicity and convergence guarantees. Because of this, we think our framework may be a useful addition to neural net toolboxes. Our framework also opens further research avenues that we are actively exploring.
Work supported by NSF award IIS–1423515. I thank Yerlan Idelbayev (UC Merced) for useful discussions.
Appendix A A convergence theorem for the quadratic-penalty method
Consider the equality-constrained problem
where is continuously differentiable and are equality constraints, also continuously differentiable. Define the quadratic-penalty function
where is the penalty parameter. Assume we are given a sequence , a nonnegative sequence of tolerances with and an starting point . The quadratic-penalty method works by finding, at each iterate , an approximate minimizer of , starting at and terminating when .
Suppose that the tolerances and penalty parameters satisfy and . Then, if a limit point of the sequence is infeasible, it is a stationary point of the function . On the other hand, if a limit point is feasible and the constraint gradients , are linearly independent, then is a KKT point for the problem (23). For such points, we have for any infinite subsequence such that that , for , where is the multiplier vector that satisfies the KKT conditions for problem (23).
See Nocedal and Wright (2006, theorem 17.2). ∎
Note that the QP method does not use the Lagrange multipliers in any way; the fact that tends to the Lagrange multiplier for constraint is a subproduct of the fact that the iterates converge to a KKT point. The AL method improves over the QP precisely by capitalizing on the availability of those estimates of the Lagrange multipliers.
Appendix B Learning rates for the L step: theorems and proofs
b.1 Optimization of a convex loss using gradient descent with a fixed step size
First we present a few well-known results about gradient-based optimization for convex functions, with a short proof if possible, and then apply them to our L step objective function (13).
b.1.1 Convergence theorems
A function is convex iff and : (and strictly convex if the inequality is strict). Let be convex and differentiable. We say is strongly convex with constant if : . A function is Lipschitz continuous with Lipschitz constant if : . All norms are Euclidean in this section. Most of the statements apply if the convex function is defined on a convex subset of . For further details, see a standard reference such as Nesterov (2004).
Let be Lipschitz continuous with respective Lipschitz constants . Then is Lipschitz continuous with Lipschitz constant .
: , by applying the triangle inequality. ∎
Let be differentiable. Then is convex if and only if : .
() Let . Since is convex, we have :