The widespread application of deep neural nets in recent years has seen an explosive growth in the size of the training sets, the number of parameters of the nets, and the amount of computing power needed to train them. At present, very deep neural nets with upwards of many million weights are common in applications in computer vision and speech. Many of these applications are particularly useful in small devices, such as mobile phones, cameras or other sensors, which have limited computation, memory and communication bandwidth, and short battery life. It then becomes desirable to compress a neural net so that its memory storage is smaller and/or its runtime is faster and consumes less energy.
Neural net compression was a problem of interest already in the early days of neural nets, driven for example by the desire to implement neural nets in VLSI circuits. However, the current wave of deep learning work has resulted in a flurry of papers by many academic and particularly industrial labs proposing various ways to compress deep nets, some new and some not so new (see related work). Various standard forms of compression have been used in one way or another, such as low-rank decomposition, quantization, binarization, pruning and others. In this paper we focus on quantization, where the ordinarily unconstrained, real-valued weights of the neural net are forced to take values within a codebook with a finite number of entries. This codebook can be adaptive, so that its entries are learned together with the quantized weights, or (partially) fixed, which includes specific approaches such as binarization, ternarization or powers-of-two approaches.
Among compression approaches, quantization is of great interest because even crudely quantizing the weights of a trained net (for example, reducing the precision from double to single) produces considerable compression with little degradation of the loss of the task at hand (say, classification). However, this ignores the fact that the quantization is not independent of the loss, and indeed achieving a really low number of bits per weight (even just 1 bit, i.e., binary weights) would incur a large loss and make the quantized net unsuitable for practical use. Previous work has applied a quantization algorithm to a previously trained, reference net, or incorporated ad-hoc modifications to the basic backpropagation algorithm during training of the net. However, none of these approaches are guaranteed to produce upon convergence (if convergence occurs at all) a net that has quantized weights and has optimal loss among all possible quantized nets.
In this paper, our primary objectives are: 1) to provide a mathematically principled statement of the quantization problem that involves the loss of the resulting net, and 2) to provide an algorithm that can solve that problem up to local optima in an efficient and convenient way. Our starting point is a recently proposed formulation of the general problem of model compression as a constrained optimization problem (Carreira-Perpiñán, 2017). We develop this for the case where the constraints represent the optimal weights as coming from a codebook. This results in a “learning-compression” (LC) algorithm that alternates SGD optimization of the loss over real-valued weights but with a quadratic regularization term, and quantization of the current real-valued weights. The quantization step takes a form that follows necessarily from the problem definition without ad-hoc decisions: -means for adaptive codebooks, and an optimal assignment for fixed codebooks such as binarization, ternarization or powers-of-two (with possibly an optimal global scale). We then show experimentally that we can compress deep nets considerably more than previous quantization algorithms—often, all the way to the maximum possible compression, a single bit per weight, without significant error degradation.
2 Related work on quantization of neural nets
Much work exists on compressing neural nets, using quantization, low-rank decomposition, pruning and other techniques, see Carreira-Perpiñán (2017) and references therein. Here we focus exclusively on work based on quantization. Quantization of neural net weights was recognized as an important problem early in the neural net literature, often with the goal of efficient hardware implementation, and has received much attention recently. The main approaches are of two types. The first one consists of using low-precision, fixed-point or other weight representations through some form of rounding, even single-bit (binary) values. This can be seen as quantization using a fixed codebook (i.e., with predetermined values). The second approach learns the codebook itself as a form of soft or hard adaptive quantization. There is also work on using low-precision arithmetic directly during training (see Gupta et al., 2015 and references therein) but we focus here on work whose goal is to quantize a neural net of real-valued, non-quantized weights.
2.1 Quantization with a fixed codebook
Work in the 1980s and 1990s explored binarization, ternarization and general powers-of-two quantization (Fiesler et al., 1990; Marchesi et al., 1993; Tang and Kwan, 1993). These same quantization forms have been revisited in recent years (Hwang and Sung, 2014; Courbariaux et al., 2015; Rastegari et al., 2016; Hubara et al., 2016; Li et al., 2016; Zhou et al., 2016; Zhu et al., 2017), with impressive results on large neural nets trained on GPUs, but not much innovation algorithmically. The basic idea in all these papers is essentially the same: to modify backpropagation so that it encourages binarization, ternarization or some other form of quantization of the neural net weights. The modification involves evaluating the gradient of the loss at the quantized weights (using a specific quantization or “rounding” operator that maps a continuous weight to a quantized one) but applying the update (gradient or SGD step) to the continuous (non-quantized) weights. Specific details vary, such as the quantization operator or the type of codebook. The latter has recently seen a plethora of minor variations: (Hwang and Sung, 2014), (Courbariaux et al., 2015), (Rastegari et al., 2016; Zhou et al., 2016), (Li et al., 2016) or (Zhu et al., 2017).
One important problem with these approaches is that their modification of backpropagation is ad-hoc, without guarantees of converging to a net with quantized weights and low loss, or of converging at all. Consider binarization to
for simplicity. The gradient is computed at a binarized weight vector, of which there are a finite number (, corresponding to the hypercube corners), and none of these will in general have gradient zero. Hence training will never stop, and the iterates will oscillate indefinitely. Practically, this is stopped after a certain number of iterations, at which time the weight distribution is far from binarized (see fig. 2 in Courbariaux et al., 2015), so a drastic binarization must still be done. Given these problems, it is surprising that these techniques do seem to be somewhat effective empirically in quantizing the weights and still achieve little loss degradation, as reported in the papers above. Exactly how effective they are, on what type of nets and why this is so is an open research question.
In our LC algorithm, the optimization essentially happens in the continuous weight space by minimizing a well-defined objective (the penalized function in the L step), but this is regularly corrected by a quantization operator (C step), so that the algorithm gradually converges to a truly quantized weight vector while achieving a low loss (up to local optima). The form of both L and C steps, in particular of the quantization operator (our compression function ), follows in a principled, optimal way from the constrained form of the problem (1). That is, given a desired form of quantization (e.g. binarization), the form of the C step is determined, and the overall algorithm is guaranteed to converge to a valid (binary) solution.
Also, we emphasize that there is little practical reason to use certain fixed codebooks, such as or , instead of an adaptive codebook such as with . The latter is obviously less restrictive, so it will incur a lower loss. And its hardware implementation is about as efficient: to compute a scalar product of an activation vector with a quantized weight vector, all we require is to sum activation values for each centroid and to do two floating-point multiplications (with and ). Indeed, our experiments in section 5.3 show that using an adaptive codebook with clearly beats using .
2.2 Quantization with an adaptive codebook
Quantization with an adaptive codebook is, obviously, more powerful than with a fixed codebook, even though it has to store the codebook itself. Quantization using an adaptive codebook has also been explored in the neural nets literature, using approaches based on soft quantization (Nowlan and Hinton, 1992; Ullrich et al., 2017) or hard quantization (Fiesler et al., 1990; Marchesi et al., 1993; Tang and Kwan, 1993; Gong et al., 2015; Han et al., 2015), and we discuss this briefly.
Given a set of real-valued elements (scalars or vectors), in adaptive quantization we represent (“quantize”) each element by exactly one entry in a codebook. The codebook and the assignment of values to codebook entries should minimize a certain distortion measure, such as the squared error. Learning the codebook and assignment is done by an algorithm, possibly approximate (such as -means for the squared error). Quantization is related to clustering and often one can use the same algorithm for both (e.g.
-means), but the goal is different: quantization seeks to minimize the distortion rather than to model the data as clusters. For example, a set of values uniformly distributed inshows no clusters but may be subject to quantization for compression purposes. In our case of neural net compression, we have an additional peculiarity that complicates the optimization: the quantization and the weight values themselves should be jointly learned to minimize the loss of the net on the task.
Two types of clustering exist, hard and soft clustering. In hard clustering, each data point is assigned to exactly one cluster (e.g.
-means clustering). In soft clustering, we have a probability distribution over points and clusters (e.g. Gaussian mixture clustering). Likewise, two basic approaches exist for neural net quantization, based on hard and soft quantization. We review each next.
In hard quantization, each weight is assigned to exactly one codebook value. This is the usual meaning of quantization. This is a difficult problem because, even if the loss is differentiable over the weights, the assignment makes the problem inherently combinatorial. Previous work (Gong et al., 2015; Han et al., 2015) has run a quantization step (-means) as a postprocessing step on a reference net (which was trained to minimize the loss). This is suboptimal in that it does not learn the weights, codebook and assignment jointly. We call this “direct compression” and discuss it in more detail in section 3.4. Our LC algorithm does learn the weights, codebook and assignment jointly, and converges to a local optimum of problem (1).
In soft quantization, the assignment of values to codebook entries is based on a probability distribution. This was originally proposed by Nowlan and Hinton (1992) as a way to share weights softly in a neural net with the goal of improving generalization, and has been recently revisited with the goal of compression (Ullrich et al., 2017)
. The idea is to penalize the loss with the negative log-likelihood of a Gaussian mixture (GM) model on the scalar weights of the net. This has the advantage of being differentiable and of coadapting the weights and the GM parameters (proportions, means, variances). However, it does not uniquely assign each weight to one mean, in fact the resulting distribution of weights is far from quantized; it simply encourages the creation of Gaussian clusters of weights, and one has to assign weights to means as a postprocessing step, which is suboptimal. The basic problem is that a GM is a good model (better than-means) for noisy or uncertain data, but that is not what we have here. Quantizing the weights for compression implies a constraint that certain weights must take exactly the same value, without noise or uncertainty, and optimize the loss. We seek an optimal assignment that is truly hard, not soft. Indeed, a GM prior is to quantization what a quadratic prior (i.e., weight decay) is to sparsity: a quadratic prior encourages all weights to be small but does not encourage some weights to be exactly zero, just as a GM prior encourages weights to form Gaussian clusters but not to become groups of identical weights.
3 Neural net quantization as constrained optimization and the “learning-compression” (LC) algorithm
As noted in the introduction, compressing a neural net optimally means finding the compressed net that has (locally) lowest loss. Our first goal is to formulate this mathematically in a way that is amenable to nonconvex optimization techniques. Following Carreira-Perpiñán (2017), we define the following model compression as constrained optimization problem:
where are the real-valued weights of the neural net, is the loss to be minimized (e.g. cross-entropy for a classification task on some training set), and the constraint indicates that the weights must be the result of decompressing a low-dimensional parameter vector . This corresponds to quantization and will be described in section 4. Problem (1) is equivalent to the unconstrained problem “”, but this is nondifferentiable with quantization (where is a discrete mapping), and introducing the auxiliary variable will lead to a convenient algorithm.
Our second goal is to solve this problem via an efficient algorithm. Although this might be done in different ways, a particularly simple one was proposed by Carreira-Perpiñán (2017) that achieves separability between the data-dependent part of the problem (the loss) and the data-independent part (the weight quantization). First, we apply a penalty method to solve (1). We consider here the augmented Lagrangian method (Nocedal and Wright, 2006), where
are the Lagrange multiplier estimates111All norms are throughout the paper unless indicated otherwise.:
The augmented Lagrangian method works as follows. For fixed , we optimize over accurately enough. Then, we update the Lagrange multiplier estimates as . Finally, we increase . We repeat this process and, in the limit as , the iterates tend to a local KKT point (typically, a local minimizer) of the constrained problem (1). A simpler but less effective penalty method, the quadratic penalty method, results from setting throughout; we do not describe it explicitly, see Carreira-Perpiñán (2017).
Finally, in order to optimize over , we use alternating optimization. This gives rise to the following two steps:
L step: learning
This involves optimizing a regularized version of the loss, which pulls the optimizer towards the currently quantized weights. For neural nets, it can be solved with stochastic gradient descent (SGD).
C step: compression (here, quantization)
We describe this in section 4. Solving this problem is equivalent to quantizing optimally the current real-valued weights , and can be seen as finding their orthogonal projection on the feasible set of quantized nets.
This algorithm was called the “learning-compression” (LC) algorithm by Carreira-Perpiñán (2017).
We note that, throughout the optimization, there are two weight vectors that evolve simultaneously and coincide in the limit as : (or, more precisely, ) contains real-valued, non-quantized weights (and this is what the L step optimizes over); and contains quantized weights (and this is what the C step optimizes over). In the C step, is the projection of the current on the feasible set of quantized vectors. In the L step, optimizes the loss while being pulled towards the current .
The formulation (1) and the LC algorithm have two crucial advantages. The first one is that we get a convenient separation between learning and quantization which allows one to solve each step by reusing existing code. The data-dependent part of the optimization is confined within the L step. This part is the more computationally costly, requiring access to the training set and the neural net, and usually implemented in a GPU using SGD. The data-independent part of the optimization, i.e., the compression of the weights (here, quantization), is confined within the C step. This needs access only to the vector of current, real-valued weights (not to the training set or the actual neural net).
The second advantage is that the form of the C step is determined by the choice of quantization form (defined by ), and the algorithm designer need not worry about modifying backpropagation or SGD in any way for convergence to a valid solution to occur. For example, if a new form of quantization were discovered and we wished to use it, all we have to do is put it in the decompression mapping form and solve the compression mapping problem (5) (which depends only on the quantization technique, and for which a known algorithm may exist). This is unlike much work in neural net quantization, where various, somewhat arbitrary quantization or rounding operations are incorporated in the usual backpropagation training (see section 2), which makes it unclear what problem the overall algorithm is optimizing, if it does optimize anything at all.
In section 4, we solve the compression mapping problem (5) for the adaptive and fixed codebook cases. For now, it suffices to know that it will involve running -means with an adaptive codebook and a form of rounding with a fixed codebook.
3.1 Geometry of the neural net quantization problem
|XX1[t] XX2[t] FF[l][Bl] Fd1[l][Bl] Fd2[l][Bl] R[l][l]-space mappings||XX[t] FF1[l][Bl] FF2[Bl][l] FF3[l] FF4[l][Bl] Fd1[l][l][0.55] Fd2[l][l][0.55] Fd3[bl][l][0.55] Fd4[lt][l][0.55] F1[l][Bl] F2[l][Bl] F3[l][Bl] F4[l][Bl] R[l][l]-space|
|w1 w2 c[r][-90]||w1[B] w2[-90] xU[br][b]-space xC[lt][lb] xF[l][l][0.9]|
Plots 1–3 (top row): illustration of the uncompressed model space (-space ), the contour lines of the loss (green lines), and the set of compressed models (the feasible set , grayed areas), for a generic compression technique . The -space is not shown. optimizes but is infeasible (no can decompress into it). The direct compression is feasible but not optimal compressed (not optimal in the feasible set). is optimal compressed. Plot 2 shows two local optima and of the loss , and their respective DC points (the contour lines are omitted to avoid clutter). Plot 3 shows several feasible sets, corresponding to different compression levels ( is most compression).
Plots 4–5 (bottom row): illustration when corresponds to quantization, in the particular case of a codebook of size and a 2-weight net, so , and . Plot 4 is the joint space and plot 5 is its projection in -space (as in plot 1). In plot 4, the black line is the feasible set , corresponding to the constraints . In plot 5, the black line is the feasible set , corresponding to the constraint . The red line is the quadratic-penalty method path , which for this simple case is a straight line segment from the point to the solution . We mark three points: blue represents the reference net at the DC codebook (the beginning of the path); red is the solution (the end of the path); and white is the direct compression point .
Problem (1) can be written as s.t. , where the objective function is the loss on the real-valued weights and the feasible set on and the low-dimensional parameters is:
We also define the feasible set in -space:
which contains all high-dimensional models that can be obtained by decompressing some low-dimensional model . Fig. 1 (plots 1–3) illustrates the geometry of the problem in general.
Solving the C step requires minimizing (where we write instead of for simplicity of notation):
We call the decompression mapping and the compression mapping. In quantization, this has the following meaning:
consists of the codebook (if the codebook is adaptive) and the assignments of weight-to-codebook-entries. The assignments can be encoded as 1-of- vectors or directly as indices in for a codebook with entries.
The decompression mapping uses the codebook and assignments as a lookup table to generate a real-valued but quantized weight vector . This vector is used in the L step as a regularizer.
The compression mapping learns optimally a codebook and assignments given a real-valued, non-quantized weight vector (using -means or a form of rounding, see section 4). All the C step does is solve for the compression mapping.
As shown by Carreira-Perpiñán (2017), the compression mapping finds the orthogonal projection of on the feasible set , which we call .
For quantization, the geometry of the constrained optimization formulation is as follows. The feasible set can be written as the union of a combinatorial number of linear subspaces (containing the origin), where is of the form . Each such subspace defines a particular assignment of the weights to the centroids . There are assignments. If we knew the optimal assignment, the feasible set would be a single linear subspace, and the weights could be eliminated (using ) to yield an unconstrained objective of tunable vectors (shared weights in neural net parlance), which would be simple to optimize. What makes the problem hard is that we do not know the optimal assignment. Depending on the dimensions and , these subspaces may look like lines, planes, etc., always passing through the origin in space. Geometrically, the union of these subspaces is a feasible set with both a continuous structure (within each subspace) and a discrete one (the number of subspaces is finite but very large).
Fig. 1 (plots 4–5) shows the actual geometry for the case of a net with weights and a codebook with centroid. This can be exactly visualized in 3D because the assignment variables are redundant and can be eliminated: s.t. , . The compression mapping is easily seen to be , and is indeed the orthogonal projection of onto the diagonal line in -space (the feasible set). This particular case is, however, misleading in that the constraints involve a single linear subspace rather than the union of a combinatorial number of subspaces. It can be solved simply and exactly by setting and eliminating variables into .
3.2 Convergence of the LC algorithm
Convergence of the LC algorithm to a local KKT point (theorem 5.1 in Carreira-Perpiñán, 2017) is guaranteed for smooth problems (continuously differentiable loss and decompression mapping ) if and optimization of the penalty function (2) is done accurately enough for each . However, in quantization the decompression mapping is discrete, given by a lookup table, so the theorem does not apply.
In fact, neural net quantization is an NP-complete problem even in simple cases. For example, consider least-squares linear regression with weights in. This corresponds to binarization of a single-layer, linear neural net. The loss is quadratic, so the optimization problem is a binary quadratic problem over the weights, which is NP-complete (Garey and Johnson, 1979). However, the LC algorithm will still converge to a “local optimum” in the same sense that the -means algorithm is said to converge to a local optimum: the L step cannot improve given the C step, and vice versa. While this will generally not be the global optimum of problem (1), it will be a good solution in that the loss will be low (because the L step continuously minimizes it in part), and the LC algorithm is guaranteed to converge to a weight vector that satisfies the quantization constraints (e.g. weights in for binarization). Our experiments confirm the effectiveness of the LC algorithm for quantization, consistently outperforming other approaches over a range of codebook types and sizes.
3.3 Practicalities of the LC algorithm
As usual with path-following algorithms, ideally one would follow the path of iterates closely until , by increasing the penalty parameter slowly. In practice, in order to reduce computing time, we increase more aggressively by following a multiplicative schedule for where and . However, it is important to use a small enough that allows the algorithm to explore the solution space before committing to specific assignments for the weights.
The L step with a large training set typically uses SGD. As recommended by Carreira-Perpiñán (2017), we use a clipped schedule for the learning rates of the form , where
is the epoch index andis a schedule for the reference net (i.e., for ). This ensures convergence and avoids erratic updates as becomes large.
We initialize and , i.e., to the reference net and direct compression, which is the exact solution for , as we show in the next section. We stop the LC algorithm when is smaller than a set tolerance, i.e., when the real-valued and quantized weights are nearly equal. We take as solution , i.e., the quantized weights using the codebook and assignments .
The runtime of the C step is negligible compared to that of the L step. With a fixed codebook, the C step is a simple assignment per weight. With an adaptive codebook, the C step runs -means, each iteration of which is linear on the number of weights . The number of iterations that -means runs is a few tens in the first -means (initialized by -means++, on the reference weights) and just about one in subsequent C steps (because -means is warm-started), as seen in our experiments. So the runtime is dominated by the L steps, i.e., by optimizing the loss.
3.4 Direct compression and iterated direct compression
The quadratic-penalty and augmented-Lagrangian methods define a path of iterates for that converges to a local solution as . The beginning of this path is of special importance, and was called direct compression (DC) by Carreira-Perpiñán (2017). Taking the limit and assuming an initial , we find that and . Hence, this corresponds to training a reference, non-quantized net and then quantizing it regardless of the loss (or equivalently projecting on the feasible set). As illustrated in fig. 1, this is suboptimal (i.e., it does not produce the compressed net with lowest loss), more so the farther the reference is from the feasible set. This will happen when the feasible set is small, i.e., when the codebook size is small (so the compression level is high). Indeed, our experiments show that for large (around 32 bits/weight) then DC is practically identical to the result of the LC algorithm, but as decreases (e.g. 1 to 4 bits/weight) then the loss of DC becomes larger and larger than that of the LC algorithm.
A variation of direct compression consists of “iterating” it, as follows. We first optimize to obtain and then quantize it with -means into . Next, we optimize again but initializing from , and then we compress it; etc. This was called “iterated direct compression (iDC)” by Carreira-Perpiñán (2017). iDC should not improve at all over DC if the loss optimization was exact and there was a single optimum: it simply would cycle forever between the reference weights and the DC weights . However, in practice iDC may improve somewhat over DC, for two reasons. 1) With local optima of , we might converge to a different optimum after the quantization step (see fig. 1 plot 2). However, at some point this will end up cycling between some reference net (some local optimum of ) and its quantized net. 2) In practice, SGD-based optimization of the loss with large neural nets is approximate; we stop SGD way before it has converged. This implies the iterates never fully reach , and keep oscillating forever somewhere in between and .
DC and iDC have in fact been proposed recently for quantization, although without the context that our constrained optimization framework provides. Gong et al. (2015) applied -means to quantize the weights of a reference net, i.e., DC. The “trained quantization” of Han et al. (2015) tries to improve over this by iterating the process, i.e., iDC. In our experiments, we verify that neither DC not iDC converge to a local optimum of problem (1), while our LC algorithm does.
4 Solving the C step: compression by quantization
The C step consists of solving the optimization problem of eq. (8): , where is a vector of real-valued weights. This is a quadratic distortion (or least-squares error) problem, and this was caused by selecting a quadratic penalty in the augmented Lagrangian (2). It is possible to use other penalties (e.g. using the norm), but the quadratic penalty gives rise to simpler optimization problems, and we focus on it in this paper. We now describe how to write quantization as a mapping in parameter space and how to solve the optimization problem (8).
Quantization consists of approximating real-valued vectors in a training set by vectors in a codebook. Since in our case the vectors are weights of a neural net, we will write the training set as . Although in practice with neural nets we quantize scalar weight values directly (not weight vectors), we develop the formulation using vector quantization for generality. Hence, if we use a codebook with entries, the number of bits used to store each weight vector is .
We consider two types of quantization: using an adaptive codebook, where we learn the optimal codebook for the training set; and using a fixed codebook, which is then not learned (although we will consider learning a global scale).
4.1 Adaptive codebook
The decompression mapping is a table lookup for each weight vector in the codebook , where is a discrete mapping that assigns each weight vector to one codebook vector. The compression mapping results from finding the best (in the least-squares sense) codebook and mapping for the “dataset” , i.e., from solving the optimization problem
which we have rewritten equivalently using binary assignment variables . This follows by writing where if and otherwise, and verifying by substituting the values that the following holds:
So in this case the low-dimensional parameters are , the decompression mapping can be written elementwise as for , and the compression mapping results from running the -means algorithm. The low-dimensional parameters are of two types: the assignments are “private” (each weight has its own ), and the codebook is “shared” by all weights. In the pseudocode of fig. 2, we write the optimally quantized weights as .
Problem (9) is the well-known quadratic distortion problem (Gersho and Gray, 1992). It is NP-complete and it is typically solved approximately by -means using a good initialization, such as that of -means++ (Arthur and Vassilvitskii, 2007). As is well known, -means is an alternating optimization algorithm that iterates the following two steps: in the assignment step we update the assignments independently given the centroids (codebook); in the centroid step we update the centroids independently by setting them to the mean of their assigned points. Each iteration reduces the distortion or leaves it unchanged. The algorithm converges in a finite number of iterations to a local optimum where cannot improve given and vice versa.
In practice with neural nets we quantize scalar weight values directly, i.e., each is a real value. Computationally, -means is considerably faster with scalar values than with vectors. If the vectors have dimension , with data points and centroids, each iteration of -means takes runtime because of the assignment step (the centroid step is , by scanning through the points and accumulating each mean incrementally). But in dimension , each iteration can be done exactly in , by using a binary search over the sorted centroids in the assignment step, which then takes for sorting and for assigning, total .
4.1.1 Why -means?
The fact that we use -means in the C step is not an arbitrary choice of a quantization algorithm (among many possible such algorithms we could use instead). It is a necessary consequence of two assumptions: 1) The fact that we want to assign weights to elements of a codebook, which dictates the form of the decompression mapping . This is not really an assumption because any form of quantization works like this. 2) That the penalty used in the augmented Lagrangian is quadratic, so that the C step is a quadratic distortion problem.
We could choose a different penalty instead of the quadratic penalty , as long as it is zero if the constraint is satisfied and positive otherwise (for example, the penalty). In the grand scheme of things, the choice of penalty is not important, because the role of the penalty is to enforce the constraints gradually, so that in the limit the constraints are satisfied and the weights are quantized: . Any penalty satisfying the positivity condition above will achieve this. The choice of penalty does have two effects: it may change the local optimum we converge to (although it is hard to have control on this); and, more importantly, it has a role in the optimization algorithm used in the L and C steps: the quadratic penalty is easier to optimize. As an example, imagine we used the penalty . This means that the L step would have the form:
that is, an -regularized loss. This is a nonsmooth problem. One can develop algorithms to optimize it, but it is harder than with the quadratic regularizer. The C step would have the form (again we write instead of for simplicity of notation):
With scalar weights , this can be solved by alternating optimization as in -means: the assignment step is identical, but the centroid step uses the median instead of the mean of the points assigned to each centroid (-medians algorithm). There are a number of other distortion measures developed in the quantization literature (Gersho and Gray, 1992, section 10.3) that might be used as penalty and are perhaps convenient with some losses or applications. With a fixed codebook, as we will see in the next section, the form of the C step is the same regardless of the penalty.
On the topic of the choice of penalty, a possible concern one could raise is that of outliers in the data. When used for clustering,-means is known to be sensitive to outliers and nonconvexities of the data distribution. Consider the following situations, for simplicity using just centroid in 1D. First, if the dataset has an outlier, it will pull the centroid towards it, away from the rest of the data (note this is not a local optima issue; this is the global optimum). For compression purposes, it may seem a waste of that centroid not to put it where most of the data is. With the penalty, the centroid would be insensitive to the outlier. Second, if the dataset consists of two separate groups, the centroid will end up in the middle of both, where there is no data, for both -means and the penalty. Again, this may seem a waste of the centroid. Other clustering algorithms have been proposed to ensure the centroids lie where there is distribution mass, such as the -modes algorithm (Carreira-Perpiñán and Wang, 2013; Wang and Carreira-Perpiñán, 2014). However, these concerns are misguided, because neural net compression is not a data modeling problem: one has to consider the overall LC algorithm, not the C step in isolation. While in the C step the centroids approach the data (the weights), in the L step the weights approach the centroids, and in the limit both coincide, the distortion is zero and there are no outliers. It is of course possible that the LC algorithm converge to a bad local optimum of the neural net quantization, which is an NP-complete problem, but this can happen for various reasons. In section 5.2 of the experiments we run the LC algorithm in a model whose weights contain clear outliers and demonstrate that the solution found makes sense.
4.2 Fixed codebook
Now, we consider quantization using a fixed codebook222We can also achieve pruning together with quantization by having one centroid be fixed to zero. We study this in more detail in a future paper., i.e., the codebook entries are fixed and we do not learn them, we learn only the weight assignments . In this way we can derive algorithms for compression of the weights based on approaches such as binarization or ternarization, which have been also explored in the literature of neural net compression, implemented as modifications to backpropagation (see section 2.1).
The compression mapping of eq. (8) now results from solving the optimization problem
This is not NP-complete anymore, unlike in the optimization over codebook and assignments jointly in (9). It has a closed-form solution for each separately where we assign to , with ties broken arbitrarily, for . That is, each weight is compressed as its closest codebook entry (in Euclidean distance). Therefore, we can write the compression mapping explicitly as separately for each weight , .
So in this case the low-dimensional parameters are (or ), the decompression mapping can be written elementwise as for (as with the adaptive codebook), and the compression mapping can also be written elementwise as (or ) for . The low-dimensional parameters are all private (the assignments or ). The codebook is shared by all weights, but it is not learned. In the pseudocode of fig. 3, we use the notation to write the optimally quantized weights.
This simplifies further in the scalar case, i.e., when the weights to be quantized are scalars. Here, we can write the codebook as an array of scalars sorted increasingly, . The elementwise compression mapping can be written generically for as:
since the codebook defines Voronoi cells that are the intervals between midpoints of adjacent centroids. This can be written more compactly as where satisfies and we define and . Computationally, this can be done in using a binary search, although in practice is small enough that a linear search in makes little difference. To use the compression mapping in the C step of the LC algorithm given in section 3, equals either a scalar weight for the quadratic-penalty method, or a shifted scalar weight for the augmented Lagrangian method. The L step of the LC algorithm always takes the form given in eq. (4).
Again, this quantization algorithm in the C step is not an arbitrary choice, it follows necessarily from the way any codebook-based quantization works. Furthermore, and unlike the adaptive codebook case, with scalar weights the solution (11) is independent of the choice of penalty, because the order of the real numbers is unique (so using a quadratic or an penalty will result in the same step).
Application to binarization, ternarization and powers-of-two
Some particular cases of the codebook are of special interest because their implementation is very efficient: binary , ternary and general powers-of-two . These are all well known in digital filter design, where one seeks to avoid floating-point multiplications by using fixed-point binary arithmetic and powers-of-two or sums of powers-of-two multipliers (which result in shift or shift-and-add operations instead). This accelerates the computation and requires less hardware.
We give the solution of the C step for these cases in fig. 5 (see proofs in the appendix). Instead of giving the compression mapping , we give directly a quantization operator that maps a real-valued weight to its optimal codebook entry. Hence, corresponds to compressing then decompressing the weights, elementwise: , where is a scalar weight. In the expressions for , we define the floor function for as if and is integer, and the sign function as follows:
Note that the generic -means algorithm (which occurs in the C step of our LC algorithm) solves problem (11), and hence its particular cases, exactly in one iteration: the centroid step does nothing (since the centroids are not learnable) and the assignment step is identical to the expressions for in eq. (11) or for in fig. 5. However, the expressions in fig. 5 are more efficient, especially for the powers-of-two case, which runs in (while the generic -means assignment step would run in ).
|Binarization with scale|
|Ternarization with scale|
|Powers of two||
|The scale is with|
4.2.1 Fixed codebook with adaptive scale
Fixed codebook values such as or may produce a large loss because the good weight values may be quite bigger or quite smaller than . One improvement is to rescale the weights, or equivalently rescale the codebook elements, by a scale parameter , which is itself learned. The low-dimensional parameters now are , where is a shared parameter and the are private. The decompression mapping can be written elementwise as for . The compression mapping results from solving the optimization problem
In general, this can be solved by alternating optimization over and :
Assignment step: assign to for .
Scale step: .
Like -means, this will stop in a finite number of iterations, and may converge to a local optimum. With scalar weights, each iteration is by using binary search in the assignment step and incremental accumulation in the scale step.
Application to binarization and ternarization with scale
For some special cases we can solve problem (13) exactly, without the need for an iterative algorithm. We give the solution for binarization and ternarization with scale in fig. 5 (see proofs in the appendix). Again, we give directly the scalar quantization operator . The form of the solution is a rescaled version of the case without scale, where the optimal scale is the average magnitude of a certain set of weights. Note that, given the scale , the weights can be quantized elementwise by applying , but solving for the scale involves all weights .
Some of our quantization operators are equal to some rounding procedures used in previous work on neural net quantization: binarization (without scale) by taking the sign of the weight is well known, and our formula for binarization with scale is the same as in Rastegari et al. (2016). Ternarization with scale was considered by (Li et al., 2016), but the solution they give is only approximate; the correct, optimal solution is given in our theorem A.3
. As we have mentioned before, those approaches incorporate rounding in the backpropagation algorithm in a heuristic way and the resulting algorithm does not solve problem (1). In the framework of the LC algorithm, the solution of the C step (the quantization operator) follows necessarily; there is no need for heuristics.
It is possible to consider more variations of the above, such as a codebook or with learnable scales , but there is little point to it. We should simply use a learnable codebook or without restrictions on or and run -means in the C step.
Computing the optimal scale with weights has a runtime in the case of binarization with scale and in the case of ternarization with scale. In ternarization, the sums can be done cumulatively in , so the total runtime is dominated by the sort, which is . It may be possible to avoid the sort using a heap and reduce the total runtime to .
We evaluate our learning-compression (LC) algorithm for quantizing neural nets of different sizes with different compression levels (codebook sizes ), in several tasks and datasets: linear regression on MNIST and classification on MNIST and CIFAR10. We compare LC with direct compression (DC) and iterated direct compression (iDC), which correspond to the previous works of Gong et al. (2015) and Han et al. (2015), respectively. By using codebook values, we also compare with BinaryConnect (Courbariaux et al., 2015), which aims at learning binary weights. In summary, our experiments 1) confirm our theoretical arguments about the behavior of (i)DC, and 2) show that LC achieves comparable loss values (in training and test) to those algorithms with low compression levels, but drastically outperforms all them at high compression levels (which are the more desirable in practice). We reach the maximum possible compression (1 bit/weight) without significant error degradation in all networks we describe (except in the linear regression case).
We used the Theano(Theano Development Team, 2016) and Lasagne (Dieleman et al., 2015)
libraries. Throughout we use the augmented Lagrangian, because we found it not only faster but far more robust than the quadratic penalty, in particular in setting the SGD hyperparameters. We initialize all algorithms from a reasonably (but not necessarily perfectly) well-trained reference net. The initial iteration () for LC gives the DC solution. The C step (also for iDC) consists of -means ran till convergence, initialized from the previous iteration’s centroids (warm-start). For the first compression, we use the -means++ initialization (Arthur and Vassilvitskii, 2007). This first compression may take several tens of -means iterations, but subsequent ones need very few, often just one (figs. 7 and 10).
We report the loss and classification error in training and test. We only quantize the multiplicative weights in the neural net, not the biases. This is because the biases span a larger range than the multiplicative weights, hence requiring higher precision, and anyway there are very few biases in a neural net compared to the number of multiplicative weights.
We calculate compression ratios as
#bits(quantized) , where is the size of the codebook;
and are the number of multiplicative weights and biases, respectively;
is the codebook size;
and we use 32-bit floats to represent real values (so ). Note that it is important to quote the base value of or otherwise the compression ratio is arbitrary and can be inflated. For example, if we set (double precision) all the compression ratios in our experiments would double.
Since for our nets , we have .
5.1 Interplay between loss, model complexity and compression level
Firstly, we conduct a simple experiment to understand the interplay between loss, model complexity and compression level, here given by classification error, number of hidden units and codebook size, respectively. One important reason why compression is practically useful is that it may be better to train a large, accurate model and compress it than to train a smaller model and not compress it in the first place (there has been some empirical evidence supporting this, e.g. Denil et al., 2013). Also, many papers show that surprisingly large compression levels are possible with some neural nets (in several of our experiments with quantization, we can quantize all the way to one bit per weight with nearly no loss degradation). Should we expect very large compression levels without loss degradation in general?
The answer to these questions depends on the relation between loss, model complexity and compression. Here, we explore this experimentally in a simple setting: a classification neural net with inputs of dimension , outputs of dimension (number of classes) and hidden, tanh units, fully connected, trained to minimize the average cross-entropy. We use our LC algorithm to quantize the net using a codebook of size . The size in bits of the resulting nets is as follows (assuming floating-point values of bits). For the reference (non-quantized net, “”), (multiplicative weights) plus (biases), total . For a quantized net, this is the sum of (for the quantized weights), (for the non-quantized biases) and (for the codebook), total .
We explore the space of optimal nets over and in order to determine what the best operational point is in order to achieve a target loss with the smallest net, that is, we want to solve the following optimization problem:
We use the entire MNIST training set of 60 000 handwritten digit images, hence and . We train a reference net of units for and compress it using a codebook of size for