Model compression by constrained optimization, using the Learning-Compression (LC) algorithm
We propose a software framework based on the ideas of the Learning-Compression (LC) algorithm, that allows a user to compress a neural network or other machine learning model using different compression schemes with minimal effort. Currently, the supported compressions include pruning, quantization, low-rank methods (including automatically learning the layer ranks), and combinations of those, and the user can choose different compression types for different parts of a neural network. The LC algorithm alternates two types of steps until convergence: a learning (L) step, which trains a model on a dataset (using an algorithm such as SGD); and a compression (C) step, which compresses the model parameters (using a compression scheme such as low-rank or quantization). This decoupling of the "machine learning" aspect from the "signal compression" aspect means that changing the model or the compression type amounts to calling the corresponding subroutine in the L or C step, respectively. The library fully supports this by design, which makes it flexible and extensible. This does not come at the expense of performance: the runtime needed to compress a model is comparable to that of training the model in the first place; and the compressed model is competitive in terms of prediction accuracy and compression ratio with other algorithms (which are often specialized for specific models or compression schemes). The library is written in Python and PyTorch and available in Github.READ FULL TEXT VIEW PDF
Compressing neural nets is an active research problem, given the large s...
Model compression has been introduced to reduce the required hardware
Model compression is generally performed by using quantization, low-rank...
In this paper, we analyze two popular network compression techniques, i....
As the industry deploys increasingly large and complex neural networks t...
In the traditional deep compression framework, iteratively performing ne...
We propose a simple and easy to implement neural network compression
Model compression by constrained optimization, using the Learning-Compression (LC) algorithm
With the success of neural networks in solving practical problems in various fields, there has been an emergence of research in neural network compression techniques that allows compressing these large models in terms of memory, computation, and power requirements. At present, many ad-hoc solutions have been proposed that typically solve only one specific type of compression: quantization [16, 6, 28, 41, 42, 11, 4], pruning [23, 14, 12, 26, 32], low-rank decomposition [29, 37, 8, 9, 20, 40, 30, 33, 24, 36]
or tensor factorizations[9, 22, 27, 10], and others.
Among the various research strands in neural net compression, in our view, the fundamental problem is that in practice, one does not know what type of compression (or combination of compression types) may be the best for a given network. In principle, it may be possible to try different existing algorithms, assuming one can find an implementation for them, but practically it is often impossible. We seek a solution that directly addresses this problem and allows non-expert end-users to compress models easily and efficiently. Our approach is based on a recently proposed compression framework, the LC algorithm [3, 4, 5, 17, 18]
, that by design separates the “learning” part of the problem, which involves the dataset, neural net model, and loss function from the “compression” part, which defines how the network parameters will be compressed. This separation has the advantage ofmodularity: we can change the compression type by simply calling a different compression routine (e.g., -means instead of the SVD), with no other changes to the algorithm.
In this paper, we further develop the ideas of modular compression presented by LC algorithm and describe our ongoing efforts in building a software library with the philosophy of single algorithm — multiple compressions
. At present, this handles 1) various forms of quantization, pruning, low-rank methods, and their combinations, 2) different types of deep net models, and 3) allows flexible configuration of compressed schemes. Our framework is written in Python and PyTorch. The source code is available online as an open-source project in Github.
The field of model compression has grown enormously in the recent years, resulting in plethora of algorithmic approaches, research projects and software. In this section we limit our attention to the software aspect of the neural network compression. We discuss what kind of compression schemes are supported, available codes, and recently proposed compression frameworks.
The majority of neural network compression code is available as individual projects and recipes tailored for a particular compression and model. Usually it is released as a companion code for published research paper, e.g., codes of [6, 30, 36, 31] and others. Some repositories combine several compression recipes in a single place: e.g., Tensorpack111https://github.com/tensorpack/tensorpack/tree/master/examples
or the fork of the Caffe library by Wei Wen222https://github.com/wenwei202/caffe.
Out of many individual compressions proposed in the literature, the quantization aware training of 
has gained popularity and became a standard feature of major deep-learning frameworks. TensorFlow, Pytorch and MxNet natively support both training of such quantized models and allow an efficient inference afterwards.
Relatively mature software is available if the goal is not to compress the model in a lossy way, but to run it unchanged as efficiently as possible on a given hardware. Such frameworks allow to convert (compile) already trained neural network to utilize the hardware-enabled fast computations: through usage of edge TPU-s on Pixel 4 (Pixel Neural Core) or Neural Engine on iPhone 8. Examples include Tensorflow Light333https://www.tensorflow.org/lite, PyTorch Mobile444https://pytorch.org/mobile/home/, Apple Core ML555https://developer.apple.com/documentation/coreml, Qualcomm Neural Processing SDK666https://developer.qualcomm.com/software/qualcomm-neural-processing-sdk, QNNPack777https://engineering.fb.com/ml-applications/qnnpack/ and others.
The diversity compression mechanisms and limited support by deep learning frameworks led to development of specialized software libraries such as Distiller  and NCCF . Distiller and NCCF gather multiple compression schemes and corresponding training algorithms into a single framework, and make it easier to apply to new models. Both frameworks allow to apply multiple compression simultaneously to disjoint parts of single model. However, the underlying compression algorithms do not share same algorithmic base and might require deeper understanding from end user to efficiently tune the settings.
Our approach is based on solid optimization principles, with guarantees of convergence under standard assumptions. It formulates the problem of model compression in a way that is intuitive and amenable to efficient optimization. The form of the actual algorithm is obtained systematically by judiciously applying mathematical transformations to the objective function and constraints. For example, if one wants to optimize the cross-entropy over a certain type of neural net, and represent its weights via a quantized codebook, then the L and C steps necessarily take a specific form. If one wants instead to represent the weights via low-rank matrices, a different C step results, and so on. The resulting algorithm is not based on combining backpropagation training with heuristics, such as pruning weights on the fly, which may result in suboptimal results or even non-convergence. The user does not need to work out the form of individual L or C steps (unless so desired), as we provide a range to choose from.
The LC algorithm is efficient in runtime; it does not take much longer than training the reference, uncompressed model in the first place. The compressed models perform very competitively and allow the user to easily explore the space of prediction accuracy of the model vs compression ratio (which can be defined in terms of memory, inference time, energy or other criteria). Our code has been extensively tested since 2017 through usage in internal research projects, and has resulted in multiple publications that improve the state of the art in several compression schemes [5, 17, 18].
But what truly makes the approach practical is its flexibility and extensibility. If one wants to compress a specific type of model with a specific compression scheme, all is needed is to pick a corresponding L step and C step. It is not necessary to create a specific algorithm to handle that choice of model and compression. Furthermore, one is not restricted to a single compression scheme; multiple compression schemes (say, low-rank plus pruning plus quantization) can be combined automatically, so they best cooperate to compress the model. The compression schemes that our code already supports make it possible for a user to mix and match them as desired with minimal effort. We expect to include further schemes in the future, as well as a range of model types.
In this section, we briefly introduce the Learning-Compression  framework, which is the backbone of our software. Let us begin by assuming we have a previously trained model with weights , which were obtained by minimizing some loss function . This is our reference model, which represents the best loss we can achieve without compression. Here we omitted the exact definition of the weights , but for now, let us assume it has parameters. The “Learning-Compression” paper  defines the compression as finding a low-dimensional parameterization of the weights in terms of -sized parameter , with .
We regard compression and decompression as mappings, while in the signal processing literature they are usually seen as algorithms (e.g., Lempel-Ziv algorithm ). Formally, the decompression mapping maps a low-dimensional parameters to uncompressed model weights :
and the compression mapping behaves as its “inverse”:
The goal of model compression is to find such that its corresponding decompressed model has (locally) optimal loss. Therefore the model compression as a constrained optimization problem is defined as:
The problem in eq. 1 is constrained, nonlinear, and usually non-differentiable wrt
(e.g., when compression is binarization). To efficiently solve it, the LC-algorithm is obtained by converting this problem to an equivalent formulation using penalty methods (quadratic penalty or augmented Lagrangian) and employing an alternating optimization. This results in an algorithm that alternates two generic steps while slowly driving the penalty parameter:
L (learning) step: . This is a regular training of the uncompressed model but with a quadratic regularization term. This step is independent of the compression type.
C (compression) step: . This means finding the best (lossy) compression of (the current uncompressed model) in the sense (orthogonal projection on the feasible set), and corresponds to our definition of the compression mapping . This step is independent of the loss, training set and task.
We will be using the quadratic penalty (QP) formulation throughout this paper to make derivations easier. In practice, we implement the augmented Lagrangian (AL) version which has additional vector of Lagrange multipliers, see Figure. 2. The QP version can be obtained from AL version by setting and skipping the multipliers update step. Fig. 1 illustrates the idea of model compression as constrained optimization and the LC algorithm.
Our software capitalizes on the separation of the L and C steps: to apply a new compression mechanism under the LC formulation, the software requires only a new C step corresponding to this mechanism. Indeed, the compression parameter enters the L step problem as a constant regardless of the chosen compression type. Therefore, all L steps for any combination of compressions have the same form. Once the L step has been implemented for a model, any possible compression (C steps) can be applied.
More importantly, this separation allows using the best tools available for each of the L and C steps. For modern neural networks, the optimization of the L step means iterations over the dataset and requires solving it using SGD and hardware accelerators. The formulation of the C step, on the other hand, is given by minimization, and as we will see in the next chapter, solutions of it involve efficient algorithms. In fact, for certain compression choices, the C step problem is well studied and has a history of its usage on its own merit in the fields of data and signal compression. From the software engineering perspective, the separation of L and C steps makes code more robust and allows us to thoroughly test and debug each component separately.
In this section, we describe some of the compression schemes supported by our library. The complete list of supported compressions is given in Table 1. We expect to add more compressions in the future.
The quantization is the process of reducing the precision of the weights, and achieved by imposing a constraint on each weight to belong to set of values — the codebook. Depending on the allowed values in the codebook, the quantization schemes are known under different names: binarization — when or , ternarization with , the powers-of-two scheme with , and others.
Let us consider the general case when we compress the weights of the model with a learned codebook of size , i.e., . We will use the equivalent formulation of the quantization using a binary assignment variable () for each weight . Then our compression goal is:
This formulation immediately falls into the Learning-Compression form of eq. 1 with . The corresponding C step problem of has the form of:
which has been thoroughly studied in signal compression and unsupervised clustering literature, and known as the -means clustering problem. The general -means problem is NP hard [7, 1], however, this is a scalar version which has an efficient globally optimal solution using dynamic programming [2, 35, 34].
Pruning is the process of removing (or sparsifying) the weights of the model. One way of formulating this problem is by using the sparsification penalties and constraints, e.g., or , which will limit the number of allowed non-zero weights. A particularly useful pruning scheme is -norm () constrained pruning defined as:
Since the -norm measures the number of non zero items in the vector, the formulation of eq. 3 allows to precisely specify the number of remaining weights.
To bring it into the Learning-Compression form (eq. 1) we introduce a copy parameter and obtain an equivalent optimization problem:
for which the C step is given by solving:
The solution of eq. 4 can be obtained by selecting all but top- weights (in magnitude) of and zeroing remaining.
Using similar steps, we can obtain the C steps for constrained formulation of pruning, and extend it to penalty based forms, e.g. , see  for further details. In our framework we provide the implementation for all combinations of and -norms, both constraint and penalty.
Our framework supports compressing the weight matrices of each layer to a given (preselected) target rank. This allows parametrizing the resulting compressed weight matrix as a product of . However, such compression requires knowing the right choice of the ranks, or otherwise, it will affect the error-compression tradeoff of the resulting model. To alleviate this issue, we include the implementation of the automatic rank selection from , which we describe next.
Assume we have a reference model with layers and the weights , where is the weight matrix of layer . We want to optimize the following model selection problem over possible low-rank models:
here is the maximum possible rank for matrix . The compression cost is defined in terms of the ranks of the matrices:
can capture both storage bits to save space, or total floating point operations to speed up the model. To put it into Learning-Compression form (eq. 1), we introduce the parameter for each layer, with constraint . Then, the objective of the C step separates over layers into:
The solution of this C step was given in , and involves an SVD and enumeration over the ranks for each layer’s weight matrix.
|Quantization||Adaptive Quantization into|
|Binarization into and|
|Pruning||-constraint (s.t., )|
|-constraint (s.t., )|
|Low-rank||Low-rank compression to a given rank|
|Low-rank with automatic rank selection for FLOPs reduction|
|Low-rank with automatic rank selection for storage compression|
|Additive Combinations||Quantization + Pruning|
|Quantization + Low-rank|
|Pruning + Low-rank|
|Quantization + Pruning + Low-rank|
Equipped with Learning-Compression algorithm and some building-block compressions, we now discuss the design of our library. The main goals are to have an easy to use, efficient, robust, and configurable neural network compression software. Particularly, we would like to have a flexibility of applying any available compression (Table 1) to any parts of the neural network with per-layer granularity:
a single compression per layer (e.g., low-rank compression for layer 1 with target rank 5)
a single compression per multiple layers (e.g. prune 5% of weights in layer 1 and 3, jointly)
mixing multiple compressions (e.g., quantize layer 1 and prune jointly layers 2 and 3)
To implement such desiderata, we leverage the modularity of the LC algorithm and introduce some additional building blocks in between.
We hand off the model training operations, the L step, to the user through the lambda functions. This gives a fine-grained control on the model’s actual learning, utilization of hardware, pulling the data sources, and other necessary steps required for training. Usually, the implementation of the L step is already available or can be extracted from the training code used for the reference (uncompressed) model. On the left of Figure 5 we give a typical way of implementing the L step in PyTorch.
All provided compressions of Table 1 are implemented as subclasses of CompressionTypeBase class, and the actual C step is exposed through the compress method. This allows a straightforward extension of the library of compressions: if needed, the user simply wraps the custom C-step solution into an object of CompressionTypeBase class. For example, on the right of Figure 5 we show how a new quantization can be implemented.
To instruct the framework on which compression types should be applied to which parts of the model, the user needs to populate a compression tasks structure. This structure is a list of simple mappings of the form: (parameters) (compression view, compression type), which is implemented as a python dictionary. The parameters are any subset of model weights, which are wrapped into internal Parameter object. The compression view is another internal structure that handles reshaping of the model weights into a form suitable for compression, e.g., reshaping the weight tensor of a convolutional layer into a matrix for low-rank compression.
While our strategy of defining the compression tasks might seem unnecessarily complicated, it brings a considerable amount of flexibility. For instance, it erases the limitations of standard compression approaches with coarse layer-based granularity: we can compress multiple layers with a single compression, or a single layer with multiple compressions, while simultaneously mixing different compressions in a single model. This abstraction disentangles compression from the model structure and allows us to construct complicated schemes of compressions in a mix-and-match way. For example, user can jointly compress a three-layer neural network so that the first and third layers are quantized with the same codebook, and the second layer is a low-rank matrix simultaneously, see Figure 6 and other examples in section 6. Such fine-grained control allows to include expert knowledge about properties of a particular model (e.g., do not quantize the first layer) without much effort.
|Semantics:||(layer 1, layer 3) (as a vector, adaptive quantization ),|
|(layer 2) (as is, low-rank with )|
To compress a model, the user needs to construct an lc.Algorithm object and provide:
a model to be compressed
associated compression tasks
implementation of the L step
a schedule of values, and
an evaluation function to keep track of the loss/error during the compression.
Once the run method is called, the LC algorithm will start execution, at which point the library will proceed in line-by-line correspondence to the pseudocode on the left of Figure 2). Currently, each of the compression tasks (and corresponding C step) is called in order. Yet, due to the nature of the LC algorithm, every compression task’s C steps can be run in parallel, further improving the efficiency of the algorithm.
In this section, we demonstrate the flexibility of our framework by easily exploring multiple compression schemes with minimal effort. As an example, say we are tasked with compressing the storage bits of the LeNet300 neural network trained on MNIST dataset (10 classes,
gray-scale images). The LeNet300 is a three-layer neural network with 300, 100, and 10 neurons respectively on every layer; the reference has an error of 2.13% on the test set.
In order to run the LC algorithm, we need to provide an L step implementation and compression tasks to an instance of the LCAlgorithm class as described in Listing 1. We implement the L step below:
|Compression||Code for compression tasks||Error|
|quantize all layers||
|quantize first and third layers||
|prune all but 5%||
|single codebook quantization with additive pruning of all but 1%||
|prune first layer, low-rank to second, quantize third||
|rank selection with||
Now, having the L step implementation, we can formulate the compression tasks. Say, we would like to know what would be the test error if the model is optimally quantized with a separate codebook on each layer? Test error in such case will be 2.56%, which is 0.43% higher than the reference. What would be the performance of the model if one would quantize only the first and the third layers, leaving the second layer untouched? Well, test error in such case will be 2.18%. What about if we prune all but 5% of the weights? Yes, the LC algorithm and our framework can handle all of these combinations and more; see Table 2 for details. We can even apply different compressions to every layer, e.g., take a look at the last row of Table 2, where we apply quantization, pruning, and low-rank compression to the different parts of the LeNet300. This is possible with the only change required: we need to provide a new compression task for desired compression and possibly a new schedule of -s and learning rates (if using SGD). Our framework allows to compare different compression techniques in single library with common algorithm.
In the compressions described in Table 2 we used exponential schedule of values: for all of quantization/pruning experiments the schedule was at -th L step and for those involving low-rank the schedule was . The learning rate for SGD decayed by 0.98 after every L step: for pure quantization experiments the base learning rate was 0.09, for pruning 0.1, and for mixture of schemes the learning rate was 0.05. We used total of 40 L steps, and every L step was running for 20 epochs.
We implemented the LC algorithm originally in 2017, and we have gone through multiple refinements and code reimplementations. We have applied it to compressing a wide array of relatively large neural nets, such as LeNet, AlexNet, VGG, ResNet, etc., which are themselves tricky to train well in the first place. In the process, we have gathered a considerable amount of practical knowledge on the behavior of the LC algorithm on both small and large models and datasets. We would like to share a list of common pitfalls so future users of our framework would hopefully avoid them:
of the algorithm. Specifically, two important quantities to keep an eye on:
The loss of the L step: . The total loss at the end of the L step must be smaller than the total loss at the beginning. Preferably, it must be much smaller. If some L step has not reduced the loss, optimization parameters of the step should be tuned.
The loss of the C step, , must have a smaller value after each C step. This often fails when new compression is introduced into the pipeline, where compress method is not fully tested. For the base compressions in the framework, we made sure they always optimize the quadratic distortion loss of the C step.
L steps can, and some of them should be optimized better than others. Here, by “better” we mean the reduction in the loss value should be evident. This advice is usually applicable for the first L steps. In case of using SGD, it is often helpful to train the first L step for a larger number of iterations than other steps.
In the case of L step optimization happening by means of SGD, often good initial learning rates can be chosen by inspecting the learning rates used in the training of the original (uncompressed) reference model.
Theoretically, the sequence of values should start at and infinitesimally grow to . In practice, we use an exponentially increasing schedule with small initial and appropriately chosen for the
-th step of the LC. For most of compression schemes, we have developed robust estimates of-values: for pruning see suppl.mat. of , for rank-selection see suppl.mat. of . For the value of , we found the range of to be a good spot.
The fields of machine learning and signal compression have developed independently for a long time: machine learning solves the problem of training a deep net to minimize a desired loss on a dataset, while signal compression solves the problem of optimally compressing a given signal. The LC algorithm allows us to seamlessly integrate the existing algorithms to train deep nets (L step) and algorithms to compress a signal (C step) by tapping on abundant literature in machine learning and signal compression fields. Based on this, we developed an extensible software framework that can easily plug in existing deep net training techniques with existing signal compression techniques and their combinations, in a flexible mix-and-match way. The source code is available online as an open-source project in Github.
Speeding up convolutional neural networks with low rank expansions.In M. Valstar, A. French, and T. Pridmore, editors, Proc. of the 25th British Machine Vision Conference (BMVC 2014), Nottingham, UK, Sept. 1–5 2014.
Restructuring of deep neural network acoustic models with singular value decomposition.In F. Bimbot, C. Cerisara, C. Fougeron, G. Gravier, L. Lamel, F. Pellegrino, and P. Perrier, editors, Proc. of Interspeech’13, pages 2365–2369, Lyon, France, Aug. 25–29 2013.