A flexible, extensible software framework for model compression based on the LC algorithm

05/15/2020 ∙ by Yerlan Idelbayev, et al. ∙ 0

We propose a software framework based on the ideas of the Learning-Compression (LC) algorithm, that allows a user to compress a neural network or other machine learning model using different compression schemes with minimal effort. Currently, the supported compressions include pruning, quantization, low-rank methods (including automatically learning the layer ranks), and combinations of those, and the user can choose different compression types for different parts of a neural network. The LC algorithm alternates two types of steps until convergence: a learning (L) step, which trains a model on a dataset (using an algorithm such as SGD); and a compression (C) step, which compresses the model parameters (using a compression scheme such as low-rank or quantization). This decoupling of the "machine learning" aspect from the "signal compression" aspect means that changing the model or the compression type amounts to calling the corresponding subroutine in the L or C step, respectively. The library fully supports this by design, which makes it flexible and extensible. This does not come at the expense of performance: the runtime needed to compress a model is comparable to that of training the model in the first place; and the compressed model is competitive in terms of prediction accuracy and compression ratio with other algorithms (which are often specialized for specific models or compression schemes). The library is written in Python and PyTorch and available in Github.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

Code Repositories

LC-model-compression

Model compression by constrained optimization, using the Learning-Compression (LC) algorithm


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the success of neural networks in solving practical problems in various fields, there has been an emergence of research in neural network compression techniques that allows compressing these large models in terms of memory, computation, and power requirements. At present, many ad-hoc solutions have been proposed that typically solve only one specific type of compression: quantization [16, 6, 28, 41, 42, 11, 4], pruning [23, 14, 12, 26, 32], low-rank decomposition [29, 37, 8, 9, 20, 40, 30, 33, 24, 36]

or tensor factorizations

[9, 22, 27, 10], and others.

Among the various research strands in neural net compression, in our view, the fundamental problem is that in practice, one does not know what type of compression (or combination of compression types) may be the best for a given network. In principle, it may be possible to try different existing algorithms, assuming one can find an implementation for them, but practically it is often impossible. We seek a solution that directly addresses this problem and allows non-expert end-users to compress models easily and efficiently. Our approach is based on a recently proposed compression framework, the LC algorithm [3, 4, 5, 17, 18]

, that by design separates the “learning” part of the problem, which involves the dataset, neural net model, and loss function from the “compression” part, which defines how the network parameters will be compressed. This separation has the advantage of

modularity: we can change the compression type by simply calling a different compression routine (e.g., -means instead of the SVD), with no other changes to the algorithm.

In this paper, we further develop the ideas of modular compression presented by LC algorithm and describe our ongoing efforts in building a software library with the philosophy of single algorithm — multiple compressions

. At present, this handles 1) various forms of quantization, pruning, low-rank methods, and their combinations, 2) different types of deep net models, and 3) allows flexible configuration of compressed schemes. Our framework is written in Python and PyTorch. The source code is available online as an open-source project in Github.

2 Related works and comparison

The field of model compression has grown enormously in the recent years, resulting in plethora of algorithmic approaches, research projects and software. In this section we limit our attention to the software aspect of the neural network compression. We discuss what kind of compression schemes are supported, available codes, and recently proposed compression frameworks.

Individual compressions

The majority of neural network compression code is available as individual projects and recipes tailored for a particular compression and model. Usually it is released as a companion code for published research paper, e.g., codes of [6, 30, 36, 31] and others. Some repositories combine several compression recipes in a single place: e.g., Tensorpack111https://github.com/tensorpack/tensorpack/tree/master/examples

or the fork of the Caffe library by Wei Wen

222https://github.com/wenwei202/caffe.

Out of many individual compressions proposed in the literature, the quantization aware training of [19]

has gained popularity and became a standard feature of major deep-learning frameworks. TensorFlow, Pytorch and MxNet natively support both training of such quantized models and allow an efficient inference afterwards.

Efficient inference frameworks

Relatively mature software is available if the goal is not to compress the model in a lossy way, but to run it unchanged as efficiently as possible on a given hardware. Such frameworks allow to convert (compile) already trained neural network to utilize the hardware-enabled fast computations: through usage of edge TPU-s on Pixel 4 (Pixel Neural Core) or Neural Engine on iPhone 8. Examples include Tensorflow Light333https://www.tensorflow.org/lite, PyTorch Mobile444https://pytorch.org/mobile/home/, Apple Core ML555https://developer.apple.com/documentation/coreml, Qualcomm Neural Processing SDK666https://developer.qualcomm.com/software/qualcomm-neural-processing-sdk, QNNPack777https://engineering.fb.com/ml-applications/qnnpack/ and others.

Compression frameworks

The diversity compression mechanisms and limited support by deep learning frameworks led to development of specialized software libraries such as Distiller [45] and NCCF [21]. Distiller and NCCF gather multiple compression schemes and corresponding training algorithms into a single framework, and make it easier to apply to new models. Both frameworks allow to apply multiple compression simultaneously to disjoint parts of single model. However, the underlying compression algorithms do not share same algorithmic base and might require deeper understanding from end user to efficiently tune the settings.

What makes our approach special?

Our approach is based on solid optimization principles, with guarantees of convergence under standard assumptions. It formulates the problem of model compression in a way that is intuitive and amenable to efficient optimization. The form of the actual algorithm is obtained systematically by judiciously applying mathematical transformations to the objective function and constraints. For example, if one wants to optimize the cross-entropy over a certain type of neural net, and represent its weights via a quantized codebook, then the L and C steps necessarily take a specific form. If one wants instead to represent the weights via low-rank matrices, a different C step results, and so on. The resulting algorithm is not based on combining backpropagation training with heuristics, such as pruning weights on the fly, which may result in suboptimal results or even non-convergence. The user does not need to work out the form of individual L or C steps (unless so desired), as we provide a range to choose from.

The LC algorithm is efficient in runtime; it does not take much longer than training the reference, uncompressed model in the first place. The compressed models perform very competitively and allow the user to easily explore the space of prediction accuracy of the model vs compression ratio (which can be defined in terms of memory, inference time, energy or other criteria). Our code has been extensively tested since 2017 through usage in internal research projects, and has resulted in multiple publications that improve the state of the art in several compression schemes [5, 17, 18].

But what truly makes the approach practical is its flexibility and extensibility. If one wants to compress a specific type of model with a specific compression scheme, all is needed is to pick a corresponding L step and C step. It is not necessary to create a specific algorithm to handle that choice of model and compression. Furthermore, one is not restricted to a single compression scheme; multiple compression schemes (say, low-rank plus pruning plus quantization) can be combined automatically, so they best cooperate to compress the model. The compression schemes that our code already supports make it possible for a user to mix and match them as desired with minimal effort. We expect to include further schemes in the future, as well as a range of model types.

3 Model compression as a constrained optimization problem

In this section, we briefly introduce the Learning-Compression [3] framework, which is the backbone of our software. Let us begin by assuming we have a previously trained model with weights , which were obtained by minimizing some loss function . This is our reference model, which represents the best loss we can achieve without compression. Here we omitted the exact definition of the weights , but for now, let us assume it has parameters. The “Learning-Compression” paper [3] defines the compression as finding a low-dimensional parameterization of the weights  in terms of -sized parameter , with .

We regard compression and decompression as mappings, while in the signal processing literature they are usually seen as algorithms (e.g., Lempel-Ziv algorithm [44]). Formally, the decompression mapping maps a low-dimensional parameters to uncompressed model weights :

and the compression mapping behaves as its “inverse”:

The goal of model compression is to find such that its corresponding decompressed model has (locally) optimal loss. Therefore the model compression as a constrained optimization problem is defined as:

(1)

The problem in eq. 1 is constrained, nonlinear, and usually non-differentiable wrt

(e.g., when compression is binarization). To efficiently solve it, the LC-algorithm is obtained by converting this problem to an equivalent formulation using penalty methods (quadratic penalty or augmented Lagrangian) and employing an alternating optimization. This results in an algorithm that alternates two generic steps while slowly driving the penalty parameter

:

  • L (learning) step: . This is a regular training of the uncompressed model but with a quadratic regularization term. This step is independent of the compression type.

  • C (compression) step: . This means finding the best (lossy) compression of (the current uncompressed model) in the sense (orthogonal projection on the feasible set), and corresponds to our definition of the compression mapping . This step is independent of the loss, training set and task.

We will be using the quadratic penalty (QP) formulation throughout this paper to make derivations easier. In practice, we implement the augmented Lagrangian (AL) version which has additional vector of Lagrange multipliers

, see Figure. 2. The QP version can be obtained from AL version by setting and skipping the multipliers update step. Fig. 1 illustrates the idea of model compression as constrained optimization and the LC algorithm.

Our software capitalizes on the separation of the L and C steps: to apply a new compression mechanism under the LC formulation, the software requires only a new C step corresponding to this mechanism. Indeed, the compression parameter enters the L step problem as a constant regardless of the chosen compression type. Therefore, all L steps for any combination of compressions have the same form. Once the L step has been implemented for a model, any possible compression (C steps) can be applied.

More importantly, this separation allows using the best tools available for each of the L and C steps. For modern neural networks, the optimization of the L step means iterations over the dataset and requires solving it using SGD and hardware accelerators. The formulation of the C step, on the other hand, is given by minimization, and as we will see in the next chapter, solutions of it involve efficient algorithms. In fact, for certain compression choices, the C step problem is well studied and has a history of its usage on its own merit in the fields of data and signal compression. From the software engineering perspective, the separation of L and C steps makes code more robust and allows us to thoroughly test and debug each component separately.

XX[][]   (reference) FF[l][Bl] (optimal compressed) Fd[l][Bl] (direct compression) R[lt][lt] -space (uncompressed models) w_mu[r][r] mappings[][] feasible models (decompressible by )

Figure 1: Schematic representation of the idea of model compression by constrained optimization. The plot illustrates the uncompressed model space (-space ), the contour lines of the loss (green lines), and the set of compressed models (the feasible set , grayed areas), for a generic compression technique . The -space is not shown. optimizes but is infeasible (no  can decompress into it). The direct compression is feasible but not optimal compressed (not optimal in the feasible set). is optimal compressed. The red curve is the projection in -space of the solution path of the LC algorithm for . See more details in [3].
input training data and model with parameters pretrained model init compression for L step C step multipliers step if is small enough then exit the loop return ,       class LCAlgorithm(): Housekeeping code is skipped Pretrained model is provided by user at init f run(self): elf.mu = 0 elf.c_step(step_number=0) or step_n, mu in enumerate(self.mu_schedule): self.mu = mu self.l_step(step_n) # call user-provided L step self.c_step(step_n) # resolve compression tasks self.multipliers_step()
Figure 2: Left: pseudocode of the LC algorithm using the augmented Lagrangian. Right: corresponding implementation in the LCAlgorithm class, the main running method is shown.

4 Supported compressions

In this section, we describe some of the compression schemes supported by our library. The complete list of supported compressions is given in Table 1. We expect to add more compressions in the future.

4.1 Quantization

The quantization is the process of reducing the precision of the weights, and achieved by imposing a constraint on each weight to belong to set of values — the codebook. Depending on the allowed values in the codebook, the quantization schemes are known under different names: binarization — when or , ternarization with , the powers-of-two scheme with , and others.

Let us consider the general case when we compress the weights of the model with a learned codebook of size , i.e., . We will use the equivalent formulation of the quantization using a binary assignment variable () for each weight . Then our compression goal is:

This formulation immediately falls into the Learning-Compression form of eq. 1 with . The corresponding C step problem of has the form of:

(2)

which has been thoroughly studied in signal compression and unsupervised clustering literature, and known as the -means clustering problem. The general -means problem is NP hard [7, 1], however, this is a scalar version which has an efficient globally optimal solution using dynamic programming [2, 35, 34].

Our software provides both -means and dynamic programming solutions for the C step of adaptive quantization problem (eq. 2). Additionally, we provide solutions for some fixed- and scaled- binarization and ternarization schemes as listed in Table 1 (see [4] for exact details).

P[][]codebook size, 1414 1212 1010 22 44 88 1616 3232     Quantization of VGG16     Weight pruning of ResNets E[B][]test error (%) learned[l][l]LC direct[l][l]quant. retrain reference[l][l]reference R[l][l]R E[B][]test error (%) P[][]proportion of surviving weights % resnet32[l][l]ResNet32 resnet56[l][l]ResNet56 resnet110[l][l]ResNet110 1414 1212 1010 88 55 1515

Figure 3: Left: Tradeoff of quantizing VGG16 trained on CIFAR10. Results of LC is given by blue curve, and compared to quantizeretrain approach similar to [13] (red curve). Details of experiments are in [4]. Right: Weight pruning of ResNets trained on CIFAR10 dataset using -constraint formulation of LC (thick lines) and comparison to magnitude based pruning with retraining (thin lines). Horizontal dashed lines correspond to reference test errors of respective networks. Full details of experiments can be found in [5].

4.2 Pruning

Pruning is the process of removing (or sparsifying) the weights of the model. One way of formulating this problem is by using the sparsification penalties and constraints, e.g., or , which will limit the number of allowed non-zero weights. A particularly useful pruning scheme is -norm () constrained pruning defined as:

(3)

Since the -norm measures the number of non zero items in the vector, the formulation of eq. 3 allows to precisely specify the number of remaining weights.

To bring it into the Learning-Compression form (eq. 1) we introduce a copy parameter and obtain an equivalent optimization problem:

for which the C step is given by solving:

(4)

The solution of eq. 4 can be obtained by selecting all but top- weights (in magnitude) of and zeroing remaining.

Using similar steps, we can obtain the C steps for constrained formulation of pruning, and extend it to penalty based forms, e.g. , see [5] for further details. In our framework we provide the implementation for all combinations of and -norms, both constraint and penalty.

4.3 Low-rank compression

Our framework supports compressing the weight matrices of each layer to a given (preselected) target rank. This allows parametrizing the resulting compressed weight matrix as a product of . However, such compression requires knowing the right choice of the ranks, or otherwise, it will affect the error-compression tradeoff of the resulting model. To alleviate this issue, we include the implementation of the automatic rank selection from [17], which we describe next.

Assume we have a reference model with layers and the weights , where is the weight matrix of layer . We want to optimize the following model selection problem over possible low-rank models:

here is the maximum possible rank for matrix . The compression cost is defined in terms of the ranks of the matrices:

can capture both storage bits to save space, or total floating point operations to speed up the model. To put it into Learning-Compression form (eq. 1), we introduce the parameter for each layer, with constraint . Then, the objective of the C step separates over layers into:

The solution of this C step was given in [5], and involves an SVD and enumeration over the ranks for each layer’s weight matrix.

L[B][]training loss err[B][]test error (%) flops[][t]MFLOPs resnet20[l][l]ResNet20 ResNet20[l][l]ResNet20 resnet32[l][l]ResNet32 ResNet32[l][l]ResNet32 resnet56[l][l]ResNet56 ResNet56[l][l]ResNet56 resnet110[l][l]ResNet110 ResNet110[l][l]ResNet110 vgg16[l][l]VGG16 VGG16[r][r]VGG16 nin[l][l]NIN NIN[l][l]NIN Wen17[cc][lc]1 Ye18-b[cc][lc]2 Ye18-a[cc][lc]2 Zhuang18-a[cc][lc]3 Zhuang18-b[cc][lc]3 Li17[cc][lc]4 Li17a[cc][lc]4 Li17b[cc][lc]4 He17[cc][lc]5 Yu18[cc][lc]6 Xu18[cc][lc]7 cobla[cc][lc]8 1a[lc][cc]1 — Wen et al. [33] 2a[lc][cc]2 — Ye et al. [38] 3a[lc][cc]3 — Zhuang et al. [43] 4a[lc][cc]4 — Li et al. [25] 5a[lc][cc]5 — He et al. [15] 6a[lc][cc]6 — Yu et al. [39] 7a[lc][cc]7 — Xu et al. [36] 8a[lc][cc]8 — Li and Shi [24] R[cc][lc]R

Figure 4: Error-compression space of test error (Y axis), inference FLOPs (X axis) and number of parameters (ball size for each net), for multiple networks trained on CIFAR10 and compressed with low-rank and structured pruning methods. Results of rank selection with LC algorithm over different values for a given network span a curve, shown as connected circles , which starts on the lower right at the reference R () and then moves left and up. Other published results using low-rank compression are shown as isolated circles labeled with a citation. Other published results involving structured filter pruning are shown as isolated squares labeled with a citation. The area of a circle or square is proportional to the number of parameters in the corresponding compressed model. Ideal models are small balls (having few parameters) on the left-bottom (where both error and FLOPs are the smallest). See [17] for full details of experiments.
Type Forms
Quantization Adaptive Quantization into
Binarization into and
Ternarization into
Pruning -constraint (s.t., )
-constraint (s.t., )
-penalty ()
-penalty ()
Low-rank Low-rank compression to a given rank
Low-rank with automatic rank selection for FLOPs reduction
Low-rank with automatic rank selection for storage compression
Additive Combinations Quantization + Pruning
Quantization + Low-rank
Pruning + Low-rank
Quantization + Pruning + Low-rank
Table 1: Currently supported compression types, with their exact forms. These compression can be defined per one or multiple layers, and different compression can be applied to different parts of the model.

5 Design of the software

Equipped with Learning-Compression algorithm and some building-block compressions, we now discuss the design of our library. The main goals are to have an easy to use, efficient, robust, and configurable neural network compression software. Particularly, we would like to have a flexibility of applying any available compression (Table 1) to any parts of the neural network with per-layer granularity:

  • [itemsep=-0.4em]

  • a single compression per layer (e.g., low-rank compression for layer 1 with target rank 5)

  • a single compression per multiple layers (e.g. prune 5% of weights in layer 1 and 3, jointly)

  • mixing multiple compressions (e.g., quantize layer 1 and prune jointly layers 2 and 3)

  • additive compressions

To implement such desiderata, we leverage the modularity of the LC algorithm and introduce some additional building blocks in between.

L step

We hand off the model training operations, the L step, to the user through the lambda functions. This gives a fine-grained control on the model’s actual learning, utilization of hardware, pulling the data sources, and other necessary steps required for training. Usually, the implementation of the L step is already available or can be extracted from the training code used for the reference (uncompressed) model. On the left of Figure 5 we give a typical way of implementing the L step in PyTorch.

def my_l_step(model, lc_penalty, args**):
    # ... skipped ...
    loss = model.loss(out_, target_) + lc_penalty()
    loss.backward()
    optimizer.step()
    # ... skipped ...
class ScaledBinaryQuantization(CompressionTypeBase):
    # Housekeeping code is skipped
    def compress(self, data):
        a = np.mean(np.abs(data))
        quantized = 2 * a * (data > 0) - a
        return quantized
Figure 5: Left: a typical implementation of the L step in PyTorch; some code (the optimizer and data source configurations) is skipped for brevity. Right: the C step implementation. To add a new compression to the framework  one needs to inherit from CompressionTypeBase class and implement compress function.

C step

All provided compressions of Table 1 are implemented as subclasses of CompressionTypeBase class, and the actual C step is exposed through the compress method. This allows a straightforward extension of the library of compressions: if needed, the user simply wraps the custom C-step solution into an object of CompressionTypeBase class. For example, on the right of Figure 5 we show how a new quantization can be implemented.

Compression tasks

To instruct the framework on which compression types should be applied to which parts of the model, the user needs to populate a compression tasks structure. This structure is a list of simple mappings of the form: (parameters) (compression view, compression type), which is implemented as a python dictionary. The parameters are any subset of model weights, which are wrapped into internal Parameter object. The compression view is another internal structure that handles reshaping of the model weights into a form suitable for compression, e.g., reshaping the weight tensor of a convolutional layer into a matrix for low-rank compression.

While our strategy of defining the compression tasks might seem unnecessarily complicated, it brings a considerable amount of flexibility. For instance, it erases the limitations of standard compression approaches with coarse layer-based granularity: we can compress multiple layers with a single compression, or a single layer with multiple compressions, while simultaneously mixing different compressions in a single model. This abstraction disentangles compression from the model structure and allows us to construct complicated schemes of compressions in a mix-and-match way. For example, user can jointly compress a three-layer neural network so that the first and third layers are quantized with the same codebook, and the second layer is a low-rank matrix simultaneously, see Figure 6 and other examples in section 6. Such fine-grained control allows to include expert knowledge about properties of a particular model (e.g., do not quantize the first layer) without much effort.

Semantics: (layer 1, layer 3) (as a vector, adaptive quantization ),
(layer 2) (as is, low-rank with )
Python code:
from lc.torch import ParameterTorch as Param, AsVector, AsIs
compression_tasks = {
    Param([l1.weight, l3.weight]): (AsVector, AdaptiveQuantization(k=6)),
    Param(l2.weight):              (AsIs,     LowRank(target_rank=3))
}
Figure 6: Semantics and the actual python code for compression tasks to quantize first and second layers of a NN with a single adaptive codebook of size and the third layer with a low-rank matrix of rank 3. Notice how the semantics translates almost verbatim into the python code.

Running the software

To compress a model, the user needs to construct an lc.Algorithm object and provide:

  1. [itemsep=-0.2em]

  2. a model to be compressed

  3. associated compression tasks

  4. implementation of the L step

  5. a schedule of values, and

  6. an evaluation function to keep track of the loss/error during the compression.

lc_alg = lc.Algorithm(
    model=net,                            # a model to compress
    compression_tasks=compression_tasks,  # specifications of compression
    l_step_optimization=my_l_step,        # implementation of the L step
    mu_schedule=mu_s,                     # schedule of the mu values
    evaluation_func=train_test_acc_eval_f # the evaluation function
)
lc_alg.run()                              # an entry point to the LC algorithm
Listing 1: Running the LC algorithm.

Once the run method is called, the LC algorithm will start execution, at which point the library will proceed in line-by-line correspondence to the pseudocode on the left of Figure 2). Currently, each of the compression tasks (and corresponding C step) is called in order. Yet, due to the nature of the LC algorithm, every compression task’s C steps can be run in parallel, further improving the efficiency of the algorithm.

6 Showcase

In this section, we demonstrate the flexibility of our framework by easily exploring multiple compression schemes with minimal effort. As an example, say we are tasked with compressing the storage bits of the LeNet300 neural network trained on MNIST dataset (10 classes,

gray-scale images). The LeNet300 is a three-layer neural network with 300, 100, and 10 neurons respectively on every layer; the reference has an error of 2.13% on the test set.

In order to run the LC algorithm, we need to provide an L step implementation and compression tasks to an instance of the LCAlgorithm class as described in Listing 1. We implement the L step below:

def my_l_step(model, lc_penalty, step):
    params = list(filter(lambda p: p.requires_grad, model.parameters()))
    lr = lr_base*(0.98**step) # we use a fixed learning rate for each L step
    optimizer = optim.SGD(params, lr=lr, momentum=0.9, nesterov=True)
    for epoch in range(epochs_per_step):
        for x, target in train_loader: # loop over the dataset
            optimizer.zero_grad()
            loss = model.loss(model(x), target) + lc_penalty() # loss + LC penalty
            loss.backward()
            optimizer.step()
Listing 2: Complete implementation of the L step for LeNet300 using Pytorch. The regular model training is exactly as the L step code with only difference of loss computation: the L step needs loss plus penalty
Compression Code for compression tasks Error
no compression
Train 0.00%
Test 2.13%
quantize all layers
  compression_tasks = {
    Param(l1.weight): (AsVector, AdaptiveQuantization(k=2)),
    Param(l2.weight): (AsVector, AdaptiveQuantization(k=2)),
    Param(l3.weight): (AsVector, AdaptiveQuantization(k=2))
  }
Train 0.02%
Test 2.56%
quantize first and third layers
  compression_tasks = {
    Param(l1.weight): (AsVector, AdaptiveQuantization(k=2)),
    Param(l3.weight): (AsVector, AdaptiveQuantization(k=2))
  }
Train 0.00%
Test 2.26%
prune all but 5%
  compression_tasks = {
    Param([l1.weight, l2.weight, l3.weights]):
      (AsVector, ConstraintL0Pruning(kappa=13310)) # 13310 = 5%
  }
Train 0.00%
Test 2.18%
single codebook quantization with additive pruning of all but 1%
  compression_tasks = {
    Param([l1.weight, l2.weight, l3.weights]): [
      (AsVector, ConstraintL0Pruning(kappa=2662)), # 2662 = 1%
      (AsVector, AdaptiveQuantization(k=2))
    ]
  }
Train 0.00%
Test 2.17%
prune first layer, low-rank to second, quantize third
  compression_tasks = {
    Param(l1.weight): (AsVector, ConstraintL0Pruning(kappa=5000)),
    Param(l2.weight): (AsIs,     LowRank(target_rank=10))
    Param(l3.weight): (AsVector, AdaptiveQuantization(k=2))
  }
Train 0.04%
Test 2.51%
rank selection with
  compression_tasks = {
    Param(l1.weight): (AsIs,     RankSelection(alpha=1e-6))
    Param(l2.weight): (AsIs,     RankSelection(alpha=1e-6))
    Param(l3.weight): (AsIs,     RankSelection(alpha=1e-6))
  }
Train 0.00%
Test 1.90%
Table 2: We are showcasing some of the possible compression tasks on the LeNet300 neural network trained on the MNIST dataset. Notice that trying a new combination of compressions is as simple as writing a new compression tasks structure with a possible change of learning rates and -schedule.

Now, having the L step implementation, we can formulate the compression tasks. Say, we would like to know what would be the test error if the model is optimally quantized with a separate codebook on each layer? Test error in such case will be 2.56%, which is 0.43% higher than the reference. What would be the performance of the model if one would quantize only the first and the third layers, leaving the second layer untouched? Well, test error in such case will be 2.18%. What about if we prune all but 5% of the weights? Yes, the LC algorithm and our framework can handle all of these combinations and more; see Table 2 for details. We can even apply different compressions to every layer, e.g., take a look at the last row of Table 2, where we apply quantization, pruning, and low-rank compression to the different parts of the LeNet300. This is possible with the only change required: we need to provide a new compression task for desired compression and possibly a new schedule of -s and learning rates (if using SGD). Our framework allows to compare different compression techniques in single library with common algorithm.

In the compressions described in Table 2 we used exponential schedule of values: for all of quantization/pruning experiments the schedule was at -th L step and for those involving low-rank the schedule was . The learning rate for SGD decayed by 0.98 after every L step: for pure quantization experiments the base learning rate was 0.09, for pruning 0.1, and for mixture of schemes the learning rate was 0.05. We used total of 40 L steps, and every L step was running for 20 epochs.

7 Practical advice

We implemented the LC algorithm originally in 2017, and we have gone through multiple refinements and code reimplementations. We have applied it to compressing a wide array of relatively large neural nets, such as LeNet, AlexNet, VGG, ResNet, etc., which are themselves tricky to train well in the first place. In the process, we have gathered a considerable amount of practical knowledge on the behavior of the LC algorithm on both small and large models and datasets. We would like to share a list of common pitfalls so future users of our framework would hopefully avoid them:

Monitor the progression

of the algorithm. Specifically, two important quantities to keep an eye on:

  • The loss of the L step: . The total loss at the end of the L step must be smaller than the total loss at the beginning. Preferably, it must be much smaller. If some L step has not reduced the loss, optimization parameters of the step should be tuned.

  • The loss of the C step, , must have a smaller value after each C step. This often fails when new compression is introduced into the pipeline, where compress method is not fully tested. For the base compressions in the framework, we made sure they always optimize the quadratic distortion loss of the C step.

L steps are not equally created

L steps can, and some of them should be optimized better than others. Here, by “better” we mean the reduction in the loss value should be evident. This advice is usually applicable for the first L steps. In case of using SGD, it is often helpful to train the first L step for a larger number of iterations than other steps.

On learning rates of SGD

In the case of L step optimization happening by means of SGD, often good initial learning rates can be chosen by inspecting the learning rates used in the training of the original (uncompressed) reference model.

On schedule

Theoretically, the sequence of values should start at and infinitesimally grow to . In practice, we use an exponentially increasing schedule with small initial and appropriately chosen for the

-th step of the LC. For most of compression schemes, we have developed robust estimates of

-values: for pruning see suppl.mat. of [5], for rank-selection see suppl.mat. of [17]. For the value of , we found the range of to be a good spot.

8 Conclusion

The fields of machine learning and signal compression have developed independently for a long time: machine learning solves the problem of training a deep net to minimize a desired loss on a dataset, while signal compression solves the problem of optimally compressing a given signal. The LC algorithm allows us to seamlessly integrate the existing algorithms to train deep nets (L step) and algorithms to compress a signal (C step) by tapping on abundant literature in machine learning and signal compression fields. Based on this, we developed an extensible software framework that can easily plug in existing deep net training techniques with existing signal compression techniques and their combinations, in a flexible mix-and-match way. The source code is available online as an open-source project in Github.

References

  • Aloise et al. [2009] D. Aloise, A. Deshpande, P. Hansen, and P. Popat. NP-hardness of Euclidean sum-of-squares clustering. Machine Learning, 75(2):245–248, May 2009.
  • Bruce [1965] J. D. Bruce. Optimum quantization. Technical Report 429, Massachussetts Institute of Technology, 1965.
  • Carreira-Perpiñán [2017] M. Á. Carreira-Perpiñán. Model compression as constrained optimization, with application to neural nets. Part I: General framework. arXiv:1707.01209, July 5 2017.
  • Carreira-Perpiñán and Idelbayev [2017] M. Á. Carreira-Perpiñán and Y. Idelbayev. Model compression as constrained optimization, with application to neural nets. Part II: Quantization. arXiv:1707.04319, July 13 2017.
  • Carreira-Perpiñán and Idelbayev [2018] M. Á. Carreira-Perpiñán and Y. Idelbayev. “Learning-compression” algorithms for neural net pruning. In

    Proc. of the 2018 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR’18)

    , pages 8532–8541, Salt Lake City, UT, June 18–22 2018.
  • Courbariaux et al. [2015] M. Courbariaux, Y. Bengio, and J.-P. David. BinaryConnect: Training deep neural networks with binary weights during propagations. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems (NIPS), volume 28, pages 3105–3113. MIT Press, Cambridge, MA, 2015.
  • Dasgupta and Freund [2009] S. Dasgupta and Y. Freund. Random projection trees for vector quantization. IEEE Trans. Information Theory, 55(7):3229–3242, July 2009.
  • Denil et al. [2013] M. Denil, B. Shakibi, L. Dinh, M. Ranzato, and N. de Freitas. Predicting parameters in deep learning. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems (NIPS), volume 26, pages 2148–2156. MIT Press, Cambridge, MA, 2013.
  • Denton et al. [2014] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems (NIPS), volume 27, pages 1269–1277. MIT Press, Cambridge, MA, 2014.
  • Garipov et al. [2016] T. Garipov, D. Podoprikhin, A. Novikov, and D. Vetrov. Ultimate tensorization: Compressing convolutional and FC layers alike. arXiv:1611.03214, Nov. 10 2016.
  • Gong et al. [2015] Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compressing deep convolutional networks using vector quantization. In Proc. of the 3rd Int. Conf. Learning Representations (ICLR 2015), San Diego, CA, May 7–9 2015.
  • Han et al. [2015] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for efficient neural network. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems (NIPS), volume 28, pages 1135–1143. MIT Press, Cambridge, MA, 2015.
  • Han et al. [2016] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. In Proc. of the 4th Int. Conf. Learning Representations (ICLR 2016), San Juan, Puerto Rico, May 2–4 2016.
  • Hassibi and Stork [1993] B. Hassibi and D. G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information Processing Systems (NIPS), volume 5, pages 164–171. Morgan Kaufmann, San Mateo, CA, 1993.
  • He et al. [2017] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. In Proc. 17th Int. Conf. Computer Vision (ICCV’17), pages 1398–1406, Venice, Italy, Dec. 11–18 2017.
  • Hwang and Sung [2014] K. Hwang and W. Sung. Fixed-point feedforward deep neural network design using weights , , and . In 2014 IEEE Workshop on Signal Processing Systems (SiPS), pages 1–6, Belfast, UK, Oct. 20–22 2014.
  • Idelbayev and Carreira-Perpiñán [2020a] Y. Idelbayev and M. Á. Carreira-Perpiñán. Low-rank compression of neural nets: Learning the rank of each layer. In Proc. of the 2020 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR’20), Seattle, WA, June 14–19 2020a.
  • Idelbayev and Carreira-Perpiñán [2020b] Y. Idelbayev and M. Á. Carreira-Perpiñán. More general and effective model compression via an additive combination of compressions. arXiv, 2020b.
  • Jacob et al. [2018] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proc. of the 2018 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR’18), pages 2704–2713, Salt Lake City, UT, June 18–22 2018.
  • Jaderberg et al. [2014] M. Jaderberg, A. Vedaldi, and A. Zisserman.

    Speeding up convolutional neural networks with low rank expansions.

    In M. Valstar, A. French, and T. Pridmore, editors, Proc. of the 25th British Machine Vision Conference (BMVC 2014), Nottingham, UK, Sept. 1–5 2014.
  • Kozlov et al. [2020] A. Kozlov, I. Lazarevich, V. Shamporov, N. Lyalyushkin, and Y. Gorbachev. Neural network compression framework for fast model inference. arXiv:2002.08679, Feb. 20 2020.
  • Lebedev et al. [2015] V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky. Speeding-up convolutional neural networks using fine-tuned CP-decomposition. In Proc. of the 3rd Int. Conf. Learning Representations (ICLR 2015), San Diego, CA, May 7–9 2015.
  • LeCun et al. [1990] Y. LeCun, J. S. Denker, and S. A. Solla. Optimal brain damage. In D. S. Touretzky, editor, Advances in Neural Information Processing Systems (NIPS), volume 2, pages 598–605. Morgan Kaufmann, San Mateo, CA, 1990.
  • Li and Shi [2018] C. Li and C. J. R. Shi. Constrained optimization based low-rank approximation of deep neural networks. In V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, editors, Proc. 15th European Conf. Computer Vision (ECCV’18), pages 746–761, Munich, Germany, Sept. 8–14 2018.
  • Li et al. [2017] H. Li, A. Kadav, I. Durdanovic, and H. P. Graf. Pruning filters for efficient ConvNets. In Proc. of the 5th Int. Conf. Learning Representations (ICLR 2017), Toulon, France, Apr. 24–26 2017.
  • Liu et al. [2015] B. Liu, M. Wan, H. Foroosh, M. Tappen, and M. Pensky. Sparse convolutional neural networks. In Proc. of the 2015 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR’15), pages 806–814, Boston, MA, June 7–12 2015.
  • Novikov et al. [2015] A. Novikov, D. Podoprikhin, A. Osokin, and D. P. Vetrov. Tensorizing neural networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems (NIPS), volume 28, pages 442–450. MIT Press, Cambridge, MA, 2015.
  • Rastegari et al. [2016] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNOR-net: ImageNet classification using binary convolutional neural networks. In B. Leibe, J. Matas, N. Sebe, and M. Welling, editors, Proc. 14th European Conf. Computer Vision (ECCV’16), pages 525–542, Amsterdam, The Netherlands, Oct. 11–14 2016.
  • Sainath et al. [2013] T. N. Sainath, B. Kingsbury, V. Sindhwani, E. Arısoy, and B. Ramabhadran. Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In Proc. of the IEEE Int. Conf. Acoustics, Speech and Sig. Proc. (ICASSP’13), pages 6655–6659, Vancouver, Canada, Mar. 26–30 2013.
  • Tai et al. [2016] C. Tai, T. Xiao, Y. Zhang, X. Wang, and W. E. Convolutional neural networks with low-rank regularization. In Proc. of the 4th Int. Conf. Learning Representations (ICLR 2016), San Juan, Puerto Rico, May 2–4 2016.
  • Tung and Mori [2018] F. Tung and G. Mori. CLIP-Q: Deep network compression learning by in-parallel pruning-quantization. In Proc. of the 2018 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR’18), pages 7873–7882, Salt Lake City, UT, June 18–22 2018.
  • Wen et al. [2016] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li. Learning structured sparsity in deep neural networks. In D. D. Lee, M. Sugiyama, U. von Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems (NIPS), volume 29, pages 2074–2082. MIT Press, Cambridge, MA, 2016.
  • Wen et al. [2017] W. Wen, C. Xu, C. Wu, Y. Wang, Y. Chen, and H. Li. Coordinating filters for faster deep neural networks. In Proc. 17th Int. Conf. Computer Vision (ICCV’17), Venice, Italy, Dec. 11–18 2017.
  • Wu [1991] X. Wu. Optimal quantization by matrix searching. J. Algorithms, 12(4):663–673, Dec. 1991.
  • Wu and Rokne [1989] X. Wu and J. Rokne. An algorithm for optimum -level quantization on histograms of points. In Proc. 17th ACM Annual Computer Science Conference, pages 339–343, Louisville, KY, Feb. 21–23 1989.
  • Xu et al. [2018] Y. Xu, Y. Li, S. Zhang, W. Wen, B. Wang, Y. Qi, Y. Chen, W. Lin, and H. Xiong. Trained rank pruning for efficient deep neural networks. arXiv:1812.02402, Dec. 8 2018.
  • Xue et al. [2013] J. Xue, J. Li, and Y. Gong.

    Restructuring of deep neural network acoustic models with singular value decomposition.

    In F. Bimbot, C. Cerisara, C. Fougeron, G. Gravier, L. Lamel, F. Pellegrino, and P. Perrier, editors, Proc. of Interspeech’13, pages 2365–2369, Lyon, France, Aug. 25–29 2013.
  • Ye et al. [2018] J. Ye, X. Lu, Z. Lin, and J. Wang. Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers. In Proc. of the 6th Int. Conf. Learning Representations (ICLR 2018), Vancouver, Canada, Apr. 30 – May 3 2018.
  • Yu et al. [2018] R. Yu, A. Li, C.-F. Chen, J.-H. Lai, V. I. Morariu, X. Han, M. Gao, C.-Y. Lin, and L. S. Davis. NISP: Pruning networks using neuron importance score propagation. In Proc. of the 2018 IEEE Computer Society Conf. Computer Vision and Pattern Recognition (CVPR’18), pages 9194–9203, Salt Lake City, UT, June 18–22 2018.
  • Zhang et al. [2016] X. Zhang, J. Zou, K. He, and J. Sun. Accelerating very deep convolutional networks for classification and detection. IEEE Trans. Pattern Analysis and Machine Intelligence, 38(10):1943–1955, Oct. 2016.
  • Zhou et al. [2016] S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160, July 17 2016.
  • Zhu et al. [2017] C. Zhu, S. Han, H. Mao, and W. J. Dally. Trained ternary quantization. In Proc. of the 5th Int. Conf. Learning Representations (ICLR 2017), Toulon, France, Apr. 24–26 2017.
  • Zhuang et al. [2018] Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu, J. Huang, and J. Zhu. Discrimination-aware channel pruning for deep neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems (NEURIPS), volume 31, pages 815–886. MIT Press, Cambridge, MA, 2018.
  • Ziv and Lempel [1977] J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE Trans. Information Theory, 23(3):337–343, May 1977.
  • Zmora et al. [2019] N. Zmora, G. Jacob, L. Zlotnik, B. Elharar, and G. Novik. Neural network distiller: a Python package for DNN compression research. arXiv:1910.12232, Oct. 27 2019.