Deep Learning Approximation: Zero-Shot Neural Network Speedup

by   Michele Pratusevich, et al.

Neural networks offer high-accuracy solutions to a range of problems, but are costly to run in production systems because of computational and memory requirements during a forward pass. Given a trained network, we propose a techique called Deep Learning Approximation to build a faster network in a tiny fraction of the time required for training by only manipulating the network structure and coefficients without requiring re-training or access to the training data. Speedup is achieved by by applying a sequential series of independent optimizations that reduce the floating-point operations (FLOPs) required to perform a forward pass. First, lossless optimizations are applied, followed by lossy approximations using singular value decomposition (SVD) and low-rank matrix decomposition. The optimal approximation is chosen by weighing the relative accuracy loss and FLOP reduction according to a single parameter specified by the user. On PASCAL VOC 2007 with the YOLO network, we show an end-to-end 2x speedup in a network forward pass with a 5 be re-gained by finetuning.



There are no comments yet.


page 1

page 2

page 3

page 4


Single-pass randomized QLP decomposition for low-rank approximation

The QLP decomposition is one of the effective algorithms to approximate ...

Learning Low-rank Deep Neural Networks via Singular Vector Orthogonality Regularization and Singular Value Sparsification

Modern deep neural networks (DNNs) often require high memory consumption...

Deep Industrial Espionage

The theory of deep learning is now considered largely solved, and is wel...

Deterministic matrix sketches for low-rank compression of high-dimensional simulation data

Matrices arising in scientific applications frequently admit linear low-...

Fast Proper Orthogonal Decomposition Using Improved Sampling and Iterative Techniques for Singular Value Decomposition

Proper Orthogonal Decomposition (POD), also known as Principal component...

Training CNNs faster with Dynamic Input and Kernel Downsampling

We reduce training time in convolutional networks (CNNs) with a method t...

Very Fast Keyword Spotting System with Real Time Factor below 0.01

In the paper we present an architecture of a keyword spotting (KWS) syst...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Learning Approximation (DLA) speeds up the runtime of neural networks, which is especially relevant for pay-per-compute or limited-compute embedded environments. It decouples the task of speeding up neural networks from the task of making them more accurate. It enables tuning off-the-shelf networks for runtime, to complement fine-tuning for accuracy.

At deploy time, the dollar cost of a production pipeline using a neural network is directly proportional to the time required to execute a forward pass of the neural network. This is in contrast to training, which generally happens only once and offline. In turn, for large networks time is proportional to the number of floating-point operations (FLOPs) required in a forward pass. For example, in the cloud, instances are paid for by the hour, so a pipeline running a ResNet-50 model ( FLOPs) vs. a ResNet-152 model ( FLOPs) running times slower and therefore costs x more. This extra dollar cost comes at minimal benefit (% vs

error on ImageNet

(He2015)). Running faster models means a higher throughput, which means better hardware utilization, both in cloud and embedded environments. For embedded environments, faster and smaller models mean lower power, cooling, and CPU speed requirements.

DLA has several other important properties. For networks used as a black-box or where training data is confidential, re-training or fine-tuning the network is not an option or is extremely challenging. In these cases, DLA is a method for speeding up the network forward pass without needing access to the training data. If you can finetune, finetuning can recover accuracy lost during the DLA process.

In DLA, first apply lossless optimizations to reduce FLOP count and runtime, which is always an optimal optimization step. Then, we propose a method for automatically selecting an appropriate approximation to apply to each layer in a network based on a single parameter, which represents whether accuracy or speedup is more important. The approximation methods are all based on singular value decompositions (SVD) applied directly to the weight tensors.

On a benchmark set of standard computer vision tasks and neural network architectures, we show between a 1.5x and 2x speedup without additional re-training and minimal accuracy loss. In each case, funetuning after applying DLA recovers lost accuracy.

2 Related Work

Better hardware and software

To make a large, slow neural network run faster, you can deploy the net to faster hardware (FPGAs, better GPUs) using better software (such as MXNet (DBLP:journals/corr/ChenLLLWWXXZZ15)

and Caffe con Troll

(Hadjis:2015:CCT:2799562.2799641)). DLA offers a third option by operating directly on neural network models and can be applied to models targeting all hardware and software.

Training smaller models

Factorization and low-rank approximation approaches such as that of jaderberg14speeding and factorized CNNs (factorized_conv_nets) apply FLOP-reduction methods on convolution weights, but to inform neural network architecture decisions before training. Follow-up work on the spatial bottleneck architecture (factorized_conv_nets), SqueezeNet SqueezeNet, and MobileNet (DBLP:journals/corr/HowardZCKWWAA17) also focus on training architectures from the start that are faster or have fewer FLOPs. DLA can be applied after these models have been trained to achieve further speedup.

Training-in-the-loop methods

park_pruning explore iteratively pruning and re-training individual convolutional filters to regain accuracy. Incremental quantization from (zhou2017) adds quantization in the training loop as well, decreasing both memory and runtime. Similarly, wen_learning_2016 sparsifies convolutions during training time to take advantage of sparse matrix multiplication methods at deploy time. All these methods require access to the training data and are time-consuming. Whereas DLA can be applied on top of a network trained with any of these methods.

Model compression

Model compression makes model definitions smaller, especially when models need to be transferred in a limited-bandwidth or limited-memory environment Cheng2017ASO. han2015deep_compression

combine iterative convolution pruning and Huffman coding to achieve high compression, but require extensive re-training to achieve accuracy parity. Binarized Neural Networks


and XNOR-Net

(rastegariECCV16) take compression even further, using only bit per weight value, but require access to the original training data to achieve high performance. DLA is complementary and can be applied after these methods have been used.

Data-agnostic approaches

Denton:2014:ELS:2968826.2968968 use SVD and factorization approaches to approximate convolutional and fully-connected layers, but also rely on re-training the network after an approximation has been applied to a single layer. Speedup is measured per-layer rather than end-to-end forward pass runtime, establishing a relationship between FLOPs and runtime for convolutional layers. DLA extends this method by applying approximations holistically to all layers in a network together, without requiring re-training in between applying approximations to individual layers. Additionally, the runtime decrease is measured end-to-end on a forward pass rather than on a per-layer basis.

3 Runtime and FLOPs

According to the roofline computation model (roofline), neural network runtime is dominated by FLOP count and memory movement to and from a specialized processor like a GPU. Convolutional, deconvolutional, and fully-connected layers (called representation layers throughout this paper) have a high arithmetic intensity, requiring many FLOPs to compute one unit of output. Therefore, these layers will operate in the FLOP-dominated regime from the roofline model rather than the memory-limited regime. CPUs and GPUs have a fixed processor throughput, so decreasing the number of FLOPs needed to perform an operation will decrease the amount of time spent in this bottleneck during a forward pass. DLA replaces FLOP-heavy representation layer operations with comparable operations that have a lower FLOP count, speeding up computations, and often pushing the layers’ operations out of the FLOP-limited regime into the memory-limited regime.

We use the FLOP count for a representation layer as a proxy for measuring its runtime, because FLOPs dominate the runtime. We only consider the max of the multiply and add operations, since most modern processors have FMA or FMAC unit that can perform a simultaneous multiply-add operation. The typical way of applying representation layers to an input is through a matrix multiplication. Because of this, bias computations are ignored as the computation is dominated by the matrix multiplication.

The FLOP count is dominated by the duplication factor on the input blob size. A convolutional layer is a D tensor in , where is the number of output feature maps, is the number of input channels, and and are the kernel dimensions in and respectively. The convolution traverses an input blob in

with stride

in and in . In a grouped convolution, the input and output channels are broken into groups and the D weight tensor is in . A grouped convolution is time-efficient because a batch process can do a group’s worth of computation in a single step. For a deconvolution, the stride is applied on the output rather than the input. See Table LABEL:tab:flopcounts for FLOP counts for the forward pass of each representation layer.

Convolution Deconvolution Fully-Connected
Table 1: Representation layers and FLOP counts

Other layers, such as batch normalization or activation layers, have a negligible number of FLOPs in the forward pass of a network, since only one operation per input pixel is typically applied.

4 The Approximation Pipeline

Only the model definition, the weights, and a single tunable input parameter that weights the importance of accuracy relative to speed on a scale from to are needed to apply DLA.

The pipeline is applied sequentially to each representation layer in the network. There is no re-training done between steps, the entire process is done with the original trained network. Approximations are applied simultaneously to all layers at once. The discussion below is illustrated on convolutional layers, but similar extensions can be derived for fully-connected and deconvolutional layers. The pipeline is broken into two parts: (1) lossless optimizations and (2) lossy approximations. The lossless optimizations are applied first, since they have no accuracy impact and will always decrease runtime. The lossy approximations are based on low-rank approximations of representation layers using SVD jaderberg14speeding, represented in Figure 1. The optimal lossy approximation is chosen based on the accurary weighting parameter .

Figure 1: A visual representation of the 4 factorization methods used in DLA. is the number of input channels, is the number of output channels, is the kernel width, is the kernel height, and is the factorization parameter.

4.1 Lossless optimizations

Here, the term lossless optimizations means optimizations applied to the network that have no effect on accuracy. The main lossless optimization is the merging of adjacent linear layers. For example, representation layers are often followed by batch normalization and scale layers, to reduce overfitting during training time 43442. However, at deploy time, these operations are sequential linear operations applied on an input blob and can be combined into one. Table LABEL:tab:linearops shows how a cascade of linear layers, with and the input and as the output can be combined into a single representation layer.

Layer Parameters Operation Cascaded operation

weight vector

bias vector
learned mean

learned variance

learned constant
learned constant
Table 2: Linear operators in neural networks

The final cascaded operation, has the same form as a representation layer and can therefore be performed with a single multiply-add operation, with the new weight parameter and new bias parameter . Therefore, all three layers can be combined into a single representation layer with these new parameters. A similar derivation can apply for batchnorm / scale layers that appear before representation layers, if only a batchnorm layer (without a scale layer) is present, or if two representation layers are applied sequentially without a non-linearity in between.

While applying this lossless optimization saves a small number of FLOPs in the entire forward pass, it reduces the need for the neural network library to allocate memory to the batchnorm and scale layers, which reduces runtime. Because this optimization is exact (up to floating-point error), it should always be applied to a network before moving it to production.

4.2 Lossy approximations

After the lossless optimizations are applied to a layer, we then apply a series of lossy approximations. Each lossy approximation is a factorization based on SVD along a given axis of the representation layer. An accuracy score and runtime score is computed for each approximation. The overall score is the weighted sum of the two with the user-specified parameter . is the percentage of variation explained by an SVD approximation, and is the FLOP reduction. Intuitively, using low-rank decomposition to determine which FLOPs to keep and which to approximate means that representation layers that are redundant will have a lower-rank decomposition, and yield more speedup for less accuracy loss when approximated.

4.2.1 Filter-wise factorization

Filter-wise factorization applies SVD on the output-channels axis of the representation layer. The decomposition will be illustrated on the convolutional layer but is extensible for deconvolution and fully-connected layers without loss of generality. Applying SVD along the first axis of a convolution yields a factorization with singular values, , is a diagonal matrix, and . Taking the first singular values, we can reconstruct a low-rank approximation of , where the inner dimension is changed from to .

This can be expressed as two adjacent convolutions and . Table LABEL:tab:filterwiseflops shows a detailed breakdown of FLOP counts and weight shapes for the resulting convolutions and . There is a FLOP reduction in all cases where . The ratio of original FLOPs to new FLOPs is the runtime score for this approximation. If is the -th singular value of , then the accuracy score is the percentage of variation explained by the first singular values .

Original   Factored
Output channels
Input channels
Kernel size
Table 3: FLOP reduction for filter-wise factorization if singular values are used

A variant of this approximation, called projection-first factorization, is applied the same way, except the filter is reconstructed into the second convolution rather than the first. A similar calculation can be done for the FLOP reduction.

4.2.2 Separable factorization

Separable factorization applies SVD to the kernel axis of the representation layer. To find a separable factorization, the weight tensor is flattened into a matrix of size and SVD is applied along the kernel axes. The resultant approximation tensors and each have kernels that are -dimensional in the spatial axes. Table LABEL:tab:separableops shows the detailed parameters to reconstruct the two separable weight tensors.

Original   Factored
Output channels
Input channels
Kernel size
Table 4: FLOP reduction for separable factorization if singular values are used

Just like in the filter-wise factorization approximation, if is the -th singular value of , then the accuracy score is the percentage of the variation of explained by the first singular values and the runtime score is the FLOP reduction ratio.

4.2.3 Per-channel factorization

Per-channel approximation applies SVD to each input channel separately, to approximate the original convolution by a set of 2-dimensional convolutions to apply per-channel. is approximated by two new convolutions and . is applied in grouped fashion to the input channels. The result is output channels after . Then reconstructs the orignal output channels. See Table LABEL:tab:perchannelops for details.

Original   Factored
Output channels
Input channels
Kernel size
Table 5: FLOP reduction for perchannel factorization if singular values are used

The decomposition is found by applying SVD times on the original weight that corresponds to the input channel, and the accuracy score is the percentage of variation of explained for that is averaged across all the input channels. Because the FLOP count is inversely proportional to the group parameter, this approximation results in a large FLOP reduction.

4.3 Chaining approximations

Approximations can be applied on top of one another. For example, if a projection-first approximation is applied on top of a filter-wise approximation on an input layer , the resulting output will be three convolutions: , , and . for this approximation chain is the ratio of the new FLOPs to the FLOPs from the original layer. is the product of the accuracy scores for each constituent decomposition. Since the convolutions are applied one after another, and any error introduced by the first will be carried over to the next.

4.4 Optimizing the overall approximation

The parameter a user specifies the relative importance of maintaining accuracy or decreasing runtime. The optimal approximation is the approximation (of all possible sequences of approximations) that has the highest score . It is guaranteed to the best tradeoff between runtime and accuracy loss, since SVD is the optimal low-rank decomposition along the given axis.

4.5 Finding approximation groups

Some network architectures like ResNets have more complex structures, where layers in a network share the same input or the same output. In the case of a residual unit in a ResNet, pairs of layers share an output representation, since their outputs get added together as part of the residual operation. In these cases, the joined layers can be considered for approximation together as part of an approximation group. The weights are concatenated into a single weight matrix , and the reconstructed weights after approximation are split into separate layer outputs to match the original output blobs.

4.6 Applying approximations holistically

During the forward pass of a network, errors introduced at layers closer to the input of the network propagate up, accumulating as the execution gets closer to the output and therefore are more consequential. Additionally, layers closer to the output tend to take a longer time to execute on a forward pass. To prevent high error accumulations, the parameter used for that layer’s approximation is based on the user-defined parameter , starting at and linearly decreasing to as you get farther from the input. The start value of can be further tuned to decrease the overall error. This also means that layers closer to the output (which typically are slower) have a more aggressive approximation applied, which provides a better trade-off between runtime and accuracy. Furthermore, in practice, is used as a threshold: no approximation whose accuracy score is considered a valid approximation.

5 Experimental results

On 4 network architectures designed specifically for computer vision applications, we show between 1.5x and 2x runtime improvement for a forward pass, between and loss in accuracy, and nearly full recovery of accuracy after finetuning. Runtime was tested using Caffe (jia2014caffe) compiled with CUDA8 on a g2 cloud instance. Detailed accuracy and runtime results are shown in Table LABEL:tab:results_acc. The FLOP reduction (rather than memory reduction) correlates with the absolute speedup, with the exception of the ResNet50 network, as shown in Table LABEL:tab:results_mults. The ResNet50 network has many convolutions that are by design memory-limited rather than FLOP-limited, so DLA is less effective.

Runtime (ms) Accuracy (top-1 or mAP) Network / dataset Baseline DLA Speedup Baseline DLA Finetuned AlexNet (NIPS2012_4824) / CIFAR10 (cifar10) x ResNet50 (He2015) / ImageNet2012 (Russakovsky:2015:ILS:2846547.2846559) x VGG16 (Simonyan14c) / ImageNet2012 x YOLO (redmon2016yolo9000) / VOC2007 (pascal-voc-2007) x

Table 6: DLA applied to 4 networks and datasets

Network / dataset Speedup FLOP Reduction Memory Reduction AlexNet (NIPS2012_4824) / CIFAR10 (cifar10) x x x ResNet50 (He2015) / ImageNet2012 (Russakovsky:2015:ILS:2846547.2846559) x x x VGG16 (Simonyan14c) / ImageNet2012 x x x YOLO (redmon2016yolo9000) / VOC2007 (pascal-voc-2007) x x x

Table 7: Speedup, FLOP reduction, and memory reduction

Varying the input parameter yields different points on an accuracy / runtime curve. This means that practitioners can choose what balance of accuracy and runtime is most important for the application. In Table LABEL:tab:yolo, decreasing values of yield faster runtimes (as measured in milliseconds on a g2 instance) but also higher accuracy losses. If time can be spent on fine-tuning the post-DLA output network, then accuracy can be recovered back to original levels.

Runtime (ms) Accuracy (mAP)
Table 8: Runtime and accuracy when varying on YOLO

6 Discussion

Here we examine specific characteristics of DLA.


The optimization function for choosing which convolution to apply specifically targets FLOP reduction, since FLOP reduction is directly related to runtime decrease for FLOP-limited layers. However, a consequence of decreasing FLOP count by using fewer output channels in a cascade of convolutions is that less memory is needed to achieve a forward pass as well. Especially in embedded environments, this makes larger networks viable. After networks have been passed through DLA and see diminishing returns with FLOP-reduction methods, it means that memory movement is the bottleneck in runtime, and different memory-reducing optimizations should be applied. In Table LABEL:tab:results_mults we see memory requirements at runtime decrease as a result of applying DLA.

Upper bound

Because DLA applies FLOP-reducing approximations, any convolutions that are memory-limited will not be sped up significantly with DLA. For example, fully-connected layers and convolutions are typically memory-limited. From Table LABEL:tab:results_mults we see that the ResNet50 network is more memory-limited than FLOP-limited, since the runtime speedup is more closely related to the memory rather than the FLOP reduction. The FLOP decrease is observed in the non- convolutions, and the speedup is likely a result of the memory decrease rather than the FLOP decrease. Pushing beyond the x speedup observed on YOLO without significant accuracy loss is not possible with the proposed DLA approximations, because once DLA has been applied, layers are moved from the FLOP-limited regime to the memory-limited regime in the roofline model.

GPU-specific optimizations

The runtime improvements here are reported on a g2 cloud instance, but proportional speedup is also observed on a CPU. For GPU targets, we have observed that although the relationship between FLOPs and runtime is linear overall, the single most significant factor in runtime for a given representation layer is the number of output channels. The runtime is a step function, with large jumps at output channels that are powers of , and linear in between. So, in practice when targeting networks for GPUs, we only choose between decompositions with powers-of-2 output channels. For a CPU, this effect is not observed and any number of output channels are considered. GPUs have different FLOP and memory throughputs, which is are properties inherent to the GPU. This means absolute and relative speedups will be different according to the target GPU architecture. A benefit of DLA is that it is GPU-architecture-agnostic, meaning the FLOP count will always be decreased, which will typically result in some speedup on any target device.


Results were tested using both Caffe and MXNet frameworks, and DLA optimizations can be applied to either kind of model. Similar speedups are observed with both frameworks. This is because the FLOP reduction optimizations from DLA are model-agnostic: they operate on the weight matrices directly and are not dependent on software implementation. Exact observed runtime speedups will be different depending on framework and implementation, but speedups will be seen using all network formats.

Relationship to Training

Because we have shown that accuracy can be recovered by fine-tuning after applying DLA, a good strategy is to train large networks, then iteratively apply DLA and finetune, if the data and training procedure is available. More aggressive speedup can be achieved in this way, though will also take more time. On the other hand, DLA does not rely on data availability, and can be taken as a data-agnostic black box. Those who do not have access to the training data or procedure can just apply DLA to get speedup with minimal accuracy loss.

7 Conclusion

Deep Learning Approximation can be applied to an already-trained network to speed it up and incur only a small amount of accuracy loss. Access to training data is not required and the techniques are framework-agnostic, which means DLA can be used on black-box networks or in environments where the training data cannot be accessed. The combination of approximation that best achieves the desired accuracy loss is chosen for each layer through an exhaustive search. DLA can be combined with other methods for speeding up neural networks. This runtime reduction can generate a multiplier in cost reduction for a production service that uses neural networks. Any accuracy loss that that was introduced can be recovered by fine-tuning or re-training the new resultant network