1 Introduction
Deep Learning Approximation (DLA) speeds up the runtime of neural networks, which is especially relevant for paypercompute or limitedcompute embedded environments. It decouples the task of speeding up neural networks from the task of making them more accurate. It enables tuning offtheshelf networks for runtime, to complement finetuning for accuracy.
At deploy time, the dollar cost of a production pipeline using a neural network is directly proportional to the time required to execute a forward pass of the neural network. This is in contrast to training, which generally happens only once and offline. In turn, for large networks time is proportional to the number of floatingpoint operations (FLOPs) required in a forward pass. For example, in the cloud, instances are paid for by the hour, so a pipeline running a ResNet50 model ( FLOPs) vs. a ResNet152 model ( FLOPs) running times slower and therefore costs x more. This extra dollar cost comes at minimal benefit (% vs
error on ImageNet
(He2015)). Running faster models means a higher throughput, which means better hardware utilization, both in cloud and embedded environments. For embedded environments, faster and smaller models mean lower power, cooling, and CPU speed requirements.DLA has several other important properties. For networks used as a blackbox or where training data is confidential, retraining or finetuning the network is not an option or is extremely challenging. In these cases, DLA is a method for speeding up the network forward pass without needing access to the training data. If you can finetune, finetuning can recover accuracy lost during the DLA process.
In DLA, first apply lossless optimizations to reduce FLOP count and runtime, which is always an optimal optimization step. Then, we propose a method for automatically selecting an appropriate approximation to apply to each layer in a network based on a single parameter, which represents whether accuracy or speedup is more important. The approximation methods are all based on singular value decompositions (SVD) applied directly to the weight tensors.
On a benchmark set of standard computer vision tasks and neural network architectures, we show between a 1.5x and 2x speedup without additional retraining and minimal accuracy loss. In each case, funetuning after applying DLA recovers lost accuracy.
2 Related Work
Better hardware and software
To make a large, slow neural network run faster, you can deploy the net to faster hardware (FPGAs, better GPUs) using better software (such as MXNet (DBLP:journals/corr/ChenLLLWWXXZZ15)
and Caffe con Troll
(Hadjis:2015:CCT:2799562.2799641)). DLA offers a third option by operating directly on neural network models and can be applied to models targeting all hardware and software.Training smaller models
Factorization and lowrank approximation approaches such as that of jaderberg14speeding and factorized CNNs (factorized_conv_nets) apply FLOPreduction methods on convolution weights, but to inform neural network architecture decisions before training. Followup work on the spatial bottleneck architecture (factorized_conv_nets), SqueezeNet SqueezeNet, and MobileNet (DBLP:journals/corr/HowardZCKWWAA17) also focus on training architectures from the start that are faster or have fewer FLOPs. DLA can be applied after these models have been trained to achieve further speedup.
Trainingintheloop methods
park_pruning explore iteratively pruning and retraining individual convolutional filters to regain accuracy. Incremental quantization from (zhou2017) adds quantization in the training loop as well, decreasing both memory and runtime. Similarly, wen_learning_2016 sparsifies convolutions during training time to take advantage of sparse matrix multiplication methods at deploy time. All these methods require access to the training data and are timeconsuming. Whereas DLA can be applied on top of a network trained with any of these methods.
Model compression
Model compression makes model definitions smaller, especially when models need to be transferred in a limitedbandwidth or limitedmemory environment Cheng2017ASO. han2015deep_compression
combine iterative convolution pruning and Huffman coding to achieve high compression, but require extensive retraining to achieve accuracy parity. Binarized Neural Networks
(courbariaux_binarized_2016)and XNORNet
(rastegariECCV16) take compression even further, using only bit per weight value, but require access to the original training data to achieve high performance. DLA is complementary and can be applied after these methods have been used.Dataagnostic approaches
Denton:2014:ELS:2968826.2968968 use SVD and factorization approaches to approximate convolutional and fullyconnected layers, but also rely on retraining the network after an approximation has been applied to a single layer. Speedup is measured perlayer rather than endtoend forward pass runtime, establishing a relationship between FLOPs and runtime for convolutional layers. DLA extends this method by applying approximations holistically to all layers in a network together, without requiring retraining in between applying approximations to individual layers. Additionally, the runtime decrease is measured endtoend on a forward pass rather than on a perlayer basis.
3 Runtime and FLOPs
According to the roofline computation model (roofline), neural network runtime is dominated by FLOP count and memory movement to and from a specialized processor like a GPU. Convolutional, deconvolutional, and fullyconnected layers (called representation layers throughout this paper) have a high arithmetic intensity, requiring many FLOPs to compute one unit of output. Therefore, these layers will operate in the FLOPdominated regime from the roofline model rather than the memorylimited regime. CPUs and GPUs have a fixed processor throughput, so decreasing the number of FLOPs needed to perform an operation will decrease the amount of time spent in this bottleneck during a forward pass. DLA replaces FLOPheavy representation layer operations with comparable operations that have a lower FLOP count, speeding up computations, and often pushing the layers’ operations out of the FLOPlimited regime into the memorylimited regime.
We use the FLOP count for a representation layer as a proxy for measuring its runtime, because FLOPs dominate the runtime. We only consider the max of the multiply and add operations, since most modern processors have FMA or FMAC unit that can perform a simultaneous multiplyadd operation. The typical way of applying representation layers to an input is through a matrix multiplication. Because of this, bias computations are ignored as the computation is dominated by the matrix multiplication.
The FLOP count is dominated by the duplication factor on the input blob size. A convolutional layer is a D tensor in , where is the number of output feature maps, is the number of input channels, and and are the kernel dimensions in and respectively. The convolution traverses an input blob in
with stride
in and in . In a grouped convolution, the input and output channels are broken into groups and the D weight tensor is in . A grouped convolution is timeefficient because a batch process can do a group’s worth of computation in a single step. For a deconvolution, the stride is applied on the output rather than the input. See Table LABEL:tab:flopcounts for FLOP counts for the forward pass of each representation layer.Convolution  Deconvolution  FullyConnected 

Other layers, such as batch normalization or activation layers, have a negligible number of FLOPs in the forward pass of a network, since only one operation per input pixel is typically applied.
4 The Approximation Pipeline
Only the model definition, the weights, and a single tunable input parameter that weights the importance of accuracy relative to speed on a scale from to are needed to apply DLA.
The pipeline is applied sequentially to each representation layer in the network. There is no retraining done between steps, the entire process is done with the original trained network. Approximations are applied simultaneously to all layers at once. The discussion below is illustrated on convolutional layers, but similar extensions can be derived for fullyconnected and deconvolutional layers. The pipeline is broken into two parts: (1) lossless optimizations and (2) lossy approximations. The lossless optimizations are applied first, since they have no accuracy impact and will always decrease runtime. The lossy approximations are based on lowrank approximations of representation layers using SVD jaderberg14speeding, represented in Figure 1. The optimal lossy approximation is chosen based on the accurary weighting parameter .
4.1 Lossless optimizations
Here, the term lossless optimizations means optimizations applied to the network that have no effect on accuracy. The main lossless optimization is the merging of adjacent linear layers. For example, representation layers are often followed by batch normalization and scale layers, to reduce overfitting during training time 43442. However, at deploy time, these operations are sequential linear operations applied on an input blob and can be combined into one. Table LABEL:tab:linearops shows how a cascade of linear layers, with and the input and as the output can be combined into a single representation layer.
Layer  Parameters  Operation  Cascaded operation  

Representation 


Batchnorm 


Scale 

The final cascaded operation, has the same form as a representation layer and can therefore be performed with a single multiplyadd operation, with the new weight parameter and new bias parameter . Therefore, all three layers can be combined into a single representation layer with these new parameters. A similar derivation can apply for batchnorm / scale layers that appear before representation layers, if only a batchnorm layer (without a scale layer) is present, or if two representation layers are applied sequentially without a nonlinearity in between.
While applying this lossless optimization saves a small number of FLOPs in the entire forward pass, it reduces the need for the neural network library to allocate memory to the batchnorm and scale layers, which reduces runtime. Because this optimization is exact (up to floatingpoint error), it should always be applied to a network before moving it to production.
4.2 Lossy approximations
After the lossless optimizations are applied to a layer, we then apply a series of lossy approximations. Each lossy approximation is a factorization based on SVD along a given axis of the representation layer. An accuracy score and runtime score is computed for each approximation. The overall score is the weighted sum of the two with the userspecified parameter . is the percentage of variation explained by an SVD approximation, and is the FLOP reduction. Intuitively, using lowrank decomposition to determine which FLOPs to keep and which to approximate means that representation layers that are redundant will have a lowerrank decomposition, and yield more speedup for less accuracy loss when approximated.
4.2.1 Filterwise factorization
Filterwise factorization applies SVD on the outputchannels axis of the representation layer. The decomposition will be illustrated on the convolutional layer but is extensible for deconvolution and fullyconnected layers without loss of generality. Applying SVD along the first axis of a convolution yields a factorization with singular values, , is a diagonal matrix, and . Taking the first singular values, we can reconstruct a lowrank approximation of , where the inner dimension is changed from to .
This can be expressed as two adjacent convolutions and . Table LABEL:tab:filterwiseflops shows a detailed breakdown of FLOP counts and weight shapes for the resulting convolutions and . There is a FLOP reduction in all cases where . The ratio of original FLOPs to new FLOPs is the runtime score for this approximation. If is the th singular value of , then the accuracy score is the percentage of variation explained by the first singular values .
Original  Factored  

Names  
Output channels  
Input channels  
Kernel size  
FLOPs 
A variant of this approximation, called projectionfirst factorization, is applied the same way, except the filter is reconstructed into the second convolution rather than the first. A similar calculation can be done for the FLOP reduction.
4.2.2 Separable factorization
Separable factorization applies SVD to the kernel axis of the representation layer. To find a separable factorization, the weight tensor is flattened into a matrix of size and SVD is applied along the kernel axes. The resultant approximation tensors and each have kernels that are dimensional in the spatial axes. Table LABEL:tab:separableops shows the detailed parameters to reconstruct the two separable weight tensors.
Original  Factored  

Names  
Output channels  
Input channels  
Kernel size  
FLOPs 
Just like in the filterwise factorization approximation, if is the th singular value of , then the accuracy score is the percentage of the variation of explained by the first singular values and the runtime score is the FLOP reduction ratio.
4.2.3 Perchannel factorization
Perchannel approximation applies SVD to each input channel separately, to approximate the original convolution by a set of 2dimensional convolutions to apply perchannel. is approximated by two new convolutions and . is applied in grouped fashion to the input channels. The result is output channels after . Then reconstructs the orignal output channels. See Table LABEL:tab:perchannelops for details.
Original  Factored  

Names  
Output channels  
Input channels  
Kernel size  
Group  
FLOPs 
The decomposition is found by applying SVD times on the original weight that corresponds to the input channel, and the accuracy score is the percentage of variation of explained for that is averaged across all the input channels. Because the FLOP count is inversely proportional to the group parameter, this approximation results in a large FLOP reduction.
4.3 Chaining approximations
Approximations can be applied on top of one another. For example, if a projectionfirst approximation is applied on top of a filterwise approximation on an input layer , the resulting output will be three convolutions: , , and . for this approximation chain is the ratio of the new FLOPs to the FLOPs from the original layer. is the product of the accuracy scores for each constituent decomposition. Since the convolutions are applied one after another, and any error introduced by the first will be carried over to the next.
4.4 Optimizing the overall approximation
The parameter a user specifies the relative importance of maintaining accuracy or decreasing runtime. The optimal approximation is the approximation (of all possible sequences of approximations) that has the highest score . It is guaranteed to the best tradeoff between runtime and accuracy loss, since SVD is the optimal lowrank decomposition along the given axis.
4.5 Finding approximation groups
Some network architectures like ResNets have more complex structures, where layers in a network share the same input or the same output. In the case of a residual unit in a ResNet, pairs of layers share an output representation, since their outputs get added together as part of the residual operation. In these cases, the joined layers can be considered for approximation together as part of an approximation group. The weights are concatenated into a single weight matrix , and the reconstructed weights after approximation are split into separate layer outputs to match the original output blobs.
4.6 Applying approximations holistically
During the forward pass of a network, errors introduced at layers closer to the input of the network propagate up, accumulating as the execution gets closer to the output and therefore are more consequential. Additionally, layers closer to the output tend to take a longer time to execute on a forward pass. To prevent high error accumulations, the parameter used for that layer’s approximation is based on the userdefined parameter , starting at and linearly decreasing to as you get farther from the input. The start value of can be further tuned to decrease the overall error. This also means that layers closer to the output (which typically are slower) have a more aggressive approximation applied, which provides a better tradeoff between runtime and accuracy. Furthermore, in practice, is used as a threshold: no approximation whose accuracy score is considered a valid approximation.
5 Experimental results
On 4 network architectures designed specifically for computer vision applications, we show between 1.5x and 2x runtime improvement for a forward pass, between and loss in accuracy, and nearly full recovery of accuracy after finetuning. Runtime was tested using Caffe (jia2014caffe) compiled with CUDA8 on a g2 cloud instance. Detailed accuracy and runtime results are shown in Table LABEL:tab:results_acc. The FLOP reduction (rather than memory reduction) correlates with the absolute speedup, with the exception of the ResNet50 network, as shown in Table LABEL:tab:results_mults. The ResNet50 network has many convolutions that are by design memorylimited rather than FLOPlimited, so DLA is less effective.
Varying the input parameter yields different points on an accuracy / runtime curve. This means that practitioners can choose what balance of accuracy and runtime is most important for the application. In Table LABEL:tab:yolo, decreasing values of yield faster runtimes (as measured in milliseconds on a g2 instance) but also higher accuracy losses. If time can be spent on finetuning the postDLA output network, then accuracy can be recovered back to original levels.
Runtime (ms)  Accuracy (mAP)  

baseline  
6 Discussion
Here we examine specific characteristics of DLA.
Memory
The optimization function for choosing which convolution to apply specifically targets FLOP reduction, since FLOP reduction is directly related to runtime decrease for FLOPlimited layers. However, a consequence of decreasing FLOP count by using fewer output channels in a cascade of convolutions is that less memory is needed to achieve a forward pass as well. Especially in embedded environments, this makes larger networks viable. After networks have been passed through DLA and see diminishing returns with FLOPreduction methods, it means that memory movement is the bottleneck in runtime, and different memoryreducing optimizations should be applied. In Table LABEL:tab:results_mults we see memory requirements at runtime decrease as a result of applying DLA.
Upper bound
Because DLA applies FLOPreducing approximations, any convolutions that are memorylimited will not be sped up significantly with DLA. For example, fullyconnected layers and convolutions are typically memorylimited. From Table LABEL:tab:results_mults we see that the ResNet50 network is more memorylimited than FLOPlimited, since the runtime speedup is more closely related to the memory rather than the FLOP reduction. The FLOP decrease is observed in the non convolutions, and the speedup is likely a result of the memory decrease rather than the FLOP decrease. Pushing beyond the x speedup observed on YOLO without significant accuracy loss is not possible with the proposed DLA approximations, because once DLA has been applied, layers are moved from the FLOPlimited regime to the memorylimited regime in the roofline model.
GPUspecific optimizations
The runtime improvements here are reported on a g2 cloud instance, but proportional speedup is also observed on a CPU. For GPU targets, we have observed that although the relationship between FLOPs and runtime is linear overall, the single most significant factor in runtime for a given representation layer is the number of output channels. The runtime is a step function, with large jumps at output channels that are powers of , and linear in between. So, in practice when targeting networks for GPUs, we only choose between decompositions with powersof2 output channels. For a CPU, this effect is not observed and any number of output channels are considered. GPUs have different FLOP and memory throughputs, which is are properties inherent to the GPU. This means absolute and relative speedups will be different according to the target GPU architecture. A benefit of DLA is that it is GPUarchitectureagnostic, meaning the FLOP count will always be decreased, which will typically result in some speedup on any target device.
Frameworks
Results were tested using both Caffe and MXNet frameworks, and DLA optimizations can be applied to either kind of model. Similar speedups are observed with both frameworks. This is because the FLOP reduction optimizations from DLA are modelagnostic: they operate on the weight matrices directly and are not dependent on software implementation. Exact observed runtime speedups will be different depending on framework and implementation, but speedups will be seen using all network formats.
Relationship to Training
Because we have shown that accuracy can be recovered by finetuning after applying DLA, a good strategy is to train large networks, then iteratively apply DLA and finetune, if the data and training procedure is available. More aggressive speedup can be achieved in this way, though will also take more time. On the other hand, DLA does not rely on data availability, and can be taken as a dataagnostic black box. Those who do not have access to the training data or procedure can just apply DLA to get speedup with minimal accuracy loss.
7 Conclusion
Deep Learning Approximation can be applied to an alreadytrained network to speed it up and incur only a small amount of accuracy loss. Access to training data is not required and the techniques are frameworkagnostic, which means DLA can be used on blackbox networks or in environments where the training data cannot be accessed. The combination of approximation that best achieves the desired accuracy loss is chosen for each layer through an exhaustive search. DLA can be combined with other methods for speeding up neural networks. This runtime reduction can generate a multiplier in cost reduction for a production service that uses neural networks. Any accuracy loss that that was introduced can be recovered by finetuning or retraining the new resultant network
Comments
There are no comments yet.