Deeplite Neutrino^TM: An End-to-End Framework for Constrained Deep Learning Model Optimization

01/11/2021 ∙ by Anush Sankaran, et al. ∙ 16

Designing deep learning-based solutions is becoming a race for training deeper models with a greater number of layers. While a large-size deeper model could provide competitive accuracy, it creates a lot of logistical challenges and unreasonable resource requirements during development and deployment. This has been one of the key reasons for deep learning models not being excessively used in various production environments, especially in edge devices. There is an immediate requirement for optimizing and compressing these deep learning models, to enable on-device intelligence. In this research, we introduce a black-box framework, Deeplite Neutrino^TM for production-ready optimization of deep learning models. The framework provides an easy mechanism for the end-users to provide constraints such as a tolerable drop in accuracy or target size of the optimized models, to guide the whole optimization process. The framework is easy to include in an existing production pipeline and is available as a Python Package, supporting PyTorch and Tensorflow libraries. The optimization performance of the framework is shown across multiple benchmark datasets and popular deep learning models. Further, the framework is currently used in production and the results and testimonials from several clients are summarized.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Deep learning has been one of the most intrusive technologies of the 21st century, having revolutionized businesses across multiple industries. From building better gaming opponents to translating languages in real-time, to the detailed understanding of large volumes of images and videos, deep learning has enabled us to achieve automation in different applications. However, deep learning is now a race for the ability to build deeper and larger models to produce better results. Recent models such as BiT-M from Google Kolesnikov et al. (2019) with 928 million parameters, Megatron-LM from NVIDIA Shoeybi et al. (2019) with 8.3 billion parameters, Turing-NLG from Microsoft Rasley et al. (2020) with 17 billion parameters, and GPT-3 from OpenAI Brown et al. (2020)

with 175 billion parameters show the unprecedented growth in the size of deep neural network (DNN) architectures.

This explosive growth has led to the primary challenge of the democratization of deep learning. Training such huge models would require vast computing powers with supercomputers, which is not accessible to all. For example, the latest GPT-3 model with over 350GB in memory size costs over $12 million dollars to train using specialized super computers111https://venturebeat.com/2020/06/01/ai-machine-learning-openai-gpt-3-size-isnt-everything/. Such a computing infrastructure is not available to everyone and is globally not affordable by all deep learning startups and researchers.

There are other implications in training huge DNN models such as energy and power consumption. Strubell et al. Strubell et al. (2019) studied that the compute required to train large-scale DNN models produces carbon-dioxide emissions equivalent to five times the lifetime emissions of an average American car. They also showed that the annual power consumption of cloud computing giants such as Amazon AWS, Google, or Microsoft, is equivalent to the annual power consumption of the United States. Additionally, according to a recent Gartner survey, as of 2020, there are more than six billion edge devices222https://www.gartner.com/en/newsroom/press-releases/2019-08-29-gartner-says-5-8-billion-enterprise-and-automotive-io and current state-of-the-art DNN models are not equipped to be deployed directly on edge devices due to challenges in their memory requirements.

Our objective is to optimize such DNN model architectures without a reduction in accuracy, as step progress towards enabling them to be directly deployed in edge devices. The idea behind model optimization is under the presumption that DNN architectures are over-parameterized. Optimization reduces the number of parameters of the large DNN model while improving the performance of the model in metrics such as computational cost, inference time, and energy consumed. This leads to the primary and the most important research question, “Can smaller models with fewer parameters, achieve an accuracy performance equivalent to a deeper model with a larger number of parameters?”

Figure 1: The typical development life cycle of a deep neural network (DNN) model. The proposed black-box framework for model optimization, Neutrino, can be seamlessly integrated into the development life cycle at minimal cost.

Challenges of Model Optimization in Production

In the research community, there are some popular approaches such as model pruning, model quantization, and model decomposition to achieve model compression. However, there are a lot of challenges in consuming research oriented techniques in production.

  1. Democratization of DNN Optimization: Training and optimization of DNN architecture is currently unaffordable and requires super-computing infrastructure. How could we make a production-ready optimization framework that is consumable and affordable by everyone?

  2. Multiple Metrics to Optimize: There are multiple metrics to optimize such as (i) the number of parameters, (ii) model memory size, (iii) inference time, (iv) computational cost in terms of FLOPs/ MACs, or (v) energy consumption. It is challenging to optimize in parallel multiple metrics of optimization.

  3. Constrained Optimization: Applications may require optimization to focus on certain metrics while trading off on other metrics. For example, real-time systems would require the inference time to be low while low-memory edge devices would focus on model memory size reduction. How would we guide the model optimization to favor certain metrics over others?

  4. Hardware support: The generic implementation in popular libraries such as PyTorch and Tensorflow does not support certain methods of model compression. Also, the model compilation and the device hardware-specific execution of the optimized model is challenging. While most of the techniques are targeted towards GPU, how could we optimize DNN architectures for specialized hardware?

  5. Black-box Framework: The end-users’ usability and simplicity is a key requirement for consuming optimization in production pipelines. There is a big need for a black-box optimization framework, where the end-user could easily provide the trained model, the dataset, and constraints for optimization, while not be troubled with the nuances of implementation and execution.

  6. Research Papers to Production: Often, research papers aims at finding a highly optimized model which retains the accuracy of the original model, while the cost involved in optimization or searching for the optimized model is considered secondary. However, in production systems, the cost and the time incurred in optimizing the original model are equally important. Unstructured weight optimization is only realistic in some ideal theoretical hardware. A production-ready framework should generalize the optimization approach across a wide variety of architectures and hardware.

In this research paper, we introduce Neutrino 333In this paper, Neutrino refers to Deeplite Neutrino™ 444https://www.deeplite.ai/index.html#neutrino, a lights-out DNN model optimization framework guided by the end-users’ constraints and requirements. A typical continuous development life cycle of a DNN model is shown in Figure 1. The proposed Neutrino can be seamlessly and smoothly integrated into any development and deployment pipeline. The framework consumes a pre-trained DNN model, with the original train-test split data as input, in addition to optimization requirements from the end-user. Neutrino produces the optimized model that can be further used for inference either in a cloud environment or could be directly deployed on the edge device. Neutrino builds a symphony of different model optimization and acceleration techniques. This research paper focuses on the part of constrained optimization technique used in the framework and the successful results obtained on various public benchmark datasets and popular models. Neutrino framework is distributed as Python PyPI library, with support for PyTorch Paszke et al. (2019) and early support for Tensorflow Abadi et al. (2015) library.

The rest of the paper is organized as follows; Section 2 provides background literature of various model optimization techniques. Section 3 explains the architecture of the proposed optimization framework. Section 4 details the experimental results obtained on various benchmark datasets and popular DNN architectures. Section 5 presents the business impact and use-cases of the proposed framework, along with the development details. Section 6 summarizes our efforts with some short-term and long-term future goals.

Background Literature

The different methods explored in the literature for DNN model optimization aims to reduce the number of parameters in the model. These techniques can be broadly grouped into three schools of thought: (1) weight pruning, (2) architecture search, and (3) weight decomposition.

Weight Pruning

The redundant parameters of the model that do not contribute to the effective output are pruned, resulting in a smaller model with fewer parameters. Column and structured shape pruning introduce non-zero weight values, while the channel and layer pruning reduce the size of the model. Weight pruning in DNN architectures is a well-researched topic with a set of comprehensive survey reports Choudhary et al. (2020); Liu et al. (2020a); Cheng et al. (2017). Liu et al. Liu et al. (2020b) proposed AutoCompress

, an automated experience-guided heuristic search technique to achieve extreme compression rates. Ren et al. 

Ren et al. (2020) proposed a density-adaptive regular-block (DARB) pruning technique to perform pruning at a channel row-level. Most of these techniques perform post-training pruning while Wang et al. Wang et al. (2020) proposed a method for pruning a DNN architecture from scratch. They showed that comparable accuracy on models is achieved with similar computational budgets as the post-training pruning methods.

Architecture Search

Architecture search finds a surrogate model, from the set space of all possible DNN architectures, such that the surrogate (or student) model is much smaller with similar performance as the original model. Thus, model optimization is formulated as a learning or heuristic-driven search problems such as knowledge distillation Luo et al. (2016); Phuong and Lampert (2019); Changyong et al. (2019), Guided Network Architecture Search Kang et al. (2020), or AutoML He et al. (2018), or meta learning Bai et al. (2019).

One of the recent reforming ideas in model compression is the Lottery Ticket Hypothesis Frankle and Carbin (2018). Morcos et al. Morcos et al. (2019) showed successful results of model compression by generalized lottery ticket hypothesis across different benchmark datasets and popular DNN architectures. Yu et al. Yu and Huang (2019) explained a family of possible slimmable architectures by using a variable layer width switch, based on the batch-normalizaton layer.

Weight Decomposition

The idea of decomposition is to fragment a really large weight matrix (or tensor) into a set of linear sequence of smaller tensors, such that maximum information is retained. Denton et al. 

Denton et al. (2014)

proposed singular value decomposition (SVD) of the original weight tensor to find the orthogonal bases. Jaderberg et al. 

Jaderberg et al. (2014) built a low-rank filter-bank approximation of the convolutional layer, to achieve upto 4.5x speedup and compression. Lebedev et al. Lebedev et al. (2014) used the popular canonical polyadic decomposition (CP) to achieve layer compression. Yu et al. Yu et al. (2017) proposed a SVD-free greedy alternative for generalized bilateral decomposition (GreBdec) of the convolutional layer. Kim et al. Kim et al. (2015) proposed an iterative method of Tucker based decomposition and fine-tuning to regain the original accuracy. Much recently, Li et al. Li et al. (2020) proposed a single formulation to easily switch between channel pruning and weight decomposition, by applying group sparsity across the columns or the rows of the weight tensor, respectively.

There are some inherent challenges with directly consuming some of the existing solutions on model optimization. Firstly, it is very difficult to measure the maximum percentage of achievable compression, such that the accuracy does not drop below an admissible threshold. Ye et al. Ye et al. (2019) discuss these different challenges as a trade-off between model robustness and model compression. Secondly, the computational and resource requirements for model distillation and architecture search are very high. Especially, Liu et al. Liu et al. (2018) argued that it is more valuable to search for the pruned architecture shape instead of pruning the unimportant weight values and channels. Thirdly, it is not trivial to identify the rank of the low-rank approximation of the decomposable tensors.

System Architecture and Design

Figure 2: An overview of the architecture design highlighting the key components of the Neutrino framework.

In this section, we describe the high-level solution architecture of Neutrino framework which contains four important components: (i) Neutrino Zoo, (ii) conductor, (iii) high-level coarse compression by exploration, and (iv) fine-grained aggressive compression by annealing. We focus on the system design from the end-users’ usability perspective. In this paper, we restrict the scope to optimizing convolutional neural networks (CNN) models for classification and object detection applications.

Neutrino Zoo

The end-user provides the following inputs to the framework: (a) a pre-trained model, , (ii) the actual train-test data split used to train the model, and

, and (iii) a set of constraints or requirements to guide the optimization. The data pre-processing and data preparation steps performed during the original model training has to be reproduced in the provided data loaders. The pre-trained model and data loaders could be borrowed from any public github repository or any custom variant designed by the end-user. However, to ease the use of the end-user, a collection of popular DNN architectures with trained weights on different benchmark datasets are provided as Neutrino Zoo. The zoo consists of various classification and object detection datasets such as: MNIST, CIFAR10, CIFAR100, VWW, ImageNet, ImageNet10 (a 10-class subset of ImageNet), ImageNet16 (a 16-class subset of ImageNet), VOC2007, VOC2012, and COCO2017. Also, over

trained DNN models are available including variants of ResNet, VGG, MobileNet, Inception, DenseNet, ShuffleNet, MLP, SSD with VGG/ MobileNet backbones, and YOLO-v3. The availability of the Neutrino Zoo allows the end-users to easily and quickly use the framework for transfer learning.

Conductor

The purpose of the conductor is to collect all the provided inputs, understand the given requirements, and orchestrate the entire optimization pipeline, accordingly. The constraints to guide the optimization are provided by the end-user and the conductor automatically orchestrates the pipeline, by additionally inferring the model and data properties. Some of the common configurable parameters are:

  1. delta: The acceptable tolerance of accuracy drop with respect to the original model, for example, 1%.

  2. stage: The two different stages of compression, while stage 1 is less intensive compression requiring fewer computational resources, stage 2 provides more aggressive compression using more resources and time.

  3. device: Perform the entire optimization and model inference in either CPU, GPU, or multi-GPU (distributed GPU environment).

  4. modularity

    : The end-user can customize multiple parts of the optimization process for Neutrino to adapt over more complex scenarios. Support for customization goes beyond vanilla classification, including specialized dataloader, custom backpropagation optimizer, and intricate loss function that their native library implementation allows.

Let the pre-trained model has optimizable layers: . In a typical CNN model, the convolutional layers and the fully connected layers are optimizable while the rest of the layers are ignored from the optimization process. The conductor analyzes the data size, number of output classes, model architecture, and optimization criteria, delta, and produces a binary composed list, , where . The conductor identifies the subset of optimizable layers that needs to be optimized, marked as , and the layers that has to be frozen throughout the process, marked as . This information is passed forward to the exploration stage, where the subset marked as is optimized.

Stage 1: Exploration

In a convolutional neural network, every optimizable layer, projects the input data into different dimensional outputs, as follows,

(1)

where is the kernel parameters of the layer, is the input, is the output, and

is usually a non-linear activation function such as ReLU, sigmoid, or tanh, and

is the projection function.

Transforming Layers:

A transformation function is applied to every optimizable layer of the Convolutional Neural Network. This transformation function is designed to ensure that it approximates the original projection, while reducing the number of parameters of the layer.

An n-D tensor can be viewed as a linear combination of multiple

-dimensional vectors using variable-separable method. For a layer having a parameters as a 4-D tensor of the shape

[width height in_shape out_shape], the following transformation function is applied,

(2)

with a canonical small-size . During the forward pass, the transformation function of is performed as follows:

(3)

This transformation function reduces the number of layer parameters from (w * h * in * out) to small_size* (w + h + in + out).

For a layer for which is a 2-D matrix of the shape [in_shape out_shape], the transformation function is designed, as follows,

(4)

where, is the near-optimal small-size approximation of the original matrix. Thus, the layer’s forward pass is replaced as follows,

(5)

This reduces the overall number of parameters of from (in*out) to small_size * (in + out).

The challenge is to find an ideal small-size approximation, , that produces good compression retaining the robustness of the model. When the near-optimal small-size is equal to the actual size of the weight tensor, , there is an over-approximation of the transformation with very low compression. A very small size, , produces a high compression, however, with a lossy reconstruction of the transformation. The exploration stage searches for the near-optimal , a lower size approximation of the tensor, , such that there is minimal loss of the transformation function of the layer, .

During the exploration stage, the composed list is updated, where Neutrino selects different transformation functions for different convolutional and fully connected dense layers. The entire model is optimized by the designed composition and the accuracy is regained by performing fine-tuning. The fine-tuning is performed using the same train-test data split used while pre-training the original model. The conductor checks if the optimized model adheres to the termination requirements as provided by the end-user, and if not, the composition list is updated and the next round of optimization is performed.

Stage 2: Annealing

Stage 2 optimization aims to perform aggressive compression and to obtain the maximum possible compression in the required tolerance of accuracy. For example, if the delta of accuracy is , and stage 1 produces a compression with an accuracy drop of , the aim of stage 2 is to further the compression with the delta going as close as possible to . In stage 2, the composed list of different layers is frozen, while the extent of optimization for each layer is increased. Annealing is a metaheuristic approach to approximate global optimization. By increasing the temperature of each layer, the overall energy of the model is preserved while finding a smaller size, that better approximates the global optima.

The entire pipeline of Neutrino framework could be executed in a distributed multi-GPU environment, to speed-up the time required for optimizing the model. To achieve this, Uber’s Horovod555https://eng.uber.com/horovod/ Sergeev and Del Balso (2018)

an open-source library is reused. Horovod supports different backend libraries including PyTorch and Tensorflow, and is easy to use and integrate.

Architecture Model Accuracy (%) Size (MB) MACs (Billions) #Params (Millions) Memory Footprint (MB) Execution Time (ms)
Resnet18 Original 76.8295 42.8014 0.5567 11.2201 48.4389 0.0594
Stage1 76.7871 7.5261 0.1824 1.9729 15.3928 0.0494
Stage2 75.8008 3.4695 0.0790 0.9095 10.3965 0.0376
Enh -0.9300 12.34x 7.05x 12.34x 4.66x 1.58x
Resnet50 Original 78.0657 90.4284 1.3049 23.7053 123.5033 3.9926
Stage1 78.7402 25.5877 0.6852 6.7077 65.2365 0.2444
Stage2 77.1680 8.4982 0.2067 2.2278 43.7232 0.1772
Enh -0.9400 10.64x 6.31x 10.64x 2.82x 1.49x
VGG19 Original 72.3794 76.6246 0.3995 20.0867 80.2270 1.4238
Stage1 71.5918 3.3216 0.0631 0.8707 7.5440 0.0278
Stage2 71.6602 2.6226 0.0479 0.6875 6.7399 0.0263
Enh -0.8300 29.22x 8.34x 29.22x 11.90x 1.67x
DenseNet121 Original 78.4612 26.8881 0.8982 7.0485 66.1506 10.7240
Stage1 79.0348 15.7624 0.5477 4.132 61.8052 0.2814
Stage2 77.8085 6.4246 0.1917 1.6842 48.3280 0.2372
Enh -0.6500 4.19x 4.69x 4.19x 1.37x 1.17x
GoogleNet Original 79.3513 23.8743 1.5341 6.2585 64.5977 5.7186
Stage1 79.4922 12.6389 0.8606 3.3132 62.1568 0.2856
Stage2 78.8086 6.1083 0.386 1.6013 51.3652 0.2188
Enh -0.4900 3.91x 3.97x 3.91x 1.26x 1.28x
Mobilenet v1 Original 66.8414 12.6246 0.0473 3.3095 16.6215 1.8147
Stage1 66.4355 6.4211 0.0286 1.6833 10.5500 0.0306
Stage2 66.6211 3.2878 0.017 0.8619 7.3447 0.0286
Enh -0.4000 3.84x 2.78x 3.84x 2.26x 1.13x
shufflenet_v2_1_0 Original 69.9805 5.1731 0.0462 1.3561 12.3418 0.0357
Stage1 68.9844 3.2792 0.0285 0.8596 10.8947 0.0361
Stage2 69.3262 1.9315 0.016 0.5063 9.3258 0.0344
Enh -0.6500 2.68x 2.89x 2.68x 1.32x 1.04x
Table 1:

Performance on different metrics obtained after multiple stages of optimization on the CIFAR-100 dataset, validating the enhancement (Enh) obtained using the proposed framework. All the results are computed for an input delta accuracy of

.
Dataset Model Accuracy (%) Size (MB) MACs (Billions) #Params (Millions) Memory Footprint (MB) Execution Time (ms)
Imagenet16 Original 94.4970 42.6663 1.8217 11.1847 74.6332 0.2158
Stage1 93.8179 3.3724 0.5155 0.8840 41.0819 0.1606
Stage2 93.6220 1.8220 0.3206 0.4776 37.4608 0.1341
Enh -0.8800 23.42x 5.68x 23.42x 1.99x 1.61x
VWW Original 93.5995 42.6389 1.8217 11.1775 74.6057 0.2149
Stage1 93.8179 3.3524 0.4014 0.8788 39.8382 0.1445
Stage2 92.6220 1.8309 0.2672 0.4800 36.6682 0.1296
Enh -0.9800 23.29x 6.82x 23.29x 2.03x 1.66x
Table 2: Performance of the ResNet18 model against multiple large scale datasets, validating the enhancement (Enh) obtained using the proposed framework. All the results are computed for an input delta accuracy of .

Experimental Results and Analysis

In this section, we experimentally showcase the performance of the Neutrino in optimizing different CNN models. The different metrics used to evaluate the extent of optimization are explained, along with the experimental protocol.

Metrics

There are different metrics used to measure the amount of optimization and performance of Neutrino, as follows:

  1. Accuracy: The top-1 accuracy () or the equivalent performance objective of the model is measured. Successful optimization retains the accuracy of the original model.

  2. Model Size: The disk size (MB) occupied by the trainable parameters of the model. Lower model size enables models to be deployed into devices with memory constraints.

  3. MACs: The computational complexity of the model is measured by the number (billions) of Multiply-Accumulate Operation (MAC) computed across the layers of the model. The lower the number of MACs, the better optimized is the model.

  4. Number of Parameters: Total number (millions) of trainable parameters (weights and biases) in the model. Optimization aims to reduce the number of parameters.

  5. Memory Footprint: The total memory (MB) required to perform the inference on a batch of data, including the memory required by the trainable parameters and the layer activations. A lower memory footprint is achieved by better optimization.

  6. Execution Time: The time (ms) required to perform forward pass on a batch of data. Optimized models have a lower execution time.

Experimental Protocol

The results are shown using several different popular CNN models against three different benchmark datasets: CIFAR-100, ImagetNet16, and Visual Wake Words (VWW). All the optimization experiments are run with an end-user requirement of accuracy delta of . The experiments are executed with a mini-batch size of , while the metrics are normalized for a mini-batch size of . All the experiments are run on four parallel GPU, using horovod, and each GPU is a Tesla V100 SXM2 with 32GB memory. The standard train-test split is used for the experiments. The images are

-normalized with global mean and variance computed from the training data. To make the training more robust, data augmentation is performed using random cropping of

with resizing and random horizontal flip.

Result Analysis

The optimization results obtained using Neutrino across different popular CNN models on CIFAR-100 dataset are shown in Table 1 and the results of ResNet-18 architecture on different large scale vision datasets are shown in Table 2. From Table 1, it can be observed that the difference between the original and the final optimized model is less than , based on the provided delta requirement. Depending on the architecture of the original model, it can be observed that the model size could be compressed anywhere between 3x to 30x. VGG19 is known to be one of the highly overparameterized CNN models, and as expected, achieved a x reduction in the number of parameters with almost 12x compression in the overall memory footprint and 8.3x reduction in computation complexity. The resulting VGG19 model occupies only 2.6MB as compared to the original model requiring 76.6MB. Mobilenet architectures are specifically designed to be lightweight with low computational cost, and even in Mobilenet v1, Neutrino achieved a size compression of x with only reduction in accuracy. In a GPU environment, a speedup of around 1.5x is observed. This could significantly impact the inference time on the model, especially on the edge devices, and also the fine-tuning time required in future versions of production releases. The performance of Neutrino on large scale vision datasets produces around 23.5x compression of ResNet18 on Imagenet16 and VWW datasets. The optimized model requires only 1.8MB as compared to 42.6MB required by the original model. There is more than 1.6x in speedup with x reduction in the computational complexity of the model. Crucially, it can be observed that Stage 2 compresses the model at least x more than Stage 1 compression.

Figure 3: The total time taken for optimizing various models and the amount of compression achieved against CIFAR-100 dataset, using Neutrino framework.
Figure 4: The proportion of time taken in optimization and the amount of compression, between Stage 1 and Stage 2 of optimization in the Neutrino framework.

Time Taken for Optimization

The overall time taken for optimization by Neutrino, including Stage 1 and Stage 2, is shown in Figure 3. It can be observed that most of the models could be optimized in less than 2 hours. While complex architectures, with longer training times, such as Resnet50 and DenseNet121 take around 6 hours and 13 hours for optimization, respectively. The comparison between the time taken for stage 1 and stage 2 compression is visually shown in Figure 4. It can be observed that almost of the overall optimization is achieved in Stage 2, while Stage 1 consumes less than 40 of the overall time required. This differentiation acts as a key feature of Neutrino, where end-users who need quick optimization with less resource consumption can choose Stage 1, while those needing aggressive optimization can choose Stage 2 optimization.

It can be experimentally observed that Neutrino could be generalized across all kinds of CNN architectures and all scales of datasets with varying number of classes. Neutrino uniformly provides high metrics of optimization across all these datasets.

Client Model Dataset Method Acc. (%) #Params (M) Size (bytes) FLOPS (M) Time (ms)
Andes Mobile- NetV1 VWW Original 88.1 3.2085 12,836,104 105.7 -
Neutrino 87.6

0.1900 (16.88%)

188,000 (68x)

24.6

-
TFLM 84.0

0.2134 (15.03%)

860,000 (14.9x)

- -
Prod#1 Mobile- NetV2-0.35x Imagenet Small Original 80.9 0.4093 1,637,076 66.50 1.64
Neutrino 80.4

0.1688 (58.76%)

675,200 (2.4x)

50.90 1.87
Intel Distiller 80.4

0.2562 (37.41%)

1,637,076 (1x)

66.50 1.59
Microsoft NNI 77.4

0.2851 (30.35%)

1,140,208 (1.43x)

52.80 2.22
Prod#2 Mobile- NetV2- 1.0x Imagenet Small Original 90.9 2.2367 8,951,804 312.8 4.14
Neutrino 82.0

0.4254 (80.98%)

1,701,864 (5.26x)

134.00 4.2
Intel Distiller 82.0

0.2983 (86.66%)

8,951,804 (1x)

312.86 4.4
Prod#3 Mobile- NetV2- 0.35x Gesture Recognition Original 96.8 2.3630 10,500,000 559.60 706
Neutrino 96.8

0.5525 (76.62%)

2,199,200 (4.77x)

508.20 611
Prod#4 SSD300 (ResNet50) COCO-10 Original 0.438 (mAP) 14.17 56,734,728 15.59 3.98
Neutrino 0.433 (mAP)

4.84 (2.93x)

19,365,488 (2.93x)

5.254

2.76
Table 3: Results from different production applications and business use-cases of Neutrino framework. It can be observed that in many practical real-world applications Neutrino performs better than other competitive optimization frameworks. The results are computed across different hardware deployments. The names of certain clients and production environments are redacted for anonymization.

Business Impact

Our blackbox optimization framework has been deployed into multiple real-world applications and has been consumed by different clients. From different chip manufacturers enabling edge deployment of DNN architectures, to a faster inference of computer vision models on the cloud, the Neutrino framework could cater to a wide variety of use-cases. Some of the key real-world use-cases, where Neutrino is currently deployed in production are:

  • Smart Appliances: More than 100 million home appliances currently use ARM on Raspberry Pi 4 with only 2GB memory. To enable on-device, AI-driven, automated gesture recognition, Neutrino is used to compress MobileNet variant architectures by almost 2.5x.

  • Person Detection: An embedded system with a small camera which uses RISC-V CPU cores Waterman et al. (2011), is used as a home assistant alarm, by doing person detection. To enable very large DNN architectures to be deployed on these CPU cores, Neutrino framework is used to achieve up to 68x compression.

  • Autonomous Driving: To enable autonomous self-driving cars, it is needed to perform real-time object detection with a highly noisy background. A highly complex DNN architecture: SSD-300 with ResNet50 as the backbone is used to accomplish object detection. However, for this large DNN model to be deployed inside an NVIDIA Xavier GPU, Neutrino framework is used to achieve 3x compression, along with 3x speedup, and 3x in power reduction, with no reduction in accuracy.

The results obtained from the real-world deployments across various use-cases are shown in Table 3. It can be observed from the results that across different production environments, use-cases, models, and datasets, the Neutrino can be generalized for successful compression of models. Depending on the application requirements, Neutrino produces anywhere between 2x to 68x compression, with less than 1% accuracy reduction from the original model. Also, in the same production environments, Neutrino was compared with competitive optimization frameworks such as Microsoft’s Neural Network Interface (NNI)666https://github.com/microsoft/nni, Intel’s Neural Network Distiller777https://github.com/NervanaSystems/distiller, and Tensorflow Lite Micro888https://www.tensorflow.org/lite/microcontrollers. It can be observed that Neutrino consistently outperforms the competitors by achieving higher compression with better accuracy. As a testimonial to the success and usability, Neutrino framework has received several accolades and media coverage, some of which are listed here:

There is CI/CD based DevOps pipeline, with a monthly sprint delivering product enhancements, software patches, and bug fixes. There is a committed core team of eight technical developers (and growing fast) with diverse skills, to lead and support new features and new client deployments.

Conclusion and Future Work

In this paper, we proposed an easy-to-use blackbox framework for DNN model optimization, Neutrino. The framework is completely automated and could be used to optimize any convolutional neural network based architecture, with no human intervention. The end-user could provide the requirements of optimization such as target model size, or the tolerance drop in accuracy, and Neutrino framework would produce the optimized model, according to the requirements. As an experimental validation, the performance of the proposed framework was shown against several benchmark datasets and popular architectures. Neutrino is currently in production and is used by several clients for multiple use-cases such as smart appliances, autonomous driving, or person detection. The success of the framework in production, along with several testimonials, are showcased. Following the challenges presented in the first section for model optimization, Neutrino is a robust and early solution that only scratches the surface. Therefore, some of the ongoing and future work has much potential to offer, such as being more target hardware aware and further improving compression and speed-up by using techniques.

References

  • M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng (2015)

    TensorFlow: large-scale machine learning on heterogeneous systems

    .
    Note: Software available from tensorflow.org External Links: Link Cited by: Introduction.
  • H. Bai, J. Wu, I. King, and M. Lyu (2019) Few shot network compression via cross distillation. arXiv preprint arXiv:1911.09450. Cited by: Architecture Search.
  • T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. (2020) Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: Introduction.
  • S. Changyong, L. Peng, X. Yuan, Q. Yanyun, D. Longquan, and M. Lizhuang (2019) Knowledge squeezed adversarial network compression. arXiv preprint arXiv:1904.05100. Cited by: Architecture Search.
  • Y. Cheng, D. Wang, P. Zhou, and T. Zhang (2017) A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282. Cited by: Weight Pruning.
  • T. Choudhary, V. Mishra, A. Goswami, and J. Sarangapani (2020) A comprehensive survey on model compression and acceleration. Artificial Intelligence Review, pp. 1–43. Cited by: Weight Pruning.
  • E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pp. 1269–1277. Cited by: Weight Decomposition.
  • J. Frankle and M. Carbin (2018) The lottery ticket hypothesis: finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635. Cited by: Architecture Search.
  • Y. He, J. Lin, Z. Liu, H. Wang, L. Li, and S. Han (2018) Amc: automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 784–800. Cited by: Architecture Search.
  • M. Jaderberg, A. Vedaldi, and A. Zisserman (2014) Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866. Cited by: Weight Decomposition.
  • M. Kang, J. Mun, and B. Han (2020) Towards oracle knowledge distillation with neural architecture search.. In AAAI, pp. 4404–4411. Cited by: Architecture Search.
  • Y. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin (2015) Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530. Cited by: Weight Decomposition.
  • A. Kolesnikov, L. Beyer, X. Zhai, J. Puigcerver, J. Yung, S. Gelly, and N. Houlsby (2019) Big transfer (bit): general visual representation learning. arXiv preprint arXiv:1912.11370. Cited by: Introduction.
  • V. Lebedev, Y. Ganin, M. Rakhuba, I. Oseledets, and V. Lempitsky (2014) Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553. Cited by: Weight Decomposition.
  • Y. Li, S. Gu, C. Mayer, L. V. Gool, and R. Timofte (2020) Group sparsity: the hinge between filter pruning and decomposition for network compression. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    pp. 8018–8027. Cited by: Weight Decomposition.
  • J. Liu, S. Tripathi, U. Kurup, and M. Shah (2020a) Pruning algorithms to accelerate convolutional neural networks for edge applications: a survey. arXiv preprint arXiv:2005.04275. Cited by: Weight Pruning.
  • N. Liu, X. Ma, Z. Xu, Y. Wang, J. Tang, and J. Ye (2020b) AutoCompress: an automatic dnn structured pruning framework for ultra-high compression rates.. In AAAI, pp. 4876–4883. Cited by: Weight Pruning.
  • Z. Liu, M. Sun, T. Zhou, G. Huang, and T. Darrell (2018) Rethinking the value of network pruning. In International Conference on Learning Representations, Cited by: Weight Decomposition.
  • P. Luo, Z. Zhu, Z. Liu, X. Wang, X. Tang, et al. (2016)

    Face model compression by distilling knowledge from neurons.

    .
    In AAAI, pp. 3560–3566. Cited by: Architecture Search.
  • A. Morcos, H. Yu, M. Paganini, and Y. Tian (2019) One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. In Advances in Neural Information Processing Systems, pp. 4932–4942. Cited by: Architecture Search.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Cited by: Introduction.
  • M. Phuong and C. Lampert (2019) Towards understanding knowledge distillation. In International Conference on Machine Learning, pp. 5142–5151. Cited by: Architecture Search.
  • J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He (2020) DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3505–3506. Cited by: Introduction.
  • A. Ren, T. Zhang, Y. Wang, S. Lin, P. Dong, Y. Chen, Y. Xie, and Y. Wang (2020) DARB: a density-adaptive regular-block pruning for deep neural networks.. In AAAI, pp. 5495–5502. Cited by: Weight Pruning.
  • A. Sergeev and M. Del Balso (2018) Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799. Cited by: Stage 2: Annealing.
  • M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro (2019) Megatron-lm: training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053. Cited by: Introduction.
  • E. Strubell, A. Ganesh, and A. McCallum (2019) Energy and policy considerations for deep learning in nlp. arXiv preprint arXiv:1906.02243. Cited by: Introduction.
  • Y. Wang, X. Zhang, L. Xie, J. Zhou, H. Su, B. Zhang, and X. Hu (2020) Pruning from scratch.. In AAAI, pp. 12273–12280. Cited by: Weight Pruning.
  • A. Waterman, Y. Lee, D. A. Patterson, and K. Asanovic (2011) The risc-v instruction set manual, volume i: base user-level isa. EECS Department, UC Berkeley, Tech. Rep. UCB/EECS-2011-62 116. Cited by: 2nd item.
  • S. Ye, K. Xu, S. Liu, H. Cheng, J. Lambrechts, H. Zhang, A. Zhou, K. Ma, Y. Wang, and X. Lin (2019) Adversarial robustness vs. model compression, or both?. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 111–120. Cited by: Weight Decomposition.
  • J. Yu and T. S. Huang (2019) Universally slimmable networks and improved training techniques. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1803–1811. Cited by: Architecture Search.
  • X. Yu, T. Liu, X. Wang, and D. Tao (2017) On compressing deep models by low rank and sparse decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7370–7379. Cited by: Weight Decomposition.