Deep learning has been one of the most intrusive technologies of the 21st century, having revolutionized businesses across multiple industries. From building better gaming opponents to translating languages in real-time, to the detailed understanding of large volumes of images and videos, deep learning has enabled us to achieve automation in different applications. However, deep learning is now a race for the ability to build deeper and larger models to produce better results. Recent models such as BiT-M from Google Kolesnikov et al. (2019) with 928 million parameters, Megatron-LM from NVIDIA Shoeybi et al. (2019) with 8.3 billion parameters, Turing-NLG from Microsoft Rasley et al. (2020) with 17 billion parameters, and GPT-3 from OpenAI Brown et al. (2020)
with 175 billion parameters show the unprecedented growth in the size of deep neural network (DNN) architectures.
This explosive growth has led to the primary challenge of the democratization of deep learning. Training such huge models would require vast computing powers with supercomputers, which is not accessible to all. For example, the latest GPT-3 model with over 350GB in memory size costs over $12 million dollars to train using specialized super computers111https://venturebeat.com/2020/06/01/ai-machine-learning-openai-gpt-3-size-isnt-everything/. Such a computing infrastructure is not available to everyone and is globally not affordable by all deep learning startups and researchers.
There are other implications in training huge DNN models such as energy and power consumption. Strubell et al. Strubell et al. (2019) studied that the compute required to train large-scale DNN models produces carbon-dioxide emissions equivalent to five times the lifetime emissions of an average American car. They also showed that the annual power consumption of cloud computing giants such as Amazon AWS, Google, or Microsoft, is equivalent to the annual power consumption of the United States. Additionally, according to a recent Gartner survey, as of 2020, there are more than six billion edge devices222https://www.gartner.com/en/newsroom/press-releases/2019-08-29-gartner-says-5-8-billion-enterprise-and-automotive-io and current state-of-the-art DNN models are not equipped to be deployed directly on edge devices due to challenges in their memory requirements.
Our objective is to optimize such DNN model architectures without a reduction in accuracy, as step progress towards enabling them to be directly deployed in edge devices. The idea behind model optimization is under the presumption that DNN architectures are over-parameterized. Optimization reduces the number of parameters of the large DNN model while improving the performance of the model in metrics such as computational cost, inference time, and energy consumed. This leads to the primary and the most important research question, “Can smaller models with fewer parameters, achieve an accuracy performance equivalent to a deeper model with a larger number of parameters?”
Challenges of Model Optimization in Production
In the research community, there are some popular approaches such as model pruning, model quantization, and model decomposition to achieve model compression. However, there are a lot of challenges in consuming research oriented techniques in production.
Democratization of DNN Optimization: Training and optimization of DNN architecture is currently unaffordable and requires super-computing infrastructure. How could we make a production-ready optimization framework that is consumable and affordable by everyone?
Multiple Metrics to Optimize: There are multiple metrics to optimize such as (i) the number of parameters, (ii) model memory size, (iii) inference time, (iv) computational cost in terms of FLOPs/ MACs, or (v) energy consumption. It is challenging to optimize in parallel multiple metrics of optimization.
Constrained Optimization: Applications may require optimization to focus on certain metrics while trading off on other metrics. For example, real-time systems would require the inference time to be low while low-memory edge devices would focus on model memory size reduction. How would we guide the model optimization to favor certain metrics over others?
Hardware support: The generic implementation in popular libraries such as PyTorch and Tensorflow does not support certain methods of model compression. Also, the model compilation and the device hardware-specific execution of the optimized model is challenging. While most of the techniques are targeted towards GPU, how could we optimize DNN architectures for specialized hardware?
Black-box Framework: The end-users’ usability and simplicity is a key requirement for consuming optimization in production pipelines. There is a big need for a black-box optimization framework, where the end-user could easily provide the trained model, the dataset, and constraints for optimization, while not be troubled with the nuances of implementation and execution.
Research Papers to Production: Often, research papers aims at finding a highly optimized model which retains the accuracy of the original model, while the cost involved in optimization or searching for the optimized model is considered secondary. However, in production systems, the cost and the time incurred in optimizing the original model are equally important. Unstructured weight optimization is only realistic in some ideal theoretical hardware. A production-ready framework should generalize the optimization approach across a wide variety of architectures and hardware.
In this research paper, we introduce Neutrino 333In this paper, Neutrino refers to Deeplite Neutrino™ 444https://www.deeplite.ai/index.html#neutrino, a lights-out DNN model optimization framework guided by the end-users’ constraints and requirements. A typical continuous development life cycle of a DNN model is shown in Figure 1. The proposed Neutrino can be seamlessly and smoothly integrated into any development and deployment pipeline. The framework consumes a pre-trained DNN model, with the original train-test split data as input, in addition to optimization requirements from the end-user. Neutrino produces the optimized model that can be further used for inference either in a cloud environment or could be directly deployed on the edge device. Neutrino builds a symphony of different model optimization and acceleration techniques. This research paper focuses on the part of constrained optimization technique used in the framework and the successful results obtained on various public benchmark datasets and popular models. Neutrino framework is distributed as Python PyPI library, with support for PyTorch Paszke et al. (2019) and early support for Tensorflow Abadi et al. (2015) library.
The rest of the paper is organized as follows; Section 2 provides background literature of various model optimization techniques. Section 3 explains the architecture of the proposed optimization framework. Section 4 details the experimental results obtained on various benchmark datasets and popular DNN architectures. Section 5 presents the business impact and use-cases of the proposed framework, along with the development details. Section 6 summarizes our efforts with some short-term and long-term future goals.
The different methods explored in the literature for DNN model optimization aims to reduce the number of parameters in the model. These techniques can be broadly grouped into three schools of thought: (1) weight pruning, (2) architecture search, and (3) weight decomposition.
The redundant parameters of the model that do not contribute to the effective output are pruned, resulting in a smaller model with fewer parameters. Column and structured shape pruning introduce non-zero weight values, while the channel and layer pruning reduce the size of the model. Weight pruning in DNN architectures is a well-researched topic with a set of comprehensive survey reports Choudhary et al. (2020); Liu et al. (2020a); Cheng et al. (2017). Liu et al. Liu et al. (2020b) proposed AutoCompress
, an automated experience-guided heuristic search technique to achieve extreme compression rates. Ren et al.Ren et al. (2020) proposed a density-adaptive regular-block (DARB) pruning technique to perform pruning at a channel row-level. Most of these techniques perform post-training pruning while Wang et al. Wang et al. (2020) proposed a method for pruning a DNN architecture from scratch. They showed that comparable accuracy on models is achieved with similar computational budgets as the post-training pruning methods.
Architecture search finds a surrogate model, from the set space of all possible DNN architectures, such that the surrogate (or student) model is much smaller with similar performance as the original model. Thus, model optimization is formulated as a learning or heuristic-driven search problems such as knowledge distillation Luo et al. (2016); Phuong and Lampert (2019); Changyong et al. (2019), Guided Network Architecture Search Kang et al. (2020), or AutoML He et al. (2018), or meta learning Bai et al. (2019).
One of the recent reforming ideas in model compression is the Lottery Ticket Hypothesis Frankle and Carbin (2018). Morcos et al. Morcos et al. (2019) showed successful results of model compression by generalized lottery ticket hypothesis across different benchmark datasets and popular DNN architectures. Yu et al. Yu and Huang (2019) explained a family of possible slimmable architectures by using a variable layer width switch, based on the batch-normalizaton layer.
The idea of decomposition is to fragment a really large weight matrix (or tensor) into a set of linear sequence of smaller tensors, such that maximum information is retained. Denton et al.Denton et al. (2014)
proposed singular value decomposition (SVD) of the original weight tensor to find the orthogonal bases. Jaderberg et al.Jaderberg et al. (2014) built a low-rank filter-bank approximation of the convolutional layer, to achieve upto 4.5x speedup and compression. Lebedev et al. Lebedev et al. (2014) used the popular canonical polyadic decomposition (CP) to achieve layer compression. Yu et al. Yu et al. (2017) proposed a SVD-free greedy alternative for generalized bilateral decomposition (GreBdec) of the convolutional layer. Kim et al. Kim et al. (2015) proposed an iterative method of Tucker based decomposition and fine-tuning to regain the original accuracy. Much recently, Li et al. Li et al. (2020) proposed a single formulation to easily switch between channel pruning and weight decomposition, by applying group sparsity across the columns or the rows of the weight tensor, respectively.
There are some inherent challenges with directly consuming some of the existing solutions on model optimization. Firstly, it is very difficult to measure the maximum percentage of achievable compression, such that the accuracy does not drop below an admissible threshold. Ye et al. Ye et al. (2019) discuss these different challenges as a trade-off between model robustness and model compression. Secondly, the computational and resource requirements for model distillation and architecture search are very high. Especially, Liu et al. Liu et al. (2018) argued that it is more valuable to search for the pruned architecture shape instead of pruning the unimportant weight values and channels. Thirdly, it is not trivial to identify the rank of the low-rank approximation of the decomposable tensors.
System Architecture and Design
In this section, we describe the high-level solution architecture of Neutrino framework which contains four important components: (i) Neutrino Zoo, (ii) conductor, (iii) high-level coarse compression by exploration, and (iv) fine-grained aggressive compression by annealing. We focus on the system design from the end-users’ usability perspective. In this paper, we restrict the scope to optimizing convolutional neural networks (CNN) models for classification and object detection applications.
The end-user provides the following inputs to the framework: (a) a pre-trained model, , (ii) the actual train-test data split used to train the model, and
, and (iii) a set of constraints or requirements to guide the optimization. The data pre-processing and data preparation steps performed during the original model training has to be reproduced in the provided data loaders. The pre-trained model and data loaders could be borrowed from any public github repository or any custom variant designed by the end-user. However, to ease the use of the end-user, a collection of popular DNN architectures with trained weights on different benchmark datasets are provided as Neutrino Zoo. The zoo consists of various classification and object detection datasets such as: MNIST, CIFAR10, CIFAR100, VWW, ImageNet, ImageNet10 (a 10-class subset of ImageNet), ImageNet16 (a 16-class subset of ImageNet), VOC2007, VOC2012, and COCO2017. Also, over
trained DNN models are available including variants of ResNet, VGG, MobileNet, Inception, DenseNet, ShuffleNet, MLP, SSD with VGG/ MobileNet backbones, and YOLO-v3. The availability of the Neutrino Zoo allows the end-users to easily and quickly use the framework for transfer learning.
The purpose of the conductor is to collect all the provided inputs, understand the given requirements, and orchestrate the entire optimization pipeline, accordingly. The constraints to guide the optimization are provided by the end-user and the conductor automatically orchestrates the pipeline, by additionally inferring the model and data properties. Some of the common configurable parameters are:
delta: The acceptable tolerance of accuracy drop with respect to the original model, for example, 1%.
stage: The two different stages of compression, while stage 1 is less intensive compression requiring fewer computational resources, stage 2 provides more aggressive compression using more resources and time.
device: Perform the entire optimization and model inference in either CPU, GPU, or multi-GPU (distributed GPU environment).
: The end-user can customize multiple parts of the optimization process for Neutrino to adapt over more complex scenarios. Support for customization goes beyond vanilla classification, including specialized dataloader, custom backpropagation optimizer, and intricate loss function that their native library implementation allows.
Let the pre-trained model has optimizable layers: . In a typical CNN model, the convolutional layers and the fully connected layers are optimizable while the rest of the layers are ignored from the optimization process. The conductor analyzes the data size, number of output classes, model architecture, and optimization criteria, delta, and produces a binary composed list, , where . The conductor identifies the subset of optimizable layers that needs to be optimized, marked as , and the layers that has to be frozen throughout the process, marked as . This information is passed forward to the exploration stage, where the subset marked as is optimized.
Stage 1: Exploration
In a convolutional neural network, every optimizable layer, projects the input data into different dimensional outputs, as follows,
where is the kernel parameters of the layer, is the input, is the output, andis the projection function.
A transformation function is applied to every optimizable layer of the Convolutional Neural Network. This transformation function is designed to ensure that it approximates the original projection, while reducing the number of parameters of the layer.
An n-D tensor can be viewed as a linear combination of multiple
-dimensional vectors using variable-separable method. For a layer having a parameters as a 4-D tensor of the shape[width height in_shape out_shape], the following transformation function is applied,
with a canonical small-size . During the forward pass, the transformation function of is performed as follows:
This transformation function reduces the number of layer parameters from (w * h * in * out) to small_size* (w + h + in + out).
For a layer for which is a 2-D matrix of the shape [in_shape out_shape], the transformation function is designed, as follows,
where, is the near-optimal small-size approximation of the original matrix. Thus, the layer’s forward pass is replaced as follows,
This reduces the overall number of parameters of from (in*out) to small_size * (in + out).
The challenge is to find an ideal small-size approximation, , that produces good compression retaining the robustness of the model. When the near-optimal small-size is equal to the actual size of the weight tensor, , there is an over-approximation of the transformation with very low compression. A very small size, , produces a high compression, however, with a lossy reconstruction of the transformation. The exploration stage searches for the near-optimal , a lower size approximation of the tensor, , such that there is minimal loss of the transformation function of the layer, .
During the exploration stage, the composed list is updated, where Neutrino selects different transformation functions for different convolutional and fully connected dense layers. The entire model is optimized by the designed composition and the accuracy is regained by performing fine-tuning. The fine-tuning is performed using the same train-test data split used while pre-training the original model. The conductor checks if the optimized model adheres to the termination requirements as provided by the end-user, and if not, the composition list is updated and the next round of optimization is performed.
Stage 2: Annealing
Stage 2 optimization aims to perform aggressive compression and to obtain the maximum possible compression in the required tolerance of accuracy. For example, if the delta of accuracy is , and stage 1 produces a compression with an accuracy drop of , the aim of stage 2 is to further the compression with the delta going as close as possible to . In stage 2, the composed list of different layers is frozen, while the extent of optimization for each layer is increased. Annealing is a metaheuristic approach to approximate global optimization. By increasing the temperature of each layer, the overall energy of the model is preserved while finding a smaller size, that better approximates the global optima.
The entire pipeline of Neutrino framework could be executed in a distributed multi-GPU environment, to speed-up the time required for optimizing the model. To achieve this, Uber’s Horovod555https://eng.uber.com/horovod/ Sergeev and Del Balso (2018)
an open-source library is reused. Horovod supports different backend libraries including PyTorch and Tensorflow, and is easy to use and integrate.
|Architecture||Model||Accuracy (%)||Size (MB)||MACs (Billions)||#Params (Millions)||Memory Footprint (MB)||Execution Time (ms)|
Performance on different metrics obtained after multiple stages of optimization on the CIFAR-100 dataset, validating the enhancement (Enh) obtained using the proposed framework. All the results are computed for an input delta accuracy of.
|Dataset||Model||Accuracy (%)||Size (MB)||MACs (Billions)||#Params (Millions)||Memory Footprint (MB)||Execution Time (ms)|
Experimental Results and Analysis
In this section, we experimentally showcase the performance of the Neutrino in optimizing different CNN models. The different metrics used to evaluate the extent of optimization are explained, along with the experimental protocol.
There are different metrics used to measure the amount of optimization and performance of Neutrino, as follows:
Accuracy: The top-1 accuracy () or the equivalent performance objective of the model is measured. Successful optimization retains the accuracy of the original model.
Model Size: The disk size (MB) occupied by the trainable parameters of the model. Lower model size enables models to be deployed into devices with memory constraints.
MACs: The computational complexity of the model is measured by the number (billions) of Multiply-Accumulate Operation (MAC) computed across the layers of the model. The lower the number of MACs, the better optimized is the model.
Number of Parameters: Total number (millions) of trainable parameters (weights and biases) in the model. Optimization aims to reduce the number of parameters.
Memory Footprint: The total memory (MB) required to perform the inference on a batch of data, including the memory required by the trainable parameters and the layer activations. A lower memory footprint is achieved by better optimization.
Execution Time: The time (ms) required to perform forward pass on a batch of data. Optimized models have a lower execution time.
The results are shown using several different popular CNN models against three different benchmark datasets: CIFAR-100, ImagetNet16, and Visual Wake Words (VWW). All the optimization experiments are run with an end-user requirement of accuracy delta of . The experiments are executed with a mini-batch size of , while the metrics are normalized for a mini-batch size of . All the experiments are run on four parallel GPU, using horovod, and each GPU is a Tesla V100 SXM2 with 32GB memory. The standard train-test split is used for the experiments. The images are
-normalized with global mean and variance computed from the training data. To make the training more robust, data augmentation is performed using random cropping ofwith resizing and random horizontal flip.
The optimization results obtained using Neutrino across different popular CNN models on CIFAR-100 dataset are shown in Table 1 and the results of ResNet-18 architecture on different large scale vision datasets are shown in Table 2. From Table 1, it can be observed that the difference between the original and the final optimized model is less than , based on the provided delta requirement. Depending on the architecture of the original model, it can be observed that the model size could be compressed anywhere between 3x to 30x. VGG19 is known to be one of the highly overparameterized CNN models, and as expected, achieved a x reduction in the number of parameters with almost 12x compression in the overall memory footprint and 8.3x reduction in computation complexity. The resulting VGG19 model occupies only 2.6MB as compared to the original model requiring 76.6MB. Mobilenet architectures are specifically designed to be lightweight with low computational cost, and even in Mobilenet v1, Neutrino achieved a size compression of x with only reduction in accuracy. In a GPU environment, a speedup of around 1.5x is observed. This could significantly impact the inference time on the model, especially on the edge devices, and also the fine-tuning time required in future versions of production releases. The performance of Neutrino on large scale vision datasets produces around 23.5x compression of ResNet18 on Imagenet16 and VWW datasets. The optimized model requires only 1.8MB as compared to 42.6MB required by the original model. There is more than 1.6x in speedup with x reduction in the computational complexity of the model. Crucially, it can be observed that Stage 2 compresses the model at least x more than Stage 1 compression.
Time Taken for Optimization
The overall time taken for optimization by Neutrino, including Stage 1 and Stage 2, is shown in Figure 3. It can be observed that most of the models could be optimized in less than 2 hours. While complex architectures, with longer training times, such as Resnet50 and DenseNet121 take around 6 hours and 13 hours for optimization, respectively. The comparison between the time taken for stage 1 and stage 2 compression is visually shown in Figure 4. It can be observed that almost of the overall optimization is achieved in Stage 2, while Stage 1 consumes less than 40 of the overall time required. This differentiation acts as a key feature of Neutrino, where end-users who need quick optimization with less resource consumption can choose Stage 1, while those needing aggressive optimization can choose Stage 2 optimization.
It can be experimentally observed that Neutrino could be generalized across all kinds of CNN architectures and all scales of datasets with varying number of classes. Neutrino uniformly provides high metrics of optimization across all these datasets.
|Client||Model||Dataset||Method||Acc. (%)||#Params (M)||Size (bytes)||FLOPS (M)||Time (ms)|
|Prod#1||Mobile- NetV2-0.35x||Imagenet Small||Original||80.9||0.4093||1,637,076||66.50||1.64|
|Prod#2||Mobile- NetV2- 1.0x||Imagenet Small||Original||90.9||2.2367||8,951,804||312.8||4.14|
|Prod#3||Mobile- NetV2- 0.35x||Gesture Recognition||Original||96.8||2.3630||10,500,000||559.60||706|
|Prod#4||SSD300 (ResNet50)||COCO-10||Original||0.438 (mAP)||14.17||56,734,728||15.59||3.98|
Our blackbox optimization framework has been deployed into multiple real-world applications and has been consumed by different clients. From different chip manufacturers enabling edge deployment of DNN architectures, to a faster inference of computer vision models on the cloud, the Neutrino framework could cater to a wide variety of use-cases. Some of the key real-world use-cases, where Neutrino is currently deployed in production are:
Smart Appliances: More than 100 million home appliances currently use ARM on Raspberry Pi 4 with only 2GB memory. To enable on-device, AI-driven, automated gesture recognition, Neutrino is used to compress MobileNet variant architectures by almost 2.5x.
Person Detection: An embedded system with a small camera which uses RISC-V CPU cores Waterman et al. (2011), is used as a home assistant alarm, by doing person detection. To enable very large DNN architectures to be deployed on these CPU cores, Neutrino framework is used to achieve up to 68x compression.
Autonomous Driving: To enable autonomous self-driving cars, it is needed to perform real-time object detection with a highly noisy background. A highly complex DNN architecture: SSD-300 with ResNet50 as the backbone is used to accomplish object detection. However, for this large DNN model to be deployed inside an NVIDIA Xavier GPU, Neutrino framework is used to achieve 3x compression, along with 3x speedup, and 3x in power reduction, with no reduction in accuracy.
The results obtained from the real-world deployments across various use-cases are shown in Table 3. It can be observed from the results that across different production environments, use-cases, models, and datasets, the Neutrino can be generalized for successful compression of models. Depending on the application requirements, Neutrino produces anywhere between 2x to 68x compression, with less than 1% accuracy reduction from the original model. Also, in the same production environments, Neutrino was compared with competitive optimization frameworks such as Microsoft’s Neural Network Interface (NNI)666https://github.com/microsoft/nni, Intel’s Neural Network Distiller777https://github.com/NervanaSystems/distiller, and Tensorflow Lite Micro888https://www.tensorflow.org/lite/microcontrollers. It can be observed that Neutrino consistently outperforms the competitors by achieving higher compression with better accuracy. As a testimonial to the success and usability, Neutrino framework has received several accolades and media coverage, some of which are listed here:
Neutrino and the parent company Deeplite has been named the AI 100, one of the top 100 AI companies globally, by CB Insights999https://www.cbinsights.com/research/artificial-intelligence-top-startups/. CB Insights platform annually chooses the list from a candidate set of more than companies, with technical novelty being one of the primary criteria. Also Intel Capital, in their AI infrastructure stack landscape 101010https://www.intel.com/content/www/us/en/intel-capital/news/story.html?id=a0F1I00000BNTXPUA5, has identified Deeplite and Neutrino , as one of the few production-ready optimization frameworks available in the market today.
In a joint partnership, Deeplite and Andes Technologies used Neutrino to deploy optimized DNN models on the first commercial RISC-V cores based on AndeStar V5 architecture111111https://www.prnewswire.com/news-releases/andes-technology-and-deeplite-inc-join-forces-to-deploy-highly-compact-deep-learning-models-into-daily-life-300972366.html. In a specific use-case, a MobileNet-v1 model trained on a Visual Wake Words (VWW) dataset was compressed from 13MB to only 688KB (68 times compression) with less than 1% drop in accuracy. According to Dr. Charlie Su, CTO and Executive VP of Andes Technology, “Deeplite has provided a solution that can be leveraged both internally within Andes as well as for our customers to bring deep learning on Andes RISC-V CPU cores to resource-limited devices at the edge.”
Using Neutrino framework large DNN architectures are currently being optimized and also deployed in ARM microcontrollers121212https://community.arm.com/developer/ip-products/processors/b/ml-ip-blog/posts/unlocking-ai-on-arm-microcontrollers-with-deep-learning-model-optimization. In a specific-use case of low-power camera, the underlying ARM Cortex-M4 has a memory resource constraint of only 256KB on-chip memory. Automatically guided by the memory constraint, Neutrino compressed a 13MB large DNN architectures to only 144KB (88 times compression) with less than 1.84% accuracy drop as compared to the original model.
There is CI/CD based DevOps pipeline, with a monthly sprint delivering product enhancements, software patches, and bug fixes. There is a committed core team of eight technical developers (and growing fast) with diverse skills, to lead and support new features and new client deployments.
Conclusion and Future Work
In this paper, we proposed an easy-to-use blackbox framework for DNN model optimization, Neutrino. The framework is completely automated and could be used to optimize any convolutional neural network based architecture, with no human intervention. The end-user could provide the requirements of optimization such as target model size, or the tolerance drop in accuracy, and Neutrino framework would produce the optimized model, according to the requirements. As an experimental validation, the performance of the proposed framework was shown against several benchmark datasets and popular architectures. Neutrino is currently in production and is used by several clients for multiple use-cases such as smart appliances, autonomous driving, or person detection. The success of the framework in production, along with several testimonials, are showcased. Following the challenges presented in the first section for model optimization, Neutrino is a robust and early solution that only scratches the surface. Therefore, some of the ongoing and future work has much potential to offer, such as being more target hardware aware and further improving compression and speed-up by using techniques.
TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Cited by: Introduction.
- Few shot network compression via cross distillation. arXiv preprint arXiv:1911.09450. Cited by: Architecture Search.
- Language models are few-shot learners. arXiv preprint arXiv:2005.14165. Cited by: Introduction.
- Knowledge squeezed adversarial network compression. arXiv preprint arXiv:1904.05100. Cited by: Architecture Search.
- A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282. Cited by: Weight Pruning.
- A comprehensive survey on model compression and acceleration. Artificial Intelligence Review, pp. 1–43. Cited by: Weight Pruning.
- Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in neural information processing systems, pp. 1269–1277. Cited by: Weight Decomposition.
- The lottery ticket hypothesis: finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635. Cited by: Architecture Search.
- Amc: automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 784–800. Cited by: Architecture Search.
- Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866. Cited by: Weight Decomposition.
- Towards oracle knowledge distillation with neural architecture search.. In AAAI, pp. 4404–4411. Cited by: Architecture Search.
- Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530. Cited by: Weight Decomposition.
- Big transfer (bit): general visual representation learning. arXiv preprint arXiv:1912.11370. Cited by: Introduction.
- Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arXiv preprint arXiv:1412.6553. Cited by: Weight Decomposition.
Group sparsity: the hinge between filter pruning and decomposition for network compression.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8018–8027. Cited by: Weight Decomposition.
- Pruning algorithms to accelerate convolutional neural networks for edge applications: a survey. arXiv preprint arXiv:2005.04275. Cited by: Weight Pruning.
- AutoCompress: an automatic dnn structured pruning framework for ultra-high compression rates.. In AAAI, pp. 4876–4883. Cited by: Weight Pruning.
- Rethinking the value of network pruning. In International Conference on Learning Representations, Cited by: Weight Decomposition.
Face model compression by distilling knowledge from neurons.. In AAAI, pp. 3560–3566. Cited by: Architecture Search.
- One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. In Advances in Neural Information Processing Systems, pp. 4932–4942. Cited by: Architecture Search.
- PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp. 8024–8035. Cited by: Introduction.
- Towards understanding knowledge distillation. In International Conference on Machine Learning, pp. 5142–5151. Cited by: Architecture Search.
- DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 3505–3506. Cited by: Introduction.
- DARB: a density-adaptive regular-block pruning for deep neural networks.. In AAAI, pp. 5495–5502. Cited by: Weight Pruning.
- Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799. Cited by: Stage 2: Annealing.
- Megatron-lm: training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053. Cited by: Introduction.
- Energy and policy considerations for deep learning in nlp. arXiv preprint arXiv:1906.02243. Cited by: Introduction.
- Pruning from scratch.. In AAAI, pp. 12273–12280. Cited by: Weight Pruning.
- The risc-v instruction set manual, volume i: base user-level isa. EECS Department, UC Berkeley, Tech. Rep. UCB/EECS-2011-62 116. Cited by: 2nd item.
- Adversarial robustness vs. model compression, or both?. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 111–120. Cited by: Weight Decomposition.
- Universally slimmable networks and improved training techniques. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1803–1811. Cited by: Architecture Search.
- On compressing deep models by low rank and sparse decomposition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7370–7379. Cited by: Weight Decomposition.