Resource-Efficient Neural Networks for Embedded Systems

While machine learning is traditionally a resource intensive task, embedded systems, autonomous navigation, and the vision of the Internet of Things fuel the interest in resource-efficient approaches. These approaches aim for a carefully chosen trade-off between performance and resource consumption in terms of computation and energy. The development of such approaches is among the major challenges in current machine learning research and key to ensure a smooth transition of machine learning technology from a scientific environment with virtually unlimited computing resources into every day's applications. In this article, we provide an overview of the current state of the art of machine learning techniques facilitating these real-world requirements. In particular, we focus on deep neural networks (DNNs), the predominant machine learning models of the past decade. We give a comprehensive overview of the vast literature that can be mainly split into three non-mutually exclusive categories: (i) quantized neural networks, (ii) network pruning, and (iii) structural efficiency. These techniques can be applied during training or as post-processing, and they are widely used to reduce the computational demands in terms of memory footprint, inference speed, and energy efficiency. We substantiate our discussion with experiments on well-known benchmark data sets to showcase the difficulty of finding good trade-offs between resource-efficiency and predictive performance.


Efficient and Robust Machine Learning for Real-World Systems

While machine learning is traditionally a resource intensive task, embed...

Deployment of Energy-Efficient Deep Learning Models on Cortex-M based Microcontrollers using Deep Compression

Large Deep Neural Networks (DNNs) are the backbone of today's artificial...

Resource-Efficient Speech Mask Estimation for Multi-Channel Speech Enhancement

While machine learning techniques are traditionally resource intensive, ...

A Survey on Distributed Machine Learning

The demand for artificial intelligence has grown significantly over the ...

How to train accurate BNNs for embedded systems?

A key enabler of deploying convolutional neural networks on resource-con...

Dual Dynamic Inference: Enabling More Efficient, Adaptive and Controllable Deep Inference

State-of-the-art convolutional neural networks (CNNs) yield record-break...

Predictive Performance Modeling for Distributed Computing using Black-Box Monitoring and Machine Learning

In many domains, the previous decade was characterized by increasing dat...

1 Introduction

Machine learning is a key technology in the 21st

century and the main contributing factor for many recent performance boosts in computer vision, natural language processing, speech recognition and signal processing. Today, the main application domain and comfort zone of machine learning applications is the “virtual world”, as found in recommender systems, stock market prediction, and social media services. However, we are currently witnessing a transition of machine learning moving into “the wild”, where most prominent examples are autonomous navigation for personal transport and delivery services, and the Internet of Things (IoT). Evidently, this trend opens several real-world challenges for machine learning engineers.

Figure 1: Aspects of resource-efficient machine learning models.

Current machine learning approaches prove particularly effective when big amounts of data and ample computing resources are available. However, in real-world applications the computing infrastructure during the operation phase is typically limited, which effectively rules out most of the current resource-hungry machine learning approaches. There are several key challenges — illustrated in Figure 1 — which have to be jointly considered to facilitate machine learning in real-world applications:

Representational efficiency

The model complexity, i.e., the number of model parameters, should match the (usually limited) resources in deployed systems, in particular regarding memory footprint.

Computational efficiency

The computational cost of performing inference should match the (usually limited) resources in deployed systems, and exploit the available hardware optimally in terms of time and energy. For instance, power constraints are key for autonomous and embedded systems, as the device lifetime for a given battery charge needs to be maximized, or constraints set by energy harvesters need to be met.

Prediction quality

The focus of classical machine learning is mostly on optimizing the prediction quality of the models. For embedded devices, model complexity versus prediction quality trade-offs must be considered to achieve good prediction performance while simultaneously reducing computational complexity and memory requirements.

In this article, we review the state of the art in machine learning with regard to these real-world requirements. We focus on deep neural networks (DNNs), the currently predominant machine learning models. We formally define DNNs in Section 2

and give a brief introduction to the most prominent building blocks, such as dropout and batch normalization.

While being the driving factor behind many recent success stories, DNNs are notoriously data and resource hungry, a property which has recently renewed significant research interest in resource-efficient approaches. This paper is dedicated to giving an extensive overview of the current directions of research of these approaches, all of which are concerned with reducing the model size and/or improving inference efficiency while at the same time maintaining accuracy levels close to state-of-the-art models. We have identified three major directions of research concerned with enhancing resource-efficiency in DNNs that we present in Section 3. In particular, these directions are

Quantized Neural Networks

Typically, the weights of a DNN are stored as 32-bit floating-point values and during inference millions of floating-point operations are carried out. Quantization approaches reduce the number of bits used to store the weights and the activations of DNNs, respectively. While quantization approaches obviously reduce the memory footprint of a DNN, the selected weight representation potentially also facilitates faster inference using cheaper arithmetic operations. Even reducing precision down to binary or ternary values works reasonably well and essentially reduces DNNs to hardware-friendly logical circuits.

Network Pruning

Starting from a fixed, potentially large DNN architecture, pruning approaches remove parts of the architecture during training or after training as a post-processing step. The parts being removed range from the very local scale of individual weights — which is called unstructured pruning — to a more global scale of neurons, channels, or even entire layers — which is called structured pruning. On the one hand, unstructured pruning is typically less sensitive to accuracy degradation, but special sparse matrix operations are required to obtain a computational benefit. On the other hand, structured pruning is more delicate with respect to accuracy but the resulting data structures remain dense such that common highly optimized dense matrix operations available on most off-the-shelf hardware can be used.

Structural Efficiency

This category comprises a diverse set of approaches that achieve resource-efficiency at the structural level of DNNs. Knowledge distillation is an approach where a small student DNN is trained to mimic the behavior of a larger teacher DNN, which has been shown to yield improved results compared to training the small DNN directly. The idea of weight sharing is to use a small set of weights that is shared among several connections of a DNN to reduce the memory footprint. Several works have investigated special matrix structures that require fewer parameters and allow for faster matrix multiplications — the main workload in fully connected layers. Furthermore, there exist several manually designed architectures that introduced lightweight building blocks or modified existing building blocks to enhance resource-efficiency. Most recently, neural architecture search methods have emerged that discover efficient DNN architectures automatically.

Evidently, many of the presented techniques are not mutually exclusive, and they can potentially be combined to further enhance resource-efficiency. For instance, one can both sparsify a model and reduce arithmetic precision.

In Section 4 we substantiate our discussion with experimental results. First, we exemplify the trade-off between latency, memory footprint, and predictive accuracy in DNNs on the CIFAR-10 data set. Subsequently, we provide a comparison of various quantization approaches for DNNs, using the CIFAR-100 data set in Section 4.2

. In particular, this overview shows that sensible trade-offs are achievable with very low numeric precision — even for the extreme case of binarized weights and activations, respectively. Finally, a complete real-world signal processing example using DNNs with binary weights and activations is discussed in Section


. We develop a complete speech enhancement system employing an efficient DNN-based speech mask estimator, which shows negligible performance degradation while allowing memory savings by a factor of 32 and speed-ups by approximately a factor of 10.

2 Background

Before we present a comprehensive overview of the many different techniques for reducing the complexity of DNNs in Section 3, this section formally introduces DNNs and some fundamentals required in the remainder of the paper.

2.1 Feed-forward Deep Neural Networks

DNNs are typically organized in layers of alternating linear transformations and non-linear activation functions. A vanilla DNN with

layers is a function mapping an input to an output by applying the iterative computation


where (1

) computes a linear transformation with weight tensor

and bias vector

, and (2) computes a non-linear activation function that is typically applied element-wise. Common choices for

are the ReLU function

, sigmoid functions, such as

and the logistic function , and, in the context of resource-efficient models, the sign function , where is the indicator function.

In this paper, we focus on hardware-efficient machine learning in the context of classification, i.e., the task of assigning the input to a class . Other predictive tasks, such as regression and multi-label prediction, can be tackled in a similar manner. For classification tasks, the output activation function for computing is typically the softmax function . An input is assigned to class .

The two most common types of layers are (i) fully connected layers111

Many popular deep learning frameworks refer to fully connected layers as

dense layers. and (ii) convolutional layers. For fully connected layers, the input is a vector whose individual dimensions — also called neurons222For , we speak of hidden layers and hidden neurons. — do not exhibit any a-priori known structure. The linear transformation of a fully connected layer is implemented as a matrix-vector multiplication where .

Convolutions are used if the data exhibits spatial or temporal dimensions such as images, in which case the DNN is called a convolutional neural network (CNN). Two-dimensional images can be represented as three-dimensional tensors

, where refers to the number of channels (or, equivalently, feature maps), and and refer to the width and the height of the image, respectively. A convolution using a rank-4 filter weight tensor mapping to is computed as


where is the auxiliary indexing function


Each spatial location of the output feature map is computed from a region of the input image . By using the same filter to compute the values at different spatial locations, a translation invariant detection of features is obtained. The spatial size of features detected within an image is bounded by the receptive field, i.e., the section of the input image that influences the value of a particular spatial location in some hidden layer. The receptive field is increased by stacking multiple convolutional layers, e.g., performing two consecutive convolutions results in each output spatial location being influenced by a larger region of the input feature maps.

Another form of translational invariance is achieved by pooling

operations that merge spatially neighboring values within a feature map to reduce the feature map’s size. Common choices are max-pooling and average-pooling which combine the results of neighboring values

333Typically, a region to halve the feature map size is used. by computing their maximum or average, respectively. Furthermore, pooling operations also increase the receptive field.

2.2 Training of Deep Neural Networks

The task of training is concerned with adjusting the weights such that the DNN reliably predicts correct classes for unseen inputs

. This is accomplished by minimizing a loss function

using gradient based optimization (Nocedal and Wright, 2006). Given some labeled training data containing input-target pairs, a typical loss function has the form


where is the data term that penalizes the DNN parameters if the output does not match the target value , is a regularizer that prevents the DNN from overfitting, and

is a trade-off hyperparameter. Typical choices for the data term

are the cross-entropy loss or the mean squared error loss, whereas typical choices for the regularizer are the -norm or the -norm of the weights, respectively. The loss is minimized using gradient descent by iteratively computing


where is a learning rate hyperparameter. In practice, more involved stochastic gradient descent (SGD) schemes, such as ADAM (Kingma and Ba, 2015), are used that randomly select smaller subsets of the data — called mini-batches — to approximate the gradient.

Modern deep learning frameworks play an important role in the growing popularity of DNNs as they make gradient based optimization particularly convenient: The user specifies the loss as a computation graph and the gradient

is calculated automatically by the framework using the backpropagation algorithm

(Rumelhart et al., 1986).

2.3 Batch Normalization

The literature has established a consensus that using more layers improves the classification performance of DNNs. However, increasing the number of layers also increases the difficulty of training a DNN using gradient based methods as described in Section 2.2. Most modern DNN architecture employ batch normalization (Ioffe and Szegedy, 2015) after the linear transformation of some or all layers by computing


where and are trainable parameters, and is the mini-batch size of SGD.

The idea is to normalize the activation statistics over the data samples in each layer to zero mean and unit variance. This results in similar activation statistics throughout the network which facilitates gradient flow during backpropagation. The linear transformation of the normalized activations with the parameters

and is mainly used to recover the DNNs ability to approximate any desired function — a feature that would be lost if only the normalization step is performed. Most recent DNN architectures have been shown to benefit from batch normalization, and, as reviewed in Section 3.2.2, batch normalization can be targeted to achieve resource-efficiency in DNNs.

2.4 Dropout

Dropout as introduced by Srivastava et al. (2014) is a way to prevent neural networks from overfitting by injecting multiplicative noise to the inputs of a layer, i.e., . Common choices for the injected noise are or with and being hyperparameters. Intuitively, the idea is that hidden neurons cannot rely on the presence of features computed by other neurons. Consequently, individual neurons are expected to compute in a sense “meaningful” features on their own. This avoids that multiple neurons jointly compute features in an entangled way. Dropout has been cast into a Bayesian framework which was subsequently exploited to perform network pruning as detailed in Section 3.2.3.

2.5 Modern Architectures

As mentioned in the beginning of this section, most architectures follow the simple scheme of repeating several layers of linear transformation followed by a non-linear function . Although most successful architectures follow this scheme, recent architectures have introduced additional components and subtle extensions that have led to new design principles. In the following, we give a brief overview of the most prominent architectures that have emerged over the past years in chronological order.

2.5.1 AlexNet

The AlexNet architecture (Krizhevsky et al., 2012) was the first work to show that DNNs are capable of improving performance over conventional hand crafted computer vision techniques by achieving 16.4% Top-5 error on the ILSVRC12 challenge — an improvement of approximately 10% absolute error compared to the second best approach in the challenge which relied on well-established computer vision techniques. This most influential work essentially started the advent of DNNs, which can be seen from the fact that DNNs have spread over virtually any scientific field and achieved improved performances over well-established methods in the respective fields.

The architecture consists of eight layers — five convolutional layers followed by three fully connected layers. AlexNet was designed to optimally utilize the available hardware at that time rather than following some clear design principle. This involves the choice of heterogeneous window sizes and seemingly arbitrary numbers of channels per layer . Furthermore, convolutions are performed in two parallel paths to facilitate the training on two GPUs.

2.5.2 VGGNet

The VGGNet architecture (Simonyan and Zisserman, 2015) won the second place at the ILSVRC14 challenge with 7.3% Top-5 error. Compared to AlexNet, its structure is more uniform and with up to 19 layers much deeper. The design of VGGNet is guided by two main principles. (i) VGGNet uses mostly convolutions and it increases the receptive field by stacking several of them. (ii) After downscaling the spatial dimension with max-pooling, the number of channels should be doubled to avoid information loss. From a hardware perspective, VGGNet is often preferred over other architectures due to its uniform architecture.

2.5.3 InceptionNet

InceptionNet (or, equivalently, GoogLeNet) (Szegedy et al., 2015)

won the ILSVRC14 challenge with 6.7% Top-5 error with an even deeper architecture consisting of 22 layers. The main feature of this architecture is the inception module which combines the outputs of

, , and convolutions, respectively, by stacking them. To reduce the computational burden, InceptionNet performs convolutions proposed in (Lin et al., 2014a) to reduce the number of channels immediately before the larger and convolutions, respectively.

2.5.4 ResNet

Motivated by the observation that adding more layers to very deep conventional CNN architectures does not necessarily reduce the training error, residual networks (ResNets) introduced by He et al. (2016) follow a rather different principle. The key idea is that every layer computes a residual that is added to the layer’s input. This is often graphically depicted as a residual path with an attached skip connection.

The authors hypothesize that identity mappings play an important role and that it is easier to model them in ResNets by simply setting all the weights of the residual path to zero instead of simulating an identity mapping by adapting the weights of several consecutive layers in an intertwined way. In any case, the skip connections reduce the vanishing gradient problem during training and enable extremely deep architectures of up to 152 layers on ImageNet and even up to 1000 layers on CIFAR-10. ResNet won the ILSVRC15 challenge with 3.6% Top-5 error.

2.5.5 DenseNet

Inspired by ResNets whose skip connections have shown to reduce the vanishing gradient problem, densely connected CNNs (DenseNets) introduced by Huang et al. (2017) drive this idea even further by connecting each layer to all previous layers. DenseNets are conceptionally very similar to ResNets — instead of adding the output of a layer to its input, DenseNets stack the output and the input of each layer. Since this stacking necessarily increases the number of feature maps with each layer, the number of new feature maps computed by each layer is typically small. Furthermore, it is proposed to use compression layers after downscaling the spatial dimension with pooling, i.e., a convolution is used to reduce the number of feature maps.

Compared to ResNets, DenseNets achieve similar performance, allow for even deeper architectures, and they are more parameter and computation efficient, respectively. However, the DenseNet architecture is highly non-uniform which complicates the hardware mapping and ultimately slows down training.

2.6 The Straight-Through Gradient Estimator

Many recently developed methods for resource-efficiency in DNNs incorporate components in the computation graph of the loss function that are non-differentiable or whose gradient is zero almost everywhere, such as piecewise constant quantizers. These components prevent the use of conventional gradient-based optimization as described in Section 2.2.

The straight-through gradient estimator (STE) is a simple but effective way to approximate the gradient of such components by simply replacing their gradient with a non-zero value. Let be some non-differentiable operation within the computation graph of such that the partial derivative is not defined. The STE then approximates the gradient by


where is an arbitrary differentiable function with a similar functional shape as . For instance, in case of the sign activation function , whose derivative is zero almost everywhere, one could select . Another common choice is the identity function whose derivative is , which simply passes the gradient on to higher components in the computation graph during backpropagation. Figure 2 illustrates the STE applied to a simplified DNN layer.

Figure 2: A simplified building block of a DNN using the straight-through gradient estimator (STE). denotes some arbitrary piecewise constant quantization function and id denotes the identity function which simply passes the gradient on during backpropagation. In the forward pass, the solid red line is followed which passes the two piecewise constant functions and whose gradient is zero almost everywhere (red boxes). During backpropagation, the dashed green line is followed which avoids these piecewise constant functions and instead only passes differentiable functions (green boxes) — in particular, the functions id and whose shapes are similar to and but whose gradient is non-zero. This allows us to obtain an approximate non-zero gradient for the real-valued parameters (blue circle) which are subsequently updated with SGD.

2.7 Bayesian Neural Networks

Since there exist several works for resource-efficient DNNs that build on the framework of Bayesian neural networks, we briefly introduce the basic principles here. Given a prior distribution on the weights and a likelihood defined by the softmax output of a DNN as


we can use Bayes’ rule to infer a posterior distribution over the weights, i.e.,


From a Bayesian perspective it is desired to compute expected predictions with respect to the posterior distribution , and not just to reduce the entire distribution to a single point estimate. However, due to the highly non-linear nature of DNNs, most exact inference scenarios involving the full posterior are typically intractable and there exist a range of approximation techniques for these tasks, such as variational inference (Hinton and van Camp, 1993; Graves, 2011; Blundell et al., 2015) and sampling based approaches (Neal, 1992).

Interestingly, training DNNs can often be seen as a very rough Bayesian approximation where we only seek for weights that maximize the posterior

, which is also known as maximum a-posteriori estimation (MAP). In particular, in a typical loss

as in (5) the data term originates from the logarithm of the likelihood whereas the regularizer originates from the logarithm of the prior .

A better Bayesian approximation is obtained with variational inference where the aim is to find a variational distribution governed by distribution parameters that is as close as possible to the posterior but still simple enough to allow for efficient inference, e.g., for computing by sampling from . This is typically achieved by the so called mean-field assumption, i.e., by assuming that the weights are independent such that factorizes into a product of factors for each weight . The most prominent approach to obtain the variational distribution is by minimizing the KL-divergence using gradient based optimization.

The Bayesian approach is appealing as distributions over the parameters directly translate into predictive distributions. In contrast to ordinary DNNs that only provide a point estimate prediction, Bayesian neural networks offer predictive uncertainties which are useful to determine how certain the DNN is about its own prediction. However, the Bayesian framework has got several other useful properties that can be exploited to obtain resource-efficient DNNs. For instance, the prior allows us to incorporate information about properties, such as sparsity, that we expect to be present in the DNN. In Section 3.2.3, we review pruning approaches based on the Bayesian paradigm, and in Section 3.1.3, we review weight quantization approaches based on the Bayesian paradigm.

3 Resource-efficiency in Deep Neural Networks

In this section, we provide a comprehensive overview of methods that enhance the efficiency of DNNs regarding memory footprint, computation time, and energy requirements. We have identified three different major approaches that aim to reduce the computational complexity of DNNs, i.e., (i) weight and activation quantization, (ii) network pruning, and (iii) structural efficiency. These categories are not mutually exclusive, and we present individual methods in the category where their contribution is most significant.

3.1 Quantized Neural Networks

Quantization in DNNs is concerned with reducing the number of bits used for the representation of the weights and the activations, respectively. The reduction in memory requirements are obvious: Using fewer bits for the weights results in less memory overhead to store the corresponding model, and using fewer bits for the activations results in less memory overhead when computing predictions. Furthermore, representations using fewer bits often facilitate faster computation. For instance, when quantization is driven to the extreme with binary weights and binary activations

, floating-point or fixed-point multiplications are replaced by hardware-friendly logical XNOR and bitcount operations. In this way, a sophisticated DNN is essentially reduced to a logical circuit.

However, training such discrete-valued DNNs444Due to finite precision, in fact any DNN is discrete-valued. However, we use this term here to highlight the extremely low number of values. is delicate as they cannot be directly optimized using gradient based methods. The challenge is to reduce the number of bits as much as possible while at the same time keeping the classification performance close to that of a well-tuned full-precision DNN. In the sequel, we provide a literature overview of approaches that train reduced-precision DNNs, and, in a broader sense, we also consider methods that use reduced precision computations during backpropagation to facilitate low-resource training.

3.1.1 Early Quantization Approaches

Approaches for reduced-precision computations date back at least to the early 1990s. Höhfeld and Fahlman (1992, 1992)

rounded the weights during training to fixed-point format with different numbers of bits. They observed that training eventually stalls as small gradient updates are always rounded to zero. As a remedy, they proposed stochastic rounding, i.e., rounding values to the nearest value with a probability proportional to the distance to the nearest value. These quantized gradient updates are correct in expectation, do not cause training to stall, and yield good performance with substantially fewer bits than deterministic rounding. More recently, 

Gupta et al. (2015) have shown that stochastic rounding can also be applied for modern deep architectures, as demonstrated on a hardware prototype.

Lin et al. (2015) propose a method to reduce the number of multiplications required during training. At forward propagation, the weights are stochastically quantized to either binary weights or ternary weights to remove the need for multiplications at all. During backpropagation, inputs and hidden neurons are quantized to powers of two, reducing multiplications to cheaper bit-shift operations, and leaving only a negligible number of floating-point multiplications to be computed. However, the speed-up is limited to training since for testing the full-precision weights are required.

Courbariaux et al. (2015a) empirically studied the effect of different numeric formats — namely floating-point, fixed-point, and dynamic fixed-point — with varying bit widths on the performance of DNNs. Lin et al. (2016) consider fixed-point quantization of pre-trained full-precision DNNs. They formulate a convex optimization problem to minimize the total number of bits required to store the weights and the activations under the constraint that the total output signal-to-quantization noise ratio is larger than a certain prespecified value. A closed-form solution of the convex objective yields layer-specific bit widths.

3.1.2 Quantization-aware Training

Quantization operations, being piecewise constant functions with either undefined or zero gradients, are not applicable to gradient-based learning using backpropagation. In recent years, the STE (Bengio et al., 2013) (see Section 2.6) became the method of choice to compute an approximate gradient for training DNNs with weights that are represented using a very small number of bits. Such methods typically maintain a set of full-precision weights that are quantized during forward propagation. During backpropagation, the gradients are propagated through the quantization functions by assuming that their gradient equals one. In this way, the full-precision weights are updated using gradients computed at the quantized weights. At test time, the full-precision weights are abandoned and only the quantized reduced-precision weights are kept. We term this scheme quantization-aware training since quantization is an essential part during forward-propagation and it is intuitive to think of the real-valued weights becoming robust to quantization. In a similar manner, many methods employ the STE to approximate the quantization of activations.

In (Courbariaux et al., 2015b), binary weight DNNs are trained using the STE to get rid of expensive floating-point multiplications. They consider deterministic rounding using the sign function and stochastic rounding using probabilities determined by the hard-sigmoid function . During backpropagation, a set of auxiliary full-precision weights is updated based on the gradients of the quantized weights. Hubara et al. (2016) extended this work by also quantizing the activations to a single bit using the sign activation function. This reduces the computational burden dramatically as floating-point multiplications and additions are reduced to hardware-friendly logical XNOR and bitcount operations, respectively.

Li et al. (2016) trained ternary weights . Their quantizer sets weights whose magnitude is lower than a certain threshold to zero, while the remaining weights are set to or according to their sign. Their approach determines and during forward propagation by approximately minimizing the squared quantization error of the real-valued weights. Zhu et al. (2017) extended this work to ternary weights where and are trainable parameters subject to gradient updates. They propose to select based on the maximum full-precision weight magnitude in each layer, i.e., with being a hyperparameter. These asymmetric weights considerably improve performance compared to symmetric weights as used in (Li et al., 2016).

Rastegari et al. (2016) approximate full-precision weight filters in CNNs as where is a scalar and is a binary weight matrix. This reduces the bulk of floating-point multiplications inside the convolutions to either additions or subtractions, and only requires a single multiplication per output neuron with the scalar . In a further step, the layer inputs are quantized in a similar way to perform the convolution with only efficient XNOR operations and bitcount operations, followed by two floating-point multiplications per output neuron. Again, the STE is used during backpropagation. Lin et al. (2017b) generalized the ideas of Rastegari et al. (2016) by approximating the full-precision weights with linear combinations of multiple binary weight filters for improved classification accuracy.

While most activation binarization methods use the sign function which can be seen as an approximation to the tanh function, Cai et al. (2017) proposed a half-wave Gaussian quantization that more closely resembles the predominant ReLU activation function.

Motivated by the fact that weights and activations typically exhibit a non-uniform distribution,

Miyashita et al. (2016) proposed to quantize values to powers of two. Their representation allows getting rid of expensive multiplications, and they report higher robustness to quantization than linear rounding schemes using the same number of bits. Zhou et al. (2017) proposed incremental network quantization where the weights of a pre-trained DNN are first partitioned into two sets, one of which is quantized to either zero or powers of two while the weights in the other set are kept at full-precision and retrained to recover the potential loss in accuracy due to quantization. They iterate partitioning, quantization, and retraining until all weights are quantized.

Jacob et al. (2018) proposed a quantization scheme that accurately approximates floating-point operations using only integer arithmetic to speed up computation. During training, their forward pass simulates the quantization step to keep the performance of the quantized DNN close to the performance when using single-precision. At test time, weights are represented as 8-bit integer values, reducing the memory footprint by a factor of four.

Liu et al. (2018) introduced Bi-real net, a ResNet-inspired architecture where the residual path is implemented with efficient binary convolutions while the shortcut path is kept real-valued to maintain the expressiveness of the DNN. The residual in each layer is computed by first transforming the input with the sign activation, followed by a binary convolution, and a final batch normalization step.

Instead of using a fixed quantizer, in LQ-net (Zhang et al., 2018a) the quantizer is adapted during training. The proposed quantizer is inspired by the representation of integers as linear combinations with and . The key idea is to consider a quantizer that assigns values to the nearest value representable as such a linear combination and to treat as trainable parameters. It is shown that such a quantizer is compatible with efficient bit-operations. The quantizer is optimized during forward propagation by minimizing the quantization error objective for and by alternately fixing and minimizing and vice versa. It is proposed to use layer-wise quantizers for the activations, i.e., an individual quantizer for each layer, and channel-wise quantizers for the weights.

Relaxed Quantization (Louizos et al., 2019) introduces a stochastic differentiable soft-rounding scheme. By injecting additive noise to the deterministic weights before rounding, one can compute probabilities of the weights being rounded to specific values in a predefined discrete set. Subsequently, these probabilities are used to differentiably round the weights using the Gumbel softmax approximation (Jang et al., 2017). Since this soft-rounding scheme produces only values that are close to values from the discrete set but which are not exactly from this set, the authors also propose a hard variant using the STE.

Zhou et al. (2016) presented several quantization schemes for the weights and the activations that allow for flexible bit widths. Furthermore, they also propose a quantization scheme for backpropagation to facilitate low-resource training. In agreement with earlier work mentioned above, they note that stochastic quantization is essential for their approach. In (Wu et al., 2018), weights, activations, weight gradients, and activation gradients are subject to customized quantization schemes that allow for variable bit widths and facilitate integer arithmetic during training and testing. In contrast to (Zhou et al., 2016), the work in (Wu et al., 2018) accumulates weight changes to low-precision weights instead of full-precision weights.

While most work on quantization based approaches is empirical, some recent work gained more theoretical insights (Li et al., 2017; Anderson and Berg, 2018).

3.1.3 Bayesian Approaches for Quantization

In this section, we review some quantization approaches, most of which are closely related to the Bayesian variational inference framework (see Section 2.7).

The work of Achterhold et al. (2018) builds on the variational dropout based pruning approach of Louizos et al. (2017) (see Section 3.2.3). They introduce a mixture of log-uniforms prior whose mixtures are centered at predefined quantization values. Consequently, the approximate posterior also concentrates at these values such that weights can be safely quantized without requiring a fine-tuning procedure.

The following works in this section directly operate on discrete weight distributions, and, consequently, do not require a rounding procedure. Soudry et al. (2014) approximate the true posterior over discrete weights using expectation propagation (Minka, 2001) with closed-form online updates. Starting with an uninformative approximation , their approach combines the current approximation (serving as the prior in (10)) with the likelihood for a single-sample data set to obtain a refined posterior. To obtain a closed-form refinement step, they propose several approximations.

Although deviating from the Bayesian variational inference framework as no similarity measure to the true posterior is optimized, the approach of Shayer et al. (2018) trains a distribution over either binary weights or ternary weights . They propose to minimize an expected loss for the variational parameters with gradient-based optimization using the local reparameterization trick (Kingma et al., 2015). After training has finished, the discrete weights are obtained by either sampling or taking a mode from . Since their approach is limited to the ReLU activation function, Peters and Welling (2018) extended their work to the activation function. This involves several non-trivial changes since the sign activation, due to its zero derivative, requires that the local reparameterization trick must be performed after the function, and, consequently, distributions need to be propagated through commonly used building blocks such as batch normalization and pooling operations. Roth et al. (2019) further extended these works to beyond three distinct discrete weights, and they introduced some technical improvements.

Havasi et al. (2019) introduced a novel Bayesian compression technique that we present here in this section although it is rather a coding technique than a quantization technique. In a nutshell, their approach first computes a variational distribution over real-valued weights using mean-field variational inference and then it encodes a sample from in a smart way. They construct an approximation to by importance sampling using the prior as


where denotes a point mass located at . In the next step, a sample from (or, equivalently, an approximate sample from ) is drawn which can be encoded by the corresponding number using bits. Using the same random number generator initialized with the same seed as in (11), the weights can be recovered by sampling weights from the prior and selecting . Since the number of samples required to obtain a reasonable approximation to in (11) grows exponentially with the number of weights, this sampling based compression scheme is performed for smaller weight blocks such that each weight block can be encoded with bits.

3.2 Network Pruning

Network pruning methods aim to achieve parameter sparsity by setting a substantial number of DNN weights to zero. Subsequently, the sparsity is exploited to enhance resource-efficiency of the DNN. On the one hand, there exist unstructured pruning approaches that set individual weights, regardless of their location in a weight tensor, to zero. Unstructured pruning approaches are typically less sensitive to accuracy degradation, but they require special sparse tensor data structures that in turn yield practical efficiency improvements only for very high sparsity. On the other hand, structured pruning methods aim to set whole weight structures to zero, e.g., by setting all weights of a matrix column to zero we would effectively prune an entire neuron. Conceptionally, structured pruning is equivalent to removing tensor dimensions such that the reduced tensor remains compatible with highly optimized dense tensor operations.

In this section, we start with the unstructured case which includes many of the earlier approaches and continue with structured pruning that has been the focus of more recent works. Then we review approaches that relate to Bayesian principles before we discuss approaches that prune structures dynamically during forward propagation.

3.2.1 Unstructured Pruning

One of the earliest approaches to reduce the network size is the optimal brain damage algorithm of LeCun et al. (1989). Their main finding is that pruning based on weight magnitude is suboptimal, and they propose a pruning scheme based on the increase in loss function. Assuming a pre-trained network, a local second-order Taylor expansion with a diagonal Hessian approximation is employed that allows us to estimate the change in loss function caused by weight pruning without re-evaluating the costly network function. Removing parameters is alternated with re-training the pruned network. In this way, the model size can be reduced substantially without deteriorating its performance. Hassibi and Stork (1992) found the diagonal Hessian approximation to be too restrictive, and their optimal brain surgeon algorithm uses an approximated full covariance matrix instead. While their method, similar as in (LeCun et al., 1989), prunes weights that cause the least increase in loss function, the remaining weights are simultaneously adapted to compensate for the negative effect of weight pruning. This bypasses the need to alternate several times between pruning and re-training the pruned network.

However, it is not clear whether these approaches scale up to modern DNN architectures since computing the required (diagonal) Hessians is substantially more demanding (if not intractable) for millions of weights. Therefore, many of the more recently proposed techniques still resort to magnitude based pruning. Han et al. (2015) alternate between pruning connections below a certain magnitude threshold and re-training the pruned DNN. The results of this simple strategy are impressive, as the number of parameters in pruned DNNs is an order of magnitude smaller (9 for AlexNet and for VGG-16) than in the original networks. Hence, this work shows that DNNs are often heavily over-parametrized. In a follow-up paper, Han et al. (2016) proposed deep compression, which extends the work in (Han et al., 2015) by a parameter quantization and parameter sharing step, followed by Huffman coding to exploit the non-uniform weight distribution. This approach yields a reduction in memory footprint by a factor of 35–49 and, consequently, a reduction in energy consumption by a factor of 3–5.

Guo et al. (2016) discovered that irreversible pruning decisions limit the achievable sparsity and that it is useful to reincorporate weights pruned in an earlier stage. In addition to each dense weight matrix , they maintain a corresponding binary mask matrix that determines whether a weight is currently pruned or not. In particular, the actual weights used during forward propagation are obtained as where denotes element-wise multiplication. Their method alternates between updating the weights based on gradient descent, and updating the weight masks by thresholding the real-valued weights according to


where and are two thresholds and refers to the iteration number. Most importantly, weight updates are also applied to the currently pruned weights according to using the STE, such that pruned weights can reappear in (12). This reduces the number of parameters of AlexNet by a factor of 17.7 without deteriorating performance.

3.2.2 Structured Pruning

In (Mariet and Sra, 2016), a determinantal point process (DPP) is used to find a group of neurons that are diverse and exhibit little redundancy. Conceptionally, a DPP for a given ground set defines a distribution over subsets where subsets containing diverse elements have high probability. Their approach treats the set of -dimensional vectors that individual neurons compute over the whole data set as , samples a diverse set of neurons according to the DPP, and then prunes the other neurons . To compensate for the negative effect of pruning, the outgoing weights of the remaining neurons after pruning are adapted so as to minimize the activation change of the next layer.

Wen et al. (2016) incorporated group lasso regularizers in the objective to obtain different kinds of sparsity in the course of training. They were able to remove filters, channels, and even entire layers in architectures containing shortcut connections. Liu et al. (2017) proposed to introduce an regularizer on the scale parameters of batch normalization and to set by thresholding. Since each batch normalization parameter corresponds to a particular channel in the network, this effectively results in channel pruning with only minimal changes to existing training pipelines. In (Huang and Wang, 2018), the outputs of different structures are scaled with individual trainable scaling factors. By using a sparsity enforcing regularizer on these scaling factors, the outputs of the corresponding structures are driven to zero and can be pruned.

Rather than pruning based on small parameter values, ThiNet (Luo et al., 2017) is a data-driven approach that prunes channels having the least impact on the subsequent layer. When pruning channels in layer , they propose to sample several activations at random spatial locations and random channels of the following layer, and to greedily prune channels whose removal results in the least increase of squared error over these randomly selected activations. After pruning, they adapt the remaining filters to minimize the squared reconstruction error by minimizing a least squares problem.

Louizos et al. (2018) propose to multiply weights with stochastic binary 0-1 gates associated with trainable probability parameters that effectively determine whether a weight should be pruned or not. They formulate an expected loss with respect to the distribution over the stochastic binary gates, and by incorporating an expected -regularizer over the weights, the probability parameters associated with these gates are encouraged to be close to zero. To enable the use of the reparameterization trick, a continuous relaxation of the binary gates using a modified binary Gumbel softmax distribution is used (Jang et al., 2017). They show that their approach can be used for structured sparsity by associating the stochastic gates to entire structures such as channels. Li and Ji (2019) extended this work by using the recently proposed unbiased ARM gradient estimator (Yin and Zhou, 2019) instead of using the biased Gumbel softmax approximation.

3.2.3 Bayesian Pruning

In (Graves, 2011; Blundell et al., 2015), mean-field variational inference is employed to obtain a factorized Gaussian approximation , i.e., in addition to a (mean) weight they also train a weight variance

. After training, weights are pruned by thresholding the “signal-to-noise ratio


Molchanov et al. (2017) proposed a method based on variational dropout (Kingma et al., 2015) which interprets dropout as performing variational inference with specific prior and approximate posterior distributions. Within this framework, the otherwise fixed dropout rates of Gaussian dropout appear as free parameters that can be optimized to improve a variational lower bound. In (Molchanov et al., 2017), this freedom is exploited to optimize individual weight dropout rates such that weights can be safely pruned if their dropout rate is close to one. This idea has been extended in (Louizos et al., 2017)

by using sparsity enforcing priors and assigning dropout rates to groups of weights that are all connected to the same structure, which in turn allows for structured pruning. Furthermore, they show how their approach can be used to determine an appropriate bit width for each weight by exploiting the well-known connection between Bayesian inference and the minimum description length (MDL) principle

(Grünwald, 2007).

3.2.4 Dynamic Network Pruning

So far, we have presented methods that result in a fixed reduced architecture. In the following, we present methods that determine dynamically in the course of forward propagation which structures should be computed, or, equivalently, which structures should be pruned. The intuition behind this idea is to vary the time spent for computing predictions based on the difficulty of the given input samples.

Lin et al. (2017a)

proposed to train, in addition to the DNN, a recurrent neural network (RNN) decision network which determines the channels to be computed using reinforcement learning. In each layer, the feature maps are compressed using global pooling and fed into the RNN which aggregates state information over the layers to compute its pruning decisions.

In (Dong et al., 2017), convolutional layers of a DNN are extended by a parallel low-cost convolution whose output after the ReLU function is used to scale the outputs of the potentially high-cost convolution. Due to the ReLU function, several outputs of the low-cost convolution will be exactly zero such that the computation of the corresponding output of the high-cost convolution can be omitted. For the low-cost convolution, they propose to use weight tensors and . However, practical speed-ups are only reported for the convolution where all channels at a given spatial location might get set to zero.

In a similar approach proposed by Gao et al. (2019), the spatial dimensions of a feature map are reduced by global average pooling to a vector which is linearly transformed to using a single low-cost fully connected layer. To obtain a sparse vector , is fed into the ReLU function, followed by a -winner-takes-all function that sets all entries of a vector to zero that are not among the largest entries in absolute value. By multiplying in a channel-wise manner to the output of a high-cost convolution, at least channels will be zero and need not be computed. The number of channels is derived from a predefined minimal pruning ratio hyperparameter.

3.3 Structural efficiency in DNNs

In this section, we review strategies that establish certain structural properties in DNNs to improve computational efficiency. Each of the proposed subcategories in this section follows rather different principles and the individual techniques might not be mutually exclusive.

3.3.1 Weight Sharing

Another technique to reduce the model size is weight sharing. In (Chen et al., 2015), a hashing function is used to randomly group network connections into “buckets”, where the connections in each bucket share the same weight value. This has the advantage that weight assignments need not be stored explicitly since they are given implicitly by the hashing function. The authors show a memory footprint reduction by a factor of 10 while keeping the predictive performance essentially unaffected.

Ullrich et al. (2017) extended the soft weight sharing approach proposed in (Nowlan and Hinton, 1992)

to achieve both weight sharing and sparsity. The idea is to select a Gaussian mixture model prior over the weights and to train both the weights as well as the parameters of the mixture components. During training, the mixture components collapse to point measures and each weight gets attracted by a certain weight component. After training, weight sharing is obtained by assigning each weight to the mean of the component that best explains it, and weight pruning is obtained by assigning a relatively high mixture mass to a component with a fixed mean at zero.

Roth and Pernkopf (2018) introduced a Dirichlet process prior over the weight distribution to enforce weight sharing in order to reduce the memory footprint of an ensemble of DNNs. They propose a sampling based inference scheme by alternately sampling weight assignments using Gibbs sampling and sampling weights using hybrid Monte Carlo (Neal, 1992). By using the same weight assignments for multiple weight samples, the memory overhead for the weight assignments becomes negligible and the total memory footprint of an ensemble of DNNs is reduced.

3.3.2 Knowledge Distillation

Knowledge distillation (Hinton et al., 2015) is an indirect approach where first a large DNN (or an ensemble of DNNs) is trained, and subsequently soft-labels obtained from the softmax output of the large DNN are used as training data for a smaller DNN. The smaller DNNs achieve performances almost identical to that of the larger DNNs which is attributed to the valuable information contained in the soft-labels. Inspired by knowledge distillation, Korattikara et al. (2015)

reduced a large ensemble of DNNs, used for obtaining Monte-Carlo estimates of a posterior predictive distribution, to a single DNN.

3.3.3 Special Matrix Structures

In this section we review approaches that aim at reducing the model size by employing efficient matrix representations. There exist several methods using low-rank decompositions which represent a large matrix (or a large tensor) using only a fraction of the parameters. In most cases, the implicitly represented matrix is never computed explicitly such that also a computational speed-up is achieved. Furthermore, there exist approaches using special matrices that are specified by only a few parameters and whose structure allows for extremely efficient matrix multiplications.

Denil et al. (2013) proposed a method that is motivated by training only a subset of the weights and predicting the values of the other weights from this subset. In particular, they represent weight matrices using a low-rank approximation with , , and to reduce the number of parameters. Instead of learning both factors and , prior knowledge, such as smoothness of pixel intensities in an image, is incorporated to compute a fixed using kernel-techniques or auto-encoders, and only the factor is learned.

In (Novikov et al., 2015), the tensor train matrix format is employed to substantially reduce the number of parameters required to represent large weight matrices of fully connected layers. Their approach enables the training of very large fully connected layers with relatively few parameters, and they achieve improved performance compared to simple low-rank approximations.

Denton et al. (2014) propose specific low-rank approximations and clustering techniques for individual layers of pre-trained CNNs to both reduce memory-footprint and computational overhead. Their approach yields substantial improvements for both the computational bottleneck in the convolutional layers and the memory bottleneck in the fully connected layers. By fine-tuning after applying their approximations, the performance degradation is kept at a decent level. Jaderberg et al. (2014) propose two different methods to approximate pre-trained CNN filters as combinations of rank-1 basis filters to speed up computation. The rank-1 basis filters are obtained either by minimizing a reconstruction error of the original filters or by minimizing a reconstruction error of the outputs of the convolutional layers. Lebedev et al. (2015) approximate the convolution tensor using the canonical polyadic (CP) decomposition — a generalization of low-rank matrix decompositions to tensors — using non-linear least squares. Subsequently, the convolution using this low-rank approximation is performed by four consecutive convolutions, each with a smaller filter, to reduce the computation time substantially.

In (Cheng et al., 2015), the weight matrices of fully connected layers are restricted to circulant matrices , which are fully specified by only

parameters. While this dramatically reduces the memory footprint of fully connected layers, circulant matrices also facilitate faster computation as matrix-vector multiplication can be efficiently computed using the fast Fourier transform. In a similar vein,

Yang et al. (2015) reparameterize matrices of fully connected layers using the Fastfood transform as , where , , and are diagonal matrices, is a random permutation matrix, and is the Walsh-Hadamard matrix. This reparameterization requires only a total of parameters, and similar as in (Cheng et al., 2015), the fast Hadamard transform enables an efficient computation of matrix-vector products.

3.3.4 Manual Architecture Design

Instead of modifying existing architectures to make them more efficient, manual architecture design is concerned with the development of new architectures that are inherently resource-efficient. Over the past years, several design principles and building blocks for DNN architectures have emerged that exhibit favorable computational properties and sometimes also improve performance.

CNN architectures are typically designed to have a transition from convolutional layers to fully connected layers. At this transition, activations at all spatial locations of each channel are typically used as individual input features for the following fully connected layer. Since the number of these features is typically large, there is a memory bottleneck for storing the parameters of the weight matrix especially in the first fully connected layer.

Lin et al. (2014a) introduced two concepts that have been widely adopted by subsequent works. The first one, global average pooling, largely solves the above-mentioned memory issue at the transition to fully connected layers. Global average pooling reduces the spatial dimensions of each channel into a single feature by averaging over all values within a channel. This reduces the number of features at the transition drastically, and, by having the same number of channels as there are classes, it can also be used to completely get rid of fully connected layers. Second, they used convolutions with weight kernels that can essentially be seen as performing the operation of a fully connected layer over each spatial location across all channels.

These convolutions have been adopted by several popular architectures (Szegedy et al., 2015; He et al., 2016; Huang et al., 2017) and, due to their favorable computational properties compared to convolutions that take a spatial neighborhood into account, later have also been exploited to improve computational efficiency. For instance, InceptionNet (Szegedy et al., 2015) proposed to split standard convolutions into two cheaper convolutions: (i) a convolution to reduce the number of channels such that (ii) a subsequent convolution is performed faster. Similar ideas are used in SqueezeNet (Iandola et al., 2016) which uses convolutions to reduce the number of channels which is subsequently input to a parallel and convolution, respectively. In addition, SqueezeNet uses the output of a global average pooling of per-class channels directly as input to the softmax in order to avoid fully connected layers which typically consume the most memory. Furthermore, by using deep compression (Han et al., 2016) (see Section 3.2.1), the memory footprint was reduced to less than MB.

Szegedy et al. (2016) extended the InceptionNet architecture by spatially separable convolutions to reduce the computational complexity, i.e., a convolution is split into a convolution followed by a convolution. In MobileNet (Howard et al., 2017) depthwise separable convolutions are used to split a standard convolution in another different way: (i) a depthwise convolution and (ii) a convolution. The depthwise convolution applies a filter to each channel separately without taking the other channels into account whereas the convolution then aggregates information across channels. Although these two cheaper convolutions together are less expressive than a standard convolution, they can be used to trade off a small loss in prediction accuracy with a drastic reduction in computational overhead and memory requirements.

Sandler et al. (2018) extended these ideas in their MobileNetV2

to an architecture with residual connections. A typical residual block with bottleneck structure in ResNet

(He et al., 2016) contains a bottleneck convolution to reduce the number of channels, followed by a convolution, followed by another convolution to restore the original number of channels again. Contrary to that building block, MobileNetV2 introduces an inverted bottleneck structure where the shortcut path contains the bottleneck and the residual path performs computations in a high-dimensional space. In particular, the residual path performs a convolution to increase the number of channels, followed by a cheap depthwise convolution, followed by another convolution to reduce the number of channels again. They show that their inverted structure is more memory efficient since the shortcut path, which needs to be kept in memory during computation of the residual path, is considerably smaller. Furthermore, they show improved performance compared to the standard bottleneck structure.

While it was more of a technical detail rather than a contribution on its own, AlexNet (Krizhevsky et al., 2012) used grouped convolutions with two groups to facilitate model parallelism for training on two GPUs with relatively little memory capacity. Instead of computing a convolution using a weight tensor , a grouped convolution splits the input into groups of channels that are independently processed using weight tensors . The outputs of these convolutions are then stacked again such that the same number of input and output channels are maintained while considerably reducing the computational overhead and memory footprint.

Although this reduces the expressiveness of the convolutional layer since there is no interaction between the different groups, Xie et al. (2017) used grouped convolutions to enlarge the number of channels of a ResNet model which resulted in accuracy gains while keeping the computational complexity of the original ResNet model approximately the same. Zhang et al. (2018b) introduced a ResNet-inspired architecture called ShuffleNet which employs grouped convolutions since convolutions have been identified as computational bottlenecks in previous works, e.g., see (Howard et al., 2017). To combine the computational efficiency of grouped convolutions with the expressiveness of a full convolution, ShuffleNet incorporates channel shuffle operations after grouped convolutions to partly recover the interaction between different groups

3.3.5 Neural Architecture Search

Neural architecture search (NAS) is a recently emerging field concerned with the automatic discovery of good DNN architectures. This is achieved by designing a discrete space of possible architectures in which we subsequently search for an architecture optimizing some objective – typically the validation error. By incorporating a measure of resource-efficiency into this objective, this technique has recently attracted attention for the automatic discovery of resource-efficient architectures.

The task is very challenging: On the one hand, evaluating the validation error is time-consuming as it requires a full training run and typically only results in a noisy estimate thereof. On the other hand, the space of architectures is typically of exponential size in the number of layers. Hence, the space of architectures needs to be carefully designed in order to facilitate an efficient search within that space.

The influential work of Zoph and Le (2017) introduced a scheme to encode DNN architectures of arbitrary depth as sequences of tokens which can be sampled from a controller RNN. This controller RNN is trained with reinforcement learning to generate well performing architectures using the validation error on a held-out validation set as a reward signal. However, the training effort is enormous as more than 10,000 training runs are required to achieve state-of-the-art performance on CIFAR-10. This would be impractical on larger data sets like ImageNet which was partly solved by subsequent NAS approaches, e.g., in (Zoph et al., 2018). In this review, we want to highlight methods that also consider resource-efficiency constraints in the NAS.

In MnasNet (Tan et al., 2018), a RNN controller is trained by also considering the latency of the sampled DNN architecture measured on a real mobile device. They achieve performance improvements under predefined latency constraints on a specific device. To run MnasNet on the large-scale ImageNet and COCO data sets (Lin et al., 2014b), their algorithm is run on a proxy task

by only training for five epochs, and only the most promising DNN architectures were trained using more epochs.

Instead of generating architectures using a controller, ProxylessNAS (Cai et al., 2019) uses a heavily over-parameterized model where each layer contains several parallel paths, each computing a different architectural block with its individual parameters. After each layer, probability parameters for selecting a particular architectural block are introduced which are trained via backpropagation using the STE. After training, the most probable path determines the selected architecture. To favor resource-efficient architectures, a latency model is build using measurements done on a specific real device whose predicted latencies are used as a differentiable regularizer in the cost function. In their experiments, they show that different target devices prefer individual DNN architectures to obtain a low latency.

Instead of using a different path for different operations in each layer, single-path NAS (Stamoulis et al., 2019) combines all operations in a single shared weight superblock such that each operation uses a subset of this superblock. A weight-magnitude-based decision using trainable threshold parameters determines which operation should be performed, allowing for gradient-based training of both the weight parameters and the architecture. Again, the STE is employed to backpropagate through the threshold function.

Liu et al. (2019) have replicated several experiments of pruning approaches (see Section 3.2) and they observed that the typical workflow of training, pruning, and fine-tuning is often not necessary and only the discovered sparsity structure is important. In particular, they show for several pruning approaches that randomly initializing the weights after pruning and training the pruned structure from scratch results in most cases in a similar performance as performing fine-tuning after pruning. They conclude that network pruning can also be seen as a paradigm for architecture search.

Tan and Le (2019) recently proposed EfficientNet which employs NAS for finding a resource-efficient architecture as a key component. In the first step, they perform NAS to discover a small resource-efficient model which is much cheaper than searching for a large model directly. In the next step, the discovered model is enlarged by a principled compound scaling approach which simultaneously increases the number of layers, the number of channels, and the spatial resolution. Although this approach is not targeting resource-efficiency on its own, EfficientNet achieves state-of-the-art performance on ImageNet using a relatively small model.

4 Experimental Results

In this section, we provide experimental results for modern DNN architectures trained on well-known benchmark data sets. We focus our experiments on network quantization approaches as reviewed in Section 3.1, since they are among the earliest and most efficient approaches to enhance the computational efficiency of DNNs.

We first exemplify the trade-off between model performance, memory footprint, and latency on the CIFAR-10 classification task in Section 4.1. This example highlights that finding a suitable balance between these requirements remains challenging due to diverse hardware and implementation issues. Furthermore, we compare several quantization approaches discussed in this paper on the more challenging CIFAR-100 task in Section 4.2. This experiment shows the different bit requirements of weights and activations as well as the need for advanced quantization approaches. In Section 4.3, we finally present a real-world speech enhancement example, where hardware-efficient binarized DNNs have led to dramatic memory and latency reductions.

The main focus of this section is to showcase the difficulty of finding good trade-offs between resource-efficiency and predictive performance. As this paper is mainly dedicated to giving a comprehensive literature overview of the current state of the art, an extensive evaluation of the many presented methods in Section 3 would be infeasible and it is also not within the scope of this paper. We refer the reader to the individual papers, but we want to highlight the work of Liu et al. (2019) which provides an extensive evaluation of several recent pruning methods.

4.1 Prediction Accuracy, Memory Footprint and Latency

To exemplify the trade-off to be made between memory footprint, latency, and prediction accuracy, we implemented general matrix multiply (GEMM) with variable-length fixed-point representation on a mobile CPU (ARM Cortex A15), exploiting its NEON SIMD instructions. Using this implementation, we executed a ResNet consisting of 32 layers with custom quantization for the weights and the activations (Zhou et al., 2016), and compare these results with single-precision floating-point. We use the CIFAR-10 data set for object classification (Krizhevsky, 2009) which consists of -pixel RGB images containing the ten object classes airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. The data set consists of 50,000 training images and 10,000 test images.

Figure 3: Improvement of reduced precision over single-precision floating-point on memory footprint and latency (green) and the respective test error of ResNet-32 on CIFAR-10 (blue).

Figure 3 reports the impact of reduced precision on runtime latency, memory requirements, and test classification error. As can be seen, reducing the bit width to 16, 8, or 4 bits does not improve runtimes. However, on reconfigurable or specialized hardware even 16, 8, and 4 bit representations can be beneficial. Since bit widths of 2 and 1 do not require bit width doubling, we obtain runtimes close to the theoretical linear speed-up. In terms of memory footprint, our implementation evidently reaches the theoretical linear improvement.

While reducing the bit width of weights and activations to only 1 or 2 bits improves memory footprint and computation time substantially, these settings also show decreased classification performance. In this example, the sweet spot appears to be at 2 bits precision, but also the predictive performance for 1 bit precision might be acceptable depending on the application. This extreme setting is evidently beneficial for highly constrained scenarios and it is easily exploited on today’s hardware, as shown in Section 4.3.

4.2 Comparison of Different Quantization Approaches

In the next experiment, we compare the performance of several quantization approaches. We use a DenseNet architecture (Huang et al., 2017) consisting of 100 layers with bottleneck and compression layers, i.e., a DenseNet-BC-100. We select the growth rate parameter . We perform our experiments on the CIFAR-100 data set which is similar to the CIFAR-10 data set except that it contains 100 object classes, i.e., the image size and the sizes of the training and the test set, respectively, are equal.

We selected some of the most popular quantization approaches (see Section 3.1) for the comparison — namely binary weight networks (BWN) (Courbariaux et al., 2015b), binarized neural networks (BNN) (Hubara et al., 2016), DoReFa-Net (Zhou et al., 2016), trained ternary quantization (TTQ) (Zhu et al., 2017), and LQ-Net (Zhang et al., 2018a). For this experiment, we quantize the DNNs in the three modes (i) weight-only, (ii) activation-only, and (iii) combined weight and activation quantization, respectively. However, note that some quantization approaches are designed for a particular mode, e.g., BWN and TTQ only consider weight quantization whereas BNN only considers combined weight and activation quantization.

Figure 4: Comparison of several popular quantization approaches (see Section 3.1) using the DenseNet-BC-100 architecture trained on the CIFAR-100 data set. The horizontal red line shows the error of the real-valued baseline. Quantization is performed using different bit widths in the three modes activation-only (blue), weight-only (green), and combined weight and activation quantization (purple), respectively.

Figure 4 reports the test error for different bit widths of the selected quantization approaches. The horizontal red line shows the test error of the real-valued baseline DenseNet-BC-100. For combined weight and activation quantization we use the same bit widths for the weights and the activations, respectively.

As expected, the test error decreases gradually with increasing bit widths for all quantization modes and for all quantization approaches. Furthermore, the results indicate that prediction performance is more sensitive to activation quantization than to weight quantization, which is in line with the results reported by many works reviewed in Section 3.1.

The more advanced LQ-Net approach clearly outperforms the rather simple linear quantization of DoReFa-Net and the specialized binary and ternary approaches. However, this performance improvement comes at the cost of longer training times. For instance, the training time per iteration increases for DoReFa-Net by a factor of 1.5 compared to a factor of up to 4.6 (depending on the bit width) for LQ-Net.

4.3 A Real-World Example: Speech Mask Estimation using Reduced-Precision DNNs

We provide a complete example employing hardware-efficient binarized DNN applied to acoustic beamforming — an important component for various speech enhancement systems. A particularly successful approach employs DNNs to estimate a speech mask , i.e., a speech presence probability at time and frequency-bin

. This speech mask is used to determine the power spectral density (PSD) matrices of the multi-channel speech and noise signals, which are subsequently used to obtain a beamforming filter such as the minimum variance distortionless response (MVDR) beamformer or generalized eigenvector (GEV) beamformer

(Warsitz and Haeb-Umbach, 2007; Warsitz et al., 2008; Heymann et al., 2016, 2015; Erdogan et al., 2016; Pfeifenberger et al., 2019). An overview of a multi-channel speech enhancement setup is shown in Figure 5.

Figure 5: System overview, showing the microphone signals and the beamformer+postfilter output

in frequency domain.

In this experiment, we compare single-precision DNNs and DNNs with binary quantization for the weights and the activations (BNNs), respectively, for the estimation of the speech mask. BNNs were trained using the STE (Hubara et al., 2016). For both architectures, the dominant eigenvector of the noisy speech PSD matrix (Pfeifenberger et al., 2019) is used as feature vector. For the BNN, this feature vector is quantized to 8 bit integer values. As output layer, a linear activation function is used, which reduces to counting the binary output neurons, followed by a normalization to yield the speech presence probability mask . Further details of the experimental setting can be found in (Zöhrer et al., 2018).

4.3.1 Data and Experimental Setup

For evaluation we used the CHiME corpus (Barker et al., 2015) which provides 2-channel and 6-channel recordings of a close-talking speaker corrupted by four different types of ambient noise. Ground truth utterances, i.e., the separated speech and noise signals, are available for all recordings, such that the ground truth speech masks at time and frequency-bin can be computed. In the test phase, the DNN is used to predict for each utterance, which is subsequently used to estimate the corresponding beam-former. We evaluated three different DNNs, i.e., a single-precision 3-layer DNN with 513 neurons per layer, and two BNNs with 513 and 1024 neurons per layer, respectively. The DNNs were trained using ADAM (Kingma and Ba, 2015) with default parameters and dropout rate .

4.3.2 Speech Mask Accuracy

Figure 6: Speech presence probability mask: (a) Ground truth mask , (b) prediction using DNNs with 513 neurons/layer, and (c) prediction using BNNs with 1024 neurons/layer.

Figure 6 shows the ground truth and predicted speech masks of the DNN and BNNs for an example utterance (F01_22HC010W_BUS). We see that both methods yield very similar results and are in good agreement with the ground truth. Table 1 reports the prediction error in [%]. Although single-precision DNNs achieve the best test error, they do so only by a small margin. Doubling the network size of BNNs slightly improved the test error for the case of 6 channels.

model neurons / layer channels train valid test
DNN 513 2ch 5.8 6.2 7.7
BNN 513 2ch 6.2 6.2 7.9
BNN 1024 2ch 6.2 6.6 7.9
DNN 513 6ch 4.5 3.9 4.0
BNN 513 6ch 4.7 4.1 4.4
BNN 1024 6ch 4.9 4.2 4.1
Table 1: Mask prediction error in [%] for DNNs with 513 neurons/layer and BNNs with 513 or 1024 neurons/layer.

4.3.3 Computation Savings for BNNs

In order to show that the advantages of binary computation translate to other general-purpose processors, we implemented matrix-multiplication operators for NVIDIA GPUs and ARM CPUs. Classification in BNNs can be implemented very efficiently as 1-bit scalar products, i.e., multiplications of two vectors and reduce to bit-wise XNOR operation, followed by counting the number of set bits with popc as


We use the matrix-multiplication algorithms of the MAGMA and Eigen libraries and replace floating-point multiplications by XNOR operations as in (13). Our CPU implementation uses NEON vectorization in order to fully exploit SIMD instructions on ARM processors. We report execution time of GPUs and ARM CPUs in Table 2. As can be seen, binary arithmetic offers considerable speed-ups over single-precision with manageable implementation effort. This also affects energy consumption since binary values require fewer off-chip accesses and operations. Performance results of x86 architectures are not reported because neither SSE nor AVX ISA extensions support vectorized popc.

arch matrix size time (float32) time (binary) speed-up
GPU 256 0.14ms 0.05ms 2.8
GPU 513 0.34ms 0.06ms 5.7
GPU 1024 1.71ms 0.16ms 10.7
GPU 2048 12.87ms 1.01ms 12.7
ARM 256 3.65ms 0.42ms 8.7
ARM 513 16.73ms 1.43ms 11.7
ARM 1024 108.94ms 8.13ms 13.4
ARM 2048 771.33ms 58.81ms 13.1
Table 2: Performance metrics for matrix-matrix multiplications on a NVIDIA Tesla K80 and ARM Cortex-A57, respectively.

While improvements of memory footprint and computation time are independent of the underlying tasks, the prediction accuracy highly depends on the complexity of the data set and the used neural network. On the one hand, simple data sets, such as the previously evaluated speech mask example or MNIST, allow for aggressive quantization without severely affecting the prediction performance. On the other hand, as reported by several works reviewed in Section 3.1, binary or ternary quantization results in severe accuracy degradation on more complex data sets such as CIFAR-10/100 and ImageNet.

5 Conclusion

We presented an overview of the vast literature of the highly active research area concerned with resource-efficiency of DNNs. We have identified three major directions of research, namely (i) network quantization, (ii) network pruning, and (iii) approaches that target efficiency at the structural level. Many of the presented works are orthogonal and can be used in conjunction to potentially further improve the results reported in the respective papers.

We have discovered several patterns in the individual strategies for enhancing resource-efficiency. For quantization approaches, a common pattern in the most successful approaches is to combine real-valued representations, that help in maintaining the expressiveness of DNNs, with quantization to enhance the computationally intensive operations. For pruning methods, we observed that the trend is moving towards structured pruning approaches that obtain smaller models whose data structures are compatible with highly optimized dense tensor operations. On the structural level of DNNs, a lot of progress has been made in the development of specific building blocks that maintain a high expressiveness of the DNN while at the same time reducing the computational overhead substantially. The newly emerging neural architecture search (NAS) approaches are promising candidates to automate the design of application-specific architectures with negligible user interaction. However, it appears unlikely that current NAS approaches will discover new fundamental design principles as the resulting architectures highly depend on a-priori knowledge encoded in the architecture search space.

In experiments, we demonstrated the difficulty of finding a good trade-off between computational complexity and predictive performance on several benchmark data sets. We demonstrated on a real-world speech example that binarized DNNs achieve predictive performance comparable to their real-valued counterpart, and that they can be efficiently implemented on off-the-shelf hardware.


This work was supported by the Austrian Science Fund (FWF) under the project number I2706-N31 and the German Research Foundation (DFG). Furthermore, we acknowledge the LEAD Project Dependable Internet of Things funded by Graz University of Technology. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie Grant Agreement No. 797223 — HYBSPN. We acknowledge NVIDIA for providing GPU computing resources.


  • Achterhold et al. (2018) Jan Achterhold, Jan M. Köhler, Anke Schmeink, and Tim Genewein. Variational network quantization. In International Conference on Learning Representations (ICLR), 2018.
  • Anderson and Berg (2018) Alexander G. Anderson and Cory P. Berg. The high-dimensional geometry of binary neural networks. In International Conference on Learning Representations (ICLR), 2018.
  • Barker et al. (2015) Jon Barker, Ricard Marxer, Emmanuel Vincent, and Shinji Watanabe. The third ’CHiME’ speech separation and recognition challenge: Dataset, task and baselines. In

    IEEE 2015 Automatic Speech Recognition and Understanding Workshop (ASRU)

    , 2015.
  • Bengio et al. (2013) Yoshua Bengio, Nicholas Léonard, and Aaron C. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR, abs/1308.3432, 2013.
  • Blundell et al. (2015) Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. In International Conference on Machine Learning (ICML), pages 1613–1622, 2015.
  • Cai et al. (2019) Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on target task and hardware. In International Conference on Learning Representations (ICLR), 2019.
  • Cai et al. (2017) Zhaowei Cai, Xiaodong He, Jian Sun, and Nuno Vasconcelos. Deep learning with low precision by half-wave Gaussian quantization. In

    Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 5406–5414, 2017.
  • Chen et al. (2015) Wenlin Chen, James T. Wilson, Stephen Tyree, Kilian Q. Weinberger, and Yixin Chen. Compressing neural networks with the hashing trick. In International Conference on Machine Learning (ICML), pages 2285–2294, 2015.
  • Cheng et al. (2015) Yu Cheng, Felix X. Yu, Rogério Schmidt Feris, Sanjiv Kumar, Alok N. Choudhary, and Shih-Fu Chang. An exploration of parameter redundancy in deep networks with circulant projections. In International Conference on Computer Vision (ICCV), pages 2857–2865, 2015.
  • Courbariaux et al. (2015a) Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Training deep neural networks with low precision multiplications. In International Conference on Learning Representations (ICLR) Workshop, volume abs/1412.7024, 2015a.
  • Courbariaux et al. (2015b) Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. BinaryConnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems (NIPS), pages 3123–3131, 2015b.
  • Denil et al. (2013) Misha Denil, Babak Shakibi, Laurent Dinh, Marc’Aurelio Ranzato, and Nando de Freitas. Predicting parameters in deep learning. In Advances in Neural Information Processing Systems (NIPS), pages 2148–2156, 2013.
  • Denton et al. (2014) Emily L. Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems (NIPS), pages 1269–1277, 2014.
  • Dong et al. (2017) Xuanyi Dong, Junshi Huang, Yi Yang, and Shuicheng Yan. More is less: A more complicated network with less inference complexity. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 1895–1903, 2017.
  • Erdogan et al. (2016) H. Erdogan, J. Hershey, S Watanabe, M. Mandel, and J. L. Roux. Improved MVDR beamforming using single-channel mask prediction networks. In Interspeech, 2016.
  • Gao et al. (2019) Xitong Gao, Yiren Zhao, Lukasz Dudziak, Robert Mullins, and Cheng-Zhong Xu. Dynamic channel pruning: Feature boosting and suppression. In International Conference on Learning Representations (ICLR), 2019.
  • Graves (2011) Alex Graves. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 2348–2356, 2011.
  • Grünwald (2007) P. D. Grünwald. The minimum description length principle. MIT press, 2007.
  • Guo et al. (2016) Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient DNNs. In Advances in Neural Information Processing Systems (NIPS), pages 1379–1387, 2016.
  • Gupta et al. (2015) Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In International Conference on Machine Learning (ICML), pages 1737–1746, 2015.
  • Han et al. (2015) Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems (NIPS), pages 1135–1143, 2015.
  • Han et al. (2016) Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and Huffman coding. In International Conference on Learning Representations (ICLR), 2016.
  • Hassibi and Stork (1992) Babak Hassibi and David G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in Neural Information Processing Systems (NIPS), pages 164–171, 1992.
  • Havasi et al. (2019) Marton Havasi, Robert Peharz, and José Miguel Hernández-Lobato. Minimal random code learning: Getting bits back from compressed model parameters. In International Conference on Learning Representations (ICLR), 2019.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  • Heymann et al. (2015) J. Heymann, L. Drude, A. Chinaev, and R. Haeb-Umbach. BLSTM supported GEV beamformer front-end for the 3RD CHiME challenge. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 444–451, 2015.
  • Heymann et al. (2016) J. Heymann, L. Drude, and R. Haeb-Umbach. Neural network based spectral mask estimation for acoustic beamforming. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 196–200, 2016.
  • Hinton and van Camp (1993) Geoffrey E. Hinton and Drew van Camp. Keeping the neural networks simple by minimizing the description length of the weights. In

    Conference on Computational Learning Theory (COLT)

    , pages 5–13, 1993.
  • Hinton et al. (2015) Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. In Deep Learning and Representation Learning Workshop @ NIPS, 2015.
  • Höhfeld and Fahlman (1992) Markus Höhfeld and Scott E. Fahlman. Learning with limited numerical precision using the cascade-correlation algorithm. IEEE Trans. Neural Networks, 3(4):602–611, 1992.
  • Höhfeld and Fahlman (1992) Markus Höhfeld and Scott E. Fahlman. Probabilistic rounding in neural network learning with limited precision. Neurocomputing, 4(4):291–299, 1992.
  • Howard et al. (2017) Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. MobileNets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.
  • Huang et al. (2017) Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, 2017.
  • Huang and Wang (2018) Zehao Huang and Naiyan Wang. Data-driven sparse structure selection for deep neural networks. In European Conference on Computer Vision (ECCV), pages 317–334, 2018.
  • Hubara et al. (2016) Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 4107–4115, 2016.
  • Iandola et al. (2016) Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1mb model size. CoRR, abs/1602.07360, 2016.
  • Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), pages 448–456, 2015.
  • Jacob et al. (2018) Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew G. Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 2704–2713, 2018.
  • Jaderberg et al. (2014) Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. In British Machine Vision Conference (BMVC), 2014.
  • Jang et al. (2017) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbel-softmax. In International Conference on Learning Representations (ICLR), 2017.
  • Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
  • Kingma et al. (2015) Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems (NIPS), pages 2575–2583, 2015.
  • Korattikara et al. (2015) Anoop Korattikara, Vivek Rathod, Kevin P. Murphy, and Max Welling. Bayesian dark knowledge. In Advances in Neural Information Processing Systems (NIPS), pages 3438–3446, 2015.
  • Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 1106–1114, 2012.
  • Lebedev et al. (2015) Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan V. Oseledets, and Victor S. Lempitsky. Speeding-up convolutional neural networks using fine-tuned CP-decomposition. In International Conference on Learning Representations (ICLR), 2015.
  • LeCun et al. (1989) Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. In Advances in Neural Information Processing Systems (NIPS), pages 598–605, 1989.
  • Li et al. (2016) Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. CoRR, abs/1605.04711, 2016.
  • Li et al. (2017) Hao Li, Soham De, Zheng Xu, Christoph Studer, Hanan Samet, and Tom Goldstein. Training quantized nets: A deeper understanding. In Advances in Neural Information Processing Systems (NIPS), pages 5813–5823, 2017.
  • Li and Ji (2019) Yang Li and Shihao Ji. -ARM: Network sparsification via stochastic binary optimization. In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), 2019.
  • Lin et al. (2016) Darryl Dexu Lin, Sachin S. Talathi, and V. Sreekanth Annapureddy. Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning (ICML), pages 2849–2858, 2016.
  • Lin et al. (2017a) Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. Runtime neural pruning. In Advances in Neural Information Processing Systems (NIPS), pages 2181–2191, 2017a.
  • Lin et al. (2014a) Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. In International Conference of Learning Represenation (ICLR), 2014a.
  • Lin et al. (2014b) Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (ECCV), pages 740–755, 2014b.
  • Lin et al. (2017b) Xiaofan Lin, Cong Zhao, and Wei Pan. Towards accurate binary convolutional neural network. In Neural Information Processing Systems (NIPS), pages 344–352, 2017b.
  • Lin et al. (2015) Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, and Yoshua Bengio. Neural networks with few multiplications. CoRR, abs/1510.03009, 2015.
  • Liu et al. (2018) Zechun Liu, Baoyuan Wu, Wenhan Luo, Xin Yang, Wei Liu, and Kwang-Ting Cheng. Bi-Real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm. In European Conference on Computer Vision (ECCV), pages 747–763, 2018.
  • Liu et al. (2017) Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In International Conference on Computer Vision (ICCV), pages 2755–2763, 2017.
  • Liu et al. (2019) Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. In International Conference on Learning Representations (ICLR), 2019.
  • Louizos et al. (2017) Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compression for deep learning. In Advances in Neural Information Processing Systems (NIPS), pages 3288–3298, 2017.
  • Louizos et al. (2018) Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks through L0 regularization. In International Conference on Learning Representations (ICLR), 2018.
  • Louizos et al. (2019) Christos Louizos, Matthias Reisser, Tijmen Blankevoort, Efstratios Gavves, and Max Welling. Relaxed quantization for discretized neural networks. In International Conference on Learning Representations (ICLR), 2019.
  • Luo et al. (2017) Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. ThiNet: A filter level pruning method for deep neural network compression. In International Conference on Computer Vision (ICCV), pages 5068–5076, 2017.
  • Mariet and Sra (2016) Zelda Mariet and Suvrit Sra. Diversity networks: Neural network compression using determinantal point processes. In International Conference of Learning Represenation (ICLR), 2016.
  • Minka (2001) Thomas P. Minka. Expectation propagation for approximate Bayesian inference. In

    Uncertainty in Artificial Intelligence (UAI)

    , pages 362–369, 2001.
  • Miyashita et al. (2016) Daisuke Miyashita, Edward H. Lee, and Boris Murmann. Convolutional neural networks using logarithmic data representation. CoRR, abs/1603.01025, 2016.
  • Molchanov et al. (2017) Dmitry Molchanov, Arsenii Ashukha, and Dmitry P. Vetrov. Variational dropout sparsifies deep neural networks. In International Conference on Machine Learning (ICML), pages 2498–2507, 2017.
  • Neal (1992) Radford M. Neal. Bayesian training of backpropagation networks by the hybrid Monte Carlo method. Technical report, Dept. of Computer Science, University of Toronto, 1992.
  • Nocedal and Wright (2006) Jorge Nocedal and Stephen Wright. Numerical Optimization. Springer New York, 2 edition, 2006.
  • Novikov et al. (2015) Alexander Novikov, Dmitry Podoprikhin, Anton Osokin, and Dmitry P. Vetrov. Tensorizing neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 442–450, 2015.
  • Nowlan and Hinton (1992) Steven J. Nowlan and Geoffrey E. Hinton. Simplifying neural networks by soft weight-sharing. Neural Computation, 4(4):473–493, 1992.
  • Peters and Welling (2018) Jorn W. T. Peters and Max Welling. Probabilistic binary neural networks. CoRR, abs/1809.03368, 2018.
  • Pfeifenberger et al. (2019) Lukas Pfeifenberger, Matthias Zöhrer, and Franz Pernkopf. Eigenvector-based speech mask estimation for multi-channel speech enhancement. IEEE Transactions on Audio, Speech, and Language Processing, 27(12):2162–2172, 2019.
  • Rastegari et al. (2016) Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. XNOR-Net: ImageNet classification using binary convolutional neural networks. In European Conference on Computer Vision (ECCV), pages 525–542, 2016.
  • Roth and Pernkopf (2018) Wolfgang Roth and Franz Pernkopf. Bayesian neural networks with weight sharing using Dirichlet processes. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2018.
  • Roth et al. (2019) Wolfgang Roth, Günther Schindler, Holger Fröning, and Franz Pernkopf. Training discrete-valued neural networks with sign activations using weight distributions. In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), 2019.
  • Rumelhart et al. (1986) David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by back-propagating errors. 323:533–536, 1986.
  • Sandler et al. (2018) Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. MobileNetV2: Inverted residuals and linear bottlenecks. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 4510–4520, 2018.
  • Shayer et al. (2018) Oran Shayer, Dan Levi, and Ethan Fetaya. Learning discrete weights using the local reparameterization trick. In International Conference on Learning Representations (ICLR), 2018.
  • Simonyan and Zisserman (2015) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), 2015.
  • Soudry et al. (2014) Daniel Soudry, Itay Hubara, and Ron Meir. Expectation backpropagation: Parameter-free training of multilayer neural networks with continuous or discrete weights. In Advances in Neural Information Processing Systems (NIPS), pages 963–971, 2014.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
  • Stamoulis et al. (2019) Dimitrios Stamoulis, Ruizhou Ding, Di Wang, Dimitrios Lymberopoulos, Bodhi Priyantha, Jie Liu, and Diana Marculescu. Single-path NAS: Designing hardware-efficient convnets in less than 4 hours. In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), 2019.
  • Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, 2015.
  • Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826, 2016.
  • Tan and Le (2019) Mingxing Tan and Quoc V. Le. EfficientNet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning (ICML), pages 6105–6114, 2019.
  • Tan et al. (2018) Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V. Le. MnasNet: Platform-aware neural architecture search for mobile. CoRR, abs/1807.11626, 2018.
  • Ullrich et al. (2017) Karen Ullrich, Edward Meeds, and Max Welling. Soft weight-sharing for neural network compression. In International Conference on Learning Representations (ICLR), 2017.
  • Warsitz and Haeb-Umbach (2007) E. Warsitz and R. Haeb-Umbach.

    Blind acoustic beamforming based on generalized eigenvalue decomposition.

    In IEEE Transactions on Audio, Speech, and Language Processing, volume 15, pages 1529–1539, 2007.
  • Warsitz et al. (2008) E. Warsitz, A. Krueger, and R. Haeb-Umbach. Speech enhancement with a new generalized eigenvector blocking matrix for application in a generalized sidelobe canceller. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 73–76, 2008.
  • Wen et al. (2016) Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 2074–2082, 2016.
  • Wu et al. (2018) Shuang Wu, Guoqi Li, Feng Chen, and Luping Shi. Training and inference with integers in deep neural networks. In International Conference on Learning Representations (ICLR), 2018.
  • Xie et al. (2017) Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 5987–5995, 2017.
  • Yang et al. (2015) Zichao Yang, Marcin Moczulski, Misha Denil, Nando de Freitas, Alexander J. Smola, Le Song, and Ziyu Wang. Deep fried convnets. In International Conference on Computer Vision (ICCV), pages 1476–1483, 2015.
  • Yin and Zhou (2019) Mingzhang Yin and Mingyuan Zhou. ARM: augment-reinforce-merge gradient for stochastic binary networks. In International Conference on Learning Representations (ICLR), 2019.
  • Zhang et al. (2018a) Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua. LQ-Nets: Learned quantization for highly accurate and compact deep neural networks. In European Conference on Computer Vision (ECCV), pages 373–390, 2018a.
  • Zhang et al. (2018b) Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 6848–6856, 2018b.
  • Zhou et al. (2017) Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quantization: Towards lossless CNNs with low-precision weights. In International Conference on Learning Representations (ICLR), 2017.
  • Zhou et al. (2016) Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. DoReFa-Net: Training low bitwidth convolutional neural networks with low bitwidth gradients. CoRR, abs/1606.06160, 2016.
  • Zhu et al. (2017) Chenzhuo Zhu, Song Han, Huizi Mao, and William J. Dally. Trained ternary quantization. In International Conference on Learning Representations (ICLR), 2017.
  • Zöhrer et al. (2018) M. Zöhrer, L. Pfeifenberger, G. Schindler, H. Fröning, and F. Pernkopf. Resource efficient deep eigenvector beamforming. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.
  • Zoph and Le (2017) Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. In International Conference on Learning Representations (ICLR), 2017.
  • Zoph et al. (2018) Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for scalable image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 8697–8710, 2018.