1 Introduction
Machine learning is a key technology in the 21^{st}
century and the main contributing factor for many recent performance boosts in computer vision, natural language processing, speech recognition and signal processing. Today, the main application domain and comfort zone of machine learning applications is the “virtual world”, as found in recommender systems, stock market prediction, and social media services. However, we are currently witnessing a transition of machine learning moving into “the wild”, where most prominent examples are autonomous navigation for personal transport and delivery services, and the Internet of Things (IoT). Evidently, this trend opens several realworld challenges for machine learning engineers.
Current machine learning approaches prove particularly effective when big amounts of data and ample computing resources are available. However, in realworld applications the computing infrastructure during the operation phase is typically limited, which effectively rules out most of the current resourcehungry machine learning approaches. There are several key challenges — illustrated in Figure 1 — which have to be jointly considered to facilitate machine learning in realworld applications:
 Representational efficiency

The model complexity, i.e., the number of model parameters, should match the (usually limited) resources in deployed systems, in particular regarding memory footprint.
 Computational efficiency

The computational cost of performing inference should match the (usually limited) resources in deployed systems, and exploit the available hardware optimally in terms of time and energy. For instance, power constraints are key for autonomous and embedded systems, as the device lifetime for a given battery charge needs to be maximized, or constraints set by energy harvesters need to be met.
 Prediction quality

The focus of classical machine learning is mostly on optimizing the prediction quality of the models. For embedded devices, model complexity versus prediction quality tradeoffs must be considered to achieve good prediction performance while simultaneously reducing computational complexity and memory requirements.
In this article, we review the state of the art in machine learning with regard to these realworld requirements. We focus on deep neural networks (DNNs), the currently predominant machine learning models. We formally define DNNs in Section 2
and give a brief introduction to the most prominent building blocks, such as dropout and batch normalization.
While being the driving factor behind many recent success stories, DNNs are notoriously data and resource hungry, a property which has recently renewed significant research interest in resourceefficient approaches. This paper is dedicated to giving an extensive overview of the current directions of research of these approaches, all of which are concerned with reducing the model size and/or improving inference efficiency while at the same time maintaining accuracy levels close to stateoftheart models. We have identified three major directions of research concerned with enhancing resourceefficiency in DNNs that we present in Section 3. In particular, these directions are
 Quantized Neural Networks

Typically, the weights of a DNN are stored as 32bit floatingpoint values and during inference millions of floatingpoint operations are carried out. Quantization approaches reduce the number of bits used to store the weights and the activations of DNNs, respectively. While quantization approaches obviously reduce the memory footprint of a DNN, the selected weight representation potentially also facilitates faster inference using cheaper arithmetic operations. Even reducing precision down to binary or ternary values works reasonably well and essentially reduces DNNs to hardwarefriendly logical circuits.
 Network Pruning

Starting from a fixed, potentially large DNN architecture, pruning approaches remove parts of the architecture during training or after training as a postprocessing step. The parts being removed range from the very local scale of individual weights — which is called unstructured pruning — to a more global scale of neurons, channels, or even entire layers — which is called structured pruning. On the one hand, unstructured pruning is typically less sensitive to accuracy degradation, but special sparse matrix operations are required to obtain a computational benefit. On the other hand, structured pruning is more delicate with respect to accuracy but the resulting data structures remain dense such that common highly optimized dense matrix operations available on most offtheshelf hardware can be used.
 Structural Efficiency

This category comprises a diverse set of approaches that achieve resourceefficiency at the structural level of DNNs. Knowledge distillation is an approach where a small student DNN is trained to mimic the behavior of a larger teacher DNN, which has been shown to yield improved results compared to training the small DNN directly. The idea of weight sharing is to use a small set of weights that is shared among several connections of a DNN to reduce the memory footprint. Several works have investigated special matrix structures that require fewer parameters and allow for faster matrix multiplications — the main workload in fully connected layers. Furthermore, there exist several manually designed architectures that introduced lightweight building blocks or modified existing building blocks to enhance resourceefficiency. Most recently, neural architecture search methods have emerged that discover efficient DNN architectures automatically.
Evidently, many of the presented techniques are not mutually exclusive, and they can potentially be combined to further enhance resourceefficiency. For instance, one can both sparsify a model and reduce arithmetic precision.
In Section 4 we substantiate our discussion with experimental results. First, we exemplify the tradeoff between latency, memory footprint, and predictive accuracy in DNNs on the CIFAR10 data set. Subsequently, we provide a comparison of various quantization approaches for DNNs, using the CIFAR100 data set in Section 4.2
. In particular, this overview shows that sensible tradeoffs are achievable with very low numeric precision — even for the extreme case of binarized weights and activations, respectively. Finally, a complete realworld signal processing example using DNNs with binary weights and activations is discussed in Section
4.3. We develop a complete speech enhancement system employing an efficient DNNbased speech mask estimator, which shows negligible performance degradation while allowing memory savings by a factor of 32 and speedups by approximately a factor of 10.
2 Background
Before we present a comprehensive overview of the many different techniques for reducing the complexity of DNNs in Section 3, this section formally introduces DNNs and some fundamentals required in the remainder of the paper.
2.1 Feedforward Deep Neural Networks
DNNs are typically organized in layers of alternating linear transformations and nonlinear activation functions. A vanilla DNN with
layers is a function mapping an input to an output by applying the iterative computation(1)  
(2) 
where (1
) computes a linear transformation with weight tensor
and bias vector
, and (2) computes a nonlinear activation function that is typically applied elementwise. Common choices forare the ReLU function
, sigmoid functions, such as
and the logistic function , and, in the context of resourceefficient models, the sign function , where is the indicator function.In this paper, we focus on hardwareefficient machine learning in the context of classification, i.e., the task of assigning the input to a class . Other predictive tasks, such as regression and multilabel prediction, can be tackled in a similar manner. For classification tasks, the output activation function for computing is typically the softmax function . An input is assigned to class .
The two most common types of layers are (i) fully connected layers^{1}^{1}1
Many popular deep learning frameworks refer to fully connected layers as
dense layers. and (ii) convolutional layers. For fully connected layers, the input is a vector whose individual dimensions — also called neurons^{2}^{2}2For , we speak of hidden layers and hidden neurons. — do not exhibit any apriori known structure. The linear transformation of a fully connected layer is implemented as a matrixvector multiplication where .Convolutions are used if the data exhibits spatial or temporal dimensions such as images, in which case the DNN is called a convolutional neural network (CNN). Twodimensional images can be represented as threedimensional tensors
, where refers to the number of channels (or, equivalently, feature maps), and and refer to the width and the height of the image, respectively. A convolution using a rank4 filter weight tensor mapping to is computed as(3) 
where is the auxiliary indexing function
(4) 
Each spatial location of the output feature map is computed from a region of the input image . By using the same filter to compute the values at different spatial locations, a translation invariant detection of features is obtained. The spatial size of features detected within an image is bounded by the receptive field, i.e., the section of the input image that influences the value of a particular spatial location in some hidden layer. The receptive field is increased by stacking multiple convolutional layers, e.g., performing two consecutive convolutions results in each output spatial location being influenced by a larger region of the input feature maps.
Another form of translational invariance is achieved by pooling
operations that merge spatially neighboring values within a feature map to reduce the feature map’s size. Common choices are maxpooling and averagepooling which combine the results of neighboring values
^{3}^{3}3Typically, a region to halve the feature map size is used. by computing their maximum or average, respectively. Furthermore, pooling operations also increase the receptive field.2.2 Training of Deep Neural Networks
The task of training is concerned with adjusting the weights such that the DNN reliably predicts correct classes for unseen inputs
. This is accomplished by minimizing a loss function
using gradient based optimization (Nocedal and Wright, 2006). Given some labeled training data containing inputtarget pairs, a typical loss function has the form(5) 
where is the data term that penalizes the DNN parameters if the output does not match the target value , is a regularizer that prevents the DNN from overfitting, and
is a tradeoff hyperparameter. Typical choices for the data term
are the crossentropy loss or the mean squared error loss, whereas typical choices for the regularizer are the norm or the norm of the weights, respectively. The loss is minimized using gradient descent by iteratively computing(6) 
where is a learning rate hyperparameter. In practice, more involved stochastic gradient descent (SGD) schemes, such as ADAM (Kingma and Ba, 2015), are used that randomly select smaller subsets of the data — called minibatches — to approximate the gradient.
Modern deep learning frameworks play an important role in the growing popularity of DNNs as they make gradient based optimization particularly convenient: The user specifies the loss as a computation graph and the gradient
is calculated automatically by the framework using the backpropagation algorithm
(Rumelhart et al., 1986).2.3 Batch Normalization
The literature has established a consensus that using more layers improves the classification performance of DNNs. However, increasing the number of layers also increases the difficulty of training a DNN using gradient based methods as described in Section 2.2. Most modern DNN architecture employ batch normalization (Ioffe and Szegedy, 2015) after the linear transformation of some or all layers by computing
(7) 
where and are trainable parameters, and is the minibatch size of SGD.
The idea is to normalize the activation statistics over the data samples in each layer to zero mean and unit variance. This results in similar activation statistics throughout the network which facilitates gradient flow during backpropagation. The linear transformation of the normalized activations with the parameters
and is mainly used to recover the DNNs ability to approximate any desired function — a feature that would be lost if only the normalization step is performed. Most recent DNN architectures have been shown to benefit from batch normalization, and, as reviewed in Section 3.2.2, batch normalization can be targeted to achieve resourceefficiency in DNNs.2.4 Dropout
Dropout as introduced by Srivastava et al. (2014) is a way to prevent neural networks from overfitting by injecting multiplicative noise to the inputs of a layer, i.e., . Common choices for the injected noise are or with and being hyperparameters. Intuitively, the idea is that hidden neurons cannot rely on the presence of features computed by other neurons. Consequently, individual neurons are expected to compute in a sense “meaningful” features on their own. This avoids that multiple neurons jointly compute features in an entangled way. Dropout has been cast into a Bayesian framework which was subsequently exploited to perform network pruning as detailed in Section 3.2.3.
2.5 Modern Architectures
As mentioned in the beginning of this section, most architectures follow the simple scheme of repeating several layers of linear transformation followed by a nonlinear function . Although most successful architectures follow this scheme, recent architectures have introduced additional components and subtle extensions that have led to new design principles. In the following, we give a brief overview of the most prominent architectures that have emerged over the past years in chronological order.
2.5.1 AlexNet
The AlexNet architecture (Krizhevsky et al., 2012) was the first work to show that DNNs are capable of improving performance over conventional hand crafted computer vision techniques by achieving 16.4% Top5 error on the ILSVRC12 challenge — an improvement of approximately 10% absolute error compared to the second best approach in the challenge which relied on wellestablished computer vision techniques. This most influential work essentially started the advent of DNNs, which can be seen from the fact that DNNs have spread over virtually any scientific field and achieved improved performances over wellestablished methods in the respective fields.
The architecture consists of eight layers — five convolutional layers followed by three fully connected layers. AlexNet was designed to optimally utilize the available hardware at that time rather than following some clear design principle. This involves the choice of heterogeneous window sizes and seemingly arbitrary numbers of channels per layer . Furthermore, convolutions are performed in two parallel paths to facilitate the training on two GPUs.
2.5.2 VGGNet
The VGGNet architecture (Simonyan and Zisserman, 2015) won the second place at the ILSVRC14 challenge with 7.3% Top5 error. Compared to AlexNet, its structure is more uniform and with up to 19 layers much deeper. The design of VGGNet is guided by two main principles. (i) VGGNet uses mostly convolutions and it increases the receptive field by stacking several of them. (ii) After downscaling the spatial dimension with maxpooling, the number of channels should be doubled to avoid information loss. From a hardware perspective, VGGNet is often preferred over other architectures due to its uniform architecture.
2.5.3 InceptionNet
InceptionNet (or, equivalently, GoogLeNet) (Szegedy et al., 2015)
won the ILSVRC14 challenge with 6.7% Top5 error with an even deeper architecture consisting of 22 layers. The main feature of this architecture is the inception module which combines the outputs of
, , and convolutions, respectively, by stacking them. To reduce the computational burden, InceptionNet performs convolutions proposed in (Lin et al., 2014a) to reduce the number of channels immediately before the larger and convolutions, respectively.2.5.4 ResNet
Motivated by the observation that adding more layers to very deep conventional CNN architectures does not necessarily reduce the training error, residual networks (ResNets) introduced by He et al. (2016) follow a rather different principle. The key idea is that every layer computes a residual that is added to the layer’s input. This is often graphically depicted as a residual path with an attached skip connection.
The authors hypothesize that identity mappings play an important role and that it is easier to model them in ResNets by simply setting all the weights of the residual path to zero instead of simulating an identity mapping by adapting the weights of several consecutive layers in an intertwined way. In any case, the skip connections reduce the vanishing gradient problem during training and enable extremely deep architectures of up to 152 layers on ImageNet and even up to 1000 layers on CIFAR10. ResNet won the ILSVRC15 challenge with 3.6% Top5 error.
2.5.5 DenseNet
Inspired by ResNets whose skip connections have shown to reduce the vanishing gradient problem, densely connected CNNs (DenseNets) introduced by Huang et al. (2017) drive this idea even further by connecting each layer to all previous layers. DenseNets are conceptionally very similar to ResNets — instead of adding the output of a layer to its input, DenseNets stack the output and the input of each layer. Since this stacking necessarily increases the number of feature maps with each layer, the number of new feature maps computed by each layer is typically small. Furthermore, it is proposed to use compression layers after downscaling the spatial dimension with pooling, i.e., a convolution is used to reduce the number of feature maps.
Compared to ResNets, DenseNets achieve similar performance, allow for even deeper architectures, and they are more parameter and computation efficient, respectively. However, the DenseNet architecture is highly nonuniform which complicates the hardware mapping and ultimately slows down training.
2.6 The StraightThrough Gradient Estimator
Many recently developed methods for resourceefficiency in DNNs incorporate components in the computation graph of the loss function that are nondifferentiable or whose gradient is zero almost everywhere, such as piecewise constant quantizers. These components prevent the use of conventional gradientbased optimization as described in Section 2.2.
The straightthrough gradient estimator (STE) is a simple but effective way to approximate the gradient of such components by simply replacing their gradient with a nonzero value. Let be some nondifferentiable operation within the computation graph of such that the partial derivative is not defined. The STE then approximates the gradient by
(8) 
where is an arbitrary differentiable function with a similar functional shape as . For instance, in case of the sign activation function , whose derivative is zero almost everywhere, one could select . Another common choice is the identity function whose derivative is , which simply passes the gradient on to higher components in the computation graph during backpropagation. Figure 2 illustrates the STE applied to a simplified DNN layer.
2.7 Bayesian Neural Networks
Since there exist several works for resourceefficient DNNs that build on the framework of Bayesian neural networks, we briefly introduce the basic principles here. Given a prior distribution on the weights and a likelihood defined by the softmax output of a DNN as
(9) 
we can use Bayes’ rule to infer a posterior distribution over the weights, i.e.,
(10) 
From a Bayesian perspective it is desired to compute expected predictions with respect to the posterior distribution , and not just to reduce the entire distribution to a single point estimate. However, due to the highly nonlinear nature of DNNs, most exact inference scenarios involving the full posterior are typically intractable and there exist a range of approximation techniques for these tasks, such as variational inference (Hinton and van Camp, 1993; Graves, 2011; Blundell et al., 2015) and sampling based approaches (Neal, 1992).
Interestingly, training DNNs can often be seen as a very rough Bayesian approximation where we only seek for weights that maximize the posterior
, which is also known as maximum aposteriori estimation (MAP). In particular, in a typical loss
as in (5) the data term originates from the logarithm of the likelihood whereas the regularizer originates from the logarithm of the prior .A better Bayesian approximation is obtained with variational inference where the aim is to find a variational distribution governed by distribution parameters that is as close as possible to the posterior but still simple enough to allow for efficient inference, e.g., for computing by sampling from . This is typically achieved by the so called meanfield assumption, i.e., by assuming that the weights are independent such that factorizes into a product of factors for each weight . The most prominent approach to obtain the variational distribution is by minimizing the KLdivergence using gradient based optimization.
The Bayesian approach is appealing as distributions over the parameters directly translate into predictive distributions. In contrast to ordinary DNNs that only provide a point estimate prediction, Bayesian neural networks offer predictive uncertainties which are useful to determine how certain the DNN is about its own prediction. However, the Bayesian framework has got several other useful properties that can be exploited to obtain resourceefficient DNNs. For instance, the prior allows us to incorporate information about properties, such as sparsity, that we expect to be present in the DNN. In Section 3.2.3, we review pruning approaches based on the Bayesian paradigm, and in Section 3.1.3, we review weight quantization approaches based on the Bayesian paradigm.
3 Resourceefficiency in Deep Neural Networks
In this section, we provide a comprehensive overview of methods that enhance the efficiency of DNNs regarding memory footprint, computation time, and energy requirements. We have identified three different major approaches that aim to reduce the computational complexity of DNNs, i.e., (i) weight and activation quantization, (ii) network pruning, and (iii) structural efficiency. These categories are not mutually exclusive, and we present individual methods in the category where their contribution is most significant.
3.1 Quantized Neural Networks
Quantization in DNNs is concerned with reducing the number of bits used for the representation of the weights and the activations, respectively. The reduction in memory requirements are obvious: Using fewer bits for the weights results in less memory overhead to store the corresponding model, and using fewer bits for the activations results in less memory overhead when computing predictions. Furthermore, representations using fewer bits often facilitate faster computation. For instance, when quantization is driven to the extreme with binary weights and binary activations
, floatingpoint or fixedpoint multiplications are replaced by hardwarefriendly logical XNOR and bitcount operations. In this way, a sophisticated DNN is essentially reduced to a logical circuit.
However, training such discretevalued DNNs^{4}^{4}4Due to finite precision, in fact any DNN is discretevalued. However, we use this term here to highlight the extremely low number of values. is delicate as they cannot be directly optimized using gradient based methods. The challenge is to reduce the number of bits as much as possible while at the same time keeping the classification performance close to that of a welltuned fullprecision DNN. In the sequel, we provide a literature overview of approaches that train reducedprecision DNNs, and, in a broader sense, we also consider methods that use reduced precision computations during backpropagation to facilitate lowresource training.
3.1.1 Early Quantization Approaches
Approaches for reducedprecision computations date back at least to the early 1990s. Höhfeld and Fahlman (1992, 1992)
rounded the weights during training to fixedpoint format with different numbers of bits. They observed that training eventually stalls as small gradient updates are always rounded to zero. As a remedy, they proposed stochastic rounding, i.e., rounding values to the nearest value with a probability proportional to the distance to the nearest value. These quantized gradient updates are correct in expectation, do not cause training to stall, and yield good performance with substantially fewer bits than deterministic rounding. More recently,
Gupta et al. (2015) have shown that stochastic rounding can also be applied for modern deep architectures, as demonstrated on a hardware prototype.Lin et al. (2015) propose a method to reduce the number of multiplications required during training. At forward propagation, the weights are stochastically quantized to either binary weights or ternary weights to remove the need for multiplications at all. During backpropagation, inputs and hidden neurons are quantized to powers of two, reducing multiplications to cheaper bitshift operations, and leaving only a negligible number of floatingpoint multiplications to be computed. However, the speedup is limited to training since for testing the fullprecision weights are required.
Courbariaux et al. (2015a) empirically studied the effect of different numeric formats — namely floatingpoint, fixedpoint, and dynamic fixedpoint — with varying bit widths on the performance of DNNs. Lin et al. (2016) consider fixedpoint quantization of pretrained fullprecision DNNs. They formulate a convex optimization problem to minimize the total number of bits required to store the weights and the activations under the constraint that the total output signaltoquantization noise ratio is larger than a certain prespecified value. A closedform solution of the convex objective yields layerspecific bit widths.
3.1.2 Quantizationaware Training
Quantization operations, being piecewise constant functions with either undefined or zero gradients, are not applicable to gradientbased learning using backpropagation. In recent years, the STE (Bengio et al., 2013) (see Section 2.6) became the method of choice to compute an approximate gradient for training DNNs with weights that are represented using a very small number of bits. Such methods typically maintain a set of fullprecision weights that are quantized during forward propagation. During backpropagation, the gradients are propagated through the quantization functions by assuming that their gradient equals one. In this way, the fullprecision weights are updated using gradients computed at the quantized weights. At test time, the fullprecision weights are abandoned and only the quantized reducedprecision weights are kept. We term this scheme quantizationaware training since quantization is an essential part during forwardpropagation and it is intuitive to think of the realvalued weights becoming robust to quantization. In a similar manner, many methods employ the STE to approximate the quantization of activations.
In (Courbariaux et al., 2015b), binary weight DNNs are trained using the STE to get rid of expensive floatingpoint multiplications. They consider deterministic rounding using the sign function and stochastic rounding using probabilities determined by the hardsigmoid function . During backpropagation, a set of auxiliary fullprecision weights is updated based on the gradients of the quantized weights. Hubara et al. (2016) extended this work by also quantizing the activations to a single bit using the sign activation function. This reduces the computational burden dramatically as floatingpoint multiplications and additions are reduced to hardwarefriendly logical XNOR and bitcount operations, respectively.
Li et al. (2016) trained ternary weights . Their quantizer sets weights whose magnitude is lower than a certain threshold to zero, while the remaining weights are set to or according to their sign. Their approach determines and during forward propagation by approximately minimizing the squared quantization error of the realvalued weights. Zhu et al. (2017) extended this work to ternary weights where and are trainable parameters subject to gradient updates. They propose to select based on the maximum fullprecision weight magnitude in each layer, i.e., with being a hyperparameter. These asymmetric weights considerably improve performance compared to symmetric weights as used in (Li et al., 2016).
Rastegari et al. (2016) approximate fullprecision weight filters in CNNs as where is a scalar and is a binary weight matrix. This reduces the bulk of floatingpoint multiplications inside the convolutions to either additions or subtractions, and only requires a single multiplication per output neuron with the scalar . In a further step, the layer inputs are quantized in a similar way to perform the convolution with only efficient XNOR operations and bitcount operations, followed by two floatingpoint multiplications per output neuron. Again, the STE is used during backpropagation. Lin et al. (2017b) generalized the ideas of Rastegari et al. (2016) by approximating the fullprecision weights with linear combinations of multiple binary weight filters for improved classification accuracy.
While most activation binarization methods use the sign function which can be seen as an approximation to the tanh function, Cai et al. (2017) proposed a halfwave Gaussian quantization that more closely resembles the predominant ReLU activation function.
Motivated by the fact that weights and activations typically exhibit a nonuniform distribution,
Miyashita et al. (2016) proposed to quantize values to powers of two. Their representation allows getting rid of expensive multiplications, and they report higher robustness to quantization than linear rounding schemes using the same number of bits. Zhou et al. (2017) proposed incremental network quantization where the weights of a pretrained DNN are first partitioned into two sets, one of which is quantized to either zero or powers of two while the weights in the other set are kept at fullprecision and retrained to recover the potential loss in accuracy due to quantization. They iterate partitioning, quantization, and retraining until all weights are quantized.Jacob et al. (2018) proposed a quantization scheme that accurately approximates floatingpoint operations using only integer arithmetic to speed up computation. During training, their forward pass simulates the quantization step to keep the performance of the quantized DNN close to the performance when using singleprecision. At test time, weights are represented as 8bit integer values, reducing the memory footprint by a factor of four.
Liu et al. (2018) introduced Bireal net, a ResNetinspired architecture where the residual path is implemented with efficient binary convolutions while the shortcut path is kept realvalued to maintain the expressiveness of the DNN. The residual in each layer is computed by first transforming the input with the sign activation, followed by a binary convolution, and a final batch normalization step.
Instead of using a fixed quantizer, in LQnet (Zhang et al., 2018a) the quantizer is adapted during training. The proposed quantizer is inspired by the representation of integers as linear combinations with and . The key idea is to consider a quantizer that assigns values to the nearest value representable as such a linear combination and to treat as trainable parameters. It is shown that such a quantizer is compatible with efficient bitoperations. The quantizer is optimized during forward propagation by minimizing the quantization error objective for and by alternately fixing and minimizing and vice versa. It is proposed to use layerwise quantizers for the activations, i.e., an individual quantizer for each layer, and channelwise quantizers for the weights.
Relaxed Quantization (Louizos et al., 2019) introduces a stochastic differentiable softrounding scheme. By injecting additive noise to the deterministic weights before rounding, one can compute probabilities of the weights being rounded to specific values in a predefined discrete set. Subsequently, these probabilities are used to differentiably round the weights using the Gumbel softmax approximation (Jang et al., 2017). Since this softrounding scheme produces only values that are close to values from the discrete set but which are not exactly from this set, the authors also propose a hard variant using the STE.
Zhou et al. (2016) presented several quantization schemes for the weights and the activations that allow for flexible bit widths. Furthermore, they also propose a quantization scheme for backpropagation to facilitate lowresource training. In agreement with earlier work mentioned above, they note that stochastic quantization is essential for their approach. In (Wu et al., 2018), weights, activations, weight gradients, and activation gradients are subject to customized quantization schemes that allow for variable bit widths and facilitate integer arithmetic during training and testing. In contrast to (Zhou et al., 2016), the work in (Wu et al., 2018) accumulates weight changes to lowprecision weights instead of fullprecision weights.
3.1.3 Bayesian Approaches for Quantization
In this section, we review some quantization approaches, most of which are closely related to the Bayesian variational inference framework (see Section 2.7).
The work of Achterhold et al. (2018) builds on the variational dropout based pruning approach of Louizos et al. (2017) (see Section 3.2.3). They introduce a mixture of loguniforms prior whose mixtures are centered at predefined quantization values. Consequently, the approximate posterior also concentrates at these values such that weights can be safely quantized without requiring a finetuning procedure.
The following works in this section directly operate on discrete weight distributions, and, consequently, do not require a rounding procedure. Soudry et al. (2014) approximate the true posterior over discrete weights using expectation propagation (Minka, 2001) with closedform online updates. Starting with an uninformative approximation , their approach combines the current approximation (serving as the prior in (10)) with the likelihood for a singlesample data set to obtain a refined posterior. To obtain a closedform refinement step, they propose several approximations.
Although deviating from the Bayesian variational inference framework as no similarity measure to the true posterior is optimized, the approach of Shayer et al. (2018) trains a distribution over either binary weights or ternary weights . They propose to minimize an expected loss for the variational parameters with gradientbased optimization using the local reparameterization trick (Kingma et al., 2015). After training has finished, the discrete weights are obtained by either sampling or taking a mode from . Since their approach is limited to the ReLU activation function, Peters and Welling (2018) extended their work to the activation function. This involves several nontrivial changes since the sign activation, due to its zero derivative, requires that the local reparameterization trick must be performed after the function, and, consequently, distributions need to be propagated through commonly used building blocks such as batch normalization and pooling operations. Roth et al. (2019) further extended these works to beyond three distinct discrete weights, and they introduced some technical improvements.
Havasi et al. (2019) introduced a novel Bayesian compression technique that we present here in this section although it is rather a coding technique than a quantization technique. In a nutshell, their approach first computes a variational distribution over realvalued weights using meanfield variational inference and then it encodes a sample from in a smart way. They construct an approximation to by importance sampling using the prior as
(11) 
where denotes a point mass located at . In the next step, a sample from (or, equivalently, an approximate sample from ) is drawn which can be encoded by the corresponding number using bits. Using the same random number generator initialized with the same seed as in (11), the weights can be recovered by sampling weights from the prior and selecting . Since the number of samples required to obtain a reasonable approximation to in (11) grows exponentially with the number of weights, this sampling based compression scheme is performed for smaller weight blocks such that each weight block can be encoded with bits.
3.2 Network Pruning
Network pruning methods aim to achieve parameter sparsity by setting a substantial number of DNN weights to zero. Subsequently, the sparsity is exploited to enhance resourceefficiency of the DNN. On the one hand, there exist unstructured pruning approaches that set individual weights, regardless of their location in a weight tensor, to zero. Unstructured pruning approaches are typically less sensitive to accuracy degradation, but they require special sparse tensor data structures that in turn yield practical efficiency improvements only for very high sparsity. On the other hand, structured pruning methods aim to set whole weight structures to zero, e.g., by setting all weights of a matrix column to zero we would effectively prune an entire neuron. Conceptionally, structured pruning is equivalent to removing tensor dimensions such that the reduced tensor remains compatible with highly optimized dense tensor operations.
In this section, we start with the unstructured case which includes many of the earlier approaches and continue with structured pruning that has been the focus of more recent works. Then we review approaches that relate to Bayesian principles before we discuss approaches that prune structures dynamically during forward propagation.
3.2.1 Unstructured Pruning
One of the earliest approaches to reduce the network size is the optimal brain damage algorithm of LeCun et al. (1989). Their main finding is that pruning based on weight magnitude is suboptimal, and they propose a pruning scheme based on the increase in loss function. Assuming a pretrained network, a local secondorder Taylor expansion with a diagonal Hessian approximation is employed that allows us to estimate the change in loss function caused by weight pruning without reevaluating the costly network function. Removing parameters is alternated with retraining the pruned network. In this way, the model size can be reduced substantially without deteriorating its performance. Hassibi and Stork (1992) found the diagonal Hessian approximation to be too restrictive, and their optimal brain surgeon algorithm uses an approximated full covariance matrix instead. While their method, similar as in (LeCun et al., 1989), prunes weights that cause the least increase in loss function, the remaining weights are simultaneously adapted to compensate for the negative effect of weight pruning. This bypasses the need to alternate several times between pruning and retraining the pruned network.
However, it is not clear whether these approaches scale up to modern DNN architectures since computing the required (diagonal) Hessians is substantially more demanding (if not intractable) for millions of weights. Therefore, many of the more recently proposed techniques still resort to magnitude based pruning. Han et al. (2015) alternate between pruning connections below a certain magnitude threshold and retraining the pruned DNN. The results of this simple strategy are impressive, as the number of parameters in pruned DNNs is an order of magnitude smaller (9 for AlexNet and for VGG16) than in the original networks. Hence, this work shows that DNNs are often heavily overparametrized. In a followup paper, Han et al. (2016) proposed deep compression, which extends the work in (Han et al., 2015) by a parameter quantization and parameter sharing step, followed by Huffman coding to exploit the nonuniform weight distribution. This approach yields a reduction in memory footprint by a factor of 35–49 and, consequently, a reduction in energy consumption by a factor of 3–5.
Guo et al. (2016) discovered that irreversible pruning decisions limit the achievable sparsity and that it is useful to reincorporate weights pruned in an earlier stage. In addition to each dense weight matrix , they maintain a corresponding binary mask matrix that determines whether a weight is currently pruned or not. In particular, the actual weights used during forward propagation are obtained as where denotes elementwise multiplication. Their method alternates between updating the weights based on gradient descent, and updating the weight masks by thresholding the realvalued weights according to
(12) 
where and are two thresholds and refers to the iteration number. Most importantly, weight updates are also applied to the currently pruned weights according to using the STE, such that pruned weights can reappear in (12). This reduces the number of parameters of AlexNet by a factor of 17.7 without deteriorating performance.
3.2.2 Structured Pruning
In (Mariet and Sra, 2016), a determinantal point process (DPP) is used to find a group of neurons that are diverse and exhibit little redundancy. Conceptionally, a DPP for a given ground set defines a distribution over subsets where subsets containing diverse elements have high probability. Their approach treats the set of dimensional vectors that individual neurons compute over the whole data set as , samples a diverse set of neurons according to the DPP, and then prunes the other neurons . To compensate for the negative effect of pruning, the outgoing weights of the remaining neurons after pruning are adapted so as to minimize the activation change of the next layer.
Wen et al. (2016) incorporated group lasso regularizers in the objective to obtain different kinds of sparsity in the course of training. They were able to remove filters, channels, and even entire layers in architectures containing shortcut connections. Liu et al. (2017) proposed to introduce an regularizer on the scale parameters of batch normalization and to set by thresholding. Since each batch normalization parameter corresponds to a particular channel in the network, this effectively results in channel pruning with only minimal changes to existing training pipelines. In (Huang and Wang, 2018), the outputs of different structures are scaled with individual trainable scaling factors. By using a sparsity enforcing regularizer on these scaling factors, the outputs of the corresponding structures are driven to zero and can be pruned.
Rather than pruning based on small parameter values, ThiNet (Luo et al., 2017) is a datadriven approach that prunes channels having the least impact on the subsequent layer. When pruning channels in layer , they propose to sample several activations at random spatial locations and random channels of the following layer, and to greedily prune channels whose removal results in the least increase of squared error over these randomly selected activations. After pruning, they adapt the remaining filters to minimize the squared reconstruction error by minimizing a least squares problem.
Louizos et al. (2018) propose to multiply weights with stochastic binary 01 gates associated with trainable probability parameters that effectively determine whether a weight should be pruned or not. They formulate an expected loss with respect to the distribution over the stochastic binary gates, and by incorporating an expected regularizer over the weights, the probability parameters associated with these gates are encouraged to be close to zero. To enable the use of the reparameterization trick, a continuous relaxation of the binary gates using a modified binary Gumbel softmax distribution is used (Jang et al., 2017). They show that their approach can be used for structured sparsity by associating the stochastic gates to entire structures such as channels. Li and Ji (2019) extended this work by using the recently proposed unbiased ARM gradient estimator (Yin and Zhou, 2019) instead of using the biased Gumbel softmax approximation.
3.2.3 Bayesian Pruning
In (Graves, 2011; Blundell et al., 2015), meanfield variational inference is employed to obtain a factorized Gaussian approximation , i.e., in addition to a (mean) weight they also train a weight variance
. After training, weights are pruned by thresholding the “signaltonoise ratio”
.Molchanov et al. (2017) proposed a method based on variational dropout (Kingma et al., 2015) which interprets dropout as performing variational inference with specific prior and approximate posterior distributions. Within this framework, the otherwise fixed dropout rates of Gaussian dropout appear as free parameters that can be optimized to improve a variational lower bound. In (Molchanov et al., 2017), this freedom is exploited to optimize individual weight dropout rates such that weights can be safely pruned if their dropout rate is close to one. This idea has been extended in (Louizos et al., 2017)
by using sparsity enforcing priors and assigning dropout rates to groups of weights that are all connected to the same structure, which in turn allows for structured pruning. Furthermore, they show how their approach can be used to determine an appropriate bit width for each weight by exploiting the wellknown connection between Bayesian inference and the minimum description length (MDL) principle
(Grünwald, 2007).3.2.4 Dynamic Network Pruning
So far, we have presented methods that result in a fixed reduced architecture. In the following, we present methods that determine dynamically in the course of forward propagation which structures should be computed, or, equivalently, which structures should be pruned. The intuition behind this idea is to vary the time spent for computing predictions based on the difficulty of the given input samples.
Lin et al. (2017a)
proposed to train, in addition to the DNN, a recurrent neural network (RNN) decision network which determines the channels to be computed using reinforcement learning. In each layer, the feature maps are compressed using global pooling and fed into the RNN which aggregates state information over the layers to compute its pruning decisions.
In (Dong et al., 2017), convolutional layers of a DNN are extended by a parallel lowcost convolution whose output after the ReLU function is used to scale the outputs of the potentially highcost convolution. Due to the ReLU function, several outputs of the lowcost convolution will be exactly zero such that the computation of the corresponding output of the highcost convolution can be omitted. For the lowcost convolution, they propose to use weight tensors and . However, practical speedups are only reported for the convolution where all channels at a given spatial location might get set to zero.
In a similar approach proposed by Gao et al. (2019), the spatial dimensions of a feature map are reduced by global average pooling to a vector which is linearly transformed to using a single lowcost fully connected layer. To obtain a sparse vector , is fed into the ReLU function, followed by a winnertakesall function that sets all entries of a vector to zero that are not among the largest entries in absolute value. By multiplying in a channelwise manner to the output of a highcost convolution, at least channels will be zero and need not be computed. The number of channels is derived from a predefined minimal pruning ratio hyperparameter.
3.3 Structural efficiency in DNNs
In this section, we review strategies that establish certain structural properties in DNNs to improve computational efficiency. Each of the proposed subcategories in this section follows rather different principles and the individual techniques might not be mutually exclusive.
3.3.1 Weight Sharing
Another technique to reduce the model size is weight sharing. In (Chen et al., 2015), a hashing function is used to randomly group network connections into “buckets”, where the connections in each bucket share the same weight value. This has the advantage that weight assignments need not be stored explicitly since they are given implicitly by the hashing function. The authors show a memory footprint reduction by a factor of 10 while keeping the predictive performance essentially unaffected.
Ullrich et al. (2017) extended the soft weight sharing approach proposed in (Nowlan and Hinton, 1992)
to achieve both weight sharing and sparsity. The idea is to select a Gaussian mixture model prior over the weights and to train both the weights as well as the parameters of the mixture components. During training, the mixture components collapse to point measures and each weight gets attracted by a certain weight component. After training, weight sharing is obtained by assigning each weight to the mean of the component that best explains it, and weight pruning is obtained by assigning a relatively high mixture mass to a component with a fixed mean at zero.
Roth and Pernkopf (2018) introduced a Dirichlet process prior over the weight distribution to enforce weight sharing in order to reduce the memory footprint of an ensemble of DNNs. They propose a sampling based inference scheme by alternately sampling weight assignments using Gibbs sampling and sampling weights using hybrid Monte Carlo (Neal, 1992). By using the same weight assignments for multiple weight samples, the memory overhead for the weight assignments becomes negligible and the total memory footprint of an ensemble of DNNs is reduced.
3.3.2 Knowledge Distillation
Knowledge distillation (Hinton et al., 2015) is an indirect approach where first a large DNN (or an ensemble of DNNs) is trained, and subsequently softlabels obtained from the softmax output of the large DNN are used as training data for a smaller DNN. The smaller DNNs achieve performances almost identical to that of the larger DNNs which is attributed to the valuable information contained in the softlabels. Inspired by knowledge distillation, Korattikara et al. (2015)
reduced a large ensemble of DNNs, used for obtaining MonteCarlo estimates of a posterior predictive distribution, to a single DNN.
3.3.3 Special Matrix Structures
In this section we review approaches that aim at reducing the model size by employing efficient matrix representations. There exist several methods using lowrank decompositions which represent a large matrix (or a large tensor) using only a fraction of the parameters. In most cases, the implicitly represented matrix is never computed explicitly such that also a computational speedup is achieved. Furthermore, there exist approaches using special matrices that are specified by only a few parameters and whose structure allows for extremely efficient matrix multiplications.
Denil et al. (2013) proposed a method that is motivated by training only a subset of the weights and predicting the values of the other weights from this subset. In particular, they represent weight matrices using a lowrank approximation with , , and to reduce the number of parameters. Instead of learning both factors and , prior knowledge, such as smoothness of pixel intensities in an image, is incorporated to compute a fixed using kerneltechniques or autoencoders, and only the factor is learned.
In (Novikov et al., 2015), the tensor train matrix format is employed to substantially reduce the number of parameters required to represent large weight matrices of fully connected layers. Their approach enables the training of very large fully connected layers with relatively few parameters, and they achieve improved performance compared to simple lowrank approximations.
Denton et al. (2014) propose specific lowrank approximations and clustering techniques for individual layers of pretrained CNNs to both reduce memoryfootprint and computational overhead. Their approach yields substantial improvements for both the computational bottleneck in the convolutional layers and the memory bottleneck in the fully connected layers. By finetuning after applying their approximations, the performance degradation is kept at a decent level. Jaderberg et al. (2014) propose two different methods to approximate pretrained CNN filters as combinations of rank1 basis filters to speed up computation. The rank1 basis filters are obtained either by minimizing a reconstruction error of the original filters or by minimizing a reconstruction error of the outputs of the convolutional layers. Lebedev et al. (2015) approximate the convolution tensor using the canonical polyadic (CP) decomposition — a generalization of lowrank matrix decompositions to tensors — using nonlinear least squares. Subsequently, the convolution using this lowrank approximation is performed by four consecutive convolutions, each with a smaller filter, to reduce the computation time substantially.
In (Cheng et al., 2015), the weight matrices of fully connected layers are restricted to circulant matrices , which are fully specified by only
parameters. While this dramatically reduces the memory footprint of fully connected layers, circulant matrices also facilitate faster computation as matrixvector multiplication can be efficiently computed using the fast Fourier transform. In a similar vein,
Yang et al. (2015) reparameterize matrices of fully connected layers using the Fastfood transform as , where , , and are diagonal matrices, is a random permutation matrix, and is the WalshHadamard matrix. This reparameterization requires only a total of parameters, and similar as in (Cheng et al., 2015), the fast Hadamard transform enables an efficient computation of matrixvector products.3.3.4 Manual Architecture Design
Instead of modifying existing architectures to make them more efficient, manual architecture design is concerned with the development of new architectures that are inherently resourceefficient. Over the past years, several design principles and building blocks for DNN architectures have emerged that exhibit favorable computational properties and sometimes also improve performance.
CNN architectures are typically designed to have a transition from convolutional layers to fully connected layers. At this transition, activations at all spatial locations of each channel are typically used as individual input features for the following fully connected layer. Since the number of these features is typically large, there is a memory bottleneck for storing the parameters of the weight matrix especially in the first fully connected layer.
Lin et al. (2014a) introduced two concepts that have been widely adopted by subsequent works. The first one, global average pooling, largely solves the abovementioned memory issue at the transition to fully connected layers. Global average pooling reduces the spatial dimensions of each channel into a single feature by averaging over all values within a channel. This reduces the number of features at the transition drastically, and, by having the same number of channels as there are classes, it can also be used to completely get rid of fully connected layers. Second, they used convolutions with weight kernels that can essentially be seen as performing the operation of a fully connected layer over each spatial location across all channels.
These convolutions have been adopted by several popular architectures (Szegedy et al., 2015; He et al., 2016; Huang et al., 2017) and, due to their favorable computational properties compared to convolutions that take a spatial neighborhood into account, later have also been exploited to improve computational efficiency. For instance, InceptionNet (Szegedy et al., 2015) proposed to split standard convolutions into two cheaper convolutions: (i) a convolution to reduce the number of channels such that (ii) a subsequent convolution is performed faster. Similar ideas are used in SqueezeNet (Iandola et al., 2016) which uses convolutions to reduce the number of channels which is subsequently input to a parallel and convolution, respectively. In addition, SqueezeNet uses the output of a global average pooling of perclass channels directly as input to the softmax in order to avoid fully connected layers which typically consume the most memory. Furthermore, by using deep compression (Han et al., 2016) (see Section 3.2.1), the memory footprint was reduced to less than MB.
Szegedy et al. (2016) extended the InceptionNet architecture by spatially separable convolutions to reduce the computational complexity, i.e., a convolution is split into a convolution followed by a convolution. In MobileNet (Howard et al., 2017) depthwise separable convolutions are used to split a standard convolution in another different way: (i) a depthwise convolution and (ii) a convolution. The depthwise convolution applies a filter to each channel separately without taking the other channels into account whereas the convolution then aggregates information across channels. Although these two cheaper convolutions together are less expressive than a standard convolution, they can be used to trade off a small loss in prediction accuracy with a drastic reduction in computational overhead and memory requirements.
Sandler et al. (2018) extended these ideas in their MobileNetV2
to an architecture with residual connections. A typical residual block with bottleneck structure in ResNet
(He et al., 2016) contains a bottleneck convolution to reduce the number of channels, followed by a convolution, followed by another convolution to restore the original number of channels again. Contrary to that building block, MobileNetV2 introduces an inverted bottleneck structure where the shortcut path contains the bottleneck and the residual path performs computations in a highdimensional space. In particular, the residual path performs a convolution to increase the number of channels, followed by a cheap depthwise convolution, followed by another convolution to reduce the number of channels again. They show that their inverted structure is more memory efficient since the shortcut path, which needs to be kept in memory during computation of the residual path, is considerably smaller. Furthermore, they show improved performance compared to the standard bottleneck structure.While it was more of a technical detail rather than a contribution on its own, AlexNet (Krizhevsky et al., 2012) used grouped convolutions with two groups to facilitate model parallelism for training on two GPUs with relatively little memory capacity. Instead of computing a convolution using a weight tensor , a grouped convolution splits the input into groups of channels that are independently processed using weight tensors . The outputs of these convolutions are then stacked again such that the same number of input and output channels are maintained while considerably reducing the computational overhead and memory footprint.
Although this reduces the expressiveness of the convolutional layer since there is no interaction between the different groups, Xie et al. (2017) used grouped convolutions to enlarge the number of channels of a ResNet model which resulted in accuracy gains while keeping the computational complexity of the original ResNet model approximately the same. Zhang et al. (2018b) introduced a ResNetinspired architecture called ShuffleNet which employs grouped convolutions since convolutions have been identified as computational bottlenecks in previous works, e.g., see (Howard et al., 2017). To combine the computational efficiency of grouped convolutions with the expressiveness of a full convolution, ShuffleNet incorporates channel shuffle operations after grouped convolutions to partly recover the interaction between different groups
3.3.5 Neural Architecture Search
Neural architecture search (NAS) is a recently emerging field concerned with the automatic discovery of good DNN architectures. This is achieved by designing a discrete space of possible architectures in which we subsequently search for an architecture optimizing some objective – typically the validation error. By incorporating a measure of resourceefficiency into this objective, this technique has recently attracted attention for the automatic discovery of resourceefficient architectures.
The task is very challenging: On the one hand, evaluating the validation error is timeconsuming as it requires a full training run and typically only results in a noisy estimate thereof. On the other hand, the space of architectures is typically of exponential size in the number of layers. Hence, the space of architectures needs to be carefully designed in order to facilitate an efficient search within that space.
The influential work of Zoph and Le (2017) introduced a scheme to encode DNN architectures of arbitrary depth as sequences of tokens which can be sampled from a controller RNN. This controller RNN is trained with reinforcement learning to generate well performing architectures using the validation error on a heldout validation set as a reward signal. However, the training effort is enormous as more than 10,000 training runs are required to achieve stateoftheart performance on CIFAR10. This would be impractical on larger data sets like ImageNet which was partly solved by subsequent NAS approaches, e.g., in (Zoph et al., 2018). In this review, we want to highlight methods that also consider resourceefficiency constraints in the NAS.
In MnasNet (Tan et al., 2018), a RNN controller is trained by also considering the latency of the sampled DNN architecture measured on a real mobile device. They achieve performance improvements under predefined latency constraints on a specific device. To run MnasNet on the largescale ImageNet and COCO data sets (Lin et al., 2014b), their algorithm is run on a proxy task
by only training for five epochs, and only the most promising DNN architectures were trained using more epochs.
Instead of generating architectures using a controller, ProxylessNAS (Cai et al., 2019) uses a heavily overparameterized model where each layer contains several parallel paths, each computing a different architectural block with its individual parameters. After each layer, probability parameters for selecting a particular architectural block are introduced which are trained via backpropagation using the STE. After training, the most probable path determines the selected architecture. To favor resourceefficient architectures, a latency model is build using measurements done on a specific real device whose predicted latencies are used as a differentiable regularizer in the cost function. In their experiments, they show that different target devices prefer individual DNN architectures to obtain a low latency.
Instead of using a different path for different operations in each layer, singlepath NAS (Stamoulis et al., 2019) combines all operations in a single shared weight superblock such that each operation uses a subset of this superblock. A weightmagnitudebased decision using trainable threshold parameters determines which operation should be performed, allowing for gradientbased training of both the weight parameters and the architecture. Again, the STE is employed to backpropagate through the threshold function.
Liu et al. (2019) have replicated several experiments of pruning approaches (see Section 3.2) and they observed that the typical workflow of training, pruning, and finetuning is often not necessary and only the discovered sparsity structure is important. In particular, they show for several pruning approaches that randomly initializing the weights after pruning and training the pruned structure from scratch results in most cases in a similar performance as performing finetuning after pruning. They conclude that network pruning can also be seen as a paradigm for architecture search.
Tan and Le (2019) recently proposed EfficientNet which employs NAS for finding a resourceefficient architecture as a key component. In the first step, they perform NAS to discover a small resourceefficient model which is much cheaper than searching for a large model directly. In the next step, the discovered model is enlarged by a principled compound scaling approach which simultaneously increases the number of layers, the number of channels, and the spatial resolution. Although this approach is not targeting resourceefficiency on its own, EfficientNet achieves stateoftheart performance on ImageNet using a relatively small model.
4 Experimental Results
In this section, we provide experimental results for modern DNN architectures trained on wellknown benchmark data sets. We focus our experiments on network quantization approaches as reviewed in Section 3.1, since they are among the earliest and most efficient approaches to enhance the computational efficiency of DNNs.
We first exemplify the tradeoff between model performance, memory footprint, and latency on the CIFAR10 classification task in Section 4.1. This example highlights that finding a suitable balance between these requirements remains challenging due to diverse hardware and implementation issues. Furthermore, we compare several quantization approaches discussed in this paper on the more challenging CIFAR100 task in Section 4.2. This experiment shows the different bit requirements of weights and activations as well as the need for advanced quantization approaches. In Section 4.3, we finally present a realworld speech enhancement example, where hardwareefficient binarized DNNs have led to dramatic memory and latency reductions.
The main focus of this section is to showcase the difficulty of finding good tradeoffs between resourceefficiency and predictive performance. As this paper is mainly dedicated to giving a comprehensive literature overview of the current state of the art, an extensive evaluation of the many presented methods in Section 3 would be infeasible and it is also not within the scope of this paper. We refer the reader to the individual papers, but we want to highlight the work of Liu et al. (2019) which provides an extensive evaluation of several recent pruning methods.
4.1 Prediction Accuracy, Memory Footprint and Latency
To exemplify the tradeoff to be made between memory footprint, latency, and prediction accuracy, we implemented general matrix multiply (GEMM) with variablelength fixedpoint representation on a mobile CPU (ARM Cortex A15), exploiting its NEON SIMD instructions. Using this implementation, we executed a ResNet consisting of 32 layers with custom quantization for the weights and the activations (Zhou et al., 2016), and compare these results with singleprecision floatingpoint. We use the CIFAR10 data set for object classification (Krizhevsky, 2009) which consists of pixel RGB images containing the ten object classes airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. The data set consists of 50,000 training images and 10,000 test images.
Figure 3 reports the impact of reduced precision on runtime latency, memory requirements, and test classification error. As can be seen, reducing the bit width to 16, 8, or 4 bits does not improve runtimes. However, on reconfigurable or specialized hardware even 16, 8, and 4 bit representations can be beneficial. Since bit widths of 2 and 1 do not require bit width doubling, we obtain runtimes close to the theoretical linear speedup. In terms of memory footprint, our implementation evidently reaches the theoretical linear improvement.
While reducing the bit width of weights and activations to only 1 or 2 bits improves memory footprint and computation time substantially, these settings also show decreased classification performance. In this example, the sweet spot appears to be at 2 bits precision, but also the predictive performance for 1 bit precision might be acceptable depending on the application. This extreme setting is evidently beneficial for highly constrained scenarios and it is easily exploited on today’s hardware, as shown in Section 4.3.
4.2 Comparison of Different Quantization Approaches
In the next experiment, we compare the performance of several quantization approaches. We use a DenseNet architecture (Huang et al., 2017) consisting of 100 layers with bottleneck and compression layers, i.e., a DenseNetBC100. We select the growth rate parameter . We perform our experiments on the CIFAR100 data set which is similar to the CIFAR10 data set except that it contains 100 object classes, i.e., the image size and the sizes of the training and the test set, respectively, are equal.
We selected some of the most popular quantization approaches (see Section 3.1) for the comparison — namely binary weight networks (BWN) (Courbariaux et al., 2015b), binarized neural networks (BNN) (Hubara et al., 2016), DoReFaNet (Zhou et al., 2016), trained ternary quantization (TTQ) (Zhu et al., 2017), and LQNet (Zhang et al., 2018a). For this experiment, we quantize the DNNs in the three modes (i) weightonly, (ii) activationonly, and (iii) combined weight and activation quantization, respectively. However, note that some quantization approaches are designed for a particular mode, e.g., BWN and TTQ only consider weight quantization whereas BNN only considers combined weight and activation quantization.
Figure 4 reports the test error for different bit widths of the selected quantization approaches. The horizontal red line shows the test error of the realvalued baseline DenseNetBC100. For combined weight and activation quantization we use the same bit widths for the weights and the activations, respectively.
As expected, the test error decreases gradually with increasing bit widths for all quantization modes and for all quantization approaches. Furthermore, the results indicate that prediction performance is more sensitive to activation quantization than to weight quantization, which is in line with the results reported by many works reviewed in Section 3.1.
The more advanced LQNet approach clearly outperforms the rather simple linear quantization of DoReFaNet and the specialized binary and ternary approaches. However, this performance improvement comes at the cost of longer training times. For instance, the training time per iteration increases for DoReFaNet by a factor of 1.5 compared to a factor of up to 4.6 (depending on the bit width) for LQNet.
4.3 A RealWorld Example: Speech Mask Estimation using ReducedPrecision DNNs
We provide a complete example employing hardwareefficient binarized DNN applied to acoustic beamforming — an important component for various speech enhancement systems. A particularly successful approach employs DNNs to estimate a speech mask , i.e., a speech presence probability at time and frequencybin
. This speech mask is used to determine the power spectral density (PSD) matrices of the multichannel speech and noise signals, which are subsequently used to obtain a beamforming filter such as the minimum variance distortionless response (MVDR) beamformer or generalized eigenvector (GEV) beamformer
(Warsitz and HaebUmbach, 2007; Warsitz et al., 2008; Heymann et al., 2016, 2015; Erdogan et al., 2016; Pfeifenberger et al., 2019). An overview of a multichannel speech enhancement setup is shown in Figure 5.In this experiment, we compare singleprecision DNNs and DNNs with binary quantization for the weights and the activations (BNNs), respectively, for the estimation of the speech mask. BNNs were trained using the STE (Hubara et al., 2016). For both architectures, the dominant eigenvector of the noisy speech PSD matrix (Pfeifenberger et al., 2019) is used as feature vector. For the BNN, this feature vector is quantized to 8 bit integer values. As output layer, a linear activation function is used, which reduces to counting the binary output neurons, followed by a normalization to yield the speech presence probability mask . Further details of the experimental setting can be found in (Zöhrer et al., 2018).
4.3.1 Data and Experimental Setup
For evaluation we used the CHiME corpus (Barker et al., 2015) which provides 2channel and 6channel recordings of a closetalking speaker corrupted by four different types of ambient noise. Ground truth utterances, i.e., the separated speech and noise signals, are available for all recordings, such that the ground truth speech masks at time and frequencybin can be computed. In the test phase, the DNN is used to predict for each utterance, which is subsequently used to estimate the corresponding beamformer. We evaluated three different DNNs, i.e., a singleprecision 3layer DNN with 513 neurons per layer, and two BNNs with 513 and 1024 neurons per layer, respectively. The DNNs were trained using ADAM (Kingma and Ba, 2015) with default parameters and dropout rate .
4.3.2 Speech Mask Accuracy
Figure 6 shows the ground truth and predicted speech masks of the DNN and BNNs for an example utterance (F01_22HC010W_BUS). We see that both methods yield very similar results and are in good agreement with the ground truth. Table 1 reports the prediction error in [%]. Although singleprecision DNNs achieve the best test error, they do so only by a small margin. Doubling the network size of BNNs slightly improved the test error for the case of 6 channels.
model  neurons / layer  channels  train  valid  test 

DNN  513  2ch  5.8  6.2  7.7 
BNN  513  2ch  6.2  6.2  7.9 
BNN  1024  2ch  6.2  6.6  7.9 
DNN  513  6ch  4.5  3.9  4.0 
BNN  513  6ch  4.7  4.1  4.4 
BNN  1024  6ch  4.9  4.2  4.1 
4.3.3 Computation Savings for BNNs
In order to show that the advantages of binary computation translate to other generalpurpose processors, we implemented matrixmultiplication operators for NVIDIA GPUs and ARM CPUs. Classification in BNNs can be implemented very efficiently as 1bit scalar products, i.e., multiplications of two vectors and reduce to bitwise XNOR operation, followed by counting the number of set bits with popc as
(13) 
We use the matrixmultiplication algorithms of the MAGMA and Eigen libraries and replace floatingpoint multiplications by XNOR operations as in (13). Our CPU implementation uses NEON vectorization in order to fully exploit SIMD instructions on ARM processors. We report execution time of GPUs and ARM CPUs in Table 2. As can be seen, binary arithmetic offers considerable speedups over singleprecision with manageable implementation effort. This also affects energy consumption since binary values require fewer offchip accesses and operations. Performance results of x86 architectures are not reported because neither SSE nor AVX ISA extensions support vectorized popc.
arch  matrix size  time (float32)  time (binary)  speedup 

GPU  256  0.14ms  0.05ms  2.8 
GPU  513  0.34ms  0.06ms  5.7 
GPU  1024  1.71ms  0.16ms  10.7 
GPU  2048  12.87ms  1.01ms  12.7 
ARM  256  3.65ms  0.42ms  8.7 
ARM  513  16.73ms  1.43ms  11.7 
ARM  1024  108.94ms  8.13ms  13.4 
ARM  2048  771.33ms  58.81ms  13.1 
While improvements of memory footprint and computation time are independent of the underlying tasks, the prediction accuracy highly depends on the complexity of the data set and the used neural network. On the one hand, simple data sets, such as the previously evaluated speech mask example or MNIST, allow for aggressive quantization without severely affecting the prediction performance. On the other hand, as reported by several works reviewed in Section 3.1, binary or ternary quantization results in severe accuracy degradation on more complex data sets such as CIFAR10/100 and ImageNet.
5 Conclusion
We presented an overview of the vast literature of the highly active research area concerned with resourceefficiency of DNNs. We have identified three major directions of research, namely (i) network quantization, (ii) network pruning, and (iii) approaches that target efficiency at the structural level. Many of the presented works are orthogonal and can be used in conjunction to potentially further improve the results reported in the respective papers.
We have discovered several patterns in the individual strategies for enhancing resourceefficiency. For quantization approaches, a common pattern in the most successful approaches is to combine realvalued representations, that help in maintaining the expressiveness of DNNs, with quantization to enhance the computationally intensive operations. For pruning methods, we observed that the trend is moving towards structured pruning approaches that obtain smaller models whose data structures are compatible with highly optimized dense tensor operations. On the structural level of DNNs, a lot of progress has been made in the development of specific building blocks that maintain a high expressiveness of the DNN while at the same time reducing the computational overhead substantially. The newly emerging neural architecture search (NAS) approaches are promising candidates to automate the design of applicationspecific architectures with negligible user interaction. However, it appears unlikely that current NAS approaches will discover new fundamental design principles as the resulting architectures highly depend on apriori knowledge encoded in the architecture search space.
In experiments, we demonstrated the difficulty of finding a good tradeoff between computational complexity and predictive performance on several benchmark data sets. We demonstrated on a realworld speech example that binarized DNNs achieve predictive performance comparable to their realvalued counterpart, and that they can be efficiently implemented on offtheshelf hardware.
Acknowledgments
This work was supported by the Austrian Science Fund (FWF) under the project number I2706N31 and the German Research Foundation (DFG). Furthermore, we acknowledge the LEAD Project Dependable Internet of Things funded by Graz University of Technology. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie SkłodowskaCurie Grant Agreement No. 797223 — HYBSPN. We acknowledge NVIDIA for providing GPU computing resources.
References
 Achterhold et al. (2018) Jan Achterhold, Jan M. Köhler, Anke Schmeink, and Tim Genewein. Variational network quantization. In International Conference on Learning Representations (ICLR), 2018.
 Anderson and Berg (2018) Alexander G. Anderson and Cory P. Berg. The highdimensional geometry of binary neural networks. In International Conference on Learning Representations (ICLR), 2018.

Barker et al. (2015)
Jon Barker, Ricard Marxer, Emmanuel Vincent, and Shinji Watanabe.
The third ’CHiME’ speech separation and recognition challenge:
Dataset, task and baselines.
In
IEEE 2015 Automatic Speech Recognition and Understanding Workshop (ASRU)
, 2015.  Bengio et al. (2013) Yoshua Bengio, Nicholas Léonard, and Aaron C. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR, abs/1308.3432, 2013.
 Blundell et al. (2015) Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. In International Conference on Machine Learning (ICML), pages 1613–1622, 2015.
 Cai et al. (2019) Han Cai, Ligeng Zhu, and Song Han. ProxylessNAS: Direct neural architecture search on target task and hardware. In International Conference on Learning Representations (ICLR), 2019.

Cai et al. (2017)
Zhaowei Cai, Xiaodong He, Jian Sun, and Nuno Vasconcelos.
Deep learning with low precision by halfwave Gaussian
quantization.
In
Conference on Computer Vision and Pattern Recognition (CVPR)
, pages 5406–5414, 2017.  Chen et al. (2015) Wenlin Chen, James T. Wilson, Stephen Tyree, Kilian Q. Weinberger, and Yixin Chen. Compressing neural networks with the hashing trick. In International Conference on Machine Learning (ICML), pages 2285–2294, 2015.
 Cheng et al. (2015) Yu Cheng, Felix X. Yu, Rogério Schmidt Feris, Sanjiv Kumar, Alok N. Choudhary, and ShihFu Chang. An exploration of parameter redundancy in deep networks with circulant projections. In International Conference on Computer Vision (ICCV), pages 2857–2865, 2015.
 Courbariaux et al. (2015a) Matthieu Courbariaux, Yoshua Bengio, and JeanPierre David. Training deep neural networks with low precision multiplications. In International Conference on Learning Representations (ICLR) Workshop, volume abs/1412.7024, 2015a.
 Courbariaux et al. (2015b) Matthieu Courbariaux, Yoshua Bengio, and JeanPierre David. BinaryConnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems (NIPS), pages 3123–3131, 2015b.
 Denil et al. (2013) Misha Denil, Babak Shakibi, Laurent Dinh, Marc’Aurelio Ranzato, and Nando de Freitas. Predicting parameters in deep learning. In Advances in Neural Information Processing Systems (NIPS), pages 2148–2156, 2013.
 Denton et al. (2014) Emily L. Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems (NIPS), pages 1269–1277, 2014.
 Dong et al. (2017) Xuanyi Dong, Junshi Huang, Yi Yang, and Shuicheng Yan. More is less: A more complicated network with less inference complexity. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 1895–1903, 2017.
 Erdogan et al. (2016) H. Erdogan, J. Hershey, S Watanabe, M. Mandel, and J. L. Roux. Improved MVDR beamforming using singlechannel mask prediction networks. In Interspeech, 2016.
 Gao et al. (2019) Xitong Gao, Yiren Zhao, Lukasz Dudziak, Robert Mullins, and ChengZhong Xu. Dynamic channel pruning: Feature boosting and suppression. In International Conference on Learning Representations (ICLR), 2019.
 Graves (2011) Alex Graves. Practical variational inference for neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 2348–2356, 2011.
 Grünwald (2007) P. D. Grünwald. The minimum description length principle. MIT press, 2007.
 Guo et al. (2016) Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic network surgery for efficient DNNs. In Advances in Neural Information Processing Systems (NIPS), pages 1379–1387, 2016.
 Gupta et al. (2015) Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In International Conference on Machine Learning (ICML), pages 1737–1746, 2015.
 Han et al. (2015) Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems (NIPS), pages 1135–1143, 2015.
 Han et al. (2016) Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and Huffman coding. In International Conference on Learning Representations (ICLR), 2016.
 Hassibi and Stork (1992) Babak Hassibi and David G. Stork. Second order derivatives for network pruning: Optimal brain surgeon. In Advances in Neural Information Processing Systems (NIPS), pages 164–171, 1992.
 Havasi et al. (2019) Marton Havasi, Robert Peharz, and José Miguel HernándezLobato. Minimal random code learning: Getting bits back from compressed model parameters. In International Conference on Learning Representations (ICLR), 2019.
 He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
 Heymann et al. (2015) J. Heymann, L. Drude, A. Chinaev, and R. HaebUmbach. BLSTM supported GEV beamformer frontend for the 3RD CHiME challenge. In 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 444–451, 2015.
 Heymann et al. (2016) J. Heymann, L. Drude, and R. HaebUmbach. Neural network based spectral mask estimation for acoustic beamforming. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 196–200, 2016.

Hinton and van Camp (1993)
Geoffrey E. Hinton and Drew van Camp.
Keeping the neural networks simple by minimizing the description
length of the weights.
In
Conference on Computational Learning Theory (COLT)
, pages 5–13, 1993.  Hinton et al. (2015) Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. In Deep Learning and Representation Learning Workshop @ NIPS, 2015.
 Höhfeld and Fahlman (1992) Markus Höhfeld and Scott E. Fahlman. Learning with limited numerical precision using the cascadecorrelation algorithm. IEEE Trans. Neural Networks, 3(4):602–611, 1992.
 Höhfeld and Fahlman (1992) Markus Höhfeld and Scott E. Fahlman. Probabilistic rounding in neural network learning with limited precision. Neurocomputing, 4(4):291–299, 1992.
 Howard et al. (2017) Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. MobileNets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.
 Huang et al. (2017) Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269, 2017.
 Huang and Wang (2018) Zehao Huang and Naiyan Wang. Datadriven sparse structure selection for deep neural networks. In European Conference on Computer Vision (ECCV), pages 317–334, 2018.
 Hubara et al. (2016) Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran ElYaniv, and Yoshua Bengio. Binarized neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 4107–4115, 2016.
 Iandola et al. (2016) Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and Kurt Keutzer. SqueezeNet: AlexNetlevel accuracy with 50x fewer parameters and <1mb model size. CoRR, abs/1602.07360, 2016.
 Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), pages 448–456, 2015.
 Jacob et al. (2018) Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew G. Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integerarithmeticonly inference. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 2704–2713, 2018.
 Jaderberg et al. (2014) Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with low rank expansions. In British Machine Vision Conference (BMVC), 2014.
 Jang et al. (2017) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with Gumbelsoftmax. In International Conference on Learning Representations (ICLR), 2017.
 Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
 Kingma et al. (2015) Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems (NIPS), pages 2575–2583, 2015.
 Korattikara et al. (2015) Anoop Korattikara, Vivek Rathod, Kevin P. Murphy, and Max Welling. Bayesian dark knowledge. In Advances in Neural Information Processing Systems (NIPS), pages 3438–3446, 2015.
 Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
 Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 1106–1114, 2012.
 Lebedev et al. (2015) Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan V. Oseledets, and Victor S. Lempitsky. Speedingup convolutional neural networks using finetuned CPdecomposition. In International Conference on Learning Representations (ICLR), 2015.
 LeCun et al. (1989) Yann LeCun, John S. Denker, and Sara A. Solla. Optimal brain damage. In Advances in Neural Information Processing Systems (NIPS), pages 598–605, 1989.
 Li et al. (2016) Fengfu Li, Bo Zhang, and Bin Liu. Ternary weight networks. CoRR, abs/1605.04711, 2016.
 Li et al. (2017) Hao Li, Soham De, Zheng Xu, Christoph Studer, Hanan Samet, and Tom Goldstein. Training quantized nets: A deeper understanding. In Advances in Neural Information Processing Systems (NIPS), pages 5813–5823, 2017.
 Li and Ji (2019) Yang Li and Shihao Ji. ARM: Network sparsification via stochastic binary optimization. In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), 2019.
 Lin et al. (2016) Darryl Dexu Lin, Sachin S. Talathi, and V. Sreekanth Annapureddy. Fixed point quantization of deep convolutional networks. In International Conference on Machine Learning (ICML), pages 2849–2858, 2016.
 Lin et al. (2017a) Ji Lin, Yongming Rao, Jiwen Lu, and Jie Zhou. Runtime neural pruning. In Advances in Neural Information Processing Systems (NIPS), pages 2181–2191, 2017a.
 Lin et al. (2014a) Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. In International Conference of Learning Represenation (ICLR), 2014a.
 Lin et al. (2014b) TsungYi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (ECCV), pages 740–755, 2014b.
 Lin et al. (2017b) Xiaofan Lin, Cong Zhao, and Wei Pan. Towards accurate binary convolutional neural network. In Neural Information Processing Systems (NIPS), pages 344–352, 2017b.
 Lin et al. (2015) Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, and Yoshua Bengio. Neural networks with few multiplications. CoRR, abs/1510.03009, 2015.
 Liu et al. (2018) Zechun Liu, Baoyuan Wu, Wenhan Luo, Xin Yang, Wei Liu, and KwangTing Cheng. BiReal net: Enhancing the performance of 1bit cnns with improved representational capability and advanced training algorithm. In European Conference on Computer Vision (ECCV), pages 747–763, 2018.
 Liu et al. (2017) Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. Learning efficient convolutional networks through network slimming. In International Conference on Computer Vision (ICCV), pages 2755–2763, 2017.
 Liu et al. (2019) Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. In International Conference on Learning Representations (ICLR), 2019.
 Louizos et al. (2017) Christos Louizos, Karen Ullrich, and Max Welling. Bayesian compression for deep learning. In Advances in Neural Information Processing Systems (NIPS), pages 3288–3298, 2017.
 Louizos et al. (2018) Christos Louizos, Max Welling, and Diederik P. Kingma. Learning sparse neural networks through L_{0} regularization. In International Conference on Learning Representations (ICLR), 2018.
 Louizos et al. (2019) Christos Louizos, Matthias Reisser, Tijmen Blankevoort, Efstratios Gavves, and Max Welling. Relaxed quantization for discretized neural networks. In International Conference on Learning Representations (ICLR), 2019.
 Luo et al. (2017) JianHao Luo, Jianxin Wu, and Weiyao Lin. ThiNet: A filter level pruning method for deep neural network compression. In International Conference on Computer Vision (ICCV), pages 5068–5076, 2017.
 Mariet and Sra (2016) Zelda Mariet and Suvrit Sra. Diversity networks: Neural network compression using determinantal point processes. In International Conference of Learning Represenation (ICLR), 2016.

Minka (2001)
Thomas P. Minka.
Expectation propagation for approximate Bayesian inference.
In
Uncertainty in Artificial Intelligence (UAI)
, pages 362–369, 2001.  Miyashita et al. (2016) Daisuke Miyashita, Edward H. Lee, and Boris Murmann. Convolutional neural networks using logarithmic data representation. CoRR, abs/1603.01025, 2016.
 Molchanov et al. (2017) Dmitry Molchanov, Arsenii Ashukha, and Dmitry P. Vetrov. Variational dropout sparsifies deep neural networks. In International Conference on Machine Learning (ICML), pages 2498–2507, 2017.
 Neal (1992) Radford M. Neal. Bayesian training of backpropagation networks by the hybrid Monte Carlo method. Technical report, Dept. of Computer Science, University of Toronto, 1992.
 Nocedal and Wright (2006) Jorge Nocedal and Stephen Wright. Numerical Optimization. Springer New York, 2 edition, 2006.
 Novikov et al. (2015) Alexander Novikov, Dmitry Podoprikhin, Anton Osokin, and Dmitry P. Vetrov. Tensorizing neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 442–450, 2015.
 Nowlan and Hinton (1992) Steven J. Nowlan and Geoffrey E. Hinton. Simplifying neural networks by soft weightsharing. Neural Computation, 4(4):473–493, 1992.
 Peters and Welling (2018) Jorn W. T. Peters and Max Welling. Probabilistic binary neural networks. CoRR, abs/1809.03368, 2018.
 Pfeifenberger et al. (2019) Lukas Pfeifenberger, Matthias Zöhrer, and Franz Pernkopf. Eigenvectorbased speech mask estimation for multichannel speech enhancement. IEEE Transactions on Audio, Speech, and Language Processing, 27(12):2162–2172, 2019.
 Rastegari et al. (2016) Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. XNORNet: ImageNet classification using binary convolutional neural networks. In European Conference on Computer Vision (ECCV), pages 525–542, 2016.
 Roth and Pernkopf (2018) Wolfgang Roth and Franz Pernkopf. Bayesian neural networks with weight sharing using Dirichlet processes. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2018.
 Roth et al. (2019) Wolfgang Roth, Günther Schindler, Holger Fröning, and Franz Pernkopf. Training discretevalued neural networks with sign activations using weight distributions. In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), 2019.
 Rumelhart et al. (1986) David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. Learning representations by backpropagating errors. 323:533–536, 1986.
 Sandler et al. (2018) Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and LiangChieh Chen. MobileNetV2: Inverted residuals and linear bottlenecks. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 4510–4520, 2018.
 Shayer et al. (2018) Oran Shayer, Dan Levi, and Ethan Fetaya. Learning discrete weights using the local reparameterization trick. In International Conference on Learning Representations (ICLR), 2018.
 Simonyan and Zisserman (2015) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. In International Conference on Learning Representations (ICLR), 2015.
 Soudry et al. (2014) Daniel Soudry, Itay Hubara, and Ron Meir. Expectation backpropagation: Parameterfree training of multilayer neural networks with continuous or discrete weights. In Advances in Neural Information Processing Systems (NIPS), pages 963–971, 2014.
 Srivastava et al. (2014) Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 Stamoulis et al. (2019) Dimitrios Stamoulis, Ruizhou Ding, Di Wang, Dimitrios Lymberopoulos, Bodhi Priyantha, Jie Liu, and Diana Marculescu. Singlepath NAS: Designing hardwareefficient convnets in less than 4 hours. In European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), 2019.
 Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, 2015.
 Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826, 2016.
 Tan and Le (2019) Mingxing Tan and Quoc V. Le. EfficientNet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning (ICML), pages 6105–6114, 2019.
 Tan et al. (2018) Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V. Le. MnasNet: Platformaware neural architecture search for mobile. CoRR, abs/1807.11626, 2018.
 Ullrich et al. (2017) Karen Ullrich, Edward Meeds, and Max Welling. Soft weightsharing for neural network compression. In International Conference on Learning Representations (ICLR), 2017.

Warsitz and HaebUmbach (2007)
E. Warsitz and R. HaebUmbach.
Blind acoustic beamforming based on generalized eigenvalue decomposition.
In IEEE Transactions on Audio, Speech, and Language Processing, volume 15, pages 1529–1539, 2007.  Warsitz et al. (2008) E. Warsitz, A. Krueger, and R. HaebUmbach. Speech enhancement with a new generalized eigenvector blocking matrix for application in a generalized sidelobe canceller. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 73–76, 2008.
 Wen et al. (2016) Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in deep neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 2074–2082, 2016.
 Wu et al. (2018) Shuang Wu, Guoqi Li, Feng Chen, and Luping Shi. Training and inference with integers in deep neural networks. In International Conference on Learning Representations (ICLR), 2018.
 Xie et al. (2017) Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 5987–5995, 2017.
 Yang et al. (2015) Zichao Yang, Marcin Moczulski, Misha Denil, Nando de Freitas, Alexander J. Smola, Le Song, and Ziyu Wang. Deep fried convnets. In International Conference on Computer Vision (ICCV), pages 1476–1483, 2015.
 Yin and Zhou (2019) Mingzhang Yin and Mingyuan Zhou. ARM: augmentreinforcemerge gradient for stochastic binary networks. In International Conference on Learning Representations (ICLR), 2019.
 Zhang et al. (2018a) Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua. LQNets: Learned quantization for highly accurate and compact deep neural networks. In European Conference on Computer Vision (ECCV), pages 373–390, 2018a.
 Zhang et al. (2018b) Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 6848–6856, 2018b.
 Zhou et al. (2017) Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quantization: Towards lossless CNNs with lowprecision weights. In International Conference on Learning Representations (ICLR), 2017.
 Zhou et al. (2016) Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. DoReFaNet: Training low bitwidth convolutional neural networks with low bitwidth gradients. CoRR, abs/1606.06160, 2016.
 Zhu et al. (2017) Chenzhuo Zhu, Song Han, Huizi Mao, and William J. Dally. Trained ternary quantization. In International Conference on Learning Representations (ICLR), 2017.
 Zöhrer et al. (2018) M. Zöhrer, L. Pfeifenberger, G. Schindler, H. Fröning, and F. Pernkopf. Resource efficient deep eigenvector beamforming. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.
 Zoph and Le (2017) Barret Zoph and Quoc V. Le. Neural architecture search with reinforcement learning. In International Conference on Learning Representations (ICLR), 2017.
 Zoph et al. (2018) Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for scalable image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 8697–8710, 2018.