Generalized Ternary Connect: End-to-End Learning and Compression of Multiplication-Free Deep Neural Networks

11/12/2018
by   Samyak Parajuli, et al.
0

The use of deep neural networks in edge computing devices hinges on the balance between accuracy and complexity of computations. Ternary Connect (TC) lin2015neural addresses this issue by restricting the parameters to three levels -1, 0, and +1, thus eliminating multiplications in the forward pass of the network during prediction. We propose Generalized Ternary Connect (GTC), which allows an arbitrary number of levels while at the same time eliminating multiplications by restricting the parameters to integer powers of two. The primary contribution is that GTC learns the number of levels and their values for each layer, jointly with the weights of the network in an end-to-end fashion. Experiments on MNIST and CIFAR-10 show that GTC naturally converges to an `almost binary' network for deep classification networks (e.g. VGG-16) and deep variational auto-encoders, with negligible loss of classification accuracy and comparable visual quality of generated samples respectively. We demonstrate superior compression and similar accuracy of GTC in comparison to several state-of-the-art methods for neural network compression. We conclude with simulations showing the potential benefits of GTC in hardware.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

08/16/2017

BitNet: Bit-Regularized Deep Neural Networks

We present a novel regularization scheme for training deep neural networ...
01/03/2018

ScreenerNet: Learning Curriculum for Neural Networks

We propose to learn a curriculum or a syllabus for supervised learning w...
11/18/2020

Layer-Wise Data-Free CNN Compression

We present an efficient method for compressing a trained neural network ...
06/15/2019

Model Compression by Entropy Penalized Reparameterization

We describe an end-to-end neural network weight compression approach tha...
10/08/2019

Lossy Image Compression with Recurrent Neural Networks: from Human Perceived Visual Quality to Classification Accuracy

Deep neural networks have recently advanced the state-of-the-art in imag...
07/29/2020

Compressing Deep Neural Networks via Layer Fusion

This paper proposes layer fusion - a model compression technique that di...
08/20/2020

Optimal Network Compression

This paper introduces a formulation of the optimal network compression p...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Motivation and Prior Work

The demonstrated success of deep learning in vision, speech and planning domains can be leveraged in practical applications involving edge computing devices only if the requirements of high power, large memory and high-end computational resources are significantly reduced. A common solution that circumvents all three of these issues is to replace multiplications in neural networks with a cheap operation such as bit shifts

[Marchesi et al.1993, Kwan and Tang1993, Courbariaux, Bengio, and David2015, Lin et al.2015, Miyashita, Lee, and Murmann2016] or integer arithmetic [Raghavan et al.2017, Wu et al.2018], the reason being the ever-increasing number of parameters in deep neural networks (projected to reach 1 trillion parameters within the next decade) and bulky power-hogging digital multipliers. Convolutional networks are especially expensive since the number of multiplications grows with the number of filters and the size of filters, in addition to the size of the input.

In this paper, we focus on replacing multiplications with logical bit shift operations, which can be implemented efficiently in hardware using shift registers. While prior work [Marchesi et al.1993, Kwan and Tang1993, Courbariaux, Bengio, and David2015, Lin et al.2015, Miyashita, Lee, and Murmann2016] has noted this possibility, all of the above are restricted to a fixed range of shifts. For example, TC [Lin et al.2015] restricts all parameters to (ignoring the scaling factor), which corresponds to the set of operations of shifting by zero and inverting a sign bit, whereas [Kwan and Tang1993] uses a maximum shift of four, and [Miyashita, Lee, and Murmann2016] uses a maximum of five. A bit more generally, the majority of prior work on neural network compression e.g. [Han, Mao, and Dally2015, Polino, Pascanu, and Alistarh2018, Achterhold et al.2018] choose the numbers of bits of precision manually.

The proposed algorithm, Generalized Ternary Connect (GTC), learns simultaneously the number of bits and the values of the quantized weights. The weights are constrained to integer powers of two, which reduces floating point multiplication by the weights to bit shifts. We view the range of bit shifts as the number of bits of precision. Our second contribution is a novel parametrization of such a set of weights with trainable parameters that allow the quantization function to take a variety of shapes, ranging from a hyperbola, to a Heaviside function, to a hyperbolic tangent function or simply an identity mapping. These additional, trainable parameters significantly generalize TC and prior work on learning neural networks with weights that are powers of two. To the best of our knowledge, this is a novel end-to-end algorithm that jointly optimizes learning and compression from the standpoint of increased performance after training.

We evaluate GTC on two tasks: supervised image classification and unsupervised image generation with a variational auto-encoder (VAE) [Kingma and Welling2013]

. We show preliminary results on the MNIST and CIFAR-10 image datasets. While prior work on low precision deep neural networks has focused on classification problems, these experiments extend the demonstration to generative models, which can lead to different insights for future work. In addition, we demonstrate a novel variant of GTC that can be used as a generic optimizer for training neural networks, often outperforming Stochastic Gradient Descent (SGD). On the classification tasks, we compare GTC to several state-of-the-art neural network compression and low-precision approaches in Section

4.3. Among the compared methods, GTC shows the best compression with minimal drop in accuracy while, unlike most of the prior work, producing a neural network whose forward pass does not require any multiplications.

2 Formulation

We use boldface uppercase symbols to denote tensors and boldface lowercase symbols to denote vectors. Let

be the set of parameters of a deep neural network with layers. Let be the quantized tensor corresponding to . The output of a deep neural network can be specified as for , where is the input to the layer and is the output (aka activation) of the layer (and the input to the layer). Deep neural networks primarily consist of the following layers.

It is clear that (scalar) multiplications constitute the majority of the computational cost of the forward pass during inference and prediction. For brevity, we omit the bias parameters throughout the paper. In our experiments, GTC outputs weights and biases in the form of bit shifts.

2.1 Quantization

Ternary Connect (TC) [Lin et al.2015] replaces multiplications with logical bit shift operations by quantizing weights to the three values and (up to constant). Our goal is to generalize TC by restricting weights to the feasible set

(1)

where is the set of integers. In other words, the weights can be and . The generalization to powers of two allows floating-point multiplications to be implemented as logical bit shifts. For a floating point number with mantissa and exponent , its product with a weight is given by

(2)
(3)

Apparent from the above, in this paper we do not alter the representation of the activation , although the same techniques developed herein can be extended. Similarly, we do not consider the multiplications in the backward pass during gradient-based training, however a technique similar to Quantized Back-Propagation [Lin et al.2015] can be developed. Let denote the quantized weight corresponding to some . The starting point for our quantization scheme is the identity transformation

(4)

where the operator returns the sign of the operand, or returns zero if the operand is close to zero. Empirically, we found that the leading sign of needs to match the sign of for good performance.

The constraint that is usually enforced through rounding or stochastic rounding [Courbariaux, Bengio, and David2015]. In our case, either rounding operator can be applied to the exponent . We introduce novel parameters that are specific for each layer, and formulate the quantization function as

(5)

Here can be any transformation of the parameters . Whereas prior work has predominantly used fixed quantization schemes, future work may explore schemes where the quantizer can potentially be an entire neural network. Such static quantization schemes are used because they asymptotically minimize the error introduced due to quantization of the parameters [Gish and Pierce1968], and many previous works use quantization error as an objective criterion [Han, Mao, and Dally2015, Raghavan et al.2017]. However, when the numbers and values of quantization levels are learned, the effect of quantization error on the accuracy is not clear. Furthermore, the benefits due to reduced computational and memory overhead may be worth a small loss in accuracy due to ignoring quantization error. In this paper, we explore a linear function as the choice of .

(6)
(7)

Despite the simple linear form, these parameters allow a very powerful family of transformations. Figure 1 illustrates four example instantiations of for some . The trivial case is when and giving the identity transform. The case when corresponds to ternary quantization based on the sign of the weight . The case when gives a family of hyperbolas whose foci and eccentricity are controlled by . The case when gives a family of hyperbolic tangent functions. To the best of our knowledge, this is a novel parameterization of the set (1). These additional trainable parameters significantly generalize TC and prior work on learning neural networks with weights that are powers of two [Miyashita, Lee, and Murmann2016, Marchesi et al.1993, Kwan and Tang1993, Lin et al.2015, Courbariaux, Bengio, and David2015]. Next we describe a novel notion of the bit precision stemming from (7).

Figure 1: Plot showing different instantiations of for different values of . Dotted lines shows the effect of rounding as in .
Example 1.

Suppose be the matrix, with and , we get as . Note that the value of was quantized to with large quantization error. In our experiments we find that the quantization error is not very important for high accuracy, and that the learned values of are non-trivial.

2.2 Compression

Consider the compression of a deep neural network whose parameters are of the form of (7) using a fixed length code. Apart from the leading sign bit, each exponent can be stored as an integer. The maximum value of the exponent is , so one might use an -bit integer for a 4X model compression compared to the storage of floating point parameters. Typically, parameters of deep networks should not be allowed to reach such large values as . Thus, the natural source of compression is the range of values taken by the exponent over all the parameters of the network. Alternatively, one might use the cardinality of the set of values taken by the exponent, but we adopt the former due to its simplicity and differentiability. For each layer , we define the number of bits as shown in (10).

(8)
(9)
(10)

The leading term of one accounts for the sign bit, whereas accounts for the range of all the exponents, sufficient to represent all exponents between the maximum and minimum exponent. In some cases, this range may not include zero, whereas may be zero due to the sign function in (7). Therefore, we add a one and take the logarithm (base 2) to finally arrive at the number of bits.

Example 2.

Continuing with our previous example, suppose that consists of , we get bits. Note that this is not an optimal code e.g. we could represent this set of exponents with bits (excluding the sign bit).

3 Joint Learning and Compression

The difficulty in directly training GTC stems from the discontinuities of the rounding function in (7), (8), (9) and the ceiling function in (10

). One approach to deal with discrete neurons is to use Straight Through Estimation (STE)

[Bengio, Léonard, and Courville2013]. In Section 4, we show the performance of a variant of GTC trained using the biased straight-through estimator, which presumes the gradients of the discrete functions are identity functions. We use this as a baseline for the following training methodology.

We use Knowledge Distillation [Hinton, Vinyals, and Dean2015] motivated by the recent success [Polino, Pascanu, and Alistarh2018, Mishra and Marr2018, Mishra and Marr2017] of knowledge distillation techniques in training low-precision networks. Similar to the above prior work, we simultaneously train a standard 32-bit network, herein referred to as the teacher, and a low precision network, herein referred to as the student (see Figure 2).

Figure 2: Illustration of the training scheme using to denote standard 32-bit floating point weights and to denote the weights quantized according to (7). The dashed arrows in the figure indicate that the values of as in (7) are learned for each layer. The novelty is that the number of bits and quantization levels are learned.

GTC differs from this prior work in three ways. First, GTC learns the number of bits as well as the quantized values. Second, while in previous work one goal of distillation is to reduce the number of neural network layers by using a compact student network distilled from a larger teacher network, note that GTC uses identical network architectures for the teacher and the student. Finally, in prior work the teacher is trained first and then fixed during the distillation process, whereas GTC learns both student and teacher models simultaneously from scratch. The GTC optimization objective is

(11)

The first term

is any standard supervised or unsupervised loss function, for example, cross-entropy (or negative log-likelihood) for supervised classification, or reconstruction error for unsupervised auto-encoders.

The second term denotes a distillation loss that is a function of the pairwise predictions of the teacher and the student. Note that in our case, since the student parameters are derived from the teacher parameters and the quantization parameters in (7), the distillation term is a function of and

. The difference of logits (before softmax) between the student and the teacher was used in

[Hinton, Vinyals, and Dean2015]. Empirically, we found that using the cross-entropy (after softmax) leads to significantly better accuracy. We extended this to the unsupervised setting by using a sigmoid squashing function then calculating the cross-entropy.

The last term penalizes values of that lead to a large variation in the power-of-two weights after quantization. This notion is captured in the number of bits as shown in (10), so is simply the sum over all the neural network layers. Roughly speaking, corresponds to the cost of bits and can be used to encode hardware constraints. As , GTC may use up to 32-bits, and as , GTC produces a binary network. This penalty term is reminiscent of Minimum Description Length [Grünwald2007] formulations. The regularizing effect of penalizing the number of bits was demonstrated in [Raghavan et al.2017]. Note that in the absence of the penalty term, the student and teacher networks can be made arbitrarily close by adjusting the parameters.

One can interpret this objective as the Lagrangian of a constrained optimization problem [Raghavan et al.2017, Carreira-Perpinán2017, Carreira-Perpinán and Idelbayev2017], that primarily optimizes the accuracy of the standard 32-bit model, with the constraints that the pairwise student-teacher predictions are similar and the student model is maximally compressed according as in Section 2.2.

During training, we use the gradients wrt and to update111We use the ADAM optimizer unless otherwise specified. and respectively, which implicitly determine the (quantized) parameters of the student. Note that in general the gradients wrt are not the same as the corresponding gradients in a standard 32-bit training due to additional gradients from the distillation loss. During testing, we dispose the teacher model and make predictions using alone. We use the identity function as the STE to propagate gradients through round and ceiling functions.

It should be clear from the discussion that the above training procedure is conducted on a powerful GPU, and that the benefits of the power-of-two weights are not leveraged during training. That is, the training is performed offline before deployment to an edge device, where the compute, memory and power savings can be realized. Future work should extend GTC to quantization of gradients and activations e.g. using quantized backpropagation

[Lin et al.2015] or using projection-based algorithms. These are yet to be explored in the setting where the bits and levels are learned.

4 Experiments

We evaluate GTC on two tasks, namely, supervised image classification and unsupervised image generation with a variational auto-encoder (VAE). We show results on two computer vision datasets: MNIST

[LeCun et al.1998] and CIFAR-10 [Krizhevsky2009]. We used the ADAM optimizer [Kingma and Ba2014] with a learning rate denoted as unless otherwise specified.

Neural Architectures: For the MNIST classification problem, we used a LeNet-style architecture that consists of four layers, namely two convolutional layers ( and filters, each of size , with max pooling and ReLU) followed by two fully connected layers ( and outputs with ReLU and soft-max respectively). Note that this is a simpler model than LeNet-5. For CIFAR-10 classification, we used Configuration D of VGG-16 [Simonyan and Zisserman2014], with the addition of batch normalization and ReLU after convolutional layers. We also added Dropout [Srivastava et al.2014] after alternating convolutional layers to avoid overfitting.

For the MNIST unsupervised problem, we only use the images as input. The encoder part of the VAE uses three fully connected layers consisting of and

neurons respectively, each with tanh activations. The result is input in to two layers corresponding to the mean and variance, each with a latent dimension of

. The decoder consists of three fully connected layers consisting of and neurons, each with tanh activations. Finally, we use one fully connected layer with neurons with sigmoid activations to arrive at a grayscale image.

4.1 Learning

The first set of experiments demonstrate the training and convergence of GTC and the performance of the GTC model. In these experiments, we compare with the corresponding standard 32-bit model that is trained independently of GTC (ie. not the teacher model used in the distillation scheme of Section 3). In Section 4.3, we compare GTC with prior work as well as some natural baseline algorithms.

Figure 7 and Figure 7 show these experiments for supervised classification and unsupervised VAE respectively. In Figure 7 (left) we see that GTC converges to a training accuracy within of the standard 32-bit model in roughly the same number of iterations (10K iterations). We note some oscillations in the training accuracy of GTC, however they are within of the standard model once the standard model converges. This shows the benefit of training using distillation because the 32-bit model guides the learning of the GTC model. Moreover, on the test set, the accuracy of GTC is almost identical to that of the standard model.

On the CIFAR-10 dataset (see Figure 7 (right)), we observed a similar pattern. The training accuracy of the standard model was over whereas GTC achieved a training accuracy around

. The performance on the test set is lower than state-of-the-art results using VGG-16 on CIFAR-10. We believe the lower performance on CIFAR-10 is caused by the use of dropout layers that tends to slow learning, limited iterations of training, and no fine-tuning of the hyperparameters of GTC (see Section

4.4). One can see that the trend in accuracy is upward.

Figure 7 shows a comparison of a VAE trained using GTC compared to a standard VAE for the MNIST dataset. For both of these variants, the reconstruction error was defined as the binary cross-entropy between the input image and the 32-bit reconstruction. For GTC, this is the first term as in (11). The distillation loss (second term) was defined as the binary cross entropy between the reconstructions of the teacher and student models. The Figure shows that the VAE trained using GTC converges in terms of the reconstruction error. However, in some cases the visual quality of the reconstructions from GTC appear to be less blurry than the reconstructions of the 32-bit standard model (see Figure 7).

Overall, GTC naturally converged to binary networks for MNIST, and binary for all but the first layer of VGG-16 for the CIFAR-10 dataset, with very little loss of accuracy. GTC converged to a binary VAE network whose output has similar visual quality compared to the standard VAE. This is the main takeaway from the learning experiments on GTC. To the best of our knowledge, these are the first experiments demonstrating generative models such as VAE using low precision parameters. Future work may extend this to more sophisticated generative models and draw potentially different insights than feedforward classification models.

. LeNet Network on the MNIST Dataset. . VGG-16 Network on the CIFAR-10 Dataset.
Figure 3: Comparison of accuracy over training iterations of GTC and the corresponding standard (32-bit) deep neural network.
.
Figure 4: Comparison of the reconstruction error over iterations of GTC and the corresponding standard (32-bit) Variational Auto-Encoder on MNIST.
SGD, and GTC, divided by every iterations.
Figure 5: Experiment showing the use of GTC as an optimizer leads to faster convergence over Stochastic Gradient Descent (SGD) with a fixed learning rate. MNIST + LeNet.
Groundtruth MNIST Reconstruction 32-Bit Reconstruction GTC
Figure 6: Visual inspection of the reconstruction of MNIST digits by the standard 32-Bit Variational Auto-Encoder and the GTC Variational Auto-Encoder.
. LeNet on MNIST. . VGG-16 on CIFAR-10. . VAE on MNIST Dataset.
Figure 7: Convergence of the bit cost as in (11), over training iterations of GTC on the MNIST and CIFAR-10 datasets.

4.2 Compression

Table 1: Weights learned by GTC shown as the number of bit shifts. A () sign indicates bits are shifted to the right (left). A means the sign bit is flipped after shifting. For example, an entry corresponds a weight of . A weight of zero is represented as . Learned values of are shown.

In order to further showcase the compression achieved by GTC, Table 1 shows the weights learned by GTC along with the number of bits and the learned values of and . First, we observe that the learned values tend to be negative, whereas the values tend to be positive. That is, GTC learns S-shaped quantization functions. A special case of this is seen in the middle layers of VGG-16, where both values that are close to zero, which corresponds to a Heaviside function (see Figure 1).

The learned weights are shown as the number of bit shifts (and flipping of sign bit) required so that the result is equivalent to a floating-point multiplication. For each number of shifts in each layer, GTC uses both the signed (with flip) and unsigned weight. The histograms show that the learned weights are almost always bi-modal with two equal modes, and never uni-modal. The weights of the middle layers of VGG are shown as zero bit shifts.

Figure 7 shows the convergence of the bit cost as in (11). The figure shows the sum over layers of the number of bits used. We observe some oscillations for LeNet on MNIST, but no oscillations are seen in VGG-16 on CIFAR-10 as well as VAE on MNIST. In all three cases, the bit cost converges within the first of the total training iterations.

4.3 Experimental Comparison to Prior Work

Table 2: Table comparing the accuracy and compression of GTC to prior work. Superscript denotes numbers taken from the corresponding paper. Subscript denotes a neural architecture that is slightly different than GTC. A hypen () indicates that the paper used a completely different architecture or did not evaluate on these datasets.

We compared GTC with the following.

  • Post Mortem (PM) takes the final 32-bit model and snaps each weight to the nearest integer power of two. The result requires bits of storage ( bits for exponent, bit for sign). In general, there is no guarantee of accuracy.

  • Straight Through Estimator (STE) directly trains the quantized model without using distillation. The objective function is the cross-entropy of the student network. Gradients wrt are calculated using identity STE and used to update . There are no parameters. This approach worked very well in terms of accuracy, but requires bits.

  • STE+Bit uses the same approach as above, with the addition of a bit penalty term as in (11). The bits are calculated using (10), without using any parameters. This approach retained the high accuracy of STE in MNIST while compressing to a binary network. It did not make any progress on the CIFAR-10 dataset and resulted in poor accuracy. Together, these indicate the necessity of the distillation approach to training.

  • LogNet [Miyashita, Lee, and Murmann2016, Lee et al.2017] proposed a method to train a network similar to GTC, but weights, gradients and activations of the network are quantized. The number of bits is not learned. We attempted to reproduce their algorithm but restricted the quantization to the weights of the network. We were not able to reproduce the performance reported.

  • Ternary Connect TC [Lin et al.2015]: We show the results from the paper. Their network architecture is slightly different from the one used for the GTC experiments. GTC is more general and shows superior compression to TC.

  • Bayesian Compression (BC) [Louizos, Ullrich, and Welling2017]: We show the performance as reported. BC shows lower accuracy on MNIST and higher accuracy on CIFAR-10. We note that the VGG-16 implementation therein does not use any dropout. GTC shows significantly higher compression than BC.

  • Variational Quantization (VQ) [Achterhold et al.2018] extends BC to ternary quantization. The number of bits is not learned. VQ shows similar accuracy and compression as GTC on MNIST.

  • Deep Compression (DC) [Han, Mao, and Dally2015] does not learn the number of bits, but learns the locations of the centroids. It uses quantization error as the primary criterion. GTC shows superior accuracy and compression.

  • Differentiable Quantization (DQ) [Polino, Pascanu, and Alistarh2018] does not learn the number of bits, but learns the locations of the centroids. The paper used a ResNet-style architecture on CIFAR-10 with a fixed bits. The reported accuracy is inferior to the accuracy of GTC for the bits.

  • BitNet [Raghavan et al.2017] does learn the number the bits on a linear quantization scale, and has a similar learning scheme as GTC. However, the reported accuracy is lower with lower compression in comparison to GTC.

  • GTC (this work): Over all the compared methods, GTC shows the best compression with minimal drop in accuracy. We believe the lower performance on CIFAR-10 is caused by the use of dropout layers, limited iterations of training, and no fine-tuning of the hyperparameters of GTC. Most of the above baselines, with the exception of PM, STE and TC do not result in a neural network whose forward pass does not require any multiplications.

4.4 Hyperparameters

Table 3: Accuracy and Average Number of Bits vs Hyperparameters on MNIST+LeNet: rows show (bit penalty) and columns show (distillation penalty). Trends in the table show the trade-off of precision vs accuracy in GTC.

In previous sections, we showed the results of GTC with a fixed value of and . We arrived at these values by using a grid search on the MNIST dataset and picking the setting with highest accuracy with lowest number of bits. We did not perform any tuning for the CIFAR-10 dataset, which might be the reason for the somewhat low performance. Table 3 shows the results of this grid search. As expected, for any fixed value of , as the value of increases, the number of bits quickly decreases to and the accuracy decreases significantly. Similarly, for any fixed value of , increasing the coefficient for distillation loss increases the number of bits, and generally increases the accuracy. Ultimately, hardware constraints dictate the desired number of bits, which in turn can be used to decide appropriate values for and . Next we show an interesting use case of GTC as an optimizer by annealing the values of .

4.5 GTC for Optimization

Prior work [Raghavan et al.2017, Lin et al.2015] has noted that low precision networks exhibit fast convergence in the early stages of learning in some cases. We leverage this property in a novel variant of GTC that can be used as an off-the-shelf optimizer. We initialize GTC with a high value of , the coefficient of bit cost and as training progress, we rapidly reduce the value of . Figure 7 shows the result of such a GTC variant. We used Stochastic Gradient Descent (SGD) with a fixed learning rate . In the variant denoted as Annealed , is divided by every iterations. The figure shows that the training accuracy of both of the variants of GTC is higher till 40K iterations. The 32-bit standard network (optimized by SGD) converges to a higher accuracy over GTC with a the fixed after 40K iterations. Interestingly, the variant with annealed has superior accuracy and dominates the 32-bit standard model throughout. It is to be noted that this effect vanishes when using ADAM instead of SGD, confirming the effect of low precision training on the effective learning rate as noted in [Raghavan et al.2017].

5 Impact

This paper proposed GTC that generalizes ternary connect in a number of ways. GTC was demonstrated to be effective in learning and compression, and in comparison to baselines and prior work. These algorithmic advances directly translates to power and computational efficiency during inference. To quantify the gains in hardware, we offer two key simulations on FPGA and Raspberry-Pi. These gains comes from the compressed weights and the elimination of multiplications. Instead of performing the standard multiply/add operation, we can instead do a shift/add operation. On FPGA, the per operation power consumption reduced from 49mW for multiplication to 10mW for bit-shifting as in (3)222Based on RTL synthesis at 800MHz on 16nm Zynq Ultrascale+(ZU9EG) using Xilinx Vivado Tools. Similarly on Raspberry-Pi, the per operation time taken reduced from 93ns to 67ns. These gains are significant when amortized over the millions or billions of parameters in deep networks. Additional gains are anticipated from handling, storing and retrieving compressed weights. Specialized digital hardware can implement bit-shifting directly with memory storage for further gains. Even in analog multipliers, a reduction in circuitry is realized with simpler current/voltage dividers compared to complex op-amp circuitry.

6 Acknowledgements

This material is based upon work supported by the Office of Naval Research (ONR) under contract N00014-17-C-1011, and NSF #1526399. The opinions, findings and conclusions or recommendations expressed in this material are those of the author and should not necessarily reflect the views of the Office of Naval Research, the Department of Defense or the U.S. Government.

References