1 Introduction
Deep Neural Networks (DNNs) have produced stateoftheart results in applications such as computer vision
[16][4] and object detection [30]. As their size continues to grow to improve prediction capabilities, their memory and computational requirements also scales, making them increasingly difficult to deploy on embedded systems. For example, [16]achieved stateofartresults on the ImageNet challenge using AlexNet which required 240MB of storage and 1.45 billion operations to compute inference per image. Several methods of compression
[11], quantization [3] and dimensionality reduction [24] have been applied to reduce these demands, with promising results. This demonstrates the overparametrization and redundancies in DNNs and poses an opportunity for utilizing regularization to make their representations more amenable to hardware implementations.In particular, lowprecision neural networks reduce both memory and computational requirements whilst achieving accuracies comparable to floating point [Gupta:2015:DLL:3045118.3045303]. For extremely lowprecisions, such as binary and/or ternary weight representations and 18 bits for activations, most of the multiplyaccumulate (MAC) operations can be replaced by simple bitwise operations. This translates to massive reductions in storage requirements and spatial complexity in hardware. Additionally, large power savings and speed gains are achieved when networks can fit in onchip memory. The issue is that a large reduction in precision, leads to large information loss which incurs significant accuracy degradation, especially for complex datasets such as ImageNet [25]. Ideally, we can train networks which have both high prediction capabilities and minimal computational complexity.
DNN training is an iterative process which has a feedforward path to compute the output and a backpropagation path to calculate gradients and update its parameters for learning. Lowprecision networks involve having a set of fullprecision weights which are quantized before computing inference. As the quantization functions are piecewise and constant, the gradients of quantized weights are calculated and applied to update their corresponding fullprecision weights. Similarly, derivatives of quantized activations are calculated by using a nonconstant differentiable approximation function. This type of training was first proposed as the Straight Through Estimator (STE)
[1] which suggested the use of a nonzero derivative approximation to functions which are nondifferentiable or have zero derivatives everywhere. The problem is that without an accurate estimator for weights and activations, there exists a significant gradient mismatch which impinges on learning. Seemingly, as discussed in [21], activations are more robust to quantization than weights for image classification problems due to weight reuse in Convolutional (CONV) layers affecting multiple operations. To overcome this, methods such as increasing the weight codebook by applying a scaling coefficient to all weights in a layer, provides better approximations for weight distributions and greater model capacity [18]. This is computationally inexpensive and can be represented as multiplying each weight layer’s matrix by a diagonal scalar matrix which only requires storage of one value. Applying finegrained scaling coefficients has also been shown to improve accuracy by increasing model capacity [20], [23]. The problem with all of these finegrained approaches is either large storage requirements for the scaling coefficients or high computational complexity due to irregular codebook indices. In this paper we present Learning Symmetric Quantization (SYQ), a method to design binary/ternary networks with finegrained scaling coefficients which preserve these complexities. We do this by learning a symmetric weight codebook via gradientbased optimizations which enables a minimallysized square diagonal scalar matrix representation. To reduce the large information loss from CONV layer quantization, we use a more finegrained pixel/rowwise scaling approach, rather than layerwise scaling in FullyConnected (FC) layers. In the process, we significantly close the accuracy gap for lowprecision networks to their floating point counterpart, whilst preserving their efficient computational structures. Our work makes the following contributions:
Our approach significantly improves the ability of convolutional weights to learn lowprecision representations. This is useful as most layers in modern network architectures consist of convolutions which are typically the least redundant layers.

The proposed method reduces the computational complexity of traditional finegrained lowprecision scaling and imposes minimal hardware costs to layerwise scaling.

On stateoftheart networks such as AlexNet, ResNet and VGG, our method is empirically shown to improve accuracy for 12 bit weights and 28 bit activations.
2 Related Work
Most methods for training lowprecision DNNs maintain a set of full precision weights that are deterministically or stochastically quantized during forward or backward propagation. Gradient updates computed with the quantized weights are then applied to the full precision weights [5], [14], [19]. To produce stateoftheart results on larger models, [23] proposed scaling the quantized weights by the expectation of realvalued weights to recover the dynamic range of each layer. [18] also implemented a similar technique for ternary networks and optimised a nonzero quantization threshold as a function of the weight expectation. Other gradientbased optimization methods for the scaling coefficient have been introduced [33]. Other methods of quantization have also been implemented, i.e. retraining networks using incremental weight subgrouping to produce no accuracy loss for 5 bit weights [31]
. Multiple binarizations and a scaling layer were described in
[27]to improve accuracy and binarize the last layer. Logarithmic data representations were used to approximate the nonuniform distribution of the weights, activations and gradients down to 3bits with negligible accuracy loss
[21]. Activations quantization has also been investigated with frameworks created for varying activation bitwidths [32] and both weights and activations [22]. Improving the network learnability under lowprecision weights and activations was analysed in [2]. More finegrained approaches of quantization have effectively clustered weights or grouped filters together and quantize differently based on their statistical distributions [6], [20] . Increasing model capacity by applying scaling coefficients to positive and negative values separately was proposed in [33]. Furthermore, sparse representations were used as regularization to make networks more amenable to hardware [7]. Also, many lowprecision DNN hardware implementations have been published [29], [10]. For example, FINN [8], [28] demonstrated the performance gains of being able to store all network weights in onchip memory by implementing binarized neural networks on FPGAs.3 LowPrecision Networks
In this section we discuss the motivations behind our work and fundamentals of lowprecision neural networks.
3.1 Motivation
Each layer of a DNN computes dot products between weight parameters and its input values. We can represent the output of each hidden unit , as:
(1) 
where
is an elementwise nonlinear activation function,
is the input vector, and
provides the weight vector of a linear transformation. This computation is repeated throughout the network, therefore overall model complexity is dependant on its structure. As modern networks continue to get deeper/wider, model complexity becomes problematic for their applicability on constrained hardware environments. A solution is to efficiently quantize both weights and activations to very lowprecisions (18 bits) with negligible or no accuracy loss. In doing so, the arithmetic operations are greatly simplified, reducing both computational and resource complexity. In the binary/ternary weight case, MACs are replaced by bit operations. For example, Figure
1 shows average resource usage on Field Programmable Gate Array (FPGA) hardware to implement a MAC operation under different precisions, which scales quadratically with the multiplier size at where is the number of bits^{1}^{1}1Results are obtained from instantiating MAC modules using Vivado.As shown, no high precision multipliers (known as DSPs on an FPGA) are required for precisions less than or equal to ternary weights and 8bit activations. Furthermore, the logic element (known as LUTs on an FPGA) requirement reduces proportionally with both weight and activation precisions. Additionally, the storage requirements for both weights and activations is reduced by . This significantly improves the network’s ability to fit in onchip memory and constrained hardware environments, and broadens the applicability of DNNs.
For a CONV layer, all weights are typically represented as a tensor
where is the filter size, is the number of input feature maps and , the number of output feature maps. In lowprecision networks, each weight layer can typically be represented by a diagonal scalar matrix multiplied by quantized weight matrix and ideally . Also, the activation function can be approximated using a piecewise constant activation function . In our proposed method, we observe that by ensuring quantization levels for W are symmetric around zero, we can construct efficient square diagonal matrix representations of , which enable finegrained quantization whilst having minimal memory requirements (of size or ). This translates to a reduction in overall model complexity and high prediction capabilities. Although, we restrict ourselves by structured matrices and lowprecision weights and activations, the network efficiently captures information through our gradientbased symmetric quantizer which learns the diagonal elements of during training.3.2 Weight Quantization
For lowprecision DNNs, the distribution of full precision weight matrices for each layer are approximated by a function , resulting in a quantized weight matrix :
(2) 
for and . The codebook is a set of all possible values for where and represent each codebook value and index respectively. For example, binary and ternary weight spaces have and respectively. Efficient functions for binarizing and ternarizing weight parameters have been proposed as piecewise constant functions in [18], such that:
(3) 
with,
(4) 
where M represents a masking matrix,
is the quantization threshold hyperparameter.
for binary networks and in our work we set for ternary networks as in [33]. The issue with discretization of the weights, is that it leads to the vanishing gradients problem
[1]. To overcome this, an STE is defined to replace the zero derivatives from the piecewise constant function in (3), by a nonzero surrogate derivative [14]. During training is used for inference and backpropagation, and the corresponding elements in are updated based on these gradients. Hence the STE is defined as:(5) 
where is the error function for a network without scaling coefficients. After training, the full precision weights are discarded and we require only the quantized weights for deployment. Whilst these methods greatly reduce computational complexity by eliminating floating point MACs, they increase the difficulty of learning.
3.3 Scaling
The introduction of scaling coefficients improves learning capabilities by providing greater model capacity and compensating for the large information loss due to binary/ternary quantization. Scaling discrete weight representations requires multiplying all by positive scaling coefficients . We want to find optimal scaling coefficients for each layer, , which minimize our error function:
(6) 
with representing the error function with scaling coefficients. Finding the optimal is vital to reducing gradient mismatches in the forward and backward functions. It was proposed in [32] as the mean of absolute weight values for each layer:
(7) 
where is the total number of layer weights. The codebook for each layer after scaling in (7) is symmetric: and the scalars become perlayer learning rate multipliers. Additionally, the STE in (8) reduces the gradient mismatch from (5) by including information from the full precision weights:
(8) 
Gradientbased optimizations for scaling coefficients were also introduced in [33] which applied different scaling coefficients for positive and negative to improve model capacity and accuracies. These are updated during backpropagation using gradients:
(9) 
where initially and is the codebook indices for each layer, i.e. and . This allows each layer’s codebook values to be asymmetric around zero, such that . The codebook indices are then highly irregular and unordered which increases computational complexity as the matrices cannot be easily decomposed. Rather we have to check the sign of every element before computation, leading to extra branching instructions for conventional computing platforms such as CPUs/GPUs and additional logic for custom hardware. The difficulty of designing lowprecision networks which have both high learning capabilities and computational efficiency can be solved by learning a symmetric codebook during training and exploiting structured matrix representations.
4 SYQ Structural Representations
We now propose matrix representations of SYQ by partitioning the quantization into weight subgroups. Diagonal matrix representations consist of mainly zeros and have nonzero entries along the main diagonal. For a matrix D to be diagonal, if , and square if . A square diagonal matrix consisting of all equal main diagonal entries is a scalar matrix. A diagonal matrix is defined by the vector :
Diagonal matrix multiplication is very computationally efficient as it can be easily decomposed and only the scalar vector requires storage.
4.1 Layers
CONV and FC layers have differing computational requirements and sensitivities to network redundancies. CONV weights are reused many times across the input feature map whereas FC weights are used only once per image. Hence, the quantization error of each weight in a CONV layer impacts the dot products across the entire input feature map volume rather than just once for FC weights. Thus, a finegrained approach to CONV layers is effective at compensating for this error. Quantized CONV weights are represented as a tensor with . As typically , it is optimal to have a diagonal scalar of size or even as only small scalar vectors are required for storage. By reshaping the tensor , we form a matrix where or and represent our scalar matrix multiplication as with the square diagonal matrix, or respectively. FC layers are represented as a matrix where is the number of hidden nodes and
the activation neurons. As FC layers are more robust to quantization, one learnable scaling coefficient (layerwise) for the FC layer can sufficiently approximate the distribution and also can be represented with scalar matrix computation. All elements in
are then equal and we only require storage of one value.4.2 Subgroups
More finegrained quantization can improve approximations of the statistical distributions of weights. We implement pixelwise scaling for CONV layers which involves grouping all spatially equivalent pixels along the dimension. This results in different values for all the main diagonal elements in . With this representation, we can still decompose the matrix computation along each pixel dimension and exploit the parallel nature of convolutions as shown in Figure 2. We do this by creating subgroups with codebook indices . Other granularities such as rowwise scaling involve grouping all pixels along a row or column (), resulting in where (as illustrated in Figure 2) and also layerwise scaling: . Different granularities affect both accuracy and computation as further explored in Sections 6 7.
5 SYQ Training
In this section, we now describe the methodology to efficiently train SYQ networks.
5.1 Symmetric Quantizer
When training lowprecision inference networks, the aim is to have the smallest possible codebook. Typically, as the codebook size increases, a network will approach fullprecision performance but increase hardware cost. However, there are certain codebook representations which are significantly more hardware friendly than others and won’t necessarily impose any hardware costs. Given a codebook , and the nonzero codebooks and , a quantizer is denoted as symmetric if:
(10) 
Learning this type of codebook requires updating one scaling coefficient during training for two bipolar codebook values. The gradient of each scaling coefficient for each subgroup becomes:
(11) 
When computing binary/ternary weight representations followed by a scale, it is ideal to have a codebook which is symmetric around zero, as the codebook storage requirements are almost halved. This is because only the absolute value of the two symmetric values needs to be stored. Additionally, codebook indices become highly regular and ordered for the scalar multiply which greatly reduces computational complexity. The nature of symmetric quantization enables the opportunity to implement finegrained quantization (pixel/rowwise) whilst maintaining the scalar matrix multiplication structure used in layerwise scaling. This is also advantageous as the scaling coefficients become finegrained adaptive learning rate multipliers for each pixel/row in a CONV layer, i.e. the STE becomes:
(12) 
As the use of scaling coefficients can more accurately approximate subgroups and are gradientbased, the gradient mismatch is significantly reduced for weight quantization which enhances network learning.
5.2 Initialization
The solution to nonconvex optimizations such as gradient descent depend heavily on parameter initialization to avoid vanishing or exploding activations/gradients and ensure network convergence [9]. For lowprecision networks, excessive gradient mismatches between the forward and backward functions must be minimized, otherwise the gradients will not propagate well. To deal with this concern, the scaling coefficients coefficients are initialized as the mean of full precision weights in it’s corresponding subgroup. For example, the scaling coefficient in pixelwise scaling is:
(13) 
Layerwise scaling in FC layers has as the mean of all layer weights. By incorporating information from the full precision weights, we aim to reduce the mismatch initially and the scaling coefficients are then optimized during backpropagation.
5.3 Activations Quantization
Our forward path approximation to in (1) uniformly quantizes a real number to a kbit number:
(14) 
where represents the round down operation and is the upper bound. itself is bounded by its arbitrary unsigned two’s complement fixed point representation where is the number of fractional bits and . Uniform quantization translates to a reduction in hardware implementation complexity. To achieve this, we use the following STE for the activations:
(15) 
Differences in the forward and backward activation functions create a gradient mismatch which can result in unstable and inefficient learning. To minimize this issue, we adjust as a hyperparameter. The overall SYQ training process is summarized in Algorithm 1.
6 Experiments
To demonstrate the versatility of SYQ, we applied it to several stateoftheart benchmark models, all with different network topologies. We use binary/ternary weights and varying activation bitwidths for classification of the largescale ImageNet dataset. The ILSVRC2012 ImageNet is a natural high resolution visual classification dataset consisting of 1000 classes, 1.28 million training images and 50K validation images. Inputs are resized to before being randomly cropped to . We report our singlecrop evaluation results using Top1 and Top5 accuracy.
6.1 Networks
We compare our results to the full precision baseline and benchmark reference model accuracies in Table 1^{2}^{2}2Our ResNet and AlexNet reference results are obtained from https://github.com/facebook/fb.resnet.torch and https://github.com/BVLC/caffe, respectively, showing that SYQ training achieves similar accuracy to floating point. This suggests the noise induced from replacing floating point weight layers with SYQ versions, provides effective regularization during training. An AlexNet [17]
variant is implemented which eliminates dropout and includes batch normalization
[15]. A mini batch size of 64 is used, L2 weight decay of 5e6, and our learning rate is initially 1e4 with step decays of scale factor 0.2. For ResNet [12], we test on the 18, 34 and 50 layer variations. Our batch size is 128, learning rate is initially 1e3 with step decay of factor 0.2. We also test on a variant of VGG16 [26], using modelA in [13]with the spp layer replaced by a max pool and only 3 CONV layers rather than 5 for input size blocks of 56, 28 and 14, as in
[2]. Batch sizes are set to 32 and our learning rate is initially 1e4 with a step decay of factor 0.2. The VGG and ResNet models were initialized from floating point baseline weights. Fullprecision weights are used for the first and last layer. All other CONV layers are quantized with SYQ pixelwise scaling, FC layers with layerwise scaling and the activations of all layers using (14).Model  18  28  Baseline  Reference  

AlexNet  Top1  56.6  58.1  56.6  57.1 
Top5  79.4  80.8  80.2  80.2  
VGG  Top1  66.2  68.7  69.4   
Top5  87.0  88.5  89.1    
ResNet18  Top1  62.9  67.7  69.1  69.6 
Top5  84.6  87.8  89.0  89.2  
ResNet34  Top1  67.0  70.8  71.3  73.3 
Top5  87.6  89.8  89.1  91.3  
ResNet50  Top1  70.6  72.3  76.0  76.0 
Top5  89.6  90.9  93.0  93.0 
6.2 Changing Granularity Via Weight Subgroups
Weight subgroups can be arbitrarily designed for a given hardware application. Table 2 shows accuracy differences between using row/layerwise vs pixelwise scaling on AlexNet and suggests pixelwise and rowwise are marginally different, especially for higher precisions, but both are considerably more accurate than layerwise.
Rowwise  Layerwise  

Weights  Act.  Top1  Top5  Top1  Top5 
1  2  0.7  0.5  1.4  2.2 
1  8  0.1  0.3  0.4  2.2 
2  2  +0.1  0.0  1.3  1.5 
2  8  0.1  0.1  1.9  1.7 
This demonstrates the effectiveness of finegrained quantization of CONV layers over layerwise and promotes the exploration for efficient representations of scalar computation. It also shows the effectiveness of rowwise quantization as it typically incurs a smaller memory requirement with a small accuracy drop, for a significant gain in the potential parallelism of the network.
6.3 Comparisons To Previous Work
We compare SYQ explicitly using AlexNet, ResNet18 and ResNet50 in Tables 3, 4 & 5 as they’ve been extensively studied in the literature. Our ternary results with 8 bit activations (2w8act) improves on the stateoftheart for all three networks. Our 2w4act for ResNet50 also improves on the stateoftheart FGQ. This is also the case for binary weights, such as 1w8act ResNet18 and AlexNet with 1w2/4act. For extremely low 1w2act representations, SYQ also has a 2.7% increase in Top1 accuracy over the stateoftheart HWGQ. This demonstrates SYQ’s superiority for producing high accuracy. Additionally, it shows that multiple learnable scaling coefficients effectively reduce the gradient mismatch in the forward and backward paths, translating to efficient learning under lowprecision constraints.
Model  Weights  Act.  Top1  Top5 

DoReFaNet [32]  1  2  49.8   
QNN [14]  1  2  51.0  73.7 
HWGQ [2]  1  2  52.7  76.3 
SYQ  1  2  55.4  78.6 
DoReFaNet [32]  1  4  53.0   
SYQ  1  4  56.2  79.4 
BWN [23]  1  32  56.8  79.4 
SYQ  1  8  56.6  79.4 
SYQ  2  2  55.8  79.2 
FGQ [20]  2  8  49.04   
TTQ [33]  2  32  57.5  79.7 
SYQ  2  8  58.1  80.8 
6.4 Varying Activation Bitwidth
The most important result is that SYQ efficiently quantizes networks with lowprecisions for both weights and activations. From Figure 3,
we can see that lowering the precision of the activations does not severely alter the training curve, suggesting that the gradient information from pixelwise scaling coefficients in SYQ compensates well for the loss of information. However, when quantizing down to 2bits, the training error curve does become more volatile, demonstrating instabilities in network learning. We also report the classification accuracies for varying activations and bitwidths on AlexNet and ResNet50 in Tables 3 & 5, which shows that there is minimal discrepancy from the fullprecision networks with as low as 4bit activations. These results are extremely promising and have strong implications for specialized hardware implementations of lowpower DNNs.
7 Hardware Implications
In this section we discuss the computational implications of different scaling operations and present a design for specialized hardware implementations.
7.1 Computational and Memory Complexity
Considering a CONV layer with Ops, , where is the IFM dimension. The layerwise scaling, as in DoReFaNet, requires one scaling coefficient per operations.
Method  Scalars  Ops 

Layer (DoReFa)  1  
Row (SYQ)  
Pixel (SYQ)  
Asymmetric (TTQ)  2  
Grouping (FGQ)  
Channel (HWGQ/BWN) 
For channelwise scaling in HWGQ and BWN, it requires N scaling coefficients as there is one per output feature map, where typically . TTQ implements asymmetric layerwise quantization which requires two scaling coefficients per layer and operations as we add a branching operation for each weight due to irregular codebook indices, as described in Section 3.3. FGQ uses pixelwise scaling for every 4 filters, whereas SYQ uses pixelwise scaling per filters, hence it requires scaling coefficients and operations. For pixelwise SYQ scaling, scaling coefficients and operations are required, where for most CONV layers in modern networks. For rowwise SYQ scaling it requires scaling coefficients and operations. These results are displayed in Table 6, demonstrating the benefits of maintaining a diagonal representation for the scalar matrix multiplication of each layer as we either improve computational or memory complexity against all other finegrained methods. Another key benefit of SYQ is its amenability to highly parallel processors.
7.2 Architectural Design
For the CONV layer, the operations are a sum of dot products between the input and kernel filter. In order to reduce compute complexity, we increase the number of operations in each dot product, while significantly decreasing the complexity of each operation. For example, the size of the input vector, in the calculation of each dot product is: . The number of operations is for multiplies and for additions. Given that we have a limited codebook for our weights, we can break it into subdot products where we apply the scaling factor, , after we have computed the subdot product for that set of symmetrically constrained weights. For pixelwise quantization, the total multiplies becomes and the total adds become . However, the first term in each of these calculations can be done at significantly lower precision. For multiplies this means a binary or ternary multiple  which can often be implemented as a bitflip. To compute this in specialized hardware, for layerwise scaling, we have a parallel MAC tree which consists of a multiply of an input and binary/ternary number (represented as a dot) followed by an adder tree to sum up the outputs. Outputs of these are fed into a multiplier to compute the scale, followed by an accumulator to store the outputs before being fed into the activation function. This architecture is shown in Figure 4. For every hardware block of this type, our perpixel/row scaling only requires one additional ring counter which stores scaling coefficients and shifts the input to the scaling multiplier through an index counter as each row/pixel is finished computing which is computationally inexpensive. As in the equivalent layerwise scaling architecture, we can still maintain one multiplier in hardware and only increase memory slightly to store the scaling coefficients.
Table 7 shows the resource and performance estimates provided by Vivado HLS of the described hardware architecture for a target Xilinx ZU3 FPGA device at an estimated clock frequency of over 300 MHz. The main design is based on the MVTU described in FINN [28], with an extension to 2bit activations and pixelwise and rowwise SYQ. The layerwise baseline uses no multiplies, as these can absorb into quantization thresholds for activations [28]. The MVTU was configured for a convolution layer with , , , while scaling the size of the MAC tree (SIMD) and the number of parallel processors (PE). As shown, the BRAM (memory blocks on an FPGA (18k)) and LUT usage is almost identical, while the DSP usage increases proportionally with the number of parallel output channels which are processed. The increase in DSPs is not necessarily costly for the ZU3 as we are able to utilize more of the total available resources. Resource usage is only shown for pixelwise SYQ, as rowwise only differed in LUT usage by less than 2%.
Config  SIMD  PE  BRAMs  LUTs (k)  DSPs 

Layer  32  32  64  29.8  4 
Layer  64  32  64  56.5  4 
Layer  32  64  64  58.9  4 
SYQ(P)  32  32  64  29.4  36 
SYQ(P)  64  32  64  56.1  36 
SYQ(P)  32  64  64  57.7  68 
ZU3      432  70.6  360 
8 Conclusions
The problem of efficiently training large DNNs with lowprecision weights and activations is considered. We propose learning symmetric quantization for DNNs in order to maximize network learning whilst minimizing hardware complexity. This was achieved by constraining the solution to lowprecision representations and learning a diagonal scalar matrix using gradientbased optimizations for efficient computation. As a result, we reduce the computational requirements of finegrained quantization and achieve stateoftheart accuracies on modern benchmark networks.
Acknowledgements
This research was partly supported under the Australian Research Councils Linkage Projects funding scheme (project number LP130101034) and Zomojo Pty Ltd.
References
 [1] Y. Bengio, N. Léonard, and A. C. Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR, abs/1308.3432, 2013.
 [2] Z. Cai, X. He, J. Sun, and N. Vasconcelos. Deep learning with low precision by halfwave gaussian quantization. CoRR, abs/1702.00953, 2017.
 [3] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen. Compressing neural networks with the hashing trick. CoRR, abs/1504.04788, 2015.
 [4] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. P. Kuksa. Natural language processing (almost) from scratch. CoRR, abs/1103.0398, 2011.
 [5] M. Courbariaux, Y. Bengio, and J. David. Binaryconnect: Training deep neural networks with binary weights during propagations. CoRR, abs/1511.00363, 2015.

[6]
Y. Duan, J. Lu, Z. Wang, J. Feng, and J. Zhou.
Learning deep binary descriptor with multiquantization.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, July 2017.  [7] J. Faraone, N. J. Fraser, G. Gamberdella, M. Blott, and P. H. W. Leong. Compressing low precision deep neural networks using sparsityinduced regularization in ternary networks. CoRR, abs/1709.06262, 2017.
 [8] N. J. Fraser, Y. Umuroglu, G. Gambardella, M. Blott, P. Leong, M. Jahre, and K. Vissers. Scaling binarized neural networks on reconfigurable logic. In Proceedings of the 8th Workshop and 6th Workshop on Parallel Programming and RunTime Management Techniques for Manycore Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms, PARMADITAM ’17, pages 25–30, New York, NY, USA, 2017. ACM.

[9]
X. Glorot and Y. Bengio.
Understanding the difficulty of training deep feedforward neural
networks.
In
In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10). Society for Artificial Intelligence and Statistics
, 2010.  [10] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. EIE: efficient inference engine on compressed deep neural network. CoRR, abs/1602.01528, 2016.
 [11] S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015.
 [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
 [13] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. CoRR, abs/1502.01852, 2015.
 [14] I. Hubara, M. Courbariaux, D. Soudry, R. ElYaniv, and Y. Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. CoRR, abs/1609.07061, 2016.
 [15] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015.

[16]
A. Krizhevsky, I. Sutskever, and G. E. Hinton.
Imagenet classification with deep convolutional neural networks.
In Proceedings of the 25th International Conference on Neural Information Processing Systems  Volume 1, NIPS’12, pages 1097–1105, USA, 2012. Curran Associates Inc.  [17] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
 [18] F. Li and B. Liu. Ternary weight networks. CoRR, abs/1605.04711, 2016.
 [19] Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio. Neural networks with few multiplications. CoRR, abs/1510.03009, 2015.
 [20] N. Mellempudi, A. Kundu, D. Mudigere, D. Das, B. Kaul, and P. Dubey. Ternary neural networks with finegrained quantization. CoRR, abs/1705.01462, 2017.
 [21] D. Miyashita, E. H. Lee, and B. Murmann. Convolutional neural networks using logarithmic data representation. CoRR, abs/1603.01025, 2016.
 [22] E. Park, J. Ahn, and S. Yoo. Weightedentropybased quantization for deep neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
 [23] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. Xnornet: Imagenet classification using binary convolutional neural networks. CoRR, abs/1603.05279, 2016.
 [24] S. Ravi. Projectionnet: Learning efficient ondevice deep networks using neural projections. CoRR, abs/1708.00630, 2017.
 [25] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei. Imagenet large scale visual recognition challenge. Int. J. Comput. Vision, 115(3):211–252, Dec. 2015.
 [26] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014.
 [27] W. Tang, G. Hua, and L. Wang. How to train a compact binary neural network with high accuracy? In AAAI, 2017.
 [28] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. H. W. Leong, M. Jahre, and K. A. Vissers. FINN: A framework for fast, scalable binarized neural network inference. CoRR, abs/1612.07119, 2016.
 [29] G. Venkatesh, E. Nurvitadhi, and D. Marr. Accelerating deep convolutional networks using lowprecision and sparsity. CoRR, abs/1610.00324, 2016.
 [30] P. Viola and M. Jones. Robust realtime object detection. In International Journal of Computer Vision, 2001.
 [31] A. Zhou, A. Yao, Y. Guo, L. Xu, and Y. Chen. Incremental network quantization: Towards lossless cnns with lowprecision weights. CoRR, abs/1702.03044, 2017.
 [32] S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. CoRR, abs/1606.06160, 2016.
 [33] C. Zhu, S. Han, H. Mao, and W. J. Dally. Trained ternary quantization. CoRR, abs/1612.01064, 2016.
Comments
There are no comments yet.