1 Introduction
Deep neural networks have achieved significant improvement for various realworld applications. However, the large memory cost, computational burden, and energy consumption prohibit the massive deployment of deep neural networks on resourcelimited devices. A number of methods are proposed to compress and accelerate deep neural networks, including pruning [11]
, tensor decomposition
[30], and quantization [24].Among these methods, lowbit network quantization is particularly helpful in network acceleration and size reduction. Binary neural networks [5] raise a lot of attention. However, binary networks usually suffer from a large drop in terms of accuracy due to limited expressiveness. To enhance the model capacity, various multibit quantization methods are proposed [31], which significantly improve the performance of quantized models but enjoy less size reduction and speed acceleration.
As a compromise between binary networks and Nbit networks, ternary neural networks convert fullprecision parameters into merely three values and save a large amount of memory with acceptable accuracy degradation. Despite ternary networks are popularly investigated [17, 32] in recent years, three major issues are mostly overlooked: 1) The squashing behavior of the forward quantization function. Most existing activation quantization methods [3, 24] squash full precision activation values into a narrow and fixed range, which could affect the expressiveness of the quantized network. 2) The saturating behavior of the backward quantization function.
The clipped StraightThrough Estimator (STE)
[2] is widely adopted in training a quantized network. Nevertheless, the gradient becomes zero when entering the saturating zone of the STE estimator. Moreover, as the network depth increases, the training could suffer from the severe problem of gradient vanishing. 3) Hardware customization for ternary neural networks. For networks with ternary weights and activation values, the computation on most modern hardware can only be performed when the ternary values are 2bit aligned. Compared with 2bit quantization, it is yet less explored to utilize some nice properties of ternary values to design a more efficient computation pattern and save more energy.In this paper, we propose a reparameterized ternary network (RTN) to resolve the three issues. Specifically, in RTNs both weights and activation values are ternarized, followed by a reparameterization with scale and offset parameters. The reparameterization can easily alleviate the first two issues mentioned above. Specifically, in order to avoid the squashing behavior of quantization function during the forward pass, the learnable scale and offset parameters on network parameters enable dynamic adjustment of the quantization range and thereon enhances the capacity of the ternary network. To tackle the saturating behavior of the clipped STE function, with the chain rule of derivatives we can decompose the gradient of activation after reparameterization with respect to that of activation before reparameterization, as well as the gradients of scale and offset parameters. Consequently, even though the gradient of an activation before reparameterization saturates, the optimization can still proceed as a result of learning the reparameterization parameters.
Finally, to address the third issue, we build a customized hardware prototype on FPGA for the reparameterized ternary network. We design an efficient encoding and computation pattern to conduct dot products between two ternary vectors, saving extra energy compared to the existing implementations of 2bit networks.
Experimental results on large scale tasks like ImageNet indicate that our proposed method significantly improves the capacity of the ternary network, and achieves up to relative improvement of accuracy on ResNet18 against stateoftheart binary and lowbit networks. Moreover, our hardware prototype on FPGA achieves and savings on power and area respectively comparing to traditional implementations of the 2bit network.
2 Related Work
Recent work on network compression shows that full precision computation is not necessary for the training and inference of DNNs [10]. To achieve higher compression and acceleration ratio, extremely lowbit like binary weights [24] have been studied. [17, 32] further improve the performance by ternarizing weights to achieve higher representation ability. TWN minimizes the Euclidean distance between ternary weights and the full precision weights. Instead of the symmetric ternarization, TTQ uses an asymmetric ternarization to achieve higher performance but less hardware convenience.
Substantial speed up requires further quantization for activation, which is generally more challenging than weights quantization [3]. [6] uses and
to represent both weights and activation and XNORNet
[24] further adds scaling factors for binary weights to improve accuracy. Higherorder Residual Quantization [19]uses two 1bit tensors to approximate the full precision activation, but the computation speed would reduce to half. To take advantage of ReLU
[22] and introduce sparsity in quantized activation, [3]uses Halfwave Gaussian Quantization to approximate ReLU. The quantized activation function has the form of a stepwise function, which always has zero gradients with respect to its input. To circumvent this problem, StraightThrough Estimator (STE)
[2] is adopted. STE approximates backward function of arbitrary functions with (clipped) identity function, and several studies [21, 31] attempt to reduce this mismatch between forward and backward to improve performance. [4, 1] propose to learn the clipping parameters and achieve better results. [9] leverage function to approximate the gradient of quantization, however, there is still a large accuracy gap between extremely lowbit and full precision models.3 Methodology
For a weight filter in a convolution layer, it is denoted by , where and are the number of input channels and the kernel size, respectively. Suppose one instance is fed to the network, and the corresponding feature map is denoted by . Then the output of one unit in the next layer can be computed by the dot product^{*}^{*}*For convolutional layers, this can be done by the im2col operation. as , where
is the Rectified Linear Units
[22].Our proposed reparameterized ternary network (RTN) consists of the linear transformations on both weights and activation values of the network. The reparameterization allows the dynamic adjustment of the quantization range, and avoid the issue of gradient vanishing during the quantized training. Besides, we also customize hardware implementations for RTN by leveraging the nice properties of ternary networks. The overall workflow of RTN is shown in Figure
1.3.1 Reparameterized Ternarization
Activation Ternarization
Previous work [28, 7] on ternary networks argue that the degradation of quantized network mainly comes from the limited quantization levels. However, it is rarely observed that the quantization functions they adopt usually squash the input into fixed ranges and therefore harm the network expressiveness significantly. In a ternary neural network, the quantization function is applied to both weights and activations, which highly restricts the capacity of the quantized model. Therefore, in this paper, we propose a reparameterized quantizer to enhance the model expressiveness. First, the ternarization function is given by:
(1) 
Since activation function in RTN is ReLU, the output of this function is always nonnegative, which means activations can never be quantized to
. We apply Batch Normalization (BN) after ReLU to recreate negative activations. As a result, the quantization values can be made full use of.
After normalizing the inputs of each layer, BN usually applies an affine transformation to increase the model capacity. Here we use
(2) 
to denote the transformation. With BN transformation, consequently the quantization function can be formulated as follows:
(3) 
where the learnable BN parameters and can adaptively adjust the quantization threshold () in Equation 1. In spite of the quantization threshold is learnable, however, the ternary activation only contains fixed ternary values (i.e. ). We consider a ternary activation and we further reparameterize by
(4) 
where is the magnitude scale factor and is the offset. With and , the reparameterized ternary activation can dynamically change the squashing range, improving the network capacity with little increase of model size and computation. Here, we refer to as fixed ternary activation, because the ternary values are fixed and it only controls the direction of the activation vector.
Our method can be reduced to a number of previous methods by taking different and . For example, when , we squash the activation into range , which is equivalent to HWGQ [3]. When and , the Euclidean distance from to full precision activation is minimized, and our approach resembles XNORNet [24]. Note that the scale factor mentioned in XNORNet is different from ours, their scale factor has to be calculated from full precision activations for each forward pass as a running variable, which is not practical.^{†}^{†}†As a result they abandon this scale factor for quantized activation in the officially released implementation. A similar idea on decoupling the vector magnitude from its direction for full precision weights can also be found in [26]. Note that these factors are designed in a layerwise pattern, so value ranges may change across different layers depending on and .
Weight Ternarization
In a similar spirit to activation ternarization, we first apply linear transformation for network weights that resembles the BN layer in activation to obtain learnable quantization thresholds. For each weight filter , the weights transformer is defined as follows:
(5) 
where and are learnable parameters. Then the transformed weights are quantized by the same function in Equation 1, i.e . As a consequence, the weights can adjust its quantization threshold. To obtain flexible quantized values, we follow a similar way to reparameterize by , where is the scale factor. Note that the offset is not included under the consideration of additional computation overhead.
3.2 Backward Update in Reparameterized Ternarization
A typical approach to propagate the gradients through the quantized activation is the clipped StraightThrough Estimator (STE): , which is exactly the gradients of hard tanh. Despite being successfully used in previous methods [24], hard tanh suffers from the saturating problem. When , the gradient of becomes zero, which enters the saturating zone as shown in the red part of Figure 2 . The saturating behavior of STE can cause gradient vanishing for weights as the depth of the network increases, which slows down and even hurts the convergence of the model. Furthermore, once activation falls into the saturating zone, they will get stuck and barely find a way out because both and remain unchanged.
Fortunately, our reparameterized ternary activation can alleviate this problem easily. Consider
as the loss function, the derivative w.r.t. to
can be written as(6) 
It can be observed that since we decouple the scale and offset from fixed ternary activation , even when the reparameterized ternary activation can still be optimized as a result of learning and . Consequently, the entire network can converge faster and reach a better optimum in the loss landscape.
Furthermore, our reparameterized ternarization has another benefit that it can dynamically adjust the learning rate of network parameters. Consider the gradients w.r.t to the activation,
(7) 
where can be absorbed in learning rate as a multiplier, making the training of the network robust to the value of the learning rate. Learnable scale factor also has been studied in [26, 32], in which they claim a similar effect as well.
3.3 The XORXNOR Toy Problem
To demonstrate how the reparameterized ternarization improves the capacity of the quantized model, we give a toy example on a twolayer neural network, as is shown in Figure 3.
The 2layer network is designed to learn two logical functions, and respectively. 4 different kinds of activation function are compared: fixed ternary activation (), reparameterized ternary activation (), the hyperbolic tangent activation () and the reparameterized hyperbolic tangent activation (
). Inputs are sampled from a Bernoulli distribution plus a uniform noise,
. Outputs are either 0 or 1. The network has a hidden layer consisting of 3 neurons without the bias term. To better observe behaviors of quantized activation, we keep the weights as full precision numbers. We report the mean square error (MSE) during training. More implementation details are in the Appendix.
The training curve is shown in Figure 3. Compared with fixed ternary activation (fta) and reparameterized ternary activation (rta), hyperbolic tangent (tanh) is a full precision function with a fixed squashing range , which is supposed to have better representation capability than the ternary activation functions. However, our rta achieves lower MSE than tanh, because the scale and offset factors alleviate the squashing issue. Similarly, rtanh achieves lower MSE than tanh and rta. In particular, the scale and offset factors of rtanh are and respectively, which substantially change the squashing range from into . From the empirical result we can see the range of activation values is at least as important as the number of quantization levels.
2bit Representation  
1st bit  2nd bit  Our Ternary True Value  2bit Network True Value 
0  0  0  0 
0  1  0  1 
1  0  1  2 
1  1  +1  3 
3.4 Efficient Computation Pattern
How to Compute Dot Product between Two Ternary Vectors
To support our ternary network (ternary weights + ternary activations), in this section, we propose an efficient way to compute the dot product between the ternary weights and activation vectors, which is the core operation for both convolution and linear layers. A special bit encoding scheme is adopted for the ternary weights and activations. We use two bits to represent each ternary weight and activation, where the first bit indicates whether this number is zero or not, and the second bit indicates the sign of this number. Table 1 shows the detailed encoding scheme for all ternary values , and . Under this encoding scheme, zero can be represented by either or .
Now, we have and we will encode them into 2bit vector representations. Suggest is a vector contains the first bit of all entries in ternary weights. contains the second bit and we define in a similar way to represent the activation. The dot product can be computed using bitwise operations.
(8) 
where , and are AND, XOR bitwise operations respectively. returns the number of 1 (logic high) in a vector. As indicated by Equation 8, the convolution can be computed efficiently via simple Boolean operations.
Figure 4(a) shows the hardware design for the vector multiplication shown in Equation 8. Given the two input vectors, the circuit computes the scale product between each pair of elements of the two vectors, the partial results and are saved inside two 32bit counters. The multiplication with two shown in Equation 8 can be easily achieved by shifting the partial results to the left by 1 bit. A substractor is used to perform the substraction operation shown in Equation 8.
For comparison, we compute the dot product of 2bit quaternary weights and activation because they share the same size of our model. We use the computation pattern introduced in DoReFaNet [31] to compute the dot product.
The quaternary vector multiplication can be computed by performing AND between each bit of inputs. Figure 4(b) shows the hardware design for the vector multiplication in [31]. The circuit takes the two input vectors and calculates the scale product between each pair of elements. The multiplication with poweroftwo is implemented by using bitwise shift operations. Four adders are employed to sum the partial results. We also evaluate the performances of the two designs shown in Figure 4 in terms of power consumption, computation latency, and area in the next section.
How To Deal With and
Actually, our reparameterized ternary activation has two extra parameters, scale and offset , besides fixed ternary activation. We demonstrate that it only introduces negligible computation complexity. With the quantized ternary weights and reparameterized ternary activation , the input of the next layer can be computed by
(9)  
where is the ReLU function, is the dot product between ternary vectors, denotes the matrix with all elements equal to 1. As a matter of fact, the second term, i.e., , in Equation 9, is a constant, which can be prestored in the cache. As shown in Figure 1, when performing the convolution, we first calculate the ternary value convolution efficiently with Boolean operations, then we only need to conduct one multiplyaccumulate (MAC) operations to get the final results.
Reparameterized Ternary Activation Can Adjust Sparsity Automatically
Interestingly, we can modify the expression of Equation 9 and fold the second term into ReLU to make it more hardware friendly,
(10) 
where is the ReLU parameterized by the sparsity threshold .
Apparently, controls the sparsity of . This reveals another effect of our reparameterized ternary activation. It can control the sparsity of the activation. The sparsity of activation has been studied in [29], in which they find that sparsity has a profound impact on accuracy. However, [29] manually sets the sparsity threshold to increase the sparsity, in which they believe the quantization error can be reduced and larger activation is more important based on the attention mechanism. In our method, the sparsity threshold is given by , which can be dynamically tuned during the training by offset factor for every layer. We give the sparsity record in Section 4.2 to show that our method concurs with [29].
4 Experiments
In this section, we first present some empirical evaluations of the reparameterized ternary network (RTN) on two realworld datasets: ImageNetILSVRC2012 [25] and CIFAR10 [14], then we evaluate the performance of the hardware implementation for RTN in terms of power consumption and area.
We adopt a number of popular neural architectures for evaluation: ResNet [12], AlexNet [15], MobileNet [13] and NetworkInNetwork (NIN) [20]. Two sets of strong baselines are chosen for comparison: 1) quantizing weights only: BWN [24], TWN [17], and TTQ [32]; 2) quantizing both weights and activations: XNOR [24], BiReal [21], TBN [28], HWGQ [3], DoReFaNet [31], PACT [4] and HORQ [18].
We denote our method with (resp. without) reparameterization on weights and activations as RTNR (resp. RTNF). We also evaluate our method when only weights are quantized.
We highlight substantial accuracy improvement (up to 13% absolute improvement compared with XNORNet) of our RTN for ResNet18 on ImageNet. Such improvement mainly comes from: 1) zero is introduced into quantized activation to get the fixed ternary activation , 2) dynamically adjusting the quantization range of weights and activations by Equations (2) and (5), and 3) learnable scale and offset are adopted for the fixed ternary activation to get the reparameterized ternary activations which have much better representation capability with negligible computation overhead.
Compared with several 2bit models, RTN has the lowest degradation from full precision models and achieves comparable accuracy. In Section 4.4, we implement our ternary multiplication circuit and other 2bit multiplication circuit used in [31, 4, 18] and show that the circuit for multiplication with ternary values significantly outperforms that for multiplication with 2bit values in terms of power and area.
4.1 Implementation
We follow the implementation setting of other extremely lowbit quantization networks [24] and do not quantize the weights and activation in the first and the last layers. See Appendix for more details of our implementation.
Initialization could be vitally important for quantization neural networks. We first train a fullprecision model from scratch and initialize the RTN by minimizing the Euclidean distance between quantized and full precision weights like TWN [17]. For example, the initial is set to and is set to 0.
4.2 Results on ImageNet
The validation error on ImageNet is plotted in Figure 5. We can see that RTNR has a lower error rate than RTNF. Especially, AlexNet plot, Figure 5(b), shows that RTNR has a relatively smooth curve and better convergence speed and may be the result of automatic adjustment of learning rate via gamma in Equation 7.
The overall results on ImageNet are shown in Table 2 with several strong extremely lowbit models. Note that we swap the order of BN and ReLU, so we report the full precision models’ accuracy as a reference and compare the degradation from full precision models (the last column in the table). We first compare our RTN with models that only quantize weights like BWN, TWN, and TTQ. Our RTN not only achieves stateoftheart accuracy but also has the smallest gap between full precision models. In addition, compared with TTQ’s asymmetric quantization, our RTN uses symmetric quantization which is naturally harder for training but more friendly for hardware implementation.
ResNet18 (ImageNet)  

Methods  # bits(W/A)  FP ref.  Accuracy  Degrad. 
BWN  1 / 32  69.3  60.8  8.5 
2 / 32  69.3  61.8  7.5  
2 / 32  69.3  65.3  4.0  
2 / 32  69.6  66.6  3.0  
2 / 32  69.2  68.5  0.7  
XNOR  1 / 1  69.3  51.2  18.1 
BiReal  1 / 1  68.0  56.4  11.6 
1 / 2  69.3  55.6  13.7  
DoReFa  1 / 2  70.2  53.4  16.8 
HWGQ  1 / 2  69.6  56.1  13.5 
HORQ  2 / 2  69.3  55.9  13.4 
DoReFa  2 / 2  70.2  62.6  7.6 
PACT  2 / 2  70.2  64.4  5.8 
2 / 2  69.2  62.4  6.8  
2 / 2  69.2  64.5  4.7  
AlexNet (ImageNet)  
XNOR  1 / 1  56.6  44.2  12.4 
1 / 2  57.2  49.7  7.5  
DoReFa  1 / 2  55.9  49.8  6.1 
HWGQ  1 / 2  55.7  50.5  5.2 
PACT  2 / 2  57.2  55.0  2.2 
2 / 2  58.7  52.6  6.1  
2 / 2  58.7  53.9  4.8  
MobileNet (ImageNet)  
PACT  2 / 2  69.9  56.1  13.8 
2 / 2  69.9  56.9  13.0  
NIN (CIFAR10)  
XNOR  1 / 1  89.8  86.4  3.4 
2 / 2  89.8  88.2  1.6  
2 / 2  89.8  88.5  1.3  
2 / 2  89.8  89.1  0.7  
2 / 2  89.8  89.6  0.2 
Quantizing the activation is more challenging compared with weights [3], and there is still a large margin between full precision models and extremely lowbit models. We compare several models that quantize both weights and activation with our proposed model (denoted by RTNR). For the ablation study, we also report the performance of our ternary network with fixed ternary activation (denoted by RTNF) to show the effectiveness of scale and offset.
According to Table 2, we can conclude that, 1) RTNR outperforms almost every models. So, though it is a tradeoff between the number of bits and accuracy, the ternary network finds a better balance between them. 2) In spite of PACT has comparable performance, especially on AlexNet, our RTNR also shares a small gap with full precision models. Note that RTN is, furthermore, better for hardware implementation on mobile and embedded devices. 3) With learnable scale factors and offset, RTNR has higher accuracy than the RTNF, which validates the improvement of representation ability from our reparameterization design.
Sparsity Comparison
According to our analysis in Section 3.4, the offset can adjust the sparsity of automatically. Generally, changing of sparsity concurs with observation in [29], in which they believe the optimal sparsity is slightly higher than 50% based on the foundation of attention mechanism. Figure 5 (d) shows the sparsity comparison between RTNF, RTNR, and full precision models. Our reparameterized ternary activation can adjust the sparsity automatically, and the sparsity of RTNR is close to FP. Compared with [29], Our RTN can adjust sparsity automatically without any manual settings.
Analysis of Reparameterization
We report the value of scale and offset for activation and the mean value of scale for weights of each layer in ResNet18 (See Appendix). We can see that the activation distribution has changed a lot among layers. This means that each layer learns its optimal range and magnitude thus increasing the representational ability. Interestingly, we found that the activation and weights in the downsample residual layer only change slightly. This situation may result from the special filter in this layer.
4.3 Results on CIFAR10
For CIFAR10, we mainly compare our method with XNORNet on NIN. We use the PyTorch implementations of XNORNet
[24]^{‡}^{‡}‡https://github.com/jiecaoyu/XNORNetPyTorch. See more implementations details in Appendix.Results for NIN on CIFAR10 can be found in Table 2. Our RTN almost reboots full precision accuracy (only 0.2% absolute gap) without bells and whistles. This performance may result from the scale and offset that significantly changes the range of ternary activation and weights.
Evolution of and
In Figure 6, we show the evolution of the parameters in RTA. For , almost all of them will increase at the beginning of the training, and are downscaled when the learning rate is decreased. For
, they are more volatile. Nonetheless, the evolution of these parameters are directly optimized by training objective and the heuristic is not easy to predict.
Ablation Study
There are two learnable parameters in the reparameterized ternary activation, the scale factor, and the offset factor. We evaluate the effect of these two parameters by only applying one of them in the RTN. We denote the RTNS as the activation with the scale factor and RTNO as the activation with the offset only. Implementation on CIFAR10 is kept the same as before.
The results are shown in Table 2. Apparently, when we only add the scale factor, the improvement can be trivial. This is because the ReLU does not impact the scale of the activation (i.e. ), and BN can eliminate the effect of the scale factor. We refer to this effect as scale invariance of activation. However, according to Equation 10, the offset factor can change the sparsity threshold in ReLU, thus greatly affect the activation. Therefore, RTNO has higher performance than RTNS. Note that in our RTNR, there is no scale invariance of activation when we apply scale and offset factors together, which can both change the distribution of activation.
4.4 Hardware Implementation
We compare the hardware performances of the two circuits for vector multiplication operation shown in Figure 4. We further implement the circuit for floatingpoint values (32 bits) vector multiplication. We synthesize our design with Xilinx Vivado Design Suite [27] and use Xilinx VC707 FPGA evaluation board for power measurement. For the comparison on circuit area and computation latency, we utilize the Synopsys Design Compiler [8] with 45nm NanGate Open Cell Library [23].
As shown in the Table 3, the circuit for ternary values (Figure 4(a)) outperforms that for the 2bit values (quaternary values) (Figure 4(b)) and floatingpoint values in terms of both power (3.43, 46.46) and area . These differences result from the fact that more adders and bitwise shifters are used by the circuit for quaternary value multiplication. From Table 3, we notice that the circuit for quaternary value multiplication is larger than that of ternary value multiplication. That is to say, for a fixed size of circuit area and a settled clock frequency, the circuit for ternary value multiplication has less latency than the circuit for quaternary value multiplication, since we can make four ternary value multiplier works in parallel. Moreover, our circuit can be easily deployed as a building block of any largescale parallel computing framework such as systolic array [16] for efficient matrix multiplication.
5 Conclusion
In this paper, we propose the reparameterized ternary network with ternary weights and activation. The learnable reparameterizers are demonstrated to considerably increase the expressiveness of fixed ternary values. According to our analysis and empirical results, scale and offset are able to adjust the range of quantized value, inflect sparsity of activation and accelerate training. To support efficient computing in RTN, a novel computation pattern is proposed.
Acknowledgements
This work is supported by the National Research Foundation, Prime Ministers Office, Singapore under its National Cybersecurity RD Programme (No. NRF2016NCRNCR002020), and FY2017 SUG Grant.
References
 [1] (2018) Nice: noise injection and clamping estimation for neural network quantization. arXiv preprint arXiv:1810.00162. Cited by: §2.
 [2] (2013) Estimating or propagating gradients through stochastic neurons for conditional computation. In arXiv:1308.3432, Cited by: §1, §2.
 [3] (2017) Deep learning with low precision by halfwave gaussian quantization. In CVPR, Cited by: §1, §2, §3.1, §4.2, §4.
 [4] (2018) Pact: parameterized clipping activation for quantized neural networks. In arXiv:1805.06085, Cited by: §2, §4, §4.
 [5] (2015) Binaryconnect: training deep neural networks with binary weights during propagations. In NeurIPS, Cited by: §1.
 [6] (2016) Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or1. In arXiv:1602.02830, Cited by: §2.
 [7] (2018) Gxnornet: training deep neural networks with ternary weights and activations without fullprecision memory under a unified discretization framework. In Neural Networks, Cited by: §3.1.
 [8] Design compiler: rtl synthesis. Note: https://www.synopsys.com/support/training/rtlsynthesis/designcompilerrtlsynthesis.html Cited by: §4.4.

[9]
(2019)
Differentiable soft quantization: bridging fullprecision and lowbit neural networks.
In
Proceedings of the IEEE International Conference on Computer Vision
, pp. 4852–4861. Cited by: §2.  [10] (2015) Deep learning with limited numerical precision. In ICML, Cited by: §2.
 [11] (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. In arXiv:1510.00149, Cited by: §1.
 [12] (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.

[13]
(2017)
Mobilenets: efficient convolutional neural networks for mobile vision applications
. arXiv preprint arXiv:1704.04861. Cited by: §4.  [14] (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §4.
 [15] (2012) Imagenet classification with deep convolutional neural networks. In NeurIPS, Cited by: §4.
 [16] (1982) Why systolic architectures?. In IEEE Computer, Cited by: §4.4.
 [17] (2016) Ternary weight networks. In arXiv:1605.04711, Cited by: §1, §2, §4.1, §4.
 [18] (2017) Performance guaranteed network acceleration via highorder residual quantization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2584–2592. Cited by: §4, §4.
 [19] (2017) Performance guaranteed network acceleration via highorder residual quantization. In CVPR, Cited by: §2.
 [20] (2013) Network in network. In arXiv:1312.4400, Cited by: §4.
 [21] (2018) Bireal net: enhancing the performance of 1bit cnns with improved representational capability and advanced training algorithm. In ECCV, Cited by: §2, §4.

[22]
(2010)
Rectified linear units improve restricted boltzmann machines
. In ICML, Cited by: §2, §3.  [23] NanGate freepdk45 open cell library. Note: http://www.nangate.com/?page_id=2325 Cited by: §4.4.
 [24] (2016) Xnornet: imagenet classification using binary convolutional neural networks. In ECCV, Cited by: §1, §1, §2, §2, §3.1, §3.2, §4.1, §4.3, §4.
 [25] (2015) Imagenet large scale visual recognition challenge. In IJCV, Cited by: §4.
 [26] (2016) Weight normalization: a simple reparameterization to accelerate training of deep neural networks. In NeurIPS, Cited by: §3.1, §3.2.
 [27] Vivado design suite  hlx editions productivity. multiplied. Note: https://www.xilinx.com/products/designtools/vivado.html Cited by: §4.4.
 [28] (2018) TBN: convolutional neural network with ternary inputs and binary weights. In ECCV, Cited by: §3.1, §4.
 [29] (2018) Twostep quantization for lowbit neural networks. In CVPR, Cited by: §3.4, §4.2.
 [30] (2015) Accelerating very deep convolutional networks for classification and detection. In PAMI, Cited by: §1.
 [31] (2016) Dorefanet: training low bitwidth convolutional neural networks with low bitwidth gradients. In arXiv:1606.06160, Cited by: §1, §2, §3.4, §3.4, §4, §4.
 [32] (2016) Trained ternary quantization. In arXiv:1612.01064, Cited by: §1, §2, §3.2, §4.
Appendix A Experimental Details
a.1 Layer Order
In RTN, we swap the order of BN and ReLU, the details are shown in Figure 7.
a.2 Implementation Details of XORXNOR toy problem
The 2layer network is designed to learn two logical functions, and respectively. Inputs are sampled from a Bernoulli distribution pulsing a uniform noise, . Outputs are either 0 or 1. The network has a hidden layer consisting of 3 neurons without bias term. To better observe behaviors of quantized activation, we keep the weights as full precision numbers.
We use 4 different kinds of activation function for comparison, which are fixed ternary activation (), reparameterized ternary activation (), the hyperbolic tangent activation () and the reparameterized hyperbolic tangent activation(i.e. with scale and offset factor)(). Except for the , the other two activation are squashing nonlinearity in a fixed range . Both
have limited (only 3) quantization levels. The hyperbolic tangent is fullprecision, which is supposed to have better representation ability than ternary activation. We use MSE loss and stochastic gradient descent to train the network. The learning rate is 0.03 and we train the toy model for 15000 epochs.
a.3 Implementation Details of Main Experiments
For the ImageNet dataset, training images are randomly resized and cropped randomly to . randomly 256256 on the smaller dimension and then a random crop of 224224 is selected for training. Training images are horizontally flipped in a random way. The test images are centrally cropped to (227 for training and test images in AlexNet). We use Stochastic Gradient Decent (SGD) as optimizer. Weight decay is set as 0.0001 for ResNet18 and AlexNet. Each network was trained up to 100 epochs with batch size of 1024. Learning rate starts from 0.1 and is decayed by a factor of 10 at epoch 30,60,85. For quantization parameters (e.g. for activation and for weights), we should set a lower learning rate because their gradients are a summation over each elements in weights/activation which increases its magnitude. In practice, we find that 0.001 is appropriate for weight quantization parameters and 0.1 for activation quantization parameters.
As for NIN on CIFAR10, we use Adam as parameter optimizer and we train the network for 320 epochs. The weight decay was set to 0.00001, and the initial learning rate is 0.01 with a decrease factor of 0.1. Note that we also do not quantize the first and the last layer.
Layer  

layer1.0.conv1  1.0426  0.0308  1.8160 
layer1.0.conv2  0.9729  0.2344  1.0974 
layer1.1.conv1  1.0223  0.1699  2.0325 
layer1.1.conv2  0.7962  0.0956  1.6872 
layer2.0.conv1  1.3083  0.5152  3.0458 
layer2.0.conv2  0.8191  0.6840  1.5639 
layer2.0.downsample.0  1.0000  0.0024  0.8739 
layer2.1.conv1  1.4091  0.3080  1.4284 
layer2.1.conv2  0.7678  0.4921  2.3644 
layer3.0.conv1  1.3986  0.8014  2.7552 
layer3.0.conv2  0.8916  0.7033  1.6015 
layer3.0.downsample.0  0.9996  0.0000  0.9435 
layer3.1.conv1  1.6719  0.4738  2.9345 
layer3.1.conv2  1.0112  0.4731  2.1110 
layer4.0.conv1  2.0472  1.4202  3.0216 
layer4.0.conv2  1.1033  0.9717  1.7116 
layer4.0.downsample.0  1.0037  0.0000  0.8537 
layer4.1.conv1  2.4687  1.4774  1.8244 
layer4.1.conv2  0.8959  0.7186  2.3379 
Comments
There are no comments yet.