1 Introduction
Quantizing deep neural networks (DNNs) can reduce memory requirements and energy consumption when deploying inferences on edge devices, such as mobiles, ASICs and FPGAs. Comparing with networks quantized by other methods, the binary and ternary networks use only or bits to represent DNNs’ weights, and therefore, can further improve the performance of inferences of DNN on edge devices because they not only remove multiplication operations but use less memory as well. As a result, many researches focus on binary and ternary quantifications.
BinaryConnectCourbariaux et al. [2015]
proposed a sign function to binarize the weights. Binary Weight Network (BWN)
Rastegari et al. [2016] introduced the same binarization function but added an extra scaling factor to obtain better results. BinaryNet Hubara et al. [2016]and XNORNet
Rastegari et al. [2016] extended the previous works so that both weights and activations were binarized. Instead of binarization, ternarization, which inherently prunes weights close to zero by setting them to zero during training to make networks sparser, is further studied. TWN Li et al. [2016] quantized full precision weights to ternary weights so that the Euclidean distance (Second Normal Form) between the full precision weights and the resulting ternary weights along with a scaling factor is minimized. GXNORNet Deng et al. [2018] provided a unified discretization framework for both weights and activations. Alemdar et al. Alemdar et al. [2017] trained ternary neural networks using a teacherstudent approach based on a layerwise greedy method. Mellempudi et al. Mellempudi et al. [2017] proposed a finegrained quantization (FGQ) to ternarize pretrained full precision models, while also constraining activations to 8 and 4 bits.The parts in inference computation that consume time and energy in the largest scale involve many weights in computation, which are saved as tensors in every layer. A tensor can be decomposed to a set of vectors, referred to as target vectors, and each target vector is approximated to a binary or ternary vector. To control the approximation error, Euclidean distance is the most commonly utilized in many previous works in: these quantization methods measure the approximation error or similarity between original target vectors and the approximated ternary or binary vectors as Euclidean distances. This method, however, is known to require expensive computation. For example, the time complexity of the tenary method proposed in
Mellempudi et al. [2017] was . In this paper, we propose a novel tenary method whose time complexity is improved toby replacing Euclidean distance by cosine similarity. We call our method a cosine similarity based target nonretraining ternary (TNT) method. In addition, our method has following advantages: 1) TNT is a nonretraining optimal quantization method for ternarization, binarization, and low bitwidth quantizations; 2) We find the theoretical upper limit of similarity between target vectors and ternary vectors; it is guaranteed that TNT always finds the optimal ternary vectors with the maximum similarity of original vectors; 3) We find the similarity is influenced by distributions of component values of target vectors, and furthermore, higher similarity can be obtained if we assume uniform distributions than normal distributions.
2 Method Description
The proposed TNT first divides the tensor type weights of a DNN model into plural target vectors. Then, it finds the ternary vector most similar to every target vector with respect to cosine similarity. In other words, the ternary vector is selected so that the intersection angle between the target vector and ternary vector is minimized. Finally, it uses a scalartuning technique to adjust the error between one target vector and its ternary vector to obtain an optimal converting result.
2.1 Tensor Decomposition and Vectorization
The weights of a DNN are normally stored in a fourthorder tensor shape, such as , that contains thirdorder tensors and every thirdorder tensor has channels, width, and height. The purpose of tensor vectorization is to flatten every thirdorder tensor into a set of target vectors. We expect that decomposing a tensor along the channel direction can yield good results, because each channel is an integral unit which acts as a feature extractor for convolution calculations with a feature map. Hence, a thirdorder tensor can be vectorized to . This expectation will be verified through experiments in this paper.
2.2 Target Nonretrain Ternarization
We first introduce our cosine similarity based technique TNT, which reduces the searching range to . Then, a scalartuning method is proposed to further optimize the ternary vector. The total of the computational complexity is .
2.2.1 Cosine Similarity
Given a target vector of a layer of a CNN, which is denoted by for , the purpose is to find a ternary vector that approximates . For simple representation of equations, we eliminate the notation of since it only represents a layer . In TNT, we use the cosine similarity metric between the two vectors to find the optimal ternary vector . The cosine similarity between the target vector and the ternary vector can be written as Eq.(1), where and is the intersecting angle between and . The value of is controlled by vector since every element in the target vector is fixed.
(1) 
(2) 
Let be the propagation obtained by sorting in a decreasing order. Without loss of generality, we can assume that all are nonzero. First, we solve Eq. (2) under the constraint of . Evidently,
(3) 
holds, and the equality holds, if for that corresponds to for and for the others. Therefore, what we need to know is
and hence, calculating argument in Eq.1 equals to find the maximum value among candidates instead of among candidates. Moreover, the computational cost of finding simply equals to the time complexity of sorting to , which is .
2.2.2 ScalarTuning
Thus, we can obtain whose intersecting angle with is minimized. In other words, approximately determines the direction of . To describe , we need to determine the length in the direction of . The principle is to find an optimal that minimizes the error . It is well known that the error is minimum, if, and only if, is the orthographic projection of to (Fig. 1), which is given by
This increases the necessary memory size (footprint), but is effective to improve accuracy. Moreover, if includes both positive and negative elements, we could improve the accuracy more by memorizing one more scalar: We let with and and with and : for a vector , () means that all the elements of is nonnegative (nonpositive). We should note that , and holds. Therefore, we have:
Thus, if we let and memorized in addition to , we can not only save memory size but also suppress loss of accuracy.
3 Simulations
In this part, we first show the performance of our TNT method on transforming target vectors to ternary vectors. Then, we show the upper limit of similarity of ternary and binary when utilizing different distributions to initialize target vectors. Finally, we demonstrate an example using TNT to convert weights of DNN models to ternary. All experiments are run on a PC with Intel(R) Core(TM) i78700 CPU at 3.2GHz using 32GB of RAM and a NVIDIA GeForce GTX 1080 graphics card, running Windows 10 system.
3.1 Converting Performance
In order to investigate the accuracy performance of our ternarization method, we prepare two target vectors of dimension 1,000,000: one has elements determined independently following a uniform distribution; The other follows a normal distribution instead. Figure 2 (a) shows the cosine similarity scores observed when we change the number of nonzero elements of . The highest score for the target vector that has followed a uniform distribution is when elements of are nonzero, while the highest score is for a normal distribution when elements of are nonzero. The curves of the cosine similarity score are unimodal, and if this always holds true, finding maximum cosine similarity scores can be easier.
Moreover, we found a fact that the cosine similarity is not easily affected by the dimension of a ternary or binary vector. We calculated 10000 times of the maximum cosine similarity with the dimension of target vectors increases by one at each time. Figure 2 (b) and (c
) show the simulation results: 1) regardless of the target vector under normal distribution or uniform distribution, ternary vectors reserve a higher similarity. 2) the cosine similarity of ternary and binary vectors converge to a stable value with the increasing of vector dimension, and the ternary vector has a smaller variance comparing with the binary vector.
3.2 Performance on Neural Networks
We perform our experiments on LeNet5Li et al. [2016], VGG7Li et al. [2016], and VGG16Simonyan and Zisserman [2014]
using MNIST, CIFAR10, and ImageNet datasets respectively to first train a full precision network model, and then replace the floating point parameters by the ternary parameters obtained by the proposed TNT. A precise comparison between floating point model and ternary model is conducted.
The experiment results are shown in Table 1. It shows that, without network retraining, inferences with ternary parameters only lose and of accuracies using LeNet5 and VGG7 respectively on MNIST dataset. And it loses of accuracy for VGG7 on CIFAR10 dataset. For VGG16 network on ImageNet dataset, the Top1 and Top5 accuracy dropped and , respectively. Moreover, the memory size of LeNet5 and VGG7 are reduced times since each ternary weight only requires bits of memory. On the other hand, because of converting the first and last layer of VGG16 to ternary without finetuning has a significant affection on the accuracy, which is the same phenomenon mentioned in Mellempudi et al. [2017], we do not convert the first and the last layer in VGG16, and the parameter size reduces times.
Network  Base Line  TNT Converted  Parameters  Converting Times  


99.18%  98.97%  1,663,370  7.803s  

91.31%  89.09%  7,734,410  88.288s  

64.26%, 85.59%  56.26%, 80.25%  12,976,266  115.863s 
4 Conclusions
In this paper, we proposed a target nonretraining ternary (TNT) method to convert a full precision parameters model to a ternary parameters model accurately and quickly without retraining of the network. In our approach, firstly, we succeeded in reducing the size of the searching range from to by evaluating the cosine similarity between a target vector and a ternary vector. Secondly, scalingtuning factors are proposed coupling with the cosine similarity to further enable the TNT to find the best ternary vector. Due to the smart tricks, TNT’s computational complexity is only . Thirdly, we showed that the distributions of parameters have an obvious affection on the weight converting result. This implies that the initial distributions for parameters are important. Moreover, we applied the TNT to several models. As a result, we verified that quantization by our TNT method caused a small loss of accuracy.
References
 Ternary neural networks for resourceefficient ai applications. In 2017 International Joint Conference on Neural Networks (IJCNN), pp. 2547–2554. Cited by: §1.
 Binaryconnect: training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pp. 3123–3131. Cited by: §1.
 GXNORnet: training deep neural networks with ternary weights and activations without fullprecision memory under a unified discretization framework. Neural Networks 100, pp. 49–58. Cited by: §1.
 Binarized neural networks. In Advances in neural information processing systems, pp. 4107–4115. Cited by: §1.
 Ternary weight networks. arXiv preprint arXiv:1605.04711. Cited by: §1, §3.2.
 Ternary neural networks with finegrained quantization. arXiv preprint arXiv:1705.01462. Cited by: §1, §1, §3.2.

Xnornet: imagenet classification using binary convolutional neural networks
. InEuropean Conference on Computer Vision
, pp. 525–542. Cited by: §1.  Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.2.
Comments
There are no comments yet.