Neural Networks Weights Quantization: Target None-retraining Ternary (TNT)

by   Tianyu Zhang, et al.
Yahoo Japan Corporation

Quantization of weights of deep neural networks (DNN) has proven to be an effective solution for the purpose of implementing DNNs on edge devices such as mobiles, ASICs and FPGAs, because they have no sufficient resources to support computation involving millions of high precision weights and multiply-accumulate operations. This paper proposes a novel method to compress vectors of high precision weights of DNNs to ternary vectors, namely a cosine similarity based target non-retraining ternary (TNT) compression method. Our method leverages cosine similarity instead of Euclidean distances as commonly used in the literature and succeeds in reducing the size of the search space to find optimal ternary vectors from 3N to N, where N is the dimension of target vectors. As a result, the computational complexity for TNT to find theoretically optimal ternary vectors is only O(N log(N)). Moreover, our experiments show that, when we ternarize models of DNN with high precision parameters, the obtained quantized models can exhibit sufficiently high accuracy so that re-training models is not necessary.



There are no comments yet.


page 1

page 2

page 3

page 4


Learning Low Precision Deep Neural Networks through Regularization

We consider the quantization of deep neural networks (DNNs) to produce l...

Cluster Regularized Quantization for Deep Networks Compression

Deep neural networks (DNNs) have achieved great success in a wide range ...

ECQ^x: Explainability-Driven Quantization for Low-Bit and Sparse DNNs

The remarkable success of deep neural networks (DNNs) in various applica...

MSP: An FPGA-Specific Mixed-Scheme, Multi-Precision Deep Neural Network Quantization Framework

With the tremendous success of deep learning, there exists imminent need...

Universal Deep Neural Network Compression

Compression of deep neural networks (DNNs) for memory- and computation-e...

Word2Bits - Quantized Word Vectors

Word vectors require significant amounts of memory and storage, posing i...

Generalization Bounds for Neural Networks: Kernels, Symmetry, and Sample Compression

Though Deep Neural Networks (DNNs) are widely celebrated for their pract...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Quantizing deep neural networks (DNNs) can reduce memory requirements and energy consumption when deploying inferences on edge devices, such as mobiles, ASICs and FPGAs. Comparing with networks quantized by other methods, the binary and ternary networks use only or bits to represent DNNs’ weights, and therefore, can further improve the performance of inferences of DNN on edge devices because they not only remove multiplication operations but use less memory as well. As a result, many researches focus on binary and ternary quantifications.

BinaryConnectCourbariaux et al. [2015]

proposed a sign function to binarize the weights. Binary Weight Network (BWN)

Rastegari et al. [2016] introduced the same binarization function but added an extra scaling factor to obtain better results. BinaryNet Hubara et al. [2016]

and XNOR-Net

Rastegari et al. [2016] extended the previous works so that both weights and activations were binarized. Instead of binarization, ternarization, which inherently prunes weights close to zero by setting them to zero during training to make networks sparser, is further studied. TWN Li et al. [2016] quantized full precision weights to ternary weights so that the Euclidean distance (Second Normal Form) between the full precision weights and the resulting ternary weights along with a scaling factor is minimized. GXNOR-Net Deng et al. [2018] provided a unified discretization framework for both weights and activations. Alemdar et al. Alemdar et al. [2017] trained ternary neural networks using a teacher-student approach based on a layer-wise greedy method. Mellempudi et al. Mellempudi et al. [2017] proposed a fine-grained quantization (FGQ) to ternarize pre-trained full precision models, while also constraining activations to 8 and 4 bits.

The parts in inference computation that consume time and energy in the largest scale involve many weights in computation, which are saved as tensors in every layer. A tensor can be decomposed to a set of vectors, referred to as target vectors, and each target vector is approximated to a binary or ternary vector. To control the approximation error, Euclidean distance is the most commonly utilized in many previous works in: these quantization methods measure the approximation error or similarity between original target vectors and the approximated ternary or binary vectors as Euclidean distances. This method, however, is known to require expensive computation. For example, the time complexity of the tenary method proposed in

Mellempudi et al. [2017] was . In this paper, we propose a novel tenary method whose time complexity is improved to

by replacing Euclidean distance by cosine similarity. We call our method a cosine similarity based target non-retraining ternary (TNT) method. In addition, our method has following advantages: 1) TNT is a non-retraining optimal quantization method for ternarization, binarization, and low bit-width quantizations; 2) We find the theoretical upper limit of similarity between target vectors and ternary vectors; it is guaranteed that TNT always finds the optimal ternary vectors with the maximum similarity of original vectors; 3) We find the similarity is influenced by distributions of component values of target vectors, and furthermore, higher similarity can be obtained if we assume uniform distributions than normal distributions.

2 Method Description

The proposed TNT first divides the tensor type weights of a DNN model into plural target vectors. Then, it finds the ternary vector most similar to every target vector with respect to cosine similarity. In other words, the ternary vector is selected so that the intersection angle between the target vector and ternary vector is minimized. Finally, it uses a scalar-tuning technique to adjust the error between one target vector and its ternary vector to obtain an optimal converting result.

2.1 Tensor Decomposition and Vectorization

The weights of a DNN are normally stored in a fourth-order tensor shape, such as , that contains third-order tensors and every third-order tensor has channels, width, and height. The purpose of tensor vectorization is to flatten every third-order tensor into a set of target vectors. We expect that decomposing a tensor along the channel direction can yield good results, because each channel is an integral unit which acts as a feature extractor for convolution calculations with a feature map. Hence, a third-order tensor can be vectorized to . This expectation will be verified through experiments in this paper.

2.2 Target Non-retrain Ternarization

We first introduce our cosine similarity based technique TNT, which reduces the searching range to . Then, a scalar-tuning method is proposed to further optimize the ternary vector. The total of the computational complexity is .

2.2.1 Cosine Similarity

Given a target vector of a layer of a CNN, which is denoted by for , the purpose is to find a ternary vector that approximates . For simple representation of equations, we eliminate the notation of since it only represents a layer . In TNT, we use the cosine similarity metric between the two vectors to find the optimal ternary vector . The cosine similarity between the target vector and the ternary vector can be written as Eq.(1), where and is the intersecting angle between and . The value of is controlled by vector since every element in the target vector is fixed.


When we denote , Eq. (1) can be transformed to Eq. (2), where and the search range of is .


Let be the propagation obtained by sorting in a decreasing order. Without loss of generality, we can assume that all are non-zero. First, we solve Eq. (2) under the constraint of . Evidently,


holds, and the equality holds, if for that corresponds to for and for the others. Therefore, what we need to know is

and hence, calculating argument in Eq.1 equals to find the maximum value among candidates instead of among candidates. Moreover, the computational cost of finding simply equals to the time complexity of sorting to , which is .

2.2.2 Scalar-Tuning

Thus, we can obtain whose intersecting angle with is minimized. In other words, approximately determines the direction of . To describe , we need to determine the length in the direction of . The principle is to find an optimal that minimizes the error . It is well known that the error is minimum, if, and only if, is the orthographic projection of to (Fig. 1), which is given by

Figure 1: Scalar Constant of a Ternary Vector

This increases the necessary memory size (footprint), but is effective to improve accuracy. Moreover, if includes both positive and negative elements, we could improve the accuracy more by memorizing one more scalar: We let with and and with and : for a vector , () means that all the elements of is non-negative (non-positive). We should note that , and holds. Therefore, we have:

Thus, if we let and memorized in addition to , we can not only save memory size but also suppress loss of accuracy.

3 Simulations

In this part, we first show the performance of our TNT method on transforming target vectors to ternary vectors. Then, we show the upper limit of similarity of ternary and binary when utilizing different distributions to initialize target vectors. Finally, we demonstrate an example using TNT to convert weights of DNN models to ternary. All experiments are run on a PC with Intel(R) Core(TM) i7-8700 CPU at 3.2GHz using 32GB of RAM and a NVIDIA GeForce GTX 1080 graphics card, running Windows 10 system.

3.1 Converting Performance

In order to investigate the accuracy performance of our ternarization method, we prepare two target vectors of dimension 1,000,000: one has elements determined independently following a uniform distribution; The other follows a normal distribution instead. Figure 2 (a) shows the cosine similarity scores observed when we change the number of non-zero elements of . The highest score for the target vector that has followed a uniform distribution is when elements of are non-zero, while the highest score is for a normal distribution when elements of are non-zero. The curves of the cosine similarity score are unimodal, and if this always holds true, finding maximum cosine similarity scores can be easier.

Moreover, we found a fact that the cosine similarity is not easily affected by the dimension of a ternary or binary vector. We calculated 10000 times of the maximum cosine similarity with the dimension of target vectors increases by one at each time. Figure 2 (b) and (c

) show the simulation results: 1) regardless of the target vector under normal distribution or uniform distribution, ternary vectors reserve a higher similarity. 2) the cosine similarity of ternary and binary vectors converge to a stable value with the increasing of vector dimension, and the ternary vector has a smaller variance comparing with the binary vector.

Figure 2: Simulation Result of Cosine Similarity by TNT method

3.2 Performance on Neural Networks

We perform our experiments on LeNet-5Li et al. [2016], VGG-7Li et al. [2016], and VGG16Simonyan and Zisserman [2014]

using MNIST, CIFAR-10, and ImageNet datasets respectively to first train a full precision network model, and then replace the floating point parameters by the ternary parameters obtained by the proposed TNT. A precise comparison between floating point model and ternary model is conducted.

The experiment results are shown in Table 1. It shows that, without network retraining, inferences with ternary parameters only lose and of accuracies using LeNet-5 and VGG-7 respectively on MNIST dataset. And it loses of accuracy for VGG-7 on CIFAR-10 dataset. For VGG-16 network on ImageNet dataset, the Top-1 and Top-5 accuracy dropped and , respectively. Moreover, the memory size of LeNet-5 and VGG-7 are reduced times since each ternary weight only requires bits of memory. On the other hand, because of converting the first and last layer of VGG-16 to ternary without fine-tuning has a significant affection on the accuracy, which is the same phenomenon mentioned in Mellempudi et al. [2017], we do not convert the first and the last layer in VGG-16, and the parameter size reduces times.

Network Base Line TNT Converted Parameters Converting Times
99.18% 98.97% 1,663,370 7.803s
91.31% 89.09% 7,734,410 88.288s
64.26%, 85.59% 56.26%, 80.25% 12,976,266 115.863s
Table 1: TNT Performance on Neural Networks

4 Conclusions

In this paper, we proposed a target non-retraining ternary (TNT) method to convert a full precision parameters model to a ternary parameters model accurately and quickly without retraining of the network. In our approach, firstly, we succeeded in reducing the size of the searching range from to by evaluating the cosine similarity between a target vector and a ternary vector. Secondly, scaling-tuning factors are proposed coupling with the cosine similarity to further enable the TNT to find the best ternary vector. Due to the smart tricks, TNT’s computational complexity is only . Thirdly, we showed that the distributions of parameters have an obvious affection on the weight converting result. This implies that the initial distributions for parameters are important. Moreover, we applied the TNT to several models. As a result, we verified that quantization by our TNT method caused a small loss of accuracy.


  • H. Alemdar, V. Leroy, A. Prost-Boucle, and F. Pétrot (2017) Ternary neural networks for resource-efficient ai applications. In 2017 International Joint Conference on Neural Networks (IJCNN), pp. 2547–2554. Cited by: §1.
  • M. Courbariaux, Y. Bengio, and J. David (2015) Binaryconnect: training deep neural networks with binary weights during propagations. In Advances in neural information processing systems, pp. 3123–3131. Cited by: §1.
  • L. Deng, P. Jiao, J. Pei, Z. Wu, and G. Li (2018) GXNOR-net: training deep neural networks with ternary weights and activations without full-precision memory under a unified discretization framework. Neural Networks 100, pp. 49–58. Cited by: §1.
  • I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio (2016) Binarized neural networks. In Advances in neural information processing systems, pp. 4107–4115. Cited by: §1.
  • F. Li, B. Zhang, and B. Liu (2016) Ternary weight networks. arXiv preprint arXiv:1605.04711. Cited by: §1, §3.2.
  • N. Mellempudi, A. Kundu, D. Mudigere, D. Das, B. Kaul, and P. Dubey (2017) Ternary neural networks with fine-grained quantization. arXiv preprint arXiv:1705.01462. Cited by: §1, §1, §3.2.
  • M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi (2016)

    Xnor-net: imagenet classification using binary convolutional neural networks


    European Conference on Computer Vision

    pp. 525–542. Cited by: §1.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.2.