Training Deep Neural Networks Using Posit Number System

09/06/2019 ∙ by Jinming Lu, et al. ∙ Nanjing University 0

With the increasing size of Deep Neural Network (DNN) models, the high memory space requirements and computational complexity have become an obstacle for efficient DNN implementations. To ease this problem, using reduced-precision representations for DNN training and inference has attracted many interests from researchers. This paper first proposes a methodology for training DNNs with the posit arithmetic, a type- 3 universal number (Unum) format that is similar to the floating point(FP) but has reduced precision. A warm-up training strategy and layer-wise scaling factors are adopted to stabilize training and fit the dynamic range of DNN parameters. With the proposed training methodology, we demonstrate the first successful training of DNN models on ImageNet image classification task in 16 bits posit with no accuracy loss. Then, an efficient hardware architecture for the posit multiply-and-accumulate operation is also proposed, which can achieve significant improvement in energy efficiency than traditional floating-point implementations. The proposed design is helpful for future low-power DNN training accelerators.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recently deep neural networks (DNNs) have made a great success in many real-world applications, such as image classification [6], speech recognition [1]

, and natural language processing

[13]. With the increasing size of DNNs, the models show the state-of-the-art performance. However, the high memory space requirements and computational complexity have become a serious problem for efficient implementations, especially on mobile devices.

To alleviate the extremely high demand of computational resource, many compression methods are proposed, which aim to generate compact DNN models. At present, the reduced-precision representation of numbers, also known as quantization, is one of the most attractive topics[9]. However, these methods mainly focus on the inference phase of DNN. Researches on training with limited-precision numbers still remain to be explored.

Because of the existence of more information flows, including gradients backpropagation and parameters updating, the training of DNNs needs higher representation ability for data. In other words, a suitable number format for DNN training should have enough dynamic range for big numbers, and have high precision for numbers in the center of data distribution.

Posit, a type-3 universal number, is introduced by Gustafson et al.[4]. An -bit posit number is defined as , where (exponent bits) is used to control dynamic range. Comparing to standard floating point(FP) number, posit has a better trade-off between dynamic range and precision, just meeting the needs of low-bits number for DNN training. Some researchers have claimed the prospect of posit in DNNs, but practical implementations and verifications are absent[4][15]. In this paper, we first propose an effective strategy for DNN training using posit number system. After the posit being proved useful in DNN training, a processing element supporting posit arithmetics is required to make full use of its efficiency in DNN accelerators. Our contributions are summarized as follows:

  • With an operation which transforms a real number to posit format, we illustrate how to apply the posit in DNN training process.

  • We analyze the advantages and disadvantages of the application of posit in DNN training, then we propose corresponding solutions to overcome these problems. Firstly, to deal with the high sensitivity of models in the early training stage and ensure the convergence of models, a warm-up training with FP32 is carried out. Secondly, to take the advantage of posit, we design a layer-wise scaling factors based on the center of data distribution in log-domain, making the data distribution of models match the change of the precision of posit number. Thirdly, to meet different data ranges of different layers, we come up with a quanlitative criteria to select a proper to achieve a better trade-off between dynamic range and precision of posit number.

  • In order to verify the effectiveness of our methods, ResNet-18 models are trained on ImageNet dataset and Cifar-10 dataset, where 8-bit or 16-bit posit numbers are applied in forward and backward computation, respectively. The experiments show no accuracy loss with the baseline model.

  • We propose a hardware architecture for posit multiply-and-accumulate (MAC) unit, which is coded by Verilog HDL and synthesized by Design Compiler under TSMC 28nm technology. Comparing to standard floating point MAC unit, the posit MAC can reduce the power by 83%, and reduce the area by 76%. It demonstrates that our design will benefit future low-power DNN training accelerators.

Ii Background

Ii-a Reduced-Precision for DNN Training

Training DNNs with reduced-precision is an appealing issue. Gupta et al. trained DNNs with fixed-point numbers, and introduced stochastic rounding procedure to prevent accuracy degradation[3]. In paper[11], the binary logarithmic data representation for both inference and training is explored, so that multiplication operations can be replaced by simpler shift operations. However, the above works usually can not provide expected model accuracy on complex tasks because there are too many information losses caused by the aggressive approximation.

To deal with this problem, some recent works use reduced-precision floating point including FP8 or FP16 in training. Micikevicius et al.[10] used FP16 for forward and backward computation, and kept FP32 for weight update and accumulation. They also proposed a loss-scaling method to keep gradients propagation effectively. Furthermore, with a chunk-based accumulation technique applied, Wang et al.[12] reduced the precision of the computation to FP8, and the precision of the weight update and accumulation to FP16.

Ii-B Posit number system

Fig. 1: The basic structure of an posit number
Binary Code Regime Exponent Mantissa Real Value
00000 x x x 0
00001 -3 0 0 1/64
00010 -2 0 0 1/16
00011 -2 1 0 1/8
00100 -1 0 0 1/4
00101 -1 0 1/2 3/8
00110 -1 1 0 1/2
00111 -1 1 1/2 3/4
01000 0 0 0 1
01001 0 0 1/2 3/2
01010 0 1 0 2
01011 0 1 1/2 3
01100 1 0 0 4
01101 1 1 0 8
01110 2 0 0 16
01111 3 0 0 64
TABLE I: The Detail Structures of Positive Values of Posit Number

An posit number, whose detail structure is shown as Fig. 1, includes four parts: a sign bit, regime bits, exponent bits, and mantissa part. The boundary between the last three parts are not fixed, as the regime part is encoded by run-length method. As for the numerical meaning of regime bits, consecutive ended by a means , consecutive ended by a means . As an example, a posit construction is described in Table I. The value of a posit number (binary code) is given by Eq. (1).


where determines the dynamic range.

The maximum and the minimum positive values that can represent are and , respectively.

Some groups have worked on the design of hardware architecture generators for posit arithmetics. Jaiswal et al.[7] proposed a parameterized posit arithmetic architectures generator, supporting basic operations such as FP-Posit conversion, addition/subtraction, and multiplication. Recently, an efficient posit MAC unit generator that can be combined with a reasonable pipeline strategy was put forward by Zhang et al.[15]

, Besides, the applications of low-bit posit in deep learning also attracted some attentions. Deep Positron

[2], a DNN architecture that employs exact-multiply-and-accumulates (EMACs) for 8-bit posit, shows better accuracies than 8-bit fixed-point and FP for some small datasets. J.Johnson[8] proposed log-float format inspired by posit, and use it for DNN inference, whose accuracy loss is less than for ImageNet dataset within ResNet-50 model.

Iii Posit Training Strategy and Experiments Results

Name Description
posit word size
posit exponent field size
sign of the number
the effective exponent value of
the regime value of
the exponent value of before rounding
the mantissa value of before rounding
the regime width of
the exponent width of
the mantissa width of
the exponent value of after rounding
the mantissa value of after rounding
TABLE II: Notations For Posit Transformation

Iii-a Posit Transformation

In this work, all data and computations are represented in posit format in the training process. Therefore, we have to transform a real number, which is represented in FP32 format in current computers, to posit format. Here we define an operator to achieve this task. The detail process is shown in Algorithm 1, and the involved notations are listed in Table II.

Given the total word size and exponent field size , we can determine the dynamic range of a posit number. To convert a non-zero number to corresponding posit number , firstly we have to limit its magnitude based on the dynamic range and then extract sign, regime, exponent, and mantissa parts.

Input: real number , posit word size and exponent field size
Output: posit number
1 ;
2 ;
3 if  then
4       ;
6 else
7       ;
8       ;
9       ;
10       ;
11       ;
12       ;
13       if  then
14             ;
16       else
17             ;
19       ;
20       ;
21       ;
22       ;
23       ;
25return ;
Algorithm 1 Transform a Number to Posit Format
(a) histgram of
(b) distribution of
(c) histgram of
(d) distribution of
Fig. 2: The histgrams and distributions of CONV layer and BN layer in training process
(a) Forward propagation with posit transformation
(b) Backward propagation with posit transformation
(c) Weight update with posit transformation
Fig. 3: DNN training computation flow graph with posit transformation. In the graph, means the transformation operation, whose subscript is omitted for simplicity. Besides, , , , and stand for the weight, activation, weight gradient, and error in a layer respectively. The symbols with subscript are in posit format.

Next, because of the restriction of word size, the width of each part is adjusted. Therefore, the rounding operations are applied to the value of each part to fit the adjusted width. Here we choose the rounding-to-zero method, e.g. the operator in Algorithm 1, Line 16, 17. Comparing to the rounding-to-nearest and stochastic rounding methods, the rounding-to-zero will be more friendly for hardware implementation. Finally, the posit result is attained by combining these parts based on Eq. (1).

With the transformation algorithm accomplished, we insert it in DNN training computation flow as depicted in Fig. 3, which includes forward process, backward process, and weight update process.

Iii-B Training a DNN Model with Posit

Although posit has many benefits while being used in DNN training, it can not show expected performance if we replace FP32 with reduced-precision posit directly. There are several key reasons as follows:

  • In the early training stage, the model is more sensitive to the precision of data, and the distributions of some layers are unstable, so that the reduced-precision representations will cause a bad initialization and make the model hard to converge.

  • In fact, the precision of posit number system is basically symmetrical about 1, but the data distributions in DNN models are concentrated on limited range. To some extent, it results mismatching between data distributions and number representation formats, thereby leading larger approximation errors.

  • For different layers, the data have different ranges, which means some data distributions are more concentrated and the others are relatively decentralized. Therefore, it is sub-optimal to use same data precision (e.g of posit) to represent them.

In this section, we propose corresponding methods for dealing with the above problems.

Warm-up Training: By observing the distributions of data in training process, we find that most of them are approximately normal. As shown in Fig. 2

, the distributions of the weights in Convolution (CONV) layers are basically stable in the training process. However, because of the initialization method, the distributions of the weights in Batch Normalization (BN) layers have a steep change in the first several epochs, which may be an important reason of high model sensitivity in early training process. Therefore, in this phase, a higher numerical precision is required. On account of this situation, a warm-up training using FP32 for several epochs (1-5 epochs) is carried out. It will be helpful to determine the data distribution effectively and make sure the convergence of networks.

Distribution-based Shifting: When transforming a real number to its reduced-precision format, the most common idea is approximating it to the nearest reduced-precision value and clipping it based on the dynamic range of reduced-precision format. As a result, the numerical errors are inevitable. To overcome the second issue, a scaling factor is introduced to shift the data distribution to a more appropriate range, whose upper bound is usually the maximum value that the reduced-precision number can represent [14]. As for posit number system, its dynamic range is large enough to meet demand. However, to make full use of the code space of posit, inspired by the shift-based mapping method [14], we also propose a layer-wise scaling factor . The calculation of the scaling factor is shown as Eq. (2).


is a tensor to be converted,

means the approximate distribution center of the input tensor in log domain, which stands for that the majority of values are close to this magnitude, is a predefined positive integer constant, which is set as 2 in our experiments. As mentioned in previous works[5], the large values have more importance than small values, so we add to for shifting values towards small magnitude a little more. Basd on the warm-up trained model, the scaling factor of each layer can be calculated. Finally, by applying the scaling factor before and after transformation operation as Eq. (3), the more important values are shifted to the order of magnitude that has higher precision.


Adjust Dynamic Range: During the DNN training process, different layers have different distribution ranges which are measured approximately by the difference between the maximum and minimum value in log domain. For example, in the first few layers, the ranges of gradients are relatively larger than the ranges of other values. In this case, the posit number should have a larger dynamic range, which means a bigger value. In this work, for simplicity, we just set the to be 1 for all weights and activations, and be 2 for all gradients and errors.

Iii-C Experiment Results

To validate our posit training strategy, we perform experiments with ResNet-18[6]

on ImageNet and Cifar-10 datasets utilizing Pytorch framework on NVIDIA P100 GPUs. The validate top-1 accuracy and related configuration are summarized in Table


. which demonstrate that training with reduced-precision posit number can achieve FP32 baseline accuracy without tuning hyperparameters. The training details are as follows:


The model uses stochastic gradient descent with moment 0.9 as optimizer. The initial learning rate is set to 0.1 and divided by 10 at epoch 60, epoch 150, and 250. The network is trained for 300 epochs with a mini-batch size of 512. The warm-up training runs for 1 epoch.

ImageNet:The model uses stochastic gradient descent with moment 0.9 as optimizer. The initial learning rate is set to 0.1 and divided by 10 every 30 epochs. The model is trained for 90 epochs with a mini-batch size of 512. The warm-up training runs for 5 epochs.

Dataset Cifar-10 ImageNet
model Cifar-ResNet-18 ResNet-18
batch size 512 512
epochs 300 120
optimizer SGD with Moment SGD with Moment
FP32 baseline 93.40 71.02
posit 92.87 71.09
  • posit (8,1) for CONV layers forward pass and weight update, posit (8,2) for CONV layers backward pass. posit (16,1) for BN layers forward pass and weight update, posit (16,2) for BN layers backward pass.

  • posit (16,1) for forward pass and weight update, posit (16,2) for backward pass.

TABLE III: Training Configurations and Validate Accuracies Results

Iv Energy-Efficient Posit MAC Architecture

By using 8 bits or 16 bits posit number for training, the model size can be reduced to 25% or 50%, then the energy consumption can be saved significantly, because the memory space requirements and the communication bandwidth are reduced. As for computational process, the energy consumption mainly comes from a mass of MAC operations. Since the posit arithmetic operations are different from traditional floating point arithmetic operations, a dedicated MAC unit is urgently required to take full advantage of the reduced-precision posit.

Fig. 4: The overall architecture for the posit MAC

As shown in Fig. 4, the posit MAC unit proposed in [15] mainly compose of three units: a decoder converting posit to FP, an FP MAC unit, and an encoder converting FP to posit. In this way, the summation of the encoder delay and decoder delay consumes about 40% time of the total posit MAC delay.

Based on this result, improved architectures for the encoder and decoder with lower latency are proposed, which are shown in Fig. 6 and Fig. 5.

Iv-a The Optimized Decoder and Encoder Architectures

The decoder aims to extract different parts of posit, then exports effective exponent value and mantissa value. Firstly, the absolute regime value of the input posit number is calculated by a LOD (if real regime value is negative) or a LZD (if real regime value is positive). Secondly, The input is left shifted by the width of regime bits, which is equal to or , where is the absolute regime value. The output of composes of posit exponent value and mantissa value. Finally the regime value and posit exponent value are packaged into effective exponent value. The critical path of the original decoder is determined by the add one operation. As shown in Fig. 5, we remove the adder, and split the left shift path by duplicating the . To preserve the function of the adder, a left-shift-one ( ) operation is inserted after the .

(a) The original decoder
(b) The optimized decoder
Fig. 5: The decoder architectures before and after optimization

The encoder converts the FP to posit format. Firstly, a 2n-bit variable is constructed with mantissa and the least significant bits(LSB) exponent bits, and the remained bits are filled by regime sequence. Then is right shifted by the width of regime bits, which is equal to or , where is the absolute regime value. Therefore, an optimization method, which is similar to that used in the optimized decoder, is applied for the encoder architecture.

(a) The original encoder
(b) The optimized encoder
Fig. 6: The encoder architectures before and after optimization

Iv-B Hardware Implementation Results

The architectures are coded by Verilog HDL and synthesized by Design Compiler under TSMC 28nm technology. To prove efficiency of the proposed encoder and decoder, the same parameterized architectures with [15] are evaluated.

posit(8,0) posit(16,1) posit(32,3)
[15] delay(ns) encoder 0.2 0.29 0.35
decoder 0.2 0.28 0.34
Ours delay(ns) encoder 0.13 0.18 0.23
decoder 0.14 0.21 0.29
power(mW) encoder 0.21 0.44 0.59
decoder 0.27 0.45 0.66
area() encoder 137 295 540
decoder 201 504 960
TABLE IV: Delay Comparison of Encoder and Decoder with [15]

The comparison results in Table IV show our encoder speeds up by 25%-35% and our decoder speeds up by 15%-30%, thereby reducing the impact of these two units on total delay.

After combining the proposed encoder and decoder with the FP MAC unit, an energy-efficient posit MAC architecture is proposed. To meet the requirements of the DNN training with posit, different posit MAC units which support all kinds of posit format involved in Table III are implemented. The implementation results are summarized in Table V. For fair comparison between the posit MAC and FP32 MAC on energy consumption, all these units are synthesized with a timing constraint of 750MHz. Comparing to FP32 MAC, the posit MAC can reduce the power by 22%-83%, and reduce the area by 6%-76%.

Power(mW) Area ()
FP32 2.52 4322
posit(8,1) 0.45 1208
posit(8,2) 0.35 1032
posit(16,1) 1.77 4079
posit(16,2) 1.60 3897
TABLE V: Comparison of Posit MAC with FP32

V Conclusion and Future Work

In this paper, with several useful methods proposed, the posit number system is applied to DNN training successfully. The experiments results show that reduced-precision posit can achieve similar accuracy with FP32 on different datasets. If the posit is applied in DNN accelerators, the overhead caused by data communications can be saved by 2-4. In order to take full advantage of posit, an energy-efficient posit MAC unit is designed. Comparing to FP32 MAC, the posit MAC can reduce the power by 22%-83%, and reduce the area by 6%-76%.

In the further work, we will implement a hardware accelerator for DNN training with posit. On the other hand, the architectures for posit arithmetic with the encoder and decoder may be not the optimal method. We will carefully design a new architecture for the posit MAC to further improve its performance.


  • [1] D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al. (2016) Deep speech 2: end-to-end speech recognition in english and mandarin. In

    International conference on machine learning

    pp. 173–182. Cited by: §I.
  • [2] Z. Carmichael, H. F. Langroudi, C. Khazanov, J. Lillie, J. L. Gustafson, and D. Kudithipudi (2018) Deep positron: a deep neural network using the posit number system. arXiv preprint arXiv:1812.01762. Cited by: §II-B.
  • [3] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan (2015) Deep learning with limited numerical precision. In International Conference on Machine Learning, pp. 1737–1746. Cited by: §II-A.
  • [4] J. L. Gustafson and I. T. Yonemoto (2017) Beating floating point at its own game: posit arithmetic. Supercomputing Frontiers and Innovations 4 (2), pp. 71–86. Cited by: §I.
  • [5] S. Han, H. Mao, and W. J. Dally (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: §III-B.
  • [6] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 770–778. Cited by: §I, §III-C.
  • [7] M. K. Jaiswal and H. K. So (2018) Universal number posit arithmetic generator on fpga. In 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1159–1162. Cited by: §II-B.
  • [8] J. Johnson (2018) Rethinking floating point for deep learning. arXiv preprint arXiv:1811.01721. Cited by: §II-B.
  • [9] R. Krishnamoorthi (2018) Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv preprint arXiv:1806.08342. Cited by: §I.
  • [10] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, et al. (2017) Mixed precision training. arXiv preprint arXiv:1710.03740. Cited by: §II-A.
  • [11] D. Miyashita, E. H. Lee, and B. Murmann (2016) Convolutional neural networks using logarithmic data representation. arXiv preprint arXiv:1603.01025. Cited by: §II-A.
  • [12] N. Wang, J. Choi, D. Brand, C. Chen, and K. Gopalakrishnan (2018) Training deep neural networks with 8-bit floating point numbers. In Advances in neural information processing systems, pp. 7675–7684. Cited by: §II-A.
  • [13] Y. Wang, M. Huang, L. Zhao, et al. (2016) Attention-based lstm for aspect-level sentiment classification. In Proceedings of the 2016 conference on empirical methods in natural language processing, pp. 606–615. Cited by: §I.
  • [14] S. Wu, G. Li, F. Chen, and L. Shi (2018) Training and inference with integers in deep neural networks. arXiv preprint arXiv:1802.04680. Cited by: §III-B.
  • [15] H. Zhang, J. He, and S. Ko (2019) Efficient posit multiply-accumulate unit generator for deep learning applications. In 2019 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5. Cited by: §I, §II-B, §IV-B, TABLE IV, §IV.