I Introduction
Deep neural networks are highly parallel workloads which require massive computational resources for training and often utilize customized accelerators such as Google’s Tensor Processing Unit (TPU) to improve the latency, or reconfigurable devices like FPGAs to mitigate power bottlenecks, or targeted ASIC’s such as Intel’s Nervana to optimize the overall performance. The training cost of DNNs is attributed to the massive number of primitives known as multiplyandaccumulate operations that compute the weighted sums of the neurons’ inputs. To alleviate this challenge, techniques such as sparse connectivity and lowprecision arithmetic
[1, 2, 3] are extensively studied. For example, performing AlexNet inference on Cifar10 dataset using 8bit fixedpoint format has shown 6 improvement in energy consumption [4] over the 32bit fixedpoint. On the other hand, using 32bit precision for an outrageously large neural network such as LSTM with mixture of experts [5], will approximately require 137 billion parameters. When performing a machine translation task with this network, it translates to an untenable DRAM memory access power of 128 W^{1}^{1}1Estimated Power = (20 Hz 10 G 640 pJ (for a 32bit DRAM access[1])) at 45nm technology node. For deploying DNN algorithms on the enddevice (e.g. AI on the edge, IoT), these resource constraints are prohibitive.Researchers have offset these constraints to some degree by using lowprecision techniques. Linear and nonlinear quantization have been successfully applied during DNN inference on 8bit fixedpoint or 8bit floating point accelerators and the performance is onpar with 32bit floating point [6, 7, 3]. However, when using quantization to perform DNN inference with ultralow bit precision (8bits), the network needs to be retrained or the number of hyperparameters should be significantly increased [8], leading to a surge in computational complexity. One solution is to utilize a lowprecision numerical format (fixedpoint, floating point, or posit [9]) for both DNN training and inference instead of quantization. Earlier studies have compared DNN inference with lowprecision (e.g. 8bit) to a floating point highprecision (e.g. 32bit) [4]. The utility of these studies is limited – the comparisons are across numerical formats with different bit widths and do not provide a fair understanding of the overall system efficiency.
More recently, the posit format has shown promise over floating point with larger dynamic range, higher accuracy, and better closure [10]. The goal of this work is to study the efficacy of the posit numerical format for DNN inference. An analysis of the histogram of weight distributions in an AlexNet DNN and a 7bit posit (Fig. 2) shows that posits can be an optimal representation of weights and activations. We compare the proposed designs with multiple metrics related to performance and resource utilization: accuracy, LUT utilization, dynamic range of the numerical formats, maximum operating frequency, inference time, power consumption, and energydelayproduct.
This paper makes the following contributions:

We propose an exact multiply and accumulate (EMAC) algorithm for accelerating ultralow precision (8bit) DNNs with the posit numerical format. We compare EMACs for three numerical formats, posit, fixedpoint, and floating point, in sub 8bit precision.

We propose the Deep Positron architecture that employs the EMACs and study the resource utilization and energydelayproduct.

We show preliminary results that posit is a natural fit for sub 8bit precision DNN inference.

We conduct experiments on the Deep Positron architecture for multiple lowdimensional datasets and show that 8bit posits achieve better performance than 8bit fixed or floating point and similar accuracies as the 32bit floating point counterparts.
Ii Background
Iia Deep Neural Networks
Deep neural networks are biologicallyinspired predictive models that learn features and relationships from a corpus of examples. The topology of these networks is a sequence of layers, each containing a set of simulated neurons. A neuron computes a weighted sum of its inputs and produces a nonlinear activation function of that sum. The connectivity between layers can vary, but in general it has feedforward connections between layers. These connections each have an associated numerical value, known as a weight, that indicates the connection strength. To discern correctness of a given network’s predictions in a supervised environment, a cost function
computes how wrong a prediction is compared to the truth. The partial derivatives of each weight with respect to the cost are used to update network parameters through backpropagation, ultimately minimizing the cost function.
Traditionally 32bit floating point arithmetic is used for DNN inference. However, the IEEE standard floating point representation is designed for a very broad dynamic range; even 32bit floating point numbers have a huge dynamic range of over 80 orders of magnitude, far larger than needed for DNNs. While very small values can be important, very large values are not, therefore the design of the numbers creates low informationperbit based on Shannon maximum entropy
[11]. Attempts to address this by crafting a fixedpoint representation for DNN weights quickly runs up against the quantization error.The 16bit (halfprecision) form of IEEE floating point, used by Nvidia’s accelerators for DNN, reveals the shortcomings of the format: complicated exception cases, gradual underflow, prolific NaN bit patterns, and redundant representations of zero. It is not the representation to design from first principles for a DNN workload. A more recent format, posit arithmetic, provides a natural fit to the demands of DNNs both for training and inference.
IiB Posit Number System
The posit number system, a Type III unum, was proposed to improve upon many of the shortcomings of IEEE754 floatingpoint arithmetic and to address complaints about the costs of managing the variable size of Type I unums [10]. (Type II unums are also of fixed size, but require lookup tables that limits their precision [12].) Posit format provides better dynamic range, accuracy, and consistency between machines than floating point. A posit number is parametrized by , the total number of bits, and , the number of exponent bits. The primary difference between a posit and floatingpoint number representation is the posit regime field, which has a dynamic width like that of a unary number; the regime is a runlength encoded signed value that can be interpreted as in Table I.
Binary  0001  001  01  10  110  1110 

Regime () 
Two values in the system are reserved: represents “Not a Real” which includes infinity and all other exception cases like and , and represents zero. The full binary representation of a posit number is shown in (1).
(1) 
For a positive posit in such format, the numerical value it represents is given by (2)
(2) 
where is the regime value, is the unsigned exponent (if ), and comprises the remaining bits of the number. If the posit is negative, the 2’s complement is taken before using the above decoding. See [10] for more detailed and complete information on the posit format.
Iii Methodology
Iiia Exact MultiplyandAccumulate (EMAC)
The fundamental computation within a DNN is the multiplyandaccumulate (MAC) operation. Each neuron within a network is equivalent to a MAC unit in that it performs a weighted sum of its inputs. This operation is ubiquitous across many DNN implementations, however, the operation is usually inexact, i.e.
limited precision, truncation, or premature rounding in the underlying hardware yields inaccurate results. The EMAC performs the same computation but allocates sufficient padding for digital signals to emulate arbitrary precision. Rounding or truncation within an EMAC unit is delayed until every product has been accumulated, thus producing a result with minimal local error. This minimization of error is especially important when EMAC units are coupled with lowprecision data.
In all EMAC units we implement, a number’s format is arbitrary as its representation is ultimately converted to fixedpoint, which allows for natural accumulation. Given the constraint of lowprecision data, we propose to use a variant of the Kulisch accumulator [13]. In this architecture, a wide register accumulates fixedpoint values shifted by an exponential parameter, if applicable, and delays rounding to a postsummation stage. The width of such an accumulator for multiplications can be computed using (3)
(3) 
where and are the maximum and minimum values for a number format, respectively. To improve the maximum operating frequency via pipelining, a D flipflop separates the multiplication and accumulation stages. The architecture easily allows for the incorporation of a bias term – the accumulator D flipflop can be reset to the fixedpoint representation of the bias so products accumulate on top of it. To further improve accuracy, the round to nearest and round half to even scheme is employed for the floating point and posit formats. This is the recommended IEEE754 rounding method and the posit standard.
IiiB Fixedpoint EMAC
The fixedpoint EMAC, shown in Fig. 3, accumulates the products of multiplications and allocates a sufficient range of bits to compute the exact result before truncation. A weight, bias, and activation, each with fraction bits and integer bits, are the unit inputs. The unnormalized multiplicative result is kept as bits to preserve exact precision. The products are accumulated over clock cycles with the integer adder and D flipflop combination. The sum of products is then shifted right by bits and truncated to bits, ensuring to clip at the maximum magnitude if applicable.
IiiC Floating Point EMAC
The floating point EMAC, shown in Fig. 4, also computes the EMAC operation for pairs of inputs. We do not consider “Not a Number” or the “ Infinity” as inputs don’t have these values and the EMAC does not overflow to infinity. Notably, it uses a fixedpoint conversion before accumulation to preserve precision of the result. Inputs to the EMAC have a single signed bit, exponent bits, and fraction bits. Subnormal detection at the inputs appropriately sets the hidden bits and adjusts the exponent. This EMAC scales exponentially with as it’s the dominant parameter in computing . To convert floating point products to a fixedpoint representation, mantissas are converted to 2’s complement based on the sign of the product and shifted to the appropriate location in the register based on the product exponent. After accumulation, inverse 2’s complement is applied based on the sign of the sum. If the result is detected to be subnormal, the exponent is accordingly set to ‘0’. The extracted value from the accumulator is clipped at the maximum magnitude if applicable.
The relevant characteristics of a float number are computed as follows.
Algorithm 1 Posit data extraction of bit input with exponent bits
IiiD Posit EMAC
The posit EMAC, detailed in Fig. 5, computes the operation of pairs of inputs. We do not consider “Not a Real” in this implementation as all inputs are expected to be real numbers and posits never overflow to infinity. Inputs to the EMAC are decoded in order to extract the sign, regime, exponent, and fraction. As the regime bit field is of dynamic size, this process is nontrivial. Algorithm IIIC describes the data extraction process. To mitigate needing both a leading ones detector (LOD) and leading zeros detector (LZD), we invert the two’s complement of the input (line 5) so that the regime always begins with a ‘0’. The regime is accordingly adjusted using the regime check bit (line 11). After decoding inputs, multiplication and converting to fixedpoint is performed similarly to that of floating point. Products are accumulated in a register, or quire in the posit literature, of width as given by (4).
(4) 
To avoid using multiple shifters in fixedpoint conversion, the scale factor is biased by such that its minimum value becomes 0. After accumulation, the scale factor is unbiased by before entering the convergent rounding and encoding stage. Algorithm IIID gives the procedure for carrying out these operations.
Algorithm 2 Posit EMAC operation for bit inputs each with exponent bits
Multiplication
Accumulation
Fraction & SF Extraction
Convergent Rounding & Encoding
The relevant characteristics of a posit number are computed as follows.
IiiE Deep Positron
We assemble a custom DNN architecture that is parametrized by data width, data type, and DNN hyperparameters (
e.g. number of layers, neurons per layer, etc.), as shown in Fig. 1. Each layer contains dedicated EMAC units with local memory blocks for weights and biases. Storing DNN parameters in this manner minimizes latency by avoiding offchip memory accesses. The compute cycle of each layer is triggered when its directly preceding layer has terminated computation for an input. This flow performs inference in a parallel streaming fashion. The ReLU activation is used throughout the network, except for the affine readout layer. A main control unit controls the flow of input data and activations throughout the network using a finite state machine.
Iv Experimental Results
Iva EMAC Analysis and Comparison
We compare the hardware implications across three numerical format parameters that each EMAC has on a Virtex7 FPGA (xc7vx485t2ffg1761c). Synthesis results are obtained through Vivado 2017.2 and optimized for latency by targeting the onchip DSP48 slices. Our preliminary results indicate that the posit EMAC is competitive with the floating point EMAC in terms of energy and latency. At lower values of , the posit number system has higher dynamic range as emphasized by [10]. We compute dynamic range as . While neither the floating point or posit EMACs can compete with the energydelayproduct (EDP) of fixedpoint, they both are able to offer significantly higher dynamic range for the same values of . Furthermore, the EDPs of the floating point and posit EMACs are similar.
Fig. 6 shows the synthesis results for the dynamic range of each format against maximum operating frequency. As expected, the fixedpoint EMAC achieves the lowest datapath latencies as it has no exponential parameter, thus a narrower accumulator. In general, the posit EMAC can operate at a higher frequency for a given dynamic range than the floating point EMAC. Fig. 7 shows the EDP across different bitwidths and as expected fixedpoint outperforms for all bitwidths.
The LUT utilization results against numerical precision are shown in Fig. 8, where posit generally consumes a higher amount of resources. This limitation can be attributed to the more involved decoding and encoding of inputs and outputs.
IvB Deep Positron Performance
We compare the performance of Deep Positron on three datasets and all possible combinations of [5,8] bitwidths for the three numerical formats. Posit performs uniformly well across all the three datasets for 8bit precision, shown in Table II, and has similar accuracy as fixedpoint and float in sub 8bit. Best results are when posit has and floating point has . As expected, the best performance drops sub 8bit by [04.21]% compared to 32bit floatingpoint. In all experiments, the posit format either outperforms or matches performance of floating and fixedpoint. Additionally, with 24 fewer bits, posit matches the performance of 32bit floating point on the Iris classification task.
Fig. 9 shows the lowest accuracy degradation per bit width against EDP. The results indicate posits achieve better performance compared to the floating and fixedpoint formats at a moderate cost.
V Related Work
Research on lowprecision arithmetic for neural networks dates to circa 1990 [17, 18] using fixedpoint and floating point. Recently, several groups have shown that it is possible to perform inference in DNNs with 16bit fixedpoint representations [19, 20]. However, most of these studies compare DNN inference for different bitwidths. Few research teams have performed comparison with same bitwidths across different number systems coupled with FPGA soft processors. For example, Hashemi et al. demonstrate DNN inference with 32bit fixedpoint and 32bit floating point on the LeNet, ConvNet, and AlexNet DNNs, where the energy consumption is reduced by 12% and 1% accuracy loss with fixedpoint [4]. Most recently, Chung et al. proposed an accelerator, Brainwave, with a spatial 8bit floating point, called msfp8. The msfp8 format improves the throughput by 3 over 8bit fixedpoint on a Stratix10 FPGA [3].
This paper also relates to three previous works that use posits in DNNs. The first DNN architecture using the posit number system was proposed by Langroudi et al.[21]. The work demonstrates that, with
1% accuracy degradation, DNN parameters can be represented using 7bit posits for AlexNet on the ImageNet corpus and that posits require
30% less memory utilization for the LeNet, ConvNet, and AlexNet neural networks in comparison to the fixedpoint format. Secondly, Cococcioni et al. [22] discuss the effectiveness of posit arithmetic for application to autonomous driving. They consider an implementation of the Posit Processing Unit (PPU) as an alternative to the Floating point Processing Unit (FPU) since the selfdriving car standards require 16bit floating point representations for the safetycritical application. Recently, Jeff Johnson proposed a log float format as a combination of the posit format and the logarithmic version of the EMAC operation called the exact loglinear multiplyadd (ELMA). This work shows that ImageNet classification using the ResNet50 DNN architecture can be performed with 1% accuracy degradation [23]. It also shows that 4% and 41% power consumption reduction can be achieved by using an 8/38bit ELMA in place of an 8/32bit integer multiplyadd and an IEEE754 float16 fused multiplyadd, respectively.This paper is inspired by the earlier studies and demonstrates that posit arithmetic with ultralow precision (8bit) is a natural choice for DNNs performing lowdimensional tasks. A precisionadaptable, parameterized FPGA soft core is used for comprehensive analysis on the Deep Positron architecture with same bitwidth for fixed, floatingpoint, and posit formats.
Vi Conclusions
In this paper, we show that the posit format is well suited for deep neural networks at ultralow precision (8bit). We show that precisionadaptable, reconfigurable exact multiplyandaccumulate designs embedded in a DNN are efficient for inference. Accuracysensitivity studies for Deep Positron show robustness at 7bit and 8bit widths. In the future, the success of DNNs in realworld applications will equally rely on the underlying platforms and architectures as much as the algorithms and data. Fullscale DNN accelerators with lowprecision posit arithmetic will play an important role in this domain.
References
 [1] S. Han, H. Mao, and W. J. Dally, “Deep Compression  Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding,” Iclr, pp. 1–13, 2016.
 [2] S. Wu, G. Li, F. Chen, and L. Shi, “Training and inference with integers in deep neural networks,” arXiv preprint arXiv:1802.04680, 2018.
 [3] E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield et al., “Serving dnns in real time at datacenter scale with project brainwave,” IEEE Micro, vol. 38, no. 2, pp. 8–20, 2018.
 [4] S. Hashemi, N. Anthony, H. Tann, R. Bahar, and S. Reda, “Understanding the impact of precision quantization on the accuracy and energy of neural networks,” in Proceedings of the Conference on Design, Automation & Test in Europe. European Design and Automation Association, 2017, pp. 1478–1483.
 [5] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le et al., “Outrageously large neural networks: The sparselygated mixtureofexperts layer,” arXiv preprint arXiv:1701.06538, 2017.
 [6] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal et al., “InDatacenter Performance Analysis of a Tensor Processing Unit TM,” pp. 1–17, 2017.
 [7] B. Reagen, P. Whatmough, R. Adolf, S. Rama, H. Lee et al., “Minerva: Enabling lowpower, highlyaccurate deep neural network accelerators,” in Proceedings of the 43rd International Symposium on Computer Architecture. IEEE Press, 2016, pp. 267–278.
 [8] A. Mishra and D. Marr, “Wrpn & apprentice: Methods for training and inference using lowprecision numerics,” arXiv preprint arXiv:1803.00227, 2018.
 [9] P. Gysel, “Ristretto: Hardwareoriented approximation of convolutional neural networks,” CoRR, vol. abs/1605.06402, 2016. [Online]. Available: http://arxiv.org/abs/1605.06402
 [10] J. L. Gustafson and I. T. Yonemoto, “Beating floating point at its own game: Posit arithmetic,” Supercomputing Frontiers and Innovations, vol. 4, no. 2, pp. 71–86, 2017.
 [11] C. E. Shannon, “A mathematical theory of communication,” Bell Syst. Tech. J., vol. 27, pp. 623–656, 1948.
 [12] W. Tichy, “Unums 2.0: An interview with john l. gustafson,” Ubiquity, vol. 2016, no. September, p. 1, 2016.
 [13] U. Kulisch, Computer arithmetic and validity: theory, implementation, and applications. Walter de Gruyter, 2013, vol. 33.

[14]
W. N. Street, W. H. Wolberg, and O. L. Mangasarian, “Nuclear feature extraction for breast tumor diagnosis,” in
Biomedical Image Processing and Biomedical Visualization, vol. 1905. International Society for Optics and Photonics, 1993, pp. 861–871.  [15] R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Annals of eugenics, vol. 7, no. 2, pp. 179–188, 1936.
 [16] J. C. Schlimmer, “Concept acquisition through representational adjustment,” 1987.
 [17] A. Iwata, Y. Yoshida, S. Matsuda, Y. Sato, and N. Suzumura, “An artificial neural network accelerator using general purpose 24 bits floating point digital signal processors,” in IJCNN, vol. 2, 1989, pp. 171–182.
 [18] D. Hammerstrom, “A vlsi architecture for highperformance, lowcost, onchip learning,” in Neural Networks, 1990., 1990 IJCNN International Joint Conference on. IEEE, 1990, pp. 537–544.
 [19] M. Courbariaux, Y. Bengio, and J. David, “Low precision arithmetic for deep learning,” CoRR, vol. abs/1412.7024, 2014. [Online]. Available: http://arxiv.org/abs/1412.7024

[20]
Y. Bengio, “Deep learning of representations: Looking forward,” in
International Conference on Statistical Language and Speech Processing. Springer, 2013, pp. 1–37.  [21] S. H. F. Langroudi, T. Pandit, and D. Kudithipudi, “Deep learning inference on embedded devices: Fixedpoint vs posit,” in 2018 1st Workshop on Energy Efficient Machine Learning and Cognitive Computing for Embedded Applications (EMC2), March 2018, pp. 19–23.
 [22] M. Cococcioni, E. Ruffaldi, and S. Saponara, “Exploiting posit arithmetic for deep neural networks in autonomous driving applications,” in 2018 International Conference of Electrical and Electronic Technologies for Automotive. IEEE, 2018, pp. 1–6.
 [23] J. Johnson, “Rethinking floating point for deep learning,” arXiv preprint arXiv:1811.01721, 2018.
Comments
There are no comments yet.