Deep Positron: A Deep Neural Network Using the Posit Number System

12/05/2018 ∙ by Zachariah Carmichael, et al. ∙ 0

The recent surge of interest in Deep Neural Networks (DNNs) has led to increasingly complex networks that tax computational and memory resources. Many DNNs presently use 16-bit or 32-bit floating point operations. Significant performance and power gains can be obtained when DNN accelerators support low-precision numerical formats. Despite considerable research, there is still a knowledge gap on how low-precision operations can be realized for both DNN training and inference. In this work, we propose a DNN architecture, Deep Positron, with posit numerical format operating successfully at ≤8 bits for inference. We propose a precision-adaptable FPGA soft core for exact multiply-and-accumulate for uniform comparison across three numerical formats, fixed, floating-point and posit. Preliminary results demonstrate that 8-bit posit has better accuracy that 8-bit fixed or floating-point for three different low-dimensional datasets. Moreover, the accuracy is comparable to 32-bit floating-point on a Xilinx Virtex-7 FGPA device. The trade-offs between DNN performance and hardware resources, i.e. latency, power, and resource utilization, show that posit outperforms in accuracy and latency at 8-bit and below.



There are no comments yet.


page 1

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep neural networks are highly parallel workloads which require massive computational resources for training and often utilize customized accelerators such as Google’s Tensor Processing Unit (TPU) to improve the latency, or reconfigurable devices like FPGAs to mitigate power bottlenecks, or targeted ASIC’s such as Intel’s Nervana to optimize the overall performance. The training cost of DNNs is attributed to the massive number of primitives known as multiply-and-accumulate operations that compute the weighted sums of the neurons’ inputs. To alleviate this challenge, techniques such as sparse connectivity and low-precision arithmetic 

[1, 2, 3] are extensively studied. For example, performing AlexNet inference on Cifar-10 dataset using 8-bit fixed-point format has shown 6 improvement in energy consumption [4] over the 32-bit fixed-point. On the other hand, using 32-bit precision for an outrageously large neural network such as LSTM with mixture of experts [5], will approximately require 137 billion parameters. When performing a machine translation task with this network, it translates to an untenable DRAM memory access power of 128 W111Estimated Power = (20 Hz 10 G 640 pJ (for a 32-bit DRAM access[1])) at 45nm technology node. For deploying DNN algorithms on the end-device (e.g. AI on the edge, IoT), these resource constraints are prohibitive.

Figure 1: An overview of a simple Deep Positron architecture embedded with the exact multiply-and-accumulate blocks (EMACs).

Researchers have offset these constraints to some degree by using low-precision techniques. Linear and nonlinear quantization have been successfully applied during DNN inference on 8-bit fixed-point or 8-bit floating point accelerators and the performance is on-par with 32-bit floating point [6, 7, 3]. However, when using quantization to perform DNN inference with ultra-low bit precision (8-bits), the network needs to be retrained or the number of hyper-parameters should be significantly increased [8], leading to a surge in computational complexity. One solution is to utilize a low-precision numerical format (fixed-point, floating point, or posit [9]) for both DNN training and inference instead of quantization. Earlier studies have compared DNN inference with low-precision (e.g. 8-bit) to a floating point high-precision (e.g. 32-bit) [4]. The utility of these studies is limited – the comparisons are across numerical formats with different bit widths and do not provide a fair understanding of the overall system efficiency.

More recently, the posit format has shown promise over floating point with larger dynamic range, higher accuracy, and better closure [10]. The goal of this work is to study the efficacy of the posit numerical format for DNN inference. An analysis of the histogram of weight distributions in an AlexNet DNN and a 7-bit posit (Fig. 2) shows that posits can be an optimal representation of weights and activations. We compare the proposed designs with multiple metrics related to performance and resource utilization: accuracy, LUT utilization, dynamic range of the numerical formats, maximum operating frequency, inference time, power consumption, and energy-delay-product.

This paper makes the following contributions:

  • We propose an exact multiply and accumulate (EMAC) algorithm for accelerating ultra-low precision (8-bit) DNNs with the posit numerical format. We compare EMACs for three numerical formats, posit, fixed-point, and floating point, in sub 8-bit precision.

  • We propose the Deep Positron architecture that employs the EMACs and study the resource utilization and energy-delay-product.

  • We show preliminary results that posit is a natural fit for sub 8-bit precision DNN inference.

  • We conduct experiments on the Deep Positron architecture for multiple low-dimensional datasets and show that 8-bit posits achieve better performance than 8-bit fixed or floating point and similar accuracies as the 32-bit floating point counterparts.

Figure 2: (a) 7-bit posit () and (b) AlexNet weight distributions. Both show heavy clustering in [-1,1] range.

Ii Background

Ii-a Deep Neural Networks

Deep neural networks are biologically-inspired predictive models that learn features and relationships from a corpus of examples. The topology of these networks is a sequence of layers, each containing a set of simulated neurons. A neuron computes a weighted sum of its inputs and produces a nonlinear activation function of that sum. The connectivity between layers can vary, but in general it has feed-forward connections between layers. These connections each have an associated numerical value, known as a weight, that indicates the connection strength. To discern correctness of a given network’s predictions in a supervised environment, a cost function

computes how wrong a prediction is compared to the truth. The partial derivatives of each weight with respect to the cost are used to update network parameters through backpropagation, ultimately minimizing the cost function.

Traditionally 32-bit floating point arithmetic is used for DNN inference. However, the IEEE standard floating point representation is designed for a very broad dynamic range; even 32-bit floating point numbers have a huge dynamic range of over 80 orders of magnitude, far larger than needed for DNNs. While very small values can be important, very large values are not, therefore the design of the numbers creates low information-per-bit based on Shannon maximum entropy

[11]. Attempts to address this by crafting a fixed-point representation for DNN weights quickly runs up against the quantization error.

The 16-bit (half-precision) form of IEEE floating point, used by Nvidia’s accelerators for DNN, reveals the shortcomings of the format: complicated exception cases, gradual underflow, prolific NaN bit patterns, and redundant representations of zero. It is not the representation to design from first principles for a DNN workload. A more recent format, posit arithmetic, provides a natural fit to the demands of DNNs both for training and inference.

Ii-B Posit Number System

The posit number system, a Type III unum, was proposed to improve upon many of the shortcomings of IEEE-754 floating-point arithmetic and to address complaints about the costs of managing the variable size of Type I unums [10]. (Type II unums are also of fixed size, but require look-up tables that limits their precision [12].) Posit format provides better dynamic range, accuracy, and consistency between machines than floating point. A posit number is parametrized by , the total number of bits, and , the number of exponent bits. The primary difference between a posit and floating-point number representation is the posit regime field, which has a dynamic width like that of a unary number; the regime is a run-length encoded signed value that can be interpreted as in Table I.

Binary 0001 001 01 10 110 1110
Regime ()
Table I: Regime Interpretation

Two values in the system are reserved: represents “Not a Real” which includes infinity and all other exception cases like and , and represents zero. The full binary representation of a posit number is shown in (1).


For a positive posit in such format, the numerical value it represents is given by (2)


where is the regime value, is the unsigned exponent (if ), and comprises the remaining bits of the number. If the posit is negative, the 2’s complement is taken before using the above decoding. See [10] for more detailed and complete information on the posit format.

Iii Methodology

Iii-a Exact Multiply-and-Accumulate (EMAC)

The fundamental computation within a DNN is the multiply-and-accumulate (MAC) operation. Each neuron within a network is equivalent to a MAC unit in that it performs a weighted sum of its inputs. This operation is ubiquitous across many DNN implementations, however, the operation is usually inexact, i.e.

limited precision, truncation, or premature rounding in the underlying hardware yields inaccurate results. The EMAC performs the same computation but allocates sufficient padding for digital signals to emulate arbitrary precision. Rounding or truncation within an EMAC unit is delayed until every product has been accumulated, thus producing a result with minimal local error. This minimization of error is especially important when EMAC units are coupled with low-precision data.

In all EMAC units we implement, a number’s format is arbitrary as its representation is ultimately converted to fixed-point, which allows for natural accumulation. Given the constraint of low-precision data, we propose to use a variant of the Kulisch accumulator [13]. In this architecture, a wide register accumulates fixed-point values shifted by an exponential parameter, if applicable, and delays rounding to a post-summation stage. The width of such an accumulator for multiplications can be computed using (3)


where and are the maximum and minimum values for a number format, respectively. To improve the maximum operating frequency via pipelining, a D flip-flop separates the multiplication and accumulation stages. The architecture easily allows for the incorporation of a bias term – the accumulator D flip-flop can be reset to the fixed-point representation of the bias so products accumulate on top of it. To further improve accuracy, the round to nearest and round half to even scheme is employed for the floating point and posit formats. This is the recommended IEEE-754 rounding method and the posit standard.

Iii-B Fixed-point EMAC

The fixed-point EMAC, shown in Fig. 3, accumulates the products of multiplications and allocates a sufficient range of bits to compute the exact result before truncation. A weight, bias, and activation, each with fraction bits and integer bits, are the unit inputs. The unnormalized multiplicative result is kept as bits to preserve exact precision. The products are accumulated over clock cycles with the integer adder and D flip-flop combination. The sum of products is then shifted right by bits and truncated to bits, ensuring to clip at the maximum magnitude if applicable.

Figure 3: A precision-adaptable (DNN weights and activation) FPGA soft core for fixed-point exact multiply-and-accumulate operation.
Figure 4: A precision-adaptable (DNN weights and activation) FPGA soft core for the floating-point exact multiply-and-accumulate operation.

Iii-C Floating Point EMAC

The floating point EMAC, shown in Fig. 4, also computes the EMAC operation for pairs of inputs. We do not consider “Not a Number” or the “ Infinity” as inputs don’t have these values and the EMAC does not overflow to infinity. Notably, it uses a fixed-point conversion before accumulation to preserve precision of the result. Inputs to the EMAC have a single signed bit, exponent bits, and fraction bits. Subnormal detection at the inputs appropriately sets the hidden bits and adjusts the exponent. This EMAC scales exponentially with as it’s the dominant parameter in computing . To convert floating point products to a fixed-point representation, mantissas are converted to 2’s complement based on the sign of the product and shifted to the appropriate location in the register based on the product exponent. After accumulation, inverse 2’s complement is applied based on the sign of the sum. If the result is detected to be subnormal, the exponent is accordingly set to ‘0’. The extracted value from the accumulator is clipped at the maximum magnitude if applicable.

The relevant characteristics of a float number are computed as follows.

Figure 5: A precision-adaptable (DNN weights and activation) FPGA soft core for the posit exact multiply-and-accumulate operation.

  Algorithm 1 Posit data extraction of -bit input with exponent bits


1:procedure Decode() Data extraction of
2:      ’1’ if is nonzero
3:      Extract sign
4:      2’s Comp.
5:      Regime check
6:      Invert 2’s
7:      Count leading zeros
8:      Shift out regime
9:      Extract fraction
10:      Extract exponent
11:      Select regime
12:     return , , ,
13:end procedure


Iii-D Posit EMAC

The posit EMAC, detailed in Fig. 5, computes the operation of pairs of inputs. We do not consider “Not a Real” in this implementation as all inputs are expected to be real numbers and posits never overflow to infinity. Inputs to the EMAC are decoded in order to extract the sign, regime, exponent, and fraction. As the regime bit field is of dynamic size, this process is nontrivial. Algorithm III-C describes the data extraction process. To mitigate needing both a leading ones detector (LOD) and leading zeros detector (LZD), we invert the two’s complement of the input (line 5) so that the regime always begins with a ‘0’. The regime is accordingly adjusted using the regime check bit (line 11). After decoding inputs, multiplication and converting to fixed-point is performed similarly to that of floating point. Products are accumulated in a register, or quire in the posit literature, of width as given by (4).


To avoid using multiple shifters in fixed-point conversion, the scale factor is biased by such that its minimum value becomes 0. After accumulation, the scale factor is unbiased by before entering the convergent rounding and encoding stage. Algorithm III-D gives the procedure for carrying out these operations.

  Algorithm 2 Posit EMAC operation for -bit inputs each with exponent bits


1:procedure PositEMAC()
4:      Gather scale factors
8:      Adjust for overflow
12:      Bias the scale factor
13:      Shift to fixed
14:      Accumulate  
Fraction & SF Extraction
Convergent Rounding & Encoding
22:      Unpack scale factor
25:      Check for overflow
31:     if  then
34:     else
37:     end if
43:     return
44:end procedure


The relevant characteristics of a posit number are computed as follows.

Iii-E Deep Positron

We assemble a custom DNN architecture that is parametrized by data width, data type, and DNN hyperparameters (

e.g. number of layers, neurons per layer, etc.), as shown in Fig. 1

. Each layer contains dedicated EMAC units with local memory blocks for weights and biases. Storing DNN parameters in this manner minimizes latency by avoiding off-chip memory accesses. The compute cycle of each layer is triggered when its directly preceding layer has terminated computation for an input. This flow performs inference in a parallel streaming fashion. The ReLU activation is used throughout the network, except for the affine readout layer. A main control unit controls the flow of input data and activations throughout the network using a finite state machine.

Iv Experimental Results

Iv-a EMAC Analysis and Comparison

We compare the hardware implications across three numerical format parameters that each EMAC has on a Virtex-7 FPGA (xc7vx485t-2ffg1761c). Synthesis results are obtained through Vivado 2017.2 and optimized for latency by targeting the on-chip DSP48 slices. Our preliminary results indicate that the posit EMAC is competitive with the floating point EMAC in terms of energy and latency. At lower values of , the posit number system has higher dynamic range as emphasized by [10]. We compute dynamic range as . While neither the floating point or posit EMACs can compete with the energy-delay-product (EDP) of fixed-point, they both are able to offer significantly higher dynamic range for the same values of . Furthermore, the EDPs of the floating point and posit EMACs are similar.

Figure 6: Dynamic Range vs. Maximum Operating Frequency (Hz) for the EMACs implemented on Xilinx Virtex-7 FPGA.
Figure 7: vs. energy-delay-product for the EMACs implemented on Xilinx Virtex-7 FPGA.

Fig. 6 shows the synthesis results for the dynamic range of each format against maximum operating frequency. As expected, the fixed-point EMAC achieves the lowest datapath latencies as it has no exponential parameter, thus a narrower accumulator. In general, the posit EMAC can operate at a higher frequency for a given dynamic range than the floating point EMAC. Fig. 7 shows the EDP across different bit-widths and as expected fixed-point outperforms for all bit-widths.

The LUT utilization results against numerical precision are shown in Fig. 8, where posit generally consumes a higher amount of resources. This limitation can be attributed to the more involved decoding and encoding of inputs and outputs.

Figure 8: vs. LUT Utilization for the EMACs implemented on Xilinx Virtex-7 FPGA.

Iv-B Deep Positron Performance

We compare the performance of Deep Positron on three datasets and all possible combinations of [5,8] bit-widths for the three numerical formats. Posit performs uniformly well across all the three datasets for 8-bit precision, shown in Table II, and has similar accuracy as fixed-point and float in sub 8-bit. Best results are when posit has and floating point has . As expected, the best performance drops sub 8-bit by [0-4.21]% compared to 32-bit floating-point. In all experiments, the posit format either outperforms or matches performance of floating and fixed-point. Additionally, with 24 fewer bits, posit matches the performance of 32-bit floating point on the Iris classification task.

Fig. 9 shows the lowest accuracy degradation per bit width against EDP. The results indicate posits achieve better performance compared to the floating and fixed-point formats at a moderate cost.

Figure 9: Average accuracy degradation vs. energy-delay-product for the EMACs implemented on Xilinx Virtex-7 FPGA. Numbers correspond with the bit-width of a numerical format.
Dataset Inference size Posit Floating-point Fixed-point 32-bit Float
Wisconsin Breast Cancer [14] 190 85.89% 77.4% 57.8% 90.1%
Iris [15] 50 98% 96% 92% 98%
Mushroom [16] 2708 96.4% 96.4% 95.9% 96.8%
Table II: Deep Positron performance on low-dimensional datasets with 8-bit EMACs.

V Related Work

Research on low-precision arithmetic for neural networks dates to circa 1990  [17, 18] using fixed-point and floating point. Recently, several groups have shown that it is possible to perform inference in DNNs with 16-bit fixed-point representations [19, 20]. However, most of these studies compare DNN inference for different bit-widths. Few research teams have performed comparison with same bit-widths across different number systems coupled with FPGA soft processors. For example, Hashemi et al. demonstrate DNN inference with 32-bit fixed-point and 32-bit floating point on the LeNet, ConvNet, and AlexNet DNNs, where the energy consumption is reduced by 12% and 1% accuracy loss with fixed-point [4]. Most recently, Chung et al. proposed an accelerator, Brainwave, with a spatial 8-bit floating point, called ms-fp8. The ms-fp8 format improves the throughput by 3 over 8-bit fixed-point on a Stratix-10 FPGA [3].

This paper also relates to three previous works that use posits in DNNs. The first DNN architecture using the posit number system was proposed by Langroudi et al.[21]. The work demonstrates that, with

1% accuracy degradation, DNN parameters can be represented using 7-bit posits for AlexNet on the ImageNet corpus and that posits require

30% less memory utilization for the LeNet, ConvNet, and AlexNet neural networks in comparison to the fixed-point format. Secondly, Cococcioni et al. [22] discuss the effectiveness of posit arithmetic for application to autonomous driving. They consider an implementation of the Posit Processing Unit (PPU) as an alternative to the Floating point Processing Unit (FPU) since the self-driving car standards require 16-bit floating point representations for the safety-critical application. Recently, Jeff Johnson proposed a log float format as a combination of the posit format and the logarithmic version of the EMAC operation called the exact log-linear multiply-add (ELMA). This work shows that ImageNet classification using the ResNet-50 DNN architecture can be performed with 1% accuracy degradation [23]. It also shows that 4% and 41% power consumption reduction can be achieved by using an 8/38-bit ELMA in place of an 8/32-bit integer multiply-add and an IEEE-754 float16 fused multiply-add, respectively.

This paper is inspired by the earlier studies and demonstrates that posit arithmetic with ultra-low precision (8-bit) is a natural choice for DNNs performing low-dimensional tasks. A precision-adaptable, parameterized FPGA soft core is used for comprehensive analysis on the Deep Positron architecture with same bit-width for fixed, floating-point, and posit formats.

Vi Conclusions

In this paper, we show that the posit format is well suited for deep neural networks at ultra-low precision (8-bit). We show that precision-adaptable, reconfigurable exact multiply-and-accumulate designs embedded in a DNN are efficient for inference. Accuracy-sensitivity studies for Deep Positron show robustness at 7-bit and 8-bit widths. In the future, the success of DNNs in real-world applications will equally rely on the underlying platforms and architectures as much as the algorithms and data. Full-scale DNN accelerators with low-precision posit arithmetic will play an important role in this domain.