DeepAI
Log In Sign Up

Efficient, arbitrarily high precision hardware logarithmic arithmetic for linear algebra

04/17/2020
by   Jeff Johnson, et al.
Facebook
0

The logarithmic number system (LNS) is arguably not broadly used due to exponential circuit overheads for summation tables relative to arithmetic precision. Methods to reduce this overhead have been proposed, yet still yield designs with high chip area and power requirements. Use remains limited to lower precision or high multiply/add ratio cases, while much of linear algebra (near 1:1 multiply/add ratio) does not qualify. We present a dual-base approximate logarithmic arithmetic comparable to floating point in use, yet unlike LNS it is easily fully pipelined, extendable to arbitrary precision with O(n^2) overhead, and energy efficient at a 1:1 multiply/add ratio. Compared to float32 or float64 vector inner product with FMA, our design is respectively 2.3x and 4.6x more energy efficient in 7 nm CMOS. It depends on exp and log evaluation 5.4x and 3.2x more energy efficient, at 0.23x and 0.37x the chip area for equivalent accuracy versus standard hyperbolic CORDIC using shift-and-add and approximated ODE integration in the style of Revol and Yakoubsohn. This technique is a useful design alternative for low power, high precision hardened linear algebra in computer vision, graphics, computational photography and machine learning applications.

READ FULL TEXT VIEW PDF

page 1

page 8

11/01/2018

Rethinking floating point for deep learning

Reducing hardware overhead of neural networks for faster or lower power ...
02/12/2021

Low precision logarithmic number systems: Beyond base-2

Logarithmic number systems (LNS) are used to represent real numbers in m...
11/01/2020

Addressing Resiliency of In-Memory Floating Point Computation

In-memory computing (IMC) can eliminate the data movement between proces...
06/26/2021

Low-Precision Training in Logarithmic Number System using Multiplicative Weight Update

Representing deep neural networks (DNNs) in low-precision is a promising...
04/11/2022

Multiplier with Reduced Activities and Minimized Interconnect for Inner Product Arrays

We present a pipelined multiplier with reduced activities and minimized ...
06/03/2018

Deploying Customized Data Representation and Approximate Computing in Machine Learning Applications

Major advancements in building general-purpose and customized hardware h...
07/08/2022

Log-CCDM: Distribution Matching via Multiplication-free Arithmetic Coding

Recent years have seen renewed attention to arithmetic coding (AC). This...

I Introduction

Energy efficiency is typically the most important challenge in advanced CMOS technology nodes. With the dark silicon problem [34], the vast majority of a large scale design is clock or power gated at any given point in time. Chip area becomes exponentially more available relative to power consumption, preferring “a new class of architectural techniques that ’spend’ area to ’buy’ energy efficiency” [32]. Memory architecture is often the most important concern, with 170-6400 greater DRAM access energy versus arithmetic at 45 nm [15]. This changes with the rise of machine learning, as heavily employed linear algebra primitives such as matrix/matrix product offer substantial local reuse of data by algorithmic tiling [37]: arithmetic operations versus

DRAM accesses. This is a reason for the rise of dedicated neural network accelerators, as memory overheads can be substantially amortized over many arithmetic operations in a fixed function design, making arithmetic efficiency matter again.

Many hardware efforts for linear algebra and machine learning tend towards low precision implementations, but here we concern ourselves with the opposite: enabling (arbitrarily) high precision yet energy efficient substitutes for floating point or long word length fixed point arithmetic. There are a variety of ML, computer vision and other algorithms where accelerators cannot easily apply precision reduction, such as hyperbolic embedding generation [24] or structure from motion via matrix factorization [33], yet provide high local data reuse potential.

The logarithmic number system (LNS) [30] can provide energy efficiency by eliminating hardware multipliers and dividers, yet maintains significant computational overhead with Gaussian logarithm

functions needed for addition and subtraction. While reduced precision cases can limit themselves to relatively small LUTs/ROMs, high precision LNS require massive ROMs, linear interpolators and substantial MUXes. Pipelining is difficult, requiring resource duplication or handling variable latency corner cases as seen in 

[7]. The ROMs are also exponential in LNS word size, so become impractical beyond a float32 equivalent. Chen et al. [5] provide an alternative fully pipelined LNS add/sub with ROM size a function of LNS word size, extended to float64 equivalent precision. However, in their words, “[our] design of [a] large word-length LNS processor becomes impractical since the hardware cost and the pipeline latency of the proposed LNS unit are much larger.” Their float64 equivalent requires 471 Kbits ROM and at least 22,479 full adder (FA) cells, and 53.5 Kbits ROM and 5,550 FA cells for float32, versus a traditional LNS implementation they cite [21] with 91 Kbits of ROM and only 586 FA cells.

While there are energy benefits with LNS [26], we believe a better bargain can be had. Our main contribution is a trivially pipelined logarithmic arithmetic extendable to arbitrary precision, using no LUTs/ROMs, and a precision to FA cell dependency. Unlike LNS, it is substantially more energy efficient than floating point at a 1:1 multiply/add ratio for linear algebra use cases. It is approximate in ways that an accurately designed LNS is not, though with parameters for tuning accuracy to match LNS as needed. It is based on the ELMA technique [18], extended to arbitrary precision with an energy efficient implementation of exp/log using restoring shift-and-add [23]

and an ordinary differential equation integration step from Revol and Yakoubsohn

[27] but with approximate multipliers and dividers. It is tailored for vector inner product, a foundation of much of linear algebra, but remains a general purpose arithmetic. We will first describe our hardware exp/log implementations, then detail how they are used as a foundation for our arithmetic, and provide an accuracy analysis. Finally, hardware synthesis results are presented and compared with floating point.

Ii Notes on hardware synthesis

All designs considered in this paper are on a commercially available 7 nm CMOS technology constrained to only SVT cells. They are generated using Mentor Catapult high level synthesis (HLS), biased towards min latency rather than min area, with ICG (clock gate) insertion where appropriate. Area is reported via Synopsys Design Compiler, and power/energy is from Synopsys PrimeTime PX from realistic switching activity. Energy accounts for combinational, register, clock tree and leakage power, normalized with respect to module throughput in cycles, so this is a per-operation energy. We consider pipelining acceptable for arithmetic problems in linear algebra with sufficient regularity such as matrix multiplication (Section VI-B), reducing the need for purely combinational latency reduction. Power analysis is at the TT@25C corner at nominal voltage. Design clocks from 250-750 MHz were considered, with 375 MHz chosen for reporting, being close to minimum energy for many of the designs. Changing frequency does change pipeline depth and required register/clock tree power, as well as choice of inferred adder or other designs needed to meet timing closure by synthesis.

Iii exp/log evaluation

Our arithmetic requires efficient hardware implementation of exponential and logarithm for a base , which are useful in their own right. Typical algorithms are power series evaluation, polynomial approximation/table-based methods, and shift-and-add methods such as hyperbolic CORDIC [35] or the simpler method by De Lugish [8]. Hardware implementations have been considered for CORDIC [10], ROM/table-based implementations [29], approximation using shift-and-add [1] with the Mitchell logarithm approximation [22], and digit recurrence/shift-and-add [25]. CORDIC requires three state variables and additions per iteration, plus a final multiplication by a scaling factor. BKM [3] avoids the CORDIC scaling factor but introduces complexity in the iteration step.

Much of the hardware elementary function literature is concerned with latency reduction rather than energy optimization. Variants of these algorithms such as high radix formulations [4][11] [25] or parallel iterations [10] increase switching activity via additional active area, iteration complexity, or adding sizable MUXes in the radix case. In lieu of decreasing combinational delay via parallelism, pipelining is a worthwhile strategy to reduce energy-delay product [12], but only with high pipeline utilization and where register power increases are not substantial. Ripple-carry adders, the simplest and most energy efficient adders, remain useful in the pipelined regime, and variants like variable block adders improve latency for minimal additional energy [36]. Fully parallel adders like carry-save can improve on both latency and switching activity for elementary functions [25], but only where the redundant number system can be maintained with low computational overhead. For example, in shift-and-add style algorithms, adding a shifted version of a carry-save number to itself requires twice the number of adder cells as a simple ripple-carry adder (one to add each of the shifted components), resulting in near double the energy. Eliminating registers via combinational multicycle paths (MCPs) is another strategy, but as the design is no longer pipelined, throughput will suffer, requiring an introduction of more functional units or accepting the decrease in throughput. There is then a tradeoff between clock frequency, combinational latency reduction, pipelining for timing closure, MCP introduction, and functional unit duplication versus energy per operation.

Iv shift-and-add with integration

We consider De Lugish-style restoring shift-and-add, which will provide ways to reduce power or recover precision with fewer iterations (Section IV-A and IV-B). The procedure for exponentials is described in Muller [23] as:

The acceptable range of is , or for (Euler’s number). Range reduction techniques considered in [23] can be used to reduce arbitrary to this range. This paper will only consider and limiting as fixed point, , restrictions discussed in Sections IV-A and VI-C.

We must consider rounding error and precision of , and . Our range-limited can be specified purely as a fixed point fraction with fractional bits. The iteration is skipped as . All subsequent are and can be similarly represented as a fixed point fraction. These will use fractional bits () with correctly rounded representations of . is the case in our restricted domain, which is maintained as a fixed point fraction with an implicit, omitted leading integer 1. Multiplication by is a shift by bits, so we append this leading 1 to the post-shifted before addition. We use fractional bits to represent . At the final -th iteration, is rounded to bits for the output . Ignoring rounding error, the relative error of the algorithm is at iteration , so for 23 fractional bits, is desired.

All adders need not be of size or , either. reduces in magnitude at each step; with , only needs the LSB fractional bits. has a related bit size progression , as is fractional bits, except starting at we are off by 1 ( cannot all be 1, as ). While successively reduces in precision from , we limit to fractional bits via truncation (bits shifted beyond position are ignored). As with [25], we can deal with truncation error by setting , using extra bits as guard bits.

requires an adder and MUX ( if , or if ). The constants are hard-wired into adders when iterations are fully unrolled (a separate adder for each iteration). The do not use a full adder of size in the general case; only shifted bits that overlap with previously available bits need full adder cells. The can also be performed first, with stored in flops (to reduce glitches) for data gating additions, reducing switching activity at the expense of higher latency, as 25% of the on average will remain zero across iterations.

One can use redundant number systems for and and avoid full evaluation of the comparator [23], but is problematic. In the non-redundant case, only a subset of the shifted require a full adder, and the remainder only a half adder. With a carry-save representation for , two full adders are required for the entire length of the word, one to add each portion of the shifted carry-save representation. While the carry-save addition is constant latency, it requires more full adder cells. In our evaluation carry-save for prohibitively increases power over synthesis-inferred adder choice. At high clock frequencies (low latency) this tradeoff is acceptable, but low power designs will generally avoid this regime.

Fig. 1: Our accuracy at relative to . All configurations have ulp error.

Iv-a Euler method integration

This algorithm is simple but has high latency from the sequential dependency of many adders, high and iterations for accurate results. For significant latency and energy reduction, Revol and Yakoubsohn [27] show that about half the iterations can be omitted by treating the problem as an ordinary differential equation with a single numerical integration step. satisfies the ODE where . They consider in software both an explicit Euler method and 4th order Runge-Kutta (RK4). RK4 involves several multipliers and is not a good energy tradeoff to avoid more iterations. The explicit Euler method step has a single multiplication:

at the -th terminal iteration, with residual used as the step size. They give a formula for at a desired accuracy of of , ignoring truncation error. Note that . Thus, for single-precision , we need , or . Double-precision has , and quad precision has . Implementation of from this requires pre-multiplication of by a fixed point rounding of , a significant energy overhead.

Iv-B Integration via approximate multiplication

We have , when . The Euler method step multiplication would be a massive bits, with the for the leading integer 1 bit of . Let and denote fractional portions of and . The step can then be expressed as:

now solely involves fractional bits, of which we only care about to MSBs produced. has zero MSBs, so there are ignorable zero MSBs in the resulting product, yielding a multiplier, still an exact step calculation. Assuming and given these zero MSBs, we only need MSBs of the result, so we truncate both and to limit the result to this size (truncation ignores carries from the multiplication of the truncated LSBs). We do this symmetrically, and since usually , we take fractional MSBs from , with an option to remove another bits, . This may not produce enough bits to align properly with , so we append zeros to the LSBs as needed to match the size of . For example, at , , , , , we have a 14 14 multiplier, of which we only need 16 MSBs (based on alignment with ), and the ultimate carry from the 12 LSBs. One can consider other approximate multipliers [17], but truncation seems to work well and provides a significant reduction in energy.

(0.5, 1] ulp err Cycles Area Energy pJ
CORDIC 9.98% 4 1738 2.749
Ours 9.90% 2 407.2 (0.23) 0.512 (0.19)
CORDIC 14.4% 4 2084 3.573
Ours 14.8% 4 769.4 (0.37) 1.107 (0.31)
TABLE I: Fully pipelined exp/log, synthesis results

Iv-C Error analysis and synthesis results

The table maker’s dilemma is unavoidable for transcendental functions [20]. For , we need to evaluate to at least 42 bits to provide correctly rounded results for fixed point . In lieu of exact evaluation, we demand function monotonicity, ulp error, and consider the occurrence of incorrectly rounded results ( ulp error). Figure 1 considers error in this regime with a sweep of , and , with . has maximum ulp error.

Table I shows fully-pipelined (iterations unrolled), near iso-accuracy synthesis results for our method (, , ) and standard hyperbolic CORDIC (28 iterations and 29 fractional bit variables). All implementations have ulp error, with the fraction at error shown shown. We are 5.4 more energy efficient, 0.23 the area, and half the latency in cycles; as discussed earlier, most CORDIC modifications reduce latency at the expense of increased energy.

V shift-and-add with integration

is similar to with roles of and reversed, with division for the integration [27]:

We restrict ourselves to . The error of is , with the target number of iterations (ignoring truncation error) for error given when . For single precision , , and double precision , , and is . Prior discussion concerning the and sequences and data gating with carry over to this algorithm. It is also the case that the running sum is not needed until the very end, so a carry-save adder postponing full evaluation of carries is appropriate. It is possible to use a redundant number system for and avoid full evaluation of the comparison [23], but the required shift with add increases switching activity significantly.

V-a Integration via approximate division

We approximate the integration division by truncating the dividend and divisor. The dividend has at least zero fractional MSBs, and the divisor , so the result is a fraction that we must align with the bits in for the sum. We skip known zero MSBs, and some number of the LSBs of the dividend. For the divisor , we need not use the entire fractional portion but choose only some number of fractional bits . We then have a by fixed point divider ( is for the leading integer 1 of ). is reasonable in our experiments. This is higher area and latency than the truncated multiplier (we only evaluated truncated division with digit recurrence), but the increase in resources of log versus exp is acceptable for linear algebra use cases (Section VIII).

Fig. 2: Our accuracy at relative to with . All configurations have ulp error.

V-B Error analysis and synthesis results

As before, we only consider monotonic implementations with ulp error, and consider the frequency of incorrectly rounded results. Figure 2 shows such error occurrence versus a sweep of , , , , with . has a larger accuracy effect than , and yields reasonable results, so all are constrained to . Table I shows near iso-accuracy synthesis results for our method (, , , ) and standard hyperbolic CORDIC (28 iterations and 30 fractional bit variables). Our implementation is 3.2 more energy efficient at 0.37 area versus CORDIC, with much of the latency and energy coming from the truncated divider. The higher resource consumption of log over exp CORDIC is from the initialization of the and CORDIC variables to , rather than 1, with propagation of inferred required adder length by HLS throughout when synthesizing the design.

Vi Approximate logarithmic arithmetic

We show how the preceding designs are used to build an arbitrarily high precision logarithmic arithmetic with some (tunably) approximate aspects.

Vi-a LNS arithmetic

The sign/magnitude logarithmic number system (LNS) [30] represents values as a rounded fixed point representation to some number of integer and fractional bits of , plus a sign and zero flag. The base is typically 2. We refer to as a representation of in the log domain. We refer to rounding and encoding as integer, fixed or floating point as a linear domain representation, though note that floating point itself is a combination of log and linear representations for the exponent and significand.

The benefit of LNS is simplifying multiplication, division and power/root. For log domain , multiplication or division of the corresponding linear domain and is , -th power of is and -th root of is , with sign, zero (and infinity/NaN flags if desired) handled in the obvious manner. Addition and subtraction, on the other hand, require Gaussian logarithm computation. For linear domain , log domain add/sub of the corresponding is:

where . Without loss of generality, we restrict , so we only consider . These functions are usually implemented with ROM/LUT tables (possibly with interpolation) rather than direct function evaluation, ideally realized to 0.5 log domain ulp relative error. The subtraction function has a singularity at , corresponding to exact cancellation , with the region very near the singularity corresponding to near-exact cancellation. Realizing this critical region to 0.5 log ulp error without massive ROMs (241 Kbits in [6]) is a motivation for subtraction co-transformation to avoid the singularity, which can reduce the requirement to at least 65 Kbits [26]. Some designs are proposed as being ROM-less [16], but in practice the switching power and leakage of the tables’ combinational cells would still be huge. Interpolation with reduced table sizes can also be used, but the formulation in [31] only considers log addition without the singularity. An ultimate limit on the technique not far above float32 equivalent is still faced, as accurate versions of these tables scale exponentially with word precision [5].

Pipelined LNS add/sub is another concern. As mentioned in Section I, Chen et al. [5] have an impractical fully pipelined implementation. Coleman et al. [7] have add/sub taking 3 cycles to complete, but chose to duplicate rather than pipeline the unit, and mention that the latency is dominated by memory (ROM) access. Arnold [2] provides a fully pipelined add/sub unit, but with a “quick” instruction version that allows the instruction to complete in either 4 or 6 cycles if it avoids the subtraction critical region. On the other hand, uniformity may increase latency, as different pipe stages are restricted to different ROM segments.

When combining an efficient LNS multiply with the penalty of addition for linear algebra, recent work by Popoff et al. [26] show an energy penalty of 1.84 over IEEE 754 float32 (using naive sum of add and mul energies), a 4.5 area penalty for the entire LNS ALU, and mention 25% reduced performance for linear algebra kernels such as GEMM. Good LNS use cases likely remain workloads with high multiply-to-add ratios.

Vi-B ELMA/FLMA logarithmic arithmetic

The ELMA (exact log-linear multiply-add) technique [18] is a logarithmic arithmetic that avoids Gaussian logarithms. It was shown that an 8-bit ELMA implementation with extended dynamic range from posit-type encodings [13] is more energy efficient in 28 nm CMOS than 8/32-bit integer multiply-add (as used in neural network accelerators). It achieved similar accuracy as integer quantization on ResNet-50 CNN [14]

inference on the ImageNet validation set 

[28], simply with float32 parameters converted via round-to-nearest only and all arithmetic in the ELMA form. Significant energy efficiency gains over IEEE 754 float16 multiply-add were also shown, though much higher precision was then impractical.

We describe ELMA and its extension to FLMA (floating point log-linear multiply-add). In ELMA, mul/div/root/power is in log domain, while add/sub is in linear domain with fixed point arithmetic. Let convert log domain (with integer and fractional log bits) to linear domain, and convert linear domain to log domain. and are both approximate conversions (LNS values are irrational). produces fixed point (ELMA) or floating point (FLMA); in base-2 FLMA, we obtain , yielding a linear domain floating point exponent and significand. can increase precision by bits, with the exponential evaluated to fractional bits. Unique conversion for base-2 requires , as the minimum derivative of , is less than 1.

FLMA approximates the linear domain sum on the log domain as . uses the floating point exponent as the log domain integer portion, and evaluates on the significand, back to the required log domain fractional bits. The fixed or floating point accumulator can use a different fractional precision () than , in which case can consider linear domain MSB fractional bits of with rounding for the reverse conversion. is similarly unique only when . Typically we have , and . As increase, we converge to exact LNS add/sub. As with LNS, if add/sub is the only operation, ELMA/FLMA does not make sense. It is tailored for linear algebra sums-of-products; conversion errors are likely to be uncorrelated in use cases of interest (Sections VII-B and VII-C), but is substantially efficient over floating point at a 1:1 multiply-to-add ratio (Section VIII).

Unlike LNS, a ELMA design (and FLMA, depending upon floating point adder latency) can be easily pipelined and accept a new summand every cycle for accumulation without resource duplication (e.g., LNS ROMs). Furthermore, accumulator precision can be (much) greater than log domain ; in LNS this requires increasing Gaussian logarithm precision to the accumulator width. These properties make ELMA/FLMA excellent for inner product, where many sums of differing magnitudes may be accumulated. FLMA is related to [19], except that architecture is oriented around a linear domain floating point representation such that all mul/div/root is done with a log conversion to LNS, the log domain operation, and an exp conversion back to linear domain. Their log/exp conversions were further approximated with linear interpolation. Every mul/div/root operation thus included the error introduced by both conversions.

Vi-C Dual-base logarithmic arithmetic

ELMA/FLMA requires accurate calculation of the fractional portions of and . Section IV shows calculation of and more accurately for the same resources versus and . While Gaussian logarithms can be computed irrespective of base, FLMA requires an accessible base-2 exponent to carry over as a floating point exponent. A base- representation does not easily yield this.

An alternative is a variation of multiple base arithmetic by Dimitrov et al. [9], allowing for more than one base (one of which is usually 2 and the others are any positive real number), with exponents as small integers, producing a representation (or zero). We instead use a representation (or zero) with (encoded in bits), (encoded as an bit fixed point fraction). when evaluated yields a FLMA floating point significand in the range , which we will refer to as the Euler significand. The product of any two of these values has and . For division, . We no longer have a unique representation when we do not limit the base- exponent to ; for example, .

We call a base- exponent in the range a normalized Euler significand. Normalization subtracts (or adds) from the base- exponent and increments (or decrements) the base-2 exponent as necessary to obtain a normalized significand. There are two immediate downsides to this. First, we do not use the full encoding range; our base- exponent is encoded as a fixed point fraction, but we only use 69.3% of the values. Encoding a precision/dynamic range tradeoff with the unused portion as in [13] could be considered. The second downside is considered in Section VII-A.

Fig. 3: FLMA sum error of log domain (with ) as a function of . All configurations have log ulp error, except for with log ulp error.

Vii FLMA analysis

We investigate dual-base arithmetic with FLMA parameters , (roughly IEEE 754 binary32 equivalent without subnormals), exp , log , accumulator for choice of . For relative error, units in the last place in a fractional log domain representation we call log ulp. For instance, 5 (base-) log ulps are between and , where b0.0110 is the binary fixed point fraction 0.375.

Vii-a Multiply/divide accuracy

LNS and single base FLMA have 0 log ulp mul/div error, but dual-base FLMA can produce a non-normalized significand, requiring add/sub of a rounding of , introducing slight error ( 0.016 log ulp for ). The extended exp algorithm can avoid this for multiply-add with an additional iteration and integer bits for and , as . We would still require additional normalization of to a floating point significand in the range . The dropped bit is kept by enlarging the accumulator, or is rounded away. Normalization is still required if more than two successive mul/div operations are performed.

Vii-B Add/subtract accuracy

Given where both and are the same sign (i.e., not strict subtraction), the error is bounded by twice maximum error, plus maximum floating point addition error and maximum error. In practice the worst case error is hard to determine without exhaustive search. Limiting ourselves to values in a limited range of , we evaluate log domain FLMA addition of for all and a choice of 64 random versus in Figure 3 ( unique pairs per configuration). All have max log ulp error , and has max log ulp error . With increased there are exponentially fewer incorrectly rounded sums but the table maker’s dilemma is a limiting factor. At , about 0.0005% of these sums remain incorrectly rounded to max 0.5 log ulp.

Fig. 4: FLMA catastrophic cancellation: relative (log ulp) error of versus exact answer in log domain, as a function of ( throughout).

For subtraction, catastrophic cancellation (a motivation for LNS co-transformation) still realizes itself. As with LNS, there is also a means of correction. While the issue appears with pairs of values very close in magnitude, consider linear domain , and evaluate with FLMA subtraction:

The base- exponent of here is 1 ulp below rounded to , and is thus our next lowest representable value from . With FLMA subtraction at :

Then back to log domain at :

If the calculation were done to 0.5 log ulp error, we get:

or an absolute error between the two of , but off by 135,111 log ulp (distance from the bit rounding of ). In floating point, the rounded result would have error 0.5 ulp. However, as (almost) all of our log domain values have a linear domain infinite fractional expansion, in near cancellation with a limited number of bits, FLMA misses the extended expansion of the subtraction residual.

If reducing relative error is a concern, we can increase for the log-to-linear conversion. This will provide more of the linear domain infinite fractional expansion, reducing relative error to log ulp almost everywhere if necessary (Figure 4). Absolute error remains bounded throughout the cancellation regime, from at to at . We are not increasing the log precision , but increasing the distinction in the linear domain between and . can maintain a reduced , with any remainder accumulator bits rounded off.

Fig. 5: FLMA versus float32 fused multiply-add (FMA) absolute error of random trials of , relative to a float64 reference.

Vii-C Multiple sum and inner product accuracy

Many processes seen in ML and computer vision yield quasi-normally distributed data. Consider sums of

independent variates , which is . The likelihood of a specific sum lying in a critical result region is given by the PDF as (which converges to 0 as ). The chance of overall catastrophic cancellation is thus frequently reduced so we could use for efficiency. Intermediate sums could be subject to cancellation issues, but barring degenerate cases (each pair of successive sums nearly cancel), this is unlikely to matter in practice.

For inner product, while the product of normal distributions is not normal, there is a similar diminishing cancellation region behavior with sums of independent product normal distributions. In practice (Figure 5), short sums of products (

) have higher error with greater variance versus floating point but the conversion and summation errors cancel for larger

, as the additional and accumulator width, even at , combined with lower multiplication error versus floating point (only non-normalized products have error, typically much less than 0.5 ulp) result in greater average accuracy versus floating point fused multiply-add (FMA).

Viii Arithmetic synthesis

We compare 7 nm area, latency and energy against IEEE 754 floating point without subnormal support. A throughput of refers to a module accepting a new operation every clock cycles ( is fully pipelined), while latency of is cycles to first result or pipeline length. Table II shows basic arithmetic operations with FLMA parameters the same as Section VII with . Note that the general LNS pattern of multiply energy being significantly lower but add/sub significantly higher still holds. Add/sub are two-operand, so this implementation includes two and one converters, and none will be actively gated in a fully utilized pipeline (they are all constantly switching). Naive sum of multiply with add energy lead to higher results as compared to floating point. However, as mentioned earlier, it is easier to efficiently pipeline FLMA add/sub compared to LNS add/sub.

The situation changes when we consider a multiply-accumulate, perhaps the most important primitive for linear algebra. Table III shows FLMA modules for 128-dim vector inner product with a inner loop, comparing against floating point FMA. The float64 comparison is against FLMA , , exp/log , , , exp , log , accumulator . The benefit of the FLMA design can be seen in this case; log domain multiplication, conversion and floating point add is much lower energy than a floating point FMA. As with LNS or FLMA addition, a single multiply-add with a log domain result would be inefficient, but in running sum cases (multiply-accumulate), the overhead is deferred and amortized over all work, and this conversion (unlike the inner loop) need not be fully pipelined. Using a combinational MCP for this with data gating when inactive saves power and area, at the computational cost of 2 additional cycles for throughput. Increased accumulator precision ( independent of ) is also possible at minimal computational cost, as this only affects the floating point adder.

Type Latency Area Energy/op pJ
float32 add/sub 1 138.4 0.274
FLMA add/sub 7 1577 (11.4) 1.768 (6.45)
float32 mul 1 248.4 0.802
FLMA mul 1 40.2 (0.16) 0.080 (0.10)
float32 FMA 1 481.2 1.443
FLMA mul-add core
, no 3 706.5 (1.47) 0.586 (0.41)
TABLE II: Fully pipelined () arithmetic synthesis
Type Throughput Area Energy/op pJ
float32 FMA 130 591.0 1.542
(8, 23) FLMA
135 1271 (2.15) 0.668 (0.43)
float64 FMA 131 1787.3 5.032
(11, 52) FLMA
144 6651 (3.72) 1.104 (0.22)
TABLE III: inner product multiply-add synthesis results

Ix Conclusion

Modern applications of computer vision, graphics (Figure 6) and machine learning often need energy efficient high precision arithmetic in hardware. We present an novel dual-base logarithmic arithmetic applicable for linear algebra kernels found in these applications. This is built on efficient implementations of and

, useful in their own right, leveraging numerical integration with truncated mul/div. While the arithmetic is approximate and without strong guarantees on relative error unlike LNS or floating point arithmetic, it retains moderate to low relative error and low absolute error, is extendible to arbitrary precision and easily pipelinable, providing an alternative to high precision floating or fixed point arithmetic when aggressive quantization is impractical.

Acknowledgments We thank Synopsys for their permission to publish results on our research obtained by using their tools with a popular 7 nm semiconductor technology node.

Fig. 6: 20482048 raytracing done entirely in dual base FLMA arithmetic (Section VII parameters with ). Pixel values clamped and rounded to nearest even integers for RGB output.

References

  • [1] K. H. Abed and R. E. Siferd (2003-11) CMOS vlsi implementation of a low-power logarithmic converter. IEEE Trans. Comput. 52 (11), pp. 1421–1433. External Links: ISSN 0018-9340, Link, Document Cited by: §III.
  • [2] M. G. Arnold (2003-Sep.) A vliw architecture for logarithmic arithmetic. In Euromicro Symposium on Digital System Design, 2003. Proceedings., Vol. , pp. 294–302. External Links: Document, ISSN Cited by: §VI-A.
  • [3] J.-C. Bajard, S. Kla, and J.-M. Muller (1994-08) BKM: a new hardware algorithm for complex elementary functions. IEEE Transactions on Computers 43 (8), pp. 955–963. External Links: Document, ISSN Cited by: §III.
  • [4] P. W. Baker (1975-03) Parallel multiplicative algorithms for some elementary functions. IEEE Trans. Comput. 24 (3), pp. 322–325. External Links: ISSN 0018-9340, Link, Document Cited by: §III.
  • [5] Chichyang Chen, Rui-Lin Chen, and Chih-Huan Yang (2000-07) Pipelined computation of very large word-length lns addition/subtraction with polynomial hardware cost. IEEE Transactions on Computers 49 (7), pp. 716–726. External Links: Document, ISSN Cited by: §I, §VI-A, §VI-A.
  • [6] J. N. Coleman and R. Che Ismail (2016-01) LNS with co-transformation competes with floating-point. IEEE Transactions on Computers 65 (1), pp. 136–146. External Links: Document, ISSN Cited by: §VI-A.
  • [7] J. N. Coleman, C. I. Softley, J. Kadlec, R. Matousek, M. Tichy, Z. Pohl, A. Hermanek, and N. F. Benschop (2008-04) The european logarithmic microprocesor. IEEE Transactions on Computers 57 (4), pp. 532–546. External Links: Document, ISSN Cited by: §I, §VI-A.
  • [8] B. G. De Lugish (1970) A class of algorithms for automatic evaluation of certain elementary functions in a binary computer. Ph.D. Thesis, University of Illinois at Urbana-Champaign, Champaign, IL, USA. Note: AAI7105082 Cited by: §III.
  • [9] V. S. Dimitrov, G. A. Jullien, and W. C. Miller (1999-10) Theory and applications of the double-base number system. IEEE Transactions on Computers 48 (10), pp. 1098–1106. External Links: Document, ISSN Cited by: §VI-C.
  • [10] J. Duprat and J. -M. Muller (1993-02) The cordic algorithm: new results for fast vlsi implementation. IEEE Trans. Comput. 42 (2), pp. 168–178. External Links: ISSN 0018-9340, Link, Document Cited by: §III, §III.
  • [11] M. D. Ercegovac, T. Lang, and P. Montuschi (1993-06) Very high radix division with selection by rounding and prescaling. In Proceedings of IEEE 11th Symposium on Computer Arithmetic, Vol. , pp. 112–119. External Links: Document, ISSN Cited by: §III.
  • [12] R. Gonzalez and M. Horowitz (1996-Sep.) Energy dissipation in general purpose microprocessors. IEEE Journal of Solid-State Circuits 31 (9), pp. 1277–1284. External Links: Document, ISSN 1558-173X Cited by: §III.
  • [13] J. L. Gustafson and I. T. Yonemoto (2017) Beating floating point at its own game: posit arithmetic. Supercomputing Frontiers and Innovations 4 (2), pp. 71–86. Cited by: §VI-B, §VI-C.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 770–778. Cited by: §VI-B.
  • [15] M. Horowitz (2014) 1.1 computing’s energy problem (and what we can do about it). In Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014 IEEE International, pp. 10–14. Cited by: §I.
  • [16] R. C. Ismail and J. N. Coleman (2011-07) ROM-less lns. In 2011 IEEE 20th Symposium on Computer Arithmetic, Vol. , pp. 43–51. External Links: Document, ISSN Cited by: §VI-A.
  • [17] H. Jiang, C. Liu, N. Maheshwari, F. Lombardi, and J. Han (2016-07) A comparative evaluation of approximate multipliers. In 2016 IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH), Vol. , pp. 191–196. External Links: Document, ISSN 2327-8226 Cited by: §IV-B.
  • [18] J. Johnson (2018)

    Rethinking floating point for deep learning

    .
    NeurIPS Workshop on Systems for ML abs/1811.01721. External Links: Link, 1811.01721 Cited by: §I, §VI-B.
  • [19] F.-S. Lai and C.-F. E. Wu (1991-08) A hybrid number system processor with geometric and complex arithmetic capabilities. IEEE Transactions on Computers 40 (8), pp. 952–962. External Links: Document, ISSN 2326-3814 Cited by: §VI-B.
  • [20] V. Lefevre, J.-M. Muller, and A. Tisserand (1998-11) Toward correctly rounded transcendentals. IEEE Transactions on Computers 47 (11), pp. 1235–1243. External Links: Document, ISSN 2326-3814 Cited by: §IV-C.
  • [21] D. M. Lewis (1994-08) Interleaved memory function interpolators with application to an accurate lns arithmetic unit. IEEE Transactions on Computers 43 (8), pp. 974–982. External Links: Document, ISSN Cited by: §I.
  • [22] J. N. Mitchell (1962-08) Computer multiplication and division using binary logarithms. IRE Transactions on Electronic Computers EC-11 (4), pp. 512–517. External Links: Document, ISSN Cited by: §III.
  • [23] J. Muller (1985-09) Discrete basis and computation of elementary functions. IEEE Trans. Comput. 34 (9), pp. 857–862. External Links: ISSN 0018-9340, Link, Document Cited by: §I, §IV, §IV, §V.
  • [24] M. Nickel and D. Kiela (2017) Poincaré embeddings for learning hierarchical representations. In Advances in neural information processing systems, pp. 6338–6347. Cited by: §I.
  • [25] J. -A. Piñeiro, M. D. Ercegovac, and J. D. Bruguera (2005-05) High-radix logarithm with selection by rounding: algorithm and implementation. J. VLSI Signal Process. Syst. 40 (1), pp. 109–123. External Links: ISSN 0922-5773, Link, Document Cited by: §III, §III, §IV.
  • [26] Y. Popoff, F. Scheidegger, M. Schaffner, M. Gautschi, F. K. Gürkaynak, and L. Benini (2016) High-efficiency logarithmic number unit design based on an improved cotransformation scheme. In Proceedings of the 2016 Conference on Design, Automation & Test in Europe, DATE ’16, San Jose, CA, USA, pp. 1387–1392. External Links: ISBN 978-3-9815370-6-2, Link Cited by: §I, §VI-A, §VI-A.
  • [27] N. Revol and J. Yakoubsohn (2000-05-01) Accelerated shift-and-add algorithms. Reliable Computing 6 (2), pp. 193–205. External Links: ISSN 1573-1340, Document, Link Cited by: §I, §IV-A, §V.
  • [28] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §VI-B.
  • [29] M. J. Schulte and E. E. Swartzlander,Jr. (1994-08) Hardware designs for exactly rounded elementary functions. IEEE Trans. Comput. 43 (8), pp. 964–973. External Links: ISSN 0018-9340, Link, Document Cited by: §III.
  • [30] E. E. Swartzlander and A. G. Alexopoulos (1975) The sign/logarithm number system. IEEE Transactions on Computers 100 (12), pp. 1238–1242. Cited by: §I, §VI-A.
  • [31] F. Taylor (1983-02) An extended precision logarithmic number system. IEEE Transactions on Acoustics, Speech, and Signal Processing 31 (1), pp. 232–234. External Links: Document, ISSN 0096-3518 Cited by: §VI-A.
  • [32] M. B. Taylor (2012) Is dark silicon useful? harnessing the four horsemen of the coming dark silicon apocalypse. In Design Automation Conference (DAC), 2012 49th ACM/EDAC/IEEE, pp. 1131–1136. Cited by: §I.
  • [33] C. Tomasi and T. Kanade (1992-11) Shape and motion from image streams under orthography: a factorization method. Int. J. Comput. Vision 9 (2), pp. 137–154. External Links: ISSN 0920-5691, Link, Document Cited by: §I.
  • [34] G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. Lugo-Martinez, S. Swanson, and M. B. Taylor (2010) Conservation cores: reducing the energy of mature computations. In Proceedings of the Fifteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XV, New York, NY, USA, pp. 205–218. External Links: ISBN 978-1-60558-839-1, Link, Document Cited by: §I.
  • [35] J. Volder (1959) The cordic computing technique. In Papers Presented at the the March 3-5, 1959, Western Joint Computer Conference, IRE-AIEE-ACM ’59 (Western), New York, NY, USA, pp. 257–261. External Links: Link, Document Cited by: §III.
  • [36] M. Vratonjic, B. R. Zeydel, and V. G. Oklobdzija (2005-10) Low- and ultra low-power arithmetic units: design and comparison. In 2005 International Conference on Computer Design, Vol. , pp. 249–252. External Links: Document, ISSN 1063-6404 Cited by: §III.
  • [37] M. Wolfe (1989) More iteration space tiling. In Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, Supercomputing ’89, New York, NY, USA, pp. 655–664. External Links: ISBN 0-89791-341-8, Link, Document Cited by: §I.