I Introduction
Energy efficiency is typically the most important challenge in advanced CMOS technology nodes. With the dark silicon problem [34], the vast majority of a large scale design is clock or power gated at any given point in time. Chip area becomes exponentially more available relative to power consumption, preferring “a new class of architectural techniques that ’spend’ area to ’buy’ energy efficiency” [32]. Memory architecture is often the most important concern, with 1706400 greater DRAM access energy versus arithmetic at 45 nm [15]. This changes with the rise of machine learning, as heavily employed linear algebra primitives such as matrix/matrix product offer substantial local reuse of data by algorithmic tiling [37]: arithmetic operations versus
DRAM accesses. This is a reason for the rise of dedicated neural network accelerators, as memory overheads can be substantially amortized over many arithmetic operations in a fixed function design, making arithmetic efficiency matter again.
Many hardware efforts for linear algebra and machine learning tend towards low precision implementations, but here we concern ourselves with the opposite: enabling (arbitrarily) high precision yet energy efficient substitutes for floating point or long word length fixed point arithmetic. There are a variety of ML, computer vision and other algorithms where accelerators cannot easily apply precision reduction, such as hyperbolic embedding generation [24] or structure from motion via matrix factorization [33], yet provide high local data reuse potential.
The logarithmic number system (LNS) [30] can provide energy efficiency by eliminating hardware multipliers and dividers, yet maintains significant computational overhead with Gaussian logarithm
functions needed for addition and subtraction. While reduced precision cases can limit themselves to relatively small LUTs/ROMs, high precision LNS require massive ROMs, linear interpolators and substantial MUXes. Pipelining is difficult, requiring resource duplication or handling variable latency corner cases as seen in
[7]. The ROMs are also exponential in LNS word size, so become impractical beyond a float32 equivalent. Chen et al. [5] provide an alternative fully pipelined LNS add/sub with ROM size a function of LNS word size, extended to float64 equivalent precision. However, in their words, “[our] design of [a] large wordlength LNS processor becomes impractical since the hardware cost and the pipeline latency of the proposed LNS unit are much larger.” Their float64 equivalent requires 471 Kbits ROM and at least 22,479 full adder (FA) cells, and 53.5 Kbits ROM and 5,550 FA cells for float32, versus a traditional LNS implementation they cite [21] with 91 Kbits of ROM and only 586 FA cells.While there are energy benefits with LNS [26], we believe a better bargain can be had. Our main contribution is a trivially pipelined logarithmic arithmetic extendable to arbitrary precision, using no LUTs/ROMs, and a precision to FA cell dependency. Unlike LNS, it is substantially more energy efficient than floating point at a 1:1 multiply/add ratio for linear algebra use cases. It is approximate in ways that an accurately designed LNS is not, though with parameters for tuning accuracy to match LNS as needed. It is based on the ELMA technique [18], extended to arbitrary precision with an energy efficient implementation of exp/log using restoring shiftandadd [23]
and an ordinary differential equation integration step from Revol and Yakoubsohn
[27] but with approximate multipliers and dividers. It is tailored for vector inner product, a foundation of much of linear algebra, but remains a general purpose arithmetic. We will first describe our hardware exp/log implementations, then detail how they are used as a foundation for our arithmetic, and provide an accuracy analysis. Finally, hardware synthesis results are presented and compared with floating point.Ii Notes on hardware synthesis
All designs considered in this paper are on a commercially available 7 nm CMOS technology constrained to only SVT cells. They are generated using Mentor Catapult high level synthesis (HLS), biased towards min latency rather than min area, with ICG (clock gate) insertion where appropriate. Area is reported via Synopsys Design Compiler, and power/energy is from Synopsys PrimeTime PX from realistic switching activity. Energy accounts for combinational, register, clock tree and leakage power, normalized with respect to module throughput in cycles, so this is a peroperation energy. We consider pipelining acceptable for arithmetic problems in linear algebra with sufficient regularity such as matrix multiplication (Section VIB), reducing the need for purely combinational latency reduction. Power analysis is at the TT@25C corner at nominal voltage. Design clocks from 250750 MHz were considered, with 375 MHz chosen for reporting, being close to minimum energy for many of the designs. Changing frequency does change pipeline depth and required register/clock tree power, as well as choice of inferred adder or other designs needed to meet timing closure by synthesis.
Iii exp/log evaluation
Our arithmetic requires efficient hardware implementation of exponential and logarithm for a base , which are useful in their own right. Typical algorithms are power series evaluation, polynomial approximation/tablebased methods, and shiftandadd methods such as hyperbolic CORDIC [35] or the simpler method by De Lugish [8]. Hardware implementations have been considered for CORDIC [10], ROM/tablebased implementations [29], approximation using shiftandadd [1] with the Mitchell logarithm approximation [22], and digit recurrence/shiftandadd [25]. CORDIC requires three state variables and additions per iteration, plus a final multiplication by a scaling factor. BKM [3] avoids the CORDIC scaling factor but introduces complexity in the iteration step.
Much of the hardware elementary function literature is concerned with latency reduction rather than energy optimization. Variants of these algorithms such as high radix formulations [4][11] [25] or parallel iterations [10] increase switching activity via additional active area, iteration complexity, or adding sizable MUXes in the radix case. In lieu of decreasing combinational delay via parallelism, pipelining is a worthwhile strategy to reduce energydelay product [12], but only with high pipeline utilization and where register power increases are not substantial. Ripplecarry adders, the simplest and most energy efficient adders, remain useful in the pipelined regime, and variants like variable block adders improve latency for minimal additional energy [36]. Fully parallel adders like carrysave can improve on both latency and switching activity for elementary functions [25], but only where the redundant number system can be maintained with low computational overhead. For example, in shiftandadd style algorithms, adding a shifted version of a carrysave number to itself requires twice the number of adder cells as a simple ripplecarry adder (one to add each of the shifted components), resulting in near double the energy. Eliminating registers via combinational multicycle paths (MCPs) is another strategy, but as the design is no longer pipelined, throughput will suffer, requiring an introduction of more functional units or accepting the decrease in throughput. There is then a tradeoff between clock frequency, combinational latency reduction, pipelining for timing closure, MCP introduction, and functional unit duplication versus energy per operation.
Iv shiftandadd with integration
We consider De Lugishstyle restoring shiftandadd, which will provide ways to reduce power or recover precision with fewer iterations (Section IVA and IVB). The procedure for exponentials is described in Muller [23] as:
The acceptable range of is , or for (Euler’s number). Range reduction techniques considered in [23] can be used to reduce arbitrary to this range. This paper will only consider and limiting as fixed point, , restrictions discussed in Sections IVA and VIC.
We must consider rounding error and precision of , and . Our rangelimited can be specified purely as a fixed point fraction with fractional bits. The iteration is skipped as . All subsequent are and can be similarly represented as a fixed point fraction. These will use fractional bits () with correctly rounded representations of . is the case in our restricted domain, which is maintained as a fixed point fraction with an implicit, omitted leading integer 1. Multiplication by is a shift by bits, so we append this leading 1 to the postshifted before addition. We use fractional bits to represent . At the final th iteration, is rounded to bits for the output . Ignoring rounding error, the relative error of the algorithm is at iteration , so for 23 fractional bits, is desired.
All adders need not be of size or , either. reduces in magnitude at each step; with , only needs the LSB fractional bits. has a related bit size progression , as is fractional bits, except starting at we are off by 1 ( cannot all be 1, as ). While successively reduces in precision from , we limit to fractional bits via truncation (bits shifted beyond position are ignored). As with [25], we can deal with truncation error by setting , using extra bits as guard bits.
requires an adder and MUX ( if , or if ). The constants are hardwired into adders when iterations are fully unrolled (a separate adder for each iteration). The do not use a full adder of size in the general case; only shifted bits that overlap with previously available bits need full adder cells. The can also be performed first, with stored in flops (to reduce glitches) for data gating additions, reducing switching activity at the expense of higher latency, as 25% of the on average will remain zero across iterations.
One can use redundant number systems for and and avoid full evaluation of the comparator [23], but is problematic. In the nonredundant case, only a subset of the shifted require a full adder, and the remainder only a half adder. With a carrysave representation for , two full adders are required for the entire length of the word, one to add each portion of the shifted carrysave representation. While the carrysave addition is constant latency, it requires more full adder cells. In our evaluation carrysave for prohibitively increases power over synthesisinferred adder choice. At high clock frequencies (low latency) this tradeoff is acceptable, but low power designs will generally avoid this regime.
Iva Euler method integration
This algorithm is simple but has high latency from the sequential dependency of many adders, high and iterations for accurate results. For significant latency and energy reduction, Revol and Yakoubsohn [27] show that about half the iterations can be omitted by treating the problem as an ordinary differential equation with a single numerical integration step. satisfies the ODE where . They consider in software both an explicit Euler method and 4th order RungeKutta (RK4). RK4 involves several multipliers and is not a good energy tradeoff to avoid more iterations. The explicit Euler method step has a single multiplication:
at the th terminal iteration, with residual used as the step size. They give a formula for at a desired accuracy of of , ignoring truncation error. Note that . Thus, for singleprecision , we need , or . Doubleprecision has , and quad precision has . Implementation of from this requires premultiplication of by a fixed point rounding of , a significant energy overhead.
IvB Integration via approximate multiplication
We have , when . The Euler method step multiplication would be a massive bits, with the for the leading integer 1 bit of . Let and denote fractional portions of and . The step can then be expressed as:
now solely involves fractional bits, of which we only care about to MSBs produced. has zero MSBs, so there are ignorable zero MSBs in the resulting product, yielding a multiplier, still an exact step calculation. Assuming and given these zero MSBs, we only need MSBs of the result, so we truncate both and to limit the result to this size (truncation ignores carries from the multiplication of the truncated LSBs). We do this symmetrically, and since usually , we take fractional MSBs from , with an option to remove another bits, . This may not produce enough bits to align properly with , so we append zeros to the LSBs as needed to match the size of . For example, at , , , , , we have a 14 14 multiplier, of which we only need 16 MSBs (based on alignment with ), and the ultimate carry from the 12 LSBs. One can consider other approximate multipliers [17], but truncation seems to work well and provides a significant reduction in energy.
(0.5, 1] ulp err  Cycles  Area  Energy pJ  

CORDIC  9.98%  4  1738  2.749 
Ours  9.90%  2  407.2 (0.23)  0.512 (0.19) 
CORDIC  14.4%  4  2084  3.573 
Ours  14.8%  4  769.4 (0.37)  1.107 (0.31) 
IvC Error analysis and synthesis results
The table maker’s dilemma is unavoidable for transcendental functions [20]. For , we need to evaluate to at least 42 bits to provide correctly rounded results for fixed point . In lieu of exact evaluation, we demand function monotonicity, ulp error, and consider the occurrence of incorrectly rounded results ( ulp error). Figure 1 considers error in this regime with a sweep of , and , with . has maximum ulp error.
Table I shows fullypipelined (iterations unrolled), near isoaccuracy synthesis results for our method (, , ) and standard hyperbolic CORDIC (28 iterations and 29 fractional bit variables). All implementations have ulp error, with the fraction at error shown shown. We are 5.4 more energy efficient, 0.23 the area, and half the latency in cycles; as discussed earlier, most CORDIC modifications reduce latency at the expense of increased energy.
V shiftandadd with integration
is similar to with roles of and reversed, with division for the integration [27]:
We restrict ourselves to . The error of is , with the target number of iterations (ignoring truncation error) for error given when . For single precision , , and double precision , , and is . Prior discussion concerning the and sequences and data gating with carry over to this algorithm. It is also the case that the running sum is not needed until the very end, so a carrysave adder postponing full evaluation of carries is appropriate. It is possible to use a redundant number system for and avoid full evaluation of the comparison [23], but the required shift with add increases switching activity significantly.
Va Integration via approximate division
We approximate the integration division by truncating the dividend and divisor. The dividend has at least zero fractional MSBs, and the divisor , so the result is a fraction that we must align with the bits in for the sum. We skip known zero MSBs, and some number of the LSBs of the dividend. For the divisor , we need not use the entire fractional portion but choose only some number of fractional bits . We then have a by fixed point divider ( is for the leading integer 1 of ). is reasonable in our experiments. This is higher area and latency than the truncated multiplier (we only evaluated truncated division with digit recurrence), but the increase in resources of log versus exp is acceptable for linear algebra use cases (Section VIII).
VB Error analysis and synthesis results
As before, we only consider monotonic implementations with ulp error, and consider the frequency of incorrectly rounded results. Figure 2 shows such error occurrence versus a sweep of , , , , with . has a larger accuracy effect than , and yields reasonable results, so all are constrained to . Table I shows near isoaccuracy synthesis results for our method (, , , ) and standard hyperbolic CORDIC (28 iterations and 30 fractional bit variables). Our implementation is 3.2 more energy efficient at 0.37 area versus CORDIC, with much of the latency and energy coming from the truncated divider. The higher resource consumption of log over exp CORDIC is from the initialization of the and CORDIC variables to , rather than 1, with propagation of inferred required adder length by HLS throughout when synthesizing the design.
Vi Approximate logarithmic arithmetic
We show how the preceding designs are used to build an arbitrarily high precision logarithmic arithmetic with some (tunably) approximate aspects.
Via LNS arithmetic
The sign/magnitude logarithmic number system (LNS) [30] represents values as a rounded fixed point representation to some number of integer and fractional bits of , plus a sign and zero flag. The base is typically 2. We refer to as a representation of in the log domain. We refer to rounding and encoding as integer, fixed or floating point as a linear domain representation, though note that floating point itself is a combination of log and linear representations for the exponent and significand.
The benefit of LNS is simplifying multiplication, division and power/root. For log domain , multiplication or division of the corresponding linear domain and is , th power of is and th root of is , with sign, zero (and infinity/NaN flags if desired) handled in the obvious manner. Addition and subtraction, on the other hand, require Gaussian logarithm computation. For linear domain , log domain add/sub of the corresponding is:
where . Without loss of generality, we restrict , so we only consider . These functions are usually implemented with ROM/LUT tables (possibly with interpolation) rather than direct function evaluation, ideally realized to 0.5 log domain ulp relative error. The subtraction function has a singularity at , corresponding to exact cancellation , with the region very near the singularity corresponding to nearexact cancellation. Realizing this critical region to 0.5 log ulp error without massive ROMs (241 Kbits in [6]) is a motivation for subtraction cotransformation to avoid the singularity, which can reduce the requirement to at least 65 Kbits [26]. Some designs are proposed as being ROMless [16], but in practice the switching power and leakage of the tables’ combinational cells would still be huge. Interpolation with reduced table sizes can also be used, but the formulation in [31] only considers log addition without the singularity. An ultimate limit on the technique not far above float32 equivalent is still faced, as accurate versions of these tables scale exponentially with word precision [5].
Pipelined LNS add/sub is another concern. As mentioned in Section I, Chen et al. [5] have an impractical fully pipelined implementation. Coleman et al. [7] have add/sub taking 3 cycles to complete, but chose to duplicate rather than pipeline the unit, and mention that the latency is dominated by memory (ROM) access. Arnold [2] provides a fully pipelined add/sub unit, but with a “quick” instruction version that allows the instruction to complete in either 4 or 6 cycles if it avoids the subtraction critical region. On the other hand, uniformity may increase latency, as different pipe stages are restricted to different ROM segments.
When combining an efficient LNS multiply with the penalty of addition for linear algebra, recent work by Popoff et al. [26] show an energy penalty of 1.84 over IEEE 754 float32 (using naive sum of add and mul energies), a 4.5 area penalty for the entire LNS ALU, and mention 25% reduced performance for linear algebra kernels such as GEMM. Good LNS use cases likely remain workloads with high multiplytoadd ratios.
ViB ELMA/FLMA logarithmic arithmetic
The ELMA (exact loglinear multiplyadd) technique [18] is a logarithmic arithmetic that avoids Gaussian logarithms. It was shown that an 8bit ELMA implementation with extended dynamic range from posittype encodings [13] is more energy efficient in 28 nm CMOS than 8/32bit integer multiplyadd (as used in neural network accelerators). It achieved similar accuracy as integer quantization on ResNet50 CNN [14]
inference on the ImageNet validation set
[28], simply with float32 parameters converted via roundtonearest only and all arithmetic in the ELMA form. Significant energy efficiency gains over IEEE 754 float16 multiplyadd were also shown, though much higher precision was then impractical.We describe ELMA and its extension to FLMA (floating point loglinear multiplyadd). In ELMA, mul/div/root/power is in log domain, while add/sub is in linear domain with fixed point arithmetic. Let convert log domain (with integer and fractional log bits) to linear domain, and convert linear domain to log domain. and are both approximate conversions (LNS values are irrational). produces fixed point (ELMA) or floating point (FLMA); in base2 FLMA, we obtain , yielding a linear domain floating point exponent and significand. can increase precision by bits, with the exponential evaluated to fractional bits. Unique conversion for base2 requires , as the minimum derivative of , is less than 1.
FLMA approximates the linear domain sum on the log domain as . uses the floating point exponent as the log domain integer portion, and evaluates on the significand, back to the required log domain fractional bits. The fixed or floating point accumulator can use a different fractional precision () than , in which case can consider linear domain MSB fractional bits of with rounding for the reverse conversion. is similarly unique only when . Typically we have , and . As increase, we converge to exact LNS add/sub. As with LNS, if add/sub is the only operation, ELMA/FLMA does not make sense. It is tailored for linear algebra sumsofproducts; conversion errors are likely to be uncorrelated in use cases of interest (Sections VIIB and VIIC), but is substantially efficient over floating point at a 1:1 multiplytoadd ratio (Section VIII).
Unlike LNS, a ELMA design (and FLMA, depending upon floating point adder latency) can be easily pipelined and accept a new summand every cycle for accumulation without resource duplication (e.g., LNS ROMs). Furthermore, accumulator precision can be (much) greater than log domain ; in LNS this requires increasing Gaussian logarithm precision to the accumulator width. These properties make ELMA/FLMA excellent for inner product, where many sums of differing magnitudes may be accumulated. FLMA is related to [19], except that architecture is oriented around a linear domain floating point representation such that all mul/div/root is done with a log conversion to LNS, the log domain operation, and an exp conversion back to linear domain. Their log/exp conversions were further approximated with linear interpolation. Every mul/div/root operation thus included the error introduced by both conversions.
ViC Dualbase logarithmic arithmetic
ELMA/FLMA requires accurate calculation of the fractional portions of and . Section IV shows calculation of and more accurately for the same resources versus and . While Gaussian logarithms can be computed irrespective of base, FLMA requires an accessible base2 exponent to carry over as a floating point exponent. A base representation does not easily yield this.
An alternative is a variation of multiple base arithmetic by Dimitrov et al. [9], allowing for more than one base (one of which is usually 2 and the others are any positive real number), with exponents as small integers, producing a representation (or zero). We instead use a representation (or zero) with (encoded in bits), (encoded as an bit fixed point fraction). when evaluated yields a FLMA floating point significand in the range , which we will refer to as the Euler significand. The product of any two of these values has and . For division, . We no longer have a unique representation when we do not limit the base exponent to ; for example, .
We call a base exponent in the range a normalized Euler significand. Normalization subtracts (or adds) from the base exponent and increments (or decrements) the base2 exponent as necessary to obtain a normalized significand. There are two immediate downsides to this. First, we do not use the full encoding range; our base exponent is encoded as a fixed point fraction, but we only use 69.3% of the values. Encoding a precision/dynamic range tradeoff with the unused portion as in [13] could be considered. The second downside is considered in Section VIIA.
Vii FLMA analysis
We investigate dualbase arithmetic with FLMA parameters , (roughly IEEE 754 binary32 equivalent without subnormals), exp , log , accumulator for choice of . For relative error, units in the last place in a fractional log domain representation we call log ulp. For instance, 5 (base) log ulps are between and , where b0.0110 is the binary fixed point fraction 0.375.
Viia Multiply/divide accuracy
LNS and single base FLMA have 0 log ulp mul/div error, but dualbase FLMA can produce a nonnormalized significand, requiring add/sub of a rounding of , introducing slight error ( 0.016 log ulp for ). The extended exp algorithm can avoid this for multiplyadd with an additional iteration and integer bits for and , as . We would still require additional normalization of to a floating point significand in the range . The dropped bit is kept by enlarging the accumulator, or is rounded away. Normalization is still required if more than two successive mul/div operations are performed.
ViiB Add/subtract accuracy
Given where both and are the same sign (i.e., not strict subtraction), the error is bounded by twice maximum error, plus maximum floating point addition error and maximum error. In practice the worst case error is hard to determine without exhaustive search. Limiting ourselves to values in a limited range of , we evaluate log domain FLMA addition of for all and a choice of 64 random versus in Figure 3 ( unique pairs per configuration). All have max log ulp error , and has max log ulp error . With increased there are exponentially fewer incorrectly rounded sums but the table maker’s dilemma is a limiting factor. At , about 0.0005% of these sums remain incorrectly rounded to max 0.5 log ulp.
For subtraction, catastrophic cancellation (a motivation for LNS cotransformation) still realizes itself. As with LNS, there is also a means of correction. While the issue appears with pairs of values very close in magnitude, consider linear domain , and evaluate with FLMA subtraction:
The base exponent of here is 1 ulp below rounded to , and is thus our next lowest representable value from . With FLMA subtraction at :
Then back to log domain at :
If the calculation were done to 0.5 log ulp error, we get:
or an absolute error between the two of , but off by 135,111 log ulp (distance from the bit rounding of ). In floating point, the rounded result would have error 0.5 ulp. However, as (almost) all of our log domain values have a linear domain infinite fractional expansion, in near cancellation with a limited number of bits, FLMA misses the extended expansion of the subtraction residual.
If reducing relative error is a concern, we can increase for the logtolinear conversion. This will provide more of the linear domain infinite fractional expansion, reducing relative error to log ulp almost everywhere if necessary (Figure 4). Absolute error remains bounded throughout the cancellation regime, from at to at . We are not increasing the log precision , but increasing the distinction in the linear domain between and . can maintain a reduced , with any remainder accumulator bits rounded off.
ViiC Multiple sum and inner product accuracy
Many processes seen in ML and computer vision yield quasinormally distributed data. Consider sums of
independent variates , which is . The likelihood of a specific sum lying in a critical result region is given by the PDF as (which converges to 0 as ). The chance of overall catastrophic cancellation is thus frequently reduced so we could use for efficiency. Intermediate sums could be subject to cancellation issues, but barring degenerate cases (each pair of successive sums nearly cancel), this is unlikely to matter in practice.For inner product, while the product of normal distributions is not normal, there is a similar diminishing cancellation region behavior with sums of independent product normal distributions. In practice (Figure 5), short sums of products (
) have higher error with greater variance versus floating point but the conversion and summation errors cancel for larger
, as the additional and accumulator width, even at , combined with lower multiplication error versus floating point (only nonnormalized products have error, typically much less than 0.5 ulp) result in greater average accuracy versus floating point fused multiplyadd (FMA).Viii Arithmetic synthesis
We compare 7 nm area, latency and energy against IEEE 754 floating point without subnormal support. A throughput of refers to a module accepting a new operation every clock cycles ( is fully pipelined), while latency of is cycles to first result or pipeline length. Table II shows basic arithmetic operations with FLMA parameters the same as Section VII with . Note that the general LNS pattern of multiply energy being significantly lower but add/sub significantly higher still holds. Add/sub are twooperand, so this implementation includes two and one converters, and none will be actively gated in a fully utilized pipeline (they are all constantly switching). Naive sum of multiply with add energy lead to higher results as compared to floating point. However, as mentioned earlier, it is easier to efficiently pipeline FLMA add/sub compared to LNS add/sub.
The situation changes when we consider a multiplyaccumulate, perhaps the most important primitive for linear algebra. Table III shows FLMA modules for 128dim vector inner product with a inner loop, comparing against floating point FMA. The float64 comparison is against FLMA , , exp/log , , , exp , log , accumulator . The benefit of the FLMA design can be seen in this case; log domain multiplication, conversion and floating point add is much lower energy than a floating point FMA. As with LNS or FLMA addition, a single multiplyadd with a log domain result would be inefficient, but in running sum cases (multiplyaccumulate), the overhead is deferred and amortized over all work, and this conversion (unlike the inner loop) need not be fully pipelined. Using a combinational MCP for this with data gating when inactive saves power and area, at the computational cost of 2 additional cycles for throughput. Increased accumulator precision ( independent of ) is also possible at minimal computational cost, as this only affects the floating point adder.
Type  Latency  Area  Energy/op pJ 

float32 add/sub  1  138.4  0.274 
FLMA add/sub  7  1577 (11.4)  1.768 (6.45) 
float32 mul  1  248.4  0.802 
FLMA mul  1  40.2 (0.16)  0.080 (0.10) 
float32 FMA  1  481.2  1.443 
FLMA muladd core  
, no  3  706.5 (1.47)  0.586 (0.41) 
Type  Throughput  Area  Energy/op pJ 

float32 FMA  130  591.0  1.542 
(8, 23) FLMA  
135  1271 (2.15)  0.668 (0.43)  
float64 FMA  131  1787.3  5.032 
(11, 52) FLMA  
144  6651 (3.72)  1.104 (0.22) 
Ix Conclusion
Modern applications of computer vision, graphics (Figure 6) and machine learning often need energy efficient high precision arithmetic in hardware. We present an novel dualbase logarithmic arithmetic applicable for linear algebra kernels found in these applications. This is built on efficient implementations of and
, useful in their own right, leveraging numerical integration with truncated mul/div. While the arithmetic is approximate and without strong guarantees on relative error unlike LNS or floating point arithmetic, it retains moderate to low relative error and low absolute error, is extendible to arbitrary precision and easily pipelinable, providing an alternative to high precision floating or fixed point arithmetic when aggressive quantization is impractical.
Acknowledgments We thank Synopsys for their permission to publish results on our research obtained by using their tools with a popular 7 nm semiconductor technology node.
References
 [1] (200311) CMOS vlsi implementation of a lowpower logarithmic converter. IEEE Trans. Comput. 52 (11), pp. 1421–1433. External Links: ISSN 00189340, Link, Document Cited by: §III.
 [2] (2003Sep.) A vliw architecture for logarithmic arithmetic. In Euromicro Symposium on Digital System Design, 2003. Proceedings., Vol. , pp. 294–302. External Links: Document, ISSN Cited by: §VIA.
 [3] (199408) BKM: a new hardware algorithm for complex elementary functions. IEEE Transactions on Computers 43 (8), pp. 955–963. External Links: Document, ISSN Cited by: §III.
 [4] (197503) Parallel multiplicative algorithms for some elementary functions. IEEE Trans. Comput. 24 (3), pp. 322–325. External Links: ISSN 00189340, Link, Document Cited by: §III.
 [5] (200007) Pipelined computation of very large wordlength lns addition/subtraction with polynomial hardware cost. IEEE Transactions on Computers 49 (7), pp. 716–726. External Links: Document, ISSN Cited by: §I, §VIA, §VIA.
 [6] (201601) LNS with cotransformation competes with floatingpoint. IEEE Transactions on Computers 65 (1), pp. 136–146. External Links: Document, ISSN Cited by: §VIA.
 [7] (200804) The european logarithmic microprocesor. IEEE Transactions on Computers 57 (4), pp. 532–546. External Links: Document, ISSN Cited by: §I, §VIA.
 [8] (1970) A class of algorithms for automatic evaluation of certain elementary functions in a binary computer. Ph.D. Thesis, University of Illinois at UrbanaChampaign, Champaign, IL, USA. Note: AAI7105082 Cited by: §III.
 [9] (199910) Theory and applications of the doublebase number system. IEEE Transactions on Computers 48 (10), pp. 1098–1106. External Links: Document, ISSN Cited by: §VIC.
 [10] (199302) The cordic algorithm: new results for fast vlsi implementation. IEEE Trans. Comput. 42 (2), pp. 168–178. External Links: ISSN 00189340, Link, Document Cited by: §III, §III.
 [11] (199306) Very high radix division with selection by rounding and prescaling. In Proceedings of IEEE 11th Symposium on Computer Arithmetic, Vol. , pp. 112–119. External Links: Document, ISSN Cited by: §III.
 [12] (1996Sep.) Energy dissipation in general purpose microprocessors. IEEE Journal of SolidState Circuits 31 (9), pp. 1277–1284. External Links: Document, ISSN 1558173X Cited by: §III.
 [13] (2017) Beating floating point at its own game: posit arithmetic. Supercomputing Frontiers and Innovations 4 (2), pp. 71–86. Cited by: §VIB, §VIC.

[14]
(2016)
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: §VIB.  [15] (2014) 1.1 computing’s energy problem (and what we can do about it). In SolidState Circuits Conference Digest of Technical Papers (ISSCC), 2014 IEEE International, pp. 10–14. Cited by: §I.
 [16] (201107) ROMless lns. In 2011 IEEE 20th Symposium on Computer Arithmetic, Vol. , pp. 43–51. External Links: Document, ISSN Cited by: §VIA.
 [17] (201607) A comparative evaluation of approximate multipliers. In 2016 IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH), Vol. , pp. 191–196. External Links: Document, ISSN 23278226 Cited by: §IVB.

[18]
(2018)
Rethinking floating point for deep learning
. NeurIPS Workshop on Systems for ML abs/1811.01721. External Links: Link, 1811.01721 Cited by: §I, §VIB.  [19] (199108) A hybrid number system processor with geometric and complex arithmetic capabilities. IEEE Transactions on Computers 40 (8), pp. 952–962. External Links: Document, ISSN 23263814 Cited by: §VIB.
 [20] (199811) Toward correctly rounded transcendentals. IEEE Transactions on Computers 47 (11), pp. 1235–1243. External Links: Document, ISSN 23263814 Cited by: §IVC.
 [21] (199408) Interleaved memory function interpolators with application to an accurate lns arithmetic unit. IEEE Transactions on Computers 43 (8), pp. 974–982. External Links: Document, ISSN Cited by: §I.
 [22] (196208) Computer multiplication and division using binary logarithms. IRE Transactions on Electronic Computers EC11 (4), pp. 512–517. External Links: Document, ISSN Cited by: §III.
 [23] (198509) Discrete basis and computation of elementary functions. IEEE Trans. Comput. 34 (9), pp. 857–862. External Links: ISSN 00189340, Link, Document Cited by: §I, §IV, §IV, §V.
 [24] (2017) Poincaré embeddings for learning hierarchical representations. In Advances in neural information processing systems, pp. 6338–6347. Cited by: §I.
 [25] (200505) Highradix logarithm with selection by rounding: algorithm and implementation. J. VLSI Signal Process. Syst. 40 (1), pp. 109–123. External Links: ISSN 09225773, Link, Document Cited by: §III, §III, §IV.
 [26] (2016) Highefficiency logarithmic number unit design based on an improved cotransformation scheme. In Proceedings of the 2016 Conference on Design, Automation & Test in Europe, DATE ’16, San Jose, CA, USA, pp. 1387–1392. External Links: ISBN 9783981537062, Link Cited by: §I, §VIA, §VIA.
 [27] (20000501) Accelerated shiftandadd algorithms. Reliable Computing 6 (2), pp. 193–205. External Links: ISSN 15731340, Document, Link Cited by: §I, §IVA, §V.
 [28] (2015) Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §VIB.
 [29] (199408) Hardware designs for exactly rounded elementary functions. IEEE Trans. Comput. 43 (8), pp. 964–973. External Links: ISSN 00189340, Link, Document Cited by: §III.
 [30] (1975) The sign/logarithm number system. IEEE Transactions on Computers 100 (12), pp. 1238–1242. Cited by: §I, §VIA.
 [31] (198302) An extended precision logarithmic number system. IEEE Transactions on Acoustics, Speech, and Signal Processing 31 (1), pp. 232–234. External Links: Document, ISSN 00963518 Cited by: §VIA.
 [32] (2012) Is dark silicon useful? harnessing the four horsemen of the coming dark silicon apocalypse. In Design Automation Conference (DAC), 2012 49th ACM/EDAC/IEEE, pp. 1131–1136. Cited by: §I.
 [33] (199211) Shape and motion from image streams under orthography: a factorization method. Int. J. Comput. Vision 9 (2), pp. 137–154. External Links: ISSN 09205691, Link, Document Cited by: §I.
 [34] (2010) Conservation cores: reducing the energy of mature computations. In Proceedings of the Fifteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XV, New York, NY, USA, pp. 205–218. External Links: ISBN 9781605588391, Link, Document Cited by: §I.
 [35] (1959) The cordic computing technique. In Papers Presented at the the March 35, 1959, Western Joint Computer Conference, IREAIEEACM ’59 (Western), New York, NY, USA, pp. 257–261. External Links: Link, Document Cited by: §III.
 [36] (200510) Low and ultra lowpower arithmetic units: design and comparison. In 2005 International Conference on Computer Design, Vol. , pp. 249–252. External Links: Document, ISSN 10636404 Cited by: §III.
 [37] (1989) More iteration space tiling. In Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, Supercomputing ’89, New York, NY, USA, pp. 655–664. External Links: ISBN 0897913418, Link, Document Cited by: §I.