High precision arithmetic is necessary for several scientific and engineering applications. Some of the examples are Monte Carlo simulations, Kalman Filter, circuit simulations, and others(Kapre and DeHon, 2012)(Liao et al., 2019). These applications require high precision arithmetic hardware for efficient execution (Chow et al., 2012). Floating-point arithmetic hardware has been challenging to design due to the intricacies involved in staying within the limits of the desired area and power budget (Leeser et al., 2014)(Volkova et al., 2019). Specifically, designing a portable floating-point unit (FPU) has been a complex task due to the precise requirements posed by the IEEE 754-2008 (floating-point) standard. Recently, researchers and computer architects have either compromised on the compliance to the standard or devised their formats to overcome the design challenges (Gustafson and Yonemoto, 2017)(Burgess et al., 2019). Posit is one such data representation proposed by John L. Gustafson in the year 2017 which aims to overcome shortcomings of the floating-point format (Gustafson and Yonemoto, 2017).
The posit arithmetic and data representation have several absolute advantages over floating-point arithmetic and format. The advantages are simpler hardware, smaller area and energy footprints, and higher dynamic range and numerical accuracy. In general, -bit posit has a better dynamic range compared to the -bit floating-point (Guntoro et al., 2020). In the past, researchers have shown that the -bit floating-point arithmetic can be replaced by -bit posit arithmetic units where . It has also been shown empirically that the replacement does not cause loss of accuracy, yet improves the area and energy footprints (Chaurasiya et al., 2018). The posit representation is a superset of the floating-point format and can serve as a drop-in replacement for floating-point arithmetic.
Due to the advantages of the posit number system, several academic and industrial research labs have started exploring and studying applications that can benefit due to posits. The SoftPosit library supports early-stage investigation of posit for different applications in software (Leong, 2018). However, no such framework exists for hardware exploration. There is a dire need for an easily reconfigurable hardware platform for early-stage design space exploration of posit arithmetic for various applications. With ever-increasing popularity and a conducive open-source ecosystem, we believe that RISC-V (1) is an excellent vehicle to have a quintessential framework supporting posit arithmetic empiricism. We chose the BSV high-level HDL (Bluespec Inc., 2020a) as the implementation language to enable rapid design space exploration through an easy reconfiguration of the hardware platform. The position of the proposed framework called Clarinet in the platform design cycle is delineated in Fig. 1 along with the posit arithmetic core, Melodica. The major contributions of this paper are:
We present Clarinet; a floating-point arithmetic enabled CPU-based framework for posit arithmetic empiricism. Clarinet is based on the RISC-V ISA (with custom instructions for posit arithmetic), and is derived from the open-source Flute core developed by Bluespec Inc (Bluespec Inc., 2020b). The Clarinet framework also features a customized RISC-V gcc tool-chain to support the new instructions.
We present Melodica, a reconfigurable posit arithmetic core that supports fused-multiply-accumulate (FMA) with quire functionality, and type-converters between floating-point, posit and quire data representations.
Through Clarinet, we also present a new usage model where posits and floating-point can coexist as independent types cleanly, allowing applications to be ported more easily to posits when they offer an advantage.
Finally, we investigate applications in the domain of linear algebra and computer vision to show the effectiveness of Clarinet as an experimental platform. For five different applications, we demonstrate that Clarinet supports trade-off-analyses between performance, power, area, and accuracy for applications of interest. We also outline the ease-of-use aspect of Clarinet.
We prefer Melodica as an add-on feature in the Flute rather than a replacement for floating-point arithmetic hardware. With Clarinet, We are trying to enable researchers to study the advantages and disadvantages of posit arithmetic. A posit arithmetic empiricism; reasoning based on empirical data for posit arithmetic is needed to quantify the benefits. Furthermore, we see an opportunity for the floats and posits to coexists in a single platform to trade among power, performance, area, and accuracy.
As of now, we support a limited number of operations and quire to carry-out experimental studies for our applications. We plan to extend Melodica with more functionality. The Melodica core is extensible to support operations demanded by the applications.
To the best of our knowledge, this is the first-ever quire enabled RISC-V CPU. The organization of the paper is as follows. In Section 2, we discuss posit, quire and float formats, the Flute core, and some of the recent implementations of posit arithmetic. Clarinet is described in Section 3, and Melodica in Section 4. Application analyses and benchmarking are presented in Section 5. Experimental setup and results are discussed in Section 6. We conclude our work in Section 7.
2. Background and Related Work
A posit number is defined by two parameters: the width of the posit number, N, and the maximum width of the exponent field, es. One of the important advantages of the posit number format is that we can vary es to trade-off between greater dynamic range (larger es) and greater precision (smaller es).
The posit format has four fields: a sign bit indicating positive or negative numbers, a regime and exponent field that together represent the scale, and finally, a fraction.
Sign (s): The MSB of the number. If the bit is set, the posit value is negative. In this case all remaining fields are represented in two’s complement notation.
Regime Field (r): The regime is used to compute the scale factor, k. In a posit number this field starts just after the sign bit and is terminated by a bit opposite to its leading bits. The computation of k is as per the equation 1, where r is the number of leading bits in the regime.
Exponent Field (exp): The exponent begins after the regime field and the maximum width of the exponent field is es.
Fraction Field (f): The remaining number of bits after the exponent make up the fraction. The fractional field is preceded by an implied hidden bit which is always 1.
For a number represented in the posit format, its value is as per the equation 2.
|All others||as per equation 2|
Posits do not have a representation for NaNs, or separate representations for and . Posits recognize only two special cases – zero and not-a-real (NaR), and support one rounding mode Round-to-Nearest-Even (RNE). Posit number system shows better accuracy around 1 than floating-point of the same size (de Dinechin et al., 2019). Table 1 summarizes the different bit representations with posits, using 8-bit posits as an example. Posit and floating-point formats are depicted in Fig. 2.
The quire is a fixed-point register that serves the purpose of accumulation like a Kulisch accumulator (Kulisch, 2002). The quire for a given posit-width is sized to represent the smallest posit squared, and the largest posit squared without any overflow. When the quire is used as an accumulator for a series of steps, it allows computation without intermediate rounding. The size of an N-bit quire is determined by where N is the posit number width (Fig. 2).
Numerical examples of pi and dot product are shown in Fig. 3. For pi calculation, 32-bit quire (q32) converges better compared to 32-bit posit (p32) or 32-bit floating-point (f32). In all our experiments, we have used 64-bit floating-point as a reference. For iteration 11 there is a dramatic increase in normalized error (compared to 64-bit floating-point) for p32 and f32, but only a marginal increase in error for q32.
In dot product, we use randomly generated vectors in the range of i)to , and ii) to . We choose these ranges since the numbers are representable in all the formats. q32 outperforms p32, f32 and q24. Absolute error in f32 is observed to be 9.9534475E-08, in p32 it is 1.0127508E-08, in q24 it is 8.14790212E-07, and in q32 it is 2.676927E-09 in dot product of 10000 element vectors with the input data range of to . The loss of precision in q24 is close to one digit only while the reduction in the total bit-width is 8 bits compared to f32. The elaborated experiments on the dot product are depicted in section 5.2. These preliminary experiments on pi calculation and dot product outline the superiority of using quire over 32-bit floating-point arithmetic.
2.1.3. Flute - A RISC-V CPU
The Flute is an in-order open-source CPU based on the RISC-V ISA, implemented using the BSV HL-HDL. The Flute pipeline is nominally 5-stages but longer for instructions like memory load-stores, integer-multiply, or floating-point operations. The core is parameterized and can be configured to operate at 32-bit or 64-bit and supports the RV64GC variant of the RISC-V ISA (1). The Flute core also supports a memory management unit (MMU) and is capable of booting the Linux operating system. The pipeline stages in Flute are:
Issue fetch requests to the instruction memory. The fetch stage can also handle compressed instructions.
Decode the fetched instruction. Checks for illegal instructions.
The first execution stage. Reads the register files or accept forwarded values from earlier instructions. Execute all single-cycle opcodes meant for the integer ALU. Branches are resolved here. Discard speculative instructions.
Execute multi-cycle operations, including floating-point operations. Multi-cycle operations are dispatched to their individual pipelines from this stage. If the instruction was executed in E1, this stage is just a pass-through.
Collects responses from various multi-cycle pipelines, handle exceptions and asynchronous events like interrupts, and commit the instruction.
2.2. Related work
Since the inception of posit data representation and arithmetic, there have been several implementations of posit arithmetic in the literature. The early and open-source hardware implementations of posit adder and multiplier were presented in (Jaiswal and So, 2018) and (Jaiswal and So, 2018). In (Jaiswal and So, 2018), the authors have covered the design of a parametric adder/subtractor, while in (Jaiswal and So, 2018), the authors have presented parametric designs of float-to-posit and posit-to-float converters, and multiplier along with the design of adder/subtractor. The PACoGen open-source framework that can generate a pipelined adder/subtractor, multiplier, and divider are presented in (Jaiswal and So, 2019). The PACoGen is capable of generating the hardware units that can adapt precision at run-time. A more reliable implementation of a parametric posit adder and multiplier generator is presented in (Chaurasiya et al., 2018). A major drawback of the generator presented in (Chaurasiya et al., 2018) is that it is a non-pipelined design resulting in low operating frequency for large bit-width adders and multipliers.
Cheetah presented in (Langroudi et al., 2019)
discusses the training of deep neural network (DNN) using posits. We believe that the architecture presented in(Langroudi et al., 2019) is promising and some of the features can be incorporated in Melodica in the future.
Apart from the mentioned efforts, there have been several other implementations of posit hardware units (Zhang et al., 2019)(Lu et al., 2019). More recently, (Tiwari et al., 2019) integrated a posit numeric unit as a functional unit with the Shakti C-Class RISC-V processor. The implementation does not support quire and reuses the floating-point infrastructure (including register file) to implement posit arithmetic. This limits the system to using 32-bit or 64-bit posits.
Unfortunately, none of the previous efforts are directed toward the consolidation of posit research. Further, they do not include an easy-to-use software framework which allows floating-point and posit types to cohabit in an application cleanly. We see here a need and an opportunity to consolidate the research in the domain of computer arithmetic by providing an open-source test-bed, Clarinet. We also address the need for a software framework by introducing a programming model that allows floating-point and posit types to coexist as independent types in an application.
The system comprises two main components – Melodica, a parameterizable Posit Numeric Unit that implements quire described in Section 4, and Clarinet, a RISC-V CPU that is enhanced with special instructions for posit arithmetic and a dedicated posit register file (PRF).
3.1. Clarinet organization
Clarinet integrates Melodica as a functional execution unit parallel to the existing floating point unit. A new module hierarchy, (F-pipe), encapsulates both the existing floating point core, and the new Melodica core. A thin layer of logic in F-pipe directs the five new instructions to Melodica, while all other floating point instructions continue to be serviced by the FPU. F-pipe also routes responses from Melodica back to the Clarinet pipeline. Except for instructions that update the quire, all other instructions result in outputs from Melodica destined for the FPR, PRF and CSR RF.
3.2. Custom Instructions
In order to use the integrated Melodica execution unit we added five new instructions to the existing instruction set implemented in Flute. As shown in their bit representations in Fig. (b)b all the instructions belong to the R-format type of the RISC-V ISA. All five instructions use the FP-OP value as defined in (1), for their seven-bit opcodes. In order to handle posit types, a new binary encoding 10 was introduced for the fmt field. In R-format instructions, these bits occupy the LSB of the funct7 instruction field. Also new Rs2 binary encoding was introduced for the posit (10000) and quire (10001) types.
FMA.P: Multiplies two posit operands present in the PRF at Rs1 and Rs2, and accumulates the result into the quire. Do not update FCSR.FFLAGS.
FCVT.S.P: Converts the posit value in PRF at Rs1 to a floating-point value which is written to the FPR at Rd. This instruction may update FCSR.FFLAGS.
FCVT.P.S: Converts the floating-ppint value in the FPR at Rs1 to a posit value which is written to the PRF at Rd. This instruction may update FCSR.FFLAGS.
FCVT.R.P: Converts the posit value in the PRF at Rs1 to a quire value which is written to the quire. Do not update FCSR.FFLAGS.
FCVT.P.R: Converts the value in the quire to a posit value which is written to the PRF at Rd. This instruction may update FCSR.FFLAGS.
The decision to add new instructions instead of reusing existing opcodes belonging to the F subset of the RISC-V ISA was driven by two requirements – integrating quire functionality (which does not exist in floating-point), and type-converter instructions that would allow posits and floating-point to coexist in an application as independent types.
The new type-converter instructions allow existing programs to be run on Clarinet without the need to modify their original data segments as has been demonstrated in section 5.1. From our experiments as illustrated in Fig. 3, we realised that applications could see significant reductions in normalized error through the introduction of quire-based accumulation even when most of the computation remained in floating-point. When an application can benefit from the use of posits (be it greater dynamic range or accuracy), the type-converter instructions allow the user to convert a part of the computation to posits and accumulate into the quire register. In order to do so, they would first need to convert their intermediate floating-point data to posits using the type-converter instructions, before executing the FMA.P instruction that accumulates into the quire. Eventually, the results are converted back to the floating-point format before writing out to memory.
3.3. Integrating the quire
As indicated in Fig. 2 the recommended size of the quire can grow very rapidly with increasing posit-width. This implies that treating the quire register similar to an entry in one of the register files would be quite expensive as far as hardware resources are concerned. For instance, using 32-bit posits would mean making a 512-bit quire value available on the forwarding paths and from the register files. Further, providing a path from quire to memory (via modified load and store instructions) would require extensive modifications of the memory pipeline.
Clarinet takes a novel approach to integrating the quire. The quire can be updated directly using the new instructions (FCVT.R.P and FMA.P). However, in order to save hardware resources, there are no instructions to directly access the quire or read and write the quire to memory. To read the quire’s value it has to be first converted to a posit type using the instruction FCVT.P.R which would bring the converted value into the PRF. These decisions allow us to contain the cost of integrating the quire to just the actual storage for the quire register.
3.4. The posit register file
A key advantage of posits (especially with quire) is that it may be profitable to implement non-standard widths for posit numbers while still retaining most of the precision advantages of operating with posits and quire. To this end, we introduced a new PRF into Clarinet – one that is sized to the value of the posit variables being handled by the Melodica core. While it would have been possible to reuse the floating point register file for posit operations, this would not have permitted the flexibility of benefiting from the use of narrower posit-widths for applications that allowed lower bit-widths. The registers in the PRF may only be accessed by instructions which directly take posits as inputs or result in a posit output. A new register file implies the creation of a new bypass path to forward in-flight posit operands from the output of the F-pipe to the input to E1. This new path is marked as pbypass in Fig. (a)a and handles only posit values.
Melodica is a posit arithmetic unit implemented using BSV HL-HDL. Melodica accepts three high-level parameters: the posit-width (N), the maximum width of the exponent field (es) and float-width (FW). Melodica supports any size float input but for Clarinet float-width is set to 32. For an N-bit Melodica architecture (qw) sized quire is integrated with the operation pipelines as a special-purpose register. Depending on the size of N, it is possible that the quire may not be sized to a multiple of byte. It delivers accumulator functionality using posit fused-multiply-accumulate (FMA) into the quire, and is meant to be used alongside a single-precision floating point implementation for all other compute operations. In addition to the FMA computation Melodica implements a complete set of type-converters between floating-point and posit formats, and between quire and posit formats.
Melodica’s organization is illustrated in Fig. 5. There are three computational steps involved in Melodica’s operation: i) extract: interpret the posit operands to extract the sign, regime, exponent, fraction fields and infinite/zero flag, ii) operate: perform the appropriate mathematical operation using one or more of the extracted posits or float operand, and iii) normalize: convert the output posit-fields back into an N-bit posit word.
The extractor unpacks posit operand into sign, scale and fraction bit fields, essentially converting from a format with variable-width fields to one with fixed-width fields. This conversion is essential for the subsequent pipelines to efficiently compute on posit fields. The scaling factor, scale, is determined using the r and the exp field as given by equation 3, where maximum posit scale width is psw and maximum posit fraction width is pfw.
Extraction operates on the N-bit input posit word and generates four outputs as illustrated in Fig. (a)a. In Fig. 6 and Fig. 7 detection is defined using the det block. The steps involved are: i) check for special cases like 0 and to determine zero-infinity flag (zif). If the sign of the posit number is negative, perform a two’s complement of the remaining N-1 bits, ii) compute k using equation 1. The r field in general ends with a flipping bit but in the case when the number of exponent and fraction bits are zero then there may not be a flipping bit, iii) determine the value of the exponent. The exp may have up to es bits. We multiplex between the two cases when its field size is exactly es, and when size is variable (lesser than es). In the latter case the location of the exp field continues until the end of the posit, iv) calculate scale using equation 3. Calculate the f field by extracting remaining bits (if any) after the exp.
The normalization illustrated in Fig. (b)b, is the reverse operation of extraction. It constructs a posit value from the constituent fields available after computation on the operands. There may be a loss in accuracy due to the rounding of fractional bits. The four steps involved in normalization are: i) computation of k and the exp bits from the scale value based on the equation 3, ii) construction and concatenation of the r and exp fields where regime bits are calculated based on run-length of 0s and 1s, iii) shift the f field by shifting r and exp field. The concatenated value is rounded to the nearest even depending on the f bits truncated in the previous stage and the truncation flag (tf), iv) check for special cases (zero, and NaN). If the sign bit is set, the final value is the two’s complement of the remaining N-1 bits.
This stage in Melodica performs computations on the input operands. The particular operation performed is based on the opcode dispatched to Melodica. The five operations that are supported by Melodica are divided into two categories – type converters and compute.
Compute Operation – Fused-Multiply-Accumulate (FMA)
The FMA, illustrated in Fig. (a)a computes the product of two input posit numbers and adds the result to the quire. Using quire as an accumulator helps preserve the overflow or underflow bits without the need to round the results.
The FMA is performed as follows: i) the hidden bit information is added to the input f fields depending on the posit value. Also check for corner cases – NaN, 0, and , ii) the f and sign fields are multiplied using integer multipliers, and the scales are added to create the scale of the output, iii) the product fraction is shifted using the new scale value to align with the quire’s integer and fraction fields. If the product is negative, appropriate twos complement is performed, iv) quire is added to the product of the operands using signed addition, and if there is overflow or underflow, the sum is rounded to the nearest even.
Type-converter – Float-to-Posit (F-to-P)
The F-to-P block converts a float input to posit format as depicted in Fig. (b)b. The conversion may result in a loss of precision when using narrower posit types. For a number represented in the posit format, its value is as per equation 4, where fsw is the exponent width and ffw is the fraction width of float.
The fixed length f field is interpreted directly from the input operand and mapped to the field in posit format using equation 2 and 4. The scale field after subtracting the bias is bounded between (-p, p), where p equals . A truncation flag (tf) is asserted by the block depending on conversion between different size float and posit values. These flags are retained to perform rounding in later stages. The output of the F-to-P needs normalization before write-back to the PRF.
Type-converter – Posit-to-Float (P-to-F)
The P-to-F block converts a posit input to float format as illustrated in Fig. (c)c. The P-to-F block receives its input after the Extract stage, and its output is directly sent to the FPR in Clarinet. Depending on the configuration parameters for Melodica this operation may result in a change in widths between source and target types.
The main operation involved in the conversion is to bound the scale between (-bias, bias) for the target float format, and truncation of the f field if the field-width for the target format is narrower than the source format using equation 2 and 4. In addition the converter also checks for special cases for the target format (zero, NaN, and ).
Type-converter – Quire-to-Posit (Q-to-P)
The Q-to-P block converts a value in the quire format to posit format so that it can be written to the PRF after normalization. After adjusting for a negative value, the scale and f are extracted from the quire as illustrated in Fig. (d)d. The truncation flag (tf) which is generated from truncating the fraction value, is sent to the Normalize block to control rounding to RNE.
Type-converter – Posit-to-Quire (P-to-Q)
The P-to-Q block converts an input posit number after extraction to quire format, thereby initializing the quire. The fraction from the extractor block is shifted (based on the scale) and extended to occupy the corresponding field in quire as shown in Fig. 7e.
5. Case Studies
We cover case studies on some of the linear algebra kernels and optical flow in computer vision. We look into application kernels that are rich in floating-point arithmetic operations. For matrix operations, we develop a subset of basic linear algebra subprograms (BLAS) and linear algebra package (LAPACK) using SoftPosit for the analyses (Leong, 2018)
. Based on the investigation, we arrive at a suitable arithmetic size for each of the kernels in BLAS and LAPACK, and optical flow estimation using Lucas-Kanade method. We use this information to tweak parameters in Clarinet to arrive at a customized Clarinet instance.
5.1. Using Clarinet - A Simple Example
|Sl. No.||Instruction||Disassembly||RF/Quire Updates||Comments|
|1||00052007||flw ft0, 0(a0)||
||Load 2.50 to FPR ft0 from memory|
|2||00452087||flw ft1, 4(a0)||
||Load 4.00 to FPR ft1 from memory|
|3||44000053||fcvt.p.s p0, ft0||
||Execute F-to-P on ft0. Result in PRF p0|
|4||440080d3||fcvt.p.s p1, ft1||
||Execute F-to-P on ft1. Result in PRF p1|
||Execute P-to-Q on p2. Result in quire|
|6||34100053||fma.p p0, p1||
||Accumulate (p0*p1) into quire|
|7||34100053||fma.p p0, p1||
||Accumulate (p0*p1) into quire|
||Execute Q-to-P on Quire. Result in PRF p2|
|9||41010153||fcvt.s.p ft2, p2||
||Execute P-to-F on p2. Result in FPR ft2|
Clarinet-Melodica introduces a new usage model for the posit programmer by focusing on quire functionality. While Clarinet-Melodica does not offer support for operations like posit addition, subtraction, and multiplication through dedicated instructions, it does so via the FMA.P instruction. The example presented in Table 2 is of a simple case where a user loads two, 32-bit floating-point numbers from memory and does a series of operations on them using the quire. In this example, Clarinet-Melodica is configured to use 16-bit posits. In particular, instruction number 6 illustrates how a user could use the FMA.P instruction to multiply two posit operands (by initializing the quire to zero). Furthermore, this form of multiplication does not suffer rounding error as the result accumulates into the quire. Similarly substituting the first or second operand in FMA.P as 1.0 would allow the user to add posits to or subtract posits from the quire.
5.2. BLAS and LAPACK
The BLAS and LAPACK are encountered in a wide range of engineering and scientific applications. In BLAS, we consider dot product (xDot), matrix-vector (xGemv), matrix-matrix operations (xGemm), and in LAPACK, we consider Givens rotation (xGivens) where x denotes the data type used for the implementations. For all the matrix operations, we implement nine different versions using different data types for comparison and use 64-bit floating-point implementation as a reference. We randomly generate numbers using rand() function. Since, Clarinet supports quire and FMA, we emphasize more on quire based implementations with using SoftPosit for our analyses. To calculate the error in xDot, we average the relative error over 100K runs. To calculate error in xGemv and xGemm, we use and , respectively where and are the operations computed in 64-bit floating-point and and are the operations computed using SoftPosit.
The accurate digits in the different implementations are shown in Fig. 8. In dot product, the 32-bit quire (q32Dot) results in 8.8 accurate digits for small (10) input vector sizes in range of 0 to 1 (Fig. 8a). For large vectors (10000) in the same range, the number of accurate digits drop to 8.2 which is a drop of 6.8%. In the same input rage, we observe a drop of 12.3% in fDot, 17.4% in p32Dot, and 9.3% in q24Dot. For the input vector range of 0 to 10 and the sizes of 10 to 10000, we observe a similar trend (Fig. 8b). Varying the range of input vectors impacts the accuracy heavily, specifically for large vectors. We observe a drop in the number of accurate digits by 55.6% in q32Dot, 53.94% in p32Dot, and 36.3% in q24Dot while in fDot it is 18.63% (Fig. 8c). The drop in accuracy is due to the fact that the posit and quire are more accurate for the values around 1.0 while as the input range shifts from 1.0 the accuracy deteriorates.
A similar trend is observed in xGemv, xGemm, and xGivens routines for increasing matrix sizes and varying ranges (Fig. 8d). A key observation here is that in p32Givens and q32Givens routines where we observe the number of accurate digits are significantly higher (8.2 and 8.8 respectively) compared to fGivens (6.79). The shaded region in Fig. 8 represents the routines that can be executed on the current version of Clarinet due to absence of posit addition, multiplication, division hardware. For implementation of routines in software we have used floating-point in conjunction with quire. For example, q32-f32Givens is implementation of Givens rotation using combination of 32-bit quire and 32-bit floating-point arithmetic. The implementation yields similar accuracy as q32Givens since the majority of the operations are dominated by quire. In BLAS routines, the 100% of the arithmetic operations can be implemented using only quire. Based on the accurate digits and supported arithmetic in Clarinet, we can arrive at the quality of Clarinet which is further discussed in Section 6.2.
5.3. Lucas-Kanade Optical Flow
Lucas-Kanade is a differential method of tracking features given a sequence of frames. Given I as brightness-per-pixel at (x,y), the local optical flow (velocity) vector (,) is given by equation 5.
The Lucas-Kanade method is used to calculate the optical flow for consecutive frames of rotating objects which are given in Fig. 9. We compare the different posit and single-precision floating-point configuration combinations with 64-bit floating-point values using SoftPosits, and generate heat maps of the absolute error for both u and v. The three configurations that are being compared are: i) 32-bit single-precision floating-point arithmetic (f32), ii) 32-bit single-precision float arithmetic combined with N-bit quire arithmetic (f32-qN), iii) N-bit posit arithmetic and N-bit quire arithmetic (pN-qN). Furthermore, owing to the better accuracy of posits around 1.0 we have normalized (norm) grey-scale pixel values (0 to 255) to (0.0 to 16.0).
From the heat-maps in Fig. (a)a the effects of normalization and q32 on error become obvious. When working with normalized data, the configuration p32-q32 clearly outperforms all other configurations. For data which is not normalized, the performance of p32-q32 depends on whether the data naturally falls around 1.0. However, even in the non normalized case, f32-q32 performs consistently better than f32. The general trend in maximum and RMS error for Rubik’s cube and sphere object frames for different configurations are shown in Fig. 11. The y-axis value for RMS error in Fig. 11 gives the number accurate digits for the configurations compared to 64-bit floating-point. Allowing one decimal places of tolerance to error, p24-q24-norm configuration can give accurate results close to f32. With a penalty of 2 more decimal place p16-q16-norm can be a feasible alternative. When optical flow is computed for posit configurations for values not around 1.0 the accuracy falls.
As summarised in Table 3 p32-q32-norm configuration results in an order improvement in accuracy as compared to f32 for the sphere dataset. The f32-q32 configuration for gray-scale pixel values (0-255) improves the accuracy by 23% and 32% for Rubik’s cube and sphere dataset respectively.
6. Experimental Results
6.1. Implementation Setup
Different configurations of Clarinet-Melodica were synthesized using Synopsys Design Compiler. All designs were synthesized with a clock frequency of 200 MHz, on a Faraday 90 nm-CMOS Faraday process. No special memory cells were used to synthesize the register files or branch target buffers.
Melodica is not a complete posit implementation. It delivers accumulator functionality using Quire, and is meant to be used alongside a 32-bit floating-point implementation. The baseline for comparisons is a 32-bit RISC-V Clarinet processor with support for 32-bit floating-point arithmetic, and no Melodica. This is the minimum functionality required in a RISC-V CPU to integrate Melodica.
For the purpose of comparison, the following five implementations were evaluated:
Clarinet-Base: This is the baseline implementation, which features a 32-entry, 32-bit wide floating-point register file (FPR) and bypass logic, and an FPU which is capable of single-precision arithmetic. Clarinet-Base does not integrate a Melodica core, but does support the new posit-related custom instructions as described in Section 3.2.
Clarinet-Double: Support for 64-bit floating-point arithmetic is added to the Clarinet-Base implementation. The FPR is doubled to be 64-bit wide and the bypass paths for floating-point values are suitably widened. The FPU arithmetic unit is now capable of processing 64-bit floating-point operands.
Clarinet-P-16-1: Melodica configured with N=16 and es=1, is integrated into the Clarinet-Base configuration. In this configuration, Melodica features a 128-bit Quire. The PRF has 32, 16-bit registers, and bypass logic.
Clarinet-P-24-2: Melodica configured with N=24 and es=2, is integrated into the Clarinet-Base configuration. In this configuration, Melodica features a 288-bit Quire. The PRF and bypass logic are widened to 24-bit.
Clarinet-P-32-2: Melodica configured with N=32 and es=2, is integrated with the Clarinet-Base configuration. In this configuration, Melodica features a 512-bit Quire. The PRF and bypass logic are widened to 32-bit.
As indicated in Table 4, adding support for 64-bit floating-point leads to a nearly 72% increase in area over Clarinet-Base. In comparison adding Melodica configured with N=16, es=1 adds approximately 9% area. Interestingly, the area overhead moving to wider values of N (24 and 32) is marginal (around 3%). The reason for this is Clarinet’s organization where moving from Clarinet-Base to Clarinet-P-16-1, introduces the new PRF and bypass logic for posit types apart from the Melodica pipes themselves. On the other hand, moving to wider posit-widths introduces no new structures, but simply widens existing ones.
Table 4 also notes the cell switching power. Cell switching power does not include the power dissipated due to net switching. Net switching, in particular the clock network, dominates overall power dissipation. Between 70% and 80% of the power is dissipated in the clock tree alone and is largely unchanged in the different configurations. For this reason, we found it more instructive to highlight the cell switching power which amplifies the effect each configuration has on dynamic power dissipation.
6.2. Quality of Clarinet
Quality of Clarinet (QoC) is defined using equation 6.
where is the area of Clarinet-base, is the are of the instance under consideration, and is the number of accurate digits in the instance under consideration. The QoC is a metric that incorporates platform configuration and application accuracy to measure the quality of Clarinet instances. A high-quality implementation would mean an implementation with a low area footprint supporting high precision computations. An implementation supporting high precision computations can be of low-quality if the implementation incurs a high area footprint. A similar equation can be formulated for the power footprint of the instances of Clarinet.
We segregate the QoC in three zones : green zone, blue zone and yellow zone. The QoC in the green zone is superior to the rest of the zones and the quality of the platform is more than 0.85 (85%) while the QoC in blue zone is between 0.750.85 and in yellow zone the QoC is less than 0.75. The QoC for different BLAS and LAPACK routines is shown in Fig. 12. The routines marked with the red star can not be executed on the current implementation of Clarinet due to presence of posit square-root and division in the routines. The QoC can not be defined as a constant for any instance, but it is a function of accuracy of the executing software application. The QoC for Rubik’s cube and sphere frames is depicted in Fig. 13. The routines that involve addition and multiplication of posits are implemented using FMA operation of Clarinet as described in Section 5.1. The QoC for Clarinet-f32 is better than other Clarinet instances since it uses less area. But using Clarinet-p32-q32-normRubik’s and Clarinet-p32-q32-normSphere implementations, which have QoC of 87.2% and 86% respectively, we notice an order improvement in accuracy. As the accuracy varies the QoC varies and depending on the accuracy requirements of the application, a suitable instance of Clarinet can be considered.
6.3. Disclaimer and Limitations
Disclaimer: We notice that the implementation of qX2_fdp_add/sub in SoftPosit uses 32-bit posit storage underneath while performing 24-bit posit accumulations that results in a 512-bit quire register. In hardware, we provide support for the 288-bit quire register for the accumulation of 24-bit posits. Based on our interactions with the developers of the SoftPosit library, we identify that it is fair to compare the accuracy of 24-bit quire (accumulation of 24-bit posits) in software and hardware.
Limitation 1: In its present form, Clarinet can execute applications that contain multiply, add, and multiply-accumulate operations on posits while division and square-root are not supported. The absence of the operations limits the application analyses and execution unless square-root and division are implemented in software.
Limitation 2: While Melodica has seen extensive unit-level verification, further system-level tests are in progress on Clarinet.
We presented Clarinet – an open-source hardware-software framework that allows posit and floating-point arithmetic to coexist in experiments. A posit arithmetic core called Melodica was presented, and the design components of the core were described. Melodica is the first-ever posit arithmetic core supporting quire that is integrated into a RISC-V CPU. We delved into case studies on basic matrix operations and the Lucas-Kanade optical flow method. The analyses of the kernels and application helped us to quantify the quality of different Clarinet instances. The advantages of different arithmetic formats were identified based on the accuracy of the numerical results. Finally, we presented synthesis results for Clarinet and outlined some of the limitations of the current implementation. We enable the researchers with a consolidated framework that can be used for experimental studies on posit arithmetic. We demonstrated high-quality, medium-quality and low-quality implementations on Clarinet by segregating the implementations in green, blue and yellow zones. The implementations that fall in green zones are high quality while the implementations that fall in yellow zone are of poor quality. Our quality metic incorporated the area footprint of the platform. In the future, we plan to extend Melodica to support more operations, and also to explore its use as a posit-enabled accelerator.
-  Cited by: §1, §2.1.3, §3.1, §3.2.
- BSV HL-HDL. GitHub. Note: https://github.com/B-Lang-org/bsc Cited by: §1.
- FLUTE RISC-V Core. GitHub. Note: https://github.com/bluespec/Flute Cited by: 1st item.
- Bfloat16 processing for neural networks. In 2019 IEEE 26th ARITH, Vol. , pp. 88–91. External Links: Cited by: §1.
- Parameterized posit arithmetic hardware generator. In 2018 IEEE 36th ICCD, Vol. , pp. 334–341. External Links: Cited by: §1, §2.2.
- A mixed precision Monte Carlo methodology for reconfigurable accelerator systems. In Proceedings of the ACM/SIGDA ISFPGA, FPGA ’12, New York, NY, USA, pp. 57–66. External Links: Cited by: §1.
- Posits: the good, the bad and the ugly. In Proceedings of the CoNGA 2019, CoNGA’19, New York, NY, USA. External Links: Cited by: §2.1.1.
- Next generation arithmetic for edge computing. In 2020 Design, Automation Test in Europe Conference Exhibition (DATE), Vol. , pp. 1357–1365. Cited by: §1.
- Beating floating point at its own game: posit arithmetic. Supercomput. Front. Innov.: Int. J. 4 (2), pp. 71–86. External Links: Cited by: §1.
- Universal number posit arithmetic generator on fpga. In DATE 2018, Vol. , pp. 1159–1162. External Links: Cited by: §2.2.
- Architecture generator for type-3 unum posit adder/subtractor. In ISCAS 2018, Vol. , pp. 1–5. External Links: Cited by: §2.2.
- PACoGen: a hardware posit arithmetic core generator. IEEE Access 7 (), pp. 74586–74601. External Links: Cited by: §2.2.
- : Spatial processors interconnected for concurrent execution for accelerating the spice circuit simulator using an FPGA. IEEE TCAD 31 (1), pp. 9–22. External Links: Cited by: §1.
- Advanced arithmetic for the digital computer: design of arithmetic units. Springer-Verlag, Berlin, Heidelberg. External Links: Cited by: §2.1.2.
- Cheetah: mixed low-precision hardware & software co-design framework for dnns on the edge. External Links: Cited by: §2.2.
- Make it real: effective floating-point reasoning via exact arithmetic. In DATE 2014, Vol. , pp. 1–4. External Links: Cited by: §1.
- External Links: Cited by: §1, §5.
- FPGA implementation of a kalman-based motion estimator for levitated nanoparticles. IEEE TIM 68 (7), pp. 2374–2386. External Links: Cited by: §1.
- Training deep neural networks using posit number system. External Links: Cited by: §2.2.
- PERI: A Posit Enabled RISC-V Core. pp. 1–14. External Links: Cited by: §2.2.
- Towards hardware iir filters computing just right: direct form i case study. IEEE Transactions on Computers 68 (4), pp. 597–608. External Links: Cited by: §1.
Efficient posit multiply-accumulate unit generator for deep learning applications. In ISCAS 2019, Vol. , pp. 1–5. External Links: Cited by: §2.2.