PERCIVAL: Open-Source Posit RISC-V Core with Quire Capability

The posit representation for real numbers is an alternative to the ubiquitous IEEE 754 floating-point standard. In this work, we present PERCIVAL, an application-level posit capable RISC-V core based on CVA6 that can execute all posit instructions, including the quire fused operations. This solves the obstacle encountered by previous works, which only included partial posit support or which had to emulate posits in software, thus limiting the scope or the scalability of their applications. In addition, Xposit, a RISC-V extension for posit instructions is incorporated into LLVM. Therefore, PERCIVAL is the first work that integrates the complete posit instruction set in hardware. These elements allow for the native execution of posit instructions as well as the standard floating-point ones, further permitting the comparison of these representations. FPGA and ASIC synthesis show the hardware cost of implementing 32-bit posits and highlight the significant overhead of including a quire accumulator. However, results comparing posits and IEEE floats show that the quire enables a more accurate execution of dot products. In general matrix multiplications, the accuracy error is reduced up to 4 orders of magnitude when compared with single-precision floats. Furthermore, performance comparisons show that these accuracy improvements do not hinder their execution, as posits run as fast as single-precision floats and exhibit better timing than double-precision floats, thus potentially providing an alternative representation.

READ FULL TEXT VIEW PDF
01/26/2016

Vectorization of Multibyte Floating Point Data Formats

We propose a scheme for reduced-precision representation of floating poi...
07/01/2016

Using the pyMIC Offload Module in PyFR

PyFR is an open-source high-order accurate computational fluid dynamics ...
08/05/2019

PERI: A Posit Enabled RISC-V Core

Owing to the failure of Dennard's scaling the last decade has seen a ste...
10/29/2021

Design and implementation of an out-of-order execution engine of floating-point arithmetic operations

In this thesis, work is undertaken towards the design in hardware descri...
04/13/2022

Fast Arbitrary Precision Floating Point on FPGA

Numerical codes that require arbitrary precision floating point (APFP) n...
07/07/2022

MiniFloat-NN and ExSdotp: An ISA Extension and a Modular Open Hardware Unit for Low-Precision Training on RISC-V cores

Low-precision formats have recently driven major breakthroughs in neural...
07/26/2022

Productivity meets Performance: Julia on A64FX

The Fujitsu A64FX ARM-based processor is used in supercomputers such as ...

1 Introduction

Representing real numbers and executing arithmetic operations on them in a microprocessor presents unique challenges. When comparing with the simpler set of integers, working with reals introduces notions such as their precision. The representation of real numbers in virtually all computers for decades has followed the IEEE 754 standard for floating-point arithmetic 

[10]. However, this standard has some flaws such as rounding and reproducibility issues, signed zero, or excess of NaN representations.

To face these challenges, alternative real number representations are proposed in the literature. Posits [8] are a promising substitute proposed in 2017 that provide compelling benefits. They deliver a good trade-off between dynamic range and accuracy, encounter fewer exceptions when operating, and have tapered precision. This means that numbers near have more precision, while very big and very small numbers have less. The posit standard includes fused operations, which can be used to compute a series of multiplications and accumulations without intermediate rounding. Furthermore, posits are consistent between implementations, as they use a single rounding scheme and include only two special cases: single 0 and . Therefore, they potentially simplify the hardware implementation [7]. Nonetheless, posits are still under development, and it is still not clear whether they could completely replace IEEE floats [5].

Including PAU into cores in hardware is a crucial step to study the efficiency of this representation further. When designing such a core and its arithmetic operations, an important decision is which ISA to implement. RISC-V [25] is a promising open-source ISA that is getting significant attraction both in academia and in industry. Thanks to its openness and flexibility, multiple RISC-V cores have been developed targeting diverse purposes in recent years. In the case of studying the performance of posits, a core that can run application-level software is needed.

Some works have studied the use of posits by emulating their execution in software [16, 21, 12]. However, this approach has the significant drawback of requiring excessive execution times, thus limiting the scalability of the applications.

To overcome these limitations, we propose to include native posit and quire support in hardware by leveraging a high-performance RISC-V core. A comparison of four of the leading open-source application-class RISC-V cores is studied in [6], CVA6 among them. In this work we have extended the datapath of the CVA6 [26] RISC-V core with a 32-bit PAU with quire and a posit register file. Together with the Xposit compiler extension, this core allows the native hardware execution of high-level applications that leverage the posit number system.

Therefore, the main contributions of this paper are the following:

  • We present PERCIVAL, an oPEn-souRCe111https://github.com/artecs-group/PERCIVAL posIt risc-V core with quire cApabiLity based on the CVA6 that can execute all 32-bit posit instructions, including the quire fused operations.

  • Compiler support for the Xposit RISC-V extension in LLVM. This allows to easily embed posit instructions into a C program that can be run natively on PERCIVAL or any other core that implements these opcodes.

  • To the best of our knowledge, the PERCIVAL core together with the Xposit extension is the first work that integrates the complete posit ISA and quire in hardware.

  • FPGA and ASIC synthesis results showcasing the resource-usage of posit arithmetic and quire capabilities on a RISC-V CPU. These results are compared with the native IEEE 754 FPU available in the CVA6 and with previous works.

  • Accuracy and timing performance of posit numbers and IEEE 754 floats are compared on PERCIVAL using GEMM and max-pooling benchmarks. Results show that 32-bit posits can be up to 4 orders of magnitude more accurate than 32-bit floats thanks to the quire register. Furthermore, this improvement does not imply a trade-off in execution time, as they can perform as fast as 32-bit floats, and thus execute faster than 64-bit floats.

The rest of the paper is organized as follows: Section 2 introduces the necessary background about the posit format, the RISC-V ISA and the CVA6 RISC-V core. Related works from the literature are surveyed in Section 3, both as standalone PAU and at the core level. In Section 4 the PERCIVAL posit core is described and in Section 5 the necessary compiler support for the Xposit RISC-V extension is introduced. The FPGA and ASIC synthesis results of the core are presented, as well as compared with other implementations, in Section 6. Subsequently, in Section 7 posits and IEEE 754 floats are compared regarding accuracy and timing performance. Finally, Section 8 concludes this work.

2 Background

2.1 Posit Format

Posit numbers [8] were introduced in 2017 as an alternative to the predominant IEEE 754 floating-point standard to represent and operate with real numbers. Posits provide reproducible results across platforms and few special cases. Furthermore, they do not support overflow or underflow, which reduces the complexity of exception handling.

A posit number configuration is defined using two parameters as Posit, where is the total bit-width, and is the maximum bit-width of the exponent. Although in literature [5, 18, 16] the most widespread posit formats have been Posit, Posit and Posit, in the latest Posit Standard 4.12 Draft [20], the value of is fixed to 2. This has the advantage of simplifying the hardware design and facilitates the conversion between different posit sizes.

Posits only distinguish two special cases: zero and NaR, which are represented as 00 and 100 respectively. The rest of the representations are composed of four fields as shown in Figure 1:

  • The sign bit S;

  • The variable-length regime field R, consisting of bits equal to followed by or the end of the posit. This field encodes a scaling factor given by Equation (1);

  • The exponent E, consisting of at most bits, which encodes an integer unbiased value . If any of its bits are located after the least significant bit of the posit, that bit will have value 0;

  • The variable-length fraction field F, formed by the remaining bits. Its value is given by dividing the unsigned integer F by .

Fig. 1: Posit format with sign, regime, exponent and fraction fields.
(1)

The real value of a generic posit is given by Equation (2). The main differences with the IEEE 754 floating-point format are the existence of the regime field, the use of an unbiased exponent, and the value of the fraction hidden bit. Usually, in floating-point arithmetic, the hidden bit is considered to be 1. However, in the latest representation of posits, it is considered to be 1 if the number is positive, or -2 if the number is negative. This simplifies the decoding stage of the posit representation [7].

(2)

In posit arithmetic, NaR has a unique representation that maps to the most negative 2’s complement signed integer. Consequently, if used in comparison operations, it results in less than all other posits and equal to itself. Moreover, the rest of the posit values follow the same ordering as their corresponding bit representations. These characteristics allow posit numbers to be compared as if they were 2’s complement signed integers, eliminating additional hardware for posit comparison operations.

The variable-length regime field acts as a long-range dynamic exponent, as can be seen in Equation (2), where it is multiplied by 4 or, equivalently, shifted left by the two exponent bits. Since it is a dynamic field, it can occupy more bits to represent larger numbers or leave more bits to the fraction field when looking for accuracy in the neighborhoods of . However, detecting these variable-sized fields adds some hardware overhead.

As an example, let 11101010 be the binary encoding of a Posit8, i.e. a Posit according to the latest Posit Standard 4.12 Draft [20]. The first bit indicates a negative number. The regime field 110 gives and therefore . The next two bits 10 represent the exponent . Finally, the remaining bits, 10, encode a fraction value of . Hence, from (2) we conclude that

In addition to the standard representation, posits include fused operations using the quire, a -bit fixed-point 2’s complement register, where is the posit bit-width. This allows to execute up to MAC operations without intermediate rounding or accuracy loss. The quire can represent either NaR, similarly to regular posits, or the value given by times the 2’s complement signed integer represented by the concatenated bits.

2.2 Risc-v Isa

The open-source RISC-V ISA [25] emanates from the ideas of RISC. It is structured as a base integer ISA plus a set of optional standard and non-standard extensions to customize and specialize the final set of instructions. There are two main base integer ISA, RV32I and RV64I, that establish the address spaces as 32-bit or 64-bit respectively.

The RISC-V general standard extensions include, among others, functionality for integer multiply/divide (M), atomic memory operations (A) and single (F) and double-precision (D) floating-point arithmetic following the IEEE 754 standard. This set of general-purpose standard extensions MAFD is abbreviated as G. In general, following the RISC principles, all extensions have fixed-length 32-bit instructions. However, there is also a compressed instruction extension (C) that provides 16-bit instructions.

Expanding the RISC-V ISA with specialized extensions is supported by the standard to allow for customized accelerators. Non-standard extensions can be added to the encoding space leveraging the four major opcodes reserved for custom extensions. A proposal of the changes that should be made to the F standard extension in order to have a 32-bit posit RISC-V extension is described in [9].

2.3 Cva6

The CVA6 [26] (formerly known as Ariane) is a 6-stage, in-order, single-issue CPU which implements the RV64GC RISC-V standard. The core implements three privilege levels and can run a Linux operating system. The primary goal of its micro-architecture is to reduce the critical path length. It was developed initially as part of the PULP ecosystem, but it is currently maintained by the OpenHW Group, which is developing a complete, industrial-grade pre-silicon verification. CVA6 is written in SystemVerilog and is licensed under an open-source Solderpad Hardware License.

As execution units in the datapath it includes an integer ALU, a multiply/divide unit and an IEEE 754 FPU [15]. This FPU claims to be IEEE 754-2008 compliant, except for some issues in the division and square root operations. For the sake of comparison, it is important that the FPU is IEEE 754 compliant instead of being limited to normal floats only, since in theory, posit hardware is slightly more expensive than floating-point hardware that does not take into account subnormal numbers [7].

3 Related Work

There has been a great deal of interest in hardware implementations of posit arithmetic since its first appearance. Standalone PAU with different degrees of capabilities or basic posit functional units have been described in the literature [2, 11, 17, 18]. These units provide the building blocks to execute posit arithmetic. However, they do not allow themselves to execute whole posit algorithms.

Recently, some works adding partial posit support to RISC-V cores have been presented. CLARINET [22] incorporates the quire into a RV64GC 5-stage in-order core. However, not all posit capabilities are included in this work. Most operations are performed in IEEE floating-point format, and the values are converted to posit when using the quire. The only posit functionalities added to the core are fused MAC with quire, fused divide and accumulate with quire and conversion instructions.

PERC [1] integrates a PAU into the Rocket Chip generator, replacing the 32 and 64-bit FPU. However, this work does not include quire support, as it is constrained by the F and D RISC-V extensions for IEEE-754 floating-point numbers. More recently, PERI [23] added a tightly coupled PAU into the SHAKTI C-class core, a 5-stage in-order RV32IMAFC core. This proposal also does not include quire support as it reuses the F extension instructions. Nonetheless, it allows dynamic switching between es=2 and es=3 posits. In [3] authors include a PAU named POSAR into a RISC-V Rocket Chip core. Again, this proposal does not include quire support and replaces the FPU present in Rocket Chip to reuse the floating-point instructions.

A different approach is taken in [4], where authors use the posit representation as a way to store IEEE floats in memory with a lower bit-width, while performing the computations using the IEEE FPU. For this purpose they include a light posit processing unit into the CVA6 core that converts between 8 or 16-bit posits and 32-bit IEEE floats. They also develop an extension of the RISC-V ISA to include these conversion instructions.

4 PERCIVAL Posit Core

In this work we have integrated full posit capabilities, including quire and fused operations, into an application-level RISC-V core. In addition to the design of the functional units that execute the posit and quire operations, the novelty of our design is that it is fully compatible both at the software and hardware level with the F and D RISC-V extensions. Therefore, both posit and IEEE floating-point numbers can be used simultaneously on the same core. This is the first work that integrates the full posit ISA with quire into a core, to the best of our knowledge.

4.1 PAU Design

The PAU is in charge of executing most posit operations and also contains the quire register, as shown in Figure 2. Posit comparisons are executed in the integer ALU. As mentioned above, this is one of the benefits of the posit representation.

Fig. 2: Internal structure of the proposed PAU.

Depending on the operation, the input operands are directed to the corresponding posit unit and the result is forwarded as an output of the PAU. There are three main blocks: computational operations (COMP), conversion operations (CONV) and operations that make use of the quire register (FUSED) (Figure 2).

Regarding COMP, the ADD unit is used both for addition and subtraction, calculating the two’s complement of the second operand when subtracting. It must be noted that the posit division and square root units are approximate, as this type of arithmetic simplifies the designs and thus reduces the hardware cost of the system [18]. On the other hand, exact division and square root algorithms could be implemented in software leveraging the MAC unit, thus eliminating the need for dedicated hardware. However, this is out of the scope of this work. In this group, all the modules use both operands except the square root, which uses only operand A. In addition, the operands and the result correspond to the posit register file.

In the CONV group, only operand A is used for conversions. Depending on the operation, the input data and the result belong to the posit or the integer register file.

The quire register is the most singular addition to this number format. According to the posit standard, it must be an architectural register accessible by the programmer that is also allowed to be dumped into memory. However, being so wide, the cost of including this functionality into the core’s datapath could be too high for the benefits it would add. In the vast majority of cases, the quire is used as an accumulator to avoid overflows in the MAC operations, and this does not require quire load and store operations. Instead, we can initialize the quire to zero (QCLR.S), negate it if needed (QNEG.S), accumulate the partial products in it without rounding or storing in memory (QMADD.S and QMSUB.S), and, when the whole operation is finished, round and output the result (QROUND.S). The necessary support for all of these operations related to the quire is included in our proposal (see Table II below). The hardware cost of including the quire as an internal register in the PAU is studied in Section 6.

4.2 Core Integration

The proposed PAU has been integrated into the CVA6 RV64GC core while maintaining the compatibility with all existing extensions, including single and double-precision floating point. Moreover, since we work with Posit32 numbers, i.e. Posit, the core adds a 32-bit posit register file in addition to the integer and floating-point registers.

The instruction decoder has been extended to support posit instructions. The inner workings of the decoder are described in Figure 3. As part of the decoding process, each posit instruction selects from which register file it must obtain its operands and to which register file it must forward its result.

0:  Instruction to decode instr.
0:  Scoreboard entry sc_instr which contains the operation op and the destination functional unit fu.
  switch (instr.opcode)
  
  case POSIT:
     switch (instr.func3)
     case 000: {Computational posit instruction}
        switch (instr.func5)
        case 00000: {PAU instruction}
           sc_instr.fu = PAU
           sc_instr.op = PADD
        
        case 00100: {ALU instruction}
           sc_instr.fu = ALU
           sc_instr.op = PMIN
        
        end switch
     case 001: {Posit load instruction}
        sc_instr.fu = LOAD
        sc_instr.op = PLW
     case 011: {Posit store instruction}
        sc_instr.fu = STORE
        sc_instr.op = PSW
     end switch
  
  default: {Instruction not decoded in any switch/case}
     illegal_instr = true
  end switch
Fig. 3: Pseudocode describing the decoding of posit instructions.

The CVA6 core uses scoreboarding for dynamically scheduled instructions and allows out-of-order write-back of each functional unit. The scoreboard tracks which instructions are issued, their functional unit and in which register they will write back to. Our design has enlarged the scoreboard to include posit registers and instructions. In this manner, we can discern whether the input data of posit operations are retrieved from a register or forwarded directly as a result of a previous operation.

As mentioned in Section 2.1, posit numbers have the benefit of being able to reuse the comparison hardware of 2’s complement signed integers. Therefore, the integer ALU has also been extended to accept posit operands and to be able to forward the result of these instructions with minimal hardware overhead. Furthermore, the PAU has been integrated into the execution phase of the processor in parallel to the ALU and the FPU, connecting the issue module with the aforementioned scoreboard. Finally, the complete datapath has been adapted to include the posit signals and all necessary additional interconnections.

5 Compiler Support: Xposit extension

The assembly output of a RISC-V compiler when processing programs that use floating-point arithmetic includes instructions from the corresponding F and D extensions. To produce a similar output but targeting posit numbers, a new extension must be introduced that translates posit instructions and posit operators to binary code. Therefore, in this section the Xposit RISC-V extension targeting posit arithmetic is presented. As part of this work, Xposit has been integrated into LLVM 12 backend [13] to allow the compilation of high-level applications, an example of which is shown in Section 7.

The posit instruction set follows the structure of the F RISC-V standard extension for single-precision floating point [24]. This Xposit extension mostly follows the adaptation to the posit format proposed in [9]. The differences with this proposal are the following:

  • We include 32 posit registers p0-31 as in the F standard extension.

  • Similarly to the integer operations in CVA6, there is no flag signaling division by zero.

  • We do not include the possibility of loading and storing the quire in memory.

The Xposit extension uses the 0001011 opcode (custom-0), occupying the space indicated in Table I as POSIT. If more operations were needed in the future, especially posit load and store instructions of other word-lengths, the 0101011, 1011011 and 1111011 opcodes (custom-1,2,3) could be leveraged. In this way, a similar approach as the F and D RISC-V extensions could be followed, which utilize the OP-FP, LOAD-FP and STORE-FP opcodes.

inst[4:2] 000 001 010 011 100 101 110 111
inst[6:5] ()
00 LOAD LOAD-FP POSIT MISC-MEM OP-IMM AUIPC OP-IMM-32
01 STORE STORE-FP custom-1 AMO OP LUI OP-32
10 MADD MSUB NMSUB NMADD OP-FP reserved custom-2/rv128
11 BRANCH JALR reserved JAL SYSTEM reserved custom-3/rv128
TABLE I: RISC-V base opcode map + POSIT extension; inst[1:0]=11

The format and fields of the Xposit instructions are described in Figure 4. Posit load and store use the same base+offset addressing as the corresponding floating-point instructions, with the base address in register rs1 and a signed 12-bit byte offset. Thus, the PLW instruction loads a posit value from memory to the rd posit register and the PSW instruction stores a posit value from the rs2 posit register to memory. The rest of the Xposit operations keep the POSIT opcode and differ from the previous instructions by the funct3 field. Finally, it must be noted that the fmt field is fixed to 01 indicating that the instructions are for single-precision (32-bit) posits. The complete instruction set of the proposed Xposit RISC-V extension is detailed in Table II.

Fig. 4: Internal structure and fields of Xposit instructions.
31 27 26 25 24 20  19 15  14 12  11 7  6 0
imm[11:0] rs1 001 rd 00001011 PLW
imm[11:5] rs2 rs1 011 imm[4:0] 00001011 PSW
00000 10 rs2 rs1 000 rd 00001011 PADD.S
00001 10 rs2 rs1 000 rd 00001011 PSUB.S
00010 10 rs2 rs1 000 rd 00001011 PMUL.S
00011 10 rs2 rs1 000 rd 00001011 PDIV.S
00100 10 rs2 rs1 000 rd 00001011 PMIN.S
00101 10 rs2 rs1 000 rd 00001011 PMAX.S
00110 10 00000 rs1 000 rd 00001011 PSQRT.S
00111 10 rs2 rs1 000 00000 00001011 QMADD.S
01000 10 rs2 rs1 000 00000 00001011 QMSUB.S
01001 10 00000 00000 000 00000 00001011 QCLR.S
01010 10 00000 00000 000 00000 00001011 QNEG.S
01011 10 00000 00000 000 rd 00001011 QROUND.S
01100 10 00000 rs1 000 rd 00001011 PCVT.W.S
01101 10 00000 rs1 000 rd 00001011 PCVT.WU.S
01110 10 00000 rs1 000 rd 00001011 PCVT.L.S
01111 10 00000 rs1 000 rd 00001011 PCVT.LU.S
10000 10 00000 rs1 000 rd 00001011 PCVT.S.W
10001 10 00000 rs1 000 rd 00001011 PCVT.S.WU
10010 10 00000 rs1 000 rd 00001011 PCVT.S.L
10011 10 00000 rs1 000 rd 00001011 PCVT.S.LU
10100 10 rs2 rs1 000 rd 00001011 PSGNJ.S
10101 10 rs2 rs1 000 rd 00001011 PSGNJN.S
10110 10 rs2 rs1 000 rd 00001011 PSGNJX.S
10111 10 00000 rs1 000 rd 00001011 PMV.X.W
11000 10 00000 rs1 000 rd 00001011 PMV.W.X
11001 10 rs2 rs1 000 rd 00001011 PEQ.S
11010 10 rs2 rs1 000 rd 00001011 PLT.S
11011 10 rs2 rs1 000 rd 00001011 PLE.S
TABLE II: Instruction set of the proposed XPosit RISC-V extension.

6 Synthesis Results

In this section, we present the FPGA and ASIC synthesis results of PERCIVAL. The details of its PAU and the IEEE 754 FPU using 32 and 64-bit formats are also included. In this manner, the hardware cost of posit numbers and the quire are highlighted and compared with other implementations.

6.1 FPGA Synthesis

The FPGA synthesis was performed using Vivado v.2020.2 targeting a Genesys II (Xilinx Kintex-7 XC7K325T-2FFG900C) FPGA. Different configurations of FPU and PAU were tested, the results of which are shown in Table III. Since the critical path does not traverse the arithmetic units of the core, in all of the cases the timing constraint of 20ns was met and the timing slack was +0.177ns.

PAU No PAU
F D FD - F D FD -
Total core
(LUT, FF)
(50318, 25727) (55900, 27652) (57129, 27996) (44693, 23636) (35402, 21618) (40740, 23599) (41260, 23945) (28950, 19579)
FPU area
(LUT, FF)
(3726, 1008) (6352, 1905) (7612, 2245) - (4046, 973) (6626, 1905) (8163, 2244) -
PAU area
(LUT, FF)
(11796, 2979) (11810, 2979) (11803, 2979) (11879, 2985) - - - -
TABLE III: Comparison of FPGA synthesis results with different configurations of FPU, marked as F and D for 32 and 64-bit numbers respectively, and 32-bit PAU with quire.

The bare CVA6 without a FPU or PAU requires 28950 LUT and 19579 FF. Including support for 32-bit floating-point numbers increases the number of LUT and FF by 6452 and 2039 respectively. This difference grows to 12310 LUT and 4366 FF when using also the double-precision D extension. Note that these values are larger than simply the FPU area, since they also include other elements such as the floating-point register file, instruction decoding and interconnections. These other non-FPU elements require 2406 LUT and 1066 FF in the 32-bit case and 4147 LUT and 2122 FF in the 64-bit case.

Comparing the overall cost of including posit support with the cost of including IEEE floating-point support, a significant difference can be seen. Adding 32-bit posit operations and quire support to the CVA6 requires 15743 LUT and 4057 FF, which is comparable to the FD floating-point configuration. Out of this area, 3864 LUT and 1072 FF are occupied by the non-PAU elements mentioned in the previous floating-point analysis.

The synthesis results reveal that the PAU requires significantly more resources than the FPU available in the CVA6. In particular, the 32-bit PAU with quire occupies 2.94 times as many LUT and 3.07 times as many FF as the 32-bit FPU. To better understand these results, in Table IV the area requirements of the different modules inside the PAU are presented. The most interesting value shown in this table is the area occupied by the posit MAC unit, which corresponds to almost half of the total area of the PAU.

When compared with the floating-point units, which do not include an accumulation register, the area requirements of the quire could be separated. Thus, the posit MAC and the quire rounding to posit can be subtracted from the total PAU area to obtain a value of 5326 LUT and 1312 FF. This outcome is now much closer to the synthesis results of the FPU, as the PAU without quire occupies 1.32 times as many LUT and 1.35 times as many FF. These results match previous works [3], where authors also report an increase of around 30% in FPGA resources when comparing their 32-bit PAU without quire with a 32-bit FPU.

In our case, the actual value of not including a quire would be even smaller, as the cost of allocating the 512-bit quire in the PAU and computing its 2’s complement, which are included in the PAU top, should also be subtracted. However, the synthesis tool does not include these details.

Name LUTs FFs
PAU top
Posit Add
Posit Mult
Posit ADiv
Posit ASqrt
Posit MAC
Quire to Posit
Int to Posit
Long to Posit
ULong to Posit
Posit to Int
Posit to Long
Posit to UInt
Posit to ULong
PAU total
PAU w/o quire
TABLE IV: FPGA synthesis area results of the PAU desegregated into its individual components.

6.2 ASIC Synthesis

The 32-bit PAU with quire and the 32-bit FPU configuration present in PERCIVAL were synthesised targeting TSMC’s 45nm standard-cell library to further study their hardware cost in ASIC. The synthesis was performed using Synopsys Design Compiler with a timing constraint of 5ns, which was met in both cases, and a toggle rate of 0.1.

The 32-bit FPU within CVA6 requires an area of 30691 m and consumes 27.26 mW of power. On the other hand, the 32-bit PAU with quire requires an area of 76970 m and consumes 67.73 mW of power. This follows the same trend shown in the FPGA synthesis, as the PAU with quire is significantly larger, 2.51x, and consumes more power, 2.48x, than the FPU.

In addition, to better assess these values in comparison with other proposals, the PAU available in CLARINET [22] was also synthesized with the same parameters. We have chosen to evaluate this work because it integrates, to the best of our knowledge, the only other PAU that contains a quire. In this case, the 32-bit PAU with quire requires an area of 69920 m and consumes 68.31 mW of power. This is a decrease of around 10% in area and a slight increase in power compared to our proposal. It must be noted, that the PAU available in CLARINET does not include full posit support, as it only allows to execute fused MAC and fused divide and accumulate with quire, as well as conversions from posits to IEEE floats and vice versa. The rest of the operations are performed in IEEE floating-point format, and the values are converted to posit when using the quire.

Similarly as in Section 6.1, the area and power results of the different elements inside the PAU are presented in Table V. As can be seen, when subtracting the cost of the quire in the PAU, the outcome is still higher than the 32-bit FPU, but it becomes much closer. The 32-bit PAU occupies 1.32 times as much area and consumes 1.38 times as much power as the 32-bit IEEE FPU. However, it is noteworthy that, while posit arithmetic and the design of its functional units are relatively new, floating-point units have been enhanced and optimized for decades.

Name Area (m) Power (mW)
PAU top
Posit Add
Posit Mult
Posit ADiv
Posit ASqrt
Posit MAC
Quire to Posit
Int to Posit
Long to Posit
UInt to Posit
ULong to Posit
Posit to Int
Posit to Long
Posit to UInt
Posit to ULong
PAU total
PAU w/o quire
CLARINET PAU
TABLE V: ASIC synthesis area and power results of the 32-bit PAU with quire desegregated into its individual components.

7 Posit vs IEEE-754 Benchmarks

One of the benefits of PERCIVAL is that an accurate and fair comparison can be made between posit and IEEE floating point. The main advantage of having support for native posit and IEEE floating point simultaneously on the same core is that identical benchmarks can be run on both number representations to compare them. In this work we have chosen to benchmark the GEMM and the max-pooling layer, used to down-sample the representation of neural networks. These examples showcase the use of the quire and posits both in the PAU and in the ALU, loading and storing from memory and leveraging the posit register file.

The GEMM and max-pooling codes for posits and IEEE floats have been written in C, including inline assembly for the required posit and float instructions. The floating-point code has also been written in inline assembly to provide exactly the same sequence of instructions to the core. The GEMM code for floats is shown in Figure 5 and the analogous version for posits is shown in Figure 6. These codes have been compiled using the modified version of LLVM with the Xposit RISC-V extension as specified in Section 5, and serve as an example of how this extension can be leveraged. Therefore, the final target architecture is RV64GCXposit. The -O2 optimization flag has been used to obtain an optimized code in every case.

0:  Float matrices a and b of size nn.
0:  Float matrix c = ab.
  for i = 0 to n-1 do
     for j = 0 to n-1 do
        asm(”fmv.w.x ft0,zero”:::); {Set ft0 to 0}
        for k = 0 to n-1 do
           asm
            ”flw    ft1,0(%0)” {Load float a and b}
            ”flw    ft2,0(%1)”
            ”fmadd.s ft0,ft1,ft2,ft0” {Accumulate on ft0}
            :: ”r” (&a[i * n + k]), ”r” (&b[k * n + j]):
           end asm
        end for
        asm
         ”fsw ft0,0(%1)” {Store the result in c}
         : ”=rm” (c[i * n + j]) : ”r” (&c[i * n + j]):
        end asm
     end for
  end for
Fig. 5: 32-bit floating-point GEMM using the F RISC-V extension.
0:  Posit matrices a and b of size nn.
0:  Posit matrix c = ab.
  for i = 0 to n-1 do
     for j = 0 to n-1 do
        asm(”qclr.s”:::); {Clear the quire}
        for k = 0 to n-1 do
           asm
            ”plw   pt0,0(%0)” {Load posit a and b}
            ”plw   pt1,0(%1)”
            ”qmadd.s pt0,pt1” {Accumulate on the quire}
            :: ”r” (&a[i * n + k]), ”r” (&b[k * n + j]):
           end asm
        end for
        asm
         ”qround.s pt2” {Round the quire to a posit}
         ”psw   pt2,0(%1)” {Store the result in c}
         : ”=rm” (c[i * n + j]) : ”r” (&c[i * n + j]) :
        end asm
     end for
  end for
Fig. 6: Posit GEMM using the Xposit RISC-V extension with the quire accumulator.

7.1 Accuracy

The accuracy differences between posits and floats are studied for the GEMM benchmark. The results obtained using 64-bit IEEE 754 format are considered as the golden solution and used to compute the MSE of the 32-bit posit and the 32-bit IEEE 754 floating point. In all cases the inputs are square matrices with the same random values, which are generated from a uniform distribution between -2 and 2. This interval is chosen because it is part of the “golden zone” 

[5], where posits are more accurate than floats thanks to their tapered precision. These random values are generated as 64-bit IEEE 754 numbers and then converted to the two other formats with the aid of the SoftPosit [14] library.

The MSE results are shown in Table VI for different matrix sizes. As can be seen from each row, the calculations are increasingly more accurate when using posit numbers. For matrices, the difference between MSE is around four orders of magnitude. Furthermore, if we compare how this error scales when increasing the size of the operands, it can be seen that posit numbers present a better behavior thanks to the quire register.

This goes in line with our previous work [19], where a similar benchmark was performed using hardware simulations. The MSE results on 32-bit floats and posits match with the results given in Table VI, albeit with small deviations due to the randomness of the input values.

Matrix size IEEE 754 Posit32
TABLE VI: GEMM MSE comparison between IEEE 754 floating-point and posit numbers.

7.2 Performance

Besides the synthesis data presented in Section 6, the execution time is a critical metric to study the hardware performance of posits and floats. The test has been performed executing the same GEMM and max-pooling described previously on PERCIVAL, avoiding cold misses and averaging over 10 executions to obtain more accurate measurements.

Matrix size 32-bit float 64-bit float Posit32
VividSparks
Posit32 no quire
0.552 ms 0.559 ms 0.651 ms 7.95 ms
6.49 ms 6.51 ms 7.18 ms 48.9 ms
51.7 ms 72.1 ms 57.5 ms 345 ms
1.60 s 1.89 s 1.60 s 2.63 s
15.2 s 16.2 s 15.1 s 21.1 s
TABLE VII: GEMM timing comparison between IEEE 754 floating-point and posit numbers.

As shown in Table VII, the execution time of 32-bit posits is practically the same as that of single-precision floats for the larger matrix sizes, where the overhead execution of the extra qround.s instruction becomes negligible (see Figure 6). This instruction is executed in the order of times, compared with the running time of the algorithm. This cost is noticeable for smaller values of , when 32-bit posits are slightly slower than 32-bit and 64-bit floats. However, for larger matrix sizes, which are common in scientific applications and in DNN, 32-bit posits perform equally as 32-bit floats and outperform 64-bit floats, since these instructions require more clock cycles to compute. The 64-bit float fused MAC unit has a latency of 3 cycles, whereas the 32-bit float and Posit32 units have a latency of 2 cycles. Furthermore, as seen in the previous accuracy benchmark, 32-bit posits are orders of magnitude more accurate than 32-bit floats when performing this calculation. Therefore, they provide an alternative solution for the execution of kernels that make use of the dot product.

Additionally, for the sake of completeness, we have performed the same GEMM timing test on a commercial core with support for posit arithmetic. RacEr is a GPGPU FPGA provided by VividSparks that supports computation with Posit32, but does not include quire support. It has 512 CPUs running at 300MHz with 32GB of DDR4 RAM. Table VII also includes the results of the GEMM benchmark on this platform. As can be seen, our proposal provides significantly faster results than this commercial accelerator.

Regarding the max-pooling layers, three different configurations have been tested following common DNN. In LeNet-5, the input of this layer is 28x28x6, the pooling kernel is 2x2 and is applied with a stride of 2, creating a 14x14x6 output representation. In AlexNet, the input size is 54x54x96, the kernel size is 3x3 and is applied with a stride of 2, generating an output of size 26x26x96. Finally, ResNet-50 is the largest configuration we have tested, as its input is 112x112x64, the pooling kernel is 3x3 and again is applied with a stride of 2, creating a 55x55x64 output representation.

The results of executing these layers on PERCIVAL using the 32 and 64-bit IEEE floating-point and Posit32 representations are shown in Table VIII. Results show that 32-bit posits perform as fast as 32-bit floats but without the need of extra hardware, as the posit maximum operation is carried out reusing the integer ALU. Double-precision floats are slower than 32-bit posits and floats by a factor of 1.4-1.7 due to the latency difference in the units as seen in the GEMM benchmark.

Max-pooling layer 32-bit float 64-bit float Posit32
LeNet-5  (28x28x6) 0.715ms 1.211ms 0.688ms
AlexNet    (54x54x96) 0.115ms 0.160ms 0.116ms
ResNet-50 (112x112x64) 0.337ms 0.470ms 0.340ms
TABLE VIII: Max-pooling timing comparison between IEEE 754 floating-point and posit numbers.

8 Conclusions

This paper has presented PERCIVAL, an extension of the application-level CVA6 RISC-V core, including all 32-bit posit instructions as well as the quire fused operations. These capabilities, integrated into a PAU together with a posit register file, are natively incorporated while preserving IEEE 754 single and double-precision floats.

Furthermore, the RISC-V ISA has been extended with Xposit, which includes support for all posit and quire instructions. This allows the compilation and execution on PERCIVAL of application-level programs that make use of posits and floats simultaneously. To the best of our knowledge, this is the first work that enables complete posit and quire capabilities in hardware.

Synthesis results show that half the area dedicated to the PAU is occupied by the quire and its operations. When comparing with the only previous work which includes quire capabilities [22], our proposal consumes slightly less power and only 10% more area, while also providing full posit operations support. When focusing on the 32-bit PAU without the quire, our proposal requires 32% more area and 38% more power than the 32-bit FPU. This goes in line with the results of recent works which reuse the F RISC-V extension [3], where authors obtain a 30% increase in FPGA resources when comparing their PAU to the FPU.

The Posit vs IEEE-754 comparison benchmark results show that 32-bit posits are up to 4 orders of magnitude more accurate than 32-bit floats when calculating the GEMM due to the quire. Moreover, they do not show a performance degradation compared with floats, thus providing a potential alternative when operating with real numbers. In addition, our proposal performs significantly better than available commercial solutions, obtaining up to a 10 speedup when multiplying small matrices.

As future work, we plan to implement and evaluate on PERCIVAL large-scale scientific applications which make use of dot products, leveraging the accuracy gains of fused operations.

Acknowledgments

This work was supported by a 2020 Leonardo Grant for Researchers and Cultural Creators, from BBVA Foundation, whose id is PR2003_20/01, by the EU(FEDER) and the Spanish MINECO under grant RTI2018-093684-B-I00, and by the CM under grant S2018/TCS-4423.

References

  • [1] M. V. Arunkumar, S. G. Bhairathi, and H. G. Hayatnagarkar (2020) PERC: posit Enhanced Rocket Chip. In 4th Workshop on Computer Architecture Research with RISC-V (CARRV’20), pp. 8. Cited by: §3.
  • [2] R. Chaurasiya, J. Gustafson, R. Shrestha, J. Neudorfer, S. Nambiar, K. Niyogi, F. Merchant, and R. Leupers (2018-10) Parameterized Posit Arithmetic Hardware Generator. In 2018 IEEE 36th International Conference on Computer Design (ICCD), pp. 334–341. External Links: ISSN 2576-6996, Document Cited by: §3.
  • [3] S. D. Ciocirlan, D. Loghin, L. Ramapantulu, N. Tapus, and Y. M. Teo (2021-09) The Accuracy and Efficiency of Posit Arithmetic. arXiv:2109.08225 [cs]. External Links: 2109.08225 Cited by: §3, §6.1, §8.
  • [4] M. Cococcioni, F. Rossi, E. Ruffaldi, and S. Saponara (2021-10) A Lightweight Posit Processing Unit for RISC-V Processors in Deep Neural Network Applications. IEEE Transactions on Emerging Topics in Computing (01), pp. 1–1. External Links: ISSN 2168-6750, Document Cited by: §3.
  • [5] F. de Dinechin, L. Forget, J. Muller, and Y. Uguen (2019) Posits: the good, the bad and the ugly. In Proceedings of the Conference for next Generation Arithmetic 2019, CoNGA’19, New York, NY, USA. External Links: Document, ISBN 978-1-4503-7139-1 Cited by: §1, §2.1, §7.1.
  • [6] A. Dörflinger, M. Albers, B. Kleinbeck, Y. Guan, H. Michalik, R. Klink, C. Blochwitz, A. Nechi, and M. Berekovic (2021) A comparative survey of open-source application-class RISC-V processor implementations. In Proceedings of the 18th ACM International Conference on Computing Frontiers, CF ’21, New York, NY, USA, pp. 12–20. External Links: Document, ISBN 978-1-4503-8404-9 Cited by: §1.
  • [7] A. Guntoro, C. De La Parra, F. Merchant, F. De Dinechin, J. L. Gustafson, M. Langhammer, R. Leupers, and S. Nambiar (2020-03) Next Generation Arithmetic for Edge Computing. In 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, pp. 1357–1365. External Links: Document, ISBN 978-3-9819263-4-7 Cited by: §1, §2.1, §2.3.
  • [8] J. L. Gustafson and I. T. Yonemoto (2017-04) Beating floating point at its own game: posit arithmetic. Supercomputing Frontiers and Innovations 4 (2), pp. 71–86. External Links: Document Cited by: §1, §2.1.
  • [9] J. L. Gustafson (2018-06) RISC-V Proposed Extension for 32-bit Posits. Note: https://posithub.org/docs/RISC-V/RISC-V.htm Cited by: §2.2, §5.
  • [10] IEEE Computer Society (2019-07) IEEE Standard for Floating-Point Arithmetic. IEEE Std 754-2019 (Revision of IEEE 754-2008), pp. 1–84. External Links: Document Cited by: §1.
  • [11] M. K. Jaiswal and H. K.-H. So (2019) PACoGen: a Hardware Posit Arithmetic Core Generator. IEEE Access 7, pp. 74586–74601. External Links: ISSN 2169-3536, Document Cited by: §3.
  • [12] H. F. Langroudi, V. Karia, Z. Carmichael, A. Zyarah, T. Pandit, J. L. Gustafson, and D. Kudithipudi (2021-06) Alps: adaptive Quantization of Deep Neural Networks with GeneraLized PositS. In

    2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

    ,
    Nashville, TN, USA, pp. 3094–3103. External Links: Document, ISBN 978-1-66544-899-4 Cited by: §1.
  • [13] C. Lattner and V. Adve (2004-03) LLVM: a compilation framework for lifelong program analysis amp; transformation. In International Symposium on Code Generation and Optimization, 2004. CGO 2004., pp. 75–86. External Links: Document Cited by: §5.
  • [14] S. H. Leong (2020-03) SoftPosit. External Links: Document, Link Cited by: §7.1.
  • [15] S. Mach, F. Schuiki, F. Zaruba, and L. Benini (2021-04) FPnew: an Open-Source Multiformat Floating-Point Unit Architecture for Energy-Proportional Transprecision Computing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 29 (4), pp. 774–787. External Links: ISSN 1063-8210, 1557-9999, Document Cited by: §2.3.
  • [16] R. Murillo, A. A. Del Barrio, and G. Botella (2020-07)

    Deep PeNSieve: a deep learning framework based on the posit number system

    .
    Digital Signal Processing 102, pp. 102762. External Links: ISSN 10512004, Document Cited by: §1, §2.1.
  • [17] R. Murillo, A. A. Del Barrio, and G. Botella (2020-10) Customized Posit Adders and Multipliers using the FloPoCo Core Generator. In 2020 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5. External Links: ISSN 2158-1525, Document Cited by: §3.
  • [18] R. Murillo, A. A. Del Barrio Garcia, G. Botella, M. S. Kim, H. Kim, and N. Bagherzadeh (2021) PLAM: a Posit Logarithm-Approximate Multiplier. IEEE Transactions on Emerging Topics in Computing, pp. 1–1. External Links: ISSN 2168-6750, Document Cited by: §2.1, §3, §4.1.
  • [19] R. Murillo, D. Mallasén, A. A. Del Barrio, and G. Botella (2021-10) Energy-Efficient MAC Units for Fused Posit Arithmetic. In 2021 IEEE 39th International Conference on Computer Design (ICCD), pp. . External Links: ISSN , Document Cited by: §7.1.
  • [20] Posit Working Group (2021-07) Posit Standard Documentation Release 4.12-draft. Cited by: §2.1, §2.1.
  • [21] G. Raposo, P. Tomás, and N. Roma (2021-06) Positnn: training Deep Neural Networks with Mixed Low-Precision Posit. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7908–7912. External Links: ISSN 2379-190X, Document Cited by: §1.
  • [22] N. Sharma, R. Jain, M. Mohan, S. Patkar, R. Leupers, N. Rishiyur, and F. Merchant (2021-10) CLARINET: a RISC-V Based Framework for Posit Arithmetic Empiricism. arXiv:2006.00364 [cs]. External Links: 2006.00364 Cited by: §3, §6.2, §8.
  • [23] S. Tiwari, N. Gala, C. Rebeiro, and V. Kamakoti (2021-06) PERI: a Configurable Posit Enabled RISC-V Core. ACM Transactions on Architecture and Code Optimization 18 (3), pp. 1–26. External Links: ISSN 1544-3566, 1544-3973, Document Cited by: §3.
  • [24] A. Waterman and K. Asanović (2019-12) The RISC-V Instruction Set Manual, Volume I: user-Level ISA, Document Version 20191213. Technical report RISC-V Foundation. Cited by: §5.
  • [25] A. Waterman, Y. Lee, D. A. Patterson, and K. Asanović (2014-05) The RISC-V instruction set manual, volume I: user-level ISA, version 2.0. Technical report Technical Report UCB/EECS-2014-54, EECS Department, University of California, Berkeley. Cited by: §1, §2.2.
  • [26] F. Zaruba and L. Benini (2019-11) The Cost of Application-Class Processing: energy and Performance Analysis of a Linux-Ready 1.7-GHz 64-Bit RISC-V Core in 22-nm FDSOI Technology. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 27 (11), pp. 2629–2640. External Links: ISSN 1063-8210, 1557-9999, Document Cited by: §1, §2.3.