One of the most important developments over the last decade has been the move from desktop computing to battery-powered computing in hand-held, wearable and mobile devices. This move from the desktop to the wider world is also reflected in the growth of applications that operate on real world data such as images, video, sound and motion. These applications are highly computationally intensive, and pose huge challenges both for mobile devices and for cloud-based services that receive and process large amounts of such data.
Approximate computing can be an effective technique both for accelerating these types of applications and for reducing the required energy. Approximate computing is based on the observation that the inputs and outputs of these algorithms are approximations. Introducing additional imprecision in the computation may have little or no effect on the final result.
One of the popular ways of approximate computing is to reduce the data precision such as from single precision binary-32 floating point to half precision binary-16. When designing custom hardware, data precision can be customized precisely to the needs of an application. For example, it has been found that some applications can make good use of as little as 8-bit floating point. When designing a custom FPGA or ASIC solution, the hardware can implement the exact level of required precision. Reducing precision can reduce the size of the hardware, but crucially it can also allow less data to be transfered between the processor and memory.
On general-purpose processors it is much more difficult to customize the precision of data to the application. Most general-purpose processors provide only two floating point sizes — single and double precision — and a limited range of integer data sizes, typically 8, 16, 32, and 64-bit. For example, if 9 bits of integer precision are required for an application, the programmer will normally use a 16-bit type. Similarly, if one needs 13-bit floating point (FP), one might use a half precision binary-16 FP type if it is available, or single precision binary-32 if not.
When the data consists of a large array of values, the cost of using more precision than necessary can become large. The obvious problem is that the larger data size requires more space in memory. But the larger data size also requires more memory bandwidth when transferring between processor and memory, and more energy to drive external pins, wires and buses when transferring unnecessarily large data.
In this paper we propose an entirely novel approach to supporting arrays of irregular precision floating point data types. We adopt the idea of software bitslice data representations that are used in the implementation of cryptography algorithms and some image representation formats. We use these software bitslice formats to represent arrays of data, and perform vector-style SIMD operations constructed from simple bitwise logical operations.
We make several contributions:
We propose using software bitslice data representations to create a new approach to SIMD vector computation for customizable precision floating point data types.
We present our customizable precision bitslice floating point operations as intrinsic functions similar to SIMD intrinsics and implement the operations as a reconfigurable library.
Experimental results show that our bitwise vector approach is efficient for large vectors of customized floating point types with low precision.
2 Software Bitslice Representations
In the standard representation of simple types, such as integer and floating point values, a single value fits inside an 8, 16, 32 or 64-bit word. In a bitslice representation, the different bits of a single number are spread across multiple machine words.
Figure 1 shows an example of standard and bitslice representations of arrays. Both the standard and bitslice representations show an array of sixteen 8-bit floating point numbers. Each number has 1 sign bit, 4 exponent bits and 3 mantissa bits. However, the physical representation of the data in memory is quite different. Instead of using sixteen 8-bit words, the bitslice representation uses eight 16-bit machine words. The first bit of each of these eight 16-bit words corresponds to one of the eight bits of the first array element. Similarly, the other 8-bit values are represented by a bit from each of the eight 16-bit words.
Bitslice representations are sometimes used for highly efficient implementations of symmetric cryptography algorithms such as the Data Encryption Standard (DES) or Advanced Encryption Standard (AES) . These cryptography algorithms perform large numbers of bit-level operations, which can be very fast on bitslice representations. But to our knowledge they have not been applied to more general purpose computation.
3 Bitslice vector computing
In this paper we propose using bitslice representations of arrays as the basis of a new approach to vector SIMD computing. We show how to construct vector SIMD operations in software that operate on these types. This allows us to construct SIMD vector types of a fixed number of elements but with arbitrary length per element. For example, we can construct a vector of thirty-two 8-bit elements, but equally we can construct a vector of thirty-two 17-bit elements.
To operate on the elements of bitslice vector types, we propose building arithmetic and other operators from native integer bitwise instructions. Figure 2 shows a simple integer adder routine for bitslice vectors with thirty-two elements, each of 13 bits. Note that the addition is performed by a sequence of bitwise operations that are the software equivalent of a hardware adder. Thus the sum of two bits and is and the carry from the addition is . By applying a sequence of these bitwise operations, an entire -bit addition can be performed.
Note that in a hardware adder, each logic gate operates on one binary value. However, the bitwise logical operators in the adder in Fig. 2 operate on an entire 32-bit register of values at once. Thus, the addition is performed sequentially by a sequence of bitwise operations. But each bitwise instruction operates on 32 separate 13-bit values. So our adder operates in vector SIMD style, requiring a number of steps that is proportional to the number of bits in each value, but operating on a vector of different values that is equal to the word-size of the underlying type supported by the machine.
The big advantage of our proposal for bitslice vector types is that they allow vectors of values with an arbitrary number of bits. One can easily support vectors of numbers with 5, 9, or 13 bits. Operating on bitslice vector types is laborious from a sequential point of view, but exploits large amounts of bit-level parallelism within the conventional machine word. The major downside of operating on bitslice vector types is that each operation requires large numbers of bitwise operations. As the number of bits in each value grows, the execution time of the arithmetic operators increases rapidly. However, as we show in the following sections, it can work well for arrays of small, irregularly-sized types.
It has been demonstrated that not all programs need the precision provided by the generic FP hardware and different sections of a program can benefit from different bitwidths for the sake of overall accuracy and power consumption . The balance between accuracy and performance makes our solution perfectly suited to the needs of approximate computing.
4 Bit-slice Floating Point Vector Operations
Bitslice floating point (BFP) vector operations perform arithmetic computations on the bitslice vector types. Bitslice vector types are essentially an array of unsigned integers (e.g., uint8_t, uint16_t, and __m128i in Intel SSE instructions), each of the integers represents one bit of the data in the standard representation. Figure 3 shows the BFP vector type for FP32. The width of bitslice vector types is decided by the size of the underlying integer types. For example, for uint16_t, the width is 16. As discussed in Sec. 3, the arithmetic operations on bitslice vector types are carried out in terms of a single bit rather than the whole value. Therefore, we need use integer bitwise operations to achieve the logic of hardware for each arithmetic operation. Our implementation follows the classic implementation of floating arithmetic operations in hardware  but with the aim of minimizing the number of gates rather than the overall latency.
In addition to the helper operations for transforming data in the standard format to our BFP vector types, we give three basic arithmetic operations – addition/subtraction, multiplication, division. For each operation, two rounding modes are available – round towards zero and round to nearest (tie to even). The computation steps and associated complexities for each BFP operations are listed in Table I. For the division, we adopt the restoring division algorithm, which is the simplest digit-recurrence algorithm .
Bit shifting is required in the alignment shift of the add operation and normalization of all the operations. In BFP vector types, bits of a vector item are spread over different integers. Shifting bits one by one is thus prohibitive due to the memory access in proportion to the number of bits to be shifted. We adopt a log shifter that significantly reduces the total number of bits to be shifted and in turn eliminates some unnecessary memory access. Log shifting has been demonstrated as an effective way of saving power in hardware .
Our BFP vector types are presented in the form of arrays and thus it is of great importance to exploit the data locality within each computation step (e.g. loops) and across steps. For example, software pipelining with loop unrolling is applied to the multiplication of significand in order to keep data in registers, and reuse them as much as possible. For similar operations, such as two successive addition operations on exponent, we can merge them together to avoid the store of intermediate results.
5 Experimental evaluation
Our proposed customizable precision floating-point arithmetic is implemented as a reconfigurable bitslice FP library. When the programmer knows the bitwidth requirement for their applications in advance, they can simply put the number of bits in exponent, mantissa, and rounding mode into a configuration file and feed it to our library generator. The generator produces a header file containing the BFP data structures and related C intrinsic functions for the basic BFP operations , and a library file (.so) that implements the custom FP operations. Programmers can either manually modify their code with BFP vector types and operations or annotate the source code and let compilers vectorize the their code and automatically generate BFP operations. The compiler support is beyond the scope of this paper.
5.2 Experimentation Evaluation
We evaluated the performance of our BFP vector operations on a Linux platform with an Intel(R) Core(TM) i7-4770 CPU, which supports AVX2 SIMD instructions. We used BFP vector types to represent three floating point formats – FP8 (1 sign bit, 4 exponent bits, 3 significand bits), FP 16 (1 sign bit, 5 exponent bits, 10 significand bits), and FP32 (1 sign bit, 8 exponent bits, 23 significand bits). For each format, we measured the performance with different integer types supported by the CPU, from 32-bit integer (uint32_t) to 256-bit integer(__m256i). As our implementation supports two widely used rounding modes – round towards zero and round to nearest, performance is given for comparison as well.
Figure 4 shows the performance of BFP addition. The performance of BFP multiplication and division is shown in Fig. 5 and Fig. 6, respectively. All the performance comparison is against the floating-point computations on data in the standard representation without using SIMD instructions. Without SIMD floating point units, our work shows another possibility to achieve SIMD floating point operations with integer computation units while with flexible precision. The performance results demonstrate that 1) with larger integer types, for example __m256i, which allows more elements (256) being processed in parallel, the performance improvement is proportional to the width of the underlying integer types; 2) applying round towards zero (RZ) rather than round to nearest(RN) helps significantly improve the performance by reducing the width of significand in the intermediate results. In particular, for small floating point formats, our BFP multiplication and division can outperform the hardware counterpart greatly. As these formats are common in image processing and other fields where approximate computing is applicable, the great performance of our BFP vector operations with flexible precision makes our solution a feasible approach for approximate computing.
6 Related Work
SWAR , SIMD Within A Register, is the most closely related work to our bitslice floating point vector computation. SWAR uses logic operations to implement integer operations. Partitioned operations are the main focus of most of the hardware support for SWAR. Instead of keeping the field size as the size of desirable data types, we use one single bit to hold one bit of a data element so that the number of elements processed in parallel is not decided by the width of data type but the max width of registers (including SIMD registers). Meanwhile, we use logic operations to implement the actual hardware logic for a single bit of the data element.
Lowering energy consumption is one of the major benefits of approximate computing. Disciplined approximate programming asks programmers to specify which parts of a program can be computed approximately. The approximate computation thus reduces the energy cost. An ISA extension is put forward to provide approximate operations and storage . With this extension, hardware has freedom to save energy at the cost of accuracy. Our customizable precision BFP vector types and related operations can serve as a software ISA for approximate computation.
Some programs may not need the dynamic range or the precision of FP arithmetic. For these programs, it is a general design practice to translate the floating-point arithmetic into a suitable finite fixed point presentation . However, some programs may still require 6 bits or more in the exponent to preserve a reasonable degree of accuracy. In other words, these applications need more than the typical 32 bits of precision that fixed point arithmetic offers. Therefore, support for small, irregularly sized floating point makes our bitslice vector types a perfect fit for this kind of application.
We propose an entirely novel approach to vector computing based on bitslice vector formats and building arithmetic operators from bitwise instructions. This approach allows us to support a vector processing model that can operate on data with an arbitrary number of bits. Thus, we can create vectors of integer or floating point types of five, nine, eleven or any number of bits. This ability to customize the precision of vector data exactly to the application creates new opportunities for optimization. In particular, it allows data precision optimizations on general-purpose processors that were previously available primarily on custom hardware. In addition, matching precision to the application may reduce the memory footprint of applications, which may in turn reduce memory traffic and the energy required for data movement.
The complexity of the arithmetic operators is related to the number of bits of precision in the data types. Our experiments show that for larger precision, the costs of arithmetic operators becomes prohibitive. However, for smaller data types the benefits of exploiting bitwise parallelism across a vector of values can outweigh the costs of bitwise arithmetic. To our knowledge we are the first to propose and evaluate general-purpose bitslice vector representations. We believe that it is a promising approach for approximate computing using just enough precision.
This work was supported by Science Foundation Ireland grant 12/IA/1381 and 10/CE/I1855 to Lero – the Irish Software Research Centre (www.lero.ie).
-  E. Biham, “A fast new DES implementation in software,” in Fast Software Encryption, ser. Lecture Notes in Computer Science, E. Biham, Ed. Springer Berlin Heidelberg, 1997, vol. 1267, pp. 260–272.
-  J. Y. F. Tong, D. Nagle, and R. A. Rutenbar, “Reducing Power by Optimizing the Necessary Precision/Range of Floating-point Arithmetic,” IEEE Trans. Very Large Scale Integr. Syst., vol. 8, no. 3, pp. 273–285, Jun. 2000.
-  M. D. Ercegovac and T. Lang, Digital Arithmetic. San Francisco (Calif.): Morgan Kaufmann Oxford, 2004.
-  J.-M. Muller, N. Brisebarre, F. de Dinechin, C.-P. Jeannerod, V. Lefèvre, G. Melquiond, N. Revol, D. Stehlé, and S. Torres, Handbook of Floating-Point Arithmetic. Birkhäuser Boston, 2010.
-  R. V. K. Pillai, D. Al-Khalili, and A. J. Al-Khalili, “Energy Delay Measures of Barrel Switch Architectures for Pre-alignment of Floating Point Operands for Addition,” in Proceedings of the 1997 International Symposium on Low Power Electronics and Design, ser. ISLPED ’97, 1997, pp. 235–238.
-  K. Acken, M. Irwin, and R. Owens, “Power Comparisons for Barrel Shifters,” in Proceedings of the 1996 International Symposium on Low Power Electronics and Design, ser. ISLPED ’96, 1996, pp. 209–212.
-  R. J. Fisher and H. G. Dietz, “Compiling for SIMD Within a Register,” in 11th International Languages and Compilers for Parallel Computing Workshop (LCPC’08), 1998.
-  H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Architecture support for disciplined approximate programming,” in Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2012, London, UK, March 3-7, 2012, 2012.