1 Introduction
The instruction set architecture of most modern processors provides Single Instruction Multiple Data (SIMD) instructions that process multiple instances of data concurrently[14]. The programming model that utilizes these instructions is a key technique for many computing systems to reach their peak performance. Most software SIMD optimizations are introduced manually by programmers. However, this approach introduces a portability problem because the code needs to be rewritten when targeting a new vector extension. In order to improve portability of codes with SIMD optimizations, recent compilers have introduced autovectorizing capability[40]. To fully exploit the SIMD capabilities of a system, the transformation for autovectorization of a compiler must be able to invoke a version of functions that operates on concurrent iterations, or on a vector function. This applies particularly to C mathematical functions defined in math.h that are frequently called in hotloops.
In this paper, we describe our implementation of a vectorized library of C standard math functions, called SLEEF library. SLEEF stands for SIMD Library for Evaluating Elementary Functions, and implements a vectorized version of all C99 real floatingpoint math functions. Our library provides 1ULP accuracy version and 3.5ULP accuracy version for most of the functions. We confirmed that our library satisfies such accuracy requirements on an empirical basis. Our library achieves both good performance and portability by abstracting intrinsic functions. This abstraction enables subfeatures of vector extensions such as mask registers to be utilized while the source code of our library is shared among different vector extensions. We also implemented a version of functions that returns bitwise consistent results across all platforms. Our library is designed to be used in conjunction with vectorizing compilers. In order to help development of vectorizing compilers, we collaborated with compiler developers in designing a Vector Function Application Binary Interface (ABI). The main difficulty in vectorizing math functions is that conditional branches are expensive. We implemented many of the functions in our library without conditional branches. We devised reduction methods and adjusted domains of polynomials so that a single polynomial covers the entire input domain. For an increased vector size, a value requiring a slow path is more likely to be contained in a vector. Therefore, we vectorized all the code paths in order to speed up the computation in such cases. We devised a variation of the PayneHanek range reduction and a remainder calculation method that are both suitable for vectorized implementation.
We compare the implementation of several selected functions in our library to those in other opensource libraries. We also compare the reciprocal throughput of functions in our library, Intel SVML
[29], FDLIBM [50], and Vectorlibm [33]. We show that the performance of our library is comparable to that of Intel SVML.The rest of this paper is organized as follows. Section 2 introduces related work. Section 3 discusses how portability is improved by abstracting vector extensions. Section 4 explains the development of a Vector ABI and a vectorized mathematical library. Section 5 shows an overview of the implementation of SLEEF, while comparing our library with FDLIBM and Vectorlibm. Section 6 explains how our library is tested. Section 7 compares our work with prior art. In Section 8, the conclusions are presented.
2 Related Work
2.1 C Standard Math Library
The C standard library (libc) includes the standard mathematical library (libm) [30]. There have been many implementations of libm. Among them, FDLIBM [50] and the libm included in the GNU C Library [25] are the most widely used libraries. FDLIBM is a freely distributable libm developed by Sun Microsystems, Inc., and there are many derivations of this library. Gal et al. described the algorithms used in the elementary mathematical library of the IBM Israel Scientific Center [22]. Their algorithms are based on the accurate tables method developed by Gal. It achieves high performance and produces very accurate results. Crlibm is a project to build a correctly rounded mathematical library [15].
There are several existing vectorized implementations of libm. Intel Short Vector Math Library (SVML) is a highly regarded commercial library [29]. This library provides highly optimized subroutines for evaluating elementary functions which can use several kinds of vector extensions available in Intel’s processors. However, this library is proprietary and only optimized for Intel’s processors. There are also a few commercial and opensource implementations of vectorized libm. AMD is providing a vectorized libm called AMD Core Math Library (ACML) [3].
Some of the code from SVML is published under a free software license, and it is now published as Libmvec [24], which is a part of Glibc. This library provides functions with 4ULP error bound. It is coded in assembly language, and therefore it does not have good portability. C. K. Anand et al. reported their C implementation of 32 single precision libm functions tuned for the Cell BE SPU compute engine [4]. They used an environment called Coconut that enables rapid prototyping of patterns, rapid unit testing of assembly language fragments and patterns to develop their library. M. Dukhan published an opensource and portable SIMD vector libm library named Yeppp! [17, 18]. Most of vectorized implementations of libm utilizes assembly coding or intrinsic functions to specify which vector instruction is used for each operator. On the other hand, there are also other implementations of vector versions of libm which are written in a scalar fashion but rely on a vectorizing compiler to generate vector instructions and generate a vectorized binary code. Christoph Lauter published an opensource Vectorlibm library implemented with plain C [33]. VDT Mathematical Library [45], is a math library written for the compiler’s autovectorization feature.
2.2 Translation of SIMD Instructions
Manilov et al. propose a C source code translator for substituting calls to platformspecific intrinsic functions in a source code with those available on the target machine [37]
. This technique utilizes graphbased pattern matching to substitute intrinsics. It can translate SIMD intrinsics between extensions with different vector lengths. This rewriting is carried out through loopunrolling.
N. Gross proposes specialized C++ templates for making the source code easily portable among different vector extensions without sacrificing performance [27]. With these templates, some part of the source code can be written in a way that resembles scalar code. In order to vectorize algorithms that have a lot of control flow, this scheme requires the bucketing technique is applied, to compute all the paths and choose the relevant results at the end.
Clark et al. proposes a method for combining static analysis at compile time and binary translation with a JIT compiler in order to translate SIMD instructions into those that are available on the target machine [11]. In this method, SIMD instructions in the code are first converted into an equivalent scalar representation. Then, a dynamic translation phase turns the scalar representation back into architecturespecific SIMD equivalents.
Leißa et al. propose a Clike language for portable and efficient SIMD programming [35]. With their extension, writing vectorized code is almost as easy as writing traditional scalar code. There is no strict separation in host code and kernels, and scalar and vector programming can be mixed. Switching between them is triggered by the type system. The authors present a formal semantics of their extension and prove the soundness of the type system.
Most of the existing methods are aiming at translating SIMD intrinsics or instructions to those provided by a different vector extension in order to port a code. Intrinsics that are unique in a specific extension are not easy to handle, and translation works only if the source and the target architectures have equivalent SIMD instructions. Automatic vectorizers in compilers have a similar weakness. Whenever possible, we have specialized the implementation of the math functions to exploit the SIMD instructions that are specific to a target vector extension. We also want to make special handling of FMA, rounding and a few other kinds of instructions, because these are critical for both execution speed and accuracy. We want to implement a library that is statically optimized and usable with Link Time Optimization (LTO). The users of our library do not appreciate usage of a JIT compiler. In order to minimize dependency on external libraries, we want to write our library in C. In order to fulfill these requirements, we take a crosslayer approach. We have been developing our abstraction layer of intrinsics, the library implementation, and the algorithms in order to make our library run fast with any vector extensions.
3 Abstraction of Vector Extensions
Modern processors supporting SIMD instructions have SIMD registers that can contain multiple data [14]. For example, a 128bit wide SIMD register may contain four 32bit singleprecision FP numbers. A SIMD add instruction might take two of these registers as operands, add the four pairs of numbers, and overwrite one of these registers with the resulting four numbers. We call an array of FP numbers contained in a SIMD register a vector.
SIMD registers and instruction can be exposed in a C program with intrinsic functions and types [6]. An intrinsic function is a kind of inline function that exposes the architectural features of an instruction set at C level. By calling an intrinsic function, a programmer can make a compiler generate a specific instruction without handcoded assembly. Nevertheless, the compiler can reorder instructions and allocate registers, and therefore optimize the code. When intrinsic functions corresponding to SIMD instructions are defined inside a compiler, C data types for representing vectors are also defined.
In SLEEF, we use intrinsic functions to specify which assembly instruction to use for each operator. We abstract intrinsic functions for each vector extension by a set of inline functions or preprocessor macros. We implement the functions exported from the library to call abstract intrinsic functions instead of directly calling intrinsic functions. In this way, it is easy to swap the vector extension to use. We call our set of inline functions for abstracting architecturespecific intrinsics Vector Extension Abstraction Layer (VEAL).
In some of the existing vector math libraries, functions are implemented with handcoded assembly[24]. This approach improves the absolute performance because it is possible to provide the optimal implementation for each microarchitecture. However, processors with a new microarchitecture are released every few years, and the library needs revision accordingly in order to maintain the optimal performance.
In other vector math libraries, the source code is written in a scalar fashion that is easy for compilers to autovectorize[33, 45]. Although such libraries have good portability, it is not easy for compilers to generate a welloptimized code. In order for each transformation rule in an optimizer to kick in, the source code must satisfy many conditions to guarantee that the optimized code runs correctly and faster. In order to control the level of optimization, a programmer must specify special attributes and compiler options.
3.1 Using Subfeatures of the Vector Extensions
There are differences in the features provided by different vector extensions, and we must change the function implementation according to the available features. Thanks to the level of abstraction provided by the VEALs, we implemented the functions so that all the different versions of functions can be built from the same source files with different macros enabled. For example, the availability of FMA instructions is important when implementing doubledouble (DD) operators [39]. We implemented DD operators both with and without FMA by manually specifying if the compiler can convert each combination of multiplication and addition instructions to an FMA instruction, utilizing VEALs.
Generally, bit masks are used in a vectorized code in order to conditionally choose elements from two vectors. In some vector extensions, a vector register with a width that matches a vector register for storing FP values, is used to store a bit mask. Some vector extensions provide narrower vector registers that are dedicated to this purpose, which is SLEEF makes use of these opmask registers by providing a dedicated data type in VEALs. If a vector extension does not support an opmask, the usual bit mask is used instead of an opmask. It is also better to have an opmask as an argument of a whole math function and make that function only compute the elements specified by the opmask. By utilizing a VEAL, it is also easy to implement such a functionality.
3.2 Details of VEALs
Fig. 1 shows some definitions in the VEAL for SVE [7]. We abstract vector data types and intrinsic functions with typedef statements and inline functions, respectively.
The vdouble data type is for storing vectors of double precision FP numbers. vopmask is the data type for the opmask described in 3.1.
The function vcast_vd_d is a function that returns a vector in which the given scalar value is copied to all elements in the vector. vsub_vd_vd_vd is a function for vector subtraction between two vdouble data. veq_vo_vd_vd compares elements of two vectors of vdouble type. The results of the comparison can be used, for example, by vsel_vd_vo_vd_vd to choose a value for each element between two vector registers. Fig. 2 shows an implementation of a vectorized positive difference function using a VEAL. This function is a vectorized implementation of the fdim function in the C standard math library.
3.3 Making Results Bitwise Consistent across All Platforms
The method of implementing math functions described so far, can deliver computation results that slightly differ depending on architectures and other conditions, although they all satisfy the accuracy requirements, and other specifications. However, in some applications, bitwise consistent results are required.
To this extent, the SLEEF project has been working closely with Unity Technologies,^{1}^{1}1https://unity3d.com/. which specializes in developing frameworks for video gaming, and we discovered that they have unique requirements for the functionalities of math libraries. Networked video games run on many gaming consoles with different architectures and they share the same virtual environment. Consistent results of simulation at each terminal and server are required to ensure fairness among all players. For this purpose, fast computation is more important than accurate computation, while the results of computation have to perfectly agree between many computing nodes, which are not guaranteed to rely on the same architecture. Usually, fixedpoint arithmetic is used for a purpose like this, however there is a demand for modifying existing codes with FP computation to support networking.
There are also other kinds of simulation in which bitwise identical reproducibility is important. In [36], the authors show that modeled mean climate states, variability and trends at different scales may be significantly changed or even lead to opposing results due to the roundoff errors in climate system simulations. Since reproducibility is a fundamental principle of scientific research, they propose to promote bitwise identical reproducibility as a worldwide standard.
One way to obtain bitwise consistent values from math functions is to compute correctly rounded values. However, for applications like networked video games, this might be too expensive. SLEEF provides vectorized math functions that return bitwise consistent results across all platforms and other settings, and this is also achieved by utilizing VEALs. The basic idea is to always apply the same sequence of operations to the arguments. The IEEE 754 standard guarantees that the basic arithmetic operators give correctly rounded results [38], and therefore the results from these operators are bitwise consistent. Because most of the functions except trigonometric functions do not have a conditional branch in our library, producing bitwise consistent results is fairly straightforward with VEALs. Availability of FMA instructions is another key for making results bitwise consistent. Since FMA instructions are critical for performance, we cannot just give up using FMA instructions. In SLEEF, the bitwise consistent versions of functions have two versions both with and without FMA instructions. We provide a nonFMA version of the functions to guarantee bitwise consistency among extensions such as Intel SSE2 that do not have FMA instructions. Another issue is that the compiler might introduce inconsistency by FP contraction, which is the result of combining a pair of multiplication and addition operations into an FMA. By disabling FP contraction, the compiler strictly preserves the order and the type of FP operations during optimization. It is also important to make the returned values from scalar functions bitwise consistent with the vector functions. In order to achieve this, we also made a VEAL that only uses scalar operators and data types. The bitwise consistent and nonconsistent versions of vector and scalar functions are all built from the same source files, with different VEALs and macros enabled. As described in Section 5, trigonometric functions in SLEEF chooses a reduction method according to the maximum argument of all elements in the argument vector. In order to make the returned value bitwise consistent, the bitwise consistent version of the functions first applies the reduction method for small arguments to the elements covered by this method. Then it applies the second method only to the elements with larger arguments which the first method does not cover.
4 The Development of a Vector Function ABI and SLEEF
Recent compilers are developing new optimization techniques to automatically vectorize a code written in standard programming languages that do not support parallelization [31, 41]. Although the first SIMD and vector computing systems [47] appeared a few decades ago, compilers with autovectorization capability have not been widely used until recently, because of several difficulties in implementing such functionality for modern SIMD architectures. Such difficulties include verifying whether the compiler can vectorize a loop or not, by determining data access patterns of the operations in the loop [40, 42]. For languages like C and C++, it is also difficult to determine the data dependencies through the iteration space of the loop, because it is hard to determine aliasing conditions of the arrays processed in the loop.
4.1 Vector Function Application Binary Interface
Vectorizing compilers convert calls to scalar versions of math functions such as sine and exponential to the SIMD version of the math functions. The most recent versions of Intel Compiler [52], GNU Compiler [23], and Arm Compiler for HPC [5], which is based on Clang/LLVM [51, 32], are capable of this transformation, and rely on the availability of vector math libraries such as SVML [29], Libmvec [24] and SLEEF respectively to provide an implementation of the vector function calls that they generate. In order to develop this kind of transformations, a targetdependent Application Binary Interface (ABI) for calling vectorized functions had to be designed.
The Vector Function ABI for AArch64 architecture [8] was designed in close relationship with the development of SLEEF. This type of ABI must standardize the mapping between scalar functions and vector functions. The existence of a standard enables interoperability across different compilers, linkers and libraries, thanks to the use of standard names defined by the specification.
The ABI includes a name mangling function, a map that converts the scalar signature to the vector one, and the calling conventions that the vector functions must obey. In particular, the name mangling function that takes the name of the scalar function to the vector function must encode all the information that is necessary to reverse the transformation back to the original scalar function. A linker can use this reverse mapping to enable more optimizations (Link Time Optimizations) that operate on object files, and does not have access to the scalar and vector function prototypes. There is a demand by users for using a different vector math library according to the usage. Reverse mapping is also handy for this purpose. A vector math library implements a function for each combination of a vector extension, a vector length and a math function to evaluate. As a result, the library exports a large number of functions. Some vector math libraries can only implement part of all the combinations. By using the reverse mapping mechanism, the compiler can check the availability of the functions by scanning the symbols exported by a library.
The Vector Function ABI is also used with OpenMP [43]. From version 4.0 onwards, OpenMP provides the directive declare simd. A user can decorate a function with this directive to inform the compiler that the function can be safely invoked concurrently on multiple instances of its arguments [34]. This means that the compiler can vectorize the function safely. This is particularly useful when the function is provided via a separate module, or an external library, for example in situations where the compiler is not able to examine the behavior of the function in the call site. The scalartovector function mapping rules stipulated in the Vector Function ABI are based on the classification of vector functions associated with the declare simd directive of OpenMP. Currently, work for implementing these OpenMP directives on LLVM is ongoing.
The Vector Function ABI specifications are provided for the Intel x86 and the Armv8 (AArch64) families of vector extensions [53, 8]. The compiler generates SIMD function calls according to the compiler flags. For example, when targeting AArch64 SVE autovectorization, the compiler will transform a call to the standard sin function to a call to the symbol _ZGVsMxv_sin. When targeting Intel AVX512 [49] autovectorization, the compiler would generate a call to the symbol _ZGVeNe8v_sin.
4.2 SLEEF and the Vector Function ABI
SLEEF is provided as two separate libraries. The first library exposes the functions of SLEEF to programmers for inclusion in their C/C++ code. The second library exposes the functions with names mangled according to the Vector Function ABI. This makes SLEEF a viable alternative to libm and its SIMD counterpart libmvec, in glibc. This also enables a user workflow that relies on the autovectorization capabilities of a compiler. The compatibility with libmvec enables users to swap from libmvec to libsleef by simply changing compiler options, without changing the code that generated the vector call. The two SLEEF libraries are built from the same source code, which are configured to target the different versions via autogenerative programs that transparently rename the functions according to the rules of the target library.
5 Overview of Library Implementation
One of the objectives of the SLEEF project is to provide a library of vectorized math functions that can be used in conjunction with vectorizing compilers. When a nonvectorized code is automatically vectorized, the compiler converts calls to scalar math functions to calls to a SIMD version of the math functions. In order to make this conversion safe and applicable to wide variety of codes, we need functions with 1ULP error bound that conforms to ANSI C standard. On the other hand, there are users who need better performance. Our library provides 1ULP accuracy version and 3.5ULP accuracy version for most of the functions. We confirmed that our library satisfies the accuracy requirements on an empirical basis. For nonfinite inputs and outputs, we implemented the functions to return the same results as libm, as specified in the ANSI C standard. They do not set errno nor raise an exception.
In order to optimize a program with SIMD instructions, it is important to eliminate conditional branches as much as possible, and execute the same sequence of instructions regardless of the argument. If the algorithm requires conditional branches according to the argument, it must prepare for the case where the elements in the input vector contain both values that would make a branch happen and not happen. Recent processors have a long pipeline and therefore branch misprediction penalty can reach more than 10 cycles[20]. Making a decision for a conditional branch also requires nonnegligible computation, within the scope of our tests. A conditional move is an operator for choosing one value from two given values according to a condition. This is equivalent to a ternary operator and can be used in a vectorized code to replace a conditional branch. Some other operations are also expensive in vectorized implementation. A tablelookup is expensive. Although inregister table lookup is reported fast on Cell BE SPU [4], it is substantially slower than polynomial evaluation without any table lookup, within the scope of our tests. Most vector extensions do not provide 64bit integer multiplication or a vector shift operator with which each element of a vector can be specified a different number of bits to shift. On the other hand, FMA and roundtointeger instructions are supported by most vector extensions. Due to the nature of the evaluation methods, dependency between operations cannot be completely eliminated. Latencies of operations become an issue when a series of dependent operations are executed. FP division and square root are not too expensive from this aspect.^{2}^{2}2The latencies of 256bit DP add, divide and sqrt instructions are 4, 14 and 18 cycles, respectively on Intel Skylake processors [28].
The actual structure of the pipeline in a processor is complex, and such level of details are not welldocumented for most CPUs. Therefore, it is not easy to optimize the code according to such hardware implementation. In this paper, we define the latency and throughput of an instruction or a subroutine as follows [21]. The latency of an instruction or a subroutine is the delay that it generates in a dependency chain. The throughput is the maximum number of instructions or subroutines of the same kind that can be executed per unit time when the inputs are independent of the preceding instructions or subroutines. Several tools and methods are proposed for automatically constructing models of latency, throughput, and port usage of instructions [1, 26]. Within the scope of our tests, most of the instruction latency in the critical path of evaluating a vector math function tends to be dominated by FMA operations. In many processors, FMA units are implemented in a pipeline manner. Some powerful processors have multiple FMA units with outoforder execution, and thus the throughput of FMA instruction is large, while the latency is long. In SLEEF, we try to maximize the throughput of computation in a versatile way by only taking account of dependencies among FMA operations. We regard each FMA operation as a job that can be executed in parallel and try to reduce the length of the critical path.
In order to evaluate a doubleprecision (DP) function to 1ULP accuracy, the internal computation with accuracy better than 1 ULP is sometimes required. Doubledouble (DD) arithmetic, in which a single value is expressed by a sum of two doubleprecision FP values [16, 46], is used for this purpose. All the basic operators for DD arithmetic can be implemented without a conditional branch, and therefore it is suitable for vectorized implementation. Because we only need 1ULP overall accuracy for DP functions, we use simplified DD operators with less than the full DD accuracy. In SLEEF, we omit renormalization of DD values by default, allowing overlap between the two numbers. We carry out renormalization only when necessary.
Evaluation of an elementary function often consists of three steps: range reduction, approximation, and reconstruction [39]. An approximation step computes the elementary function using a polynomial. Since this approximation is only valid for a small domain, a number within that range is computed from the argument in a range reduction step. The reconstruction step combines the results of the first two steps to obtain the resulting number.
An argument reduction method that finds an FP remainder of dividing the argument by is used in evaluation of trigonometric functions. The range reduction method suggested by Cody and Waite [13, 12] is used for small arguments. The Payne and Hanek’s method [44] provides an accurate rangereduction for a large argument of trigonometric function, but it is expensive in terms of operations.
There are tools available for generating the coefficients of the polynomials, such as Maple [9] and Sollya [10]. In order to finetune the generated coefficients, we created a tool for generating coefficients that minimizes the maximum relative error. When a SLEEF function evaluates a polynomial, it evaluates a few lowest degree terms in DD precision while other terms are computed in doubleprecision, in order to achieve 1ULP overall accuracy. Accordingly, coefficients in DD precision or coefficients that can be represented by FP numbers with a few most significant bits in mantissa are used in the last few terms. We designed our tool to generate such coefficients. We use Estrin’s scheme [19] to evaluate a polynomial to reduce dependency between FMA operations. This scheme reduces bubbles in the pipeline, and allows more FMA operations to be executed in parallel. Reducing latency can improve the throughput of evaluating a function because the latency and the reciprocal throughput of the entire function are close to each other.
Below, we describe and compare the implementations of selected functions in SLEEF, FDLIBM [50] and Christoph Lauter’s Vectorlibm [33]. We describe 1ULP accuracy version of functions in SLEEF. The error bound specification of FDLIBM is 1 ULP.
5.1 Implementation of sin and cos
FDLIBM uses CodyWaite range reduction if the argument is under . Otherwise, it uses the PayneHanek range reduction. Then, it switches between polynomial approximations of the sine and cosine functions on . Each polynomial has 6 nonzero terms.
sin and cos in Vectorlibm have 4ULP error bound. They use a vectorized path if all arguments are greater than 3.05e151, and less than 5.147 for sine and 2.574 for cosine. In the vectorized paths, a polynomial with 8 and 9 nonzero terms is used to approximate the sine function on , following CodyWaite range reduction. In the scalar paths, Vectorlibm uses a polynomial with 10 nonzero terms.
SLEEF switches among two CodyWaite range reduction methods with approximation with different sets of constants, and the PayneHanek reduction. The first version of the algorithm operates for arguments within , and the second version for arguments that are within . Otherwise, SLEEF uses a vectorized PayneHanek reduction, which is described in 5.8. SLEEF only uses conditional branches for choosing a reduction method from CodyWaite and PayneHanek. SLEEF uses a polynomial approximation of the sine function on , which has 9 nonzero terms. The sign is set in the reconstruction step.
5.2 Implementation of tan
After CodyWaite or PayneHanek reduction, FDLIBM reduces the argument to , and uses a polynomial approximation with 13 nonzero terms. It has 10 if statements after CodyWaite reduction.
tan in Vectorlibm has 8ULP error bound. A vectorized path is used if all arguments are less than 2.574 and greater than 3.05e151. After CodyWaite range reduction, a polynomial with 9 nonzero terms for approximating sine function on is used twice to approximate sine and cosine of the reduced argument. The result is obtained by dividing these values. In the scalar path, Vectorlibm evaluates a polynomial with 10 nonzero terms twice.
In SLEEF, the argument is reduced in 3 levels. It first reduces the argument to with CodyWaite or PayneHanek range reduction. Then, it reduces the argument to with . At the third level, it reduces the argument to with the doubleangle formula. Let be the reduced argument with CodyWaite or PayneHanek. if . Otherwise, . Then, SLEEF uses a polynomial approximation of the tangent function on , which has 9 nonzero terms, to approximate . Let be the obtained value with this approximation. Then, if . Otherwise, . SLEEF only uses conditional branches for choosing a reduction method from CodyWaite and PayneHanek. Annotated source code of tan is shown in Appendix A
5.3 Implementation of asin and acos
FDLIBM and SLEEF first reduces the argument to using and .
Then, SLEEF uses a polynomial approximation of arcsine on with 12 nonzero terms.
FDLIBM uses a rational approximation with 11 terms (plus one division). For computing arcsine, FDLIBM switches the approximation method if the original argument is over 0.975. For computing arccosine, it has three paths that are taken when , and , respectively. It has 7 and 6 if statements in asin and acos, respectively.
asin and acos in Vectorlibm have 6ULP error bound. asin and acos in Vectorlibm use vectorized paths if arguments are all greater than 3.05e151 and 2.77e17, respectively. Vectorlibm evaluates polynomials with 3, 8, 8, and 5 terms to compute arcsine. It evaluates a polynomial with 21 terms for arccosine.
5.4 Implementation of atan
FDLIBM reduces the argument to . It uses a polynomial approximation of the arctangent function with 11 nonzero terms. It has 9 if statements.
atan in Vectorlibm have 6ULP error bound. Vectorlibm uses vectorized paths if arguments are all greater than 1.86e151 and less than 2853. It evaluates four polynomials with 7, 9, 9 and 4 terms in the vectorized path.
SLEEF reduces argument to using . Let if . Otherwise, . It then uses a polynomial approximation of arctangent function with 20 nonzero terms to approximate . As a reconstruction, it computes if . Otherwise, .
5.5 Implementation of log
FDLIBM reduces the argument to . It then approximates the reduced argument with a polynomial that contains 7 nonzero terms in a similar way to SLEEF. It has 9 if statements.
log in Vectorlibm has 4ULP error bound. It uses a vectorized path if the input is a normalized number. It uses a polynomial with 20 nonzero terms to approximate the logarithm function on . It does not use division.
SLEEF multiplies the argument by , if the argument is a denormal number. Let be the resulting argument, and . If is a denormal number, is subtracted 64. SLEEF uses a polynomial with 7 nonzero terms to evaluate , where are constants. As a reconstruction, it computes .
5.6 Implementation of exp
All libraries reduce the argument range to by finding and integer such that .
SLEEF then uses a polynomial approximation with 13 nonzero terms to directly approximate the exponential function of this domain. It achieves 1ULP error bound without using a DD operation.
FDLIBM uses a polynomial with 5 nonzero terms to approximate . It then computes exp. It has 11 if statements.
The reconstruction step is to add integer to the exponent of the resulting FP number of the above computation.
exp in Vectorlibm has 4ULP error bound. A vectorized path covers almost all input domains. It uses a polynomial with 11 terms to approximate the exponential function.
5.7 Implementation of pow
FDLIBM computes in DD precision. Then, it computes pow(, ) = . It has 44 if statements.
Vectorlibm does not implement pow.
SLEEF computes . The internal computation is carried out in DD precision. In order to compute logarithm internally, it uses a polynomial with 11 nonzero terms. The accuracy of the internal logarithm function is around 0.008 ULP. The internal exponential function in pow uses a polynomial with 13 nonzero terms.
5.8 The PayneHanek Range Reduction
Our method computes , where . The argument is an FP number, and therefore it can be represented as , where is an integer mantissa and is an integer exponent . We now denote the integral part and the fractional part of as and , respectively. Then,
The value only depends on the exponent of the argument, and therefore, can be calculated and stored in a table, in advance. In order to compute in DD precision, must be in quaddoubleprecision. We now denote , where are DP numbers and . Then,
(1) 
because . In the method, we compute (1) in DD precision in order to avoid overflow. The size of the table retaining is 32K bytes. Our method is included in the source code of tan shown in Appendix A.
FDLIBM seems to implement the original PayneHanek algorithm with more than 100 lines of C code, which includes 13 if statements, 18 for loops, 1 switch statement and 1 goto statement. The numbers of iterations of most of the for loops depend on the argument.
Vectorlibm implements a nonvectorized variation of the PayneHanek algorithm which has some similarity with our method. In order to reduce argument , it first decomposes into and such that . A tripledouble (TD) approximation to is lookedup from a table. It then calculates in TD. The reduced argument is obtained as a product of and the fractional part of . In Table I, we compare the numbers of FP operators in the implementations. Note that the method in Vectorlibm is used for trigonometric functions with 4ULP error bound, while our method is used for functions with 1ULP error bound.
Operator  SLEEF (1ULP)  Vectorlibm (4ULP) 

add/sub  36  71 
mul  5  18 
FMA  11  0 
round  8  0 
5.9 FP Remainder
We devised an exact remainder calculation method suitable for vectorized implementation. The method is based on the long division method, where an FP number is regarded as a digit. Fig. 3 shows an example process for calculating the FP remainder of 1e+40 / 0.75. Like a typical long division, we first find integer quotient 1.333e+40 so that 1.333e+40 does not exceed 1e+40. We multiply the found quotient with , and then subtract it from 1e+40 to find the dividend 4.836e+24 for the second iteration.
Our basic algorithm is shown in Algorithm 1. If , and are FP numbers of the same precision , then is representable with an FP number of precision . In this case, the number of iterations can be minimized by substituting with the largest FP number of precision within the range specified at line 3. However, the algorithm still works if is any FP number of precision within the range. By utilizing this property, an implementer can use a division operator that does not return a correctly rounded result. The source code of an implementation of this algorithm is shown in Fig. 7 in Appendix B. A part of the proof of correctness is shown in Appendix C.
FDLIBM uses a method of shift and subtract. It first converts the mantissa of two given arguments into 64bit integers, and calculates a remainder in a bitbybit basis. The main loop iterates times, where and are the exponents of the arguments of fmod. This loop includes 10 integer additions and 3 if statements. The number of iterations of the main loop can reach more than 1000.
Vectorlibm does not implement FP remainder.
5.10 Handling of Special Numbers, Exception and Flags
Our implementation gives a value within the specified error bound without special handling of denormal numbers, unless otherwise noted.
When a function has to return a specific value for a specific value of an argument (such as a NaN or a negative zero) is given, such a condition is checked at the end of each function. The return value is substituted with the special value if the condition is met. This process is complicated in functions like pow, because they have many conditions for returning special values.
SLEEF functions do not give correct results if the computation mode is different from roundtonearest. They do not set errno nor raise an exception. This is a common behavior among vectorized math libraries including Libmvec [24] and SVML [29]. Because of SIMD processing, functions can raise spurious exceptions if they try to raise an exception.
5.11 Summary
FDLIBM extensively uses conditional branches in order to switch the polynomial according to the argument(sin, cos, tan, log, etc), to return a special value if the arguments are special values(pow, etc.), and to control the number of iterations (the PayneHanek reduction).
Vectorlibm switches between a few polynomials in most of the functions. It does not provide functions with 1ULP error bound, nevertheless, the numbers of nonzero terms in the polynomials are larger than other two libraries in some of the functions. A vectorized path is used only if the argument is smaller than 2.574 in cos and tan, although these functions are frequently evaluated with an argument up to . In most of the functions, Vectorlibm uses a nonvectorized path if the argument is very small or a nonfinite number. For example, it processes 0 with nonvectorized paths in many functions, although 0 is a frequently evaluated argument in normal situations. If nonfinite numbers are once contained in data being processed, the whole processing can become significantly slower afterward. Variation in execution time can be exploited for a sidechannel attack in cryptographic applications.
SLEEF uses the fastest paths if all the arguments are under 15 for trigonometric functions, and the same vectorized path is used regardless of the argument in most of the nontrigonometric functions. SLEEF always uses the same polynomial regardless of the argument in all functions.
Although reducing the number of conditional branches has a few advantages in implementing vector math libraries, it seems to be not given a high priority in other libraries.
6 Testing
SLEEF includes three kinds of testers. The first two kinds of testers test the accuracy of all functions against highprecision evaluation using the MPFR library. In these tests, the computation error in ULP is calculated by comparing the values output by each SLEEF function and the values output by the corresponding function in the MPFR library, and it is checked if the error is within the specified bounds.
6.1 Perfunctory Test
The first kind of tester carries out a perfunctory set of tests to check if the build is correct. These tests include standards compliance tests, accuracy tests and regression tests.
In the standards compliance tests, we test if the functions return the correct values when values that require special handling are given as the argument. These argument values include Inf, NaN and 0. Atan2 and pow are binary functions and have many combinations of these special argument values. These are also all tested.
In the accuracy test, we test if the error of the returned values from the functions is within the specified range, when a predefined set of argument values are given. These argument values are basically chosen between a few combinations of two values at regular intervals. The trigonometric functions are also tested against argument values close to integral multiples of . Each function is tested against tens of thousands of argument values in total.
In the regression test, the functions are tested with argument values that triggered bugs in the previous library release, in order to prevent reemergence of the same bug.
The executables are separated into a tester and IUTs ( Implementation Under Test). The tests are carried out by making these two executables communicate via an input/output pipeline, in order to enable testing of libraries for architectures which the MPFR library does not support.
6.2 Randomized Test
The second kind of tester is designed to run continuously. This tester generates random arguments and compare the output from each function to the output calculated with the corresponding function in the MPFR library. This tester is expected to find bugs if it is run for a sufficiently long time.
In order to randomly generate an argument, the tester generates random bits of the size of an FP value, and reinterprets the bits as an FP value. The tester executes the randomized test for all the functions in the library at several thousand arguments per second for each function on a computer with a Core i76700 CPU.
In the SLEEF project, we use randomized testing in order to check the correctness of functions, rather than formal verification. It is indeed true that proving correctness of implementation contributes to the reliability of implementation. However, there is a performance overhead because the way of implementation is limited in a form that is easy to prove the correctness. There would be an increased cost of maintaining the library because of the need for updating the proof each time the implementation is modified.
6.3 BitIdentity Test
The third kind of tester is for testing if bitidentical results are returned from the functions that are supposed to return such results. This test is designed to compare the results among the binaries compiled with different vector extensions. For each predetermined list of arguments, we calculate an MD5 hash value of all the outputs from each function. Then, we check if the hash values match among functions for different architectures.
7 Performance Comparison
In this section, we present results of a performance comparison between FDLIBM Version 5.3 [50], Vectorlibm [33], SLEEF 3.4, and Intel SVML [29] included in Intel C Compiler 19.
We measured the reciprocal throughput of each function by measuring the execution time of a tight loop that repeatedly calls the function in a singlethreaded process. In order to obtain useful results, we turned off optimization flags when compiling the source code of this tight loop,^{3}^{3}3If we turn on the optimizer, there is concern that the compiler optimizes away the call to a function. In order to prevent this, we have to introduce extra operations, but this also introduces overhead. After trying several configurations of the loop and optimizer settings, we decided to turn off the optimizer in favor of reproducibility, simplicity and fairness. We checked the assembly output from the compiler and confirmed that the unoptimized loop simply calls the target function and increments a counter, and therefore that the operations inside a loop are minimal. while the libraries are compiled with their default optimization options. We did not use LTO. We confirmed that the calls to the function are not compiled out or inlined by checking the assembly output from the compiler. The number of function calls by each loop is , and the execution time of this loop is measured with the clock_gettime function.
We compiled SLEEF and FDLIBM using gcc7.3.0 with “O3 mavx2 mfma” optimization options. We compiled Vectorlibm using gcc7.3.0 with the default “O3 march=native ftreevectorize ftreevectorizerverbose=1 fnomatherrno” options. We changed VECTOR_LENGTH in vector.h to 4 and compiled the source code on a computer with an Intel Core i76700 CPU.^{4}^{4}4Vectorlibm evaluates functions with 512 bits of vector length by default. Because SLEEF and SVML are 256bit wide, the setting is changed. The accuracy of functions in SVML can be chosen by compiler options. We specified an “fimfmaxerror=1.0” and an “fimfmaxerror=4.0” options for icc to obtain the 1ULP and 4ULP accuracy results, respectively.
We carried out all the measurements on a physical PC with Intel Core i76700 CPU @ 3.40GHz without any virtual machine. In order to make sure that the CPU is always running at the same 3.4GHz clock speed during the measurements, we turned off Turbo Boost. With this setting, 10 nano sec. corresponds to 34 clock cycles.
The following results compare the the reciprocal throughput of each function. If the implementation is vectorized and each vector has elements of FP numbers, then a single execution evaluates the corresponding mathematical function times. We generated arguments in advance and stored in arrays. Each time a function is executed, we set a randomly generated argument to each element of the argument vector (each element is set with a different value). The measurement results do not include the delay for generating random numbers.
7.1 Execution Time of Floating Point Remainder
We compared the reciprocal throughput of doubleprecision fmod functions in the libm included in Intel C Compiler 19, FDLIBM and SLEEF. All the FP remainder functions always return a correctlyrounded result. We generated a random denominator
and a numerator uniformly distributed within
and , respectively, where is varied from 1 to . Fig. 4 shows the reciprocal throughput of the fmod function in each library. Please note that SVML does not contain a vectorized fmod function.The graph of reciprocal throughput looks like a step function, because the number of iterations increases in this way.
7.2 Comparison of Overall Execution Time
Func, error bound,  Vectorlibm  FDLIBM  SLEEF  SVML 

domain  
sin, 1 ulp,  4.927  11.43  13.68  
sin, 4 ulps,  9.601  7.504  6.679  
sin, 1 ulp,  18.96  11.41  13.86  
sin, 4 ulps,  12.48  7.507  6.723  
sin, 1 ulp,  162.3  48.79  41.72  
sin, 4 ulps,  288.6  43.82  34.96  
cos, 1 ulp,  11.42  13.75  12.99  
cos, 4 ulps,  9.557  7.850  7.917  
cos, 1 ulp,  18.45  13.74  13.18  
cos, 4 ulps,  13.97  7.850  7.838  
cos, 1 ulp,  162.1  50.38  38.38  
cos, 4 ulps,  360.3  46.09  36.57  
tan, 1 ulp,  7.819  17.30  15.71  
tan, 4+ ulps,  15.58  9.367  7.570  
tan, 1 ulp,  22.24  17.28  15.78  
tan, 4+ ulps,  20.16  9.367  7.595  
tan, 1 ulp,  177.0  48.82  43.31  
tan, 4+ ulps,  399.4  36.54  40.50  
asin, 1 ulp,  14.87  12.99  12.10  
asin, 4+ ulps,  20.75  5.552  9.627  
acos, 1 ulp,  12.07  16.09  12.11  
acos, 4+ ulps,  23.62  7.572  10.23  
atan, 1 ulp,  10.16  22.12  19.97  
atan, 4+ ulps,  35.54  9.251  12.09  
log, 1 ulp,  31.66  15.46  12.05  
log, 4 ulps,  39.64  9.636  8.842  
exp, 1 ulp,  12.19  7.663  7.968  
exp, 4 ulps,  17.35  6.756  
pow, 1 ulp,  69.40  55.53  75.18  
We compared the reciprocal throughput of 256bit wide vectorized doubleprecision functions in Vectorlibm, SLEEF and SVML, and scalar functions in FDLIBM. We generated random arguments that were uniformly distributed within the indicated intervals for each function. In order to check execution speed of fast paths in trigonometric functions, we measured the reciprocal throughput with arguments within . The result is shown in Table II.
The reciprocal throughput of functions in SLEEF is comparable to that of SVML in all cases. This is because the latency of FP operations is generally dominant in the execution time of math functions. Because there are two levels of scheduling mechanisms, which includes the optimizer in a compiler and the outoforder execution hardware, there is small room for making a difference to the throughput or latency.
Execution speed of FDLIBM is not very slow despite many conditional branches. This seems to be because of a smaller number of FP operations, and faster execution speed of scalar instructions compared to equivalent SIMD instructions.
Vectorlibm is slow even if only the vectorized path is used. This seems to be because Vectorlibm evaluates polynomials with a large number of terms. Autovectorizers are still developing, and the compiled binary code might not be well optimized. When a slow path has to be used, Vectorlibm is even slower since a scalar evaluation has to be carried out for each of the elements in the vector.
Vectorlibm uses Horner’s method to evaluate polynomials, which involves long latency of chained FP operations. In FDLIBM, this latency is reduced by splitting polynomials into even and odd terms, which can be evaluated in parallel. SLEEF uses Estrin’s scheme. In our experiments, there was only a small difference between Estrin’s scheme and splitting polynomials into even and odd terms with respect to execution speed.
8 Conclusion
In this paper, we showed that our SLEEF library shows performance comparable to commercial libraries while maintaining good portability. We have been continuously developing SLEEF since 2010.^{5}^{5}5https://sleef.org/ [48] We distribute SLEEF under the Boost Software License [2], which is a permissive open source license. We actively communicate with developers of compilers and members of other projects in order to understand the needs of realworld users. The Vector Function ABI is important in developing vectorizing compilers. The functions that return bitidentical results are added to our library to reflect requests from our multiple partners. We thoroughly tested these functionalities, and SLEEF is already adopted in multiple commercial products.
Appendix A Annotated source code of tan


Fig. 5 and 6 shows a C source code of our implementation of the tangent function with 1ULP error bound. We omitted the second CodyWaite reduction for the sake of simplicity. The definitions of DD operators are provided in Table III. In the implementation of our library, we wrote all the operators with VEAL, as described in Sec. 3. The only conditional branch in this source code is the if statement at line 5. We implemented the other if statements at line 5, 5, 5, and 5 with conditional move operators. The for loop at line 5 is unrolled. We implemented round functions at line 5, 5 and 5 with a single instruction for most of the vector extensions. Macro ESTRIN at line 5 evaluates a polynomial in doubleprecision with Estrin’s scheme.
In the loop from line 5 to 5 in the PayneHanek reduction, Eq. (1) is computed. The result is multiplied by at line 5. The numbers of FP operators shown in Table I are the numbers of operators from line 5 to line 5. The path with the PayneHanek reduction is also taken if the argument is nonfinite. In this case, variable is set to NaN at line 5, and this will propagate to the final result.
Appendix B Annotated source code of the FP remainder
Function name  Output 

dd  DD number 
ddadd2_d2_d2_d2  Sum of DD numbers and 
ddadd_d2_d2_d2  Addition of two DD numbers and , 
where  
ddadd_d2_d_d  Addition of two DP numbers and , 
where  
ddadd_d2_d_d2  Addition of DP number and 
DD number , where  
ddmul_d2_d2_d2  Product of DD numbers and 
ddmul_d2_d2_d  Product of DD number and 
DP number  
ddmul_d2_d_d  Product of DP numbers and 
dddiv_d2_d2_d2  Returns , where and are DD 
numbers  
ddsqu_d2_d2  Returns , where is a DD number 
ddscale_d2_d2_d  Product of DD number and DP 
number , where  
ddnormalize_d2_d2  Renormalize DD number 
Fig. 7 shows a sample C source code of our FP remainder. In this implementation, both dividend and divisor are DP numbers. It correctly handles signs and denormal numbers. It executes a division instruction only once throughout the computation of the remainder. The implementation also supports the case where a division operator does not give a correctly rounded result. In this case, the nextafter function must be applied multiple times at line 7 according to the maximum error. We implemented if statements at line 7, 7 and 7 with conditional move operators. In the main loop, the algorithm finds the remainder of , and at line 7, the correct sign is assigned to the resulting remainder.
The stopping condition of the for loop at line 7 is not strictly required, and this loop can be terminated after all the values in vectors satisfy the stopping condition in a vectorized implementation. Since we assume that all operators return a roundtonearest FP number, we use the nextafter function at line 7 and 7 to find a value close to but not exceeding it. This method is applicable only if is smaller than DBL_MAX. In order to find the correct when is between 1 and 3, we use a few comparisons to detect this case at line 7 and 7.
Appendix C Correctness of FP Remainder
It is obvious that Algorithm 1 returns a correct result. We show partial proof that is representable as a radix2 FP number of precision if , and are radix2 FP numbers of precision .
We now define property and function . holds iff there are integer and integer that satisfy . If and are finite DP numbers, and holds. denotes the maximum number that does not exceed and holds.
We now show that and hold if and hold. By Lemma 1, integer exists that satisfies . Then, and hold. holds because . Then, holds.
Lemma 1 Given and integer . There exists an integer that satisfies and .
Proof: If , holds. In this case, can be since . Otherwise, is an integer. , and therefore . Because , , and therefore can be .
We now discuss the number of iterations. We suppose is set to . If , and the loop terminates after th iteration. Otherwise, , and . Then, . Therefore the number of iterations is .
In order to show that the number that substitutes at line 7 in Fig. 7 satisfies the condition at line 3 in Algorithm 1, we prove assuming , and . is the floatingpoint number that is the closest to . We assume no overflow or underflow.
It is obvious that is substituted with a number that is smaller or equal to . , where is precision. Therefore, . Similarly, . Therefore, . Let . . . holds since and . Therefore, .
Acknowledgments
We wish to acknowledge Will Lovett and Srinath Vadlamani for their valuable input and suggestions. The authors are particularly grateful to Robert Werner, who reviewed the final manuscript. The authors would like to thank Prof. Leigh McDowell for his suggestions. The authors would like to thank all the developers that contributed to the SLEEF project, in particular Diana Bite and Alexandre Mutel.
References
 [1] (2019) Uops.info: characterizing latency, throughput, and port usage of instructions on intel microarchitectures. In Proceedings of the TwentyFourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’19, New York, NY, USA, pp. 673–686. External Links: ISBN 9781450362405, Document Cited by: §5.
 [2] (200308)(Website) External Links: Link Cited by: §8.
 [3] (2013)(Website) External Links: Link Cited by: §2.1.
 [4] (200908) An optimized cell be special function library generated by coconut. IEEE Transactions on Computers 58 (8), pp. 1126–1138. External Links: Document, ISSN 00189340 Cited by: §2.1, §5.
 [5] (Website) External Links: Link Cited by: §4.1.
 [6] (201409) ARM NEON Intrinsics Reference. External Links: Link Cited by: §3.
 [7] (2017)(Website) External Links: Link Cited by: §3.2.
 [8] (2018)(Website) External Links: Link Cited by: §4.1, §4.1.
 [9] (1983) The design of maple: a compact, portable, and powerful computer algebra system. In Computer Algebra, J. A. van Hulzen (Ed.), Berlin, Heidelberg, pp. 101–115. External Links: ISBN 9783540387565 Cited by: §5.
 [10] (201009) Sollya: an environment for the development of numerical codes. In Mathematical Software  ICMS 2010, K. Fukuda, J. van der Hoeven, M. Joswig, and N. Takayama (Eds.), Lecture Notes in Computer Science, Vol. 6327, Heidelberg, Germany, pp. 28–31. Cited by: §5.
 [11] (200702) Liquid SIMD: Abstracting SIMD Hardware using Lightweight Dynamic Mapping. In 2007 IEEE 13th International Symposium on High Performance Computer Architecture, Vol. , pp. 216–227. External Links: Document, ISSN 15300897 Cited by: §2.2.
 [12] (19801103) Implementation and testing of function software. In Problems and Methodologies in Mathematical Software Production, pp. 24–47. External Links: Document Cited by: §5.
 [13] (1980)(Website) Englewood Cliffs, N.J.. Cited by: §5.
 [14] (2000Sep.) The long and winding road to highperformance image processing with mmx/sse. In Proceedings Fifth IEEE International Workshop on Computer Architectures for Machine Perception, Vol. , pp. 302–310. External Links: Document, ISSN Cited by: §1, §3.
 [15] (20031224) CRLIBM: a correctly rounded elementary function library. In Proc. SPIE 5205, Advanced Signal Processing Algorithms, Architectures, and Implementations XIII, Vol. 5205, pp. 458–464. External Links: Document Cited by: §2.1.
 [16] (197106) A floatingpoint technique for extending the available precision. Numer. Math. 18 (3), pp. 224–242. External Links: Document Cited by: §5.
 [17] (2013)(Website) External Links: Link Cited by: §2.1.
 [18] (20140506) Methods for highthroughput computation of elementary functions. In Parallel Processing and Applied Mathematics: 10th International Conference, pp. 86–95. External Links: Document Cited by: §2.1.
 [19] (1960) Organization of computer systems: the fixed plus variable structure computer. In Papers Presented at the May 35, 1960, Western Joint IREAIEEACM Computer Conference, IREAIEEACM ’60 (Western), New York, NY, USA, pp. 33–40. External Links: Link, Document Cited by: §5.
 [20] (200603) Characterizing the branch misprediction penalty. In 2006 IEEE International Symposium on Performance Analysis of Systems and Software, Vol. , pp. 48–58. External Links: Document, ISSN Cited by: §5.
 [21] (Website) External Links: Link Cited by: §5.
 [22] (199103) An accurate elementary mathematical library for the IEEE floating point standard. ACM Trans. Math. Softw. 17 (1), pp. 26–45. External Links: Document Cited by: §2.1.
 [23] (1984)(Website) External Links: Link Cited by: §4.1.
 [24] (2015)(Website) External Links: Link Cited by: §2.1, §3, §4.1, §5.10.
 [25] (2018)(Website) External Links: Link Cited by: §2.1.
 [26] (2018)(Website) External Links: Link Cited by: §5.
 [27] (201607) Neat SIMD: elegant vectorization in C++ by using specialized templates. In 2016 International Conference on High Performance Computing Simulation (HPCS), Vol. , pp. 848–857. External Links: Document, ISSN Cited by: §2.2.
 [28] (201606) Intel 64 and IA32 Architectures Optimization Reference Manual. External Links: Link Cited by: footnote 2.
 [29] (2019)(Website) External Links: Link Cited by: §1, §2.1, §4.1, §5.10, §7.
 [30] (20111208) ISO/IEC 9899:2011 Information technology — Programming languages — C. International Organization for Standardization. External Links: Link Cited by: §2.1.
 [31] (201201) AutoVectorization Techniques for Modern SIMD Architectures. In Proc. of the 16th Workshop on Compilers for Parallel Computing, Cited by: §4.
 [32] (200403) LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO’04), Palo Alto, California. Cited by: §4.1.
 [33] (201611) A new opensource SIMD vector libm fully implemented with highlevel scalar C. In 2016 50th Asilomar Conference on Signals, Systems and Computers, pp. 407–411. External Links: Document Cited by: §1, §2.1, §3, §5, §7.
 [34] (20170817) Extending OpenMP SIMD support for target specific code and application to ARM SVE. In Scaling OpenMP for Exascale Performance and Portability, pp. 62–74. External Links: Document Cited by: §4.1.
 [35] (201202) Extending a clike language for portable SIMD programming. SIGPLAN Not. 47 (8), pp. 65–74. External Links: ISSN 03621340, Document Cited by: §2.2.
 [36] (201506) Importance of bitwise identical reproducibility in earth system modeling and status report. Geoscientific Model Development Discussions 8 (), pp. 4375–4400. External Links: Document Cited by: §3.3.
 [37] (201612) Free rider: a sourcelevel transformation tool for retargeting platformspecific intrinsic functions. ACM Trans. Embed. Comput. Syst. 16 (2), pp. 38:1–38:24. External Links: ISSN 15399087, Link, Document Cited by: §2.2.
 [38] (2009) Handbook of floatingpoint arithmetic. 1st edition, Birkhauser Boston, Inc.. External Links: ISBN 081764704X, 9780817647049 Cited by: §3.3.
 [39] (1997) Elementary functions: algorithms and implementation. Birkhauser Boston, Inc.. External Links: ISBN 081763990X Cited by: §3.1, §5.
 [40] (2004) Autovectorization in GCC. In Proceedings of the 2004 GCC Developers Summit, pp. 105–118. External Links: Link Cited by: §1, §4.
 [41] (201104) Vapor SIMD: autovectorize once, run everywhere. In International Symposium on Code Generation and Optimization, CGO 2011, pp. 151–160. External Links: Document Cited by: §4.
 [42] (2006) Autovectorization of interleaved data for SIMD. In Proceedings of the 27th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’06, pp. 132–143. External Links: Document Cited by: §4.
 [43] (2013)(Website) External Links: Link Cited by: §4.1.
 [44] (198301) Radian reduction for trigonometric functions. SIGNUM Newsl. 18 (1), pp. 19–24. External Links: Document Cited by: §5.
 [45] (2014) Speeding up HEP experiment software with a library of fast and autovectorisable mathematical functions. Journal of Physics: Conference Series 513 (5), pp. 052027. External Links: Document Cited by: §2.1, §3.
 [46] (19971001) Adaptive precision floatingpoint arithmetic and fast robust geometric predicates. Discrete & Computational Geometry 18 (3), pp. 305–363. External Links: Document Cited by: §5.
 [47] (197801) The CRAY1 computer system. Commun. ACM 21 (1), pp. 63–72. External Links: Document Cited by: §4.
 [48] (20100501) Efficient evaluation methods of elementary functions suitable for SIMD computation. Computer Science  Research and Development 25 (1), pp. 25–32. External Links: Document Cited by: §8.
 [49] (201603) Knights Landing: SecondGeneration Intel Xeon Phi Product. IEEE Micro 36 (2), pp. 34–46. External Links: Document Cited by: §4.1.
 [50] (2010)(Website) External Links: Link Cited by: §1, §2.1, §5, §7.
 [51] (2004)(Website) External Links: Link Cited by: §4.1.
 [52] (200202) Intel OpenMP C++/Fortran Compiler for HyperThreading Technology: Implementation and Performance. In Intel Technology Journal, Vol. 6, pp. 36–46. Cited by: §4.1.
 [53] (20151120)(Website) Intel Corporation. External Links: Link Cited by: §4.1.