The vectorized (AVX-512) batched singular value decomposition algorithm for matrices of order two.
In this paper a vectorized algorithm for simultaneously computing up to eight singular value decompositions (SVDs, each of the form A=UΣ V^∗) of real or complex matrices of order two is proposed. The algorithm extends to a batch of matrices of an arbitrary length n, that arises, for example, in the annihilation part of the parallel Kogbetliantz algorithm for the SVD of a square matrix of order 2n. The SVD algorithm for a single matrix of order two is derived first. It scales, in most instances error-free, the input matrix A such that its singular values Σ_ii cannot overflow whenever its elements are finite, and then computes the URV factorization of the scaled matrix, followed by the SVD of a non-negative upper-triangular middle factor. A vector-friendly data layout for the batch is then introduced, where the same-indexed elements of each of the input and the output matrices form vectors, and the algorithm's steps over such vectors are described. The vectorized approach is then shown to be about three times faster than processing each matrix in isolation, while slightly improving accuracy over the straightforward method for the 2× 2 SVD.READ FULL TEXT VIEW PDF
The vectorized (AVX-512) batched singular value decomposition algorithm for matrices of order two.
Let a finite sequence , where , of complex matrices be given, and let the corresponding sequences , of unitary matrices be sought for, as well as a sequence of diagonal matrices with the real and non-negative diagonal elements, such that , i.e., for each , the right hand side of the equation is the singular value decomposition (SVD) of the left hand side. This batch of SVD computational tasks arises naturally in, e.g., parallelization of the Kogbetliantz algorithm  for the SVD [2, 3, 4]. A parallel step of the algorithm, repeated until convergence, amounts to forming and processing such a batch, with each assembled column by column from the elements of the iteration matrix at the suitably chosen pivot positions , , , and . The iteration matrix is then updated from the left by and from the right by , transforming the th and the th rows and columns, respectively, while annihilating the off-diagonal pivot positions.
For each , the matrices , , , and have the following elements,
where . When its actual index is either implied or irrelevant, is denoted by . Similarly, , , and denote , , and , respectively, in such a case, and the bracketed indices of the particular elements are also omitted.
When computing in the machine’s floating-point arithmetic, the real and the imaginary parts of the input elements are assumed to be rounded to finite (i.e., excludingand NaN) double precision quantities, but the SVD computations can similarly be vectorized in single precision (float datatype in the C language ).
Let C and W denote the CPU’s cache line size and the maximal SIMD width, both expressed in bytes, respectively. For an Intel CPU with the 512-bit Advanced Vector Extensions Foundation (AVX-512F) instruction set , . Let B be the size in bytes of the chosen underlying datatype T (here, in the real and the complex case alike, so ), and let . If , let , else let .
This paper aims to show how to single-threadedly compute as many SVDs at the same time as there are the SIMD/vector lanes available (S), one SVD by each lane. Furthermore, these vectorized computations can execute concurrently on the non-overlapping batch chunks assigned to the multiple CPU cores.
Techniques similar to the ones proposed in this paper have already been applied in  for vectorization of the Hari–Zimmermann joint diagonalizations of a complex positive definite pair of matrices  of order two, and could be, as future work, for the real variant of the Hari–Zimmermann algorithm for the generalized eigendecomposition  and the generalized SVD . Those efforts do not use the C compiler intrinsics, but rely instead on the vectorization capabilities of the Intel Fortran compiler over data laid out in a vector-friendly fashion similar to the one described in section 3. Simple as it may seem, it is also a more fragile way of expressing the vector operations, should the compiler ever renegade on the present behavior of its autovectorizer. The intrinsics approach has been tried in  with 256-bit-wide vectors of the AVX2+FMA  instruction set, alongside AVX-512F, for vectorization of the eigendecompositions of symmetric matrices of order two by the Jacobi rotations computed similarly to . This way the one-sided Jacobi SVD (and, similarly, the hyperbolic SVD) of real matrices can be significantly sped up when is small enough to make the eigendecompositions’ execution time comparable to the Grammian formations and the column updates, e.g., when the targeted matrices are the block pivots formed in a block-Jacobi algorithm [13, 14].
In numerical linear algebra the term “batched computation” is well-established, signifying a simultaneous processing of a large quantity of relatively small problems, e.g., the LU and the Cholesky factorizations  and the corresponding linear system solving  on the GPUs, with appropriate data layouts. It is therefore both justifiable and convenient to reuse the term in the present context.
This paper is organized as follows. A non-vectorized Kogbetliantz method for the SVD of a matrix of order two is presented in section 2. In section 3 a vector-friendly data layout is proposed, followed by a summary of the vectorized algorithm for the batched SVDs in section 4. The algorithm comprises the following phases:
assembling of the left () and the right () singular vectors of the matrices .
The pointwise, non-vectorized Kogbetliantz algorithm for the SVD of a matrix of order two has been an active subject of research [18, 19, 20], and has been implemented for real matrices in LAPACK’s  xLASV2 (for the full SVD) and xLAS2 (for the singular values only) routines, where . Here a simplified version of the algorithm from [4, trigonometric case] is described, with an early reduction of a complex matrix to the real one that is partly influenced by, but improves on, .
It is assumed in the paper that the floating-point arithmetic  is nonstop, i.e., does not trap on exceptions, and has the gradual underflow, i.e., Flush-denormals-To-Zero (FTZ) and Denormals-Are-Zero (DAZ) processor flags  are disabled.
To compute and for a complex with both components finite, including , while avoiding the complex arithmetic operations, use the function . With DBL_TRUE_MIN being the smallest positive non-zero (and thus subnormal, or denormal in the old parlance) double precision value, let
Here and in the following, and are the functions similar to the ones in the C language , but with a bit relaxed semantics, that return the minimal (respectively, maximal) of their two non-NaN arguments, or the second argument if the first is a NaN, as it is the case with the vector minimum and maximum [6, VMINPD and VMAXPD]. See also [7, subsection 6.2] for a similar exploitation of the NaN handling of and operations. It now follows that, when , and so ,
The signs of and are thus preserved in and , respectively.
A vectorized implementation is accessible from the Intel Short Vector Math Library (SVML) via a compiler intrinsic, as well as it is a reciprocal square root () vectorized routine, helpful for the cosine calculations in (3), (11), and (12), though neither is always correctly rounded to at most half ulpaaaConsult the reports on the High Accuracy functions at https://software.intel.com/content/www/us/en/develop/documentation/mkl-vmperfdata/top/real-functions/root.html URL..
Even if both components of are finite, from ( ‣ 2) can overflow, but cannot. Scaling a floating-point number by an integer power of two is exact, except when the significand of a subnormal result loses a trailing non-zero part due to shifting of the original significand to the left, or when the result overflows. Therefore, such scaling [6, VSCALEFPD] is the best remedy for the absolute value overflow problem.
Let the exponent of a floating-point value (assuming the radix two) be defined as and for a finite non-zero (see [6, VGETEXPPD]). Let be two less than the largest exponent of a finite double precision number. To find a scaling factor for , take as
where and are computed as
Note that , due to the definition of . If is real, is not used. The upper bound on is required to be finite, since would result in a NaN.
If there is a value of a huge magnitude (i.e., with its exponent greater than ) in , from (-1) will be negative and the huge values will decrease, either twofold or fourfold. Else, will be the maximal non-negative amount by which the exponents of the values in can jointly be increased, thus taking the very small values out of the subnormal range if possible, without any of the new exponents going over .
Let , and let denote the th column of . The Frobenius norm of ,
This scaling is both a simplification and an improvement of [4, subsection 2.3.2], which guarantees that the computed scaled singular values are finite, while avoiding any branching, lane masking, or recomputing when vectorized, with the only adverse effect being a potential sacrifice of the tiniest subnormal values in the presence of a huge one (i.e., with its exponent strictly greater than ) in .
If , let , else let . Denote the column-pivoted by . If , let , else let . Denote the row-sorted by . To make real and non-negative, let
Complex multiplication, required in ( ‣ 2.2), (4), (5), and (7), is performed using the fused multiply-add operations with a single rounding , , as in [24, cuComplex.h] and [25, subsection 3.2.1], i.e., for a complex holds
To annihilate , compute the Givens rotation , where , as
Since the column norms of a well-scaled are finite, its column-pivoted, row-sorted QR factorization in (2) cannot result in an infinite element in .
If then as a special case. Handling special cases in a vectorized way is difficult as it implies branching or using the instructions with a lane mask. However, function aids in avoiding both of these approaches similarly to in ( ‣ 2), since and from (2) can be computed as
Similarly, to make from (4) real and non-negative, take and obtain as
due to the column pivoting. Specifically, if is already real, then
Here the plane rotations and are computed, such that , where
with from (5) and .
Let, as in [4, subsection 2.2.1], where the following formulas have been derived,
With and from (9), , compute
as justified in the next paragraph.
Since the quotient in (10) is non-negative (when defined), is non-positive, and thus . From compute
with and . Assume that was not bounded in magnitude. If in floating-point (this occurs rarely, when , , and ), then instead of the correct result, . Else, if , adding one to its square would have made little difference before the rounding (and the sum would have overflown after it), so the square root in (11) could be approximated by . Again, with so obtained, adding one to it in the denominator in (11) would have been irrelevant, and would have then equaled to . Bounding from above as in (10) therefore avoids the argument of the square root overflowing (so using instead of is not required), and ensures for all that would otherwise be greater than the bound.
Having thus computed , the right plane rotation is constructed from
where and .
The following Theorem 2.3 shows that the special form of contributes to an important property of the computed scaled singular values ; namely, they are already sorted non-ascendingly, and thus never have to be swapped in a postprocessing step. Also, the scaled singular values are always finite in floating-point arithmetic. For it holds , where
Multiplying the matrices on the left hand side of (14) and equating the elements of the result with the corresponding elements of the right hand side, one obtains
after an algebraic simplification using the relation (12) for . The equations for and from (13) then follow by multiplying the equations for and from (15), respectively, by . Specially, , since would imply an obvious contradiction with the assumption.
where the argument of the square root can be expressed as
after substitution of (16) for and and a subsequent algebraic simplification. For a fixed but arbitrary , (21), and thus the numerator of (20), decrease monotonically as . Substituting zero for in (20) and (21), the former becomes
what proves the inequality between the tangents.
The inequality between the scaled singular values follows easily from (15) as
has to be bounded from above, to be able to bound from below, and thus (20) from above. From (5) and the column pivoting goal, , what gives after dividing by , i.e., and are contained in the intersection of the first quadrant and the unit disc. On this domain, attains the maximal value of for , so , as claimed, and thus . Substituting this lower bound for in (20), it follows .
it can be concluded that
so , where the right hand side is the immediate successor (that represents ) of the largest finite floating-point number, as claimed.
where has to be backscaled to obtain the singular values of the original input matrix. However, the backscaling should be skipped if it would cause the singular values to overflow or (less catastrophic but still inaccurate outcome) underflow, while informing the user of such an event by preserving the value of .
Computing each element of a complex requires only one complex multiplication.
Writing a complex number as , noting that , , and are the complex conjugates of , , and , respectively, and precomputing and , (20) can be expanded as, e.g.,
but another mathematically equivalent computation that minimizes the number of roundings required for forming the elements of as this one does is also valid.
Vectors replace scalars in the SIMD arithmetic operations. A vector should hold S elements from the same matrix sequence, with the same row and column indices, and the consecutive bracketed indices. When computing with complex numbers, however, it is more efficient to keep the real and the imaginary parts of the elements in separate vectors, since there are no hardware vector instructions for the complex multiplication and division, e.g., which thus have to be implemented manually. Also, a vector should be aligned in memory to a multiple of W bytes to employ the most efficient versions of the vector load/store operations. It is therefore essential to establish a vector-friendly layout for the matrix sequences , , , and in the linear memory space. One such layout, inspired by splitting the real and the complex parts of the matrix elements into separate vectors [7, subsection 6.2], is
where and (similarly, , , and , ) for are the sequences of the real and the imaginary components, respectively, of the elements in the th row and the th column of the matrices in (similarly, in and in ).
Each train of boxes represents a contiguous region of memory aligned to W bytes. In the real case, no -boxes exist, but the layout otherwise stays the same. The scaled singular values and from (8) are stored as
respectively, while the scaling parameters from (-1) are laid out as
The virtual elements, with their bracketed indices ranging from to
, serve if present as a (e.g., zero) padding, which ensures that all vectors, including the last one, formed from the consecutive elements of a box train, hold the same maximal number (S) of defined values and can thus be processed in an uniform manner.
The input sequence may initially be in another layout and has to be repacked before any further computation. Also, the output sequences , , , and may have to be repacked for a further processing. Such reshufflings should be avoided, as they incur a substantial overhead in both time and memory requirements.
Layout of data, including the intermediate results, in vector registers during the computation is the same as it is for the box trains, but with S elements instead of . The th vector, for , encompasses the consecutive indices ,
A vector is loaded into, kept in, and stored from, a variable of the C type __m512d.
In the following, a bold lowercase letter stands for a vector, and the uppercase one for a (logical, not necessarily in-memory) matrix sequence. For example, is a sequence of matrices , of which is a subsequence of length S, and is a vector containing elements of , for some and its corresponding indices from (17). A bold constant denotes a vector with all its values being equal to the given constant. An arithmetic operation on vectors (or collections thereof) or matrix sequences is a shorthand for a sequence of the elementwise operations; e.g.,
where and are any two matrix sequences, is a product of matrices of order two, and is a collection of vectors with all their values equal to two. All bracketed indices are one-based, as it is customary in linear algebra, but in the C code they are zero-based, being thus one less than they are in the paper’s text.
When there are two cases, the real and the complex one, all code-presenting figures cover the latter with a mixture of the actual statements and a mathematically oriented pseudocode. The real-case differences are described in them in the comments starting with . A function name in uppercase, NAME, is a shorthand for the _mm512_name_pd C compiler intrinsic, if the operation is available in the machine’s instruction set, or for an equivalent sequence of the bit-pattern preserving casts to and from the integer vectors and an equivalent integer NAME operation. More precisely, for bitwise operations, if the AVX-512DQ instruction set extensions are not supported, an exception to the naming rule holds:
All other required operations have a _pd variant (with double precision vector lanes) in the core AVX-512F instruction set, so it suffices for implementing the entire algorithm. Additionally, let CMPLT_MASK stand for the _mm512_cmplt_pd_mask intrinsic, i.e., for the less-than lane-wise comparison of two vectors.
The four phases of the algorithm for the batched SVDs of order two, as listed in section 1, can be succinctly depicted by the following logical execution pipeline,
where the first row shows the transformations of , the second row contains the various matrix sequences that are the “by-products” of the computation, described in section -1, ending with the sequences of the left and the right singular vectors, that are formed as indicated in the third row. As the singular values can overflow due to the backscaling (see subsection 2.3) of the scaled ones (), computing them unconditionally is unsafe, and such postprocessing is left to the user’s discretion. In certain use-cases it might be known in advance that the singular values cannot overflow/underflow, e.g., if the initial matrices have already been well-scaled at their formation. The backscaling, performed as in Fig. 1, is then unconditionally safe.
The pipeline is executed independently on each non-overlapping subsequence of S consecutive matrices. If there are more such sequences than the active threads, at a point in time some sequences might have already been processed, while the others are still waiting, either for the start or the completion of the processing. A conceptual core of a driver routine implementing such a pass over the data is shown in Fig. 2, where xSsvd2, , are the main (real or complex, respectively), single-threaded routines that are responsible for all vectorized computations on each particular sequence of size S. The OpenMP  parallel for directive in Fig. 2 assumes a user-defined maximal number and placement/affinity of threads.
The input arguments of xSsvd2 are (the pointers to) the arrays, each aligned to W bytes, of S double values, e.g., const double A12r[static S] for . The output arguments are similar, e.g., double U21i[static S] for . Note that the same interface, up to replacing S by 1, would be applicable to the pointwise x1svd2 routine for a single SVD, but without the implied alignment restriction.
No branching is involved explicitly in the xSsvd2 routines. It is therefore fully branch-free, if the used SVML routines are. All data, once loaded from memory or computed, is intended to be held in the zmm vector registers until the output has been formed and written back to RAM. This goal is almost achievable in the test setting, since there are two vector register spillages, with a total of only four extra memory accesses (two writes and two reads), as reported by the optimizer. A hand-tuned self-contained assembly might do away with these as well.
The first three phases of the algorithm are vectorized as described in sections 5, 6, and 7, respectively, since each of the phases can be viewed as an algorithm on its own. They are, however, chained by the dataflow, each having as its input the output of the previous one. Should the output of a phase be made available alongside the final results, it could be written to an additional memory buffer in the same layout as presented in section 3. Otherwise, the intermediate results are not preserved.
Vectorization of the last, fourth phase of the algorithm from (20) is as tedious and uninformative as it is straightforward, and so it is omitted for brevity. It suffices to say that (and ) and (and ), for , are computed from (20), using the operation where possible, and (1) for the complex multiplications. The final row permutations by or are performed in the same way as the row swaps in the URV factorization phase, described in Fig. 6 in section 6. An interested reader is referred to the actual code in the supplementary materialbbbSupplementary material is available in https://github.com/venovako/VecKog repository..
Computation of the scaling parameters is remarkably simple, as shown in Fig. 3.