Some utilities for the Jacobi-type SVD algorithms and beyond, plus a Cosine-Sine Decomposition tester.
A parallel, blocked, one-sided Hari–Zimmermann algorithm for the generalized singular value decomposition (GSVD) of a real or a complex matrix pair (F,G) is here proposed, where F and G have the same number of columns, and are both of the full column rank. The algorithm targets either a single graphics processing unit (GPU), or a cluster of those, performs all non-trivial computation exclusively on the GPUs, requires the minimal amount of memory to be reasonably expected, scales acceptably with the increase of the number of GPUs available, and guarantees the reproducible, bitwise identical output of the runs repeated over the same input and with the same number of GPUs.READ FULL TEXT VIEW PDF
Some utilities for the Jacobi-type SVD algorithms and beyond, plus a Cosine-Sine Decomposition tester.
The Hari–Zimmermann generalized SVD for CUDA.
is a Jacobi-type method for computing the generalized eigenvalue decomposition (GEVD) of a matrix pair, where both matrices are Hermitian of the same order and is positive definite.
If and are instead given implicitly by their factors and (not necessarily square nor with the same number of rows), respectively, such that , then the GEVD of can be computed implicitly, i.e., without assembling and in entirety from the factors, by a modification of the Hari–Zimmermann algorithm Novakovic-Singer-Singer-2015 . However, pivot submatrices of and of a certain, usually small order are formed explicitly throughout the computation.
The modified algorithm is a method that jointly orthogonalizes the pairs of columns of and by a sequence of transformations that are applied from the right side of the factors only. Such a one-sided algorithm computes , , , , and , where , , and . The matrix is square and nonsingular, while and are non-negative, diagonal, and scaled such that . The method thus implicitly computes the GEVD of , but explicitly the generalized singular value decomposition (GSVD; see, e.g., VanLoan-76 ; Paige-Saunders-81 ) of , with the generalized singular values forming the diagonal of (all of them finite, since has a positive diagonal). Furthermore, the generalized singular values can be considered to be sorted descendingly by a symmetric permutation, i.e., , and thus , , and , where , , and constitute a decomposition of possessing all other aforementioned properties.
The GEVD of can be recovered by letting and noting that , i.e., the columns of
are the generalized eigenvectors, and the diagonal ofcontains the generalized eigenvalues of .
If needed, the right generalized singular vectorscan either be computed from , or can be obtained simultaneously with by accumulating the inverses of the transformations that have been multiplied to form Singer-DiNapoli-Novakovic-Caklovic-2019 (i.e., if , when transformations have been applied, then ).
The recent work Novakovic-Singer-Singer-2015 has shown that such method can be successfully blocked and parallelized for the CPUs with the shared memory, and for the clusters of those, albeit only the real matrix pairs were considered therein. Even the sequential but blocked version outperformed the GSVD algorithm in LAPACK Anderson-et-al-99 , and the parallel ones exhibited a decent scalability.
On the other hand, an efficient, parallel and blocked one-sided Jacobi-type algorithm for the “ordinary” and the hyperbolic SVD Novakovic-2015 ; Novakovic-2017 of a single real matrix has been developed for the GPUs, that utilizes the GPUs almost fully, with the CPU serving only the controlling purpose in the single-GPU case.
This work aims to merge the experience of those two approaches, and present a parallel and blocked one-sided (also called “implicit”) Hari–Zimmermann algorithm for the GSVD on the GPU(s) as an extension of the latter, but for the complex matrix pairs as well as for the real ones.
Even though the research in parallelization of the GSVD has a long history Luk-85 ; Bai-94 , three novel and major differences from the earlier, Kogbetliantz-based procedures aim to ensure both the high performance and the high relative accuracy of this one: using the implicit Hari–Zimmermann algorithm as the basic method, that is blocked to exploit the GPU memory hierarchy, and the massive parallelism of the GPUs that suits the algorithm (and vice versa) perfectly.
This paper continues with section 2, where the complex and the real one-sided Hari–Zimmermann algorithms are introduced, together with the general, architecturally agnostic principles of their blocking and parallelization. In section 3 the single-GPU implementation are described in detail, while in section 4 the most straightforward multi-GPU implementation approach is suggested. The numerical testing results are summarized in section 5, and the paper concludes with some directions for future research in section 6. In A a non-essential but computationally cheap way for enhancing the accuracy of the real and the complex dot-products on the GPUs is proposed. The full testing results and some details of the chosen Jacobi strategies are provided as the supplementary material.
In this section the complex and the real one-sided Hari–Zimmermann algorithms are briefly described. Please see Hari-84 ; Hari-2018 for a more thorough overview of the two-sided algorithms, and Novakovic-Singer-Singer-2015 for a detailed explanation of the real implicit Hari–Zimmermann algorithm. This paper closely follows the terminology and the implementation decisions of Singer-DiNapoli-Novakovic-Caklovic-2019 , where the complex generalized hyperbolic SVD based on the implicit Hari–Zimmermann approach has been introduced, but without the hyperbolic scalar products (i.e., the signature matrix is taken to be identity here) and without forming the right generalized singular vectors .
Let the matrices and be of size and , respectively, with . Then, is square of order , and assume that . Otherwise, for , the GSVD of is obtained by taking
Even though the algorithm works on the rectangular matrices, it might be beneficial performance-wise to avoid transforming very tall and skinny (block)columns by working on the square matrices instead. To shorten and , the problem is transformed by computing the QR factorization of with the column pivoting, , and then , with its columns prepermuted by , is shortened by the column-pivoted QR factorization, . The square matrices and , both of order , take the place of and in the algorithm, respectively. With in the decompositions of and of , the matrix from the former, sought-for decomposition can be recovered by using and the computed from the latter as .
It is assumed that , i.e., the column norms of are unity. Should it not be the case, and are then prescaled by a nonsingular, diagonal matrix , where , is the th column of and ; otherwise, . The iterative transformation phase starts with the matrix pair , where , and . Implicitly, and have been transformed by a congruence with as and .
An iteration (or “step”) of the sequential non-blocked Hari–Zimmermann algorithm consists of selecting a pair of indices , , and thus two pivot submatrices, one of ,
and one of ,
If is embedded into an matrix such that , , , , while letting
be the identity matrix elsewhere, then looking two-sidedly the congruence withtransforms the pair into a pair , where and . One-sidedly, the transformation by orthogonalizes the th and the th pivot columns of and to obtain and . Also, is accumulated into the product . In a one-sided sequential step only the th and the th columns of , , and are effectively transformed, in-place (i.e., overwritten), postmultiplying them by the matrix , while the other columns of these matrices remain intact:
As , it follows that . However, due to the floating-point rounding errors, these equations might not hold. To prevent to drift too far away from as the algorithm progresses, the squared Frobenius norms of and could be recomputed for each as and . Then, a rescaling of and as and , by a diagonal matrix such that and , should bring back close to . From and it is then possible to compute , with the final . In this version of the algorithm it is not necessary to rescale the columns of and by at the start, since such rescaling happens at each step, so . If , this version is equivalent to the standard (previously described) one, for which it can be formally set and .
Suppose that has been computed (by either version) such that it diagonalizes and , but . To keep sorted descendingly, swap the columns of by a permutation to obtain . Such will swap the th and the th columns of and
as it orthogonalizes them. Sorting in each step is a heuristic that speeds up the algorithm notably in practice (see section5), but it makes reasoning about the convergence harder and is not strictly necessary.
Computing from and is more involved in the complex case than in the real one. However, in both cases, first it is established whether the th and the th columns of and are numerically relatively orthogonal,
where is the precision of the chosen floating-point
datatype222The relation relies on the expected (as opposed to
the worst case) rounding error for the dot-products Drmac-97
that form the elements of and , and
while sensible in the real case, it is probably too tight in the
complex case, where a more careful analysis of the complex
dot-products might be employed in the future work and a handful of
transformations subsequently might be skipped.
, and while sensible in the real case, it is probably too tight in the complex case, where a more careful analysis of the complex dot-products might be employed in the future work and a handful of transformations subsequently might be skipped.. If they are, no non-trivial transformation is to take place, and , since still the column swap may be warranted. Rescaling by is thus not performed even for , since it might perturb the columns sufficiently enough for them to cease to be numerically orthogonal.
To that end, let , , or if , , and define to be with the sign of for and real. Then, let , set
and, noting that since is positive definite, with these quantities compute
In these ranges of the angles, for the trigonometric identities and hold when . Otherwise, , , and . Then, compute , , , and , and with them finally obtain
where and .
If , i.e., if and , then is undefined, and might also be. In that case, it can be shown that and are diagonalized by
To that end, let and . Then, set
Note that and (and the corresponding tangents) have the same sign in the range of
. Assuming that the floating-point arithmetic unit does not trap onand , obtain as
and from it and using the same trigonometric identities as in the complex case. Finally, compute
Since the real case is in fact a simplification of the complex case, when is undefined, being , i.e., when and (or, in other words, when and are proportional), define
The sequential one-sided algorithm in each step chooses a single pivot index pair, according to some criterion that is called a sequential Jacobi strategy. However, at most pivot column pairs of each matrix can be transformed concurrently if the indices in all index pairs are distinct.
In a parallel step a sequence of pivot index pairs, where , such that each index in the range from 1 to appears at most (and for even , exactly) once in the sequence, addresses pivot column pairs of and to be transformed—each pair by a separate, concurrent task. All permutations of a given are equivalent from the numerical point of view, since the resulting and are the same for every reordering of the sequence, and therefore any reordering represents the entire equivalence class.
For simplicity, a barrier is assumed between the successive parallel steps, i.e., all tasks of a step have to be completed before those of the following step are started.
A criterion to choose a pivot index pair sequence for each parallel step is called a parallel Jacobi strategy. Among the strategies that are simplest to compute are the ones that prescribe a pivot sequence for each step, until all possible index pairs are selected at least once. Then, the choice of the steps is periodically repeated. Let be the shortest such period. The first steps constitute the first sweep, the following steps the second sweep, and so on.
If in any sweep exactly different index pairs are chosen, such a strategy is called cyclic; otherwise, some index pairs are repeated in a sweep, and the strategy is called quasi-cyclic. For even , , and the equality holds if and only if the strategy is cyclic.
A strategy is defined for a fixed ; however, by a slight abuse of the usual terminology, a single principle by which the particular strategies are generated for some given matrix orders will simply be called a strategy kind, or even a strategy for short.
Based on the previous experience with the one-sided Jacobi-like algorithms, two parallel Jacobi strategy kinds have been selected for testing: the modified modulus (mm; see, e.g., Novakovic-SingerSanja-2011 ; Novakovic-Singer-Singer-2015 ), quasi-cyclic with , and the generalized Mantharam–Eberlein (me; see Mantharam-Eberlein-93 ; Novakovic-2015 ) cyclic one. Please see Figures 1 and 2 in the supplementary material, where a sweep of me and of mm, respectively, is shown two-sidedly on a matrix of order 32.
Parallelization alone is not sufficient for achieving a decent performance of the algorithm on the modern architectures with multiple levels of the memory hierarchy.
The pointwise algorithm just described is therefore modified to work on the block columns of the matrices, instead of the columns proper. Each block column comprises an arbitrary but fixed number w, , of consecutive matrix columns. Instead of pivot submatrices of and , in the blocked algorithm pivot submatrices and are formed in the th (parallel or sequential) step by matrix multiplications,
where , , , , , and are the th and th block columns of , , and of width w.
Now, and can either be jointly diagonalized by a matrix , which leads to the full block (fb) algorithm Hari-SingerSanja-SingerSasa-2014 , as called in the context of the Jacobi methods, or their off-diagonal norms can be reduced by a sequence of congruences accumulated into , which is called the block-oriented (bo) algorithm Hari-SingerSanja-SingerSasa-2010 . The idea behind blocking is that , , and fit, by choosing w, into the small but fast cache memory (e.g., the shared memory of a GPU), to speed up the computation with them, as well as employing BLAS 3 (matrix multiplies) operations for the block column updates by afterwards:
The computation of in either fb or bo can be done by any convergent method; a two-sided method can be applied straightforwardly, but for the one-sided approach and have to be factorized first by, e.g., the Cholesky factorization
and then the same implicit Hari–Zimmermann method, either pointwise or blocked, and in both cases, either parallel or sequential, can be recursively applied to and .
In the single-GPU algorithm, there is only one level of such a recursion, i.e., one level of blocking. The block, outer level of the algorithm and the pointwise, inner level do not need to employ the same strategy kind. Both levels, however, are parallel. The sweeps of the outer level are called the block (or outer) sweeps, and those of the inner level are called the pointwise (or inner) sweeps, which for fb are limited to 30 ( and are usually fully diagonalized in less than that number of sweeps), and for bo are limited to only one inner sweep. Apart from that, there is no other substantial difference between fb and bo.
The Cholesky factorization is not the only way to form and . One numerical stability improvement would be to use a diagonally pivoted version of the factorization instead SingerSanja-SingerSasa-Novakovic-Uscumlic-Dunjko-2012 ,
or to refrain from forming and explicitly altogether, by shortening the pivot block columns by the column-pivoted QR factorization directly Singer-DiNapoli-Novakovic-Caklovic-2019
In both cases, let
where and are permutation matrices, while and are unitary and are not required to be stored, implicitly or explicitly, for any further computation.
However, the QR factorization (even without the column pivoting) of a pair of the tall and skinny block columns comes with a significant performance penalty on a GPU compared to the Cholesky factorization of a small, square pivot submatrix Novakovic-2015 , and the pivoted Cholesky factorization does not avoid a possibility of getting a severely ill-conditioned or by multiplying an ill-conditioned pair of block columns by itself. Therefore, both of these enhancements are only briefly described here but have not been tested.
In the following, the blocked algorithm is assumed to form the pivot submatrices as