The reduced QR factorization , with orthonormal columns and upper-triangular , is a ubiquitous subproblem in numerical algorithms. It can be used to solve linear systems, least squares problems, as well as eigenvalue problems with dense and sparse matrices. In many application contexts, these problems involve very overdetermined systems of equations in a large number of variables, driving a need for scalable parallel QR factorization algorithms. We study communication-efficient algorithms for QR factorizations of rectangular matrices .
Our work builds on a recently developed algorithm, CholeskyQR2, a Cholesky-based algorithm designed for tall and skinny matrices. The well-known CholeskyQR algorithm computes as the upper triangular factor in the Cholesky factorization of matrix , then solves triangular linear systems of equations, . The forward error in computing in CholeskyQR can be as high as where is the condition number of . On the other hand, the triangular solves can done backward stably, so with where is the relative error in the machine floating point representation. Therefore, when under effect of round-off error, CholeskyQR computes an accurate factorization of
into a product of a nearly orthogonal matrixand a triangular matrix .
CholeskyQR2 provides a correction to the orthogonal factor by running CholeskyQR once more on itself to obtain , where is upper-triangular and near identity. The upper-triangular factor can then be updated as . If was computed to within a few digits of accuracy, which is guaranteed if , it will be close to orthogonal and therefore well-conditioned. In this case, the CholeskyQR of will not lose much precision from round-off error and the QR factorization given by CholeskyQR2 will retain all or nearly all digits of accuracy . Improving the stability of Cholesky-based QR factorizations is an active research area [3, 4].
A key advantage of CholeskyQR2 is its practicality. It requires only matrix multiplications and Cholesky factorizations. In previous work , a 1D parallelization of the algorithm was shown to provide minimal communication cost and synchronization (a logarithmic factor less than other communication-avoiding algorithms ). Further, the same authors have recently shown that a Cholesky-based algorithm can be made unconditionably stable while maintaining asymptotically equivalent costs 111SIAM PP18; Tokyo, Japan; ”Shifted Cholesky QR for Computing the QR Factorization for Ill-conditioned Matrices”. Consequently, the algorithm is of high practical interest for a wide class of sufficiently well-conditioned least squares problems and as a competitor to the unconditionably stable Householder algorithm. In this work, we extend the algorithm to a 3D parallelization suitable for matrices with arbitrary dimensions.
To construct a scalable CholeskyQR2 algorithm, we must utilize efficient parallel algorithms for Cholesky factorization and matrix multiplication. Much focus has been given to theoretical analysis of the communication cost of matrix multiplication and Gaussian elimination algorithms. We present a simple adaptation of the algorithm therein  to Cholesky factorization. Tradeoffs between synchronization and bandwidth are investigated using a tunable parameter, , representing the depth of recursion. This parameter thus determines the size of the diagonal submatrices. We focus on minimizing bandwidth cost given unbounded memory.
Of the QR algorithms that achieve asymptotically minimal communication, none have been implemented in practice. The use of recursive slanted panels  provides a 3D algorithm that is communication efficient for square matrices. Recent work has extended the algorithm to rectangular matrices by using it as a subroutine in the TSQR algorithm for tall and skinny matrices . This algorithm achieves as little communication for rectangular matrices as a matrix multiplication of corresponding dimensions. However, neither slanted-panel based approach is known to have been evaluated in practical performance.
As such, no current QR factorization algorithm that achieves theoretically optimal communication cost has been demonstrated to be practical. Existing algorithms face challenges in their complexity, high constants, and incompatibility with BLAS-like routines. Previously studied parallel CholeskyQR2 (CQR2) algorithms do not scale for matrices of an arbitrary size. The novelty of our communication-avoiding CholeskyQR2 (CA-CQR2) algorithm lies in the combination of its simplicity and asymptotic efficiency. It relies upon a tunable processor grid to ensure optimally minimal communication for matrices of any size while being relatively easy to implement. Our contributions are to provide a detailed specification, cost analysis, implementation, and performance evaluation of the newly proposed algorithm.
Ii Foundation and Previous Work
We will study the scalability of parallel QR factorization algorithms through the lens of interprocessor communication cost analysis. In this section, we will introduce results necessary for our work, and provide brief summaries of related work. We first consider collective communication and the parallel algorithms we use for matrix multiplication and Cholesky factorization.
In presenting these algorithms and analyzing their costs, we use a simple -- model as defined below.
Our analysis assumes , which is reflective of current parallel computer architectures.
We define a few sequential routines and give their asymptotic costs .
We also want to define a unit-step function as follows:
Ii-B Processor Grids and Collective Communication
Collective communication serves as an efficient way to move data among processors over some subset of a processor grid. We define a 3D processor grid containing processors. uniquely identifies every processor in the grid, where each of the corresponding dimensions are of size and . can be split into 2D slices such as , row communicators such as , column communicators such as , and depth communicators such as .
Allgather, Allreduce, and Broadcast are collectives used heavily in the multiplication and factorization algorithms explored below. As these are well known [11, 12, 13, 14, 15], we give only the function signature and a brief description of each. We assume butterfly network collectives for analysis .
Transpose — all processors swap local array with processor via point-to-point communication.
Bcast — root processor distributes local array to every processor in as local array .
Reduce — all processors in contribute local arrays to an element-wise reduction onto root processor as local array .
Allreduce — all processors in contribute local arrays to an element-wise reduction onto some root. The reduced array is then broadcasted into local array .
Allgather — all processors in contribute local arrays to a concatenation onto some root. The concatenated array is then broadcasted into local array .
The costs of these collective routines can be obtained by a butterfly schedule, where words of data are being communicated across processors.
We disregard the computational cost in reductions by the assumption .
Ii-C Matrix Multiplication
Matrix multiplication over and other cubic partitions of is an important building block for the 3D-CQR2 and CA-CQR2 algorithms presented below. We use a variant of 3D matrix multiplication (which we refer to as 3D SUMMA) that achieves asymptotically optimal communication cost over a 3D processor grid [16, 17, 18, 19, 20, 21]. Our algorithm MM3D is a customization with respect to known algorithms. First, is not distributed across and is instead distributed across with . Second, instead of distributing matrix across , we Allreduce onto so that each 2D slice holds a distributed copy. These differences are motivated by the need for to be replicated over each 2D slice of in our new algorithms. A cyclic distribution is used as it is required by our 3D Cholesky factorization algorithm detailed below. See Algorithm 1 for specific details.
We analyze the cost of MM3D for multiplying an matrix by a matrix in Table I.
Ii-D Cholesky Factorization
This recursive algorithm yields a family of parallel algorithm variants . Rewriting these equations gives a recursive definition for ,
We augment the recursive definition of the Cholesky factorization, using the recursive definition for , as motivated by other parallel algorithms leveraging the triangular inverse [24, 7]. Like before, the factorization gets expanded into matrix multiplication of submatrices of dimension ,
Rewriting these equations gives a recursive definition for ,
We embed the two recursive definitions and arrive at an algorithm to solve for both the Cholesky factorization of and the triangular inverse of . Note that the addition of solving for adds only two extra matrix multiplications at each recursive level to the recursive definition for , thus achieving the same asymptotic cost. If were to be solved recursively at each level, the synchronization cost would incur an extra logarithmic factor. We address the missing base case in the cost analysis derivation below.
We incorporate two matrix transposes at each recursive level to take into account and needed in the equations above. Processor must send its local data to to transpose the matrix globally. A local transpose without horizontal communication will yield an incorrect distributed transpose.
A cyclic distribution of the matrices among processors is chosen to utilize every processor in the recursive calls performed on submatrices. Upon reaching the base case where matrix dimension , the square region of the matrix is scattered over the processors in and an Allgather must be performed to load the region onto each. Then all processors perform a Cholesky factorization and triangular inverse in an embarrassingly parallel fashion. See Algorithm 3 for full details.
The cost of this recursive Cholesky factorization algorithm (CFR3D) is determined by the cost of matrix multiplication at each recursive level and the total cost of the base cases. In order to utilize the entire 3D processor grid, , for each instance of MM3D, we obtain globally distributed submatrices by abstracting submatrices locally. This influences our decision to use a cyclic distribution because each processor owns an active part of the shrinking submatrix. Upon reaching a submatrix of size , matrices and are solved explicitly using Cholesky and triangular inversion. An allgather routine is needed to load the data scattered across the submatrix into . The combined communication cost of the base case must not dominate the cost of the MM3D.
We analyze the cost of CFR3D in Table II.
Choice of depends on the non-recursive communication cost. Because , our algorithm must compute Allgathers, for a total cost of
Choice of creates a tradeoff between the synchronization cost and the communication cost. We elect to match the communication cost at the expense of an increase in synchronization, giving the relation . The final cost of the 3D algorithm is
Ii-E QR Factorization
QR factorization decomposes an matrix into matrices and such that . We focus on the case when and and are the results of a reduced QR factorization. In this case, is with orthonormal columns and is and upper-triangular. Parallel QR algorithms have received much study [25, 26, 27, 5, 28, 29, 30], but focus has predominantly been on 2D blocked algorithms. 3D algorithms for QR have been proposed [8, 9].
Algorithms 4 and 5 give pseudocode for the sequential CholeskyQR2 (CQR2) algorithm. It is composed of matrix multiplications and Cholesky factorizations and unlike other QR factorization algorithms, does not require explicit QR factorizations . One small detail we exploit is that by avoiding the calculation of the triangular inverse at the first recursive level in Algorithm 3, we can reduce the computational cost by a factor of two for near-square matrices. Such an improvement incurs a tradeoff with synchronization cost discussed in the next section. Using the building block algorithms explored above, we seek to extend the existing parallel 1D-CQR2 algorithm given in Algorithms 6 and 7 to efficiently handle an arbitrary number of rows and columns.
The existing parallel 1D-CQR2 algorithm is solved over 1D processor grid . It partitions the matrix into rectangular chunks of size . The motivation behind this parallelization strategy lies in minimal required communication. Each processor can perform a sequential symmetric rank- update (syrk) with its partition of , resulting in matrix . 1D parallelization allows each processor to perform local matrix multiplication with its initial partition of and contribute to the summation across rows using an Allreduce routine. Note that each processor needs only contribute the upper-triangle of each symmetric matrix .
Matrix is now replicated across all processors. Each processor performs a sequential Cholesky factorization with its copy of and solves for . Finally, because is distributed in the same manner as , horizontal communication is not required and each processor can solve for with and its copy of . See Figure 1 and Algorithm 6 for finer details.
1D-CQR2 calls 1D-CQR twice to solve for as shown in Algorithms 5 and 7. Each processor can solve for sequentially. This algorithm ensures that is distributed the same as and is stored on every processor.
To be efficient in practice, should be small enough to make the Allreduce feasible under given memory constraints. It is most efficient when . For any given matrix , the size and shapes of the local rectangular blocks can be tuned. In general, 1D-CQR2 can only be applied to extremely overdetermined matrices. The 3D-CQR2 and CA-CQR2 algorithms expand 1D-CQR2 to handle matrices of any size.
See Table III for the costs attained in the 1D-CQR algorithm. Note that the Allreduce in line 2 need only send the upper-triangle of symmetric matrix , so a factor of is applied to the horizontal bandwidth cost.
The costs in Table IV are attained in the 1D-CQR2 algorithm.
This algorithm achieves poor scalability in communication, computation, and memory footprint. Regardless of , the Allgather distributes an matrix onto each processor and as grows the matrix won’t fit into a reasonably sized memory. Results from  show that this algorithm performs well when . Therefore, this algorithm can only be used when . Below, we show that CA-CQR2 can scale efficiently on a tunable processor grid.
Iii Communication-avoiding CholeskyQR2
Our main goal is to develop a parallelization of Cholesky-QR2 that scales efficiently for rectangular matrices of arbitrary dimensions . We start with a 3D algorithm best suited when .
Matrix is initially distributed across and is partitioned into rectangular chunks of size . We compute in a similar way to MM3D. Processor broadcasts along as . Local matrix multiplication with and forms . After a reduction and broadcast, is replicated on . Similar to how the Allreduce in 1D-CQR2 sums the rows efficiently in parallel, a reduction along onto root sums each along the corresponding subgrid, resulting in a unique partition of scattered across . A broadcast along fills in the missing partitions and ensures that each owns a copy of .
Now that is distributed the same as , CFR3D can solve for over . Finally, MM3D computes , which is again distributed the same as . An alternate strategy involves computing two triangular inverted blocks of dimension and solving for with three smaller instances of MM3D. This strategy lowers the computational cost by nearly a factor of 2. Utilizing inverted blocks of size less than does not decrease computational cost and increases the synchronization cost by a logarithmic factor. The same strategy can be employed by CA-CQR2, but is most effective for near-square matrices. See Algorithm 8.
The costs in Table V are attained in the 3D-CQR algorithm.
The costs in Table VI are attained in the 3D-CQR2 algorithm.
This algorithm is most communication efficient when .
Tunable processor grids have several advantages over static grids when factoring rectangular matrices. They can match the shape of the matrix and tune themselves to optimize certain parameters such as memory size and horizontal communication. Tunable algorithms solved over tunable grids can achieve a wide range of asymptotic costs. These costs can be shown to interpolate between known algorithms on specific grids. Finding the optimal processor grid for a specific algorithm is nontrivial, however. Skinny matrices can not take full advantage of the resources provided by a 3D grid, while square matrices overload the resource capacity that skinny rectangular grids provide.
The CA-CQR2 algorithm can be seen as a generalization of 1D-CQR2 and 3D-CQR2. We define a rectangular processor grid that partitions the matrix into rectangular blocks of size . CA-CQR2 effectively utilizes its tunable grid by performing simultaneous instances of CFR3D on cubic grid partitions of dimension . This allows each grid partition to avoid further communication with other partitions because each has the necessary data to compute the final step . In order to allow these simultaneous Cholesky factorizations, a few extra steps are needed to get the required data onto the correct processor.
As in 3D-CQR, processor broadcasts to as . Each processor can then perform local matrix multiplication . In order for cubic partitions of to own the same matrix , we perform a few extra steps. First, we subdivide along dimension into contiguous groups of size . Each group participates in a reduction in order to simulate the horizontal summing of rows in a linear combination. Processor such that is the only processor that must retain the communicated data and is thus the root. To complete the linear combination, we subdivide along dimension into groups of size , where each processor belonging to the same group is a step size away. An Allreduce is performed on this subcommunicator resulting in every processor owning the one correct region of . A final broadcast from root such that along ensures that is replicated times among subcommunicators of size .
Now that is distributed as described above, simultaneous instances of CFR3D and MM3D are performed over to obtain . CA-CQR2 requires calling CA-CQR twice and performing a final MM3D to solve for .
As evidenced by Figure 2 and the pseudocode presented in Algorithms 10 and 11, CA-CQR2 combines elements of both the 1D-CQR2 and 3D-CQR2 algorithms in order to most efficiently span the grid range . To show that this algorithm achieves the same costs as 1D-CQR2 and 3D-CQR2 with grid sizes , respectively, we provide a cost analysis below.