Communication-avoiding Cholesky-QR2 for rectangular matrices

The need for scalable algorithms to solve least squares and eigenvalue problems is becoming increasingly important given the rising complexity of modern machines. We address this concern by presenting a new scalable QR factorization algorithm intended to accelerate these problems for rectangular matrices. Our contribution is a communication-avoiding distributed-memory parallelization of an existing Cholesky-based QR factorization algorithm called CholeskyQR2. Our algorithm exploits a tunable processor grid able to interpolate between one and three dimensions, resulting in tradeoffs in the asymptotic costs of synchronization, horizontal bandwidth, flop count, and memory footprint. It improves the communication cost complexity with respect to state-of-the-art parallel QR implementations by Θ(P^1/6). Further, we provide implementation details and performance results on Blue Waters supercomputer. We show that the costs attained are asymptotically equivalent to other communication-avoiding QR factorization algorithms and demonstrate that our algorithm is efficient in practice.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

11/08/2020

LDU factorization

LU-factorization of matrices is one of the fundamental algorithms of lin...
05/22/2018

One machine, one minute, three billion tetrahedra

This paper presents a new scalable parallelization scheme to generate th...
10/24/2017

Avoiding Communication in Proximal Methods for Convex Optimization Problems

The fast iterative soft thresholding algorithm (FISTA) is used to solve ...
12/17/2017

Avoiding Synchronization in First-Order Methods for Sparse Convex Optimization

Parallel computing has played an important role in speeding up convex op...
07/31/2021

Communication-avoiding micro-architecture to compute Xcorr scores for peptide identification

Database algorithms play a crucial part in systems biology studies by id...
09/24/2019

A high-level characterisation and generalisation of communication-avoiding programming techniques

Today's hardware's explosion of concurrency plus the explosion of data w...
06/28/2019

Polynomial Preconditioned GMRES to Reduce Communication in Parallel Computing

Polynomial preconditioning with the GMRES minimal residual polynomial ha...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The reduced QR factorization , with orthonormal columns and upper-triangular , is a ubiquitous subproblem in numerical algorithms. It can be used to solve linear systems, least squares problems, as well as eigenvalue problems with dense and sparse matrices. In many application contexts, these problems involve very overdetermined systems of equations in a large number of variables, driving a need for scalable parallel QR factorization algorithms. We study communication-efficient algorithms for QR factorizations of rectangular matrices .

Our work builds on a recently developed algorithm, CholeskyQR2[1], a Cholesky-based algorithm designed for tall and skinny matrices. The well-known CholeskyQR algorithm computes as the upper triangular factor in the Cholesky factorization of matrix , then solves triangular linear systems of equations, . The forward error in computing in CholeskyQR can be as high as where is the condition number of . On the other hand, the triangular solves can done backward stably, so with where is the relative error in the machine floating point representation. Therefore, when under effect of round-off error, CholeskyQR computes an accurate factorization of

into a product of a nearly orthogonal matrix

and a triangular matrix .

CholeskyQR2 provides a correction to the orthogonal factor by running CholeskyQR once more on itself to obtain , where is upper-triangular and near identity. The upper-triangular factor can then be updated as . If was computed to within a few digits of accuracy, which is guaranteed if , it will be close to orthogonal and therefore well-conditioned. In this case, the CholeskyQR of will not lose much precision from round-off error and the QR factorization given by CholeskyQR2 will retain all or nearly all digits of accuracy [2]. Improving the stability of Cholesky-based QR factorizations is an active research area [3, 4].

A key advantage of CholeskyQR2 is its practicality. It requires only matrix multiplications and Cholesky factorizations. In previous work [5], a 1D parallelization of the algorithm was shown to provide minimal communication cost and synchronization (a logarithmic factor less than other communication-avoiding algorithms [6]). Further, the same authors have recently shown that a Cholesky-based algorithm can be made unconditionably stable while maintaining asymptotically equivalent costs 111SIAM PP18; Tokyo, Japan; ”Shifted Cholesky QR for Computing the QR Factorization for Ill-conditioned Matrices”. Consequently, the algorithm is of high practical interest for a wide class of sufficiently well-conditioned least squares problems and as a competitor to the unconditionably stable Householder algorithm. In this work, we extend the algorithm to a 3D parallelization suitable for matrices with arbitrary dimensions.

To construct a scalable CholeskyQR2 algorithm, we must utilize efficient parallel algorithms for Cholesky factorization and matrix multiplication. Much focus has been given to theoretical analysis of the communication cost of matrix multiplication and Gaussian elimination algorithms. We present a simple adaptation of the algorithm therein [7] to Cholesky factorization. Tradeoffs between synchronization and bandwidth are investigated using a tunable parameter, , representing the depth of recursion. This parameter thus determines the size of the diagonal submatrices. We focus on minimizing bandwidth cost given unbounded memory.

Of the QR algorithms that achieve asymptotically minimal communication, none have been implemented in practice. The use of recursive slanted panels [8] provides a 3D algorithm that is communication efficient for square matrices. Recent work has extended the algorithm to rectangular matrices by using it as a subroutine in the TSQR algorithm for tall and skinny matrices [9]. This algorithm achieves as little communication for rectangular matrices as a matrix multiplication of corresponding dimensions. However, neither slanted-panel based approach is known to have been evaluated in practical performance.

As such, no current QR factorization algorithm that achieves theoretically optimal communication cost has been demonstrated to be practical. Existing algorithms face challenges in their complexity, high constants, and incompatibility with BLAS-like routines. Previously studied parallel CholeskyQR2 (CQR2) algorithms do not scale for matrices of an arbitrary size. The novelty of our communication-avoiding CholeskyQR2 (CA-CQR2) algorithm lies in the combination of its simplicity and asymptotic efficiency. It relies upon a tunable processor grid to ensure optimally minimal communication for matrices of any size while being relatively easy to implement. Our contributions are to provide a detailed specification, cost analysis, implementation, and performance evaluation of the newly proposed algorithm.

Ii Foundation and Previous Work

We will study the scalability of parallel QR factorization algorithms through the lens of interprocessor communication cost analysis. In this section, we will introduce results necessary for our work, and provide brief summaries of related work. We first consider collective communication and the parallel algorithms we use for matrix multiplication and Cholesky factorization.

Ii-a Preliminaries

In presenting these algorithms and analyzing their costs, we use a simple -- model as defined below.

Our analysis assumes , which is reflective of current parallel computer architectures.

We define a few sequential routines and give their asymptotic costs [10].

We also want to define a unit-step function as follows:

Ii-B Processor Grids and Collective Communication

Collective communication serves as an efficient way to move data among processors over some subset of a processor grid. We define a 3D processor grid containing processors. uniquely identifies every processor in the grid, where each of the corresponding dimensions are of size and . can be split into 2D slices such as , row communicators such as , column communicators such as , and depth communicators such as .

Allgather, Allreduce, and Broadcast are collectives used heavily in the multiplication and factorization algorithms explored below. As these are well known [11, 12, 13, 14, 15], we give only the function signature and a brief description of each. We assume butterfly network collectives for analysis [12].

  • Transpose — all processors swap local array with processor via point-to-point communication.

  • Bcast — root processor distributes local array to every processor in as local array .

  • Reduce — all processors in contribute local arrays to an element-wise reduction onto root processor as local array .

  • Allreduce — all processors in contribute local arrays to an element-wise reduction onto some root. The reduced array is then broadcasted into local array .

  • Allgather — all processors in contribute local arrays to a concatenation onto some root. The concatenated array is then broadcasted into local array .

The costs of these collective routines can be obtained by a butterfly schedule, where words of data are being communicated across processors.

We disregard the computational cost in reductions by the assumption .

Ii-C Matrix Multiplication

Matrix multiplication over and other cubic partitions of is an important building block for the 3D-CQR2 and CA-CQR2 algorithms presented below. We use a variant of 3D matrix multiplication (which we refer to as 3D SUMMA) that achieves asymptotically optimal communication cost over a 3D processor grid [16, 17, 18, 19, 20, 21]. Our algorithm MM3D is a customization with respect to known algorithms. First, is not distributed across and is instead distributed across with . Second, instead of distributing matrix across , we Allreduce onto so that each 2D slice holds a distributed copy. These differences are motivated by the need for to be replicated over each 2D slice of in our new algorithms. A cyclic distribution is used as it is required by our 3D Cholesky factorization algorithm detailed below. See Algorithm 1 for specific details.

1: has processors arranged in a grid. Matrices and are replicated on . Each processor owns a cyclic partition of matrix and matrix , referred to as local matrices and , respectively. Let , , and be temporary arrays with the same distribution as and .
2:Bcast
3:Bcast
4:MM
5:Allreduce
6:, where is and distributed the same way as and .
Algorithm 1

We analyze the cost of MM3D for multiplying an matrix by a matrix in Table I.

Table I: Per-line costs of Algorithm 1.
# Cost
2
3
4
5

Ii-D Cholesky Factorization

Assuming is a dense, symmetric positive definite matrix of dimension , the factorization can be expanded into matrix multiplication of submatrices of dimension [22, 7],

This recursive algorithm yields a family of parallel algorithm variants [23]. Rewriting these equations gives a recursive definition for ,

We augment the recursive definition of the Cholesky factorization, using the recursive definition for , as motivated by other parallel algorithms leveraging the triangular inverse [24, 7]. Like before, the factorization gets expanded into matrix multiplication of submatrices of dimension ,

Rewriting these equations gives a recursive definition for ,

We embed the two recursive definitions and arrive at an algorithm to solve for both the Cholesky factorization of and the triangular inverse of . Note that the addition of solving for adds only two extra matrix multiplications at each recursive level to the recursive definition for , thus achieving the same asymptotic cost. If were to be solved recursively at each level, the synchronization cost would incur an extra logarithmic factor. We address the missing base case in the cost analysis derivation below.

1: is a symmetric positive definite matrix of dimension .
2:if return
3:
4:
5:
6:
7:
8: is lower triangular, .
Algorithm 2 CholInv

We incorporate two matrix transposes at each recursive level to take into account and needed in the equations above. Processor must send its local data to to transpose the matrix globally. A local transpose without horizontal communication will yield an incorrect distributed transpose.

A cyclic distribution of the matrices among processors is chosen to utilize every processor in the recursive calls performed on submatrices. Upon reaching the base case where matrix dimension , the square region of the matrix is scattered over the processors in and an Allgather must be performed to load the region onto each. Then all processors perform a Cholesky factorization and triangular inverse in an embarrassingly parallel fashion. See Algorithm 3 for full details.

1: has processors arranged in a 3D grid. Matrix is of dimension , symmetric, and positive definite. is replicated on . Each processor owns a cyclic partition of given by . Let , , , , and be temporary arrays, distributed the same way as .
2:if  then
3:     
4:     
5:else
6:     
7:     
8:     
9:     
10:     
11:     
12:     
13:     
14:     
15:     
16:, , where matrices and are distributed the same way as .
Algorithm 3

The cost of this recursive Cholesky factorization algorithm (CFR3D) is determined by the cost of matrix multiplication at each recursive level and the total cost of the base cases. In order to utilize the entire 3D processor grid, , for each instance of MM3D, we obtain globally distributed submatrices by abstracting submatrices locally. This influences our decision to use a cyclic distribution because each processor owns an active part of the shrinking submatrix. Upon reaching a submatrix of size , matrices and are solved explicitly using Cholesky and triangular inversion. An allgather routine is needed to load the data scattered across the submatrix into . The combined communication cost of the base case must not dominate the cost of the MM3D.

We analyze the cost of CFR3D in Table II.

Table II: Per-line costs of Algorithm 3.
# Cost
3
4
6
7
8
9
10
11
12
13
15

Choice of depends on the non-recursive communication cost. Because , our algorithm must compute Allgathers, for a total cost of

Choice of creates a tradeoff between the synchronization cost and the communication cost. We elect to match the communication cost at the expense of an increase in synchronization, giving the relation . The final cost of the 3D algorithm is

Ii-E QR Factorization

QR factorization decomposes an matrix into matrices and such that . We focus on the case when and and are the results of a reduced QR factorization. In this case, is with orthonormal columns and is and upper-triangular. Parallel QR algorithms have received much study [25, 26, 27, 5, 28, 29, 30], but focus has predominantly been on 2D blocked algorithms. 3D algorithms for QR have been proposed [8, 9].

Algorithms 4 and 5 give pseudocode for the sequential CholeskyQR2 (CQR2) algorithm. It is composed of matrix multiplications and Cholesky factorizations and unlike other QR factorization algorithms, does not require explicit QR factorizations [1]. One small detail we exploit is that by avoiding the calculation of the triangular inverse at the first recursive level in Algorithm 3, we can reduce the computational cost by a factor of two for near-square matrices. Such an improvement incurs a tradeoff with synchronization cost discussed in the next section. Using the building block algorithms explored above, we seek to extend the existing parallel 1D-CQR2 algorithm given in Algorithms 6 and 7 to efficiently handle an arbitrary number of rows and columns.

1: is
2:
3:
4:
5:, where is with orthonormal columns, is upper triangular
Algorithm 4
1: is
2:
3:
4:
5:, where is with orthonormal columns, is upper triangular
Algorithm 5

Ii-F 1D-CholeskyQR2

The existing parallel 1D-CQR2 algorithm is solved over 1D processor grid [1]. It partitions the matrix into rectangular chunks of size . The motivation behind this parallelization strategy lies in minimal required communication. Each processor can perform a sequential symmetric rank- update (syrk) with its partition of , resulting in matrix . 1D parallelization allows each processor to perform local matrix multiplication with its initial partition of and contribute to the summation across rows using an Allreduce routine. Note that each processor needs only contribute the upper-triangle of each symmetric matrix .

Matrix is now replicated across all processors. Each processor performs a sequential Cholesky factorization with its copy of and solves for . Finally, because is distributed in the same manner as , horizontal communication is not required and each processor can solve for with and its copy of . See Figure 1 and Algorithm 6 for finer details.

1D-CQR2 calls 1D-CQR twice to solve for as shown in Algorithms 5 and 7. Each processor can solve for sequentially. This algorithm ensures that is distributed the same as and is stored on every processor.

To be efficient in practice, should be small enough to make the Allreduce feasible under given memory constraints. It is most efficient when . For any given matrix , the size and shapes of the local rectangular blocks can be tuned. In general, 1D-CQR2 can only be applied to extremely overdetermined matrices. The 3D-CQR2 and CA-CQR2 algorithms expand 1D-CQR2 to handle matrices of any size.

1: has processors arranged in a 1D grid. Each processor owns a (cyclic) blocked partition of input matrix given by . Let be distributed the same as , be distributed the same as .
2:
3:
4:
5:
6:, where is distributed the same as , is an upper triangular matrix of dimension owned locally by every processor
Algorithm 6
1:Same requirements as Algorithm 6.
2:
3:
4:
5:Same requirements as Algorithm 6.
Algorithm 7

See Table III for the costs attained in the 1D-CQR algorithm. Note that the Allreduce in line 2 need only send the upper-triangle of symmetric matrix , so a factor of is applied to the horizontal bandwidth cost.

Table III: Per-line costs of Algorithm 6.
# Cost
2
3
4
5

The costs in Table IV are attained in the 1D-CQR2 algorithm.

Table IV: Per-line costs of Algorithm 7.
# Cost
2
3
4

This algorithm achieves poor scalability in communication, computation, and memory footprint. Regardless of , the Allgather distributes an matrix onto each processor and as grows the matrix won’t fit into a reasonably sized memory. Results from [1] show that this algorithm performs well when . Therefore, this algorithm can only be used when . Below, we show that CA-CQR2 can scale efficiently on a tunable processor grid.

Figure 1: Illustration of each step in the existing parallel 1D-CQR algorithm.

Iii Communication-avoiding CholeskyQR2

Our main goal is to develop a parallelization of Cholesky-QR2 that scales efficiently for rectangular matrices of arbitrary dimensions . We start with a 3D algorithm best suited when .

Iii-a 3d-Cqr2

Matrix is initially distributed across and is partitioned into rectangular chunks of size . We compute in a similar way to MM3D. Processor broadcasts along as . Local matrix multiplication with and forms . After a reduction and broadcast, is replicated on . Similar to how the Allreduce in 1D-CQR2 sums the rows efficiently in parallel, a reduction along onto root sums each along the corresponding subgrid, resulting in a unique partition of scattered across . A broadcast along fills in the missing partitions and ensures that each owns a copy of .

Now that is distributed the same as , CFR3D can solve for over . Finally, MM3D computes , which is again distributed the same as . An alternate strategy involves computing two triangular inverted blocks of dimension and solving for with three smaller instances of MM3D. This strategy lowers the computational cost by nearly a factor of 2. Utilizing inverted blocks of size less than does not decrease computational cost and increases the synchronization cost by a logarithmic factor. The same strategy can be employed by CA-CQR2, but is most effective for near-square matrices. See Algorithm 8.

3D-CQR2 calls 3D-CQR twice to solve for as shown in Algorithm 5 and 7. MM3D is invoked over to solve . This algorithm ensures that and are distributed the same as . See Algorithm 9.

1: has processors arranged in a 3D grid. is and is replicated on . Each processor owns a (cyclic) blocked partition of given by . Let , , , and be temporary arrays distributed the same as .
2:
3:
4:
5:
6:
7:
8:, where and are distributed the same as . is orthonormal and R is an upper triangular matrix of dimension .
Algorithm 8
1:Same requirements as Algorithm 8.
2:
3:
4:
5:Same requirements as Algorithm 8.
Algorithm 9

The costs in Table V are attained in the 3D-CQR algorithm.

Table V: Per-line costs of Algorithm 8.
# Cost
2
3
4
5
6
7

The costs in Table VI are attained in the 3D-CQR2 algorithm.

Table VI: Per-line costs of Algorithm 9.
# Cost
2
3
4

This algorithm is most communication efficient when .

Iii-B Ca-Cqr2

Tunable processor grids have several advantages over static grids when factoring rectangular matrices. They can match the shape of the matrix and tune themselves to optimize certain parameters such as memory size and horizontal communication. Tunable algorithms solved over tunable grids can achieve a wide range of asymptotic costs. These costs can be shown to interpolate between known algorithms on specific grids. Finding the optimal processor grid for a specific algorithm is nontrivial, however. Skinny matrices can not take full advantage of the resources provided by a 3D grid, while square matrices overload the resource capacity that skinny rectangular grids provide.

The CA-CQR2 algorithm can be seen as a generalization of 1D-CQR2 and 3D-CQR2. We define a rectangular processor grid that partitions the matrix into rectangular blocks of size . CA-CQR2 effectively utilizes its tunable grid by performing simultaneous instances of CFR3D on cubic grid partitions of dimension . This allows each grid partition to avoid further communication with other partitions because each has the necessary data to compute the final step . In order to allow these simultaneous Cholesky factorizations, a few extra steps are needed to get the required data onto the correct processor.

As in 3D-CQR, processor broadcasts to as . Each processor can then perform local matrix multiplication . In order for cubic partitions of to own the same matrix , we perform a few extra steps. First, we subdivide along dimension into contiguous groups of size . Each group participates in a reduction in order to simulate the horizontal summing of rows in a linear combination. Processor such that is the only processor that must retain the communicated data and is thus the root. To complete the linear combination, we subdivide along dimension into groups of size , where each processor belonging to the same group is a step size away. An Allreduce is performed on this subcommunicator resulting in every processor owning the one correct region of . A final broadcast from root such that along ensures that is replicated times among subcommunicators of size .

Now that is distributed as described above, simultaneous instances of CFR3D and MM3D are performed over to obtain . CA-CQR2 requires calling CA-CQR twice and performing a final MM3D to solve for .

As evidenced by Figure 2 and the pseudocode presented in Algorithms 10 and 11, CA-CQR2 combines elements of both the 1D-CQR2 and 3D-CQR2 algorithms in order to most efficiently span the grid range . To show that this algorithm achieves the same costs as 1D-CQR2 and 3D-CQR2 with grid sizes , respectively, we provide a cost analysis below.

1: has processors arranged in a tunable grid of size for any integer in range . is and is replicated on . Each processor owns a (cyclic) blocked partition of given by . Let , , , and be temporary arrays distributed the same as .
2:
3:
4:
5:
6:
7: Define
8:
9:
10:, where and are distributed the same as . is orthonormal and is an upper triangular matrix of dimension .
Algorithm 10
1:Same requirements as Algorithm 10.
2:
3:
4:Define