# A 3D Parallel Algorithm for QR Decomposition

Interprocessor communication often dominates the runtime of large matrix computations. We present a parallel algorithm for computing QR decompositions whose bandwidth cost (communication volume) can be decreased at the cost of increasing its latency cost (number of messages). By varying a parameter to navigate the bandwidth/latency tradeoff, we can tune this algorithm for machines with different communication costs.

## Authors

• 11 publications
• 17 publications
• 4 publications
• 5 publications
• 1 publication
• ### Node Aware Sparse Matrix-Vector Multiplication

The sparse matrix-vector multiply (SpMV) operation is a key computationa...
12/23/2016 ∙ by Amanda Bienz, et al. ∙ 0

• ### Evaluating the Cost of Atomic Operations on Modern Architectures

Atomic operations (atomics) such as Compare-and-Swap (CAS) or Fetch-and-...
10/19/2020 ∙ by Hermann Schweizer, et al. ∙ 0

• ### Avoiding Synchronization in First-Order Methods for Sparse Convex Optimization

Parallel computing has played an important role in speeding up convex op...
12/17/2017 ∙ by Aditya Devarakonda, et al. ∙ 0

• ### Polynomial Preconditioned GMRES to Reduce Communication in Parallel Computing

Polynomial preconditioning with the GMRES minimal residual polynomial ha...
06/28/2019 ∙ by Jennifer A. Loe, et al. ∙ 0

• ### Parallel and Communication Avoiding Least Angle Regression

We are interested in parallelizing the Least Angle Regression (LARS) alg...
05/27/2019 ∙ by S. Das, et al. ∙ 0

• ### A Core Calculus for Static Latency Tracking with Placement Types

Developing efficient geo-distributed applications is challenging as prog...
07/30/2020 ∙ by Tobias Reinhard, et al. ∙ 0

• ### Improving Performance Models for Irregular Point-to-Point Communication

Parallel applications are often unable to take full advantage of emergin...
06/06/2018 ∙ by Amanda Bienz, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A common task in numerical linear algebra, especially when solving least/squares and eigenvalue problems, is

QR/̄decomposing a matrix into a unitary Q/̄factor times an upper trapezoidal R/̄factor. We present a QR decomposition algorithm, 3d-caqr/eg, whose bandwidth and latency costs demonstrate a tradeoff.

We model the cost of a parallel algorithm in terms of the number of arithmetic operations, the number of words moved between processors, and the number of messages in which these words are moved. These three quantities, measured along critical paths in a parallel schedule, characterize the algorithm’s arithmetic cost, bandwidth cost, and latency cost, respectively.

###### Theorem 1

An matrix, , can be QR/decomposed on processors with these asymptotic costs:

 \#\,operations\#\,words\#\,% messagesmn2/Pn2/(nP/m)δ(nP/m)δ(logP)2 (1)

where can be chosen from , assuming

 P/(logP)4 =Ω(m/n),\quad and (2) P⋅(logP)2 =O(mδ1+δ⋅n1−δ1+δ).

This arithmetic cost is optimal [DGHL12]. For the smallest , the latency cost is optimal, and for the largest , the bandwidth cost is optimal [BCD14]. However, these bandwidth and latency lower bounds are not attained simultaneously: the bandwidth/latency product is . We conjecture that this product must be , meaning the tradeoff is inevitable.

Our main contribution is the presentation and analysis of 3d-caqr/eg, which extends Elmroth/Gustavson’s recursive algorithm [EG00] to the distributed/memory setting and uses communication/efficient subroutines. The inductive cases feature 3D matrix multiplication (3dmm) [ABG95], which incurs a smaller bandwidth cost than conventional (2D) approaches. The base cases feature a new variant of communication/avoiding QR (caqr) [DGHL12]. caqr incurs a smaller latency cost than conventional (Householder) QR. Our variant further improves the bandwidth cost. We chose the name ‘3d-caqr/eg’ to reflect this lineage.

For tall-and-skinny matrices whose aspect ratio is at least , it’s best to directly invoke 3d-caqr/eg’s base-case subroutine, 1d-caqr/eg. 1d-caqr/eg also demonstrates a bandwidth/latency tradeoff, albeit less drastic, which we can navigate to derive the following bounds.

###### Theorem 2

An matrix can be QR/decomposed on processors with these asymptotic costs:

 \#\,operations\#\,words\#\,% messagesmn2/Pn2(logP)2 (3)

assuming .

The rest of this work is organized as follows. We start by summarizing relevant mathematical background on computing QR decompositions (Section 2). We then introduce our parallel machine model, formalizing how we quantify the costs of communication and computation (Section 3). Next we review the communication/efficient subroutines mentioned above, 3dmm (Section 4) and caqr (Section 5). With this background in place, we present and analyze the new algorithms, 1d-caqr/eg (Section 6) and 3d-caqr/eg (Section 7), proving Theorems 2 and 1, in reverse order. We conclude by discussing limitations and extensions and comparing with related work (Section 8).

## 2 QR Decomposition

In Section 2, we summarize the relevant background concerning computing QR decompositions.

After formalizing the problem in Section 2.1, we present a recursive template algorithm, called rec/qr (Algorithm 1 in Section 2.2), which includes many well/known algorithms as special cases. We then specialize rec/qr to utilize compact matrix representations (Section 2.3) and a simpler recursive splitting strategy (Section 2.4). The result of these specializations, called qr/eg (Algorithm 2), serves as a template for our two new algorithms, 1d-caqr/eg and 3d-caqr/eg.

### 2.1 QR Preliminaries

A QR decomposition of a matrix is a matrix pair such that , the Q/̄factor is unitary, meaning , and the R/̄factor is upper trapezoidal, meaning all entries below its main diagonal equal zero. To specialize for real/valued , simply substitute for and ‘orthogonal’ for ‘unitary’.

We will always assume that has at least as many rows as columns. (This implies that has the same dimensions as and is upper triangular.) When has more columns than rows, we can obtain a QR decomposition by splitting with square , decomposing , and computing .

### 2.2 Recursive QR Decomposition

We consider QR decomposition algorithms based on rec/qr (Algorithm 1), which split vertically (creftype 4), QR/̄decompose the left panel (creftype 5), update the right panel (creftype 6), QR/̄decompose the lower part of the (updated) right panel (creftype 7), and then assemble a QR decomposition from the smaller ones (creftypeplural 9 and 8).

We call rec/qr a ‘template’ because it leaves several details unspecified. To instantiate this template and obtain an algorithm, we must pick a base/case condition (base-condition, creftype 1), a base/case QR/̄decomposition subroutine (base-QR, creftype 2), and a splitting strategy (Split, creftype 4). Additionally, we must specify how the operations are scheduled and how the data are distributed.

### 2.3 Compact Representations

In practice, a QR decomposition of an matrix () is typically not represented as a pair of explicit matrices. In Section 2.3 we specialize rec/qr to represent and more compactly.

Since an R-factor is upper triangular, it is identifiable by just its superdiagonal entries. Subsequently, the symbol may either denote (1) the actual R-factor, an upper/triangular matrix; (2) the leading rows of the R-factor, an upper/triangular matrix; or (3) the upper triangle of the R-factor, a data structure of size . When presenting algorithms, we will prefer convention (2); to obtain an upper/triangular from rec/qr, we agree that base/̄QR (creftype 2) returns such an , and we rewrite the R-factor assembly (creftype 9) as

 R=[RLB120RR].

can be written as : the matrix pair is called a basis/kernel representation [SB95] of . If is the Q-factor of a QR decomposition of an matrix (), then there exists such a representation where the basis is and the kernel is .

Modifying rec/qr to use basis/kernel representations, creftype 6 becomes

 [B12B22]=[A12A22]−VLTHLVHL[A12A22], (4)

and creftype 8 becomes

 V =[VL[0VR]] (5) T =⎡⎢⎣TL−TLVHL[0VR]TR0TR⎤⎥⎦,

where and represent and .

To simplify the presentation without affecting our asymptotic conclusions, we will not exploit the block/lower/trapezoidal and block/upper/triangular structures of the bases and kernels. With this understanding, it minimizes arithmetic to evaluate the quadruple matrix product in Equation 4 from right to left, and the quadruple product in Equation 5 from inside/out (two possibilities).

When is close to , a general basis/kernel representation may require more storage than the explicit () Q-factor. The QR decomposition algorithms in (Sca)LAPACK use a variant [Pug92] of compact WY representation [SVL89], which we call Householder representation in this work. In Householder representation, is unit lower trapezoidal and is upper triangular. These properties enable an in/place implementation, where ’s strict lower trapezoid and ’s upper triangle overwrite and where need not be stored, since in this case

 T=(triu(VHV,−1)+diag(diag(VHV))/2)−1,

using the MATLAB operations ‘’ and ‘’.

Our algorithms will construct, store, and apply Q-factors in Householder representation. This choice is motivated by our practical goal of integration into the ScaLAPACK library [BCC97]; from a theoretical standpoint, any basis/kernel representation (with basis) would yield the same asymptotic costs.

### 2.4 Elmroth-Gustavson’s Approach

The recursive framework of rec/qr is quite general. We will obtain our desired algorithmic costs by following an approach of Elmroth/Gustavson [EG00] (implemented in LAPACK’s _geqrt3), in which we split vertically (roughly) in half, until the number of columns drops below a given threshold . The recursive calls define a binary tree whose levels are complete except possibly the last. (We always suppose ; when , the tree has just one node.) We call this specialized template qr/eg (Algorithm 2); qr/eg utilizes the compact representations as explained in Section 2.3.

We mention that [EG00] actually proposes a hybrid of the stated approach and an iterative approach, switching between the two for a constant-factor improvement in the arithmetic cost. (It still fits in the rec/qr framework.) While our algorithms can also benefit from this optimization, we omit further discussion in the present work since it does not affect our asymptotic conclusions.

## 3 Computation Model

We model a parallel machine as a set of interconnected processors, each with unbounded local memory. Processors operate on local data and communicate with other processors by sending and receiving messages. A processor can perform at most one task (operation/send/receive) at a time. A (data) word means a complex number; operations are the usual field actions, plus complex conjugation and real square roots. Messages are point/to/point and asynchronous. Each operation takes time , while sending or receiving a message of words takes time , being the latency and the inverse of the bandwidth.

We model an execution as a DAG whose vertices are tasks and whose edges define paths, one for each processor’s task sequence, plus an inter/path edge for each send/receive pair. Weighting vertices by their tasks’ durations, we define runtime as the maximum weight of any path. Therefore, if every path includes at most operations and at most messages, containing at most words in total, the runtime is bounded,

 runtime≤γ⋅F+β⋅W+α⋅S.

When multiple processors send/receive messages simultaneously, it can be more efficient to split the messages into more messages of smaller sizes, to coalesce them into fewer, larger messages, or to route them through intermediate processors. These lower-level implementation details are often irrelevant to our asymptotic analyses, motivating us to express our algorithms’ communication patterns abstractly, in terms of collectives.

In the rest of Section 3 we define the eight different collectives appearing in this work, collecting in Table 1 their costs.

A general scenario, called an all-to-all, is when every processor  initially owns a block of data, containing words, destined for every processor , including itself. All other collectives we use can be interpreted as special cases of an all-to-all. Four of these distinguish a ‘root’ processor : scatter, where only ’s outgoing blocks are nonempty; gather, where only ’s incoming blocks are nonempty; broadcast, a scatter where ’s outgoing blocks are identical; and reduce, a gather where ’s incoming blocks have same size and are added entrywise. Four others, including all-to-all, can be constructed from the first four: all/gather, gathers with same outgoing blocks but different roots; all/reduce, reduces with same outgoing blocks but different roots; all-to-all, gathers with different outgoing blocks and different roots; and reduce/scatter, reduces with different outgoing blocks and different roots. (Since the three ‘reduce’ collectives perform arithmetic, they are technically not all-to-alls.)

###### Lemma 1

There exist algorithms for the eight collectives satisfying the upper bounds in Table 1.

For all but all-to-all we use a binomial/tree and possibly a bidirectional/exchange algorithm: see, e.g., [TRG05, CHPvdG07]. In particular, for (all/̄)gather and (reduce/̄)scatter we use binary tree algorithms, and for broadcast and (all/̄)reduce we use whichever of the two minimizes all three costs, asymptotically. For all-to-all we use the (radix-2) index algorithm [BHK97], possibly performed twice using the load/balancing approach of [HBJ96]. (For a more detailed proof, see Section A.)

Note that when the number of processors is not too large w.r.t. the block size, the bidirectional/exchange algorithm for broadcast, built from scatter+all/gather, is asymptotically cheaper than the corresponding binary tree algorithm in terms of bandwidth. Similarly, the bidirectional/exchange algorithm for (all/̄)reduce), built from reduce/scatter+(all/̄)gather, improves both arithmetic and bandwidth.

## 4 3D Matrix Multiplication

The key to reducing 3d-caqr/eg’s bandwidth cost below that of previous approaches is 3D matrix multiplication (3dmm) [ABG95]. Here, 3D refers to the parallelization of the operations and distribution of data over a three-dimensional (logical) processor grid. 1d-caqr/eg will also exploit two special cases, 1dmm, performed on a one-dimensional grid, and mm, performed locally on one processor.

For concreteness, consider multiplying an matrix with a matrix to obtain an matrix , via the usual (entrywise) formula:

 for all (i,j)∈[I]×[J],Cij=∑k∈[K]AikBkj. (6)

We identify each of the (scalar) multiplications with a point in three/dimensional Euclidean space, so the set of multiplications defines a discrete brick. We point the reader to [BDKS16, SVdGP16] for further discussions of this geometric terminology.

###### Lemma 2

Suppose input matrices and are initially owned by processor , and output matrix is to be finally owned also by processor . can be computed with runtime

 γ⋅O(IJK). (7)

Directly evaluating the sums/of/products in Equation 6 on processor involves multiplications and additions; no communication is necessary.

###### Lemma 3

Suppose , , , and satisfy

 P=O(IJKmax(I,J,K))andP=O(max(I,J,K)).

If , suppose that matrices and are initially distributed in matching row/wise layouts where each processor owns rows, and that matrix is to be finally owned by a single processor . Alternatively, if , suppose that matrices and are initially/finally distributed in matching row/wise layouts where each processor owns rows, and that matrix is initially owned by a single processor . can be computed with runtime

 γ⋅O(IJKP)+β⋅O(IJKmax(I,J,K))+α⋅O(logP), (8)

In the first case, each processor performs a local mm and then all processors reduce to processor . In the second case, processor broadcasts to all processors and then each processor performs a local mm. The hypotheses guarantee that is not too large for these collectives can leverage the bidirectional/exchange algorithms. (For a more detailed proof, see Section B.) The bound of Equation 8 also holds in a third case, when and the distributions are symmetric to the second case, but we will not need this result.

###### Lemma 4

Suppose , , , and satisfy

 c⋅IJKmin(I,J,K)3≤P≤IJK.

There exists a data distribution of , , and such that each processor initially owns at most entries of and , and finally at most entries of , where can be computed with runtime

 γ⋅O(IJKP)+β⋅O((IJKP)2/3)+α⋅O(logP). (9)

Pick , , and , where . Under the hypotheses, are positive integers with , thus define a valid processor grid. Moreover, . Pick partitions , , and of , , and which are balanced, meaning their parts differ in size by at most one. Pick a partition of to be a union of balanced -way partitions of the sets (), and similarly for partitions and of and . Distribute entries of to each grid processor , and similarly for and : this distribution satisfies the balance constraint in the theorem statement. The algorithm proceeds with all/gathers of blocks of and along processor grid fibers in the - and -directions. then local mms, then finally reduce/scatters of blocks of along processor grid fibers in the -direction. (For a more detailed proof, see Section B.)

We denote by mm, 1dmm, or 3dmm an algorithm that satisfies Lemma 2, Lemma 3, or Lemma 4, resp.

## 5 Communication-Avoiding QR

Our new algorithms 1d-caqr/eg and 3d-caqr/eg are closely related to communication/avoiding QR (caqr) and tall/skinny QR (tsqr) [DGHL12]: we explore this relationship in Section 8. For now we remark that 3d-caqr/eg specializes qr/eg to use 1d-caqr/eg as a base-case, and that 1d-caqr/eg specializes qr/eg to use tsqr (the variant in [BDG15]) as a base-case.

On input, the matrix is partitioned across the processors so that each processor  owns rows, not necessarily contiguous. Thus we require be sufficiently tall and skinny: . A single processor , which owns ’s leading rows, is designated as the root processor.

On output, the Q-factor is stored in Householder representation , where has the same distribution as . Both and the R-factor are returned only on the root processor.

###### Lemma 5

tsqr’s runtime is

 γ⋅O(maxpmpn2+n3logP)+β⋅O(n2logP)+α⋅O(logP)

It is crucial to use the tsqr variant in [BDG15]. (See also Section C for a proof.)

Recall from Section 3 that when the block-size is sufficiently large, reduce and broadcast can be performed more efficiently, by reduce/scatter+gather and scatter+all/gather, resp. Unfortunately, tsqr’s reduce- and broadcast-like collectives preclude these optimizations. Next in Section 6, we will show how similar savings are achievable.

## 6 1d-Caqr-Eg

We now present a new algorithm, 1d-caqr/eg, an instantiation of the template qr/eg (Algorithm 2). 1d-caqr/eg effectively reduces tsqr’s bandwidth cost by a logarithmic factor, at the expense of increasing its latency cost by a comparable factor.

The input/output data distributions are the same as for tsqr, so we continue notation from Section 5. We specify 1d-caqr/eg by stepping line-by-line through qr/eg — base case in Section 6.1 and inductive case in Section 6.2 — then prove Theorem 2 in Section 6.3.

### 6.1 Base Case

1d-caqr/eg’s base-case QR decomposition subroutine (creftype 2) is tsqr (Section 5), using the same root processor. Note that ’s distribution satisfies tsqr’s requirements, and , , and are returned distributed as required by 1d-caqr/eg.

### 6.2 Inductive Case

Let us walk through the inductive case line-by-line. All algorithmic costs are incurred in the two recursive calls (creftypeplural 9 and 5) and the six matrix multiplications (creftypeplural 13, 12, 11, 8, 7 and 6).

(creftype 4): the splitting involves no computation nor communication.

(creftype 5): the left recursive call is valid since still satisfies the data distribution requirements (only decreases).

(creftype 6): this is a 3dmm with matrix dimensions , , and . We choose a (1D) processor grid with and , thus and the partitions and are trivial. We pick the partition to match the distribution of ’s rows, and pick and . Additionally, we set for all but the root processor.

(creftype 7): this is an mm on the root processor with matrix dimensions and .

(creftype 8): this is a 3dmm with matrix dimensions , , and , followed by a matrix subtraction. We choose a (1D) processor grid with and , thus and the partitions and are trivial. We pick the partition to match the distribution of ’s rows, and pick and . Additionally, we set for all but the root processor. The row-wise distribution enables the subsequent matrix subtraction to be performed without further communication.

(creftype 9): the second recursive call is valid since still satisfies the data distribution requirements: the number of rows owned by the root processor decreases by the same amount that does, while all other processors keep the same number.

(creftype 10): each processor assembles its local rows of : no computation nor communication is required.

(creftype 11): this is a 3dmm with matrix dimensions , , and ; we choose a processor grid and partitions as in creftype 6.

(creftype 12): this is an mm on the root processor with the same dimensions as creftype 7.

(creftype 13): this is an mm on the root processor with the same dimensions as creftypeplural 12 and 7.

(creftype 14): each processor assembles its local rows of : no computation nor communication is required.

Verify that , , and are distributed as desired.

### 6.3 Concluding the Analysis

1d-caqr/eg is valid for any such that , and there is no loss of generality to suppose . When , 1d-caqr/eg reduces to tsqr. As we will see, picking allows us to reduce 1d-caqr/eg’s arithmetic and bandwidth costs — while increasing its latency cost — to appear as if we had used bidirectional exchange reduce and broadcast algorithms (Section 3) within tsqr, despite the fact that these algorithms are inapplicable, as we lamented at the end of Section 5. We will navigate the tradeoff with a nonnegative parameter , taking

 b =Θ(n/(logP)ϵ). (10)

We will show that taking yields Theorem 2.

###### Lemma 6

If , 1d-caqr/eg has runtime

 γ⋅(mn2P+nb2logP)+β⋅O(n2+nblogP)+α⋅O(nblogP). (11)

Here we give an expanded proof of Lemma 6.

Let us derive an upper bound on the runtime of an 1d-caqr/eg invocation. (The unchanging parameters are implicit.) We will now assume a balanced data distribution, meaning the numbers of rows any two processors owns differ by at most one.

When , the runtime of 1d-caqr/eg for any is just , which satisfies the conclusion, so we may assume hereafter.

In the base case (), the algorithmic cost is the 3d-caqr/eg call, so by Lemma 5 we conclude that

 T(m,n)=γ⋅O(mn2P+n3logP)+β⋅O(n2logP)+α⋅O(logP).

In the inductive case, the algorithmic cost is due to the two recursive calls, the three 1dmm calls (creftypeplural 11, 8 and 6, performed on 1D processor grids, and the three (local) mm calls (creftypeplural 13, 12 and 7), performed by the root.

The local mms have runtime .

To apply Lemma 3 to the 3dmm calls, let us now suppose that and , Actually, only the first assumption is new: we already know that for the initial , and when we see that in any recursive call the current is within a factor of two of the initial , hence in any recursive call. Confirming that the data distributions chosen in Section 6.2 match those in the proof of Lemma 3, the 1dmms’ runtime is

 γ⋅O(mn2P)+β⋅O(n2)+α⋅O(logP).

Overall, we have found that

 T(m,n)=T(m,⌊n/2)+T(m−⌊n/2⌋,⌈n/2⌉)+γ⋅O(mn2P)+β⋅O(n2)+α⋅O(logP).

By induction we may take to be nondecreasing in both and . The former property justifies replacing by in the second recursive call. Hence, we will take to be its initial value in the analysis of every recursive call.

Supposing for a nonnegative integer , is bounded by Equation 11.

Now observe that reproduces the asymptotic bound Equation 11, and since is nondecreasing in , this bound also holds when .

Requiring suffices to ensure that at every inductive case.

[of Theorem 2] Substituting Equation 10 into Equation 11,

 γ⋅(mn2P(1+nPm(logP)1−2ϵ))+β⋅O(n2(1+(logP)1−ϵ))+α⋅O((logP)1+ϵ),

thence the hypothesis is . We conclude by taking .

This argument extends to any , assuming , but the asymptotic tradeoff vanishes when . For the tradeoff is only between bandwidth and latency. A sensible interpretation of the case is , meaning tsqr is invoked immediately. In this case, the costs are given directly by Lemma 5.

## 7 3d-Caqr-Eg

We now present our second new algorithm, 3d-caqr/eg, another instantiation of the template qr/eg (Algorithm 2).

On input, the matrix () is partitioned across the processors row/cyclically: thus, each processor owns at most rows.

On output, the Q-factor is stored in Householder representation , where has the same distribution as . Both and the R-factor have the same distribution, matching the top submatrix of .

After walking through 3d-caqr/eg in a similar fashion as we did for 1d-caqr/eg in Section 6, we collect the results to prove Theorem 1.

In the following, denotes an upper bound on the runtime of 3d-caqr/eg, a function of .

### 7.1 Base Case

Recall that denotes the recursive threshold for 3d-caqr/eg. 3d-caqr/eg’s base-case QR decomposition subroutine (creftype 2) is 1d-caqr/eg, with a fixed recursive threshold .

To satisfy 1d-caqr/eg’s data distribution requirements, we convert from row/cyclic to block/row layout, distributed over

 P∗=min(P,⌊m/n⌋)

processors, ensuring each owns at least rows and one owns the top rows (and perhaps others).

Initially, processors own rows of . (Clearly with equality just in case ; note further that with equality just in case .) Number these processors from to according to the cyclic layout of , so that processor  owns the top row of . Deal these processors among groups, so processor  goes into group , processor  goes into group , and so on. Represent each group by its lowest-numbered processor and within each group, gather ’s rows to the representative. Since each group contains at most processors and each processor initially owns at most rows of , the largest block-size in any gather is at most .

Each of the representatives (including processor ) now owns at least rows of , satisfying the first part of 1d-caqr/eg’s data distribution requirements: it remains to ensure processor  owns the top rows of . These rows are currently owned by the first representatives. (Clearly with equality just in case .) We next perform a gather over the representatives of groups  through , taking processor  to be the root so that afterwards it owns the top rows of (and perhaps others). We also perform a scatter with the opposite communication pattern so that the overall number of rows per representative is unchanged. The largest block-size in both the gather and the scatter is at most .

We can now invoke 1d-caqr/eg, with parameters . After it returns, we redistribute , , and by reversing the preceding gathers/scatters, so that is (resp., and are) distributed over all processors like was (resp., ’s first rows were) initially.

### 7.2 Inductive Case

Let us walk through the inductive case line-by-line, as we did for 1d-caqr/eg. All algorithmic costs are incurred in the two recursive calls (creftypeplural 9 and 5) and the six matrix multiplications (creftypeplural 13, 12, 11, 8, 7 and 6).

(creftype 4): the splitting involves no computation nor communication.

(creftype 5): the left recursive call is valid since still satisfies the data distribution requirements (only decreases).

(creftype 6): this is a 3dmm with matrix dimensions , , and . We do not yet specify the processor grid, but we do suppose that 3d-caqr/eg uses a balanced parallelization and data distribution as in the proof of Lemma 4: this is possible for any processor grid.

To match this data distribution, we perform an all-to-all before and after the 3dmm invocation, each time using the two-phase approach [BHK97]. The first all-to-all redistributes the input matrices from column- and row-cyclic to 3dmm layout (the left factor is row-cyclic, transposed); the maximum number of input matrix entries any processor owns before or after this collective is at most

where the processor grid is . The second all-to-all converts the output matrix from 3dmm layout to row-cyclic layout; the maximum number of output matrix entries any processor owns before or after this collective is at most

(creftype 7): this is a 3dmm with matrix dimensions and . We pick a 3D processor grid and partitions satisfying the same constraints as for creftype 6, and perform similar all-to-alls before and after, so we can reuse the preceding analysis, substituting , , and .

(creftype 8): this is a 3dmm with matrix dimensions , , and , followed by a matrix subtraction. We proceed similarly to creftypeplural 7 and 6, except that the left factor is initially in row-cyclic layout, so the first summand in the first term of the maximum becomes (vs. ).

(creftype 9): the second recursive call is valid since still satisfies the data distribution requirements: in particular, unlike 1d-caqr/eg there is no requirement that a fixed processor owns the first rows or that every processor owns at least rows.

(creftype 10): each processor assembles its local rows of : no computation nor communication is required.

(creftype 11): this is a 3dmm with matrix dimensions , , and ; we choose a processor grid, partitions, and all-to-alls as in creftype 6.

(creftype 12): this is a 3dmm with the same dimensions, processor grid, partitions, and all-to-alls as creftype 7.

(creftype 13): this is a 3dmm with the same dimensions, processor grid, partitions, and all-to-alls as creftypeplural 12 and 7.

(creftype 14): each processor assembles its local rows of : no computation nor communication is required.

Verify that , , and are distributed as desired.

### 7.3 Concluding the Analysis

3d-caqr/eg is valid for any , and there is no loss of generality to suppose . Taking simplifies 3d-caqr/eg to 1d-caqr/eg with parameters and additional data redistributions. As in the case of 1d-caqr/eg, picking allows us to reduce 3d-caqr/eg’s arithmetic and bandwidth costs, while increasing its latency cost.

We will navigate this tradeoff with two nonnegative parameters , taking

 b =Θ(n/(nP/m)δ), b∗ =Θ(b/(logP)ϵ). (12)

We prove Theorem 1 with and .

###### Lemma 7

If and , 3d-caqr/eg has runtime

 T(m,n)=γ⋅(mn2P+nb∗2logP)+β⋅O(mnP+nb+nb∗logP+(mn2P)2/3+((mnP+n)lognb+nP2b)logP)+α⋅O(nb∗logP). (13)

Here we give an expanded proof of Lemma 7.

Let us derive an upper bound on the runtime of a 3d-caqr/eg invocation. (The unchanging parameters are implicit.)

When , the runtime of 3d-caqr/eg is just , which satisfies the conclusion, so we may assume hereafter.

In a base case (), the algorithmic costs are due to the 1d-caqr/eg invocation and the four communication phases.

The first communication phase, involving independent gathers, has runtime bounded by

 β⋅O((⌈P′/P∗⌉−1)⌈m/P′⌉n)+α⋅O(log⌈P′/P∗⌉)=β⋅O(mn/P+n2)+α⋅O(logP),

and the same bound applies for the last phase (matching scatters). (Recall that and .) The fact that no communication happens when (and thus ) is evident in the first bound but not the second.

The second communication phase, a simultaneous gather/scatter, has runtime bounded by

 β⋅O(((P′′−1)⌈n/P′′⌉n)+α⋅O(logP′′)=β⋅O(n2)+α⋅O(logP),

and the same bound applies for the third phase (matching scatter/gather). (Recall that .)

A runtime bound for the 1d-caqr/eg invocation is given by Theorem 2, supposing now that . Actually we use the more refined bound of Equation 11, substituting for .

Altogether, in a base case,

 T(m,n)=γ⋅O(mn2P+nb∗2logP)+β⋅O(mnP+n2+nb∗logP)+α⋅O(nb∗logP)

In the inductive case (), the algorithmic cost is due to the two recursive calls, the six 3dmms, performed on 3D processor grids, and the twelve all-to-alls, performed before and after each 3dmm.

To apply Lemma 4 to the 3dmms, let us now suppose that for some and : since at every recursive call and since in the inductive case, for each 3dmm. Thus the 3dmms’ overall runtime is

 γ⋅O(mn2P)+β⋅O((mn2P)2/3)+α⋅O(logP).

Moreover, the inequalities derived at the beginning of Lemma 4’s proof also yield the following upper bound on the overall runtime of the all-to-alls,

 β⋅O((mnP+n+P2)logP)+α⋅O(logP).

Overall, we have found that

 T(m,n)=T(m,⌊n/2⌋)+T(m−⌊n/2⌋,⌈n/2⌉)+γ⋅O(mn2P)+β⋅O((mn2P)2/3+(mnP+n+P2)logP)+α⋅O(logP).

By induction we may take to be nondecreasing in both and . The former property justifies replacing by in the second recursive call. Hence, we will take to be its initial value in the analysis of every recursive call.

Supposing for a nonnegative integer , is bounded by Equation 13.

Now observe that reproduces the asymptotic bound of Equation 13, which thus holds when since is nondecreasing in .

In conclusion, 3d-caqr/eg’s runtime satisfies Equation 13 if , , , and .

[of Theorem 1] Consider picking as in Equation 12. The constraints relating and are satisfiable if

 P=O(m2δ1+2δ⋅n2−2δ1+2δ), (14)

and the constraint relating and is satisfiable if

 P⋅(logP)2ϵ1+2δ=O(m2δ1+2δ⋅n2−2δ1+2δ), (15)

a stronger condition than Equation 14.

Substituting Equation 10 into Equation 11,

where denotes the sum of three terms associated with the all-to-alls,

 mnPlognPmlogP,nlognPmlogP,P2(nPm)δlogP,

plus a term associated with the 3dmms. We obtain the stated arithmetic and latency costs by taking and . To suppress the bandwidth term , it suffices to require that there exists , hence , such that

 P/(logP)ϵ1−δ−δ′ =Ω(m/n), (16) P⋅(logP)ϵδ+δ′ =O(m⋅n1−δ−δ′δ+δ′), P⋅(logP)ϵ2+2δ =O(mδ1+δ⋅n1−δ1+δ).

The 3dmms’ bandwidth cost cannot be reduced, but it is lower/ order if .

The hypotheses of Theorem 1, (and tacitly ) and Equation 2, imply Equations 16, 15 and 14.

This argument extends to a larger range of nonnegative . Assuming fixed , for the 3dmm invocations dominate the bandwidth cost, whose bound remains as if , no longer a tradeoff, at least asymptotically. In the case , the additive term in the arithmetic cost, due to the small (mostly) triangular matrix operations on tsqr’s critical path, possibly dominates. (A sensible interpretation of the case is , in which case 1d-caqr/eg is invoked immediately.) Assuming fixed , the tradeoffs due to varying are just as in the proof of Theorem 2, except now the factor in the arithmetic cost is suppressed by increasing .

## 8 Discussion

We have presented two new algorithms, 1d-caqr/eg and 3d-caqr/eg (Sections 7 and 6), for computing QR decompositions on distributed/memory parallel machines. Our analysis (e.g., Equations 13 and 11) demonstrates tradeoffs between arithmetic, bandwidth, and latency costs, governed by the choice of one (1d-caqr/eg) or two (3d-caqr/eg) block sizes. We navigated these tradeoffs in Theorems 1 and 2 by asymptotically minimizing arithmetic, as well as bandwidth in Theorem 2.

### 8.1 Comparison With Similar Algorithms

Here we compare the two new algorithms with four other instances of the rec/qr framework, deriving Tables 2 and 3. Let us review the other algorithms.