Bidiagonalization with Parallel Tiled Algorithms

11/18/2016
by   Mathieu Faverge, et al.
0

We consider algorithms for going from a "full" matrix to a condensed "band bidiagonal" form using orthogonal transformations. We use the framework of "algorithms by tiles". Within this framework, we study: (i) the tiled bidiagonalization algorithm BiDiag, which is a tiled version of the standard scalar bidiagonalization algorithm; and (ii) the R-bidiagonalization algorithm R-BiDiag, which is a tiled version of the algorithm which consists in first performing the QR factorization of the initial matrix, then performing the band-bidiagonalization of the R-factor. For both bidiagonalization algorithms BiDiag and R-BiDiag, we use four main types of reduction trees, namely FlatTS, FlatTT, Greedy, and a newly introduced auto-adaptive tree, Auto. We provide a study of critical path lengths for these tiled algorithms, which shows that (i) R-BiDiag has a shorter critical path length than BiDiag for tall and skinny matrices, and (ii) Greedy based schemes are much better than earlier proposed variants with unbounded resources. We provide experiments on a single multicore node, and on a few multicore nodes of a parallel distributed shared-memory system, to show the superiority of the new algorithms on a variety of matrix sizes, matrix shapes and core counts.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

11/01/2018

An O(n n) time Algorithm for computing the Path-length Distance between Trees

Tree comparison metrics have proven to be an invaluable aide in the reco...
02/09/2020

Butterfly factorization via randomized matrix-vector multiplications

This paper presents an adaptive randomized algorithm for computing the b...
11/08/2020

LDU factorization

LU-factorization of matrices is one of the fundamental algorithms of lin...
01/01/2018

Computation of Maximal Determinants of Binary Circulant Matrices

We describe algorithms for computing maximal determinants of binary circ...
06/29/2021

Distributed Matrix Tiling Using A Hypergraph Labeling Formulation

Partitioning large matrices is an important problem in distributed linea...
08/10/2018

Model Approximation Using Cascade of Tree Decompositions

In this paper, we present a general, multistage framework for graphical ...
05/19/2020

ParaDIAG: Parallel-in-Time Algorithms Based on the Diagonalization Technique

In 2008, Maday and Rønquist introduced an interesting new approach for t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This work is devoted to the design and comparison of tiled algorithms for the bidiagonalization of large matrices. Bidiagonalization is a widely used kernel that transforms a full matrix into bidiagonal form using orthogonal transformations. The importance of bidiagonalization stems from the fact that, in many algorithms, the bidiagonal form is a critical step to compute the singular value decomposition (SVD) of a matrix. The necessity of computing the SVD is present in many computational science and engineering areas. Based on the Eckart–Young theorem 

[EckartYoung36]

, we know that the singular vectors associated with the largest singular values represent the best way (in the 2-norm sense) to approximate the matrix. This approximation result leads to many applications, since it means that SVD can be used to extract the “most important” information of a matrix. We can use the SVD for compressing data or making sense of data. In this era of Big Data, we are interested in very large matrices. To reference one out of many application, SVD is needed for principal component analysis (PCA) in Statistics, a widely used method in applied multivariate data analysis.

We consider algorithms for going from a “full” matrix to a condensed “band bidiagonal” form using orthogonal transformations. We use the framework of “algorithms by tiles”. Within this framework, we study: (i) the tiled bidiagonalization algorithm BiDiag, which is a tiled version of the standard scalar bidiagonalization algorithm; and (ii) the R-bidiagonalization algorithm R-BiDiag, which is a tiled version of the algorithm which consists in first performing the QR factorization of the initial matrix, then performing the band-bidiagonalization of the R-factor. For both bidiagonalization algorithms BiDiag and R-BiDiag, we use HQR-based reduction trees, where HQR stands for the Hierarchical QR factorization of a tiled matrix [j125]. Considering various reduction trees gives us the flexibility to adapt to matrix shape and machine architecture. In this work, we consider many types of reduction trees. In shared memory, they are named FlatTS, FlatTT, Greedy, and a newly introduced auto-adaptive tree, Auto. In distributed memory, they are somewhat more complex and take into account the topology of the machine. The main contributions are the following:

  • The design and comparison of the BiDiag and R-BiDiag tiled algorithms with many types of reduction trees. There is considerable novelty in this. Previous work [4967575, Ltaief:2013:HBR:2450153.2450154, 6267821, Haidar:2013:IPS:2503210.2503292] on tiled bidiagonalization has only considered one type of tree (FlatTS tree) with no R-BiDiag. Previous work [Ltaief2012] has considered Greedy trees only for the QR steps in BiDiag (so of the steps), it considers FlatTS tree for the LQ steps ( of the steps), and it does not consider R-BiDiag. This paper is the first to study R-BiDiag for tiled bidiagonalization algorithm and to study Greedy trees for both steps (LQ and QR) of the tiled bidiagonalization algorithm.

  • A detailed study of critical path lengths for FlatTS, FlatTT, Greedy with BiDiag and R-BiDiag (so six different algorithms in total), which shows that:

    • The newly-introduced Greedy based schemes (BiDiag and R-BiDiag) are much better than earlier proposed variants with unbounded resources and no communication: for matrices of tiles, , their critical paths have a length instead of for FlatTS and FlatTT

    • On the one hand, BiDiagGreedy has a shorter critical path length than R-BiDiagGreedy for square matrices; on the other hand, R-BiDiagGreedy has a shorter critical path length than BiDiagGreedy for “tall and skinny” matrices. For example, for a tile matrix, when go to infinity, and , with , then the asymptotic ratio between BiDiagGreedy and R-BiDiagGreedy is .

  • Implementation of our algorithms within the DPLASMA framework [dplasma-6008998], which runs on top of the PaRSEC runtime system [dague-engine-Bosilca201237], and which enables parallel distributed experiments on multicore nodes. All previous tiled bidiagonalization study [4967575, Ltaief:2013:HBR:2450153.2450154, 6267821, Haidar:2013:IPS:2503210.2503292, Ltaief2012] were limited to shared memory implementation.

  • A practical auto-adaptative tree (Auto) that self-tunes for increased performance. This tree is especially useful in many practical situations. For example, it is appropriate (1) when the critical path length is not a major consideration due to limited number of resources or (2) when the intra-node “communication” is expensive in which TS kernels are much faster than TT kernels or (3) both.

  • Experiments on a single multicore node, and on a few multicore nodes of a parallel distributed shared-memory system, show the superiority of the new algorithms on a variety of matrix sizes, matrix shapes and core counts. The new Auto algorithm outperforms its competitors in almost every test case, hence standing as the best algorithmic choice for most users.

The rest of the paper is organized as follows. Section 2 provides a detailed overview of related work. Section 3 describes the BiDiag and R-BiDiag algorithms with the FlatTS, FlatTT and Greedy trees. Section 4 is devoted to the analysis of the critical paths of all variants. Section 5 outlines our implementation, and introduces the new Auto reduction tree. Experimental results are reported in Section 6. Finally, conclusion and hints for future work are given in Section 7.

2 Related Work

This section provides an overview of the various approaches to compute the singular values of a matrix, and positions our new algorithm with respect to existing numerical software kernels.

Computing the SVD. The SVD of a matrix is a fundamental matrix factorization and is a basic tool used in many applications. Computing the SVD of large matrices in an efficient and scalable way, is an important problem that has gathered much attention. The matrices considered here are rectangular -by-. Without loss of generality, we consider . We call GE2VAL the problem of computing (only) the singular values of a matrix, and GESVD the problem of computing the singular values and the associated singular vectors.

From full to bidiagonal form. There are few algorithms to compute the singular value decomposition. A large class of these algorithms consists in first reducing the matrix to bidiagonal form with orthogonal transformations (GE2BD step), then processing the bidiagonal matrix to obtain the sought singular values (BD2VAL step). These two steps (GE2BD and BD2VAL) are very different in nature. GE2BD can be done in a known number of operations and has no numerical difficulties. On the other hand, BD2VAL requires the convergence of an iterative process and is prone to numerical difficulties. This paper mostly focuses on GE2BD: reduction from full to bidiagonal form. Clearly, GE2BD+BD2VAL solves GE2VAL: computing (only) the singular value of a matrix. If the singular vectors are desired (GESVD), one can also compute them by accumulating the “backward” transformations; in this example, this would consist in a VAL2BD step followed by a BD2GE step.

In 1965, Golub and Kahan [doi:10.1137/0702016] provides a singular value solver based on an initial reduction to bidiagonal form. In [doi:10.1137/0702016, Th. 1], the GE2BD step is done using a QR step on the first column, then an LQ step on the first row, then a QR step on the second column, etc. The steps are done one column at a time using Householder transformation. This algorithm is implemented as a Level-2 BLAS algorithm in LAPACK as xGEBD2. For an -by- matrix, the cost of this algorithm is (approximately) .

Level 3 BLAS for GE2BD. In 1989, Dongarra, Sorensen and Hammarling [DONGARRA1989215] explains how to incorporate Level-3 BLAS in LAPACK xGEBD2. The idea is to compute few Householder transformations in advance, and then to accumulate and apply them in block using the WY transform [doi:10.1137/0908009]. This algorithm is available in LAPACK (using the compact WY transform [doi:10.1137/0910005]) as xGEBRD. Großer and Lang [GroBer:1999:EPR:330506.330509, Table 1] explain that this algorithm performs (approximately) 50% of flops in Level 2 BLAS (computing and accumulating Householder vectors) and 50% in Level 3 BLAS (applying Householder vectors). In 1995, Choi, Dongarra and Walker [Choi1995] presents the ScaLAPACK version, PxGEBRD, of the LAPACK xGEBRD algorithm of [DONGARRA1989215].

Multi-step approach. Further improvements for GE2BD (detailed thereafter) are possible. These improvements rely on combining multiple steps. These multi-step methods will perform in general much better for GE2VAL (when only singular values are sought) than for GESVD (when singular values and singular vectors are sought). When singular values and singular vectors are sought, all the “multi” steps have to be performed in “reverse” on the singular vectors adding a non-negligible overhead to the singular vector computation.

Preprocessing the bidiagonalization with a QR factorization (preQR step). In 1982, Chan [Chan:1982:IAC:355984.355990] explains that, for tall-and-skinny matrices, in order to perform less flops, one can pre-process the bidiagonalization step (GE2BD) with a QR factorization. In other words, Chan propose to do preQR(,)+GE2BD(,) instead of GE2BD(,). A curiosity of this algorithm is that it introduces nonzeros where zeros were previously introduced; yet, there is a gain in term of flops. Chan proves that the crossover points when preQR(,)+GE2BD(,) performs less flops than GE2BD(,) is when is greater than . Chan also proved that, asymptotically, preQR(,)+GE2BD(,) will perform half the flops than GE2BD(,) for a fixed and going to infinity. If the singular vectors are sought, preQR has more overhead: (1) the crossover point is moved to more tall-and-skinny matrices, and there is less gain; also (2) there is some complication as far as storage goes.

The preQR step was somewhat known by the community and an earlier reference is for example the book of Lawson and Hanson 1974 [doi:10.1137/1.9781611971217]. For this reason, the fact of pre-processing the bidiagonalization by a QR factorization (preQR) is referred to as the LHC (Lawson, Hanson and Chan) algorithm in [Trefethen:1997:NLA].

Three-step approach: partialGE2BD+preQR+GE2BD. In 1997, Trefethen and Bau [Trefethen:1997:NLA] present a “three-step” approach. The idea is to first start a bidiagonalization; then whenever the ratio passes the mark, switch to a QR factorization; and then finish up with GE2BD.

Two-step approach: GE2BND+BND2BD. In 1999, Großer and Lang [GroBer:1999:EPR:330506.330509] studied a two-step approach for GE2BD: (1) go from full to band (GE2BND), (2) then go from band to bidiagonal (BND2BD). In this scenario, GE2BND has most of the flops and performs using Level-3 BLAS kernels; BND2BD is not using Level-3 BLAS but it executes much less flops and operates on a smaller data footprint that might fit better in cache. There is a trade-off for the bandwidth to be chosen. If the bandwidth is too small, then the first step (GE2BND) will have the same issues as GE2BD. If the bandwidth is too large, then the second step BND2BD will have many flops and dominates the run time.

As mentioned earlier, when the singular vectors are sought (GESVD), the overhead of the method in term of flops is quite large. In 2013, Haidar, Kurzak, and Luszczek [Haidar:2013:IPS:2503210.2503292] reported that despite the extra flops a two-step approach leads to speedup even when singular vectors are sought (GESVD).

Tiled Algorithms for the SVD. In the context of massive parallelism, and of reducing data movement, many dense linear algebra algorithms have been moved to so-called “tiled algorithms”. In the tiled algorithm framework, algorithms operates on tiles of the matrix, and tasks are scheduled thanks to a runtime. In the context of the SVD, tiled algorithms naturally leads to band bidiagonal form. In 2010, Ltaief, Kurzak and Dongarra [4967575] present a tiled algorithm for GE2BND (to go from full to band bidiagonal form). In 2013 (technical report in 2011), Ltaief, Luszczek, Dongarra [Ltaief:2013:HBR:2450153.2450154] add the second step (BND2BD) and present a tiled algorithm for GE2VAL using GE2BND+BND2BD+BD2VAL. In 2012, Ltaief, Luszczek, and Dongarra [Ltaief2012] improve the algorithm for tall and skinny matrices by using “any” tree instead of flat trees in the QR steps. In 2012, Haidar, Ltaief, Luszczek and Dongarra [6267821] improve the BND2BD step of [Ltaief:2013:HBR:2450153.2450154]. Finally, in 2013, Haidar, Kurzak, and Luszczek [Haidar:2013:IPS:2503210.2503292] consider the problem of computing singular vectors (GESVD) by performing

GE2BND+BND2BD+BD2VAL+VAL2BD+BD2BND+BND2GE.


They show that the two-step approach (from full to band, then band to bidiagonal) can be successfully used not only for computing singular values, but also for computing singular vectors.

BND2BD step. The algorithm in LAPACK for BND2BD is xGBBRD. In 1996, Lang [LANG19961] improved the sequential version of the algorithm and developed a parallel distributed algorithm. Recently, PLASMA released an efficient multi-threaded implementation [6267821, Ltaief:2013:HBR:2450153.2450154]. We also note that Rajamanickam [piroband] recently worked on this step.

BD2VAL step. Much research has been done and is done on this kernel. Much software exists. In LAPACK, to compute the singular values and optionally the singular vectors of a bidiagonal matrix, the routine xBDSQR uses the Golub-Kahan QR algorithm [doi:10.1137/0702016]; the routine xBDSDC uses the divide-and-conquer algorithm [doi:10.1137/S0895479892242232]; and the routine xBDSVX uses bisection and inverse iteration algorithm. Recent research was trying to apply the MRRR (Multiple Relatively Robust Representations) method [Willems2006] to the problem.

BND2BD+BD2VAL steps in this paper. This paper does not focus neither on BND2BD nor BD2VAL. As far as we are concerned, we can use any of the methods mentioned above. The faster these two steps are, the better for us. For this study, during the experimental section, for BND2BD, we use the PLASMA multi-threaded implementation [6267821, Ltaief:2013:HBR:2450153.2450154] and, for BD2VAL, we use LAPACK xBDSQR.

Intel MKL. We note that, since 2014 and Intel MKL 11.2, the Intel MKL library is much faster to compute GESVD [intel]. Much of the literature on tiled bidiagonalization algorithm was before 2014, and so it was comparing to much slower version of MKL that is now available. The reported speedup in this manuscript for the tiled algorithms over MKL, are therefore much less impressive than previously reported.

We also note that, while there was a great performance improvement for computing GESVD [intel] from Intel MKL 11.1 to Intel MKL 11.2, there was no performance improvement for GEBRD. The reason is that the interface for GEBRD prevents a two-step approach, while the interface of GESVD allows for a two-step approach. Similarly, our two-step algorithm does not have the same interface as LAPACK GEBRD. Consequently, it would not be fair to compare our algorithm with LAPACK GEBRD or Intel MKL GEBRD because our algorithm does not have a native LAPACK GEBRD interface. (I.e., it will not produce the Householder vector column by column as requested by the LAPACK interface.) For this reason, we will always compare with GESVD, and not GEBRD.

Approach to compute the SVD without going by bidiagonal form. We note that going to bidiagonal form is not a necessary step to compute the singular value decomposition of a matrix. There are approaches to compute the SVD of a matrix which avoid the reduction to bidiagonal form. In the “direct method” category, we can quote Jacobi SVD [doi:10.1137/050639193, doi:10.1137/05063920X], QDWHSVD [doi:10.1137/120876605]; one can also think to think the singular value decomposition directly from the block bidiagonal form although not much work has been done in this direction so far. Also, we can also refer to all “iterative methods” for computing the singular values. The goal of these iterative methods is in general to compute a few singular triplets. As far as software packages, we can quote SVDPACK [SVDPACK], PROPACK [PROPACK], and PRIMME SVD [PRIMME_SVDS]. As far methods, we can quote [doi:10.1137/140979381, doi:10.1137/S1064827500372973, doi:10.1137/S0895479802404192].

Connection with Symmetric Eigenvalue Problem.

Many ideas (but not all) can be applied (and have been applied) to symmetric tridiagonalization. This paper only focuses on bidiagonalization algorithm but we believe that some of the tricks we show (but not all) could be used for creating an efficient symmetric tridiagonalization.

3 Tiled Bidiagonalization Algorithms

In this section, we provide background on tiled algorithms for QR factorization, and then explain how to use them for the BiDiag and R-BiDiag algorithms.

3.1 Tiled algorithms

Tiled algorithms are expressed in terms of tile operations rather than elementary operations. Each tile is of size , where is a parameter tuned to squeeze the most out of arithmetic units and memory hierarchy. Typically, ranges from to on state-of-the-art machines [sc09-agullo].

Consider a rectangular tiled matrix of size . The actual size of is thus , where is the tile size. We number rows from to , columns from to , and factorization steps from to . Algorithm 1 outlines a tiled QR algorithm, where loop indices represent tiles:

for  to  do
        Step , denoted as :
        for  to  do
              
       
Algorithm 1 algorithm for a tiled matrix of size .

In Algorithm 1, is the step, and also the panel index, and is an orthogonal transformation that combines rows and to zero out the tile in position . We explain below how such orthogonal transformations can be implemented.

3.2 Kernels

To implement a given orthogonal transformation , one can use six different kernels, whose costs are given in Table 1. In this table, the unit of time is the time to perform floating-point operations.

Operation Panel Update
Name Cost Name Cost
Factor square into triangle 4 6
Zero square with triangle on top 6 12
Zero triangle with triangle on top 2 6
Table 1: Kernels for tiled QR. The unit of time is , where is the blocksize.

There are two main possibilities to implement an orthogonal transformation : The first version eliminates tile with the TS (Triangle on top of square) kernels, as shown in Algorithm 2:

for  to  do
       
       
Algorithm 2 Elimination via TS (Triangle on top of square) kernels.

Here the tile panel is factored into a triangle (with ). The transformation is applied to subsequent tiles , , in row (with ). Tile is zeroed out (with ), and subsequent tiles , , in row are updated (with ). The flop count is (expressed in same time unit as in Table 1). Dependencies are the following:

and can be executed in parallel, as well as operations on different columns . With an unbounded number of processors, the parallel time is thus time-units.

for  to  do
       
       
for  to  do
       
       
Algorithm 3 Elimination via TT (Triangle on top of triangle) kernels.

The second approach to implement the orthogonal transformation is with the TT (Triangle on top of triangle) kernels, as shown in Algorithm 3. Here both tiles and are factored into a triangle (with ). The corresponding transformations are applied to subsequent tiles and , , in both rows and (with ). Tile is zeroed out (with ), and subsequent tiles , , in row are updated (with ). The flop count is , just as before. Dependencies are the following:

Now the factor operations in row and can be executed in parallel. Moreover, the updates can be run in parallel with the factorization. Thus, with an unbounded number of processors, the parallel time is time-units.

In Algorithms 2 and 3, it is understood that if a tile is already in triangle form, then the associated and update kernels do not need to be applied.

3.3 QR factorization

Consider a rectangular tiled matrix of size , with . There are many algorithms to compute the QR factorization of , and we refer to [c177] for a survey. We use the three following variants:

  • This simple algorithm is the reference algorithm used in [Buttari2008, tileplasma]. At step , the pivot row is always row , and we perform the eliminations in sequence, for , down to . In this algorithm, TS kernels are used.

  • This algorithm is the counterpart of the FlatTS algorithm with TT kernels. It uses exactly the same elimination operations, but with different kernels.

  • This algorithm is asymptotically optimal, and turns out to be the most efficient on a variety of platforms [henc, j125]. An optimal algorithm is Grasap [henc, j125]. The difference between Grasap and Greedy is a small constant term. We work with Greedy in this paper. The main idea is to eliminate many tiles in parallel at each step, using a reduction tree (see [c177] for a detailed description).

Figure 1: Snapshots of the bidiagonalization algorithm BiDiag.

3.4 Bidiagonalization

Consider a rectangular tiled matrix of size , with . The bidiagonalization algorithm BiDiag proceeds as the QR factorization, but interleaves one step of LQ factorization between two steps of QR factorization (see Figure 1). More precisely, BiDiag executes the sequence

where is the step of the algorithm (see Algorithm 1), and is the step of the algorithm. The latter is a right factorization step that executes the column-oriented eliminations shown in Algorithm 4.

Step , denoted as :
for  to  do
       
Algorithm 4 Step for a tiled matrix of size .

In Algorithm 4, is an orthogonal transformation that combines columns and to zero out the tile in position . It is the exact counterpart to the row-oriented eliminations and be implemented with the very same kernels, either TS or TT.

3.5 R-Bidiagonalization

When is much larger than , R-bidiagonalization should be preferred, if minimizing the operation count is the objective. This R-BiDiag algorithm does a QR factorization of , followed by a bidiagonalization of the upper square matrix. In other words, given a rectangular tiled matrix of size , with , R-BiDiag executes the sequence

Let and be the actual size of (element wise). The number of arithmetic operations is for BiDiag and for R-BiDiag [Golub, p.284]. These numbers show that R-BiDiag is less costly than BiDiag whenever , or equivalently, whenever . One major contribution of this paper is to provide a comparison of BiDiag and R-BiDiag in terms of parallel execution time, instead of operation count.

3.6 Comments

Case .

The paper focuses on the case but everything holds in the case. One simply can transpose the initial matrix and change switch LQ steps for QR steps and vice versa.

Real/Complex arithmetic.

There is no restriction of our work on whether the matrices are real or complex. Actually, our codes have been generated to allow for both real and complex arithmetics, single and double precisions.

Memory usage of our algorithm.

In the standard LAPACK case, GEBRD takes as input a matrix and overrides the matrix with the Householder reflectors and the bidiagonal matrix. The algorithm is therefore down in place and there is no extra storage needed. In the case R-BiDiag, it is important to note that as in the case of the LHC algorithm [Chan:1982:IAC:355984.355990], an extra storage of size is needed if one wants to recompute the singular vectors. (If one wants the singular value, all that matters is the bidiagonal form and any Householder vectors generated can safely be discarded.)

Any tree is possible.

We note that whether we choose to perform BiDiag or R-BiDiag, any tree is possible at each step of the algorithm. For example a restriction of our study thereafter is to fix the trees in the QR step and the BiDiag step of R-BiDiag to be the same. Clearly there is no such need, and one can consider for R-BiDiag (for example) a combination like (QRFlatTS +BiDiagGreedy). We do not study such combinations. This also simplifies notations. Since we consider the same tree for the QR step and the BiDiag step of R-BiDiag, when we write R-BiDiagFlatTT (for example), we mean (QRFlatTT +BiDiagFlatTT).

BiDiagGreedy and BiDiagBinomial are the same.

Another name for BiDiagGreedy would be BiDiagBinomial. Since in the BiDiag algorithm, we repeat the same tree (in the Greedy case, a Binomial tree) over and over again. So BiDiagGreedy and BiDiagBinomial are the exact same algorithm. (Whereas QRGreedy is not the same as QRBinomial.)

4 Critical paths

In this section, we compute exact or estimated values of the critical paths of the

BiDiag and R-BiDiag algorithms with the FlatTS, FlatTT, and Greedy trees.

4.1 Bidiagonalization

Given a rectangular tiled matrix of size , with , the bidiagonalization algorithm BiDiag executes the sequence

To compute the critical path, we first observe that there is no overlap between two consecutive steps and . To see why, consider w.l.o.g. the first two steps and on Figure 1. Tile is used at the end of the step to update the last row of the trailing matrix (whichever it is). In passing, note that all columns in this last row are updated in parallel, because we assume unlimited resources when computing critical paths. But tile it is the first tile modified by the step, hence there is no possible overlap. Similarly, there is no overlap between two consecutive steps and . Consider steps and on Figure 1. Tile is used at the end of the step to update the last column of the trailing matrix (whichever it is), and it is the first tile modified by the step.

As a consequence, the critical path length of BiDiag is the sum of the critical path lengths of each QR and LQ step. And so an optimal BiDiag algorithm is an algorithm which is optimal at each QR and LQ step independently. We know that an optimal QR step is obtained by a binomial tree. We know that an optimal LQ step is obtained by a binomial tree. Therefore the scheme alternating binomial QR and LQ trees, that is BiDiagGreedy, is optimal. This is repeated in the following theorem.

Theorem 1.

BiDiagGreedy is optimal, in the sense that it has the shortest critical path length, over all BiDiag algorithms, which are tiled algorithms that alternate a QR step and an LQ step to obtain a band bidiagonal matrix with a sequence of any QR trees and any LQ trees.

We now compute critical path lengths. From [henc, c177, j125] we have the following values for the critical path of one QR step applied to a tiled matrix of size :




(a) Square ()
(b) Tall and skinny ()
(c) Tall and skinny ()
Figure 2: Critical paths: ratio of the different methods over BiDiag with FlatTS, on three configurations. Solid lines are for BiDiag and dashed lines for R-BiDiag.

The critical path of one LQ step applied to a tiled matrix of size is . Finally, in the BiDiag algorithm, the size of the matrix for step is and the size of the matrix for step is . Altogether, we derive the following values:



We devote the whole next subsection to the study of the critical path of , because this is more complicated.

4.2 Study of the critical path of

4.2.1 A formula with a summation

By following the reasoning of the previous subsection, we get


  • (1)

The latter formula is exact and is easy enough to compute with a small computer code. However it does not provide us with much insight. To get a better grasp on , we develop some more Equation (1).

4.2.2 An exact formula when and are powers of two

First we note that:

(2)

One way to understand the previous relation is to remember that “the” antiderative of is . The previous relation reminds us of this result. The general case ( is not a power of 2) of Equation (2) is

(3)

From Equations (1) and (2), we derive that if is a power of two:

From Equations (1), (2) and (3), we derive that, if both and are powers of two, with :

Theorem 2.

Let and be powers of 2,

4.2.3 Using Stirling’s formula

We can also consider Equation (1

) again and obtain simpler bounds by rounding down and up the ceiling function in the logarithms and then using asymptotic analysis. We consider the bounds:

Applied to Equation (1), this leads to

We note that the difference between the lower bound (left side) and the upper bound (right side) is .

We now consider “large” and , and we use Stirling’s formula as follows:

And with one more term we get:

In base 2,

We obtain that

We stop the expansion a little earlier.

We now study the term: , this gives:

We now study the term: . We can set with . We get

We have that

We are therefore drawn to study . We see that this is a decreasing function of . We are interested in the limit value when goes to . For large values of , we have

So that

So all in all, we have that

And so

Thus

So we get

(4)

We note that the difference between the lower bound (left side) and the upper bound (right side) is .

We now get the following theorem

Theorem 3.

This theorem matches the power of 2 case. Also we understand that we perform QR steps that each are done with a binomial tree of length . This explains the term; we perform LQ steps that each are done with a binomial tree of length . This explains the term. The coefficient 6 in front is due to the weights use for the kernels. This theorem holds for any and .

In the case, we get that

In the case, we get that