# A Novel Randomized XR-Based Preconditioned CholeskyQR Algorithm

CholeskyQR is a simple and fast QR decomposition via Cholesky decomposition, while it has been considered highly sensitive to the condition number. In this paper, we provide a randomized preconditioner framework for CholeskyQR algorithm. Under this framework, two methods (randomized LU-CholeskyQR and randomized QR-CholeskyQR) are proposed and discussed. We prove the proposed preconditioners can effectively reduce the condition number, which is also demonstrated by numerical tests. Abundant numerical tests indicate our methods are more stable and faster than all the existing algorithms and have good scalability.

## Authors

• 11 publications
• 2 publications
• 3 publications
06/11/2020

### Randomized Fast Subspace Descent Methods

Randomized Fast Subspace Descent (RFASD) Methods are developed and analy...
05/30/2021

### A Note On The Randomized Kaczmarz Method With A Partially Weighted Selection Step

In this note we reconsider two known algorithms which both usually conve...
07/06/2019

### A Randomized Algorithm for Edge-Colouring Graphs in O(m√(n)) Time

We present a simple randomized algorithm to edge-colour arbitrary simple...
08/25/2021

### On the approximation of a matrix

Let F^* be an approximation of a given (a × b) matrix F derived by metho...
06/22/2020

### Randomized Runge-Kutta method – stability and convergence under inexact information

We deal with optimal approximation of solutions of ODEs under local Lips...
10/06/2021

### Randomized Nyström Preconditioning

This paper introduces the Nyström PCG algorithm for solving a symmetric ...
09/17/2019

### A Note on a Simple and Practical Randomized Response Framework for Eliciting Sensitive Dichotomous Quantitative Information

Many issues of interest to social scientists and policymakers are of a s...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

QR decomposition has laid the foundation of matrix computation, including (weighted) least square problem, block orthogonalization, Krylov subspace methods, randomized singular value decomposition. Based on its essential role in numerical linear algebra, it has shown great power in other areas like MIMO detection, GPS positioning, and so on. The (reduced) QR decomposition reads as

 A=QR,A∈Rm×n,Q∈Rm×n,R∈Rn×n,  (m>n) (1)

where

is an orthogonal matrix, and

is an upper-triangular matrix.

Therefore, how to perform QR decomposition quickly and decently has been a great challenge for computer scientists and mathematicians over decades. On a modest scale, Householder reflection [householder1958unitary] and Givens rotation are preferred, as they are proven stable, see [golub2013matrix] and [higham2002accuracy, Chapter 18] for details. However, several drawbacks of both methods (mainly Householder QR) have been mentioned in the literature. First, these methods are BLAS1 and BLAS2 intensive, which will be inferior for the machine supporting fast BLAS3 operation. Second, such communication-frequent methods might be poorly performed for distributed architectures. Many studies focus on how to improve Householder QR via BLAS3 and distributed systems, such as [bischof1987wy, bischof1991parallel, granat2010novel, o1990parallel, schreiber1989storage]. In contrast, communication-avoiding algorithms [ballard2011minimizing, demmel2013communication] have attracted much attention for their asymptotically optimal communication cost. We will not expand this topic, and the interested readers might refer to the references.

In this paper, we focus on the QR decomposition of tall-and-skinny matrices, i.e., the case when in (1

). Such requests could come from randomized singular value decomposition (RSVD)

[halko2011finding], Krylov subspace methods (KSM) [hoemmen2010communication] and local optimal block preconditioned conjugate gradient method (LOBPCG) [duersch2018robust], as well as block HouseholderQR algorithms [schreiber1989storage]. Due to its importance, many efforts have been put in past decades. For direct methods, Tall and Skinny QR (TSQR) was proposed and well studied in the past ten years, see [anderson2011communication, constantine2011tall]. However, these improvements are far from satisfactory, mainly due to the complexity of implementation.

On contrast, the CholeskyQR method is famous as an efficient and elegant method for tall-and-skinny matrices, whose history can date back to a century ago. It uses the fact that factor of is just the Cholesky factor of . Different from Givens and Householder method, the CholeskyQR method requires less communication and computation cost, especially for tall-and-skinny matrices. However, the condition number of is , which makes the CholeskyQR method suffer from numerical instability, as commented in [fukaya2014choleskyqr2]. Hence it is necessary to precondition the CholeskyQR algorithm when is ill-conditioned. Regarding this, how to choose a suitable preconditioner becomes the main problem of this type of algorithms. Here we present a short review of existing work on preconditioning CholeskyQR, and Section 2 expands our discussions. The authors in [fukaya2014choleskyqr2] provide a first and clever way of preconditioning the CholeskyQR algorithm by just performing CholeskyQR again, named CholeskyQR2. In their consequent paper [yamamoto2015roundoff], rigorous round-off error has been established. However, when 111In this paper, u denotes the machine precision., the first Cholesky factorization might break down, leading to an unwanted cessation of CholeskyQR2. To remedy this issue, a shifted CholeskyQR [fukaya2020shifted] is adopted as the preconditioner. More recently, the full LU decomposition is considered as the preconditioner in [terao2020lu].

Unlike existing deterministic preconditioners, we explore the possibility of randomized ones. It is consensus that modern large-scale computation is eager for randomized algorithms, which have reached tremendous success in linear solvers, range finder, model reduction, etc. See [kannan2017randomized, martinsson2020randomized] for more details.

In this paper, we propose a novel framework for preconditioning the CholeskyQR algorithm. The proposed framework has the following advantages. First, the method requires less computation and communication costs, comparing to existing preconditioners like the CholeskyQR2 family. Second, the matrix concentration techniques allow us to predict whether the whole system is rank deficient with high probability. If the system is rank deficient, then only direct QR methods could work. Thus in application aspect, our proposed methods are more flexible. Finally, it shares the same merit as other CholeskyQR algorithms: easy to implement. We believe these advantages benefit from randomization techniques, which are powerful and unexplored.

The rest of this paper is organized as follows. In Section 2 we recall the definition and historical progress of the CholeskyQR algorithm. Some principles on how to precondition CholeskyQR algorithms are pointed out. Based on these, we propose randomized XR based CholeskyQR method, including randomized LU-CholeskyQR and randomized QR-CholeskyQR. The theoretical analysis is displayed in Section 3. To further corroborate our methods, numerous numerical tests are carried out in Section 4. Two applications are shown in Section 5, describing how our method can accelerate related algorithms like randomized SVD.

## 2 CholeskyQR and its preconditioners

This section introduces the family of CholeskyQR algorithms. Hereafter, we assume that is full rank and tall-and-skinny, which means . We also assume that is an integer.

### 2.1 Existing methods review

The original CholeskyQR is shown in Algorithm 1. It first computes the Grammian matrix of , denoted as , then selects as the (upper) Cholesky factor. Notice that if no numerical issue happens, the is exactly the factor of .222Since if is the (reduced) QR decomposition, then the orthogonality of yields that . It follows from the definition of Cholesky decomposition that . Finally, to recover , we need to solve a triangular system additionally.

Compared to the classical HouseholderQR and TSQR, the CholeskyQR algorithm has several highlight points.

• First, its computational cost is about half of TSQR and HouseholderQR.

• Second, for the parallel environment, CholeskyQR only needs simple reduction operations, while TSQR and other versions of parallelized QR need a huge amount of reduction operations.

• The CholeskyQR algorithm uses BLAS3 operations, which Housholder QR and TSQR can hardly apply.

As a result, CholeskyQR always runs faster than classical QR and TSQR. However, CholeskyQR is prohibitive in real scenarios, and there are several reasons displayed in the literature.

• First, the numerical result seems rather sensitive to the condition number of the matrix. It was shown in [yamamoto2015roundoff] that the round-off error of CholeskyQR is about .

• Second, the CholeskyQR method needs to compute the Cholesky factorization of , while the condition number of the latter is . When has a large condition number such that is larger than , then the CholeskyQR algorithm will break down at the second step.

CholeskyQR2, proposed in [fukaya2014choleskyqr2], aimed to improve numerical stability by repeating twice CholeskyQR, see Algorithm 2. In the paper and its sequential work [yamamoto2015roundoff, yamamoto2016roundoff], CholeskyQR2 has been established in both theory and application. An obvious advantage is that CholeskyQR2 is simple and easy to implement, with acceptable numerical error, see [yamamoto2016roundoff]: Suppose is the numerical result of CholeskyQR2, then we have

 ∥^ZT^Z−I∥F≤6(m+n+1)nu, (2)

under mild condition. However, when , the CholeskyQR2 also breaks down like the original one.

To this end, shifted CholeskyQR3 (Algorithm 3) was proposed to remedy the problem that the first Cholesky step might break down. In [fukaya2020shifted], the authors proved that the condition number after first Cholesky step can be reduced to

 cond(^Q)≤2√1+ε∥X∥22(cond(X)2)√3, (3)

under mild condition and suggested that is enough for performance. Often, the numerical behavior of shifted CholeskyQR is poor and additional CholeskyQR step needs to be performed twicely. That’s why the algorithm is called shifted CholeskyQR3.

We comment on these existing methods as a summary, which also inspire our proposed algorithms. Both CholeskyQR2 and shifted CholeskyQR3 can be regarded as a preconditioner of CholeskyQR, as pointed out in [fukaya2020shifted]. However, there are also some departures from the classical preconditioners. First, the preconditioning techniques are by their nature inexact, like diagonal scaling or incomplete LU decomposition. The might not be true for the CholeskyQR2 algorithm: the first Cholesky step will recover exact QR decomposition if no numerical round-off error appears. Second, the preconditioning seems costly, due to the numerical instability of the original CholeskyQR: The algorithm needs repeating twice or even more to get a satisfactory result. For these two reasons, those preconditioners are not so familiar with commonly discussed ones.

### 2.2 XR-Based preconditioned CholeskyQR

The purpose of this subsection is to develop new methods for preconditioning the CholeskyQR algorithm. Although CholeskyQR2 and shifted CholeskyQR3 are not so efficient, they still provide an alternative way of viewing the CholeskyQR algorithm. Usually, the preconditioned CholeskyQR algorithm can be regarded as a two-step framework.

For the CholeskyQR2 algorithm, the preconditioning step is just CholeskyQR itself, which is accurate but less efficient, as mentioned in the previous subsection. The core of such an efficient and stable algorithm is finding a suitable and practical rough QR decomposition. It should meet the following requirements. First, a fast rough QR decomposition is required in this framework. Second, the factor of rough QR decomposition must be close to the exact factor and must be upper-triangular. Equivalently, the factor of rough QR decomposition must enjoy a mild condition number. Notice that -less QR decomposition can be used in the preconditioning step since only the factor is required.

It is necessary to achieve good performance to make smaller. Ideally, the best rough QR decomposition must be the exact QR decomposition, like what we encounter in the CholeskyQR2 family, regardless of round-off error. However, such a step is unaffordable since it achieves our ultimate goal, and no stable QR decomposition can be applied for large-scale matrices.

Therefore, we attempt to perform XR decomposition, in a submatrix . The submatrix extracts the core information of the original matrix. Denote by the projection operator, such that . Then we assume that is the factor of , and . Therefore, XR decomposition fits into the framework in Algorithm 4. In this paper, XR might be LU or QR, to make sure that is preserved upper-triangular. Plus, is assumed to keep the correct size of . For this purpose, we propose the following two algorithms, attempting to use randomized strategy to reduce the condition number of the CholeksyQR algorithm.

More precisely, our proposed method includes two steps: We first choose a submatrix , then perform LU or QR decomposition to get , the upper-triangular factor of . If is invertible, then serves as the preconditioner: we calculate , and the final is our desired factor. These two algorithms are called randomized LU-CholeskyQR and randomized QR-CholeskyQR algorithms, respectively. Detailed algorithms are displayed in Algorithms 6 and 5.

We make some remarks on the choice of XR decomposition. We always prefer QR decomposition, in the sense that when , the rough QR decomposition becomes exact. However, even the exact LU decomposition cannot ensure that either or factor is better-conditioned than itself, and such attempt is considered in [terao2020lu].

The proposed framework has several advantages. First, the partial XR factorization is cheap and easy to implement, and the computational cost is always . Later we will show that it is sufficient to choose slightly larger than , yielding the computational cost of rough QR decomposition approximately . Second, in general, the partial factor is a suitable preconditioner, and the corresponding factor behaves well-conditioned, similarly to CholeskyQR2 and the shifted CholeskyQR3.

Compared to the CholeskyQR2 family, our proposed preconditioners are more similar to the classical ones in the following two senses:

• The preconditioning step is inexact, even if no round-off error is involved.

• The preconditioning step is much cheaper than the CholeskyQR step.

It is essential in our methods that sacrificing the accuracy in the preconditioning step gains much flexibility. That’s why we prefer randomized algorithms to deterministic ones.

However, it is still a CholeskyQR method, and some common drawbacks seem inevitable currently. The success of our algorithm requires the full rank assumption of the original matrix. Moreover, the algorithm might break down when is degenerate even if is of full rank due to bad sampling. Luckily, the most scenario we have considered will not reach the worst case, for example, randomized SVD, see Sections 5 and 4.

### 2.3 Strategies of choosing submatrix

This subsection deals with how to choose the submatrix . We describe several strategies, which are all prevailing and well-studied in the past years. This choice is also called random projection in the literature, see [gittens2013topics] and references therein for further study. We here summarize two common techniques. We introduce the unified expression , where will be explained case by case.

##### Row Extraction

Row extraction generates the submatrices from randomly choosing rows from original rows directly. Either equi-probability or a priori weighted sampling can be used. In this case, is a row-chosen matrix. The advantage of row extraction is easy to implement and requires less communication time and cost in parallel. The disadvantage is that it will provide poor preconditioning in some extreme cases, as pointed out in the later analysis. However, these extreme cases encounter rarely in most practical cases. Hence we adopt this strategy in Section 4 for simplicity.

##### Gaussian Ensemble

Gaussian Ensemble generates the submatrices from , where is a Gaussian matrices (that means each entry of is drawn from i.i.d.) This method enjoys the optimal theoretical result, but the multiplication operation is unaffordable. Hence we only treat it as a theoretical supplement for our proposed methods.

### 2.4 Parallelism

This subsection discusses how to perform the proposed algorithms in parallel, typically in an MPI environment. For convenience, we use terminology in collective communication from MPI functions directly, such as scatter, broadcast, reduce, gather. Since the Cholesky decomposition is always assumed to be performed in a single processor, it suffices to discuss computing , i.e. forming the Grammian matrix, and compute in parallel environment.

• Forming global Grammian matrices is divided into two steps: 1) Each processor computes its local Grammian matrix . 2) Use reduce to summing all local Grammian matrix up, in root processor.

• Solving the upper-triangular systems also includes two steps: 1) The root processor sends to each processor using broadcast. 2) Each processor solves the upper-triangular systems individually.

### 2.5 Complexity analysis

In this subsection, we study the computation and communication costs of our algorithms. We compare our algorithms with other existing methods, like HouseholderQR, CholeskyQR2 family, and TSQR. We suppose that , , and be the number of processes. In complexity analysis, the low order term will be omitted.

#### 2.5.1 Computational costs

The analysis consists of two parts, the analysis of CholeskyQR and randomized QR.

##### CholeskyQR

We first evaluate the complexity of computing Grammian matrix in parallel. Each processor first computes the local Grammian matrix, which needs operations per processor. Then we perform Cholesky decomposition in root node, which needs ops (only root node). Finally, solving a triangular system needs ops per processor. Hence .

##### Precondition Step [golub2013matrix]

For LU factorization, the cost is , and for QR factorization, we have .

##### Summary

The analysis result is summarized in the table below. We should keep in mind that , and terms involving will be dominant. Notice that this table only discusses the cost about computing factor, and we do not discuss forming factor by matrix multiplication.

#### 2.5.2 Communication costs

In this subsection, we compare the communication costs simply by two indices: the communication times and the total size of communicated data. We omit the comparison with TSQR since we only consider collaborative communication.

## 3 Theoretical analysis

In this section, we provide some theoretical analysis on randomized QR-CholeskyQR algorithm. The main tool used in this section is concentration inequalities and related topics, and the readers not familiar with those might refer to [vershynin2018high, wainwright2019high]

for a comprehensive study. For ease of understanding, we omit the round-off error in this section. Essentially, there seems no need to estimate the round-off error provided both small QR and triangular solver are stable. However, for completeness, we give an analysis in

Appendix A.

We first simplify the problem via a matrix analysis approach. We first recall some basic definitions and properties of orthogonal matrices and the condition number. Here, we denote by the

norm of a vector. Throughout this section,

is called an orthogonal matrix if , where

is the identity matrix. Notice that this implies

. The condition number is under norm, that is

 cond(M)=σmax(M)σmin(M), (4)

where and are the largest and smallest singular value of matrix . A consequent result is that if is an orthogonal matrix, then , since multiplying an orthogonal matrix does not change the singular value. For QR decomposition , we have .

Suppose is full rank, with QR decomposition , where is orthogonal and is upper-triangular. Suppose is the row-chosen matrix such that the submatrix can be expressed as , see Section 2.3. Our main result starts from the following proposition,

###### Proposition 1

Suppose that is the factor of , and . Moreover, suppose that is its QR decomposition, then we have

 cond(X)=cond(PQ), (5)

provided and are full rank.

###### Proof

Since , is full rank is equivalent to is full rank. Define is -factor of , by definition it is an orthogonal matrix. Suppose is the QR decomposition of . Since , we have

 ^Q1R1=A1=PA=PQR=Q2R2R. (6)

Observe that by assumption we have are all invertible. It follows from the uniqueness of QR decomposition that . Hence

 X=AR−11=QRR−11=QR−12. (7)

Since is orthogonal matrix, we have

 cond(X)=cond(R−12)=cond(R2)=cond(PQ). (8)

Let and

be the largest and smallest eigenvalue of a matrix

, notice that they coincide with and when is symmetric and semi-positive definite. Recall the matrix Chernoff inequality from [tropp2015introduction, Theorem 5.1.1]:

###### Proposition 2 (Matrix Chernoff Inequality)

Suppose are independent, symmetric, random matrices with dimension .555The original statement is for Hermitian matrices. Suppose that holds for all .

Define be their sum, and denote by . Then for any we have

 P[λmin(Y)≤(1−ε)μmin]≤n[e−ε(1−ε)1−ε]μmin/L (9)

and

 P[λmax(Y)≥(1+ε)μmax]≤n[eε(1+ε)1+ε]μmax/L. (10)

Based on this proposition, we show our first result, giving an estimate on the condition number of when each row is chosen with equal probability. Before we go to the details of the statement and proof of this theoretical result, let us consider some extreme cases: suppose , then the submatrix will be considerably ill-conditioned, since indeed we need to choose each of the first rows of at least one time, in our row choosing strategy. This only occurs in a probability about

 (m−nl−n)/(ml)=(l)(l−1)⋯(l−n)(m)(m−1)⋯(m−n)≈(lm)n≪1.

Therefore, the behavior of our method is strongly related to the structure of the original matrix . To characterize it more precisely, we introduce 666Here we adopt the MATLAB notation. For example, denotes the -th row of ., aiming to measure how the elements of a matrix concentrate.

###### Theorem 1 (Equal Probability Case)

Suppose with QR decomposition , the submatrix is generated by choosing row independently, while each row of submatrix is drawn from the original matrix with equal probability . Then for any , we have

 P[cond(X)≥√1+δ1−ε]≤n⎡⎢⎣[e−ε(1−ε)1−ε]l/(mθ(Q))+[eδ(1+δ)1+δ]l/(mθ(Q))⎤⎥⎦, (11)

where is defined in Algorithm 6.

###### Proof

We first construct a matrix-valued random variable

, to adapt the result in Proposition 2. Separate into row vectors , where is the th row vector. Define

 M=qTkqk∈Rn×n with probability 1m for each k=1,2,⋯,m.

Then we have

 EM=1m(qT1q1+⋯+qTmqm)=1mQTQ=1mIn, (12)

and it follows from that the following inequality holds,

 λmax(M)≤maxk∥qk∥2=θ(Q), (13)

from the definition of .

Set independently, and . , therefore . Here we point out that , which will be proved later. Hence it suffices to estimate . By applying Proposition 2 we obtain

 P[λmin(Y)≤(1−ε)lm]≤n[e−ε(1−ε)1−ε]lm1θ(Q) (14)

and

 P[λmax(Y)≥(1+ε)lm]≤n[eε(1+ε)1+ε]lm1θ(Q). (15)

Combining both inequalities, we conclude that

 P[cond(Y)≥1+δ1−ε]≤n⎡⎢⎣[e−ε(1−ε)1−ε]l/(mθ(Q))+[eδ(1+δ)1+δ]l/(mθ(Q))⎤⎥⎦.

Finally, we show . It follows from Proposition 1 that if the submatrix , where is the th row vector of , then , where Therefore

 QT1Q1=l∑s=1=qTisqis=Y,

hence .

An observation is that the concentration result works if and only if is small. It is believable that in general we have , hence if can ensure probability to obtain such a preconditioner. However, this might not hold true for some case. Rethinking that case we have mentioned before, suppose where is identity matrix. In this case, , and the concentration works if and only if , which will eventually impractical when get larger.

Now consider the case when each row is chosen with unequal probability. The following theorem shows under some strategy, the can be sharpened to , which is the optimal rate. Though impractical, this theoretical result might shed light on the mechanism of the row choosing strategy, which seems promising in the future.

###### Theorem 2 (Unequal Probability Case)

Suppose with QR decomposition . The submatrix is generated by choosing row independently, while each row of submatrix is drawn as follows: for each , we select with probability . Then for any , we have

 P{cond(X)≥√1+δ1−ε}≤n⎡⎢⎣[e−ε(1−ε)1−ε]l/n+[eδ(1+δ)1+δ]l/n⎤⎥⎦, (16)

where is defined in Algorithm 6.

###### Proof

We modify the construction of in Theorem 1. Notice that

 m∑k=1∥qk∥2=∥Q∥2=n.

Define

 M=1∥qk∥2qTkqk∈Rn×n with probability 1n∥qk∥2 for each k=1,2,⋯,m.

Then we have

 EM=1n(qT1q1+⋯+qTmqm)=1nQTQ=In, (17)

and the following inequality holds,

 L:=λmax(M)=maxk∥qk∥2=1, (18)

from the definition of . The rest of the proof is similar to that in Theorem 1.

Next, we show the result of the Gaussian ensemble. In actual computation, one might not hope to use Gaussian ensemble for its expansive arithmetic cost. However, what we believe is that this result reveals the numerical behavior of our proposed methods appropriately.

###### Theorem 3 (Gaussian Ensemble Case)

Given with QR decomposition , the submatrix is generated by left-multiplying Gaussian matrices

, where each component is drawn from standard normal distribution

iid. Suppose , then we have

 P[cond(X)≥3+√nl1−√nl]≤2e−(√l−√n)28, (19)

where is defined in Algorithm 6.

###### Proof

Since , it follows from [wainwright2019high, Theorem 6.1] that

 P[σmax(ΩQ)/√l>(1+ε)+√nl]≤e−lε22 (20)

and

 P[σmin(ΩQ)/√l<(1−ε)−√nl]≤e−lε22 (21)

Choose , yielding

 P[σmax(ΩQ)/√l>32+12√nl]≤e−(√l−√n)28 (22)

and

 P[σmin(ΩQ)/√l<12(1−√nl)]≤e−(√l−√n)28 (23)

It follows from Proposition 1 that , we conclude that

 P[cond(X)≥3+√nl1−√nl]≤2e−(√l−√n)28.

###### Remark 1

If , then simple estimation shows that

 P[cond(X)≥100]≤2e−(√l−√n)28.

##### Summarization

In summary, we provide the theoretical result, including both submatrix choosing strategies and Gaussian ensemble. These results shows that exponential decay might happen when the size of the submatrix becomes larger and larger. Hence, for a variety of matrices, our framework might produce a practical preconditioner. Unfortunately, the exponential decay of row choosing strategy does not hold for some matrices. We believe that this is due to large , i.e., the matrix entries concentrate in a lower-dimensional matrix.

## 4 Numerical results

In this section, we present plenty of numerical experiments to illustrate our results. We compare the performance among HouseholderQR, CholeskyQR, CholeskyQR2 (from [fukaya2014choleskyqr2]), shifted CholeskyQR3 (from [fukaya2020shifted]), and two newly proposed algorithms: randomized LU-CholeskyQR and randomized QR-CholeskyQR. Section 4.1 introduces the setup of our experiments, including computing configurations and the strategy of generating test matrices. The next several sections discuss the stability, runtime, strong and weak scalability in Sections 4.4, 4.3, and 4.2 respectively. In addition, Section 4.5 concerned about the relationship between the condition number and the number of sampling rows.

### 4.1 Setup

The experiments carried out in this section are all implemented in C++ language, with the help of OpenMPI version 3.1.4, and BLAS and LAPACK libraries implemented in Intel MKL version 2017.1 on High-performance Computing Platform of Peking University, see Table 3 for specifications. We use the double-precision, hence , and the choice of shift in shifted CholeskyQR3 is .888This choice is different as that in [fukaya2020shifted], since the original choice will lead to a break-down when the matrix has a condition number .

We generate test matrices in the following manner, forming , where both and

are orthogonal matrices from taking a full SVD of a random matrix, and

Here is a constant. Clearly we have and . We denote the number of processes involved by .

### 4.2 Accuracy test

First, we examine the numerical stability of randomized LU-CholeskyQR and randomized QR-CholeskyQR (shortened as rLU-CholeskyQR and rQR-CholeskyQR respectively), when fixing . Throughout this section, two measurements will be investigated: orthogonality and residual , where denotes Frobenius norm.

Figures 2 and 1 display the numerical stability, where we take , , and the condition number varies from to . We see in Fig. 1, the orthogonality error of all methods except CholeskyQR are acceptable, in the sense that those are almost independent to the condition number. The figure indicates that the orthogonality error of CholeskyQR is about , only valid when . The breakdown of CholeskyQR and CholeskyQR2 when has been already observed [fukaya2014choleskyqr2]. For the residual aspect, see Fig. 2, all methods listed here are stable, expect a slight difference is detected in our tests (HouseholderQR performs a little worse while CholeskyQR performs a little better).

Next, we show the dependence of numerical stability on the size of matrices. We take and fix and respectively, see Figs. 6, 5, 4, and 3. It follows from the numerical results that CholeskyQR is unstable due to the large condition number. The rLU-CholeskyQR algorithm performs a little worse, while the orthogonality error of other methods increases with and only mildly, and all stable. All tests support that the residual is negligible for most practical cases.

### 4.3 Runtime test

Now we turn to evaluate the runtime performance of our proposed methods compared to others. For single process, we take and test several methods. As shown in Fig. 7, CholeskyQR outperforms other methods. Among all stable methods (i.e., exclude CholeskyQR), our proposed methods enjoy the best runtime performance, and the cost time of randomized LU/QR-CholeskyQR is about the same. It is worth noting that our methods are even faster than CholeskyQR2, which is proven limited in the previous subsection. All results in Fig. 7 indicate that the computational cost is proportional to , the size of the matrix, which corroborates the theoretical counting.

To see whether similar results hold for multi-processes, we carry out the experiments in 32 processes. Here for convenience, we only choose methods that are easy to parallelize. We use the CholeskyQR2 algorithm instead of CholeskyQR for comparison, omit rLU-CholeskyQR since it has a similar performance as rQR-CholeskyQR. Fig. 8 shows the result when taking while varies from to . The runtime performance of those methods coincident with those in a single process. For completeness, we provide Fig. 9, illustrating the result when taking while varies from to . It indicates that CholeskyQR2 and sCholeskyQR3 take and more time than rQR-CholeskyQR, respectively. And CholeskyQR2 will collapse when the condition number is bigger than .

### 4.4 Scalability test

In this subsection, we examine the scalability of rQR-CholeskyQR, comparing to the CholeskyQR2 algorithm and the shifted CholeskyQR3 algorithm. We consider both strong scalability and weak scalability. The former focuses on how running time varies with the number of processors for a fixed total problem size. And the latter focuses on how the running time varies with the number of processors when fixing problem size tackled by each process.

As Table 4 shows, they all have good strong scalability. As the number of processes increases, the speedup ratio also increases, proportional to the number of processes. When using 512 processes, they speed up around 36 times than 8 processes, which indicates these methods can be used in the multi-processes configuration.

The weak scalability can be easily derived from above results. The running time is proportional to the scale of matrix, which says if we double the row of matrix , the running time will also double. But if we also double the number of processes at the same time, it will cost almost the same time as before or slightly more to compute. In short, fixing , and , the runtime would only increase mildly when grows. As Fig. 10 shows, when varies from to , the runtime of CholeskyQR2 and sCholeskyQR3 only become 2.26 and 2.25 times larger. Among these, rQR-CholeskyQR performs the best since it only consumes 2.21 times more time than before.

### 4.5 Dependence on numbers of sampling rows

Finally, we concentrate on the following problem: how many rows do we need to sample to ensure the condition numbers of the preconditioned matrices are small enough. Notice that in previous subsections, the sampling rate is selected as 2, a fixed number. In this section, we reveal the relation between the condition number (on expectation) and the sampling rate . It seems essential in the sense that it has a strong impact on the final performance on preconditioning, but we will show our method is not so sensitive to this parameter, as soon as the sampling rate departs away from .

As Fig. 11 shows, a slight enlargement in sampling rate might lead to exponential decay in the condition number of . When the sampling rate equals to 1, the condition number will possibly reach 3000 at most. However, if the sampling rate becomes a little larger, such as 1.2, the condition number of will drop down to 20, which is helpful in subsequent CholeskyQR. When the sampling rate continues to increase, the condition number after preconditioning will continue to drop, but not so significantly as before. For practical usage, we recommend choosing the sampling rate from 1.5 to 2.

## 5 Applications

In this section, two examples are provided to explain the motivation of studying QR decomposition of tall-and-skinny matrices, which also serve as a natural application for our proposed methods.

Nevertheless, this short section cannot encompass all cases using the proposed method. For example, the computation of tall and skinny QR in LOBPCG takes up lots of time. Also, the block QR decomposition relies on tall-and-skinny matrices heavily. Hence it is worth mentioning that our algorithms can accelerate these numerical methods, and many of them are within the scope of future work.

### 5.1 Randomized singular value decomposition

Randomized SVD, proposed by [granat2010novel], aims to calculate several largest singular vector approximately for large scale matrices. It has been widely used and analyzed, see [halko2011finding]. The basic version of RSVD is introduced as follows:

This algorithm might suffer from loosing accuracy. Instead of using vanilla version, a power method iteration is usually used to improve its accuracy, see Algorithm 8. For further discussion about how to improve RSVD, see [musco2015randomized].

Often, the orthogonal step uses QR decomposition, for example, MATLAB built-in qr function. Therefore, we can use our proposed method instead. To test our proposed QR decomposition, we select several sparse matrices and report the time cost of each iteration. Here all the experiments are performed in MATLAB R2020b, on a personal laptop. The matrices we test are all selected from SuiteSparse Matrix Collection, here we list the brief information of them, see Table 5.

Table 6 shows the cost time in each iteration in Algorithm 8. Notice that in each iteration, we simply replace MATLAB built-in QR (denoted as QR) and our proposed randomized QR-CholeskyQR (denoted as rQR), and the acceleration becomes more powerful, as the matrices becomes more sparse.

### 5.2 Least square

In this section, we extend our interest in the factor of randomized QR. Consider the Least Square problem of , which is usually solved by the normal equation , provided is of the full rank. However, the solver of the normal equation also suffers from ill-conditioned . Hence a suitable preconditioner is required. As we remarked before, the factor of the submatrix, denoted as , can be regarded as a suitable preconditioner since is small with high probability. The algorithm using the randomized factor can be summarized as follows.

## Appendix A Estimations on rQR-CholeskyQR under inexact procedure

In this section, we focus on the estimation in Algorithm 6 when the sub-routine is not exactly performed. The inexactness often comes from round-off error, which will be mainly discussed in this section. The main spirit of this section is that, under a stable choice of sub-routine, the result is also stable, in the sense that the condition number of computed (denoted by for clarity) enjoys a mild increase than . We first assume that our subroutine (QR and triangular solver) are forward stable under norm.

###### Definition 1

The QR decomposition is -stable for inexact QR

 A≈(Q+ΔQ)(R+ΔR),

we have

 ∥ΔRR−1∥2≤αcond(R).

###### Definition 2

is -stable with respect to triangular system if the following estimation holds

 ∥ΔY∥2/∥Y∥2≤βcond(A), (24)

where .

We now give a framework on inexact version of Algorithm 6.

###### Proposition 3

Suppose that for a submatrix , is an -stable factor of via QR decomposition. is -stable with respect to . We assume that , then we have

 ∥X−^X∥2≤2(βcond(A)+αcond(A1))∥X∥2. (25)

Moreover, denote by , we assume , then

 cond(^X)≤1+γ1−γcond(X). (26)

###### Proof

We denote by the exact result of triangular system. Then by the definition of -stability, we have

 ∥Y−^X∥2≤βcond(A)∥Y∥2 (27)

Next, we estimate the term , it follows that

 ∥X−Y∥2=∥AR−1−A^R−1∥2≤∥AR−1∥2∥I−^R1R−11∥≤αcond(R1)∥Y∥2 (28)

Since , then by triangle inequality we have

 ∥Y∥2≤11−αcond(A1)∥X∥2 (29)

Therefore

 ∥X−^X∥2≤∥X−Y∥2+∥Y−^X∥2≤βcond(A)∥Y∥2+αcond(A1)∥Y∥2≤βcond(A)+αcond(A1)1−αcond(A1)∥X∥2≤2(βcond(A)+αcond(A1))∥X∥2. (30)

Hence, by Weyl’s theorem [golub2013matrix, Chapter 8.6]

 cond(^X)≤σ1(X)+γ∥X∥2σn(X)−γ∥X∥2≤cond(X)1+γ1−γ. (31)

We close this section by discussing the aforementioned stability. Under mild assumptions, the backward stability for triangular systems (see [higham2002accuracy, Theorem 8.5]) can tell us that , and thus it is -stable where since .

The -stability is more involved. We first recall the backward stability of Householder and Givens QR, (see [higham2002accuracy, Chapter 18]). That means, there exists , such that , such that is the factor of . Here

• For Householder QR, , where is a small constant.

• For Givens QR, , where is a small constant.

By expanding the definition of backward stability, we have

 QR+ΔA=Q––^R, (32)

or equivalently,

 I+Q−1ΔAR−1=Q−1Q––^RR−1=(Q−1Q––)(I+ΔRR−1). (33)

By perturbation result in [chang1997perturbation, Theorem 3.1], we then have

 ∥ΔRR−1∥F∥R∥2≤c∥Q−1ΔAR−1∥F≤∥ΔA∥F∥R−1∥2. (34)

Combining backward stability we have

 ∥ΔRR−1∥2≤ε∥A∥Fcond(R),

and the common result on has been discussed before.