1 Introduction
QR decomposition has laid the foundation of matrix computation, including (weighted) least square problem, block orthogonalization, Krylov subspace methods, randomized singular value decomposition. Based on its essential role in numerical linear algebra, it has shown great power in other areas like MIMO detection, GPS positioning, and so on. The (reduced) QR decomposition reads as
(1) 
where
is an orthogonal matrix, and
is an uppertriangular matrix.Therefore, how to perform QR decomposition quickly and decently has been a great challenge for computer scientists and mathematicians over decades. On a modest scale, Householder reflection [householder1958unitary] and Givens rotation are preferred, as they are proven stable, see [golub2013matrix] and [higham2002accuracy, Chapter 18] for details. However, several drawbacks of both methods (mainly Householder QR) have been mentioned in the literature. First, these methods are BLAS1 and BLAS2 intensive, which will be inferior for the machine supporting fast BLAS3 operation. Second, such communicationfrequent methods might be poorly performed for distributed architectures. Many studies focus on how to improve Householder QR via BLAS3 and distributed systems, such as [bischof1987wy, bischof1991parallel, granat2010novel, o1990parallel, schreiber1989storage]. In contrast, communicationavoiding algorithms [ballard2011minimizing, demmel2013communication] have attracted much attention for their asymptotically optimal communication cost. We will not expand this topic, and the interested readers might refer to the references.
In this paper, we focus on the QR decomposition of tallandskinny matrices, i.e., the case when in (1
). Such requests could come from randomized singular value decomposition (RSVD)
[halko2011finding], Krylov subspace methods (KSM) [hoemmen2010communication] and local optimal block preconditioned conjugate gradient method (LOBPCG) [duersch2018robust], as well as block HouseholderQR algorithms [schreiber1989storage]. Due to its importance, many efforts have been put in past decades. For direct methods, Tall and Skinny QR (TSQR) was proposed and well studied in the past ten years, see [anderson2011communication, constantine2011tall]. However, these improvements are far from satisfactory, mainly due to the complexity of implementation.On contrast, the CholeskyQR method is famous as an efficient and elegant method for tallandskinny matrices, whose history can date back to a century ago. It uses the fact that factor of is just the Cholesky factor of . Different from Givens and Householder method, the CholeskyQR method requires less communication and computation cost, especially for tallandskinny matrices. However, the condition number of is , which makes the CholeskyQR method suffer from numerical instability, as commented in [fukaya2014choleskyqr2]. Hence it is necessary to precondition the CholeskyQR algorithm when is illconditioned. Regarding this, how to choose a suitable preconditioner becomes the main problem of this type of algorithms. Here we present a short review of existing work on preconditioning CholeskyQR, and Section 2 expands our discussions. The authors in [fukaya2014choleskyqr2] provide a first and clever way of preconditioning the CholeskyQR algorithm by just performing CholeskyQR again, named CholeskyQR2. In their consequent paper [yamamoto2015roundoff], rigorous roundoff error has been established. However, when ^{1}^{1}1In this paper, u denotes the machine precision., the first Cholesky factorization might break down, leading to an unwanted cessation of CholeskyQR2. To remedy this issue, a shifted CholeskyQR [fukaya2020shifted] is adopted as the preconditioner. More recently, the full LU decomposition is considered as the preconditioner in [terao2020lu].
Unlike existing deterministic preconditioners, we explore the possibility of randomized ones. It is consensus that modern largescale computation is eager for randomized algorithms, which have reached tremendous success in linear solvers, range finder, model reduction, etc. See [kannan2017randomized, martinsson2020randomized] for more details.
In this paper, we propose a novel framework for preconditioning the CholeskyQR algorithm. The proposed framework has the following advantages. First, the method requires less computation and communication costs, comparing to existing preconditioners like the CholeskyQR2 family. Second, the matrix concentration techniques allow us to predict whether the whole system is rank deficient with high probability. If the system is rank deficient, then only direct QR methods could work. Thus in application aspect, our proposed methods are more flexible. Finally, it shares the same merit as other CholeskyQR algorithms: easy to implement. We believe these advantages benefit from randomization techniques, which are powerful and unexplored.
The rest of this paper is organized as follows. In Section 2 we recall the definition and historical progress of the CholeskyQR algorithm. Some principles on how to precondition CholeskyQR algorithms are pointed out. Based on these, we propose randomized XR based CholeskyQR method, including randomized LUCholeskyQR and randomized QRCholeskyQR. The theoretical analysis is displayed in Section 3. To further corroborate our methods, numerous numerical tests are carried out in Section 4. Two applications are shown in Section 5, describing how our method can accelerate related algorithms like randomized SVD.
2 CholeskyQR and its preconditioners
This section introduces the family of CholeskyQR algorithms. Hereafter, we assume that is full rank and tallandskinny, which means . We also assume that is an integer.
2.1 Existing methods review
The original CholeskyQR is shown in Algorithm 1. It first computes the Grammian matrix of , denoted as , then selects as the (upper) Cholesky factor. Notice that if no numerical issue happens, the is exactly the factor of .^{2}^{2}2Since if is the (reduced) QR decomposition, then the orthogonality of yields that . It follows from the definition of Cholesky decomposition that . Finally, to recover , we need to solve a triangular system additionally.
Compared to the classical HouseholderQR and TSQR, the CholeskyQR algorithm has several highlight points.

First, its computational cost is about half of TSQR and HouseholderQR.

Second, for the parallel environment, CholeskyQR only needs simple reduction operations, while TSQR and other versions of parallelized QR need a huge amount of reduction operations.

The CholeskyQR algorithm uses BLAS3 operations, which Housholder QR and TSQR can hardly apply.
As a result, CholeskyQR always runs faster than classical QR and TSQR. However, CholeskyQR is prohibitive in real scenarios, and there are several reasons displayed in the literature.

First, the numerical result seems rather sensitive to the condition number of the matrix. It was shown in [yamamoto2015roundoff] that the roundoff error of CholeskyQR is about .

Second, the CholeskyQR method needs to compute the Cholesky factorization of , while the condition number of the latter is . When has a large condition number such that is larger than , then the CholeskyQR algorithm will break down at the second step.
CholeskyQR2, proposed in [fukaya2014choleskyqr2], aimed to improve numerical stability by repeating twice CholeskyQR, see Algorithm 2. In the paper and its sequential work [yamamoto2015roundoff, yamamoto2016roundoff], CholeskyQR2 has been established in both theory and application. An obvious advantage is that CholeskyQR2 is simple and easy to implement, with acceptable numerical error, see [yamamoto2016roundoff]: Suppose is the numerical result of CholeskyQR2, then we have
(2) 
under mild condition. However, when , the CholeskyQR2 also breaks down like the original one.
To this end, shifted CholeskyQR3 (Algorithm 3) was proposed to remedy the problem that the first Cholesky step might break down. In [fukaya2020shifted], the authors proved that the condition number after first Cholesky step can be reduced to
(3) 
under mild condition and suggested that is enough for performance. Often, the numerical behavior of shifted CholeskyQR is poor and additional CholeskyQR step needs to be performed twicely. That’s why the algorithm is called shifted CholeskyQR3.
We comment on these existing methods as a summary, which also inspire our proposed algorithms. Both CholeskyQR2 and shifted CholeskyQR3 can be regarded as a preconditioner of CholeskyQR, as pointed out in [fukaya2020shifted]. However, there are also some departures from the classical preconditioners. First, the preconditioning techniques are by their nature inexact, like diagonal scaling or incomplete LU decomposition. The might not be true for the CholeskyQR2 algorithm: the first Cholesky step will recover exact QR decomposition if no numerical roundoff error appears. Second, the preconditioning seems costly, due to the numerical instability of the original CholeskyQR: The algorithm needs repeating twice or even more to get a satisfactory result. For these two reasons, those preconditioners are not so familiar with commonly discussed ones.
2.2 XRBased preconditioned CholeskyQR
The purpose of this subsection is to develop new methods for preconditioning the CholeskyQR algorithm. Although CholeskyQR2 and shifted CholeskyQR3 are not so efficient, they still provide an alternative way of viewing the CholeskyQR algorithm. Usually, the preconditioned CholeskyQR algorithm can be regarded as a twostep framework.
For the CholeskyQR2 algorithm, the preconditioning step is just CholeskyQR itself, which is accurate but less efficient, as mentioned in the previous subsection. The core of such an efficient and stable algorithm is finding a suitable and practical rough QR decomposition. It should meet the following requirements. First, a fast rough QR decomposition is required in this framework. Second, the factor of rough QR decomposition must be close to the exact factor and must be uppertriangular. Equivalently, the factor of rough QR decomposition must enjoy a mild condition number. Notice that less QR decomposition can be used in the preconditioning step since only the factor is required.
It is necessary to achieve good performance to make smaller. Ideally, the best rough QR decomposition must be the exact QR decomposition, like what we encounter in the CholeskyQR2 family, regardless of roundoff error. However, such a step is unaffordable since it achieves our ultimate goal, and no stable QR decomposition can be applied for largescale matrices.
Therefore, we attempt to perform XR decomposition, in a submatrix . The submatrix extracts the core information of the original matrix. Denote by the projection operator, such that . Then we assume that is the factor of , and . Therefore, XR decomposition fits into the framework in Algorithm 4. In this paper, XR might be LU or QR, to make sure that is preserved uppertriangular. Plus, is assumed to keep the correct size of . For this purpose, we propose the following two algorithms, attempting to use randomized strategy to reduce the condition number of the CholeksyQR algorithm.
More precisely, our proposed method includes two steps: We first choose a submatrix , then perform LU or QR decomposition to get , the uppertriangular factor of . If is invertible, then serves as the preconditioner: we calculate , and the final is our desired factor. These two algorithms are called randomized LUCholeskyQR and randomized QRCholeskyQR algorithms, respectively. Detailed algorithms are displayed in Algorithms 6 and 5.
We make some remarks on the choice of XR decomposition. We always prefer QR decomposition, in the sense that when , the rough QR decomposition becomes exact. However, even the exact LU decomposition cannot ensure that either or factor is betterconditioned than itself, and such attempt is considered in [terao2020lu].
The proposed framework has several advantages. First, the partial XR factorization is cheap and easy to implement, and the computational cost is always . Later we will show that it is sufficient to choose slightly larger than , yielding the computational cost of rough QR decomposition approximately . Second, in general, the partial factor is a suitable preconditioner, and the corresponding factor behaves wellconditioned, similarly to CholeskyQR2 and the shifted CholeskyQR3.
Compared to the CholeskyQR2 family, our proposed preconditioners are more similar to the classical ones in the following two senses:

The preconditioning step is inexact, even if no roundoff error is involved.

The preconditioning step is much cheaper than the CholeskyQR step.
It is essential in our methods that sacrificing the accuracy in the preconditioning step gains much flexibility. That’s why we prefer randomized algorithms to deterministic ones.
However, it is still a CholeskyQR method, and some common drawbacks seem inevitable currently. The success of our algorithm requires the full rank assumption of the original matrix. Moreover, the algorithm might break down when is degenerate even if is of full rank due to bad sampling. Luckily, the most scenario we have considered will not reach the worst case, for example, randomized SVD, see Sections 5 and 4.
2.3 Strategies of choosing submatrix
This subsection deals with how to choose the submatrix . We describe several strategies, which are all prevailing and wellstudied in the past years. This choice is also called random projection in the literature, see [gittens2013topics] and references therein for further study. We here summarize two common techniques. We introduce the unified expression , where will be explained case by case.
Row Extraction
Row extraction generates the submatrices from randomly choosing rows from original rows directly. Either equiprobability or a priori weighted sampling can be used. In this case, is a rowchosen matrix. The advantage of row extraction is easy to implement and requires less communication time and cost in parallel. The disadvantage is that it will provide poor preconditioning in some extreme cases, as pointed out in the later analysis. However, these extreme cases encounter rarely in most practical cases. Hence we adopt this strategy in Section 4 for simplicity.
Gaussian Ensemble
Gaussian Ensemble generates the submatrices from , where is a Gaussian matrices (that means each entry of is drawn from i.i.d.) This method enjoys the optimal theoretical result, but the multiplication operation is unaffordable. Hence we only treat it as a theoretical supplement for our proposed methods.
2.4 Parallelism
This subsection discusses how to perform the proposed algorithms in parallel, typically in an MPI environment. For convenience, we use terminology in collective communication from MPI functions directly, such as scatter, broadcast, reduce, gather. Since the Cholesky decomposition is always assumed to be performed in a single processor, it suffices to discuss computing , i.e. forming the Grammian matrix, and compute in parallel environment.

Forming global Grammian matrices is divided into two steps: 1) Each processor computes its local Grammian matrix . 2) Use reduce to summing all local Grammian matrix up, in root processor.

Solving the uppertriangular systems also includes two steps: 1) The root processor sends to each processor using broadcast. 2) Each processor solves the uppertriangular systems individually.
2.5 Complexity analysis
In this subsection, we study the computation and communication costs of our algorithms. We compare our algorithms with other existing methods, like HouseholderQR, CholeskyQR2 family, and TSQR. We suppose that , , and be the number of processes. In complexity analysis, the low order term will be omitted.
2.5.1 Computational costs
The analysis consists of two parts, the analysis of CholeskyQR and randomized QR.
CholeskyQR
We first evaluate the complexity of computing Grammian matrix in parallel. Each processor first computes the local Grammian matrix, which needs operations per processor. Then we perform Cholesky decomposition in root node, which needs ops (only root node). Finally, solving a triangular system needs ops per processor. Hence .
Precondition Step [golub2013matrix]
For LU factorization, the cost is , and for QR factorization, we have .
Summary
The analysis result is summarized in the table below. We should keep in mind that , and terms involving will be dominant. Notice that this table only discusses the cost about computing factor, and we do not discuss forming factor by matrix multiplication.
Total  Along Critical Path  

HouseholderQR  –  
CholeskyQR  
CholeskyQR2  
sCholeskyQR3  
rLUCholeskyQR  
rQRCholeskyQR 
2.5.2 Communication costs
In this subsection, we compare the communication costs simply by two indices: the communication times and the total size of communicated data. We omit the comparison with TSQR since we only consider collaborative communication.
# Times  # Data  

CholeskyQR  2  
CholeskyQR2  4  
sCholeskyQR3  6  
rLUCholeskyQR  
rQRCholeskyQR 
3 Theoretical analysis
In this section, we provide some theoretical analysis on randomized QRCholeskyQR algorithm. The main tool used in this section is concentration inequalities and related topics, and the readers not familiar with those might refer to [vershynin2018high, wainwright2019high]
for a comprehensive study. For ease of understanding, we omit the roundoff error in this section. Essentially, there seems no need to estimate the roundoff error provided both small QR and triangular solver are stable. However, for completeness, we give an analysis in
Appendix A.We first simplify the problem via a matrix analysis approach. We first recall some basic definitions and properties of orthogonal matrices and the condition number. Here, we denote by the
norm of a vector. Throughout this section,
is called an orthogonal matrix if , whereis the identity matrix. Notice that this implies
. The condition number is under norm, that is(4) 
where and are the largest and smallest singular value of matrix . A consequent result is that if is an orthogonal matrix, then , since multiplying an orthogonal matrix does not change the singular value. For QR decomposition , we have .
Suppose is full rank, with QR decomposition , where is orthogonal and is uppertriangular. Suppose is the rowchosen matrix such that the submatrix can be expressed as , see Section 2.3. Our main result starts from the following proposition,
Proposition 1
Suppose that is the factor of , and . Moreover, suppose that is its QR decomposition, then we have
(5) 
provided and are full rank.
Proof
Since , is full rank is equivalent to is full rank. Define is factor of , by definition it is an orthogonal matrix. Suppose is the QR decomposition of . Since , we have
(6) 
Observe that by assumption we have are all invertible. It follows from the uniqueness of QR decomposition that . Hence
(7) 
Since is orthogonal matrix, we have
(8) 
Let and
be the largest and smallest eigenvalue of a matrix
, notice that they coincide with and when is symmetric and semipositive definite. Recall the matrix Chernoff inequality from [tropp2015introduction, Theorem 5.1.1]:Proposition 2 (Matrix Chernoff Inequality)
Suppose are independent, symmetric, random matrices with dimension .^{5}^{5}5The original statement is for Hermitian matrices. Suppose that holds for all .
Define be their sum, and denote by . Then for any we have
(9) 
and
(10) 
Based on this proposition, we show our first result, giving an estimate on the condition number of when each row is chosen with equal probability. Before we go to the details of the statement and proof of this theoretical result, let us consider some extreme cases: suppose , then the submatrix will be considerably illconditioned, since indeed we need to choose each of the first rows of at least one time, in our row choosing strategy. This only occurs in a probability about
Therefore, the behavior of our method is strongly related to the structure of the original matrix . To characterize it more precisely, we introduce ^{6}^{6}6Here we adopt the MATLAB notation. For example, denotes the th row of ., aiming to measure how the elements of a matrix concentrate.
Theorem 1 (Equal Probability Case)
Suppose with QR decomposition , the submatrix is generated by choosing row independently, while each row of submatrix is drawn from the original matrix with equal probability . Then for any , we have
(11) 
where is defined in Algorithm 6.
Proof
We first construct a matrixvalued random variable
, to adapt the result in Proposition 2. Separate into row vectors , where is the th row vector. DefineThen we have
(12) 
and it follows from that the following inequality holds,
(13) 
from the definition of .
Set independently, and . , therefore . Here we point out that , which will be proved later. Hence it suffices to estimate . By applying Proposition 2 we obtain
(14) 
and
(15) 
Combining both inequalities, we conclude that
Finally, we show . It follows from Proposition 1 that if the submatrix , where is the th row vector of , then , where Therefore
hence .
An observation is that the concentration result works if and only if is small. It is believable that in general we have , hence if can ensure probability to obtain such a preconditioner. However, this might not hold true for some case. Rethinking that case we have mentioned before, suppose where is identity matrix. In this case, , and the concentration works if and only if , which will eventually impractical when get larger.
Now consider the case when each row is chosen with unequal probability. The following theorem shows under some strategy, the can be sharpened to , which is the optimal rate. Though impractical, this theoretical result might shed light on the mechanism of the row choosing strategy, which seems promising in the future.
Theorem 2 (Unequal Probability Case)
Suppose with QR decomposition . The submatrix is generated by choosing row independently, while each row of submatrix is drawn as follows: for each , we select with probability . Then for any , we have
(16) 
where is defined in Algorithm 6.
Proof
Next, we show the result of the Gaussian ensemble. In actual computation, one might not hope to use Gaussian ensemble for its expansive arithmetic cost. However, what we believe is that this result reveals the numerical behavior of our proposed methods appropriately.
Theorem 3 (Gaussian Ensemble Case)
Given with QR decomposition , the submatrix is generated by leftmultiplying Gaussian matrices
, where each component is drawn from standard normal distribution
iid. Suppose , then we have(19) 
where is defined in Algorithm 6.
Proof
Since , it follows from [wainwright2019high, Theorem 6.1] that
(20) 
and
(21) 
Choose , yielding
(22) 
and
(23) 
Remark 1
If , then simple estimation shows that
Summarization
In summary, we provide the theoretical result, including both submatrix choosing strategies and Gaussian ensemble. These results shows that exponential decay might happen when the size of the submatrix becomes larger and larger. Hence, for a variety of matrices, our framework might produce a practical preconditioner. Unfortunately, the exponential decay of row choosing strategy does not hold for some matrices. We believe that this is due to large , i.e., the matrix entries concentrate in a lowerdimensional matrix.
4 Numerical results
In this section, we present plenty of numerical experiments to illustrate our results. We compare the performance among HouseholderQR, CholeskyQR, CholeskyQR2 (from [fukaya2014choleskyqr2]), shifted CholeskyQR3 (from [fukaya2020shifted]), and two newly proposed algorithms: randomized LUCholeskyQR and randomized QRCholeskyQR. Section 4.1 introduces the setup of our experiments, including computing configurations and the strategy of generating test matrices. The next several sections discuss the stability, runtime, strong and weak scalability in Sections 4.4, 4.3, and 4.2 respectively. In addition, Section 4.5 concerned about the relationship between the condition number and the number of sampling rows.
4.1 Setup
The experiments carried out in this section are all implemented in C++ language, with the help of OpenMPI^{7}^{7}7https://www.openmpi.org/. version 3.1.4, and BLAS and LAPACK libraries implemented in Intel MKL version 2017.1 on Highperformance Computing Platform of Peking University, see Table 3 for specifications. We use the doubleprecision, hence , and the choice of shift in shifted CholeskyQR3 is .^{8}^{8}8This choice is different as that in [fukaya2020shifted], since the original choice will lead to a breakdown when the matrix has a condition number .
Item  Specification 

CPU  Intel Xeon E52697A V4 (2.60GHz, 16 cores) 
Number of CPUs / node  2 
Number of threads / node  32 
Peak FLOPS / node  1.33 TFlops 
Memory size / node  256G 
We generate test matrices in the following manner, forming , where both and
are orthogonal matrices from taking a full SVD of a random matrix, and
Here is a constant. Clearly we have and . We denote the number of processes involved by .4.2 Accuracy test
First, we examine the numerical stability of randomized LUCholeskyQR and randomized QRCholeskyQR (shortened as rLUCholeskyQR and rQRCholeskyQR respectively), when fixing . Throughout this section, two measurements will be investigated: orthogonality and residual , where denotes Frobenius norm.
Figures 2 and 1 display the numerical stability, where we take , , and the condition number varies from to . We see in Fig. 1, the orthogonality error of all methods except CholeskyQR are acceptable, in the sense that those are almost independent to the condition number. The figure indicates that the orthogonality error of CholeskyQR is about , only valid when . The breakdown of CholeskyQR and CholeskyQR2 when has been already observed [fukaya2014choleskyqr2]. For the residual aspect, see Fig. 2, all methods listed here are stable, expect a slight difference is detected in our tests (HouseholderQR performs a little worse while CholeskyQR performs a little better).
Next, we show the dependence of numerical stability on the size of matrices. We take and fix and respectively, see Figs. 6, 5, 4, and 3. It follows from the numerical results that CholeskyQR is unstable due to the large condition number. The rLUCholeskyQR algorithm performs a little worse, while the orthogonality error of other methods increases with and only mildly, and all stable. All tests support that the residual is negligible for most practical cases.
4.3 Runtime test
Now we turn to evaluate the runtime performance of our proposed methods compared to others. For single process, we take and test several methods. As shown in Fig. 7, CholeskyQR outperforms other methods. Among all stable methods (i.e., exclude CholeskyQR), our proposed methods enjoy the best runtime performance, and the cost time of randomized LU/QRCholeskyQR is about the same. It is worth noting that our methods are even faster than CholeskyQR2, which is proven limited in the previous subsection. All results in Fig. 7 indicate that the computational cost is proportional to , the size of the matrix, which corroborates the theoretical counting.
To see whether similar results hold for multiprocesses, we carry out the experiments in 32 processes. Here for convenience, we only choose methods that are easy to parallelize. We use the CholeskyQR2 algorithm instead of CholeskyQR for comparison, omit rLUCholeskyQR since it has a similar performance as rQRCholeskyQR. Fig. 8 shows the result when taking while varies from to . The runtime performance of those methods coincident with those in a single process. For completeness, we provide Fig. 9, illustrating the result when taking while varies from to . It indicates that CholeskyQR2 and sCholeskyQR3 take and more time than rQRCholeskyQR, respectively. And CholeskyQR2 will collapse when the condition number is bigger than .
4.4 Scalability test
In this subsection, we examine the scalability of rQRCholeskyQR, comparing to the CholeskyQR2 algorithm and the shifted CholeskyQR3 algorithm. We consider both strong scalability and weak scalability. The former focuses on how running time varies with the number of processors for a fixed total problem size. And the latter focuses on how the running time varies with the number of processors when fixing problem size tackled by each process.
As Table 4 shows, they all have good strong scalability. As the number of processes increases, the speedup ratio also increases, proportional to the number of processes. When using 512 processes, they speed up around 36 times than 8 processes, which indicates these methods can be used in the multiprocesses configuration.
processes  CholeskyQR2  sCholeskyQR3  rQRCholeskyQR 

8  2.3181  3.6855  1.8643 
16  1.4032 (1.65)  2.2035 (1.67)  1.1706 (1.59) 
32  0.7886 (2.93)  1.1413 (3.22)  0.6152 (3.03) 
64  0.4187 (5.53)  0.5993 (6.14)  0.3341 (5.58) 
128  0.2386 (9.71)  0.3301 (11.16)  0.1925 (9.68) 
256  0.1251 (18.52)  0.1911 (19.28)  0.1026 (18.17) 
512  0.0631 (36.74)  0.0958 (38.47)  0.052 (35.85) 
The weak scalability can be easily derived from above results. The running time is proportional to the scale of matrix, which says if we double the row of matrix , the running time will also double. But if we also double the number of processes at the same time, it will cost almost the same time as before or slightly more to compute. In short, fixing , and , the runtime would only increase mildly when grows. As Fig. 10 shows, when varies from to , the runtime of CholeskyQR2 and sCholeskyQR3 only become 2.26 and 2.25 times larger. Among these, rQRCholeskyQR performs the best since it only consumes 2.21 times more time than before.
4.5 Dependence on numbers of sampling rows
Finally, we concentrate on the following problem: how many rows do we need to sample to ensure the condition numbers of the preconditioned matrices are small enough. Notice that in previous subsections, the sampling rate is selected as 2, a fixed number. In this section, we reveal the relation between the condition number (on expectation) and the sampling rate . It seems essential in the sense that it has a strong impact on the final performance on preconditioning, but we will show our method is not so sensitive to this parameter, as soon as the sampling rate departs away from .
As Fig. 11 shows, a slight enlargement in sampling rate might lead to exponential decay in the condition number of . When the sampling rate equals to 1, the condition number will possibly reach 3000 at most. However, if the sampling rate becomes a little larger, such as 1.2, the condition number of will drop down to 20, which is helpful in subsequent CholeskyQR. When the sampling rate continues to increase, the condition number after preconditioning will continue to drop, but not so significantly as before. For practical usage, we recommend choosing the sampling rate from 1.5 to 2.
5 Applications
In this section, two examples are provided to explain the motivation of studying QR decomposition of tallandskinny matrices, which also serve as a natural application for our proposed methods.
Nevertheless, this short section cannot encompass all cases using the proposed method. For example, the computation of tall and skinny QR in LOBPCG takes up lots of time. Also, the block QR decomposition relies on tallandskinny matrices heavily. Hence it is worth mentioning that our algorithms can accelerate these numerical methods, and many of them are within the scope of future work.
5.1 Randomized singular value decomposition
Randomized SVD, proposed by [granat2010novel], aims to calculate several largest singular vector approximately for large scale matrices. It has been widely used and analyzed, see [halko2011finding]. The basic version of RSVD is introduced as follows:
This algorithm might suffer from loosing accuracy. Instead of using vanilla version, a power method iteration is usually used to improve its accuracy, see Algorithm 8. For further discussion about how to improve RSVD, see [musco2015randomized].
Often, the orthogonal step uses QR decomposition, for example, MATLAB builtin qr function. Therefore, we can use our proposed method instead. To test our proposed QR decomposition, we select several sparse matrices and report the time cost of each iteration. Here all the experiments are performed in MATLAB R2020b, on a personal laptop. The matrices we test are all selected from SuiteSparse Matrix Collection^{9}^{9}9https://sparse.tamu.edu/., here we list the brief information of them, see Table 5.
rows  columns  nnz  density  nnz/row  
Ge87H76  112,985  112,985  7,892,195  6.18e4  69.85 
Si87H76  240,369  240,369  10,661,631  1.85e4  44.36 
Hamrle3  1,447,360  1,447,360  5,514,242  2.63e6  3.81 
G3_circuit  1,585,478  1,585,478  7,660,826  3.04e6  4.83 
Table 6 shows the cost time in each iteration in Algorithm 8. Notice that in each iteration, we simply replace MATLAB builtin QR (denoted as QR) and our proposed randomized QRCholeskyQR (denoted as rQR), and the acceleration becomes more powerful, as the matrices becomes more sparse.
time(s)  

QR  rQR  ratio  QR  rQR  ratio  QR  rQR  ratio  
Ge87H76  0.308  0.213  1.45  0.741  0.557  1.33  2.087  1.509  1.38 
Si87H76  0.540  0.321  1.68  1.466  0.912  1.61  3.341  2.021  1.65 
Hamrle3  2.528  0.762  3.27  6.790  2.653  2.56  12.392  4.559  2.72 
G3_circuit  2.904  0.903  3.21  5.964  2.406  2.48  15.128  5.135  2.95 
5.2 Least square
In this section, we extend our interest in the factor of randomized QR. Consider the Least Square problem of , which is usually solved by the normal equation , provided is of the full rank. However, the solver of the normal equation also suffers from illconditioned . Hence a suitable preconditioner is required. As we remarked before, the factor of the submatrix, denoted as , can be regarded as a suitable preconditioner since is small with high probability. The algorithm using the randomized factor can be summarized as follows.
Appendix A Estimations on rQRCholeskyQR under inexact procedure
In this section, we focus on the estimation in Algorithm 6 when the subroutine is not exactly performed. The inexactness often comes from roundoff error, which will be mainly discussed in this section. The main spirit of this section is that, under a stable choice of subroutine, the result is also stable, in the sense that the condition number of computed (denoted by for clarity) enjoys a mild increase than . We first assume that our subroutine (QR and triangular solver) are forward stable under norm.
Definition 1
The QR decomposition is stable for inexact QR
we have
Definition 2
is stable with respect to triangular system if the following estimation holds
(24) 
where .
We now give a framework on inexact version of Algorithm 6.
Proposition 3
Suppose that for a submatrix , is an stable factor of via QR decomposition. is stable with respect to . We assume that , then we have
(25) 
Moreover, denote by , we assume , then
(26) 
Proof
We denote by the exact result of triangular system. Then by the definition of stability, we have
(27) 
Next, we estimate the term , it follows that
(28) 
Since , then by triangle inequality we have
(29) 
Therefore
(30) 
Hence, by Weyl’s theorem [golub2013matrix, Chapter 8.6]
(31) 
We close this section by discussing the aforementioned stability. Under mild assumptions, the backward stability for triangular systems (see [higham2002accuracy, Theorem 8.5]) can tell us that , and thus it is stable where since .
The stability is more involved. We first recall the backward stability of Householder and Givens QR, (see [higham2002accuracy, Chapter 18]). That means, there exists , such that , such that is the factor of . Here

For Householder QR, , where is a small constant.

For Givens QR, , where is a small constant.
By expanding the definition of backward stability, we have
(32) 
or equivalently,
(33) 
By perturbation result in [chang1997perturbation, Theorem 3.1], we then have
(34) 
Combining backward stability we have
and the common result on has been discussed before.
Comments
There are no comments yet.