# A fast spectral divide-and-conquer method for banded matrices

Based on the spectral divide-and-conquer algorithm by Nakatsukasa and Higham [SIAM J. Sci. Comput., 35(3): A1325-A1349, 2013], we propose a new algorithm for computing all the eigenvalues and eigenvectors of a symmetric banded matrix. For this purpose, we combine our previous work on the fast computation of spectral projectors in the so called HODLR format, with a novel technique for extracting a basis for the range of such a HODLR matrix. The numerical experiments demonstrate that our algorithm exhibits quasilinear complexity and allows for conveniently dealing with large-scale matrices.

## Authors

• 1 publication
• 20 publications
• ### Computing the spectral decomposition of interval matrices and a study on interval matrix power

We present an algorithm for computing a spectral decomposition of an int...
12/11/2019 ∙ by David Hartman, et al. ∙ 0

• ### A non-commutative algorithm for multiplying (7 × 7) matrices using 250 multiplications

We present a non-commutative algorithm for multiplying (7x7) matrices us...
12/21/2017 ∙ by Alexandre Sedoglavic, et al. ∙ 0

• ### Fast Computation of Sep_λ via Interpolation-based Globality Certificates

Given two square matrices A and B, we propose a new algorithm for comput...
11/12/2019 ∙ by Tim Mitchell, et al. ∙ 0

• ### A Fast Method for Steady-State Memristor Crossbar Array Circuit Simulation

In this work we propose an effective preconditioning technique to accele...
09/15/2021 ∙ by Rui Xie, et al. ∙ 0

• ### A parallel structured divide-and-conquer algorithm for symmetric tridiagonal eigenvalue problems

In this paper, a parallel structured divide-and-conquer (PSDC) eigensolv...
08/05/2020 ∙ by Xia Liao, et al. ∙ 0

• ### An efficient hybrid tridiagonal divide-and-conquer algorithm on distributed memory architectures

In this paper, an efficient divide-and-conquer (DC) algorithm is propose...
12/22/2016 ∙ by Shengguo Li, et al. ∙ 0

• ### A Maximum Matching Algorithm for Basis Selection in Spectral Learning

We present a solution to scale spectral algorithms for learning sequence...
06/09/2017 ∙ by Ariadna Quattoni, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Given a large symmetric banded matrix , we consider the computation of its complete spectral decomposition

 A=QΛQT,Λ=diag(λ1,λ2,…,λn), (1)

where are the eigenvalues of

and the columns of the orthogonal matrix

the corresponding eigenvectors. This problem has attracted quite some attention from the early days of numerical linear algebra until today, particularly when is a a tridiagonal matrix.

A number of applications give rise to banded eigenvalue problems. For example, they constitute a critical step in solvers for general dense symmetric eigenvalue problems. Nearly all existing approaches, with the notable exception of [NakaHigh2013], first reduce a given dense symmetric matrix to tridiagonal form. This is followed by a method for determining the spectral decomposition of a tridiagonal matrix, such as the QR algorithm, the classical divide-and-conquer (D&C) method or the algorithm of multiple relatively robust representations (MRRR). All these methods have complexity or higher; simply because all eigenvectors are computed and stored explicitly.

On a modern computing architecture with a memory hierarchy, it turns out to be advantageous to perform the tridiagonalization based on successive band reduction [BiscLangSun2000], with a symmetric banded matrix as an intermediate step [AuckBlumBungHuck2011, BienIgualKressPet2011, HaidLtaiDong2011, HaidSolcGates2013, SoloBallDemmHoef016]. In this context, it would be preferable to design an eigenvalue solver that works directly with banded matrices, therefore avoiding the reduction from banded to tridiagonal form. Such a possibility has been explored for classical D&C in [Arbenz1992, HaidLtaiDong2012]. However, the proposed methods seem to suffer from numerical instability or an unsatisfactory complexity growth as the bandwidth increases.

In this paper we propose a new and fast approach to computing the spectral decomposition of a symmetric banded matrix. This is based on the spectral D&C method from [NakaHigh2013], which recursively splits the spectrum using invariant subspaces extracted from spectral projectors associated with roughly half of the spectrum. In previous work [KressnerSus2017], we have developed a fast method for approximating such spectral projectors in a hierarchical low-rank format, the so called HODLR (hierarchically off-diagonal low-rank) format [Ambikasaran2013]. However, the extraction of the invariant subspace, requires to determine a basis for the range of the spectral projector. This represents a major challenge. We present an efficient algorithm for computing an orthonormal basis of an invariant subspace in the HODLR format, which heavily exploits properties of spectral projectors. The matrix of eigenvectors is stored implicitly, via orthonormal factors, where each factor is an orthonormal basis for an invariant subspace. Our approach extends to general symmetric HODLR matrices.

Several existing approaches that use hierarchical low-rank formats for the fast solution of eigenvalue problems are based on computing (inexact) LDL decompositions in such a format, see [Hackbusch2015, sec. 13.5] for an overview. These decompositions allow to slice the spectrum of a symmetric matrix into smaller chunks and are particularly well suited when only the eigenvalues and a few eigenvectors are needed.

To the best our knowledge, the only existing fast methods suitable for the complete spectral decomposition of a large symmetric matrix are based on variations the classical D&C method by Cuppen for a symmetric tridiagonal matrix [Cuppen1980/81]. One recursion of the method divides, after a rank-one perturbation, the matrix into a block diagonal matrix. In the conquer phase the rank-one perturbation is incorporated by solving a secular equation for the eigenvalues and applying a Cauchy-like matrix to the matrix of eigenvectors. Gu and Eisenstat [Gu1995] not only stabilized Cuppen’s method but also observed that the use of the fast multipole method for the Cauchy-like matrix multiplication reduced its complexity to for computing all eigenvectors. Vogel et al. [VogelXiaCauBal2016] extended these ideas beyond tridiagonal matrices, to general symmetric HSS (hierarchically semiseparable) matrices. Moreover, by representing the matrix of eigenvectors in factored form, the overall cost reduces to . While our work bears similarities with [VogelXiaCauBal2016], such as the storage of eigenvectors in factored form, it differs in several key aspects. First, our developments use the HODLR format while [VogelXiaCauBal2016] uses the HSS format. The later format stores the low-rank factors of off-diagonal blocks in a nested manner and thus reduces the memory requirements by a factor if the involved ranks stay on the same level. However, one may need to work with rather large values of in order to gain significant computational savings from working with HSS instead of HODLR. A second major difference is that the spectral D&C method used in this paper has, despite the similarity in name, little in common with Cuppen’s D&C. One advantage of using spectral D&C is that it conveniently allows to compute only parts of the spectrum. A third major difference is that [VogelXiaCauBal2016] incorporates a perturbation of rank , as it is needed to process matrices of bandwidth larger than one by sequentially splitting it up into rank-one perturbations. The method presented in this paper processes higher ranks directly, avoiding the need for splitting and leveraging the performance of level 3 BLAS operations. While the timings reported in [VogelXiaCauBal2016] cover matrices of size up to and appear to be comparable with the timings presented in this paper, our experiments additionally demonstrate that our newly proposed method allows for conveniently dealing with large-scale matrices.

The rest of the paper is organized as follows. In section 2, we recall the spectral divide-and-conquer algorithm for computing the spectral decomposition of a symmetric matrix. Section 3 gives a brief overview of the HODLR format and of a fast method for computing spectral projectors of HODLR matrices. In section 4 we discuss the fast extraction of invariant subspaces from a spectral projector given in the HODLR format. Section 5 presents the overall spectral D&C algorithm in the HODLR format for computing the spectral decomposition of a banded matrix. Numerical experiments are presented in section 6.

## 2 Spectral divide-and-conquer

In this section we recall the spectral D&C method by Nakatsukasa and Higham [NakaHigh2013] for a symmetric matrix with spectral decomposition (1). We assume that the eigenvalues are sorted in ascending order and choose a shift such that

 λ1≤⋯≤λν<μ<λν+1≤⋯≤λn,ν≈n/2.

The relative spectral gap associated with this splitting of eigenvalues is defined as

 gap=λν+1−λνλn−λ1.

The spectral projector associated with the first eigenvalues is the orthogonal projector onto the subspace spanned by the corresponding eigenvectors. Given (1), it takes the form

 Π<μ=Q[Iν000]QT.

Note that

 ΠT<μ=Π2<μ=Π<μ,trace(Π<μ)=rank(Π<μ)=ν.

The spectral projector associated with the other eigenvalues is given by

 Π>μ=Q[000In−ν]QT

and satisifies analogous properties.

The method from [NakaHigh2013] first computes the matrix sign function

 sign(A−μI)=Q[−Iν00In−ν]QT

and then extracts the spectral projectors via the relations

 Π<μ=12(I−sign(A−μI)),Π>μ=I−Π<μ.

The ranges of these spectral projector are invariant subspaces of and, in turn, of . Letting and denote arbitrary orthonormal bases for and , respectively, we therefore obtain

 [Q<μQ>μ]TA[Q<μQ>μ]=[A<μ00A>μ], (2)

where the eigenvalues of are and the eigenvalues of are . Applying the described procedure recursively to and leads to Algorithm 1. When the size of the matrix is below a user-prescribed minimal size , the recursion is stopped and a standard method for computing spectral decompositions is used, denoted by eig.

In the following sections, we discuss how Algorithm 1 can be implemented efficiently in the HODLR format.

## 3 Computation of spectral projectors in HODLR format

In this section, we briefly recall the HODLR format and the algorithm from [KressnerSus2017] for computing spectral projectors in the HODLR format.

### 3.1 HODLR format

Given an matrix let us consider a block matrix partitioning of the form

 M=⎡⎢⎣M(1)11M(1)12M(1)21M(1)22⎤⎥⎦. (3)

This partitioning is applied recursively, times, to the diagonal blocks , , leading to the hierarchical partitioning shown in Figure 1. We say that is a HODLR matrix of level and HODLR rank if all off-diagonal blocks seen during this process have rank at most . In the HODLR format, these blocks are stored, more efficiently, in terms of their low-rank factors. For example, for , the HODLR format takes the form

The definition of a HODLR matrix of course depends on how the partitioning (3) is chosen on each level of the recursion. This choice is completely determined by the integer partitions

 n=n1+n2+⋯n2p,m=m1+m2+⋯+m2p, (4)

corresponding to the sizes , , of the diagonal blocks on the lowest level of the recursion. Given specific integer partitions (4), we denote the set of HODLR matrices of rank by .

### 3.2 Operations in the HODLR format

Assuming that the integer partitions (4) are balanced, with , and , the storage of in the HODLR format requires memory. Various matrix operations with HOLDR matrices can be preformed with linear-polylogarithmic complexity. Table 1 summarizes the operations needed in this work; we refer to, e.g., [Ballani2016, Hackbusch2015] for more details. In order to perform operations including two HODLR matrices, the corresponding partitions ought to be compatible.

The operations listed in Table 1 with subscript employ recompression in order to limit the increase of off-diagonal ranks. In this paper recompression is done adaptively, such that the -norm approximation error in each off-diagonal block is bounded by a prescribed truncation tolerance . For matrix addition, recompression is done only after adding two off-diagonal blocks, whereas multiplying HOLDR matrices and computing the Cholesky decomposition requires recompression in intermediate steps.

In this work we also need to extract submatrices of HODLR matrices. Let , associated with an integer partition , and consider a subset of indices . Then the submatrix is again a HODLR matrix. To see this, consider the partitioning (3) and let with and , where is the size of . Then

 M(C,C) = ⎡⎢⎣M(1)11(C1,C1)M(1)12(C1,C2)M(1)21(C2,C1)M(1)22(C2,C2)⎤⎥⎦ = ⎡⎢⎣M(1)11(C1,C1)U(2)1(C1,:)(V(2)2(C2,:))TU(2)2(C2,:)(V(2)1(C1,:))TM(1)22(C2,C2)⎤⎥⎦.

Hence, the off-diagonal blocks again have rank at most . Applying this argument recursively to , establishes , associated with the integer partition

 |C|=:m=m1+m2+⋯m2p,

where is the cardinality of , is the cardinality of , and so on. Note that it may happen that some , in which case the corresponding blocks in the HODLR format vanish. Formally, this poses no problem in the definition and operations with HOLDR matrices. In practice, these blocks are removed to reduce overhead.

### 3.3 Computation of spectral projectors in the HODLR format

The method presented in [KressnerSus2017] for computing spectral projectors of banded matrices is based on the dynamically weighted Halley iteration from [NakaBaiGygi2010, NakaHigh2013] for computing the matrix sign function. In this work, we also need a slight variation of that method for dealing with HODLR matrices.

Given a symmetric non-singular matrix , the method from [NakaBaiGygi2010] uses an iteration

 Xk+1=bkckXk+(ak−bkck)Xk(I+ckXTkXk)−1,X0=A/α, (5)

that converges globally cubically to . The parameter is such that . The parameters are computed by

 ak=h(lk),bk=(ak−1)2/4,ck=ak+bk−1, (6)

where is determined by the recurrence

 lk=lk−1(ak−1+bk−1l2k−1)/(1+ck−1l2k−1),k≥1, (7)

with a lower bound for , and the function is given by

 h(l)=√1+γ+12 ⎷8−4γ+8(2−l2)l2√1+γ,γ=3√4(1−l2)l4.

In summary, except for and the parameters determining (5) are simple and cheap to compute.

The algorithm hQDWH presented in [KressnerSus2017] for banded is essentially an implementation of (5) in the HODLR matrix arithmetic, with one major difference. Following [NakaHigh2013], the first iteration of hQDWH avoids the computation of the Cholesky factorization for the evaluation of

in the first iteration. Instead, a QR decomposition of a

matrix is computed. This improves numerical stability and allows us to safely determine spectral projectors even for relatives gaps of order . For reasons explained in [KressnerSus2017, Remark 3.1], existing algorithms for performing QR decompositions of HODLR matrices come with various drawbacks. When is a HODLR matrix, Algorithm 2 therefore uses a Cholesky decomposition (instead of a QR decomposition) in the first step as well. In turn, as will be demonstrated by numerical experiments in section 6, this restricts the application of the algorithm to matrices with relative spectral gaps of order or larger. We do not see this as a major disadvantage in the setting under consideration. The relative gap is controlled by the choice of the shift in our D&C method and tiny relative spectral gaps can be easily avoided by adjusting .

The inexpensive estimation of

for banded is discussed in [KressnerSus2017]. For a HODLR matrix , we determine and by applying a few steps of the (inverse) power method to and , respectively.

Assuming that a constant number of iterations is needed and that the HOLDR ranks of all intermediate quantities are bounded by , the complexity of Algorithm 2 is .

## 4 Computation of invariant subspace basis in the HODLR format

This section addresses the efficient extraction of a basis for the range of a spectral projector given in the HODLR format.

Assuming that , the most straightforward approach to obtain a basis for is to simply take its first columns. Numerically, this turns out to be a terrible idea, especially when is banded.

###### Example 1.

Let be even and let be a symmetric banded matrix with bandwidth and eigenvalues distributed uniformly in . In particular, . Figure 2 shows that the condition number of the first columns of grows dramatically as increases. By computing a QR decomposition of these columns, we obtain an orthonormal basis . This basis has perfect condition number but, as Table 2 shows, it represents a r ather poor approximation of .

There exist a number of approaches that potentially avoid the problems observed in Example 1, such as applying a QR factorization with pivoting [GolVanL2013, Chapter 5.4] to . None of these approaches has been realized in the HODLR format. In fact, techniques like pivoting across blocks appear to be incompatible with the format.

In the following, we develop a new algorithm for computing a basis for in the HODLR format, which consists of two steps: (1) We first determine a set of well-conditioned columns of by performing a Cholesky factorization with local pivoting. As we will see below, the number of obtained columns is generally less but not much less than . (2) A randomized algorithm is applied to complete the columns to a basis of .

### 4.1 Column selection by block Cholesky with local pivoting

The spectral projector is not only symmetric positive semidefinite but it is also idempotent. The (pivoted) Cholesky factorization of such matrices has particular properties.

###### Theorem 2 ([Higham1996, Theorem 10.9]).

Let be a symmetric positive semidefinite matrix of rank . Then there is a permutation matrix such that admits a Cholesky factorization:

 PTBP=RTR,R=[R1R200],

where is a upper triangular matrix with positive diagonal elements.

Note that, by the invertibility of , the first columns of as well as form a basis for . The latter turns out to be orthonormal if is idempotent.

###### Lemma 3 ([MolerStew1978, Corollary 1.2.]).

Suppose, in addition to the hypotheses of Theorem 2, that . Then

 [R1R2][RT1RT2]=Ir.

The algorithm described in [Higham1996, Chapter 10] for realizing Theorem 2 chooses the maximal diagonal element as the pivot in every step of the standard Cholesky factorization algorithm. In turn, the diagonal elements of the Cholesky factor are monotonically decreasing and it is safe to decide which ones are considered zero numerically. Unfortunately, this algorithm, which will be denoted by cholp in the following, cannot be applied to because the diagonal pivoting strategy destroys the HODLR format. Instead, we use cholp only for the (dense) diagonal blocks of .

To illustrate the idea of our algorithm, we first consider a general symmetric positive semidefinite HODLR matrix of level , which takes the form

 M=(M11U1VT2V2UT1M22)

with dense diagonal blocks , . Applying cholp to gives a decomposition , with the diagonal elements of decreasing monotonically. As , and in turn also , will be chosen as a principal submatrix of , Lemma 3 implies that . In particular, the diagonal elements of are bounded by . Let denote the number of diagonal elements not smaller than a prescribed threshold . As will be shown in Lemma 4 below, choosing sufficiently close to ensures that is well-conditioned. Letting denote the permutation associated with and setting , we have

 M11(C1,C1)=R11(1:s,1:s)TR11(1:s,1:s).

The Schur complement of this matrix in (without considering the rows and columns neglected in the first block) is given by

 S=M22−V2U1(C1,:)TM11(C1,C1)−1U1(C1,:)VT2=M22−~RT12~R12, (8)

where the rank of is not larger than the rank of . We again apply cholp to and only retain diagonal elements of the Cholesky factor larger or equal than . Letting denote the corresponding indices and setting , , where is the size of , we obtain the factorization

 M(C,C)=~RT~Rwith[R11(1:s1,1:s1)~R12(:,C2)0R22(1:s2,1:s2)].

For a general HODLR matrix, we proceed recursively in an analogous fashion, with the difference that we now form submatrices of HODLR matrices (see section 3.2) and the operations in (8) are executed in the HODLR arithmetic.

The procedure described above leads to Algorithm 3. Based on the complexity of operations stated in Table 1, the cost of the algorithm applied to an spectral projector is . In Line 10 we update a HODLR matrix with a matrix given by its low-rank representation. This operation essentially corresponds to the addition of two HODLR matrices. Recompression is performed when computing all off-diagonal blocks, while dense diagonal blocks are updated by a dense matrix of the corresponding size. In Line 10 we also enforce symmetry in the Schur complement.

#### 4.1.1 Analysis of Algorithm 3

The indices selected by Algorithm 3 applied to need to attain two goals: (1) has moderate condition number, (2) is not much smaller than the rank of . In the following analysis, we show that the first goal is met when choosing sufficiently close to . The attainment of the second goal is demonstrated by the numerical experiments in section 6.

Our analysis needs to take into account that Algorithm 3 is affected by error due to truncation in the HODLR arithmetic. On the one hand, the input matrix, the spectral projector computed by Algorithm 2, is not exactly idempotent:

 Π2<μ=Π<μ+F, (9)

with a symmetric perturbation matrix of small norm. On the other hand, the incomplete Cholesky factor returned by Algorithm 3 is inexact as well:

 Π<μ(C,C)=~RT~R+E, (10)

with another symmetric perturbation matrix of small norm. For a symmetric matrix satisfying (9), Theorem 2.1 in [MolerStew1978] shows that

 ∥Π<μ∥2≤1+∥F∥2. (11)

The following lemma establishes a bound on the norm of the inverse of .

###### Lemma 4.

With the notation introduced above, set and , and suppose that . Then .

###### Proof.

Using (10) and (11), we obtain

 ∥~RT~R∥2≤∥Π<μ(C,C)∥2+∥E∥2≤∥Π<μ∥2+∥E∥2≤1+εH.

We now decompose , such that is diagonal with and is strictly upper triangular. Then

 ∥D2+TTD+DT+TTT∥2≤1+εH.

Because the matrix on the left is symmetric, this implies

 λmax(D2+TTD+DT+TTT)≤1+εH ⇒ λmax(TTD+DT+TTT)≤1−δ2+εH.

On the other hand,

 λmin(TTD+DT+TTT)≥λmin(TTD+DT)≥−(r−1)λmax(TTD+DT)≥−(r−1)(1−δ2+εH),

where the second inequality uses that the trace of is zero and hence its eigenvalues sum up to zero. In summary,

 ∥TTD+DT+TTT∥2≤(r−1)(1−δ2+εH),

and with . This completes the proof because

 ∥Π<μ(C,C)−1∥2 ≤ ∥D−2∥2∥(I+D−2~E)−1∥2≤1δ2−(r−1)(1−δ2)−rεH = 1r1δ2−1+1/r−εH,

where the inverse exists under the conditions of the lemma. ∎

The following theorem shows that the columns selected by Algorithm 3 have an excellent condition number if is sufficiently close to one and the perturbations introduced by the HODLR arithmetic remain small.

###### Theorem 5.

Let denote the set of indices returned by Algorithm 3 and suppose that the conditions (9) and (10) as well as the condition of Lemma 4 are satisfied. Then it holds for the -norm condition number of that

 κ(Π<μ(:,C))≤1r1+εHδ2−1+1/r−εH=1+2r(1−δ)+O((1−δ)2+εH).
###### Proof.

By definition, . From (11) we get

 ∥Π<μ(:,C)∥2≤∥Πμ∥2≤1+∥F∥2. (12)

To bound the second factor, we note that and apply Lemma 4. Using the two bounds, the statement follows. ∎

The condition of Lemma 4, , requires to be very close to . We conjecture that this condition can be improved to a distance that is proportional to or even a constant independent of . The latter is what we observe in the numerical experiments; choosing constant and letting grow does not lead to a deterioration of the condition number.

### 4.2 Range correction

As earlier, let denote a set of indices obtained by Algorithm 3 for a threshold , and . We recall that the dimension of the column space of can be easily computed knowing that . If , then it only remains to perform the orthogonalization to get an orthonormal basis of . However, depending on the choice of , in practice it can occur that the cardinality of is smaller than , which implies that the selected columns cannot span the column space of . In this case additional vectors need to be computed to get a complete orthonormal basis for .

An orthonormal basis for in the HODLR format can be computed using a method suggested in [Lintner2002], and it is given as

 Π<μ(:,C)∗H~R−1. (13)

The biggest disadvantage of the method in [Lintner2002] is the loss the orthogonality in badly conditioned problems, caused by squaring of the condition number when computing . However, choosing only well-conditioned subset of columns of allows us to avoid dealing with badly conditioned problems, and thus prevents potential loss of orthogonality in (13).

In case , we complete the basis (13) to an orthonormal basis for , by computing an orthonormal basis of the orthogonal complement of in . First we detect the orthogonal complement of in .

###### Lemma 6.

If is the orthogonal complement of , then

 R⊥Π<μ,C:=(Range(Π<μ(:,C)))⊥∩Range(Π<μ)

is the orthogonal complement of in the vector space . Moreover, .

###### Proof.

The statements follow directly from the definition of . ∎

Using (13) we construct an orthogonal projector

 PC⊥=I−Π<μ(:,C)∗H~R−1∗H(Π<μ(:,C)∗H~R−1)T (14)

onto . From (13) it steadily follows that . Thus computing an orthonormal basis for will allow us to obtain a complete orthonormal basis for .

To this end, we employ a randomized algorithm [HalkoMartinsson2011] to compute an orthonormal basis of dimension for : (1) we first multiply

with a random matrix

, where

is an oversampling parameter; (2) we compute its QR decomposition. As singular values of

are either unity or zero, multiplication with the orthogonal projector , generated by the linearly independent columns , gives a matrix whose singular values are well-separated as well. In particular, the resulting matrix has singular values equal to , and the others equal to zero. Indeed, in exact arithmetics has the exact rank , and then oversampling is not required [HalkoMartinsson2011]. However, due to the formatted arithmetics, we use a small oversampling parameter to improve accuracy. As we require only columns to complete the basis for , finally we keep only the first columns of the orthonormal factor.

A pseudo-code for computing a complete orthonormal basis for is given in Algorithm 4. Note that is a rectangular HODLR matrix, obtained by extracting columns with indices of a HODLR matrix, as explained in section 3. This implies that the complexity of operations stated in Table 1 carries over for the operations involving HODLR matrices in Algorithm 4. The complexity of the algorithm also depends on the number of the missing basis vectors. However, in our experiments we observe that is very small with respect to and for choice of we use, which makes the cost of operations in Line 3 and Line 4 negligible. In the setup when , the overall complexity of Algorithm 4 is governed by solving a triangular system in Line 5 or Line 7, i.e. it is .