LU factorization with errors *

We present new algorithms to detect and correct errors in the lower-upper factorization of a matrix, or the triangular linear system solution, over an arbitrary field. Our main algorithms do not require any additional information or encoding other than the original inputs and the erroneous output. Their running time is softly linear in the dimension times the number of errors when there are few errors, smoothly growing to the cost of fast matrix multiplication as the number of errors increases. We also present applications to general linear system solving.



page 1

page 2

page 3

page 4


Error correction in fast matrix multiplication and inverse

We present new algorithms to detect and correct errors in the product of...

Fast and Exact Matrix Factorization Updates for Nonlinear Programming

LU and Cholesky matrix factorization algorithms are core subroutines use...

Solving Tall Dense Linear Programs in Nearly Linear Time

In this paper we provide an Õ(nd+d^3) time randomized algorithm for solv...

Parsing Linear Context-Free Rewriting Systems with Fast Matrix Multiplication

We describe a matrix multiplication recognition algorithm for a subset o...

Polynomial Linear System Solving with Errors by Simultaneous Polynomial Reconstruction of Interleaved Reed-Solomon Codes

In this paper we present a new algorithm for Polynomial Linear System So...

On Nondeterministic Derandomization of Freivalds' Algorithm: Consequences, Avenues and Algorithmic Progress

Motivated by studying the power of randomness, certifying algorithms and...

Algorithms for Interference Minimization in Future Wireless Network Decomposition

We propose a simple and fast method for providing a high quality solutio...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The efficient detection and correction of computational errors is an increasingly important issue in modern computing. Such errors can result from hardware failures, communication noise, buggy software, or even malicious servers or communication channels.

The first goal for fault-tolerant computing is verification. Freivalds presented a linear-time algorithm to verify the correctness of a single matrix product [18]. Recently, efficient verification algorithms for a wide range of computational linear algebra problems have been developed [24, 14, 15, 16].

Here we go further and try to correct any errors that arise. This approach is motived by the following scenarios:

Large scale distributed computing

In high-performance computing, the failure of some computing nodes (fail stop) or the corruption of some bits in main memory by cosmic radiation (soft errors) become relevant. The latter type can be handled by introducing redundancy and applying classical error correction, either at the hardware level (e.g., with ECC RAM), or integrated within the computation algorithm, as in many instances of Algorithm Based Fault Tolerance (ABFT) [22, 11, 9, 4].

Outsourcing computation

When running some computation on one or several third-party servers, incorrect results may originate from either failure or malicious corruption of some or all of the third parties. The result obtained is then considered as an approximation of the correct result.

Intermediate expression swell

It often happens that symbolic computations suffer from requiring large temporary values, even when both the input and output of a problem are small or sparse. It can then be more efficient to use special methods that are able to determine or guess the sparsity pattern and exploit this in the computation. Sparse polynomial and rational function interpolation has been developed

[31, 2, 23, 33]

as such a technique. The connection with error correction comes from the fact that we may regard a sparse output as the perturbation of some trivial approximate solution, such as zero or the identity matrix.

Fault-tolerant computer algebra

Some recent progress on fault tolerant algorithms has been made for Chinese remaindering [21, 28, 3], system solving [5, 25], matrix multiplication and inversion [30, 20, 32], and function recovery [8, 26].

1.1 Our setting

In this paper, we focus on LU-factorization and system solving. We will assume the input (such as a matrix to be factored) is known, as well as an approximate output (such as a candidate LU-factorization) which may contain errors. We seek efficient algorithms to recover the correct outputs from the approximate ones.

Our work can be seen as part of a wider effort to understand how techniques for fault tolerance without extra redundancy extend beyond basic operations such as matrix multiplication. It turns out that the complexities for other linear algebra problems are extremely sensitive to the precise way they are stated. For instance, in the case of system solving, the complexities for error-correction are quite different if we assume the LU-factorization to be returned along with the output, or not. The LU-factorization process itself is very sensitive to pivoting errors; for this reason we will assume our input matrix to admit a generic rank profile (GRP).

From an information theoretic point of view, it should be noted that all necessary input information in order to compute the output is known to the client. The approximate output is merely provided by the server in order to make the problem computationally cheaper for the client. In theory, if the client has sufficient resources, he could perform the computation entirely by himself, while ignoring the approximate output. Contrary to what happens in the traditional theory of error correcting codes, it is therefore not required to transmit the output data in a redundant manner (in fact, we might even not transmit anything at all).

In our setting, unlike classical coding theory, it is therefore more appropriate to measure the correction capacity in terms of the amount of computational work for the decoding algorithm rather than the amount of redundant information.

We also put no a priori restriction on the number of errors: ideally, the complexity should increase with the number of errors and approach the cost of the computation without any approximate output when the number of errors is maximal.

As a consequence, the error-correcting algorithms we envision can even be used in an extreme case of the second scenario: a malicious third party might introduce errors according to patterns that are impossible to correct using traditional bounded distance decoding approaches. Concerning the third scenario, the complexity of our error-correcting algorithms is typically sensitive to the sparsity of a problem, both in the input and output.

Another general approach for error correction is discussed in Section 2: if we allow the server to communicate some of the intermediate results, then many linear algebra operations can be reduced in a simple way to matrix multiplication with error correction.

1.2 General outline and main results

General notations. Throughout our paper, stands for an effective field with elements. We write for the set of matrices with entries in and for the number of non-zero entries of a matrix . The soft-Oh notation is the same as the usual big-Oh notation but ignoring sub-logarithmic factors: if and only if for cost functions  and .

Let be a constant between and such that two matrices can be multiplied using . In practice, we have for small dimensions  and for larger dimensions, using Strassen multiplication (the precise threshold depends on the field ; Strassen multiplication typically becomes interesting for of the order of a few hundred). The best asymptotic algorithm currently gives  [19]. From a practical point of view, the notation indicates whether an algorithm is able to take advantage of fast low-level matrix multiplication routines.

Generic solution. In Section 2, we first describe a generic strategy for fault-tolerant computations in linear algebra. However, this strategy may require the server to communicate certain results of intermediate computations. In the remainder of the paper, we only consider a stronger form of error correction, for which only the final results need to be communicated.

LU-factorization. Let

be an invertible matrix with generic rank profile (GRP). Let

be (resp.) lower- and upper-triangular matrices that are “close to” the LU-factorization of : . Specifically, suppose there exist sparse upper- and lower-triangular error matrices with

In the following we write the (possibly) faulty matrices in calligraphic fonts. We specify that has 1’s on the diagonal, so that is strictly lower-triangular whereas may have corrections on the diagonal or above it.

Given , the goal is to determine the true and as efficiently as possible. Our main result is a probabilistic Monte Carlo algorithm for doing this in time

for any constant probability of failure, where

is the number of nonzero terms in all input matrices, is the number of errors in or , and is the matrix dimension (see Theorem 3). Here we measure the complexity in terms of he number of operations in , while assuming that random elements in can be generated with unit cost. Note that because the number of errors cannot exceed the total dense size , this complexity is never larger than , which is the cost of re-computing the LU-factorization with no approximate inputs, up to logarithmic factors. In Section 3.4, we also show how to generalize our algorithm to rectangular, possibly rank deficient matrices (still under the GRP assumption).

System solving. In Section 4, we turn to the problem of solving a linear system , where is as above and is also given. Given only possibly-erroneous solution to such a system, we do not know of any efficient method to correct the potential errors. Still, we will show how errors can efficiently be corrected if we require the server to return a few extra intermediate results.

More precisely, if has few rows, then we require an approximate LU-factorization and an approximate solution to the triangular system . If has many rows (with respect to the number of columns), then we require an approximate LU-factorization and an approximate inverse of . Given these data, we give algorithms to correct all errors in similar running time as above, depending on the number of errors in all provided inputs.

2 Generic solution

In [24, § 5] a generic strategy is described for verifying the result of a linear algebra computation. The complexity of the verification is the same as the complexity of the actual computation under the assumption that matrix products can be computed in time . In this section, we show that a similar strategy can be used to correct errors.

Let be any algorithm which uses matrix multiplication. We assume that is deterministic; any random bits needed in the computation should be pre-generated by the client and included with the input.

The server runs algorithm and, each time it performs any matrix multiplication, it adds that product to a list. This list of all intermediate products is sent back to the client. We assume that there may be a small number of errors in any of the intermediate products, but that these errors do not propagate; that is, the total number of erroneous entries in any intermediate product is bounded by some .

Next, to correct errors, the client also runs algorithm , except that every time it needs to multiply matrices, it performs error correction instead, using the intermediate result sent by the server. All other operations (additions, comparisons, etc.) are performed directly by the client.

The total cost for the server is the same as the normal computation of without error correction. The cost for the client, as well as the communication, is the cost that the algorithm would have if matrix multiplication could be performed in time.

In particular, typical block matrix algorithms such as LU factorization admit a worst case complexity of for the server and only for the client (including communications).

The goal of this paper is to perform better, for instance when is a bound on the number of errors only in and and not of all intermediate computations.

3 Block recursive algorithm

We will now describe an error cleaning algorithm emulating the steps of an LU decomposition algorithm, where each computation task is replaced by an error cleaning task.

The cleaning algorithm to be used needs to satisfy the following properties:

  1. it has to be a block algorithm, gathering most operations into matrix multiplications where error cleaning can be efficiently performed by means of sparse interpolation, as in [30, 32];

  2. it has to be recursive in order to make an efficient usage of fast matrix multiplication.

  3. each block operation must be between operands that are submatrices of either the input matrix or the approximate and factors. Indeed, the only data available for the error correction are these three matrices: the computation of any intermediate results that cannot directly be extracted from these matrices would be too expensive for achieving the intended complexity bounds.

In the large variety of LU decomposition algorithms, iterative and block iterative algorithms range in three main categories depending on the scheduling of the update operation (see [10, § 5.4] and [12] for a description): right-looking, left-looking and Crout.

The right-looking variant updates the trailing matrix immediately after a pivot or a block pivot is found, hence generating blocks containing intermediate results, not satisfying (3). Differently, the left-looking and the Crout variants proceed by computing directly the output coefficients in and following a column shape (left-looking) or arrow-head (Crout) frontier.

Figure 1: Access pattern of the left-looking (left), Crout (center) and right-looking variants of an LU factorization. Diagonal stripes represent read-only accesses, crossed stripes read-write accesses.

Figure 1 summarizes these 3 variants by exposing the memory access patterns of one iteration.

The left-looking and the Crout schedules consist in delaying the computation of the Gauss updates until the time where the elimination front deals with the location under consideration. This is precisely satisfying Condition (3). However these two schedules are usually described in an iterative or block iterative setting. To the best of our knowledge, no recursive variant has been proposed so far. We introduce in Section 3.1 a recursive Crout LU decomposition algorithm on which we will base the error-correction algorithm of section 3.3. Interestingly, we could not succeed in writing a recursive version of a left-looking algorithm preserving Condition (3).

We emphasize that, although our eventual error correction algorithm will follow the recursive Crout variant, no assumption whatsoever is made on the algorithm used (e.g., by a remote server) to produce the approximate LU decomposition used as input.

3.1 Recursive Crout LU decomposition

Algorithm 1 is a presentation of a Crout recursive variant of Gaussian elimination, for a generic rank profile (GRP) matrix. It incorporates the delayed update schedule of the classical block iterative Crout algorithm [10, 12] into a recursive algorithm dividing both row and column dimensions. Note that due to the delayed updates, the recursive algorithm need to be aware of the coefficients in and previously computed, hence the entirety of the working matrix has to be passed as input, contrarily to most recursive algorithms [6, 17]. In a normal context, this algorithm could work in-place, overwriting the input matrix with both factors and as the algorithm proceeds.

In our error-correcting context, the input will eventually consist not only of but also of the approximate and . Therefore, in Algorithm 1 below, we treat the matrix as an unmodified input and fill in the resulting and into a separate matrix . In Algorithm 3, this matrix will initially contain the approximate and which will be overwritten by the correct and .

The main work of the Crout decomposition consists of dot products (in the base case), matrix multiplications and triangular solves. We define TRSM as a triangular system solve with matrix right-hand side, with some variants depending whether the triangular matrix is lower (’L’) or upper (’U’) triangular, and whether the triangular matrix is on the left (’L’, for ) or on the right (’R’, for for ). These always work in-place, overwriting the right-hand side with the solution. For instance:

  • is a right solve with upper-triangular which transforms to such that .

  • is a left solve with lower-triangular which transforms to such that .

In the algorithms, we use the subscript to denote “indices 2 and 3”, so for example and

1: is
2: (resp. ) is unit lower (resp. upper) triangular
3: is
4: has GRP
5: such that
6:if  then
7:      dot product
is implicitly
9:     Decompose with ,
10:     Split
11:     Split
13:     Here
18:     Here
20:     Here
21:end if
Algorithm 1 Crout
Theorem 1.

overwrites with a complete LU factorization of in operations.


Correctness is proven by induction on . For we have

For , we have

After the first recursive call in step 12, we have . In step 15, we ensured that . Similarly, from step 17. Finally, after the second recursive call, we have . This concludes the proof that at the end of the algorithm.

The complexity bound stems from the fact that the dot products in step 7 cost overall, and that the matrix multiplications and system solves require  [13]. ∎

3.2 Error-correcting triangular solves

The cost of Algorithm 1 is dominated by matrix-matrix products of off-diagonal blocks of and and triangular solving in steps 1417. The first task in adapting this algorithm for error correction is to perform error correction in this triangular solving step, treating the right-hand side as an unevaluated black box matrix in order to avoid the matrix-matrix multiplication.

We explain the process to correct errors in (an equivalent process works in the transpose for ): recall steps 14 and 15 of Algorithm 1: and . Mathematically, these steps perform the computation: . At this point in the error correction, is part of the original input matrix while , , and have already been corrected by the recursive calls.

The idea is to be able to correct the next parts of an approximate , namely , without recomputing it. Algorithm 2 below does this following the approach of [32]: for total errors within erroneous columns, less than columns can have more than errors. Therefore, we start with a candidate for , then try to correct errors per erroneous column. If is correct, then they will be corrected in fewer than iterations. If fewer than columns are corrected on some step, this indicates that the guess for was too low, so we double the guess for and continue. This is shown in Algorithm 2, where as previously mentioned, a normal font is an already correct matrix block and a calligraphic font denotes a matrix block to be corrected. We also recall that stands for the number of nonzero elements in , and define ColSupport of a matrix to be the indices of columns of with any nonzero entries.

Due to the crucial use of sparse interpolation, we require a high-order element in the underlying field . If no such exists, we simply replace by an extension field. The cost of computing in such an extension field will only induce a logarithmic overhead.

1: is
2: is presented as an unevaluated blackbox where the inner dimension between and is
3: is invertible upper triangular
4:Failure bound .
5: is updated in-place s.t.
6: how many errors exist
7: how many errors have been corrected
8:if then   () end if
9:Pick of order in , with precomputed
14:     Pick uniformly at random.
16:      []1[]Upper[]1[]Right[]1[]Trsm
columns of with errors
18:     Clear any entries of from columns
19:     Update , 
20:     if then end if
too many errors; must be wrong
23:      unevaluated
24:      selects erroneous cols. of
26:      []1[]Upper[]1[]Right[]1[]Trsm
27:     Find s.t.  by sparse interpolation
Algorithm 2 []1[]Upper[]1[]Right[]1[]TrsmEC
Theorem 2.

For a failure bound , , , , , , with total non-zero entries

and errors in , Algorithm 2 is correct and runs in time


Without loss of generality, we may assume that in step 8. Indeed, in the contrary case, arithmetic in is only times more expensive than arithmetic in , which is absorbed by the soft-Oh of the claimed complexity bound.

Define as the correct output matrix such that and consider the beginning of some iteration through the loop. creftypecap 17 computes with high probability the column support of the remaining errors , viewing this as a blackbox

. We project this blackbox on the left with a block of vectors

using a Freivalds check [18].

Hence is a submatrix of a permutation matrix selecting the erroneous columns of . Selecting the same rows and columns in yields a matrix that is still triangular and invertible. With this, one can form a new blackbox

using the fact that . The columns of this blackbox are viewed as sparse polynomials whose evaluation at powers of are used to recover them via sparse interpolation.

Note that only the columns with at most nonzero entries are correctly recovered by the sparse interpolation, so some columns of may still be incorrect. However, any incorrect ones are discovered by the Freivalds check in the next round and never incorporated into . From the definition of , and the fact that sparse interpolation works correctly for all -sparse columns of , we know that every iteration results in either reducing by half, or doubling. Therefore the total number of iterations is at most .

According to [32, Lemma 4.1], the probability of failure in each Frievalds check in step 17 is at most . By the union bound, the probability of failure at any of the iterations is therefore at most , as required.

Now for the complexity, the calls to compute the ColSupport on creftypeplural 15 to 17 are all performed using sparse matrix-vector operations, taking . From [32, Lemma 6.1], the multiplication by the Vandermonde matrix in , , and all take operations.

The recovery of by batched multi-sparse interpolation takes operations [32, Theorem 5.2].

What remains are the cost of computing the following:

  • The product , which costs using fast matrix multiplication.

  • The product , which costs

  • The subroutine []1[]Upper[]1[]Right[]1[]Trsm, which costs the same as it would be to multiply times , .

From the definition of we have . Let . Since and , all three costs are


Until the algorithm terminates, we always have , so (1) is at most , proving the first part of the in the complexity.

For the second part, observe that the number of erroneous columns must satisfy , which means that by the definition of . Then the cost in (1) is bounded above by . ∎

Remark 1.

Since the input matrix is not modified by the algorithm, some entries of may be defined implicitly — in particular, if the matrix is unit diagonal and the 1’s are not explicitly stored.

Remark 2.

Transposing the algorithm, we may also correct , where is lower triangular:

Remark 3.

There are two more triangular variants to consider. Computing []1[]Lower[]1[]Right[]1[]TrsmEC (or, with Remark 2, []1[]Upper[]1[]Left[]1[]TrsmEC) could be done exactly the same as in Algorithm 2, except that the two subroutine calls, lines 16 and 26, would be to []1[]Lower[]1[]Right[]1[]Trsm.

3.3 Correcting an invertible LU decomposition

We now have all the tools to correct an LU decomposition. We suppose that the matrix is non-singular and has generic rank profile (thus there exist unique and such that ). We are given possibly faulty candidate matrices , and want to correct them: for this we run Algorithm 1, but replace lines 14-17 by two calls to Algorithm 2. The point is to be able to have explicit submatrices of , , , and for the two recursive calls (no blackbox, nothing unevaluated there), and that all the base cases represent only a negligible part of the overall computations (the base case is the part of the algorithm that is recomputed explicitly when correcting). This is presented in Algorithm 3.

1: is
2: (resp. ) is unit lower (resp. upper) triangular
3: are unit lower/upper triangular
4: is
5: has GRP
6:Failure bound
7: such that
8:if  then
9:      dot product
is implicitly , must be correct
11:     Decompose with ,
12:     Split
13:     Split
15:     Here
is left unevaluated
is left unevaluated
18:     Here
20:     Now .
21:end if
Algorithm 3 CroutEC
Theorem 3.

For which has GRP, unit lower triangular and upper triangular, with total non-zero entries , failure bound , and errors in and , Algorithm CroutEC is correct and runs in time


Correctness follows from Theorems 1 and 2: we rewrite Algorithm 1, but replace the intermediate two matrix multiplications and two []1[][]1[][]1[]Trsms by two calls, with blackboxes, to Algorithm 2. Passing to all subroutines ensures that the total probability of failure in any Frievalds check in any []1[][]1[][]1[]TrsmEC is at most . Note that the shrinking does not affect the soft-oh complexity, since at the bottom level we will have , and is .

For the complexity, note that since has GRP, . The stated cost bound depends crucially on the following fact: in each level of recursive calls to CroutEC, each call is correcting a single diagonal block and uses only the parts of and above and left of that block.

This claim is true by inspection of the algorithm: the blocks and being corrected in the two recursive calls to CroutEC on creftypeplural 19 and 14 are clearly disjoint diagonal blocks. And we see also that the algorithm never uses the top-left part of , namely ; all arguments to the calls to []1[][]1[][]1[]TrsmEC on creftypeplural 17 and 16 are (disjoint) submatrices above and left of , as shown, e.g., in Figure 2.

Figure 2: All updates at the third recursion level at step 16. The figure indicates the locations of , , and .

With this understanding, we can perform the analysis. The work of the algorithm is entirely in the dot products in the base case, and the calls to []1[][]1[][]1[]TrsmEC in the recursive case.

For the base case dot products, these are to correct the diagonal entries of , using the diagonal of and disjoint, already-corrected rows of and columns of . Therefore the total cost of the dot products is .

For the rest, consider the th recursive level of calls to CroutEC, where . There will be exactly calls to []1[][]1[][]1[]TrsmEC on this level, whose inputs are all disjoint, from the claim above. Write etc. for the parameters in the th call to []1[][]1[][]1[]TrsmEC on level , for .

Every call to []1[][]1[][]1[]TrsmEC on the th level satisfies:

  • ;

  • ;

  • , so that is ;

  • because the submatrices are disjoint at the same recursive level; and

  • because each error is only corrected once.

Now we wish to compute the total cost of all calls to []1[][]1[][]1[]TrsmEC, which by Theorem 2 and the notation just introduced, is soft-oh of

Taking the first term of the sum, this is

which gives the first term in our stated complexity.

The second term of the sum simplifies to


Now observe that, by definition, each individual summand is less than or equal to both parts of the expression. Therefore we bound the sum of minima by the minimum of sums; that is, the previous summation is at most

Because each error is only corrected once, . Using Hölder’s inequality and , this yields

for all . Applying this to the second summation above gives

Then the entirety of (2) becomes just which gives the second term in the stated complexity. ∎

3.4 Correcting a rectangular, rank-deficient LU

For , assume first that is a rectangular matrix such that is square with GRP. Assume also that is an approximate LU decomposition. Then we may correct potential errors as follows:

  1. ; corrects

  2. . corrects

Second, if is rank deficient, but still GRP, then the first occurrence of a zero pivot , line 9 in Algorithm 3, reveals the correct rank: at this point , , and the upper (resp. left) part of (resp. ) are correct. It is thus sufficient to stop the elimination there, and recover the remaining parts of

using (corrected) triangular system solves, as follows:

  1. Let and , where is .

  2. CroutEC

  3. . corrects

  4. . corrects

4 System solving

One application of LU factorization is linear system solving. Say is invertible, and is . Ideally, correcting a solution to the linear system , should depend only on the errors in . Indeed, our algorithm []1[]Upper[]1[]Right[]1[]TrsmEC does exactly this in the special case that is upper-triangular.

Unfortunately, we do not know how to do this for a general non-singular using the previous techniques, as we for instance do not know of an efficient way to compute the nonzero columns of the erroneous entries in .

By adding some data to the solution (and therefore, unfortunately, potentially more errors), there are several possibilities:

  • Generically, a first solution is to proceed as in Section 2. One can use [24], but then the complexity depends not only on errors in , but on errors in all the intermediate matrix products.

  • A second solution is to invert the matrix using [32, Algorithm 6] (InverseEC) and then multiply the right-hand side using [32, Algorithm 5] (MultiplyEC). This requires some extra data to correct, namely a candidate inverse matrix, and the complexity depends on errors appearing now both in the system solution and in that inverse candidate matrix as follows:

    1. ;

    2. .

Now, computing an inverse, as well as correcting it, is more expensive than using an LU factorization: for the computation itself, the inverse is more expensive by a constant factor of (assuming classic matrix arithmetic), and for InverseEC the complexity of [32, Theorem 8.3] requires the fast selection of linearly independent rows using [7], which might be prohibitive in practice.

For these reasons, we prefer to solve systems using an LU factorization. The goal of the remainder of this section is to do this with a similar complexity as for Algorithm 3 and while avoiding to rely on the sophisticated algorithm from [7].

4.1 Small right-hand side

An intermediate solution, requiring the same amount of extra data as the version with the inverse matrix, but using only fast routines, can be as follows. Use as extra data a candidate factorization and a candidate intermediate right-hand side of dimension , such that are approximations to the true with . We simultaneously correct errors in , , , and as follows:

  1. CroutEC; and with

  2. ; with

  3. . with

Note, of course, that if the number of rows in is very small, say only , then it is faster to recover and only, by running CroutEC, and then compute and directly from the corrected and with classical TRSMs.

4.2 Large right-hand side

If the row dimension of is large with respect to the column dimension , then the matrix from above will be larger than . The client can instead ask the server to provide as extra data as a candidate for , to correct it with , and then to use and to correct directly:

  1. ; and with

  2. ;

  3. []1[]Lower[]1[]Right[]1[]TrsmEC,

with TrInvEC a variant of InverseEC sketched below as Algorithm 4 (where TRSV is a triangular system solve with a column vector and TRMV is a matrix-vector multiplication). This does not require the expensive algorithm of [7] to select independent columns, as the matrix is triangular.

Note that, in the call to []1[]Lower[]1[]Right[]1[]TrsmEC in the last step, the right-hand side is left unevaluated, just as in the calls to []1[][]1[][]1[]TrsmEC from Algorithm 3.

1:; via TRSV with , via TRMV with candidate