Low Rank Approximation at Sublinear Cost by Means of Subspace Sampling

06/11/2019
by   Victor Y. Pan, et al.
CUNY Law School
0

Low Rank Approximation (LRA) of a matrix is a hot research subject, fundamental for Matrix and Tensor Computations and Big Data Mining and Analysis. Computations with LRA can be performed at sub-linear cost, that is, by using much fewer arithmetic operations and memory cells than an input matrix has entries. Although every sub-linear cost algorithm for LRA fails to approximate the worst case inputs, we prove that our sub-linear cost variations of a popular subspace sampling algorithm output accurate LRA of a large class of inputs. Namely, they do so with a high probability for a random input matrix that admits its LRA. In other papers we proposed and analyzed sub-linear cost algorithms for other important matrix computations. Our numerical tests are in good accordance with our formal results.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

06/10/2019

Refinement of Low Rank Approximation of a Matrix at Sub-linear Cost

Low rank approximation (LRA) of a matrix is a hot subject of modern comp...
06/10/2019

Low Rank Approximation Directed by Leverage Scores and Computed at Sub-linear Cost

Low rank approximation (LRA) of a matrix is a major subject of matrix an...
07/21/2019

Low Rank Approximation of a Matrix at Sub-linear Cost

A matrix algorithm performs at sub-linear cost if it uses much fewer flo...
07/03/2017

Rank Determination for Low-Rank Data Completion

Recently, fundamental conditions on the sampling patterns have been obta...
02/04/2020

Randomized Numerical Linear Algebra: Foundations Algorithms

This survey describes probabilistic algorithms for linear algebra comput...
06/12/2020

Tensor-Krylov method for computing eigenvalues of parameter-dependent matrices

In this paper we extend the Residual Arnoldi method for calculating an e...
02/18/2021

Theory meets Practice at the Median: a worst case comparison of relative error quantile algorithms

Estimating the distribution and quantiles of data is a foundational task...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

LRA Background. Low rank approximation (LRA) of a matrix is a hot research area of Numerical Linear Algebra (NLA) and Computer Science (CS) with applications to fundamental matrix and tensor computations and data mining and analysis (see surveys [HMT11], [M11], [KS16], and [CLO16]). Matrices defining Big Data (e.g., unfolding matrices of multidimensional tensors) are frequently so immense that realistically one can access and process only a tiny fraction of their entries, although quite typically these matrices admit their LRA, that is, are close to low rank matrices or equivalently have low numerical rank. One can operate with low rank matrices at sub-linear computational cost, that is, by using much fewer arithmetic operations and memory cells than an input matrix has entries, but can we compute LRA at sub-linear cost? Yes and no. No, because every sub-linear cost LRA algorithm fails even on the small input families of Appendix C. Yes, because our sub-linear cost variations of a popular subspace sampling algorithm output accurate LRA for a large class of input. Let us provide some details.

Subspace sampling algorithms compute LRA of a matrix by using auxiliary matrices , or for random multipliers and , commonly called test matrices and having smaller sizes. Their output LRA are nearly optimal whp provided that and/or are Gaussian, Rademacher’s, SRHT or SRFT matrices;111

Here and hereafter “SRHT and SRFT” are the acronyms for “Subsample Random Hadamard and Fourier transforms”;

“Gaussian” stands for “standard Gaussian (normal) random”; Rademacher’s are the matrices filled with iid variables, each equal to 1 or with probability 1/2. furthermore the algorithms consistently output accurate LRA in their worldwide application with these and some other random multipliers and , all of which are multiplied by at super-linear cost (see [TYUC17, Section 3.9], [HMT11, Section 7.4], and the bibliography therein).

Our modifications of these algorithms use sparse orthogonal (e.g., sub-permutation) multipliers222We define sub-permutation matrices as full-rank submatrices of permutation matrices. and , run at sub-linear cost, and as we prove, whp output reasonably accurate dual LRA

, that is, LRA of a random input admitting LRA; we deduce our error estimates under three distinct models of random matrix computations in Sections

4.14.3.

How meaningful is our result? Our definitions of three classes of random matrices of low numerical rank are quite natural for various real world applications of LRA, but are odd for some other ones. This, however, applies to any definition of that kind.

Our approach enables new insight into the subject, and our formal study is in good accordance with our numerical tests for both synthetic and real world inputs, some from [HMT11].

Our upper bounds on the output error of LRA of an matrix of numerical rank exceed the optimal error bound by a factor of , but if this optimal bound is small enough we can apply iterative refinement of LRA running at sub-linear cost (see [PLa]).

As we have pointed out, any sub-linear cost LRA algorithm (and ours are no exception) fails on some families of hard inputs, but our analysis and tests show that the class of such inputs is narrow. We conjecture that it shrinks fast if we recursively apply the same algorithm with new multipliers; we propose some heuristic recipes for these recursive processes, and our numerical tests confirm their efficiency.

Impact of our study, its extensions and by-products:

(i) Our duality approach enables new insight into some fundamental matrix computations besides LRA: [PQY15], [PZ17a], and [PZ17b] provide formal support for empirical efficiency of dual Gaussian elimination with no pivoting, while [PLb] proposes a sub-linear cost modification of Sarlós’ algorithm of 2006 and then proves that whp it outputs nearly optimal solution of the highly important problem of Linear Least Squares Regression (LLSR) provided that its input is random. Then again this formal proof is in good accordance with the test results.

(ii) In [PLSZa] we proved that popular Cross-Approximation LRA algorithms running at sub-linear cost as well as our simplified sub-linear cost variations of these algorithms output accurate solution of dual LRA whp, and we also devised a sub-linear cost algorithm for transformation of any LRA into its special form of CUR LRA, which is particularly memory efficient.

(iii) The paper [PLa] has proposed and elaborated upon sub-linear cost refinement of a crude but reasonably close LRA.

Related Works. Huge bibliography on LRA can be partly accessed via [M11], [HMT11], [KS16], [PLSZ17], [TYUC17], [OZ18], and the references therein. [PLSZ16] and [PLSZ17] were the first papers that provided formal support for dual accurate randomized LRA computations performed at sub-linear cost (in these papers such computations are called superfast). The earlier papers [PQY15], [PLSZ16], [PZ17a], and [PZ17b] studied duality for other fundamental matrix computations besides LRA, while the paper [PLb] has extended our study to a sub-linear cost dual algorithm for the popular problem of Linear Least Squares Regression and confirmed accuracy of this solution by the results of numerical experiments.

Organization of the paper. In Section 2 we recall random sampling for LRA. In Sections 3 and 4 we prove deterministic and randomized error bounds, respectively, for our dual LRA algorithms running at sub-linear cost. In Section 5 we discuss multiplicative pre-processing and generation of multipliers for both pre-processing and sampling. In Section 6 we cover our numerical tests. Appendix A is devoted to background on matrix computations. In Appendix B we prove our error bounds for dual LRA. In Appendix C we specify some small families of hard inputs for sub-linear cost LRA.

Some definitions. The concepts “large”, “small”, “ill-” and “well-conditioned”, “near”, “close”, and “approximate” are usually quantified in the context. “” and “” mean “much less than” and “much greater than”, respectively. “Flop”

stands for “floating point arithmetic operation”; “

iid” for “independent identically distributed”. In context a “perturbation of a matrix” can mean a perturbation having a small relative norm. denotes the class of real matrices, We assume dealing with real matrices throughout, and so the Hermitian transpose of turns into transpose, , but most of our study can be readily extended to complex matrices; see some relevant results about complex Gaussian matrices in [E88], [CD05], [ES05], and [TYUC17].

2 LRA by means of subspace sampling

2.1 Four subspace sampling algorithms

Hereafter and denote the spectral and the Frobenius matrix norms, respectively; can denote either of them. denotes the Moore – Penrose pseudo inverse of .

Algorithm 2.1.

Column Subspace Sampling or Range Finder (see Remark 2.1).

Input:

An matrix and a target rank .

Output:

Two matrices and defining an LRA of .

Initialization:

Fix an integer , , and an matrix of full rank .

Computations:
  1. Compute the matrix .

  2. Fix a nonsingular matrix and output the matrix .

  3. Output an matrix .

Remark 2.1.

Let . Then and independently of the choice of , but its proper choice numerically stabilizes the computations of the algorithm. For the matrix is ill-conditioned,333 denotes numerical rank of (see Appendix A.1). but is orthogonal for , and where and are the factors of the thin QR factorization of (cf. [HMT11, Algorithm 4.1]). It is also orthogonal for and the factors and in a rank-revealing QR factorization .

Column Subspace Sampling turns into Column Subset Selection in the case of a sub-permutation matrix and turns into Row Subspace Sampling if it is applied to the transpose .

Algorithm 2.2.

Row Subspace Sampling or Transposed Range Finder. See Remark 2.2.

Input:

As in Algorithm 2.1.

Output:

Two matrices and defining an LRA of .

Initialization:

Fix an integer , , and a matrix of full numerical rank .

Computations:
  1. Compute the matrix .

  2. Fix a nonsingular matrix ; then output matrix .

  3. Output an matrix .

Row Subspace Sampling turns into Row Subset Selection in case of a sub-permutation matrix .

Remark 2.2.

Let . Then and independently of the choice of , but a proper choice of numerically stabilizes the computations of the algorithm. For the matrix is ill-conditioned, but is orthogonal if , , , and and are the factors of the thin LQ factorization of or if and and are the factors in a rank-revealing LQ factorization .

The following algorithm combines row and column subspace sampling. In the case of the identity matrix

it turns into the algorithm of [CW09, Theorems 4.7 and 4.8] and [TYUC17, Section 1.4], whose origin can be traced back to [WLRT08].

Algorithm 2.3.

Row and Column Subspace Sampling. See Remark 2.3.

Input:

As in Algorithm 2.1.

Output:

Two matrices and defining an LRA of .

Initialization:

Fix two integers and , and ; fix two matrices and of full numerical ranks and two nonsingular matrices and .

Computations:

1. Output the matrix .

2. Compute the matrices and .

3. Output the matrix .

Remark 2.3.

If the matrix has full rank , then independently of the choice of the matrices and , but a proper choice of numerically stabilizes the computations of the algorithm. For the matrix is ill-conditioned, but is orthogonal if , , , and and are the factors of the thin LQ factorization of or if and and are the factors in a rank-revealing LQ factorization .

Remark 2.4.

By applying Algorithm 2.3 to the transpose matrix we obtain Algorithm 2.4. It begins with column subspace sampling followed by row subspace sampling. We only study Algorithms 2.1 and 2.3 for input , but they turn into Algorithms 2.2 and 2.4 for the input .

2.2 The known error bounds

Theorem 2.1.

(i) Let and apply Algorithm 2.1 with a Gaussian multiplier . Then (cf. [HMT11, Theorems 10.5 and 10.6]) 444[HMT11, Theorems 10.7 and 10.8] estimate the norm in probability.

(ii) Let and apply Algorithm 2.1 with an SRHT or SRFT multiplier . Then (cf. [T11], [HMT11, Theorem 11.2])

Clarkson and Woodruff prove in [CW09] that Algorithm 2.3 reaches the bound within a factor of whp if the multipliers and are Rademacher’s matrices and if and are sufficiently large, having order of and for small , respectively.

Tropp et al. argue in [TYUC17, Section 1.7.3] that LRA is not practical if the numbers and of row and column samples are large; iterative refinement of LRA at sub-linear cost in [PLa] can be a partial remedy. [TYUC17, Theorem 4.3] shows that the output LRA of Algorithm 2.3 applied with Gaussian multipliers and satisfies555In words, the expected output error norm is within a factor of from its minimum value ; this factor is just 2 for .

(2.1)

3 Deterministic output error bounds for sampling algorithms

3.1 Deterministic error bounds of Range Finder

Theorem 3.1.

[HMT11, Theorem 9.1]. Suppose that Algorithm 2.1 has been applied to a matrix with a multiplier and let

(3.1)
(3.2)

be SVDs of the matrices , its rank- truncation , and , respectively. [ and if . The columns of span the top right singular space of .] Then

(3.3)

Notice that , , and and obtain

(3.4)

It follows that the output LRA is optimal up to a factor of .

Next we deduce an upper bound on the norm in terms of , , and . Given we can compute the norm at sub-linear cost if , and in some applications reasonable upper estimates for and are available.

Corollary 3.1.

Under the assumptions of Theorem 3.1 let the matrix have full rank . Then

Proof.

Deduce from (3.1) and (3.2) that . Hence .

Recall that the matrix has full rank , apply Lemma A.1, recall that

is an orthogonal matrix, and obtain

.

Substitute and and obtain the corollary. ∎

Corollary 3.2.

Under the assumptions of Corollary 3.1 let and  Then

Proof.

Lemma A.3 implies that .

Consequently , and so by virtue of Lemma A.2 if .

If in addition , then and therefore by virtue of Lemma A.2.

Combine these bounds and obtain .

Together with Corollary 3.1 this implies Corollary 3.2. ∎

For a given matrix we can compute the norm at sub-linear cost if . If also some reasonable upper bounds on and are known, then Corollary 3.2 implies a posteriori estimates for the output errors of Algorithm 2.1.

3.2 Impact of pre-multiplication on the errors of LRA (deterministic estimates)

Lemma 3.1.

[The impact of pre-multiplication on LRA errors.] Suppose that Algorithm 2.3 outputs a matrix for and that . Then

(3.5)
(3.6)
Proof.

Recall that and notice that if . Therefore . Consequently (3.5) and (3.6) hold. ∎

We bounded the norm in the previous subsection; next we bound the norms and at sub-linear cost for , a fixed orthogonal , and proper choice of sparse .

Theorem 3.2.

[P00, Algorithm 1] for a real applied to an orthogonal matrix performs flops and outputs an sub-permutation matrix such that , and , for of (3.5) and any fixed ; for and .

[P00, Algorithm 1] outputs matrix . One can strengthen deterministic bounds on the norm by computing sub-permutation matrices for of at least order .

Theorem 3.3.

For of at least order and a fixed orthogonal multiplier compute a sub-permutation multiplier by means of deterministic algorithms by Osinsky, running at sub-linear cost and supporting [O18, equation (1)]. Then for of (3.5).

4 Accuracy of sub-linear cost dual LRA algorithms

Next we estimate the output errors of Algorithm 2.1 for a fixed orthogonal matrix and two classes of random inputs of low numerical rank, in particular for perturbed factor-Gaussian inputs of Definition A.1. These estimates formally support the observed accuracy of Range Finder with various dense multipliers (see [HMT11, Section 7.4], and the bibliography therein), but also with sparse multipliers, with which Algorithms 2.3 and 2.4 run at sub-linear cost. By applying the results of the previous section we extend these upper estimates for output accuracy to variations of Algorithm 2.3 that run at sub-linear cost; then we extend them to Algorithm 2.4 by means of transposition of an input matrix.

Our estimates involve the norms of a Gaussian matrix and its pseudo inverse (cf. Appendix A.4).

Definition 4.1.

A matrix is Gaussian if its entries are iid Gaussian variables. is the class of Gaussian matrices. , , , , , and for a matrix . [ and , for all pairs of and .]

Theorem 4.1.

[Non-degeneration of a Gaussian Matrix.] Let , , and . Then the matrices , , , and have full rank with probability 1.

Assumption 4.1.

We simplify the statements of our results by assuming that a Gaussian matrix has full rank and ignoring the probability 0 of its degeneration.

4.1 Output errors of Range Finder for a perturbed factor-Gaussian input

Theorem 4.2.

[Errors of Range Finder for a perturbed factor-Gaussian matrix.] Apply Algorithm 2.1 to a perturbation of a right factor-Gaussian matrix of rank such that

(4.1)

(i) Then

(ii) Let with a probability close to 1 and let the integer be at least moderately large. Then with a probability close to 1

which is close to if . Here and hereafter

Proof.

We prove claim (i) in Appendix B. Let us deduce claim (ii). Recall from Theorems A.4 and A.5

that the random variables

and are strongly concentrated about their expected values and , respectively. Substitute these equations into the bound of claim (i), apply Jensen’s inequality, and deduce claim (ii) of the theorem. ∎

4.2 Output errors of Range Finder near a matrix with a random singular space

Next we prove similar estimates under an alternative randomization model for dual LRA.

Theorem 4.3.

[Errors of Range Finder for an input with a perturbed random singular space.] Let the matrix in Theorem 3.1 be the Q factor in a QR factorization of a normalized Gaussian matrix and let the multiplier be any matrix of full rank .

(i) Then for and of Definition 4.1 it holds that

(ii) For a large or reasonably large integer , the random variable is strongly concentrated about its expected values

which turn into

if . Here and if the matrix is orthogonal.

Proof.

Write

(4.2)

and so and .

Let be SVD. Then (cf. Lemma A.4), and hence

Therefore (apply Lemma A.1 and recall that is an orthogonal matrix).

Substitute and obtain .

Deduce from (4.2) that the matrices and

share all their singular values. Therefore

, and so .

By combining this bound with (3.4) prove claim (i) of the theorem.

Already for a reasonably large ratio the random variable is strongly concentrated about its expected values

respectively (cf. Theorems A.4 and A.5), which are close to

respectively, if . Apply Jensen’s inequality and deduce claim (ii) of the theorem. ∎

Bound the output errors of Algorithms 2.3–2.4 by combining the estimates of this section and Section 3.2 and by transposing the input matrix .

4.3 Impact of pre-multiplication in the case of Gaussian noise

Next we deduce randomized estimates for the impact of pre-multiplication in the case where an input matrix includes considerable additive white Gaussian noise,666

Additive white Gaussian noise is statistical noise having a probability density function (PDF) equal to that of the Gaussian (normal) distribution. Additive white Gaussian noise is widely adopted in information theory and used in signal and image processing; in many cases it properly represents the errors of measurement and rounding (cf.

[SST06]). which is a classical representation of natural noise in information theory, is widely adopted in signal and image processing, and in many cases properly represents errors of measurement and rounding (cf. [SST06]).

Theorem 4.4.

Suppose that the multipliers and are orthogonal and let the input matrix be the sum of a fixed matrix and a scaled Gaussian matrix ,

(4.3)

for a constant proportional to the norm . Then for any pair of orthogonal multipliers and it holds that

(4.4)
(4.5)
Proof.

Assumption (4.3) and Lemma A.4 together imply that is a scaled Gaussian matrix: . Hence . Apply Theorem A.3 and obtain that . Recall from (3.5) that since the multipliers and are orthogonal. Substitute the above bound on and obtain (4.4). Substitute equations and and claim (iii) of Theorem A.5 and obtain (4.5). ∎

Remark 4.1.

For , , sub-permutation matrices and , and a nonsingular matrix , Algorithms 2.3 and 2.4 output LRA in the form where and are two submatrices made up of columns and rows of and . [PLSZa] extends our current study to devising and analyzing algorithms for the computation of such CUR LRA in the case where and are arbitrary integers not exceeded by .

5 Multiplicative pre-processing and generation of multipliers

5.1 Multiplicative pre-processing for LRA

We proved that sub-linear cost variations of Algorithms 2.3 and 2.4 tend to output accurate LRA of random input whp. In the real world computations input matrices are not random, but we can randomize them by multiplying them by random matrices.

Algorithms 2.1 – 2.4 output accurate LRA whp if the multipliers are Gaussian, SRHT, SRFT or Rademacher’s (cf. [HMT11, Sections 10 and 11], [T11], [CW09]), but multiplication by these matrices run at super-linear cost. Our heuristic recipe is to apply these algorithms with a small variety of sparse multipliers and/or , , with which computational cost becomes sub-linear and then to monitor the accuracy of the output LRA by applying the criteria of the previous section, [PLa], and/or [PLSZa].

Various families of sparse multipliers have been proposed in [PLSZ16] and [PLSZ17]. One can readily complement these families with sub-permutation matrices and, say, sparse quasi Radmacher’s multipliers (see [PLSZa]) and then combine these basic multipliers together by using orthogonalized sums, products or other lower degree polynomials of these matrices as multipliers (cf. [HMT11, Remark 4.6]).

Next we specify a particular family of sparse multipliers, which was highly efficient in our tests when we applied them both themselves and in combination with other sparse multipliers.

5.2 Generation of abridged Hadamard and Fourier multipliers

We define multipliers of this family by means of abridging the classical recursive processes of the generation of SRHT and SRFT matrices for . These matrices are obtained from the dense matrices of Walsh-Hadamard transform (cf. [M11, Section 3.1]) and of discrete Fourier transform (DFT) at points (cf. [P01, Section 2.3]), respectively. Recursive representation in recursive steps enables multiplication of the matrices and

by a vector in

additions and subtractions and flops, respectively.

We end these processes in recursive steps for a fixed recursion depth , , and obtain the -abridged Hadamard (AH) and Fourier (AF) matrices and , respectively, such that and . Namely write , , and denoting a primitive -th root of 1, and then specify two recursive processes:

(5.1)
(5.2)

where denotes the matrix of odd/even permutations such that , , , , , .777For this is a decimation in frequency (DIF) radix-2 representation of FFT. Transposition turns it into the decimation in time (DIT) radix-2 representation of FFT.

For any fixed pair of and , each of the matrices (resp. ) is orthogonal (resp. unitary) up to scaling and has nonzero entries in every row and column. Now make up multipliers and of and submatrices of and , respectively. Then in view of sparseness of or , we can compute the products and by using and flops, respectively, and they are just additions or subtractions in the case of submatrices of .

By combining random permutation with either Rademacher’s diagonal scaling for AH matrices or or random unitary diagonal scaling for AF matrices , we obtain the Abridged Scaled and Permuted Hadamard (ASPH) matrices, , and Abridged Scaled and Permuted Fourier (ASPF) matrices, , where and are two matrices of permutation and diagonal scaling. Likewise define the families of ASH, ASF, APH, and APF matrices, , , , and , respectively. Each random permutation or scaling contributes up to random parameters. We can involve more random parameters by applying random permutation and scaling to the intermediate matrices and for .

Now the first rows for or first columns for of and form a -abridged Hadamard or Fourier multiplier, which turns into a SRHT or SRFT matrix, respectively, for . For and of order Algorithm 2.1 with a SRHT or SRFT multiplier outputs whp accurate LRA of any matrix admitting LRA (see [HMT11, Section 11]), but in our tests the output was consistently accurate even with sparse abridged SRHT or SRFT multipliers computed just in three recursive steps.

6 Numerical tests

In this section we cover our tests of dual sub-linear cost variants of Algorithm 2.1. The tests for Tables 6.16.4 have been performed by using MatLab on a Dell computer with the Intel Core 2 2.50 GHz processor and 4G memory running Windows 7; the standard normal distribution function randn of MATLAB has been applied in order to generate Gaussian matrices. The MATLAB function ”svd()” has been applied in order to calculate the -rank, i.e., the number of singular values exceeding for in Sections 6.2 and 6.3 and in Section 6.4. The tests for Tables 6.56.7 have been performed on a 64-bit Windows machine with an Intel i5 dual-core 1.70 GHz processor by using custom programmed software in and compiled with LAPACK version 3.6.0 libraries.

6.1 Input matrices for LRA

We generated the following classes of input matrices for testing LRA algorithms.

Class I: Perturbed factor-Gaussian matrices with expected rank , for three Gaussian matrices , , and .

Class II: , for and being the Q factors of the thin QR orthogonalization of Gaussian matrices and ; , (cf. [H02, Section 28.3]), and . (Hence and .)

Class III: (i) The matrices of the discretized single-layer Laplacian operator of [HMT11, Section 7.1]: , for two circles and on the complex plane. We arrived at a matrix ,