1 Introduction
LRA Background. Low rank approximation (LRA) of a matrix is a hot research area of Numerical Linear Algebra (NLA) and Computer Science (CS) with applications to fundamental matrix and tensor computations and data mining and analysis (see surveys [HMT11], [M11], [KS16], and [CLO16]). Matrices defining Big Data (e.g., unfolding matrices of multidimensional tensors) are frequently so immense that realistically one can access and process only a tiny fraction of their entries, although quite typically these matrices admit their LRA, that is, are close to low rank matrices or equivalently have low numerical rank. One can operate with low rank matrices at sub-linear computational cost, that is, by using much fewer arithmetic operations and memory cells than an input matrix has entries, but can we compute LRA at sub-linear cost? Yes and no. No, because every sub-linear cost LRA algorithm fails even on the small input families of Appendix C. Yes, because our sub-linear cost variations of a popular subspace sampling algorithm output accurate LRA for a large class of input. Let us provide some details.
Subspace sampling algorithms
compute LRA of a matrix by using auxiliary matrices , or for random multipliers and
, commonly called test matrices and having smaller sizes. Their output LRA are nearly optimal whp provided that and/or are Gaussian, Rademacher’s, SRHT or SRFT
matrices;111 Here and hereafter
“SRHT and SRFT” are the acronyms for “Subsample Random Hadamard and Fourier
transforms”;
Our modifications of these algorithms use sparse orthogonal (e.g., sub-permutation) multipliers222We define sub-permutation matrices as full-rank submatrices of permutation matrices. and , run at sub-linear cost, and as we prove, whp output reasonably accurate dual LRA
, that is, LRA of a random input admitting LRA; we deduce our error estimates under three distinct models of random matrix computations in Sections
4.1 – 4.3.How meaningful is our result? Our definitions of three classes of random matrices of low numerical rank are quite natural for various real world applications of LRA, but are odd for some other ones. This, however, applies to any definition of that kind.
Our approach enables new insight into the subject, and our formal study is in good accordance with our numerical tests for both synthetic and real world inputs, some from [HMT11].
Our upper bounds on the output error of LRA of an matrix of numerical rank exceed the optimal error bound by a factor of , but if this optimal bound is small enough we can apply iterative refinement of LRA running at sub-linear cost (see [PLa]).
As we have pointed out, any sub-linear cost LRA algorithm (and ours are no exception) fails on some families of hard inputs, but our analysis and tests show that the class of such inputs is narrow. We conjecture that it shrinks fast if we recursively apply the same algorithm with new multipliers; we propose some heuristic recipes for these recursive processes, and our numerical tests confirm their efficiency.
Impact of our study, its extensions and by-products:
(i) Our duality approach enables new insight into some fundamental matrix computations besides LRA: [PQY15], [PZ17a], and [PZ17b] provide formal support for empirical efficiency of dual Gaussian elimination with no pivoting, while [PLb] proposes a sub-linear cost modification of Sarlós’ algorithm of 2006 and then proves that whp it outputs nearly optimal solution of the highly important problem of Linear Least Squares Regression (LLSR) provided that its input is random. Then again this formal proof is in good accordance with the test results.
(ii) In [PLSZa] we proved that popular Cross-Approximation LRA algorithms running at sub-linear cost as well as our simplified sub-linear cost variations of these algorithms output accurate solution of dual LRA whp, and we also devised a sub-linear cost algorithm for transformation of any LRA into its special form of CUR LRA, which is particularly memory efficient.
(iii) The paper [PLa] has proposed and elaborated upon sub-linear cost refinement of a crude but reasonably close LRA.
Related Works. Huge bibliography on LRA can be partly accessed via [M11], [HMT11], [KS16], [PLSZ17], [TYUC17], [OZ18], and the references therein. [PLSZ16] and [PLSZ17] were the first papers that provided formal support for dual accurate randomized LRA computations performed at sub-linear cost (in these papers such computations are called superfast). The earlier papers [PQY15], [PLSZ16], [PZ17a], and [PZ17b] studied duality for other fundamental matrix computations besides LRA, while the paper [PLb] has extended our study to a sub-linear cost dual algorithm for the popular problem of Linear Least Squares Regression and confirmed accuracy of this solution by the results of numerical experiments.
Organization of the paper. In Section 2 we recall random sampling for LRA. In Sections 3 and 4 we prove deterministic and randomized error bounds, respectively, for our dual LRA algorithms running at sub-linear cost. In Section 5 we discuss multiplicative pre-processing and generation of multipliers for both pre-processing and sampling. In Section 6 we cover our numerical tests. Appendix A is devoted to background on matrix computations. In Appendix B we prove our error bounds for dual LRA. In Appendix C we specify some small families of hard inputs for sub-linear cost LRA.
Some definitions. The concepts “large”, “small”, “ill-” and “well-conditioned”, “near”, “close”, and “approximate” are usually quantified in the context. “” and “” mean “much less than” and “much greater than”, respectively. “Flop”
stands for “floating point arithmetic operation”; “
iid” for “independent identically distributed”. In context a “perturbation of a matrix” can mean a perturbation having a small relative norm. denotes the class of real matrices, We assume dealing with real matrices throughout, and so the Hermitian transpose of turns into transpose, , but most of our study can be readily extended to complex matrices; see some relevant results about complex Gaussian matrices in [E88], [CD05], [ES05], and [TYUC17].2 LRA by means of subspace sampling
2.1 Four subspace sampling algorithms
Hereafter and denote the spectral and the Frobenius matrix norms, respectively; can denote either of them. denotes the Moore – Penrose pseudo inverse of .
Algorithm 2.1.
Column Subspace Sampling or Range Finder (see Remark 2.1).
- Input:
-
An matrix and a target rank .
- Output:
-
Two matrices and defining an LRA of .
- Initialization:
-
Fix an integer , , and an matrix of full rank .
- Computations:
-
-
Compute the matrix .
-
Fix a nonsingular matrix and output the matrix .
-
Output an matrix .
-
Remark 2.1.
Let . Then and independently of the choice of , but its proper choice numerically stabilizes the computations of the algorithm. For the matrix is ill-conditioned,333 denotes numerical rank of (see Appendix A.1). but is orthogonal for , and where and are the factors of the thin QR factorization of (cf. [HMT11, Algorithm 4.1]). It is also orthogonal for and the factors and in a rank-revealing QR factorization .
Column Subspace Sampling turns into Column Subset Selection in the case of a sub-permutation matrix and turns into Row Subspace Sampling if it is applied to the transpose .
Algorithm 2.2.
Row Subspace Sampling or Transposed Range Finder. See Remark 2.2.
- Input:
-
As in Algorithm 2.1.
- Output:
-
Two matrices and defining an LRA of .
- Initialization:
-
Fix an integer , , and a matrix of full numerical rank .
- Computations:
-
-
Compute the matrix .
-
Fix a nonsingular matrix ; then output matrix .
-
Output an matrix .
-
Row Subspace Sampling turns into Row Subset Selection in case of a sub-permutation matrix .
Remark 2.2.
Let . Then and independently of the choice of , but a proper choice of numerically stabilizes the computations of the algorithm. For the matrix is ill-conditioned, but is orthogonal if , , , and and are the factors of the thin LQ factorization of or if and and are the factors in a rank-revealing LQ factorization .
The following algorithm combines row and column subspace sampling. In the case of the identity matrix
it turns into the algorithm of [CW09, Theorems 4.7 and 4.8] and [TYUC17, Section 1.4], whose origin can be traced back to [WLRT08].Algorithm 2.3.
Row and Column Subspace Sampling. See Remark 2.3.
- Input:
-
As in Algorithm 2.1.
- Output:
-
Two matrices and defining an LRA of .
- Initialization:
-
Fix two integers and , and ; fix two matrices and of full numerical ranks and two nonsingular matrices and .
- Computations:
-
1. Output the matrix .
2. Compute the matrices and .
3. Output the matrix .
Remark 2.3.
If the matrix has full rank , then independently of the choice of the matrices and , but a proper choice of numerically stabilizes the computations of the algorithm. For the matrix is ill-conditioned, but is orthogonal if , , , and and are the factors of the thin LQ factorization of or if and and are the factors in a rank-revealing LQ factorization .
2.2 The known error bounds
Theorem 2.1.
Clarkson and Woodruff prove in [CW09] that Algorithm 2.3 reaches the bound within a factor of whp if the multipliers and are Rademacher’s matrices and if and are sufficiently large, having order of and for small , respectively.
Tropp et al. argue in [TYUC17, Section 1.7.3] that LRA is not practical if the numbers and of row and column samples are large; iterative refinement of LRA at sub-linear cost in [PLa] can be a partial remedy. [TYUC17, Theorem 4.3] shows that the output LRA of Algorithm 2.3 applied with Gaussian multipliers and satisfies555In words, the expected output error norm is within a factor of from its minimum value ; this factor is just 2 for .
(2.1) |
3 Deterministic output error bounds for sampling algorithms
3.1 Deterministic error bounds of Range Finder
Theorem 3.1.
Notice that , , and and obtain
(3.4) |
It follows that the output LRA is optimal up to a factor of .
Next we deduce an upper bound on the norm in terms of , , and . Given we can compute the norm at sub-linear cost if , and in some applications reasonable upper estimates for and are available.
Corollary 3.1.
Under the assumptions of Theorem 3.1 let the matrix have full rank . Then
Proof.
Recall that the matrix has full rank , apply Lemma A.1, recall that
is an orthogonal matrix, and obtain
.Substitute and and obtain the corollary. ∎
Corollary 3.2.
Under the assumptions of Corollary 3.1 let and Then
3.2 Impact of pre-multiplication on the errors of LRA (deterministic estimates)
Lemma 3.1.
[The impact of pre-multiplication on LRA errors.] Suppose that Algorithm 2.3 outputs a matrix for and that . Then
(3.5) |
(3.6) |
We bounded the norm in the previous subsection; next we bound the norms and at sub-linear cost for , a fixed orthogonal , and proper choice of sparse .
Theorem 3.2.
[P00, Algorithm 1] outputs matrix . One can strengthen deterministic bounds on the norm by computing sub-permutation matrices for of at least order .
4 Accuracy of sub-linear cost dual LRA algorithms
Next we estimate the output errors of Algorithm 2.1 for a fixed orthogonal matrix and two classes of random inputs of low numerical rank, in particular for perturbed factor-Gaussian inputs of Definition A.1. These estimates formally support the observed accuracy of Range Finder with various dense multipliers (see [HMT11, Section 7.4], and the bibliography therein), but also with sparse multipliers, with which Algorithms 2.3 and 2.4 run at sub-linear cost. By applying the results of the previous section we extend these upper estimates for output accuracy to variations of Algorithm 2.3 that run at sub-linear cost; then we extend them to Algorithm 2.4 by means of transposition of an input matrix.
Our estimates involve the norms of a Gaussian matrix and its pseudo inverse (cf. Appendix A.4).
Definition 4.1.
A matrix is Gaussian if its entries are iid Gaussian variables. is the class of Gaussian matrices. , , , , , and for a matrix . [ and , for all pairs of and .]
Theorem 4.1.
[Non-degeneration of a Gaussian Matrix.] Let , , and . Then the matrices , , , and have full rank with probability 1.
Assumption 4.1.
We simplify the statements of our results by assuming that a Gaussian matrix has full rank and ignoring the probability 0 of its degeneration.
4.1 Output errors of Range Finder for a perturbed factor-Gaussian input
Theorem 4.2.
[Errors of Range Finder for a perturbed factor-Gaussian matrix.] Apply Algorithm 2.1 to a perturbation of a right factor-Gaussian matrix of rank such that
(4.1) |
(i) Then
(ii) Let with a probability close to 1 and let the integer be at least moderately large. Then with a probability close to 1
which is close to if . Here and hereafter
Proof.
We prove claim (i) in Appendix B. Let us deduce claim (ii). Recall from Theorems A.4 and A.5
that the random variables
and are strongly concentrated about their expected values and , respectively. Substitute these equations into the bound of claim (i), apply Jensen’s inequality, and deduce claim (ii) of the theorem. ∎4.2 Output errors of Range Finder near a matrix with a random singular space
Next we prove similar estimates under an alternative randomization model for dual LRA.
Theorem 4.3.
[Errors of Range Finder for an input with a perturbed random singular space.] Let the matrix in Theorem 3.1 be the Q factor in a QR factorization of a normalized Gaussian matrix and let the multiplier be any matrix of full rank .
(i) Then for and of Definition 4.1 it holds that
(ii) For a large or reasonably large integer , the random variable is strongly concentrated about its expected values
which turn into
if . Here and if the matrix is orthogonal.
4.3 Impact of pre-multiplication in the case of Gaussian noise
Next we deduce randomized estimates
for the impact of pre-multiplication
in the case
where an input matrix
includes considerable additive white Gaussian noise,666 Additive white Gaussian
noise is statistical noise having a probability density function (PDF) equal to that of the Gaussian (normal) distribution. Additive white Gaussian
noise is widely adopted in information theory and used in signal and image processing; in many cases it properly represents the errors of measurement and rounding (cf.
Theorem 4.4.
Suppose that the multipliers and are orthogonal and let the input matrix be the sum of a fixed matrix and a scaled Gaussian matrix ,
(4.3) |
for a constant proportional to the norm . Then for any pair of orthogonal multipliers and it holds that
(4.4) |
(4.5) |
Proof.
Assumption (4.3) and Lemma A.4 together imply that is a scaled Gaussian matrix: . Hence . Apply Theorem A.3 and obtain that . Recall from (3.5) that since the multipliers and are orthogonal. Substitute the above bound on and obtain (4.4). Substitute equations and and claim (iii) of Theorem A.5 and obtain (4.5). ∎
Remark 4.1.
For , , sub-permutation matrices and , and a nonsingular matrix , Algorithms 2.3 and 2.4 output LRA in the form where and are two submatrices made up of columns and rows of and . [PLSZa] extends our current study to devising and analyzing algorithms for the computation of such CUR LRA in the case where and are arbitrary integers not exceeded by .
5 Multiplicative pre-processing and generation of multipliers
5.1 Multiplicative pre-processing for LRA
We proved that sub-linear cost variations of Algorithms 2.3 and 2.4 tend to output accurate LRA of random input whp. In the real world computations input matrices are not random, but we can randomize them by multiplying them by random matrices.
Algorithms 2.1 – 2.4 output accurate LRA whp if the multipliers are Gaussian, SRHT, SRFT or Rademacher’s (cf. [HMT11, Sections 10 and 11], [T11], [CW09]), but multiplication by these matrices run at super-linear cost. Our heuristic recipe is to apply these algorithms with a small variety of sparse multipliers and/or , , with which computational cost becomes sub-linear and then to monitor the accuracy of the output LRA by applying the criteria of the previous section, [PLa], and/or [PLSZa].
Various families of sparse multipliers have been proposed in [PLSZ16] and [PLSZ17]. One can readily complement these families with sub-permutation matrices and, say, sparse quasi Radmacher’s multipliers (see [PLSZa]) and then combine these basic multipliers together by using orthogonalized sums, products or other lower degree polynomials of these matrices as multipliers (cf. [HMT11, Remark 4.6]).
Next we specify a particular family of sparse multipliers, which was highly efficient in our tests when we applied them both themselves and in combination with other sparse multipliers.
5.2 Generation of abridged Hadamard and Fourier multipliers
We define multipliers of this family by means of abridging the classical recursive processes of the generation of SRHT and SRFT matrices for . These matrices are obtained from the dense matrices of Walsh-Hadamard transform (cf. [M11, Section 3.1]) and of discrete Fourier transform (DFT) at points (cf. [P01, Section 2.3]), respectively. Recursive representation in recursive steps enables multiplication of the matrices and
by a vector in
additions and subtractions and flops, respectively.We end these processes in recursive steps for a fixed recursion depth , , and obtain the -abridged Hadamard (AH) and Fourier (AF) matrices and , respectively, such that and . Namely write , , and denoting a primitive -th root of 1, and then specify two recursive processes:
(5.1) |
(5.2) |
where denotes the matrix of odd/even permutations such that , , , , , .777For this is a decimation in frequency (DIF) radix-2 representation of FFT. Transposition turns it into the decimation in time (DIT) radix-2 representation of FFT.
For any fixed pair of and , each of the matrices (resp. ) is orthogonal (resp. unitary) up to scaling and has nonzero entries in every row and column. Now make up multipliers and of and submatrices of and , respectively. Then in view of sparseness of or , we can compute the products and by using and flops, respectively, and they are just additions or subtractions in the case of submatrices of .
By combining random permutation with either Rademacher’s diagonal scaling for AH matrices or or random unitary diagonal scaling for AF matrices , we obtain the –Abridged Scaled and Permuted Hadamard (ASPH) matrices, , and –Abridged Scaled and Permuted Fourier (ASPF) matrices, , where and are two matrices of permutation and diagonal scaling. Likewise define the families of ASH, ASF, APH, and APF matrices, , , , and , respectively. Each random permutation or scaling contributes up to random parameters. We can involve more random parameters by applying random permutation and scaling to the intermediate matrices and for .
Now the first rows for or first columns for of and form a -abridged Hadamard or Fourier multiplier, which turns into a SRHT or SRFT matrix, respectively, for . For and of order Algorithm 2.1 with a SRHT or SRFT multiplier outputs whp accurate LRA of any matrix admitting LRA (see [HMT11, Section 11]), but in our tests the output was consistently accurate even with sparse abridged SRHT or SRFT multipliers computed just in three recursive steps.
6 Numerical tests
In this section we cover our tests of dual sub-linear cost variants of Algorithm 2.1. The tests for Tables 6.1–6.4 have been performed by using MatLab on a Dell computer with the Intel Core 2 2.50 GHz processor and 4G memory running Windows 7; the standard normal distribution function randn of MATLAB has been applied in order to generate Gaussian matrices. The MATLAB function ”svd()” has been applied in order to calculate the -rank, i.e., the number of singular values exceeding for in Sections 6.2 and 6.3 and in Section 6.4. The tests for Tables 6.5–6.7 have been performed on a 64-bit Windows machine with an Intel i5 dual-core 1.70 GHz processor by using custom programmed software in and compiled with LAPACK version 3.6.0 libraries.
6.1 Input matrices for LRA
We generated the following classes of input matrices for testing LRA algorithms.
Class I: Perturbed factor-Gaussian matrices with expected rank , for three Gaussian matrices , , and .
Class II: , for and being the Q factors of the thin QR orthogonalization of Gaussian matrices and ; , (cf. [H02, Section 28.3]), and . (Hence and .)
Class III: (i) The matrices of the discretized single-layer Laplacian operator of [HMT11, Section 7.1]: , for two circles and on the complex plane. We arrived at a matrix ,