Given a real matrix with columns , can one approximate the Gram matrix from just a few columns? We answer this question by presenting deterministic conditions for the exact111We assume infinite precision, and no round off errors. computation of from a few columns, and probabilistic error bounds for approximations.
Our motivation (Section 1.1) is followed by an overview of the results (Section 1.2), and a literature survey (Section 1.3). Those not familiar with established notation can find a review in Section 1.4.
The objective is the analysis of a randomized algorithm for approximating
. Specifically, it is a Monte Carlo algorithm for sampling outer products and represents a special case of the ground breaking work on randomized matrix multiplication by Drineas, Kannan, and Mahoney[16, 17].
The basic idea is to represent as a sum of outer products of columns,
The weights are set to so that is an unbiased estimator, . Intuitively, one would expect the algorithm to do well for matrices of low rank.
The intuition is based on the singular value decomposition. Given left singular vectors associated with the non-zero singular values of , one can represent as a sum of outer products,
Hence for matrices of low rank, a few left singular vectors and singular values suffice to reproduce exactly. Thus, if has columns that “resemble” its left singular vectors, the Monte Carlo algorithm should have a chance to perform well.
1.2 Contributions and Overview
We sketch the main contributions of this paper. All proofs are relegated to Section 7.
1.2.1 Deterministic conditions for exact computation (Section 2)
To calibrate the potential of the Monte-Carlo algorithm [16, 17] and establish connections to existing work in linear algebra, we first derive deterministic conditions that characterize when can be computed exactly from a few columns of . Specifically:
We present necessary and sufficient conditions (Theorem 2.2) for computing exactly from columns of ,
The conditions and weights depend on the right singular vector matrix associated with the non-zero singular values of .
For matrices with , this is always possible (Corollary 2.2).
In the special case where (Theorem 2.2.2), the weights are equal to inverse leverage scores, . However, they do not necessarily correspond to the largest leverage scores.
1.2.2 Sampling probabilities for the Monte-Carlo algorithm (Section 3)
Given an approximation from the Monte-Carlo algorithm [16, 17], we are interested in the two-norm relative error due to randomization, . Numerical experiments compare two types of sampling probabilities:
“Optimal” probabilities , and
The experiments illustrate that sampling columns of with the “optimal” probabilities produces a smaller error than sampling with leverage score probabilities. This was not obvious a priori, because the “optimal” probabilities are designed to minimize the expected value of the Frobenius norm absolute error, . Furthermore, corresponding probabilites and can differ by orders of magnitude.
For matrices of rank one though, we show (Theorem 3.2.2) that the probabilities are identical, for , and that the Monte Carlo algorithm always produces the exact result, , when it samples with these probabilities.
We present probabilistic bounds for when the Monte-Carlo algorithm samples with two types of sampling probabilities.
provided the number of sampled columns is at least
Here or , where is the stable rank of . The bound containing is tighter for matrices with .
Note that the amount of sampling depends on the rank or the stable rank, but not on the dimensions of . Numerical experiments (Section 4.4) illustrate that the bounds are informative, even for stringent success probabilities and matrices of small dimension.
1.2.4 Singular value bounds (Section 6)
Given a matrix with orthonormal rows, , the Monte-Carlo algorithm computes by sampling columns from with the “optimal” probabilities. The goal is to derive a positive lower bound for the smallest singular value , as well as an upper bound for the two-norm condition number with respect to left inversion .
Surprisingly, Theorem 4.1 leads to bounds (Theorems 6.1 and 6.2) that are not always as tight as the ones below. These bounds are based on a Chernoff inequality and represent a slight improvement over existing results.
1.3 Literature Review
We review bounds for the relative error due to randomization of general Gram matrix approximations , and also for the smallest singular value and condition number of sampled matrices when has orthonormal rows.
In addition to [16, 17], several other randomized matrix multiplication algorithms have been proposed [5, 13, 14, 40, 46, 48]. Sarlós’s algorithms  are based on matrix transformations. Cohen and Lewis [13, 14] approximate large elements of a matrix product with a random walk algorithm. The algorithm by Belabbas and Wolfe  is related to the Monte Carlo algorithm [16, 17], but with different sampling methods and weights. A second algorithm by Drineas et al.  relies on matrix sparsification, and a third algorithm  estimates each matrix element independently. Pagh  targets sparse matrices, while Liberty  estimates the Gram matrix by iteratively removing “unimportant” columns from .
Eriksson-Bique et al. 
derive an importance sampling strategy that minimizes the variance of the inner products computed by the Monte Carlo method. Madrid, Guerra, and Rojas present experimental comparisons of different sampling strategies for specific classes of matrices.
1.3.1 Gram matrix approximations
We review existing bounds for the error due to randomization of the Monte Carlo algorithm [16, 17] for approximating , where is a real matrix. Relative error bounds in the Frobenius norm and the two-norm are summarized in Tables 1 and 2.
Table 1 shows probabilistic lower bounds for the number of sampled columns so that the Frobenius norm relative error . Not listed is a bound for uniform sampling without replacement [38, Corollary 1], because it cannot easily be converted to the format of the other bounds, and a bound for a greedy sampling strategy [5, p. 5].
|Bound for # samples||Sampling||Reference|
|opt||[17, Theorem 2]|
|opt||[24, Lemma 1], [25, Lemma 2]|
|u-wor||[16, Lemma 7]|
|u-wor||[8, Lemma 4.13], [27, Lemma 4.3]|
Table 2 shows probabilistic lower bounds for the number of sampled columns so that the two-norm relative error . These bounds imply, roughly, that the number of sampled columns should be at least or .
|Bound for # samples||Reference|
|[47, Theorems 1.1 and 3.1, and their proofs]|
|[43, Theorem 17], [42, Theorem 20]|
|[21, Theorem 4]|
|[44, Theorem 3.1], [55, Theorem 2.1]|
|[36, Example 4.3]|
|[49, Theorem 3.9]|
|Bound for # samples||Sampling||Reference|
|u-wor||[8, Lemma 4.3]|
|opt||[6, Lemma 13]|
|u-wr, u-wor||[37, Corollary 4.2]|
|u-wr||[8, Lemma 4.4]|
|u-wor||[26, Lemma 1]|
1.3.2 Singular value bounds
We review existing bounds for the smallest singular value of a sampled matrix , where is with orthonormal rows.
1.3.3 Condition number bounds
We are aware of only two existing bounds for the two-norm condition number of a matrix whose columns are sampled from a matrix with orthonormal rows. The first bound [1, Theorem 3.2] lacks explicit constants, while the second one [37, Corollary 4.2] applies to uniform sampling with and without replacement. It ensures with probability at least , provided the number of sampled columns in is at least .
1.3.4 Relation to subset selection
The Monte Carlo algorithm selects outer products from , which is equivalent to selecting columns from , hence it can be viewed as a form of randomized column subset selection.
The traditional deterministic subset selection methods select exactly the required number of columns, by means of rank-revealing QR decompositions or SVDs[11, 28, 29, 31, 34]. In contrast, more recent methods are motivated by applications to graph sparsification [3, 4, 49]. They oversample columns from a matrix with orthonormal rows, by relying on a barrier sampling strategy222The name comes about as follows: Adding a column to amounts to a rank-one update for the Gram matrix
. The eigenvalues of this matrix, due to interlacing, form “barriers” for the eigenvalues of the updated matrix.. The accuracy of the selected columns is determined by bounding the reconstruction error, which views as an approximation to [3, Theorem 3.1], [4, Theorem 3.1], [49, Theorem 3.2].
Boutsidis  extends this work to general Gram matrices . Following , he selects columns from the right singular vector matrix of , and applies barrier sampling simultaneously to the dominant and subdominant subspaces of .
In terms of randomized algorithms for subset selection, the two-stage algorithm by Boutsidis et al.  samples columns in the first stage, and performs a deterministic subset selection on the sampled columns in the second stage. Other approaches include volume sampling [24, 25], and CUR decompositions .
1.3.5 Leverage scores
In the late seventies, statisticians introduced leverage scores for outlier detection in regression problems[12, 33, 53]. More recently, Drineas, Mahoney et al. have pioneered the use of leverage scores for importance sampling in randomized algorithms, such as CUR decompositions , least squares problems , and column subset selection , see also the perspectives on statistical leverage [45, §6]. Fast approximation algorithms are being designed to make the computation of leverage scores more affordable [18, 39, 42].
All matrices are real. Matrices that can have more than one column are indicated in bold face, and column vectors and scalars in italics. The columns of the matrix are denoted by . The identity matrix is , whose columns are the canonical vectors .
The thin Singular Value Decomposition (SVD) of a matrix with is , where the matrix and the matrix have orthonormal columns, , and the diagonal matrix of singular values is , with . The Moore-Penrose inverse of is . The unique symmetric positive semi-definite square root of a symmetric positive semi-definite matrix is denoted by .
The norms in this paper are the two-norm , and the Frobenius norm
The stable rank
of a non-zero matrixis , where .
Given a matrix with orthonormal rows, , the two-norm condition number with regard to left inversion is ; the leverage scores [19, 20, 45] are the squared columns norms , ; and the coherence [1, 10] is the largest leverage score,
The expected value of a scalar or a matrix-valued random random variableis ; and the probability of an event is .
2 Deterministic conditions for exact computation
To gauge the potential of the Monte Carlo algorithm, and to establish a connection to existing work in linear algebra, we first consider the best case: The exact computation of from a few columns. That is: Given not necessarily distinct columns , under which conditions is ?
Since a column can be selected more than once, and therefore the selected columns may not form a submatrix of , we express the selected columns as , where is a sampling matrix with
Then one can write
where is diagonal weighting matrix. We answer two questions in this section:
Given a set of columns of , when is without any constraints on ? The answer is an expression for a matrix with minimal Frobenius norm (Section 2.1).
Given a set of columns of , what are necessary and sufficient conditions under which for a diagonal matrix ? The answer depends on the right singular vector matrix of (Section 2.2).
2.1 Optimal approximation (no constraints on )
For a given set of columns of , we determine a matrix of minimal Frobenius norm that minimizes the absolute error of in the Frobenius norm.
The following is a special case of [23, Theorem 2.1], without any constraints on the number of columns in . The idea is to represent in terms of the thin SVD of as .
Given columns of , not necessarily distinct, the unique solution of
with minimal Frobenius norm is .
If, in addition, , then
If also , then
See Section 7.1.
Theorem 2.1 implies that if has maximal rank, then the solution of minimal Frobenius norm depends only on the right singular vector matrix of and in particular only on those columns that correspond to the columns in .
2.2 Exact computation with outer products (diagonal )
We present necessary and sufficient conditions under which for a non-negative diagonal matrix , that is .
Let be a matrix, and let . Then
for weights , if and only if the matrix has orthonormal rows.
See Section 7.2.
If has rank one, then any non-zero columns of will do for representing , and explicit expressions for the weights can be derived.
If then for any columns ,
See Section 7.3.
Hence, in the special case of rank-one matrices, the weights are inverse leverage scores of as well as inverse normalized column norms of . Furthermore, in the special case , Corollary 2.2 implies that any non-zero column of can be chosen. In particular, choosing the column of largest norm yields a weight of minimal value, where is the coherence of .
In the following, we look at Theorem 2.2 in more detail, and distinguish the two cases when the number of selected columns is greater than , and when it is equal to .
2.2.1 Number of selected columns greater than
We illustrate the conditions of Theorem 2.2 when . In this case, indices do not necessarily have to be distinct, and a column can occur repeatedly.
so that . Also let , and select the first column twice, and , so that
The weights and give a matrix
with orthonormal rows. Thus, an exact representation does not require distinct indices.
However, although the above weights yield an exact representation, the corresponding weight matrix does not have minimal Frobenius norm.
Remark (Connection to Theorem 2.1)
To see this, note that for , the columns are linearly dependent. Hence the minimal Frobenius norm solution has rank equal to . If were to be diagonal, it could have only non-zero diagonal elements, hence the number of outer products would be , a contradiction.
To illustrate this, let
so that . Also, let , and select columns , and , so that
Theorem 2.1 implies that the solution with minimal Frobenius norm is
which is not diagonal.
However is also a solution since has orthonormal rows. But does not have minimal Frobenius norm since , while .
2.2.2 Number of selected columns equal to
If , then no column of can be selected more than once, hence the selected columns form a submatrix of . In this case Theorem 2.2 can be strengthened: As for the rank-one case in Corollary 2.2, an explicit expression for the weights in terms of leverage scores can be derived.
Let be a matrix with . In addition to the conclusions of Theorem 2.2 the following also holds: If
has orthonormal rows, then it is an orthogonal matrix, and, .
See Section 7.4.
Note that the columns selected from do not necessarily correspond to the largest leverage scores. The following example illustrates that the conditions in Theorem 2.2.2 are non-trivial.
In Theorem 2.2.2 it is not always possible to find columns from that yield an orthogonal matrix.
For instance, let
and . Since no two columns of are orthogonal, no two columns can be scaled to be orthonormal. Thus no matrix submatrix of can give rise to an orthogonal matrix.
However, for it is possible to construct a matrix with orthonormal rows. Selecting columns , and from , and weights , and yields a matrix
that has orthonormal rows.
3 Monte Carlo algorithm for Gram Matrix Approximation
3.1 The algorithm
The randomized algorithm for approximating , presented as Algorithm 1, is a special case of the BasicMatrixMultiplication Algorithm [17, Figure 2] which samples according to the Exactly(c) algorithm [21, Algorithm 3], that is, independently and with replacement. This means a column can be sampled more than once.
A conceptual version of the randomized algorithm is presented as Algorithm 1. Given a user-specified number of samples , and a set of probabilities , this version assembles columns of the sampling matrix , then applies to , and finally computes the product
The choice of weights makes an unbiased estimator, [17, Lemma 3].
Discounting the cost of sampling, Algorithm 1 requires flops to compute an approximation to . Note that Algorithm 1 allows zero probabilities. Since an index corresponding to can never be selected, division by zero does not occur in the computation of . Implementations of sampling with replacement are discussed in [22, Section 2.1]. For matrices of small dimension, one can simply use the Matlab function randsample.
3.2 Sampling probabilities
We consider two types of probabilities, the “optimal” probabilities from  (Section 3.2.1), and leverage score probabilities (Section 3.2.2) motivated by Corollary 2.2 and Theorem 2.2.2, and their use in other randomized algorithms [9, 19, 20]. We show (Theorem 3.2.2) that for rank-one matrices, Algorithm 1 with “optimal” probabilities produces the exact result with a single sample. Numerical experiments (Section 3.2.3) illustrate that sampling with “optimal” probabilities results in smaller two-norm relative errors than sampling with leverage score probabilities, and that the two types of probabilities can differ significantly.
3.2.1 “Optimal” probabilities 
They are defined by
and are called “optimal” because they minimize [17, Lemma 4]. The “optimal” probabilities can be computed in flops.
The analyses in [17, Section 4.4] apply to the more general “nearly optimal” probabilities , which satisfy and are constrained by
where is a scalar. In the special case , they revert to the optimal probabilites, , . Hence can be viewed as the deviation of the probabilities from the “optimal” probabilities .
The exact representation in Theorem 2.2.2 suggests probabilities based on the leverage scores of ,
Since the leverage score probabilities are proportional to the squared column norms of , they are the “optimal” probabilities for approximating . Exact computation of leverage score probabilities, via SVD or QR decomposition, requires flops; thus, it is more expensive than the computation of the “optimal” probabilities.
In the special case of rank-one matrices, the “optimal” and leverage score probabilities are identical; and Algorithm 1 with “optimal” probabilities computes the exact result with any number of samples, and in particular a single sample. This follows directly from Corollary 2.2.
If , then , .
If is computed by Algorithm 1 with any and probabilities , then .
|EEG Eye State||15||1.31|
|Wine Quality - Red||12||1.03|
|Wine Quality - White||12||1.01|
3.2.3 Comparison of sampling probabilities
We compare the norm-wise relative errors due to randomization of Algorithm 1 when it samples with “optimal” probabilites and leverage score probabilities.
Experimental set up
We present experiments with eight representative matrices, described in Table 4
, from the UCI Machine Learning Repository.
For each matrix, we ran Algorithm 1 twice: once sampling with “optimal” probabilities , and once sampling with leverage score probabilities . The sampling amounts range from 1 to , with 100 runs for each value of .
Figure 1 contains two plots for each matrix: The left plot shows the two-norm relative errors due to randomization, , averaged over 100 runs, versus the sampling amount . The right plot shows the ratios of leverage score over “optimal” probabilities , .
Sampling with “optimal” probabilities produces average errors that are lower, by as much as a factor of 10, than those from sampling with leverage score probabilities, for all sampling amounts . Furthermore, corresponding leverage score and “optimal” probabilities tend to differ by several orders of magnitude.
4 Error due to randomization, for sampling with “nearly optimal” probabilities
We present two new probabilistic bounds (Sections 4.1 and 4.2) for the two-norm relative error due to randomization, when Algorithm 1 samples with the “nearly optimal” probabilities in (2). The bounds depend on the stable rank or the rank of , but not on the matrix dimensions. Neither bound is always better than the other (Section 4.3). The numerical experiments (Section 4.4) illustrate that the bounds are informative, even for stringent success probabilities and matrices of small dimension.
4.1 First bound
The first bound depends on the stable rank of and also, weakly, on the rank.
Given and , if the number of columns sampled by Algorithm 1 is at least
then with probability at least ,
See Section 7.5.
As the required error becomes smaller, so does the constant in the lower bound for the number of samples, that is, as .
4.2 Second bound
This bound depends only on the stable rank of .
Given and , if the number of columns sampled by Algorithm 1 is at least
then with probability at least ,
See Section 7.6.