 # Randomized Approximation of the Gram Matrix: Exact Computation and Probabilistic Bounds

Given a real matrix A with n columns, the problem is to approximate the Gram product AA^T by c << n weighted outer products of columns of A. Necessary and sufficient conditions for the exact computation of AA^T (in exact arithmetic) from c >= rank(A) columns depend on the right singular vector matrix of A. For a Monte-Carlo matrix multiplication algorithm by Drineas et al. that samples outer products, we present probabilistic bounds for the 2-norm relative error due to randomization. The bounds depend on the stable rank or the rank of A, but not on the matrix dimensions. Numerical experiments illustrate that the bounds are informative, even for stringent success probabilities and matrices of small dimension. We also derive bounds for the smallest singular value and the condition number of matrices obtained by sampling rows from orthonormal matrices.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Given a real matrix with columns , can one approximate the Gram matrix from just a few columns? We answer this question by presenting deterministic conditions for the exact111We assume infinite precision, and no round off errors. computation of from a few columns, and probabilistic error bounds for approximations.

Our motivation (Section 1.1) is followed by an overview of the results (Section 1.2), and a literature survey (Section 1.3). Those not familiar with established notation can find a review in Section 1.4.

### 1.1 Motivation

The objective is the analysis of a randomized algorithm for approximating

. Specifically, it is a Monte Carlo algorithm for sampling outer products and represents a special case of the ground breaking work on randomized matrix multiplication by Drineas, Kannan, and Mahoney

[16, 17].

The basic idea is to represent as a sum of outer products of columns,

 AAT=A1AT1+⋯+AnATn.

The Monte Carlo algorithm [16, 17], when provided with a user-specified positive integer , samples columns , , according to probabilities , , and then approximates by a weighted sum of outer products

 X=w1At1ATt1+⋯+wcAtcATtc.

The weights are set to so that is an unbiased estimator, . Intuitively, one would expect the algorithm to do well for matrices of low rank.

The intuition is based on the singular value decomposition. Given left singular vectors associated with the non-zero singular values of , one can represent as a sum of outer products,

 AAT=σ21U1UT1+⋯+σ2kUkUTk.

Hence for matrices of low rank, a few left singular vectors and singular values suffice to reproduce exactly. Thus, if has columns that “resemble” its left singular vectors, the Monte Carlo algorithm should have a chance to perform well.

### 1.2 Contributions and Overview

We sketch the main contributions of this paper. All proofs are relegated to Section 7.

#### 1.2.1 Deterministic conditions for exact computation (Section 2)

To calibrate the potential of the Monte-Carlo algorithm [16, 17] and establish connections to existing work in linear algebra, we first derive deterministic conditions that characterize when can be computed exactly from a few columns of . Specifically:

• We present necessary and sufficient conditions (Theorem 2.2) for computing exactly from columns of ,

 AAT=w1At1ATt1+⋯+wcAtcATtc.

The conditions and weights depend on the right singular vector matrix associated with the non-zero singular values of .

• For matrices with , this is always possible (Corollary 2.2).

• In the special case where (Theorem 2.2.2), the weights are equal to inverse leverage scores, . However, they do not necessarily correspond to the largest leverage scores.

#### 1.2.2 Sampling probabilities for the Monte-Carlo algorithm (Section 3)

Given an approximation from the Monte-Carlo algorithm [16, 17], we are interested in the two-norm relative error due to randomization, . Numerical experiments compare two types of sampling probabilities:

• “Optimal” probabilities , and

• Leverage score probabilities [7, 9].

The experiments illustrate that sampling columns of with the “optimal” probabilities produces a smaller error than sampling with leverage score probabilities. This was not obvious a priori, because the “optimal” probabilities are designed to minimize the expected value of the Frobenius norm absolute error, . Furthermore, corresponding probabilites and can differ by orders of magnitude.

For matrices of rank one though, we show (Theorem 3.2.2) that the probabilities are identical, for , and that the Monte Carlo algorithm always produces the exact result, , when it samples with these probabilities.

#### 1.2.3 Probabilistic bounds (Sections 4 and 5)

We present probabilistic bounds for when the Monte-Carlo algorithm samples with two types of sampling probabilities.

• Sampling with “nearly optimal” probabilities , where (Theorems 4.1 and 4.2). We show that

 ∥X−AAT∥2/∥AAT∥2≤ϵwith probability at least~{}1−δ,

provided the number of sampled columns is at least

 c≥c0(ϵ)ln(ρ(A)/δ)βϵ2sr(A),where2≤c0(ϵ)≤2.7.

Here or , where is the stable rank of . The bound containing is tighter for matrices with .

Note that the amount of sampling depends on the rank or the stable rank, but not on the dimensions of . Numerical experiments (Section 4.4) illustrate that the bounds are informative, even for stringent success probabilities and matrices of small dimension.

• Sampling with leverage score probabilities (Theorem 5). The bound corroborates the numerical experiments in Section 3.2.3, but is not as tight as the bounds for “nearly optimal” probabilities, since it depends only on , and .

#### 1.2.4 Singular value bounds (Section 6)

Given a matrix with orthonormal rows, , the Monte-Carlo algorithm computes by sampling columns from with the “optimal” probabilities. The goal is to derive a positive lower bound for the smallest singular value , as well as an upper bound for the two-norm condition number with respect to left inversion .

Surprisingly, Theorem 4.1 leads to bounds (Theorems 6.1 and 6.2) that are not always as tight as the ones below. These bounds are based on a Chernoff inequality and represent a slight improvement over existing results.

• Bound for the smallest singular value (Theorem 6.1). We show that

 σm(QS)≥√1−ϵ% with probability at least~{}1−δ,

provided the number of sampled columns is at least

 c≥c1(ϵ)mln(m/δ)ϵ2,where% 1≤c1(ϵ)≤2.
• Condition number bound (Theorem 6.2). We show that

 κ(QS)≤√1+ϵ√1−ϵwith probability at least~{}1−δ,

provided the number of sampled columns is at least

 c≥c2(ϵ)mln(2m/δ)ϵ2,% where2≤c2(ϵ)≤2.6.

In addition, we derive corresponding bounds for uniform sampling with and without replacement (Theorems 6.1 and 6.2).

### 1.3 Literature Review

We review bounds for the relative error due to randomization of general Gram matrix approximations , and also for the smallest singular value and condition number of sampled matrices when has orthonormal rows.

In addition to [16, 17], several other randomized matrix multiplication algorithms have been proposed [5, 13, 14, 40, 46, 48]. Sarlós’s algorithms  are based on matrix transformations. Cohen and Lewis [13, 14] approximate large elements of a matrix product with a random walk algorithm. The algorithm by Belabbas and Wolfe  is related to the Monte Carlo algorithm [16, 17], but with different sampling methods and weights. A second algorithm by Drineas et al.  relies on matrix sparsification, and a third algorithm  estimates each matrix element independently. Pagh  targets sparse matrices, while Liberty  estimates the Gram matrix by iteratively removing “unimportant” columns from .

Eriksson-Bique et al. 

derive an importance sampling strategy that minimizes the variance of the inner products computed by the Monte Carlo method. Madrid, Guerra, and Rojas

 present experimental comparisons of different sampling strategies for specific classes of matrices.

Excellent surveys of randomized matrix algorithms in general are given by Halko, Martinsson, and Tropp , and by Mahoney .

#### 1.3.1 Gram matrix approximations

We review existing bounds for the error due to randomization of the Monte Carlo algorithm [16, 17] for approximating , where is a real matrix. Relative error bounds in the Frobenius norm and the two-norm are summarized in Tables 1 and 2.

Table 1 shows probabilistic lower bounds for the number of sampled columns so that the Frobenius norm relative error . Not listed is a bound for uniform sampling without replacement [38, Corollary 1], because it cannot easily be converted to the format of the other bounds, and a bound for a greedy sampling strategy [5, p. 5].

Table 2 shows probabilistic lower bounds for the number of sampled columns so that the two-norm relative error . These bounds imply, roughly, that the number of sampled columns should be at least or .

#### 1.3.2 Singular value bounds

We review existing bounds for the smallest singular value of a sampled matrix , where is with orthonormal rows.

Table 3 shows probabilistic lower bounds for the number of sampled columns so that the smallest singular value . All bounds but one contain the coherence . Not listed is a bound [21, Lemma 4] that requires specific choices of , , and .

#### 1.3.3 Condition number bounds

We are aware of only two existing bounds for the two-norm condition number of a matrix whose columns are sampled from a matrix  with orthonormal rows. The first bound [1, Theorem 3.2] lacks explicit constants, while the second one [37, Corollary 4.2] applies to uniform sampling with and without replacement. It ensures with probability at least , provided the number of sampled columns in is at least .

#### 1.3.4 Relation to subset selection

The Monte Carlo algorithm selects outer products from , which is equivalent to selecting columns from , hence it can be viewed as a form of randomized column subset selection.

The traditional deterministic subset selection methods select exactly the required number of columns, by means of rank-revealing QR decompositions or SVDs

[11, 28, 29, 31, 34]. In contrast, more recent methods are motivated by applications to graph sparsification [3, 4, 49]. They oversample columns from a matrix with orthonormal rows, by relying on a barrier sampling strategy222The name comes about as follows: Adding a column to amounts to a rank-one update for the Gram matrix

. The eigenvalues of this matrix, due to interlacing, form “barriers” for the eigenvalues of the updated matrix

.
. The accuracy of the selected columns is determined by bounding the reconstruction error, which views as an approximation to [3, Theorem 3.1], [4, Theorem 3.1], [49, Theorem 3.2].

Boutsidis  extends this work to general Gram matrices . Following , he selects columns from the right singular vector matrix of , and applies barrier sampling simultaneously to the dominant and subdominant subspaces of .

In terms of randomized algorithms for subset selection, the two-stage algorithm by Boutsidis et al.  samples columns in the first stage, and performs a deterministic subset selection on the sampled columns in the second stage. Other approaches include volume sampling [24, 25], and CUR decompositions .

#### 1.3.5 Leverage scores

In the late seventies, statisticians introduced leverage scores for outlier detection in regression problems

[12, 33, 53]. More recently, Drineas, Mahoney et al. have pioneered the use of leverage scores for importance sampling in randomized algorithms, such as CUR decompositions , least squares problems , and column subset selection , see also the perspectives on statistical leverage [45, §6]. Fast approximation algorithms are being designed to make the computation of leverage scores more affordable [18, 39, 42].

### 1.4 Notation

All matrices are real. Matrices that can have more than one column are indicated in bold face, and column vectors and scalars in italics. The columns of the matrix are denoted by . The identity matrix is , whose columns are the canonical vectors .

The thin Singular Value Decomposition (SVD) of a matrix with is , where the matrix and the matrix have orthonormal columns, , and the diagonal matrix of singular values is , with . The Moore-Penrose inverse of  is . The unique symmetric positive semi-definite square root of a symmetric positive semi-definite matrix is denoted by .

The norms in this paper are the two-norm , and the Frobenius norm

 ∥A∥F≡ ⎷n∑j=1∥Aj∥22=√σ21+⋯+σ2k.

The stable rank

of a non-zero matrix

is , where .

Given a matrix with orthonormal rows, , the two-norm condition number with regard to left inversion is ; the leverage scores [19, 20, 45] are the squared columns norms , ; and the coherence [1, 10] is the largest leverage score,

 μ≡max1≤j≤m∥Qj∥22.

The expected value of a scalar or a matrix-valued random random variable

is ; and the probability of an event is .

## 2 Deterministic conditions for exact computation

To gauge the potential of the Monte Carlo algorithm, and to establish a connection to existing work in linear algebra, we first consider the best case: The exact computation of from a few columns. That is: Given not necessarily distinct columns , under which conditions is ?

Since a column can be selected more than once, and therefore the selected columns may not form a submatrix of , we express the selected columns as , where is a sampling matrix with

 S=(et1…etc),1≤t1≤…≤tc≤n.

Then one can write

 w1At1ATt1+⋯+wcAtcATtc=(AS)W(AS)T,

where is diagonal weighting matrix. We answer two questions in this section:

1. Given a set of columns of , when is without any constraints on ? The answer is an expression for a matrix with minimal Frobenius norm (Section 2.1).

2. Given a set of columns of , what are necessary and sufficient conditions under which for a diagonal matrix ? The answer depends on the right singular vector matrix of (Section 2.2).

### 2.1 Optimal approximation (no constraints on W)

For a given set of columns of , we determine a matrix of minimal Frobenius norm that minimizes the absolute error of in the Frobenius norm.

The following is a special case of [23, Theorem 2.1], without any constraints on the number of columns in . The idea is to represent in terms of the thin SVD of as .

Given columns of , not necessarily distinct, the unique solution of

 minW∥AAT−(AS)W(AS)T∥F

with minimal Frobenius norm is .

 (AS)Wopt(AS)T=AATandWopt=(VTS)†((VTS)†)T.

If also , then

 (AS)Wopt(AS)T=AATandWopt=(VTS)−1(VTS)−T.

See Section 7.1.

Theorem 2.1 implies that if has maximal rank, then the solution of minimal Frobenius norm depends only on the right singular vector matrix of and in particular only on those columns that correspond to the columns in .

### 2.2 Exact computation with outer products (diagonal W)

We present necessary and sufficient conditions under which for a non-negative diagonal matrix , that is .

Let be a matrix, and let . Then

 c∑j=1wjAtjATtj=AAT

for weights , if and only if the matrix has orthonormal rows.

See Section 7.2.

If has rank one, then any non-zero columns of will do for representing , and explicit expressions for the weights can be derived.

If then for any columns ,

 c∑j=1wjAtjATtj=AATwherewj=1c∥VTetj∥22=∥A∥2F∥Atj∥22,1≤j≤c.

See Section 7.3.

Hence, in the special case of rank-one matrices, the weights are inverse leverage scores of as well as inverse normalized column norms of . Furthermore, in the special case , Corollary 2.2 implies that any non-zero column of can be chosen. In particular, choosing the column of largest norm yields a weight of minimal value, where is the coherence of .

In the following, we look at Theorem 2.2 in more detail, and distinguish the two cases when the number of selected columns is greater than , and when it is equal to .

#### 2.2.1 Number of selected columns greater than \rank(A)

We illustrate the conditions of Theorem 2.2 when . In this case, indices do not necessarily have to be distinct, and a column can occur repeatedly.

###### Example

Let

 VT=(10000100)

so that . Also let , and select the first column twice, and , so that

 VT(e1e1e2)=(110001).

The weights and give a matrix

 VT(2−1/2e12−1/2e1e2)=(2−1/22−1/20001)

with orthonormal rows. Thus, an exact representation does not require distinct indices.

However, although the above weights yield an exact representation, the corresponding weight matrix does not have minimal Frobenius norm.

###### Remark (Connection to Theorem 2.1)

If in Theorem 2.2, then no diagonal weight matrix can be a minimal norm solution in Theorem 2.1.

To see this, note that for , the columns are linearly dependent. Hence the minimal Frobenius norm solution has rank equal to . If were to be diagonal, it could have only non-zero diagonal elements, hence the number of outer products would be , a contradiction.

To illustrate this, let

 VT=1√2(10100101)

so that . Also, let , and select columns , and , so that

 VTS≡VT(e1e2e3)=1√2(101010).

Theorem 2.1 implies that the solution with minimal Frobenius norm is

 Wopt=(VTS)†((VST)†)=⎛⎜⎝1/201/20201/201/2⎞⎟⎠,

which is not diagonal.

However is also a solution since has orthonormal rows. But does not have minimal Frobenius norm since , while .

#### 2.2.2 Number of selected columns equal to \rank(A)

If , then no column of can be selected more than once, hence the selected columns form a submatrix of . In this case Theorem 2.2 can be strengthened: As for the rank-one case in Corollary 2.2, an explicit expression for the weights in terms of leverage scores can be derived.

Let be a matrix with . In addition to the conclusions of Theorem 2.2 the following also holds: If

 VT(√w1et1⋯√wketk)

has orthonormal rows, then it is an orthogonal matrix, and

, .

See Section 7.4.

Note that the columns selected from do not necessarily correspond to the largest leverage scores. The following example illustrates that the conditions in Theorem 2.2.2 are non-trivial.

###### Example

In Theorem 2.2.2 it is not always possible to find columns from that yield an orthogonal matrix.

For instance, let

 VT=(1/21/21/21/2−1/√14−2/√143/√140),

and . Since no two columns of are orthogonal, no two columns can be scaled to be orthonormal. Thus no matrix submatrix of can give rise to an orthogonal matrix.

However, for it is possible to construct a matrix with orthonormal rows. Selecting columns , and from , and weights , and yields a matrix

that has orthonormal rows.

###### Remark (Connection to Theorem 2.1)

In Theorem 2.2.2 the condition implies that the matrix

 VT(et1…etk)=VTS

is non-singular. From Theorem 2.1 follows that is the unique minimal Frobenius norm solution for .

If, in addition, the rows of are orthonormal, then the minimal norm solution is a diagonal matrix,

 Wopt=(VTS)−1(VTS)−T=\diag(1∥VTet1∥22⋯1∥VTetk∥22).

## 3 Monte Carlo algorithm for Gram Matrix Approximation

We review the randomized algorithm to approximate the Gram matrix (Section 3.1); and discuss and compare two different types of sampling probabilities (Section 3.2).

### 3.1 The algorithm

The randomized algorithm for approximating , presented as Algorithm 1, is a special case of the BasicMatrixMultiplication Algorithm [17, Figure 2] which samples according to the Exactly(c) algorithm [21, Algorithm 3], that is, independently and with replacement. This means a column can be sampled more than once.

A conceptual version of the randomized algorithm is presented as Algorithm 1. Given a user-specified number of samples , and a set of probabilities , this version assembles columns of the sampling matrix , then applies to , and finally computes the product

 X=(AS)(AS)T=c∑j=11cptjAtjATtj.

The choice of weights makes an unbiased estimator, [17, Lemma 3].

Discounting the cost of sampling, Algorithm 1 requires flops to compute an approximation to . Note that Algorithm 1 allows zero probabilities. Since an index corresponding to can never be selected, division by zero does not occur in the computation of . Implementations of sampling with replacement are discussed in [22, Section 2.1]. For matrices of small dimension, one can simply use the Matlab function randsample.

### 3.2 Sampling probabilities

We consider two types of probabilities, the “optimal” probabilities from  (Section 3.2.1), and leverage score probabilities (Section 3.2.2) motivated by Corollary 2.2 and Theorem 2.2.2, and their use in other randomized algorithms [9, 19, 20]. We show (Theorem 3.2.2) that for rank-one matrices, Algorithm 1 with “optimal” probabilities produces the exact result with a single sample. Numerical experiments (Section 3.2.3) illustrate that sampling with “optimal” probabilities results in smaller two-norm relative errors than sampling with leverage score probabilities, and that the two types of probabilities can differ significantly.

#### 3.2.1 “Optimal” probabilities 

They are defined by

 poptj=∥∥Aj∥∥22∥A∥2F,1≤j≤n (1)

and are called “optimal” because they minimize [17, Lemma 4]. The “optimal” probabilities can be computed in flops.

The analyses in [17, Section 4.4] apply to the more general “nearly optimal” probabilities , which satisfy and are constrained by

 pβj≥βpoptj,1≤j≤n, (2)

where is a scalar. In the special case , they revert to the optimal probabilites, , . Hence can be viewed as the deviation of the probabilities from the “optimal” probabilities .

#### 3.2.2 Leverage score probabilities [7, 9]

The exact representation in Theorem 2.2.2 suggests probabilities based on the leverage scores of ,

 plevj=∥∥VTej∥∥22∥V∥2F=∥VTej∥22k,1≤j≤n, (3)

where .

Since the leverage score probabilities are proportional to the squared column norms of , they are the “optimal” probabilities for approximating . Exact computation of leverage score probabilities, via SVD or QR decomposition, requires flops; thus, it is more expensive than the computation of the “optimal” probabilities.

In the special case of rank-one matrices, the “optimal” and leverage score probabilities are identical; and Algorithm 1 with “optimal” probabilities computes the exact result with any number of samples, and in particular a single sample. This follows directly from Corollary 2.2.

If , then , .

If is computed by Algorithm 1 with any and probabilities , then .

#### 3.2.3 Comparison of sampling probabilities

We compare the norm-wise relative errors due to randomization of Algorithm 1 when it samples with “optimal” probabilites and leverage score probabilities.

##### Experimental set up

We present experiments with eight representative matrices, described in Table 4

, from the UCI Machine Learning Repository

.

For each matrix, we ran Algorithm 1 twice: once sampling with “optimal” probabilities , and once sampling with leverage score probabilities . The sampling amounts range from 1 to , with 100 runs for each value of .

Figure 1 contains two plots for each matrix: The left plot shows the two-norm relative errors due to randomization, , averaged over 100 runs, versus the sampling amount . The right plot shows the ratios of leverage score over “optimal” probabilities , .

##### Conclusions

Sampling with “optimal” probabilities produces average errors that are lower, by as much as a factor of 10, than those from sampling with leverage score probabilities, for all sampling amounts . Furthermore, corresponding leverage score and “optimal” probabilities tend to differ by several orders of magnitude.

## 4 Error due to randomization, for sampling with “nearly optimal” probabilities

We present two new probabilistic bounds (Sections 4.1 and 4.2) for the two-norm relative error due to randomization, when Algorithm 1 samples with the “nearly optimal” probabilities in (2). The bounds depend on the stable rank or the rank of , but not on the matrix dimensions. Neither bound is always better than the other (Section 4.3). The numerical experiments (Section 4.4) illustrate that the bounds are informative, even for stringent success probabilities and matrices of small dimension.

### 4.1 First bound

The first bound depends on the stable rank of and also, weakly, on the rank.

Let be an matrix, and let be computed by Algorithm 1 with the “nearly optimal” probabilities in (2).

Given and , if the number of columns sampled by Algorithm 1 is at least

 c≥c0(ϵ)sr(A)ln(\rank(A)/δ)βϵ2,wherec0(ϵ)≡2+2ϵ3,

then with probability at least ,

 ∥∥X−AAT∥∥2∥AAT∥2≤ϵ.

See Section 7.5.

As the required error becomes smaller, so does the constant in the lower bound for the number of samples, that is, as .

### 4.2 Second bound

This bound depends only on the stable rank of .

Let be an matrix, and let be computed by Algorithm 1 with the “nearly optimal” probabilities in (2).

Given and , if the number of columns sampled by Algorithm 1 is at least

 c≥c0(ϵ)sr(A)ln(4sr(A)/δ)βϵ2,wherec0(ϵ)≡2+2ϵ3,

then with probability at least ,

 ∥∥X−AAT∥∥2∥AAT∥2≤ϵ.

See Section 7.6.

### 4.3 Comparison

The bounds in Theorems