Lectures on Randomized Numerical Linear Algebra

12/24/2017 ∙ by Petros Drineas, et al. ∙ Purdue University berkeley college 0

This chapter is based on lectures on Randomized Numerical Linear Algebra from the 2016 Park City Mathematics Institute summer school on The Mathematics of Data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Matrices are ubiquitous in computer science, statistics, and applied mathematics. An matrix can encode information about objects (each described by features), or the behavior of a discretized differential operator on a finite element mesh; an positive-definite matrix can encode the correlations between all pairs of objects, or the edge-connectivity between all pairs of nodes in a social network; and so on. Motivated largely by technological developments that generate extremely large scientific and Internet data sets, recent years have witnessed exciting developments in the theory and practice of matrix algorithms. Particularly remarkable is the use of randomization—typically assumed to be a property of the input data due to, e.g., noise in the data generation mechanisms—as an algorithmic or computational resource for the development of improved algorithms for fundamental matrix problems such as matrix multiplication, least-squares (LS) approximation, low-rank matrix approximation, etc.

Randomized Numerical Linear Algebra (RandNLA) is an interdisciplinary research area that exploits randomization as a computational resource to develop improved algorithms for large-scale linear algebra problems. From a foundational perspective, RandNLA has its roots in theoretical computer science (TCS), with deep connections to mathematics (convex analysis, probability theory, metric embedding theory) and applied mathematics (scientific computing, signal processing, numerical linear algebra). From an applied perspective, RandNLA is a vital new tool for machine learning, statistics, and data analysis. Well-engineered implementations have already outperformed highly-optimized software libraries for ubiquitous problems such as least-squares regression, with good scalability in parallel and distributed environments. Moreover, RandNLA promises a sound algorithmic and statistical foundation for modern large-scale data analysis.

This chapter serves as a self-contained, gentle introduction to three fundamental RandNLA algorithms: randomized matrix multiplication, randomized least-squares solvers, and a randomized algorithm to compute a low-rank approximation to a matrix. As such, this chapter has strong connections with many areas of applied mathematics, and in particular it has strong connections with several other chapters in this volume. Most notably, this includes that of G. Martinsson, who uses these methods to develop improved low-rank matrix approximation solvers 

[pcmi-chapter-martinsson]; R. Vershynin, who develops probabilistic tools that are used in the analysis of RandNLA algorithms [pcmi-chapter-vershynin]; J. Duchi, who uses stochastic and randomized methods in a complementary manner for more general optimization problems [pcmi-chapter-duchi]; and M. Maggioni, who uses these methods as building blocks for more complex multiscale methods [pcmi-chapter-maggioni].

We start this chapter with a review of basic linear algebraic facts in Section 2; we review basic facts from discrete probability in Section 3; we present a randomized algorithm for matrix multiplication in Section 4; we present a randomized algorithm for least-squares regression problems in Section 5; and finally we present a randomized algorithm for low-rank approximation in Section 6. We conclude this introduction by noting that [Drineas2016, Mah-mat-rev_BOOK] might also be of interest to a reader who wants to go through other introductory texts on RandNLA.

2 Linear Algebra

In this section, we present a brief overview of basic linear algebraic facts and notation that will be useful in this chapter. We assume basic familiarity with linear algebra (e.g., inner/outer products of vectors, basic matrix operations such as addition, scalar multiplication, transposition, upper/lower triangular matrices, matrix-vector products, matrix multiplication, matrix trace, etc.).

2.1 Basics.

We will entirely focus on matrices and vectors over the reals. We will use the notation to denote an -dimensional vector: notice the use of bold latin lowercase letters for vectors. Vectors will always be assumed to be column vectors, unless explicitly noted otherwise. The vector of all zeros will be denoted as , while the vector of all ones will be denoted as ; dimensions will be implied from context or explicitly included as a subscript.

We will use bold latin uppercase letters for matrices, e.g., denotes an matrix . We will use the notation to denote the -th row of as a row vector and to denote the -th column of

as a column vector. The (square) identity matrix will be denoted as

where denotes the number of rows and columns. Finally, we use to denote the -th column of , i.e., the -th canonical vector.

Matrix Inverse. A matrix is nonsingular or invertible if there exists a matrix such that

The inverse exists when all the columns (or all the rows) of are linearly independent. In other words, there does not exist a non-zero vector such that . Standard properties of the inverse include: and .

Orthogonal matrix. A matrix is orthogonal if . Equivalently, for all and between one and ,

The same property holds for the rows of . In words, the columns (rows) of are pairwise orthogonal and normal vectors.

QR Decomposition. Any matrix can be decomposed into the product of an orthogonal matrix and an upper triangular matrix as:

where is an orthogonal matrix and is an upper triangular matrix. The QR decomposition is useful in solving systems of linear equations, has computational complexity , and is numerically stable. To solve the linear system using the QR decomposition we first premultiply both sides by , thus getting . Then, we solve using backward substitution [GVL96].

2.2 Norms.

Norms are used to measure the size or mass of a matrix or, relatedly, the length of a vector. They are functions that map an object from (or ) to . Formally:

Definition 1.

Any function, : that satisfies the following properties is called a norm:

  1. Non-negativity: ; if and only if .

  2. Triangle inequality: .

  3. Scalar multiplication: , for all .

The following properties are easy to prove for any norm: and

The latter property is known as the reverse triangle inequality.

2.3 Vector norms.

Given and an integer , we define the vector -norm as:

The most common vector -norms are:

  • One norm: .

  • Euclidean (two) norm: .

  • Infinity (max) norm: .

Given we can bound the inner product using -norms. The Cauchy-Schwartz inequality states that:

In words, it gives an upper bound for the inner product of two vectors in terms of the Euclidean norm of the two vectors. Hölder’s inequality states that

The following inequalities between common vector -norms are easy to prove:

Also, . We can now define the notion of orthogonality for a pair of vectors and state the Pythagorean theorem.

Theorem 2.

Two vectors are orthogonal, i.e., , if and only if

Theorem 2 is also known as the Pythagorean Theorem. Another interesting property of the Euclidean norm is that it does not change after pre(post)-multiplication by a matrix with orthonormal columns (rows).

Theorem 3.

Given a vector and a matrix with and :

2.4 Induced matrix norms.

Given a matrix and an integer we define the matrix -norm as:

The most frequent matrix -norms are:

  • One norm: the maximum absolute column sum,

  • Infinity norm: the maximum absolute row sum,

  • Two (or spectral) norm :

This family of norms is named “induced” because they are realized by a non-zero vector that varies depending on and . Thus, there exists a unit norm vector (unit norm in the -norm) such that The induced matrix -norms follow the submultiplicativity laws:

Furthermore, matrix -norms are invariant to permutations: where and are permutation matrices of appropriate dimensions. Also, if we consider the matrix with permuted rows and columns

then the norm of the submatrix is related to the norm of the full unpermuted matrix as follows: The following relationships between matrix -norms are relatively easy to prove. Given a matrix ,

It is also the case that and While transposition affects the infinity and one norm of a matrix, it does not affect the two norm, i.e.,

. Also, the matrix two-norm is not affected by pre(post)- multiplication with matrices whose columns (rows) are orthonormal vectors:

where and are orthonormal matrices ( and ) of appropriate dimensions.

2.5 The Frobenius norm.

The Frobenius norm is not an induced norm, as it belongs to the family of Schatten norms (to be discussed in Section 2.8).

Definition 4.

Given a matrix , we define the Frobenius norm as:

where denotes the matrix trace (where, recall, the trace of a square matrix is defined to be the sum of the elements on the main diagonal).

Informally, the Frobenius norm measures the variance or variability (which can be given an interpretation of size or mass) of a matrix. Given a vector

, its Frobenius norm is equal to its Euclidean norm, i.e., . Transposition of a matrix does not affect its Frobenius norm, i.e., . Similar to the two norm, the Frobenius norm does not change under permutations or pre(post)- multiplication with a matrix with orthonormal columns (rows):

where and are orthonormal matrices ( and ) of appropriate dimensions. The two and the Frobenius norm can be related by:

The Frobenius norm satisfies the so-called strong sub-multiplicativity property, namely:

Given and , the Frobenius norm of their outer product is equal to the product of the Euclidean norms of the two vectors forming the outer product:

Finally, we state a matrix version of the Pythagorean theorem.

Lemma 5 (Matrix Pythagoras).

Let . If then

2.6 The Singular Value Decomposition.

The Singular Value Decomposition (SVD) is the most important matrix decomposition and exists for every matrix.

Definition 6.

Given a matrix , we define its full SVD as:

where and are orthogonal matrices that contain the left and right singular vectors of , respectively, and is a diagonal matrix, with the singular values of in decreasing order on the diagonal.

We will often use (respectively, ), (respectively, ) to denote the columns of the matrix (respectively, ). Similarly, we will use , to denote the singular values:

The singular values of are non-negative and their number is equal to . The number of non-zero singular values of is equal to the rank of . Due to orthonormal invariance, we get:

where and are orthonormal matrices ( and ) of appropriate dimensions. In words, the singular values of are the same as the singular values of .

The following inequalities involving the singular values of the matrices and are important. First, if both and are in , for all ,

(7)

Second, if and , for all ,

(8)

where, recall, . We are often interested in keeping only the non-zero singular values and the corresponding left and right singular vectors of a matrix . Given a matrix with , its thin SVD can be defined as follows.

Definition 9.

Given a matrix of rank , we define its thin SVD as:

where and are matrices with pairwise orthonormal columns (i.e., and ) that contain the left and right singular vectors of corresponding to the non-zero singular values; is a diagonal matrix with the non-zero singular values of in decreasing order on the diagonal.

If is a nonsingular matrix, we can compute its inverse using the SVD:

(If is nonsingular, then it is square and full rank, in which case the thin SVD is the same as the full SVD.) The SVD is so important since, as is well-known, the best rank- approximation to any matrix can be computed via the SVD.

Theorem 10.

Let be the thin SVD of ; let be an integer; and let . Then,

and

In words, the above theorem states that if we seek a rank approximation to a matrix that minimizes the two or the Frobenius norm of the “error” matrix, i.e., of the difference between and its approximation, then it suffices to keep the top singular values of and the corresponding left and right singular vectors.

We will often use the following notation: let (respectively, ) denote the matrix of the top left (respectively, right) singular vectors of ; and let denote the diagonal matrix containing the top singular values of . Similarly, let (respectively, ) denote the matrix of the bottom nonzero left (respectively, right) singular vectors of ; and let denote the diagonal matrix containing the bottom singular values of . Then,

(11)

2.7 SVD and Fundamental Matrix Spaces.

Any matrix defines four fundamental spaces:

The Column Space of

This space is spanned by the columns of :

The Null Space of

This space is spanned by all vectors such that :

The Row Space of

This space is spanned by the rows of :

The Left Null Space of

This space is spanned by all vectors such that :

The SVD reveals orthogonal bases for all these spaces. Given a matrix , with , its SVD can be written as:

It is easy to prove that:

Theorem 12 (Basic Theorem of Linear Algebra.).

The column space of is orthogonal to the null space of and their union is . The column space of is orthogonal to the null space of and their union is .

2.8 Matrix Schatten norms.

The matrix Schatten norms are a special family of norms that are defined on the vector containing the singular values of a matrix. Given a matrix with singular values , we define the Schatten p-norm as:

Common Schatten norms of a matrix are:

Schatten one-norm

The nuclear norm, i.e., the sum of the singular values.

Schatten two-norm

The Frobenius norm, i.e., the square root of the sum of the squares of the singular values.

Schatten infinity-norm

The two norm, i.e., the largest singular value.

Schatten norms are orthogonally invariant, submultiplicative, and satisfy Hölder’s inequality.

2.9 The Moore-Penrose pseudoinverse.

A generalized notion of the well-known matrix inverse is the Moore-Penrose pseudoinverse. Formally, given a matrix , a matrix is the Moore Penrose pseudoinverse of if it satisfies the following properties:

  1. .

  2. .

  3. .

  4. .

Given a matrix of rank and its thin SVD

its Moore-Penrose pseudoinverse is

If a matrix has full rank, then . If a matrix has full column rank, then , and is a projection matrix onto the column span of ; while if it has full row rank, then , and is a projection matrix onto the row span of .

A particularly important property regarding the pseudoinverse of the product of two matrices is the following: for matrices and , satisfying [Bjo15, Theorem 2.2.3] states that

(13)

(We emphasize that the condition on the ranks is crucial: while the inverse of the product of two matrices always equals the product of the inverse of those matrices, the analogous statement is not true in full generality for the Moore-Penrose pseudoinverse [Bjo15].) The fundamental spaces of the Moore-Penrose pseudoinverse are connected with those of the actual matrix. Given a matrix and its Moore-Penrose pseudoinverse , the column space of can be defined as:

and it is orthogonal to the null space of . The null space of can be defined as:

and it is orthogonal to the column space of .

2.10 References.

We refer the interested reader to [Strang88, GVL96, TrefethenBau97, Bjo15] for additional background on linear algebra and matrix computations, as well as to [Stewart90, Bhatia97] for additional background on matrix perturbation theory.

3 Discrete Probability

In this section, we present a brief overview of discrete probability. More advanced results (in particular, Bernstein-type inequalities for real-valued and matrix-valued random variables) will be introduced in the appropriate context later in the chapter. It is worth noting that most of RandNLA builds upon simple, fundamental principles of discrete (instead of continuous) probability.

3.1 Random experiments: basics.

A random experiment is any procedure that can be infinitely repeated and has a well-defined set of possible outcomes. Typical examples are the roll of a dice or the toss of a coin. The sample space of a random experiment is the set of all possible outcomes of the random experiment. If the random experiment only has two possible outcomes (e.g., success and failure) then it is often called a Bernoulli trial. In discrete probability, the sample space is finite. (We will not cover countably or uncountably infinite sample spaces in this chapter.)

An event is any subset of the sample space . Clearly, the set of all possible events is the powerset (the set of all possible subsets) of , often denoted as . As an example, consider the following random experiment: toss a coin three times. Then, the sample space is

and an event could be described in words as “the output of the random experiment was either all heads or all tails”. Then, The probability measure or probability function maps the (finite) sample space to the interval . Formally, let the function for all be a function whose domain is and whose range is the interval . This function has the so-called normalization property, namely

If is an event, then

(14)

namely the probability of an event is the sum of the probabilities of its elements. It follows that the probability of the empty event (the event that corresponds to the empty set) is equal to zero, whereas the probability of the event (clearly itself is an event) is equal to one. Finally, the uniform probability function is defined as , for all .

3.2 Properties of events.

Recall that events are sets and thus set operations (union, intersection, complementation) are applicable. Assuming finite sample spaces and using Eqn. (14), it is easy to prove the following property for the union of two events and :

This property follows from the well-known inclusion-exclusion principle for set union and can be generalized to more than two sets and thus to more than two events. Similarly, one can prove that In the above, denotes the complement of the event . Finally, it is trivial to see that if is a subset of then

3.3 The union bound.

The union bound is a fundamental result in discrete probability and can be used to bound the probability of a union of events without any special assumptions on the relationships between the events. Indeed, let for all be events defined over a finite sample space . Then, the union bound states that

The proof of the union bound is quite simple and can be done by induction, using the inclusion-exclusion principle for two sets that was discussed in the previous section.

3.4 Disjoint events and independent events.

Two events and are called disjoint or mutually exclusive if their intersection is the empty set, i.e., if

This can be generalized to any number of events by necessitating that the events are all pairwise disjoint. Two events and are called independent if the occurrence of one does not affect the probability of the other. Formally, they must satisfy

Again, this can be generalized to more than two events by necessitating that the events are all pairwise independent.

3.5 Conditional probability.

For any two events and , the conditional probability is the probability that occurs given that occurs. Formally,

Obviously, the probability of in the denominator must be non-zero for this to be well-defined. The well-known Bayes rule states that for any two events and such that and ,

Using the Bayes rule and the fact that the sample space can be partitioned as , it follows that

We note that the probabilities of both events and must be in the open interval . We can now revisit the notion of independent events. Indeed, for any two events and such that and the following statements are equivalent:

  1. ,

  2. , and

  3. .

Recall that the last statement was the definition of independence in the previous section.

3.6 Random variables.

Random variables are functions mapping the sample space to the real numbers . Note that even though they are called variables, in reality they are functions. Let be the sample space of a random experiment. A formal definition for the random variable would be as follows: let be a real number (not necessarily positive) and note that the function

returns a subset of and thus is an event. Therefore, the function has a probability. We will abuse notation and write:

instead of the more proper notation . This function of is of great interest and it is easy to generalize as follows:

3.7 Probability mass function and cumulative distribution function.

Two common functions associated with random variables are the probability mass function (PMF) and the cumulative distribution function (CDF). The first measures the probability that a random variable takes a particular value

, and the second measures the probability that a random variable takes any value below .

Definition 15 (Probability Mass Function (PMF)).

Given a random variable and a real number , the probability mass function (PMF) is the function

Definition 16 (Cumulative Distribution Function (CDF)).

Given a random variable and a real number , the cumulative distribution function (CDF) is the function

It is obvious from the above definitions that

3.8 Independent random variables.

Following the notion of independence for events, we can now define the notion of independence for random variables. Indeed, two random variables and are independent if for all reals and ,

3.9 Expectation of a random variable.

Given a random variable , its expectation is defined as

In the above, is the image of the random variable over the sample space ; recall that is a function. That is, the sum is over the range of the random variable . Alternatively, can be expressed in terms of a sum over the domain of , i.e., over . For finite sample spaces , such as those that arise in discrete probability, we get

We now discuss fundamental properties of the expectation. The most important property is linearity of expectation: for any random variables and and real number ,

The first property generalizes to any finite sum of random variables and does not need any assumptions on the random variables involved in the summation. If two random variables and are independent then we can manipulate the expectation of their product as follows:

3.10 Variance of a random variable.

Given a random variable , its variance is defined as

In words, the variance measures the average of the (square) of the difference

. The standard deviation is the square root of the variance and is often denoted by

. It is easy to prove that

This obviously implies

which is often all we need in order to get an upper bound for the variance. Unlike the expectation, the variance does not have a linearity property, unless the random variables involved are independent. Indeed, if the random variables and are independent, then

The above property generalizes to sums of more than two random variables, assuming that all involved random variables are pairwise independent. Also, for any real ,

3.11 Markov’s inequality.

Let be a non-negative random variable; for any ,

This is a very simple inequality to apply and only needs an upper bound for the expectation of . An equivalent formulation is the following: let be a non-negative random variable; for any ,

or, equivalently,

In words, the probability that a random variable exceeds times its expectation is at most . In order to prove Markov’s inequality, we will show,

assuming

for any . In order to prove the above inequality, we define the following function

with expectation:

Clearly, from the function definition, . Taking expectation on both sides:

Thus,

Hence, we conclude the proof of Markov’s inequality.

3.12 The Coupon Collector Problem.

Suppose there are types of coupons and we seek to collect them in independent trials, where in each trial the probability of obtaining any one coupon is (uniform). Let denote the number of trials that we need in order to collect at least one coupon of each type. Then, one can prove that [MotwaniRaghavan95, Section 3.6]:

The occurrence of the additional factor in the expectation is common in sampling-based approaches that attempt to recover different types of objects using sampling in independent trials. Such factors will appear in many RandNLA sampling-based algorithms.

3.13 References.

There are numerous texts covering discrete probability; most of the material in this chapter was adapted from [MotwaniRaghavan95].

4 Randomized Matrix Multiplication

Our first randomized algorithm for a numerical linear algebra problem is a simple, sampling-based approach to approximate the product of two matrices and . This randomized matrix multiplication algorithm is at the heart of all of the RandNLA algorithms that we will discuss in this chapter, and indeed all of RandNLA more generally. It is of interest both pedagogically and in and of itself, and it is also used in an essential way in the analysis of the least squares approximation and low-rank approximation algorithms discussed below.

We start by noting that the product may be written as the sum of rank one matrices:

(17)

where each of the summands is the outer product of a column of and the corresponding row of . Recall that the standard definition of matrix multiplication states that the -th entry of the matrix product is equal to the inner product of the -th row of and the -th column of , namely

It is easy to see that the two definitions are equivalent. However, when matrix multiplication is formulated as in Eqn. (17), a simple randomized algorithm to approximate the product suggests itself: in independent identically distributed (i.i.d.) trials, randomly sample (and appropriately rescale) a few rank-one matrices from the terms in the summation of Eqn. (17

); and then output the sum of the (rescaled) terms as an estimator for

.

Input: , , integer (), and s.t. and . Output: and . For to , Pick with , independently and with replacement. Set and . Return .
Algorithm 1 The RandMatrixMultiply algorithm

Consider the RandMatrixMultiply algorithm (Algorithm 1), which makes this simple idea precise. When this algorithm is given as input two matrices and

, a probability distribution

, and a number of column-row pairs to choose, it returns as output an estimator for the product of the form

Equivalently, the above estimator can be thought of as the product of the two matrices and formed by the RandMatrixMultiply algorithm, where consists of (rescaled) columns of and consists of the corresponding (rescaled) rows of . Observe that

Therefore, the procedure used for sampling and scaling column-row pairs in the RandMatrixMultiply algorithm corresponds to sampling and rescaling terms in Eqn. (17).

Remark 18.

The analysis of RandNLA algorithms has benefited enormously from formulating algorithms using the so-called sampling-and-rescaling matrix formalism. Let’s define the sampling-and-rescaling matrix to be a matrix with if the -th column of is chosen in the -th trial (all other entries of are set to zero). Then

so that . Obviously, the matrix is very sparse, having a single non-zero entry per column, for a total of non-zero entries, and so it is not explicitly constructed and stored by the algorithm.

Remark 19.

The choice of the sampling probabilities in the RandMatrixMultiply algorithm is very important. As we will prove in Lemma 21, the estimator returned by the RandMatrixMultiply algorithm is (in an element-wise sense) unbiased, regardless of our choice of the sampling probabilities. However, a natural notion of the variance of our estimator (see Theorem 22 for a precise definition) is minimized when the sampling probabilities are set to

In words, the best choice when sampling rank-one matrices from the summation of Eqn. (17) is to select rank-one matrices that have larger Frobenius norms with higher probabilities. This is equivalent to selecting column-row pairs that have larger (products of) Euclidean norms with higher probability.

Remark 20.

This approach for approximating matrix multiplication has several advantages. First, it is conceptually simple. Second, since the heart of the algorithm involves matrix multiplication of smaller matrices, it can use any algorithms that exist in the literature for performing the desired matrix multiplication. Third, this approach does not tamper with the sparsity of the input matrices. Finally, the algorithm can be easily implemented in one pass over the input matrices and , given the sampling probabilities . See [dkm_matrix1, Section 4.2] for a detailed discussion regarding the implementation of the RandMatrixMultiply algorithm in the pass-efficient and streaming models of computation.

4.1 Analysis of the RAndMAtrixMUltiply algorithm.

This section provides upper bounds for the error matrix , where and are the outputs of the RandMatrixMultiply algorithm.

Our first lemma proves that the expectation of the