1. Motivation and Main Results
Randomized block Krylov methods have emerged as a powerful tool for spectral computation and matrix approximation [RST09, HMT11, MRT11, HMST11, MM15, WZZ15, DIKMI18, DI19, YGL18, MT20]. At present, our understanding of these methods is more rudimentary than our understanding of simple Krylov subspace methods in either the deterministic or random setting. Our aim is to develop a priori bounds that better explain the remarkable performance of randomized block Krylov methods.
This paper focuses on the most basic questions. How well can we estimate the maximum (or minimum) eigenvalue of a symmetric matrix using a randomized block Krylov method? How well can we estimate the maximum (or minimum) singular value of a general matrix? We present the first detailed analysis that addresses these questions. The argument streamlines and extends the influential work [KW92]
of Kucziński & Woźniakowski on the simple Krylov method with one randomized starting vector. We discover that increasing the block size offers some clear advantages over the simple method.
Randomized block Krylov methods have also been promoted as a tool for low-rank matrix approximation. Theoretical analysis of these algorithms appears in several recent papers [MM15, DIKMI18, DI19, YGL18]. In an unpublished companion paper [Tro18], we complement these works with a more detailed analysis that offers several new insights. The mathematical approach is different from the case of extreme eigenvalues, and it involves additional ideas from the analysis of the randomized SVD algorithm [HMT11]. In collaboration with Robert J. Webber, we are currently preparing a review article [TW21] with a definitive treatment of randomized block Krylov methods for low-rank matrix approximation.
1.1. Block Krylov Methods for Computing the Maximum Eigenvalue
Let us begin with a mathematical description of a block Krylov method for estimating the maximum eigenvalue of a symmetric matrix. See Section 1.1.6 for a brief discussion about implementations.
1.1.1. Block Krylov Subspaces
Suppose that we are given a symmetric matrix . Choose a test matrix , where the number is called the block size. Select a parameter that controls the depth of the Krylov subspace. In concept, the block Krylov method constructs the matrix
The range of the matrix is called a block Krylov subspace:
The block Krylov subspace admits an alternative representation in terms of polynomials:
where is the linear space of real polynomials with degree at most .
Warning 1.1 (Notation).
Be aware that our notation differs slightly from the traditional usage. In and in , the highest matrix power is , rather than the usual .
1.1.2. Invariance Properties of Krylov Subspaces
Krylov subspaces have remarkable invariance properties that help explain their computational value.
The block Krylov subspace only depends on the range of the test matrix:
The block Krylov subspace co-varies with the orientation of the matrices:
The block Krylov subspace is invariant under “affine” transformations of the input matrix:
1.1.3. Computing Maximum Eigenvalues
Block Krylov subspaces support a wide range of matrix computations. The core idea is to compress the input matrix to the Krylov subspace and to perform calculations on the (small) compressed matrix. In other words, Krylov methods belong to the class of Ritz–Galerkin methods; see [Lan50, Pai71, Saa80, Par98, Saa11].
In particular, we can obtain an estimate for the maximum eigenvalue of the input matrix by maximizing the Rayleigh quotient of over the block Krylov subspace :
The symbol denotes the transpose of a matrix or vector, and we instate the convention that . We may suppress the dependence of on , , or when they are clear from context.
The goal of our analysis is to understand how well approximates the maximum eigenvalue as a function of the block size and the depth of the Krylov space.
Provided that , any vector that maximizes the Rayleigh quotient in (1.4) has a component in the invariant subspace of spanned by the large eigenvalues. More precisely,
We have written for the norm and for the spectral projector onto the invariant subspace of associated with the eigenvalues that strictly exceed .
1.1.4. Invariance Properties of the Eigenvalue Estimate
The eigenvalue estimate inherits some invariance properties from the block Krylov subspace. These facts can help us develop effective implementations of the algorithm and to analyze their performance.
For fixed depth , the estimate only depends on the range of the test matrix :
For fixed depth , the estimate does not depend on the orientation of and in the sense that
For fixed depth , the estimate covaries with increasing “affine” transformations of :
1.1.5. A Random Test Matrix
To ensure that we can estimate the maximum eigenvalue of an arbitrary symmetric input matrix , we draw the test matrix that generates the Krylov subspace at random.
How should we select the distribution? Observe that the eigenvalue estimate only depends on the range of the test matrix because of the property (1.6). Furthermore, the property (1.7) shows that the eigenvalue estimate is invariant under rotations. Therefore, we can choose any random test matrix whose range has a rotationally invariant distribution. This idea is extracted from [KW92]; see Section 4.1 for the justification.
We will consider a standard normal test matrix . That is, the entries of
are statistically independent Gaussian random variables, each with mean zero and variance one. It is well known that the range of this random matrix has a rotationally invariant distribution. The goal of this paper is to study the behavior of the random eigenvalue estimate.
Remark 1.2 (Other Test Matrices).
Our analysis and detailed results depend heavily on the choice of a standard normal test matrix . In practice, we can achieve similar empirical behavior from test matrices with “structured” distributions that require less storage or that have fast matrix–vector multiplies. For randomized Krylov subspace methods, the computational benefit of using a structured test matrix is limited because we need to perform repeated multiplications with the input matrix to generate the Krylov matrix; cf. (1.1). See [HMT11, Secs. 4.6 and 7.4] or [MT20, Secs. 9 and 11.5] for some discussion about other random test matrices.
For completeness, we describe the simplest stable implementation of a block Krylov method for computing the largest eigenvalue of a symmetric matrix. See Algorithm 1 for pseudocode; this approach is adapted from [RST09, HMST11]. Here is a summary of the costs:
A total of matrix–matrix multiplies between and an matrix, plus another multiplication between and an matrix. The arithmetic cost depends on whether the matrix supports a fast multiplication operation. For example, the algorithm is more efficient when is sparse.
Repeated orthogonalization of vectors of length with total cost operations.
Solution of a (block-tridiagonal) maximum eigenvalue problem at a cost of operations.
Maintenance of the matrix , which requires units of storage.
As noted, Krylov subspace methods are particularly valuable when we have an efficient procedure for computing matrix–vector products with . On many contemporary computer architectures, the cost of performing a product with several vectors is similar to the cost of a product with a single vector. In this setting, we can take advantage of the improved behavior of a block method almost for free.
1.2. The Role of the Spectrum
Owing to invariance, the theoretical behavior of the eigenvalue estimate depends only on the spectrum of the input matrix . In this section, we develop this idea further and introduce some spectral features that affect the performance of the eigenvalue estimate.
Warning 1.3 (Numerical Behavior).
1.2.1. Invariance Properties of the Random Eigenvalue Estimate
The random estimate of the maximum eigenvalue has several invariance properties that allow us to simplify the analysis.
First, the rotation invariance (1.7) of the eigenvalue estimate and the rotational invariance of the range of imply that
signifies equality of distribution for two random variables. In other words, the maximum eigenvalue estimate depends only on the eigenvalues of the input matrix—but not the eigenvectors.
Second, owing to the affine covariance property (1.8), the eigenvalue estimate only depends on the “shape” of the spectrum of , but not its location or scale. As a consequence, we must assess the behavior of the eigenvalue estimate in terms of spectral features that are affine invariant.
1.2.2. Spectral Features of the Input Matrix
To express the results of our analysis, we introduce terminology for some spectral features of the symmetric matrix . First, let us instate compact notation for the eigenvalues of :
The map returns the th largest eigenvalue of a symmetric matrix.
Next, we define some functions of the eigenvalue spectrum:
The spectral range of the input matrix is the distance between the extreme eigenvalues. That is, .
The spectral gap is the relative difference between the maximum eigenvalue and the next distinct eigenvalue:
If is a multiple of the identity, then . Note that .
Let be a nonnegative number. The -stable rank is a continuous measure of the “dimension” of the range of that reflects how quickly the spectrum decays. It is defined as
If is a multiple of the identity, we define . Otherwise, . When the eigenvalues of decay at a polynomial rate, the stable rank can be much smaller than the rank for an appropriate choice of . See Section 2.1 for some illustrations.
Let be any estimate for the largest eigenvalue of the input matrix . We measure the error in the estimate relative to the spectral range:
The relative error in the Krylov estimate falls in the interval because of (1.5).
The spectral gap, the stable rank, and the error measure are all invariant under increasing affine transformations of the spectrum of . We suppress the dependence of these quantities on the input matrix , unless emphasis is required.
1.2.3. An Example
Let us illustrate these concepts with a canonical example. A common task in computational physics is to compute the minimum eigenvalue of (the discretization of) an elliptic operator.
Consider the standard -point second-order finite-difference discretization of the one-dimensional Laplacian on with homogeneous boundary conditions. Writing , the eigenvalues of the matrix are
The spectral gap between the smallest and second smallest eigenvalue is proportional to . The spectrum has essentially no decay, which is visible in the fact that is proportional to . The tiny spectral gap and large stable rank suggest that the maximum eigenvalue problem for will be challenging.
A natural remedy is to attempt to compute the maximum eigenvalue of the inverse . Independent of , the spectral gap between the largest and second largest eigenvalue satisfies . The stable rank . The large spectral gap and small stable rank suggest that the maximum eigenvalue problem for will be quite easy, regardless of the dimension.
1.3. Matrices with Few Distinct Eigenvalues
Before continuing, we must address an important special case. When the input matrix has few distinct eigenvalues, the block Krylov method computes the maximum eigenvalue of the matrix perfectly.
Proposition 1.5 (Randomized Block Krylov: Matrices with Few Eigenvalues).
Let be a symmetric matrix. Fix
the block size and the depth
of the block Krylov subspace. Draw a standard normal matrix .
If has or fewer distinct eigenvalues, then with probability one.
with probability one.
1.4. Matrices without a Spectral Gap
Our first result gives probabilistic bounds for the maximum eigenvalue estimate without any additional assumptions. In particular, it does not require a lower bound on the spectral gap .
Theorem 1.6 (Randomized Block Krylov: Maximum Eigenvalue Estimate).
Instate the following hypotheses.
Let be a symmetric input matrix.
Draw a standard normal test matrix with block size .
Fix the depth parameter , and let be an arbitrary nonnegative integer partition.
We have the following probability bounds for the estimate , defined in (1.4), of the maximum eigenvalue of the input matrix.
Let us take a moment to explain the content of this result. We begin with a discussion about the role of the second depth parameter, and then we explain the role of the first depth parameter . We emphasize that the user does not choose a partition ; the bounds are valid for all partitions.
For now, fix . The key message of Theorem 1.6 is that the relative error satisfies
once the depth parameter exceeds
Once the depth attains this level, the probability of error drops off exponentially fast:
In fact, we need just to ensure that the probability bound is nontrivial.
The most important aspect of this result is that the depth scales with , so it is possible to achieve moderate relative error using a block Krylov space with limited depth. In contrast, the randomized power method [KW92] requires the depth to be proportional to to achieve a relative error of .
The second thing to notice is that the depth scales with . The stable rank is never larger than the ambient dimension , but it can be significantly smaller—even constant—when the spectrum of the matrix has polynomial decay.
Here is another way to look at these facts. As we increase the depth parameter , the block Krylov method exhibits a burn-in period whose length depends on . While the depth , the algorithm does not make much progress in estimating the maximum eigenvalue. Once the depth satisfies , the expected relative error decreases in proportion to . In contrast, the power method [KW92] reduces the expected relative error in proportion to .
We can now appreciate the role of the first depth parameter . When the spectrum of the input matrix exhibits polynomial decay, is constant for an appropriate value of that depends on the decay rate. In this case, the analysis shows that the total burn-in period is just steps. For example, when the th eigenvalue decays like , this situation occurs.
The block size may not play a significant role in determining the average error. But changing the block size has a large effect on the probability of failure (i.e., the event that the relative error exceeds ). For example, suppose that we increase the block size from one to three. For each increment of in the depth , the failure probability with block size is a factor of smaller than the failure probability with block size !
Remark 1.7 (Prior Work).
The simple case in Theorem 1.6 has been studied in the paper [KW92]. Our work introduces two major innovations. First, we obtain bounds in terms of the stable rank, which allows us to mitigate the dimensional dependence that appears in [KW92]. Second, we have obtained precise results for larger block sizes , which indicate potential benefits of using block Krylov methods. Our proof strategy is motivated by the work in [KW92], but we have been able to streamline and extend their arguments by using a more transparent random model for the test matrix.
1.5. Matrices with a Spectral Gap
Our second result gives probabilistic bounds for the maximum eigenvalue estimate when we have a lower bound for the spectral gap of the input matrix.
Theorem 1.8 (Randomized Block Krylov: Maximum Eigenvalue Estimate with Spectral Gap).
Instate the following hypotheses.
Let be a symmetric input matrix with spectral gap , defined in (1.10).
Draw a standard normal test matrix with block size .
Fix the depth parameter , and let be an arbitrary nonnegative integer partition.
We have the following probability bounds for the estimate , defined in (1.4), of the maximum eigenvalue of the input matrix.
Let us give a verbal summary of what this result means. First of all, we anticipate that Theorem 1.8 will give better bounds than Theorem 1.6 when the spectral gap exceeds the target level for the error. But both results are valid for all choices of and .
Now, fix the first depth parameter . One implication of the spectral gap result is that the relative error satisfies
when the second depth parameter exceeds
In this case, the depth only scales with , so the block Krylov method can achieve very small relative error for a matrix with a spectral gap. As before, the depth also scales with , so the dimensional dependence is weak—or even nonexistent if the spectrum has polynomial decay and is sufficiently large.
When the depth , the error probability drops off quickly:
This bound indicates that is the scale on which the depth needs to increase to reduce the failure probability by a constant multiple (which depends on the block size).
We discover a new phenomenon when we examine the expectation of the error. On average, to achieve a relative error of , it is sufficient that the depth , where
|for all ;|
In other words, the depth of the block Krylov space needs to be about before we obtain an average relative error less than one; we can reduce this requirement slightly by increasing the block size . But once the depth reaches this level, Theorem 1.8 suggests that the block Krylov method with reduces the average error twice as fast as the block Krylov method with .
1.6. Estimating Minimum Eigenvalues
We can also use Krylov subspace methods to obtain an estimate for the minimum eigenvalue of a symmetric matrix . Conceptually, the simplest way to accomplish this task is to apply the Krylov subspace method to the negation . The minimum eigenvalue estimate takes the form
Owing to (1.5), this estimate is never smaller than .
It is straightforward to adapt Theorems 1.6 to obtain bounds for the minimum eigenvalue estimate with a random test matrix . In particular, we always have the bound
In this context, the stable rank takes the form
See Section 1.4 for discussion of this type of bound.
We can also use Theorem 1.8 to obtain results in terms of the spectral gap. The spectral gap is the magnitude of the difference between the smallest two eigenvalues of , relative to the spectral range. For example, when the block size , we have the bound
See Section 1.5 for discussion of this type of bound.
Remark 1.10 (Inverse Iterations).
Suppose that we have a routine for applying the inverse of a positive-semidefinite matrix
to a vector. In this case, we can apply the Krylov method to the inverted matrixto estimate the minimum eigenvalue. This approach is often more powerful than applying the Krylov method to . For example, inverse iteration is commonly used for discretized elliptic operators; the discussion in Section 1.2.3 supports this approach.
1.7. Estimating Singular Values
We now arrive at the problem of estimating the spectral norm of a general matrix using Krylov subspace methods. Assuming that , we can apply the block Krylov method to the square . This yields an estimate for the square of the spectral norm of . If , we can just as well work with the other square .
We also have a bound in terms of the spectral gap , which is the difference between the squares of the largest two distinct singular values, relative to the spectral range. For block size , we have
In this case, it is natural to bound the stable rank as
We have written for the spectral norm and for the Schatten -norm for .
Remark 1.11 (Other Approaches).
It is also possible to work with “odd” Krylov subspacesor , but the analysis requires some modifications.
Remark 1.12 (Minimum Singular Values).
The quantity gives an estimate for the th largest squared singular value of . When , this is the smallest singular value, which may be zero. It is straightforward to derive results for the estimate using the principles outlined above. We omit the details.
With minor modifications, the analysis in this paper can be extended to cover some related situations. First, when the maximum eigenvalue has multiplicity greater than one, the (block) Krylov method converges more quickly. Second, we can extend the algorithm and the analysis to the problem of estimating the largest eigenvalue of an Hermitian matrix (using a complex standard normal test matrix). Third, we can analyze the behavior of the randomized (block) power method. For brevity, we omit the details.
2. Numerical Experiments
Krylov subspace methods exhibit complicated behavior because they are implicitly solving a polynomial optimization problem. Therefore, we do not expect a reductive theoretical analysis to capture all the nuances of their behavior. Nevertheless, by carefully choosing input matrices, we can witness the phenomena that the theoretical analysis predicts.
2.1. Experimental Setup
We implemented the randomized block Krylov method in Matlab 2019a. The code uses full orthogonalization, as described in Algorithm 1, and the maximum eigenvalue of the Rayleigh matrix is computed with the command eig.
We consider several types of input matrices for which accurate maximum eigenvalue estimates are particularly difficult. The randomized block Krylov method is rotationally invariant and affine invariant, so there is no (theoretical) loss in working with a diagonal matrix. Of course, the Krylov method does not exploit the fact that the input matrix is diagonal.
Recall that an matrix from the Gaussian Orthogonal Ensemble (GOE) is obtained by extracting the symmetric part of a Gaussian matrix:
The large eigenvalues of a GOE matrix cluster together, and the spectral gap becomes increasingly small as the dimension increases. The spectrum has essentially no decay.
To study the behavior of the Krylov method without a spectral gap, we draw a realization of a GOE matrix, diagonalize it using eig, and make an affine transformation so that its extreme eigenvalues are and . Abusing terminology, we also refer to this diagonal input matrix as a GOE matrix.
To study the behavior of the Krylov method with a spectral gap , we take the diagonalized GOE matrix and increase the largest eigenvalue until the gap is . We refer to this model as a gapped GOE matrix.
To understand how tail decay affects the convergence of the Krylov method, we consider gapped power law matrices. For a dimension , power , and gap , the matrix is diagonal with entries
See Figure 1 for a depiction of the stable rank function for these types of input matrices. For the random matrix models, we draw a single realization of the input matrix and then fix it once and for all. The variability in the experiments derives from the random draw of the test matrix.
2.2. Sample Paths: The Role of Block Size
First, we explore the empirical probability that the randomized block Krylov method can achieve a sufficiently small error.
The first set of experiments focuses on the behavior of the block Krylov method for a matrix with a substantial spectral gap. For a fixed gapped GOE matrix with , Figure 2 illustrates sample paths of the relative error in estimating the maximum eigenvalue as a function of the total depth of the Krylov space. We compare the performance as the block size varies. Here is a summary of our observations:
After a burn in period of about five steps, the error begins to decay exponentially, as described in Theorem 1.8. For , we can graphically estimate that the decay rate is about , while the theory predicts a decay rate at least .
For block size , the average error decays at roughly half the rate achieved for block size . For , the average error has a distinctive profile; for , the error is qualitatively similar. These phenomena match the bounds in Theorem 1.8(2).
The sample paths give a clear picture of how the errors typically evolve. Independent of the block size, most of the sample paths decay at the same rate. As the block size increases, there is a slight reduction in the error (seen as a shift to the left), but this improvement is both small and diminishing.
The impact of the block size becomes evident when we look at the spread of the sample paths. As the block size increases, the error varies much less, and this effect is sharpened by increasing the block size. The apparent reason is that the Krylov method can misconverge: it locks onto the second largest eigenvalue (dashed blue line), and it may not find a component of the maximum eigenvector for a significant number of iterations. For block size , this tendency is strong enough to destroy the rate of average convergence. For larger block sizes, the likelihood and duration of misconvergence both decrease.
The second set of experiments addresses the performance of the block Krylov method for a matrix with a small spectral gap. For a fixed GOE matrix (with ), Figure 3 shows sample paths of the relative error in estimating the maximum eigenvalue as a function of the total depth of the Krylov space and the block size . Let us highlight a few observations.
In the polynomial decay regime , the block size affects the average error weakly, as suggested by Theorem 1.6. In the exponential decay regime, the block size plays a much more visible role, but the theory does not fully capture this effect.
Regardless of the block size, the bulk of the sample paths decay more quickly than the average error. The block size has less of an effect on the typical error than on the average error. Nevertheless, for larger block size, we quickly achieve more digits of accuracy.
As in the first experiment, the block size has a major impact on the variability of the error. As we increase the block size, the sample paths start to cluster sharply, so the typical error and the average error align with each other. Misconvergence is also visible here, and it can be mitigated by increasing the block size.
2.3. Burn-in: The Role of Tail Content
Next, we examine how the tail content affects the burn-in period for the randomized block Krylov method.
Let us consider how the burn-in increases with the dimension of a matrix with limited spectral decay. For gapped GOE matrices with , Figure 3(a) shows how the average error evolves as a function of the depth of the Krylov space with block size . The main point is that the error initially stagnates before starting to decay exponentially. The stagnation period increases in proportion to the logarithm of the dimension . As we saw in Figure 0(a), this same behavior is visible in the size of for small values of .
Now, we look at how the burn-in depends on the rate of decay of the eigenvalues of a matrix. For an gapped power law matrix with , Figure 3(b) charts how the average error decays as a function of the depth for block size . We observe that the error initially decays exponentially at a slow rate. After a burn-in period, the error begins to decline at a faster exponential rate. The length of the initial trajectory increases in proportion with the logarithm of the order of the power law. From Figure 0(b), we detect that this speciation reflects the spread of for small values of .
The numerical experiments presented in this section confirm many of our theoretical predictions about the behavior of randomized block Krylov methods for estimating the maximum eigenvalue of a symmetric matrix. Increasing the block size results in a dramatic reduction in the probability of committing a large error in the estimate. The underlying mechanism is that simple Krylov method () is much more likely to misconverge than a block method (even with ). The effect is so significant that block Krylov methods can converge, on average, at twice the rate of a simple Krylov method. These benefits counterbalance the increased computational cost of a block method.
These facts have implications for design and implementation of practical algorithms. If the arithmetic cost is the driving concern, then the simple randomized Krylov method typically uses matrix–vector multiplications more efficiently than the block methods. But we have also seen that block methods can achieve a specific target for the error (averaged over test matrices) using a similar number of matrix–vector multiplies as the simple method. In fact, in modern computing environments, the actual cost (time, communication, energy) of multiplying a matrix by several vectors may be equivalent to a single matrix–vector multiplication, in which case the block methods are the obvious choice. Finally, when it is critical to limit the probability of failure, then the block methods are clearly superior.
3. History, Related Work, and Contributions
Krylov subspace methods are a wide class of algorithms that use matrix–vector products (“Krylov information”) to compute eigenvalues and eigenvectors and to solve linear systems. These methods are especially appropriate in situations where we can only interact with a matrix through its action on vectors. In this treatment, we only discuss Krylov methods for spectral computations. Some of the basic algorithms in this area are the power method, the inverse power method, subspace iteration, the Lanczos method, the block Lanczos method, and the band Lanczos method. See the books [Par98, BDD00, Saa11, GVL13] for more background and details.
3.1. Simple Krylov Methods
Simple Krylov methods are algorithms based on a Krylov subspace constructed from a single starting vector . That is, the block size .
The power method, which dates to the 19th century, is probably the earliest algorithm that relies on Krylov information to compute eigenvalues and eigenvectors of a symmetric matrix. The power method is degenerate in the sense that it only keeps the highest-order term in the Krylov subspace.
In the late 1940s, Lanczos [Lan50] developed a sophisticated Krylov subspace method for solving the symmetric eigenvalue problem. (More precisely, the Lanczos method uses a three-term recurrence to compute a basis for the Krylov subspace so that the compression of the input matrix to the Krylov subspace is tridiagonal.) In exact arithmetic, the Lanczos estimate of the maximum eigenvalue of a symmetric matrix coincides with for a fixed vector . On the other hand, the Lanczos method has complicated behavior in finite-precision arithmetic; for example, see Meurant’s book [Meu06].
The first analysis of the Lanczos method with a deterministic starting vector dates to the work of Lanczos [Lan50]. Kaniel, Paige, and Saad also made major theoretical contributions in the 1970s and 1980s; see [Par98, Saa11] for details and references. In the 1980s, Nemirovsky, Yudin, and Chou showed that Krylov subspace methods are the optimal deterministic algorithms for solving the symmetric eigenvalue problem, assuming we only have access to the matrix via matrix–vector multiplication; see [NY83, Cho87, Nem91].
The burn-in period for Krylov methods has been observed in many previous works, including [vdSvdV86, LS05, BES05]. The length of the burn-in period depends on the proportion of the energy in the test vector that is captured by the true invariant subspace.
The main contribution of the paper [vdSvdV86] is the observation that Krylov methods may exhibit superlinear convergence. The explanation for this phenomenon is that the optimal polynomial can create an artificial spectral gap by annihilating undesirable eigenvalues. This kind of behavior is visible in Figure 3.
3.2. Random Starting Vectors
Practitioners have often suggested using randomized variants of Krylov subspace methods. That is, the starting vector is chosen at random. Historically, randomness was just used to avoid the situation where the starting vector is orthogonal to the leading invariant subspace of the matrix.
Later, deeper justifications for random starting vectors appeared. The first substantive theoretical analysis of a randomized Krylov method appears in Dixon’s paper [Dix83] on the power method with a random starting vector. We believe that this is the first paper to recognize that Krylov methods can be successful without the presence of a spectral gap.
In 1992, Kuczyński & Woźniakowski published an analysis [KW92] of the Lanczos method with a random starting vector. Their work highlighted the benefits of randomization, and it provided a clear explanation of the advantages of using full Krylov information instead of the power method. See the papers [KW94, DCM97, LW98] for further work in this direction.
3.3. Block Krylov Methods
Block Krylov subspace methods use multiple starting vectors to generate the Krylov subspace, instead of just one. In other words, the algorithms form a Krylov subspace , where is a matrix. These methods were developed in the late 1960s and 1970s in an effort to resolve multiple eigenvalues more reliably. The block analog of the power method is called subspace iteration; see the books [Par98, Saa11] for discussion.
There are also block versions of the Lanczos method, which were developed by Cullum & Donath [CD74] and Golub & Underwood [GU77]. (More precisely, the block Lanczos method uses a recurrence to compute a basis for the block Krylov subspace so that the compression of the input matrix to the block Krylov subspace is block tridiagonal.) In exact arithmetic, the block Lanczos estimate of the maximum eigenvalue of a symmetric matrix coincides with for a fixed matrix .
Most of the early work on block Krylov subspace methods focuses on the case where the block size is small, while the depth of the Krylov space is moderately large. This can lead to significant problems with numerical stability, especially in the case where we use a recurrence to perform orthogonalization. As a consequence, full orthogonalization is usually recommended. Furthermore, most of the existing analysis of block Krylov methods is deterministic; for example, see [Saa80, LZ15].
3.4. Randomized Block Krylov Methods
Over the last decade, randomized block Krylov subspace methods have emerged as a powerful tool for spectral computations on large matrices. These algorithms use a Krylov subspace generated by a random test matrix .
In contrast with earlier block Krylov algorithms, contemporary methods use a much larger block size and a much smaller depth . Furthermore, the randomness plays a central role in supporting performance guarantees for the algorithms.
Most of the recent literature concerns the problem of computing a low-rank approximation to a large matrix, rather than estimating eigenvalues or invariant subspaces. Some of the initial work on randomized algorithms for matrix approximation appears in [PRTV98, FKV98, DKM06, MRT11]. Randomized subspace iteration was proposed in [RST09] and further developed in [HMT11]. Randomized block Krylov methods that use the full block Krylov subspace were proposed in the papers [RST09, HMST11]; see also [DIKMI18]. See [HMT11] for more background and history.
3.5. Analysis of Randomized Block Krylov Methods
There is a growing body of literature that develops theoretical performance guarantees for randomized block Krylov methods. The papers [RST09, HMT11, Gu15, MM15] contain theoretical analyses of randomized subspace iteration. The papers [MM15, WZZ15, DIKMI18] contain theoretical analysis of randomized methods that use the full block Krylov space. These works all focus on low-rank matrix approximation.
3.6. Contemporary Work
After the initial draft of this paper and its companion [Tro18] was completed, several other research groups released new work on randomized (block) Krylov methods. The paper [DI19] shows that randomized block Lanczos can produce excellent low-rank matrix approximations, even without the presence of a spectral gap. The paper [YGL18] demonstrates that randomized block Lanczos gives excellent estimates for the singular values of a general matrix with spectral decay.
We set out to develop highly refined bounds for the behavior of randomized block Krylov methods that use the full Krylov subspace. Our aim is to present useful and informative results in the spirit of Saad [Saa80], Kucziński & Woźniakowski [KW92], and Halko et al. [HMT11]. Our research has a number of specific benefits over prior work.
We have shown that randomized block Krylov methods have exceptional performance for matrices with spectral decay. In fact, for matrices with polynomial spectral decay, we can often obtain accurate estimates even when the block Krylov subspace has constant depth.
We have obtained detailed information about the role of the block size . In particular, increasing the block size drives down failure probabilities exponentially fast.
The companion paper [Tro18] gives the first results on the performance of randomized block Krylov methods for the symmetric eigenvalue problem.
Our bounds have explicit and modest constants, which gives the bounds some predictive power.
We hope that these results help clarify the benefits of randomized block Krylov methods. We also hope that our work encourages researchers to develop new implementations of these algorithms that fully exploit contemporary computer architectures.
Before we begin the proofs of the main results, Theorems 1.6 and 1.8, we present some background information. In Section 4.1, we justify the claim that the test matrix should have a uniformly random range. In Section 4.2, we introduce the Chebyshev polynomials of the first and second type, and we develop the properties that we need to support our arguments. Expert readers may wish to skip to Section 5 for proofs of the main results.
4.1. Rotationally Invariant Distributions
We wish to estimate spectral properties of a matrix using randomized block Krylov information. In particular, we aim to establish probabilistic upper bounds on the error in these spectral estimates for any input matrix. This section contains a general argument that clarifies why we ought to use a random test matrix with a rotationally invariant range in these applications.
Proposition 4.1 (Uniformly Random Range).
Consider any bivariate function that is orthogonally invariant:
Fix a symmetric matrix , and consider the orthogonal orbit
Let be a random matrix.
Let be a uniformly random orthogonal matrix,
drawn independently from
be a uniformly random orthogonal matrix, drawn independently from. Then
By orthogonal invariance of ,
The inequality is Jensen’s, and the last identity follows from the definition of the class . ∎
Now, consider the problem of estimating the maximum eigenvalue of the worst matrix with eigenvalue spectrum using a block Krylov method. Define the orthogonally invariant111The relative error (1.12) is orthogonally invariant because the eigenvalues of are orthogonally invariant and (1.7) states that the eigenvalue estimate (1.4) is also orthogonally invariant. function . For any distribution on and for a uniformly random orthogonal matrix , Proposition 4.1 states that
That is, the random test matrix is better than the test matrix if we want to minimize the worst-case expectation of the error; the same kind of bound holds for tail probabilities. We surmise that the test matrix should have a uniformly random range. Moreover, because of the co-range invariance (1.6), we can select any distribution with uniformly random range, such as the standard normal matrix .
4.2. Chebyshev Polynomials
In view of (1.3), we can interpret Krylov subspace methods as algorithms that compute best polynomial approximations. The analysis of Krylov methods often involves a careful choice of a specific polynomial that witnesses the behavior of the algorithm. Chebyshev polynomials arise naturally in this connection because they are the solutions to minimax approximation problems.
This section contains the definitions of the Chebyshev polynomials of the first and second type, and it derives some key properties of these polynomials. We also construct the specific polynomials that arise in our analysis. Some of this material is drawn from the paper [KW92] by Kuczyński & Woźniakowski. For general information about Chebyshev polynomials, we refer the reader to [AS64, OLBC10].
4.2.1. Chebyshev Polynomials of the First Kind
We can define the Chebyshev polynomials of the first kind via the formula
Using the binomial theorem, it is easy to check that this expression coincides with a polynomial of degree with real coefficients.
We require two properties of the Chebyshev polynomial . First, it satisfies a uniform bound on the unit interval:
Indeed, it is well-known that is the unique monic polynomial of degree with the least maximum value on the interval . The simpler result (4.2) is an immediate consequence of the representation
The latter formula follows from (4.1) after we apply de Moivre’s theorem for complex exponentiation.
Second, the Chebyshev polynomial grows quickly outside of the unit interval:
This estimate is a direct consequence of the definition (4.1).
4.2.2. The Attenuation Factor
Let be a parameter, and define the quantity
This definition is closely connected with the growth properties of . We can bound the attenuation factor in two ways:
These numerical inequalities can be justified using basic calculus. The first is very accurate for , while the second is better across the full range .
4.2.3. First Polynomial Construction
Choose a nonnegative integer partition . Consider the polynomial
The polynomial has degree , and it is normalized so that it takes the value one at . It holds that
Remark 4.2 (The Monomial).
In contrast to [KW92] and other prior work, we use products of Chebyshev polynomials with low-degree monomials. This seemingly minor change leads to results phrased in terms of the stable rank, rather than the ambient dimension. This is equivalent to a bound on iterations of the subspace iteration method, followed by iterations of the block Krylov method.
4.2.4. Chebyshev Polynomials of the Second Kind
We can define the Chebyshev polynomials of the second kind via the formula
Using the binomial theorem, it is easy to check that this expression coincides with a polynomial of degree with real coefficients. Moreover, when is an even number, the polynomial is an even function.
We require two properties of the Chebyshev polynomial . First, it satisfies a weighted uniform bound on the unit interval:
In fact, is the unique monic polynomial that minimizes the maximum value of the left-hand side of (4.9) over the interval ; see [KW92, Eqn. (23) et seq.]. The simpler result (4.9) is an immediate consequence of the representation
The latter formula follows from (4.8) after we apply de Moivre’s theorem for complex exponentiation.
4.2.5. Second Polynomial Construction
As before, introduce a parameter . Choose a nonnegative integer partition . Consider the polynomial
Since is an even polynomial, this expression defines a polynomial with degree and normalized to take the value one at . We have the bound
5. The Error in the Block Krylov Subspace Method
In this section, we initiate the proof of Theorems 1.6 and 1.8. Along the way, we establish Proposition 1.5. First, we show how to replace the block Krylov subspace by a simple Krylov subspace (with block size one). Afterward, we develop an explicit representation for the error in the eigenvalue estimate derived from the simple Krylov subspace. Finally, we explain how to construct the simple Krylov subspace so that we preserve the benefits of computing a block Krylov subspace.
The ideas in this section are drawn from several sources. The strategy of reducing a block Krylov subspace to a simple Krylov subspace already appears in [Saa80], but we use a different technique that is adapted from [HMT11]. The kind of analysis we perform for the simple Krylov method is standard; we have closely followed the presentation in [KW92].