Greed Works: An Improved Analysis of Sampling Kaczmarz-Motkzin

12/07/2019 ∙ by Jamie Haddock, et al. ∙ 0

Stochastic iterative algorithms have gained recent interest in machine learning and signal processing for solving large-scale systems of equations, Ax=b. One such example is the Randomized Kaczmarz (RK) algorithm, which acts only on single rows of the matrix A at a time. While RK randomly selects a row of A to work with, Motzkin's Method (MM) employs a greedy row selection. Connections between the two algorithms resulted in the Sampling Kaczmarz-Motzkin (SKM) algorithm which samples a random subset of β rows of A and then greedily selects the best row of the subset. Despite their variable computational costs, all three algorithms have been proven to have the same theoretical upper bound on the convergence rate. In this work, an improved analysis of the range of random (RK) to greedy (MM) methods is presented. This analysis improves upon previous known convergence bounds for SKM, capturing the benefit of partially greedy selection schemes. This work also further generalizes previous known results, removing the theoretical assumptions that β must be fixed at every iteration and that A must have normalized rows.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Large-scale systems of equations arise in many areas of data science, including in machine learning and as subroutines of several optimization methods

[9]. We consider solving these large systems of linear equations, , where , , and . Iterative methods which use a small portion of the data in each iteration are typically employed in this domain. These methods offer a small memory footprint and good convergence guarantees. The Kaczmarz method [26] is such an iterative method that consists of sequential orthogonal projections towards the solution set of a single equation (or subsystem). Given the system

, the method computes iterates by projecting onto the hyperplane defined by the equation

where is a selected row of the matrix and is the corresponding entry of . The iterates are recursively defined as

We assume that is full-rank and system has unique solution . We will use to represent the th residual and to represent the th error term. Additionally, we let

be the smallest singular value of

and unless otherwise noted, we let represent the Euclidean norm. We let denote the Frobenius norm and denote the norm. A visualization of several iterations of a Kaczmarz method are shown in Figure 1.

Figure 1: Several iterations of a Kaczmarz method. The iterate is the orthogonal projection of onto the solution set of the selected equation (represented by a line).

The Kaczmarz method was originally proposed in the late 30s [27] and rediscovered in the 1970’s under the name algebraic reconstruction technique (ART) as an iterative method for reconstructing an image from a series of angular projections in computed tomography [20, 25]. This method has seen popularity among practitioners and researchers alike since the beginning of the digital age [11, 24], but saw a renewed surge of interest after the elegant convergence analysis of the Randomized Kaczmarz (RK) method in [44]. In [44], the authors showed that for a consistent system with unique solution, RK (with specified sampling distribution) converges at least linearly in expectation with the guarantee

(1)

Many variants and extensions followed, including convergence analyses for inconsistent and random linear systems [33, 12], connections to other popular iterative algorithms [30, 35, 40, 41, 16], block approaches [36, 42], acceleration and parallelization strategies [17, 28, 31], and techniques for reducing noise and corruption [48, 23].

Another popular Kaczmarz method extension is greedy (rather than randomized) row selection, which has been rediscovered several times in the literature as the “most violated constraint control” or the “maximal-residual control” [10, 38, 39]

. This method was proposed in the 1950’s as an iterative relaxation method for linear programming by Agmon, Motzkin, and Schoenberg under the name

Motzkin’s relaxation method for linear inequalities (MM) [32, 1]. In [1], the author showed that MM converges at least linearly (deterministically) with the convergence rate of (1). The bodies of literature studying this greedy strategy have remained somewhat disjoint, with analyses for linear systems of equations in the numerical linear algebra community and analyses for linear systems of inequalities in the operations research and linear programming community [18, 19, 45, 3, 6, 7, 13]. There has been recent work in analyzing variants of this greedy strategy [15, 4, 5, 43]. In [43], the authors analyze MM on a system to which a Gaussian sketch has been applied. In [4, 5], the authors analyze variants of MM in which the equation selected in each iteration is chosen randomly amongst the set whose residual values are sufficiently near the maximal residual value. In [15], the authors provide a convergence analysis for a generalized version of MM in which the equation chosen in each iteration is that which has the maximal weighted residual value which are the residual values divided by the norm of the corresponding row of the measurement matrix. In [14]

, the authors illustrated the connection between MM and RK and proposed a family of algorithms that interpolate between the two, known as the

Sampling Kaczmarz-Motzkin (SKM) methods.

The SKM methods operate by randomly sampling a subset of the system of equations, computing the residual of this subset, and projecting onto the equation corresponding to the largest magnitude entry of this sub-residual. The family of methods (parameterized by the size of the random sample of equations, ) interpolates between MM, which is SKM with , and RK, which is SKM with . In [14], the authors prove that the SKM methods converge at least linearly in expectation with the convergence rate specified in (1). Meanwhile, the empirical convergence of this method is seen to depend upon ; however, increasing also increases the computational cost of each iteration so the per iteration gain from larger sample size may be outweighed by the in-iteration cost. This is reminiscent of other methods which use sampled subsets of data in each iteration, such as the block projection methods [2, 36, 37].

Like SKM, the randomized block Kaczmarz (RBK) methods use a subset of rows to produce the next iterate; rather than forcing the next iterate to satisfy the single sampled equation as in RK, block iterates satisfy all the equations in the randomly sampled block. The st RBK iteration is given by

where and represent the restriction onto the row indices in and denotes the Moore-Penrose inverse of the matrix . In [36], the authors prove that on a system with a row-normalized measurement matrix with a well-conditioned row-paving RBK converges at least linearly in expectation with the guarantee

(2)

where is an absolute constant and denotes the operator norm of the matrix. This can be a significant improvement over the convergence rate of (1) when . However, the cost per iteration scales with the size of the blocks. In [37], the authors generalize this result to inconsistent systems and show that, up to a convergence horizon, RBK converges to the least-squares solution. In [34], the authors, inspired by the sketching framework in [21], construct a block-type method which iterates by projecting onto a Gaussian sketch of the equations. They show that this method converges at least linearly in expectation with the guarantee

(3)

where is an absolute constant and is the number of rows in the resulting sketched system. This result requires a Gaussian sketch which is a costly operation, however the authors suggest using a Gaussian sketch of only a subset of the equations. This result is most related to SKM and to our main result.

2 Previous Results

This section focuses on the convergence behavior of the RK, MM, and SKM methods. Each of these projection methods is a special case of Algorithm 1 with a different selection rule (Line 4). In iteration , RK uses the randomized selection rule that chooses

with probability

, MM uses the greedy selection rule , and SKM uses the hybrid selection rule that first samples a subset of rows, , uniformly at random from all subsets of size , , and then chooses . As previously mentioned, RK and MM are special cases of the SKM method when the sample size and , respectively. Each of the methods converge linearly when the system is consistent with unique solution (RK and SKM converge linearly in expectation, MM converges linearly deterministically). In Table 1, we present the selection rules and convergence rates for RK, MM, and SKM. Note that under the assumption that has been normalized so that , each of these upper bounds on the convergence rate is the same since . Thus, these results do not reveal any advantage the more computationally expensive methods (MM, SKM with ) enjoy over RK. There are, in fact, pathological examples on which RK, MM, and SKM exhibit nearly the same behavior (e.g., consider the system defining two lines that intersect at one point in ), so it is not possible to prove significantly different convergence rates without leveraging additional properties of the system.

1:procedure Kacz()
2:     
3:     repeat
4:         Choose according to selection rule.
5:         .
6:         
7:     until stopping criterion reached
8:     return
9:end procedure
Algorithm 1 Generic Kaczmarz Method
Selection Rule Convergence Rate
RK [44]
SKM [14]
MM [1]
Table 1: The selection rules and convergence rates of RK, SKM, and MM. The presented results for MM and SKM assume that has been normalized so that .

In [22], the authors demonstrate that MM can converge faster than RK or SKM and that the convergence rate depends on the structure of the residual terms of the iterations, . In particular, they prove that

where is the dynamic range of the th residual, . Our main contribution in this paper is to prove that the SKM methods can exhibit a similarly accelerated convergence rate and the advantage scales with the size of the sample, . Again, this advantage depends upon the structure of the residuals of the iterations. We define here a generalization of the dynamic range used in [22]; our dynamic range is defined as

(4)

Now, we let denote expectation with respect to the random sample conditioned upon the sampled for , and denote expectation with respect to all random samples for where is understood to be the last iteration in the context in which is applied. We state our main result below in Corollary 1; this is a corollary of our generalized result which will be discussed and proven later.

Corollary 1.

Let be normalized so for all rows . Let denote the unique solution to the system of equations . Then SKM converges at least linearly in expectation and the bound on the rate depends on the dynamic range, of the random sample of rows of , . Precisely, in the th iteration of SKM, we have

so applying expectation with respect to all iterations, we have

Corollary 1 shows that SKM experiences at least linear convergence where the contraction term is a product of terms that are less than one and dependent on the sub-sample size . When , as in RK, = 1, so Corollary 1 recovers the upper bound for RK shown in [44]. However, when for MM, Corollary 1 offers an improved upper bound on the error over [22]; specifically

Our result illustrates that the progress made by an iteration of the SKM algorithm depends upon the dynamic range of the residual of that iteration. The dynamic range of each iteration, , satisfies

Note that the upper bound, , is achieved by a constant residual where for all , while the lower bound is achieved by the residual with one nonzero entry. As smaller provides a smaller upper bound on the new error , we consider the situation with one nonzero entry in the residual as the “best case” and the situation with a constant residual as the “worst case.” We now compare our single iteration result in the best and worst cases to the previously known single iteration results of [1, 14, 22, 44]. These are summarized in Table 2; we present only the contraction terms such that

for each upper bound in the case that is normalized so that for . In particular, note that the worst case residual provides the same upper bound rate as those of [44, 14, 1].

Best Case Worst Case Previous Best Case Previous Worst Case
MM [22] [1, 14, 44]
SKM
RK
Table 2: Contraction terms such that for the best and worst case bounds of MM, SKM, and RK.

3 Main Results

Corollary 1 is a specialization of our general result to SKM with a fixed sample size and systems that are row-normalized. Our main result requires neither row-normalization nor a static sample size. However, we must additionally generalize the SKM sampling distribution for systems that are not row-normalized. We now consider the general SKM method which samples many rows of in the

th iteration (according to probability distribution

defined in (5)) and projects onto the hyperplane associated to the largest magnitude entry of the sampled sub-residual. The generalized probability distribution over the subset of rows of of size is denoted The sampled subset of rows of , where

(5)

and Thus, our generalized SKM method is Algorithm 1 with selection rule and Similar to the RK probability distribution of [44], the computation of (5) is utilized here simply to theoretically analyze the SKM algorithm without requiring normalized rows. We do not suggest that this probability distribution be implemented in a real world setting.

Our main result shows that the generalized SKM converges at least linearly in expectation with a bound that depends on the dynamic range of the sampled sub-residual, the size of the sample, and the minimum squared singular value of , . In the event that there are multiple rows within the sub-residual which achieve , an arbitrary choice can be made amongst those rows and the main result will not be affected by this choice.

Theorem 1 provides theoretical convergence guarantees for the generalized SKM method. Whereas previous guarantees for SKM required normalized rows or fixed sample sizes  [14, 22], the guarantees presented here do not require either assumption. In addition, the contraction term of the generalized SKM method shows dependence on the dynamic range, another feature lacking in previous works. Following the statement of the theorem, we use standard techniques in the Kaczmarz literature to prove our main result. We also remark on the special case in which rows have equal norm (and thus subsets are selected uniformly at random), the case where is fixed, and the case in which in order to make connections to previous results.

Theorem 1.

Let denote the unique solution to the system of equations . Then generalized SKM converges at least linearly in expectation and the bound on the rate depends on the dynamic range, of the random sample of rows of , . Precisely, in the th iteration of generalized SKM, we have

Proof.

We begin by rewriting the iterate and simplifying the resulting expression which yields

Now, we take expectation of both sides (with respect to the sampled according to the distribution (5)). This gives

where the last line follows from standard properties of singular values. This completes our proof. ∎

One may be assured that this contraction term is always strictly positive. We prove this simple fact in Lemma 1.

Lemma 1.

For any matrix defining consistent system with unique solution , we have

Proof.

Beginning with the definition of , we have

where the inequality follows from properties of singular values and Cauchy-Schwartz. ∎

The next remarks make simplifying assumptions on the generalized SKM algorithm and our main result to provide better context for comparison with previous works.

Remark 1.

(Recovery of RK guarantees) If all of the rows of have equal norm (not necessarily unit norm), then our result specializes to

(6)

Additionally, when , the sampling distribution (5) and theoretical error upper bound (6) simplifies to the probability distribution and error guarantees of [44].

Remark 2.

(Improvement of MM guarantees) Corollary 1 is obtained from Theorem 1 when rows of have unit norm and . When , then an improved convergence rate of

for MM over that shown in [22],

is obtained.

Remark 3.

(Connection to Block RK) Note that this bound on the convergence rate of SKM additionally provides a bound on the convergence rate of a block Kaczmarz variant. This variant is distinct from the block Kaczmarz method considered in [36]. The analysis of [36] requires a pre-partitioned row paving, while the variant considered here allows the blocks to be sampled randomly and not pre-partitioned. Consider the block Kaczmarz variant which in each iteration selects a block of rows of , , and projects the previous iterate into the solution space of the entire block of equations. This variant necessarily converges faster than SKM as it makes more progress in each iteration. In particular, note that Given iterate and sample of rows , let denote the iterate produced by SKM and denote the iterate produced by this block Kaczmarz variant. Note that is the closest point to on the hyperplane associated to equation so, since also lies on this hyperplane, we have

Now, we note that by orthogonality of the projections, we have

so by the above inequality, we have

A visualization of this situation is presented in Figure 2. Thus, the progress made by BK in any fixed iteration is at least as large as the progress made by SKM, so it must converge at least as quickly.

Figure 2: The SKM and BK iterates, and , generated by one iteration starting at satisfy .

Because Theorem 1 shows that the contraction coefficient for generalized SKM is dependent by the dynamic range, the following section discusses bounds on the dynamic range for special types of linear systems.

4 Analysis of the Dynamic Range

Since the dynamic range plays an integral part in the convergence behavior for generalized SKM, the dynamic range is analyzed here for different types of specialized linear systems. Note that the dynamic range has also appeared in other works, although not under the guise of “dynamic range”. For example, in [4] the authors proposed a Greedy Randomized Kaczmarz (GRK) algorithm that finds a subset of indices to randomly select the next row to project onto. The operation of finding this subset relies on a ratio between the and norms of the residual at the current iteration, essentially using a proxy of the dynamic range. In this section, we analyze the dynamic range for random Gaussian linear systems and remark on the extension to other random linear systems.

4.1 Gaussian Matrices

When entries of the measurement matrix

are drawn i.i.d. from a standard Gaussian distribution, it can be shown that the dynamic range is upper bounded by

. The proof of the upper bound of is similar to Lemma 2 of  [22], where the authors analyze the dynamic range for . Here, we generalized the bound for varying samples sizes .

Lemma 2.

Let be a random Gaussian matrix with . For each subset , let denote the set of rows in that are independent of and note . Assuming there is at least rows in which are independent of , the dynamic range can be upper bounded as:

(7)
Remark 4.

Note that the factor is as since

Thus, we conclude that the expected dynamic range for any iteration is

Proof.

Without loss of generality, we let the solution to the system so that . We are then interested in finding an upper bound on the dynamic range (4) in expectation. Here, the expectation is taken with respect to the random i.i.d. draws of the entries of . To that end, we derive upper bounds and lower bounds on the numerator and denominator of (4). Starting with the upper bound on the numerator we have

where the first inequality follows from the Cauchy-Schwartz inequality and remaining computation uses the fact that and simplifies the expression. The lower bound follows from

where the second to last inequality uses the fact that for i.i.d. Gaussian random variables

, we have that and that . Therefore, we have

Dividing all terms by attains the desired result (7). ∎

We conjecture that the true bound is actually and that the is an artifact of our proof technique. We have plotted and the corresponding conjectured bound in the left of Figure 3 for a Gaussian matrix of size .

Remark 5.

To extend to other distributions, one can simply note that as the signal dimension

gets large, the Law of Large numbers can be invoked and a similar computation can be used to show an upper bound on the dynamic range of the system.

Figure 3: Dynamic ranges for various sample sizes . Left: Gaussian matrix and Gaussian error (red). Conjectured bound is plotted in blue. Right: Incidence matrix for complete graph with Gaussian error (red) and a sparse error with Bernoulli random variable entries (blue). Bounds are plotted in the same colors with different line styles.

4.2 Incidence Matrices

In the previous subsection, we analyzed the dynamic range for systems with measurement matrices that are randomly generated. Deterministically generated measurement matrices are additionally of interest. In this subsection, we analyze the dynamic range associated to incidence matrices of directed graphs, . The incidence matrix associated to a directed graph is of size . For each arc, which connects vertex to vertex , the associated row of is all zeros with a one and negative one in the th and th entries. These types of matrices arise in the average consensus problem. The gossip methods that solve this problem are generalized by the Kaczmarz methods [29]. RK specializes to the randomized gossip method in which the pair of nodes which update are selected at random [47, 8]. SKM specializes to a variant of greedy gossip with eavesdropping (GGE) in which the nodes are selected from amongst a random sample to maximize the update [46]. Thus, our analysis provides an alternate convergence rate for GGE.

Now, we consider the dynamic range for an incidence matrix. We can derive a simple bound on the dynamic range in each iteration that depends only upon the entries of the current error vector,

. In particular,

(8)

where and denote the vertices connected by the th smallest magnitude difference across an arc. This bound improves for iterates with a sufficient amount of variation in the coordinates. We have plotted and the corresponding bounds in the right of Figure 3. We calculate these values for the incidence matrix of the complete graph in the cases when the error is a Gaussian vector (red) and a Bernoulli vector (blue).

Remark 6.

If the right-hand-side vector associated to the system