Preconditioning Kaczmarz method by sketching

03/05/2019 ∙ by Alexandr Katrutsa, et al. ∙ Moscow Institute of Physics and Technology Skoltech 0

We propose a new method for preconditioning Kaczmarz method by sketching. Kaczmarz method is a stochastic method for solving overdetermined linear systems based on a sampling of matrix rows. The standard approach to speed up convergence of iterative methods is using preconditioner. As we show the best possible preconditioner for this method can be constructed from QR decomposition of the system matrix, but the complexity of this procedure is too high. Therefore, to reduce this complexity, we use random sketching and compare it with the Kaczmarz method without preconditioning. The developed method is applicable for different modifications of classical Kaczmarz method that were proposed recently. We provide numerical experiments to show the performance of the developed methods on solving both random and real overdetermined linear systems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This paper presents a new method of preconditioning of the Kaczmarz method [9] based on sketching technique [23]. The Kaczmarz method is the stochastic method for solving overdetermined linear systems , where and . We assume that we can only sample rows of , and can not use the full . Existing convergence results [22] show that convergence is defined by . We show by extensive numerical simulations that this value can be minimized by using preconditioner. To reduce the time required for a preconditioner computation we approximate the matrix  by sketching and compute the preconditioner for a sketched matrix. According to our assumptions, we can sketch the matrix only by random sampling of a fixed number of rows. The number of sketched rows is a parameter of the proposed method, and we test different numbers proportional to the number of columns .

Numerical experiments demonstrate the performance of the proposed method for both random consistent linear systems, test tomography problems and non-linear function approximation problem with random Fourier features technique. Also, we study random linear systems with noisy right-hand sides and show how the variance of the noise affects the convergence.

Main contributions of this paper are the following.

  • We study the preconditioning of the Kaczmarz method by sketching, analyze its complexity and provide limitations of this approach. (Section 23).

  • We propose to use preconditioning as the fine-tuning technique to get highly accurate solutions of overdetermined linear systems. (Section 4)

  • We compare the proposed approach on both synthetic data, test tomography data and function approximation problem using explicit feature map. (Section 4)

1.1 Related work

The Kaczmarz method for solving overdetermined linear systems is a well-known randomized iterative method, which has numerous interpretations and generalizations [6, 14]

. For example, it can be re-written as a special case of the stochastic gradient descent or coordinate descent applied to the linear least-squares problem corresponding to the considered system 

[16, 18]. The idea of preconditioning of such kind of methods is mentioned in the paper [24] but in the context of using stochastic gradient method for solving regression problem. In contrast, we emphasize on the Kaczmarz method and its convergence speed up with preconditioning. The convergence of the Kaczmarz method for a noisy linear system is presented in [13]. Our study shows that in the case of small noise the preconditioned Kaczmarz method gives smaller relative error than without preconditioning. Also, there is an extension [25] of the Kaczmarz method for the inconsistent linear system that uses random orthogonal projection technique to address the inconsistency. One of the main issues in the Kaczmarz method is the correct sampling of rows: if two sequential rows are correlated than the progress in the convergence will be very small. The paper [15] addresses this issue and proposes a two-subspace projection method.

2 Kaczmarz method and its preconditioning

Let be a rectangular matrix, where and . We consider the problem of solving an overdetermined linear system

(1)

with the following assumptions

  1. the least-squares solution of the problem using a full matrix is not available

  2. we can sample rows of matrix

    according to some probability distribution.

To fulfill these assumptions, we consider the Kaczmarz method [9] for solving (1). This method is based on the iterative update of the current approximation  using the solution of the following optimization problem:

(2)

where is the -th row of the matrix and is the

-th element of the vector

. Problem (2) has an analytical solution

(3)

The rows of the matrix can be sampled in different ways. The original paper [9] proposed a sequential selection of rows one by one. Later, the study [22] proposed a Randomized Kaczmarz method (RKM), which samples rows with probabilities . The convergence of the RKM is characterized by the following Theorem.

Theorem 1 ([22])

Let be the solution of (1). Then RKM converges in expectation with the average error as

where .

From this Theorem, it follows that decreasing leads to faster convergence of RKM. Also, let be a condition number of the matrix . The following well-known inequality between and holds

(4)

Thus, is an upper bound of up to the factor , and therefore instead of the minimization of we can minimize . The standard approach for the reduction of the condition number is to use preconditioners [20]. We will denote by and by the left and right preconditioners, respectively. In our case, the matrix is rectangular of size and . Therefore, the right preconditioner is preferable since it requires memory in contrast to the left preconditioner, which requires . An ideal right preconditioner satisfies

which means that is unitary. Therefore, we can obtain from the QR decomposition of the matrix . In particular, if

(5)

where

is an orthogonal matrix and

is an upper triangular matrix, then

(6)

The complexity of every step in this procedure for deriving optimal right preconditioner is the following:

  • QR decomposition of the matrix costs ,

  • inversion of upper triangular matrix costs .

Thus, the total complexity is , which is too high in the case of . Section 3 shows how to reduce the complexity of preconditioner computation down to .

After introducing the right preconditioner we have the following modification of the problem (1):

(7)

The corresponding modification of the Kaczmarz method is

(8)

where is the -th row of the matrix sampled according to some distribution.

Note that the cost of the one iteration of the method (8) is , which is higher than corresponding cost of the original Kaczmarz method (3). Also, we need to perform preprocessing for computing . However, the expected speedup asymptotically dominates the increased cost of every iteration and gives a smaller relative error for the same time. Examples of such behavior and runtime comparison are provided in Section 4.

Although we show in (5) and (6) how to compute the optimal right preconditioner, this computation is intractable because we can not use the full matrix . To overcome this limitation we propose to use sketching [23, 5, 7] of the matrix to compute the approximation of . The next section discusses some ways to sketch the matrix and resulting sketching-based preconditioners.

3 Sketching-based preconditioners

This section gives an overview of
sketching idea, how it helps in constructing the preconditioner for the Kaczmarz method and limitations of this approach.

The sketching is the procedure of replacing original matrix by a sketched matrix , where is some given matrix and
 [23, 6]. Since the complexity of computing the product for arbitrary dense matrices and is , which is intractable, we consider matrix  from a set , such that if and only if for ,

where and . For such matrices , product is equivalent to the selection of rows of the matrix with indices such that . Details of rows selection we discuss later in this section. Now assume we have a sketched matrix such that . Then we compute QR decomposition of this matrix

where is orthogonal and is upper triangular matrix, and use as an approximation of . Comparing with (5) and (6), we reduce the complexity of the QR decomposition from to , which can be rewritten in the form if , where is some given factor. Therefore, the total complexity of computing preconditioner is and does not depend on the number of rows of the matrix .

3.1 Selection of rows for sketched matrix

The remaining question in the procedure of construction  is how to obtain a sketched matrix  or in particular how to select rows in the matrix  in such a way that the matrix  approximates the matrix .

Since we can only sample rows of matrix , but can not use full matrix, we use a simple uniform random sampling strategy. Unfortunately, this strategy is not universal and can give a very poor approximation of or even singular matrix . This phenomenon relates to the notion of coherence of the matrix.

Definition 1 ([3])

Let be the coherence of matrix :

(9)

where is the -th row of the matrix  from the QR decomposition of the matrix .

According to [3], the coherence is between and and the larger coherence of the matrix, the poorer approximation is given by a uniform random sampled sketched matrix . To address this issue, authors of [3] propose several row mixing methods that reduce the coherence of the matrix with unitary transformations, e.g, Discrete Cosine Transform. Unfortunately, such kind of methods require operations and are too costly, since in our case . Thus, the used approach of constructing sketched matrix  is applicable only to matrices that have low coherence.

4 Numerical experiments

In this section, we provide numerical experiments to demonstrate the performance of the proposed approach to speed-up the convergence of the Kaczmarz method. The source code can be found at GitHub***https://github.com/amkatrutsa/preckacz. We use random matrices, standard tomography data from AIRTool II [8] and matrices taken from non-linear function approximation problem. We reduce the latter problem to the problem of solving the overdetermined linear system with random Fourier features technique [1]. In all experiments, we initialize with zero vector. Considered algorithms are implemented in Python with NumPy library [17]. Since we assume that the full matrix is not available, we use uniform random sampling of rows in the Kaczmarz method and its preconditioning modifications. Also, we test the proposed approach for a range of factors required to compute . We take into account the preprocessing time required to compute as a starting point for all convergence plots shown below.

4.1 Synthetic data

We generate a random matrix of size

with condition number

. Entries of the matrix are generated from the standard normal distribution

. After that, we generate a ground truth solution  from . We compare the proposed preconditioning method in two settings: consistent system, where and consistent system with noisy right-hand side: , . In both settings, we test different values of . On the one hand, the larger , the better approximation quality of . But on the other hand, the larger , the more time we need to compute in pre-processing step. Therefore, we identify the minimal value of among the tested values that give faster convergence than the Kaczmarz method without preconditioning (“No preconditioning” label in figures below).

4.1.1 Consistent system setting

In this experiment setting we generate the right-hand side . Figure 1 shows the dependence of the relative error on the convergence time for the compared methods. In the case of the Kaczmarz method, which is labeled as “No preconditioning”, . This plot demonstrates that if is not large enough, e.g. , the convergence of the preconditioned Kaczmarz method can be significantly slower than the convergence without preconditioning. In the same time, already for we observe a significant speedup compared with the Kaczmarz method without preconditioning. Further increase of shows that is already large enough and a further increase is not reasonable.

Figure 1: Comparison of convergence speed of the considered methods for consistent linear system.

4.1.2 Consistent system with noise setting

In this setting, we generate the right-hand side , where . We test the dependence of the relative error on the required time for different values of and . Figure 6 shows the same result with respect to the considered values of as for the consistent setting, see Figure 1. In particular, is not large enough to give a good approximation of , but or larger is already large enough to get faster convergence. Also, this plot shows the dependence of the performance of the considered methods for different noise levels. The higher the level of noise, the less gain from the preconditioned Kaczmarz method we get. That is because preconditioning gives a very small relative error, but in the high noise case, we can not get very accurate due to the noisy right-hand side.

(a)
(b)
(c)
(d)
Figure 6: Comparison of convergence speed of the considered methods for consistent linear system with noisy right-hand side for different level of noise.

Thus, we experimentally show the effectiveness of the preconditioning Kaczmarz method in the cases of randomly generated consistent linear system and randomly generated linear system with a small level of noise in the right-hand side. In the next section, we use matrices coming from different types of tomography to show the performance of the proposed method in the real application.

4.2 Tomography data

To generate the tomography data we use the package AIR Tool II [8]. In particular, to generate matrices we use functions paralleltomo, fancurvedtomo and fanlineartomo with default parameters and number of discretization intervals . The number of intervals corresponds to a square image that we reconstruct through solving an overdetermined system. With default parameters and these functions give matrices . Denote by and matrices given by functions paralleltomo, fancurvedtomo and fanlineartomo respectively. As the ground truth solution we use a standard test image shown in Figure 7 and denote it by . This image is represented as matrix and is reconstructed by the considered methods in the reshaped form .

Figure 7: Original test image

Figure (a)a shows the convergence of the same methods that were studied in Section 4.1. Here we again observe that is not enough for faster convergence. Moreover, for the matrix , gives a singular matrix and we use its pseudoinverse to compute . For larger values of we get nonsingular matrices

and observe asymptotically faster convergence compared with no preconditioning case. However, we see two phases in the convergence of preconditioned methods: the first phase gives less accurate approximation than the Kaczmarz method without preconditioning, but at some moment the Kaczmarz method gives higher relative error than the preconditioned Kaczmarz methods. For example, Figure 

(a)a shows that after 300 seconds the preconditioned Kaczmarz method, where , gives lower relative error than the Kaczmarz method without preconditioning. We observe the same convergence in the case of matrices and , see Figure (b)b(c)c. To make the convergence plot more clear, we skip the case

since it gives non-singular but very ill-conditioned matrix

.

(a) Matrix
(b) Matrix
(c) Matrix
Figure 11: Convergence comparison of the considered methods for the test image, see Figure 7

In addition to the convergence plots in Figure 11, we provide the normalized elementwise error between ground truth image and images reconstructed with the considered methods, see Figure 15. These plots show regions of the reconstructed images with high and low reconstruction quality, which is measured as , where is ground truth solution and is approximation given by the corresponding method. The nominator is the elementwise absolute value of the difference  and .

(a) Matrix
(b) Matrix
(c) Matrix
Figure 15: Relative error for different tomography types computed as , where nominator is elementwise absolute value of difference between ground truth solution and approximation given by considered methods

Thus, the proposed preconditioned Kaczmarz method can be used as a fine-tuning technique that will improve the accuracy given by the Kaczmarz method without preconditioning. Details of this approach are in the next section.

4.2.1 Fine-tuning with the preconditioned Kaczmarz method

As we show in the previous section the preconditioned Kaczmarz asymptotically gives a smaller relative error. However, the Kaczmarz method without preconditioning gives a smaller relative error during the first iterations. Therefore, we propose to start solving a given system without using preconditioner and then switch to the preconditioned modification of the Kaczmarz method. This strategy gives a lower error rate for the same runtime, see Figure 19, where we denote by the time to switch to the preconditioned Kaczmarz method. We use seconds in all experiments. In the case of tomography given by we do not consider fine-tuning for , since it gives poor convergence that we see in Figure (a)a.

(a) Matrix
(b) Matrix
(c) Matrix
Figure 19: Convergence comparison of the considered methods with fine-tuning strategy () for the test image, see Figure 7

4.3 Kernel regression problem

One more source of the consistent overdetermined linear systems is the kernel regression problem with explicit feature maps, e.g. random Fourier features [2, 12]

. These maps reduce the non-linear regression problem to the linear regression problem with the modified matrix. We use this approach in the problem of non-linear function approximation such that the parametric function model is not known. Also, it is known that introducing Gaussian kernel reduces linear regression problem to non-linear and can give a very accurate approximation of smooth functions 

[11].

Consider the following function that is used as the test function in the package George [1]:

(10)

and use the following parameters: . To compose an overdetermined linear system we sample points uniform in and compute the function values in these points. So, we have a matrix , where every row is a sampled point from the set  and the right-hand side  is the vector of the function values in the corresponding points. Since the test function (10) is non-linear, the composed linear system  is not consistent, but the implicit kernel trick with random Fourier features helps in constructing a consistent linear system . This explicit feature map gives an approximation of introducing the Gaussian kernel and the accuracy of the approximation exponentially grows with the dimension  [19]. To construct the matrix , we generate the matrix with entries from the normal distribution , where and are hyper-parameters that have to be tuned. After that, we compute , and , where sine and cosine are computed elementwise. Now we are ready to construct the matrix that gives a consistent linear system. We use the columns of the matrices and in the following way:

  1. the -th column of the matrix is the -th column of the matrix for

  2. the -th column of the matrix is the -th column of the matrix for .

Thus, for sufficiently large and small we have a consistent linear system  [2].

Figure 24 shows the convergence of the considered methods for different values of . Experiments show that gives a sufficiently accurate solution, therefore we fix it and show the dependence only on the dimension . Figure (b)b shows that already for the preconditioned Kaczmarz method converges much faster for all considered values of .

(a)
(b)
(c)
(d)
Figure 24: Convergence comparison of the considered methods for the kernel regression problem via random Fourier features for function (10) approximation,

5 Conclusion

This paper presents the preconditioned Kaczmarz method by sketching technique and demonstrates its performance in different applications. However, the proposed approach requires a number of sketched rows that is an unknown hyperparameter, therefore theoretical investigation on the minimal, sufficient number of the sketched rows is the interesting further study. Another ingredient of the Kaczmarz method is the distribution for a sampling of rows. To sample rows in the right way, the methods for sampling from unnormalized distributions can be used. Also, the preconditioning approach can be generalized to non-linear systems and inspire modifications of the corresponding non-linear stochastic least-squares solvers. One more interesting extension of this work is to consider state-of-the-art adaptive stochastic gradient methods 

[4, 10] from the preconditioning perspective like in [21]. This point of view can lead to more efficient and theoretically motivated optimization methods.

References

  • [1] S. Ambikasaran, D. Foreman-Mackey, L. Greengard, D. W. Hogg, and M. O’Neil, Fast direct methods for Gaussian processes, IEEE transactions on pattern analysis and machine intelligence, 38 (2016), pp. 252–265.
  • [2] H. Avron, M. Kapralov, C. Musco, C. Musco, A. Velingker, and A. Zandieh,

    Random Fourier features for kernel ridge regression: Approximation bounds and statistical guarantees

    , in Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org, 2017, pp. 253–262.

  • [3] H. Avron, P. Maymounkov, and S. Toledo, Blendenpik: Supercharging LAPACK’s least-squares solver, SIAM Journal on Scientific Computing, 32 (2010), pp. 1217–1236.
  • [4] J. Duchi, E. Hazan, and Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization, Journal of Machine Learning Research, 12 (2011), pp. 2121–2159.
  • [5] M. Ghashami, E. Liberty, J. M. Phillips, and D. P. Woodruff, Frequent directions: Simple and deterministic matrix sketching, SIAM Journal on Computing, 45 (2016), pp. 1762–1792.
  • [6] R. M. Gower and P. Richtárik, Randomized iterative methods for linear systems, SIAM Journal on Matrix Analysis and Applications, 36 (2015), pp. 1660–1690.
  • [7] R. M. Gower, P. Richtárik, and F. Bach, Stochastic quasi-gradient methods: Variance reduction via jacobian sketching, arXiv preprint arXiv:1805.02632, (2018).
  • [8] P. C. Hansen and J. S. Jørgensen, AIR Tools II: algebraic iterative reconstruction methods, improved implementation, Numerical Algorithms, 79 (2018), pp. 107–137.
  • [9] S. Kaczmarz, Angenäherte auflösung von systemen linearer gleichungen, Bull. Int. Acad. Pol. Sci. Lett. Class. Sci. Math. Nat, (1937), pp. 355–7.
  • [10] D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980, (2014).
  • [11] C. A. Micchelli, Y. Xu, and H. Zhang, Universal kernels, Journal of Machine Learning Research, 7 (2006), pp. 2651–2667.
  • [12] M. Munkhoeva, Y. Kapushev, E. Burnaev, and I. Oseledets, Quadrature-based features for kernel approximation, in Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, eds., Curran Associates, Inc., 2018, pp. 9165–9174.
  • [13] D. Needell, Randomized Kaczmarz solver for noisy linear systems, BIT Numerical Mathematics, 50 (2010), pp. 395–403.
  • [14] D. Needell and J. A. Tropp, Paved with good intentions: analysis of a randomized block Kaczmarz method, Linear Algebra and its Applications, 441 (2014), pp. 199–221.
  • [15] D. Needell and R. Ward, Two-subspace projection method for coherent overdetermined systems, Journal of Fourier Analysis and Applications, 19 (2013), pp. 256–269.
  • [16] D. Needell, R. Ward, and N. Srebro, Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm, in Advances in Neural Information Processing Systems, 2014, pp. 1017–1025.
  • [17] T. E. Oliphant, A guide to NumPy, vol. 1, Trelgol Publishing USA, 2006.
  • [18] Z. Qu and P. Richtárik, Coordinate descent with arbitrary sampling I: Algorithms and complexity, Optimization Methods and Software, 31 (2016), pp. 829–857.
  • [19] A. Rahimi and B. Recht, Random features for large-scale kernel machines, in Advances in neural information processing systems, 2008, pp. 1177–1184.
  • [20] Y. Saad, Iterative methods for sparse linear systems, vol. 82, SIAM, 2003.
  • [21] M. Staib, S. J. Reddi, S. Kale, S. Kumar, and S. Sra, Escaping saddle points with adaptive gradient methods, arXiv preprint arXiv:1901.09149, (2019).
  • [22] T. Strohmer and R. Vershynin, A randomized Kaczmarz algorithm with exponential convergence, Journal of Fourier Analysis and Applications, 15 (2009), p. 262.
  • [23] D. P. Woodruff et al., Sketching as a tool for numerical linear algebra, Foundations and Trends® in Theoretical Computer Science, 10 (2014), pp. 1–157.
  • [24] J. Yang, Y.-L. Chow, C. Ré, and M. W. Mahoney, Weighted SGD for regression with randomized preconditioning, The Journal of Machine Learning Research, 18 (2017), pp. 7811–7853.
  • [25] A. Zouzias and N. M. Freris, Randomized extended Kaczmarz for solving least squares, SIAM Journal on Matrix Analysis and Applications, 34 (2013), pp. 773–793.