In the era of big data where data sets become continuously larger, randomized iterative methods become very popular and they are now playing major role in areas like numerical linear algebra, scientific computing and optimization. They are preferred mainly because of their cheap per iteration cost which leads to the improvement in terms of complexity upon classical results by orders of magnitude and to the fact that they can easily scale to extreme dimensions. However, a common feature of these methods is that in their update rule a particular subproblem needs to be solved exactly. In the case that the size of this problem is large, this step can be computationally very expensive. The purpose of this work is to reduce the cost of this step by incorporating inexact updates in the stochastic methods under study.
1.1 The Setting
In this paper we are interested to solve three closely related problems:
Stochastic Quadratic Optimization Problem
Best Approximation Problem
Concave Quadratic Maximization Problem
We start by presenting the main connections and key relationships between these problems as well as popular randomized iterative methods (with exact updates) for solving each one of them.
Stochastic Optimization Problem:
We study the stochastic quadratic optimization problem
first proposed in  for reformulating consistent linear systems
In particular, problem (1) is defined by setting:
where is a random symmetric positive semi-definite matrix that depends on three different matrices: the data matrix of the linear system (2
), a random matrixand on an positive definite matrix which defines the geometry of the space. Throughout the paper, is used to define a inner product in via and an induced norm, . By we denote the Moore-Penrose pseudoinverse.
The expectation in (1) is over random matrices with rows (and arbitrary number of columns , e.g., ) drawn from an arbitrary (user defined) distribution . The authors of  give necessary and sufficient conditions that distribution needs to be satisfied for the set of solutions of (1) to be equal to the set of solutions of the linear system (2); a property for which the term exactness was coined in (see Section 3 for more details on exactness).
and a linear rate of convergence was proved despite the fact that is not necessarily strongly convex, (1) is not a finite-sum problem and a fixed stepsize is used.
The stochastic optimization problem (1) has many unique characteristics mainly because it has constructed in a particular way in order to capture all the information of the linear system (2). For example it holds that
and it can be proved that all eigenvalues of its Hessian matrixare upper bounded by 1. Due to these specific characteristics, the update rules of seemingly different randomized iterative methods are identical. In particular the following methods for solving (1) have exactly the same behavior with SGD :
In all methods is a fixed stepsize and is sampled afresh in each iteration from distribution . See  for more insights into the reformulation (1), its properties and other equivalent reformulations (e.g., stochastic fixed point problem, probabilistic intersection problem, and stochastic linear system).
Best Approximation Problem and Sketch and Project Method:
In [46, 29], it has been shown that for the case of consistent linear systems with multiple solutions, SGD (and as a result SNM (5) and SPPM (6)) converges linearly to one particular minimizer of function , the projection of the initial iterate onto the solution set of the linear system (2). This naturally leads to the best approximation problem:
Unlike, the linear system (2) which is allowed to have multiple solutions, the best approximation problem has always (from its construction) a unique solution. For solving problem (7), the Sketch and Project Method (SPM):
first proposed in . The name Sketch and Project method is justified by the iteration structure which follows two steps: (i) Choose the sketched system , (ii) Project the last iterate onto . The Sketch and Project viewpoint will be useful later in explaining the natural interpretation of the proposed inexact update rules. (see Section 4.2).
Dual Problem and SDSA:
The Fenchel dual of (7) is the (bounded) unconstrained concave quadratic maximization problem
Boundedness follows from consistency. It turns out that by varying and (but keeping consistency of the linear system), the dual problem in fact captures all bounded unconstrained concave quadratic maximization problems .
— updates the dual vectorsas follows:
where the random matrix is sampled afresh in each iteration from distribution , and is chosen in such a way to maximize the dual objective : . More specifically, SDSA is defined by picking the with the smallest (standard Euclidean) norm. This leads to the formula:
In  the dual method was analyzed for the case of unit stepsize (). Later in  the analysis extended to capture the cases of . Momentum variants of the dual method that provide further speed up have been also studied in .
An interesting property that holds between the suboptimalities of the Sketch and Project method and SDSA is that the dual suboptimality of in terms of the dual function values is equal to the primal suboptimality of in terms of distance [19, 29]. That is,
This simple to derive result (by combining the expression of the dual function (10) and the equation (13)) gives for free the convergence analysis of SDSA, in terms of dual function suboptimality once the analysis of Sketch and Project is available (see Section 5).
In this work we propose and analyze inexact variants of all previously mentioned randomized iterative algorithms for solving the stochastic optimization problem, the best approximation problem and the dual problem. In all of these methods, a certain potentially expensive calculation/operation needs to be performed in each step; it is this operation that we propose to be performed inexactly. For instance, in the case of SGD, it is the computation of the stochastic gradient , in the case of SPM is the computation of the projection , and in the case of SDSA it is the computation of the dual update .
We perform an iteration complexity analysis under an abstract notion of inexactness and also under a more structured form of inexactness appearing in practical scenarios. An inexact solution of these subproblems can be obtained much more quickly than the exact solution. Since in practical applications the savings thus obtained are larger than the increase in the number of iterations needed for convergence, our inexact methods can be dramatically faster.
Let us now briefly outline the rest of the paper:
In Section 2 we describe the subproblems and introduce two notions of inexactness (abstract and structured) that will be used in the rest of the paper. The Inexact Basic Method (iBasic) is also presented. iBasic is a method that simultaneously captures inexact variants of the algorithms (4), (5), (6) for solving the stochastic optimization problem (1) and algorithm (8) for solving the best approximation problem (7). It is an inexact variant of the Basic Method, first presented in , where the inexactness is introduced by the addition of an inexactness error in the original update rule. We illustrate the generality of iBasic by presenting popular algorithms that can be cast as special cases.
In Section 3 we establish convergence results of iBasic under general assumptions on the inexactness error of its update rule (see Algorithm 1). In this part we do not focus on any specific mechanisms which lead to inexactness; we treat the problem abstractly. However, such errors appear often in practical scenarios and can be associated with inaccurate numerical solvers, quantization, sparsification and compression mechanisms. In particular, we introduce several abstract assumptions on the inexactness level and describe our generic convergence results. For all assumptions we establish linear rate of decay of the quantity (i.e. L2 convergence)444As we explain later, a convergence of the expected function values of problem 1 can be easily obtained as a corollary of L2 convergence..
Subsequently, in Section 4 we apply our general convergence results to a more structured notion of inexactness error and propose a concrete mechanisms leading to such errors. We provide theoretical guarantees for this method in situations when a linearly convergent iterative method (e.g., Conjugate Gradient) is used to solve the subproblem inexactly. We also highlight the importance of the dual viewpoint through a sketch-and-project interpretation.
In Section 5 we study an inexact variant of SDSA, which we called iSDSA, for directly solving the dual problem (10). We provide a correspondence between iBasic and iSDSA and we show that the random iterates of iBasic arise as affine images of iSDSA. We consider both abstract and structured inexactness errors and provide linearly convergent rates in terms of the dual function suboptimality .
Finally, in Section 6 we evaluate the performance of the proposed inexact methods through numerical experiments and show the benefits of our approach on both synthetic and real datasets. Concluding remarks are given in Section 7.
A summary of the convergence results of iBasic under several assumptions on the inexactness error with pointers to the relevant theorems is available in Table 1. We highlight that similar convergence results can be also obtained for iSDSA in terms of the dual function suboptimality (check Section 5 for more details on iSDSA).
For convenience, a table of the most frequently used notation is included in the Appendix C. In particular, with boldface upper-case letters we denote matrices and
is the identity matrix. Bywe denote the solution set of the linear system . By , where is a random matrix, we denote the solution set of the sketched linear system . In general, we use to express the exact solution of a sub-problem and to indicate its inexact variant. Unless stated otherwise, throughout the paper, is the projection of onto in the -norm: . An explicit formula for the projection of point onto set is given by
A formula for the projection onto is obtained by replacing and with and respectively into the above equation. We denote this projection by . We also write .
In order to keep the expression brief throughout the paper we define555In the iterate the expression becomes .:
Using this matrix we can easily express important quantities related to the problems under study. For example the stochastic functions of problem (1) can be expressed as
In addition the gradient and the Hessian of with respect to the inner product are equal to
A key matrix in our analysis is
which has the same spectrum with the matrix but at the same time is symmetric and positive semi-definite666Note that matrix is not symmetric but it is self-adjoint with respect to the -inner product.. We denote with the eigenvalues of . With we indicate the smallest nonzero eigenvalue, and with the largest eigenvalue. It was shown in  that for all .
2 Inexact update rules
In this section we start by explaining the key sub-problems that need to be solved exactly in the update rules of the previously described methods. We present iBasic, a method that solves problems (1) and (7) and we show how by varying the main parameters of the method we recover inexact variants of popular algorithms as special cases. Finally closely related work on inexact algorithms for solving different problems is also presented.
2.1 Expensive Sub-problems in Update Rules
Let us devote this subsection on explaining how the inexactness can be introduced in the current exact update rules of SGD777Note that SGD has identical updates to the Stochastic Newton and Stochastic proximal point method. Thus the inexactness can be added to these updates in similar way. (4), Sketch and Project (8) and SDSA (11) for solving the stochastic optimization, best approximation and the dual problem respectively. As we have shown these methods solve closely related problems and the key subproblems in their update rule are similar. However the introduction of inexactness in the update rule of each one of them can have different interpretation.
For example for the case of SGD for solving the stochastic optimization problem (1) (see also Section 4.1 and 4.2 for more details), if we define then the stochastic gradient of function becomes and the update rule of SGD takes the form: . Clearly in this update the expensive part is the computation of the quantity that can be equivalently computed to be the least norm solution of the smaller (in comparison to ) linear system . In our work we are suggesting to use an approximation of the exact solution and with this way avoid executing the possibly expensive step of the update rule. Thus the inexact update is taking the following form:
Here denotes a more abstract notion of inexactness and it is not necessary to be always equivalent to the quantity . It can be interpreted as an expression that acts as an perturbation of the exact update. In the case that has the above form we say that the notion of inexactness is structured. In our work we are interested in both the abstract and more structured notions of inexactness. We first present general convergence results where we require the error to satisfy general assumptions (without caring how this error is generated) and later we analyze the concept of structured inexactness by presenting algorithms where .
In similar way, the expensive operation of SPM (8) is the exact computation of the projection . Thus we are suggesting to replace this step with an inexact variant and compute an approximation of this projection. The inexactness here can be also interpreted using both, the abstract error and its more structured version . At this point, observe that, by using the expression (15) the structure of the in SPM and SGD has the same form.
In the SDSA the expensive subproblem in the update rule is the computation of the that satisfy . Using the definition of the dual function (10) this value can be also computed by evaluating the least norm solution of the linear system . Later in Section 5 we analyze both notions of inexactness (abstract and more structured) for inexact variants of SDSA.
Table 2 presents the key sub-problem that needs to be solved in each algorithm as well as the part where the inexact error is appeared in the update rule.
2.2 The Inexact Basic Method
In each iteration of the all aforementioned exact methods a sketch matrix is drawn from a given distribution and then a certain subproblem is solved exactly to obtain the next iterate. The sketch matrix requires to have rows but no assumption on the number of columns is made which means that the number of columns allows to vary through the iterations and it can be very large. The setting that we are interested in is precisely that of having such large random matrices . In these cases we expect that having approximate solutions of the subproblems will be beneficial.
Recently randomized iterative algorithms that requires to solve large subproblems in each iteration have been extensively studied and it was shown that are really beneficial when they compared to their single coordinates variants () [34, 35, 44, 27]. However, in theses cases the evaluation of an exact solution for the suproblem in the update rule can be computationally very expensive. In this work we propose and analyze inexact variants by allowing to solve the subproblem that appear in the update rules of the stochastic methods, inexactly. In particular, following the convention established in  of naming the main algorithm of the paper Basic method we propose the inexact Basic method (iBasic) (Algorithm 1).
The in the update rule of the method represents the abstract inexactness error described in Subsection 2.1. Note that, iBasic can have several equivalent interpretations. This allow as to study the methods (4),(5),(6) for solving the stochastic optimization problem and the sketch and project method (8) for the best approximation problem in a single algorithm only. In particular iBasic can be seen as inexact stochastic gradient descent (iSGD) with fixed stepsize applied to (1). From (17), and as a result the update rule of iBasic can be equivalently written as: In the case of the best approximation problem (7), iBasic can be interpreted as inexact Sketch and Project method (iSPM) as follows:
2.3 General Framework and Further Special Cases
The proposed inexact methods, iBasic (Algorithm 1) and iSDSA (Section 5), belong in the general sketch and project framework, first proposed from Gower and Richtarik in  for solving consistent linear systems and where a unified analysis of several randomized methods was studied. This interpretation of the algorithms allow us to recover a comprehensive array of well-known methods as special cases by choosing carefully the combination of the main parameters of the algorithms.
In particular, the iBasic has two main parameters (besides the stepsize of the update rule). These are the distribution from which we draw random matrices and the positive definite matrix . By choosing carefully combinations of the parameters and we can recover several existing popular algorithms as special cases of the general method. For example, special cases of the exact Basic method are the Randomized Kaczmarz, Randomized Gaussian Kaczmarz888Special case of the iBasic, when the random matrix is chosen to be a Gaussian vector with mean and a positive definite covariance matrix . That is [18, 29]., Randomized Coordinate Descent and their block variants. For more details about the generality of the sketch and project framework and further algorithms that can be cast as special cases of the analysis we refer the interested reader to Section 3 of  and Section 7 of . Here we present only the inexact update rules of two special cases that we will later use in the numerical evaluation.
Special Cases: Let us define with the column concatenation of the identity matrix indexed by a random subset of .
Inexact Randomized Block Kaczmarz (iRBK): Let and let pick in each iteration the random matrix . In this setup the update rule of the iBasic simplifies to
Inexact Randomized Block Coordinate Descent (iRBCD)999In the setting of solving linear systems Randomized Coordinate Descent is known also as Gauss-Seidel method. Its block variant can be also interpret as randomized coordinate Newton method (see ).: If the matrix of the linear system is positive definite then we can choose . Let also pick in each iteration the random matrix . In this setup the update rule of the iBasic simplifies to
For more papers related to Kaczmarz method (randomized, greedy, cyclic update rules) we refer the interested reader to [23, 28, 38, 5, 37, 39, 9, 33, 34, 13, 31, 59, 35, 50]. For the coordinate descent method (a.k.a Gauss-Seidel for linear systems) and its block variant, Randomized Block Coordinate Descent we suggest [25, 36, 44, 45, 40, 41, 43, 7, 24, 14, 1, 54].
2.4 Other Related Work on Inexact Methods
One of the current trends in the large scale optimization problems is the introduction of inexactness in the update rules of popular deterministic and stochastic methods. The rational behind this is that an approximate/inexact step can often computed very efficiently and can have significant computational gains compare to its exact variants.
In the area of deterministic algorithms, the inexact variant of the full gradient descent method, , has received a lot of attention [49, 11, 51, 16, 32]. It has been analyzed for the cases of convex and strongly convex functions under several meaningful assumptions on the inexactness error and its practical benefit compared to the exact gradient descent is apparent. For further deterministic inexact methods check  for Inexact Newton methods, [52, 47] for Inexact Proximal Point methods and  for Inexact Fixed point methods.
In the recent years, with the explosion that happens in areas like machine learning and data science inexactness enters also the updating rules of several stochastic optimization algorithms and many new methods have been proposed and analyzed.
In the large scale setting, stochastic optimization methods are preferred mainly because of their cheap per iteration cost (compared to their deterministic variants), their property to scale to extreme dimensions and their improved theoretical complexity bounds. In areas like machine learning and data science, where the datasets become larger rapidly, the development of faster and efficient stochastic algorithms is crucial. For this reason, inexactness has recently introduced to the update rules of several stochastic optimization algorithms and new methods have been proposed and analyzed. One of the most interesting work on inexact stochastic algorithms appears in the area of second order methods. In particular on inexact variants of the Sketch-Newton method and subsampled Newton Method for minimize convex and non-convex functions [48, 2, 4, 56, 57, 58]. Note that our results are related also with this literature since our algorithm can be seen as inexact stochastic Newton method (see equation (5)). To the best or our knowledge our work is the first that provide convergence analysis of inexact stochastic proximal point methods (equation (6)) in any setting. From numerical linear algebra viewpoint inexact sketch and project methods for solving the best approximation problem and its dual problem where also never analyzed before.
As we already mentioned our framework is quite general and many algorithms, like iRBK (21) and iRBCD (22) can be cast as special cases. As a result, our general convergence analysis includes the analysis of inexact variants of all of these more specific algorithms as special cases. In  an analysis of the exact randomized block Kacmzarz method has been proposed and in the experiments an inexact variant was used to speedup the method. However, no iteration complexity results were presented for the inexact variant and both the analysis and numerical evaluation have been made for linear systems with full rank matrices that come with natural partition of the rows (this is a much more restricted case than the one analyzed in our setting). For inexact variants of the randomized block coordinate descent algorithm in different settings than ours we suggest [53, 15, 6, 12].
Finally an analysis of approximate stochastic gradient descent for solving the empirical risk minimization problem using quadratic constraints and sequential semi-definite programs has been presented in .
3 Convergence Results Under General Assumptions
In this section we consider scenarios in which the inexactness error can be controlled, by specifying a per iteration bound on the norm of the error. In particular, by making different assumptions on the bound we derive general convergence rate results. Our focus is on the abstract notion of inexactness described in Section 2.1 and we make no assumptions on how this error is generated.
An important assumption that needs to be hold in all of our results is exactness. A formal presentation is presented below. We state it here and we highlight that is a requirement for all of our convergence results (exactness is also required in the analysis of the exact variants ).
Note that is a convex quadratic, and that whenever . However, can be zero also for points outside of . Clearly, is nonnegative, and for . However, without further assumptions, the set of minimizers of can be larger than . The exactness assumption ensures that this does not happen. For necessary and sufficient conditions for exactness, we refer the reader to . Here it suffices to remark that a sufficient condition for exactness is to require to be positive definite. This is easy to see by observing that . In other words, if is the solution set of the stochastic optimization problem (1) and the solution set of the linear system (2) then the notion of exactness is captured by:
3.1 Assumptions on Inexactness Error
In the convergence analysis of iBasic the following assumptions on the inexactness error are used. We note that Assumptions 3.1, 3.1 and 3.1 are special cases of Assumption 3.1. Moreover Assumption 3.1 is algorithmic dependent and can hold in addition of any of the other four assumptions. In our analysis, depending on the result we aim at, we will require either one of the first four Assumptions to hold by itself, or to hold together with Assumption 3.1. We will always assume exactness.
In all assumptions the expectation on the norm of error () is conditioned on the value of the current iterate and the random matrix . Moreover it is worth to mention that for the convergence analysis we never assume that the inexactness error has zero mean, that is .
where the upper bound
is a sequence of random variables (that can possibly depends on both the value of the current iterateand the choice of the random at the iteration).
The following three assumptions on the sequence of upper bounds are more restricted however as we will later see allow us to obtain stronger and more controlled results.
where the upper bound is a sequence of real numbers.
where the upper bound is a special sequence that depends on a non-negative inexactness parameter and the distance to the optimal value .
where the upper bound is a special sequence that depends on a non-negative inexactness parameter and the value of the stochastic function computed at the iterate .
Finally the next assumption is more algorithmic oriented. It holds in cases where the inexactness error in the update rule is chosen to be orthogonal with respect to the -inner product to the vector
. This statement may seem odd at this point but its usefulness will become more apparent in the next section where inexact algorithms with structured inexactness error will be analyzed. As it turns out, in the case of structured inexactness error (Algorithm2) this assumption is satisfied.
3.2 Convergence Results
In this section we present the analysis of the convergence rates of iBasic by assuming several combination of the previous presented assumptions.
All convergence results are described only in terms of convergence of the iterates , that is , and not the objective function values . This is sufficient, because by (see Lemma 10) we can directly deduce a convergence rate for the function values.
The exact Basic method (Algorithm 1 with ), has been analyzed in  and it was shown to converge with where . Our analysis of iBasic is more general and includes the convergence of the exact Basic method as special case when we assume that the upper bound is . For brevity, in he convergence analysis results of this manuscript we also use
Let us start by presenting the convergence of iBasic when only Assumption 3.1 holds for the inexactness error.
Let assume exactness and let be the iterates produced by iBasic with . Set and consider the error be such that it satisfies Assumption 3.1. Then,
See Appendix B.1. ∎
This means that we obtain a linear convergence rate up to a solution level that is proportional to the upper bound 101010Several similar more specific assumptions can be made for the upper bound . For example if the upper bound satisfies with for all then it can be shown that exist such that inequality (28) of Theorem 1 takes the form: (see [51, 16] for similar results). .
See Appendix B.2. ∎
Inspired from , let us now analyze iBasic using the sequence of upper bounds that described in Assumption 3.1. This construction of the upper bounds allows us to obtain stronger and more controlled results. In particular using the upper bound of Assumption 3.1 the sequence of expected errors converge linearly to the exact (not in a potential neighborhood like the previous result). In addition Assumption 3.1 guarantees that the distance to the optimal solution reduces with the increasing of the number of iterations. However for this stronger convergence a bound for is required, a quantity that in many problems is unknown to the user or intractable to compute. Nevertheless, there are cases that this value has a close form expression and can be computed before hand without any further cost. See for example [27, 30, 26, 21] where methods for solving the average consensus were presented and the value of corresponds to the algebraic connectivity of the network under study.
Assume exactness. Let be the iterates produced by iBasic with . Set and consider the inexactness error be such that it satisfies Assumption 3.1, with . Then
See Appendix B.3. ∎
At Theorem 2, to guarantee linear convergence the inexact parameter should live in the interval . In particular, is the parameter that controls the level of inexactness of Algorithm 1. Not surprisingly the fastest convergence rate is obtained when ; in such case the method becomes equivalent with its exact variant and the convergence rate simplifies to . Note also that similar to the exact case the optimal convergence rate is obtained for .
Moreover, the upper bound of Assumption 3.1 depends on two important quantities, the (through the upper bound of the inexactness parameter ) and the distance to the optimal solution . Thus, it can have natural interpretation. In particular the inexactness error is allowed to be large either when the current iterate is far from the optimal solution ( large) or when the problem is well conditioned and is large. In the opposite scenario, when we have ill conditioned problem or we are already close enough to the optimum we should be more careful and allow less errors to the updates of the method.
In the next theorem we provide the complexity results of iBasic in the case that the Assumption 3.1 is satisfied combined with one of the previous assumptions.
Let assume exactness and let be the iterates produced by iBasic with . Set . Let also assume that the inexactness error be such that it satisfies Assumption 3.1. Then:
See Appendix B.4. ∎
In the case that Assumptions 3.1 and 3.1 hold simultaneously, the convergence of iBasic is similar to (31) but in this case (due to Assumption 3.1, is a sequence of real numbers). In addition, note that for having Assumption 3.1 on top of Assumption 3.1 leads to improvement of the convergence rate. In particular, from Theorem 2, iBasic converges with rate while having both assumptions this is simplified to the faster (32).
4 iBasic with Structured Inexactness Error
Up to this point, the analysis of iBasic was focused in more general abstract cases where the inexactness error of the update rule satisfies several general assumptions. In this section we are focusing on a more structured form of inexactness error and we provide convergence analysis in the case that a linearly convergent algorithm is used for the computation of the expensive key subproblem of the method.
4.1 Linear System in the Update Rule
Using this expression the exact Basic method can be equivalently interpreted as the following two step procedure:
Find the least norm solution111111We are precisely looking for the least norm solution of the linear system because this solution can be written down in a compact way using the Moore-Penrose pseudoinverse. This is equivalent with the expression that appears in our update: . However it can be easily shown that the method will still converge with the same rate of convergence even if we choose any other solution of the linear system . of . That is find where