We consider incremental methods for addressing the following finite sum optimization problem
where each component function is weakly convex (see creftypecap 1), lower semi-continuous and lower bounded. We restrict ourself to the case when the global minimum set of (1.1) is nonempty and closed and denote as its minimal function value. The formulation in (1.1
) is quite general—training a neural network, phase retrieval, low rank matrix optimization, etc., fit within this framework. It is worthy to mention thatin (1.1) can be nonconvex and nondifferentiable under the weakly convex setting, covering rich applications in practice; see Section 2.2 for detalied examples.
Incremental methods plays an important role in large-scale problems. In this paper, we discuss incremental (sub)-gradient descent111Strictly speaking, when is nondifferentiable, it is more appropriate to name the algorithm incremetal subgradient method since it is not necessarily a descent algorithm. For convenience, we simply call it as incremental subgradient descent, in accordance with the name of incremental gradient descent., incremental proximal point algorithm and incremental prox-linear algorithm. In each time, incremental methods update the iterate with only one component function selected according to a cyclic order, i.e., sequentially from to and repeating such process cyclically. To be more specific, in -th iteration, incremental algorithms start with and then update using with certain method for all , giving The following three incremental algorithms considered in this paper differ from each other in the update of .
incremental (sub)-gradient descent:
where is any subgradient belongs to the Fréchet subdifferential (see (2.1)).
Incremental proximal point algorithm:
Incremental prox-linear algorithm: We now consider a specific class of weakly convex functions in (1.1) that is of the following composite form,
where each is a (possibly nonsmooth) Lipschitz convex mapping and is a smooth function with Lipschitz continuous Jacobian; see Section 2.2 for a list of concrete examples that fit into this setting. We denote by
as the local convex relaxation of at .
The incremental prox-linear algorithm update as
Another counterpart scheme to incremental methods for addressing (1.1) is the stochastic algorithms which at each iteration take one component function independently and uniformly from
to update. Such sampling scheme plays a key role in the analysis of stochastic algorithms. For example, the random uniform sampling scheme results in an unbiased estimation of the full (sub)-gradient information in each iteration, which makes the stochastic algorithms in expectation the same as the one using full components. Compared with stochastic algorithms, the incremental method is much easier to implement in large-scale setting since the former requires independent uniform sampling in each iteration. Furthermore, we will show in the experimental section that incremental algorithms usually outperforms their stochastic counterparts, which is the case in practice[3, 4]
. There are two possible reasons for interpreting this phenomenon: 1) The convergence results of stochastic algorithms are stated in expectation which hides large variance, while incremental algorithms have deterministic convergence properties. 2) incremental methods visit each component
exactly once after one cycle, but in SGD some component functions may be selected more often that the others in each epoch.
Incremental methods are widely utilized in practice. Compared to nonincremental methods like subgradient descent, proximal point method and prox-linear algorithm, incremental methods are more suitable for large-scale problems due to their low computational load in each iteration. For instance, incremental (sub)-gradient descent and its random shuffling version222This is an alternative component function selection rule, the resultant algorithm is often called “random shuffling” in training neural network literature. Instead of choosing sequentially from to for updating, at the start of -th iteration it will choose independently and uniformly one permutation from the set of all possible permutations of , then it sequentially visit each component function indexed by to update.
are broadly employed in practice for large scale machine learning problems such as training deep neural networks; see e.g.,[3, 4, 5, 6, 7, 8]9, 10]. Moreover, in the regression and linear system literature, the well-known and extensively studied Kaczmarz method is indeed one special case of incremental (sub)-gradient descent.
Though incremental algorithm is broadly used, its theoretical insights are far from being well understood. The purely deterministic component selection in each iteration makes it challenging to analyze. The main prior achievements for analyzing incremental methods are based on convexity assumption. To the best of our knowledge, there is no convergence result for incremental (sub)-gradient descent, incremental proximal point algorithm and incremental prox-linear algorithm when the function in (1.1) is nonconvex and nondifferentiable, even no explicit convergence rate result is currently known under smooth nonconvex setting. Thus, it is fundamentally important to ask:
In this paper, we answer this question positively under the assumption that each is weakly convex. This setting includes a wide class of nonconvex and nondifferentiable concrete examples with some of them described in Section 2.2.
1.1 Our contribution
We summarize our main contributions as follows:
We extend the convergence theory for incremental algorithms (including incremental (sub)-gradient descent (1.2), incremental proximal point algorithm (1.3)) from convex optimization to weakly convex optimization. Moreover, we show the same convergence result for incremental prox-linear algorithm under the same weakly convex setting333The stochastic prox-linear algorithm is proposed and analyzed very recently in [11, 12]. However, to our knowledge, we have not seen incremental prox-linear algorithm in the literature, thus introducing this algorithm may also be one of our contributions. This specific algorithm would be the first choice in many practical situations due to its high efficiency; see our experiment section for clarification.. In particular, with a constant stepsize scheme, we show that all the three incremental methods converge at rate to a critical point of the weakly convex optimization problem (1.1) (for incremental prox-linear method, in (1.1) should have the specific composite form (1.4)); see creftypecap 1 for the hidden constant. That being said, achieving accurate critical point requires at most number of iterations for all the three algorithms.
Suppose the weakly convex function in (1.1) satisfies an additional error bound property called sharpness condition (see creftypecap 2), then we design a geometrically diminishing stepsize rule which depends only on the problem intrinsic parameters and show that all the three incremental algorithms converge locally linearly to the optimal solution set of (1.1) (Again, for incremental prox-linear method, in (1.1) should have the composite form (1.4)). It is surprising since, compared to the incremental aggregated algorithm [13, 14] which has linear convergence for smooth strongly convex optimization, the three incremental methods analyzed in this paper do not use any aggregation techniques other than a diminishing stepsize.
It is worthy to mention that our bounds for incremental (sub)-gradient descent also works for its random shuffling version (see Footnote 2) which is also very popular in large scale learning literature. Since our result is deterministic, one can regard it as a worst case bound for the random shuffling version of incremental (sub)-gradient descent.
Our argument is simple and transparent. For instance, the key steps for showing the linear convergence of incremental (sub)-gradient descent are to establish the recursion (4.8) and then conduct a tailored analysis for our designed geometrically diminishing stepsize rule. Establishing the recursion (4.8) is similar to the previous work for convex optimization; see e.g., [15, 9]. Then, with geometrically diminishing stepsize, the derivation of linear convergence starting from recursion (4.8) shares the same spirit of prior work on full subgradient descent analysis, see, e.g., [16, 17]. Nonetheless, our conclusions are powerful in the sense that incremental methods can have linear rate of convergence for a set of nondifferentiable nonconvex optimization problems. We believe our proof technique for dealing with the weakly convex functions will be useful for further development of the analysis of other incremental type algorithms (like aggregated version [14, 13]) for solving weakly convex optimization.
1.2 Related work
The research of incremental (sub)-gradient descent has a long history, but mainly focuses on convex optimization. Most of these works can be divided into two categories based on differentiability. The starting work may date back to  for solving linear least square problems where each is differentiable and convex. Then various works [20, 21, 22, 23] extend incremental gradient descent to learning shallow linear neural network and other convex smooth problems of form (1.1), with various choices of step size and convergence analysis. A more recent work  shows that for a strongly convex with each component being convex, incremental gradient descent with diminishing step size has convergence rate in terms of the distance between iterates and the optimal solution set. We note that a popular variant named incremental aggregated gradient descent [14, 13, 24, 25]—which utilises a single component function at a time as in incremental methods, but keeps a memory of the most recent gradients of all component functions to approximate the gradient—have been proved to converge linearly for smooth and strongly convex functions.
When in (1.1) is convex but not differentiable, incremental (sub)-gradient descent was proposed in  for solving nondifferentiable finite sum optimization problems. Later, Nedic and Bertsekas  provided asymptotic convergence result for incremental (sub)-gradient descent using either constant or diminishing step sizes when the function is convex. This result was further improved by  where the authors proved that the algorithm converges in a rate of in terms of the function suboptimality gap by using constant step size, and in a linear rate to the optimal solution set by using the Polyak’s dynamic step size which is obtained by using the knowledge of . Similar asymptotic convergence results were also established in  for incremental -subgradient descent. Besides, it is worth mentioning that Solodov  analyzed the global convergence of incremental gradient descent for smooth nonconvex problems, but only with asymptotic convergence without any explicit rate. We refer the readers to  for a comprehensive review.
Incremental proximal point algorithm has been proposed in  for large scale convex optimization and is proved to converge in a rate of in terms of the function suboptimality gap by using constant step size. To our knowledge, the convergence performance of incremental proximal point algorithm has not yet been studied for nonconvex problems. In terms of incremental prox-linear algorithm, to our knowledge it has not yet been utilized or analyzed in the literature, except its stochastic counterpart which was proposed and analyzed very recently in [11, 12].
For a more clear and transparent comparison, we list the representative historical results and ours in Table 1.
Since can be nondifferentiable, we utilize tools from generalized differentiation. The (Fréchet) subdifferential of a function at is defined as
where each is called a subgradient of at .
2.1 Function regularity
The following two properties serves as the fundamental basis in our paper.
To begin, we give the definition of weak convexity which characterizes the extent of nonconvexity of a function.
Definition 1 (weak convexity; see, e.g., ).
We say that is weakly convex with parameter if is convex.
Noting that the weak convexity of with parameter is equivalent to
We now introduce sharpness, a regularity condition that characterizes how fast the function increases when the is away from the set of global minima.
Definition 2 (sharpness; see, e.g., ).
We say that a mapping is -sharp where if
for all . Here denotes the set of global minimizers of , represents the minimal value of , and is the distance of to , i.e., .
Informally speaking, sharpness condition is tailored for nondifferentiable functions. This regularity condition plays a fundamental role in showing linear and quadratic convergence rate for (sub)-gradient descent and prox-linear algorithm, respectively [16, 17, 31]. In Section 4, we will show with the sharpness condition, we improve the convergence rate of incremental methods from sublinear convergence to linear convergence.
2.2 Concrete examples
where is a Lipschitz convex function and is a smooth mapping with Lipschitz continuous Jacobian. We now list several popular nondifferentiable nonconvex optimization problems of the form (2.4).
Robust Matrix Sensing 
Low-rank matrices are ubiquitous in computer vision, machine learning, and data science applications. One fundamental computational task is to recover a Positive Semi-Definite (PSD) low-rank matrixwith
from a small number of linear measurements arbitrarily corrupted with outliers
where is a linear measurement operator consisting of a set of sensing matrices and
is a sparse outliers vector. An effective approach to recover the low-rank matrixis by using a factored representation of the matrix variable  (i.e., with ) and employing an
-loss function to robustify the solution against outliers:
Direct calculus shows that the weak convexity parameter of each component function in (2.6) is at most . Furthermore, it is shown in , under certain statistical assumptions ( has i.i.d. Gaussian ensembles and the fraction of outliers in is less than and ), is exactly the set of global minimizers to (2.5) and the objective function in (2.5) is sharp with parameter where is a constant depending on the fraction of outliers in .
Robust phase retrieval aims to recover a signal from its magnitude-wise measurements arbitrarily corrupted with outliers:
where the operator in (2.7) means coordinate-wise taking modulus and then squaring. Here, the matrix is a measurement matrix and is sparse outliers.  formulated the following problem for recovering both the sign and magnitude information of :
It is straightforward to verify that each component function in (2.8) is weakly convex with parameter . When the vector obeys an i.i.d.Gaussian distribution and the fraction of outliers is no more than and , then it is proved in  that is exactly the set of minimizers to (2.8) and the objective function in (2.8) is sharp with parameter , where is some constant depending on the outlier ratio.
Robust Blind Deconvolution 
In image processing community, blind deconvolution is a central technique for recovering real images from convolved measurements. Mathematically speaking, the task is to recovery a ground truth pair from there convolution corrupted by outliers:
where are measurement operators and contains outliers. Robust blind deconvolution amounts to addressing
Similarly, the above objective function is sharp (with sharpness parameter closely related to the energy of ground truth signal and the outliers ratio) when are from standard i.i.d. Gaussian distribution. Again, it is easy to verify that each component function of the objective function in (2.10) is weakly convex with parameter .
The task of Robust PCA is to separate a low rank matrix with and a sparse matrix from their addition . Robust PCA has been successfully applied in many real applications, such as background extraction from surveillance video  and image specularity removal . Considering a simple case that is PSD with dimension , using the same factorization approach as in (2.6) we estimate the low-rank matrix by solving
The weak convexity parameter of each component function in the above is . However, whether the objective function in (2.11) has sharpness property remains open.
3 Convergence Result for Weakly Convex Optimization
In this section, we study the convergence behavior of incremental (sub)-gradient descent, incremental proximal point method, and incremental prox-linear algorithm with constant stepsize under the assumption that each component function is weakly convex.
3.1 Assumptions and Moreau envelop
We make the following assumptions through out this section. Note that in this section, sharpness condition is not assumed.
(bounded subgradients): For any , there exists a constant , such that , for all and bounded .
(weak convexity) Each component function in (1.1) is weakly convex with parameter . Set .
creftypecap 1 (A1) is standard in analyzing incremental and stochastic algorithms; see, e.g., [18, 9, 15, 12, 41]. For the set of concrete applications listed in Section 2.2, creftypecap 1 (A1) is satisfied on any bounded subset. This assumption concerns the Lipschitz continuity of . Too see this, we refer the readers to [42, Theorem 9.1] which established that the function is Lipschitz continuous with parameter whenever creftypecap 1 (A1) is valid and is finite.
The incremental prox-linear algorithm (1.6) is designated for solving the weakly convex optimization problems (1.1) where has the composite form in (1.4). Thus, instead of using creftypecap 1, we need to make the following tailored assumptions for incremental prox-linear method. The parameter notations remain the same with creftypecap 1 which we will explain immediately after the statement of creftypecap 2.
(bounded subgradient) For any , there exits a constant , such that for all and bounded .
(quadratic approximation) There exists a constant such that each component function in (1.1) satisfies
A similar assumption has been used in  for analyzing nonincremental prox-linear method. Note that creftypecap 2 is similar to creftypecap 1. For those concrete examples listed in Section 2.2, the bounded subgradient parameter in creftypecap 2 (A3) coincides with that of creftypecap 1 (A1) when is evaluated at . More generally, the two constants in creftypecap 2 (A3) and creftypecap 1 (A1) coincide when in (1.4) is norm or max function. Furthermore, creftypecap 2 (A4) is slightly stronger than creftypecap 1 (A2) since it implies the weak convexity of through a direct calculation by utilizing convexity of . It can be verified that the listed examples in Section 2.2 satisfy creftypecap 2 (A4) and the parameter coincides with the weakly convex parameter. Thus, in order to release the overloading of notations, we keep the same notation as those of creftypecap 1.
Though for weakly convex optimizations, checking the norm of the subgradient is in general not appropriate to characterize optimality due to nondifferentiability. The seminal paper  found a reliable surrogate optimality measure for weakly convex optimization, where the problem can be nondifferentiable and even nonconvex. For completeness, we briefly introduce the related notions of Moreau envelope and proximal mapping.
(see [42, Definition 1.22]) For any , the Moreau envelope of is defined as
The corresponding proximal mapping is defined as
The Moreau envelope approximates from below (i.e., ) and the approximation error is controlled by the penalty parameter . More importantly, if the original function is -weakly convex, then the Moreau envelope is smooth for any with  [44, Theorem 3.4], even when is nondifferentiable.
The following result quantifies how close is close to a critical point of when is small :
where . Intuitively, (3.3) implies that if is small, then is close to which is nearly stationary since is small.
In this section, we will utilize the gradient of the Moreau envelope of the weakly convex function in (1.1) as surrogate optimality measure to analyze the convergence rate of the three incremental methods. For any and , we denote by
3.2 Sublinear convergence result
This result can be proved by using [42, Exercise 8.8] and the first order optimality condition of . These relations in (3.6) and (3.7) have been commonly utilized in the analysis of proximal-type algorithms [9, 45]. It is interesting to see that (3.6) and (3.7) are very similar to the updates in incremental (sub)-gradient descent. The only difference is that the subgradients in (3.6) and (3.7) are evaluated at , while in incremental (sub)-gradient descent it is evaluated at . This observation indicates that we can follow an unified proof strategy for all the three incremental methods with slight modification. Therefore, we will present the results for all the three incremental methods in the same place, i.e., creftypecap 1 and creftypecap 2.
In the following theorem, we give out the global sublinear rate of convergence for the three incremental methods solving general weakly convex optimization problems444The sublinear convergence result in this section remains valid when (1.1) is further constrained by a closed convex set. . The definition of the weakly convex parameter and the upper bound for the norm of subgradients in a bounded subset can be found in creftypecap 1 and creftypecap 2.
Suppose that creftypecap 1 is valid for incremental (sub)-gradient descent (1.2) and incremental proximal point algorithm (1.3), and that creftypecap 2 is valid for incremental prox-linear method (1.6), respectively. Suppose further the stepsize for all , where integer is the total iteration number. Then for any , if the sequence is generated by one of the three incremental methods for solving (1.1), then we have
where is defined in (3.4).
Proof of creftypecap 1.
The proof of incremental proximal point and incremental prox-linear methods is very similar to that of incremental (sub)-gradient descent. Thus, we will first present the complete proof for incremental (sub)-gradient descent and then point out the necessary modifications for the proof of the other two incremental proximal methods.
Part I: Proof of incremental (sub)-gradient descent. Let be generated by the incremental (sub)-gradient descent. From the optimality of in (3.5) and the update in incremental (sub)-gradient descent, we have the following inequality
Summing up the inequality in (3.9) for from to gives
where in the last inequality we have used creftypecap 1 (A1).
Due to , (3.10) reduces to
which together with (3.11) gives
We now provide lower bound for and upper bound for . For , we have
where first inequality utilizes creftypecap 1 (A1) and [42, Theorem 9.1] which implies is Lipschitz continuous with parameter , and the last inequality follows from [42, Proposition 12.19] which sates that for any weakly convex function with parameter , is Lipschitz continuous with constant if .
According to the update in incremental (sub)-gradient descent, we have , which together with creftypecap 1 (A1) yields
where the inequality follows directly from the definition of Moreau envelope of at . Hence, plugging the above inequality into (3.17) and rearranging the terms provide the following recursion
Finally, dividing in both sides in (3.19) yields