There are many applications in scientific research and engineering involving the following low-rank stochastic convex semidefinite optimization problem:
where is some convex, smooth cost function associated with the -th sample, is the positive semidefinite (PSD) constraint, and represents the set of real symmetric matrices of size . In many related applications, certain low-rank assumption is usually imposed on . Typical applications include the non-metric multidimensional scaling (Agarwal2007; Borg-NMDS2005), matrix sensing (Sanghavi2013; Recht-2016), phase retrieval (Candes-ACHA2015), synchronization (Bandeira-Synchronization2016), and community detection (Montanari-CommunityDetection2016).
The classical algorithms for solving problem (1) mainly include the first-order methods such as the well-known projected gradient descent method (Nestrov-1989), interior point method (Alizadeh-IPM1995), and more specialized path-following interior point methods (for more detail, see the nice survey paper (Monteiro-SDP2003) and references therein). However, most of these methods are not well-scalable due to the PSD constraint, i.e., . To break the hurdle of PSD constraint, the idea of low-rank factorization was adopted in the literature (Monteiro2003; Monteiro2005) and became very popular in the past five years due to its empirical success (Sanghavi-FGD2016; Jin2016; Ma2018; Ma2019; Wang2017-ICML; Zeng-svrg2018; Zeng-SGD2019).
Since the PSD constraint has been eliminated, the recast problem (2) has a significant advantage over (1), but this benefit has a corresponding cost: the objective function is no longer convex but instead nonconvex in general. This brings a great challenge to the analysis. Even for the simple first-order methods like the factored gradient descent (FGD) and gradient descent (GD), its local linear convergence remains unspecified until the recent work in (Sanghavi-FGD2016; Wang2017-GD), respectively. Moreover, facing the challenge in large scale applications with a big , stochastic algorithms (Robbins-SGD1951) have been widely adopted nowadays, that is, at each iteration, we only use the gradient information of one or a small batch of the whole sample instead of the full gradient over samples. However, due to the existence of variance of such stochastic gradients, the stochastic gradient descent (SGD) method either converges to an neighborhood of a global optimum at a linear rate when adopting a fixed step size , or has a local sublinear convergence rate when adopting a diminishing step size, in the restricted strongly convex (RSC) case (Zeng-SGD2019).
To accelerate SGD, various variance reduction techniques have been proposed in the literature, e.g. the stochastic variance reduced gradient (SVRG) method in (Johnson-Zhang-svrg2013) and stochastic average gradient (SAG) method in (Bach-SAG), which resume the linear convergence for strongly convex problems in Euclidean space. One distinction of the SVRG proposed by Johnson-Zhang-svrg2013 lies in that it does not need to memorize the gradients over training sample set; instead it involves two loops of iterations for updating full-sample gradients and stochastic variations. Its algorithmic implementations give rise to two choices: Option I, where one naturally sets the output of the inner loop as its last iterate; Option II, where one selects the output of the inner loop from all the iterates in the inner loop in a uniformly random way, more expensive in memory cost on recording inner loop iterates. Although the paper (Johnson-Zhang-svrg2013) established Option II’s linear convergence guarantee, it leaves open the convergence of Option I, in spite of a more natural and efficient scheme in practice. This gap was later filled by (Tan-Ma-bbsvrg2016) with an introduction of Barzilai-Borwein (BB) step size (BB-stepsize1988).
When adapted to the the nonconvex matrix problem, linear convergence results are only known for Option II. Zhang2017-SVRG studied the SVRG with Option II for the matrix sensing problem with the square loss and established its linear submanifold convergence, (i.e., converges exponentially fast to a submanifold as the orbit of a global optimum under the orthogonal group action) under certain restricted isometry property (RIP). This work was generalized by Wang2017-ICML with a variant of SVRG with Option II called Stochastic Variance-Reduced Gradient for Low-Rank Matrix Recovery (dubbed SVRG-LR for short) for the low-rank matrix recovery problem with general loss and the Restricted Strongly Convex (RSC) condition.
Yet, for Option I scheme in nonconvex low-rank matrix recovery, convergence theories remain largely open. In fact, Option I is more natural and usually enjoys a faster empirical loss reduction than Option II used in SVRG (see Johnson-Zhang-svrg2013; Tan-Ma-bbsvrg2016; Sebbouh-Bach-svrg-opt1-19 for example in Euclidean space and also Figure 1 when adapted to the matrix sensing case considered in our later numerical experiments in Section 5). Moreover, the memory storage required in the implementation of SVRG with Option I (called SVRG-I henceforth) is generally much less than that of SVRG with Option II (called SVRG-II henceforth), since it only needs to store one iterate of the inner loop when implementing SVRG-I, instead of all iterates of the inner loop required in the implementation of SVRG-II. Therefore, it is desired to establish the convergence of Option I and its variants due to these advantages.
Particularly when the matrix variable is positive semi-definite, several recent studies provided partial answers to the convergence problem of Option I. First, Ma2018; Ma2019 adapted the original SVRG with Option I to the application of ordinal embedding (Agarwal2007) based on the low-rank factorization reformulation. In these works, a generalization of the well-known Barzilai-Borwein (BB) step size (BB-stepsize1988) called the stabilized Barzilai-Borwein (SBB) step size was also introduced for SVRG to alleviate the challenge of tuning the step size parameter, where the BB step size is a special case of SBB step size. Under certain smoothness assumption, the convergence to a critical point of the SVRG with Option I and SBB step size was firstly established in Ma2018; Ma2019; then its local linear point convergence (i.e., exponentially fast convergence to a global minimum in Euclidean distance starting from a neighborhood of this global minimum) was further established in Zeng-svrg2018 under the RSC condition. However, the submanifold convergence of SVRG with Option I is still left open, an important problem as a global optimum is invariant under the orthogonal group action.
In this paper, we fill this gap by providing a global linear submanifold convergence theory of Stochastic Variance-Reduced Gradient method for Semi-definite optimization (2), using a variant of Option I, under general loss and RSC with the fixed step size and general stabilized Barzilai-Borwein step size. To achieve this, motivated by Wang2017-universal-SVRG, we adopted the following new semi-stochastic gradient (i.e., ) in place of the original one in the inner loop of SVRG (i.e., , called SVRG-I in this paper), see Algorithm 1 for detail. Thus, our algorithm is called Stochastic Variance-Reduced Gradient for SDP (dubbed SVRG-SDP for short). Distinguished to SVRG-LR, we adopt Option I instead of Option II by selecting the output of inner loop as the last iterate (), as well as the mini-batch strategy. Under the regular Lipschitz gradient and restricted strongly convex assumptions, we at first establish a local linear submanifold convergence of the proposed SVRG method (see Theorem 3.2), where the radius of the initial ball permitted to guarantee the linear submanifold convergence is larger than existing results established in the literature (Sanghavi-FGD2016; Zhang2017-SVRG; Wang2017-universal-SVRG; Zeng-svrg2018; Zeng-SGD2019), as listed in Table 1. Then, we boost such local linear submanifold convergence to the global linear submanifold convergence111that is, exponentially fast convergence to a submanifold of the orthogonal group action orbit of a global minimum starting from an initial point that is not necessarily close to the global minimum submanifold. (see Theorem 3.3) with an appropriate initial scheme based on the projected gradient descent method (see Algorithm 2). Moreover, we establish the same global linear submanifold convergence results for the proposed SVRG methods equipped with two practical step size schemes, i.e., the fixed step size and certain stabilized Barzilai-Borwein step size introduced in Ma2018; Ma2019. A series of experiments in matrix sensing are conducted to show the effectiveness of the proposed method. The numerical results show that the proposed method performs similarly to original SVRG with Option I yet with provable convergence guarantee, and works remarkably better than their counterparts with Option II, i.e., SVRG-LR and SVRG-II, as well as the factored gradient descent and stochastic gradient descent methods. The effects of the mini-batch size, update frequency of the inner loop, and step size schemes are also discussed and studied in both theory and experiment.
In contrast to the existing analysis for SVRG with Option II, the main difficulty in dealing with Option I is to yield the contraction between two successive iterates in the outer loop, since the iterates in the inner loop may change largely during the update of Option I. Such possibly dramatic changes among iterates in the inner loop can be alleviated in Option II where the output of the inner loop is a uniformly random sample of iterates, by taking expectation or average over the iterates of the inner loop in convergence analysis. Hence it is relatively easy to tackle the case of Option II in analysis. However when we switch to Option I where the output of the inner loop is set as the last iterate, one has to carefully develop some new technique to handle the change in the inner loop to avoid being out of control.
Our key development in the convergence analysis is a novel second-order descent lemma (see Lemma 4) for SVRG-SDP about the progress made by a single iterate of the inner loop, characterized by the submanifold metric (i.e., , where is a global minimum with rank , is the set of orthogonal matrices of size ). This improves the second-order descent lemma in (Zeng-svrg2018, Lemma 1) with Euclidean metric (i.e., ) and leads to the desired linear submanifold convergence of SVRG-SDP.
The rest of this paper is organized as follows. In Section 2, we introduce the proposed SVRG method, together with an initialization scheme and some different step size strategies. In Section 3, we establish the linear submanifold convergence of the proposed method. In Section 4, we present a key lemma as well as the proof for our main theorem. A series of experiments are provided in Section 5 to demonstrate the effectiveness of the proposed method as well as the effects of algorithmic parameters. We conclude this paper in Section 6. Part of proofs are presented in Appendix.
For any two matrices , their inner product is defined as . For any matrix , and denote its Frobenius and spectral norms, respectively, and and denote the smallest and largest strictly positive singular values of , denote , with a slight abuse of notation, we also use .
denotes the identity matrix with the size. We will omit the subscript of if there is no confusion in the context.
2 A Stochastic Variance Reduced Gradient Scheme for SDP
The SVRG method was firstly proposed by Johnson-Zhang-svrg2013
for minimizing a finite sum of convex functions with a vector argument. The main idea of SVRG is adopting the variance reduction technique to accelerate SGD and achieves a faster convergence rate. SVRG is an inner-outer loop based method. The main purpose of inner loop is to reduce the variance introduced by the stochastic sampling, and thus accelerate the convergence of the outer loop iterates. In this section, we propose a new version of SVRG with Option I adapted to Semidefinie Programming which enjoys provable convergence guarantees.
2.1 SVRG-SDP with Option I: Algorithm 1
Inspired by Ma2018; Ma2019 and Wang2017-ICML, we propose a new variant of SVRG method with Option I to solve the stochastic SDP problem (1) based on its nonconvex reformulation (2), as described in Algorithm 1. Compared to SVRG-I suggested in (Ma2018; Ma2019) for problem (2), a new semi-stochastic gradient
is exploited in the inner loop to replace the original for the variance reduced estimate of the current gradient, where is the estimate at the -th outer loop, is the associated factorization of (i.e., ), and is the -th iterate in the inner loop. Intuitively, due to the use of the latest iterate in the inner loop, the new semi-stochastic gradient should be more accurate than the old one that mixed and , resulting in a possible better performance than SVRG-I as demonstrated by our later numerical experiments. More importantly, the use of such a new semi-stochastic gradient is essential for deriving the linear convergence in the submanifold metric , as shown in the following convergence analysis. Besides the Option I such that , we also adopt the mini-batch strategy to SVRG-SDP, which (though obtainable) is missing in (Wang2017-universal-SVRG; Zeng-svrg2018).
When or , SVRG-SDP reduces to the known factored gradient descent (FGD) method studied in Sanghavi-FGD2016. This shows that our proposed algorithmic framework provides more flexible choices for the users.
2.2 Initialization Procedure: Algorithm 2
Note that the recast problem (2) is no longer convex, thus the choice of initialization is very important in the implementations of these algorithms for the low-rank matrix estimation as demonstrated in (Sanghavi-FGD2016; Wang2017-universal-SVRG; Wang2017-GD; Zhang2017-SVRG; Ma2018; Ma2019; Zeng-svrg2018; Zeng-SGD2019). One of commonly used strategies is to construct the initialization directly from the observed data like in the applications of matrix sensing, matrix completion and phase retrieval (say, Candes-TIT2015; Sanghavi2013; Netrapalli-2013; Zheng-Lafferty2015). Such a strategy is generally effective for the case that the objective function has a small “condition number”, while for the general objective functions as considered in this paper, another common strategy is to use one of the standard convex algorithms (say, projected gradient descent (ProjGD) (Nestrov-1989)). Some specific implementations of this idea have been used in (Recht-2016; Sanghavi-FGD2016; Wang2017-universal-SVRG; Wang2017-GD; Zeng-SGD2019).
2.3 Fixed and Stabilized Barzilai-Borwein Step Sizes
Besides the issue of the initial choice, another important implementation issue of SVRG is the tuning of the step size. There are mainly two classes of step sizes: deterministic or data adaptive. Here we discuss three particular choices.
Fixed step size (Johnson-Zhang-svrg2013):
Barzilai-Borwein (BB) step size (BB-stepsize1988; Tan-Ma-bbsvrg2016): given an initial and for , let ,
Note that such a BB step size is originally studied for strongly convex objective functions (Tan-Ma-bbsvrg2016), and it may be breakout if there is no guarantee of the curvature of like in nonconvex cases (Ma2018). In order to avoid such possible instability of (5) in our studies, a variant of BB step size, called the stabilized BB step size, is suggested by Ma2018; Ma2019 shown as follows.
Stabilized BB (SBB) step size (Ma2018; Ma2019): given an initial and an , for ,
Note that the BB step size is a special case of SBB step size with , thus we call the BB step size as SBB.
Besides the above three step sizes, there are some other schemes like the diminishing step size and the use of smoothing technique in BB step size as discussed in Tan-Ma-bbsvrg2016. However, we mainly focus on the listed two step sizes (as BB is a special case SBB) in this paper due to their established effectiveness in wide applications.
3 Convergence Theory of SVRG-SDP
In this section, we present the linear submanifold convergence of the proposed SVRG-SDP.
3.1 Assumptions and Convergence Metric
To present our main convergence results, we need the following assumptions.
each () is -Lipschitz differentiable for some constant , i.e., is smooth and is Lipschitz continuous satisfying
is -restricted strongly convex (RSC) for some constants and , i.e., for any with rank ,
The above assumptions are regular and commonly used in the literature (see, Sanghavi-FGD2016; Wang2017-universal-SVRG; Wang2017-ICML; Zeng-SGD2019). Assumption 1(a) implies that is also -Lipschitz differentiable. For any -Lipschitz differentiable and -restricted strongly convex function , the following hold (Nestrov-2004):
where the first inequality holds for any , and the second inequality holds for any with rank , the first inequality and the right-hand side of the second inequality hold for the Lipschitz continuity of , and the left-hand side of the second inequality is due to the -restricted strong convexity of . Under Assumption 1, let
where is generally called the condition number of the objective function.
In order to characterize the submanifold convergence of SVRG-SDP, we use the following orthogonally invariant metric to measure the gap between and ,
where is the set of orthogonal matrices of size . Such metric has been widely used in the convergence analysis of low-rank factorization based algorithms in the literature (e.g., Sanghavi-FGD2016; Recht-2016; Wang2017-universal-SVRG; Wang2017-ICML; Zeng-SGD2019). Compared to the Euclidean metric used in the convergence analysis of SVRG-I in Zeng-svrg2018, such an orthogonally invariant metric is more desired since a global minimum of the low-rank stochastic semidefinite programming problem (1) is naturally invariant in the loss after an orthogonal transform on its factorizations.
3.2 Local Linear Submanifold Convergence
Let be a global optimum of problem (1) with rank , and be a rank- decomposition of (i.e., ). Let , and we define the following constants:
Let be a sequence satisfying for some mini-batch size satisfying
where . Given a positive integer , define
where the second inequality holds for the definition of (implying ). The above inequality shows that . Based on the above defined constants, we present our main theorem on the local linear submanifold convergence of the proposed SVRG-SDP as follows, where its proof is postponed in Section 4.
[Local linear submanifold convergence] Let be a sequence generated by Algorithm 1. Suppose that Assumption 1 holds with . Let , satisfy (10), and for . If the initialization satisfies , then for any positive integer , there holds
Particularly, if a fixed step size is used, then the above inequality implies the following linear convergence,
Theorem 3.2 establishes the local linear submanifold convergence of the proposed SVRG method for problem (2) under the smoothness and RSC assumptions. According to Theorem 3.2, the radius of the initial ball permitted to guarantee the linear convergence is , which is slightly better than the existing results as presented in Table 1 and in some sense tight by (Zeng-SGD2019, Proposition 2), where a counter example is provided such that FGD cannot converge to the global optimum once the initialization radius is not smaller than . In the following, we provide some detailed comparisons between them.
Under the same smoothness and RSC assumptions, similar local linear submanifold convergence of SVRG-LR was established in (Wang2017-universal-SVRG, Theorem 4.7) in the framework of both statistical and optimization frameworks. We provide some remarks on comparisons between these two results in the following. At first, according to (Wang2017-universal-SVRG, Theorem 4.7) and the discussion in (Wang2017-universal-SVRG, Remark 4.8), the provable radius of the initialization ball is for SVRG-LR, which is smaller than that required for the proposed SVRG-SDP in this paper. In this sense, the convergence conditions in this paper are weaker than those used in Wang2017-universal-SVRG. Secondly, from the discussion in (Wang2017-universal-SVRG, Remark 4.8), the contraction parameter associated with SVRG-LR lies in the interval when , while by (11), the contraction parameter associated with our proposed SVRG-SDP (approaching to ) shall be smaller than if a moderately large is adopted. This implies in some sense that the proposed version of SVRG as an Option I generally converges faster than SVRG-LR, the Option II counterpart for problem (2). Moreover, note that for any positive integer , defined in (11) is always less than . This means we have no requirement on to guarantee the linear convergence of SVRG-SDP, while a sufficiently large (at least in the order of ) is required to guarantee the linear convergence of SVRG-LR studied in Wang2017-universal-SVRG. Also, the convergence analysis of mini-batch version of SVRG-LR (though obtainable) is missing in the literature (Wang2017-universal-SVRG).
In Algorithm 1, when , SVRG-SDP reduces to FGD studied in Sanghavi-FGD2016. Under the similar smoothness and RSC assumptions, the local linear convergence of FGD was firstly established in Sanghavi-FGD2016 if the initialization lies in the ball , and later the radius of the initialization ball was improved to in Zeng-SGD2019. Notice that Theorem 3.2 above also holds for the case of . This shows that the provable radius of initialization ball for FGD is further improved to .
Compared to the stochastic version of FGD, i.e., stochastic gradient descent (SGD) method studied in Zeng-SGD2019, the convergence rate of SVRG-SDP is linear while that of SGD is sublinear when adopting a diminishing step size. As shown in Table 1, the initialization ball for SVRG-SDP is also slightly larger than that of SGD with a fixed step size in Zeng-SGD2019, while in this case, SGD only converges to a -neighborhood of a global minimum, where is the used fixed step size.
Compared to SVRG-I studied in Zeng-svrg2018, we establish the linear submanifold convergence, a more precise characterization in the low-rank matrix factorization setting, instead of point convergence for SVRG-SDP. The convergence conditions used in this paper are also weaker than those in Zeng-svrg2018 in the following two aspects. The first one is the weaker assumption on the objective function . From (Zeng-svrg2018, Assumption 1(b)), each is required to be -restricted strongly convex; while in this paper, we only require the -restricted strong convexity of the average function as shown in Assumption 1(b), which is a more realistic condition used in the literature (Sanghavi-FGD2016; Wang2017-universal-SVRG; Wang2017-GD; Wang2017-ICML; Zeng-SGD2019). The second one is that the initialization radius is improved from in Zeng-svrg2018 to in this paper, as shown in Table 1. Moreover, there are lack of theoretical guarantees on the initialization schemes (though obtainable) in Zeng-svrg2018, while in this paper, we fill this gap and provide the theoretical guarantees for the suggested initialization scheme as shown in the following Proposition 3.3.
[Influence of batch size and update frequency] Note that the range of the mini-batch size is very flexible, since its upper bound is , where and are generally far more than . If the particular fixed step size is adopted in Algorithm 1, then (15) implies
Regardless of the memory storage, the above inequality shows that a larger mini-batch size adopted implies a faster convergence speed of the iterates yielded in the outer loop of SVRG-SDP, as long as is smaller than the upper bound specified in (10). Similar claim also holds for the choice of update frequency in the inner loop by the above inequality since the base is positive and less than under the choice of in (10). In order to yield a fast convergence speed, a moderately large is usually required in the implementation of SVRG-SDP, as also implied by (16). They are reasonable since more gradient information is exploited when a large mini-batch size or update frequency is adopted in the inner loop with other fixed parameters.
Besides the concerned linear convergence of the iterates yielded in the outer loop, it is also important to estimate the computational complexity in terms of the amount of gradient information used to achieve a prescribed precision. Specifically, given a precision , the above inequality gives an estimate of the computational complexity of SVRG-SDP as follows:
where . On one hand, for some moderately large , the above computational complexity approximates to , which shows that a smaller mini-batch size generally implies a lower computational complexity of SVRG-SDP to achieve a prescribed precision. On the other hand, for some fixed and considering some moderately large , the above estimate of computational complexity also implies that a smaller leads to a lower computational complexity. These are also verified by our later numerical experiments in Section 5.
In the next, we give a corollary to show the linear convergence of SVRG-SDP when adopting the considered SBB step size (6), where BB step size is its special case with .
[Convergence of SVRG-SDP with SBB step size] Suppose that the assumptions of Theorem 3.2 hold and that for any . Then the convergence claim in Theorem 3.2 also holds for SVRG-SDP equipped with the SBB step size in (6) .
3.3 Global Linear Convergence with Provable Initial Scheme
By Theorem 3.2, the radius of the initialization ball plays a central role in the establishment of local linear convergence of SVRG-SDP. Thus, it is crucial to show whether such proper initial point can be easily achieved in practice. In the next, we show that such a desired initial point can be indeed achieved by the suggested initialization scheme (see Algorithm 2) with the logarithmic computational complexity.
Similar results of Proposition 3.3 have been shown in Wang2017-GD for the gradient descent method and in Zeng-SGD2019 for the stochastic gradient descent method. According to (Wang2017-GD, Theorem 5.7), the condition on is , while the requirement in Proposition 3.3 is , which significantly relaxes the condition in Wang2017-GD. According to (Zeng-SGD2019, Proposition 1), the requirements on of both papers are the same. However, the requirement on the radius of initialization ball in this paper is slightly weaker that of Zeng-SGD2019, where the radius is improved from in Zeng-SGD2019 to in this paper. The proof of Proposition 3.3 is provided in Appendix C.
[Global linear convergence] Suppose that Assumption 1 holds with . Let be a sequence generated by Algorithm 1 with via the initial scheme in Algorithm 2 (where satisfies (17)), satisfying (10), , and for . Then converges to a global minimum exponentially fast.
The proof of this theorem is summarized as follows. By Theorem 3.2, we show that the proposed SVRG method converges to a global minimum exponentially fast starting from an initial guess close to this global minimum, and then according to Proposition 3.3, we show that the suggested initialization algorithm (see Algorithm 2) can find the desired initial guess permitted to the linear convergence with an order of logarithmic computational complexity, starting from the trivial origin point. In other words, the convergence speed of the suggested initial scheme in Algorithm 2 is also linear to reach the desired initial precision. Therefore, combining Proposition 3.3 with the local linear convergence in Theorem 3.2, the whole convergence speed of SVRG-SDP equipped with such initial scheme is linear starting from the origin point. This implies the global linear convergence of SVRG-SDP in Theorem 3.3.
4 Second-Order Descent Lemma and Proof of Theorem 3.2
In this section, we present the key proofs of our main theorem (i.e., Theorem 3.2). The proof idea is motivated by Zeng-svrg2018 with an extension. Specifically, to prove Theorem 3.2, we need the following second-order descent lemma which estimates the progress made by a single iterate of the inner loop and is characterized in the manifold metric instead of the Euclidean metric as in (Zeng-svrg2018, Lemma 1).
We call this lemma as Second-order descent lemma since both linear and quadratic terms of are involved in the upper bound in the right-hand side of (18). This is in general different from the literature (say, Johnson-Zhang-svrg2013; Tan-Ma-bbsvrg2016) to yield the linear convergence of SVRG methods.
[Proof of Lemma 4] Let , , and Note that
The bounds of both and can be estimated respectively by Lemma Appendix B.1. Bound and Lema B.2. Bound whose proofs are given in Appendix B. Combining these bounds ((21) and (25) in Appendix B) for (19) yields
[Proof of Theorem 3.2] We prove this theorem by induction. We firstly develop the contraction between two iterates of the outer loop based on Lemma 4, and then establish the locally linear convergence recursively.
We assume that at the -th inner loop, and for and , then (18) still holds due to . This implies
where the first inequality holds for , and the second inequality holds for (implying ). Adding the term to both sides of the above inequality yields