# Convergence diagnostics for stochastic gradient descent with constant step size

Iterative procedures in stochastic optimization are typically comprised of a transient phase and a stationary phase. During the transient phase the procedure converges towards a region of interest, and during the stationary phase the procedure oscillates in a convergence region, commonly around a single point. In this paper, we develop a statistical diagnostic test to detect such phase transition in the context of stochastic gradient descent with constant step size. We present theoretical and experimental results suggesting that the diagnostic behaves as intended, and the region where the diagnostic is activated coincides with the convergence region. For a class of loss functions, we derive a closed-form solution describing such region, and support this theoretical result with simulated experiments. Finally, we suggest an application to speed up convergence of stochastic gradient descent by halving the learning rate each time convergence is detected. This leads to remarkable speed gains that are empirically comparable to state-of-art procedures.

## Authors

• 2 publications
• 14 publications
• ### Understanding and Detecting Convergence for Stochastic Gradient Descent with Momentum

Convergence detection of iterative stochastic optimization methods is of...
08/27/2020 ∙ by Jerry Chee, et al. ∙ 38

• ### On Convergence-Diagnostic based Step Sizes for Stochastic Gradient Descent

Constant step-size Stochastic Gradient Descent exhibits two phases: a tr...
07/01/2020 ∙ by Scott Pesme, et al. ∙ 0

We propose a statistical adaptive procedure called SALSA for automatical...
02/25/2020 ∙ by Pengchuan Zhang, et al. ∙ 0

• ### Theoretical Analysis of Auto Rate-Tuning by Batch Normalization

Batch Normalization (BN) has become a cornerstone of deep learning acros...
12/10/2018 ∙ by Sanjeev Arora, et al. ∙ 20

• ### Online Stochastic Gradient Descent with Arbitrary Initialization Solves Non-smooth, Non-convex Phase Retrieval

In recent literature, a general two step procedure has been formulated f...
10/28/2019 ∙ by Yan Shuo Tan, et al. ∙ 0

• ### Theoretical Interpretation of Learned Step Size in Deep-Unfolded Gradient Descent

Deep unfolding is a promising deep-learning technique in which an iterat...
01/15/2020 ∙ by Satoshi Takabe, et al. ∙ 0

• ### Fluctuation-dissipation relations for stochastic gradient descent

The notion of the stationary equilibrium ensemble has played a central r...
09/28/2018 ∙ by Sho Yaida, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider a classical problem in stochastic optimization stated as

 θ⋆=argminθ∈ΘE(ℓ(y,x⊤θ)), (1)

where is the loss function, denotes the response, are the features, and are parameters in . For example, the quadratic loss function is defined as

. In estimation problems we typically have a finite data set

, , from which we wish to estimate by solving the empirical version of Equation (1):

 ^θ=argminθ∈Θ1NN∑i=1ℓ(yi,x⊤iθ).

When data size, , and parameter size, , are large classical methods for computing fail. Stochastic gradient descent (SGD) is a powerful alternative (Bottou, 2010, 2012; Toulis and Airoldi, 2015; Zhang, 2004) because it solves the problem in an iterative fashion through the procedure:

 θn=θn−1−γ∇ℓ(yn,x⊤nθn−1). (2)

Here, is the estimate of prior to the th iteration, is a random sample from the data, and is the gradient of the loss with respect to . Classical stochastic approximation theory (Benveniste et al., 1990; Borkar, 2008; Robbins and Monro, 1951) suggests that SGD converges to a value such that , which under typical regularity conditions is equal to when is infinite (streaming setting), or is equal to when is finite. Going forward we assume the streaming setting for simplicity, but our results hold for finite as well.

Typically, stochastic iterative procedures start from some starting point and then move through a transient phase and towards a stationary phase (Murata, 1998). In stochastic gradient descent this behavior is largely governed by parameter , which is known as the learning rate, and can either be decreasing over (e.g., ), or constant. In the decreasing rate case, the transient phase is usually long, and can be impractically so if the rate is slightly misspecified (Nemirovski et al., 2009; Toulis et al., 2017), whereas the stationary phase involves SGD converging in quadratic mean to . When is constant the transient phase is much shorter and less sensitive to the learning rate, whereas the stationary phase involves SGD oscillating within a region that contains . In this paper, we focus on statistical convergence diagnostics for constant rate SGD because in this setting a convergence diagnostic can be utilized to identify when there is no benefit in running the procedure longer.

### 1.1 Related work and contributions

The idea that SGD methods are composed of a transient phase and a stationary phase (also known as search phase and convergence phase, respectively), has been expressed before (Murata, 1998)

. However, no principled statistical methods have been developed to address stationarity issues, and thereby guide empirical practice of SGD. Currently, guidance is based on heuristics originating from optimization theory that aim to evaluate the magnitude of SGD updates. For example, a popular method is to stop when

is small according to some threshold, or when updates of the loss function have reached machine precision (Bottou et al., 2016; Ermoliev and Wets, 1988). These methods, however, do not take into account the sampling variation in SGD estimates, and are suited for deterministic procedures but not stochastic ones.

A more statistically motivated approach is to monitor the test error of SGD iterates on a hold-out validation test, concurrently with the main SGD iteration (Blum et al., 1999; Bottou, 2012). One idea here is to stop the procedure when the validation error starts increasing. An important problem with this approach is that the validation error is also a stochastic process, and estimating when it actually starts increasing presents similar, if not greater, challenges to the original problem of detecting convergence to stationary phase. Furthermore, cross validation can be computationally costly in large data sets.

In stochastic approximation, methods to detect stationarity can be traced back to classical theory of stopping times (Pflug, 1990; Yin, 1989). One important method, which forms the basis of this paper, is Pflug’s procedure (Pflug, 1990) that keeps a running average of the inner product of successive gradients , where we defined . The underlying idea is that in the transient phase the stochastic gradients point roughly to the same direction, and thus their inner product is positive. In the stationary phase, SGD with constant rate moves haphazardly around , and so the gradients point to different directions making the inner product negative.

The intuition that a negative inner product of successive gradients indicates convergence underlies accelerated methods in stochastic approximation (Delyon and Juditsky, 1993; Kesten, 1958; Roux et al., 2012). The accelerated methods, however, take this intuition as a given, whereas we develop theory for it to define a formal convergence testing procedure. Recently, another related idea is that of gradient diversity (Yin et al., 2018), which is used to understand why speedup gains in batch SGD saturate with increasing batch size. An important difference is that gradient diversity calculates the inner products at a fixed parameter value , whereas stochastic approximation methods, including this paper, use successive parameter values.

#### 1.1.1 Overview of results and contributions

Our contributions in this paper can be summarized as follows. In Section 2, we develop a formal convergence diagnostic test for SGD, which combines Pflug’s stopping time procedure (Pflug, 1990) with SGD in Equation (2) to detect when SGD exits the transient phase and enters the stationary phase. We note that by convergence of SGD with constant rate we do not mean convergence to a single point but convergence to the stationarity region. We prove a general result that the diagnostic indeed is activated almost surely. We illustrate through an example, where conditional on the diagnostic being activated, the distance is uncorrelated with the starting distance , implying that the diagnostic captures the transition from transient to stationary phase. In Section 3, we develop theory for quadratic loss, and derive a closed-form solution describing the region where the diagnostic is activated. In Section 4.2, we present extensions beyond the quadratic loss. In Section 4.3 we suggest an application of the diagnostic in speeding up SGD by halving the learning rate each time convergence is detected. This leads to a new SGD procedure, named SGD

, which is comparable to state-of-art procedures, such as variance-reduced SGD

(Johnson and Zhang, 2013) and averaged SGD (Bottou, 2010; Xu, 2011), in Sections 4.4 and 4.5.

## 2 Convergence diagnostic

Before we develop the formal diagnostic, we present theory that supports the existence of a transient and stationary phase of SGD. The theory suggests that the mean squared error of SGD has a bias term from distance to the starting point, and a variance term from noise in stochastic gradients.

###### Theorem 1

[(Moulines and Bach, 2011; Needell et al., 2014)] Under certain assumptions on the loss function, there are positive constants such that, for every , it holds that

 E(||θn−θ⋆||2)≤E(||θ0−θ⋆||2)e−Aγn+Bγ.

Remarks. The constants differ depending on the analysis. For example, Bach and Moulines (Moulines and Bach, 2011) use , where is the strong convexity constant of expected loss , and is the Lipschitz constant of ; and , where is an upper bound for the variance of . Needell and Srebro (Needell et al., 2014) use and .

Despite such differences, all analyses suggest that the SGD procedure with constant rate goes through a transient phase exponentially fast during which it forgets the initial conditions , and then enters a stationary phase during which it oscillates around , roughly at a region of radius . A trade-off exists here: reducing will make the oscillation radius, , smaller but escaping the transient phase becomes much slower; for instance, in the extreme case where the procedure will never exit the transient phase.

Despite the theoretical insights it offers, Theorem 1 has limited practical utility for estimating the phase transition in SGD. One approach could be to find the value of for which , that is, the initial conditions have been discounted to 1% of the stationary variance. That, however, requires estimating , , and , which is challenging. In the following section, we develop a concrete statistical diagnostic to estimate the phase transition and detect convergence of SGD in a much simpler way.

### 2.1 Pflug diagnostic

In this section, we develop a convergence diagnostic for SGD procedures that relies on Pflug’s procedure (Pflug, 1992) in stochastic approximation. The diagnostic is presented as Algorithm 1 and concrete instances under quadratic loss along with theoretical analysis are presented in Section 3, with extensions in Section 4.1.

The diagnostic is defined by a random variable

that keeps the running sum of the inner product of successive stochastic gradients, as shown in Line 7. The idea is that in the transient phase SGD moves towards by discarding initial conditions, and so the stochastic gradients point to the same direction, on average. This implies that the inner product of successive stochastic gradients is likely positive in the transient phase. In the stationary phase, however, SGD is oscillating around at a distance bounded by Theorem 1, and so the gradients point to different directions. This implies a negative inner product on average during the stationary phase. When the statistic changes sign from positive to negative, this is a good signal that the procedure has exited the transient phase.

Since our convergence diagnostic is iterative we need to show that it eventually terminates with an answer. In Theorem 2 that follows we prove that as , and so Algorithm 1 indeed terminates almost surely. For brevity, we state the theorem without technical details. The full assumptions and proof can be found in the supplementary material.

###### Theorem 2

Under certain assumptions, the convergence diagnostic in Algorithm 1 for constant rate SGD procedure in Equation (2) satisfies as , and so the algorithm terminates almost surely.

Remarks. Theorem 2 shows that the inner product of successive gradients is negative in expectation as the iteration number increases. Roughly speaking, when is very close to the dominant force is the variance in the stochastic gradient pulling the next iterates away from ; when is far from the dominant force is the bias in the stochastic gradient, which instead pulls the next iterates towards

. This implies that the running sum of successive gradients will eventually become negative at a finite iteration number, and so by the law of large numbers the diagnostic returns a value almost surely.

In this section, we attempt to gain analytical insight into our convergence diagnostic of Algorithm 1 by assuming simple quadratic loss function, i.e., and . Consider the case where , i.e., the procedure starts in the stationary region. Let , where are zero-mean random variables, . Then,

 θ1 =θ⋆+γ(y1−x⊤1θ⋆)x1=θ⋆+γε1x1,

from which it follows that

 S2−S1 =(y2−x⊤2θ1)(y1−x⊤1θ0)x⊤2x1=(ε2−γε1x⊤2x1)ε1x⊤2x1. E(S2−S1) =−γE(ε21)E((x⊤2x1)2)<0. (3)

Thus, when the procedure starts at true parameter value, , the diagnostic is decreased in expectation, and eventually at some iteration the statistic becomes negative and the diagnostic is activated at . We generalize this result in the following theorem.

###### Theorem 3

Suppose that the loss is quadratic, . Let and

be two iid vectors from the distribution of

, and define: ; ; ; , and suppose that all such constants are finite. Then, for ,

 Δn(θ) =E(Sn+2−Sn+1|θn=θ) =(θ−θ⋆)⊤(C−γD)(θ−θ⋆)−γc2σ2.

Remarks. Theorem 3

shows that the boundary surface that separates the two regions where the test statistic

increases or decreases in expectation looks like an ellipse, for large enough . Regardless of the choice of , when is close enough to , the diagnostic is guaranteed to decrease in expectation since the only remaining term is .

The result also shows the various competing forces between bias and variance in the stochastic gradients as they relate to how the diagnostic behaves. For instance, when is very close , larger (noise in stochastic gradient) contributes to a faster decrease of the diagnostic in expectation, but at the cost of higher variance. The contribution of the other term, , is less clear. For instance, is large when there is strong collinearity in features , which may contribute to decreasing . But strong collinearity also implies that is almost positive definite which contributes positive values to , thus counteracting the contribution of . Note that is a positive definite matrix but may not be. This implies that careful selection of may be necessary for the diagnostic to work well. For example, when is very small and is positive definite, then will converge to a negative number slowly. One way to alleviate this sensitivity to the learning rate is through implicit updates (Toulis et al., 2014), which we explore in the following section.

### 3.1 Implicit update

As mentioned above the Pflug diagnostic is sensitive to the choice of learning rate . When is small and is positive definite, will be mostly increasing during the transient phase, which makes convergence slower. But choosing a large learning rate can easily lead to numerical instability. One way to alleviate such sensitivity to the learning rate is to use the SGD procedure with an implicit update as follows:

 θn=θn−1−γ∇ℓ(yn,x⊤nθn). (4)

Note that appears on both sides of the equation. In the quadratic loss model we can solve exactly the implicit equation as follows:

 θn=(I+γxnx⊤n)−1(θn−1+γynxn). (5)

Implementing the procedure in Eq. (5) is fast since it is equivalent to . More generally, the implicit update in Equation 14 can be computed efficiently in many settings through a one-dimensional root-finding procedure (Toulis et al., 2014). Previous work on implicit SGD (ISGD) has shown that implicit procedures have similar asymptotic properties with standard SGD procedures with numerical stability as an added benefit. Since most related work on ISGD methods is with respect to decreasing learning rate procedures (Bertsekas, 2011; Kulis and Bartlett, 2010; Toulis et al., 2017, 2014), we provide an analysis for constant rate ISGD as in Equation (4) in the supplementary material. We note that ISGD procedures are related to proximal updates in stochastic optimization (Parikh and Boyd, 2013; Rosasco et al., 2014; Xiao and Zhang, 2014), but these methods differ from ISGD methods in that they employ a combination of classical SGD with deterministic proximal operators, whereas ISGD’s proximal operator is purely stochastic.

The following theorem shows that the implicit update in the linear model mitigates the sensitivity of the Pflug diagnostic to the choice of the learning rate.

###### Theorem 4

Let . Under the assumptions of Theorem 3 applied on the implicit procedure in Equation (14), it holds that

 Δimn(θ) =E(Sn+2−Sn+1|θn=θ) =aγΔn(θ)+bγ[(θ−θ⋆)⊤D(θ−θ⋆)+σ2c2],

where , .

Remarks. Theorem 4 shows that the diagnostic is more stable with the ISGD procedure than with the classical SGD procedure. By stability we mean two things. First, even when classical SGD diverges the convergence diagnostic may still declare convergence. Consider, for example, the simple model , where . If the classical SGD procedure will diverge. However, the diagnostic will declare convergence almost immediately because by Theorem 3 it decreases, in expectation, for every . Such inconsistency due to instability of classical SGD cannot happen with implicit SGD.

Second, generally speaking, empirical performance of the diagnostic under implicit SGD matches theory better than under classical SGD. This is illustrated in the following section, where the region of diagnostic convergence is smooth and elliptical under implicit SGD, as predicted by Theorem 4; under classical SGD, the corresponding region does not follow Theorem 3 as closely due to sensitivity to learning rate specification.

### 3.2 Illustration

Here, we illustrate the main results of Theorem 4 through Figure 1, which can be described as follows. The shaded areas in the figure show how the Pflug diagnostic changes in expectation when the SGD iterate falls in the region. In other words, every point in the figure is shaded by the value , as defined in Theorem 4.

Various shades of grey indicate the magnitude of change. The darkest-shaded region corresponds to the region where the diagnostic decreases in expectation, that is, for all in that region. We call this the Pflug region. Note that the Pflug region is centered roughly around , the true parameter value. Inside the Pflug region the diagnostic is decreased in expectation, and outside of the region it is increased. Furthermore, the expected change in the diagnostic is uniform in distance to the center of the Pflug region, which is roughly : the farther we move away from the center the larger the expected increase of the diagnostic becomes.

The blue polygon shaded with diagonal lines corresponds to empirical estimations of the convergence region of SGD, defined as the region where SGD iterates have oscillated around for 95% of the time calculated over 1000 simulations. The polygon shows that the Pflug region approximates very well the actual convergence region of SGD. This is remarkable because the Pflug region can be calculated from data using the convergence diagnostic, whereas by Theorem 1 the SGD convergence region cannot be calculated without knowledge of and other unknown parameters.

### 3.3 Simulated example

Next, we test the Pflug diagnostic through a simulated experiment. The experimental setup is as follows. We set as the parameter dimension, and also set as the data set size and fix with ; this ensures some variation and sparsity in the parameter values. We sample features as , where denotes a

-variate normal distribution,

is the identity matrix, and

. We sample outcomes as , where .

For given we run Algorithm 1 with burnin = , for various values of the starting point sampled as , where . Let , then for each run we store the tuple

 (γ,τ,E0,Eτ/2,E2τ),

where is the output of Algorithm 1, i.e., the iteration at which the Pflug diagnostic detected convergence. The idea in this experimental evaluation is that if the convergence diagnostic detects convergence accurately, iterates earlier than convergence, say, , will depend on the initial conditions more than iterates later than convergence, say, . Thus, for given and , we should expect a much higher correlation between and than between and . To test this hypothesis, for a given value of we draw 100 independent samples of . With these samples we regress on and on

in two normal linear regression models. Table

1 summarizes the regression results from this experiment. In the second and third column of Table 1 we report the regression coefficients of in the two model fits, respectively, and also report statistical significance.

From the table, we see that the regression coefficient corresponding to is always positive and statistically significant, whereas the coefficient is mostly not significant for . This suggests that depends on initial conditions , and thus stationarity has not yet been reached at iteration . In contrast, does not depend on initial conditions , and thus stationarity has likely occurred after iteration . This is evidence indicating that the Pflug diagnostic performs reasonably well in estimating the switch of SGD from its transient phase to its stationary phase.

We note that in the regression evaluation we had to control for (by using it as a regressor) because the iteration number is correlated with mean-squared error (larger values for are correlated with smaller error).

## 4 Extensions and applications

In this section we consider extensions of the Pflug diagnostic to a more broad family of loss functions inspired by generalized linear models (GLMs). We also consider an application of the diagnostic to speed up convergence of SGD with constant rate.

### 4.1 Generalized linear loss

In this section we consider extensions of the Pflug diagnostic to a more broad family of loss functions inspired by generalized linear models (GLMs). We also consider an application of the diagnostic to speed up convergence of SGD with constant learning rate.

### 4.2 Generalized linear loss

Here, we consider the loss based on the GLM formulation (McCullagh, 1984; Toulis et al., 2014) where . For example, the quadratic loss is equivalent to . The logistic loss is when is binary and . In general, cannot be chosen arbitrarily—one standard choice is to define such that is a proper density, i.e., it integrates to one. The following theorem generalizes the results in Section 3 on the quadratic loss.

###### Theorem 5

Define the loss . Let and suppose that , almost surely for all . Let be two iid vectors from the distribution of . Define ; ; ; . Then, for small enough ,

 Δglmn(θ) =E(Sn+2−Sn+1|θn=θ)≤||C(θ,θ⋆)||2−γk[σ2c2+D2(θ,θ⋆)].

Remarks. The result in Theorem 5 has the same structure as in Theorem 3 so a direct analogy can be helpful. The terms in the two theorems are identical, if we consider that for the quadratic loss it holds that . The term in Theorem 5 corresponds to the term in Theorem 3, and corresponds to . The terms are equal when we set , in which case . Thus, the diagnostic with the more general GLM loss has familiar properties. For example, when , i.e., when SGD is near the truth, and , in which case the negative constant term dominates, and the test statistic decreases in expectation leading to activation of the diagnostic. One difference with the quadratic loss, however, is that as we move farther from the statistic may change in a nonlinear way. Therefore the boundary separating the positive and negative regions of the diagnostic will generally not have the familiar smooth elliptical shape as in the quadratic loss (see Figure 1). This may lead to more complex behavior for the diagnostic, which is open to future work.

Regarding the assumptions of Theorem 5, we note that the constraint on derivative is not particularly strict because in the GLM formulation is guaranteed to be positive. The assumption is made to simplify the analysis, but can be improved by analyzing the quantity through existing analyses of .

### 4.3 Sgd1/2 for fast convergence

We now switch gears from analyzing the behavior of the Pflug diagnostic to using it in a practical application. Our suggested application is to use the diagnostic within a SGD loop where the learning rate is halved and the procedure restarted each time convergence is detected. We emphasize that our goal here is to illustrate the utility of our convergence diagnostic and not to exhaustively demonstrate the performance of the new procedure. A full analysis of the proposed procedure is open to future work.

More specifically, the SGD procedure with constant rate has linear convergence to a stationary distance from of , as suggested by Theorem 1. It would therefore be beneficial to reduce the learning rate when we know that SGD iterates are oscillating around in a ball of radius , so that the procedure moves to a ball with a smaller radius. To implement such a procedure, however, would require knowing , and also knowing all parameters required to calculate . Our solution employs the Pflug convergence diagnostic to detect stationarity. Algorithm 2 describes such a procedure, named SGD, where the learning rate is halved upon detection of convergence (Line 10).

Note that implicit updates can be used in this algorithm as well; we call this modified algorithm ISGD. In our experiments in the following section, we employ ISGD because of the benefits in numerical stability from using implicit updates, as described earlier.

### 4.4 Simulated data experiments

To evaluate the effectiveness of ISGD, we compare to other classical and state-of-the-art SGD methods. We first experiment on simulated data to better understand the performance of ISGD and its competition under various parameter settings. In particular, we compare the performance of procedure ISGD in Algorithm 2 against SVRG and classical ISGD on simulated data. The classical ISGD uses a learning rate of , which is optimized through pre-processing. The basic experimental setup is as follows.

We consider settings of high and low signal to noise ration (SNR), and high and low dimension and test under the four combinations of these settings. For the high SNR case, we set , where , and for the low SNR case we set . For the high dimension case we set as the parameter dimension, and for the low dimension case we set . Given , we fix such that . We set as the size of the data set. We sample features as , where . We sample outcomes as for the normal model, and for the logistic model, where denotes the binomial random variable with mean . The learning parameters for each SGD method were tuned to provide best performance through pre-processing.

From simulations with the normal model in the left half of Figure 2 we see that ISGD

attains a comparable performance to SVRG. In general, SVRG attains an overall better performance for these experiments, which we believe is related to our convergence diagnostic being aggressive in a couple of cases, which are essentially cases of Type-I error.

From simulations with the logistic model in the right half of Figure 2 we see that ISGD attains an even better performance than before as there are fewer cases of Type-I error. With high SNR and low dimension parameter settings, ISGD achieves consistently better performance than SVRG. We note that such comparisons do not take into account the sensitivity of SVRG to misspecifications of the learning rate (large enough learning rates can easily make the procedure diverge); or that SVRG requires periodic calculations over the entire data set, which here is easy because we are using only 5,000 data points, but may be a problem in more realistic settings. We also note that there are several improvements available for ISGD by allowing a larger burnin period or by discounting the learning rate less aggressively. An interesting direction for future work is to understand the performance of our diagnostic test in terms of statistical validity and power, and thus address some of the aforementioned tuning issues in a principled manner.

### 4.5 Benchmark data sets

In addition to simulated experiments we conduct experiments on benchmark data sets MNIST (binary) and COVERTYPE (binary) to evaluate real world performance.111Data sets can be found at https://archive.ics.uci.edu/ml/databases/mnist/ and https://archive.ics.uci.edu/ml/datasets/covertype, respectively. In particular, we perform binary logistic regression using ISGD, SVRG, classical ISGD, and averaged ISGD (Toulis et al., 2016). We plot the prediction error on a held-out test set in Figure 3 relative to the number of passes over the data.

Overall, we see that ISGD convergences very quickly, after going over less than a quarter of the data, and achieves best performance in the COVERTYPE data set. We currently do not have a theoretical justification for this, but we have verified that the aforementioned result is consistently observed across multiple experiments. ISGD was also very stable to specifications of the learning rate parameter, as expected from the analysis of Theorem 4. In contrast, even though SVRG performed comparably to ISGD, its performance was unstable, especially in the COVERTYPE data set, and required careful fine tuning of the learning rate through trial and error. Averaged SGD performed well on the MNIST data set, but flattened out very fast in the COVERTYPE data, possibly due to non-strong convexity of the objective function.

## 5 Conclusion

In this paper we focused on detecting convergence of SGD with constant learning rate to its convergence phase. This is an important practical task because statistical properties of iterative stochastic procedures are better understood under stationarity. We borrowed from the theory of stopping times in stochastic approximation to develop a simple diagnostic that uses the inner product of successive gradients to detect convergence. Theoretical and empirical results suggest that the diagnostic reliably detects the phase transition, which can speed up classical procedures.

Future work needs to focus on analysis of errors conditional on the diagnostic being activated. This could show that the error is uncorrelated with the initial starting point conditional on the test being activated, and so provide theoretical support to the empirical results in Table 1. It would also be interesting to analyzse ISGD. Another idea is to use aggregation among parallel ISGD

chains. At stationarity we expect iterates from different chains to be uncorrelated with each other, and so averaging may help. It would also be interesting to use the diagnostic in problems with non-convex loss, such as neural networks.

## References

• Benveniste et al. (1990) Benveniste, A., P. Priouret, and M. Métivier (1990). Adaptive algorithms and stochastic approximations. Springer-Verlag New York, Inc.
• Bertsekas (2011) Bertsekas, D. P. (2011). Incremental proximal methods for large scale convex optimization. Mathematical programming 129(2), 163–195.
• Blum et al. (1999) Blum, A., A. Kalai, and J. Langford (1999). Beating the hold-out: Bounds for k-fold and progressive cross-validation. In

Proceedings of the twelfth annual conference on Computational learning theory

, pp. 203–208. ACM.
• Borkar (2008) Borkar, V. S. (2008). Stochastic approximation. Cambridge Books.
• Bottou (2010) Bottou, L. (2010).

Large-scale machine learning with stochastic gradient descent.

In Proceedings of COMPSTAT’2010, pp. 177–186. Springer.
• Bottou (2012) Bottou, L. (2012). Stochastic Gradient Descent Tricks. In Neural Networks: Tricks of the Trade, Volume 1, pp. 421–436.
• Bottou et al. (2016) Bottou, L., F. E. Curtis, and J. Nocedal (2016). Optimization methods for large-scale machine learning. arXiv preprint arXiv:1606.04838.
• Delyon and Juditsky (1993) Delyon, B. and A. Juditsky (1993). Accelerated stochastic approximation. SIAM J. Optimization 3(4), 868–881.
• Ermoliev and Wets (1988) Ermoliev, Y. M. and R.-B. Wets (1988). Numerical techniques for stochastic optimization. Springer-Verlag.
• Johnson and Zhang (2013) Johnson, R. and T. Zhang (2013). Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, pp. 315–323.
• Kesten (1958) Kesten, H. (1958). Accelerated stochastic approximation. The Annals of Mathematical Statistics 29(1), 41–59.
• Kulis and Bartlett (2010) Kulis, B. and P. L. Bartlett (2010). Implicit online learning. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 575–582.
• McCullagh (1984) McCullagh, P. (1984). Generalized linear models. European Journal of Operational Research 16(3), 285–292.
• Moulines and Bach (2011) Moulines, E. and F. R. Bach (2011).

Non-asymptotic analysis of stochastic approximation algorithms for machine learning.

In Advances in Neural Information Processing Systems, pp. 451–459.
• Murata (1998) Murata, N. (1998). A statistical study of on-line learning. Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK, 63–92.
• Needell et al. (2014) Needell, D., R. Ward, and N. Srebro (2014). Stochastic gradient descent, weighted sampling, and the randomized kaczmarz algorithm. In Advances in Neural Information Processing Systems, pp. 1017–1025.
• Nemirovski et al. (2009) Nemirovski, A., A. Juditsky, G. Lan, and A. Shapiro (2009). Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization 19(4), 1574–1609.
• Parikh and Boyd (2013) Parikh, N. and S. Boyd (2013). Proximal algorithms. Foundations and Trends in optimization 1(3), 123–231.
• Pflug (1990) Pflug, G. C. (1990). Non-asymptotic confidence bounds for stochastic approximation algorithms with constant step size. Monatshefte für Mathematik 110(3), 297–314.
• Pflug (1992) Pflug, G. C. (1992).

Gradient estimates for the performance of markov chains and discrete event processes.

Annals of Operations Research 39(1), 173–194.
• Robbins and Monro (1951) Robbins, H. and S. Monro (1951). A stochastic approximation method. The annals of mathematical statistics, 400–407.
• Rosasco et al. (2014) Rosasco, L., S. Villa, and B. C. Vũ (2014). Convergence of stochastic proximal gradient algorithm. arXiv preprint arXiv:1403.5074.
• Roux et al. (2012) Roux, N. L., M. Schmidt, and F. Bach (2012). A stochastic gradient method with an exponential convergence _rate for finite training sets. In Advances in Neural Information Processing Systems, pp. 2663–2671.
• Toulis and Airoldi (2015) Toulis, P. and E. M. Airoldi (2015). Scalable estimation strategies based on stochastic approximations: classical results and new insights. Statistics and computing 25(4), 781–795.
• Toulis et al. (2017) Toulis, P., E. M. Airoldi, et al. (2017). Asymptotic and finite-sample properties of estimators based on stochastic gradients. The Annals of Statistics 45(4), 1694–1727.
• Toulis et al. (2014) Toulis, P., J. Rennie, and E. Airoldi (2014). Statistical analysis of stochastic gradient methods for generalized linear models. In 31st International Conference on Machine Learning.
• Toulis et al. (2016) Toulis, P., D. Tran, and E. Airoldi (2016). Towards stability and optimality in stochastic gradient descent. In Artificial Intelligence and Statistics, pp. 1290–1298.
• Xiao and Zhang (2014) Xiao, L. and T. Zhang (2014). A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization 24, 2057–2075.
• Xu (2011) Xu, W. (2011). Towards optimal one pass large scale learning with averaged stochastic gradient descent. arXiv preprint arXiv:1107.2490.
• Yin et al. (2018) Yin, D., A. Pananjady, M. Lam, D. Papailiopoulos, K. Ramchandran, and P. Bartlett (2018). Gradient diversity: a key ingredient for scalable distributed learning. In Proceedings of 21st International Conference on Artificial Intelligence and Statistics (AISTATS’18).
• Yin (1989) Yin, G. (1989). Stopping times for stochastic approximation. In Modern Optimal Control: A Conference in Honor of Solomon Lefschetz and Joseph P. LaSalle, pp. 409–420.
• Zhang (2004) Zhang, T. (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-first international conference on Machine learning, pp. 116. ACM.

## Appendix A Proofs of theorems

###### Theorem 1 ((Moulines and Bach, 2011; Needell et al., 2014))

Under certain assumptions on the loss function, there are positive constants such that, for every , it holds that

 E(||θn−θ⋆||2)≤E(||θ0−θ⋆||2)e−Aγn+Bγ.
###### Theorem 2

Consider SGD with constant rate,

 θn=θn−1−γ∇ℓ(yn,x⊤nθn−1).

Suppose that Theorem 1 holds, so that that , for some positive and large enough . We make the following additional assumptions:

1. , where is -Lipschitz, and .

2. It holds , for any , for some positive constant , and some positive definite matrix

with minimum eigenvalue

.

3. It holds that .

Then,

 E(∇ℓ(yn,x⊤nθn−1)⊤∇ℓ(yn+1,x⊤n+1θn))<0.
###### Proof 2

For brevity let be the stochastic gradient at iteration .

 E(~ℓ⊤i−1~ℓi) =E[(fi−1+ei−1)⊤(fi+ei)]=E[(fi−1+ei−1)⊤fi] %[because$ei$arezero−mean] =E[(fi−1+ei−1)⊤f(θi−1−γfi−1−γei−1)] [\small{% by SGD step for θi }] ≤E(||fi−1||2)−γK⋅E[(fi−1+ei−1)⊤C(fi−1+ei−1)] [% \small{ by Assumption~{}(b) }] ≤(1−γμK)E(||fi−1)||2)−γK⋅E(||ei−1||2C) ≤(1−γμK)L2E(||θi−1−θ⋆||2)−γμKτ2 [\small{ by Lipschitz % Assumption~{}(a) }] ≤γ[(1−γμK)L2M−μKτ2] <0. [\small{ by Assumption~{}(c) and % small enough γ }] (6)

Remarks. Assumption (b) is a form of strong convexity. For example, suppose that , then and . In this case is the Fisher information matrix and Assumption (b) holds for . When is small enough and a Taylor approximation of is possible, the above result still holds for when the Fisher information exists. Assumption (c) shows that there is a threshold value for below which the diagnostic cannot terminate. For example, suppose that error noise is small so that and , as argued before. Then, , that is, the learning rate has to exceed the reciprocal of the minimum eigenvalue of the Fisher information matrix.

###### Theorem 3

Suppose that the loss is quadratic, . Let and be two iid vectors from the distribution of , and define: ; ; ; , and suppose that all such constants are finite. Then, for ,

 Δn(θ) =E(Sn+2−Sn+1|θn=θ) =(θ−θ⋆)⊤(C−γD)(θ−θ⋆)−γc2σ2.
###### Proof 3

For notational brevity we make the following definitions:

 θ+ =θ+γ(y1−x⊤1θ)x1 θ++ =θ++γ(y2−x⊤2θ+)x2, (7)

where is the current iterate, and and are the next two using iid data and . For a fixed we understand the Pflug diagnostic through the function

 H(θ) =S++−S+|θ=∇++ℓ⊤∇+ℓ=(θ+−θ)⊤(θ++−θ+)/γ2 (8) and Δn(θ) =E(H(θ))=E((θ+−θ)⊤(θ++−θ+)/γ2). (9)

We use Equation (3) to derive an expression for :

 H(θ) =(y1−x⊤1θ)(y2−x⊤2θ+)x⊤1x2 =(y1−x⊤1θ)[y2−x⊤2θ−γ(y1−x⊤1θ)x⊤1x2]x⊤1x2 =(y1−x⊤1θ)(y2−x⊤2θ)x⊤1x2−γ(y1−x⊤1θ)2(x⊤1x2)2. (10)

Let ; we know that . Now, we analyze each term individually:

 (y1−x⊤1θ)(y2−x⊤2θ)x⊤1x2 =[x⊤1(θ⋆−θ)+ε1][x⊤2(θ⋆−θ)+ε2]x⊤1x2 =(θ−θ⋆)⊺x1x⊤2(x⊤1x2)(θ−θ⋆)+ε1W(1)+ε2W(2)+ε1ε2W(3).

The variables are conditionally independent of and so using the law of iterated expectations these terms vanish.

 E((y1−x⊤1θ)(y2−x⊤2θ)x⊤1x2)=(θ−θ⋆)⊺E(x1x⊤2(x⊤1x2))(θ−θ⋆)=(θ−θ⋆)⊺C(θ−θ⋆).

Using a similar reasoning, for the second term we have:

 (y1−x⊤1θ)2(x⊤1x2)2= [(x⊤1(θ⋆−θ)+ε1]2(x⊤1x2)2 = (θ−θ⋆)⊺x1x⊤1(x⊤1x2)2(θ−θ⋆)+ε1W(4)+ε21(x⊤1x2)2. (11)

In expectation of Equation (3),

 E((y1−x⊤1θ)2(x⊤1x2)2) =(θ−θ⋆)⊺E(x1x⊤1(x⊤1x2)2)(θ−θ⋆)+ε21(x⊤1x2)2 =(θ−θ⋆)⊺D(θ−θ⋆)+σ2c2. (12)

By combining all results we finally get:

 Δn(θ)=(θ−θ⋆)⊺(C−γD)(θ−θ⋆)−γσ2c2.
###### Theorem 4

Let . Under the assumptions of Theorem 3 applied on the implicit procedure in Equation (14), it holds that

 Δimn(θ) =E(Sn+2−Sn+1|θn=θ) =aγΔn(θ)+bγ[(θ−θ⋆)⊤D(θ−θ⋆)+σ2c2],

where , .

###### Proof 4

We derive similar theoretical results for under the linear normal model for implicit updates. We have the implicit updates

 θ+ =θ+γ(y