For modern industrial scale machine learning problems with massive amount of data, stochastic first-order methods almost become the default choice. Additionally, the datasets are not only extremely large, but often stored or even collected in a distributed manner. Stochastic version oflternating direction method of multiplier(ADMM) algorithms are popular approachs to handle this distributed setting, especially for the regularized empirical risk minimization problems.
Consider the following stochastic optimization problem:
where with as the loss incurred on a sample , , , , and both and are convex and differentiable. The stochastic version of alternating direction method of multiplier (ADMM) (Boyd et al., 2011) is to rewrite (1) as a constrained optimization problem
Here and through the rest of the paper, we start to use the same for both the stochastic instance and the expectation to ease the notation. In the batch learning setting, is approximated by the empirical risk function . However, to minimize with a large amount of samples, the computation is less efficient under time and resource constraints. In the stochastic setting, in each iteration is updated based on one noisy sample instead of a full training set.
Note that the classical setting of linear constraint can be reformulated as
by a simple linear transformation operation whenis invertible.
One of the main ideas in the stochastic ADMM is in parallel to the stochastic gradient descent (SGD). At iteration, an iid sample is drawn from the distribution of . A straightforward application of this SGD idea to the ADMM for solving (2) leads to the following stochastic ADMM (sADMM)
Here is introduced as a relaxation parameter (Eckstein & Bertsekas, 1992; Boyd et al., 2011). When , the relaxation scheme becomes the standard ADMM. The over-relaxation case is that and it can accelerate the convergence toward to the optimal solution (Yuan et al., 2019).
1.1 Variants of ADMM and Stochastic ADMM
Many variants of the classical ADMM have been recently developed. These are two types of common modifications in many variants of ADMM in order to cater for requirements of different applications.
The gradient-based ADMM is to solve (3a) inexactly by applying only one step gradient descent for all -nonlinear terms in with the step size :
To accommodate these variants all into one stochastic setting, we formulate a very general scheme to unify all above cases in the form of stochastic version of ADMM:
General stochastic ADMM (G-sADMM)
where the approximate objective function for -subproblem is
The explicitness parameters and the proximal parameter . This scheme (5) is very general and includes existing variants as follows.
: deterministic version of ADMM:
: the standard stochastic ADMM (sADMM);
and : this scheme is the stochastic version of the linearized ADMM;
and : this scheme is the stochastic version of the gradient-based ADMM.
, , and : the stochastic ADMM considered in (Ouyang et al., 2013).
1.2 Main Results
Define . Let , and . Let . denote the sequence of stochastic ADMM (5) with the initial choice . Define as a stochastic process satisfying the SDE
where the matrix
Then we have with a weak convergence of order one.
1.3 Review and Related Work
Stochastic and online ADMM
The use of stochastic and online techniques for ADMM have recently drawn a lot of interest. (Wang & Banerjee, 2012) first proposed the online ADMM in the standard form, which learns from only one sample (or a small mini-batch) at a time. (Ouyang et al., 2013; Suzuki, 2013) proposed the variants of stochastic ADMM to attack the difficult nonlinear optimization problem inherent in by linearization. Very recent, further accelerated algorithms for the stochastic ADMM have been developed in (Zhong & Kwok, 2014; Huang et al., 2019)
Continuous models for optimization algorithms
Assume the proximal parameter is linked to by with a constant . Our interest here is not about the numerical convergence of from the ADMM towards the optimal point of the objective function as for a fixed , but the proposal of an appropriate continuous model whose (continuous-time) solution is a good approximation to the sequence as .
The work in (Su et al., 2016) is one seminal work based on this perspective of using continuous-time dynamical system tools to analyze various existing discrete algorithms for optimzation problems to mode Nesterov’s accelerated gradient method. For the applications to the ADMM, the recent works in (França et al., 2018)
establishes the first deterministic continuous-time models in the form of ordinary differential equation (ODE) for the smooth ADMM and(Yuan et al., 2019) extends to the non-smooth case via the differential inclusion model.
In this setting of continuous limit theory, a time duration is fixed first so that the continuous-time model is mainly considered in this time interval . Usually a small parameter (such as step size) is identified with a correct scaling from the discrete algorithm, and used to partition the interval into windows. The iteration index in the discrete algorithm is labelled from to . The convergence of the discrete scheme to the continuous model means that, with the same initial , for any , as , then the error between and measured in certain sense converges to zero for any .
This continuous viewpoint and formulation has been successful for both deterministic and stochastic optimzation algorithms in machine learning (E et al., 2019). The works in (Li et al., 2017, 2019) rigorously present the mathematical connection of Ito stochastic differential equation (SDE) with stochastic gradient descent (SGD) with a step size . More precisely, for any small but finite , the corresponding stochastic differential equation carries a small parameter in its diffusion terms and is called stochastic modified equation (SME) due to the historical reason in numerical analysis for differential equations. The convergence between and is then formulated in the weak sense. This SME technique, originally arising from the numerical analysis of SDE (Kloeden & Platen, 2011), is the major mathematical tool for most stochastic or online algorithms.
We demonstrate how to use mathematical tools like stochastic modified equation(SME) and asymptotic expansion to study the dynamics of stochastic ADMM in the small step-size (step-size for ADMM is ) regime.
We present an unified framework for variants of stochastic version of ADMM, linearized ADMM, gradient-based ADMM, and present a unified stochastic differential equation as their continuous-time limit under weak convergence.
We are first to show that the drift term of the stochastic differential equation is the same as the previous ordinary differential equation models.
We are first to show that the standard deviation of the solution paths has the scaling. Moreover, we can even accurately compute the continuous limit of the time evolution of , and for the residual . The joint fluctuations of is a new phenomenon that has not been studied in previous works on continuous-time analysis of stochastic gradient descent type algorithms.
From our stochastic differential equation analysis, we could derive useful insights for practical improvements that are not clear without the continuous-time model. For example, we are able to precisely compute the diffusion-fluctuation trade-off, which would enable us to decide when to decrease step-size and increase batch size to accelerate convergence of stochastic ADMM.
1.5 Notations and Assumptions
to denote the Euclidean two norm if the subscript is not specified. and all vectors are referred as column vectors., and , refer to the first (gradient) and second (Hessian) derivatives w.r.t. .
The first assumptions is Assumption I: , and for each , , , are closed proper convex functions; has full column rank.
Let as the set of functions of at most polynomial growth, if there exists constants , > 0 such that
, and are differentiable and the second order derivative are uniformly bounded in , and almost surely in for . is uniformly bounded in .
, , and the partial derivatives up to order belong to and for , it means the almost surely in , i.e. , the constants , in (7) do not depend on .
and satisfy a uniform growth condition: for a constant independent of .
2 Weak Approximation to Stochastic ADMM
In this section, we show the weak approximation to the stochastic ADMM (3) and the general family of stochastic ADMM variant (5). Appendix A is a summary of the background of the weak approximation and the stochastic modified equation for interested readers.
Given the noisy gradient and its expectation , we define the following matrix by
Theorem 1 (SME for sADMM).
Consider the standard stochastic ADMM without relaxation (3) with . Let . denote the sequence of stochastic ADMM with the initial choice .
Define as a stochastic process satisfying the SDE
where and the diffusion matrix is defined by (8), Then we have with the weak convergence of order .
Sketch of proof.
The ADMM scheme is in a form of the iteration of the triplet where . But by the first order optimality condition for -subproblem and -subproblem, we have for whatever input triplet . Thus, the variable is faithfully replaced by . The remaining goal is to further replace the variable by the variable that the ADMM iteration is approximately reduced to the iteration only for variable. This is indeed true because of the critical observation (Proposition 7) that the residual is has a second order smallness, belonging to , if . Thus, ADMM is transformed into the one-step iteration form (20) only in variable with . The conclusion then follows by directly checking the conditions (23) in Theorem 5. ∎
Our main theorem is for the G-sADMM scheme which contains the relaxation parameter , the proximal parameter and the implicitness parameters .
Theorem 2 (SME for G-sADMM).
Let , and . Let . denote the sequence of stochastic ADMM (5) with the initial choice .
Define as a stochastic process satisfying the SDE
where the matrix
Then we have in weak convergence of order , with the following precise meaning.
For any time interval and for any test function such that and its partial derivatives up to order belong to , there exists a constant such that
Sketch of proof.
The idea of this proof is similar to that in Theorem 1 even with the introduction of parameters. But for the relaxation parameter when , we need to overcome a substantial challenge. If , then the residual is now only at order , not . In the proof, we propose a new -residual and show that it is indeed as small as (Proposition 9) to solve this challenge. The difference between and the -residual thus induces the extra -term in the new coefficient matrix in (11). ∎
The rigorous proof is in Appendix B.
We do not present a simple form of SME as the the second order weak approximation as for the SGD scheme, due to the complicated issue of the residuals. In addition, the proof requires a regularity condition for the functions and ; at least needs to have the third order derivatives of . So, our theoretic theorems can not cover the non-smooth function . Our numerical tests suggest that the conclusion holds too for regularization function .
In general applications, it is very difficulty to get the expression of the variance matrix as a function of , except in very few simplified cases. In applications of empirical risk minization, the function is the empirical average of the loss on each sample : . The diffusion matrix in (8) becomes the following form
It is clear that if with iid samples , then as .
The stochastic scheme (5)
is the simplest form of using only one instance of the
gradient in each iteration.
If a batch size larger than one is used, then
the one instance gradient
is replaced by the average
where is the batch size and
are iid samples.
Under these settings,
should be multiplied by a fact
where the continuous-time function is the linear interpolation of
is the linear interpolation ofat times . The stochastic modified equation (10) is then in the following form
Based on the SME above, we can find the stochastic asymptotic expansion of
See Chapter 2 in (Freidlin & Wentzell, 2012) for rigorous justification. is deterministic as the gradient flow of the deterministic problem: , and are stochastic and satisfy certain SDEs independent of . The useful conclusion is that the standard deviation of , mainly coming from the term , is . Hence, the standard deviation of the stochastic ADMM is and more importantly, the rescaled two standard deviations and are close as the function of the time .
We can investigate the fluctuation of the sequence generated by the stochastic ADMM. The approach is to study the modified equation of its continuous version first. Since the residual is on the order shown in the appendix (Proposition 6 and 7), we have the following result.
There exists a deterministic function such that
where is the solution to the SME in Theorem (2) and is a weak approximation to with the order 1.
In addition, we have the following asymptotic for :
where satisfies .
The standard deviation of is on the order .
Recall the residual and in view of Corollary 10 in the appendix, we have the following result that there exists a function such that
and the residual is a weak approximation to with the order 1. If in the G-sADMM (5), then the expectation and standard deviation of and are both at order . If in the G-sADMM (5), then the expectation and standard deviation of and are only at order .
3 Numerical Examples
Example 1: one dimensional example In this simple example, the dimension . Consider , where
is a Bernoulli random variable taking valuesor
with equal probability. We testand . The matrix . These settings satisfy the assumptions in our main theorem. We choose such that . The SME when is . The choice of the initial guess is and . The terminal time is fixed.
Figure 2 shows the match of the expectation and the standard deviation of the sequence of stochastic ADMM and of the SME with . Furthermore, we plot Figure random trajectories from both models in Figure 3. and it shows the fluctuation in the sADMM can be well capturedd by the SME model.
The acceleration effect of for the deterministic ADMM has been shown in (Yuan et al., 2019). Figure 1 confirms the same effect both for smooth and non-smooth for the expectation of the solution sequence .
The SME does not only provide the expectation of the solution, but also provides the fluctuation of the numerical solution for any given . Figure 2 compares the mean and standard deviation (“std”) between and at . The right vertical axis is the value of standard deviation and the two curves are very close. In addition, with the same setting, a few hundreds of trajectory samples are shown together in Figure 3, which illustrate the match both in the mean and in the std between the stochastic ADMM and the SME.
To verify our theorem on the convergence order, a test function is used for the test of the weak convergence error:
For each , set , so and . Figure 4 shows the error versus in the semi-log plot for three values of relaxation parameter . The first order convergence rate is verified.
We also numerically investigated the convergence rate for the non-smooth penalty , even though this regularization function does not satisfy our assumptions. The diffusion term is still the same as in the case since is deterministic. For the corresponding SDE, at least formally, we can write , by using the sign function as . The rigorous meaning needs the concept of stochastic differential inclusion, which is out of the scope of this work. The numerical results in Figure 4 shows that the weak convergence order is also true for this case.
Finally, we test the orders for the standard deviation of and . Th consistence of with the SME’s has been shown in Figure 2. The theoretic prediction is that both are at order . We plot the sequences of and for various . These two quantities should be the same regardless of , and only depends on . which is confirmed by Figure 5.
For the residual, the theoretic prediction is that both and are on the order if . We plot , , against the time in Figure 6 and Figure 8, respectively. For the stochastic ADMM scheme with , the numerical test shows that and are on the order .
Example 2: generalized ridge and lasso regression
Example 2: generalized ridge and lasso regression
We perform experiments on the generalized ridge regression.
where (ridge regression) or (lasso regression), with a constant . is a penalty matrix specifying the desired structured pattern of . Among the random ,
is the zero-mean random (column) vector with uniformly distribution in the hypercubewith independent components. The labelled data , where is a given vector and is the zero-mean measurement noise, independent of . The analytic expression of the matrix-valued function is available based on the four-order momentums of .
We use a batch size for the stochastic ADMM ( is used in experiments). Then the corresponding SME for the ridge regression problem is
The SME for the lasso regression (formally) is
The direct simulation of these stochastic equations has a high computational burden because of the complexity of matrix square root for . So, our tests are only restricted to the dimension .
Set is the Hilbert matrix multiplied by . . . The vector is set as . The initial is the zero vector. .
In algorithms, set . We choose the test function . Denote where are the sequence computed from the (unified) stochastic ADMM with the batch size . Denote where is the solution of the SME.
In this paper, we have use the stochastic modified equation(SME) to analyze the dynamics of stochastic ADMM in the large limit (i.e., small step-size limit). It is a first order weak approximation to a general family of stochastic ADMM algorithms, including the standard, linearized and gradient-based ADMM with relaxation .
Our new continuous-time analysis is the first analysis of stochastic version of ADMM. It faithfully captures the fluctuation of the stochastic ADMM solution and provides a mathematical clear and insightful way to understand the dynamics of stochastic ADMM algorithms.
It is a substantial complementary to the existing ODE-based continuous-time analysis (França et al., 2018; Yuan et al., 2019) for the deterministic ADMM. It is also an important mile-stone for understanding continuous time limit of stochastic algorithms other than stochastic gradient descent (SGD), as we observed new phenonmons like the joint fluctuation of , and . We provide solid numerical experiments verifying our theory on several examples, including smooth function like quadratic functions and non-smooth function like norm.
5 Future Work
There are a few natural directions to further explore in future.
First, in the theoretic analysis aspect, for simplicity of analysis, we derive our mathematical proof based on smoothness of and . As we observed empirically, for non-smooth function like norm, our continuous-time limit framework would derive a stochastic differential inclusion. A natural follow-up of this work would be develop formal mathematical tools of stochastic differential inclusion to extend our proof to non-smooth functions.
Second, from our stochastic differential equation, we could develop practical rules to choose adaptive step-size and batch size by precisely computing the optimal diffusion-fluctuation trade-off to accelerate convergence of stochastic ADMM.
- Boyd et al. (2011) Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J., et al. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011.
- E et al. (2019) E, W., Ma, C., and Wu, L. Machine learning from a continuous viewpoint. 2019.
- Eckstein & Bertsekas (1992) Eckstein, J. and Bertsekas, D. P. On the Douglas–Rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming, 55(1-3):293–318, 1992.
- França et al. (2018) França, G., Robinson, D. P., and Vidal, R. ADMM and accelerated ADMM as continuous dynamical systems. In Proceedings of the 35th International Conference on Machine Learning, pp. 1559–1567, 2018.
- Freidlin & Wentzell (2012) Freidlin, M. I. and Wentzell, A. D. Random Perturbations of Dynamical Systems. Grundlehren der mathematischen Wissenschaften. Springer-Verlag, New York, 3 edition, 2012.
- Goldfarb et al. (2013) Goldfarb, D., Ma, S., and Scheinberg, K. Fast alternating linearization methods for minimizing the sum of two convex functions. Mathematical Programming, 141(1-2):349–382, 2013.
- Huang et al. (2019) Huang, F., Chen, S., and Huang, H. Faster stochastic alternating direction method of multipliers for nonconvex optimization. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 2839–2848, Long Beach, California, USA, 09–15 Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/huang19a.html.
- Kloeden & Platen (2011) Kloeden, P. and Platen, E. Numerical Solution of Stochastic Differential Equations. Stochastic Modelling and Applied Probability. Springer, New York, corrected edition, 2011. ISBN 9783662126165. URL https://books.google.com.hk/books?id=r9r6CAAAQBAJ.
- Li et al. (2017) Li, Q., Tai, C., and E, W. Stochastic modified equations and adaptive stochastic gradient algorithms. In 34th International Conference on Machine Learning, ICML 2017, 34th International Conference on Machine Learning, ICML 2017, pp. 3306–3340. International Machine Learning Society (IMLS), 1 2017.
- Li et al. (2019) Li, Q., Tai, C., and E, W. Stochastic modified equations and dynamics of stochastic gradient algorithms I: Mathematical foundations. Journal of Machine Learning Research, 20(40):1–47, 2019.
- Milstein (1995) Milstein, G. Numerical Integration of Stochastic Differential Equations, volume 313 of Mathematics and Its Applications. Springer, 1995. ISBN 9780792332138. URL https://books.google.com.hk/books?id=o2y8Or_a4W0C.
- Milstein (1986) Milstein, G. N. Weak approximation of solutions of systems of stochastic differential equations. Theory of Probability & Its Applications, 30(4):750–766, 1986. doi: 10.1137/1130095. URL https://doi.org/10.1137/1130095.
- Ouyang et al. (2013) Ouyang, H., He, N., Tran, L., and Gray, A. Stochastic alternating direction method of multipliers. In Dasgupta, S. and McAllester, D. (eds.), Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pp. 80–88, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. URL http://proceedings.mlr.press/v28/ouyang13.html.
- Su et al. (2016) Su, W., Boyd, S., and Candes, E. J. A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. Journal of Machine Learning Research, 17(153):1–43, 2016.
- Suzuki (2013) Suzuki, T. Dual averaging and proximal gradient descent for online alternating direction multiplier method. In International Conference on Machine Learning, pp. 392–400, 2013.
- Wang & Banerjee (2012) Wang, H. and Banerjee, A. Online alternating direction method. In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, Proceedings of the 29th International Conference on Machine Learning, ICML 2012, pp. 1699–1706, 10 2012. ISBN 9781450312851.
- Yuan et al. (2019) Yuan, H., Zhou, Y., Li, C. J., and Sun, Q. Differential inclusions for modeling nonsmooth ADMM variants: A continuous limit theory. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 7232–7241, Long Beach, California, USA, 09–15 Jun 2019. PMLR. URL http://proceedings.mlr.press/v97/yuan19c.html.
- Zhong & Kwok (2014) Zhong, W. and Kwok, J. Fast stochastic alternating direction method of multipliers. In International Conference on Machine Learning, pp. 46–54, 2014.
Appendix A Weak Approximation and Stochastic Modified Equations
We introduce and review the concepts for the weak approximation and the stochastic modified equation.
Definition 4 (weak convergence).
We say the family (parametrized by ) of the stochastic sequence , , weakly converges to (or is a weak approximation to), a family of continuous-time Ito processes with the order if they satisfy the following conditions: For any time interval and for any test function such that and its partial derivatives up to order belong to , there exists a constant and such that for any ,
The constant in the above inequality and , independent of , may depend on and . For the conventional applications to numerical method for SDE (Milstein, 1995), may not depended on ; for the stochastic modified equation in our problem, does depend on . We drop the subscript in and for notational ease whenever there is no ambiguity.
The idea of using the weak approximation and the stochastic modified equation was originally proposed by (Li et al., 2017), which is based on an important theorem due to (Milstein, 1986). In brief, this Milstein’s theorem links the one step difference, which has been detailed above, to the global approximation in weak sense, by checking three conditions on the momentums of one step difference. Since we only consider the first order weak approximation, the Milstein’s theorem is introduced in a simplified form below for only . The more general situations can be found in Theorem 5 in (Milstein, 1986), Theorem 9.1 in (Milstein, 1995) and Theorem 14.5.2 in (Kloeden & Platen, 2011).
Let the stochastic sequence be recursively defined by the iteration written in the form associated with a function :
where are iid random variables. . Define the one step difference . We use the parenthetical subscript to denote the dimensional components of a vector like .
Assume that there exists a function such that satisfies the bounds of the fourth momentum
for any component indices and any ,
For any arbitrary , consider the family of the Ito processes defined by a stochastic differential equation whose noise depends on the parameter ,
Theorem 5 (Milstein’s weak convergence theorem).
If there exist a constant
and a function , such that the following conditions
of the first three moments on the error
, such that the following conditions of the first three moments on the error:
hold for any and any , then weakly converges to with the order 1.
For the SDE (22) at the small noise , by the Ito-Taylor expansion, it is well-known that and and for all integer and the component index . Refer to (Kloeden & Platen, 2011) and Lemma 1 in (Li et al., 2017). So, the main receipt to apply the Milstein’s theorem is to examine the conditions of the momentums for the discrete sequence .
One prominent work (Li et al., 2017) is to use the SME as a weak approximation to understand the dynamical behaviour of the stochastic gradient descent (SGD). The prominent advantage of this technique is that the fluctuation in the SGD iteration can be well captured by the fluctuation in the SME. Here is the brief result. For the composite minimization problem
the SGD iteration is with the step size , then by Theorem 5, the corresponding SME of first order approximation is
with . Details can be found in (Li et al., 2017). The SGD here is analogous to the forward-time Euler-Maruyama approximation since .
Appendix B Proof of main theorems
The one step difference is important to consider the weak convergence of the discrete scheme (5). The question is that for one single iteration, from step to step , what is the order of the change of the states . Since For notational ease, we drop the random variable in the scheme (5); the readers bear in mind that and its derivatives involve .
for any integer . The optimality conditions (25) now can be written only in the variables :