1 Introduction and problem statement
First introduced in [4], the DouglasRachford splitting technique has become popular in recent years due to its fast theoretical convergence rates and strong practical performance. The method is first proposed to addresses the minimization of the sum of two functions . The method was extended in [11] to handle problems involving the sum of two nonlinear monotone operator problems. For further developments, see [7, 2, 3]
. However, most of these variants implicityly assume full accessibility of all data values, while in reality one can hardly ignore that the size of data is rapidly increasing in various domain, and thus batch mode learning procedure cannot deal with the huge size training set for the data probably cannot be loaded into the memory simultaneously. Furthermore, it cannot be started until the training data are prepared, hence cannot effectively deal with training data appear in sequence, such as audio and video processing
[15]. In such situation, sequential learning become powerful tools.Online learning is one of the most promising methods in large scale machine learning tasks in these days [20, 18]. Important advances have been made on sequential learning in the recent literature on similar problems. Composite objective mirror descent (COMID) [5] generalizes mirror descent [1] to the online setting. Regularized dual averaging (RDA) [19] generalizes dual averaging [13] to online and composite optimization, and can be used for distributed optimization [6]. Online alternating direction multiplier method (ADMM) [16], RDAADMM [16] and online proximal gradient (OPG) ADMM [18] generalize classical ADMM [8] to online and stochastic settings.
Our focus in this paper is to generalize the DouglasRachford splitting to online settings. In this work, we consider the problems of the following form:
(1.1) 
where
is a convex loss function associated with a sample in a training set, and
is a nonsmooth convex penalty function or regularizer. Many problems of relevance in signal processing and machine learning can be formulated as the above optimization problem. Similar problems include the ridge regression, the lasso
[17], and the logistic regression.
Let in Problem (1.1), and then the DouglasRachford splitting algorithm approximates a minimizer of (1.1) with the help of the following sequence:
(1.2) 
where ( satisfies , and the proximal mapping of a convex function at is
(1.3) 
Thus the iterative scheme of DouglasRachford splitting for the problem (1.1) is as follows:
(1.4)  
(1.5)  
(1.6) 
where converges weakly (in Hilbert space , weak convergence is equivalent to strong convergence) to some point (thus also , for is continuous), and is a solution to Problem (1.1).
The only modification of the splitting that we propose for online processing is simple:
(1.7) 
We call this method online DRs (oDRs). Due to the complex loss function , generally the update is difficult to solve efficiently. A common way is to linearize the objective such that
(1.8) 
which is called inexact oDRs (ioDRs).
ioDRs can also be derived from the another point of view, which is based on proximal gradient [12]. The proximal gradient method uses the proximal mapping of the nonsmooth part to minimize composite functions (1.1) [10]:
(1.9)  
(1.10) 
where denotes the th step length. The online PG (OPG) is straight forward, that is at around solving the following optimization problem with the linearization of only th loss function [19, 6, 16]:
(1.11) 
Then the ioDRs can be seen a combination of online OPG with DRs.
2 Convergence Analysis for DRs
It is clear that is a solution of Problem (1.1) if and only if , which is equivalent to , where . It is clear that and are two monotone setvalued operators [14], and the resolvent operators and are both single valued. Thus if the loss functions are smooth, we have , which immediately gives an accuracy measure proposed in [9]
of a vector
to a solution of Problem (1.1) by(2.1) 
It is showed in the Theorem 3.1 of [9] that after iterations of (1.4), we have
(2.2) 
On the other hand, from , then we have . Since and are convex functions, using their (sub)gradients, we have
(2.3)  
(2.4)  
(2.5) 
where is the subgradient of at satisfying , and is any subgradient of at .
Adding (2.3), (2.4), and (2.5) together yields
(2.6)  
(2.7) 
If and are both Lipschitz continuous, and the Lipschitz constants are and respectively, the we have and , where is any subgradient of at . Furthermore is bounded for its convergence. Thus we have , , and further we have (2.2), thus the following convergence result holds.
Theorem 2.1.
If is differentiable, both and are Lipschitz continuous. Let the sequence be generated by DRs. Then we have
(2.8) 
Remark 1.
If is differentiable and is Lipschitz continuous with . Then we can obtain an explicit bound by using Lemma 3.1 of [9], which shows us that is monotonically decreasing. Thus we have from the following inference
(2.9)  
(2.10) 
Further from Theorem 3.1 in [9], we have
(2.11)  
(2.12) 
Then according to (2.7), we have
(2.13)  
(2.14) 
Thus we have the following corollary.
Corollary 2.2.
If and are differentiable, assume both and are Lipschitz continuous, with and respectively. Let the sequence be generated by DRs. Then we have
(2.15)  
(2.16) 
Remark 2.
Corollary 2.3.
If both and are Lipschitz continuous, with and respectively. Let the sequence be generated by DRs. Then we have
(2.19)  
(2.20) 
3 Online DouglasRachford splitting method
The procedure of batch DRs, oDRs, and ioDRs are summarized in Algotihtm 1, 2 and 3 respectively, where .
3.1 Regret Analysis for oDRs
The goal of oDRs is to achieve low regret w.r.t. a static predictor on a sequence of functions
(3.1) 
Formally, at every round of the algorithm we make a prediction and then receive the function . We seek bounds on the standard regret in the online learning setting with respect to , defined as
(3.2) 
In batch optimization we set for all while in stochastic optimization we choose to be the average of some random subset of .
As pointed by Lemma 3.2 and Theorem 3.1 in [9], with the notation in mind, we have in each iteration that
(3.3) 
(3.4) 
and
(3.5) 
which means that . Following the same procedure as (2.7), we have
(3.6) 
Adding above formulas together for , we obtain the following result.
We conjecture that
Theorem 3.1.
If and are both Lipschitz continuous. Let the sequence be generated by oDRs. Then we have
(3.7) 
4 Computational experiments
In this section, we demonstrate the performance of oDRs and ioDRs to solve lasso and logistic regression problems. We present simulation results to show the convergence of the objective in oDRs and ioDRs. We also compare them with batch DRs and OADM [18]. We set for all the updates of , and .
4.1 Lasso
The lasso problem is formulated as follows:
(4.1) 
where and is a scalar. The three updatas of DRs are:
(4.2)  
(4.3)  
(4.4)  
(4.5) 
where and . The differences of oDRs and ioDRs from DRs is the update of , which are:
(4.6) 
and
(4.7) 
respectively.
Our experiments mainly follow the lasso example in [18]. We first randomly generated with 1000 examples of dimensionality 100. is then normalized along the columns. Then, a true is randomly generated with certain sparsity pattern for lasso, and we set the number of nonzeros as 10. is calculated by adding Gaussian noise to , where is number of examples. We set and in OADM [18]. All experiments are implemented in Matlab.
In Figure 1, the objective value of the problem is depicted against the iteration times. In this example, oDRs and DRs show faster convergence than OADM and ioDRs. The main reason for the slow convergence of ioDRs is because the linearization of the objective in each iteration (1.8). We observe that OADM takes even longer iterations to achieve a certain precision, although the regret bound is more tighter than the bound obtained in this work for oDRs ( in OADM [18], while in oDRs). Thus we believe and conjecture that the regret in Theorem 3.1 is indeed .
4.2 Logistic regression
The logistic regression problem is formulated as follows:
(4.8) 
where are samples with labels , the regularization term promotes sparse solutions and balances goodnessoffit and sparsity.
The three updatas of DRs are:
(4.9)  
(4.10)  
(4.11) 
While it is
References
 [1] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
 [2] Patrick L Combettes. Iterative construction of the resolvent of a sum of maximal monotone operators. J. Convex Anal, 16(4):727–748, 2009.
 [3] Patrick L Combettes and JeanChristophe Pesquet. Proximal splitting methods in signal processing. In FixedPoint Algorithms for Inverse Problems in Science and Engineering, pages 185–212. Springer, 2011.
 [4] Jim Douglas and HH Rachford. On the numerical solution of heat conduction problems in two and three space variables. Transactions of the American mathematical Society, 82(2):421–439, 1956.
 [5] John Duchi, Shai ShalevShwartz, Yoram Singer, and Ambuj Tewari. Composite objective mirror descent. 2010.
 [6] John C Duchi, Alekh Agarwal, and Martin J Wainwright. Dual averaging for distributed optimization. In Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on, pages 1564–1565. IEEE, 2012.
 [7] Jonathan Eckstein and Dimitri P Bertsekas. On the douglas rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming, 55(13):293–318, 1992.
 [8] Daniel Gabay and Bertrand Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1):17–40, 1976.
 [9] BS He and XM Yuan. On convergence rate of the douglasrachford operator splitting method. Mathematical Programming, under revision, 2011.
 [10] Jason D Lee, Yuekai Sun, and Michael A Saunders. Proximal newtontype methods for minimizing composite functions. 2013.
 [11] PierreLouis Lions and Bertrand Mercier. Splitting algorithms for the sum of two nonlinear operators. SIAM Journal on Numerical Analysis, 16(6):964–979, 1979.
 [12] Yurii Nesterov. Gradient methods for minimizing composite objective function, 2007.
 [13] Yurii Nesterov. Primaldual subgradient methods for convex problems. Mathematical programming, 120(1):221–259, 2009.
 [14] R Tyrell Rockafellar. Convex analysis, volume 28. Princeton university press, 1997.

[15]
Ziqiang Shi, Jiqing Han, Tieran Zheng, and Shiwen Deng.
Audio segment classification using online learning based tensor representation feature discrimination.
IEEE transactions on audio, speech, and language processing, 21(12):186–196, 2013.  [16] Taiji Suzuki. Dual averaging and proximal gradient descent for online alternating direction multiplier method. In Proceedings of the 30th International Conference on Machine Learning (ICML13), pages 392–400, 2013.
 [17] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
 [18] Huahua Wang and Arindam Banerjee. Online alternating direction method. arXiv preprint arXiv:1206.6448, 2012.
 [19] Lin Xiao. Dual averaging methods for regularized stochastic learning and online optimization. The Journal of Machine Learning Research, 11:2543–2596, 2010.
 [20] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. 2003.
Comments
There are no comments yet.