1 Introduction and problem statement
First introduced in , the Douglas-Rachford splitting technique has become popular in recent years due to its fast theoretical convergence rates and strong practical performance. The method is first proposed to addresses the minimization of the sum of two functions . The method was extended in  to handle problems involving the sum of two nonlinear monotone operator problems. For further developments, see [7, 2, 3]
. However, most of these variants implicityly assume full accessibility of all data values, while in reality one can hardly ignore that the size of data is rapidly increasing in various domain, and thus batch mode learning procedure cannot deal with the huge size training set for the data probably cannot be loaded into the memory simultaneously. Furthermore, it cannot be started until the training data are prepared, hence cannot effectively deal with training data appear in sequence, such as audio and video processing. In such situation, sequential learning become powerful tools.
Online learning is one of the most promising methods in large scale machine learning tasks in these days [20, 18]. Important advances have been made on sequential learning in the recent literature on similar problems. Composite objective mirror descent (COMID)  generalizes mirror descent  to the online setting. Regularized dual averaging (RDA)  generalizes dual averaging  to online and composite optimization, and can be used for distributed optimization . Online alternating direction multiplier method (ADMM) , RDA-ADMM  and online proximal gradient (OPG) ADMM  generalize classical ADMM  to online and stochastic settings.
Our focus in this paper is to generalize the Douglas-Rachford splitting to online settings. In this work, we consider the problems of the following form:
is a convex loss function associated with a sample in a training set, and
is a non-smooth convex penalty function or regularizer. Many problems of relevance in signal processing and machine learning can be formulated as the above optimization problem. Similar problems include the ridge regression, the lasso
, and the logistic regression.
where ( satisfies , and the proximal mapping of a convex function at is
Thus the iterative scheme of Douglas-Rachford splitting for the problem (1.1) is as follows:
where converges weakly (in Hilbert space , weak convergence is equivalent to strong convergence) to some point (thus also , for is continuous), and is a solution to Problem (1.1).
The only modification of the splitting that we propose for online processing is simple:
We call this method online DRs (oDRs). Due to the complex loss function , generally the update is difficult to solve efficiently. A common way is to linearize the objective such that
which is called inexact oDRs (ioDRs).
ioDRs can also be derived from the another point of view, which is based on proximal gradient . The proximal gradient method uses the proximal mapping of the nonsmooth part to minimize composite functions (1.1) :
Then the ioDRs can be seen a combination of online OPG with DRs.
2 Convergence Analysis for DRs
It is clear that is a solution of Problem (1.1) if and only if , which is equivalent to , where . It is clear that and are two monotone set-valued operators , and the resolvent operators and are both single valued. Thus if the loss functions are smooth, we have , which immediately gives an accuracy measure proposed in 
of a vectorto a solution of Problem (1.1) by
On the other hand, from , then we have . Since and are convex functions, using their (sub)gradients, we have
where is the subgradient of at satisfying , and is any subgradient of at .
If and are both Lipschitz continuous, and the Lipschitz constants are and respectively, the we have and , where is any subgradient of at . Furthermore is bounded for its convergence. Thus we have , , and further we have (2.2), thus the following convergence result holds.
If is differentiable, both and are Lipschitz continuous. Let the sequence be generated by DRs. Then we have
If is differentiable and is Lipschitz continuous with . Then we can obtain an explicit bound by using Lemma 3.1 of , which shows us that is monotonically decreasing. Thus we have from the following inference
Further from Theorem 3.1 in , we have
Then according to (2.7), we have
Thus we have the following corollary.
If and are differentiable, assume both and are Lipschitz continuous, with and respectively. Let the sequence be generated by DRs. Then we have
If both and are Lipschitz continuous, with and respectively. Let the sequence be generated by DRs. Then we have
3 Online Douglas-Rachford splitting method
3.1 Regret Analysis for oDRs
The goal of oDRs is to achieve low regret w.r.t. a static predictor on a sequence of functions
Formally, at every round of the algorithm we make a prediction and then receive the function . We seek bounds on the standard regret in the online learning setting with respect to , defined as
In batch optimization we set for all while in stochastic optimization we choose to be the average of some random subset of .
As pointed by Lemma 3.2 and Theorem 3.1 in , with the notation in mind, we have in each iteration that
which means that . Following the same procedure as (2.7), we have
Adding above formulas together for , we obtain the following result.
We conjecture that
If and are both Lipschitz continuous. Let the sequence be generated by oDRs. Then we have
4 Computational experiments
In this section, we demonstrate the performance of oDRs and ioDRs to solve lasso and logistic regression problems. We present simulation results to show the convergence of the objective in oDRs and ioDRs. We also compare them with batch DRs and OADM . We set for all the updates of , and .
The lasso problem is formulated as follows:
where and is a scalar. The three updatas of DRs are:
where and . The differences of oDRs and ioDRs from DRs is the update of , which are:
Our experiments mainly follow the lasso example in . We first randomly generated with 1000 examples of dimensionality 100. is then normalized along the columns. Then, a true is randomly generated with certain sparsity pattern for lasso, and we set the number of nonzeros as 10. is calculated by adding Gaussian noise to , where is number of examples. We set and in OADM . All experiments are implemented in Matlab.
In Figure 1, the objective value of the problem is depicted against the iteration times. In this example, oDRs and DRs show faster convergence than OADM and ioDRs. The main reason for the slow convergence of ioDRs is because the linearization of the objective in each iteration (1.8). We observe that OADM takes even longer iterations to achieve a certain precision, although the regret bound is more tighter than the bound obtained in this work for oDRs ( in OADM , while in oDRs). Thus we believe and conjecture that the regret in Theorem 3.1 is indeed .
4.2 Logistic regression
The logistic regression problem is formulated as follows:
where are samples with labels , the regularization term promotes sparse solutions and balances goodness-of-fit and sparsity.
The three updatas of DRs are:
While it is
-  Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
-  Patrick L Combettes. Iterative construction of the resolvent of a sum of maximal monotone operators. J. Convex Anal, 16(4):727–748, 2009.
-  Patrick L Combettes and Jean-Christophe Pesquet. Proximal splitting methods in signal processing. In Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pages 185–212. Springer, 2011.
-  Jim Douglas and HH Rachford. On the numerical solution of heat conduction problems in two and three space variables. Transactions of the American mathematical Society, 82(2):421–439, 1956.
-  John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Ambuj Tewari. Composite objective mirror descent. 2010.
-  John C Duchi, Alekh Agarwal, and Martin J Wainwright. Dual averaging for distributed optimization. In Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on, pages 1564–1565. IEEE, 2012.
-  Jonathan Eckstein and Dimitri P Bertsekas. On the douglas rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming, 55(1-3):293–318, 1992.
-  Daniel Gabay and Bertrand Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1):17–40, 1976.
-  BS He and XM Yuan. On convergence rate of the douglas-rachford operator splitting method. Mathematical Programming, under revision, 2011.
-  Jason D Lee, Yuekai Sun, and Michael A Saunders. Proximal newton-type methods for minimizing composite functions. 2013.
-  Pierre-Louis Lions and Bertrand Mercier. Splitting algorithms for the sum of two nonlinear operators. SIAM Journal on Numerical Analysis, 16(6):964–979, 1979.
-  Yurii Nesterov. Gradient methods for minimizing composite objective function, 2007.
-  Yurii Nesterov. Primal-dual subgradient methods for convex problems. Mathematical programming, 120(1):221–259, 2009.
-  R Tyrell Rockafellar. Convex analysis, volume 28. Princeton university press, 1997.
Ziqiang Shi, Jiqing Han, Tieran Zheng, and Shiwen Deng.
Audio segment classification using online learning based tensor representation feature discrimination.IEEE transactions on audio, speech, and language processing, 21(1-2):186–196, 2013.
-  Taiji Suzuki. Dual averaging and proximal gradient descent for online alternating direction multiplier method. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 392–400, 2013.
-  Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
-  Huahua Wang and Arindam Banerjee. Online alternating direction method. arXiv preprint arXiv:1206.6448, 2012.
-  Lin Xiao. Dual averaging methods for regularized stochastic learning and online optimization. The Journal of Machine Learning Research, 11:2543–2596, 2010.
-  Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. 2003.