Online and stochastic Douglas-Rachford splitting method for large scale machine learning

08/22/2013 ∙ by Ziqiang Shi, et al. ∙ FUJITSU 0

Online and stochastic learning has emerged as powerful tool in large scale optimization. In this work, we generalize the Douglas-Rachford splitting (DRs) method for minimizing composite functions to online and stochastic settings (to our best knowledge this is the first time DRs been generalized to sequential version). We first establish an O(1/√(T)) regret bound for batch DRs method. Then we proved that the online DRs splitting method enjoy an O(1) regret bound and stochastic DRs splitting has a convergence rate of O(1/√(T)). The proof is simple and intuitive, and the results and technique can be served as a initiate for the research on the large scale machine learning employ the DRs method. Numerical experiments of the proposed method demonstrate the effectiveness of the online and stochastic update rule, and further confirm our regret and convergence analysis.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and problem statement

First introduced in [4], the Douglas-Rachford splitting technique has become popular in recent years due to its fast theoretical convergence rates and strong practical performance. The method is first proposed to addresses the minimization of the sum of two functions . The method was extended in [11] to handle problems involving the sum of two nonlinear monotone operator problems. For further developments, see [7, 2, 3]

. However, most of these variants implicityly assume full accessibility of all data values, while in reality one can hardly ignore that the size of data is rapidly increasing in various domain, and thus batch mode learning procedure cannot deal with the huge size training set for the data probably cannot be loaded into the memory simultaneously. Furthermore, it cannot be started until the training data are prepared, hence cannot effectively deal with training data appear in sequence, such as audio and video processing 

[15]. In such situation, sequential learning become powerful tools.

Online learning is one of the most promising methods in large scale machine learning tasks in these days [20, 18]. Important advances have been made on sequential learning in the recent literature on similar problems. Composite objective mirror descent (COMID) [5] generalizes mirror descent [1] to the online setting. Regularized dual averaging (RDA) [19] generalizes dual averaging [13] to online and composite optimization, and can be used for distributed optimization [6]. Online alternating direction multiplier method (ADMM) [16], RDA-ADMM [16] and online proximal gradient (OPG) ADMM [18] generalize classical ADMM [8] to online and stochastic settings.

Our focus in this paper is to generalize the Douglas-Rachford splitting to online settings. In this work, we consider the problems of the following form:

(1.1)

where

is a convex loss function associated with a sample in a training set, and

is a non-smooth convex penalty function or regularizer. Many problems of relevance in signal processing and machine learning can be formulated as the above optimization problem. Similar problems include the ridge regression, the lasso 

[17]

, and the logistic regression.

Let in Problem (1.1), and then the Douglas-Rachford splitting algorithm approximates a minimizer of (1.1) with the help of the following sequence:

(1.2)

where ( satisfies , and the proximal mapping of a convex function at is

(1.3)

Thus the iterative scheme of Douglas-Rachford splitting for the problem (1.1) is as follows:

(1.4)
(1.5)
(1.6)

where converges weakly (in Hilbert space , weak convergence is equivalent to strong convergence) to some point (thus also , for is continuous), and is a solution to Problem (1.1).

The only modification of the splitting that we propose for online processing is simple:

(1.7)

We call this method online DRs (oDRs). Due to the complex loss function , generally the update is difficult to solve efficiently. A common way is to linearize the objective such that

(1.8)

which is called inexact oDRs (ioDRs).

ioDRs can also be derived from the another point of view, which is based on proximal gradient [12]. The proximal gradient method uses the proximal mapping of the nonsmooth part to minimize composite functions (1.1[10]:

(1.9)
(1.10)

where denotes the -th step length. The online PG (OPG) is straight forward, that is at around solving the following optimization problem with the linearization of only -th loss function  [19, 6, 16]:

(1.11)

Then the ioDRs can be seen a combination of online OPG with DRs.

2 Convergence Analysis for DRs

It is clear that is a solution of Problem (1.1) if and only if , which is equivalent to , where . It is clear that and are two monotone set-valued operators [14], and the resolvent operators and are both single valued. Thus if the loss functions are smooth, we have , which immediately gives an accuracy measure proposed in [9]

of a vector

to a solution of Problem (1.1) by

(2.1)

It is showed in the Theorem 3.1 of [9] that after iterations of (1.4), we have

(2.2)

On the other hand, from , then we have . Since and are convex functions, using their (sub)gradients, we have

(2.3)
(2.4)
(2.5)

where is the subgradient of at satisfying , and is any subgradient of at .

Adding (2.3),  (2.4), and  (2.5) together yields

(2.6)
(2.7)

If and are both Lipschitz continuous, and the Lipschitz constants are and respectively, the we have and , where is any subgradient of at . Furthermore is bounded for its convergence. Thus we have , , and further we have (2.2), thus the following convergence result holds.

Theorem 2.1.

If is differentiable, both and are Lipschitz continuous. Let the sequence be generated by DRs. Then we have

(2.8)
Remark 1.

If is differentiable and is Lipschitz continuous with . Then we can obtain an explicit bound by using Lemma 3.1 of [9], which shows us that is monotonically decreasing. Thus we have from the following inference

(2.9)
(2.10)

Further from Theorem 3.1 in [9], we have

(2.11)
(2.12)

Then according to (2.7), we have

(2.13)
(2.14)

Thus we have the following corollary.

Corollary 2.2.

If and are differentiable, assume both and are Lipschitz continuous, with and respectively. Let the sequence be generated by DRs. Then we have

(2.15)
(2.16)
Remark 2.

If furthermore is Lipschitz continuous with , then from (2.10) we have from the following derivation

(2.17)
(2.18)

Then due to the similar formulation, we have . Put these new inequalities into (2.14), then we have the following rate.

Corollary 2.3.

If both and are Lipschitz continuous, with and respectively. Let the sequence be generated by DRs. Then we have

(2.19)
(2.20)

3 Online Douglas-Rachford splitting method

The procedure of batch DRs, oDRs, and ioDRs are summarized in Algotihtm 12 and 3 respectively, where .

Input: starting point .

1: for do

2: .

3: .

4: .

5: end for

Output: .

Algorithm 1 A generic DRs

Input: starting point .

1: for do

2: .

3: .

4: .

5: end for

Output: .

Algorithm 2 A generic oDRs

Input: starting point .

1: for do

2: .

3: .

4: .

5: end for

Output: .

Algorithm 3 A generic ioDRs

3.1 Regret Analysis for oDRs

The goal of oDRs is to achieve low regret w.r.t. a static predictor on a sequence of functions

(3.1)

Formally, at every round of the algorithm we make a prediction and then receive the function . We seek bounds on the standard regret in the online learning setting with respect to , defined as

(3.2)

In batch optimization we set for all while in stochastic optimization we choose to be the average of some random subset of .

As pointed by Lemma 3.2 and Theorem 3.1 in [9], with the notation in mind, we have in each iteration that

(3.3)
(3.4)

and

(3.5)

which means that . Following the same procedure as (2.7), we have

(3.6)

Adding above formulas together for , we obtain the following result.

We conjecture that

Theorem 3.1.

If and are both Lipschitz continuous. Let the sequence be generated by oDRs. Then we have

(3.7)

4 Computational experiments

In this section, we demonstrate the performance of oDRs and ioDRs to solve lasso and logistic regression problems. We present simulation results to show the convergence of the objective in oDRs and ioDRs. We also compare them with batch DRs and OADM [18]. We set for all the updates of , and .

4.1 Lasso

The lasso problem is formulated as follows:

(4.1)

where and is a scalar. The three updatas of DRs are:

(4.2)
(4.3)
(4.4)
(4.5)

where and . The differences of oDRs and ioDRs from DRs is the update of , which are:

(4.6)

and

(4.7)

respectively.

Our experiments mainly follow the lasso example in [18]. We first randomly generated with 1000 examples of dimensionality 100. is then normalized along the columns. Then, a true is randomly generated with certain sparsity pattern for lasso, and we set the number of nonzeros as 10. is calculated by adding Gaussian noise to , where is number of examples. We set and in OADM [18]. All experiments are implemented in Matlab.

(a) =0.1
(b) =1
(c) =10
Figure 1: The convergence of objective value in DRs, oDRs, ioDRs, OADM, and real objective value for the lasso problem (4.1).

In Figure 1, the objective value of the problem is depicted against the iteration times. In this example, oDRs and DRs show faster convergence than OADM and ioDRs. The main reason for the slow convergence of ioDRs is because the linearization of the objective in each iteration (1.8). We observe that OADM takes even longer iterations to achieve a certain precision, although the regret bound is more tighter than the bound obtained in this work for oDRs ( in OADM [18], while in oDRs). Thus we believe and conjecture that the regret in Theorem 3.1 is indeed .

4.2 Logistic regression

The logistic regression problem is formulated as follows:

(4.8)

where are samples with labels , the regularization term promotes sparse solutions and balances goodness-of-fit and sparsity.

The three updatas of DRs are:

(4.9)
(4.10)
(4.11)

While it is

References

  • [1] Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
  • [2] Patrick L Combettes. Iterative construction of the resolvent of a sum of maximal monotone operators. J. Convex Anal, 16(4):727–748, 2009.
  • [3] Patrick L Combettes and Jean-Christophe Pesquet. Proximal splitting methods in signal processing. In Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pages 185–212. Springer, 2011.
  • [4] Jim Douglas and HH Rachford. On the numerical solution of heat conduction problems in two and three space variables. Transactions of the American mathematical Society, 82(2):421–439, 1956.
  • [5] John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Ambuj Tewari. Composite objective mirror descent. 2010.
  • [6] John C Duchi, Alekh Agarwal, and Martin J Wainwright. Dual averaging for distributed optimization. In Communication, Control, and Computing (Allerton), 2012 50th Annual Allerton Conference on, pages 1564–1565. IEEE, 2012.
  • [7] Jonathan Eckstein and Dimitri P Bertsekas. On the douglas rachford splitting method and the proximal point algorithm for maximal monotone operators. Mathematical Programming, 55(1-3):293–318, 1992.
  • [8] Daniel Gabay and Bertrand Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximation. Computers & Mathematics with Applications, 2(1):17–40, 1976.
  • [9] BS He and XM Yuan. On convergence rate of the douglas-rachford operator splitting method. Mathematical Programming, under revision, 2011.
  • [10] Jason D Lee, Yuekai Sun, and Michael A Saunders. Proximal newton-type methods for minimizing composite functions. 2013.
  • [11] Pierre-Louis Lions and Bertrand Mercier. Splitting algorithms for the sum of two nonlinear operators. SIAM Journal on Numerical Analysis, 16(6):964–979, 1979.
  • [12] Yurii Nesterov. Gradient methods for minimizing composite objective function, 2007.
  • [13] Yurii Nesterov. Primal-dual subgradient methods for convex problems. Mathematical programming, 120(1):221–259, 2009.
  • [14] R Tyrell Rockafellar. Convex analysis, volume 28. Princeton university press, 1997.
  • [15] Ziqiang Shi, Jiqing Han, Tieran Zheng, and Shiwen Deng.

    Audio segment classification using online learning based tensor representation feature discrimination.

    IEEE transactions on audio, speech, and language processing, 21(1-2):186–196, 2013.
  • [16] Taiji Suzuki. Dual averaging and proximal gradient descent for online alternating direction multiplier method. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 392–400, 2013.
  • [17] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
  • [18] Huahua Wang and Arindam Banerjee. Online alternating direction method. arXiv preprint arXiv:1206.6448, 2012.
  • [19] Lin Xiao. Dual averaging methods for regularized stochastic learning and online optimization. The Journal of Machine Learning Research, 11:2543–2596, 2010.
  • [20] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. 2003.