A Fully Stochastic Primal-Dual Algorithm

01/23/2019
by   Adil Salim, et al.
Télécom ParisTech
0

A new stochastic primal-dual algorithm for solving a composite optimization problem is proposed. It is assumed that all the functions/matrix used to define the optimization problem are given as statistical expectations. These expectations are unknown but revealed across time through i.i.d realizations. This covers the case of convex optimization under stochastic linear constraints. The proposed algorithm is proven to converge to a saddle point of the Lagrangian function. In the framework of the monotone operator theory, the convergence proof relies on recent results on the stochastic Forward Backward algorithm involving random monotone operators.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/20/2022

A Tseng type stochastic forward-backward algorithm for monotone inclusions

In this paper, we propose a stochastic version of the classical Tseng's ...
12/02/2020

On the Convergence of the Stochastic Primal-Dual Hybrid Gradient for Convex Optimization

Stochastic Primal-Dual Hybrid Gradient (SPDHG) was proposed by Chambolle...
04/21/2022

A New Lagrangian Problem Crossover: A Systematic Review and Meta-Analysis of Crossover Standards

The performance of most evolutionary metaheuristic algorithms relays on ...
01/01/2022

On the improved conditions for some primal-dual algorithms

The convex minimization of f(𝐱)+g(𝐱)+h(𝐀𝐱) over ℝ^n with differentiable ...
02/21/2017

A Continuum of Optimal Primal-Dual Algorithms for Convex Composite Minimization Problems with Applications to Structured Sparsity

Many statistical learning problems can be posed as minimization of a sum...
02/06/2015

Stochastic recursive inclusion in two timescales with an application to the Lagrangian dual problem

In this paper we present a framework to analyze the asymptotic behavior ...
01/20/2021

Fixpoint Theory – Upside Down

Knaster-Tarski's theorem, characterising the greatest fixpoint of a mono...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many applications in machine learning, statistics or signal processing require the solution of the following optimization problem 

chambolle2011first ; vu2013splitting ; con-jota13 . Given two Euclidean spaces and , solve

(1)

where and are lower semicontinuous convex functions such that for every and belongs to the set of linear operators. Consider and , where is the indicator function of the set , i.e, the function equal to on and elsewhere. In this particular case, Problem (1) boils down to the linearly constrained problem

(2)

In order to solve Problem (1

), primal-dual methods generate a sequence of primal estimates

and a sequence of dual estimates jointly converging to a saddle point of the Lagrangian function. As is well known, the qualification condition

where is the relative interior of a set, ensures the existence of such a point bau-com-livre11 . There is a rich literature on such algorithms which cannot be exhaustively listed chambolle2011first ; vu2013splitting ; con-jota13 .

In this paper, it is assumed that the quantities that enter the minimization problem are likely to be unavailable or difficult to compute numerically. More precisely, it is assumed that the functions and

are defined as expectations of random functions. Given a probability space

, consider two convex normal integrands (see below) and . Then, we consider that and . In addition, let be a measurable function from to (i.e

a random matrix), then it is assumed that

. Finally, the Fenchel conjugate of takes the form , where is a normal convex integrand.

In the particular case of Problem (2), let us assume that where

is a random vector. Then, since

, we simply put .

In order to solve Problem (1), the observer is given the functions , , , and

, along with a sequence of independent and identically distributed (i.i.d.) random variables

with the probability distribution

. In this paper, a new stochastic primal dual algorithm based on this data is proposed to solve this problem.

The convergence proof for this algorithm relies on the monotone operator theory. The algorithm is built around an instantiation of the stochastic Forward-Backward algorithm involving random monotone operators that was introduced in bia-hac-16 . It is proven that the weighted means of the iterates of the algorithm, where the weights are given by the step sizes of the algorithm, converges almost surely to a saddle point of the Lagrangian function. To the authors knowledge, the proposed algorithm is the first method that allows to solve Problem (1) in a fully stochastic setting. Existing methods typically allow to handle subproblems of Problem (1) in which some quantities used to define (1) are assumed to be available or set equal to zero ouyang2013stochastic ; rosasco2015stochastic ; yu2017online ; combettes2016stochastic ; toulis2015stable . In particular, the new algorithm generalizes the stochastic gradient algorithm (in the case where only is non zero), the stochastic proximal point algorithm patrascu2017nonasymptotic ; toulis2015stable ; bia-16 (only is non zero), and the stochastic proximal gradient algorithm atchade2017perturbed ; bia-hac-sal-(sub)jca17 (only is non zero).

To our knowledge, the proposed algorithm is also one of the first methods that allows to tackle stochastic linear constraints. The paper yu2017online studies stochastic inequality constraints for optimization over a compact set and provide regret bounds. Handling stochastic constraints online is suitable in various fields of machine learning like Neyman-Pearson classification or online portfolio optimization. For example, the Markowitz portfolio optimization problem is an instance of Problem (2) where is a random variable with values in , , where is the probability simplex, and is some real positive number. In this case, authors usually assume that is fully known or estimated. The paper is organized as follows. The next section is devoted to rigorously state the main problem and the main algorithm. In section 3 the convergence proof of the algorithm is given.

Some notations.

The notation will refer to the Borel -field of . Both the operator norm and the Euclidean norm will be denoted as . The distance of a point to a set is denoted as . As mentioned above, we denote as the set of linear operators, identified with matrices, from to . The set of proper, lower semicontinuous convex functions on is .

2 Problem description

Before entering our subject, we recall some definitions regarding set-valued functions and integrals. Let be a probability space where the -field is -complete. Given a Euclidean space , let be a set valued function such that is a closed set for each . The function is said measurable if for any set . An equivalent definition for the mesurability of requires that the domain of belongs to , and that there exists a sequence of measurable functions such that for all , where is the closure of a set. Such functions are called measurable selections of . Assume now that is measurable and that . Given , let be the space of the -measurable functions such that , and let

If , the function is said integrable. The selection integral of is the set

(3)

In all the remainder, given a single-valued or a set-valued function , the notation will refer to the integral of with respect to . The meaning of this integral will be clear from the context.

We now state our problem. A function is said a convex normal integrand if is convex, and if the set-valued mapping is closed-valued and measurable, where is the epigraph of a function. Let be a convex normal integrand, and assume that for all . Consider the convex function defined on as the Lebesgue integral . Denoting as the subdifferential of with respect to , it is known that the set-valued function is measurable, , and for each , where the integral is the selection integral defined above att-79 ; roc-wet-82 .

Let be another convex normal integrand, and let , where the integral is defined as the sum

and

and where the convention is used. The function is a lower semi continuous convex function if for all , which we assume. We shall also assume that is proper. Note that this implies that for -almost all . It is also known that is measurable for each  att-79 . We assume that , where the right hand member is set to for the values of for which . Before proceeding in the problem statement, it is useful to provide sufficient conditions under which this interchange of the expectation and the subdifferentiation is possible. By roc-wet-82 , this will be the case if the following conditions hold: i) the set-valued mapping is constant -a.e., where is the domain of , ii) whenever -a.e., iii) there exists at which is finite and continuous. Another case where this interchange is permitted is the following. Let be a positive integer, and let be a collection of closed and convex subsets of . Let , and assume that the normal cone of at satisfies the identity for each , where the summation is the usual set summation. As is well known, this identity holds true under a qualification condition of the type (see also bauschke1999strong for other conditions). Now, assume that and that is an arbitrary probability measure putting a positive weight on each . Let be the indicator function

(4)

Then it is obvious that is a convex normal integrand, , and . We can also combine these two types of conditions: let be a probability space, where is -complete, and let be a convex normal integrand satisfying the conditions i)iii) above. Consider the closed and convex sets introduced above, and let be a probability measure on the set such that for each . Now, set , , and define as

where . Then it is clear that

and

To proceed with our problem statement, we introduce another convex normal integrand and assume that the function has verbatim the same properties as , after replacing the space with . We also denote the Fenchel conjugate of , so that .

Finally, let be an operator-valued measurable function. Let us assume that is -integrable, and let us introduce the Lebesgue integral .

Having introduced these functions, our purpose is to find a solution of Problem (1), where the set of such points is assumed non empty. To solve this problem, the observer is given the functions , and a sequence of i.i.d random variables from a probability space to with the probability distribution .

Denote as the Moreau’s proximity operator of a function . We also denote as the least norm element of the set , which is known to exist and to be unique bau-com-livre11 . Similarly, will refer to the least norm element of which was introduced above. We shall also denote as a measurable subgradient of at . More precisely, is a measurable function such that for each , (recall that this set is non empty). A possible choice for is (see (bia-hac-16, , §2.3 and §3.1) for the measurability issues). Turning back to Problem (1), our purpose will be to find a saddle point of the Lagrangian . Denoting as the set of these saddle points, an element of is characterized by the inclusions

(5)

Consider a sequence of positive weights . The algorithm proposed here consists in the following iterations applied to the random vector .

Algorithm 1 The Main Algorithm : Solving Problem (1)

We also give the instance of the main algorithm that allows to solve Problem (2) (which is a instance of Problem (1)).

Algorithm 2 Stochastic Linear Constraints : Solving Problem (2)

The convergence of Algorithm 1 is stated by the following theorem.

Theorem 2.1

Consider the Problem (1), and let the following assumptions hold true.

  1. The step size sequence satisfies , and as .

  2. There exists an integer that satisfies the following conditions:

    • The function is in .

    • There exists a point , and three functions , , and which

      (6)

      The last assumption is verified for and for each point .

  3. For any compact set of , there exist and such that

  4. Writing , there exists such that for all ,

  5. There exists such that for any and any ,

    where is the projection operator onto , and where is the integer provided by Assumption 2.

Assumptions similar to 35 are made on the function and .

  1. There exists a measurable function such that is -integrable, where is the integer provided by Assumption 2, and such that for all ,

    Moreover, there exists a constant such that .

Consider the sequence of iterates produced by the algorithm (1), and define the averaged estimates

Then, the sequence is bounded in and the sequence converges almost surely (a.s.) to a random variable supported by .

Let us now discuss our assumptions. Assumption 1 is standard in the decreasing step case. Assumption 2

is a moment assumption that is generally easy to check. Note that this assumption requires the set of saddle points

to be non empty. Notice the relation between Equations (6) and the two inclusions in (5). Focusing on the first inclusion, there exist and such that . Then, Assumption 2 states that there are two measurable selections and of and respectively which are both in and which satisfy and . Not also that the larger is , and the weaker is Assumption 5.

Assumption 3 is relatively weak and easy to check. This assumption on the functions and is much weaker than Assumption 1, which assumes that the growth of is not faster than linear. This is due to the fact that and enter the algorithm (1) through the proximity operator while the function is used explicitly in this algorithm (through its (sub)gradient). This use of the functions is reminiscent of the well-known Robbins-Monro algorithm, where a linear growth is needed to ensure the algorithm stability. Note that Assumption 1 is satisfied under the more restrictive assumption that is -Lipschitz continuous without any bounded gradient assumption.

Assumption 4 is quite weak, and is studied e.g in necoara2018randomized . This assumption is easy to illustrate in the case where as in (4). Following bauschke1999strong , we say that the subsets are linearly regular if there exists such that for every ,

Sufficient conditions for a collection of sets to satisfy the above condition can be found in bauschke1999strong and the references therein. Note that this condition implies that . Let us finally discuss Assumption 5. As , it is known that converges to for every . Assumption 5 provides a control on the convergence rate. This assumption holds under the sufficient condition that for -almost every and for every ,

where is a positive random variable with a finite fourth moment bia-16 .

3 Proof of Theorem 2.1

The proof of Theorem 2.1 employs the monotone operator theory. We begin by recalling some basic facts on monotone operators. All the results below can be found in bre-livre73 ; bau-com-livre11 without further mention.

A set-valued mapping on the Euclidean space will be called herein an operator. An operator with singleton values is identified with a function. As above, the domain of is . The graph of is . The operator is said monotone if , . A monotone operator with non empty domain is said maximal if is a maximal element for the inclusion ordering in the family of the monotone operator graphs. Let be the identity operator, and let be the inverse of , which is defined by the fact that . An operator belongs to the set of the maximal monotone operators on if and only if for each , the so-called resolvent is a contraction defined on the whole space . In particular, it is single-valued. A typical element of is the subdifferential of a function . In this case, the resolvent for coincides with the proximity operator

. A skew-symmetric element of

can also be checked to be an element of .

The set of zeros of an operator on is the set . The sum of two operators and is the operator whose image at is the set sum of and . Given two operators , where is single-valued with domain , the so-called Forward-Backward algorithm is an iterative algorithm for finding a point in . It reads

where is a positive step.

In the sequel, we shall be interested by random elements of as used in bia-16 ; bia-hac-16 ; bia-hac-sal-(sub)jca17 . Consider a function , where is the probability space introduced at the beginning of Section 2. By the maximality of , the graph is known to be a closed subset of . By saying that is a -valued random variable, we mean that the function is measurable according to the definition of Section 2. When , where is a convex normal integrand such as is proper -a.e., is a random element of . Finally, when is a skew-symmetric element of which is measurable in the usual sense (as a function), then it is also a random element of .

We now enter the proof of Theorem 2.1. Let us set , and endow this Euclidean space with the standard scalar product. By writing , it will be understood that and .

For each , define the set-valued operator on as

where is the image of by . Fixing , the operator coincides with the subdifferential of the convex normal integrand with respect to . Thus, the map is a measurable function. Let us also define the operator as

We can write , where

( is a linear skew-symmetric operator written in a matrix form in ). For each , both these operators belong to , and . Thus, by (bau-com-livre11, , Cor. 24.4). Moreover, since both and are measurable, is a -valued random variable.

Now, from the assumptions on the functions , and , we see that the operators and , where is the selection integral (3), are written as

For the same reasons as for the operators and , it holds that , , and belong to . Moreover, recalling the system of inclusions (5), we also obtain that .

Defining the function

(obviously, -a.e.), let us consider the following version of the Forward-Backward algorithm

On the one hand, one can easily check that this is exactly Algorithm (1). On the other hand, this algorithm is an instance of the random Forward-Backward algorithm studied in bia-hac-16 . By checking the assumptions of Theorem 2.1 one by one, one sees that the assumptions of (bia-hac-16, , Th. 3.1 and Cor. 3.1) are verified. Theorem 2.1 follows.

Remark 1

The convergence stated by Theorem 2.1 concerns the averaged sequence . One can ask whether the sequence itself converges to . A counterexample is provided by the particular case , , and (proof omitted). A pointwise convergence would have been possible if were so-called demipositive bia-hac-16 . Note that in the previous counterexample, is not demipositive.

Remark 2

Constant step Forward-Backward algorithms usually require the operator to be so-called cocoercive. This property is not needed if a decreasing step size is used pey-sor-10 ; bia-hac-16 .

References

  • (1) A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of mathematical imaging and vision, 40(1):120–145, 2011.
  • (2) B. C. Vũ. A splitting algorithm for dual monotone inclusions involving cocoercive operators. Advances in Computational Mathematics, 38(3):667–681, 2013.
  • (3) L. Condat. A primal-dual splitting method for convex optimization involving Lipschitzian, proximable and linear composite terms. Journal of Optimization Theory and Applications, 158(2):460–479, 2013.
  • (4) H. H. Bauschke and P. L. Combettes. Convex analysis and monotone operator theory in Hilbert spaces. CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC. Springer, New York, 2011.
  • (5) P. Bianchi and W. Hachem. Dynamical behavior of a stochastic forward-backward algorithm using random monotone operators. Journal of Optimization Theory and Applications, 171(1):90–120, 2016.
  • (6) H. Ouyang, N. He, L. Tran, and A. Gray. Stochastic alternating direction method of multipliers. In International Conference on Machine Learning, pages 80–88, 2013.
  • (7) L. Rosasco, S. Villa, and B. C. Vũ. Stochastic inertial primal-dual algorithms. arXiv preprint arXiv:1507.00852, 2015.
  • (8) H. Yu, M. Neely, and X. Wei. Online convex optimization with stochastic constraints. In Advances in Neural Information Processing Systems, pages 1427–1437, 2017.
  • (9) P. L. Combettes and J. C. Pesquet. Stochastic forward-backward and primal-dual approximation algorithms with application to online image restoration. In Signal Processing Conference (EUSIPCO), 2016 24th European, pages 1813–1817. IEEE, 2016.
  • (10) P. Toulis, T. Horel, and E. M. Airoldi. Stable robbins-monro approximations through stochastic proximal updates. arXiv preprint arXiv:1510.00967, 2015.
  • (11) A. Patrascu and I. Necoara. Nonasymptotic convergence of stochastic proximal point algorithms for constrained convex optimization. Journal of Machine Learning Research, May 2017.
  • (12) P. Bianchi. Ergodic convergence of a stochastic proximal point algorithm. SIAM Journal on Optimization, 26(4):2235–2260, 2016.
  • (13) Y. F. Atchadé, G. Fort, and E. Moulines. On perturbed proximal gradient algorithms. Journal of Machine Learning Research, 18(1):310–342, 2017.
  • (14) P. Bianchi, W. Hachem, and A. Salim. A constant step Forward-Backward algorithm involving random maximal monotone operators. To appear in Journal of Convex Analysis, 2019.
  • (15) H. Attouch. Familles d’opérateurs maximaux monotones et mesurabilité. Annali di Matematica Pura ed Applicata, 120(1):35–111, 1979.
  • (16) R. T. Rockafellar and R. J.-B. Wets. On the interchange of subdifferentiation and conditional expectations for convex functionals. Stochastics, 7(3):173–182, 1982.
  • (17) H. H. Bauschke, J. M. Borwein, and W. Li. Strong conical hull intersection property, bounded linear regularity, Jameson’s property (G), and error bounds in convex optimization. Mathematical Programming, 86(1):135–160, 1999.
  • (18) I. Necoara, P. Richtarik, and A. Patrascu. Randomized projection methods for convex feasibility problems: conditioning and convergence rates. arXiv preprint arXiv:1801.04873, 2018.
  • (19) H. Brézis. Opérateurs maximaux monotones et semi-groupes de contractions dans les espaces de Hilbert. North-Holland mathematics studies. Elsevier Science, Burlington, MA, 1973.
  • (20) J. Peypouquet and S. Sorin.

    Evolution equations for maximal monotone operators: asymptotic analysis in continuous and discrete time.

    Journal of Convex Analysis, 17(3-4):1113–1163, 2010.