Stochastic Conditional Gradient Methods: From Convex Minimization to Submodular Maximization

This paper considers stochastic optimization problems for a large class of objective functions, including convex and continuous submodular. Stochastic proximal gradient methods have been widely used to solve such problems; however, their applicability remains limited when the problem dimension is large and the projection onto a convex set is costly. Instead, stochastic conditional gradient methods are proposed as an alternative solution relying on (i) Approximating gradients via a simple averaging technique requiring a single stochastic gradient evaluation per iteration; (ii) Solving a linear program to compute the descent/ascent direction. The averaging technique reduces the noise of gradient approximations as time progresses, and replacing projection step in proximal methods by a linear program lowers the computational complexity of each iteration. We show that under convexity and smoothness assumptions, our proposed method converges to the optimal objective function value at a sublinear rate of O(1/t^1/3). Further, for a monotone and continuous DR-submodular function and subject to a general convex body constraint, we prove that our proposed method achieves a ((1-1/e)OPT-) guarantee with O(1/^3) stochastic gradient computations. This guarantee matches the known hardness results and closes the gap between deterministic and stochastic continuous submodular maximization. Additionally, we obtain ((1/e)OPT -) guarantee after using O(1/^3) stochastic gradients for the case that the objective function is continuous DR-submodular but non-monotone and the constraint set is down-closed. By using stochastic continuous optimization as an interface, we provide the first (1-1/e) tight approximation guarantee for maximizing a monotone but stochastic submodular set function subject to a matroid constraint and (1/e) approximation guarantee for the non-monotone case.

Authors

• 27 publications
• 27 publications
• 48 publications

In this paper, we develop Stochastic Continuous Greedy++ (SCG++), the fi...
02/19/2019 ∙ by Hamed Hassani, et al. ∙ 0

• Stochastic Submodular Maximization: The Case of Coverage Functions

Stochastic optimization of continuous objectives is at the heart of mode...
11/05/2017 ∙ by Mohammad Reza Karimi, et al. ∙ 0

• Continuous Submodular Maximization: Beyond DR-Submodularity

In this paper, we propose the first continuous optimization algorithms t...
06/21/2020 ∙ by Moran Feldman, et al. ∙ 0

• Maximizing Monotone DR-submodular Continuous Functions by Derivative-free Optimization

In this paper, we study the problem of monotone (weakly) DR-submodular c...
10/16/2018 ∙ by Yibo Zhang, et al. ∙ 0

• Black Box Submodular Maximization: Discrete and Continuous Settings

In this paper, we consider the problem of black box continuous submodula...
01/28/2019 ∙ by Lin Chen, et al. ∙ 6

• Stochastic Monotone Submodular Maximization with Queries

We study a stochastic variant of monotone submodular maximization proble...
07/09/2019 ∙ by Takanori Maehara, et al. ∙ 0

• One Sample Stochastic Frank-Wolfe

One of the beauties of the projected gradient descent method lies in its...
10/10/2019 ∙ by Mingrui Zhang, et al. ∙ 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Stochastic optimization arises in many procedures including wireless communications (Ribeiro, 2010), learning theory (Vapnik, 2013)

(Bottou, 2010), adaptive filters (Haykin, 2008), portfolio selection (Shapiro et al., 2009) to name a few. In this class of problems the goal is to optimize an objective function defined as an expectation over a set of random functions subject to a general convex constraint. In particular, consider an optimization variable

and a random variable

that together determine the choice of a stochastic function . The goal is to solve the program

 maxx∈CF(x):=maxx∈CEz∼P[~F(x,z)], (1)

where is a convex compact set, is the realization of the random variable drawn from a distribution , and is the expected value of the random functions with respect to the random variable . In this paper, our focus is on the cases where the function is either concave or continuous submodular. The first case which considers the problem of maximizing a concave function is equivalent to stochastic convex minimization, and the second case in which the goal is to maximize a continuous submodular function is called stochastic continuous submodular maximization. Note that the main challenge here is to solve Problem (1) without computing the objective function or its gradient

explicitly, since we assume that either the probability distribution

is unknown or the cost of computing the expectation is prohibitive.

In this regime, stochastic algorithms are the method of choice since they operate on computationally cheap stochastic estimates of the objective function gradients. Stochastic variants of the proximal gradient method are perhaps the most popular algorithms to tackle this category of problems both in convex minimization and submodular maximization. However, implementation of proximal methods requires projection onto a convex set at each iteration to enforce feasibility, which could be computationally expensive or intractable in many settings. To avoid the cost of projection, recourse to conditional gradient methods arises as a natural alternative. Unlike proximal algorithms, conditional gradient methods do not suffer from computationally costly projection steps and only require solving linear programs which can be implemented at a significantly lower complexity in many practical applications.

In deterministic convex minimization, in which the exact gradient information is given, conditional gradient method, a.k.a., Frank-Wolfe algorithm (Frank and Wolfe, 1956; Jaggi, 2013), succeeds in lowering the computational complexity of proximal algorithms due to its projection free property. However, in the stochastic regime, where only an estimate of the gradient is available, stochastic conditional gradient methods either diverge or underperform the proximal gradient methods. In particular, in stochastic convex minimization it is known and proven that stochastic conditional gradient methods may not converge to the optimal solution without an increasing batch size, whereas stochastic proximal gradient methods converge to the optimal solution at the sublinear rate of . Hence, the possibility of designing a convergent stochastic conditional gradient method with a small batch size remains unanswered.

In deterministic continuous submodular maximization where objective function is submodular, (neither convex nor concave), a variant of the condition gradient method called, continuous (Calinescu et al., 2011; Bian et al., 2017b) obtains a approximation guarantee for monotone functions, in contrast the best-known result for proximal gradient methods is a approximation guarantee (Hassani et al., 2017). However, in stochastic submodular maximization setting, stochastic variants of the continuous greedy algorithm with a small batch size fail to achieve a constant factor approximation (Hassani et al., 2017), whereas stochastic proximal gradient method recovers the approximation obtained by proximal gradient method in deterministic settings (Hassani et al., 2017). It is unknown if one can design a stochastic conditional gradient method that obtains a constant approximation guarantee, ideally , for Problem (1).

In this paper, we introduce a stochastic conditional gradient method for solving the generic stochastic optimization Problem (1). The proposed method lowers the noise of gradient approximations through a simple gradient averaging technique which only requires a single stochastic gradient computation per iteration, i.e., the batch size can be as small as . The proposed stochastic conditional gradient method improves the best-known convergence guarantees for stochastic conditional gradient methods in both convex minimization and submodular maximization settings. A summary of our theoretical contributions follows.

1. [label=()]

2. In stochastic convex minimization, we propose a stochastic variant of the Frank-Wolfe algorithm which converges to the optimal objective function value at the sublinear rate of . In other words, the proposed SFW algorithm achieves an -suboptimal objective function value after stochastic gradient evaluations.

3. In stochastic continuous submodular maximization, we propose a stochastic conditional gradient method, which can be interpreted as a stochastic variant of the continuous greedy algorithm, that obtains the first tight approximation guarantee, when the function is monotone. For the non-monotone case, the proposed method obtains a approximation guarantee. Moreover, for the more general case of -weakly DR-submodular monotone maximization the proposed stochastic conditional gradient method achieves a -approximation guarantee.

4. In stochastic discrete submodular maximization, the proposed stochastic conditional gradient method achieves and approximation guarantees for monotone and non-monotone settings, respectively, by maximizing the multilinear extension of the stochastic discrete objective function. Further, if the objective function is monotone and has curvature the proposed stochastic conditional gradient method achieves a approximation guarantee.

We begin the paper by studying the related work on stochastic methods in convex minimization and submodular maximization (Section 2). Then, we proceed by stating stochastic convex minimization problem (Section 3). We introduce a stochastic conditional gradient method which can be interpreted as a stochastic variant of Frank-Wolfe (FW) algorithm for solving stochastic convex minimization problems (Section 3.1). The proposed Stochastic Frank-Wolfe method (SFW) differs from the vanilla FW method in replacing gradients by their stochastic approximations evaluated based on averaging over previously observed stochastic gradients. We further analyze the convergence properties of the proposed SFW method (Section 3.2). In particular, we show that the averaging technique in SFW lowers the noise of gradient approximation (Lemma 3.2) and consequently the sequence of objective function values generated by SFW converges to the optimal objective function value at a sublinear rate of in expectation (Theorem 3.2). We complete this result by proving that the sequence of objective function values almost surely converges to the optimal objective function value (Theorem 3.2).

We then focus on the application of the proposed stochastic conditional gradient method in continuous submodular maximization (Section 4). After defining the notions of submodularity (Section 4.1), we introduce the Stochastic Continuous Greedy (SCG) algorithm for solving continuous submodular maximization problems (Section 4.2). The proposed SCG algorithm achieves the optimal -approximation when the expected objective function is DR-submodular, monotone, and smooth (Theorem 4.2). To be more precise, the expected objective value of the iterates generated by SCG in expectation is not smaller after stochastic gradient evaluations. Moreover, for the case that the expected function is not DR-submodular but weakly DR-submodular, the SCG algorithm obtains an -approximation guarantee (Theorem 4.3). We further extend our results to non-monotone setting by introducing the Non-monotone Stochastic Continuous Greedy (NMSCG) method. We show that under the assumptions that the expected function is only DR-submodular and smooth NMSCG reaches a -approximation guarantee (Theorem 4.4).

The continuous multilinear extension of discrete submodular functions implies that the results for stochastic continuous DR-submodular maximization can be extended to stochastic discrete submodular maximization. We formalize this connection by introducing the stochastic discrete submodular maximization problem and defining its continuous multilinear extension (Section 5). By leveraging this connection, one can relax the discrete problem to a stochastic continuous submodular maximization, use SCG to solve the relaxed continuous problem within a approximation to the optimum value (Theorem 5), and use a proper rounding scheme (such as the contention resolution method (Chekuri et al., 2014)) to obtain a feasible set whose value is a approximation to the optimum set in expectation. In summary, we show that SCG achieves an approximation for a generic discrete monotone stochastic submodular maximization problem after iterations where is the size of the ground set (Corollary 5.1). We further prove a approximation guarantee for NMSCG when we maximize a non-monotone stochastic submodular set function (Theorem 5). Moreover, if the expected set function has curvature , SCG reaches an approximation guarantee (Theorem 5.1). We finally close the paper by concluding remarks (Section 7).

Notation.  Lowercase boldface

denotes a vector and uppercase boldface

denotes a matrix. We use to denote the Euclidean norm of vector . The -th element of the vector is written as and the element on the i- row and -th column of the matrix is denoted by .

2 Related Work

In this section we overview the literature on conditional gradient methods in convex minimization as well as submodular maximization and compare our novel theoretical guarantees with the existing results.

Convex minimization. The problem of minimizing a stochastic convex function subject to a convex constraint has been tackled by many researchers and many approaches have been reported in the literature. Projected stochastic gradient (SGD) and stochastic variants of Frank-Wolfe algorithm are among the most popular approaches. SGD updates the iterates by descending through the negative direction of stochastic gradient with a proper stepsize and projecting the resulted point onto the feasible set (Robbins and Monro, 1951; Nemirovski and Yudin, 1978; Nemirovskii et al., 1983). Although stochastic gradient computation is inexpensive, the cost of projection step can be high (Fujishige and Isotani, 2011) or intractable (Collins et al., 2008). In such cases projection-free conditional gradient methods, a.k.a., Frank-Wolfe algorithm, are more practical (Frank and Wolfe, 1956; Jaggi, 2013). The online Frank-Wolfe algorithm proposed by Hazan and Kale (2012) requires stochastic gradient evaluations to reach suboptimal objective function value, i.e., under the assumption that the objective function is convex and has bounded gradients. The stochastic Frank-Wolfe studied by Hazan and Luo (2016) obtains an improved complexity of under the assumptions that the expected objective function is smooth (has Lipschitz gradients) and Lipschitz continuous (the gradients are bounded). More importantly, the stochastic Frank-Wolfe algorithm in (Hazan and Luo, 2016) requires an increasing batch size as time progresses, i.e., . In this paper, we propose a stochastic variant of conditional gradient method which achieves the complexity of under milder assumptions (only requires smoothness of the expected function) and operates with a fixed batch size, e.g., .

Submodular maximization.  Maximizing a deterministic submodular set function has been extensively studied. The celebrated result of Nemhauser et al. (1978) shows that a greedy algorithm achieves a approximation guarantee for a monotone function subject to a cardinality constraint. It is also known that this result is tight under reasonable complexity-theoretic assumptions (Feige, 1998). Recently, variants of the greedy algorithm have been proposed to extend the above result to non-monotone and more general constraints (Feige et al., 2011; Buchbinder et al., 2015, 2014; Mirzasoleiman et al., 2016; Feldman et al., 2017). While discrete greedy algorithms are fast, they usually do not provide the tightest guarantees for many classes of feasibility constraints. This is why continuous relaxations of submodular functions, e.g., the multilinear extension, have gained a lot of interest (Vondrák, 2008; Calinescu et al., 2011; Chekuri et al., 2014; Feldman et al., 2011; Gharan and Vondrák, 2011; Sviridenko et al., 2015). In particular, it is known that the continuous greedy algorithm achieves a approximation guarantee for monotone submodular functions under a general matroid constraint (Calinescu et al., 2011). An improved -approximation guarantee can be obtained if the objective function has curvature (Vondrák, 2010). The problem of maximizing submodular functions also has been studied for the non-monotone case, and constant approximation guarantees have been established (Feldman et al., 2011; Buchbinder et al., 2015; Ene and Nguyen, 2016; Buchbinder and Feldman, 2016).

Continuous submodularity naturally arises in many learning applications such as robust budget allocation (Staib and Jegelka, 2017; Soma et al., 2014), online resource allocation (Eghbali and Fazel, 2016), learning assignments (Golovin et al., 2014), as well as Adwords for e-commerce and advertising (Devanur and Jain, 2012; Mehta et al., 2007). Maximizing a deteministic continuous submodular function dates back to the work of Wolsey (1982). More recently, Chekuri et al. (2015) proposed a multiplicative weight update algorithm that achieves a approximation guarantee after oracle calls to gradients of a monotone smooth submodular function (i.e., twice differentiable DR-submodular) subject to a polytope constraint. A similar approximation factor can be obtained after oracle calls to gradients of for monotone DR-submodular functions subject to a down-closed convex body using the continuous greedy method (Bian et al., 2017b). However, such results require exact computation of the gradients which is not feasible in Problem (14). An alternative approach is then to modify the current algorithms by replacing gradients by their stochastic estimates ; however, this modification may lead to arbitrarily poor solutions as demonstrated in (Hassani et al., 2017)

. Another alternative is to estimate the gradient by averaging over a (large) mini-batch of samples at each iteration. While this approach can potentially reduce the noise variance, it increases the computational complexity of each iteration and is not favorable. The work by

Hassani et al. (2017) is perhaps the first attempt to solve Problem (14) only by executing stochastic estimates of gradients (without using a large batch). They showed that the stochastic gradient ascent method achieves a approximation guarantee after iterations. Although this work opens the door for maximizing stochastic continuous submodular functions using computationally cheap stochastic gradients, it fails to achieve the optimal approximation. To close the gap, we propose in this paper Stochastic Continuous Greedy which outputs a solution with function value at least after iterations. Notably, our result only requires the expected function to be monotone and DR-submodular and the stochastic functions need not be monotone nor DR-submodular. Moreover, in contrast to the result in (Bian et al., 2017b), which holds for down-closed convex constraints, our result holds for any convex constraints. For non-monotone DR-submodular functions, we also propose the non-monotone stochastic continuous greedy (NMSCG) method that achieves a solution with function value at least after at most iterations. Crucially, the feasible set in this case should be down-closed or otherwise no constant approximation guarantee is possible (Chekuri et al., 2014).

Our result also has important implications for the problem of maximizing a stochastic discrete submodular function subject to a matroid constraint. Since the proposed SCG method works in stochastic settings, we can relax the discrete objective function to a continuous function through the multi-linear extension (note that expectation is a linear operator). Then we can maximize within a approximation to the optimum value by using only oracle calls to the stochastic gradients of when the functions are monotone. Finally, a proper rounding scheme (such as the contention resolution method (Chekuri et al., 2014)) results in a feasible set whose value is a approximation to the optimum set in expectation111For the ease of presentation, and in the discrete setting, we only present our results for the matroid constraint. However, our stochastic continuous algorithms can provide constant factor approximations for any constrained submodular maximization setting where an efficient and loss-less rounding scheme exists. It includes, for example, knapsack constraints among many others. . Using the same procedure we can also prove a approximation guarantee in expectation for the non-monotone case. Additionally, when the set function is monotone and has a curvature – check (41) for the definition of the curvature – we show that the approximation factor can be improved from to .

3 Stochastic Convex Minimization

Many problems in machine learning can be reduced to the minimization of a convex objective function defined as an expectation over a set of random functions. In this section, our focus is on developing a stochastic variant of the conditional gradient method, a.k.a., Frank-Wolfe, which can be applied to solve general stochastic convex minimization problems. To make the notation consistent with other works on stochastic convex minimization, instead of solving Problem (1) for the case that the objective function is concave, we assume that is convex and intend to minimize it subject to a convex set . Therefore, the goal is to solve the program

 minx∈CF(x):=minx∈CEz∼P[~F(x,z)]. (2)

A canonical subset of problems having this form is support vector machines, least mean squares, and logistic regression.

In this section we assume that only the expected (average) function is convex and smooth, and the stochastic functions may not be convex nor smooth. Since the expected function is convex, descent methods can be used to solve the program in (2). However, computing the gradients (or Hessians) of the function percisely requires access to the distribution , which may not be feasible in many applications. To be more specific, we are interested in settings where the distribution

is either unknown or evaluation of the expected value is computationally prohibitive. In this regime, stochastic gradient descent methods, which operate on stochastic approximations of the gradients, are the mostly used alternatives. In the following section, we aim to develop a stochastic variant of the Frank-Wolfe method which converges to an optimal solution of (

3.1 Stochastic Frank-Wolfe Method

In this section, we introduce a stochastic variant of the Frank-Wolfe method to solve Problem (2). Assume that at each iteration we have access to the stochastic gradient

which is an unbiased estimate of the gradient

. It is known that a naive stochastic implementation of Frank-Wolfe (replacing gradient by ) might diverge, due to non-vanishing variance of gradient approximations. To resolve this issue, we introduce a stochastic version of the Frank-Wolfe algorithm which reduces the noise of gradient approximations via a common averaging technique in stochastic optimization (Ruszczyński, 1980, 2008; Yang et al., 2016; Mokhtari et al., 2017).

By letting be a discrete time index and a given stepsize which approaches zero as grows, the proposed biased gradient estimation is defined by the recursion

 dt=(1−ρt)dt−1+ρt∇~F(xt,zt), (3)

where the initial vector is given as . We will show that the averaging technique in (3) reduces the noise of gradient approximation as time increases. More formally, the expected noise of the gradient estimation approaches zero asymptotically. This property implies that the biased gradient estimate is a better candidate for approximating the gradient comparing to the unbiased gradient estimate that suffers from a high variance approximation. We therefore define the descent direction of our proposed Stochastic Frank-Wolfe (SFW) method as the solution of the linear program

 vt=\operatornamewithlimitsargminv∈C{dTtv}. (4)

As in the traditional FW method, the updated variable is a convex combination of and the iterate

 xt+1=(1−γt+1)xt+γt+1vt, (5)

where is a proper positive stepsize. Note that each iteration of the proposed SFW method only requires a single stochastic gradient computation, unlike the methods in (Hazan and Luo, 2016; Reddi et al., 2016) which an require increasing number of stochastic gradient evaluations as the number of iterations grows.

The proposed SFW is summarized in Algorithm 1. The core steps are Steps 2-5 which follow the updates in (3)-(5). The initial variable can be any feasible vector in the convex set and the initial gradient estimate is set to be the null vector . The sequence of positive parameters and should be diminishing at proper rates as we describe in the following convergence analysis section.

3.2 Convergence Analysis

In this section we study the convergence rate of the proposed SFW method for solving the constraint convex program in (2). To do so, we first assume that the following conditions hold.

Assumption 1

The convex set is bounded with diameter , i.e., for all we can write

 ∥x−y∥≤D. (6)
Assumption 2

The expected function is convex. Moreover, its gradients are -Lipschitz continuous over the set , i.e., for all

 (7)
Assumption 3

The variance of the unbiased stochastic gradients is bounded above by , i.e., for all random variables and vectors we can write

 E[∥∇~F(x,z)−∇F(x)∥2]≤σ2. (8)

Assumption 1 is standard in constrained convex optimization and is implied by the fact that the set is convex and compact. The condition in Assumption 2 ensures that the objective function is smooth on the set . Note that here we only assume that the (average) function has Lipschitz continuous gradients, and the stochastic gradients may not be Lipschitz continuous. Finally, the required condition in Assumption 3 is customary in stochastic optimization and guarantees that the variance of stochastic gradients is bounded by a finite constant .

To study the convergence rate of SFW, we first derive an upper bound on the error of gradient approximation in the following lemma.

Consider the proposed Stochastic Frank-Wolfe (SFW) method outlined in Algorithm 1. If the conditions in Assumptions 1-3 hold, the sequence of squared gradient errors satisfies

 E[∥∇F(xt)−dt∥2∣Ft]≤(1−ρt2)∥∇F(xt−1)−dt−1∥2+ρ2tσ2+2L2D2γ2tρt, (9)

where is a sigma-algebra measuring all sources of randomness up to step .

Check Appendix A.

The result in Lemma 3.2 shows that squared error of gradient approximation decreases in expecation at each iteration by the factor if the remaining terms on the right hand side of (9) are negligible relative to the term . This condition can be satisfied, if the parameters and are properly chosen. This observation verifies our intuition that the noise of the stochastic gradient approximation diminishes as the number of iterations increases.

In the following lemma, we derive an upper bound on the suboptimality which depends on the norm of gradient error .

Consider the proposed Stochastic Frank-Wolfe (SFW) method outlined in Algorithm 1. If the conditions in Assumptions 1-3 are satisfied, the suboptimality satisfies

 F(xt+1)−F(x∗)≤(1−γt+1)(F(xt)−F(x∗))+γt+1D∥∇F(xt)−dt∥+LD2γ2t+12. (10)

Check Appendix B.

Based on Lemma 3.2, the suboptimality approaches zero if the error of gradient approximation converges to zero sufficiently fast and the last term is summable, i.e., . In the special case of zero approximation error, which is equivalent to the case that we have access to the expected gradient , by setting it can be shown that the suboptimality converges to zero at the sublinear rate of . Therefore, the result in Lemma 3.2 is consistent with the analysis of Frank-Wolfe method (Jaggi, 2013).

In the following theorem, by using the results in Lemmas 3.2 and 3.2, we establish an upper bound on the expected suboptimality .

Consider the proposed Stochastic Frank-Wolfe (SFW) method outlined in Algorithm 1. Suppose the conditions in Assumptions 1-3 are satisfied. If we set and , then the expected suboptimality is bounded above by

 E[F(xt)−F(x∗)]≤Q(t+9)1/3, (11)

where the constant is given by

 Q:=max{91/3(F(x0)−F(x∗)),LD22+2Dmax{2∥∇F(x0)−d0∥,(16σ2+2L2D2)1/2}}. (12)

Check Appendix C.

The result in Theorem 3.2 indicates that the expected suboptimality of the iterates generated by the SFW method converges to zero at least at a sublinear rate of . In other words, it shows that to achieve the expected suboptimality , the number of required stochastic gradients (sample gradients) to reach this accuracy is .

To complete the convergence analysis of SFW we also prove that the sequence of the objective function values converges to the optimal value almost surely. This result is formalized in the following theorem.

Consider the proposed Stochastic Frank-Wolfe (SFW) method outlined in Algorithm 1. Suppose that the conditions in Assumptions 1-3 are satisfied. If we choose and such that (i) , (ii) , (iii) , and (iv) , then the suboptimality converges to zero almost surely, i.e.,

 limt→∞F(xt)−F(x∗) \makebox[0.0pt]\tiny a.s.= 0. (13)

Check Appendix D.

Theorem 3.2 provides almost sure convergence of the sequence of objective function value to the optimal value . In other words it shows that the sequence of the objective function values converges to with probability 1. Indeed, a valid set of choices for and to satisfy the required conditions in Theorem 3.2 are and .

4 Stochastic Continuous Submodular Maximization

In the previous section, we focused on the convex setting, but what if the objective function is not convex? In this section, we consider a broad class of non-convex optimization problems that possess special combinatorial structures. More specifically, we focus on constrained maximization of stochastic continuous submodular functions that demonstrate diminishing returns, i.e., continuous DR-submodular functions,

 maxx∈C F(x)≐maxx∈C Ez∼P[~F(x,z)]. (14)

As before, the functions are stochastic where is the optimization variable, is a realization of the random variable drawn from a distribution , and is a compact set. Our goal is to maximize the expected value of the random functions over the convex body . Note that we only assume that is DR-submodular, and not necessarily the stochastic functions . We also consider situations where the distribution is either unknown (e.g., when the objective is given as an implicit stochastic model) or the domain of the random variable is very large (e.g., when the objective is defined in terms of an empirical risk) which makes the cost of computing the expectation very high. In these regimes, stochastic optimization methods, which operate on computationally cheap estimates of gradients, arise as natural solutions. In fact, very recently, it was shown in (Hassani et al., 2017) that stochastic gradient methods achieve a approximation guarantee to Problem (14). In Section 3 of (Hassani et al., 2017), the authors also showed that if we simply substitute gradients by stochastic gradients in the update of conditional gradient methods (a.k.a., Frank-Wolfe), such as continuous greedy (Vondrák, 2008) or its close variant (Bian et al., 2017b), the resulted method can perform arbitrarily poorly in stochastic continuous submodular maximization settings. Our goal in this section is to design a stable stochastic variant of conditional gradient method to solve Problem (14) up to a constant factor.

4.1 Preliminaries

We begin by recalling the definition of a submodular set function: A function , defined on the ground set , is called submodular if for all subsets , we have

 f(A)+f(B)≥f(A∩B)+f(A∪B).

The notion of submodularity goes beyond the discrete domain (Wolsey, 1982; Vondrák, 2007; Bach, 2015). Consider a continuous function where the set is of the form and each is a compact subset of . We call the continuous function submodular if for all we have

 F(x)+F(y)≥F(x∨y)+F(x∧y), (15)

where (component-wise) and (component-wise). Further, a submodular function is monotone (on the set ) if

 x≤y⟹F(x)≤F(y), (16)

for all . Note that in (16) means that for all . Furthermore, a differentiable submodular function is called DR-submodular (i.e., shows diminishing returns) if the gradients are antitone, namely, for all we have

 x≤y⟹∇F(x)≥∇F(y). (17)

When the function is twice differentiable, submodularity implies that all cross-second-derivatives are non-positive (Bach, 2015), i.e.,

 for all i≠j,  for all x∈X,  ∂2F(x)∂xi∂xj≤0, (18)

and DR-submodularity implies that all second-derivatives are non-positive (Bian et al., 2017b), i.e.,

 for all i,j,  for all x∈X,  ∂2F(x)∂xi∂xj≤0. (19)

4.2 Stochastic Continuous Greedy

We proceed to introduce, Stochastic Continuous Greedy (SCG), which is a stochastic variant of the continuous greedy method (Vondrák, 2008) to solve Problem (14). We only assume that the expected objective function is monotone and DR-submodular and the stochastic functions may not be monotone nor submodular. Since the objective function is monotone and DR-submodular, continuous greedy algorithm (Calinescu et al., 2011; Bian et al., 2017b) can be used in principle to solve Problem (14). Note that each update of continuous greedy requires computing the gradient of , i.e., . However, if we only have access to the (computationally cheap) stochastic gradients , then the continuous greedy method will not be directly usable (Hassani et al., 2017). This limitation is due to the non-vanishing variance of gradient approximations. To resolve this issue, we use the gradient averaging technique in Section 3.1. As in SFW, we define the estimated gradient by the recursion

 dt=(1−ρt)dt−1+ρt∇~F(xt,zt), (20)

where is a positive stepsize and the initial vector is defined as . We therefore define the ascent direction of our proposed SCG method as follows

 vt=\operatornamewithlimitsargmaxv∈C{dTtv}, (21)

which is a linear objective maximization over the convex set . Indeed, if instead of the gradient estimate we use the exact gradient for the updates in (21), the continuous greedy update will be recovered. Here, as in continuous greedy, the initial decision vector is the null vector, . Further, the stepsize for updating the iterates is equal to , and the variable is updated as

 xt+1=xt+1Tvt. (22)

The stepsize and the initialization ensure that after iterations the variable ends up in the convex set . We would like to highlight that the convex body may not be down-closed or contain . Nonetheless, the final iterate returned by SCG will be a feasible point in . The steps of the proposed SCG method are outlined in Algorithm 2. Note that the major difference between SFW in Algorithm 1 and SCG in Algorithm 2 is in Step 4 where the variable is computed. In SFW, is a convex combination of and , while in SCG is computed by moving from towards the direction with the stepsize .

We proceed to study the convergence properties of our proposed SCG method for solving Problem (14). To do so, we first assume that the following conditions hold.

Assumption 4

The Euclidean norm of the elements in the constraint set are uniformly bounded, i.e., for all we can write

 ∥x∥≤D. (23)
Assumption 5

The function is DR-submodular and monotone. Further, its gradients are -Lipschitz continuous over the set , i.e., for all

 (24)
Assumption 6

The variance of the unbiased stochastic gradients is bounded above by , i.e., for any vector we can write

 E[∥∇~F(x,z)−∇F(x)∥2]≤σ2, (25)

where the expectation is with respect to the randomness of .

Due to the initialization step of SCG (i.e., starting from ) we need a bound on the furthest feasible solution from that we can end up with; and such a bound is guaranteed by Assumption 4. The condition in Assumption 5 ensures that the objective function is smooth. Note again that may or may not be Lipschitz continuous. Finally, the required condition in Assumption 6 guarantees that the variance of stochastic gradients is bounded by a finite constant . Note that Assumptions 5-6 are stronger than Assumptions 2-3 since they ensure smoothness and bounded gradients for all points and not only for the feasible points .

To study the convergence of SCG, we first derive an upper bound for the expected error of gradient approximation (i.e., ) in the following lemma.

Consider Stochastic Continuous Greedy (SCG) outlined in Algorithm 2. If Assumptions 4-6 are satisfied and , then for we have

 E[∥∇F(xt)−dt∥2] ≤Q(t+9)2/3, (26)

where

Check Appendix E.

Let us now use the result of Lemma 4.2 to show that the sequence of iterates generated by SCG reaches a approximation guarantee for Problem (14).

Consider Stochastic Continuous Greedy (SCG) outlined in Algorithm 2. If Assumptions 4-6 are satisfied and , then the expected objective function value for the iterates generated by SCG satisfies the inequality

 E[F(xT)]≥(1−1/e)OPT−15DQ1/2T1/3−LD22T, (27)

where and .

Check Appendix F.

The result in Theorem 4.2 shows that the sequence of iterates generated by SCG, which only has access to a noisy unbiased estimate of the gradient at each iteration, is able to achieve the optimal approximation bound , while the error term vanishes at a sublinear rate of .

4.3 Weak Submodularity

In this section, we extend our results to a more general case where the expected objective function is weakly-submodular. A continuous function is -weakly DR-submodular if

 γ=infx,y∈X,x≤y infi∈[n] [∇F(x)]i[∇F(y)]i, (28)

where denotes the -th element of vector . See (Eghbali and Fazel, 2016) for related definitions. In the following theorem, we prove that the proposed SCG method achieves a approximation guarantee when the expected function is monotone and weakly DR-submodular with parameter .

Consider Stochastic Continuous Greedy (SCG) outlined in Algorithm 2. If Assumptions 4-6 are satisfied and the function is -weakly DR-submodular, then for the expected objective function value of the iterates generated by SCG satisfies the inequality

 E[F(xT)]≥(1−e−γ)OPT−15DQ1/2T1/3−LD22T, (29)

where and

Check Appendix G.

4.4 Non-monotone Continuous Submodular Maximization

In this section, we aim to extend the results for the proposed Stochastic Continuous Greedy algorithm to maximize non-monotone stochastic DR-submodular. The problem formulation of interest is similar to Problem (14) except the facts that the objective function may not be monotone and the set is a bounded box. To be more precise, we aim to solve the program

 maxx∈C F(x)≐maxx∈C Ez∼P[~F(x,z)], (30)

where is continuous DR-submodular, , each is a bounded interval, and the convex set is a subset of , where and . In this section, we propose the first stochastic conditional gradient method for solving the stochastic non-monotone maximization problem in (30). In this section, we further assume that the convex set is down-closed and .

We introduce a variant of the Stochastic Continuous Greedy method that achieves a -approximation guarantee for Problem (30). The proposed Non-monotone Stochastic Continuous Greedy (NMSCG) method is inspired by the unified measured continuous greedy algorithm in (Feldman et al., 2011) and the Frank-Wolfe method in (Bian et al., 2017a) for non-monotone deterministic continuous submodular maximization. The steps of NMSCG are summarized in Algorithm 3. The stochastic gradient update (Step 2) and the update of the variable (Step 4) are identical to the ones for SCG in Section 4.2. The main difference between NMSCG and SCG is in the computation of the ascent direction . In particular, in NMSCG the ascent direction vector is obtained by solving the linear program

 vt=\operatornamewithlimitsargmaxv∈C,v≤¯u−xt{dTtv}, (31)

which differs from (21) by having the extra constraint . This extra condition is added to ensure that the solution does not grow aggressively, since in non-monotone case dramatic growth of the solution may lead to poor performance. In NMSCG, the initial variable is , which is a legitimate initialization as we assume that the convex set is down-closed. In the following theorem, we establish a - guarantee for NMSCG.

Consider Non-monotone Stochastic Continuous Greedy (NMSCG) outlined in Algorithm 3 with the averaging parameter . If Assumptions 4 and 6 hold and the gradients are -Lipschitz continuous and the convex set is down-closed, then the iterate generated by NMSCG satisfies the inequality

 (32)

where

See Section H.

The result in Theorem 4.4 states that the sequence of iterates generated by NMSCG achieves a approximation guarantee after stochastic gradient computations.

5 Stochastic Discrete Submodular Maximization

Even though submodularity has been mainly studied in discrete domains (Fujishige, 2005), many efficient methods for optimizing such submodular set functions rely on continuous relaxations either through a multi-linear extension (Vondrák, 2008) (for maximization) or Lovas extension (Lovász, 1983) (for minimization). In fact, Problem (14) has a discrete counterpart, recently considered in (Hassani et al., 2017; Karimi et al., 2017):

 maxS∈If(S)≐maxS∈IEz∼P[~f(S,z)], (33)

where the function is submodular, the functions are stochastic, is the optimization set variable defined over a ground set , is the realization of a random variable drawn from the distribution , and is a general matroid constraint. Since is unknown, problem (33) cannot be directly solved using the current state-of-the-art techniques. Instead, Hassani et al. (2017) showed that by lifting the problem to the continuous domain (via multi-linear relaxation) and using stochastic gradient methods on a continuous relaxation to reach a solution that is within a factor of the optimum. Contemporarily, (Karimi et al., 2017) used a concave relaxation technique to provide a approximation for the class of submodular coverage functions. Our work closes the gap for maximizing the stochastic submodular set maximization, namely, Problem (33), by providing the first tight approximation guarantee for general monotone submodular set functions subject to a matroid constraint. For the non-monotone case, we obtain a approximation guarantee. We, further, show that a -approximation guarantee can be achieved when the function has curvature .

According to the results in Section 4, SCG achieves in expectation a -optimal solution for Problem (14) when the function is monotone and DR-submodular, and NMSCG obtains -optimal solution for the non-monotone case. The focus of this section is on extending these results into the discrete domain and showing that SCG and NMSCG can be used to maximize a stochastic submodular set function , namely Problem (33), through the multilinear extension of the function . To be more precise, in lieu of solving the program in (33) one can solve the continuous optimization problem

 maxx∈C F(x) = maxx∈C ∑S⊂Vf(S)∏i∈Sxi∏j∉S(1−xj), (34)

where is the multilinear extension of the function and the convex set is the matroid polytope (Calinescu et al., 2011) which is down-closed (note that in (34), denotes the -th element of the vector ). The fractional solution of the program (34) can then be rounded into a feasible discrete solution without any loss (in expectation) in objective value by methods such as randomized PIPAGE ROUNDING (Calinescu et al., 2011). Note that randomized PIPAGE ROUNDING requires computational complexity (Karimi et al., 2017) for the uniform matroid and complexity for general matroids.

Indeed, the conventional continuous greedy algorithm is able to solve the program in (34); however, each iteration of the method is computationally costly due to gradient evaluations. Instead, Feldman et al. (2011) and Calinescu et al. (2011) suggested approximating the gradient using a sufficient number of samples from . This mechanism still requires access to the set function multiple times at each iteration, and hence is not efficient for solving Problem (33). The idea is then to use a stochastic (unbiased) estimate for the gradient . In the following remark, we provide a method to compute an unbiased estimate of the gradient using samples from , where and ’s, , are carefully chosen sets.

(Constructing an Unbiased Estimator of the Gradient in Multilinear Extensions) Recall that . In terms of the multilinear extensions, we obtain , where and denote the multilinear extension of and , respectively. So is an unbiased estimator of when . Note that finding the gradient of may not be easy as it contains exponentially many terms. Instead, we can provide computationally cheap unbiased estimators for . It can easily be shown that

 ∂~F∂xi=~F(x,z;xi←1)−~F(x,z;xi←0). (35)

where for example by we mean a vector which has value on its -th coordinate and is equal to elsewhere. To create an unbiased estimator for at a point with realization we can simply sample a set by including each element in it independently with probability and use as an unbiased estimator for the -th partial derivative of . We can sample one single set and use the above trick for all the coordinates. This involves function computations for .

Indeed, the stochastic gradient ascent method proposed by Hassani et al. (2017) can be used to solve the multilinear extension problem in (34) using unbiased estimates of the gradient at each iteration. However, the stochastic gradient ascent method fails to achieve the optimal approximation. Further, the work of Karimi et al. (2017) achieves a approximation solution only when each is a coverage function. Here, we show that SCG achieves the first tight approximation guarantee for the discrete stochastic submodular Problem (33). More precisely, we show that SCG finds a solution for (34), with an expected function value that is at least , in iterations. To do so, we first show in the following lemma that the difference between any coordinates of gradients of two consecutive iterates generated by SCG, i.e., for , is bounded by multiplied by a factor which is independent of the problem dimension .

Consider Stochastic Continuous Greedy (SCG) outlined in Algorithm 2 with iterates , and recall the definition of the multilinear extension function in (34). If we define as the rank of the matroid and , then the following

 ∣∣∇jF(xt+1)−∇jF(xt)∣∣≤mf√r∥xt+1−xt∥, (36)

holds for .

See Appendix I.

The result in Lemma 5 states that in an ascent direction of SCG, the gradient is