In this paper, we consider the following non-oblivious stochastic optimization problem:
where is the decision variable, is a convex feasible set,
is a random variable with distribution, and the objective function is defined as the expectation of a set of stochastic functions . Problem (1) is called non-oblivious as the underlying distribution depends on the variable
and may change during the optimization procedure. One should note that the usual oblivious stochastic (convex/non-convex) optimization, in which the probability distributionis independent of , is a special case of problem (1). We focus on providing efficient solvers for (1) in terms of the sample complexity of (a.k.a calls to the stochastic oracle), for the settings that the objective function is either (non-convex) continuous submodular or concave. A canonical example of problem (1) is the multi-linear extension of a discrete submodular function where the stochasticity crucially depends on the decision variable at which we evaluate (the definition is deferred to Section 5).
When the objective function is monotone and continuous DR-submodular, Hassani et al. (2017) showed that the projected Stochastic Gradient Ascent (SGA) method finds a solution to problem (1) with a function value no less than after computing at most stochastic gradients. Here, and throughout the paper, OPT denotes the optimal value of problem (1). Hassani et al. (2017) also provided examples for which SGA cannot achieve better than approximation ratio, in general. Later, Mokhtari et al. (2018a) proposed Stochastic Continuous Greedy (SCG), a conditional gradient method that achieves the tight solution by calls to the linear optimization oracle while using stochastic gradients. While both SCG and SGA are first-order methods, meaning that they rely on stochastic gradients, SCG provably achieves a better result at the price of being slower. The first contribution of this paper is to answer the following question:
“Can we achieve the best of both worlds? That is, can we find a solution after at most calls to the stochastic oracle?”
A very similar question arises in the case of stochastic convex minimization (concave maximization) using conditional gradient methods (a.k.a. Frank Wolfe). It is well known that such methods, due to solving a linear optimization program, are highly sensitive to stochasticity and may easily diverge (unlike projected gradient methods). To overcome this issue, Hazan and Luo (2016) proposed a variance reduced method, by using an increasing mini-batch sizes, that achieves -approximate optimum by using stochastic gradients. Recently, Mokhtari et al. (2018b) proposed a momentum method that achieves the same rate while fixing the size of mini-batches to 1. Our next contribution is to answer the following question:
“Can we achieve an -approximate optimum with stochastic oracle calls?”
Our contributions. We answer the above questions affirmatively. More precisely, we develop Stochastic Continuous Greedy++(), the first algorithm that achieves the tight solution for problem (1) with calls to the linear optimization program while using
stochastic gradients in total. Our technique relies on a novel variance reduction method that estimates the difference of gradients in the non-oblivious stochastic setting without introducing extra bias. This is crucial in our analysis, as all the existing variance reduction methods fail to correct for this bias and can only operate in the oblivious/classical stochastic setting. We further show that our result isoptimal in all aspects. In particular, in Theorem 5, we provide an information-theoretic lower bound to showcase the necessity of stochastic oracle queries in order to achieve . Note that under natural complexity assumptions, one cannot achieve an approximation ratio better than for monotone submodular functions (Feige, 1998). Finally, we extend our result to the stochastic concave maximization (i.e., is concave) and propose Stochastic Frank-Wolfe++() that finds an -suboptimal solution using stochastic gradients with calls to the linear optimization oracle. To the best of our knowledge, is the first efficient conditional gradient method that achieves the optimal stochastic oracle complexity for smooth concave maximization (resp., smooth convex minimization).
2 Related Work
Submodular Maximization. Submodular set functions (Nemhauser et al., 1978)
capture the intuitive notion of diminishing returns and have become increasingly important in various machine learning applications. Examples include data summarization(Lin and Bilmes, 2011b, a), crowd teaching (Singla et al., 2014)
, neural network interpretation(Elenberg et al., 2017), dictionary learning (Das and Kempe, 2011), and variational inference (Djolonga and Krause, 2014), to name a few. The celebrated result of Nemhauser et al. (1978) shows that for a monotone submodular function and subject to a cardinality constraint, a simple greedy algorithm achieves the tight approximation guarantee. However, the vanilla greedy method does not provide the tightest guarantees for many classes of feasibility constraints. To circumvent this issue, the continuous relaxation of submodular functions, through the multilinear extension, have been extensively studied (Vondrák, 2008; Calinescu et al., 2011; Chekuri et al., 2014; Feldman et al., 2011; Gharan and Vondrák, 2011; Sviridenko et al., 2015). In particular, it is known that the Continuous Greedy algorithm achieves the tight approximation guarantee for monotone submodular functions under a general matroid constraint (Calinescu et al., 2011).
Continuous DR-submodular functions, an important subclass of non-convex functions, generalize the notion of diminishing returns to the continuous domains (Bian et al., 2017). Such functions naturally arise in machine learning applications such as optimum experimental design (Chen et al., 2018), Map inference for Determinantal Point Processes (Kulesza and Taskar, 2012), and revenue maximization (Niazadeh et al., 2018). It has been recently shown that monotone continuous DR-submodular functions can be (approximately) maximized over convex bodies using first-order methods (Bian et al., 2017; Hassani et al., 2017; Mokhtari et al., 2018a). When exact gradient information is available, Bian et al. (2017) showed that the Continuous Greedy algorithm, a variant of the conditional gradient method, achieves with gradient evaluations. However, the problem becomes considerably more challenging when we only have access to a stochastic first-order oracle. In particular, Hassani et al. (2017) showed that the stochastic gradient ascent achieves by using stochastic gradients. In contrast, Mokhtari et al. (2018a) proposed the stochastic variant of the continuous greedy algorithm that achieves by using stochastic gradients. This paper shows how achieving is possible by stochastic gradient evaluations.
Convex minimization. The problem of minimizing a stochastic convex function subject to a convex constraint using stochastic projected gradient descent-type methods has been studied extensively in the past (Robbins and Monro, 1951; Nemirovski and Yudin, 1978; Nemirovskii et al., 1983). Although stochastic gradient computation is inexpensive, the cost of projection step can be prohibitive (Fujishige and Isotani, 2011) or intractable (Collins et al., 2008). In such cases, the projection-free algorithms, a.k.a., Frank-Wolfe or conditional gradient, are the method of choice (Frank and Wolfe, 1956; Jaggi, 2013). In the stochastic setting, the online Frank-Wolfe algorithm proposed by Hazan and Kale (2012) requires stochastic gradient evaluations to reach an -approximate optimum, i.e., , under the assumption that the objective function is convex and has bounded gradients. The stochastic variant of Frank-Wolfe studied by Hazan and Luo (2016), uses an increasing batch size of (at iteration ) to obtain an improved stochastic oracle complexity of under the assumptions that the expected objective function is smooth and Lipschitz continuous. Recently, Mokhtari et al. (2018b) proposed a momentum gradient estimator thorough which they achieve a similar stochastic gradient evaluations while fixing the batch-size to 1. Our work improves these results by introducing the first stochastic conditional gradient method that obtains an -suboptimal solution after computing at most stochastic gradients.
We first recap some standard definitions for both discrete and continuous submodular maximization, and then review variance reduced methods for solving stochastic optimization problems.
Submodularity. A set function , defined on the ground set , is submodular if
for all subsets . Even though submodularity is mostly considered in the discrete domain, the notion can be naturally extended to arbitrary lattices (Fujishige, 2005). To this aim, let us consider a subset of of the form where each is a compact subset of . A function is continuous submodular (Wolsey, 1982) if for all , we have
where (component-wise) and (component-wise). A submodular function is monotone if for any such that , we have (here, by we mean that every element of is less than that of ). When twice differentiable, is submodular if and only if all cross-second-derivatives are non-positive (Bach, 2015), i.e.,
The above expression makes it clear that continuous submodular functions are not convex nor concave in general, as concavity (convexity) implies that (resp.). A proper subclass of submodular functions are called DR-submodular (Bian et al., 2017; Soma and Yoshida, 2015) if for all such that
and any standard basis vectorand a non-negative number such that and , then, One can easily verify that for a differentiable DR-submodular function the gradient is an antitone mapping, i.e., for all such that we have (Bian et al., 2017). An important example of a DR-submodular function is the multilinear extension (Calinescu et al., 2011) which will be studied in Section 5.
Variance Reduction. Beyond the vanilla stochastic gradient, variance reduced algorithms (Schmidt et al., 2017; Johnson and Zhang, 2013; Defazio et al., 2014; Nguyen et al., 2017; Reddi et al., 2016; Allen-Zhu, 2018) have been successful in reducing stochastic first-order oracle complexity in oblivious stochastic optimization
where each component function is -smooth. In contrast to (1), the underlying distribution of (4) is invariant to the variable and is hence called oblivious. We will now explain a recent variance reduction technique for solving (4) using stochastic gradient information. Consider the following unbiased estimate of the gradient at the current iterate :
where for some , is an unbiased gradient estimator at , and is a mini-batch of random samples drawn from . Fang et al. (2018) showed that, with the gradient estimator (5), stochastic gradient evaluations are sufficient to find an -first-order stationary point of problem (4), improving upon the complexity of SGD. A crucial property leading to the success of the variance reduction method given in (5) is that and use the same minibatch sample in order to exploit the -smoothness of component functions . Such a construction is only possible in the oblivious setting where is invariant to the choice of . In fact, (5) would introduce bias in the more general non-oblivious case (1): To see this, let be the minibatch of random variable sampled according to distribution . We have but since the distribution is not the same as . The same argument renders all the existing variance reduction techniques inapplicable for the non-oblivious setting of problem (1).
4 Stochastic Continuous
In this section, we present the (SCG++) algorithm, the first method that obtains a solution with stochastic oracle complexity. The algorithm is a variant of a conditional gradient method. To be more precise, at each iteration , given a gradient estimator , solves the subproblem
to obtain an element in as an ascent direction, which is then added to the current iterate with a scaling factor , i.e., the new iterate is computed by following the update
Here, and throughout the paper, is the total number of iterations of the algorithm. The iterates are assumed to be initialized at the origin which may not belong to the feasible set . Even though each iterate may not necessarily be in , the feasibility of the final iterate is guaranteed by the convexity of . Note that the sequence of iterates can be regarded as a path from the origin (as we manually force ) to some feasible point in . The key idea in is to exploit the high correlation between the consecutive iterates to maintain a highly accurate estimate , which is the focus of the rest of this section. Note that by replacing the gradient approximation vector in the update of by the exact gradient of the objective function, we recover the update of the continuous greedy method (Calinescu et al., 2011; Bian et al., 2017).
We now proceed to describe our approach for evaluating the gradient approximation when we face a non-oblivious problem (1). Given a sequence of iterates , the gradient of the objective function at iterate can be written in a path-integral form as follows
By obtaining an unbiased estimate ofand reusing the previous unbiased estimates for , we obtain recursively an unbiased estimator of which has a reduced variance. Estimating and separately as suggested in (5) would cause the bias issue in the the non-oblivious case (see the discussion at the end of section 3). Therefore, we propose an approach for directly estimating the difference in an unbiased manner.
We construct an unbiased estimator of the gradient vector by adding an unbiased estimate of the gradient difference to , where is an unbiased estimator of the gradient . Note that the vector can be written as
where for . Therefore, if we sample the parameter uniformly at random from the interval , it can be easily verified that is an unbiased estimator of the gradient difference since
Hence, all we need is an unbiased estimator of the Hessian-vector product for the non-oblivious objective at an arbitrary . In the following lemma, we present such an unbiased estimator of the Hessian that can be evaluated efficiently.
For any , let be a random variable with distribution and define
Then, is an unbiased estimator of , i.e., .
The result in Lemma 4 shows how to evaluate an unbiased estimator of the Hessian . If we consider
to be a random variable with a uniform distribution over the interval, then we can define a random variable with the probability distribution where . Considering these two random variables, combined with Lemma 4, we can construct an unbiased estimator of the integral in (9) by
where is a minibatch containing samples of the random pair .
Once we have access to which is an unbiased estimator of , we can approximate the gradient difference by its unbiased estimator defined as
Note that for the general objective , the matrix-vector product requires computation and memory. To resolve this issue, in Section 4.1 we provide an implementation of (13) using only first-order information which has a computational and memory complexity of .
Using as an unbiased estimator for the gradient difference , we can define our objective function gradient estimator as
Indeed, this update can also be written in a recursive way as
if we set . Note that the proposed approach for gradient approximation in (14) has an inherent variance reduction mechanism which leads to the optimal complexity of in terms of the number of calls to the stochastic oracle. We further highlight this point in Section 4.2.
4.1 Implementation of the Hessian-Vector Product
In this section, we focus on the computation of the gradient difference approximation introduced in (13). We aim to come up with a scheme that avoids explicitly computing the matrix estimator , which has a complexity of , and present an approach directly approximating that only uses the finite differences of gradients with a complexity of . Recall the definition of the Hessian approximation in (12). Computing is equivalent to computing instances of for some and . Denote and use the expression in (11) to write
Note that the first three terms can be computed in time and only the last two terms on the right hand side of (16) involve operations, which can be approximated by the following finite gradient difference scheme. For any twice differentiable function and arbitrary with bounded Euclidean norm , we compute, for some small ,
By considering the second-order smoothness of the function with constant we can show that for arbitrary it holds . Therefore, the error of the above approximation can be bounded by
where is obtained from the mean value theorem. This quantity can be made arbitrary small by decreasing . Later, we show that setting is sufficient, where is the target accuracy. By applying the technique of (17) to the two functions and , we can approximate (16) in time :
We further can define a minibatch version of such implementation as
which is used in Option II of Step 8 in Algorithm 1.
4.2 Convergence Analysis
In this section, we analyze the convergence property of Algorithm 1 using (20) as the gradient-difference estimation. The result for (13) can be obtained accordingly. We note that (13) is a special case of (20) by taking (e.g., by letting ). We first specify the assumptions required for the analysis of the .
Assumption 4.1 (monotonicity and DR-submodularity)
The function is non-negative, monotone, and continuous DR-submodular.
Assumption 4.2 (function value at the origin)
The function at the origin is .
Assumption 4.3 (bounded stochastic function value)
The stochastic function has bounded function value for all and : .
Assumption 4.4 (compactness of feasible domain)
The set is compact with diameter .
Assumption 4.5 (bounded gradient norm)
For all , the stochastic gradient has bounded norm: ,
and the norm of the gradient of has bounded fourth-order moment, i.e.,
has bounded fourth-order moment, i.e.,. Furthermore, we define
Assumption 4.6 (bounded second-order differentials)
For all , the stochastic Hessian has bounded spectral norm and the spectral norm of the Hessian of the log-probability function has bounded second order moment: . Furthermore, we define
Assumption 4.7 (continuity of the Hessian)
The stochastic Hessian is -Lipschitz continuous, i.e, for all and all , i.e., . The Hessian of the log probability is -Lipschitz continuous, i.e, for all and all we have . Furthermore, we define
As we mentioned in the previous section, the update for the stochastic gradient vector in is designed properly to reduce the noise of gradient approximation. In the following lemma, we formally characterize the variance of the gradient approximation for . To this end, we also need to properly choose the minibatch sizes and .
Consider the method outlined in Algorithm 1 and assume that in Step 8 we follow the update in (20) to construct the gradient difference approximation (Option II). If Assumptions (4.3), (4.4), (4.5), (4.6), and (4.7) hold, then
where we set , the minibatch sizes to and , and the error of Hessian-vector product approximation , which is defined in (18), is sufficiently small (in fact is sufficient). Here is a constant defined by .
The result in Lemma 21 shows that by calls to the stochastic oracle at each iteration, the variance of gradient approximation in after iterations is of . In the following theorem, we incorporate this bound on the noise of gradient approximation to characterize the convergence guarantee of .
Consider the method outlined in Algorithm 1 and assume that in Step 8 we follow the update in (20) to construct the gradient difference approximation (Option II). If Assumptions 4.1-4.7 hold, then the output of , denoted by , satisfies
by setting , , , and is sufficiently small as discussed in Lemma 21. Here is a constant defined by .
The result in Theorem 4.2 shows that after at most iterations the objective function value for the output of is at least . As the number of calls to the stochastic oracle per iteration is of , to reach a approximation guarantee, has an overall stochastic first-order oracle complexity of . We formally characterize this result in the following corollary. [oracle complexities] To find a solution to problem (1) using Algorithm 1 with Option II, the overall stochastic first-order oracle complexity is and the overall linear optimization oracle complexity is .
5 Discrete Stochastic Submodular Maximization
In this section, we focus on extending our result in the previous section to the case where is the multilinear extension of a (stochastic) discrete submodular set function . This is an important instance of the non-oblivious stochastic optimization, introduced in (1). Indeed, once such a result is achieved, with a proper loss-less rounding scheme, including pipage rounding (Calinescu et al., 2007) or contention resolution method (Vondrák et al., 2011), we can extend our results to the discrete setting.
Let denote a finite set of elements, i.e., . Consider a discrete and monotone submodular function , which is defined as an expectation over a set of functions . Our goal is to maximize subject to some constraint , where encodes the collection of feasible solutions. In other words, we aim to solve the following discrete and stochastic submodular function maximization problem
where is an arbitrary distribution. In particular, we assume that the pair forms a matroid with rank . The simplest example is maximization under the cardinality constraint, i.e., for a given integer , find a set , of szie , that maximizes . The challenge here is to find a near-optimal solution for the problem (22) without explicitly computing the expectation. That is, we assume access to an oracle that, given a set , outputs an independently chosen sample where . The focus of this section is to show how can solve the discrete optimization problem (22). To do so, we rely on the multilinear extension of a discrete submodular function defined as
where is the matroid polytope (Calinescu et al., 2007). Note that here denotes the -th component of the vector . In other words, is the expected value of over sets wherein each element is included independently with probability . Specifically, in lieu of solving (22) we can maximize its multilinear extension, i.e.,
To this end, we need access to unbiased estimators of the gradient and the Hessian. In the following lemma, we recall the structure of the Hessian of the objective function (23).
where for the vector the and entries are set to and , respectively.
Note that every entry in (5) can be estimated direct sampling without introducing any bias. We will now construct the Hessian approximation using Lemma 5. Let be a uniform random variable between and let be a random vector in which ’s are generated i.i.d. according to the uniform distribution over the unit interval . In each iteration, a minibatch of samples of (recall that is the random variable that parameterizes the component function ), i.e., , is generated. Then for all , we let and construct the random set using and in the following way: if and only if for . Having and , each entry of the Hessian estimator is
where , and if then . As linear optimization over the rank- matriod polytope always return with at most nonzero entries, the complexity of computing is .
Now we use the above approximation of the Hessian to solve the multilinear extension as a special case of problem (1) using . To do so, we first introduce the following assumption.
Let denote maximum marginal value of , i.e.,
and further define .
Under Assumption 5.1, the Hessian estimator has bounded norm:
We now analyze the convergence of for solving the problem in (24). Compared to Theorem 4.2, Theorem 5 has an explicit dependence on the problem dimension and exploits the sparsity of to tighten the bounding terms of dimension-dependency.
Consider the multilinear extension problem (24) under Assumption 5.1. By using the minibatch size and , Algorithm 1 finds a solution to problem (24) with at most iterations. Moreover, the overall stochastic oracle cost is .
Note that, in multilinear extension case, the smoothness property required for the results in Section 4 is absent, and that is why we need to develop a more sophisticated gradient-difference estimator to achieve a similar theoretical guarantee (more details is available in section 8.5 of the appendix). Theorem 5 shows that by using a proper rounding scheme finds a approximate solution of the discrete submodular maximization problem (22) after at most calls to the stochastic oracle. In the following corollary, we show that this complexity bound is optimal.
There exists a distribution and a monotone submodular function , given as , such that the following holds. In order to find a solution for (22) with a -cardinality constraint, any algorithm requires at least stochastic samples where are positive constants.
The result in Theorem 5 shows that finding a solution for the problem in (22) could require at least calls to the stochastic function . Therefore, is optimal in terms of number of calls to the stochastic oracle, highlighted in the following remark.
[optimality of stochastic oracle complexity] To achieve the tight approximate solution, the stochastic oracle complexity in Theorem 5 is optimal in terms of its dependency on .
6 Stochastic Concave Maximization
In this section, we focus on another case of problem (1) when the objective function is concave. By following the variance reduction technique developed in the gradient approximation of , we introduce the Stochastic Frank-Wolfe++method () for concave maximization which achieves an accurate solution in expectation after at most stochastic gradient computations. We study the general case of non-oblivious stochastic optimization, but, indeed, the results also hold for the oblivious stochastic problem as a special case.
The steps of the are summarized in Algorithm 2. Unlike the update of , in we restart the gradient estimation for the iterates that are powers of . This modification is necessary to ensure that the noise of gradient approximation stays bounded by a proper constant when the diminishing step-size is used. The rest of gradient approximation scheme is similar to that of . Once the gradient approximation is evaluated we find the ascent direction by solving the linear optimization program . Then, we compute the updated variable by performing the update , where is a chosen stepsize.
In the following theorem, we characterize the convergence properties of the proposed method for solving stochastic concave maximization problems. For simplicity, we analyze the convergence using gradient-difference estimator (13). Similar results can be obtained when using (20).