1 Introduction
Submodular optimization is a sweet spot between tractability and expressiveness, with numerous applications in machine learning (e.g. Krause and Golovin (2014), and see below) while permitting many algorithms that are both practical and backed by rigorous guarantees (e.g. Buchbinder et al. (2015); Feige et al. (2011); Calinescu et al. (2011)). In general, a realvalued function defined on a lattice is submodular if and only if
for all , where and denote the join and meet, respectively, of and in the lattice . Such functions are generally neither convex nor concave. In one of the most commonly studied examples, is the lattice of subsets of a fixed ground set (or a sublattice thereof), with union and intersection playing the roles of join and meet, respectively.
This paper concerns a different wellstudied setting, where is a hypercube (i.e., ), with componentwise maximum and minimum serving as the join and meet, respectively.^{1}^{1}1Our results also extend easily to arbitrary axisaligned boxes (i.e., “box constraints”). We consider the fundamental problem of (approximately) maximizing a continuous and nonnegative submodular function over the hypercube.^{2}^{2}2More generally, the function only has to be nonnegative at the points and . The function is given as a “black box”: accessible only via querying its value at a point. We are interested in algorithms that use at most a polynomial (in ) number of queries. We do not assume that is monotone (otherwise the problem is trivial).
We next briefly mention four applications of maximizing a nonmonotone submodular function over a hypercube that are germane to machine learning and other related application domains.^{3}^{3}3See the supplement for more details on these applications.
Nonconcave quadratic programming. In this problem, the goal is to maximize , where the offdiagonal entries of are nonpositive. One application of this problem is to largescale price optimization on the basis of demand forecasting models (Ito and Fujimaki, 2016).
Map inference for Determinantal Point Processes (DPP).
DPPs are elegant probabilistic models that arise in statistical physics and random matrix theory. DPPs can be used as generative models in applications such as text summarization, human pose estimation, and news threading tasks
(Kulesza et al., 2012). The approach in Gillenwater et al. (2012) to the problem boils down to maximize a suitable submodular function over the hypercube, accompanied with an appropriate rounding (see also (Bian et al., 2017a)). One can also think of regularizing this objective function with regularizer, in order to avoid overfitting. Even with a regularizer, the function remains submodular.Logsubmodularity and meanfield inference. Another probabilistic model that generalizes DPPs and all other strong Rayleigh measures (Li et al., 2016; Zhang et al., 2015) is the class of logsubmodular distributions over sets, i.e. where is a set submodular function. MAP inference over this distribution has applications in machine learning (Djolonga and Krause, 2014). One variational approach towards this MAP inference task is to use meanfield inference to approximate the distribution with a product distribution , which again boils down to submodular function maximization over the hypercube (see (Bian et al., 2017a)).
Revenue maximization over social networks. In this problem, there is a seller who wants to sell a product over a social network of buyers. To do so, the seller gives away trial products and fractions thereof to the buyers in the network (Bian et al., 2017b; Hartline et al., 2008). In (Bian et al., 2017b), there is an objective function that takes into account two parts: the revenue gain from those who did not get a free product, where the revenue function for any such buyer is a nonnegative nondecreasing and submodular function ; and the revenue loss from those who received the free product, where the revenue function for any such buyer is a nonpositive nonincreasing and submodular function . The combination for all buyers is a nonmonotone submodular function. It also is nonnegative at and , by extending the model and accounting for extra revenue gains from buyers with free trials.
Our results.
Maximizing a submodular function over the hypercube is at least as difficult as over the subsets of a ground set.^{4}^{4}4An instance of the latter problem can be converted to one of the former by extending the given set function (with domain viewed as ) to its multilinear extension defined on the hypercube (where ). Sampling based on an approximate solution for the multilinear extension yields an equally good approximate solution to the original problem. For the latter problem, the best approximation ratio achievable by an algorithm making a polynomial number of queries is ; the (informationtheoretic) lower bound is due to (Feige et al., 2011), the optimal algorithm to (Buchbinder et al., 2015). Thus, the bestcase scenario for maximizing a submodular function over the hypercube (using polynomially many queries) is a approximation. The main result of this paper achieves this bestcase scenario:
There is an algorithm for maximizing a continuous submodular function over the hypercube that guarantees a approximation while using only a polynomial number of queries to the function under mild continuity assumptions.
Our algorithm is inspired by the bigreedy algorithm of Buchbinder et al. (2015), which maximizes a submodular set function; it maintains two solutions initialized at and , go over coordinates sequentially, and make the two solutions agree on each coordinate. The algorithmic question here is how to choose the new coordinate value for the two solutions, so that the algorithm gains enough value relative to the optimum in each iteration. Prior to our work, the bestknown result was a approximation (Bian et al., 2017b), which is also inspired by the bigreedy. Our algorithm requires a number of new ideas, including a reduction to the analysis of a zerosum game for each coordinate, and the use of the special geometry of this game to bound the value of the game.
The second and third applications above induce objective functions that, in addition to being submodular, are concave in each coordinate^{5}^{5}5However, after regularzation the function still remains submodular, but can lose coordinatewise concavity. (called DRsubmodular in (Soma and Yoshida, 2015) based on diminishing returns defined in (Kapralov et al., 2013)). Here, an optimal approximation algorithm was recently already known on integer lattices (Soma and Yoshida, 2017), that can easily be generalized to our continuous setting as well; our contribution is a significantly faster such bigreedy algorithm. The main idea here is to identify a monotone equilibrium condition sufficient for getting the required approximation guarantee, which enables a binary searchtype solution.
We also run experiments to verify the performance of our proposed algorithms in practical machine learning applications. We observe that our algorithms match the performance of the prior work, while providing either a better guaranteed approximation or a better running time.
Further related work.
Buchbinder and Feldman (2016) derandomize the bigreedy algorithm. Staib and Jegelka (2017) apply continuous submodular optimization to budget allocation, and develop a new submodular optimization algorithm to this end. Hassani et al. (2017) give a approximation for monotone continuous submodular functions under convex constraints. Gotovos et al. (2015) consider (adaptive) submodular maximization when feedback is given after an element is chosen. Chen et al. (2018); Roughgarden and Wang (2018) consider submodular maximization in the context of online noregret learning. Mirzasoleiman et al. (2013) show how to perform submodular maximization with distributed computation. Submodular minimization has been studied in Schrijver (2000); Iwata et al. (2001). See Bach et al. (2013) for a survey on more applications in machine learning.
Variations of continuous submodularity.
We consider nonmonotone nonnegative continuous submodular functions, i.e. s.t. , , where and are coordinatewise max and min operations. Two related properties are weak Diminishing Returns Submodularity (weak DRSM) and strong Diminishing Returns Submodularity (strong DRSM) (Bian et al., 2017b), formally defined below. Indeed, weak DRSM is equivalent to submodularity (see Proposition 4 in the supplement), and hence we use these terms interchangeably.
Definition 1 (Weak/Strong DRSM).
Consider a continuous function :

Weak DRSM (continuous submodular): , and

Strong DRSM (DRsubmodular ): , and :
As simple corollaries, a twicedifferentiable is strong DRSM if and only if all the entries of its Hessian are nonpositive, and weak DRSM if and only if all of the offdiagonal entries of its Hessian are nonpositive. Also, weak DRSM together with concavity along each coordinate is equivalent to strong DRSM (see Proposition 4 in the supplementary materials for more details).
Coordinatewise Lipschitz continuity.
Consider univariate functions generated by fixing all but one of the coordinates of the original function . In future sections, we sometimes require mild technical assumptions on the Lipschitz continuity of these single dimensional functions.
Definition 2 (Coordinatewise Lipschitz).
A function is coordinatewise Lipschitz continuous if there exists a constant such that , , the single variate function is Lipschitz continuous, i.e.,
2 Weak DRSM Maximization: Continuous Randomized BiGreedy
Our first main result is a approximation algorithm (up to additive error ) for maximizing a continuous submodular function , a.k.a. weak DRSM, which is informationtheoretically optimal (Feige et al., 2011). This result assumes that is coordinatewise Lipschitz continuous.^{6}^{6}6Such an assumption is necessary, since otherwise the singledimensional problem amounts to optimizing an arbitrary function and is hence intractable. Prior work, e.g. Bian et al. (2017b) and Bian et al. (2017a), implicitly requires such an assumption to perform singledimensional optimization. Before describing our algorithm, we introduce the notion of the positiveorthant concave envelope of a twodimensional curve, which is useful for understanding our algorithm.
Definition 3.
Consider a curve over the interval such that:

and are both continuous,

, and .
Then the positiveorthant concave envelope of , denoted by , is the smallest concave curve in the positiveorthant upperbounding all the points (see Figure 0(a)), i.e.,
We start by describing a vanilla version of our algorithm for maximizing over the unit hypercube, termed as continuous randomized bigreedy (Algorithm 1). This version assumes blackbox oracle access to algorithms for a few computations involving univariate functions of the form (e.g. maximization over , computing concenv(.), etc.). We first prove that the vanilla algorithm finds a solution with an objective value of at least of the optimum. In Section 2.2, we show how to approximately implement these oracles in polynomial time when is coordinatewise Lipschitz.
Theorem 1.
If is nonnegative and continuous submodular (or equivalently is weak DRSM), then Algorithm 1 is a randomized approximation algorithm, i.e. returns s.t.
2.1 Analysis of the Continuous Randomized BiGreedy (proof of Theorem 1)
We start by defining these vectors, used in our analysis in the same spirit as Buchbinder et al. (2015):
Note that and (or and ) are the values of and at the end of (or at the beginning of) the iteration of Algorithm 1. In the remainder of this section, we give the highlevel proof ideas and present some proof sketches. See the supplementary materials for the formal proofs.
2.1.1 Reduction to coordinatewise zerosum games.
For each coordinate , we consider a subproblem. In particular, define a twoplayer zerosum game played between the algorithm player (denoted by ALG) and the adversary player (denoted by ADV). ALG selects a (randomized) strategy , and ADV selects a (randomized) strategy . Recall the descriptions of and at iteration of Algorithm 1,:
We now define the utility of ALG (negative of the utility of ADV) in our zerosum game as follows:
(1) 
Suppose the expected utility of ALG is nonnegative at the equilibrium of this game. In particular, suppose ALG’s randomized strategy (in Algorithm 1) guarantees that for every strategy of ADV the expected utility of ALG is nonnegative. If this statement holds for all of the zerosum games corresponding to different iterations , then Algorithm 1 is a approximation of the optimum.
Lemma 1.
If for constant , then .
Proof sketch..
Our bigreedy approach, á la Buchbinder et al. (2015), revolves around analyzing the evolving values of three points: , , and . These three points begin at allzeroes, allones, and the optimum solution, respectively, and converge to the algorithm’s final point. In each iteration, we aim to relate the total increase in value of the first two points with the decrease in value of the third point. If we can show that the former quantity is at least twice the latter quantity, then a telescoping sum proves that the algorithm’s final choice of point scores at least half that of optimum.
The utility of our game is specifically engineered to compare the total increase in value of the first two points with the decrease in value of the third point. The positive term of the utility is half of this increase in value, and the negative term is a bound on how large in magnitude the decrease in value may be. As a result, an overall nonnegative utility implies that the increase beats the decrease by a factor of two, exactly the requirement for our bigreedy approach to work. Finally, an additive slack of in the utility of each game sums over iterations for a total slack of . ∎
Proof of Lemma 1..
Consider a realization of where . We have:
(2) 
where the inequality holds due to weak DRSM. Similarly, for a a realization of where :
(3) 
Putting eq. 2 and eq. 3 together, for every realization we have:
(4) 
Moreover, consider the term . We have:
(5) 
where the first inequality holds due to weak DRSM property and the second inequity holds as . Similarly, consider the term . We have:
(6) 
where the first inequality holds due to weak DRSM and the second inequity holds as . By eq. 4, eq. 5, eq. 6, and the fact that , we have:
2.1.2 Analyzing the zerosum games.
Fix an iteration of Algorithm 1. We then have the following.
Proposition 1.
If ALG plays the (randomized) strategy as described in Algorithm 1, then we have against any strategy of ADV.
Proof of Proposition 1.
We do the proof by case analysis over two cases:
Case (easy):
In this case, the algorithm plays a deterministic strategy . We therefore have:
where the inequality holds because , and also and so:
To complete the proof for this case, it is only remained to show . As , for any given either or (or both). If then:
where the first inequality uses weak DRSM property and the second inequality uses the fact . If , we then have:
where the first inequality uses weak DRSM property and the second inequality holds because both terms are nonnegative, following the fact that:
Therefore, we finish the proof of the easy case.
Case (hard):
In this case, ALG plays a mixed strategy over two points. To determine the twopoint support, it considers the curve and finds a point on (i.e., Definition 3) that lies on the line , where recall that and (as and are the maximizers of and respectively). Because this point is on the concave envelope it should be a convex combination of two points on the curve . Lets say , where and , and . The final strategy of ALG is a mixed strategy over with probabilities . Fixing any mixed strategy of ALG over two points and with probabilities (denoted by ), define the ADV’s positive region, i.e.
Now, suppose ALG plays a mixed strategy with the property that its corresponding ADV’s positive region covers the entire curve . Then, for any strategy of ADV the expected utility of ALG is nonnegative. In the rest of the proof, we geometrically characterize the ADV’s positive region against a mixed strategy of ALG over a 2point support, and then we show for the particular choice of , and in Algorithm 1 the positive region covers the entire curve .
Lemma 2.
Suppose ALG plays a 2point mixed strategy over and with probabilities , and w.l.o.g. . Then ADV’s positive region is the pentagon , where and (see Figure 0(b)):

,

,

is the intersection of the lines leaving with slope and leaving along the gaxis,

is the intersection of the lines leaving with slope and leaving along the haxis.
Proof of Lemma 2..
We start by a technical lemma, showing a singlecrossing property of the gh curve of a weak DR submodular function , and we then characterize the region using this lemma.
Lemma 3.
The univariate function is monotone nonincreasing.
Proof.
By using weak DRSM property of the proof is immediate, as for any ,
where the inequality holds due to the fact that and . ∎
Being equipped with Lemma 3, the positive region is the set of all points such that
The above inequality defines a polytope. Our goal is to find the vertices and faces of this polytope. Now, to this end, we only need to consider three cases: 1) , 2) and 3) (note that ). From the first and third case we get the halfspaces and respectively, that form two of the faces of the positiveregion polytope. From the second case, we get another halfspace, but the observation is that the transition from first case to second case happens when , i.e. on a line with slope one leaving , and transition from second case to the third case happens when , i.e. on a line with slope one leaving . Therefore, the second halfspace is the region under the line connecting two points and , where is the intersection of and the line leaving with slope one (point ), and is the intersection of and the line leaving with slope one (point ). The line segment defines another face of the positive region polytope, and
Comments
There are no comments yet.