Submodular optimization is a sweet spot between tractability and expressiveness, with numerous applications in machine learning (e.g. Krause and Golovin (2014), and see below) while permitting many algorithms that are both practical and backed by rigorous guarantees (e.g. Buchbinder et al. (2015); Feige et al. (2011); Calinescu et al. (2011)). In general, a real-valued function defined on a lattice is submodular if and only if
for all , where and denote the join and meet, respectively, of and in the lattice . Such functions are generally neither convex nor concave. In one of the most commonly studied examples, is the lattice of subsets of a fixed ground set (or a sublattice thereof), with union and intersection playing the roles of join and meet, respectively.
This paper concerns a different well-studied setting, where is a hypercube (i.e., ), with componentwise maximum and minimum serving as the join and meet, respectively.111Our results also extend easily to arbitrary axis-aligned boxes (i.e., “box constraints”). We consider the fundamental problem of (approximately) maximizing a continuous and nonnegative submodular function over the hypercube.222More generally, the function only has to be nonnegative at the points and . The function is given as a “black box”: accessible only via querying its value at a point. We are interested in algorithms that use at most a polynomial (in ) number of queries. We do not assume that is monotone (otherwise the problem is trivial).
We next briefly mention four applications of maximizing a non-monotone submodular function over a hypercube that are germane to machine learning and other related application domains.333See the supplement for more details on these applications.
Non-concave quadratic programming. In this problem, the goal is to maximize , where the off-diagonal entries of are non-positive. One application of this problem is to large-scale price optimization on the basis of demand forecasting models (Ito and Fujimaki, 2016).
Map inference for Determinantal Point Processes (DPP).
DPPs are elegant probabilistic models that arise in statistical physics and random matrix theory. DPPs can be used as generative models in applications such as text summarization, human pose estimation, and news threading tasks(Kulesza et al., 2012). The approach in Gillenwater et al. (2012) to the problem boils down to maximize a suitable submodular function over the hypercube, accompanied with an appropriate rounding (see also (Bian et al., 2017a)). One can also think of regularizing this objective function with regularizer, in order to avoid overfitting. Even with a regularizer, the function remains submodular.
Log-submodularity and mean-field inference. Another probabilistic model that generalizes DPPs and all other strong Rayleigh measures (Li et al., 2016; Zhang et al., 2015) is the class of log-submodular distributions over sets, i.e. where is a set submodular function. MAP inference over this distribution has applications in machine learning (Djolonga and Krause, 2014). One variational approach towards this MAP inference task is to use mean-field inference to approximate the distribution with a product distribution , which again boils down to submodular function maximization over the hypercube (see (Bian et al., 2017a)).
Revenue maximization over social networks. In this problem, there is a seller who wants to sell a product over a social network of buyers. To do so, the seller gives away trial products and fractions thereof to the buyers in the network (Bian et al., 2017b; Hartline et al., 2008). In (Bian et al., 2017b), there is an objective function that takes into account two parts: the revenue gain from those who did not get a free product, where the revenue function for any such buyer is a non-negative non-decreasing and submodular function ; and the revenue loss from those who received the free product, where the revenue function for any such buyer is a non-positive non-increasing and submodular function . The combination for all buyers is a non-monotone submodular function. It also is non-negative at and , by extending the model and accounting for extra revenue gains from buyers with free trials.
Maximizing a submodular function over the hypercube is at least as difficult as over the subsets of a ground set.444An instance of the latter problem can be converted to one of the former by extending the given set function (with domain viewed as ) to its multilinear extension defined on the hypercube (where ). Sampling based on an -approximate solution for the multilinear extension yields an equally good approximate solution to the original problem. For the latter problem, the best approximation ratio achievable by an algorithm making a polynomial number of queries is ; the (information-theoretic) lower bound is due to (Feige et al., 2011), the optimal algorithm to (Buchbinder et al., 2015). Thus, the best-case scenario for maximizing a submodular function over the hypercube (using polynomially many queries) is a -approximation. The main result of this paper achieves this best-case scenario:
There is an algorithm for maximizing a continuous submodular function over the hypercube that guarantees a -approximation while using only a polynomial number of queries to the function under mild continuity assumptions.
Our algorithm is inspired by the bi-greedy algorithm of Buchbinder et al. (2015), which maximizes a submodular set function; it maintains two solutions initialized at and , go over coordinates sequentially, and make the two solutions agree on each coordinate. The algorithmic question here is how to choose the new coordinate value for the two solutions, so that the algorithm gains enough value relative to the optimum in each iteration. Prior to our work, the best-known result was a -approximation (Bian et al., 2017b), which is also inspired by the bi-greedy. Our algorithm requires a number of new ideas, including a reduction to the analysis of a zero-sum game for each coordinate, and the use of the special geometry of this game to bound the value of the game.
The second and third applications above induce objective functions that, in addition to being submodular, are concave in each coordinate555However, after regularzation the function still remains submodular, but can lose coordinate-wise concavity. (called DR-submodular in (Soma and Yoshida, 2015) based on diminishing returns defined in (Kapralov et al., 2013)). Here, an optimal -approximation algorithm was recently already known on integer lattices (Soma and Yoshida, 2017), that can easily be generalized to our continuous setting as well; our contribution is a significantly faster such bi-greedy algorithm. The main idea here is to identify a monotone equilibrium condition sufficient for getting the required approximation guarantee, which enables a binary search-type solution.
We also run experiments to verify the performance of our proposed algorithms in practical machine learning applications. We observe that our algorithms match the performance of the prior work, while providing either a better guaranteed approximation or a better running time.
Further related work.
Buchbinder and Feldman (2016) derandomize the bi-greedy algorithm. Staib and Jegelka (2017) apply continuous submodular optimization to budget allocation, and develop a new submodular optimization algorithm to this end. Hassani et al. (2017) give a -approximation for monotone continuous submodular functions under convex constraints. Gotovos et al. (2015) consider (adaptive) submodular maximization when feedback is given after an element is chosen. Chen et al. (2018); Roughgarden and Wang (2018) consider submodular maximization in the context of online no-regret learning. Mirzasoleiman et al. (2013) show how to perform submodular maximization with distributed computation. Submodular minimization has been studied in Schrijver (2000); Iwata et al. (2001). See Bach et al. (2013) for a survey on more applications in machine learning.
Variations of continuous submodularity.
We consider non-monotone non-negative continuous submodular functions, i.e. s.t. , , where and are coordinate-wise max and min operations. Two related properties are weak Diminishing Returns Submodularity (weak DR-SM) and strong Diminishing Returns Submodularity (strong DR-SM) (Bian et al., 2017b), formally defined below. Indeed, weak DR-SM is equivalent to submodularity (see Proposition 4 in the supplement), and hence we use these terms interchangeably.
Definition 1 (Weak/Strong DR-SM).
Consider a continuous function :
Weak DR-SM (continuous submodular): , and
Strong DR-SM (DR-submodular ): , and :
As simple corollaries, a twice-differentiable is strong DR-SM if and only if all the entries of its Hessian are non-positive, and weak DR-SM if and only if all of the off-diagonal entries of its Hessian are non-positive. Also, weak DR-SM together with concavity along each coordinate is equivalent to strong DR-SM (see Proposition 4 in the supplementary materials for more details).
Coordinate-wise Lipschitz continuity.
Consider univariate functions generated by fixing all but one of the coordinates of the original function . In future sections, we sometimes require mild technical assumptions on the Lipschitz continuity of these single dimensional functions.
Definition 2 (Coordinate-wise Lipschitz).
A function is coordinate-wise Lipschitz continuous if there exists a constant such that , , the single variate function is -Lipschitz continuous, i.e.,
2 Weak DR-SM Maximization: Continuous Randomized Bi-Greedy
Our first main result is a -approximation algorithm (up to additive error ) for maximizing a continuous submodular function , a.k.a. weak DR-SM, which is information-theoretically optimal (Feige et al., 2011). This result assumes that is coordinate-wise Lipschitz continuous.666Such an assumption is necessary, since otherwise the single-dimensional problem amounts to optimizing an arbitrary function and is hence intractable. Prior work, e.g. Bian et al. (2017b) and Bian et al. (2017a), implicitly requires such an assumption to perform single-dimensional optimization. Before describing our algorithm, we introduce the notion of the positive-orthant concave envelope of a two-dimensional curve, which is useful for understanding our algorithm.
Consider a curve over the interval such that:
and are both continuous,
, and .
Then the positive-orthant concave envelope of , denoted by , is the smallest concave curve in the positive-orthant upper-bounding all the points (see Figure 0(a)), i.e.,
We start by describing a vanilla version of our algorithm for maximizing over the unit hypercube, termed as continuous randomized bi-greedy (Algorithm 1). This version assumes blackbox oracle access to algorithms for a few computations involving univariate functions of the form (e.g. maximization over , computing conc-env(.), etc.). We first prove that the vanilla algorithm finds a solution with an objective value of at least of the optimum. In Section 2.2, we show how to approximately implement these oracles in polynomial time when is coordinate-wise Lipschitz.
If is non-negative and continuous submodular (or equivalently is weak DR-SM), then Algorithm 1 is a randomized -approximation algorithm, i.e. returns s.t.
2.1 Analysis of the Continuous Randomized Bi-Greedy (proof of Theorem 1)
We start by defining these vectors, used in our analysis in the same spirit as Buchbinder et al. (2015):
Note that and (or and ) are the values of and at the end of (or at the beginning of) the iteration of Algorithm 1. In the remainder of this section, we give the high-level proof ideas and present some proof sketches. See the supplementary materials for the formal proofs.
2.1.1 Reduction to coordinate-wise zero-sum games.
For each coordinate , we consider a sub-problem. In particular, define a two-player zero-sum game played between the algorithm player (denoted by ALG) and the adversary player (denoted by ADV). ALG selects a (randomized) strategy , and ADV selects a (randomized) strategy . Recall the descriptions of and at iteration of Algorithm 1,:
We now define the utility of ALG (negative of the utility of ADV) in our zero-sum game as follows:
Suppose the expected utility of ALG is non-negative at the equilibrium of this game. In particular, suppose ALG’s randomized strategy (in Algorithm 1) guarantees that for every strategy of ADV the expected utility of ALG is non-negative. If this statement holds for all of the zero-sum games corresponding to different iterations , then Algorithm 1 is a -approximation of the optimum.
If for constant , then .
Our bi-greedy approach, á la Buchbinder et al. (2015), revolves around analyzing the evolving values of three points: , , and . These three points begin at all-zeroes, all-ones, and the optimum solution, respectively, and converge to the algorithm’s final point. In each iteration, we aim to relate the total increase in value of the first two points with the decrease in value of the third point. If we can show that the former quantity is at least twice the latter quantity, then a telescoping sum proves that the algorithm’s final choice of point scores at least half that of optimum.
The utility of our game is specifically engineered to compare the total increase in value of the first two points with the decrease in value of the third point. The positive term of the utility is half of this increase in value, and the negative term is a bound on how large in magnitude the decrease in value may be. As a result, an overall nonnegative utility implies that the increase beats the decrease by a factor of two, exactly the requirement for our bi-greedy approach to work. Finally, an additive slack of in the utility of each game sums over iterations for a total slack of . ∎
Proof of Lemma 1..
Consider a realization of where . We have:
where the inequality holds due to weak DR-SM. Similarly, for a a realization of where :
Moreover, consider the term . We have:
where the first inequality holds due to weak DR-SM property and the second inequity holds as . Similarly, consider the term . We have:
2.1.2 Analyzing the zero-sum games.
Fix an iteration of Algorithm 1. We then have the following.
If ALG plays the (randomized) strategy as described in Algorithm 1, then we have against any strategy of ADV.
Proof of Proposition 1.
We do the proof by case analysis over two cases:
In this case, the algorithm plays a deterministic strategy . We therefore have:
where the inequality holds because , and also and so:
To complete the proof for this case, it is only remained to show . As , for any given either or (or both). If then:
where the first inequality uses weak DR-SM property and the second inequality uses the fact . If , we then have:
where the first inequality uses weak DR-SM property and the second inequality holds because both terms are non-negative, following the fact that:
Therefore, we finish the proof of the easy case.
In this case, ALG plays a mixed strategy over two points. To determine the two-point support, it considers the curve and finds a point on (i.e., Definition 3) that lies on the line , where recall that and (as and are the maximizers of and respectively). Because this point is on the concave envelope it should be a convex combination of two points on the curve . Lets say , where and , and . The final strategy of ALG is a mixed strategy over with probabilities . Fixing any mixed strategy of ALG over two points and with probabilities (denoted by ), define the ADV’s positive region, i.e.
Now, suppose ALG plays a mixed strategy with the property that its corresponding ADV’s positive region covers the entire curve . Then, for any strategy of ADV the expected utility of ALG is non-negative. In the rest of the proof, we geometrically characterize the ADV’s positive region against a mixed strategy of ALG over a 2-point support, and then we show for the particular choice of , and in Algorithm 1 the positive region covers the entire curve .
Suppose ALG plays a 2-point mixed strategy over and with probabilities , and w.l.o.g. . Then ADV’s positive region is the pentagon , where and (see Figure 0(b)):
is the intersection of the lines leaving with slope and leaving along the g-axis,
is the intersection of the lines leaving with slope and leaving along the h-axis.
Proof of Lemma 2..
We start by a technical lemma, showing a single-crossing property of the g-h curve of a weak DR submodular function , and we then characterize the region using this lemma.
The univariate function is monotone non-increasing.
By using weak DR-SM property of the proof is immediate, as for any ,
where the inequality holds due to the fact that and . ∎
Being equipped with Lemma 3, the positive region is the set of all points such that
The above inequality defines a polytope. Our goal is to find the vertices and faces of this polytope. Now, to this end, we only need to consider three cases: 1) , 2) and 3) (note that ). From the first and third case we get the half-spaces and respectively, that form two of the faces of the positive-region polytope. From the second case, we get another half-space, but the observation is that the transition from first case to second case happens when , i.e. on a line with slope one leaving , and transition from second case to the third case happens when , i.e. on a line with slope one leaving . Therefore, the second half-space is the region under the line connecting two points and , where is the intersection of and the line leaving with slope one (point ), and is the intersection of and the line leaving with slope one (point ). The line segment defines another face of the positive region polytope, and