# On the Reducibility of Submodular Functions

The scalability of submodular optimization methods is critical for their usability in practice. In this paper, we study the reducibility of submodular functions, a property that enables us to reduce the solution space of submodular optimization problems without performance loss. We introduce the concept of reducibility using marginal gains. Then we show that by adding perturbation, we can endow irreducible functions with reducibility, based on which we propose the perturbation-reduction optimization framework. Our theoretical analysis proves that given the perturbation scales, the reducibility gain could be computed, and the performance loss has additive upper bounds. We further conduct empirical studies and the results demonstrate that our proposed framework significantly accelerates existing optimization methods for irreducible submodular functions with a cost of only small performance losses.

## Authors

• 7 publications
• 58 publications
• 14 publications
09/26/2013

### Structured Convex Optimization under Submodular Constraints

A number of discrete and continuous optimization problems in machine lea...
02/26/2019

### A Memoization Framework for Scaling Submodular Optimization to Large Scale Problems

We are motivated by large scale submodular optimization problems, where ...
01/18/2022

### Sparsification of Decomposable Submodular Functions

Submodular functions are at the core of many machine learning and data m...
05/31/2019

### Majorisation-minimisation algorithms for minimising the difference between lattice submodular functions

We consider the problem of minimising functions represented as a differe...
10/06/2020

A real-valued set function is (additively) approximately submodular if i...
07/04/2017

### Unsupervised Submodular Rank Aggregation on Score-based Permutations

Unsupervised rank aggregation on score-based permutations, which is wide...
12/07/2015

### Gauss quadrature for matrix inverse forms with applications

We present a framework for accelerating a spectrum of machine learning a...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Submodularity naturally arises in a number of machine learning problems, such as active learning

[10], clustering [22], and dictionary selection [5]. The scalability of submodular optimization methods is critical in practice, thus has drawn much attention from the research community. For example, Iyer et al. [12] propose general optimization methods based on the semidifferential. Wei et al. [25] combine approximation with pruning to accelerate the greedy algorithm for uniform matroid constrained submodular maximization. Mirzasoleiman et al. [21] and Pan et al. [24] use distributed implementation to accelerate existing optimization methods. Other techniques, including stochastic sampling [20] and decomposable assumption [14], are also applied to scale up submodular optimization methods. While in this paper, we focus on the reducibility of submodular functions, a favourable property that can substantially improve the scalability of submodular optimization methods. The reducibility can directly reduce the solution space of the submodular optimization problems, while preserve all the optima in the reduced space, thereby enables us to accelerate the optimization process without incurring performance loss.

Recent research shows that for some submodular functions, by evaluating marginal gains, reduction can be applied for unconstrained maximization [8], unconstrained minimization [6, 12], and uniform matroid constrained maximization [25]. By leveraging the reducibility, a variety of methods have been developed to scale up the optimization of reducible submodular functions [6, 8, 12, 19], While existing works mainly focus on reducible functions, there exist a number of irreducible submodular functions widely applied in practice, for which existing methods can only provide vacuous reduction.

In this paper, we investigate the problem that whether irreducible functions can also exploit this favorable property. We firstly introduce the concept of reducibility using marginal gains over the endpoint sets of a given lattice. Then for irreducible functions, we transform them to reducible functions by adding random noise to perturb the marginal gains, after which we perform lattice reduction for the perturbed functions and solve the original functions on the reduced lattice. Theoretical results show that given the perturbation scales, the reducibility gain is lower bounded, and the performance loss has additive upper bounds. The empirical results demonstrate that there exist useful perturbation scale intervals in practice, which enables us to significantly accelerate existing optimization methods with small performance losses.

In summary, this paper has the following contributions. Firstly, we introduce the concept of reducibility, and propose the perturbation-reduction framework. Secondly, we theoretically analyze our proposed method. In particular, for the reducibility gain, we propose a lower bound in terms of the perturbation scale. For the performance loss, we propose both deterministic and probabilistic upper bounds. The deterministic bound provides the understanding of relationship between the reducibility gain and performance loss, while the probabilistic bound can explain the experimental results. Finally, we empirically show that the proposed method is applicable for a variety of commonly used irreducible submodular functions.

In the sequel, we organize the paper as follows. In Section 2, we introduce the definitions and the existing reduction algorithms. In Section 3, we propose our perturbation based method. Theoretical analysis and empirical results are presented in Section 4 and Section 5, respectively. In Section 6, we review some related works. Section 7 comes to our conclusion.

## 2 Reducibility

### 2.1 Notations and Definitions

Given a finite set , and a set function . is said to be submodular [6] if , . An equivalent definition of submodularity is the diminishing return property: , , , where is called the marginal gain of element with respect to set . To simplify the notation, we denote by , and by . Given , the (set interval) lattice is defined as .

Suppose , and is a submodular function. In this paper, we focus on unconstrained submodular optimization problems,

 Problem 1: minX⊆Nf(X),Problem 2: maxX⊆Nf(X).

Problem 1 can be exactly solved in polynomial time [23], while Problem 2 is NP-hard since some of its special cases (e.g., Max Cut) are NP-hard.

For convenience of the presentation, we will use P1 and P2 to refer to Problem 1 and Problem 2, respectively. Similarly, for the following algorithms, we will use A1 and A2 to refer to Algorithm 1 and Algorithm 2. The reference holds for P3, P4, A3 and A4.

Define as the optima set of P1. Similarly, define . Obviously, we have and .

We now give the definition of reducibility. For P1 (P2), we say the objective function is reducible for minimization (maximization) if , where can be obtained in function evaluations, such that (). Note that if we can only find in time, the reduction is meaningless since time is enough for us to find all the optima. The ratio is called the reduction rate.

### 2.2 Algorithms

Existing works on reduction for unconstrained submodular optimization can be summarized by the following two algorithms, both of which terminate in time. The brief review of existing works can be found in Section 6.

###### Proposition 1.

Suppose is submodular. After each iteration of A1 (A2), we have ().

We prove Proposition 1 in the supplementary material. According to Proposition 1, if the output of A1 (A2) statifies , then is reducible.

According to A1 (A2), if , then we have and . The algorithm will terminate after the first iteration and the output is , which provides a vacuous reduction. In this case, we say that is irreducible with respect to A1 (A2). For convenience, we directly say is irreducible.

Thereby, we conclude two points from the above algorithms. First, by the definition of and , the reducibility of can be determined by the signs of marginal gains with respect to the endpoint sets of the current working lattice. Second, the reducibility of for minimization and maximization are actually the same property. Specially, suppose in a certain iteration, A1 and A2 have the same working lattice . According to the algorithms, they also have the same and , which determine whether is reducible after the current iteration.

###### Proposition 2.

Given a submodular function , and a lattice . , Define . Then is reducible on with respect to A1 (A2) if and only if

 K=maxi∈T∖SKi>0. (1)
###### Proof.

Suppose in A1 (A2). Then is reducible if and only if the algorithm does not terminate after its first iteration, i.e., or . Suppose happens, i.e., , . According to submodularity, . We have . Suppose happens, i.e., , . According to submodularity, . We have . ∎

According to Proposition 2, the reducibility of for minimization (maximization) can be obtained by (1). Thus we say is reducible with respect to A1 (A2) if (1) holds. Similarly, without ambiguity in this paper, we directly say is reducible if (1) holds.

## 3 Perturbation Reduction

Given a reducible submodular function, we can use A1 and A2 to provide useful reduction. Unfortunately, there still exist many irreducible submodular functions, some of which are listed in the experimental section. Given a submodular function , which is irreducible on . According to Proposition 2, , we have and .

If we expect A1 and A2 to provide nontrivial reduction, we need to guarantee that (1) holds for some elements without changing the submodularity of the objective function. A natural way is to add random noise 111 is a modular function, and . to perturb the original function as follows,

 Problem 3: minX⊆Ng(X)≜minX⊆Nf(X)+r(X), Problem 4: maxX⊆Ng(X)≜maxX⊆Nf(X)+r(X),

where , is generated uniformly at random in for some . By appropriately choosing the value of , we can ensure or hold for some . Thus we have (1) holds, indicating that is reducible. At the same time, as is a modular function, the submodularity of still holds.

We propose our perturbation based method for minimization and maximization in A3 and A4, respectively. For an irreducible submodular function on a given lattice , we first perturb the objective function to make it reducible, i.e., . A1 or A2 are then employed to obtain the reduced lattice of . Finally we solve the original problems of on the reduced lattice exactly or approximately using existing methods.

It is worth mentioning that, though we mainly focus on irreducible functions, our methods also work for reducible ones, as they are special cases of irreducible functions. Particularly, given a reducible function on , of which the reduction rate is less than , after A1 (A2) terminates, we can get a sublattice so that is irreducible on .

## 4 Theoretical Analysis

By perturbing the irreducible submodular function, we transform P1 (P2) into P3 (P4). This makes the objective reducible while leads the solution to be inexact. Correspondingly, our theoretical analysis of the method focuses on two main aspects: the reducibility gain and the performance loss incurred by perturbation.

### 4.1 Reducibility Gain

Suppose is an irreducible submodular function on , and as defined in P3 and P4. Since is irreducible, , we have and .

###### Proposition 3.

Given a submodular function , which is irreducible on . Define . If , then is irreducible on .

###### Proof.

Since , we suppose . , we have , and , which implies that is also irreducible on . ∎

Proposition 3 indicates that if the perturbation scale is small enough, there is no reducibility gain. This is intuitively reasonable since we have when .

To lower bound the reducibility gain of adding perturbation, we generalize the concept of curvature [4, 13] for non-monotone irreducible submodular functions.

###### Definition 1.

Given a submodular function , the curvature of on is defined as,

 c{f,[S,T]}=maxi∈T∖S,f(i|S)>0f(i|S)−f(i|T−i)f(i|S).

Note that for any irreducible submodular function on , we have .

###### Theorem 1.

Suppose , denote , , . The reduction rate in expectation of is at least .

###### Proof.

Suppose ,

, we define a random variable

as,

 Hi={1if Ki>0,0otherwise.

indicates whether can be reduced from or not. Define as the total number of the reduced elements. We firstly lower bound by the total number of the reduced elements after the first iteration round of A1(A2),

 E(H)=∑i∈T∖SE(Hi)=∑i∈T∖SPr{Hi=1} ≥12t⋅∑i∈T∖Smax{0,t−f(i|S)}+max{0,t+f(i|T−i)} ≥s−c2t⋅∑i∈T∖Sf(i|S)=s−ck2t.

Consequently, the reduction rate in expectation is . ∎

Theorem 1 implies that the reduction rate in expectation approaches as the perturbation scale increases. This is also consistent with our intuition since when . Note that is a modular function, which always has the highest reduction rate .

### 4.2 Performance Loss

Suppose (), i.e., () is an optimum of P1 (P2). Recall that () is the output of A3 (A4), for P1 (P2), we define () as the performance loss incurred by perturbation. For P1 (P2), the following result shows that the performance loss is upper bounded by the total perturbation of the “mistakenly” reduced elements, which will be explained later on.

###### Theorem 2.

Given an irreducible submodular function . Suppose is the perturbation scale in A3 (A4), and is the reduction rate. We have,

 f(Xp∗)−f(X∗) <−r(Xt∖X∗)+r(X∗∖Yt)
###### Proof.

We prove the maximization case. In general, we have , otherwise the loss is zero. Note that according to A4. So we firstly introduce an intermediate set , i.e., the contraction of in for our analysis. Given the fact that , if we can upper bound , then the total performance loss is also upper bounded. In A2, we have . By definition, , , and . We have,

 f(X∗∪Xt)−f(X∗) (2) =|Xt∖X∗|∑s=1,xs∈Xt∖X∗f(xs|X∗+x1+⋯+xs−1) (3) ≥t−1∑i=0∑d∈Di∖X∗f(d|Yi−d) (4) >−t−1∑i=0∑d∈Di∖X∗r(d) (5) =−r(Xt∖X∗), (6)

where (3) is the telescopic version of (2). According to submodularity, we have (4) holds, and (5) comes from the third step of A2. Similarly, we have,

 f(X∗∪Xt)−f(X∗∪Xt∩Yt)<−r(X∗∖Yt). (7)

Combining (6) with (7), and noting and , we have,

 f(X∗)−f(X∗p)

We note in (8), is actually the set of all the elements which are not in but added by A2. Similarly, is the set of all the elements which are in but eliminated by A2. Consequently, the performance loss is upper bounded by the total perturbation value of all the mistakenly reduced elements. Since the number of all the mistakenly reduced elements is no more than the number of all the reduced elements , and the perturbation is generated in , we have .

For the minimization case, the proof is similar. ∎

Note that in Theorem 2

, the performance loss is upper bounded by the sum of random variables, which means we can obtain high probability bounds using some concentration inequalities, such as

[11].

###### Theorem 3.

(Hoeffding) Let be independent real-valued random variables such that , . Then with probability ,

 n∑i=1Xi−E[n∑i=1Xi]
###### Theorem 4.

Define , and . Denote , and , where is the symmetric difference between the two sets and . Then with probability at least ,

 f(Xp∗)−f(X∗)
###### Proof.

We prove (10

). Since the perturbation vector

has zero expectation value, and each element of is independently generated. For any fixed , according to Theorem 3, with probability at least ,

 r(X)−r(X∗)≤t√2|X△X∗|log(1/δ). (11)

Suppose , and define . Obviously, we have . Hence,

 Pr[r(X∗c)−r(X∗)≥t√2Nrlog(m/δ)] =∑X∈[S,T]Pr[r(X)−r(X∗)≥t√2|X△X∗|log(mδ),X∗c=X] ≤∑X∈[S,T]Pr[r(X)−r(X∗)≥t√2|X△X∗|log(mδ)] ≤∑X∈[S,T]δm=mδm=δ,

where the first equality holds by the law of total probability. The second equality holds because replacing

with in the first expression does not change the event. The first inequality comes from dropping the event increases the probability. The last line results from (11) and the definition of . Combining the above result with Theorem 2, and note , we have, with probability at least ,

 \footnotesizef(X∗)−f(X∗p)

Using a similar method, (9) can also be proved. ∎

Theorem 4 has an intuitive interpretation. Take P2 and P4 as examples, , if is large, then it is unlikely that is an optimum of P4. Suppose , then we have,

 Pr[f(Y)+r(Y)≥f(X)+r(X),∀X⊆N] ≤Pr[f(Y)+r(Y)≥f(X∗)+r(X∗)] =Pr[r(Y)−r(X∗)≥σ].

Totally, the probability of being an optimum of P4 is upper bounded by the probability that the perturbation difference can compensate the function value difference , where the later probability is small when is large.

Finally, we show that the and in Theorem 4, which are the numbers of mistakenly reduced elements, can also be upper bounded by functions of and .

###### Theorem 5.

Denote the total number of the mistakenly reduced elements in the first iteration of A1 and A2 as and , respectively. We have,

 Er[M1r] ≤n2−F−f(X∗)t, (12) Er[N1r] ≤n2−f(X∗)−Ft, (13)

where .

###### Proof.

For (13), we calculate the total mistakenly reduced element number in expectation in the first iteration of A2. According to the definition of symmetric difference, .

, . And iff . Similarly, , . And iff . Thus we have,

 Er[N1r]=Er|X∗∖Y1|+Er|X1∖X∗| =∑i∈X∗t−f(i|∅)2t+∑j∉X∗t+f(j|N−j)2t =n2−12t⎡⎣∑i∈X∗f(i|∅)−∑j∉X∗f(j|N−j)⎤⎦ ≤n2−f(X∗)−f(∅)+f(X∗)−f(N)2t.

For (12), the proof is similar. ∎

Using similar methods we can obtain the following results, which recover Theorem 5 as a special case.

###### Theorem 6.

Denote the total number of the mistakenly reduced elements in the th iteration as , and , respectively. We have,

 Er[Mkr] ≤nk−12−Fk−1−f(X∗)t, Er[Nkr] ≤nk−12−f(X∗)−Fk−1t,

where , and .

Theorem 5 implies that the expected number of mistakenly reduced elements in the first iteration will approach as the perturbation scale increases. This is consistent with the intuition. Let , then . Each element will be randomly selected to be added or eliminated with probability , so the expected number of mistakenly reduced elements is .

###### Remark 1.

When is large enough, most elements will be reduced in the first iteration, i.e., . Let , where , by (13) we have , which indicates that if is a large enough perturbation scale, then the number of mistakenly reduced elements can be desirably upper bounded.

###### Remark 2.

Suppose is large enough and . With the result of Theorem 2, we have

 f(X∗)−f(X∗p)

Let where , we have from above. This means if there is some relationship between the optimum and the perturbation ( is a large perturbation scale), then the previous performance loss results can be transformed into approximation ratios.

## 5 Experimental Results

For reducible submodular functions, by incorporating reduction into optimization methods, favorable performance has been achieved [6, 8, 12, 19]. In our experiments, we mainly focus on (nearly) irreducible submodular functions, as listed below.

##### Subset Selection Function.

The objective function [18, 12] is irreducible. Given , , where . We set , , and randomly generate symmetric matrix in , and set , .

##### Mutual Information Function.

Given random vectors , define as the entropy of random variables , which is a highly reducible submodular function. The symmetrization [1] of leads to the mutual information , which is irreducible. We set , and randomly generate .

##### Log-Determinant Function.

Given a positive definite matrix , the determinant [17] is log-submodular. The symmetrization of log-determinant is , where , . We set . We randomly generate data points and compute the similarity matrix as the positive definite matrix .

##### Negative Half-Products Function.

The objective [2] is , where are non-negative vertors. When is not non-negative, can be highly reducible [19]. Here is non-negative, and is nearly irreducible. The reduction rate of A1 (A2) is about . We set , and randomly generate in and in .

### 5.1 Perturbation Scale

In Theorem 1, we lower bound the expectation of reduction rate using the expectation of reduction rate after the first iteration of A1 (A2). It is recently reported that for reducible submodular minimization, this bound is often not tight in practice [12]. Given that our method is actually transforming irreducible functions to reducible ones, it is reasonable to borrow experience from reducible cases. We conjecture that relatively small reduction rates after the first iteration would be sufficient for high reduction rates after the last iteration, thereby we only need to choose small perturbation scales to obtain desirable reducibility gains. We empirically verify the conjecture as shown in Figure 1. Appropriate perturbation scales are chosen so that the reduction rates after the last iteration are changing from to nearly . Given a certain perturbation scale, we repeatedly generate for times and record the average reduction rates of A2 after iteration - and the last iteration. We observe that A2 terminates within iterations for all objective functions.

We defer similar results for minimization to the supplementary material. As conjectured, we learn from Figure 1 that the gap between the average reduction rates after the first iteration and the last iteration is always large in practice. Hence, we can choose to get appropriate reduction rates in expectation (e.g., ) after the first iteration, so as to obtain potentially high final reduction rates. Although we can empirically utilize the gap of reducibility gain to choose relative small perturbation scales, we would like to point out that theoretically determining the reduction rates in expectation after the last iteration given certain perturbation scales is still an open problem.

### 5.2 Optimization Results

We implement our method using SFO toolbox [15].

For maximization, we compare A4 with both exact and approximate methods, as exact methods usually cannot terminate in acceptable time with larger input scales. Denote the outputs of the proposed A4 and the existing method as and , respectively. Also denote the running time as and . We measure the performance loss using relative error, which is defined as . When is exact, is the approximation ratio. We measure the reducibility gain using both the reduction rate and the time ratio . Small time ratios and relative errors indicate large reducibility gains and small performance losses, respectively.

We employ the branch-and-bound method [9] as the exact solver. Since it has exponential time complexity, we reset so that it terminates within acceptable time. The results are shown in Figure 2. For comparison, we normalize the perturbation scale as follows. We define , and define the perturbation scale ratio as . We change the perturbation scale in by varying in . We then randomly generate cases for each objective function and record the average relative errors, average reduction rates, and average time ratios for each perturbation scale ratio. Figure 3 shows the results compared with the random bi-directional greedy method [3], which is used as the approximate solver. Note that is set to 100. For each case, we firstly run A2 once, and then run the random method times on both the original and the reduced lattice, and record the best solutions.

According to Figure 2 and Figure 3, when the perturbation scale ratio is smaller than , the time ratio is larger than . This is because the small reducibility gain cannot make the combination methods more efficient than before. As the perturbation scale ratio increases, the reduction rate increases and the time ratio decreases as expected. Meanwhile, the relative error increases gently when increases, indicating that there exist useful intervals, in which the perturbation scales can lead to large reducibility gains and small performance losses.

For minimization, since the subset selection and the mutual information function have trivial zero optimal values, i.e., , we use the later two as objective functions. We employ the Fujishige-Wolfe minimum-norm point algorithm [7] as the exact solver. All the settings are the same as those of maximization. The results of minimization are shown in Figure 4. We note that for negative half-products function, the useful interval of perturbation scales is smaller than those of other functions. According to Remark 2, as the marginal gains are relatively large compared to the optimal value, it is inappropriate to choose large perturbation scales in this case.

## 6 Related Work

In this section, we review some existing works related to solution space reduction for submodular optimization. For P1, Fujishige [6] firstly proves , where and . Note that actually , which is the working lattice of A1 after its first iteration. Recently, Iyer et al. [12] propose the discrete Majorization-Minimization (MMin) framework for P1. They prove that by choosing appropriate supergradients, MMin is identical with A1. For P2, Goldengorin [8] proposes the Preliminary Preservation Algorithm (PPA), which is identical with A2. For general cases, Mei et al. [19] prove that the two algorithms work for quasi-submodular functions. Beyond unconstrained problems, for uniform matroid constrained monotone submodular function optimization, Wei et al. [25] propose similar pruning method in which the reduced ground set contains all the original solutions of the greedy algorithm.

## 7 Conclusions

In this paper, we introduce the reducibility of submodularity, which can improve the efficiency of submodular optimization methods. We then propose the perturbation-reduction framework, and demonstrate its advantages theoretically and empirically. We analyze the reducibility gain and performance loss given perturbation scales. Experimental results show that there exists practically useful intervals, and choosing perturbation scales from them enables us to significantly accelerate the existing methods with only small performance loss. For the future work, we would like to study the reducibility of submodular functions in constrained problems.

#### Acknowledgements

Jincheng Mei would like to thank Csaba Szepesvári for fixing the proof of Theorem 4. Bao-Liang Lu was supported by the National Basic Research Program of China (No. 2013CB329401), the National Natural Science Foundation of China (No. 61272248) and the Science and Technology Commission of Shanghai Municipality (No. 13511500200). Asterisk indicates the corresponding author.

## References

• [1] Francis Bach. Learning with submodular functions: A convex optimization perspective. Foundations and Trends Machine Learning, 6(2-3):145–373, 2013.
• [2] Endre Boros and Peter L Hammer. Pseudo-boolean optimization. Discrete Applied Mathematics, 123(1):155–225, 2002.
• [3] Niv Buchbinder, Michael Feldman, Joseph Naor, and Roy Schwartz. A tight linear time (1/2)-approximation for unconstrained submodular maximization. In IEEE Annual Symposium on Foundations of Computer Science, pages 649–658, 2012.
• [4] Michele Conforti and Gérard Cornuéjols. Submodular set functions, matroids and the greedy algorithm tight worst-case bounds and some generalizations of the rado-edmonds theorem. Discrete Applied Mathematics, 7(3):251–274, 1984.
• [5] Abhimanyu Das and David Kempe. Submodular meets spectral: Greedy algorithms for subset selection, sparse approximation and dictionary selection. In International Conference on Machine Learning, pages 1057–1064, 2011.
• [6] Satoru Fujishige. Submodular Functions and Optimization, volume 58. Elsevier, 2005.
• [7] Satoru Fujishige and Shigueo Isotani. A submodular function minimization algorithm based on the minimum-norm base. Pacific Journal of Optimization, 7(1):3–17, 2011.
• [8] Boris Goldengorin. Maximization of submodular functions: Theory and enumeration algorithms. European Journal of Operational Research, 198(1):102–112, 2009.
• [9] Boris Goldengorin, Gerard Sierksma, Gert A Tijssen, and Michael Tso. The data-correcting algorithm for the minimization of supermodular functions. Management Science, 45(11):1539–1551, 1999.
• [10] Daniel Golovin and Andreas Krause. Adaptive submodularity: Theory and applications in active learning and stochastic optimization.

Journal of Artificial Intelligence Research

, 42:427–486, 2011.
• [11] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13–30, 1963.
• [12] Rishabh Iyer, Stefanie Jegelka, and Jeff Bilmes. Fast semidifferential-based submodular function optimization. In International Conference on Machine Learning, pages 855–863, 2013.
• [13] Rishabh Iyer, Stefanie Jegelka, and Jeff A Bilmes. Curvature and optimal algorithms for learning and minimizing submodular functions. In Advances in Neural Information Processing Systems, pages 2742–2750, 2013.
• [14] Stefanie Jegelka, Francis Bach, and Suvrit Sra. Reflection methods for user-friendly submodular optimization. In Advances in Neural Information Processing Systems, pages 1313–1321, 2013.
• [15] Andreas Krause. Sfo a toolbox for submodular function optimization. The Journal of Machine Learning Research, 11:1141–1144, 2010.
• [16] Alex Krizhevsky. Learning multiple layers of features from tiny images, 2009.
• [17] Alex Kulesza and Ben Taskar. Determinantal point processes for machine learning. Foundations and Trends in Machine Learning, 5(2-3):123–286, 2012.
• [18] Hui Lin and Jeff Bilmes. How to select a good training-data subset for transcription submodular active selection for sequences. In Conference of the International Speech, pages 2859–2862, 2009.
• [19] Jincheng Mei, Kang Zhao, and Bao-Liang Lu. On unconstrained quasi-submodular function optimization. In AAAI Conference on Artificial Intelligence, pages 1191–1197, 2015.
• [20] Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, Amin Karbasi, Vondrák Jan, and Andreas Krause. Lazier than lazy greedy. In AAAI Conference on Artificial Intelligence, pages 1812–1818, 2015.
• [21] Baharan Mirzasoleiman, Amin Karbasi, Rik Sarkar, and Andreas Krause. Distributed submodular maximization: Identifying representative elements in massive data. In Advances in Neural Information Processing Systems, pages 2049–2057, 2013.
• [22] Mukund Narasimhan, Nebojsa Jojic, and Jeff A Bilmes. Q-clustering. In Advances in Neural Information Processing Systems, pages 979–986, 2005.
• [23] James B Orlin. A faster strongly polynomial time algorithm for submodular function minimization. Mathematical Programming, 118(2):237–251, 2009.
• [24] Xinghao Pan, Stefanie Jegelka, Joseph E Gonzalez, Joseph K Bradley, and Michael I Jordan. Parallel double greedy submodular maximization. In Advances in Neural Information Processing Systems, pages 118–126, 2014.
• [25] Kai Wei, Rishabh Iyer, and Jeff Bilmes. Fast multi-stage submodular maximization. In International Conference on Machine Learning, pages 1494–1502, 2014.

## Appendix A Proof of Proposition 1

For Algorithm 1, the proof can be found in [12]. For Algorithm 2, the proof can be found in [8]. A proof using weaker assumption of quasi-submodular function can be found in [19]. We prove Proposition 1 here for completeness.

###### Proof.

Algorithm 1. Obviously . Suppose , we now prove . Suppose is a minimum of , then we have . For , if , by submodularity, we have , i.e., , which contradicts with the optimality of . So we have , and . , if , by submodularity, we have , i.e., , which also contradicts with the optimality of . Therefore we have , and