Success in today’s machine learning and artificial intelligence algorithms relies largely on big data. Often, however, there may exist a small data subset that can act as a surrogate for the whole. Thus, various summarization methods have been designed to select such representative subsets and reduce redundancy. The problem is usually formulated as maximizing a score functionthat assigns importance scores to subsets of an underlying ground set of all elements. Submodular functions are a useful class of functions for this purpose: a function is submodular (Fujishige, 2005) if for any subset and ,
Since the above diminishing returns property naturally captures the redundancy among elements in terms of their importance to a summary, submodular functions have been commonly used as objectives in summarization and machine learning applications. The importance of ’s contribution to is , called the “marginal gain” of conditioned on .
The objective can be chosen from a rich class of submodular functions, e.g., facility location, saturated coverage, feature based, entropy and . We focus on the most commonly used form: normalized and monotone non-decreasing submodular functions, i.e., and . In order for a summary to have a limited size, a cardinality constraint is often applied, as we focus on in this paper. We also address, however, knapsack and matroid constraints in (Zhou and Bilmes, 2018). Under a cardinality constraint, the problem becomes
Submodular maximization is usually NP-hard. However, (2) can be solved near-optimally by a greedy algorithm with approximation factor (Nemhauser et al., 1978). Starting from , greedy algorithm selects the element with the largest marginal gain into , i.e., , until . To accelerate the greedy algorithm without an objective value loss, the lazy greedy approach (Minoux, 1978; Leskovec et al., 2007) updates only the top element of a priority queue of marginal gains for all elements in in each step. Recent approximate greedy algorithms (Iyer et al., 2013; Wei et al., 2014a; Mirzasoleiman et al., 2015) develop piece-wise, multi-stage, or random sampling strategies to tradeoff approximate optimality and speed.
In various applications such as news digesting, video summarization (Mirzasoleiman et al., 2017a), music recommending and photo sharing, data is fed into a system as a stream and under a particular order. At any time point , the user can request a summary of the elements he/she has seen so far. The greedy algorithm and its variants are not appropriate to the streaming setting both for memory and computational reasons, i.e., they require storing all elements in advance, and computing their marginal gains each step. In this paper, we study how to solve (2) with for any in the streaming setting in one pass using a memory of size only , where is the number of buffered elements and is the number of elements in the solution set.
1.1. Related Work
Various strategies have been proposed in previous work to solve (2) in the streaming setting. A thresholding algorithm in (Badanidiyuru et al., 2014) adds element to a summary if its marginal gain exceeds a threshold , where and is the global maxima. One function evaluation is required per step for computing . However, in
is not known in advance for a stream so the proposed sieve-streaming algorithm starts by running multiple instances of the thresholding algorithm with different estimates of, and dynamically removes the instances whose estimates of lie outside the interval updated by the maximal singleton gain. At the end, the instance achieving the largest is used for the solution . It has a guarantee of with memory. A sliding window method based on thresholding (Chen et al., 2016) has also been proposed that emphasizes recent data.
Swapping between new elements and the ones in is a natural yet more computationally expensive strategy (Buchbinder et al., 2015; Chekuri et al., 2015; Gomes and Krause, 2010). The algorithm initializes with the first elements from the stream, and keeps replacing a new element and once (Buchbinder et al., 2015) or (Chekuri et al., 2015), where , and are nonnegative constants, and denotes the historical solution set right before adding to it. Both cases have guarantee (when for the former) with memory size . The latter requires less computation, i.e, one function evaluation per element, comparing to evaluations required by the former.
A mini-batch based strategy splits the whole stream evenly into segments, and sequentially adds to the element with the largest marginal gain in each segment. It was introduced via the submodular secretary problem and its extensions (Bateni et al., 2013). This algorithm has an approximation bound of in expectation with memory size , if the data arrives in a uniformly at random order. This method requires only one function evaluation per element, but it needs to know the length of the stream in advance, impossible when the stream is unboundedly large and a summary can be requested at any time.
A hardness result is given in Theorem 1.6 of (Buchbinder et al., 2015): for solving (2) in the online setting, there is no deterministic algorithm -competitive for any constant . In Lemma 4.7 of (Buchbinder et al., 2015) (Lemma 4.11 in its arXiv version), the approximation factor in the worst case cannot exceed unless and all the elements up to a summary request is stored in the memory. Note the online setting in (Buchbinder et al., 2015)111The online setting in (Buchbinder et al., 2015) is, and we quote: “The elements of N arrive one by one in an online fashion. Upon arrival, the online algorithm must decide whether to accept each revealed element into its solution and this decision is irrevocable.” is slightly different from our streaming setting in that it does not allow the buffering of unselected elements. However, it is trivial to generalize the -hardness to algorithms with buffer size . In particular, we consider the submodular function used in the proof of Lemma 4.7 in (Buchbinder et al., 2015), and use their notations for and : the hardness stays unless the algorithm buffers at least one , but since the algorithm cannot distinguish and until seeing the last element , it needs to buffer at least elements to ensure that one is stored in the buffer.
Different settings for streaming submodular maximization have also been studied recently. A robust streaming algorithm (Mirzasoleiman et al., 2017b) has been studied for when the data provider has the right to delete at most elements due to privacy concerns. Given any single-pass streaming algorithm with an -approximation guarantee, it runs a cascading chain of instances of such an algorithm with non-overlapping solutions to ensure that only one solution is affected by a deletion. Its solution still satisfies a -approximation guarantee when deletions are allowed. Another popularly studied setting is submodular maximization with sliding windows (Epasto et al., 2017), which aims to maintain a solution that takes only the last items into account.
In the present paper, we mainly focus on the classical streaming setting where deletion or sliding windows is not considered. Our method, however, can be applied as a streaming algorithm subroutine in the deletion-robust setting of (Mirzasoleiman et al., 2017b).
1.2. Our Approach
In practice, the thresholding algorithm must try a large number of thresholds (associated with different estimates of ) to obtain a sufficiently good solution, because the solution set is sensitive to tiny changes in threshold . This results in a high memory load. Though swapping and mini-batch strategies ask for a smaller memory size , the former requires function evaluations per step, while the latter needs to know in advance and requires uniformly at random ordered elements, which cannot be justified in a streaming setting. Although the worst-case approximation factors of the three algorithms are , and respectively, they perform much poorer in practice than the offline greedy algorithm, which has the worst-case approximation factor but usually performs much better than .
The main contributions of this paper is a novel streaming algorithm (that we call “stream clipper”) that can achieve similar empirical performance to the offline greedy algorithm, and we analyze when this is the case. It is given in Algorithm 1 and illustrated in the left plot of Figure 1. It uses two thresholds and to process each element : it adds to the solution set if ; rejects if ; otherwise (i.e., ) places in a buffer . The final solution is generated by a greedy algorithm starting from the obtained and adds more elements from to until reaches the budget size . Since the elements with marginal gains slightly less than are saved in and given a second chance to be selected into , the two-threshold scheme mitigates the instability of a single thresholding method without requiring the testing of a large number of different thresholds simultaneously.
According to the hardness analysis in (Buchbinder et al., 2015), the worst-case approximation factor of stream clipper cannot exceed for memory size . However, we explicitly show that in some cases when thresholds and fulfill certain data dependent conditions, its approximation factor lies in . In addition, given , and a data stream to process, we show simple conditions to justify when stream clipper can guarantee an approximation factor for any .
An advanced version of stream clipper is given in Algorithm 2 with illustration in the right plot in Figure 1. It allows an element in buffer to replace some element in , if such swapping improves the objective . This avoids extra computation spent on swapping for every new element . In addition, the advanced version adapts thresholds to remove elements from the buffer once its size exceeds a user defined limit . This guarantees memory efficiency even for a poor initialization of the thresholds. In Section 3, experiments on news and video summarization show that stream clipper significantly outperforms other streaming algorithms consistently (Figure 2-5, Figure 10). In most experiments, it achieves as large as the offline greedy algorithm, and produces a summary of similar quality, but costs much less memory and computation due to its streaming setting.
2. Stream Clipper
In the following, we first introduce a naïve stream clipper and then later its advanced version with swapping, threshold adaptation, and buffer cleaning procedures. Detailed analysis of the approximation bound in different cases (rather then the worst case) for the naïve version follows. We further show the analysis can be extended to the advanced version. In the following, we use the letters “A” for Algorithm and “L” for line. For example, A1.L2-5 refers to Lines 2-5 of Algorithm 1.
2.1. Naïve Stream Clipper
We first give a naïve version of stream clipper in Algorithm 1. It selects element if and , and stores in (A1.L2-3), while rejects if (A1.L7). It places whose marginal gain is between and (A1.L4) into the buffer (A1.L5). Once a summary is requested, a greedy algorithm (A1.L8-10) adds more elements from to until .
In the following, we use and to represent and at the end of the iteration of the for-loop in Algorithm 1. Note and are the solution and buffer after passing elements but before running greedy procedure in A1.L8-10. We use to represent the final solution of Algorithm 1, use for the size of , and use to denote the selected element by A1.L3. In above algorithm, the thresholds and are fixed, so tuning them is important for getting a good solution. However, in the advanced version introduced below, they are updated adaptively with the incoming data stream, and thus more robust to the initialization values.
2.2. Advanced Stream Clipper
In practice, we develop two additional strategies to (1) achieve further improvement by occasional swapping between buffered element in and element in solution , and (2) keep the buffer size by removing unimportant elements from . The advanced version of stream clipper after applying these two strategies is given in Algorithm 2, where A2.L5-10 denotes the first strategy, and A2.L15-17 denotes the second strategy. Algorithm 2 is the same as Algorithm 1 if we ignore these steps.
The swapping procedure in A2.L5-10 is applied only to the new element whose marginal gain is between and . A2.L5 computes the objective for all the possible swappings between and element , and finds achieving the maximal objective . A2.L6 computes , the average of the swapping gain on the objective over all elements in . If , which means swapping brings positive improvements to the objective, the swapping is committed as in A2.L10. Comparing to previous swapping methods (Buchbinder et al., 2015) that computes for all new element , stream clipper only computes A2.L5 for such that . This improves the efficiency since computing requires function evaluations.
When the buffer size reaches the user defined limit , stream clipper increases by step size as shown in A2.L16. Since the lower threshold increases, elements in buffer whose marginal gain can be removed from (A2.L17). We repeat this buffer cleaning procedure until . Note the maximal value of after it increases is , because if .
In Algorithm 2, parameter is an estimate to . In practice, it can be initialized as and increased to according to solution set achieved in later steps. We initialize the “step size” as since it works well empirically. The two thresholds are initialized as shown in Algorithm 2. Note we can start with a sufficiently small is to guarantee and , and adaptively increase it later as in A 2.L16.
2.3. Approximation Bound
We study the approximation bound of Algorithm 1 in different cases rather than the worst case. Firstly, we assume is properly selected so . This guarantees elements are selected into by the greedy algorithm in A1.L8-10 and thus there are elements in the final output . A trivial choice of is .
Lemma 0 ().
If and , then before A1.L8.
When , all the elements whose marginal gain is less than will be stored in the buffer, and may lead to a large . Note the advanced version Algorithm 2 can start from , and adaptively increase it and clean the buffer when exceeds the limit . By following similar proof technique in (Nemhauser et al., 1978), we have the theorem below. Please refer to (Zhou and Bilmes, 2018) for its proof.
Theorem 2 ().
If submodular function is monotone non-decreasing and normalized, let , the following result holds for the final output of Algorithm 1.
The bound in (3) is a convex combination of and . It depends on , , , and : is known once a summary is requested; thresholds and are pre-defined parameters; is the optimum we need to compare to. However, is the number of elements from optimal set that have been rejected by A1.L7. It depends on that may not be known. In order to remove the dependency on , we take the minimum of the right hand side of (3) over all possible values of . We use to denote the right hand side of (3),
Since has a complex shape, we firstly study its first and second order derivatives.
Lemma 0 ().
The derivative and second order derivative of are
Proposition 0 ().
When , the minimum value of the bound given in (3) w.r.t. is either , or .
By using Proposition 4, we can derive the minimum value of in three different cases, which corresponds to three ranges of determined by , and . This leads to the following theorem.
Theorem 5 ().
Under the assumptions of Theorem 2, we have
Case 1: when ,
Case 2: when ,
Case 3: when ,
Remarks: In case 1, when , buffer and Algorithm 1 reduces to sieve-streaming (Badanidiyuru et al., 2014), so the bound is . In the following corollary, we further show in cases 2 & 3, better (i.e., ) bounds can be achieved when , since the greedy algorithm in the end of Algorithm 1 further takes advantage of elements from buffer .
Corollary 0 ().
Under the assumptions of Theorem 2, when (case-1), if and , . When (case-2&3), if , .
According to Corollary 6, although the approximation factor is possible to be for cases 2 & 3, the worst case bound is still . This obeys the hardness given in (Buchbinder et al., 2015), i.e., it is impossible to improve the worst-case bound over . However, the bound can be strictly better than on specific orders of the same set of elements . Given thresholds and , for a data stream with a specific order and an , we give the conditions to justify whether stream clipper can achieve an approximation factor .
In the following analysis, we use , a sequence of distinct integers from to , to denote the order of elements in the stream, i.e., . We use to represent the set of all orders. By analyzing the three cases in Theorem 5, we can locate and in specific ranges. In each range, we characterize the orders on which and buffer size is bounded by , i.e., .
Proposition 0 ().
1) For any , given and to use in stream clipper (Algorithm 1), define , where
if , for any order , we have .
2) For any , given and to use in stream clipper (Algorithm 1), define
for any order , we have .
The detailed proof is given in (Zhou and Bilmes, 2018). In the advanced version of stream clipper, we can adjust and to guarantee an nonempty . The conditions in (10) can provide some clues of how to adjust them based on the updated estimate of . According to Proposition 7, given , , and any , for the orders on which stream clipper achieves 1) and when , or 2) and for every when , we have with .
In this section, on several news and video datasets, we compare summaries generated by stream clipper and other algorithms. We use the feature based submodular function (Wei et al., 2014b) as our objective, where is a set of features, and is a modular score ( is the affinity of element to feature ). This function typically achieves good performance on summarization tasks. Our baseline algorithms are the lazy greedy approach (Minoux, 1978) (which has identical output as greedy but is faster) and the “sieve-streaming” (Badanidiyuru et al., 2014) approach for streaming submodular maximization, which has low memory requirements as it takes one pass over the data. Note in summarization experiments, a difference of on utility usually leads to large gap on rouge-2 and F1-score.
3.1. Empirical Study on News
An empirical study is conducted on a ground set containing sentences from all NYT articles on a randomly selected date between 1996 and 2007, which are from the NYTs annotated corpus 1996-2007 (https://catalog.ldc.upenn.edu/LDC2008T19). Figure 2 shows how and time cost varies when we change . We set the budget size
of the summary to be the number of sentences in a human generated summary. The buffer sizeof stream clipper is fixed to , while the number of trials in sieve-streaming is , leading to memory requirement of , which is much larger than of stream clipper. In order to test how performance varies with the order of stream, for each , we run same experiment on different random orders of the same data.
The utility and time cost of both streaming algorithms do not change too much when the order changes. The utility curve of stream clipper overlaps that of lazy greedy, while its time cost is much less and increases more slowly than that of lazy greedy. Sieve-streaming performs much worse than SS in terms of utility, and its time cost is only slightly less and even slightly decreases when increasing (this is because it quickly fills with elements and stops much earlier before seeing all elements).
Figure 4 shows how relative utility ( denotes the solution of the offline greedy algorithm) and time cost of the two streaming algorithms vary with memory size. Stream clipper quickly reaches a close to of greedy algorithm once exceeds , while sieve-streaming achieves much smaller which does not increase until . Note the time cost of stream clipper is larger than that of sieve-streaming when but dramatically decreases below it quickly. This is because the buffer cleaning procedure in A2.L15-17 needs to be frequently executed if is small (and is small). However, a slight increase in memory size can effectively reduce the time cost.
Figure 3 shows the robustness of the two streaming algorithms to parameter . In the wide range of , stream clipper keeps a relative utility, while sieve-streaming decreases dramatically around its peak value . Hence, sieve-streaming is more sensitive to and thus a delicate search of is necessary. This results in a high memory burden. By contrast, our approach adaptively adjusts two thresholds via swapping and buffer cleaning even when the estimate used to initialize them is inaccurate.
3.2. NYT News Summarization
In this section, we conduct summarization experiments on two news corpora, The New York Times annotated corpus 1996-2007 and the DUC 2001 corpus (http://www-nlpir.nist.gov/projects/duc).
The first dataset includes all the articles published on The New York Times in days from 1996-2007. For each day, we collect the sentences in articles associated with human generated summaries as the ground set (with sizes varying from to ), and extract their TFIDF features to build . We concatenate the sentences from all human generated summaries in the same date as reference summary. We compare the machine generated summaries produced by different methods with the reference summary by ROUGE-2 (Lin, 2004) (recall on 2-grams) and ROUGE-2 F1-score (F1-measure based on recall and precision on 2-grams). We also compare their relative utility. As before, sieve-streaming holds a memory size of . Figure 5 shows the statistics over days.
Stream clipper keeps a relative utility for most days, while sieve-streaming dominates the region. The ROUGE-2 score of stream clipper is usually better than sieve-streaming, but slightly worse than lazy greedy. However, its F1-score is very close to that of lazy greedy, while sieve-streaming’s is much worse.
Figure 6 shows the number of collected sentences in each day and the corresponding time cost of each algorithm. The area of each circle is proportional to the relative utility. We use a log scale time axis for better visualization. Stream clipper is times faster than lazy greedy. Their time cost have similar increasing speed, because as the summary size increases, the greedy stage in stream clipper tends to dominate the computation. The time cost of sieve-streaming decreases when , but its relative utility also reduces fast. This is caused by the aforementioned early stopping.
3.3. DUC2001 News Summarization
We also observe similar result on DUC 2001 corpus, which are composed of two datasets. The first one includes sets of documents, each is selected by a NIST assessor because the documents in a set are related to a same topic. The assessor also provides four human generated summary having word counts for each set. In Figure 7 and Figure 8, we report the statistics to rouge-2 and F1-score of summaries of the same size generated by different algorithms. The second dataset is composed of four document sets associated with four topics. We report the detailed results in Table 1. Both of them show stream clipper can achieve similar performance as offline greedy algorithm, whereas outperforms sieve-streaming.
3.4. Video Summarization
We apply lazy greedy, sieve-streaming, and stream clipper to videos from video summarization dataset SumMe (Gygli et al., 2014)222http://www.vision.ee.ethz.ch/gyglim/vsum/. Each video has frames as given in Table 2 (Zhou and Bilmes, 2018). We resize each frame to a image, and extract features from two standard image descriptors, i.e., a pyramid of HoG (pHoG) (Bosch et al., 2007) to delineate local and global shape, and GIST (Oliva and Torralba, 2001) to capture global scene. The pHoG features are achieved over a four-level pyramid using bins with angle of degrees. The GIST features are obtained by using blocks and orientation per scale. We concatenate them to form a
-dimensional feature vector for each frame to build. Each algorithm selects of all frames as summary set, i.e., . Sieve-streaming uses a memory of frames, while stream clipper uses a much smaller memory of frames.
We compare the summaries generated by the three algorithms with the ones produced by the ground truth and users. Each user was asked to select a subset of frames as summary, and ground truth score of each frame is given by voting from all users. For each video, we compare each algorithm generated summary with the reference summary composed of the top frames with the largest ground truth scores for different , and the user summary from different users. In particular, we report F1-score for comparison to ground truth score generated summaries in Figure 10 (recall comparison is given in Figure 11 (Zhou and Bilmes, 2018)). We report F1-score for comparison to user summaries in Figure 9 (recall comparison is given in Figure 12 (Zhou and Bilmes, 2018)). In each plot for each video, we also report the average F1-score and average recall over all users.
Stream clipper approaches or outperforms lazy greedy and shows high F1-score on most videos, while the time cost is small according to Table 2. Although on a few videos sieve-streaming achieves the best F1-score, in most of these cases its generated summaries are trivially dominated by the first frames as shown in Figure 9-12 (Zhou and Bilmes, 2018). On these videos, neither lazy greedy nor stream clipper performs well, though they acheive high objective value in optimization. This indicates that the extracted features of the submodular function should be improved.
In this paper, we introduce stream clipper, a fast and memory-efficient streaming submodular maximization algorithm that can achieve similar performance as commonly used greedy algorithm. It uses two thresholds to either select important element into summary or a buffer. The final summary is generated by greedily selecting more elements from the buffer. Swapping and buffer-reduce procedures are triggered lazily for further improvement and bounding memory. Thresholds are adjusted adaptively to avoid search for the optimal thresholds.
- Badanidiyuru et al. (2014) Ashwinkumar Badanidiyuru, Baharan Mirzasoleiman, Amin Karbasi, and Andreas Krause. 2014. Streaming Submodular Maximization: Massive Data Summarization on the Fly. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 671–680.
- Bateni et al. (2013) Mohammadhossein Bateni, Mohammadtaghi Hajiaghayi, and Morteza Zadimoghaddam. 2013. Submodular Secretary Problem and Extensions. ACM Trans. Algorithms 9, 4 (2013), 32:1–32:23.
- Bosch et al. (2007) Anna Bosch, Andrew Zisserman, and Xavier Munoz. 2007. Representing shape with a spatial pyramid kernel. In ACM International Conference on Image and Video Retrieval. 401–408.
- Buchbinder et al. (2015) Niv Buchbinder, Moran Feldman, and Roy Schwartz. 2015. Online Submodular Maximization with Preemption. In Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Algorithms. 1202–1216.
- Chekuri et al. (2015) Chandra Chekuri, Shalmoli Gupta, and Kent Quanrud. 2015. Streaming Algorithms for Submodular Function Maximization. In 42nd International Colloquium of Automata, Languages, and Programming (ICALP) Part I. 318–330.
- Chen et al. (2016) Jiecao Chen, Huy L. Nguyen, and Qin Zhang. 2016. Submodular Maximization over Sliding Windows. arXiv (2016). http://arxiv.org/abs/1611.00129
- Epasto et al. (2017) Alessandro Epasto, Silvio Lattanzi, Sergei Vassilvitskii, and Morteza Zadimoghaddam. 2017. Submodular Optimization Over Sliding Windows. In International Conference on World Wide Web (WWW). 421–430.
- Fujishige (2005) Satoru Fujishige. 2005. Submodular functions and optimization. Elsevier.
- Gomes and Krause (2010) Ryan Gomes and Andreas Krause. 2010. Budgeted Nonparametric Learning from Data Streams. In Proc. International Conference on Machine Learning (ICML).
Gygli et al. (2014)
Michael Gygli, Helmut
Grabner, Hayko Riemenschneider, and Luc
Van Gool. 2014.
Creating Summaries from User Videos. In
European Conference on Computer Vision (ECCV).
- Iyer et al. (2013) Rishabh Iyer, Stefanie Jegelka, and Jeff A. Bilmes. 2013. Fast Semidifferential-based Submodular Function Optimization. In International Conference on Machine Learning (ICML).
- Leskovec et al. (2007) Jure Leskovec, Andreas Krause, Carlos Guestrin, Christos Faloutsos, Jeanne VanBriesen, and Natalie Glance. 2007. Cost-effective Outbreak Detection in Networks. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 420–429.
- Lin (2004) Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out: Proceedings of the ACL-04 Workshop. 74–81.
- Minoux (1978) Michel Minoux. 1978. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization Techniques. Lecture Notes in Control and Information Sciences, Vol. 7. Chapter 27, 234–243.
- Mirzasoleiman et al. (2015) Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, Amin Karbasi, Jan Vondrák, and Andreas Krause. 2015. Lazier Than Lazy Greedy. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence. 1812–1818.
- Mirzasoleiman et al. (2017a) Baharan Mirzasoleiman, Stefanie Jegelka, and Andreas Krause. 2017a. Streaming Non-monotone Submodular Maximization: Personalized Video Summarization on the Fly. arXiv (2017). http://arxiv.org/abs/1706.03583
- Mirzasoleiman et al. (2017b) Baharan Mirzasoleiman, Amin Karbasi, and Andreas Krause. 2017b. Deletion-Robust Submodular Maximization: Data Summarization with “the Right to be Forgotten”. In International Conference on Machine Learning (ICML), Vol. 70. 2449–2458.
- Nemhauser et al. (1978) G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. 1978. An analysis of approximations for maximizing submodular set functions—I. Mathematical Programming 14, 1 (1978), 265–294.
- Oliva and Torralba (2001) Aude Oliva and Antonio Torralba. 2001. Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope. International Journal of Computer Vision 42, 3 (2001), 145–175.
- Wei et al. (2014a) Kai Wei, Rishabh Iyer, and Jeff Bilmes. 2014a. Fast Multi-stage Submodular Maximization. In International Conference on Machine Learning (ICML).
- Wei et al. (2014b) Kai Wei, Yuzong Liu, Katrin Kirchhoff, Chris D. Bartels, and Jeff A. Bilmes. 2014b. Submodular subset selection for large-scale speech training data. In IEEE International Conference on Acoustics, Speech and Signal Processing, (ICASSP) 2014. 3311–3315.
- Zhou and Bilmes (2018) Tianyi Zhou and Jeff Bilmes. 2018. Appendix for Stream Clipper. In Submitted.
Appendix A Proof of Theorem 2
We use to index the step of the greedy algorithm in A1.L8-10, while indexes variables after passing elements and before the greedy procedure in A1.L8-10. Note indexes the final step of the greedy procedure. We have
The first inequality uses monotonicity of , while the second one is due to submodularity.
The third inequalities follows from set theory along with the fact that is non-negative monotone non-decreasing. The fourth inequality is a result of applying rejection rule to the rejected elements in , and the max greedy selection rule in A1.L9. Rearranging (13) yields
then the rearranged inequality equals to
When and , this is exactly
Since in total elements are selected by the greedy algorithm, applying (17) from to yields
which is equivalent to
by applying the definition of . The last inequality is due to
which is due to selection rule used in A1.L2. For each selected element , in (20) is the solution at the beginning of the step. We simply use telescope sum representation of to achieve the equality in (20).