# Subspace Selection via DR-Submodular Maximization on Lattices

The subspace selection problem seeks a subspace that maximizes an objective function under some constraint. This problem includes several important machine learning problems such as the principal component analysis and sparse dictionary selection problem. Often, these problems can be solved by greedy algorithms. Here, we are interested in why these problems can be solved by greedy algorithms, and what classes of objective functions and constraints admit this property. To answer this question, we formulate the problems as optimization problems on lattices. Then, we introduce a new class of functions, directional DR-submodular functions, to characterize the approximability of problems. We see that the principal component analysis, sparse dictionary selection problem, and these generalizations have directional DR-submodularities. We show that, under several constraints, the directional DR-submodular function maximization problem can be solved efficiently with provable approximation factors.

## Authors

• 2 publications
• 25 publications
• ### A Reduction for Optimizing Lattice Submodular Functions with Diminishing Returns

A function f: Z_+^E →R_+ is DR-submodular if it satisfies f( + χ_i) -f (...

06/27/2016 ∙ by Alina Ene, et al. ∙ 0

• ### Maximizing Non-Monotone DR-Submodular Functions with Cardinality Constraints

We consider the problem of maximizing a non-monotone DR-submodular funct...

11/29/2016 ∙ by Ali Khodabakhsh, et al. ∙ 0

• ### Competitive Algorithms for Online Budget-Constrained Continuous DR-Submodular Problems

In this paper, we study a certain class of online optimization problems,...

06/30/2019 ∙ by Omid Sadeghi, et al. ∙ 0

• ### Optimal DR-Submodular Maximization and Applications to Provable Mean Field Inference

Mean field inference in probabilistic models is generally a highly nonco...

05/19/2018 ∙ by An Bian, et al. ∙ 0

• ### Multiple Knapsack-Constrained Monotone DR-Submodular Maximization on Distributive Lattice --- Continuous Greedy Algorithm on Median Complex ---

We consider a problem of maximizing a monotone DR-submodular function un...

07/09/2019 ∙ by Takanori Maehara, et al. ∙ 0

• ### Submodular meets Spectral: Greedy Algorithms for Subset Selection, Sparse Approximation and Dictionary Selection

We study the problem of selecting a subset of k random variables from a ...

02/19/2011 ∙ by Abhimanyu Das, et al. ∙ 0

• ### Dimension Estimation Using Autoencoders

Dimension Estimation (DE) and Dimension Reduction (DR) are two closely r...

09/24/2019 ∙ by Nitish Bahadur, et al. ∙ 7

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

#### Background and motivation

The subspace selection problem involves seeking a good subspace from data. Mathematically, the problem is formulated as follows. Let be a family of subspaces of , be a set of feasible subspaces, and be an objective function. Then, the task is to solve the following optimization problem.

 maximizef(X)subject toX∈F. (1.1)

This problem is a kind of feature selection problem, and contains several important machine learning problems such as the principal component analysis and sparse dictionary selection problem.

In general, the subspace selection problem is a non-convex continuous optimization problem; hence it is hopeless to obtain a provable approximate solution. On the other hand, such solution can be obtained efficiently in some special cases. The most important example is the principal component analysis. Let be the set of all the subspaces of , be the subspaces with dimension of at most , and be the function defined by

 f(X)=∑i∈I∥ΠXui∥2 (1.2)

where is the given data and is the projection to subspace . Then, problem (1.1) with these , , and defines the principal component analysis problem. As we know, the greedy algorithm, which iteratively selects a new direction that maximizes the objective function, gives the optimal solution to problem (1.1). Another important problem is the sparse dictionary selection problem. Let

be a set of vectors, called a dictionary. For a subset

, we denote by the subspace spanned by . Let be the subspaces spanned by a subset of , and be the subspaces spanned by at most vectors of . Then, the problem (1.1) with these , , and in (1.2) defines the sparse dictionary selection problem. The problem is in general difficult to solve natarajan1995sparse ; however, the greedy-type algorithms, e.g., orthogonal matching pursuit, yield provable approximation guarantees depending on the mutual coherence of .

Here, we are interested in the following research question: Why the principal component analysis and the sparse dictionary selection problem can be solved by the greedy algorithms, and what classes of objective functions and constraints have the same property?

#### Existing approach

Several researchers have considered this research question (see Related work below). One successful approach is employing submodularity. Let be a (possibly infinite) set of vectors. We define by . If this function satisfies the submodularity, , or some its approximation variants, we obtain a provable approximation guarantee of the greedy algorithm krause2010submodular ; das2011submodular ; elenberg2016restricted ; khanna2017approximation .

However, this approach has a crucial issue that it cannot capture the structure of vector spaces. Consider three vectors , , and in . Then, we have ; therefore, . However, this property (a single subspace is spanned by different bases) is overlooked in the existing approach, which yields underestimation of the approximation factors of the greedy algorithms (see Section 4.2).

#### Our approach

In this study, we employ Lattice Theory to capture the structure of vector spaces. A lattice is a partially ordered set closed under the greatest lower bound (aka., meet, ) and the least upper bound (aka., join, ).

The family of all subspaces of is called the vector lattice , which forms a lattice whose meet and join operators correspond to the intersection and direct sum of subspaces, respectively. This lattice can capture the structure of vector spaces as mentioned above. Also, the family of subspaces spanned by a subset of forms a lattice.

We want to establish a submodular maximization theory on lattice. Here, the main difficulty is a “nice” definition of submodularity. Usually, the lattice submodularity is defined by the following inequality topkis1978minimizing , which is a natural generalization of set submodularity.

 f(X)+f(Y)≥f(X∧Y)+f(X∨Y). (1.3)

However, this is too strong that it cannot capture the principal component analysis as shown below.

###### Example 1.

Consider the vector lattice . Let and be subspaces of where is sufficiently small. Let be the given data. Then, function (1.2) satisfies , , , and . Therefore, it does not satisfy the lattice submodularity. A more important point is that, since we can take , there is no constants and such that on this lattice. This means that it is very difficult to formulate this function as an approximated version of a lattice submodular function.

Another commonly used submodularity is the diminishing return (DR)-submodularity soma2015generalization ; bian2016guaranteed ; soma2017non , which is originally introduced on the integer lattice . A function is DR-submodular if

 f(X+ei)−f(X)≥f(Y+ei)−f(Y) (1.4)

for all (component wise inequality) and , where is the -th unit vector. This definition is later extended to distributive lattices gottschalk2015submodular and can be extended to general lattices (see Section 3). However, Example 1 above is still crucial, and therefore the objective function of the principal component analysis cannot be an approximated version of a DR-submodular function.

To summarize the above discussion, our main task is to define submodularity on lattices that should satisfy the following two properties:

1. It captures some important practical problems such as the principal component analysis.

2. It admits efficient approximation algorithms on some constraints.

#### Our contributions

In this study, in response to the above two requirements, we make the following contributions:

1. We define downward DR-submodularity and upward DR-submodularity on lattices, which generalize the DR-submodularity (Section 3). Our directional DR-submodularities are capable of representing important machine learning problems such as the principal component analysis and sparse dictionary selection problem (Section 4).

2. We propose approximation algorithms for maximizing (1) monotone downward DR-submodular function over height constraint, (2) monotone downward DR-submodular function over knapsack constraint, and (3) non-monotone DR-submodular function (Section 5). These are obtained by generalizing the existing algorithms for maximizing the submodular set functions. Thus, even our directional DR-submodularities are strictly weaker than the lattice DR-submodularity; it is sufficient to admit approximation algorithms.

All the proofs of propositions and theorems are given in Appendix in the supplementary material.

#### Related Work

For the principal component analysis, we can see that the greedy algorithm, which iteratively selects the largest eigenvectors of the correlation matrix, solves the principal component analysis problem exactly

abdi2010principal .

With regard to the sparse dictionary selection problem, several studies gilbert2003approximation ; tropp2003improved ; tropp2004greed ; das2008algorithms have analyzed greedy algorithms. In general, the objective function for the sparse dictionary selection problem is not submodular. Therefore, researchers introduced approximated versions of the submodularity and analyzed the approximation guarantee of algorithms with respect to the parameter.

Krause and Cevher krause2010submodular showed that function (1.2) is an approximately submodular function whose additive gap depends on the mutual coherence. They also showed that the greedy algorithm gives -approximate solution.111A solution is an -approximate solution if it satisfies . If then we simply say that it is an -approximate solution.

Das and Kempe das2011submodular introduced the submodularity ratio, which is another measure of submodularity. For the set function maximization problem, the greedy algorithm attains a provable approximation guarantee depending on the submodularity ratio. The approximation ratio of the greedy algorithm is further improved by combining with the curvature bian2017guarantees . Elenberg et al. elenberg2016restricted showed that, if function has a bounded restricted convexity and a bounded smoothness, the corresponding set function has a bounded submodularity ratio. Khanna et al. khanna2017approximation applied the submodularity ratio for the low-rank approximation problem.

It should be emphasized that all the existing studies analyzed the greedy algorithm as a function of a set of vectors (the basis of the subspace), instead of as a function of a subspace. This overlooks the structure of the subspaces causing difficulties as described above.

## 2 Preliminaries

A lattice is a partially ordered set (poset) such that, for any , the least upper bound and the greatest lower bound uniquely exist. We often say “ is a lattice” by omitting if the order is clear from the context. In this paper, we assume that the lattice has the smallest element .

A subset is lower set if then any with is also . For , the set is called the lower set of .

A sequence of elements of is a composition series if there is no such that for all . The length of the longest composition series from to is referred to as the height of and is denoted by . The height of a lattice is defined by . If this value is finite, the lattice has the largest element . Note that the height of a lattice can be finite even if the lattice has infinitely many elements. For example, the height of the vector lattice is .

A lattice is distributive if it satisfies the distributive law: . A lattice is modular if it satisfies the modular law: . Every distributive lattice is modular. On a modular lattice , all the composition series between and have the same length. The lattice is modular if and only if its height function satisfies the modular equality: . Modular lattices often appear with algebraic structures. For example, the set of all subspaces of a vector space forms a modular lattice. Similarly, the set of all normal subgroups of a group forms a modular lattice.

For a lattice , an element is join-irreducible if there no such that .222For the set lattice of a set , the join-irreducible elements correspond to the singleton sets, . Thus, for clarity, we use upper case letters for general lattice elements (e.g., or ) and lower case letters for join-irreducible elements (e.g., or ). We denote by the set of all join-irreducible elements. Any element is represented by a join of join-irreducible elements; therefore the structure of is specified by the structure of . A join irreducible element is admissible with respect to an element if and any with satisfies . We denote by the set of all admissible elements with respect to . A set is called a closure of at . See Figures 2.2 and 2.2 for the definition of admissible elements and closure. Note that is admissible with respect to if and only if the distance from the lower set of to is one.

###### Example 2.

In the vector lattice , each element corresponds to a subspace. An element is join-irreducible if and only if it has dimension one. A join-irreducible element is admissible to if these are linearly independent. The closure is the one dimensional subspaces contained in independent to .

## 3 Directional DR-submodular functions on modular lattices

We introduce new submodularities on lattices. As described in Section 1, our task is to find useful definitions of “submodularities” on lattices; thus, this section is the most important part of this paper.

Recall definition (1.4) of the DR-submodularity on the integer lattice. Then, we can see that and for and , where and are the -th components of and , respectively. Here, and are join-irreducibles in the integer lattice, , , and . Thus, a natural definition of the DR-submodularity on lattices is as follows.

###### Definition 3 (Strong DR-submodularity).

A function is strong DR-submodular if, for all with and with , the following holds.

 f(X∨a)−f(X)≥f(Y∨b)−f(Y) (3.1)

The same definition is introduced by Gottshalk and Peis gottschalk2015submodular for distributive lattices. However, this is too strong for our purpose because it cannot capture the principal component analysis; you can check this in Example 1. Therefore, we need a weaker concept of DR-submodularities.

Recall that for all . Thus, the strong DR-submodularity (3.1) is equivalent to the following.

By relaxing the outer to , we obtain the following definition.

###### Definition 4 (Downward DR-submodularity).

Let be a lattice. A function is downward DR-submodular with additive gap , if for all and , the following holds.

Similarly, the strong DR-submodularity (3.1) is equivalent to the following.

By relaxing the inner to , we obtain the following definition.

###### Definition 5 (Upward DR-submodularity).

Let be a lattice. is upward DR-submodular with additive gap , if for all and with , the following holds.

If a function is both downward DR-submodular with additive gap and upward DR-submodular with additive gap , then we say that is bidirectional DR-submodular with additive gap . We say directional DR-submodularity to refer these new DR-submodularities.

The strong DR-submodularity implies the bidirectional DR-submodularity, because both downward and upward DR-submodularities are relaxations of the strong DR-submodularity. Interestingly, the converse also holds in distributive lattices.

###### Proposition 6.

On a distributive lattice, the strong DR-submodularity, downward DR-submodularity, and upward DR-submodularity are equivalent. ∎

Therefore, we can say that directional DR-submodularities are required to capture the specialty of non-distributive lattices such as the vector lattice.

At the cost of generalization, in contrast to the lattice submodularity (1.3) and the strong DR-submodularity (3.1), the downward and upward DR-submodularity are not closed under addition, because the elements attained in the min/max in the above definitions can depend on the objective function.

## 4 Examples

In this section, we present several examples of directional DR-submodular functions to show that our concepts can capture several machine learning problems.

### 4.1 Principal component analysis

Let be the given data. We consider the vector lattice of all the subspaces of , and the objective function defined by (1.2). Then, the following holds.

###### Proposition 7.

The function defined by (1.2) is a monotone bidirectional DR-submodular function. ∎

This provides a reason why the principal component analysis is solved by the greedy algorithm from the viewpoint of submodularity.

The objective function can be generalized further. Let be a monotone non-decreasing concave function with for each . Let

 fρ(X)=∑i∈Iρi(∥ΠXui∥2). (4.1)

Then, the following holds.

###### Proposition 8.

The function defined by (4.1) is a monotone bidirectional DR-submodular function. ∎

If we use this function instead of the standard function (1.2), we can ignore the contributions from very large vectors because if is already well approximated in , there is less incentive to seek larger subspace for due to the concavity of . See Experiment in Appendix.

### 4.2 Sparse dictionary selection

Let be a set of vectors called a dictionary. We consider of all subspaces spanned by , which forms a (not necessarily modular) lattice. The height of coincides with the dimension of . Let be the given data. Then the sparse dictionary selection problem is formulated by the maximization problem of defined by (1.2) on this lattice under the height constraint.

In general, the function is not a directional DR-submodular function on this lattice. However, we can prove that is a downward DR-submodular function with a provable additive gap. We introduce the following definition.

###### Definition 9 (Mutual coherence of lattice).

Let be a lattice of subspaces. For , the lattice has mutual coherence , if for any , there exists such that , , and for all unit vectors and , . The infimum of such is called the mutual coherence of , and is denoted by .

Our mutual coherence of a lattice is a generalization of the mutual coherence of a set of vectors donoho2003optimally . For a set of unit vectors , its mutual coherence is defined by . The mutual coherence of a set of vector is extensively used in compressed sensing to prove the uniqueness of the solution in a sparse recovery problem eldar2012compressed . Here, we have the following relation between the mutual coherence of a lattice and that of a set of vectors, which is the reason why we named our quantity mutual coherence.

###### Lemma 10.

Let be a set of unit vectors whose mutual coherence is . Then, the lattice generated by the vectors has mutual coherence . ∎

This means that if a set of vectors has a small mutual coherence, then the lattice generated by the vectors has a small mutual coherence. Note that the converse does not hold. Consider where , , and for sufficiently small . Then the mutual coherence of the vectors is ; however, the mutual coherence of the lattice generated by is . This shows that the mutual coherence of a lattice is a more robust concept than that of a set of vectors, which is a strong advantage of considering a lattice instead of a set of vectors.

If a lattice has a small mutual coherence, we can prove that the function is a monotone downward DR-submodular function with a small additive gap.

###### Proposition 11.

Let be normalized vectors and be a lattice generated by . Suppose that forms a modular lattice. Let . Then, the function defined in (4.1) is a downward DR-submodular function with additive gap at most where . ∎

### 4.3 Quantum cut

Finally, we present an example of a non-monotone bidirectional DR-submodular function. Let be a directed graph, and be a weight function. The cut function is then defined by where is the indicator function of and is the complement of . This is a non-monotone submodular function. Maximizing the cut function has application in feature selection problems with diversity lin2009graph .

We extend the cut function to the “quantum” setting. We say that a lattice of vector spaces is ortho-complementable if then where is the orthogonal complement of . Let be vectors assigned on each vertex. For an ortho-complementable lattice , the quantum cut function is defined by

 f(X)=∑(i,j)∈Ec(i,j)∥ΠX(ui)∥2∥Π¯X(vj)∥2. (4.2)

If for all , where is the -th unit vector, and is the lattice of axis-parallel subspaces of , function (4.2) coincides with the original cut function. Moreover, it carries the submodularity.

###### Proposition 12.

The function defined by (4.2) is a bidirectional DR-submodular function. ∎

The quantum cut function could be used for subspace selection problem with diversity. For example, in a natural language processing problem, the words are usually embedded into a latent vector space

mikolov2013distributed . Usually, we select a subset of words to summarize documents; however, if we want to select a “meaning”, which is encoded in the vector space as a subspace kim2013deriving , it would be promising to select a subspace. In such an example, the quantum cut function (4.2) can be used to incorporate the diversity represented by the graph of words.

## 5 Algorithms

We provide algorithms for maximizing (1) a monotone downward-DR submodular function on the height constraint, which generalizes the cardinality constraint (Section 5.1), (2) a monotone downward DR-submodular function on knapsack constraint (Section 5.2), and (3) a non-monotone bidirectional DR-submodular function (Section 5.3). Basically, these algorithms are extensions of the algorithms for the set lattice. This indicates that our definitions of directional DR-submodularities are natural and useful.

Below, we always assume that is normalized, i.e., .

### 5.1 Height constraint

We first consider the height constraint, i.e., . This coincides with the cardinality constraint if is the set lattice. In general, this constraint is very difficult analyze because can be arbitrary large. Thus, we assume that the height function is -incremental, i.e., for all and . Note that if and only if is modular.

We show that, as similar to the set lattice, the greedy algorithm (Algorithm 1) achieves approximation for the downward DR-submodular maximization problem over the height constraint.

###### Theorem 13.

Let be a lattice whose height function is -incremental, and be a downward DR-submodular function with additive gap . Then, Algorithm 1 finds -approximate solution of the height constrained monotone submodular maximization problem.333Algorithm 1 requires solving the non-convex optimization problem in Step 3. If we can only obtain an -approximate solution in Step 3, the approximation ratio of the algorithm reduces to . In particular, on modular lattice with , it gives approximation. ∎

### 5.2 Knapsack constraint

Next, we consider the knapsack constrained problem. A knapsack constraint on a lattice is specified by a nonnegative modular function (cost function) and nonnegative number (budget) such that the feasible region is given by .

In general, it is NP-hard to obtain a constant factor approximation for a knapsack constrained problem even for a distributive lattice gottschalk2015submodular . Therefore, we need additional assumptions on the cost function.

We say that a modular function is order-consistent if for all , , , and . The height function of a modular lattice is order-consistent, because for all and ; therefore it generalizes the height function. Moreover, on the set lattice , any modular function is order-consistent because there is no join-irreducible such that holds; therefore it generalizes the standard knapsack constraint on sets.

For a knapsack constraint with an order-consistent nonnegative modular function, we obtain a provable approximation ratio.

###### Theorem 14.

Let be a lattice, be a knapsack constraint where be an order-consistent modular function, , and be a monotone downward DR-submodular function with additive gap . Then, Algorithm 2 gives approximation of the knapsack constrained monotone submodular maximization problem. ∎

### 5.3 Non-monotone unconstrained maximization

Finally, we consider the unconstrained non-monotone maximization problem.

The double greedy algorithm buchbinder2015tight achieves the optimal approximation ratio on the unconstrained non-monotone submodular set function maximization problem. To extend the double greedy algorithm to lattices, we have to assume that the lattice has a finite height. This is needed to terminate the algorithm in a finite step. We also assume both downward DR-submodularity and upward DR-submodularity, i.e., bidirectional DR-submodularity. Finally, we assume that the lattice is modular. This is needed to analyze the approximation guarantee.

###### Theorem 15.

Let be a modular lattice of finite height, , and be non-monotone bidirectional DR-submodular function with additive gap . Then, Algorithm 3 gives approximate solution of the unconstrained non-monotone submodular maximization problem.

## 6 Conclusion

In this paper, we formulated the subspace selection problem as optimization problem over lattices. By introducing new “DR-submodularities” on lattices, named directional DR-submodularities, we successfully characterize the solvable subspace selection problem in terms of the submodularity. In particular, our definitions successfully capture the solvability of the principal component analysis and sparse dictionary selection problem. We propose algorithms with provable approximation guarantees for directional DR-submodular functions over several constraints.

There are several interesting future directions. Developing an algorithm for the matroid constraint over lattice is important since it is a fundamental constraint in submodular set function maximization problem. Related with this direction, extending the continuous relaxation type algorithms over lattices is very interesting. Such algorithms have been used to obtain the optimal approximation factors to matroid constrained submodular set function maximization problem.

It is also an interesting direction to look for machine learning applications of the directional DR-submodular maximization other than the subspace selection problem. The possible candidates include the subgroup selection problem and the subpartition selection problem.

## References

• [1] Hervé Abdi and Lynne J Williams. Principal component analysis. Wiley interdisciplinary reviews: computational statistics, 2(4):433–459, 2010.
• [2] Andrew An Bian, Joachim M Buhmann, Andreas Krause, and Sebastian Tschiatschek. Guarantees for greedy maximization of non-submodular functions with applications. In International Conference on Machine Learning (ICML’17), 2017.
• [3] Andrew An Bian, Baharan Mirzasoleiman, Joachim Buhmann, and Andreas Krause. Guaranteed Non-convex Optimization: Submodular Maximization over Continuous Domains. In

Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS’17)

, pages 111–120, 2017.
• [4] Niv Buchbinder, Moran Feldman, Joseph Seffi, and Roy Schwartz. A tight linear time (1/2)-approximation for unconstrained submodular maximization. SIAM Journal on Computing, 44(5):1384–1402, 2015.
• [5] Abhimanyu Das and David Kempe.

Algorithms for subset selection in linear regression.

In

Proceedings of the 40th Annual ACM Symposium on Theory of Computing (STOC’08)

, pages 45–54, 2008.
• [6] Abhimanyu Das and David Kempe. Submodular meets spectral: Greedy algorithms for subset selection, sparse approximation and dictionary selection. Proceedings of the 28th International Conference on Machine Learning (ICML’11), pages 1057–1064, 2011.
• [7] David L Donoho and Michael Elad. Optimally sparse representation in general (nonorthogonal) dictionaries via l1 minimization. Proceedings of the National Academy of Sciences, 100(5):2197–2202, 2003.
• [8] Yonina C Eldar and Gitta Kutyniok. Compressed sensing: theory and applications. Cambridge University Press, 2012.
• [9] Ethan R Elenberg, Rajiv Khanna, Alexandros G Dimakis, and Sahand Negahban. Restricted strong convexity implies weak submodularity. arXiv preprint arXiv:1612.00804, 2016.
• [10] Anna C Gilbert, S Muthukrishnan, and Martin J Strauss. Approximation of functions over redundant dictionaries using coherence. In Proceedings of the 14th ACM-SIAM Symposium on Discrete algorithms (SODA’03), pages 243–252, 2003.
• [11] Corinna Gottschalk and Britta Peis. Submodular function maximization over distributive and integer lattices. arXiv preprint arXiv:1505.05423, 2015.
• [12] George Grätzer. General lattice theory. Springer Science & Business Media, 2002.
• [13] Rajiv Khanna, Ethan R. Elenberg, Alexandros G. Dimakis, Joydeep Ghosh, and Sahand Negahban. On approximation guarantees for greedy low rank optimization. In Proceedings of the 34th International Conference on Machine Learning (ICML’17), pages 1837–1846, 2017.
• [14] Joo-Kyung Kim and Marie-Catherine de Marneffe. Deriving adjectival scales from continuous space word representations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP’13), pages 1625–1630, 2013.
• [15] Andreas Krause and Volkan Cevher. Submodular dictionary selection for sparse representation. In Proceedings of the 27th International Conference on Machine Learning (ICML’10), pages 567–574, 2010.
• [16] Hui Lin, Jeff Bilmes, and Shasha Xie. Graph-based submodular selection for extractive summarization. In In Proceedings of the IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU’09), pages 381–386. IEEE, 2009.
• [17] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems (NIPS’13), pages 3111–3119, 2013.
• [18] Balas Kausik Natarajan. Sparse approximate solutions to linear systems. SIAM Journal on Computing, 24(2):227–234, 1995.
• [19] Tasuku Soma and Yuichi Yoshida. A generalization of submodular cover via the diminishing return property on the integer lattice. In Advances in Neural Information Processing Systems (NIPS’15), pages 847–855, 2015.
• [20] Tasuku Soma and Yuichi Yoshida. Non-monotone dr-submodular function maximization. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’17), volume 17, pages 898–904, 2017.
• [21] Gilbert Strang, Gilbert Strang, Gilbert Strang, and Gilbert Strang. Introduction to linear algebra, volume 3. Wellesley-Cambridge Press Wellesley, MA, 1993.
• [22] Donald M Topkis. Minimizing a submodular function on a lattice. Operations Research, 26(2):305–321, 1978.
• [23] Joel A Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Information theory, 50(10):2231–2242, 2004.
• [24] Joel A Tropp, Anna C Gilbert, Sambavi Muthukrishnan, and Martin J Strauss. Improved sparse approximation over quasiincoherent dictionaries. In Proceedings of the International Conference on Image Processing (ICIP’03), volume 1, pages I–37. IEEE, 2003.

## Appendix A Proofs

In this section, we provide proofs omitted in the main body.

###### Proof of Proposition 6.

We use the Birkhoff’s representation theorem for distributive lattice. A set is a lower set if then for all . The lower sets forms a lattice under the inclusion order. We call this lattice lower set lattice of .

###### Theorem 16 (Birkhoff’s representation theorem; see [12]).

Any finite distributive lattice is isomorphic to the lower set lattice of . The isomorphism is given by . ∎

This theorem implies that, for any , the corresponding lower set of is uniquely determined. Therefore, for any , we have for all .

(Downward Strong) By Birkhoff’s representation theorem, we have . Thus the replaced maximum in (3.3) coincides with the minimum.

(Upward Strong) By Birkhoff’s representation theorem, for any and with , the element such that is uniquely determined (i.e., represent as a lower set of and remove from the lower set). Thus, the replaced minimum in (3.5) coincides with the maximum. ∎

###### Proofs of Propositions 7, 8.

The downward DR-submodularity follows from Proposition 11, which is proved below, since the mutual coherence of is zero. Thus, we here prove the upward DR-submodularity. To simplify the notation, we prove the case that . Extension to the general case is easy.

Let and with . Since the height of join-irreducible elements is one in the vector lattice, the outer max in (3.5) is negligible. Let , where is the orthogonal complement of . By the modularity of the height, is 1-dimensional subspace. In particular, it is join-irreducible. Notice that . Since , we have

 f(X∨a⊥)−f(X)=ρ(∥ΠXv∥2+⟨a′,v⟩2)−ρ(∥ΠXv∥2). (A.1)

Here, we identify 1-dimensional subspace as a unit vector in the space. Let . By using the modularity of the height again, we have . Since , we have

 f(Y)−f(˚Y)=ρ(∥Π˚Yv∥2+⟨a′,v⟩2)−ρ(∥Π˚Yv∥2). (A.2)

By the concavity of and , we obtain

 ρ(∥ΠXv∥2+⟨a′,v⟩2)−ρ(∥ΠXv∥2)≥ρ(∥Π˚Yv∥2+⟨a′,v⟩2)−ρ(∥Π˚Yv∥2). (A.3)

This shows the upward DR-submodularity. ∎

###### Proofs of Proposition 11.

To simplify the notation, we prove the case that . Extension to the general case is easy.

Let . Since the join-irreducible elements has height one in this lattice, the additive gap is given by

Let and arbitrary. By the definition of mutual coherence, there exists that has low coherence with . Let . Then, by the modularity of the height function, we have and it is a join-irreducible element. Since , we have . By comparing the height of and , we have .

We use at the RHS and evaluate

 (f(Y∨b)−f(Y))−(f(X∨b′)−f(X)). (A.5)

Let be a unit vector in orthogonal to . Note that may not be the element of .

 f(Y∨b⊥)−f(Y) =ρ(∥ΠYv∥2+⟨b⊥,v⟩2)−ρ(∥ΠYv∥2) (A.6) ≤ρ(∥ΠXv∥2+⟨b⊥,v⟩2)−ρ(∥ΠXv∥2). (A.7)

where the second inequality follows from the concavity of with the monotonicity of the mapping . Thus,

 δ≤ρ(∥ΠXv∥2+⟨b⊥,v⟩2)−ρ(∥ΠXv∥2+⟨b′′,v⟩2). (A.8)

where is the unit vector proportional to . If then, by the monotonicity of , we have . Therefore, we only have to consider the reverse case. In such case, by the concavity, we have

 δ≤ρ′(0)(⟨b⊥,v⟩2−⟨b′′,v⟩2). (A.9)

Here, is the derivative of at .

Let us denote where is a unit vector in orthogonal to . Then, by the definition of the mutual coherence, we have . Also, we have . By the construction, we have where . Thus, we have

 (α2+β2∥~t∥2)⟨b′′,v⟩2 =(α⟨b⊥,v⟩+β⟨~t,v⟩)2 (A.10) =α2⟨b⊥,v⟩2+2αβ⟨b⊥,v⟩⟨~t,v⟩+β2⟨~t,v⟩2. (A.11)

Therefore, by using , we have

 ⟨b⊥v⟩2−⟨b′′,v⟩2 ≤2αβ⟨b⊥,v⟩⟨~t,v⟩+β2(∥~t∥2⟨b′′,v⟩2−⟨~t,v⟩2)α2 (A.12) ≤2αβ+β2α2∥v∥2≤2ϵ√1−ϵ2+ϵ21−ϵ2∥v∥2≤3ϵ1−ϵ2∥v∥2. (A.13)

###### Proof of Lemma 10.

Suppose that has dimension . Let . Then, there exists such that any vector is represented by a linear combination of them. We construct by selecting maximally independent vectors to and let , where . By the dimension theorem of vector space and the fact that is the subspace of the intersection of and , we have . Here, the left-hand side is and the right-hand side is . Therefore, . This shows .

We check the condition of the mutual coherence. Let and be normalized vectors in and . Then we have

 ⟨u,u′⟩=α⊤Mβ, (A.14)

where , , and . Here, . Therefore we prove that and are small. Since is normalized, we have

 1=∥u∥2=α⊤Gα≥∥α∥2λ%min(G) (A.15)

where and

is the smallest eigenvalue of

. Since the diagonal elements of are one, and the absolute values of the off-diagonal elements are at most , the Gerschgorin circle theorem [21] implies that . Therefore, . Similarly, . Therefore, . ∎

###### Proof of Proposition 12.

We first check the downward DR-submodularity. Take arbitrary subspaces and with and . Without loss of generality, we can suppose . To simplify the notation, we use the same symbol to represent the unit vector in the subspace . By a direct calculation,

 f(Y∨b)= ∑(i,j)∈Ec(i,j)(∥ΠYvi∥2+⟨b,vi⟩2)(∥Π¯Yvj∥2−⟨b,vj⟩2). (A.16)

Hence,

 f(Y∨b)−f(Y)= ∑(i,j)∈Ec(i,j)(∥Π¯Yvj∥2⟨b,vi⟩2−∥ΠYvi∥2⟨b,vj⟩2−⟨b,vi⟩2⟨b,vj⟩2). (A.17)

Since , we have

 f(X∨b)−f(X)= ∑(i,j)∈Ec(i,j)(∥Π¯Xvj∥2⟨b,vi⟩2−∥ΠXvi∥2⟨b,vj⟩2−⟨b,vi⟩2⟨b,vj⟩2). (A.18)

Since , we have and . Hence, each summand in is smaller than that in . This shows the downward DR-submodularity.

Next, we check the upward DR-submodularity. Take arbitrary subspaces and a vector with . Let . To simplify the notation, we use the same symbol to represent the unit vector in the subspace . Notice that . Then, we can show the following equalities by the same argument as the downward DR-submodular case.

 f(X∨a)−f(X)=∑(i,j)∈E(∥Π¯Xvj∥2⟨a,vj⟩2−∥ΠXvi∥2⟨a,vi⟩2−⟨a,vi⟩2⟨a,vj⟩2), (A.19)
 f(Y)−f(˚Y)=∑(i,j)∈E(