1 Introduction
A good distance metric is often the key to an effective machine learning algorithm. For instance, when clustering, the distance metric largely defines which points end up in which clusters. Similarly, in largemargin learning, the distance between different labelings can contribute as much to the definition of the margin as the objective function itself. Likewise, when constructing diverse
best lists, the measure of diversity is key to ensuring meaningful differences between list elements.We consider distance metrics over binary vectors, . If we define the set , then each can seen as the characteristic vector of a set , where if , and otherwise. For sets , with representing the symmetric difference, , the Hamming distance is then:
(1) 
A Hamming distance between two vectors assumes that each entry difference contributes value one. Weighted Hamming distance generalizes this slightly, allowing each entry a unique weight. Mahalanobis distance generalizes further, allowing weighted pairwise interactions of the following form:
(2) 
When is a positive semidefinite matrix, this type of distance is a metric. For many practical applications, however, it is desirable to have entries interact with each other in more complex and higherorder ways than Hamming or Mahalanobis allow. Yet, arbitrary interactions would result in nonmetric functions whose optimization would be intractable. In this work, therefore, we consider an alternative class of functions that goes beyond pairwise interactions, yet is computationally feasible, is natural for many applications, and preserves metricity.
Given a set function , we can define a distortion between two binary vectors as follows: . By asking to satisfy certain properties, we will arrive at a class of discrete metrics that is feasible to optimize and preserves metricity. We say that is positive if whenever ; is normalized if ; is monotone if for all ; is subadditive if for all ; is modular if for all ; and is submodular if for all . If we assume that is positive, normalized, monotone, and subadditive then is a metric (see Theorem 3.1), but without useful computational properties. If is positive, normalized, monotone, and modular, then we recover the weighted Hamming distance. In this paper, we assume that is positive, normalized, monotone, and submodular (and hence also subadditive). These conditions are sufficient to ensure the metricity of , but allow for a significant generalization over the weighted Hamming distance. Also, thanks to the properties of submodularity, this class yields efficient optimization algorithms with guarantees for practical machine learning problems. In what follows, we will refer to normalized monotone submodular functions as polymatroid functions; all of our results will be concerned with positive polymatroids. We note here that despite the restrictions described above, the polymatroid class is in fact quite broad; it contains a number of natural choices of diversity and coverage functions, such as set cover, facility location, saturated coverage, and concaveovermodular functions.
Given a positive polymatroid function , we refer to as a submodular Hamming (SH) distance. We study two optimization problems involving these metrics (each is a positive polymatroid, each , and denotes a combinatorial constraint):
(3) 
We will use as shorthand for the sequence , for the sequence , and for the objective function . We will also make a distinction between the homogeneous case where all are the same function, and the more general heterogeneous case where each may be distinct. In terms of constraints, in this paper’s theory we consider only the unconstrained () and the cardinalityconstrained (e.g., , ) settings. In general though, could express more complex concepts such as knapsack constraints, or that solutions must be an independent set of a matroid, or a cut (or spanning tree, path, or matching) in a graph.
Intuitively, the SHmin problem can be thought of as a centroidfinding problem; the minimizing should be as similar to the ’s as possible, since a penalty of is paid for each difference. Analogously, the SHmax problem can be thought of as a diversification problem; the maximizing should be as distinct from all ’s as possible, as is awarded for each difference. Given modular (the weighted Hamming distance case), these optimization problems can be solved exactly and efficiently for many constraint types. For the more general case of submodular , we establish several hardness results and offer new approximation algorithms, as summarized in Tables 1 and 2. Our main contribution is to provide (to our knowledge), the first systematic study of the properties of submodular Hamming (SH) metrics, by showing metricity, describing potential machine learning applications, and providing optimization algorithms for SHmin and SHmax.
SHmin  SHmax  
homogeneous  heterogeneous  homogeneous  heterogeneous  
UC  Open  
Card 
UnionSplit  BestB  MajorMin  RandSet  
UC  Card  UC  Card  UC  
SHmin      
SHmax     
The outline of this paper is as follows. In Section 2, we offer further motivation by describing several applications of SHmin and SHmax to machine learning. In Section 3, we prove that for a positive polymatroid function , the distance is a metric. Then, in Sections 4 and 5 we give hardness results and approximation algorithms, and in Section 6 we demonstrate the practical advantage that submodular metrics have over modular metrics for several realworld applications.
2 Applications
We motivate SHmin and SHmax by showing how they occur naturally in several applications.
Clustering: Many clustering algorithms, including for example means [1], use distance functions in their optimization. If each item to be clustered is represented by a binary feature vector , then counting the disagreements between and is one natural distance function. Defining sets , this count is equivalent to the Hamming distance . Consider a document clustering application where is the set of all features (e.g., grams) and is the set of features for document . Hamming distance has value both when
``synapse''
} and when . Intuitively, however, a smaller distance seems warranted in the latter case since the difference is only in one rather than two distinct concepts. The submodular Hamming distances we propose in this work can easily capture this type of behavior. Given feature clusters , one can define a submodular function as:(4) 
Applying this with , if the documents’ differences are confined to one cluster, the distance is smaller than if the differences occur across several word clusters. In the case discussed above, the distances are and . If this submodular Hamming distance is used for means clustering, then the meanfinding step becomes an instance of the SHmin problem. That is, if cluster contains documents , then its mean takes exactly the following SHmin form:
(5) 
Structured prediction
: Structured support vector machines (SVMs) typically rely on Hamming distance to compare candidate structures to the true one. The margin required between the correct structure score and a candidate score is then proportional to their Hamming distance. Consider the problem of segmenting an image into foreground and background. Let
be image ’s true set of foreground pixels. Then Hamming distance between and a candidate segmentation with foreground pixels counts the number of mislabeled pixels. However, both [2] and [3] observe poor performance with Hamming distance and recent work by [4] shows improved performance with richer distances that are supermodular functions of . One potential direction for further enriching image segmentation distance functions is thus to consider nonmodular functions from within our submodular Hamming metrics class. These functions have the ability to correct for the overpenalization that the current distance functions may suffer from when the same kind of difference happens repeatedly. For instance, if differs fromonly in the pixels local to a particular block of the image, then current distance functions could be seen as overestimating the difference. Using a submodular Hamming function, the “lossaugmented inference” step in SVM optimization becomes an SHmax problem. More concretely, if the segmentation model is defined by a submodular graph cut
, then we have: . (Note that .) In fact, [5] observes superior results with this type of lossaugmented inference using a special case of a submodular Hamming metric for the task of multilabel image classification.Diverse best: For some machine learning tasks, rather than finding a model’s single highestscoring prediction, it is helpful to find a diverse set of highquality predictions. For instance, [6] showed that for image segmentation and pose tracking a diverse set of solutions tended to contain a better predictor than the top highestscoring solutions. Additionally, finding diverse solutions can be beneficial for accommodating user interaction. For example, consider the task of selecting photos to summarize the photos that a person took while on vacation. If the model’s best prediction (a set of
images) is rejected by the user, then the system should probably present a substantially different prediction on its second try. Submodular functions are a natural model for several summarization problems
[7, 8]. Thus, given a submodular summarization model , and a set of existing diverse summaries , one could find a th summary to present to the user by solving:(6) 
If and are both positive polymatroids, then this constitutes an instance of the SHmax problem.
3 Properties of the submodular Hamming metric
We next show several interesting properties of the submodular Hamming distance. Proofs for all theorems and lemmas can be found in the supplementary material. We begin by showing that any positive polymatroid function of is a metric. In fact, we show the more general result that any positive normalized monotone subadditive function of is a metric. This result is known (see for instance Chapter 8 of [9]), but we provide a proof (in the supplementary material) for completeness.
Theorem 3.1.
Let be a positive normalized monotone subadditive function. Then is a metric on .
Proof.
Let be arbitrary. We check each of the four properties of metrics:

Proof of nonnegativity: because is normalized and positive.

Proof of identity of indiscernibles: . The third implication follows because of normalization and positivity of , and the fourth follows from the definition of .

Proof of symmetry: , by definition of .

Proof of the triangle inequality: First, note that . This follows because each element is either in (true if ) or in (true if ). Similarly, each element is either in (true if ) or in (true if ). Then, because is monotone and subadditive, we have:
(7)
∎
While these subadditive functions are metrics, their optimization is known to be very difficult. The simple subadditive function example in the introduction of [10] shows that subadditive minimization is inapproximable, and Theorem 17 of [11] states that no algorithm exists for subadditive maximization that has an approximation factor better than . By contrast, submodular minimization is polytime in the unconstrained setting [12], and a simple greedy algorithm from [13] gives a approximation for maximization of positive polymatroids subject to a cardinality constraint. Many other approximation results are also known for submodular function optimization subject to various other types of constraints. Thus, in this work we restrict ourselves to positive polymatroids.
Corollary 3.1.1.
Let be a positive polymatroid function. Then is a metric on .
This restriction does not entirely resolve the question of optimization hardness though. Recall that the optimization in SHmin and SHmax is with respect to , but that the are applied to the sets . Unfortunately, the function , for a fixed set , is neither necessarily submodular nor supermodular in . The next example demonstrates this violation of submodularity.
Example 3.1.1.
To be submodular, the function must satisfy the following condition for all sets : . Consider the positive polymatroid function and let consist of two elements: . Then for and (with ):
(8) 
This violates the definition of submodularity, implying that is not submodular.
Although can be nonsubmodular, we are interestingly still able to make use of the fact that is submodular in to develop approximation algorithms for SHmin and SHmax.
4 Minimization of the submodular Hamming metric
In this section, we focus on SHmin (the centroidfinding problem). We consider the four cases from Table 1: the constrained () and unconstrained () settings, as well as the homogeneous case (where all are the same function) and the heterogeneous case. Before diving in, we note that in all cases we assume not only the natural oracle access to the objective function (i.e., the ability to evaluate for any ), but also knowledge of the (the sequence). Theorem 4.1 shows that without knowledge of , SHmin is inapproximable. In practice, requiring knowledge of is not a significant limitation; for all of the applications described in Section 2, is naturally known.
Theorem 4.1.
Let be a positive polymatroid function. Suppose that the subset is fixed but unknown and . If we only have an oracle for , then there is no polytime approximation algorithm for minimizing , up to any polynomial approximation factor.
Proof.
Define as follows:
(9) 
Then unless . Thus, it would take any algorithm an exponential number of queries on to find . ∎
4.1 Unconstrained setting
Submodular minimization is polytime in the unconstrained setting [12]. Since a sum of submodular functions is itself submodular, at first glance it might then seem that the sum of in SHmin can be minimized in polytime. However, recall from Example 3.1.1 that the ’s are not necessarily submodular in the optimization variable, . This means that the question of SHmin’s hardness, even in the unconstrained setting, is an open question. Theorem 4.2 resolves this question for the heterogeneous case, showing that it is NPhard and that no algorithm can do better than a approximation guarantee. The question of hardness in the homogeneous case remains open.
Theorem 4.2.
The unconstrained and heterogeneous version of SHmin is NPhard. Moreover, no polytime algorithm can achieve an approximation factor better than .
Proof.
We first show that for any graph it is possible to construct and such that the corresponding sum in the SHmin problem has minimum value if and only if is a vertex cover for . For constants and , let and . For every edge , define two positive polymatroid functions:
(10) 
Let and and define the sum :
(11) 
The value of each term in this sum is shown in Table 3. Note that the definition of ensures that .
Case  

Using Table 3, we can show that the minimizers of are exactly the set covers of :

Case 1—show that every vertex cover of is a minimizer of : By the definition of , we know , and so the minimum value of occurs when all are , which is clearly achievable by setting . Any set that is a vertex cover contains at least one endpoint of each edge, and hence also has value for each .

Case 2—show that every minimizer of is a vertex cover of : Suppose that is a minimizer of but not a vertex cover of . Then there exists some uncovered edge with neither endpoint in . Consider adding to to form a set . The corresponding difference in value is: . The difference in value for each other edge that touches is similarly if is uncovered in , or if is covered by . All other values remain unchanged. Thus, , contradicting the assumption that is a minimizer of .
Borrowing from [14]’s Theorem 3.1, we now define a particular graph and two additional positive polymatroid functions. Consider the bipartite graph where and the edge set consists of edges that form a perfect matching of to . Let be a random minimumcardinality vertex cover of . Define the following two functions:
(12) 
where is set so that . [14] shows that, knowing but given only valueoracle access to the , no polytime algorithm can distinguish between and . Moreover, if restricted to vertex cover solutions, it is easy to see that the function is minimized on any of the possible vertex covers, for which it has value , while the function is minimized on the set , for which it has value . The ratio of these minimizers is , which allows [14] to show that no polytime algorithm can achieve a approximation for the minimum submodular vertex cover problem.
Now, instead of explicitly restricting to vertex cover solutions, consider unconstrained minimization on and . Since and cannot be distinguished in polytime, neither can and . We can also show that: (1) any minimizer of or must be a vertex cover, and (2) the ratio of the corresponding vertex cover minimizers is .

Show ’s minimizers are vertex covers: Suppose that is a minimizer of but not a vertex cover of . Then there exists some uncovered edge with neither endpoint in . Consider adding to to form a set . As shown above, the corresponding difference in value is . The difference is if and otherwise. Thus, all we need is for to be . Plugging in the definition of and , this inequality can be seen to hold for all . Thus, overall , contradicting the assumption that is a minimizer of .

Show ’s minimizers are vertex covers: The reasoning here is analogous to the case; the difference is always , since adding a single node can never change the value by more than .

’s minimum value: Any vertex cover includes at least nodes and thus has value . Since there are edges total, for a vertex cover. Combining these we see that has minimum value .

’s minimum value: The vertex cover consisting of the set minimizes : . Thus, the minimum value is .
Letting , we have that . Thus, in the limit the as , the ratio of minimizers is: . Plugging in the definition of from above, the ratio in terms of is: . ∎
Since unconstrained SHmin is NPhard, it makes sense to consider approximation algorithms for this problem. We first provide a simple approximation, UnionSplit (see Algorithm 1). This algorithm splits into , then applies standard submodular minimization (see e.g. [15]) to the split function. Theorem 4.3 shows that this algorithm is a approximation for SHmin. It relies on Lemma 4.2.1, which we state first.
Lemma 4.2.1.
Let be a positive monotone subadditive function. Then, for any :
(13) 
Proof.
The upper bound follows from the definition of and the fact that is subadditive:
(14) 
The lower bound on follows due to the monotonicity of : and . Summing these two inequalities gives the bound. ∎
Theorem 4.3.
UnionSplit is a approximation for unconstrained SHmin.
Proof.
An SHmin instance seeks the minimizer of . Define . From Lemma 4.2.1, we see that is a approximation for (any submodular function is also subadditive). Thus, if can be minimized exactly, the result is a approximation for SHmin. Exact minimization of is possible because is submodular in . The submodularity of follows from the fact that submodular functions are closed under restriction, complementation, and addition (see [16], page 9). These closure properties imply that, for each , and are both submodular in , as is their sum. ∎
Note that UnionSplit’s approximation bound is tight; there exists a problem instance where exactly a factor of is achieved. More concretely, consider , , , and for . Then according to the passed to SubmodularOpt, all solutions have value . Yet, under the true the solutions and have the better (smaller) value . Letting , the quantity approaches , making the ratio between the correct solution and the one given by UnionSplit possibly as large as .
Restricting to the homogeneous setting, we can provide a different algorithm that has a better approximation guarantee than UnionSplit. This algorithm simply checks the value of for each and returns the minimizing . We call this algorithm BestB (Algorithm 2). Theorem 4.4 gives the approximation guarantee for BestB. This result is known [17], as the proof of the guarantee only makes use of metricity and homogeneity (not submodularity), and these properties are common to much other work. We provide the proof in our notation for completeness though.
Theorem 4.4.
For , BestB exactly solves unconstrained SHmin. For , BestB is a approximation for unconstrained homogeneous SHmin.
Proof.
Define , for positive polymatroid. Since each is normalized and positive, each is minimized by : . Thus, any given is minimized by setting . For , this implies that SHmin is exactly solved by setting .
Now consider and the homogeneous setting where there is a single : . By Theorem 3.1, is a metric, so it obeys the triangle inequality:
(15) 
Fixing some and summing this inequality over all :
(16) 
where the last equality is due to the fact that polymatroids are normalized: . Regrouping terms, is independent of , so it can be pulled out of the summation:
(17) 
Notice that is exactly and is . Substituting in this notation and summing over all :
(18) 
On the lefthand side we can again replace the sum with , yielding: . Since a sum over items is larger than times the minimum term in the sum, the remaining sum here can be replaced by a min:
(19) 
The lefthand size is exactly what the BestB algorithm computes, and hence the minimizing found by BestB is a approximation for unconstrained homogeneous SHmin. ∎
Note that as a corollary of this result, in the case when , the optimal solution for unconstrained homogeneous SHmin is to take the best of and . Also note that since UnionSplit’s approximation bound is tight, BestB is theoretically better in terms of worstcase performance in the unconstrained setting. However, UnionSplit’s performance on practical problems is often better than the BestB’s, as many practical problems do not hit upon this worst case. For example, consider the case where , is simply cardinality, , and each consists of two items: . Then the best has value , while the set found by UnionSplit has a lower (better) value of .
4.2 Constrained setting
In the constrained setting, the SHmin problem becomes more difficult. Essentially, all of the hardness results established in existing work on constrained submodular minimization applies to the constrained SHmin problem as well. Theorem 4.5 shows that, even for a simple cardinality constraint and identical (homogeneous setting), not only is SHmin NPhard, but also it is hard to approximate with a factor better than .
Theorem 4.5.
Homogeneous SHmin is NPhard under cardinality constraints. Moreover, no algorithm can achieve an approximation factor better than , where denotes the curvature of . This holds even when .
Proof.
Let and . Then under cardinality constraints, SHmin becomes . Corollary 5.1 of [18] establishes that this problem is NPhard and has a hardness of . ∎
We can also show similar hardness results for several other combinatorial constraints including matroid constraints, shortest paths, spanning trees, cuts, etc. [18, 14]. Note that the hardness established in Theorem 4.5 depends on a quantity , which is also called the curvature of a submodular function [19, 18]. Intuitively, this factor measures how close a submodular function is to a modular function. The result suggests that the closer the function is being modular, the easier it is to optimize. This makes sense, since with a modular function, SHmin can be exactly minimized under several combinatorial constraints. To see this for the cardinalityconstrained case, first note that for modular , the corresponding function is also modular. Lemma 4.5.1 formalizes this.
Lemma 4.5.1.
If the in SHmin are modular, then is also modular.
Proof.
Any normalized modular function can be represented as a vector , such that . With , this can be written:
(20)  
(21) 
Summing over and letting represent the part that is constant with respect to , we have:
(22) 
Thus, can be represented by offset and vector such that , with entries . This is sufficient to prove modularity. (For optimization purposes, note that can be dropped without affecting the solution to SHmin.) ∎
Given Lemma 4.5.1, from the definition of modularity we know that there exists some constant and vector , such that . From this representation it is clear that can be minimized subject to the constraint by choosing as the set the items corresponding to the smallest entries in . Thus, for modular , or with small curvature , such constrained minimization is relatively easy.
Having established the hardness of constrained SHmin, we now turn to considering approximation algorithms for this problem. Unfortunately, the UnionSplit algorithm from the previous section requires an efficient algorithm for submodular function minimization, and no such algorithm exists in the constrained setting; submodular minimization is NPhard even under simple cardinality constraints [20] (although see [21] that shows it is possible to get solutions for a subset of the cardinality constraints). Similarly, the BestB algorithm breaks down in the constrained setting; its guarantees carry over only if all the are within the constraint set . Thus, for the constrained SHmin problem we instead propose a majorizationminimization algorithm. Theorem 4.6 shows that this algorithm has an approximation guarantee, and Algorithm 3 formally defines the algorithm.
Essentially, MajorMin proceeds by iterating the following two steps: constructing , a modular upper bound for at the current solution , then minimizing to get a new . consists of superdifferentials [22, 23] of ’s component submodular functions. We use the superdifferentials defined as “grow” and “shrink” in [24]. Defining sets as for “grow”, and for “shrink”, the vector that represents the modular can be written:
(23) 
where is the gain in value when adding to . We now state the main theorem characterizing algorithm MajorMin’s performance on SHmin.
Theorem 4.6.
MajorMin is guaranteed to improve the objective value, , at every iteration. Moreover, for any constraint over which a modular function can be exactly optimized, it has a approximation guarantee, where is the optimal solution of SHmin.
Proof.
We first define the full “grow” and “shrink” superdifferentials:
(24)  
(25) 
When referring to either of these modular functions, we use . Note that the upperbound in the following sense: , and .
MajorMin proceeds as follows. Starting from and applying either “grow” or “shrink” to construct a modular approximation to at yields the following simple surrogate function for each : . The below bound then holds (from [18]):
(26) 
Let . Also, let . Then, it holds that:
(27)  
(28)  
(29) 
The first inequality follows from the definition of the modular upper bound, the second inequality follows from the fact that is the minimizer of the modular optimization, and the third inequality follows from Equation 26. We now show that MajorMin improves the objective value at every iteration:
(30) 
∎
While MajorMin does not have a constantfactor guarantee (which is possible only in the unconstrained setting), the bounds are not too far from the hardness of the constrained setting. For example, in the cardinality case, the guarantee of MajorMin is , while the hardness shown in Theorem 4.5 is .
5 Maximization of the submodular Hamming metric
We next characterize the hardness of SHmax (the diversification problem) and describe approximation algorithms for it. We first show that all versions of SHmax, even the unconstrained homogeneous one, are NPhard. Note that this is a nontrivial result. Maximization of a monotone function such as a polymatroid is not NPhard; the maximizer is always the full set . But, for SHmax, despite the fact that the are monotone with respect to their argument , they are not monotone with respect to itself. This makes SHmax significantly harder. After establishing that SHmax is NPhard, we show that no polytime algorithm can obtain an approximation factor better in the unconstrained setting, and a factor of in the constrained setting. Finally, we provide a simple approximation algorithm which achieves a factor of for all settings.
Theorem 5.1.
All versions of SHmax (constrained or unconstrained, heterogeneous or homogeneous) are NPhard. Moreover, no polytime algorithm can obtain a factor better than for the unconstrained versions, or better than for the cardinalityconstrained versions.
Proof.
We first show that homogeneous unconstrained SHmax is NPhard. We proceed by constructing an that can represent any symmetric positive normalized (nonmonotone) submodular function. Maximization is NPhard for this type of function, since it subsumes the MaxCut problem. Hence, the reduction to unconstrained SHmax suffices to show NPhardness.
Consider an instance of SHmax with and
Comments
There are no comments yet.