Discrete optimization methods such as graph cuts [5, 18] have proven to be quite effective for many computer vision problems, including stereo , interactive segmentation  and texture synthesis . The underlying optimization problem behind graph cuts is a special case of submodular function optimization that can be solved exactly using max flow . Graph cut methods, however, are limited by their reliance on first-order priors involving pairs of pixels, and there is considerable interest in expressing priors that rely on local image patches such as the popular Field of Experts model .
In this paper we focus on an important generalization of the functions that graph cuts can minimize, which can express higher-order priors. These functions, which  called Sum-of-Submodular (SoS), can be efficiently solved with a variant of max flow . While SoS functions have more expressive power, they also involve a large number of parameters. Rather than addressing the question of which existing higher order priors can be expressed as an SoS function, we take a discriminative learning approach and effectively search the space of SoS functions with the goal of finding a higher order prior that gives strong results on our training set.111Since we are taking a discriminative approach, the higher-order energy function we learn does not have a natural probabilistic interpretation. We are using the word “prior” here somewhat loosely, as is common in computer vision papers that focus on energy minimization.
Our main contribution is to introduce the first learning method for training such SoS functions, and to demonstrate the effectiveness of this approach for interactive segmentation using learned higher order priors. Following a Structural SVM approach [14, 33], we show that the training problem can be cast as a quadratic optimization problem over an extended set of linear constraints. This generalizes large-margin training of pairwise submodular (a.k.a. regular ) MRFs [1, 29, 30], where submodularity corresponds to a simple non-negativity constraint. To solve the training problem, we show that an extended cutting-plane algorithm can efficiently search the space of SoS functions.
1.1 Sum-of-Submodular functions and priors
A submodular function on a set satisfies for all . Such a function is sum-of-submodular (SoS) if we can write it as
for where each is submodular. Research on higher-order priors calls a clique .
Of course, a sum of submodular functions is itself submodular, so we could use general submodular optimization to minimize an SoS function. However, general submodular optimization is  (which is impractical for low-level vision problems), whereas we may be able to exploit the neighborhood structure to do better. For example, if all the cliques are pairs () the energy function is referred to as regular and the problem can be reduced to max flow . As mentioned above, this is the underlying technique used in the popular graph cuts approach [5, 20, 27]. The key limitation is the restriction to pairwise cliques, which does not allow us to naturally express important higher order priors such as those involving image patches . The most common approach to solving higher-order priors with graph cuts, which involves transformation to pairwise cliques, in practice almost always produces non-submodular functions that cannot be solved exactly [8, 13].
2 Related Work
Many learning problems in computer vision can be cast as structured output prediction, which allows learning outputs with spatial coherence. Among the most popular generic methods for structured output learning are Conditional Random Fields (CRFs) trained by maximum conditional likelihood , Maximum-Margin Markov Networks (M3N) 
, and Structural Support Vector Machines (SVM-struct)[33, 14]. A key advantage of M3N and SVM-struct over CRFs is that training does not require computation of the partition function. Among the two large-margin approaches M3N and SVM-struct, we follow the SVM-struct methodology since it allows the use of efficient inference procedures during training.
In this paper, we will learn submodular discriminant functions. Prior work on learning submodular functions falls into three categories: submodular function regression , maximization of submodular discriminant functions, and minimization of submodular discriminant functions.
Learning of submodular discriminant functions where a prediction is computed through maximization has widespread use in information retrieval, where submodularity models diversity in the ranking of a search engine [34, 23] or in an automatically generated abstract . While exact (monotone) submodular maximization is intractible, approximate inference using a simple greedy algorithm has approximation guarantees and generally excellent performance in practice.
The models considered in this paper use submodular discriminant functions where a prediction is computed through minimization. The most popular such models are regular MRFs . Traditionally, the parameters of these models have been tuned by hand, but several learning methods exist. Most closely related to the work in this paper are Associative Markov Networks [30, 1]
, which take an M3N approach and exploit the fact that regular MRFs have an integral linear relaxation. These linear programs (LP) are folded into the M3N quadratic program (QP) that is then solved as a monolithic QP. In contrast, SVM-struct training using cutting planes for regular MRFs allows graph cut inference also during training, and [7, 19] show that this approach has interesting approximation properties even the for multi-class case where graph cut inference is only approximate. More complex models for learning spatially coherent priors include separate training for unary and pairwise potentials 
, learning MRFs with functional gradient boosting, and the Potts models, all of which have had success on a variety of vision problems. Note that our general approach for learning multi-label SoS functions, described in section 4.4, includes the Potts model as a special case.
3 SoS minimization
In this section, we briefly summarize how an SoS function can be minimized by means of a submodular flow network (section 3.1), and then present our improved algorithm for solving this minimization (section 3.2).
3.1 SoS minimization via submodular flow
Submodular flow is similar to the max flow problem, in that there is a network of nodes and arcs on which we want to push flow from to . However, the notion of residual capacity will be slightly modified from that of standard max flow. For a much more complete description see .
We begin with a network . As in the max flow reduction for Graph Cuts, there are source and sink arcs and for every . Additionally, for each clique , there is an arc for each .222To explain the notation, note that might be in multiple cliques , so we may have multiple edges (that is, is a multigraph). We distinguish between them by the subscript .
Every arc also has an associated residual capacity . The residual capacity of arcs and are the familiar residual capacities from max flow: there are capacities and (determined by the unary terms of ), and whenever we push flow on a source or sink arc, we decrease the residual capacity by the same amount.
For the interior arcs, we need one further piece of information. In addition to residual capacities, we also keep track of residual clique functions , related to the flow values by the following rule: whenever we push units of flow on arc , we update by
The residual capacities of the interior arcs are chosen so that the are always nonnegative. Accordingly, we define .
Given a flow , define the set of residual arcs as all arcs with . An augmenting path is an path along arcs in . The following theorem of  tells how to optimize by computing a flow in .
Let be a feasible flow such that there is no augmenting path from to . Let be the set of all reachable from along arcs in . Then is the minimum value of over all .
3.2 IBFS for Submodular Flow
Incremental Breadth First Search (IBFS) , which is the state of the art in max flow methods for vision applications, improves the algorithm of  to guarantee polynomial time complexity. We now show how to modify IBFS to compute a maximum submodular flow in .
IBFS is an augmenting paths algorithm: at each step, it finds a path from to with positive residual capacity, and pushes flow along it. Additionally, each augmenting path found is a shortest - path in . To ensure that the paths found are shortest paths, we keep track of distances and from to and from to , and search trees and containing all nodes at distance at most from or from respectively. Two invariants are maintained:
For every in , the unique path from to in is a shortest - path in .
For every in , the unique path from to in is a shortest - path in .
The algorithm proceeds by alternating between forward passes and reverse passes. In a forward pass, we attempt to grow the source tree by one layer (a reverse pass attempts to grow , and is symmetric). To grow , we scan through the vertices at distance away from , and examine each out-arc with positive residual capacity. If is not in or , then we add to at distance level , and with parent . If is in , then we found an augmenting path from to via the arc , so we can push flow on it.
The operation of pushing flow may saturate some arcs (and cause previously saturated arcs to become unsaturated). If the parent arc of in the tree or becomes saturated, then becomes an orphan. After each augmentation, we perform an adoption step, where each orphan finds a new parent. The details of the adoption step are similar to the relabel operation of the Push-Relabel algorithm, in that we search all potential parent arcs in for the neighbor with the lowest distance label, and make that node our new parent.
In order to apply IBFS to the submodular flow problem, all the basic datastructures still make sense: we have a graph where the arcs have residual capacities , and a maximum flow has been found if and only if there is no longer any augmenting path from to .
The main change for the submodular flow problem is that when we increase flow on an edge , instead of just affecting the residual capacity of that arc and the reverse arc, we may also change the residual capacities of other arcs for . However, the following result ensures that this is not a problem.
If was previously saturated, but now has residual capacity as a result of increasing flow along , then (1) either or there was an arc and (2) either or there was an arc .
Increasing flow on an edge never creates a shortcut between and , or from to .
The proofs are based on results of , and can be found in the Supplementary Material.
Corollary 3.3 ensures that we never create any new shorter - or - paths not contained in or . A push operation may cause some edges to become saturated, but this is the same problem as in the normal max flow case, and any orpans so created will be fixed in the adoption step. Therefore, all invariants of the IBFS algorithm are maintained, even in the submodular flow case.
One final property of the IBFS algorithm involves the use of the “current arc heuristic”, which is a mechanism for avoiding iterating through all possible potential parents when performing an adoption step. In the case of Submodular Flows, it is also the case that whenever we create new residual arcs we maintain all invariants related to this current arc heuristic, so the same speedup applies here. However, as this does not affect the correctness of the algorithm, and only the runtime, we will defer the proof to the Supplementary Material.
Running time. The asymptotic complexity of the standard IBFS algorithm is . In the submodular-flow case, we still perform the same number of basic operations. However, note finding residual capacity of an arc requires minimizing for separating and . If , this can be done in time using . However, for , it will likely be much more efficient to use the naive algorithm of searching through all values of . Overall, we add work at each basic step of IBFS, so if we have cliques the total runtime is .
4 S3SVM: SoS Structured SVMs
In this section, we first review the SVM algorithm and its associated Quadratic Program (section 4.1). We then decribe a general class of SoS discriminant functions which can be learned by SVM-struct (section 4.2) and explain this learning procedure (section 4.3). Finally, we generalize SoS functions to the multi-label case (section 4.4).
4.1 Structured SVMs
Structured output prediction describes the problem of learning a function where is the space of inputs, and is the space of (multivariate and structured) outputs for a given problem. To learn , we assume that a training sample of input-output pairs is available and drawn i.i.d. from an unknown distribution. The goal is to find a function from some hypothesis space
that has low prediction error, relative to a loss function. The function quantifies the error associated with predicting when is the correct output value. For example, for image segmentation, a natural loss function might be the Hamming distance between the true segmentation and the predicted labeling.
The mechanism by which Structural SVMs finds a hypothesis is to learn a discriminant function over input/output pairs. One derives a prediction for a given input by minimizing over all .333Note that the use of minimization departs from the usual language of [33, 14] where the hypothesis is . However, because of the prevalence of cost functions throughout computer vision, we have replaced by throughout. We will write this as . We assume is linear in two quantities w and where is a parameter vector and is a feature vector relating input and output . Intuitively, one can think of as a cost function that measures how poorly the output matches the given input .
Ideally, we would find weights w such that the hypothesis always gives correct results on the training set. Stated another way, for each example , the correct prediction should have low discriminant value, while incorrect predictions with large loss should have high discriminant values. We write this constraint as a linear inequality in w
It is convenient to define , so that the above inequality becomes .
Since it may not be possible to satisfy all these conditions exactly, we also add a slack variable to the constraints for example . Intuitively, slack variable represents the maximum misprediction loss on the th example. Since we want to minimize the prediction error, we add an objective function which penalizes large slack. Finally, we also penalize to discourage overfitting, with a regularization parameter to trade off these costs.
Quadratic Program 1.
-Slack Structural SVM
4.2 Submodular Feature Encoding
We now apply the Structured SVM (SVM-struct) framework to the problem of learning SoS functions.
For the moment, assume our prediction task is to assign a binary label for each element of a base set. We will cover the multi-label case in section 4.4. Since the labels are binary, prediction consists of assigning a subset for each input (namely the set of pixels labeled 1).
Our goal is to construct a feature vector that, when used with the SVM-struct algorithm of section 4.1, will allow us to learn sum-of-submodular energy functions. Let’s begin with the simplest case of learning a discriminant function , defined only on a single clique and which does not depend on the input .
Intuitively, our parameters w will correspond to the table of values of the clique function , and our feature vector will be chosen so that . We can accomplish this by letting and w have entries, indexed by subsets , and defining (where is 1 if and 0 otherwise). Note that, as we claimed,
If our parameters are allowed to vary over all , then may be an arbitrary function , and not necessarily submodular. However, we can enforce submodularity by adding a number of linear inequalities. Recall that is submodular if and only if . Therefore, is submodular if and only if the parameters satisfy
These are just linear constraints in w, so we can add them as additional constraints to Quadratic Program 1. There are of them, but each clique has parameters, so this does not increase the asymptotic size of the QP.
By adding constraints (5) to the QP, we ensure that the optimal solution w is defines a submodular . Conversely, for any submodular function , there is a feasible w defined by , so the optimal solution to the QP must be the maximum-margin such function. ∎
To introduce a dependence on the data , we can define to be for an arbitrary nonnegative function .
Because is nonnegative, constraints (5) ensure that the discriminant function is again submodular. ∎
Finally, we can learn multiple clique potentials simultaneously. If we have a neighborhood structure with cliques, each with a data-dependence , we create a feature vector composed of concatenating the different features .
With feature vector , and adding a copy of the constraints (5) for each clique , the learned is the maximum margin of the form
where the can vary over all possible submodular functions on the cliques .
4.3 Solving the quadratic program
The -slack formulation for SSVMs (QP 1) makes intuitive sense, from the point of view of minimizing the misprediction error on the training set. However, in practice it is better to use the -slack reformulation of this QP from . Compared to -slack, the -slack QP can be solved several orders of magnitude faster in practice, as well as having asymptotically better complexity.
The -slack formulation is an equivalent QP which replaces the slack variables with a single variable . The loss constraints (3) are replaced with constraints penalizing the sum of losses across all training examples. We also include submodular constraints on w.
Quadratic Program 2.
1-Slack Structural SVM
Note that we have a constraint for each tuple , which is an exponential sized set. Despite the large set of constraints, we can solve this QP to any desired precision by using the cutting plane algorithm. This algorithm keeps track of a set of current constraints, solves the current QP with regard to those constraints, and then given a solution , finds the most violated constraint and adds it to . Finding the most violated constraint consists of solving for each example the problem
Since the features ensure that is SoS, then as long as factors as a sum over the cliques (for instance, the Hamming loss is such a function), then (7) can be solved with Submodular IBFS. Note that this also allows us to add arbitrary additional features for learning the unary potentials as well. Pseudocode for the entire S3SVM learning is given in Algorithm 1.
4.4 Generalization to multi-label prediction
Submodular functions are intrinsically binary functions. In order to handle the multi-label case, we use expansion moves  to reduce the multi-label optimization problem to a series of binary subproblems, where each pixel may either switch to a given label or keep its current label. If every binary subproblem of computing the optimal expansion move is an SoS problem, we will call the original multi-label energy function an SoS expansion energy.
Let be our label set, with output space . Our learned function will have the form where . For a clique and label , define , i.e., the subset of taking label .
If all the clique functions are of the form
where each is submodular, then any expansion move for the multi-label energy function will be SoS.
Fix a current labeling , and let be the energy when the set switches to label . We can write in terms of the clique functions and sets as
We use a fact from the theory of submodular functions: if is submodular, then for any fixed both and are also submodular. Therefore, is SoS. ∎
Theorem 4.4 characterizes a large class of SoS expansion energies. These functions generalize commonly used multi-label clique functions, including the Potts model . The model pays cost when all pixels are equal to label , and otherwise. We can write this as an SoS expansion energy by letting if and otherwise . Then, is equal to the Potts model, up to an additive constant. Generalizations such as the robust model  can be encoded in a similar fashion. Finally, in order to learn these functions, we let be composed of copies of — one for each , and add corresponding copies of the constraints (5).
As a final note: even though the individual expansion moves can be computed optimally, -expansion still may not find the global optimum for the multi-labeled energy. However, in practice -expansion finds good local optima, and has been used for inference in Structural SVM with good results, as in .
5 Experimental Results
In order to evaluate our algorithms, we focused on binary denoising and interactive segmentation. For binary denoising, Generic Cuts  provides the most natural comparison since it is a state-of-the-art method that uses SoS priors. For interactive segmentation the natural comparison is against GrabCut , where we used the OpenCV implementation. We ran our general S3SVM method, which can learn an arbitrary SoS function, an also considered the special case of only using pairwise priors. For both the denoising and segmentation applications, we significantly improve on the accuracy of the hand-tuned energy functions.
5.1 Binary denoising
Our binary denoising dataset consists of a set of 20 black and white images. Each image is and either a set of geometric lines, or a hand-drawn sketch (see Figure 1). We were unable to obtain the original data used by , so we created our own similar data by adding independent Gaussian noise at each pixel.
For denoising, the hand-tuned Generic Cuts algorithm of  posed a simple MRF, with unary pixels equal to the absolute valued distance from the noisy input, and an SoS prior, where each clique penalizes the square-root of the number of edges with different labeled endpoints within that clique. There is a single parameter , which is the tradeoff between the unary energy and the smoothness term. The neighborhood structure consists of all patches of the image.
Our learned prior includes the same unary terms and clique structure, but instead of the square-root smoothness prior, we learn a clique function to get an MRF . Note that each clique has the same energy as every other, so this is analogous to a graph cuts prior where each pairwise edge has the same attractive potential. Our energy function has 16 total parameters (one for each possible value of , which is defined on patches).
We randomly divided the 20 input images into 10 training images and 10 test images. The loss function was the Hamming distance between the correct, un-noisy image and the predicted image. To hand tune the value , we picked the value which gave the minimum pixel-wise error on the training set. S3SVM training took only 16 minutes.
Numerically, S3SVM performed signficantly better than the hand-tuned method, with an average pixel-wise error of only 4.9% on the training set, compared to 28.6% for Generic Cuts. The time needed to do inference after training was similar for both methods: 0.82 sec/image for S3SVM vs. 0.76 sec/image for Generic Cuts. Visually, the S3SVM images are significantly cleaner looking, as shown in Figure 1.
5.2 Interactive segmentation
The input to interactive segmentation is a color image, together with a set of sparse foreground/background annotations provided by the user. See Figure 2 for examples. From the small set of labeled foreground and background pixels, the prediction task is to recover the ground-truth segmentation for the whole image.
Our baseline comparison is the Grabcut algorithm, which solves a pairwise CRF. The unary terms of the CRF are obtained by fitting a Gaussian Mixture Model to the histograms of pixels labeled as being definitely foreground or background. The pairwise terms are a standard contrast-sensitive Potts potential, where the cost of pixelsand taking different labels is equal to for some hand-coded parameters . Our primary comparison is against the OpenCV implementation of Grabcut, available at www.opencv.org.
As a special case, our algorithm can be applied to pairwise-submodular energy functions, for which it solves the same optimization problem as in Associative Markov Networks (AMN’s) [30, 1]. Automatically learning parameters allows us to add a large number of learned unary features to the CRF.
As a result, in addition to the smoothness parameter , we also learn the relative weights of approximately 400 features describing the color values near a pixel, and relative distances to the nearest labeled foreground/background pixel. Further details on these features can be found in the Supplementary Material. We refer to this method as S3SVM-AMN.
Our general S3SVM method can incorporate higher-order priors instead of just pairwise ones. In addition to the unary features used in S3SVM-AMN, we add a sum-of-submodular higher-order CRF. Each patch in the image has a learned submodular clique function. To obtain the benefits of the contrast-sensitive pairwise potentials for the higher-order case, we cluster (using -means) the and gradient responses of each patch into 50 clusters, and learn one submodular potential for each cluster. Note that S3SVM automatically allows learning the entire energy function, including the clique potentials and unary potentials (which come from the data) simultaneously.
We use a standard interactive segmentation dataset from  of 151 images with annotations, together with pixel-level segmentations provided as ground truth. These images were randomly sorted into training, validation and testing sets, of size 75, 38 and 38 respectively. We trained both S3SVM-AMN and S3SVM on the training set for various values of the regularization parameter , and picked the value which gave the best accuracy on the validation set, and report the results of that value on the test set.
The overall performance is shown in the table below. Training time is measured in seconds, and testing time in seconds per image. Our implementation, which used the submodular flow algorithm based on IBFS discussed in section 3.2, will be made freely available under the MIT license.
Learning and validation was performed 5 times with independently sampled training sets. The averages and standard deviations shown above are from these 5 samples.
-  D. Anguelov, B. Taskar, V. Chatalbashev, D. Koller, D. Gupta, G. Heitz, and A. Y. Ng. Discriminative learning of Markov Random Fields for segmentation of 3D scan data. In CVPR, 2005.
-  C. Arora, S. Banerjee, P. Kalra, and S. N. Maheshwari. Generic cuts: an efficient algorithm for optimal inference in higher order MRF-MAP. In ECCV, 2012.
-  M.-F. Balcan and N. J. A. Harvey. Learning submodular functions. In STOC, 2011.
-  Y. Boykov and V. Kolmogorov. An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. TPAMI, 26(9), 2004.
-  Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. TPAMI, 23(11), 2001.
-  J. Edmonds and R. Giles. A min-max relation for submodular functions on graphs. Annals of Discrete Mathematics, 1:185–204, 1977.
-  T. Finley and T. Joachims. Training structural SVMs when exact inference is intractable. In ICML, 2008.
-  A. Fix, A. Gruber, E. Boros, and R. Zabih. A graph cut algorithm for higher-order Markov Random Fields. In ICCV, 2011.
-  S. Fujishige and X. Zhang. A push/relabel framework for submodular flows and its refinement for 0-1 submodular flows. Optimization, 38(2):133–154, 1996.
-  A. V. Goldberg, S. Hed, H. Kaplan, R. E. Tarjan, and R. F. Werneck. Maximum flows by incremental breadth-first search. In European Symposium on Algorithms, 2011.
-  V. Gulshan, C. Rother, A. Criminisi, A. Blake, and A. Zisserman. Geodesic star convexity for interactive image segmentation. In CVPR, 2010.
-  X. He, R. S. Zemel, and M. A. Carreira-Perpiñán. Multiscale conditional random fields for image labeling. In CVPR, 2004.
-  H. Ishikawa. Transformation of general binary MRF minimization to the first order case. TPAMI, 33(6), 2010.
-  T. Joachims, T. Finley, and C.-N. Yu. Cutting-plane training of structural SVMs. Machine Learning, 77(1), 2009.
-  P. Kohli, M. P. Kumar, and P. H. Torr. P3 and beyond: Move making algorithms for solving higher order functions. TPAMI, 31(9), 2008.
-  P. Kohli, L. Ladicky, and P. Torr. Robust higher order potentials for enforcing label consistency. IJCV, 82, 2009.
-  V. Kolmogorov. Minimizing a sum of submodular functions. Discrete Appl. Math., 160(15):2246–2258, Oct. 2012.
-  V. Kolmogorov and R. Zabih. What energy functions can be minimized via graph cuts? TPAMI, 26(2), 2004.
-  H. Koppula, A. Anand, T. Joachims, and A. Saxena. Semantic labeling of 3D point clouds for indoor scenes. In NIPS, 2011.
-  V. Kwatra, A. Schodl, I. Essa, G. Turk, and A. Bobick. Graphcut textures: Image and video synthesis using graph cuts. SIGGRAPH, 2003.
-  L. Ladicky, C. Russell, P. Kohli, and P. H. S. Torr. Associative hierarchical CRFs for object class image segmentation. In ICCV, 2009.
-  J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001.
H. Lin and J. Bilmes.
Learning mixtures of submodular shells with application to document summarization.In UAI, 2012.
-  D. Munoz, J. A. Bagnell, N. Vandapel, and M. Hebert. Contextual classification with functional max-margin Markov networks. In CVPR, 2009.
-  J. B. Orlin. A faster strongly polynomial time algorithm for submodular function minimization. Math. Program., 118(2):237–251, Jan. 2009.
-  S, Roth and M, Black. Fields of experts. IJCV, 82, 2009.
-  C. Rother, V. Kolmogorov, and A. Blake. “GrabCut” - interactive foreground extraction using iterated graph cuts. SIGGRAPH, 23(3):309–314, 2004.
-  R. Sipos, P. Shivaswamy, and T. Joachims. Large-margin learning of submodular summarization models. In EACL, 2012.
-  M. Szummer, P. Kohli, and D. Hoiem. Learning CRFs using graph cuts. In ECCV, 2008.
-  B. Taskar, V. Chatalbashev, and D. Koller. Learning associative markov networks. In ICML, 2004.
-  B. Taskar, V. Chatalbashev, and D. Koller. Learning associative markov networks. In ICML. ACM, 2004.
-  B. Taskar, C. Guestrin, and D. Koller. Maximum-margin markov networks. In NIPS, 2003.
-  I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for interdependent and structured output spaces. In ICML, 2004.
-  Y. Yue and T. Joachims. Predicting diverse subsets using structural SVMs. In ICML, 2008.