Parameter Learning for Log-supermodular Distributions

by   Tatiana Shpakova, et al.

We consider log-supermodular models on binary variables, which are probabilistic models with negative log-densities which are submodular. These models provide probabilistic interpretations of common combinatorial optimization tasks such as image segmentation. In this paper, we focus primarily on parameter estimation in the models from known upper-bounds on the intractable log-partition function. We show that the bound based on separable optimization on the base polytope of the submodular function is always inferior to a bound based on "perturb-and-MAP" ideas. Then, to learn parameters, given that our approximation of the log-partition function is an expectation (over our own randomization), we use a stochastic subgradient technique to maximize a lower-bound on the log-likelihood. This can also be extended to conditional maximum likelihood. We illustrate our new results in a set of experiments in binary image denoising, where we highlight the flexibility of a probabilistic model to learn with missing data.



page 9


Bounding Evidence and Estimating Log-Likelihood in VAE

Many crucial problems in deep learning and statistics are caused by a va...

Accurate and Conservative Estimates of MRF Log-likelihood using Reverse Annealing

Markov random fields (MRFs) are difficult to evaluate as generative mode...

Marginal Weighted Maximum Log-likelihood for Efficient Learning of Perturb-and-Map models

We consider the structured-output prediction problem through probabilist...

A Note on New Bernstein-type Inequalities for the Log-likelihood Function of Bernoulli Variables

We prove a new Bernstein-type inequality for the log-likelihood function...

Fast Stochastic Quadrature for Approximate Maximum-Likelihood Estimation

Recent stochastic quadrature techniques for undirected graphical models...

Lipschitz Parametrization of Probabilistic Graphical Models

We show that the log-likelihood of several probabilistic graphical model...

Efficient Exact Inference in Planar Ising Models

We give polynomial-time algorithms for the exact computation of lowest-e...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Submodular functions provide efficient and flexible tools for learning on discrete data. Several common combinatorial optimization tasks, such as clustering, image segmentation, or document summarization, can be achieved by the minimization or the maximization of a submodular function 

[1, 8, 14]. The key benefit of submodularity is the ability to model notions of diminishing returns, and the availability of exact minimization algorithms and approximate maximization algorithms with precise approximation guarantees [12].

In practice, it is not always straightforward to define an appropriate submodular function for a problem at hand. Given fully-labeled data, e.g., images and their foreground/background segmentations in image segmentation, structured-output prediction methods such as the structured-SVM may be used [18]. However, it is common (a) to have missing data, and (b) to embed submodular function minimization within a larger model. These are two situations well tackled by probabilistic modelling.

Log-supermodular models, with negative log-densities equal to a submodular function, are a first important step toward probabilistic modelling on discrete data with submodular functions [5]. However, it is well known that the log-partition function is intractable in such models. Several bounds have been proposed, that are accompanied with variational approximate inference [6]. These bounds are based on the submodularity of the negative log-densities. However, the parameter learning (typically by maximum likelihood), which is a key feature of probabilistic modeling, has not been tackled yet. We make the following contributions:

  • In Section 3, we review existing variational bounds for the log-partition function and show that the bound of [9], based on “perturb-and-MAP” ideas, formally dominates the bounds proposed by [5, 6].

  • In Section 4.1, we show that for parameter learning via maximum likelihood the existing bound of [5, 6] typically leads to a degenerate solution while the one based on “perturb-and-MAP” ideas and logistic samples [9] does not.

  • In Section 4.2, given that the bound based on “perturb-and-MAP” ideas is an expectation (over our own randomization), we propose to use a stochastic subgradient technique to maximize the lower-bound on the log-likelihood, which can also be extended to conditional maximum likelihood.

  • In Section 5, we illustrate our new results on a set of experiments in binary image denoising, where we highlight the flexibility of a probabilistic model for learning with missing data.

2 Submodular functions and log-supermodular models

In this section, we review the relevant theory of submodular functions and recall typical examples of log-supermodular distributions.

2.1 Submodular functions

We consider submodular functions on the vertices of the hypercube . This hypercube representation is equivalent to the power set of

. Indeed, we can go from a vertex of the hypercube to a set by looking at the indices of the components equal to one and from set to vertex by taking the indicator vector of the set.

For any two vertices of the hypercube, , a function is submodular if , where the min and max operations are taken component-wise and correspond to the intersection and union of the associated sets. Equivalently, the function , where is the

-th canonical basis vector, is non-increasing. Hence, the notion of diminishing returns is often associated with submodular functions. Most widely used submodular functions are cuts, concave functions of subset cardinality, mutual information, set covers, and certain functions of eigenvalues of submatrices 

[1, 7]. Supermodular functions are simply negatives of submodular functions.

In this paper, we are going to use a few properties of such submodular functions (see [1, 7] and references therein). Any submodular function can be extended from to a convex function on , which is called the Lovász extension. This extension has the same value on , hence we use the same notation . Moreover, this function is convex and piecewise linear, which implies the existence of a polytope , called the base polytope, such that for all , , that is, is the support function of . The Lovász extension and the base polytope have explicit expressions that are, however, not relevant to this paper. We will only use the fact that can be efficiently minimized on , by a variety of generic algorithms, or by more efficient dedicated ones for subclasses such as graph-cuts.

2.2 Log-supermodular distributions

Log-supermodular models are introduced in [5]

to model probability distributions on a hypercube,

, and are defined as

where is a submodular function such that and the partition function is . It is more convenient to deal with the convex log-partition function . In general, calculation of the partition function or the log-partition function is intractable, as it includes simple binary Markov random fields—the exact calculation is known to be -hard [10]. In Section 3, we review upper-bounds for the log-partition function.

2.3 Examples

Essentially, all submodular functions used in the minimization context can be used as negative log-densities [5, 6]

. In computer vision, the most common examples are graph-cuts, which are essentially binary Markov random fields with attractive potentials, but higher-order potentials have been considered as well 

[11]. In our experiments, we use graph-cuts, where submodular function minimization may be performed with max-flow techniques and is thus efficient [4]. Note that there are extensions of submodular functions to continuous domains that could be considered as well [2].

3 Upper-bounds on the log-partition function

In this section, we review the main existing upper-bounds on the log-partition function for log-supermodular densities. These upper-bounds use several properties of submodular functions, in particular, the Lovász extension and the base polytope. Note that lower bounds based on submodular maximization aspects and superdifferentials [5] can be used to highlight the tightness of various bounds, which we present in Figure 1.

3.1 Base polytope relaxation with L-Field [5]

This method exploits the fact that any submodular function can be lower bounded by a modular function , i.e., a linear function of in the hypercube representation. The submodular function and its lower bound are related by , leading to:

which, by swapping the sum and min, is less than


Since the polytope is tractable (through its membership oracle or by maximizing linear functions efficiently), the bound is tractable, i.e., computable in polynomial time. Moreover, it has a nice interpretation through the convex duality as the logistic function may be represented as , leading to:

where . This shows in particular the convexity of . Finally, [6] shows the remarkable result that the minimizer may be obtained by minimizing a simpler function on , namely the squared Euclidean norm, thus leading to algorithms such as the minimum-norm-point algorithm [7].

3.2 “Pertub-and-MAP” with logistic distributions

Estimating the log-partition function can be done through optimization using “pertub-and-MAP” ideas. The main idea is to perturb the log-density, find the maximum a-posteriori configuration (i.e., perform optimization), and then average over several random perturbations [9, 17, 19].

The Gumbel distribution on

, whose cumulative distribution function is

, where is the Euler constant, is particularly useful. Indeed, if

is a collection of independent random variables

indexed by , each following the Gumbel distribution, then the random variable is such that we have [9, Lemma 1]. The main problem is that we need such variables, and a key contribution of [9] is to show that if we consider a factored collection of i.i.d. Gumbel variables, then we get an upper-bound on the log partition-function, that is, .

Writing and using the fact that (a) has zero expectation and (b) the difference between two independent Gumbel distributions has a logistic distribution (with cumulative distribution function [15], we get the following upper-bound:


where the random vector consists of independent elements taken from the logistic distribution. This is always an upper-bound on and it uses only the fact that submodular functions are efficient to optimize. It is convex in as an expectation of a maximum of affine functions of .

3.3 Comparison of bounds

In this section, we show that is always dominated by . This is complemented by another result within the maximum likelihood framework in Section 4.

Proposition 1.

For any submodular function , we have:


The first inequality was shown by [9]. For the second inequality, we have:

In the inequality above, since the logistic distribution has full support, there cannot be equality. However, if the base polytope is such that, with high probability , then the two bounds are close. Since the logistic distribution is concentrated around zero, we have equality when is large for all and .

Theoretical complexity of and .

The logistic bound can be computed if there is efficient MAP-solver for submodular functions (plus a modular term). In this case, the divide-and-conquer algorithm can be applied for L-Field [5]. Thus, the complexity is dedicated to the minimization of problems. Meanwhile, for the method based on logistic samples, it is necessary to solve optimization problems. In our empirical bound comparison (next paragraph), running time was the same for both methods. Note however that for parameter learning, we need a single SFM problem per gradient iteration (and not ).

Empirical comparison of and .

We compare the upper-bounds on the log-partition function and , with the setup used by [5]

. We thus consider data from a Gaussian mixture model with 2 clusters in

. The centers are sampled from and , respectively. Then we sampled points for each cluster. Further, these points are used as nodes in a complete weighted graph, where the weight between points and is equal to .

We consider the graph cut function associated to this weighted graph, which defines a log-supermodular distribution. We then consider conditional distributions, one for each , on the events that at least points from the first cluster lie on the one side of the cut and at least points from the second cluster lie on the other side of the cut. For each conditional distribution, we evaluate and compare the two upper bounds.

In Figure 1, we show various bounds on as a function of the number on conditioned pairs. The logistic upper bound is obtained using 100 logistic samples: the logistic upper-bound is close to the superdifferential lower bound from [5] and is indeed significantly lower than the bound .


Mean bounds with confidence intervals, c = 1

(b) Mean bounds with confidence intervals, c = 3
Figure 1: Comparison of log-partition function bounds for different values of . See text for details.

3.4 From bounds to approximate inference

Since linear functions are submodular functions, given any convex upper-bound on the log-partition function, we may derive an approximate marginal probability for each . Indeed, following [9], we consider an exponential family model , where is the function . When is assumed to be fixed, this can be seen as an exponential family with the base measure , sufficient statistics , and is the log-partition function. It is known that the expectation of the sufficient statistics under the exponential family model is the gradient of the log-partition function [23]. Hence, any approximation of this log-partition gives an approximation of this expectation, which in our situation is the vector of marginal probabilities that an element is equal to .

For the L-field bound, at , we have , where is the minimizer of , thus recovering the interpretation of [6] from another point of view.

For the logistic bound, this is the inference mechanism from  [9], with , where is the maximizer of . In practice, in order to perform approximate inference, we only sample logistic variables. We could do the same for parameter learning, but a much more efficient alternative, based on mixing sampling and convex optimization, is presented in the next section.

4 Parameter learning through maximum likelihood

An advantage of log-supermodular probabilistic models is the opportunity to learn the model parameters from data using the maximum-likelihood principle. In this section, we consider that we are given observations , e.g., binary images such as shown in Figure 2.

We consider a submodular function represented as . The modular term is explicitly taken into account with , and base submodular functions are assumed to be given with so that the function remains submodular. Assuming the data are independent and identically (i.i.d.) distributed, then maximum likelihood is equivalent to minimizing:

which takes the particularly simple form


where we use the notation . We now consider replacing the intractable log-partition function by its approximations defined in Section 3.

4.1 Learning with the L-field approximation

In this section, we show that if we replace by , we obtain a degenerate solution. Indeed, we have

This implies that Eq. (4) becomes

The minimum with respect to may be performed in closed form with , where . Putting this back into the equation above, we get the equivalent problem:

which is equivalent to, using the representation of as the support function of for any submodular function:

Since is convex, by Jensen’s inequality, the linear term in is non-negative; thus maximum likelihood through L-field will lead to a degenerate solution where all ’s are equal to zero.

4.2 Learning with the logistic approximation with stochastic gradients

In this section we consider the problem (4) and replace by :


where denotes the empirical average of (over the data).

Denoting by the maximizers of , the objective function may be written:

This implies that at optimum, for , then , while, , the expected values of the sufficient statistics match between the data and the optimizers used for the logistic approximation [9].

In order to minimize the expectation in Eq. (5), we propose to use the projected stochastic gradient method, not on the data as usually done, but on our own internal randomization. The algorithm then becomes, once we add weighted -regularization :

  • Input: functions , , and expected sufficient statistics and , regularizer .

  • Initialization:

  • Iterations: for from to

    • Sample as independent logistics

    • Compute

    • Replace by

    • Replace by .

  • Output: .

Since our cost function is convex and Lipschitz-continuous, the averaged iterates are converging to the global optimum [16] at rate (for function values).

4.3 Extension to conditional maximum likelihood

In our experiments in Section 5, we consider a joint model over two binary vectors , as follows


which corresponds to sampling from a log-supermodular model and considering that switches the values of with probability for each , that is, a noisy observation of . We have:

with which is equivalent to .

Using Bayes rule, we have , which leads to a log-supermodular model of the form .

Thus, if we observe both and , we can consider a conditional maximization of the log-likelihood (still a convex optimization problem), which we do in our experiments for supervised image denoising, where we assume we know both noisy and original images at training time. Stochastic gradient on the logistic samples can then be used. Note that our conditional ML estimation can be seen as a form of approximate conditional random fields [13].

While supervised learning can be achieved by other techniques such as structured-output-SVMs 

[18, 20, 22], our probabilistic approach also applies when we do not observe the original image, which we now consider.

4.4 Missing data through maximum likelihood

In the model in Eq. (6), we now assume we only observed the noisy output , and we want to perform parameter learning for . This is a latent variable model for which ML can be readily applied. We have:

In practice, we will assume that the noise probability (and hence ) is uniform across all elements. While we could use majorization-minization approaches such as the expectation-minimization algorithm (EM), we consider instead stochastic subgradient descent to learn the model parameters and (now a non-convex optimization problem, for which we still observed good convergence).

5 Experiments

The aim of our experiments is to demonstrate the ability of our approach to removing noise in binary images, following the experimental set-up of [9]. We consider the training sample of images of size , and the test sample of binary images, containing a horse silhouette from the Weizmann horse database [3]. At first we add some noise by flipping pixels values independently with probability . In Figure 2, we provide an example from the test sample: the original, the noisy and the denoised image (by our algorithm).

We consider the model from Section 4.3, with the two functions , which are horizontal and vertical cut functions with binary weights respectively, together with a modular term of dimension . To perform minimization we use graph-cuts [4] as we deal with positive or attractive potentials.

(a) original image
(b) noisy image
(c) denoised image
Figure 2: Denoising of a horse image from the Weizmann horse database [3].

5.1 Supervised image denoising

We assume that we observe pairs of original-noisy images, . We perform parameter inference by maximum likelihood using stochastic subgradient descent (over the logistic samples), with regularization by the squared -norm, one parameter for , one for , both learned by cross-validation. Given our estimates, we may denoise a new image by computing the “max-marginal”, e.g., the maximum a posteriori through a single graph-cut, or computing “mean-marginals” with 100 logistic samples. To calculate the error we use the normalized Hamming distance and 100 test images.

Results are presented in Table 1, where we compare the two types of decoding, as well as a structured output SVM (SVM-Struct [22]) applied to the same problem. Results are reported in proportion of correct pixels. We see that the probabilistic models here outperform the max-margin formulation and that using mean-marginals (which is optimal given our loss measure) lead to slightly better performance.

noise max-marg. std mean-marginals std SVM-Struct std
1% 0.4% <0.1% 0.4% <0.1% 0.6% <0.1%
5% 1.1% <0.1% 1.1% <0.1% 1.5% <0.1%
10% 2.1% <0.1% 2.0% <0.1% 2.8% 0.3%
20% 4.2% <0.1% 4.1% <0.1% 6.0% 0.6%
Table 1: Supervised denoising results.
is fixed is not fixed
max-marg. std mean-marg. std max-marg. std mean-marg. std
1% 0.5% <0.1% 0.5% <0.1% 1.0% - 1.0% -
5% 0.9% 0.1% 1.0% 0.1% 3.5% 0.9% 3.6% 0.8%
10% 1.9% 0.4% 2.1% 0.4% 6.8% 2.2% 7.0% 2.0%
20% 5.3% 2.0% 6.0% 2.0% 20.0% - 20.0% -
Table 2: Unsupervised denoising results.

5.2 Unsupervised image denoising

We now only consider noisy images to learn the model, without the original images, and we use the latent model from Section 4.4. We apply stochastic subgradient descent for the difference of the two convex functions to learn the model parameters and use fixed regularization parameters equal to .

We consider two situations, with a known noise-level or with learning it together with and . The error was calculated using either max-marginals and mean-marginals. Note that here, structured-output SVMs cannot be used because there is no supervision. Results are reported in Table 2. One explanation for a better performance for max-marginals in this case is that the unsupervised approach tends to oversmooth the outcome and max-marginals correct this a bit.

When the noise level is known, the performance compared to supervised learning is not degraded much, showing the ability of the probabilistic models to perform parameter estimation with missing data. When the noise level is unknown and learned as well, results are worse, still better than a trivial answer for moderate levels of noise (5% and 10%) but not better than outputting the noisy image for extreme levels (1% and 20%). In challenging fully unsupervised case the standard deviation is up to 2.2% (which shows that our results are statistically significant).

6 Conclusion

In this paper, we have presented how approximate inference based on stochastic gradient and “perturb-and-MAP” ideas could be used to learn parameters of log-supermodular models, allowing us to benefit from the versatility of probabilistic modelling, in particular in terms of parameter estimation with missing data. While our experiments have focused on simple binary image denoising, exploring larger-scale applications in computer vision (such as done by [21, 24]) should also show the benefits of mixing probabilistic modelling and submodular functions.


This work was funded by the MacSeNet Innovative Training Network. We would like to thank Sesh Kumar, Anastasia Podosinnikova and Anton Osokin for interesting discussions related to this work.


  • [1] F. Bach. Learning with submodular functions: a convex optimization perspective.

    Foundations and Trends in Machine Learning

    , 6(2-3):145 – 373, 2013.
  • [2] F. Bach. Submodular functions: from discrete to continuous domains. Technical Report 1511.00394, arXiv, 2015.
  • [3] E. Borenstein, E. Sharon, and S. Ullman. Combining Top-down and Bottom-up Segmentation. In Proc. ECCV, 2004.
  • [4] Y. Boykov, O. Veksler, and R. Zabih. Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(11):1222–1239, 2001.
  • [5] J. Djolonga and A. Krause. From MAP to Marginals: Variational Inference in Bayesian Submodular Models. In Adv. NIPS, 2014.
  • [6] J. Djolonga and A. Krause. Scalable Variational Inference in Log-supermodular Models. In Proc. ICML, 2015.
  • [7] S. Fujishige. Submodular Functions and Optimization. Annals of discrete mathematics. Elsevier, 2005.
  • [8] D. Golovin and A. Krause.

    Adaptive Submodularity: Theory and Applications in Active Learning and Stochastic Optimization.

    Journal of Artificial Intelligence Research

    , 42:427–486, 2011.
  • [9] T. Hazan and T. Jaakkola. On the Partition Function and Random Maximum A-Posteriori Perturbations. In Proc. ICML, 2012.
  • [10] M. Jerrum and A. Sinclair. Polynomial-time approximation algorithms for the Ising model. SIAM Journal on Computing, 22(5):1087–1116, 1993.
  • [11] P. Kohli, L. Ladicky, and P. H. S. Torr. Robust higher order potentials for enforcing label consistency. International Journal of Computer Vision, 82(3):302–324, 2009.
  • [12] Andreas Krause and Daniel Golovin. Submodular function maximization. In Tractability: Practical Approaches to Hard Problems. Cambridge University Press, February 2014.
  • [13] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. ICML, 2001.
  • [14] H. Lin and J. Bilmes. A class of submodular functions for document summarization. In Proc. NAACL/HLT, 2011.
  • [15] S. Nadarajah and S. Kotz. A generalized logistic distribution. International Journal of Mathematics and Mathematical Sciences, 19:3169 – 3174, 2005.
  • [16] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
  • [17] G. Papandreou and A. Yuille. Perturb-and-map random fields: Using discrete optimization to learn and sample from energy models. In Proc. ICCV, 2011.
  • [18] M. Szummer, P. Kohli, and D. Hoiem. Learning CRFs using graph cuts. In Proc. ECCV, 2008.
  • [19] D. Tarlow, R.P. Adams, and R.S. Zemel. Randomized optimum models for structured prediction. In Proc. AISTATS, 2012.
  • [20] B. Taskar, C. Guestrin, and D. Koller. Max-margin Markov networks. 2003.
  • [21] S. Tschiatschek, J. Djolonga, and A. Krause.

    Learning probabilistic submodular diversity models via noise contrastive estimation.

    In Proc. AISTATS, 2016.
  • [22] I. Tsochantaridis, Thomas Joachims, T., Y. Altun, and Y. Singer. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6:1453–1484, 2005.
  • [23] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1-2):1–305, 2008.
  • [24] J. Zhang, J. Djolonga, and A. Krause. Higher-order inference for multi-class log-supermodular models. In Proc. ICCV, 2015.