Approximate Inference via Weighted Rademacher Complexity

01/27/2018 ∙ by Jonathan Kuck, et al. ∙ Stanford University Allen Institute for Artificial Intelligence 0

Rademacher complexity is often used to characterize the learnability of a hypothesis class and is known to be related to the class size. We leverage this observation and introduce a new technique for estimating the size of an arbitrary weighted set, defined as the sum of weights of all elements in the set. Our technique provides upper and lower bounds on a novel generalization of Rademacher complexity to the weighted setting in terms of the weighted set size. This generalizes Massart's Lemma, a known upper bound on the Rademacher complexity in terms of the unweighted set size. We show that the weighted Rademacher complexity can be estimated by solving a randomly perturbed optimization problem, allowing us to derive high-probability bounds on the size of any weighted set. We apply our method to the problems of calculating the partition function of an Ising model and computing propositional model counts (#SAT). Our experiments demonstrate that we can produce tighter bounds than competing methods in both the weighted and unweighted settings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A wide variety of problems can be reduced to computing the sum of (many) non-negative numbers. These include calculating the partition function of a graphical model, propositional model counting (#SAT), and calculating the permanent of a non-negative matrix. Equivalently, each can be viewed as computing the discrete integral of a non-negative weight function. Exact summation, however, is generally intractable due to the curse of dimensionality

[Bellman1961].

As alternatives to exact computation, variational methods [Jordan et al.1998, Wainwright, Jordan, and others2008] and sampling [Jerrum and Sinclair1996, Madras2002] are popular approaches for approximate summation. However, they generally do not guarantee the estimate’s quality.

An emerging line of work estimates and formally bounds propositional model counts or, more generally, discrete integrals [Ermon et al.2013a, Chakraborty, Meel, and Vardi2013, Ermon et al.2014, Zhao et al.2016]. These approaches reduce the problem of integration to solving a small number of optimization problems involving the same weight function but subject to additional random constraints introduced by a random hash function. This results in approximating the #P-hard problem of exact summation [Valiant1979] using the solutions of NP-hard optimization problems.

Optimization can be performed efficiently for certain classes of weight functions, such as those involved in the computation of the permanent of a non-negative matrix. If instead of summing (permanent computation) we maximize the same weight function, we obtain a maximum weight matching problem, which is in fact solvable in polynomial time [Kuhn1955]. However, adding hash-based constraints makes the maximum matching optimization problem intractable, which limits the application of randomized hashing approaches [Ermon et al.2013c]. On the other hand, there do exist fully polynomial-time randomized approximation schemes (FPRAS) for non-negative permanent computation [Jerrum, Sinclair, and Vigoda2004, Bezáková et al.2006]. This gives hope that approximation schemes may exist for other counting problems even when optimization with hash-based constraints is intractable.

We present a new method for approximating and bounding the size of a general weighted set (i.e., the sum of the weights of its elements) using geometric arguments based on the set’s shape. Our approach, rather than relying on hash-based techniques, establishes a novel connection with Rademacher complexity [Shalev-Shwartz and Ben-David2014]. This generalizes geometric approaches developed for the unweighted case to the weighted setting, such as the work of barvinok1997approximate barvinok1997approximate who uses similar reasoning but without connecting it with Rademacher complexity. In particular, we first generalize Rademacher complexity to weighted sets. While Rademacher complexity is defined as the maximum of the sum of Rademacher variables over a set, weighted Rademacher complexity also accounts for the weight of each element in the set. Just like Rademacher complexity is related to the size of the set, we show that weighted Rademacher complexity is related to the total weight of the set. Further, it can be estimated by solving multiple instances of a maximum weight optimization problem, subject to random Rademacher perturbations. Notably, the resulting optimization problem turns out to be computationally much simpler than that required by the aforementioned randomized hashing schemes. In particular, if the weight function is log-supermodular, the corresponding weighted Rademacher complexity can be estimated efficiently, as our perturbation does not change the original optimization problem’s complexity [Orlin2009, Bach and others2013].

Our approach most closely resembles a recent line of work involving the Gumbel distribution [Hazan and Jaakkola2012, Hazan, Maji, and Jaakkola2013, Hazan et al.2016, Balog et al.2017, Mussmann and Ermon2016, Mussmann, Levy, and Ermon2017]. There, the Gumbel-max idea is used to bound the partition function by performing MAP inference on a model where the unnormalized probability of each state is perturbed by random noise variables sampled from a Gumbel distribution. While very powerful, exact application of the Gumbel method is impractical, as it requires exponentially many independent random perturbations. One instead uses local approximations of the technique.

Empirically, on spin glass models we show that our technique yields tighter upper bounds and similar lower bounds compared with the Gumbel method, given similar computational resources. On a suite of #SAT model counting instances our approach generally produces comparable or tighter upper and lower bounds given limited computation.

2 Background

Rademacher complexity is an important tool used in learning theory to bound the generalization error of a hypothesis class [Shalev-Shwartz and Ben-David2014].

Definition 1.

The Rademacher complexity of a set is defined as:

(1)

where denotes expectation over , and is sampled uniformly from .

As the name suggests, it is a measure of the complexity of set (which, in learning theory, is usually a hypothesis class). It measures how “expressive”

is by evaluating how well we can “fit” to a random noise vector

by choosing the closest vector (or hypothesis) from . Intuitively, Rademacher complexity is related to , the number of vectors in , another crude notion of complexity of . However, it also depends on how vectors in are arranged in the ambient space . A central focus of this paper will be establishing quantitative relationships between and .

A key property of Rademacher complexity that makes it extremely useful in learning theory is that it can be estimated using a small number of random noise samples under mild conditions [Shalev-Shwartz and Ben-David2014]. The result follows from McDiarmid’s inequality:

Proposition 1 (mcdiarmid1989method, mcdiarmid1989method).

Let

be independent random variables. Let

be a function that satisfies the bounded differences condition that and :

Then for all

McDiarmid’s inequality says we can bound, with high probability, how far a function of random variables may deviate from its expected value, given that the function does not change much when the value of a single random variable is changed. Because the function in Eq. (1) satisfies this property [Shalev-Shwartz and Ben-David2014], we can use Eq. (1) to bound with high probability by computing the supremum for only a small number of noise samples .

3 Problem Setup

In this section we formally define our problem and introduce the optimization oracle central to our solution. Let be a non-negative weight function. We consider the problem of computing the sum

Many problems, including computing the partition function of an undirected graphical model, where is the unnormalized probability of state (see koller2009probabilistic koller2009probabilistic), propositional model counting (#SAT), and computing the permanent of a non-negative matrix can be reduced to calculating this sum. The problem is challenging because explicit calculation requires summing over states, which is computationally intractable in cases of interest.

Due to the general intractability of exactly calculating , we focus on an efficient approach for estimating which additionally provides upper and lower bounds that hold with high probability. Our method depends on the following assumption:

Assumption 1.

We assume existence of an optimization oracle that can output the value

(2)

for any vector and weight function .

Note that throughout the paper we simply denote as , as , and assume . Assumption 1 is reasonable, as there are many classes of models where such an oracle exists. For instance, polynomial time algorithms exist for finding the maximum weight matching in a weighted bipartite graph [Hopcroft and Karp1971, Jonker and Volgenant1987]. Graph cut algorithms can be applied to efficiently maximize a class of energy functions [Kolmogorov and Zabin2004]. More generally, MAP inference can be performed efficiently for any log-supermodular weight function [Orlin2009, Chakrabarty, Jain, and Kothari2014, Fujishige1980]. Our perturbation preserves the submodularity of , as can be viewed as independent single variable perturbations, so we have an efficient optimization oracle whenever the original weight function is log-supermodular. Further, notice that this is a much weaker assumption compared with the optimization oracle required by randomized hashing methods [Chakraborty, Meel, and Vardi2013, Ermon et al.2014, Zhao et al.2016].

If an approximate optimization oracle exists that can find a value within some known bound of the maximum, we can modify our bounds to use the approximate oracle. This may improve the efficiency of our algorithm or extend its use to additional problem classes. For the class of log-supermodular distributions, approximate MAP inference is equivalent to performing approximate submodular minimization [Jegelka, Lin, and Bilmes2011].

We note that even when an efficient optimization oracle exists, the problem of exactly calculating is generally still hard. For example, polynomial time algorithms exist for finding the maximum weight perfect matching in a weighted bipartite graph. However, computing the permanent of a bipartite graph’s adjacency matrix, which equals the sum of weights for all perfect matchings or , is still #P-complete[Jerrum, Sinclair, and Vigoda2004]. A fully polynomial randomized approximation scheme (FPRAS) exists [Jerrum, Sinclair, and Vigoda2004, Bezáková et al.2006]

, based on Markov chain Monte Carlo to sample over all perfect matchings. However, the polynomial time complexity of this algorithm suffers from a large degree, limiting its practical use.

4 Weighted Rademacher Bounds on

Our approach for estimating the sum is based on the idea that the Rademacher complexity of a set is related to the set’s size. In particular, Rademacher complexity is monotonic in the sense that whenever . Note that monotonicity does not hold for , that is, is monotonic in the contents of but not necessarily in its size. We estimate the sum of arbitrary non-negative elements by generalizing the Rademacher complexity in definition 2.

Definition 2.

We define the weighted Rademacher complexity of a weight function as

(3)

for sampled uniformly from .

In the notation of Eq. (2), the weighted Rademacher complexity is simply . For a set , let denote the indicator weight function for , defined as . Then , that is, the weighted Rademacher complexity is identical to the standard Rademacher complexity for indicator weight functions. For a general weight function, the weighted Rademacher complexity extends the standard Rademacher complexity by giving each element (hypothesis) its own weight.

4.1 Algorithmic Strategy

The key idea of this paper is to use the weighted Rademacher complexity to provide probabilistic estimates of , the total weight of .

This is a reasonable strategy because as we have seen before, for an indicator weight function , reduces to the standard Rademacher complexity , and is simply the cardinality of the set. Therefore we can use known quantitative relationships between and from learning theory to estimate in terms of . Although not formulated in the framework of Rademacher complexity, this is the strategy used by barvinok1997approximate barvinok1997approximate.

Here, we generalize these results to general weight functions and show that it is, in fact, possible to use to obtain estimates of . This observation can be turned into an algorithm by observing that is the expectation of a random variable concentrated around its mean. Therefore, as we will show in Proposition 2, a small number of samples suffices to reliably estimate (and hence, ) with high probability. Whenever is ‘sufficiently nice’ and we have access to an optimization oracle, the estimation algorithm is efficient.

Inputs: A positive integer and weight function .

Output: A number which approximates .

  1. Sample vectors independently and uniformly from .

  2. Apply the optimization oracle of assumption 1 to each vector and compute the mean

  3. Output as an estimator of and thus .

Algorithm 1 Rademacher Estimate of

4.2 Bounding Weighted Rademacher Complexity

The weighted Rademacher complexity is an expectation over optimization problems. The optimization problem is defined by sampling a vector, or direction since all have length , uniformly from and finding the vector that is most aligned (largest dot product) after adding . Our first objective is to derive bounds on the weighted Rademacher complexity in terms of the sum .

Figure 1: Illustration mapping a set of vectors in high dimensional space to the unit circle. Red regions correspond to regions of space that have a large dot product with some vector in the set. Left: when the size of a set is small, very few regions have a large dot product with any vector in the set, so the Rademacher complexity will be small. Right: when a large set of vectors is tightly packed in a small region of space, the Rademacher complexity will remain relatively small. In both left and right figures we have similar (small) Rademacher complexities, yet different set sizes. This illustrates why tight bounds on the set size based on Rademacher complexity are difficult to achieve.

We begin with the observation that it is impossible to derive bounds on the Rademacher complexity in terms of set size that are tight for sets of all shapes. To gain intuition, note that in high dimensional spaces the dot product of any particular vector and another chosen uniformly at random from is close to 0 with high probability. The distribution of weight vectors throughout the space may take any geometric form. One extreme configuration is that all vectors with large weights are packed tightly together, forming a Hamming ball. At the other extreme, all vectors with large weights could be distributed uniformly through the space. As Figure 1 illustrates, a large set of tightly packed vectors and a small set of well-distributed vectors will both have similar Rademacher complexity. Thus, bounds on Rademacher complexity that are based on the underlying set’s size fundamentally cannot always be tight for all distributions. Nevertheless, the lower and upper bounds we derive next are tight enough to be useful in practice.

4.2.1 Lower bound.

To lower bound the weighted Rademacher complexity we adapt the technique of [Barvinok1997] for lower bounding the standard Rademacher complexity. The high level idea is that the space can be mapped to the leaves of a binary tree. By following a path from the root to a leaf, we are dividing the space in half times, until we arrive at a leaf which corresponds to a single element (with some fixed weight). By judiciously choosing which half of the space (branch of the tree) to recurse into at each step we derive the bound in Lemma 1, whose proof is given in the appendix.

Lemma 1.

For any , the weighted Rademacher complexity of a weight function is lower bounded by

with

4.2.2 Upper bound.

In the unweighted setting, a standard upper bound on the Rademacher complexity is used in learning theory to show that the Rademacher complexity of a small hypothesis class is also small, often to prove PAC-learnability. Massart’s Lemma (see [Shalev-Shwartz and Ben-David2014], lemma 26.8) formally upper bounds the Rademacher complexity in terms of the size of the set. This result is intuitive since, as we have noted, the dot product between any one vector is small with most other vectors . Therefore, if the set is small the Rademacher complexity must also be small.

Adapting the proof technique of Massart’s Lemma to the weighted setting we arrive at the following bound:

Lemma 2.

For any , , and weight functions with , the weighted Rademacher complexity of is upper bounded by

(4)

with

Note that for an indicator weight function we recover the bound from Massart’s Lemma by setting and .

Corollary 2.1.

For sufficiently large and

we recover the bound from Lemma 2.

Lemma 2 holds for any and . In general we set and optimize over to make the bound as tight as possible, comparing the result with the trivial bound given by Corollary 2.1. More sophisticated optimization strategies over and could result in a tighter bound. Please see the appendix for further details and proofs.

4.3 Bounding the Weighted Sum

With our bounds on the weighted Rademacher complexity from the previous section, we now present our method for efficiently bounding the sum . Proposition 2 states that we can estimate the weighted Rademacher complexity using the optimization oracle of assumption 1.

Proposition 2.

For sampled uniformly at random, the bound

(5)

holds with probability greater than .95.

Proof.

By applying Proposition 1 to the function , and noting the constant , we have

This finishes the proof. ∎

To bound we use our optimization oracle to solve a perturbed optimization problem, giving an estimate of the weighted Rademacher complexity, . Next we invert the bounds on (Lemmas 1 and 2) to obtain bounds on . We optimize the parameters and (from equations 1 and 2) to make the bounds as tight as possible. By applying our optimization oracle repeatedly, we can reduce the slack introduced in our final bound when estimating (by Lemma 2) and arrive at our bounds on the sum , stated in the following theorem.

Theorem 1.

With probability at least , the sum of any weight function is bounded by the outputs of algorithms 2 and 3 as

Inputs: The estimator output by algorithm 1, used to compute , and optionally and .

Output: A number which lower bounds .

  1. If was provided as input, calculate

  2. If was provided as input and ,

  3. Otherwise,

  4. Output the lower bound .

Algorithm 2 Rademacher Lower Bound for

Inputs: The estimator , used to compute , and optionally and .

Output: A number which upper bounds .

  1. If was provided as input, calculate

  2. If was provided as input, calculate

  3. Set the value

  4. Output the upper bound :

    1. If , .

    2. If , .

    3. Otherwise,

      where .

Algorithm 3 Rademacher Upper Bound for

5 Experiments

The closest line of work to this paper showed that the partition function can be bounded by solving an optimization problem perturbed by Gumbel random variables [Hazan and Jaakkola2012, Hazan, Maji, and Jaakkola2013, Hazan et al.2016, Kim, Sabharwal, and Ermon2016, Balog et al.2017]. This approach is based on the fact that

where all random variables are sampled from the Gumbel distribution with scale and shifted by the Euler-Mascheroni constant to have mean 0. Perturbing all states with IID Gumbel random variables is intractable, leading the authors to bound by perturbing states with a combination of low dimensional Gumbel perturbations. Specifically the upper bound

[Hazan et al.2016] and lower bound

[Balog et al.2017, p. 6] hold in expectation, where for are sampled from the Gumbel distribution with scale and shifted by the Euler-Mascheroni constant to have mean 0.

To obtain bounds that hold with high probability using Gumbel perturbations we calculate the slack term [Hazan et al.2016, p. 32]

giving upper and lower bounds and that hold with probability where samples are used to estimate the expectation bounds.

We note the Gumbel expectation upper bound takes nearly the same form as the weighted Rademacher complexity, with two differences. The perturbation is sampled from a Gumbel distribution instead of a dot product with a vector of Rademacher random variables and, without scaling, the two bounds are naturally written in different log bases.

We experimentally compare our bounds with those obtained by Gumbel perturbations on two models. First we bound the partition function of the spin glass model from [Hazan et al.2016]

. For this problem the weight function is given by the unnormalized probability distribution of the spin glass model. Second we bound the propositional model counts (#SAT) for a variety of SAT problems. This problem falls into the unweighted category where every weight is either 0 or 1, specifically every satisfying assignment has weight 1 and we bound the total number of satisfying assignments.

Figure 2: Bounds for a 7x7 spin glass model with (for both methods), that hold with probability .95. Our bounds and estimator are scaled to match Gumbel log base bounds.

5.1 Spin Glass Model

Following [Hazan et al.2016], we bound the partition function of a spin glass model with variables for , where each variable represents a spin. Each spin has a local field parameter which corresponds to its local potential function . We performed experiments on grid shaped models where each spin variable has 4 neighbors, unless it occupies a grid edge. Neighboring spins interact with coupling parameters . The potential function of the spin glass model is

with corresponding weight function

We compare our bounds on a 7x7 spin glass model. We sampled the local field parameters uniformly at random from and the coupling parameters uniformly at random from with varying. Non-negative coupling parameters make it possible to perform MAP inference efficiently using the graph-cuts algorithm [Kolmogorov and Zabin2004, Greig, Porteous, and Seheult1989]. We used the python maxflow module wrapping the implementation from boykov2004experimental boykov2004experimental.

Figure 2 shows bounds that hold with probability .95, where all bounds are computed with . For this value of , our approach produces tighter upper bounds than using Gumbel perturbations. The crossover to a tighter Gumbel perturbation upper bound occurs around . Lower bounds are equivalent, although we note it is trivial to recover this bound by simply calculating the largest weight over all states.

Test title Model Name #Variables #Clauses ln(Z) log-1 939 3785 47.8 64.5 (20.8) 438.0 (46.2) 426.5 (43.0) 0.5 (0.6) -0.3 (0.0) log-2 1337 24777 24.2 48.6 (20.7) 485.7 (60.3) 464.0 (45.1) 0.3 (0.4) -0.3 (0.0) log-3 1413 29487 26.4 49.9 (22.3) 503.9 (65.3) 478.2 (42.3) 0.4 (0.4) -0.3 (0.0) log-4 2303 20963 65.3 106.0 (26.6) 830.2 (77.7) 676.9 (58.8) 0.4 (0.5) -0.2 (0.0) tire-1 352 1038 20.4 30.7 (11.2) 198.5 (17.6) 249.6 (23.7) 0.3 (0.4) -0.5 (0.1) tire-2 550 2001 27.3 42.1 (14.2) 283.9 (27.7) 310.2 (29.6) 0.3 (0.4) -0.4 (0.1) tire-3 577 2004 26.1 36.9 (17.1) 280.5 (36.1) 316.5 (29.1) 0.4 (0.6) -0.4 (0.1) tire-4 812 3222 32.3 55.0 (17.4) 384.7 (38.9) 383.3 (35.3) 0.3 (0.3) -0.3 (0.0) ra 1236 11416 659.2 621.1 (15.5) 856.7 (0.0) 1100.9 (45.7) 184.1 (10.1) 0.3 (0.0) rb 1854 11324 855.9 857.2 (12.6) 1285.1 (0.0) 1387.5 (43.9) 239.3 (7.7) 0.2 (0.0) sat-grid-pbl-0010 110 191 54.7 51.6 (4.9) 76.2 (0.0) 176.3 (13.6) 7.6 (2.2) -0.5 (0.1) sat-grid-pbl-0015 240 436 125.4 120.2 (6.6) 166.4 (0.0) 310.3 (18.8) 26.6 (3.8) -0.1 (0.1) sat-grid-pbl-0020 420 781 220.4 215.5 (9.0) 291.1 (0.0) 472.3 (26.5) 56.2 (5.6) 0.0 (0.1) sat-grid-pbl-0025 650 1226 348.3 338.8 (9.4) 450.5 (0.0) 667.8 (33.1) 97.0 (6.2) 0.2 (0.1) sat-grid-pbl-0030 930 1771 502.4 482.6 (13.0) 644.6 (0.0) 893.3 (36.7) 144.1 (8.7) 0.2 (0.0) c432 196 514 25.0 42.7 (5.8) 135.3 (1.0) 212.6 (18.2) 1.4 (0.8) -0.5 (0.1) c499 243 714 28.4 58.5 (6.2) 168.2 (0.4) 243.8 (16.6) 3.2 (1.2) -0.4 (0.1) c880 417 1060 41.6 83.1 (8.4) 281.1 (4.7) 332.9 (21.3) 4.2 (1.4) -0.3 (0.1) c1355 555 1546 28.4 79.8 (12.2) 342.9 (14.2) 368.6 (28.7) 2.3 (1.3) -0.3 (0.1) c1908 751 2053 22.9 87.8 (12.2) 427.7 (19.3) 419.1 (32.8) 1.8 (0.9) -0.3 (0.0) c2670 1230 2876 161.5 260.0 (14.6) 812.8 (10.8) 701.4 (39.6) 23.7 (3.4) -0.1 (0.0)

Table 1: Empirical comparison of our estimate of () and bounds on () propositional model counts against bounds based on Gumbel perturbations (

). The mean over 100 runs is shown with the standard deviation in parentheses. Bounds hold with probability .95 and

for both methods. Tighter bounds are in bold. Meta column descriptions, left to right: model name and information, natural logarithm of ground truth model counts and our estimator, upper bounds, and lower bounds.

5.2 Propositional Model Counting

Next we evaluate our method on the problem of propositional model counting. Given a boolean formula , this poses the question of how many assignments to the underlying boolean variables result in evaluating to true. Our weight function is given by if evaluates to true, and otherwise.

We performed MAP inference on the perturbed problem using the weighted partial MaxSAT solver MaxHS [Davies2013]. Ground truth was obtained for a variety of models111The models used in our experiments can be downloaded from http://reasoning.cs.ucla.edu/c2d/results.html using three exact propositional model counters [Thurley2006, Sang et al.2004, Oztok and Darwiche2015]222Precomputed model counts were downloaded from https://sites.google.com/site/marcthurley/sharpsat/benchmarks/ collected-model-counts. Table 1 shows bounds that hold with probability .95 and . While the Gumbel lower bounds are always trivial, we produce non-trivial lower bounds for several model instances. Our upper bounds are generally comparable to or tighter than Gumbel upper bounds.

5.3 Analysis

Our bounds are much looser than those computed by randomized hashing schemes [Chakraborty, Meel, and Vardi2013, Ermon et al.2013d, Ermon et al.2013b, Zhao et al.2016], but also require much less computation [Ermon et al.2013c, Achim, Sabharwal, and Ermon2016]. While our approach provides polynomial runtime guarantees for MAP inference in the spin glass model after random perturbations have been applied, randomized hashing approaches do not. For propositional model counting, we found that our method is computationally cheaper by over 2 orders of magnitude than results reported in zhao2016closing zhao2016closing. Additionally, we tried reducing the runtime and accuracy of randomized hashing schemes by running code from zhao2016closing zhao2016closing with values of 0, .01, .02, .03, .04, and .05. We set the maximum time limit to 1 hour (while our method required .01 to 6 seconds of computation for reported results). Throughout experiments on models reported in Table 1 our approach still generally required orders of magnitude less computation and also found tighter bounds in some instances.

Empirically, our lower bounds were comparable to or tighter than those obtained by Gumbel perturbations on both models. The weighted Rademacher complexity is generally at least as good an estimator of as the Gumbel upper bound, however it is only an estimator and not an upper bound. Our upper bound using the weighted Rademacher complexity, which holds in expectation, is empirically weaker than the corresponding Gumbel expectation upper bound. However, the slack term needed to transform our expectation bound into a high probability bound is tighter than the corresponding Gumbel slack term. Since both slack terms approach in the limit of infinite computation (, the number of samples used to estimate the expectation bound), this can result in a trade-off where we produce a tighter upper bound up to some value of , after which the Gumbel bound becomes tighter.

6 Conclusion

We introduced the weighted Rademacher complexity, a novel generalization of Rademacher complexity. We showed that this quantity can be used as an estimator of the size of a weighted set, and gave bounds on the weighted Rademacher complexity in terms of the weighted set size. This allowed us to bound the sum of any non-negative weight function, such as the partition function, in terms of the weighted Rademacher complexity. We showed how the weighted Rademacher complexity can be efficiently approximated whenever an efficient optimization oracle exists, as is the case for a variety of practical problems including calculating the partition function of certain graphical models and the permanent of non-negative matrices. Experimental evaluation demonstrated that our approach provides tighter bounds than competing methods under certain conditions.

In future work our estimator and bounds on may be generalized to other forms of randomness. Rather than sampling uniformly from , we could conceivably sample each element

from some other distribution, such as the uniform distribution over

, a Gaussian, or Gumbel. Our bounds should readily adapt to continuous uniform or gaussian distributions, although derivations may be more complex in general. As another line of future work, the weighted Rademacher complexity may be useful beyond approximate inference to learning theory.

7 Acknowledgments

We gratefully acknowledge funding from Ford, FLI and NSF grants , , . We also thank Tri Dao, Aditya Grover, Rachel Luo, and anonymous reviewers.

References

  • [Achim, Sabharwal, and Ermon2016] Achim, T.; Sabharwal, A.; and Ermon, S. 2016. Beyond parity constraints: Fourier analysis of hash functions for inference. In

    International Conference on Machine Learning

    , 2254–2262.
  • [Bach and others2013] Bach, F., et al. 2013. Learning with submodular functions: A convex optimization perspective. Foundations and Trends® in Machine Learning 6(2-3):145–373.
  • [Balog et al.2017] Balog, M.; Tripuraneni, N.; Ghahramani, Z.; and Weller, A. 2017. Lost relatives of the Gumbel trick. In 34th International Conference on Machine Learning, 371–379.
  • [Barvinok1997] Barvinok, A. I. 1997. Approximate counting via random optimization. Random Structures and Algorithms 11(2):187–198.
  • [Bellman1961] Bellman, R. E. 1961. Adaptive control processes: a guided tour. Princeton university press.
  • [Bezáková et al.2006] Bezáková, I.; Štefankovič, D.; Vazirani, V. V.; and Vigoda, E. 2006. Accelerating simulated annealing for the permanent and combinatorial counting problems. In Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, 900–907.
  • [Boykov and Kolmogorov2004] Boykov, Y., and Kolmogorov, V. 2004. An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE transactions on pattern analysis and machine intelligence 26(9):1124–1137.
  • [Chakrabarty, Jain, and Kothari2014] Chakrabarty, D.; Jain, P.; and Kothari, P. 2014. Provable submodular minimization using wolfe’s algorithm. In Advances in Neural Information Processing Systems, 802–809.
  • [Chakraborty, Meel, and Vardi2013] Chakraborty, S.; Meel, K. S.; and Vardi, M. Y. 2013. A scalable approximate model counter. In International Conference on Principles and Practice of Constraint Programming, 200–216. Springer.
  • [Davies2013] Davies, J. 2013. Solving MAXSAT by Decoupling Optimization and Satisfaction. Ph.D. Dissertation, University of Toronto.
  • [Ermon et al.2013a] Ermon, S.; Gomes, C.; Sabharwal, A.; and Selman, B. 2013a. Taming the curse of dimensionality: Discrete integration by hashing and optimization. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), 334–342.
  • [Ermon et al.2013b] Ermon, S.; Gomes, C. P.; Sabharwal, A.; and Selman, B. 2013b. Embed and project: Discrete sampling with universal hashing. In Advances in Neural Information Processing Systems (NIPS), 2085–2093.
  • [Ermon et al.2013c] Ermon, S.; Gomes, C. P.; Sabharwal, A.; and Selman, B. 2013c. Optimization with parity constraints: From binary codes to discrete integration. In Proc. of the 29th Conference on Uncertainty in Artificial Intelligence (UAI).
  • [Ermon et al.2013d] Ermon, S.; Gomes, C. P.; Sabharwal, A.; and Selman, B. 2013d. Taming the curse of dimensionality: Discrete integration by hashing and optimization. In Proc. of the 30th International Conference on Machine Learning (ICML).
  • [Ermon et al.2014] Ermon, S.; Gomes, C.; Sabharwal, A.; and Selman, B. 2014. Low-density parity constraints for hashing-based discrete integration. In International Conference on Machine Learning, 271–279.
  • [Fujishige1980] Fujishige, S. 1980. Lexicographically optimal base of a polymatroid with respect to a weight vector. Mathematics of Operations Research 5(2):186–196.
  • [Greig, Porteous, and Seheult1989] Greig, D. M.; Porteous, B. T.; and Seheult, A. H. 1989.

    Exact maximum a posteriori estimation for binary images.

    Journal of the Royal Statistical Society. Series B (Methodological) 271–279.
  • [Hazan and Jaakkola2012] Hazan, T., and Jaakkola, T. S. 2012. On the partition function and random maximum a-posteriori perturbations. In Langford, J., and Pineau, J., eds., Proceedings of the 29th International Conference on Machine Learning (ICML-12), 991–998. New York, NY, USA: ACM.
  • [Hazan et al.2016] Hazan, T.; Orabona, F.; Sarwate, A. D.; Maji, S.; and Jaakkola, T. 2016. High dimensional inference with random maximum a-posteriori perturbations. arXiv preprint arXiv:1602.03571.
  • [Hazan, Maji, and Jaakkola2013] Hazan, T.; Maji, S.; and Jaakkola, T. 2013. On sampling from the Gibbs distribution with random maximum a-posteriori perturbations. In Advances in Neural Information Processing Systems, 1268–1276.
  • [Hopcroft and Karp1971] Hopcroft, J. E., and Karp, R. M. 1971. A algorithm for maximum matchings in bipartite graphs. In Switching and Automata Theory, 1971., 12th Annual Symposium on, 122–125. IEEE.
  • [Jegelka, Lin, and Bilmes2011] Jegelka, S.; Lin, H.; and Bilmes, J. A. 2011. On fast approximate submodular minimization. In Advances in Neural Information Processing Systems, 460–468.
  • [Jerrum and Sinclair1996] Jerrum, M., and Sinclair, A. 1996. The markov chain monte carlo method: an approach to approximate counting and integration. Approximation algorithms for NP-hard problems 482–520.
  • [Jerrum, Sinclair, and Vigoda2004] Jerrum, M.; Sinclair, A.; and Vigoda, E. 2004. A polynomial-time approximation algorithm for the permanent of a matrix with nonnegative entries. Journal of the ACM (JACM) 51(4):671–697.
  • [Jonker and Volgenant1987] Jonker, R., and Volgenant, A. 1987. A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing 38(4):325–340.
  • [Jordan et al.1998] Jordan, M. I.; Ghahramani, Z.; Jaakkola, T. S.; and Saul, L. K. 1998. An introduction to variational methods for graphical models. NATO ASI SERIES D BEHAVIOURAL AND SOCIAL SCIENCES 89:105–162.
  • [Kim, Sabharwal, and Ermon2016] Kim, C.; Sabharwal, A.; and Ermon, S. 2016.

    Exact sampling with integer linear programs and random perturbations.

    In Proc. 30th AAAI Conference on Artificial Intelligence.
  • [Koller and Friedman2009] Koller, D., and Friedman, N. 2009. Probabilistic graphical models: principles and techniques. MIT press.
  • [Kolmogorov and Zabin2004] Kolmogorov, V., and Zabin, R. 2004. What energy functions can be minimized via graph cuts? IEEE transactions on pattern analysis and machine intelligence 26(2):147–159.
  • [Kuhn1955] Kuhn, H. W. 1955. The hungarian method for the assignment problem. Naval Research Logistics (NRL) 2(1-2):83–97.
  • [Madras2002] Madras, N. N. 2002. Lectures on monte carlo methods, volume 16. American Mathematical Soc.
  • [McDiarmid1989] McDiarmid, C. 1989. On the method of bounded differences. Surveys in combinatorics 141(1):148–188.
  • [Mussmann and Ermon2016] Mussmann, S., and Ermon, S. 2016. Learning and inference via maximum inner product search. In International Conference on Machine Learning, 2587–2596.
  • [Mussmann, Levy, and Ermon2017] Mussmann, S.; Levy, D.; and Ermon, S. 2017. Fast amortized inference and learning in log-linear models with randomly perturbed nearest neighbor search. UAI.
  • [Orlin2009] Orlin, J. B. 2009. A faster strongly polynomial time algorithm for submodular function minimization. Mathematical Programming 118(2):237–251.
  • [Oztok and Darwiche2015] Oztok, U., and Darwiche, A. 2015. A top-down compiler for sentential decision diagrams. In IJCAI, 3141–3148.
  • [Sang et al.2004] Sang, T.; Bacchus, F.; Beame, P.; Kautz, H. A.; and Pitassi, T. 2004. Combining component caching and clause learning for effective model counting. In SAT.
  • [Shalev-Shwartz and Ben-David2014] Shalev-Shwartz, S., and Ben-David, S. 2014. Understanding machine learning: From theory to algorithms. Cambridge university press.
  • [Thurley2006] Thurley, M. 2006. sharpsat-counting models with advanced component caching and implicit bcp. In SAT, 424–429.
  • [Valiant1979] Valiant, L. G. 1979. The complexity of enumeration and reliability problems. SIAM Journal on Computing 8(3):410–421.
  • [Wainwright, Jordan, and others2008] Wainwright, M. J.; Jordan, M. I.; et al. 2008. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning 1(1–2):1–305.
  • [Zhao et al.2016] Zhao, S.; Chaturapruek, S.; Sabharwal, A.; and Ermon, S. 2016. Closing the gap between short and long xors for model counting. In AAAI, 3322–3329.

8 Appendix

We present formal proofs of our bounds on the sum of any non-negative weight function . For readability we occasionally restate results from the main paper. The format of our proof is as follows. First we bound the weighted Rademacher complexity, , by the output of our optimization oracle (, described in Assumption 1), which we refer to as the slack bound. Next we lower bound the sum by and apply our slack bound to obtain a lower bound on in terms of . Similarly, we upper bound the sum by and apply our slack bound to obtain an upper bound on in terms of . Finally we tighten the bounds by repeatedly applying our optimization oracle.

8.1 Slack Bound

We use McDiarmid’s bound (Proposition 1) to bound the difference between the output of our optimization oracle (, described in Assumption 1) and its expectation, which is the weighted Rademacher complexity . For the function

the constant in McDiarmid’s bound, giving

By choosing uniformly at random we can say with probability greater than .95 that

(6)

8.2 Lower Bound

In this section we lower bound the sum by and apply our slack bound to obtain a lower bound on in terms of . We extend the Massart lemma [Shalev-Shwartz and Ben-David2014, lemma 26.8] to the weighted setting by accounting for the weight term in the weighted Rademacher complexity. Our lower bound on is given by the following Lemma:

Lemma 3.

For sampled uniformly at random, the following bound holds with probability greater than .95:

Proof.

We begin by upper bounding in terms of . Define generated uniformly at random and . For any , , and weight functions with we have

where we have used Jensen’s inequality. By the linearity of expectation and independence between elements in a random vector ,

Using Lemma A.6 from [Shalev-Shwartz and Ben-David2014],

Next,

(7)

where

Note that for we have two valid inequalities that hold for either choice of . Having bounded the weighted Rademacher complexity in terms of , we now apply the slack bound from equation 6 and have that with probability greater than .95

(8)

This upper bound on holds for any , so we could jointly optimize over and to make the bound as tight as possible. However, this is non-trivial because changing changes the weight function we supply to our optimization oracle. Instead we generally set and optimize over only . At the end of this section we derive another bound with a different choice of . This bound is trivial to derive, but illustrates that other choices of could result in meaningful bounds.

Rewriting the bound in Equation 8 with and we have

(9)

so the optimal value of that makes our bound as tight as possible occurs at the maximum of the quadratic function

Where the stated derivatives are valid for , as is piecewise constant with a discontinuity at . The maximum of must occur at , or the value of that makes . By inspection the maximum does not occur at , so the maximum will occur at or else if the derivative is never zero. We have at . Rearranging equation 9 we have

Depending on the value of we have 3 separate regimes for the optimal lower bound on .

  1. : In this case at (note so that ) and the optimal lower bound is

    We require that , but note that we can discard our bound and recompute with a new if we find that for our computed value of , as this can only happen with low probability when our slack bound is violated and we have estimated poorly with . Note that , so .

  2. : In this case is never zero, so at we have the optimal lower bound of

  3. : This case cannot occur because by definition.

We now illustrate how alternative choices of could result in meaningful bounds. From Equation 7 we have

For sufficiently large () we have , which makes for all . Further, for sufficiently large ,

(for all ) and when a single element has the unique largest weight ()

Therefore, for sufficiently large and , we have

This bound is trivial, however we picked poorly. To tighten this bound we would make as small as possible subject to or alternatively we would make as big as possible subject to . Note that we have used , so gauranteeing or requires a bound on . We leave joint optimization over and for future work.

8.3 Upper Bound

In this section we upper bound the sum by and apply our slack bound to obtain an upper bound on in terms of . Our proof technique is inspired by [Barvinok1997], who developed the method for bounding the sum of a weight function with values of either 0 or 1. We generalize the proof to the weighted setting for any weight function . The principle underlying the method is that by dividing the space in half times we arrive at a single weight. By judiciously choosing which half of the space to recurse into at each step we can bound upper bound .

Define the dimensional space for and . For any vector with , define as