# High Dimensional Discrete Integration by Hashing and Optimization

Recently Ermon et al. (2013) pioneered an ingenuous way to practically compute approximations to large scale counting or discrete integration problems by using random hashes. The hashes are used to reduce the counting problems into many separate discrete optimization problems. The optimization problems can be solved by an NP-oracle, and if they allow some amenable structure then commercial SAT solvers or linear programming (LP) solvers can be used in lieu of the NP-oracle. In particular, Ermon et al. has shown that if the domain of integration is {0,1}^n then it is possible to obtain a 16-approximation by this technique. In many crucial counting tasks, such as computation of partition function of ferromagnetic Potts model, the domain of integration is naturally {0,1,..., q-1}^n, q>2. A straightforward extension of Ermon et al.'s work would allow a q^2-approximation for this problem, assuming the existence of an optimization oracle. In this paper, we show that it is possible to obtain a (2+2/q-1)^2-approximation to the discrete integration problem, when q is a power of an odd prime (a similar expression follows for even q). We are able to achieve this by using an idea of optimization over multiple bins of the hash functions, that can be easily implemented by inequality constraints, or even in unconstrained way. Also the burden on the NP-oracle is not increased by our method (an LP solver can still be used). Furthermore, we provide a close-to-4-approximation for the permanent of a matrix by extending our technique. Note that, the domain of integration here is the symmetric group. Finally, we provide memory optimal hash functions that uses minimal number of random bits for the above purpose. We are able to obtain these structured hashes without sacrificing the amenability of the NP-oracle. We provide experimental simulation results to support our algorithms.

## Authors

• 9 publications
• 38 publications
• 19 publications
04/30/2020

### Sparse Hashing for Scalable Approximate Model Counting: Theory and Practice

Given a CNF formula F on n variables, the problem of model counting or #...
01/13/2022

### Faster Counting and Sampling Algorithms using Colorful Decision Oracle

In this work, we consider d-Hyperedge Estimation and d-Hyperedge Sample ...
11/02/2021

### Truly Low-Space Element Distinctness and Subset Sum via Pseudorandom Hash Functions

We consider low-space algorithms for the classic Element Distinctness pr...
10/21/2020

### Taming Discrete Integration via the Boon of Dimensionality

Discrete integration is a fundamental problem in computer science that c...
02/17/2020

### On the Approximability of Weighted Model Integration on DNF Structures

Weighted model counting admits an FPRAS on DNF structures. We study weig...
10/23/2019

02/07/2018

### Sparse Linear Discriminant Analysis under the Neyman-Pearson Paradigm

In contrast to the classical binary classification paradigm that minimiz...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Large scale counting problems, such as computing the permanent of a matrix or computing the partition function of a graphical probabilistic generative model, come up often in variety of inference tasks. These problems can be, without loss of any generality, written as a discrete integration: the summation of evaluations of a nonnegative function over all elements of :

 SΩ(w)≡∑σ∈Ωw(σ). (1)

These problems are computationally intractable because of the exponential (and sometime super-exponential) size of . A special case is of course the set of problems #P, counting problems associated with the decision problems in NP. For example, one might ask how many variable assignments a given CNF (conjunctive normal form) formula satisfies. The complexity class #P was defined by Valiant [15], in the context of computing the permanent of a matrix. Indeed, the permanent of a matrix can be represented as,

 Perm(A)≡∑σ∈Snn∏i=1Ai,σ(i), (2)

where is the symmetric group of elements and is the -th element of . Clearly, here is playing the role of , and . Therefore computing permanent of a matrix is a canonical example of a problem defined by eq. (1).

Similar counting problems arise when one wants to compute the partition functions of the well-known probabilistic generative models of statistical physics, such as the Ising model, or more generally the Ferromagnetic Potts Model [10]. Given a graph , and a label-space , the partition function of the Potts model is given by,

 Z(G)≡∑σ∈Q|V| exp(−ζ(J∑(u,v)∈Eδ(σ(u),σ(v))+H∑u∈Vδ(σ(u),0))), (3)

where , and are system-constants (representing the temperature, spin-coupling and external force), is the delta-function that is if and only if and otherwise , and

represents a label-vector, where

is the label of vertex . It has been shown that, under the availability of an NP-oracle, every problem in #P can be approximated within a factor of

, with high probability via a randomized algorithm

[13]. This result says #P can be approximated by and the power of an NP-oracle and randomization is sufficient. However, depending on the weight function , eq. (1) may or may not be in #P.

Recently, Ermon et al. proposed an alternative approach to solve these counting problems [3, 5] by breaking them into multiple optimization problems. Namely, they use families of hash functions and use a (possibly NP) oracle that can return the correct solution of the optimization problem: We call this oracle a MAX-oracle. In particular, when can be represented as , and is a random hash function, assuming the availability of a MAX-oracle, Ermon et al. [3] propose a randomized algorithm that approximates the discrete sum within a factor of sixteen (a 16-approximation) with high probability. Ermon et al. use simple linear sketches over , i.e., the hash function is defined to be

 hA,b(x)=Ax+b, (4)

where the arithmetic operations are over . The matrix and the vector are randomly and uniformly chosen from the respective sample spaces. The MAX-oracle in this case simply provides solutions to the optimization problem:

The constraint space is nice since it is a coset of the nullspace of , and experimental results showed them to be manageable by optimization softwares/SAT solvers. In particular it was observed that being Integer Programming constraints, real-world instances are often solved quickly. Since the implementation of the hash function heavily affects the runtime, it makes sense to keep constraints of the MAX-oracle as an affine space as above. These constraints are also called parity constraints. The idea of using such constraints to show reduction among class of problems appeared in several papers before, including [11, 16, 7, 14, 8] among others. The key property that the hash functions satisfy is that they are pairwise independent. This property can be relaxed somewhat - and in a subsequent paper Ermon et al. show that a hash family would work even if the matrix is sparse and random, thus effectively reducing the randomness as well as making the problem more tractable empirically [4]. Subsequently, Achlioptas and Jiang [2] have shown another way of achieving similar guarantees. Instead of arriving at the set as a solution of a system of linear equations (over ), they view the set as the image of a lower-dimensional space. This is akin to the generator matrix view of a linear error-correcting code as opposed to the parity-check matrix view. This viewpoint allows their MAX-oracle to solve just an unconstrained optimization problem.

Our contributions. Note that, some crucial counting problems, such as computing the partition function of the Ferromagnetic Potts model of Eq. (3), naturally have for . To use the binary-domain algorithm of [3] for any , we need to use a look-up table to map -ary numbers to binary111Also it is worth noting that while there exists polynomial time approximation (FPRAS) for the Ising model (), FPRAS for general Potts model () is significantly more challenging (and likely impossible [6]). In this process the number of variables (and also the number of constraints) increases by a factor of . This makes the MAX-oracle significantly slower, especially when is large. Also, for the permanent problem, where , this can create a big problem. It would be useful to extend the method of [3] for without increasing the number of variables. An extension of the method of [3] for -ary is possible: it provides a -approximation at best which is particularly bad if is growing with . For the binary setting, it has been noted in [3, section 5.3] that the approximation ratio can be improved to any by increasing the number of variables, which extends to this -ary setting. However this also results in an increase in number of variables by a factor of which is undesirable.

Our first contribution in this paper is to provide a new algorithm by modifying and generalizing the algorithm of [3]. For any is a power of prime, our algorithm provides a -approximation, when is odd, and -approximation, when is even, to the optimization problem of (1) assuming availability of the MAX-oracle. Our algorithm utilizes an idea of using optimization over multiple bins of the hash function that can be easily implemented via inequality constraints. The constraint space of the MAX-oracle remains an affine space and can be represented as an integer linear program. In general, for arbitrary , if represented as as , the approximation factor is at best by the hashing technique. But by having it represented as the approximation factor can be improved by our technique. Our multi-bin technique can also be use to extend the generator-matrix based algorithm of Achlioptas and Jiang [2]. As a result, we need the MAX-oracle to only perform unconstrained maximization, as opposed to constrained. This lead to significant speed-up in the system, while resulting in the same approximation guarantees.

Secondly, we show that by using our technique and some modifications to the MAX-oracle, it is possible to obtain close-to--approximation to the problem of computing permanent of nonnegative matrices. The NP-oracle still is amenable to be implemented in an LP solver. It is to be noted that our idea of optimization over multiple bins is crucial here, since the straightforward generalization of Ermon et al.’s result would have given an approximation factor of . While there already exists a polynomial time randomized approximation scheme (-approximation) of permanent of a nonnegative matrix [9], the runtime there is . Since we are delegating the hard task to a professional optimization solver, our method can still be of interest here.

We note that it is possible to derandomize the hash families based on parity-constraints to the optimal extent while maintaining the essential properties necessary for their performance. Namely, it can be ensured that the hash family can still be represented as while using information theoretically optimal memory to generate them.

Finally, we show the performance of our algorithms to compute the partition function of the ferromagnetic Potts model by running experiments on both synthetic datasets and real-worlds datasets. We also use our algorithm to compute the Total Variation distance between two joint probability distributions over a large number of variables. The algorithm to compute permanent is also validated experimentally. All the experiments exhibit good performance guarantees.

Organization. The paper is organized as follows. In Section 2 we describe the technique by [3], and then elaborate our new ideas and main results. In Section 3, we provide an improvement of the WISH algorithm by [3] that lead to an improved approximation. We delegate an algorithm with unconstrained optimization oracle (similar to [2]) and its analysis to the appendix. Section 4 is devoted to computation of permanent of a matrix. In Section 5, we show how to optimally derandomize the hash function used in our algorithm. The experimental results on computation of partition functions, total variation distance, and the permanent is provided in Section 6.

## 2 Background and our techniques

In this section we describe the main ideas developed by [3] and provide an overview of the techniques that we use to arrive at our new results.

First of all, notice that from (1) we obtain: where is the tail distribution of weights and a nonincreasing function of . Note that, . We can split the range of into geometrically growing values such that . Let . Clearly . As we have not made any assumption on the values of the weight function, and can be far from each other and they are hard to bound despite the fact that is monotonic in nature. On the other hand we can try to bound the area under the curve by bounding the area of the slice between and . This area is at least and at most . Therefore: which implies

 β0+(q−1)n′∑i=1qi−1βi ≤SΩ(w)≤β0+(q−1)n′∑i=1qiβi. (5)

Hence is a -factor approximation of and if we are able to find a approximation of each value of we will be able to obtain a -factor approximation of . In [3]

, subsequently the main idea is to estimate the coefficients

.

Now note that, Suppose, using a random hash function we compute hashes of all elements in . The pre-image of an entry in is called the bin corresponding to that value, i.e., is the bin corresponding to the value . In every bin for the hash function, there is on average one element such that . So for a randomly and arbitrarily chosen bin if , then on expectation. However suppose one performs this random hashing times and then take the aggregate (in this case the median) value of s. That is say, . Then by using the independence of the hash functions, it can be shown that the aggregate is an upper bound on with high probability. Indeed, without loss of generality, if we assume that the configurations within are ordered according to the value of the function , i.e., then we can take . If the hash family is pairwise independent, then by using the Chebyshev inequality it can be shown that with high probability, . This lead to a -approximation for . For this leads to the -approximation, because [3] identified with and took . The WISH algorithm proposed by [3] makes use of the above analysis and provides a -approximation of . If we naively extend this algorithm by identifying with then the approximation factor we achieve is . Note that, for , it was not possible to take , but as we will see later that it is possible to take when , and for , this observation immediately gives a -approximation to .

Instead of using a straightforward analysis for the -ary case, in this paper we use a MAX-oracle that can optimize over multiple bins of the hash function. Using this oracle we proposed a modified WISH algorithm and call it MB-WISH (Multi-Bin WISH). Just as in the case of [3, 4], the MAX-oracle constraints can be integer linear programming constraints and commercial softwares such as CPLEX can be used.

The main idea of using an optimization over multiple bins is that it boosts the probability that the we are getting above is close to . However if we restrict ourselves to the binary alphabet then (as will be clear later) there is no immediate way to represent such multiple bins in a compact way in the MAX-oracle. For the non-binary case, it is possible to represent multiple bins of the hash function as simple inequality constraints. This idea lead to an improvement in the approximation factor of to , where decays to proportional to . Note that we need to choose to be a power of prime so that is a field.

In [2], the bins (as described above) are produced as images of some function, and not as pre-images of hashes. Since we want the number of bins to be , this can be achieved by looking at images of where The rest of the analysis of [2] is almost same as above. The benefit of this approach is that the MAX-oracle just has to solve an unconstrained optimization here. Implementing our multi-bin idea for this perspective of [2] is not straight-forward as we can no longer use inequality constraints for this. However, as we show later, we found a way to combine bins here in a succinct way generalizing the design of . As a result, we get the same approximation guarantee as in MB-WISH, with the oracle load heavily reduced (the details of this part can be found in the Appendix, we call this the Unconstrained MB-WISH).

Coming back to the discussion on MB-WISH, for computing the permanent, the domain of integration is the symmetric group . However can be embedded in for a . Therefore we can try to use MB-WISH algorithm and same set of hashes on elements of treating them as -ary vectors, . We need to be careful though since it is essential that the MAX-oracle returns a permutation and not an arbitrary vector. The modified MAX-oracle for permanents therefore must have some additional constraints. However those being affine constraints, it turns out MAX-oracle is still implementable in common optimization softwares quite easily.

For the analysis of [3, 4] to go through, we needed a family of hash functions that are pairwise independent222It is sufficient to have the hash family satisfy some weaker constraints, such as being pairwise negatively correlated.. A hash family is called uniform and pairwise independent if the following two criteria are met for a randomly and uniformly chosen from : 1) for every

and 2) for any two and , By identifying with (and with ) and by using a family of hashes defined in (4), Ermon et al. [3] show the family to be pairwise independent and thereby achieve their objective.

Finally, the size of the hash family determines how many random bits are required for the randomized algorithm to work. Although by defining the hash family in the above way Ermon et al. reduce the number of random bits from potentially to bits (see, p. 3 of [3]). We show that it is possible to come up with hash functions of the same form (i.e., ) and same pairwise independence properties using only random bits. If one uses a random sparse Toeplitz matrix, the construction of hash family takes only random bits and the hash operation can be faster because of the structure of the matrix.

## 3 The Mb-Wish algorithm and its analysis

Let us assume where is a prime-power. Let us also fix an ordering among the elements of and write . In this section, the symbol ‘’ just signifies a fixed ordering and has no real meaning over the finite field. Extending this notation, for any two vectors , we will say if and only if the th coordinates of and , satisfy for all . Below denotes an all-one vector of a dimension that would be clear from context. Also, for any event let

denote the indicator random variable for the event

.

The MAX-oracle for MB-WISH performs the following optimization, given :

 maxσ∈Fnq:Aσ+b

The modified WISH algorithm is presented as Algorithm 1. The main result of this section is below.

###### Theorem 1.

Let be a power of prime and . Let . For any and a positive constant , Algorithm 1 makes calls to the MAX-oracle and, with probability at least outputs a -approximation of .

The theorem will be proved by a series of lemmas. The key trick that we are using is to ask the MAX-oracle to solve an optimization problem over not a single bin, but multiple bins of the hash function. The hash family is defined in the following way. We have , the operations are over . Let

 Hm,n={hA,b:A∈Fm×nq,b∈Fmq}. (7)

The coding theoretic intuition behind our technique is following. The set of configurations forms a linear code of dimension . The bins of the hash function define the cosets of this linear code. We would like to chose cosets of a random linear code and the find the optimum value of over the configurations of these cosets as the MAX-oracle. To choose a hash function uniformly and randomly from , we can just choose the entries of and uniformly at random from independently.

Note that, the hash family as defined in (7) is uniform and pairwise independent. It follows from the following more general result.

###### Lemma 1.

Let us define to be the indicator random variable denoting for some and randomly and uniformly sampled from . Then and for any two configurations the random variables and are independent.

The proof is delegated to Appendix A.

Fix an ordering of the configurations such that

. We can also interpolate the space of configuration to make it continuous by the following technique. For any positive real number

, where is the integer part and is the fractional part, define . For , define , where . We take for . To prove Theorem 1 we need the following crucial lemma as well.

###### Lemma 2.

Let be defined as in the Algorithm 1. Then for , we have,

The proof of this lemma is again delegated to Appendix A along with the full proof of Theorem 1. Here we provide a sketch of the remaining steps. From Lemma 2, the output of the algorithm lies in the range with probability at least where and . and are a factor of apart. Now, following an argument similar to (5), we can show

Therefore Algorithm 1 provides a -approximation to . The total number of calls to the MAX-oracle is .

As an example of this result, suppose . In this case the algorithm provides a -approximation. Later, in the experimental section, we have used a ferromagnetic Potts model with . The approximation guarantee that MB-WISH provides in that case is a factor of . Note that, for a -ary Potts model, it is only natural to use our algorithm instead of converting it to binary in conjunction with the original algorithm of Ermon et al.

Instead of pairwise independent hash families, if we employ -wise independent families, it leads to a better decay probability of error. However it does not improve the approximation factor.

#### Mb-Wish with unconstrained optimization oracle.

We can modify and generalize the results of Achlioptas and Jiang [2] to formulate a version of MB-WISH that can use unconstrained optimizers as the MAX-oracle. The MAX-oracle for this algorithm performs an unconstrained optimization of the form: , given and a set .

The aim is to carefully design so that all the desirable statistical properties are satisfied. This part is quite different from the hashing-based analysis and not an immediate extension of [2]. We provide the algorithm (Unconstrained MB-WISH) and its analysis in Appendix B.

## 4 Mb-Wish for computing permanent

Recall the permanent of a matrix as defined in Eq. (2): . We will show that it is possible to approximate the permanent with a modification of the MB-WISH algorithm and our idea of using multiple bins for optimization in the calls to MAX-oracle. Also, recall from Section 3 that we set where there exists a fixed ordering among the elements. We set and consider any as an -length vector over (that is by identifying as respectively). Then we define a modified hash family with the operations are over .

However, when calling the MAX-oracle, we need to make sure that we are getting a permutation as the output. Hence the modified MAX-oracle for computing permanent will be:

 maxσ∈Fnqw(σ), subject to, % Aσ+b<αr⋅1;σ<αn−1⋅1;σ(i)≠σ(j)∀i≠j. (8)

These constraints, which are all linear, ensures that the MAX-oracle returns a permutation over elements. With this change we propose Algorithm 3 to compute permanent of a matrix and call it PERM-WISH. The full algorithm is provided in Appendix C.

The main result of this section is the following.

###### Theorem 2.

Let be any matrix. Let be a power of prime and . For any and a positive constant , Algorithm 3 makes calls to the MAX-oracle and, with probability at least outputs a -approximation of .

The proof of Theorem 2 follows the same trajectory as in Theorem 1. The constraints in MAX-oracle ensures that a permutation is always returned. So in the proof of Theorem 1, the s can be though of as permutations instead in this setting. It should be noted that, we must take for PERM-WISH to work. That is the reason we get a -approximation for the permanent.

It also has to be noted that, since is large, the straightforward extension of WISH algorithm would have provided only a -approximation of the permanent. Therefore the idea of using optimizations with multiple bins are crucial here as it lead to a close to -approximation.

###### Remark 1 (Constraints).

It turns out that the constraints in the MAX-oracle in Algorithm 3 are linear/affine. Therefore they are still easy to implement in different CSP softwares.

## 5 Derandomization with structured hash families

In this section, we show that it is possible to construct pairwise independent hash family using only random bits such that any hash function from the family still has the structure . Both of our constructions can be easily extended to -ary alphabets. The proofs of the propositions below can all be found in Appendix D in the supplementary material.

Construction 1: Let be an irreducible polynomial of degree . We construct the finite field with the , root of as a generator of . Now, any can be written as a power of via a natural map . Indeed, for any element consider the polynomial of degree . The coefficients of this polynomial from an element of . is just the inverse of this map. Also, assume that the all-zero vector is mapped to under .

Let be the configuration to be hashed. Suppose the hash function is , indexed by and . The hash function is defined as follows: Let . Compute Let be the first bits of . Finally, output , where .

###### Proposition 1.

The hash function can be written as an affine transform () over .

Note that, to chose a random and uniform hash function from , one needs random bits. It follows that the hash family is pairwise independent.

###### Proposition 2.

The hash family is uniform and pairwise independent.

Moreover the randomness used to construct this hash function is also optimal. It can be shown that, the size of a pairwise independent hash family is at least (see, [12]). This implies that random bits were essential for the construction.

Construction 2: Toeplitz matrix. In [5], a Toeplitz matrix was used as the hash function. In a Toeplitz matrix, each descending diagonal from left to right is fixed, i.e., if is the th entry of a Toeplitz matrix, then . So to specify an Toeplitz matrix one needs to provide only entries (entries of the first row and first column). Consider the random Toeplitz matrix where each of the entries of the first row and first column are chosen with equal probability from , i.e., each entry in the first row and column is a Bernoulli() random variable. The random hash function , is constructed by chosing a uniformly random

###### Proposition 3.

The hash family is uniform and pairwise independent [5].

Note that, since the number of random bits required from this construction is , in terms of memory requirement this hash family is slightly suboptimal to the previous construction. However, Toeplitz matrix allow for much faster computation of the hash function (matrix-vector multiplication with Toeplitz matrix takes only time compared to for unstructured matrices).

We remark that sparse Toeplitz Matrices also can be used as our hash family, further reducing the randomness. In particular, we could construct a Toeplitz matrix with Bernoulli() entries for . While the pairwise independence of the hash family is lost, it is still possible to analyze the MB-WISH algorithm with this family of hashes since they form a strongly universal family [12]. The number of random bits used in this hash family is . This construction allows us to have sparse rows in the Toeplitz matrix for small values of , which can lead to further speed-up.

Both the constructions of this section extend to -ary alphabet straightforwardly.

## 6 Experimental results

All the experiments were performed in a shared parallel computing environment that is equipped with 50 compute nodes with 28 cores Xeon E5-2680 v4 2.40GHz processors with 128GB RAM.

### 6.1 Experiments on simulated Potts model

We implemented our algorithm to estimate the partition function of Potts Model (a generalization of Ising Model). Recall that the partition function of the Potts model on a graph is given in Eq. (3). For our simulation, we have randomly generated the graph with number of nodes varying in and corresponding regular degree using a python library networkx. We took the number of states of the Potts model , the external force and the spin-coupling to be 0.1 and then varied the values of . The partition functions for different cases are calculated using both brute force technique and our algorithm. We have used a python module constraint (for Constraint Satisfaction Problems (CSPs)) to handle the constrained optimization for MAX-oracle. The obtained approximation factors for different values of are listed in Table 1.

For the approximation factor for MB-WISH is exactly 1 (up to the precision of the number system used). However the time taken by MB-WISH is much higher. For , MB-WISH gives an approximation factor of after running for eight hours in a parallel computing environment that is equipped with 50 compute nodes with 28 cores Xeon E5-2680 v4 2.40GHz processors with 128GB RAM.

For graphs with larger number of vertices, it is difficult to compute the partition function of Potts Model by brute force. Therefore, we compare the partition function computed by MB-WISH () with the one () computed by a belief propagation algorithm in the in the PGMPY library in python [1]. Again, for our simulation, we have randomly generated the graph with number of nodes varying in and with regular degree using a python library networkx. We took the number of states of the Potts model , the external force and the spin-coupling to be and then varied the values of . In our experiments each optimization instances are run with a timeout of minutes for respectively. The results are summarized in Table 2.

### 6.2 Real-world CSPs

Many instances of real-world graphical models are available in http://www.cs.huji.ac.il/project/PASCAL/showExample.php. Notably, some of them (e.g., image alignment, protein folding) are defined on non-Boolean domains, which justify the use of MB-WISH. We have computed the partition functions for several datasets from this repository.

The dataset Network.uai is a Markov network with nodes each having a binary value. A configuration here is a binary sequence of length . To calculate the partition function, we need to find the sum of weights for different configurations. In order to use Unconstrained MB-WISH, we view each configuration as a -ary string of length . Our results for the log-partition came out to be with one hour time out for each call to the MAX-oracle. The benchmark for the log-partition function is provided to be .

The Object detection dataset comprised of nodes each having a -ary value and by Unconstrained MB-WISH we found the log-partition function to be . The CSP dataset is a Markov network with node having a ternary value: we found the log partition function to be . For these datasets there were no baselines.

### 6.3 Permanent

We demonstrate the PERM-WISH algorithm by calculating the permanent of randomly generated matrix of size where and compute the ratio of the value achieved by our method () and actual permanent () by brute force. For the purpose of constrained optimization, we use the python modules described above. For the experiment we take and . We find for respectively.

### 6.4 Experiments to estimate Total Variation distance

We use Unconstrained MB-WISH to compute total variation distances between two high dimensional (up to dimension

) probability distributions, generated via an iid model and two different Markov chains (domain size up to

). There is no baseline to compare with for the Markov chain case, but for the product distributions the method shows good accuracy. The detailed results are provided below.

The total variation (TV) distance between any two discrete distributions and with common sample space is defined to be

 ∥P−Q∥TV=supA⊆P|P(A)−Q(A)|=12∑σ∈P|P(σ)−Q(σ)|.

It is difficult to estimate the TV distance between two distribution over product spaces. Even if the random variables are independent, there is no easy way to analytically compute the total variation distance. Simply consider finding TV distance between joint distributions of

random variables that can take value in . In that case, we seek to find,

 12∑σ∈{0,1,…,q−1}n|Pn(σ)−Qn(σ)|,

which is in the exact form of Eq. (1). Therefore we can use MB-WISH algorithm to estimate the total variation distance. To see how good the performance is, we compare the computed value with the known upper and lower bound on TV distance based on Hellinger distance: . It is known that,

 12h(P,Q)2≤∥P−Q∥TV≤h(P,Q)√1−h(P,Q)2/4.

For the experiments, we choose two distribution over points (): and , where . The distributions are chosen very close to each other. Here the distribution and are supported on where can be any natural number. It is known that,

 h(Pn,Qn)2 =2−2n∏i=1(1−12h(Pi,Qi)2).

To compare our method with the upper and lower bound with Hellinger distance we choose the random variables to be independent, although that our method is capable of approximation irrespective of any such conditions. We do our experiments in a time constrained manner (10 minute for each calls to MAX-oracle). The results are summarized in Table 3.

We also ran the experiment when the joint distributions are not of independent random variables and therefore no standard upper or lower bounds are available. We assume the random variables form a Markov chain over states. We consider two different first order Markov chains with initial probability distributions and respectively and the state-transition matrices given by:

 T=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣0.22040.28130.22850.028190.241550.111370.12770.049950.39490.31590.010680.32200.2.3870.114590.31390.37140.33970.15990.021340.10740.026910.388070.2.8690.059810.2382⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦

and where every row of is given by the vector We estimate the total variation distance between and in Table 4 below.

## References

• [1] PGMPY documentation. Accessed: 2018-06-28.
• [2] Dimitris Achlioptas and Pei Jiang. Stochastic integration via error-correcting codes. In UAI, pages 22–31, 2015.
• [3] Stefano Ermon, Carla Gomes, Ashish Sabharwal, and Bart Selman.

Taming the curse of dimensionality: Discrete integration by hashing and optimization.

In

Proceedings of the 30th International Conference on Machine Learning (ICML-13)

, pages 334–342, 2013.
• [4] Stefano Ermon, Carla Gomes, Ashish Sabharwal, and Bart Selman. Low-density parity constraints for hashing-based discrete integration. In International Conference on Machine Learning, pages 271–279, 2014.
• [5] Stefano Ermon, Carla P Gomes, Ashish Sabharwal, and Bart Selman. Optimization with parity constraints: From binary codes to discrete integration. In

Uncertainty in Artificial Intelligence

, page 202, 2013.
• [6] Leslie Ann Goldberg and Mark Jerrum. Approximating the partition function of the ferromagnetic potts model. Journal of the ACM (JACM), 59(5):25, 2012.
• [7] Carla P Gomes, Ashish Sabharwal, and Bart Selman. Model counting: A new strategy for obtaining good bounds. In AAAI, pages 54–61, 2006.
• [8] Carla P Gomes, Willem Jan van Hoeve, Ashish Sabharwal, and Bart Selman. Counting csp solutions using generalized xor constraints. In AAAI, pages 204–209, 2007.
• [9] Mark Jerrum, Alistair Sinclair, and Eric Vigoda. A polynomial-time approximation algorithm for the permanent of a matrix with nonnegative entries. Journal of the ACM (JACM), 51(4):671–697, 2004.
• [10] Renfrey Burnard Potts. Some generalized order-disorder transformations. In Mathematical proceedings of the cambridge philosophical society, volume 48, pages 106–109. Cambridge University Press, 1952.
• [11] Michael Sipser. A complexity theoretic approach to randomness. In

Proceedings of the fifteenth annual ACM symposium on Theory of computing

, pages 330–335. ACM, 1983.
• [12] Douglas R Stinson. On the connections between universal hashing, combinatorial designs and error-correcting codes. Congressus Numerantium, pages 7–28, 1996.
• [13] Larry Stockmeyer. On approximation algorithms for# p. SIAM Journal on Computing, 14(4):849–861, 1985.
• [14] Marc Thurley. An approximation algorithm for# k-sat. arXiv preprint arXiv:1107.2001, 2011.
• [15] Leslie G Valiant. The complexity of computing the permanent. Theoretical computer science, 8(2):189–201, 1979.
• [16] Leslie G Valiant and Vijay V Vazirani. NP is as easy as detecting unique solutions. Theoretical Computer Science, 47:85–93, 1986.

## Appendix A Omitted proofs from Section 3

###### Proof of Lemma 1.

Let denote the th row of and denote the th entry of . Then .

For all configurations , we must have

 Pr(Aiσ+bi<αr)=r−1∑j=0Pr(Aiσ+bi=αj)=rq.

As are independent , we must have that . Now for any two configurations ,

 Pr(Aiσ1+bi<αr∧Aiσ2+bi<αr) =r−1∑k=0r−1∑j=0Pr(Aiσ1+bi=αk∧Aiσ2+bi=αj) =r−1∑k=0r−1∑j=0Pr(Aiσ1+bi=αk) ⋅Pr(Aiσ2+bi=αj|Aiσ1+bi=αk) =r−1∑k=0r−1∑j=0Pr(Aiσ1+bi=αk)Pr(Ai(σ2−σ1)=αj−αk) =r2(1/q)(1/q)=(r/q)2.

As all the rows are independent,

###### Proof of Lemma 2.

Consider the set of heaviest configuration

 Ωj={σ1,…σ⌊tj⌋}.

Let By the uniformity property of the hash function,

 ESj(hi)=E∑σ∈Ωj1[hi(σ)<αr⋅1]=⌊tj⌋ti.

For each configuration let us denote the random variable . By our design Note that, . Also, from Lemma 1, the random variables s are pairwise independent. Therefore,

 varSj(hi)=var(∑σ∈Ωj¯Ziσ)=∑σ∈ΩjE¯Ziσ2=⌊tj⌋ti(1−1ti).

Now, for any ,

 Pr(w(k)i≥βj)=Pr(w(t)i≥w(σ⌊t