Approximating the Permanent by Sampling from Adaptive Partitions

11/26/2019 ∙ by Jonathan Kuck, et al. ∙ Stanford University Allen Institute for Artificial Intelligence 12

Computing the permanent of a non-negative matrix is a core problem with practical applications ranging from target tracking to statistical thermodynamics. However, this problem is also #P-complete, which leaves little hope for finding an exact solution that can be computed efficiently. While the problem admits a fully polynomial randomized approximation scheme, this method has seen little use because it is both inefficient in practice and difficult to implement. We present AdaPart, a simple and efficient method for drawing exact samples from an unnormalized distribution. Using AdaPart, we show how to construct tight bounds on the permanent which hold with high probability, with guaranteed polynomial runtime for dense matrices. We find that AdaPart can provide empirical speedups exceeding 25x over prior sampling methods on matrices that are challenging for variational based approaches. Finally, in the context of multi-target tracking, exact sampling from the distribution defined by the matrix permanent allows us to use the optimal proposal distribution during particle filtering. Using AdaPart, we show that this leads to improved tracking performance using an order of magnitude fewer samples.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The permanent of a square, non-negative matrix is a quantity with natural graph theoretic interpretations. If is interpreted as the adjacency matrix of a directed graph, the permanent corresponds to the sum of weights of its cycle covers. If the graph is bipartite, it corresponds to the sum of weights of its perfect matchings. The permanent has many applications in computer science and beyond. In target tracking applications uhlmann2004matrix; morelande2009joint; oh2009markov; hamid2015joint, it is used to calculate the marginal probability of measurements-target associations. In general computer science, it is widely used in graph theory and network science. The permanent also arises in statistical thermodynamics beichl1999approximating.

Unfortunately, computing the permanent of a matrix is believed to be intractable in the worst-case, as the problem has been formally shown to be #P-complete valiant1979complexity. Surprisingly, a fully polynomial randomized approximation scheme (FPRAS) exists, meaning that it is theoretically possible to accurately approximate the permanent in polynomial time. However, this algorithm is not practical: it is difficult to implement and it scales as . Ignoring coefficients, this is no better than exact calculation until matrices of size , which takes days to compute on a modern laptop.

The problems of sampling from an unnormalized distribution and calculating the distribution’s normalization constant (or partition function) are closely related and interreducible. An efficient solution to one problem leads to an efficient solution to the other jerrum1986random; jerrum1996markov. Computing the permanent of a matrix is a special instance of computing the partition function of an unnormalized distribution wainwright2008graphical. In this case the distribution is over permutations, the matrix defines a weight for each permutation, and the permanent is the sum of these weights.

1.1 Contributions

First, we present AdaPart, a novel method for drawing exact samples from an unnormalized distribution using any algorithm that upper bounds its partition function.

We use these samples to estimate and bound the partition function with high probability. This is a generalization of prior work 

huber2006exact; law2009approximately, which showed that a specific bound on the matrix permanent nests, or satisfies a Matryoshka doll like property where the bound recursively fits within itself, for a fixed partitioning of the state space. Our novelty lies in adaptively choosing a partitioning of the state space, which (a) is suited to the particular distribution under consideration, and (b) allows us to use any upper bound or combination of bounds on the partition function, rather than one that can be proven a priori to nest according to a fixed partitioning.

Second, we provide a complete instantiation of AdaPart for sampling permutations with weights defined by a matrix, and correspondingly computing the permanent of that matrix. To this end, we identify and use an upper bound on the permanent with several desirable properties, including being computable in polynomial time and being tighter than the best known bound that provably nests.

Third, we empirically demonstrate that AdaPart is both computationally efficient and practical for approximating the permanent of a variety of matrices, both randomly generated and from real world applications. We find that AdaPart can be over 25x faster compared to prior work on sampling from and approximating the permanent. In the context of multi-target tracking, AdaPart facilitates sampling from the optimal proposal distribution during particle filtering, which improves multi-target tracking performance while reducing the number of samples by an order of magnitude.

2 Background

The permanent of an non-negative matrix is defined as , where the sum is over all permutations of and denotes the corresponding symmetric group. Let us define the weight function, or unnormalized probability, of a permutation as . The permanent can then be written as , which is the partition function (normalization constant) of , also denoted .

We are interested in sampling from the corresponding probability distribution over permutations

, or more generally from any unnormalized distribution where the exact partition function is unknown. Instead, we will assume access to a function that upper bounds the partition function, for instance an upper bound on the permanent. By verifying (at runtime) that this upper bound satisfies a natural ‘nesting’ property w.r.t. a partition of the permutations, we will be able to guarantee exact samples from the underlying distribution. Note that verification is critical since the ‘nesting’ property does not hold for upper bounds in general.

In the next few sections, we will consider the general case of any non-negative weight function over states (i.e., ) and its partition function , rather than specifically discussing weighted permutations of a matrix and its permanent. This is to simplify the discussion and present it in a general form. We will return to the specific case of the permanent later on.

2.1 Nesting Bounds

huber2006exact and law2009approximately have noted that upper bounds on the partition function that ‘nest’ can be used to draw exact samples from a distribution defined by an arbitrary, non-negative weight function. For their method to work, the upper bound must nest according to some fixed partitioning of the weight function’s state space, as formalized in Definition 1. In Definition 2, we state the properties that must hold for an upper bound to ‘nest’ according to the partitioning .

Definition 1 (Partition Tree).

Let denote a finite state space. A partition tree for is a tree where each node is associated with a non-empty subset of such that:

  1. The root of is associated with .

  2. If , the tree formed by a single node is a partition tree for .

  3. Let be the children of the root node of , and be their associated subsets of . is a partition tree if are pairwise disjoint, , and for each the subtree rooted at is a partition tree for .

Definition 2 (Nesting Bounds).

Let be a non-negative weight function with partition function . Let be a partition tree for and let be the set containing the subsets of associated with each node in . The function is a nesting upper bound for with respect to if:

  1. The bound is tight for all single element sets: for all .111This requirement can be relaxed by defining a new upper bounding function that returns for single element sets and the upper bound which violated this condition for multi-element sets.

  2. The bound ‘nests’ at every internal node in . Let be the subset of associated with . Let be the subsets associated with the children of in . Then the bound ‘nests’ at if:

    (1)

2.2 Rejection Sampling with a Fixed Partition

Setting aside the practical difficulty of finding such a bound and partition, suppose we are given a fixed partition tree and a guarantee that nests according to . Under these conditions, law2009approximately proposed a rejection sampling method to perfectly sample an element, , from the normalized weight function (see Algorithm A.1 in the Appendix). Algorithm A.1 takes the form of a rejection sampler whose proposal distribution matches the true distribution precisely—except for the addition of slack elements with joint probability mass equal to . The algorithm recursively samples a partition of the state space until the sampled partition contains a single element or slack is sampled. Samples of slack are rejected and the procedure is repeated until a valid single element is returned.

According to Proposition A.1 (see Appendix), Algorithm A.1 yields exact samples from the desired target distribution. Since it performs rejection sampling using to construct a proposal, its efficiency depends on how close the proposal distribution is to the target distribution. In our case, this is governed by two factors: (a) the tightness of the (nesting) upper bound, , and (b) the tree used to partition the state space (in particular, it is desirable for every node in the tree to have a small number of children).

In what follows, we show how to substantially improve upon Algorithm A.1 by utilizing tighter bounds (even if they don’t nest a priori) and iteratively checking for the nesting condition at runtime until it holds.

Inputs:

  1. Non-empty state space,

  2. Unnormalized weight function,

  3. Family of upper bounds for ,

  4. Refinement function, , where is the set of all partitions of

Output: A sample distributed as .

[0]   if  then Return      ;      repeat// default       Choose a subset to refine:       for all  do// default                    ;    ;     until    Sample a subset with prob. , or sample with prob.    if  is sampled then Recursively call AdaPart ()   else Reject and restart with the call to AdaPart ()

Algorithm 2.1 AdaPart: Sample from a Weight Function using Adaptive Partitioning

2 Background

The permanent of an non-negative matrix is defined as , where the sum is over all permutations of and denotes the corresponding symmetric group. Let us define the weight function, or unnormalized probability, of a permutation as . The permanent can then be written as , which is the partition function (normalization constant) of , also denoted .

We are interested in sampling from the corresponding probability distribution over permutations

, or more generally from any unnormalized distribution where the exact partition function is unknown. Instead, we will assume access to a function that upper bounds the partition function, for instance an upper bound on the permanent. By verifying (at runtime) that this upper bound satisfies a natural ‘nesting’ property w.r.t. a partition of the permutations, we will be able to guarantee exact samples from the underlying distribution. Note that verification is critical since the ‘nesting’ property does not hold for upper bounds in general.

In the next few sections, we will consider the general case of any non-negative weight function over states (i.e., ) and its partition function , rather than specifically discussing weighted permutations of a matrix and its permanent. This is to simplify the discussion and present it in a general form. We will return to the specific case of the permanent later on.

2.1 Nesting Bounds

huber2006exact and law2009approximately have noted that upper bounds on the partition function that ‘nest’ can be used to draw exact samples from a distribution defined by an arbitrary, non-negative weight function. For their method to work, the upper bound must nest according to some fixed partitioning of the weight function’s state space, as formalized in Definition 1. In Definition 2, we state the properties that must hold for an upper bound to ‘nest’ according to the partitioning .

Definition 1 (Partition Tree).

Let denote a finite state space. A partition tree for is a tree where each node is associated with a non-empty subset of such that:

  1. The root of is associated with .

  2. If , the tree formed by a single node is a partition tree for .

  3. Let be the children of the root node of , and be their associated subsets of . is a partition tree if are pairwise disjoint, , and for each the subtree rooted at is a partition tree for .

Definition 2 (Nesting Bounds).

Let be a non-negative weight function with partition function . Let be a partition tree for and let be the set containing the subsets of associated with each node in . The function is a nesting upper bound for with respect to if:

  1. The bound is tight for all single element sets: for all .111This requirement can be relaxed by defining a new upper bounding function that returns for single element sets and the upper bound which violated this condition for multi-element sets.

  2. The bound ‘nests’ at every internal node in . Let be the subset of associated with . Let be the subsets associated with the children of in . Then the bound ‘nests’ at if:

    (1)

2.2 Rejection Sampling with a Fixed Partition

Setting aside the practical difficulty of finding such a bound and partition, suppose we are given a fixed partition tree and a guarantee that nests according to . Under these conditions, law2009approximately proposed a rejection sampling method to perfectly sample an element, , from the normalized weight function (see Algorithm A.1 in the Appendix). Algorithm A.1 takes the form of a rejection sampler whose proposal distribution matches the true distribution precisely—except for the addition of slack elements with joint probability mass equal to . The algorithm recursively samples a partition of the state space until the sampled partition contains a single element or slack is sampled. Samples of slack are rejected and the procedure is repeated until a valid single element is returned.

According to Proposition A.1 (see Appendix), Algorithm A.1 yields exact samples from the desired target distribution. Since it performs rejection sampling using to construct a proposal, its efficiency depends on how close the proposal distribution is to the target distribution. In our case, this is governed by two factors: (a) the tightness of the (nesting) upper bound, , and (b) the tree used to partition the state space (in particular, it is desirable for every node in the tree to have a small number of children).

In what follows, we show how to substantially improve upon Algorithm A.1 by utilizing tighter bounds (even if they don’t nest a priori) and iteratively checking for the nesting condition at runtime until it holds.

Inputs:

  1. Non-empty state space,

  2. Unnormalized weight function,

  3. Family of upper bounds for ,

  4. Refinement function, , where is the set of all partitions of

Output: A sample distributed as .

[0]   if  then Return      ;      repeat// default       Choose a subset to refine:       for all  do// default                    ;    ;     until    Sample a subset with prob. , or sample with prob.    if  is sampled then Recursively call AdaPart ()   else Reject and restart with the call to AdaPart ()

Algorithm 2.1 AdaPart: Sample from a Weight Function using Adaptive Partitioning

3 Adaptive Partitioning

A key limitation of using the approach in Algorithm A.1 is that it is painstaking to prove a priori that an upper bound nests for a yet unknown weight function with respect to a complete, fixed partition tree. Indeed, a key contribution of prior work huber2006exact; law2009approximately has been to provide a proof that a particular upper bound nests for any weight function according to a fixed partition tree whose nodes all have a small number of children.

In contrast, we observe that it is nearly trivial to empirically verify a posteriori whether an upper bound respects the nesting property for a particular weight function for a particular partition of a state space; that is, whether the condition in Eq. (1) holds for a particular choice of and . This corresponds to checking whether the nesting property holds at an individual node of a partition tree. If it doesn’t, we can refine the partition and repeat the empirical check. We are guaranteed to succeed if we repeat until the partition contains only single elements, but empirically find that the check succeeds after a single call to for the upper bound we use.

The use of this adaptive partitioning strategy provides two notable advantages: (a) it frees us to choose any upper bounding method, rather than one that can be proven to nest according to a fixed partition tree; and (b) we can customize—and indeed optimize—our partitioning strategy on a per weight function basis. Together, this leads to significant efficiency gains relative to Algorithm A.1.

Algorithm 2.1 describes our proposed method, AdaPart. It formalizes the adaptive, iterative partitioning strategy and also specifies how the partition tree can be created on-the-fly during sampling without instantiating unnecessary pieces. In contrast to Algorithm A.1, AdaPart does not take a fixed partition tree as input. Further, it operates with any (not necessarily nesting) upper bounding method for (subsets of) the state space of interest.

Figure 1 illustrates the difference between our adaptive partitioning strategy and a fixed partitioning strategy. We represent the entire state space as a 2 dimensional square. The left square in Figure 1 illustrates a fixed partition strategy, as used by law2009approximately. Regardless of the specific weight function defined over the square, the square is always partitioned with alternating horizontal and vertical splits. To use this fixed partitioning, an upper bound must be proven to nest for any weight function. In contrast, our adaptive partitioning strategy is illustrated by the right square in Figure 1, where we choose horizontal or vertical splits based on the particular weight function. Note that slack is not shown and that the figure illustrates the complete partition trees.

Fixed vs. Adaptive Partitioning

[capbesideposition=right,top,capbesidewidth=5.6cm]figure[]

Figure 1: Binary partitioning of a square in the order black, blue, orange, green. Left: each subspace is split in half according to a predefined partitioning strategy, alternating vertical and horizontal splits. Right: each subspace is split in half, but the method of splitting (vertical or horizontal) is chosen adaptively with no predefined order. This figure represents tight upper bounds without slack.

AdaPart uses a function , which takes as input a subset of the state space , and outputs a collection of different ways of partitioning

. We then use a heuristic to decide which one of these

partitions to keep. In Figure 1, takes a rectangle as input and outputs 2 partitionings, the first splitting the rectangle in half horizontally and the second splitting it in half vertically.

AdaPart works as follows. Given a non-negative weight function for a state space , we start with the trivial partition containing only one subset—all of . We then call on , which gives possible partitions of . For each of the possible partitions, we sum the upper bounds on each subset in the partition, denoting this sum as for the the -th partition. At this point, we perform a local optimization step and choose the partition with the tightest (i.e., smallest) upper bound, . The rest of the options for partitioning are discarded at this point. The partition is ‘refined’ by replacing with the disjoint subsets forming the -th partition of .

This process is repeated recursively, by calling on another subset , until the sum of upper bounds on all subsets in is less than the upper bound on . We now have a valid nesting partition of and can perform rejection sampling. Similar to Algorithm A.1, we draw a random sample from , where each is chosen with probability , and is chosen with the remaining probability. If subset is sampled, we recursively call AdaPart on . If is selected, we discard the computation and restart the entire process. The process stops when is a singleton set , in which case is output as the sample.

AdaPart can be seen as using a greedy approach for optimizing over possible partition trees of w.r.t. . At every node, we partition in a way that minimizes the immediate or “local” slack (among the possible partitioning options). This approach may be sub-optimal due to its greedy nature, but we found it to be efficient and empirically effective. The efficiency of AdaPart can be improved further by tightening upper bounds whenever slack is encountered, resulting in an adaptive222The use of ‘adaptive’ here is to connect this section with the rejection sampling literature, and is unrelated to ‘adaptive’ partitioning discussed earlier. rejection sampler (gilks1992adaptive) (please refer to Section A.2 in the Appendix for further details).

3.1 Estimating the Partition Function

Armed with a method, AdaPart, for drawing exact samples from a distribution defined by a non-negative weight function whose partition function is unknown, we now outline a simple method for using these samples to estimate the partition function . The acceptance probability of the rejection sampler embedded in AdaPart can be estimated as

(2)

which yields

as an unbiased estimator of

. The number of accepted samples out of

total samples is distributed as a Binomial random variable with parameter

. The Clopper–Pearson method clopper1934use gives tight, high probability bounds on the true acceptance probability, which in turn gives us high probability bounds on . Please refer to Section A.3 in the Appendix for the unbiased estimator of when performing bound tightening as in an adaptive rejection sampler.

4 Adaptive Partitioning for the Permanent

In order to use AdaPart for approximating the permanent of a non-negative matrix , we need to specify two pieces: (a) the method for partitioning any given subset of the permutations defined by , and (b) a function that upper bounds the permanent of , as well as any subset of the state space (of permutations) generated by .

4.1 for Permutation Partitioning

We implement the method for partitioning an matrix into a set of different partitions as follows. One partition is created for each column . The -th partition of the permutations contains subsets, corresponding to all permutations containing a matrix element, for . This is inspired by the fixed partition of law2009approximately, modified to choose the column for partitioning adaptively.

4.2 Upper Bounding the Permanent

There exists a significant body of work on estimating and bounding the permanent (cf. an overview by zhang2016update), on characterizing the potential tightness of upper bounds gurvits2006hyperbolic; samorodnitsky2008upper, and on improving upper bounds hwang1998upper; soules2000extending; soules2003new; soules2005permanental. We use an upper bound from soules2005permanental, which is computed as follows. Define and for . Let . Given a matrix with entries , sort the entries of each row from largest to smallest to obtain , where . This gives the upper bound,

(3)

If the matrix entries are either 0 or 1, this bound reduces to the Minc-Brègman bound (minc1963upper; bregman1973some). This upper bound has many desirable properties. It can be efficiently computed in polynomial time, while tighter bounds (also given by soules2005permanental) require solving an optimization problem. It is significantly tighter than the one used by law2009approximately. This is advantageous because the runtime of AdaPart scales linearly with the bound’s tightness (via the acceptance probability of the rejection sampler).

Critically, we empirically find that this bound never requires a second call to in the repeat-until loop of AdaPart. That is, in practice we always find at least one column that we can partition on to satisfy the nesting condition. This bounds the number of subsets in a partition to and avoids a potentially exponential explosion. This is fortuitous, but also interesting, because this bound (unlike the bound used by law2009approximately) does not nest according to any predefined partition tree for all matrices.

4.3 Dense Matrix Polynomial Runtime Guarantee

The runtime of AdaPart is bounded for dense matrices as stated in Proposition 4.1. Please refer to Section A.4 in the Appendix for further details.

Proposition 4.1.

The runtime of AdaPart is for matrices with entries in every row and column that all take the maximum value of entries in the matrix, as shown in Algorithm A.2.

5 Related Work on Approximating the Permanent

The fastest exact methods for calculating the permanent have computational complexity that is exponential in the matrix dimension ryser1963combinatorial; bax1996finite; balasubramanian1980combinatorics; glynn2013permanent. This is to be expected, because computing the permanent has been shown to be #P-complete valiant1979complexity. Work to approximate the permanent has thus followed two parallel tracks, sampling based approaches and variational approaches.

The sampling line of research has achieved complete (theoretical) success. jerrum2004polynomial proved the existence of a fully polynomial randomized approximation scheme (FPRAS) for approximating the permanent of a general non-negative matrix, which was an outstanding problem at the time broder1986hard; jerrum1989approximating. An FPRAS is the best possible solution that can be reasonably hoped for since computing the permanent is #P-complete. Unfortunately, the FPRAS presented by jerrum2004polynomial has seen little, if any, practical use. The algorithm is both difficult to implement and slow with polynomial complexity of , although this complexity was improved to by bezakova2006accelerating.

In the variational line of research, the Bethe approximation of the permanent (huang2009approximating; vontobel2014bethe) is guaranteed to be accurate within a factor of (anari2018tight). This approach uses belief propagation to minimize the Bethe free energy as a variational objective. A closely related approximation, using Sinkhorn scaling, is guaranteed to be accurate within a factor of (gurvits2014bounds). The difference between these approximations is discussed in vontobel2014bethe. The Sinkhorn based approximation has been shown to converge in polynomial time (linial2000deterministic), although the authors of (huang2009approximating) could not prove polynomial convergence for the Bethe approximation. aaronson2014generalizing build on (gurvits2002deterministic) (a precursor to (gurvits2014bounds)

) to estimate the permanent in polynomial time within additive error that is exponential in the largest singular value of the matrix. While these variational approaches are relatively computationally efficient, their bounds are still exponentially loose.

There is currently a gap between the two lines of research. The sampling line has found a theoretically ideal FPRAS which is unusable in practice. The variational line has developed algorithms which have been shown to be both theoretically and empirically efficient, but whose approximations to the permanent are exponentially loose, with only specific cases where the approximations are good (huang2009approximating; chertkov2008belief; chertkov2010inference). huber2006exact and law2009approximately began a new line of sampling research that aims to bridge this gap. They present a sampling method which is straightforward to implement and has a polynomial runtime guarantee for dense matrices. While there is no runtime guarantee for general matrices, their method is significantly faster than the FRPAS of jerrum2004polynomial for dense matrices. In this paper we present a novel sampling algorithm that builds on the work of huber2006exact; law2009approximately. We show that AdaPart leads to significant empirical speedups, further closing the gap between the sampling and variational lines of research.

6 Experiments

In this section we show the empirical runtime scaling of AdaPart as matrix size increases, test AdaPart on real world matrices, compare AdaPart with the algorithm from law2009approximately for sampling from a fixed partition tree, and compare with variational approximations huang2009approximating; anari2018tight; gurvits2014bounds. Please see section A.5 in the Appendix for additional experiments verifying that the permanent empirically falls within our high probability bounds.

6.1 Runtime Scaling and Comparison with Variational Approximations

To compare the runtime performance of AdaPart with law2009approximately we generated random matrices of varying size. We generated matrices in two ways, by uniformly sampling every element from (referred to as ‘uniform’ in plots) and by sampling blocks of size and a single block along the diagonal of an matrix (with all other elements set to 0, referred to as ‘block diagonal’ in plots). Runtime scaling is shown in Figure 2. While AdaPart is faster in both cases, we observe the most time reduction for the more challenging, low density block diagonal matrices. For reference a Cython implementation of Ryser’s algorithm for exactly computing the permanent in exponential time ryser1963combinatorial requires roughly 1.5 seconds for a matrix.

Figure 2: Log-log plot of mean runtime over 5 samples against (matrices are of size ).

To demonstrate that computing the permanent of these matrices is challenging for variational approaches, we plot the bounds obtained from the Bethe and Sinkhorn approximations in Figure 3. Note that the gap between the variational lower and upper bounds is exponential in the matrix dimension . Additionally, the upper bound from soules2005permanental (that we use in AdaPart) is frequently closer to the exact permanent than all variational bounds.

Figure 3: Bounds on the permanent given by the Bethe approximation huang2009approximating; anari2018tight, the Sinkhorn approximation gurvits2014bounds, and the upper bound we use from soules2005permanental.

6.2 Matrices from Real-World Networks

In Table 1 we show the performance of our method on real world problem instances. In the context of directed graphs, the permanent represents the sum of weights of cycle covers (i.e., a set of disjoint directed cycles that together cover all vertices of the graph) and defines a distribution over cycle covers. Sampling cycle covers is then equivalent to sampling permutations from the distribution defined by the permanent. We sampled 10 cycle covers from distributions arising from graphs333Matrices available at http://networkrepository.com. in the fields of cheminformatics, DNA electrophoresis, and power networks and report mean runtimes in Table 1. Among the matrices that did not time out, AdaPart can sample cycle covers 12 - 25x faster than the baseline from law2009approximately. We used 10 samples from AdaPart to compute bounds on the permanent that are tight within a factor of 5 and hold with probability , shown in the AdaPart sub-columns of Table 1 (we show the natural logarithm of all bounds). Note that we would get comparable bounds using the method from law2009approximately as it is also produces exact samples. For comparison we compute variational bounds using the method of gurvits2014bounds, shown in the ‘Sinkhorn’ sub-columns. Each of these bounds was computed in less than .01 seconds, but they are generally orders of magnitude looser than our sampling bounds. Note that our sampling bounds can be tightened arbitrarily by using more samples at the cost of additional (parallel) computation, while the Sinkhorn bounds cannot be tightened. We do not show bounds given by the Bethe approximation because the matlab code from huang2009approximating was very slow for matrices of this size and the c++ code does not handle matrices with 0 elements.

Model Information Sampling Runtime (sec.) Lower Bounds Upper Bounds
Network Name Nodes Edges AdaPart law2009approximately AdaPart Sinkhorn AdaPart Sinkhorn
ENZYMES-g192 31 132 4.2 52.9 19.3 17.0 20.8 38.5
ENZYMES-g230 32 136 3.3 55.5 19.8 17.2 21.3 39.4
ENZYMES-g479 28 98 1.8 45.1 12.3 10.9 13.8 30.3
cage5 37 196 6.1 TIMEOUT -20.2 -29.2 -18.7 -3.6
bcspwr01 39 46 4.2 74.8 18.7 13.2 20.1 40.3
Table 1: Runtime comparison of our algorithm (AdaPart) with the fixed partitioning algorithm from law2009approximately and bound tightness comparison of AdaPart with the Sinkhorn based variational bounds from gurvits2014bounds (logarithm of bounds shown). Best values are in bold.

6.3 Multi-Target Tracking

The connection between measurement association in multi-target tracking and the matrix permanent arises frequently in tracking literature uhlmann2004matrix; morelande2009joint; oh2009markov; hamid2015joint. It is used to calculate the marginal probability that a measurement was produced by a specific target, summing over all other joint measurement-target associations in the association matrix. We implemented a Rao-Blackwellized particle filter that uses AdaPart to sample from the optimal proposal distribution and compute approximate importance weights (see Section A.6 in the Appendix).

We evaluated the performance of our particle filter using synthetic multi-target tracking data. Independent target motion was simulated for 10 targets with linear Gaussian dynamics. Each target was subjected to a unique spring force. As baselines, we evaluated against a Rao-Blackwellized particle filter using a sequential proposal distribution sarkka2004rao and against the standard multiple hypothesis tracking framework (MHT) reid1979algorithm; chong2018forty; kim2015multiple. We ran each method with varying numbers of particles (or tracked hypothesis in the case of MHT) and plot the maximum log-likelihood of measurement associations among sampled particles in Figure 4. The mean squared error over all inferred target locations (for the sampled particle with maximum log-likelihood) is also shown in Figure 4. We see that by sampling from the optimal proposal distribution (blue x’s in Figure 4) we can find associations with larger log-likelihood and lower mean squared error than baseline methods while using an order of magnitude fewer samples (or hypotheses in the case of MHT).

Figure 4: Multi-target tracking performance comparison. Left: maximum log-likelihoods among sampled particles (or top- hypotheses for the MHT baseline). Right: mean squared error over all time steps and target locations.

7 Conclusion and Future Work

Computing the permanent of a matrix is a fundamental problem in computer science. It has many applications, but exact computation is intractable in the worst case. Although a theoretically sound randomized algorithm exists for approximating the permanent in polynomial time, it is impractical. We proposed a general approach, AdaPart, for drawing exact samples from an unnormalized distribution. We used AdaPart to construct high probability bounds on the permanent in provably polynomial time for dense matrices. We showed that AdaPart is significantly faster than prior work on both dense and sparse matrices which are challenging for variational approaches. Finally, we applied AdaPart to the multi-target tracking problem and showed that we can improve tracking performance while using an order of magnitude fewer samples.

In future work, AdaPart may be used to estimate general partition functions if a general upper bound wainwright2003tree; liu2011bounding; lou2017anytime is found to nest with few calls to . The matrix permanent specific implementation of AdaPart may benefit from tighter upper bounds on the permanent. Particularly, a computationally efficient implementation of the Bethe upper bound (huang2009approximating; anari2018tight) would yield improvements on sparse matrices (see Figure 3), which could be useful for multi-target tracking where the association matrix is frequently sparse. The ‘sharpened’ version of the bound we use (Equation 3), also described in soules2005permanental, would offer performance improvements if the ‘sharpening’ optimization problem can be solved efficiently.

Acknowledgements

Research supported by NSF (#1651565, #1522054, #1733686), ONR (N00014-19-1-2145), AFOSR (FA9550- 19-1-0024), and FLI.

References

Appendix A Appendix

a.1 Proof of Correctness for Sampling with a Fixed Partition

Algorithm A.1 specifies a method for sampling from a weight function given a fixed partition tree and a bound that provably nests. Its proof of correctness is given in PropositionA.1. Note that a simple property that follows from recursively applying the definition of a nesting bound is that . More generally, given any node in associated with the subset , we have .

Proposition A.1 (huber2006exact, law2009approximately).

Algorithm A.1 samples an element from the normalized weight function .

Proof.

The probability of sampling leaf node at depth in the partition tree, with ancestors , …, (where is the parent node of and is the root node) and associated ancestor subsets , …, is

Inputs:

  1. Non-empty state space

  2. Partition tree of

  3. Unnormalized weight function

  4. Nesting upper bound for with respect to

Output: A sample distributed as .

Algorithm:

  1. Set to the root node of and .

  2. Sample a child of (denoted with associated subsets of ) or slack with probabilities:

  3. If a child was sampled with an associated subset containing a single element then return this element.

  4. If a child, , was sampled with an associated subset containing more than one element then set , , and go to step 2.

  5. If the slack element was sampled then go to step 1.

Algorithm A.1 Sample from a Normalized Weight Function

a.2 Adaptive Rejection Sampling

We can improve the efficiency of AdaPart by tightening the upper bounds whenever we encounter slack. This is done by subtracting the computed slack from the associated upper bounds, which still preserves nesting properties. The resulting algorithm is an adaptive rejection sampler [gilks1992adaptive], where the “envelope” proposal is tightened every time a point is rejected.444The use of ‘adaptive’ here is to connect this section with the rejection sampling literature, and is unrelated to ‘adaptive’ partitioning discussed earlier.

Formally, for any partition of , we define a new, tighter upper bound as follows:

(4)

This is still a valid upper bound on because of the additive nature of , and is, by definition, also nesting w.r.t. the partition . If we encounter any slack, there must exists some for which , hence we can strictly tighten our bound for subsequent steps of the algorithm (thereby making AdaPart more efficient) by using instead of . For matrices sampled uniformly from , we empirically find that bound tightening is most effective for small matrices. After 1000 samples we improve our bound on the permanent to roughly 64%, 77%, and 89% of the original bound for matrices of size 10, 15, 25 respectively. However, bound tightening is more effective for other types of matrices. For the matrices from real world networks in Section 6, after drawing 10 samples we improve our bound on the permanent to roughly 25%, 22%, 8%, 7%, and 19% for models ENZYMES-g192, ENZYMES-g230, ENZYMES-g479, cage5, and bcspwr01 respectively.

a.3 Estimating the Partition Function with Adaptive Rejection Sampling

The number of accepted samples, , is a random variable with expectation , where is the upper bound on the entire state space when the -th sample is drawn. This gives the unbiased estimator for the partition function. We use bootstrap techniques chernick2008bootstrap to perform Monte Carlo simulations that yield high probability bounds on the partition function. We used 100,000 samples to compute our bounds in Table 1, which took 2-6 seconds using our python script, but note that this computation is extremely parallelizable. We computed each bootstrap upper and lower bound 10 times and always sampled the same values, except for one case where the log upper bound changed by .06, showing that 100,000 samples is sufficient.

a.4 Runtime Guarantee of AdaPart

law2009approximately prove that the runtime of Algorithm A.1 is per sample when using their upper bound on the permanent [law2009approximately, p. 33], where controls density. AdaPart has the same guarantee with a minor modification to the presentation in Algorithm 2.1. The repeat loop is removed and if the terminating condition is not met after a single call to , Algorithm A.1 is called with the upper bound from and fixed partitioning strategy from [law2009approximately] as shown in Algorithm A.2.

Inputs:

  1. Non-empty state space

  2. Unnormalized weight function

  3. Family of upper bounds for that are tight on single element subsets

  4. Refinement function where is the set of all partitions of

Output: A sample distributed as .

[0]   if  then Return         for all  do// default                  if  then// default      Return the output of Algorithm A.1 called on and with the bound and fixed partition of [law2009approximately]   else// default      Sample a subset with prob. , or sample with prob.    if  is sampled then// default      Recursively call AdaPart ()   else// default      Restart, i.e., call AdaPart ()

Algorithm A.2 AdaPart: Sample from a Normalized Weight Function using Adaptive Partitioning with Polynomial Runtime Guarantee for Dense Matrices

a.5 Additional Experiments

Figure 5: Accuracy results on randomly sampled block diagonal matrices constructed as described earlier, with blocks of size . We plot the exact permanent, our estimate, and our high probability bounds calculated from 10 samples for each matrix.

While calculating the permanent of a large matrix is generally intractable, it can be done efficiently for certain special types of matrices. One example is block diagonal matrices, where an matrix is composed of blocks of size and a single block along the diagonal. Only elements within these blocks on the diagonal may be non-zero. The permanent of a block diagonal matrix is simply the product of the permanents of each matrix along the diagonal, which can be calculated efficiently whenever the block size is sufficiently small. We plot the exact permanent, our estimate, and our high probability bounds for randomly sampled block diagonal matrices of various sizes in Figure 5.

a.6 Multi-Target Tracking Overview

The multi-target tracking problem is very similar to classical inference problems in hidden Markov models, requiring the estimation of an unobserved state given a time series of noisy measurements. The non-standard catch is that at each time step the observer is given one noisy measurement per target, but is not told which target produced which measurement. Brute forcing the problem is intractable because there are

potential associations when tracking targets. The connection between measurement association and the matrix permanent arises frequently in tracking literature uhlmann2004matrix, morelande2009joint, oh2009markov, hamid2015joint, and its computational complexity is cited when discussing the difficulty of multi-target tracking.

As brief background, the computational complexity of multi-target tracking has led to many heuristic approximations, notably including multiple hypothesis tracking (MHT) reid1979algorithm, chong2018forty, kim2015multiple and joint probabilistic data association (JPDA) fortmann1983sonar, hamid2015joint. As heuristics, they can succumb to failure modes. JPDA is known to suffer from target coalescence where neighboring tracks merge blom2000probabilistic.

Alternatively, sequential Monte Carlo methods (SMC or particle filters) provide an asymptotically unbiased method for sequentially sampling from arbitrarily complex distributions. When targets follow linear Gaussian dynamics, a Rao-Blackwellized particle filter may be used to sample the measurement associations allowing sufficient statistics for distributions over individual target states to be computed in closed form (by Kalman filtering, see Algorithm 

A.3 in the Appendix for further details) sarkka2004rao. The proposal distribution is a primary limitation when using Monte Carlo methods. Ideally it should match the target distribution as closely as possible, but this generally makes it computationally unwieldy.

In the case of a Rao-Blackwellized particle filter for multi-target tracking, the optimal proposal distribution [doucet2000sequential, p. 199]

that minimizes the variance of each importance weight is a distribution over permutations defined by a matrix permanent (please see Section 

A.10 in the Appendix for further details). We implemented a Rao-Blackwellized particle filter that uses the optimal proposal distribution. We evaluated it’s effectiveness against a Rao-Blackwellized particle filter using a sequential proposal distribution sarkka2004rao and against the standard multiple hypothesis tracking framework (MHT) reid1979algorithm, chong2018forty, kim2015multiple.

Our work can be extended to deal with a variable number of targets and clutter measurements using a matrix formulation similar to that in atanasov2014semantic.

a.7 Optimal Single-Target Bayesian Filtering

In this section we give a brief review of the optimal Bayesian filter for single-target tracking. Consider a hidden Markov model with unobserved state and measurement at time

. The joint distribution over states and measurements factors as

by the Markov property. This factorization of the joint distribution facilitates Bayesian filtering, a recursive algorithm that maintains a fully Bayesian distribution over the hidden state as each measurement is sequentially observed. Given the prior distribution over the initial state, the Bayesian filter consists of the update step555Where we have abused notation and the initial distribution is .

and the prediction step

In the special case of linear Gaussian models where the state transition and measurement processes are linear but corrupted with Gaussian noise, the above integrals can be computed analytically giving closed form update and predict steps. The distribution over the hidden states remains Gaussian and is given by the Kalman filter with update step

(5)

and prediction step

(6)

a.8 Optimal Multi-Target Bayesian Filtering

In this section we give a brief review of the optimal Bayesian filter for multi-target tracking problem with a fixed cardinality (fixed number of targets and measurements over time) [oh2009markov, pp. 485-486] and its computational intractability.

Given standard multi-target tracking assumptions, the joint distribution over all target states , measurements , and measurement-target associations can be factored as666For a tracking sequence of targets over time steps, is an array where row represents the state of all targets at time and element

is a vector representing the state of the

target at time . Likewise is an array where row represents all measurements at time and element is a vector representing the measurement at time . Measurement-target associations are represented by the array where the element is a permutations of ( denotes the symmetric group).

(7)

The optimal Bayesian filter for multi-target tracking is a recursive algorithm, similar to the standard Bayesian filter in the single target tracking setting, that maintains a distribution over the joint state of all targets by incorporating new measurement information as it is obtained. It is more complex than the single target Bayesian filter because it must deal with uncertainty in measurement-target association. As in the single target tracking setting the filter is composed of prediction and update steps. The prediction step is

(8)

The update step is

(9)

Unfortunately the multi-target optimal Bayesian filtering steps outlined above are computationally intractable to compute. Even in special cases where the integrals are tractable, such as for linear Gaussian models, summation over states is required.

a.9 Sequential Monte Carlo

Sequential Monte Carlo (SMC) or particle filtering methods can be used to sample from sequential models . These methods can be used to sample from the distribution defined by the optimal Bayesian multi-target filter. When target dynamics are linear Gaussian a Rao-Blackwellized particle filter can be used to sample measurement-target associations and compute sufficient statistics for individual target distributions in closed form sarkka2004rao.

Pseudo-code for Rao-Blackwellized sequential importance sampling is given in algorithm A.3. We use and to denote calculation of the closed form Kalman filter update and prediction steps given in equations 5 and 6 respectively.

Outputs: importance samples and weights () with corresponding state estimates and covariance matrices . Note and are both arrays; is the target’s estimated state vector at time for sample .

[1]    for t = 1, …, T do// Update particle at time          for i = 1, …, N do// Sample particle i                                                for k = 1, …, K do// Iterate over targets                      ,                      ,                                                                      for i = 1, …, N do// Normalize importance weights                          

Algorithm A.3 Rao-Blackwellized Sequential Importance Sampling

a.10 Optimal Proposal Distribution

While SMC methods are asymptotically unbiased, their performance depends on the quality of the proposal distribution. The optimal proposal distribution that minimizes the variance of importance weight