1 Introduction
Probabilistic inference problems arise in many domains, from statistical physics to machine learning. There is little hope that efficient, exact solutions to these problems exist as they are at least as hard as NPcomplete decision problems. Significant research has been devoted across the fields of machine learning, statistics, and statistical physics to develop variational and sampling based methods to approximate these challenging problems
(Chandler, 1987; Mézard et al., 2002; Wainwright et al., 2008; Baxter, 2016; Owen, 2013). Variational methods such as Belief Propagation (BP) (Koller and Friedman, 2009) have been particularly successful at providing principled approximations due to extensive theoretical analysis.We introduce belief propagation neural networks (BPNNs), a flexible neural architecture designed to estimate the partition function of a factor graph. BPNNs generalize BP and can thus provide more accurate estimates than BP when trained on a small number of factor graphs with known partition functions. At the same time, BPNNs retain many of BP’s properties, which results in more accurate estimates compared to general neural architectures. BPNNs are composed of iterative layers (BPNND) and an optional Bethe free energy layer (BPNNB), both of which maintain the symmetries of BP under factor graph isomorphisms. BPNND is a parametrized iterative operator that strictly generalizes BP while preserving many of BP’s guarantees. Like BP, BPNND is guaranteed to converge on tree structured factor graphs and return the exact partition function. For factor graphs with loops, BPNND computes a lower bound whenever the Bethe approximation obtained from fixed points of BP is a provable lower bound (with mild restrictions on BPNND). BPNNB performs regression from the trajectory of beliefs (over a fixed number of iterations) to the partition function of the input factor graph. While this sacrifices some guarantees, the additional flexibility introduced by BPNNB generally improves estimation performance.
Experimentally, we show that on Ising models BPNND is able to converge faster than standard BP and frequently finds better fixed points that provide tighter lower bounds. BPNND generalizes well to Ising models sampled from a different distribution than seen during training and to models with nearly twice as many variables as seen during training, providing estimates of the log partition function that are significantly better than BP or a standard graph neural network (GNN) in these settings. We also perform experiments on community detection problems, where BP is known to perform well both empirically and theoretically, and show improvements over BP and a standard GNN. We then perform experiments on approximate model counting (Stockmeyer, 1983; Jerrum et al., 1986; Karp et al., 1989; Bellare et al., 1998), the problem of computing the number of solutions to a Boolean satisfiability (SAT) problem. Unlike the first two experiments it is very difficult for BP to converge in this setting. Still, we find that BPNN learns to estimate accurate model counts from a training set of 10’s of problems and generalize to problems that are significantly harder for an exact model counter to solve. Compared to handcrafted approximate model counters, BP returns comparable estimates 100’s times faster using GPU computation.
2 Background: Factor Graphs and Belief Propagation
In this section we provide background on factor graphs and belief propagation (Koller and Friedman, 2009)
. A factor graph is a representation of a discrete probability distribution that takes advantage of independencies between variables to make the representation more compact. Belief propagation is a method for approximating the normalization constant, or partition function, of a factor graph. Let
be a discrete probability distribution defined in terms of a factor graph as(1) 
where , are factors and is the partition function. As a data structure, a factor graph is a bipartite graph with variables nodes and factor nodes. Factor nodes and variables nodes are connected if and only if the variable is in the scope of the factor.
Belief Propagation
Belief propagation performs iterative message passing between neighboring variable and factor nodes. Variable to factor messages, , and factor to variable messages, , are computed at every iteration as
(2) 
Messages are typically initialized either randomly or as constants. The BP algorithm estimates approximate marginal probabilities over the sets of variables associated with each factor . We denote the belief over variables , after message passing iteration is complete, as with . Similarly, BP computes beliefs at each variable as . The belief propagation algorithm proceeds by iteratively updating variable to factor messages and factor to variable messages until they converge to fixed values, referred to as a fixed point of Equations 2, or a predefined maximum number of iterations is reached. At this point the beliefs are used to compute a variational approximation of the factor graph’s partition function. This approximation, originally developed in statistical physics, is known as the Bethe free energy (Bethe, 1935). It is defined in terms of the Bethe average energy and the Bethe entropy , where is the degree of variable node .
Numerically Stable Belief Propagation.
For numerical stability, belief propagation is generally performed in logspace and messages are normalized at every iteration. It is also standard to add a damping parameter, , to improve convergence by taking partial update steps. BP without damping is recovered when , while would correspond to not updating messages and instead retaining their values from the previous iteration. With these modifications, the variable to factor messages from Equation 2 are rewritten as follows, where terms scaled by represent the difference in the message’s value from the previous iteration:
(3) 
Similarly, the factor to variable messages from Equation 2 are rewritten as
(4) 
Note that and
are vectors of length
, denotes log factors, and are normalization terms, and we use the shorthand for the logsumexp function: .3 Belief Propagation Neural Networks
We design belief propagation neural networks (BPNNs) as a family of graph neural networks that operate on factor graphs. Unlike standard graph neural networks (GNNs), BPNNs do not resend messages between nodes, a property taken from BP known as avoiding ‘double counting’ the evidence. This property guarantees that BPNND described below is exact on trees (Theorem 3). BPNND is a strict generalization of BP (Proposition 1), but is still guaranteed to give a lower bound to the partition function upon convergence for a class of factor graphs (Theorem 3) by finding fixed points of BP (Theorem 2). Like BP, BPNN preserves the symmetries inherent to factor graphs (Theorem 4).
BPNNs consist of two parts. First, iterative BPNN layers output messages, analogous to standard BP. These messages are used to compute beliefs using the same equations as for BP. Second, the beliefs are passed into a Bethe free energy layer (BPNNB) which generalizes the Bethe approximation by performing regression from beliefs to . Alternatively, when the standard Bethe approximation is used in place of BPNNB, BPNN provides many of BP’s guarantees.
BPNN Iterative Layers
BPNN iterative layers are flexible neural operators that can operate on beliefs or message in a variety of ways. Here, we focus on a specific variant, BPNND, due to its strong convergence properties, and we refer the reader to Appendix C for information on other variants. The BPNN iterative damping layer (BPNND) modifies factortovariable messages (Equation 4) using the output of a learned operator in place of the conventional damping term , where denotes the degree and the cardinality of variable . This learned operator takes as input the difference between iterations and of every factortovariable message, and modifies these differences jointly. It can thus be much richer than a scalar multiplier. BPNND factortovariable messages are given by
(5) 
where denotes the result of applying to all factortovariable message differences and is the output corresponding to the modified message difference. Variabletofactor messages are unchanged from Eq. 3, except for taking messages as input,
(6) 
Proposition 1.
BPNNDs subsume BP and damped BP as a strict generalization.
For nontrivial choices of , whether BPNN preserves the fixed points of BP or introduces any new ones turns out to depend only on the set of fixed points of itself, i.e., . As we show next, this property allows us to easily enforce that every fixed point of BP is also a fixed point of BPNND (Theorem 1), or vice versa (Theorem 2).^{1}^{1}1For lack of space, all proofs are deferred to Appendix A.
Theorem 1.
If zero is a fixed point of , then every fixed point of BP is also a fixed point of BPNND.
Theorem 2.
If does not have any nonzero fixed points, then every fixed point of BPNND is also a fixed point of BP.
Corollary 2.1.
If zero is the unique fixed point of , then the fixed points of BP and BPNND are identical. This property is satisfied when for any invertible function .
Note that a broad class of highly expressive learnable operators are invertible (Behrmann et al., 2019). Enforcing that every fixed point of BPNND is also a fixed point of BP is particularly useful, as it immediately follows that BPNND returns a lower bound whenever the Bethe approximation obtained from fixed points of BP returns a provable lower bound (Theorem 3). When a BPNND layer is applied iteratively until convergence, fast convergence is guaranteed for tree structured factor graphs (Proposition 2). As mentioned, BPNN iterative layers are flexible and can additionally be modified to operate directly on message values or factor beliefs at the expense of no longer returning a lower bound (see Appendix C).
Theorem 3.
If zero is the unique fixed point of
, the Bethe approximation computed from beliefs at a fixed point of BPNND (1) is exact for tree structured graphs and (2) lower bounds the partition function of any factor graph with binary variables and logsupermodular potential functions.
Proposition 2.
BPNND converges within iterations on tree structured factor graphs with height .
Bethe Free Energy Layer (BPNNB).
When convergence to a fixed point is unnecessary, we can increase the flexibility of our architecture by building a Klayer BPNN from iterative layers that do not share weights. Additionally we define a Bethe free energy layer (BPNNB, Equation 7) using two MLPs that take the trajectories of learned beliefs from each factor and variable as input and output scalars:
(7) 
This parameterization subsumes the standard Bethe approximation, so we can initialize the parameters of to output the Bethe approximation computed from the final layer beliefs (see the appendix for details). Note that is the number of variables in the scope of factor , denotes the symmetric group (all permutations of ), and the permutation is applied to the dimensions of all concatenated terms. We ensure that BPNN preserves the symmetries of BP (Theorem 4) by passing all factor permutations through and averaging the result.
BPNN Preserves the Symmetries of BP.
BPNN is designed so that equivalent input factor graphs are mapped to equivalent outputs. This is a property that BP satisfies by default. Standard GNNs are also designed to satisfy this property, however the notion of ‘equivalence’ between graphs is different than ‘equivalence’ between factor graphs. In this section we formalize these statements.
Graph isomorphism defines an equivalence relationship between graphs that is respected by standard GNNs. Two isomorphic graphs are structurally equivalent and indistinguishable if the nodes are appropriately matched. More formally, there exists a bijection between nodes (or their indices) in the two graphs that defines this matching. Standard GNNs are designed so that output node representations are equivariant to the input node indexing; the indexing of output node representations matches the indexing of input nodes. Output node representations of a GNN run on two isomorphic graphs can be matched using the same bijection that defines the isomorphism. Further, standard GNNs are designed to map isomorphic graphs to the same graphlevel output representation. These two properties are achieved by using a message aggregation function and a graphlevel output function that are both invariant to node indexing.
We formally define factor graph isomorphism in Definition 1 (Appendix A). This equivalence relationship is more complicated than for standard graphs because factor potentials define a structured relationship between factor and variable nodes. As in a standard graph, variable nodes are indexed globally (
) in the representation of a factor graph. Additionally, variable nodes are also indexed locally by factors that contain them. This is required because each factor dimension (note that factors are tensors) corresponds to a unique variable, unless the factor happens to be symmetric. Local variable indices define a mapping between factor dimensions and the variables’ global indices. These local variable indices lead to additional bijections in the definition of isomorphic factor graphs (condition 2 in Definition
1). Note that standard GNNs do not respect factor graph isomorphisms because of these additional bijections.In contrast to standard GNNs, BP respects factor graph isomorphisms. When BP is run on two isomorphic factor graphs for the same number of iterations with constant message initialization^{2}^{2}2Any message initialization can be used, as long as initial messages are equivariant, see Lemma 1. the output beliefs and messages satisfy bijections corresponding to those of the input factor graphs. Specifically, messages are equivariant to global node indexing (Lemma 1), variable beliefs are equivariant to global variable node indexing (Lemma 2), and factor beliefs are equivariant to global factor node indexing and local variable node indexing within factors (Lemma 3). We refer to the above properties as equivariances of BP under factor graph isomorphisms. We show that these properties also apply to BPNND when is equivariant to global node indexing. The Bethe approximation obtained from isomorphic factor graphs is identical, when BP is run for the same number of iterations with constant message initialization2. BPNNB also satisfies this property because it is, by design, invariant to local variable indexing within factors (Lemma 4). Together, these properties lead to the following:
Theorem 4.
If is equivariant to global node indexing, then (1) BPNND messages and beliefs preserve the equivariances of BP under factor graph isomorphisms and (2) BPNNB is invariant under factor graph isomorphisms.
4 Experiments
In our experiments we trained BPNN to estimate the partition function of factor graphs from a variety of domains. First, experiments on synthetic Ising models show that BPNND can learn to find better fixed points than BP and converge faster. Additionally, BPNN generalizes to Ising models with nearly twice as many variables as those seen during training and that were sampled from a different distribution. Second, experiments and an ablation study on the stochastic block model from community detection show that maintaining properties of BP in BPNN improves results over standard GNNs. Finally, model counting experiments performed on real world SAT problems show that BPNN can learn from 10’s of training problems, generalize to problems that are harder for an exact model counter, and compute estimates 100’s of times faster than handcrafted approximate model counters. We implemented our BPNN and the baseline GNN using PyTorch Geometric
(Fey and Lenssen, 2019). We refer the reader to Appendix B.2 for details on the GNN.4.1 Ising Models
We followed a common experimental setup used to evaluate approximate integration methods (Hazan and Jaakkola, 2012; Ermon et al., 2013). We randomly generated grid structured attractive Ising models whose partition functions can be computed exactly using the junction tree algorithm (Lauritzen and Spiegelhalter, 1988) for training and validation. BP computes a provable lower bound for these Ising models (Ruozzi, 2012). This family of Ising models is only slightly more general than the one studied in (Koehler, 2019), where BP was proven to quickly converge to the Bethe free energy’s global optimum. We found that an iterative BPNND layer was able to converge faster than standard BP and could find tighter lower bounds for these problems. Additionally we trained a 10 layer BPNN and evaluated its performance against a 10 layer GNN architecture (details in Appendix). Compared to the GNN, BPNN has improved generalization when tested on larger Ising models and Ising models sampled from a different distribution than seen during training.
Improved Lower Bounds and Faster Convergence.
We trained an iterative BPNND layer to lower bound the partition function on a training set of 50 random Ising models of size 10x10 (100 variables). (See the appendix for further details.) We then ran the learned BPNND and standard BP on a validation set of 50 Ising models. We empirically verified that BPNND found fixed points corresponding to tighter lower bounds than BP, and that it found them faster than standard BP. BPNND converged on all 50 models, while BP failed to converge within 200 iterations for 6 of the models. We recorded the number of iterations that BPNND and BP run with parallel updates took to converge, defined as a maximum factortovariable message difference of . BPNND had a median improvement ratio of 1.7x over BP, please refer to the appendix for complete convergence plots. Among the 44 models where BP converged, the RMSE between the exact log partition function and BPNND’s estimate was .97 compared with 7.20 for BP. For 10 of the 44 models, BPNND found fixed points corresponding to lower bounds on the log partition function that were larger (i.e., better) than BP’s by 3 to 22 (corresponding to bounds on the partition function that were 20 to times larger). In contrast, the log lower bound found by BP was never larger than the bound found by BPNND by more than 1.7.
Out of Distribution Generalization.
We tested BPNN’s ability to generalize to larger factor graphs and to shifts in the test distribution. Again we used a training set of 50 Ising models of size 10x10 (100 variables). We sampled test Ising models from distributions with generative parameters increased by factors of 2 and 10 from their training values (see appendix for details) and with their size increase to 14x14 (for 196 variables instead of the 100 seen during training). For this experiment we used a BPNN architecture with 10 iterative layers whose weights were not tied and with MLPs that operate on factor messages (without a BPNNB layer). As a baseline we trained a 10 layer GNN (maximally powerful GIN architecture) with width 4 on the same dataset. We also compute the Bethe approximation from running standard loopy belief propagation and the mean field approximation. We used the libDAI (Mooij, 2010) implementation for both. We tested loopy belief propagation with and without damping and with both parallel and sequential message update strategies. We show results for two settings whose estimates of the partition function differ most drastically: (1) run for a maximum of 10 iterations with parallel updates and damping set to .5, and (2) run for a maximum of 1000 iterations with sequential updates using a random sequence and no damping. Full test results are shown in Figure 1. The leftmost point in the left figure shows results for test data that was drawn from the same distribution used for training the BPNN and GNN. The BPNN and GNN perform similarly for data drawn from the same distribution seen during training. However, our BPNN significantly outperforms the GNN when the test distribution differs from the training distribution and when generalizing to the larger models. Our BPNN also significantly outperforms loopy belief propagation, both for test data drawn from the training distribution and for out of distribution data.
4.2 Stochastic Block Model
The Stochastic Block Model (SBM) is a generative model describing the formation of communities and is often used to benchmark community detection algorithms (Abbe, 2018). While BP does not lower bound the partition functions of associated factor graphs for SBMs, it has been shown that BP asymptotically (in the number of nodes) reaches the information theoretic threshold for community recovery on SBMs with fewer than 4 communities (Abbe, 2018). We trained a BPNN to estimate the partition function of the associated factor graph and observed improvements over estimates obtained by BP or a maximally powerful GNN, which lead to more accurate marginals that can be used to better quantify uncertainty in SBM community membership. We refer the reader to Appendix F for a formal definition of SBMs as well as our procedure for constructing factor graphs from a sampled SBM.
Dataset and Methods
In our experiments, we consider SBMs with 2 classes and 1520 nodes, so that exact inference is possible using the Junction Tree algorithm. In this nonasymptotic setting, BP is a strong baseline and can almost perfectly recover communities (Chen et al., 2019), but is not optimal and thus does not compute exact marginals or partition functions. For training, we sample 10 two class SBMs with 15 nodes, class probabilities of .75 and .25, and edge probability of .93 within and .067 between classes along with four such graphs for validation. For each graph, we fix each node to each class and calculate the exact log partition using the Junction Tree Algorithm, producing 300 training and 120 validation graphs. We explain in Appendix F how these graphs can be used to calculate marginals.
To estimate SBM partition functions, we trained a BPNN with 30 iterative BPNN layers that operate on messages (see Appendix C), followed by a BPNNB layer. Since BP does not provide a lower bound for SBM partitions, we took advantage of BPNN’s flexibility and chose greater expressive power over BPNND’s superior convergence properties. We compared against BP and a GNN as baseline methods. Additionally, we performed 2 ablation experiments. We trained a BPNN with a BPNNB layer that was not permutation invariant to local variable indexing, by removing the sum over permutations in from Equation 7 and only passing in the original beliefs. We refer to this noninvariant version as BPNNNI. We then forced BPNNNI to ‘double count’ messages by changing the sums in Equations 5 and 6 to be over . We refer to this noninvariant version that performs double counting as BPNNDC. We refer the reader to Appendix F for further details on models and training.
Results
As shown in Table 1, BPNN provides the best estimates for the partition function. Critically, we see that not ’double counting’ messages and preserving the symmetries of BP are key improvements of BPNN over GNN. Additionally, BPNN outperforms BP and GNN on out of distribution data and larger graphs and can learn more accurate marginals. We refer the reader to Appendix F for more details on these additional experiments.
Stochastic Block Model RMSE  
BP  GNN  BPNNDC  BPNNNI  BPNN 
Train/Val  Train/Val  Train/Val  Train/Val  Train/Val 
12.55/11.14  7.33/7.93  7.04/8.43  4.43/5.63  4.16/4.15 
4.3 Model Counting
In this section we use a BPNN to estimate the number of satisfy solutions to a Boolean formula, a challenging problem for BP which generally fails to converge due to the complex logical constraints and 0 probability states. Computing the exact number of satisfy solutions (exact model counting) is a #Pcomplete problem (Valiant, 1979). Model counting is a fundamental problem that arises in many domains including probabilistic reasoning (Roth, 1993; Belle et al., 2015), network reliability (DueñasOsorio et al., 2017), and detecting private information leakage from programs (Biondi et al., 2018). However, the computational complexity of exact model counting has led to a significant body of work on approximate model counting (Stockmeyer, 1983; Jerrum et al., 1986; Karp et al., 1989; Bellare et al., 1998; Gomes et al., 2006; Ermon et al., 2014; Ivrii et al., 2015; Achlioptas and Jiang, 2015; Achlioptas et al., 2018; Soos and Meel, 2019), with the goal of estimating the number of satisfying solutions at a lower computational cost.
Training Setup.
All BPNNs trained in this section were composed of 5 BPNND layers followed by a BPNNB layer and were trained to predict the natural logarithm of the number of satisfying solutions to an input formula in CNF form. This is accomplished by converting the CNF formula into a factor graph whose partition function is the number of satisfying solutions to the input formula. We evaluated the performance of our BPNN using benchmarks from (Soos and Meel, 2019), with ground truth model counts obtained using DSharp (Muise et al., 2012). The benchmarks fall into 7 categories, including network QMR problems (Quick Medical Reference) (Jaakkola and Jordan, 1999), network grid problems, and bitblasted versions of satisfiability modulo theories library (SMTLIB) benchmarks (Chakraborty et al., 2016). Each category contains 14 to 105 problems allocated for training and validation. See the appendix for additional details on training, the dataset, and our use of minimal independent support variable sets.
Baseline Approximate Model Counters.
For comparison we ran two stateoftheart approximate model counters on all benchmarks, ApproxMC3 (Chakraborty et al., 2016; Soos and Meel, 2019) and F2 (Achlioptas and Theodoropoulos, 2017; Achlioptas et al., 2018). ApproxMC3 is a randomized hashing algorithm that returns an estimate of the model count that is guaranteed to be within a multiplicative factor of the exact model count with high probability. F2 gives up the probabilistic guarantee that the returned estimate will be within a multiplicative factor of the true model count in return for significantly increased computational efficiency. We also attempted to train a GNN, using the architecture from (Selsam et al., 2018) adapted from classification to regression. We used the author’s code, slightly modified to perform regression, but were not successful in achieving nontrivial learning.
BPNNs Provide Excellent Computational Efficiency.
Figure 2 shows runtimes and estimates for BPNN, ApproxMC3, and F2 on all benchmarks from the category ‘or_50’. BPNN is signficantly faster than both F2 and ApproxMC. BPNN provides median speedups of 2.2 and 32 over F2 and ApproxMC3 when all methods are run using a CPU. When BPNN is allowed to run in parallel on a GPU, it provides median speedups of 248 and 3,689 over F2 and and ApproxMC3. Additionally, BPNN’s estimates are significantly tighter than F2’s, with a RMSE for BPNN of .30 compared with 2.5 for F2. Please see the appendix for further runtime comparisons between methods.
Learning from Limited Data.
We trained a separate BPNN on a random sampling of 70% of the problems in each training category. This gave each BPNN only 9 to 73 benchmarks to learn from. In contrast, prior work has performed approximate model counting on Boolean formulas in disjunctive normal form (DNF) by creating a large training set of 100k examples whose model counts can be approximated with an efficient polynomial time algorithm (Abboud et al., 2020). Such an algorithm does not exist for model counting on CNF formulas, making this approach intractable. Nonetheless, BPNN achieves training and validation RMSE comparable to or better than F2 across the range of benchmark categories (see the appendix for complete results). This demonstrates that BPNNs can capture the distribution of diverse families of SAT problems in an extremely data limited regime.
Generalizing from Easy Data to Hard Data.
We repeated the same experiment from the previous paragraph, but trained each BPNN on the 70% of the problems from each category that DSharp solved fastest. Validation was performed on the remaining 30% of problems that took longest for DSharp to solve. These hard validation sets are significantly more challenging for Dsharp. The median runtime in each category’s hard validation set is 4 to 15 times longer than the longest runtime in each corresponding easy training set. Validation RMSE on these hard problems was within 33% of validation error when trained and validated on a random sampling for 3 of the 7 categories. This demonstrates that BPNNs have the potential to be trained on available data and then generalize to related problems that are too difficult for any current methods. See the appendix for complete results.
Learning Across Diverse Domains.
We trained a BPNN on a random sampling of 70% of problems from all categories, spanning network grid problems, bitblasted versions of SMTLIB benchmarks, and network DQMR problems. The BPNN achieved a final training RMSE of 3.9 and validation RMSE of 5.31, demonstrating that the BPNN is capable of capturing a broad distribution that spans multiple domains from a small training set.
5 Related Work
(Abboud et al., 2020) use a graph neural network to perform approximate weighted disjunctive normal form (DNF) counting. Weighted DNF counting is a #Pcomplete problem. However, in contrast to model counting on CNF formulas, there exists an polynomial time approximation algorithm for weighted DNF counting (where is the number of variables and is the number of clauses). The authors leverage this to generate a large training dataset of 100k DNF formulas with approximate solutions. In comparison, our BPNN can learn and generalize from a very small training dataset of less than 50 problems. This result provides the significant future work alluded to in the conclusion of (Abboud et al., 2020).
Recently,^{3}^{3}3An early version of our paper concurrent with (Satorras and Welling, 2020) was submitted to UAI 2020:
https://github.com/jkuck/jkuck.github.io/blob/master/files/BPNN_UAI_submission.pdf (Satorras and Welling, 2020) designed a graph neural network that operates on factor graphs and exchanges messages with BP to perform error correction decoding. In contrast, BPNND preserves all of BP’s fixed point, computes the exact partition function on tree structured factor graphs, and returns a lower bound whenever the Bethe approximation obtained from fixed points of BP is a provable lower bound. All BPNN layers preserve BP’s symmetries (invariances and equivariances) to permutations of both variable and factor indices. Finally BPNN avoids ‘double counting’ during message passing.
Prior work has shown that neural networks can learn how to solve NPcomplete decision problems and optimization problems (Selsam et al., 2018; Prates et al., 2019; Hsieh et al., 2019). (Yoon et al., 2018) perform marginal inference in relatively small graphical models using GNNs. (Heess et al., 2013) consider improving message passing in expectation propagation for probabilistic programming, when users can specify arbitrary code to define factors and the optimal updates are intractable. (Wiseman and Kim, 2019)
consider learning Markov random fields and address the problem of estimating marginal likelihoods (generally intractable to compute precisely). They use a transformer network that is faster than LBP but computes comparable estimates. This allows for faster amortized inference during training when likelihoods must be computed at every training step. In contrast, BPNNs significantly outperform LBP and generalize to out of distribution data.
6 Conclusion
We introduced belief propagation neural networks, a strict generalization of BP that learns to find better fixed points faster. The BPNN architecture resembles that of a standard GNN, but preserves BP’s invariances and equivariances to permutations of variable and factor indices. We empirically demonstrated that BPNNs can learn from tiny data sets containing only 10s of training points and generalize to test data drawn from a different distribution than seen during training. BPNNs significantly outperform loopy belief propagation and standard graph neural networks in terms of accuracy. BPNNs provide excellent computational efficiency, running orders of magnitudes faster than stateoftheart randomized hashing algorithms while maintaining comparable accuracy.
Broader impact
This work makes both a theoretical contribution and a practical one by advancing the stateoftheart in approximate inference on some benchmark problems. Our theoretical analysis of neural fixed point iterators is unlikely to have a direct impact on society. BPNN, on the other hand, can make approximate inference more scalable. Because approximate inference is a key computational problem underlying, for example, much of Bayesian statistics, it is applicable to many domains, both beneficial and harmful to society. Among the beneficial ones, we have applications of probabilistic inference to medical diagnosis and applications of model counting to reliability, safety, and privacy analysis.
Acknowledgements
We thank Tri Dao, Ines Chami, and Shengjia Zhao for helpful discussions and feedback. Research supported by NSF (#1651565, #1522054, #1733686), ONR (N000141912145), AFOSR (FA9550 1910024), and FLI.
References
 Community detection and stochastic block models: recent developments. Journal of Machine Learning Research 18 (177), pp. 1–86. External Links: Link Cited by: §4.2.
 Learning to reason: leveraging neural networks for approximate dnf counting. AAAI. Cited by: §4.3, §5.
 Fast and flexible probabilistic model counting. In SAT, pp. 148–164. Cited by: Appendix E, Appendix E, §4.3, §4.3.
 Stochastic integration via errorcorrecting codes. In Proc. Uncertainty in Artificial Intelligence, Cited by: §4.3.
 Probabilistic model counting with short XORs. In SAT, Cited by: Appendix E, §4.3.
 Exactly solved models in statistical mechanics. Elsevier. Cited by: §1.
 Invertible residual networks. In International Conference on Machine Learning, pp. 573–582. Cited by: §3.
 Uniform generation of npwitnesses using an nporacle. Electronic Colloquium on Computational Complexity (ECCC) 5. Cited by: §1, §4.3.
 Hashingbased approximate probabilistic inference in hybrid domains. In Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence (UAI), Cited by: §4.3.
 Statistical theory of superlattices. Proceedings of the Royal Society of London. Series AMathematical and Physical Sciences 150 (871), pp. 552–575. Cited by: §B.1, §2.
 Scalable approximation of quantitative information flow in programs. In VMCAI, Cited by: §4.3.
 Algorithmic improvements in approximate counting for probabilistic inference: from linear to logarithmic SAT calls. In IJCAI, Cited by: Appendix E, Appendix E, §4.3, §4.3.
 Introduction to modern statistical mechanics. Oxford University Press, Oxford, UK. Cited by: §1.
 Supervised community detection with line graph neural networks. ICLR. Cited by: Appendix F, §4.2.
 Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Phys. Rev. E 84, pp. 066106. External Links: Document, Link Cited by: Appendix F.
 Countingbased reliability estimation for powertransmission grids. In AAAI, Cited by: §4.3.
 Lowdensity parity constraints for hashingbased discrete integration. In ICML, pp. 271–279. Cited by: §4.3.

Taming the curse of dimensionality: discrete integration by hashing and optimization
. In ICML, pp. 334–342. Cited by: §4.1.  Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, Cited by: §4.
 Model counting: a new strategy for obtaining good bounds. In AAAI, pp. 54–61. Cited by: §4.3.
 On the partition function and random maximum aposteriori perturbations. In ICML, pp. 991–998. Cited by: §4.1.
 Learning to pass expectation propagation messages. In NeurIPS, pp. 3219–3227. Cited by: §5.
 Learning neural pde solvers with convergence guarantees. ICLR. Cited by: §5.
 On computing minimal independent support and its applications to sampling and counting. Constraints, pp. 1–18. Cited by: §4.3.
 On computing minimal independent support and its applications to sampling and counting. Constraints 21 (1), pp. 41–58. Cited by: Appendix E.
 Variational probabilistic inference and the qmrdt network. Journal of artificial intelligence research 10, pp. 291–322. Cited by: §4.3.

Random generation of combinatorial structures from a uniform distribution
. Theor. Comput. Sci. 43, pp. 169–188. Cited by: §1, §4.3.  Montecarlo approximation algorithms for enumeration problems. J. Algorithms 10, pp. 429–448. Cited by: §1, §4.3.
 Adam: a method for stochastic optimization. In ICLR, Cited by: Appendix D, Appendix E.
 Fast convergence of belief propagation to global optima: beyond correlation decay. In NeurIPS, Cited by: §4.1.
 Probabilistic graphical models: principles and techniques. MIT press. Cited by: Appendix A, §1, §2.
 Factor graphs and the sumproduct algorithm. IEEE Trans. on information theory 47 (2), pp. 498–519. Cited by: §B.1.
 Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society: Series B (Methodological) 50 (2), pp. 157–194. Cited by: §4.1.
 Analytic and algorithmic solution of random satisfiability problems. Science 297 (5582), pp. 812–815. Cited by: §1.

LibDAI: a free and open source C++ library for discrete approximate inference in graphical models
. JMLR 11, pp. 2169–2173. External Links: Link Cited by: Appendix D, §4.1.  New understanding of the bethe approximation and the replica method. arXiv preprint arXiv:1303.2168. Cited by: Appendix A.
 DSHARP: Fast dDNNF Compilation with sharpSAT. In Canadian Conference on Artificial Intelligence, Cited by: Appendix E, §4.3.
 Monte carlo theory, methods and examples. Cited by: §1.
 Learning to solve NPcomplete problems: a graph neural network for decision TSP. In AAAI, Vol. 33, pp. 4731–4738. Cited by: §5.
 On the hardness of approximate reasoning. In IJCAI, Cited by: §4.3.
 The bethe partition function of logsupermodular graphical models. In NeurIPS, Cited by: Appendix A, §4.1.
 Neural enhanced belief propagation on factor graphs. arXiv preprint arXiv:2003.01998. Cited by: §5, footnote 3.
 Learning a SAT solver from singlebit supervision. In ICLR, Cited by: Appendix E, §4.3, §5.
 BIRD: engineering an efficient cnfxor sat solver and its applications to approximate model counting. In AAAI, Cited by: Appendix E, Appendix E, Appendix E, Appendix E, §4.3, §4.3, §4.3.
 Extending SAT solvers to cryptographic problems. In SAT, Cited by: Appendix E.
 The complexity of approximate counting. In STOC ’83, Cited by: §1, §4.3.
 The complexity of enumeration and reliability problems. SIAM Journal on Computing 8 (3), pp. 410–421. Cited by: §4.3.
 Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning 1 (1–2), pp. 1–305. Cited by: §1.
 A reduction of a graph to a canonical form and an algebra arising during this reduction. NauchnoTechnicheskaya Informatsia 2 (9), pp. 12–16. Cited by: §B.2.
 Amortized bethe free energy minimization for learning mrfs. In NeurIPS, pp. 15520–15531. Cited by: §5.
 How powerful are graph neural networks?. In ICLR, Cited by: Appendix A, §B.2.
 Constructing freeenergy approximations and generalized belief propagation algorithms. IEEE Trans. on information theory 51 (7), pp. 2282–2312. Cited by: §B.1, §B.1.
 Inference in probabilistic graphical models by graph neural networks. ArXiv abs/1803.07710. Cited by: §5.
References
 Community detection and stochastic block models: recent developments. Journal of Machine Learning Research 18 (177), pp. 1–86. External Links: Link Cited by: §4.2.
 Learning to reason: leveraging neural networks for approximate dnf counting. AAAI. Cited by: §4.3, §5.
 Fast and flexible probabilistic model counting. In SAT, pp. 148–164. Cited by: Appendix E, Appendix E, §4.3, §4.3.
 Stochastic integration via errorcorrecting codes. In Proc. Uncertainty in Artificial Intelligence, Cited by: §4.3.
 Probabilistic model counting with short XORs. In SAT, Cited by: Appendix E, §4.3.
 Exactly solved models in statistical mechanics. Elsevier. Cited by: §1.
 Invertible residual networks. In International Conference on Machine Learning, pp. 573–582. Cited by: §3.
 Uniform generation of npwitnesses using an nporacle. Electronic Colloquium on Computational Complexity (ECCC) 5. Cited by: §1, §4.3.
 Hashingbased approximate probabilistic inference in hybrid domains. In Proceedings of the 31st Conference on Uncertainty in Artificial Intelligence (UAI), Cited by: §4.3.
 Statistical theory of superlattices. Proceedings of the Royal Society of London. Series AMathematical and Physical Sciences 150 (871), pp. 552–575. Cited by: §B.1, §2.
 Scalable approximation of quantitative information flow in programs. In VMCAI, Cited by: §4.3.
 Algorithmic improvements in approximate counting for probabilistic inference: from linear to logarithmic SAT calls. In IJCAI, Cited by: Appendix E, Appendix E, §4.3, §4.3.
 Introduction to modern statistical mechanics. Oxford University Press, Oxford, UK. Cited by: §1.
 Supervised community detection with line graph neural networks. ICLR. Cited by: Appendix F, §4.2.
 Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Phys. Rev. E 84, pp. 066106. External Links: Document, Link Cited by: Appendix F.
 Countingbased reliability estimation for powertransmission grids. In AAAI, Cited by: §4.3.
 Lowdensity parity constraints for hashingbased discrete integration. In ICML, pp. 271–279. Cited by: §4.3.

Taming the curse of dimensionality: discrete integration by hashing and optimization
. In ICML, pp. 334–342. Cited by: §4.1.  Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, Cited by: §4.
 Model counting: a new strategy for obtaining good bounds. In AAAI, pp. 54–61. Cited by: §4.3.
 On the partition function and random maximum aposteriori perturbations. In ICML, pp. 991–998. Cited by: §4.1.
 Learning to pass expectation propagation messages. In NeurIPS, pp. 3219–3227. Cited by: §5.
 Learning neural pde solvers with convergence guarantees. ICLR. Cited by: §5.
 On computing minimal independent support and its applications to sampling and counting. Constraints, pp. 1–18. Cited by: §4.3.
 On computing minimal independent support and its applications to sampling and counting. Constraints 21 (1), pp. 41–58. Cited by: Appendix E.
 Variational probabilistic inference and the qmrdt network. Journal of artificial intelligence research 10, pp. 291–322. Cited by: §4.3.

Random generation of combinatorial structures from a uniform distribution
. Theor. Comput. Sci. 43, pp. 169–188. Cited by: §1, §4.3.  Montecarlo approximation algorithms for enumeration problems. J. Algorithms 10, pp. 429–448. Cited by: §1, §4.3.
 Adam: a method for stochastic optimization. In ICLR, Cited by: Appendix D, Appendix E.
 Fast convergence of belief propagation to global optima: beyond correlation decay. In NeurIPS, Cited by: §4.1.
 Probabilistic graphical models: principles and techniques. MIT press. Cited by: Appendix A, §1, §2.
 Factor graphs and the sumproduct algorithm. IEEE Trans. on information theory 47 (2), pp. 498–519. Cited by: §B.1.
 Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society: Series B (Methodological) 50 (2), pp. 157–194. Cited by: §4.1.
 Analytic and algorithmic solution of random satisfiability problems. Science 297 (5582), pp. 812–815. Cited by: §1.

LibDAI: a free and open source C++ library for discrete approximate inference in graphical models
. JMLR 11, pp. 2169–2173. External Links: Link Cited by: Appendix D, §4.1.  New understanding of the bethe approximation and the replica method. arXiv preprint arXiv:1303.2168. Cited by: Appendix A.
 DSHARP: Fast dDNNF Compilation with sharpSAT. In Canadian Conference on Artificial Intelligence, Cited by: Appendix E, §4.3.
 Monte carlo theory, methods and examples. Cited by: §1.
 Learning to solve NPcomplete problems: a graph neural network for decision TSP. In AAAI, Vol. 33, pp. 4731–4738. Cited by: §5.
 On the hardness of approximate reasoning. In IJCAI, Cited by: §4.3.
 The bethe partition function of logsupermodular graphical models. In NeurIPS, Cited by: Appendix A, §4.1.
 Neural enhanced belief propagation on factor graphs. arXiv preprint arXiv:2003.01998. Cited by: §5, footnote 3.
 Learning a SAT solver from singlebit supervision. In ICLR, Cited by: Appendix E, §4.3, §5.
 BIRD: engineering an efficient cnfxor sat solver and its applications to approximate model counting. In AAAI, Cited by: Appendix E, Appendix E, Appendix E, Appendix E, §4.3, §4.3, §4.3.
 Extending SAT solvers to cryptographic problems. In SAT, Cited by: Appendix E.
 The complexity of approximate counting. In STOC ’83, Cited by: §1, §4.3.
 The complexity of enumeration and reliability problems. SIAM Journal on Computing 8 (3), pp. 410–421. Cited by: §4.3.
 Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning 1 (1–2), pp. 1–305. Cited by: §1.
 A reduction of a graph to a canonical form and an algebra arising during this reduction. NauchnoTechnicheskaya Informatsia 2 (9), pp. 12–16. Cited by: §B.2.
 Amortized bethe free energy minimization for learning mrfs. In NeurIPS, pp. 15520–15531. Cited by: §5.
 How powerful are graph neural networks?. In ICLR, Cited by: Appendix A, §B.2.
 Constructing freeenergy approximations and generalized belief propagation algorithms. IEEE Trans. on information theory 51 (7), pp. 2282–2312. Cited by: §B.1, §B.1.
 Inference in probabilistic graphical models by graph neural networks. ArXiv abs/1803.07710. Cited by: §5.
Appendix A Proofs
Theorem 1.
Theorem 2.
Theorem 3.
If zero is the unique fixed point of , then the fixed points of BPNND and BP are identical by Theorems 1 and 2. Therefore, (1) the Bethe approximation obtatined from fixed points of BPNND on tree structured factor graphs is exact because it is exact for fixed points of BP [Koller and Friedman, 2009] (or see [Mori, 2013][p.27] for a detailed proof). (2) Ruozzi [2012][p.8] prove in Corollary 4.2 that the Bethe approximation at any fixed point of BP is a lower bound on the partition function for factor graphs with binary variables and logsupermodular potential functions. so it follows that the Bethe approximation at any fixed point of BPNND lower bounds the partition function. ∎
Proposition 2.
If we consider a BPNN with weight tying, then regardless of the number of iterations or layers, the output messages are the same if the input messages are the same. Without loss of generality, let us first consider any node as the root node, and consider all the messages on the path from the leaf nodes through . Let denote the depth of the subtree with root when we consider as the root (e.g. for a leaf node , ). We use the following induction argument:

At iteration , the message from all nodes with to their parents will be fixed for subsequent iterations since the inputs to the BPNN for these messages are the same.

If at iteration , the message from all nodes with to their parents are fixed for all subsequent iterations, then the inputs to the BPNN for all the messages from all nodes with to their parents will be fixed (since they depend on lower level messages that are fixed). Therefore, at iteration , the messages from all the nodes with to their parents will be fixed because of weight tying between BPNN layers.

The maximum tree depth is , so . From the induction argument above, after at most iterations, all the messages along the path from leaf nodes to will be fixed.
Since the BPNN layer performs the operation over all nodes, this above argument is valid for all nodes when we consider them as root nodes. Therefore, all messages will be fixed after at most iterations, which completes the proof. ∎
Isomorphic Factor Graphs
To prove Theorem 4 we define isomorphic factor graphs, an equivalence relation among factor graph representations, and break Theorem 4 into the lemmas in this section. Standard GNNs are built on the assumption that isomorphic graphs should be mapped to the same representation and nonisomorphic graphs should be mapped to different representations [Xu et al., 2018]. This is a challenging goal, in fact [Xu et al., 2018][p.4] prove in Lemma 2 that any GNN that aggregates messages from 1hop neighbors is, at most, as discriminative as the WeisfeilerLehman (WL) graph isomorphism test. Xu et al. [2018] go on to propose a provably ‘maximally powerful’ GNN, one that maps isomorphic graphs to the same representation and maps nonisomorphic graphs to different representations whenever the WL test maps them to different representations, which is the best result possible for this class of graph neural networks that aggregate messages from 1hop neighbors. The input to a standard GNN is a graph represented by an adjacency matrix whose ith row and column correspond to the ith node. Nodes and edges may have corresponding features. The GNN in [Xu et al., 2018] was designed to map isomorphic graphs to the same representation by outputting learned node representations that are equivariant to the input node indexing and a graph wide representation that is invariant to input node indexing.
The input to a BPNN is a factor graph. With the same motivation as for standard GNNs, BPNNs should map isomorphic factor graphs to the same output representation. A factor graph is represented as^{4}^{4}4Note that a factor graph can be viewed as a weighted hypergraph where factors define hyperedges and factor potentials define hyperedge weights for every variable assignment within the factor. . is an adjacency matrix over factor nodes and variable nodes, where^{5}^{5}5For readability, we use and to index factors and and to index variables throughout this section. if the ith variable is in the scope of the ath factor and otherwise. is an ordered list of factor potentials, where the ath factor potential, , corresponds to the ath factor (row) in and is represented as a tensor with one dimension for every variable in the scope of . is an ordered list of ordered lists that locally indexes variables within each factor. is an ordered list specifying the local indexing of variables within the ath factor (in and ). specifies that the kth dimension of the tensor corresponds to the ith variable (column) in . We define two factor graphs to be isomorphic when they meet the conditions of Definition 1.
Definition 1.
Factor graphs and with and are isomorphic if and only if , , and

There exist bijections^{6}^{6}6For , we use to denote . and such that for all and , where and .

There exists a bijection for every factor,
(8) such that and , where where , , , and denotes permuting the dimensions of the tensor according to .
Condition 1 in Definition 1 states that permuting the global indices of variables or factors in a factor graph results in an isomorphic factor graph. Condition 2 in Definition 1 states that permuting the local indices of variables within factors also results in an isomorphic factor graph. In Lemmas 1, 2, and 3 we formalize the equivariance of messages and beliefs obtained by applying BPNN iterative layers. We use using the bijections from Definition 1 to construct bijective mappings between messages and beliefs. In Lemma 4 we use the equivariance of beliefs between isomorphic factor graphs to show that the output of BPNNB is identical for isomorphic factor graphs.
Lemma 1.
Message equivariance: Let and denote variable to factor messages and and factor to variable messages obtained by applying k iterations of BP to factor graphs and . If and are isomorphic as factor graphs and messages are initialized to a constant^{7}^{7}7Any message initialization strategy can be used, as long as initial messages are equivariant; e.g. they satisfy the bijective mapping and where and .
then there is a bijective mapping between messages: and where and . This property holds for BPNND iterative layers if is equivariant to global node indexing.
Proof.
We use a proof by induction.
Base case: the initial messages are all equal when constant initialization is used and therefore satisfy any bijective mapping.
Inductive step: Writing the definition of variable to factor messages, we have
(9) 
since the bijective mapping holds for factor to variable messages at iteration by the inductive hypothesis. Writing the definition of factor to variable messages, we have
(10) 
showing that the bijective mapping continues to hold at iteration .
Proof extension to BPNND: the logic of the proof is unchanged when BP is performed in logspace with damping. The only difference between BPNND and standard BP is the replacement of the term in the computation of factor to variable messages with , where . If is equivariant to global node indexing (the bijective mapping holds, where denotes applying the operator to the kth iteration’s message differences when the input factor graph is and taking the output correpsonding to message ), then equality is maintained in Equation 10 and the bijective mapping between messages holds. ∎
Lemma 2.
Variable belief equivariance: Let and denote the variable beliefs obtained by applying k iterations of BP (or BPNND iterative layers with equivariant to global node indexing) to factor graphs and . If and are isomorphic as factor graphs, then there is a bijective mapping between beliefs: , where .
Proof.
By the definition of variable beliefs,
(11) 
where the second equality holds due to factor to variable message equivariance from Lemma 1. ∎
Lemma 3.
Factor belief equivariance: Let and denote the factor beliefs obtained by applying k iterations of BP (or BPNND iterative layers with equivariant to global node indexing) to factor graphs and . If and are isomorphic as factor graphs, then there is a bijective mapping between beliefs: , where and .
Proof.
By the definition of factor beliefs,
(12) 
where the second equality holds due to variable to factor message equivariance from Lemma 1. ∎
Lemma 4.
Bethe approximation invariance: If factor graphs and are isomorphic, then the Bethe approximations obtained by applying BP to and (or the output of BPNNB) are identical.
Proof.
By the definition of the Bethe approximation (or the negative Bethe free energy),
(13) 
where , the second equality follows from the equivariance of variable and factor beliefs (Lemmas 2 and 3), and the final equality follows from the commutative property of addition.
Proof extension to BPNNB: the proof holds for BPNNB because every permutation (in ) of factor belief terms is input to . ∎
Appendix B Extended Background
We provide background on belief propagation and graph neural networks (GNN) to motivate and clarify belief propagation neural networks (BPNN).
b.1 Belief Propagation
We describe a general version of belief propagation [Yedidia et al., 2005] that operates on factor graphs.
Factor Graphs.
A factor graph [Kschischang et al., 2001, Yedidia et al., 2005] is a general representation of a distribution over discrete random variables, . Let denote a possible state of the variable. We use the shorthand for the joint probability mass function, where is a specific realization of all variables. Without loss of generality, can be written as the product
(14) 
The functions each take some subset of variables as arguments; function takes . We require that all functions are nonnegative and finite. This makes a well defined probability distribution after normalizing by the distribution’s partition function
(15) 
A factor graph is a bipartite graph that expresses the factorization of the distribution in equation 14. A factor graph’s nodes represent the variables and functions present in equation 14. The nodes corresponding to functions are referred to as factor nodes. Edges exist between factor nodes and variables nodes if and only if the variable is an argument to the corresponding function.
Message Updates.
Belief propagation performs iterative message passing. The message from variable node to factor node during iteration is computed according to the rule
(16) 
The message from factor node to variable node during iteration is then computed according to the rule
(17) 
The BP algorithm estimates approximate marginal probabilities for each variable, referred to as beliefs. We denote the belief at variable node , after message passing iteration is complete, as which is computed as
(18) 
Similarly, BP computes joint beliefs over the sets of variables associated with each factor . We denote the belief over variables , after message passing iteration is complete, as which is computed as
(19) 
Partition Function Approximation.
The belief propagation algorithm proceeds by iteratively updating variable to factor messages (Equation 16) and factor to variable messages (Equation 17) until they converge to fixed values, referred to as a fixed point of Equations 16 and 17, or a predefined maximum number of iterations is reached. While BP is not guaranteed to converge in general, whenever a fixed point is found it defines a set of consistent beliefs, meaning that marginal beliefs at factor nodes agree with beliefs every variable node they are connected to. At this point the beliefs are used to compute a variational approximation of the factor graph’s partition function. This approximation, originally developed in statistical physics, is known as the Bethe free energy [Bethe, 1935]. It is defined in terms of the Bethe average energy and the Bethe entropy .
Definition 2.
defines the Bethe average energy.
Definition 3.
defines the Bethe entropy, where is the degree of variable node .
Definition 4.
The Bethe free energy is defined as .
b.2 GNN Background
This section provides background on graph neural networks (GNNs), a form of neural network used to perform representation learning on graph structured data. GNNs perform iterative message passing operations between neighboring nodes in graphs, updating the learned, hidden representation of each node after every iteration.
Xu et al. [2018] showed that graph neural networks are at most as powerful as the WeisfeilerLehman graph isomorphism test [Weisfeiler and Lehman, 1968], which is a strong test that generally works well for discriminating between graphs. Additionally, [Xu et al., 2018] presented a GNN architecture called the Graph Isomorphism Network (GIN), which they showed has discriminative power equal to that of the WeisfeilerLehman test and thus strong representational power. We will use GIN as a baseline GNN for comparison in our experiments because it is provably as discriminative as any GNN that aggregates information from 1hop neighbors.We now describe in detail the GIN architecture that we use as a baseline. Our architecture performs regression on graphs, learning a function from graphs to a real number. Our input is a graph with node feature vectors for and edge feature vectors for . Our output is the number , which should ideally be close to the ground truth value . Let denote the representation vector corresponding to node after the message passing operation. We use a slightly modified GIN update to account for edge features as follows:
(20) 
A layer GIN network with width is defined by successive GIN updates as given by Equation 20, where is an dimensional feature vector for . All MLPs within GIN updates (except
) are multilayer perceptrons with a single hidden layer whose input, hidden, and output layers all have dimensionality
. is different in that its input dimensionality is given by the dimensionality of the original node feature representations. The final output of our GIN network is given by