Probabilistic Graphical Models (PGMs) are powerful, general machine learning models that encode distributions over random variables. PGM Inference, in which we seek to compute some probabilistic beliefs within the system modeled by the PGM, is in general an intractable problem, leading to dependence on approximate algorithms. Belief Propagation (BP) is a widely employed approximate inference algorithms for PGMs. BP has been successfully utilized in many areas, including computer vision , error-correcting codes , and protein-folding .
BP is a message-passing algorithm, in which messages are passed along edges of the PGM graph. While BP is exact on tree PGMs, it is approximate on general graphs containing loops, where iterative updates are applied until convergence. Like others , we break the performance of BP into two properties: convergence (for how many input graphs does it reach a convergent state) and speed (how long does it take to reach the convergent state). While BP has been shown to perform well on many graphs containing loops, there is no guarantee of convergence in most cases, and graphs of varying structure and parameterization can prevent BP from converging or can have slow speed for convergence .
General-Purpose GPU computing has begun recently exploring many-core parallelism for graph-based problems . This, combined with the inherent parallelism available between message updates, suggests that many-core parallelism can be effectively applied to BP to yield good performance on the GPU (that is, good convergence and speed). In order to ensure good performance, one must be careful in the implementation such as to avoid the convergence and speed pitfalls inherently present in Belief Propagation.
In existing BP literature, there has been much interest in exploring the use of message schedulings for improving BP performance. The naive scheduling is known as Synchronous or Loopy BP (LBP), where all messages are updated in parallel . Asynchronous approaches, where some amount of sequentiality is enforced during the message updates, for example via subgraph updates  or greedy message selection [6, 8], have been shown both empirically and theoretically to outperform LBP in single-core environments. The general intuition is that enforcing sequentialism in the scheduling encourages more direct propagation of information, thus converging faster. The contrast between LBP and Asynchronous BP introduces a parallelism vs. efficiency spectrum (also found in other graph problems such as SSSP ). LBP exposes high levels of parallelism but is work-inefficient. Asynchronous BP is efficient and convergent but exposes little parallelism. We hypothesize that there exists a tradeoff between the parallelism and sequentialism in Belief Propagation, and that GPUs can effectively harness that tradeoff to yield performant BP.
We start by presenting many-core frontier-based implementations for two greedy asynchronous message schedulings, Residual Belief Propagation  and Residual Splash . We then benchmark the performance, varying parallelism to explore how parallelism affects the performance of BP. As expected, we find that as parallelism is increased, we see less convergence but obtain faster speed. As parallelism is decreased, we see more convergence but lower speeds. This is encouraging, as it means we can still get convergence boosts while exploiting parallelism, but we also see that existing approaches incur significant overheads, and performance is heavily tied to the choice of parallelism. To overcome these drawbacks, we propose a new message scheduling, called Randomized Belief Propagation (RnBP) which uses low-overhead, randomized scheduling, and outperforms existing approaches.
To summarize, our contributions are:
Demonstration of tradeoff between parallelism and sequentialism in terms of speed/convergence of BP.
Demonstration that overheads prevent existing asynchronous message scheduling approaches from scaling to the GPU.
Ii-a Belief Propagation
We focus our attention on the Sum-Product Belief Propagation algorithm over discrete pairwise Markov Random Fields (MRFs), though we expect the results to generalize to other variants of BP. Suppose we have the set of discrete random variables, each taking on a value , where is a finite set. An MRF is an undirected graph . Each vertex represents a discrete random variable . is set of unary potential functions for each random variable. Each edge represents the probabilistic relationship between two variables.
is the set of binary potential functions for each edge. An MRF yields the following joint distribution over:
The goal of inference is to derive the vertices’s marginal distributions . This is intractable in general, however BP can be used to find exact marginals (for trees) or approximate marginals (for graphs containing loops). This is done through the iterative passing of messages along the edges of the graph. Each edge has two messages being passed along it, indicating each vertex’s belief about the other’s state. The message is a distribution, updated as follows:
where indicates the neighbors of
. Each message is initialized to the uniform distribution and normalized between updates. Messages are iterated untilconvergence, at which point we calculate the beliefs at each vertex:
Ii-B Message Scheduling for Belief Propagation
BP message schedulings differ by the messages that are updated each iteration. LBP simply updates every edge, every iteration in parallel. That is, all messages are updated using the previous iteration’s messages. LBP performance has been examined both empirically  and theoretically .
Asynchronous approaches enforce sequentialism in message updates, updating each message using the most recent messages. That is, a single message is updated, and that update is immediately used to update other messages. If we assume LBP to be a max-norm contraction, ABP has at least as good convergence rate guarantees as LBP .
RBP then selects the next message to update asynchronously based on the highest residual. Intuitively, the program focuses its computational effort to parts of the graph where it moves closer to a converged state.
Residual Splash (RS)  is an extension of RBP for multi-core parallelization. They extend residuals to vertices, where the residual of a vertex is the maximum residual of incoming messages. Similar to RBP, vertices are selected greedily, however, in RS, a splash, or BFS search of depth around the vertex, is performed with updates moving sequentially through the BFS tree. RS demonstrates linear speedup in the number of cores. In this paper we explore LBP, RBP, and RS because of their simplicity and good performance in existing work.
Ii-C Related Work
BP has been implemented on the GPU for specific BP workloads, including stereo matching [10, 3] and error correcting codes . Several works specifically explore memory usage, as the unique architecture of the GPU closely ties memory use and performance. Grauer et al.  explores using registers, shared memory, and local memory for Belief Propagation and their effect on GPU occupancy for the stereo matching problem. Liang et al. shows a general approach for reducing memory usage for BP by storing only the messages along the edges of partitions of the graphs, allowing messages to be stored in faster shared memory. While we do not explore memory use, our message scheduling work combines naturally with the memory work of both of these approaches.
Several works explore different message schedulings on the GPU for specific BP applications. Yang et al. filters messages to be updated by removing any messages that have already converged. We employ the same filter as one of the filters in our final RnBP scheduling approach. Xiang et al.  changes BP on a grid-based stereo problem by using directional updates, that is, messages are updated along dimensions of the grid. Of course, this directional update is specific to grid-based models such as ones used in computer vision. Romero et al. constructed an LDPC code structure in such a fashion that the updates could be partitioned so many could be completed in parallel while still maintaining sequentiality overall. In general, we cannot control the problem as in their case, and creating effective message partitions are problem-specific and non-trivial. Our work takes a general approach that can apply to any BP problem, and explore message schedulings that have not yet been explored on the GPU, to the authors’s best knowledge.
Iii Frontier-Based Belief Propagation on the GPU
We present all algorithms examined as realizations of a frontier-based BP framework. In this section, we implement several existing schedulings and benchmark their performance. In the next section, we introduce our own GPU-centric scheduling approach, Randomized Belief Propagation.
To transfer the schedulings onto the synchronous, many-core architecture of the GPU, we utilize a data-centric, frontier-based parallelization framework [23, 20]. We consider the frontier to be the set of messages selected to be updated synchronously and in parallel each iteration. Message schedulings differ on selection of the frontier, but follow the same general structure presented in Algorithm 1.
Iii-a Greedy Update Frontier Selection
We use this frontier-based approach to implement several existing schedulings on the GPU, specifically LBP, RBP, and RS. LBP is simple to implement in this framework: every iteration, all the messages are put in the frontier to be updated. RBP and RS rely on greedily selecting updates based on message residuals. In order to explore the trade-off between parallelism and greedy sequentialism, we will simply adjust the greedy approach to select multiple elements as a frontier per iteration as opposed to a single element. We can consider this to be the selection of the top- values for update each iteration. Adjusting allows us to adjust parallelism.
For single-core implementations, the primary data structure employed to perform these greedy updates is a Priority Queue. While concurrent priority queues have been developed, they rely on mutual exclusion, and thus are best suited for asynchronous environments, unlike the GPU. Other work in using GPUs for algorithms with Priority Queue based methods have turned to other approaches, involving sort-and-select, binning, or problem division [5, 19]. Several algorithms for direct top- GPU selection exist [2, 14], but speedup only occurs for very large problem sizes. We choose to use a simple sort-and-select approach in order to select the top elements.
We now present the high level approach for our bulk-parallel greedy update selection. We maintain the residuals of either the messages for RBP or the vertices for RS. Each iteration, we perform a key-value sort of the residuals with their corresponding vertices/edges. The top elements after the sort form the update frontier. RBP updates this frontier directly, RS updates the splash around the selected nodes. A single update is visualized in Fig.1.
We implement LBP, RBP, and RS using Nvidia’s CUDA library . We use a simple adjacency list format for storing graph structure and parameterization. Each edge and vertex is assigned IDs, and for parallel operations, each thread is assigned a subset of the IDs to update. We use the CUDA occupancy API for kernel launch settings and Nvidia’s CUB library Radix Sort for the sort operation . We implement serial RBP (SRBP) as a performance benchmark. We use the same adjacency list format and use the Boost library’s Fibonacci heap for the Priority Queue.
To accurately benchmark performance, we would like to be able to adjust the difficulty of the inference problem. A synthetic benchmark that gives us control over difficulty is the Ising dataset, a standard benchmark for message propagation algorithms . Ising grids are
grids of binary variables. Univariate potentialsare randomly sampled from the [0,1] range. The pairwise potentials are set to when and otherwise. is sampled from [-0.5,0.5] to make certain potentials favor agreement while others favor disagreement. Varying changes the difficulty of the inference problem (higher being more difficult). For RBP and RS, we test on Ising grids of size and , with . We also run on simpler chain graphs, where binary variables are formatted in a single long chain. Of course, when a graph is a chain, BP is guaranteed to converge. We sample and in the same manner used for our Ising grids. For RBP and RS, we test on chain graphs of size , with .
In order to examine parallelism’s effect on performance, we introduce a multiplier , where the frontier size each round is . Varying thus varies the parallelism used. For RS, we lock111Exploration of different splash depths could be interesting, though we change our focus to randomized updates, and thus do not pursue this further. splash depth to be . We time how long it takes the message updates to converge. Our GPU code is run on a single NVidia Tesla V100 and our CPU code is run on Intel Xeon Processors.
Fig.2 shows GPU RS performance on our three benchmarks as cumulative convergence graphs, indicating the cumulative percentage of the set of input graphs that have converged as a function of time. GPU RBP exhibits the same patterns on each dataset and thus is not shown for brevity.
Our results indicate that a tradeoff does indeed exist between parallelism and sequentialism. Specifically, we see that as we decrease , that is, we reduce our parallelism, more graphs converge, but they take longer to do so. Thus, low parallelism encourages convergence, while high parallelism encourages speed. LBP, with full parallelism, demonstrates only partial convergence, while RS is able to extend convergence, given time, by reducing parallelism (Fig.2,2).
In Tables I and II, we show the speedup results comparing GPU RBP and RS to SRBP. We compare with the fastest setting in our test runs that converges on all or most of the graphs, indicated for each dataset. For cases where SRBP convergence did not occur (i.e., SRBP failed to converge on all but the Ising , dataset), we provide a conservative lower-bound on speedup based on how long we gave SRBP to run (90 seconds). We see that RS outperforms RBP and both outperform SRBP.
There are two primary shortcomings to RBP and RS. First, performance relies heavily upon , and effective selection is non-trivial. Second, the sort-and-select approach incurs significant overhead. This is best demonstrated by the easy chain dataset (Fig.2) where RS takes significantly longer than LBP, which converges very quickly. Profiling indicates that on many graphs, both RBP and RS spend more than 90% of runtime during the sort-and-select step, up to 98% for certain runs.
Iv Randomized Belief Propagation
To overcome the shortcomings of existing approaches on the GPU and exploit the tradeoff we have demonstrated, we present our novel, low-overhead, randomized message scheduling technique for Belief Propagation on the GPU, Randomized Belief Propagation (RnBP).
We hypothesize that varying the parallelism affects performance more than the specific selection of messages each round when in a many-core environment. We thus perform random selection as opposed to exact top- selection.
In creating our message frontier, we employ two filters. In order to encourage selection to be similar to the top-, we only choose the messages to update from those whose residual is above the thresholds. Thus, our first filter prunes all messages whose next update will move them less than .
The second filter is our randomized filter. We randomly select some percentage of the remaining messages to be updated. Adjusting thus allows us to adjust the parallelism for that round. A single update is visualized in Fig.3.
Finally, we dynamically range based on the convergence of the run. Throughout the run, we can track how many of the edges have not converged. The ratio between the number of edges not converged between each iteration becomes an indicator of runtime convergence performance: . If is low, it is indicative of good convergence, if is high, it is indicative of bad convergence. We introduce two settings, one high and one low. We know from our results in Section III that low parallelism encourages convergence and high parallelism encourages speed. Thus, if , we use the lower parallelism setting, thus encouraging convergence. Otherwise, we use the higher parallelism setting, thus encouraging speed. We note, overhead prevented similar dynamic selection from aiding GPU RBP/RS.
We use the same chain and Ising grid benchmarks described in III-C. We test with Ising grids of size with and of size with . For chain graphs, we test with size with .
Again, our GPU code is run on a single NVidia Tesla V100 and our CPU code is run on Intel Xeon Processors. We continue to compare to LBP and SRBP.
As for RBP and RS, we can vary our high and low parallelism settings to get different parallelisms during run time. We found that for our synthetic dataset the high parallelism setting mattered less than the low parallelism setting. As such, we locked our high parallelism to be a full update, thus whenever , we update the full message frontier update. We show performance on all datasets with low parallelism () being set to 0.7, 0.4, and 0.1.
Fig.4-4 shows GPU RnBP performance on our benchmarks as cumulative convergence graphs. For easy graph datasets, where LBP converges for most or all, we notice that RnBP with higher parallelism settings (i.e., ) nearly matches LBP performance (see Fig.4,4). This shows the value in RnBP’s lack of overhead. As the graphs become more difficult, where LBP only converges on some, we see that RnBP continues to converge quickly on all graphs (see Fig.4, 4). RnBP converges with much higher parallelism than that required for RS and RBP. Using the higher parallelism settings allows speed paired with convergence. We see that this allows for RnBP to actually provide speedups over GPU LBP runtimes (Fig.4,4), averaging 9x speedups on the Ising 200200, C=2.5 dataset.
Notice, LBP fails to converge on any graphs for the difficult 100100, C=3 dataset. We see that we can effectively drop parallelism in RnBP, however, to encourage convergence (see Fig.4). We do so without significant overheads yielding dramatic slow downs. This convergence behavior applies to larger and more difficult graphs than the ones RBP and RS could handle. RnBP thus extends the classes of Belief Propagation problems for which GPU speedups can be applied. We note that for the difficult dataset, RnBP can still be sensitive to the selected parallelism. However, on all our other datasets, RnBP is fairly robust to parallelism selection. Thus, while not completely solved, RnBP is a considerable improvement to existing approaches.
We characterize the speedup of RnBP over SRBP in Table III. Again, we compare with the fastest setting in our test runs that converges on all or most of the graphs, indicated for each dataset, and present conservative lower bounds when SRBP failed to converge (given 90 seconds).
Iv-E Additional Tests
As RnBP is a novel message scheduling, we provide several additional tests to examine performance. To test correctness, we created a smaller Ising dataset, size , , for which exact inference is tractable. We use Variable Elimination to find the exact marginal values, then determine the KL-divergence between the exact results and the results of both SRBP and RnBP (run with ). These are shown in figure 5. We see that RnBP achieves the same quality of result as compared to SRBP.
We tested RnBP on a real-world dataset, specifically a protein-folding dataset . This dataset contains graphs with vertices representing amino acid units and the setting at each vertex representing the side-chain configuration. The possible settings at each vertex ranges from 2 to 81 and the graph structure is highly irregular. The cumulative convergence is shown in Fig.4 (We run RnBP with ). Despite the different structure as compared to our synthetic dataset and without any finetuning to handle load-imbalanced message updates, we see that RnBP yields fast, convergent performance. Given 3 minutes per graph, RnBP was the only approach to converge on all graphs and yielded an average of 4.4x speedup over SRBP when SRBP converged.
V Conclusions and Future Work
In this work, we presented a study of message scheduling approaches for BP on many-core GPU systems (summarized in Table IV). We hypothesized the existence of a tradeoff between parallelism and sequentialism for BP speed and convergence, and that GPUs could be used to exploit that tradeoff for performant BP. We presented many-core, frontier-based implementations for two asynchronous message schedulings, RBP  and RS , and showed empirically that indeed a tradeoff exists. Specifically, lower parallelism encourages convergence, while higher parallelism encourages speed. We also show that these approaches incur significant overhead, suggesting that a new GPU-centric approach is needed. In this direction, we presented a novel message scheduling we call Randomized Belief Propagation (RnBP), which utilizes randomization to select frontiers for updating. We demonstrate that this approach yields higher convergence while maintaining speed, providing speedups over serial and existing GPU methods on both synthetic and real-world datasets. Our implementation is available online222https://github.com/mvandermerwe/BP-GPU-Message-Scheduling.
|GPU LBP||All Messages||✓|
|Serial RBP/RS||Priority Queue||X|
This work was supported in part by NSF awards 1704715 and 1817073.
-  (2000) The generalized distributive law. IEEE Transactions on Information Theory 46 (2), pp. 325–343. Cited by: §V.
-  (2012) Fast k-selection algorithms for graphics processing units. Journal of Experimental Algorithmics (JEA) 17, pp. 4–2. Cited by: §III-A.
-  (2006) Belief propagation on the gpu for stereo vision. In Computer and Robot Vision, 2006. The 3rd Canadian Conference on, pp. 76–76. Cited by: §II-C.
-  (2012) A gpu implementation of belief propagation decoder for polar codes. In Signals, Systems and Computers (ASILOMAR), 2012 Conference Record of the Forty Sixth Asilomar Conference on, pp. 1272–1276. Cited by: §II-C.
-  (2014) Work-efficient parallel gpu methods for single-source shortest paths. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 349–359. Cited by: §I, §III-A.
Residual belief propagation: informed scheduling for asynchronous message passing.
Proceedings of the Twenty-Second Conference on Uncertainty in Artificial Intelligence, pp. 165–173. Cited by: 1st item, 4th item, §I, §I, §I, §II-B, §II-B, §III-C, §V.
-  (2006) Efficient belief propagation for early vision. International Journal of Computer Vision 70 (1), pp. 41–54. Cited by: §I, §V.
-  (2009-16–18 Apr) Residual splash for optimally parallelizing belief propagation. In Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, D. van Dyk and M. Welling (Eds.), Proceedings of Machine Learning Research, Vol. 5, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, pp. 177–184. External Links: Cited by: 1st item, §I, §I, §II-B, §II-B, §V.
-  (2010) Optimizing and auto-tuning belief propagation on the gpu. In International Workshop on Languages and Compilers for Parallel Computing, pp. 121–135. Cited by: §II-C, §V.
-  (2008) GPU implementation of belief propagation using cuda for cloud tracking and reconstruction. In Pattern Recognition in Remote Sensing (PRRS 2008), 2008 IAPR Workshop on, pp. 1–4. Cited by: §II-C.
-  (2011) Hardware-efficient belief propagation. IEEE Transactions on Circuits and Systems for Video Technology 21 (5), pp. 525–537. Cited by: §II-C, §V.
-  (2018) Graph partition neural networks for semi-supervised classification. arXiv preprint arXiv:1803.06272. Cited by: §V.
-  (2015) CUDA unbound (cub) library. Cited by: §III-B, §IV-B.
-  (2011) Randomized selection on the gpu. In Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics, pp. 89–98. Cited by: §III-A.
-  (2007) Sufficient conditions for convergence of the sum–product algorithm. IEEE Transactions on Information Theory 53 (12), pp. 4422–4437. Cited by: §II-B.
-  (1999) Loopy belief propagation for approximate inference: an empirical study. In Proceedings of the Fifteenth conference on Uncertainty in Artificial Intelligence, pp. 467–475. Cited by: §I, §I, §II-B.
-  (2010) CUDA programming guide. Cited by: §III-B.
-  (2010) CURAND library. Cited by: §IV-B.
-  (2010) A gpu-based application framework supporting fast discrete-event simulation. Simulation 86 (10), pp. 613–628. Cited by: §III-A.
-  (2011) The tao of parallelism in algorithms. In ACM Sigplan Notices, Vol. 46, pp. 12–25. Cited by: §III.
-  (2012) Sequential decoding of non-binary ldpc codes on graphics processing units. In Signals, Systems and Computers (ASILOMAR), 2012 Conference Record of the Forty Sixth Asilomar Conference on, pp. 1267–1271. Cited by: §I, §II-C.
-  (2003) Tree-based reparameterization framework for analysis of sum-product and related algorithms. IEEE Transactions on Information Theory 49 (5), pp. 1120–1146. Cited by: §I.
-  (2017) Gunrock: gpu graph analytics. ACM Transactions on Parallel Computing (TOPC) 4 (1), pp. 3. Cited by: §I, §III.
-  (2012) Real-time stereo matching based on fast belief propagation. Machine Vision and Applications 23 (6), pp. 1219–1227. Cited by: §II-C.
-  (2006) Real-time global stereo matching using hierarchical belief propagation.. In BMVC, Vol. 6, pp. 989–998. Cited by: §II-C.
-  (2003) Approximate inference and protein-folding. In Advances in Neural Information Processing Systems, pp. 1481–1488. Cited by: 4th item, §I, Fig. 4, §IV-E.
-  (2001) Generalized belief propagation. In Advances in Neural Information Processing Systems, pp. 689–695. Cited by: §I, §V.