Deep learning (DL) has shown remarkable success in various computer vision and natural language processing tasks. However, it suffers from a serious drawback of not explicitly handling real-world uncertainty, constraints and causal relations, which are crucial for deploying robust decision-making systems in practice. To address this, researchers are developing hybrid models by combining DL with probabilistic reasoning techniques, eg. random sum-product networks
, probabilistic logic neural networks, etc.
To ease the development of such hybrid models, several software frameworks like Pyro, DeepProbLog, etc. have been developed. However, deploying these hybrid models in real applications raises demands of adequate throughput, latency, energy consumption, etc. While several efficient hardware platforms have been developed for DL, probabilistic reasoning kernels are still implemented on general-purpose CPUs which is energy-inefficient. To address this, we propose a customized processor architecture that efficiently accelerates probabilistic reasoning workloads. Specifically, in this work:
GPU execution bottlenecks are identified by developing a highly-optimized CUDA implementation.
A programmable processor architecture is designed to alleviate these bottlenecks.
Experiments are performed to validate the performance improvement over a suite of benchmarks.
Ii Probabilistic reasoning compute kernel
As shown in fig. 1, probabilistic reasoning is usually performed with probabilistic program or a probabilistic graphical model. These models can be converted to a tractable representation for efficient inference, called Sum-Product Network (SPN, also known as Arithmetic circuit) [9, 1]. Furthermore, SPNs can also be learned directly from data. SPNs are rooted directed acyclic graphs wherein the internal nodes are either sums or products, and the leaf nodes are probabilistic parameters or data inputs.
On a general-purpose processor, SPN can be implemented as a list of operations (alg. 1
), or as a for loop over a vector (alg.2). This loop formulation already shows that, although SPNs offer parallelism, they inherently consist of scalar operations that consume prior intermediate results via irregular indirect memory accesses. This leads to inefficient parallel execution as will be demonstrated in the next section.
Iii SPNs on CPU and GPU
To evaluate the performance of general-purpose platforms, SPNs are implemented on a CPU and a GPU.
CPU: An Intel core i5-7200 CPU is used to execute SPNs as lists of operations in C (alg. 1), giving the compiler maximal freedom to reorder and parallelize operations. The for-loop implementation (alg. 2) is also evaluated, but it consistently performs slower than alg. 1
GPU: GPU offers parallelism in the form of a single instruction multiple thread (SIMT) mode. Multiple operations of SPNs can be executed in parallel by scheduling on multiple threads. However, the following two problems arise while doing so, due to the irregularity of SPN graphs.
1) Inter-thread data transfers: Inter-thread data transfers are needed when threads use data computed by other threads, which frequently happens when an SPN is scheduled across multiple threads. Such transfers can be allowed only after thread synchronization, as GPU does not guarantee lock-step execution of threads. To minimize these synchronization overheads in our experiments, SPNs are decomposed into groups of independent nodes (colored in fig. 2). As nodes in a group are not dependent on other nodes of the same group, they can be executed on any thread without synchronization. When executing a new group, threads still have to be synchronized with the CUDA primitive __syncthreads(). Moreover, this inter-thread communication needs the data to be in either GPU shared memory or global memory. If data size fits, the shared memory is used for low-latency inter-thread transfers, otherwise the global memory is used with L1 caching enabled.
2) Irregular memory accesses: Data transfers from the shared memory become a bottleneck if multiple threads in a GPU warp access the same bank, resulting in bank conflicts. These bank conflicts are minimized with a graph coloring-based bank allocation that ensures that the threads in a warp access to different banks.
The pseudocode of the developed CUDA implementation is shown in alg. 3. Several implementations are evaluated with varying number of threads in 1 thread block, on the Nvidia Jetson TX2 embedded GPU platform. An SPN trained on a benchmark in  is used for performance measurement, with the resulting throughput shown in fig. 2(c). The GPU implementation utilizing a single thread expectedly performs worse than the CPU, as a GPU CUDA core is simpler than a superscalar CPU core. Use of 256 threads, however, only increases the throughput by a factor 4.1x, a sublinear scaling due to the following reasons:
Overhead of thread synchronization .
Limited shared memory bandwidth. In Jetson TX2 GPU, shared memory has 32 banks among 128 CUDA cores. As all the threads read from (and write to) the shared memory, its bandwidth becomes a bottleneck.
Thread divergence due to selection between sum and product operations leads to inactive threads in a warp.
An efficient processor for SPNs should strive for improved parallelization by avoiding these bottlenecks.
Iv SPN processor architecture
We propose a custom processor architecture (fig. 3) to alleviate the bottlenecks in SPN parallelization. It uses a light-weight method to exchange data among processing elements, and contains an appropriate memory hierarchy to handle irregular reads and writes efficiently. The important features are the following:
Processing element (PE): The PE is a flexible arithmetic unit that can perform all the needed arithmetic operations (+,). It can additionally be configured to forward either of its inputs to output without performing any operation. The output of the PE is registered to allow adequate clock frequency.
Trees of PEs: The processor’s datapath consists of trees of PEs that enable local reuse of data, by avoiding frequent writebacks to the register file. Multiple trees allow execution of independent subgraphs of an SPN in parallel.
Register file and register writes: Each PE tree writes to a private banked register file. PEs at different levels of the tree can write to a different number of banks. As shown in fig. 3 bottom left, PEs at the first level can write to 2 respective banks, the second level can write to 4 banks, and so on.
Crossbar and register reads: The inputs of the PE trees are serviced by the register file via a crossbar, allowing PE trees to read any of the banks. Multiple inputs, however, cannot access the same bank in a given cycle. The crossbar is a combinational block with no internal memory.
Data memory: All the register files load/store data together in the form of a vector from a single address of the data memory, limiting irregular accesses to the register file, while requiring only vectorized transactions to the data memory.
Programmability: The architecture is made programmable with a custom VLIW instruction set, that can configure the trees and crossbar every clock cycle, copy data within register banks, and load/store from data memory.
Compilation: A custom compiler is developed to effectively schedule SPNs on the processor. The compiler directly takes as input the SPNs generated from tools like . The compiler allocates register banks to intermediate results while trying to minimize the register read/write bank conflicts. This allocation has to happen in tandem with the placement of operation on the PE, since PEs cannot write to all register banks. Furthermore, the compiler also reorders operations to minimize read-after-write hazards due to pipelining in PE trees. Finally, the intermediate data from the register files are spilled carefully to the data memory to minimize load-stores. In the end, the compiler generates a list of the custom VLIW instructions that can be executed directly on the processor.
In summary, the architecture allows local reuse of intermediate results by using trees of PEs, enables flexible movement of data with the crossbar, and efficiently supports irregular data fetches by using banked register files. The custom compiler allows to effectively use these architectural features.
V Experimental results
We experimentally benchmark the throughput of the proposed processor architecture against optimized CPU and GPU implementations, with SPNs trained on a suite of standard benchmarks [7, 3] using the algorithm in . The CPU and GPU implementations are as described in sec. III. For a fair comparison, the performance of the proposed processor is evaluated for configurations that consume less resources than the GPU (summarized in Table I): 32 banks in the register file, same as the amount of GPU shared memory banks, with 64 registers in each bank.
Two configurations of the custom processor with different arrangements of PEs are evaluated to highlight the impact of the tree arrangement. The Ptree configuration uses 2 trees each with 4 levels of PEs (30 PEs in total). The Pvect configuration removes the trees and uses only the lowest level PEs (16 PEs in total). Note that both configurations are identical in every aspect (crossbar, register file etc.), except for the arrangement and the number of PEs.
Throughput for the CPU and GPU is measured by averaging the runtime for 100k SPN executions, resulting in the peak performances of 0.55 and 0.95 effective operations/cycle respectively. For accurate throughput measurement of our processor, a cycle-accurate model is developed in the MyHDL framework . The resulting performance is compared in fig. 4 in terms of operations/cycle. The throughput of Ptree is at least 12x higher than the CPU and GPU, with a peak performance of 11.6 operations/cycle. Moreover, Ptree performs 2x better than Pvect, confirming the benefit of the tree arrangement of the PEs.
Vi Related work
, authors implemented SPNs on an Nvidia 1080Ti discrete GPU using Tensorflow and observed much lower throughput than a CPU. However, Tensorflow is ill-suited for SPNs as it launches a new GPU kernel for every operation in SPN, incurring prohibitively high overhead. We show that a custom CUDA-based implementation can achieve throughput similar to a CPU, even on an embedded GPU platform.
In recent years, a few works have proposed custom hardware for SPNs acceleration. Work in [12, 13, 14] uses fully-parallel spatial architectures that implement every operation in SPNs as a unique hardware operator. This is typically intended for FPGA implementation. These fully unrolled implementations, however, are not scalable to the large SPNs required for real applications. In , authors proposed a processor with multiple processing elements that use a common data bus to share intermediate results, making the bus a severe bottleneck. The processor in , proposed for a different probabilistic reasoning workload, has similarities with our work. However, it has unnecessary hardware redundancy (multiple crossbars) while having stringent requirements for the graph structure (only polytrees), making it unsuitable for SPNs. Our work retains it’s advantages while being more general.
An efficient processor architecture is proposed to accelerate Sum-product networks, an important compute kernel for probabilistic reasoning. The processor contains hardware features suitable for the irregular graph processing of SPNs, such as trees of processing elements for data reuse, a crossbar for flexible sharing of data, and independent-addressable register banks for irregular data reads and writes. The performance of the processor is compared with CPU and an highly-optimized GPU CUDA implementation, showing 12x higher throughput while using fewer computational and memory units.
-  (2008) On probabilistic inference by weighted model counting. Vol. 172, pp. 772–799. Cited by: §II.
-  (2019) MyHDL manual. Note: Version 0.11 Cited by: §V.
UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Cited by: §V.
-  (2013) Understanding and modeling the synchronization cost in the gpu architecture. Cited by: 1st item.
Learning the structure of probabilistic sentential decision diagrams.
Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence (UAI), Cited by: §IV, §V.
-  (2010) High-throughput bayesian computing machine with reconfigurable hardware. In Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays, pp. 73–82. Cited by: §VI.
Learning markov network structure with decision trees. In 2010 IEEE International Conference on Data Mining, pp. 334–343. Cited by: §III, §V.
-  (2019) Random sum-product networks: a simple and effective approach to probabilistic deep learning. In Conference on Uncertainty in Artificial Intelligence (UAI), Cited by: §I.
-  (2011) Sum-product networks: a new deep architecture. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 689–690. Cited by: §II.
-  (2019) Probabilistic logic neural networks for reasoning. arXiv preprint arXiv:1906.08495. Cited by: §I.
-  (2015) Towards real-time, on-board, hardware-supported sensor and software health management for unmanned aerial systems. International Journal of Prognostics and Health Management. Cited by: §VI.
-  (2019) ProbLP: a framework for low-precision probabilistic inference. In Proceedings of the 56th Annual Design Automation Conference 2019, pp. 190. Cited by: §VI.
-  (2018) Automatic mapping of the sum-product network inference problem to fpga-based accelerators. In IEEE 36th International Conference on Computer Design (ICCD), pp. 350–357. Cited by: §VI, §VI.
-  (2015) Bayesian network-based framework for the design of reconfigurable health management monitors. In 2015 NASA/ESA Conference on Adaptive Hardware and Systems (AHS), pp. 1–8. Cited by: §VI.