Acceleration of probabilistic reasoning through custom processor architecture

02/27/2021 ∙ by Nimish Shah, et al. ∙ 0

Probabilistic reasoning is an essential tool for robust decision-making systems because of its ability to explicitly handle real-world uncertainty, constraints and causal relations. Consequently, researchers are developing hybrid models by combining Deep Learning with probabilistic reasoning for safety-critical applications like self-driving vehicles, autonomous drones, etc. However, probabilistic reasoning kernels do not execute efficiently on CPUs or GPUs. This paper, therefore, proposes a custom programmable processor to accelerate sum-product networks, an important probabilistic reasoning execution kernel. The processor has an optimized datapath architecture and memory hierarchy optimized for sum-product networks execution. Experimental results show that the processor, while requiring fewer computational and memory units, achieves a 12x throughput benefit over the Nvidia Jetson TX2 embedded GPU platform.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep learning (DL) has shown remarkable success in various computer vision and natural language processing tasks. However, it suffers from a serious drawback of not explicitly handling real-world uncertainty, constraints and causal relations, which are crucial for deploying robust decision-making systems in practice. To address this, researchers are developing hybrid models by combining DL with probabilistic reasoning techniques, eg. random sum-product networks

[8]

, probabilistic logic neural networks

[10], etc.

To ease the development of such hybrid models, several software frameworks like Pyro, DeepProbLog, etc. have been developed. However, deploying these hybrid models in real applications raises demands of adequate throughput, latency, energy consumption, etc. While several efficient hardware platforms have been developed for DL, probabilistic reasoning kernels are still implemented on general-purpose CPUs which is energy-inefficient. To address this, we propose a customized processor architecture that efficiently accelerates probabilistic reasoning workloads. Specifically, in this work:

  • GPU execution bottlenecks are identified by developing a highly-optimized CUDA implementation.

  • A programmable processor architecture is designed to alleviate these bottlenecks.

  • Experiments are performed to validate the performance improvement over a suite of benchmarks.

Fig. 1: An example of a hybrid system that uses Deep Learning for perception and probabilistic models for reasoning. Probabilistic models are usually converted to sum-product networks (SPN) for efficient inference.

The paper is organized as follows: sec. II describes the compute kernel, while sec. III discusses its performance on a CPU and GPU. The proposed processor architecture is described in sec. IV. Sections V and VI evaluate the performance and discuss related work. Finally, sec. VII concludes the work.

Ii Probabilistic reasoning compute kernel

As shown in fig. 1, probabilistic reasoning is usually performed with probabilistic program or a probabilistic graphical model. These models can be converted to a tractable representation for efficient inference, called Sum-Product Network (SPN, also known as Arithmetic circuit) [9, 1]. Furthermore, SPNs can also be learned directly from data. SPNs are rooted directed acyclic graphs wherein the internal nodes are either sums or products, and the leaf nodes are probabilistic parameters or data inputs.

On a general-purpose processor, SPN can be implemented as a list of operations (alg. 1

), or as a for loop over a vector (alg.

2). This loop formulation already shows that, although SPNs offer parallelism, they inherently consist of scalar operations that consume prior intermediate results via irregular indirect memory accesses. This leads to inefficient parallel execution as will be demonstrated in the next section.

1:Inputs: IN: SPN leaf nodes as a vector
2:r0 IN[0] IN[1]
3:r1 IN[2] IN[3]
4:r2 r0 r1
5:
6:return rN
Algorithm 1 SPN as a list of operations
1:Inputs: IN: SPN leaf nodes as a vector of size m. O: Vector of size n with binary indicators to select between sum or product for each arithmetic operation. B,C: Vectors with pointers to first and second operand of each arithmetic operation.
2:A [] // a vector of size m+n
3:A[0…m-1] IN // initialize with inputs
4:for i 0; i n; i do
5:   if O[i] SUM then
6:      A[i+m] A[B[i]] A[C[i]]
7:   else
8:      A[i+m] A[B[i]] A[C[i]]    
9:return A[m+n-1]
Algorithm 2 SPN as a for loop

Iii SPNs on CPU and GPU

To evaluate the performance of general-purpose platforms, SPNs are implemented on a CPU and a GPU.

CPU: An Intel core i5-7200 CPU is used to execute SPNs as lists of operations in C (alg. 1), giving the compiler maximal freedom to reorder and parallelize operations. The for-loop implementation (alg. 2) is also evaluated, but it consistently performs slower than alg. 1

GPU: GPU offers parallelism in the form of a single instruction multiple thread (SIMT) mode. Multiple operations of SPNs can be executed in parallel by scheduling on multiple threads. However, the following two problems arise while doing so, due to the irregularity of SPN graphs.

1) Inter-thread data transfers: Inter-thread data transfers are needed when threads use data computed by other threads, which frequently happens when an SPN is scheduled across multiple threads. Such transfers can be allowed only after thread synchronization, as GPU does not guarantee lock-step execution of threads. To minimize these synchronization overheads in our experiments, SPNs are decomposed into groups of independent nodes (colored in fig. 2). As nodes in a group are not dependent on other nodes of the same group, they can be executed on any thread without synchronization. When executing a new group, threads still have to be synchronized with the CUDA primitive __syncthreads(). Moreover, this inter-thread communication needs the data to be in either GPU shared memory or global memory. If data size fits, the shared memory is used for low-latency inter-thread transfers, otherwise the global memory is used with L1 caching enabled.

2) Irregular memory accesses: Data transfers from the shared memory become a bottleneck if multiple threads in a GPU warp access the same bank, resulting in bank conflicts. These bank conflicts are minimized with a graph coloring-based bank allocation that ensures that the threads in a warp access to different banks.

Fig. 2: Execution of an SPN on multiple GPU threads: (a) decompose SPN into ’groups’ based on dependencies, and (b) execute on multiple threads and sync them before executing a new group. Throughput comparison of CPU and GPU in (c)

The pseudocode of the developed CUDA implementation is shown in alg. 3. Several implementations are evaluated with varying number of threads in 1 thread block, on the Nvidia Jetson TX2 embedded GPU platform. An SPN trained on a benchmark in [7] is used for performance measurement, with the resulting throughput shown in fig. 2(c). The GPU implementation utilizing a single thread expectedly performs worse than the CPU, as a GPU CUDA core is simpler than a superscalar CPU core. Use of 256 threads, however, only increases the throughput by a factor 4.1x, a sublinear scaling due to the following reasons:

  • Overhead of thread synchronization [4].

  • Limited shared memory bandwidth. In Jetson TX2 GPU, shared memory has 32 banks among 128 CUDA cores. As all the threads read from (and write to) the shared memory, its bandwidth becomes a bottleneck.

  • Thread divergence due to selection between sum and product operations leads to inactive threads in a warp.

An efficient processor for SPNs should strive for improved parallelization by avoiding these bottlenecks.

1:kernel inputs: Same as inputs of alg. 2
2:int i= blockDim.x*blockIdx.x + threadIdx.x //Thread ID
3:const int t= TOTAL_THREADS, x= length(IN)
4:__shared__ A[] // a vector in shared memory
5:
6:for j 0; j x; j do
7:   A[i + j*t] IN[i + j*t] // copy inputs to shared memory
8:__syncthreads()
9:
10:// compute first group of nodes
11:if O[i] == SUM then
12:   A[i + x*t] A[B[i]] A[C[i]]
13:else
14:   A[i + x*t] A[B[i]] A[C[i]]
15:if O[i + t] == SUM then
16:   A[i + (x+1)*t] A[B[i + t]] A[C[i + t]]
17:else
18:   A[i + (x+1)*t] A[B[i + t]] A[C[i + t]]
19:
20:__syncthreads()
21:// compute second group of nodes, and so on
Algorithm 3 SPNcuda(IN, B, C, O)

Iv SPN processor architecture

We propose a custom processor architecture (fig. 3) to alleviate the bottlenecks in SPN parallelization. It uses a light-weight method to exchange data among processing elements, and contains an appropriate memory hierarchy to handle irregular reads and writes efficiently. The important features are the following:

Processing element (PE): The PE is a flexible arithmetic unit that can perform all the needed arithmetic operations (+,). It can additionally be configured to forward either of its inputs to output without performing any operation. The output of the PE is registered to allow adequate clock frequency.

Trees of PEs: The processor’s datapath consists of trees of PEs that enable local reuse of data, by avoiding frequent writebacks to the register file. Multiple trees allow execution of independent subgraphs of an SPN in parallel.

Fig. 3: The SPN processor architecture based on trees of processing elements (PE). A PE tree writes outputs to a private banked register file, while the inputs are serviced via a crossbar, allowing trees to access other register files.

Register file and register writes: Each PE tree writes to a private banked register file. PEs at different levels of the tree can write to a different number of banks. As shown in fig. 3 bottom left, PEs at the first level can write to 2 respective banks, the second level can write to 4 banks, and so on.

Crossbar and register reads: The inputs of the PE trees are serviced by the register file via a crossbar, allowing PE trees to read any of the banks. Multiple inputs, however, cannot access the same bank in a given cycle. The crossbar is a combinational block with no internal memory.

Data memory: All the register files load/store data together in the form of a vector from a single address of the data memory, limiting irregular accesses to the register file, while requiring only vectorized transactions to the data memory.

Programmability: The architecture is made programmable with a custom VLIW instruction set, that can configure the trees and crossbar every clock cycle, copy data within register banks, and load/store from data memory.

Compilation: A custom compiler is developed to effectively schedule SPNs on the processor. The compiler directly takes as input the SPNs generated from tools like [5]. The compiler allocates register banks to intermediate results while trying to minimize the register read/write bank conflicts. This allocation has to happen in tandem with the placement of operation on the PE, since PEs cannot write to all register banks. Furthermore, the compiler also reorders operations to minimize read-after-write hazards due to pipelining in PE trees. Finally, the intermediate data from the register files are spilled carefully to the data memory to minimize load-stores. In the end, the compiler generates a list of the custom VLIW instructions that can be executed directly on the processor.

In summary, the architecture allows local reuse of intermediate results by using trees of PEs, enables flexible movement of data with the crossbar, and efficiently supports irregular data fetches by using banked register files. The custom compiler allows to effectively use these architectural features.

V Experimental results

We experimentally benchmark the throughput of the proposed processor architecture against optimized CPU and GPU implementations, with SPNs trained on a suite of standard benchmarks [7, 3] using the algorithm in [5]. The CPU and GPU implementations are as described in sec. III. For a fair comparison, the performance of the proposed processor is evaluated for configurations that consume less resources than the GPU (summarized in Table I): 32 banks in the register file, same as the amount of GPU shared memory banks, with 64 registers in each bank.

Platform
Compute
units
Immediate
memory size
memory
banks
CPU
2 arith. units in
a superscalar core
168 80b registers
32 KB L1 cache
16
GPU
128
CUDA cores
64K 32b registers
64 KB shared mem.
32
Ours
(Pvect)
16 PEs
2K 32b registers
64 KB data mem.
32
Ours
(Ptree)
30 PEs
2K 32b registers
64 KB data mem.
32
TABLE I: Compute and memory details of different processing platforms
Fig. 4: Throughput comparison of different platforms on several benchmarks.

Two configurations of the custom processor with different arrangements of PEs are evaluated to highlight the impact of the tree arrangement. The Ptree configuration uses 2 trees each with 4 levels of PEs (30 PEs in total). The Pvect configuration removes the trees and uses only the lowest level PEs (16 PEs in total). Note that both configurations are identical in every aspect (crossbar, register file etc.), except for the arrangement and the number of PEs.

Throughput for the CPU and GPU is measured by averaging the runtime for 100k SPN executions, resulting in the peak performances of 0.55 and 0.95 effective operations/cycle respectively. For accurate throughput measurement of our processor, a cycle-accurate model is developed in the MyHDL framework [2]. The resulting performance is compared in fig. 4 in terms of operations/cycle. The throughput of Ptree is at least 12x higher than the CPU and GPU, with a peak performance of 11.6 operations/cycle. Moreover, Ptree performs 2x better than Pvect, confirming the benefit of the tree arrangement of the PEs.

Vi Related work

In [13]

, authors implemented SPNs on an Nvidia 1080Ti discrete GPU using Tensorflow and observed much lower throughput than a CPU. However, Tensorflow is ill-suited for SPNs as it launches a new GPU kernel for every operation in SPN, incurring prohibitively high overhead. We show that a custom CUDA-based implementation can achieve throughput similar to a CPU, even on an embedded GPU platform.

In recent years, a few works have proposed custom hardware for SPNs acceleration. Work in [12, 13, 14] uses fully-parallel spatial architectures that implement every operation in SPNs as a unique hardware operator. This is typically intended for FPGA implementation. These fully unrolled implementations, however, are not scalable to the large SPNs required for real applications. In [11], authors proposed a processor with multiple processing elements that use a common data bus to share intermediate results, making the bus a severe bottleneck. The processor in [6], proposed for a different probabilistic reasoning workload, has similarities with our work. However, it has unnecessary hardware redundancy (multiple crossbars) while having stringent requirements for the graph structure (only polytrees), making it unsuitable for SPNs. Our work retains it’s advantages while being more general.

Vii Conclusion

An efficient processor architecture is proposed to accelerate Sum-product networks, an important compute kernel for probabilistic reasoning. The processor contains hardware features suitable for the irregular graph processing of SPNs, such as trees of processing elements for data reuse, a crossbar for flexible sharing of data, and independent-addressable register banks for irregular data reads and writes. The performance of the processor is compared with CPU and an highly-optimized GPU CUDA implementation, showing 12x higher throughput while using fewer computational and memory units.

References

  • [1] M. Chavira and A. Darwiche (2008) On probabilistic inference by weighted model counting. Vol. 172, pp. 772–799. Cited by: §II.
  • [2] J. Decaluwe (2019) MyHDL manual. Note: Version 0.11 Cited by: §V.
  • [3] D. Dua and C. Graff (2017)

    UCI machine learning repository

    .
    University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §V.
  • [4] J. Letendre (2013) Understanding and modeling the synchronization cost in the gpu architecture. Cited by: 1st item.
  • [5] Y. Liang, J. Bekker, and G. Van den Broeck (2017) Learning the structure of probabilistic sentential decision diagrams. In

    Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence (UAI)

    ,
    Cited by: §IV, §V.
  • [6] M. Lin, I. Lebedev, and J. Wawrzynek (2010) High-throughput bayesian computing machine with reconfigurable hardware. In Proceedings of the 18th annual ACM/SIGDA international symposium on Field programmable gate arrays, pp. 73–82. Cited by: §VI.
  • [7] D. Lowd and J. Davis (2010)

    Learning markov network structure with decision trees

    .
    In 2010 IEEE International Conference on Data Mining, pp. 334–343. Cited by: §III, §V.
  • [8] R. Peharz, A. Vergari, K. Stelzner, A. Molina, X. Shao, M. Trapp, K. Kersting, and Z. Ghahramani (2019) Random sum-product networks: a simple and effective approach to probabilistic deep learning. In Conference on Uncertainty in Artificial Intelligence (UAI), Cited by: §I.
  • [9] H. Poon and P. Domingos (2011) Sum-product networks: a new deep architecture. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pp. 689–690. Cited by: §II.
  • [10] M. Qu and J. Tang (2019) Probabilistic logic neural networks for reasoning. arXiv preprint arXiv:1906.08495. Cited by: §I.
  • [11] J. Schumann, K. Y. Rozier, T. Reinbacher, O. J. Mengshoel, T. Mbaya, and C. Ippolito (2015) Towards real-time, on-board, hardware-supported sensor and software health management for unmanned aerial systems. International Journal of Prognostics and Health Management. Cited by: §VI.
  • [12] N. Shah, L. I. G. Olascoaga, W. Meert, and M. Verhelst (2019) ProbLP: a framework for low-precision probabilistic inference. In Proceedings of the 56th Annual Design Automation Conference 2019, pp. 190. Cited by: §VI.
  • [13] L. Sommer, J. Oppermann, A. Molina, C. Binnig, K. Kersting, and A. Koch (2018) Automatic mapping of the sum-product network inference problem to fpga-based accelerators. In IEEE 36th International Conference on Computer Design (ICCD), pp. 350–357. Cited by: §VI, §VI.
  • [14] S. Zermani, C. Dezan, R. Euler, and J. Diguet (2015) Bayesian network-based framework for the design of reconfigurable health management monitors. In 2015 NASA/ESA Conference on Adaptive Hardware and Systems (AHS), pp. 1–8. Cited by: §VI.