Weighted Random Sampling on GPUs

06/23/2021
by   Hans-Peter Lehmann, et al.
KIT
0

An alias table is a data structure that allows for efficiently drawing weighted random samples in constant time and can be constructed in linear time. The PSA algorithm by Hübschle-Schneider and Sanders is able to construct alias tables in parallel on the CPU. In this report, we transfer the PSA algorithm to the GPU. Our construction algorithm achieves a speedup of 17 on a consumer GPU in comparison to the PSA method on a 16-core high-end desktop CPU. For sampling, we achieve an up to 24 times higher throughput. Both operations also require several times less energy than on the CPU. Adaptations helping to achieve this include changing memory access patterns to do coalesced access. Where this is not possible, we first copy data to the faster shared memory using coalesced access. We also enhance a generalization of binary search enabling to search for a range of items in parallel. Besides naive sampling, we also give improved batched sampling algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

03/04/2021

Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture

Graph Convolutional Networks (GCNs) are increasingly adopted in large-sc...
10/25/2021

AxoNN: An asynchronous, message-driven parallel framework for extreme-scale deep learning

In the last few years, the memory requirements to train state-of-the-art...
01/25/2021

RTGPU: Real-Time GPU Scheduling of Hard Deadline Parallel Tasks with Fine-Grain Utilization

Many emerging cyber-physical systems, such as autonomous vehicles and ro...
01/22/2020

Accelerating supply chains with Ant Colony Optimization across range of hardware solutions

Ant Colony algorithm has been applied to various optimization problems, ...
04/26/2021

Capstan: A Vector RDA for Sparsity

This paper proposes Capstan: a scalable, parallel-patterns-based, reconf...
08/26/2020

Exploring the Design Space of Static and Incremental Graph Connectivity Algorithms on GPUs

Connected components and spanning forest are fundamental graph algorithm...
05/28/2021

High Performance and Scalable NAT System on Commodity Platforms

Quick network address translation (NAT) is proposed to improve the netwo...

Code Repositories

alias-table-gpu

Efficient construction of and sampling from alias tables on the GPU


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Weighted random sampling is the process of drawing items from a set , where each item has a specific weight . Denoting the total weight with

, each item is drawn with probability

. In this report, we consider sampling with replacement, so the same item can be sampled multiple times. GPUs are becoming more important for high performance computing because of their fast memory and high degree of parallelism. Therefore, there is need for an efficient method to construct data structures for drawing weighted random samples on GPUs. A data structure that allows for efficiently sampling from a weighted random distribution in is the alias table, introduced by Walker [walker1977efficient].
Weighted random sampling has numerous applications, for example sampling recursion layers when generating R-MAT graphs [hubschle2019linear], sampling particle source positions in medical simulations [wilderman2007method], sampling ray directions in photorealistic rendering [burke2004bidirectional]

, and sampling word distributions in machine learning

[li2017saberlda]. Alias tables can also be used for interactive noise function generation [galerne2012gabor].
This report is based on and has text overlaps with the master’s thesis of the first author [lehmann2021alias]. The source code of the implementation is available on GitHub [sourceGitHub].

2 Preliminaries

2.1 GPUs

Basic Architecture.

A GPU is highly symmetrical, consisting of multiple streaming multiprocessors (SMs). Each SM simultaneously executes multiple threads. The smallest level of parallelism, 32 threads, is called a warp. All threads in a warp share their instruction pointer and inactive threads are masked out [nvidia2018turingArchitecture]. Functions that are executed on the GPU are called kernels. A kernel is executed on a grid of blocks, each of which consists of a grid of threads that are scheduled to the same SM. Threads from the same block can synchronize and share memory, while threads from different blocks cannot cooperate directly [nvidia2008gettingstarted].

GPU Memory.

The GPU has a large global memory (also called device memory). Additionally, each block can allocate shared memory that is located directly on the SM and can be accessed much faster. Whenever the threads of a warp access global memory, the number of 32-byte transactions needed to fulfill the requests is minimized (coalescing) [nvidia2020bestpractices]. To leverage this performance improvement, special memory access patterns like interleaved addressing need to be used. Moreover, the GPU’s memory addresses are distributed over multiple physical memory modules called banks. The banks can perform transactions in parallel but when multiple threads access the same bank in different rows, the operations need to be serialized [nguyen2007gpu].

2.2 Alias Tables

An alias table [walker1977efficient] has rows, where is the number of items in the input set. Each row represents a bucket of equal share of the total weight. It has two columns, namely a weight and an alias . To sample, we draw a uniform random number and multiply it by . The integer part selects a row from the table. The fractional part is used to choose between item and its alias by checking if . Thus, alias tables allow for sampling an item in time . It is possible to construct an alias table for every discrete distribution.

(a) Input weights
(b) Constructed table
Figure 1: Illustration of alias table construction. Buckets of items with weight smaller than the average are filled with excess weight of heavy items.

Sequential Construction.

The idea of alias table construction is that heavy items that are more likely to be sampled than a table row () give excess weight to the buckets of one or more light items (). This procedure is illustrated in Figure 1. Vose [vose1991linear] describes an alias table construction algorithm that explicitly maintains lists l and h of light and heavy items. While there are items available, the algorithm takes a heavy item . It then distributes the excess weight of that item by taking light items and filling their buckets. When a heavy item’s weight drops below , it is moved to the list of light items.

Parallel Construction.

Hübschle-Schneider and Sanders’ [hubschle2019parallel] PSA method uses a two-step approach to parallel alias table construction. During the first step, splitting, the algorithm precomputes the state of Vose’s construction at positions. These splits define sections that can later be worked on independently in parallel. The algorithm selects a number of light and heavy items in a way such that the number of items in each section is and the weights are balanced. Valid split positions are found by executing a binary search on the prefix sums of light and heavy items. Because the weight usually does not exactly fit into a section, the algorithm stores the remaining weight of the last heavy item as spill to the next section. The result of the splitting step is a list of section boundaries and their respective spill values. The second step, packing, then constructs the actual alias table. In parallel, each processor iterates over the items of one of the sections and distributes weight from buckets of heavy items to buckets of light items. The PSA+ method [hubschle2019parallel] is a semi-greedy variant that, instead of calculating prefix sums and splits for all items, builds the alias table in fixed-size sections until each section runs out of light or heavy items. PSA+ then only performs the PSA construction with the remaining items.

3 Related Work

Mohanty et al. [mohanty2012efficient] implement only the alias table sampling step on the GPU and use it for Monte Carlo simulations. Binder and Keller [binder2019massively] introduce a monotonic sampling algorithm for GPUs that is not based on alias tables and can sample in average case and in worst case running time. A sampling algorithm is called monotonic if a larger random number also generates a larger sample. This can be used to preserve the low discrepancy of quasi-random number generators. In this report, we do not consider the additional requirement of monotonicity and are rather interested in improving throughput.

4 Construction

Because our new method is based on PSA [hubschle2019parallel], we can now introduce our construction algorithm by explaining the splitting and packing steps individually.

4.1 Split Method

Let’s denote the number of splits with the variable . As a baseline, we transfer the original split algorithm of PSA [hubschle2019parallel] directly to the GPU. Being based on binary search, the first iterations take the same branches and therefore read the same memory locations. This allows for coalescing but does not utilize the parallelism of the memory banks. We then introduce a new search operation that we call partial -ary search that makes use of both architectural properties. While we present it only in context of alias table construction, it can be used in other contexts, too.

Partial -ary Search.

For finding an item in a sorted list, Kaldewey et al. [kaldewey2009parallel] evaluate -ary search on GPUs. In contrast to binary search, -ary search reduces the search range to in each iteration by looking at equally spaced pivots in parallel. The threads synchronize after each memory access and limit the search range to one of the sections. With plain -ary search, all threads cooperate to search for one single item. Our new partial -ary search algorithm can be used to search for one item per thread. It makes use of the fact that the threads of a block often search for items that are close together in memory. The algorithm, to our knowledge, has not previously been described in the literature. The algorithm works in two phases. In the first phase, it executes -ary search for all items of the block at once. In each iteration, instead of continuing the search on one section, partial -ary search reduces the search range to the range between the smallest and largest section that contain at least one of the searched items. This can be achieved by only comparing with the smallest and largest item of the block. This is repeated until the search range can no longer be reduced. In the second phase, each thread looks for its own item using ordinary binary search, which is initialized with the range determined using -ary search. We call the method partial -ary search because only the first iterations of searching are executed in -ary fashion before falling back to standard binary search. Algorithm 1 illustrates the idea.

function binarySearch(, …, : Ordered list to search in,
            : Item to search, : Initial search range)
    while  do
        
        if  then  else 
    return a
function partialP-arySearch(, …, : Ordered list to search in,
            , …, : Ordered items to search, : Thread index)
    
    , …, : Pivots of all threads (shared)
    , …, : State of all threads (shared)
    while true do
        
        if  then
             smaller
        else if  then
             larger
        else
             within
         where  is the maximum number with  smaller
         where  is the minimum number with  larger
        if  close to  then break
    return binarySearch(, , , )
Listing 1: Partial -ary search

Uncompetitive Method.

For each of the threads, the split method searches for the number of heavy items to include. To make use of interleaved addressing, an inverse split algorithm would start one thread for each item and check them all in parallel. The method is 60 times slower than the baseline.

4.2 Pack Method

The pack step is similar to sequential alias table construction but starts at a specific position that is determined by the split. As a baseline, we transfer the original pack algorithm of PSA [hubschle2019parallel] to the GPU. We now explain multiple ideas that incrementally improve its performance.

l and h in Shared Memory.

The baseline pack method accesses the l and h arrays in a way that cannot be coalesced. For the shared memory method, we first copy the array sections that each block will later access to the shared memory in an efficient interleaved fashion. Because shared memory is much faster, the rather inefficient memory access pattern of the pack operation is no longer a problem.

Weight in l and h Arrays.

In the baseline method, the l and h arrays only store the index of the light and heavy items. The pack method reads items from the arrays and then loads the corresponding weight from the input array. Using the shared memory method above, access to the l and h arrays is cheap but access to the weights array is still expensive and not properly coalesced. Instead of only storing the item index in l and h, we now also store the weight of the items. Because we do this during partitioning into the l and h arrays, no additional passes over the data are required.

Chunked Loading.

With the shared memory pack method, we assume that the light and heavy items of each section fit into the shared memory. In order to make each section small enough, we need to compute a large number of splits. The idea of the chunked pack method is to generate larger sections and therefore reduce the number of splits required. This can be achieved by efficiently loading chunks of the l and h arrays to shared memory as needed. During packing, whenever all threads of a block have no light or no heavy items left, the threads cooperate to load a new chunk of new data from the global l and h arrays in an interleaved way (see Algorithm 2).

function chunkedPack()
    while not all threads are finished do
        copyChunks()
        if current thread is not finished then
            packUntilChunkEnd()
function copyChunks()
    foreach worker thread  do
        if  already handled more than  of its light items then
            Copy next light items that  will access to shared memory
        if  already handled more than  of its heavy items then
            Copy next heavy items that  will access to shared memory
function packUntilChunkEnd()
    : State like in the PSA method
    Restore state of 
    while true do
        if light or heavy array in shared memory ran out of items then
            Store state of 
            return
        // Normal packing loop, see PSA[hubschle2019parallel]
        if  then
            
        else
            
    Mark thread as finished
Listing 2: Chunked pack method

Uncompetitive Methods.

Because the l and h arrays are sorted by item index, write operations to the alias table cannot be coalesced. Writing to the shared memory first and copying the table afterwards is not feasible because split sections can write to overlapping memory locations in the output.111Without loss of generality, the last processed light item of a thread can have a significantly lower index in the input array than the last processed heavy item. The next thread can then process a light item with an index smaller than the index of the current thread’s last heavy item. Reordering the l and h arrays before executing the split kernel is up to 2.2 times slower than the baseline method. The CPU implementation [hubschle2019parallel] initializes the alias table with the weights instead of accessing the array directly in the pack step. On the GPU, the method is roughly 15 % slower than the baseline method. The CPU implementation iterates over the input items to find the next heavy item instead of using the l and h arrays. On the GPU, the method is more than 3.7 times slower than the baseline method. The pack method accesses the weights array indirectly using weight[l[i]] but precomputing those values directly to an array is roughly 10 % slower than the baseline method.

4.3 Psa+

Hübschle-Schneider’s and Sanders’ implementation [hubschle2019parallel] executes greedy packing before partitioning into the l and h arrays. For that, it uses the sweeping pack method [hubschle2019parallel], which is not efficient on GPUs. The idea of our PSA+ implementation is to perform greedy packing while partitioning, when the arrays are already available in the fast shared memory. We then only copy items back to the global l and h arrays that are not yet handled. With this method, we are able to reduce both the time of the prefix sum and the memory reads and writes to the l and h arrays. Our PSA+ implementation does not perform any additional access to global memory that would not have been done with PSA.

5 Sampling

We now consider algorithms for efficiently sampling alias tables on the GPU. The baseline sampling method directly follows the algorithm of Walker [walker1977efficient], which first chooses a random table row and then either outputs the item or its alias. The throughput scales with the number of samples drawn because table rows that are accessed a second time might already be cached. We now present batched methods that make explicit use of the cache.

Cached Sectioned Sampling.

To increase the number of cache hits, we use a similar idea as in Algorithm R [sanders2018efficient]. For uniform sampling, Algorithm R splits the items to be sampled into two sections recursively. The number of samples to be drawn from each section is decided using a binomial deviate. Each thread then only draws samples from one section and therefore accesses more local memory areas. Our new cached sectioned sampling algorithm uses the same idea to split the alias table into one section per block. The threads in the block then draw their samples only from that section, relying on the cache to improve sampling throughput. Splitting an alias table is easier than splitting the items themselves because each table row is sampled with the same probability. Like in Algorithm R, it is possible to determine the sections without communication by using a pseudorandom number generator. The size of the sections serves as a tuning parameter between the number of sections to calculate and the cache hit probability. In our setting (

), the normal distribution is a good approximation of the binomial distribution

[martinez2013accelerating] and computationally much easier to evaluate.

Cached Limited Sectioned Sampling.

Even if the whole section would theoretically fit into the cache, the cached sectioned sampling method only achieves a small increase in throughput. This is due to multiple blocks being scheduled to each SM and therefore evicting each other’s cache entries. Our new cached limited sectioned method allocates (but does not use) so much shared memory that only a single block can be executed on each SM. Like the cached sectioned method, the method allows for using the section size as a tuning parameter.

Shared Memory Sectioned Sampling.

Our shared memory sampling algorithm explicitly copies each block’s section to the fast shared memory in an interleaved fashion and then samples from there. The section size is limited by the size of the shared memory, so it cannot be used as a tuning parameter.

6 Evaluation

For comparing our methods among each other and with the CPU implementation [hubschle2019parallel], we use both consumer devices and powerful servers, as listed in Table 1. For speedups, we compare the RTX 2080 and the high-end desktop CPU because they have a similar price range. Because the behavior is similar on all tested GPUs, we only plot measurements from the RTX 2080. We use uniform random weights and the shuffled power law distribution ( in random order).

Machine Hardware specifications
Desktop AMD Ryzen 3950X (16 cores, 32 threads), Ubuntu 20.04
AMD server AMD EPYC 7551P (32 cores, 64 threads), Ubuntu 20.04
Intel server 4x Intel Xeon Gold 6138 (420 cores, 160 threads), Ubuntu 20.04
GTX 1650S Nvidia GeForce GTX 1650 Super GPU, CUDA 11.1
Intel Xeon 1230 V2 (4 cores), Arch Linux (2021-06-11)
RTX 2080 Nvidia GeForce RTX 2080 GPU, CUDA 11.1
Intel Core i5-750 (4 cores), Ubuntu 16.04
Tesla V100 Nvidia Tesla V100 data center GPU, CUDA 11.0
Intel Xeon Gold 6230 (using 4 cores), Red Hat Enterprise 7.7

Table 1: Machines used for the evaluation.

Implementation details.

By performing only index and pointer arithmetics in conditional branches and accessing the memory afterwards, we achieve a speedup of up to 2. In the pack method, we use casts to int4 to help the compiler generate a single 128-bit operation instead of two 64-bit operations for accesses to the alias table rows. This reduces memory transfers by 50 % and makes the pack operation nearly 50 % faster.

6.1 Construction

Figure 2: Time needed for determining a single split using different split algorithms. Using input items with uniform random weights.
Figure 3: Construction duration for a table of size with uniform random weights.

A comparison of our split methods is plotted in Figure 2. Independently of the number of splits , the partial -ary split method is up to 1.5 times faster than the baseline method, depending on the input distribution and number of splits. Figure 3 shows how the techniques of Section 4.2 achieve a speedup of 3.7 to the baseline. Because the pack method has an influence on the number of splits to calculate, the figure shows the full construction time including splitting. When storing weights in l and h, the pack step gets 2 times faster while the split and partition steps get 2 times slower because of an increased size of array elements. In total, this results in a speed improvement because the pack step takes most time overall. The pack step of the chunked method is slower than the shared memory method because its memory access cannot be coalesced as well but it speeds up the splitting step significantly. For large , the chunked method is slightly faster than the shared memory method.

Psa+.

When the items have uniform random weights, PSA+ on GPUs can greedily handle around 90 % of the items. A reason why Hübschle-Schneider and Sanders’ [hubschle2019parallel] algorithm can pack a higher fraction of the items greedily is that our section size is limited by the shared memory and therefore rather small. We only attempt greedy packing in promising situations by introducing a threshold for the minimum number of light and heavy items in each section. Using uniform random weights with items, PSA+ achieves a speedup of 1.5 to PSA and using a shuffled power law distribution with exponent , it achieves a speedup of 1.4. While PSA+ can be slower for some weight distributions, it achieves significant speedups for these important distributions.

Comparison with the CPU method.

Our GPU-based chunked method achieves a speedup of 17 on the RTX 2080 over Ref. [hubschle2019parallel] on a desktop CPU, as listed in Table 2. Constructing with items, our method is faster even when including the time to transfer the input weights to the GPU. In fact, our construction is faster than the time needed to transfer a finished alias table to the GPU.

Machine Construction time
Desktop CPU 69.2 ms 743.2 ms
AMD server 21.3 ms 151.5 ms
Intel server 18.2 ms 83.1 ms
GTX 1650S 7.6 ms 222Not enough memory for temporary data structures during construction.
RTX 2080 4.0 ms 32.8 ms
Tesla V100 2.5 ms 23.9 ms
Table 2: Construction duration comparison with the CPU method [hubschle2019parallel]. Input are and items with a shuffled power law distributed weights.

6.2 Sampling

Figure 4 shows a comparison of the baseline sampling method and the three sectioned methods. The baseline method does not need preprocessing and is therefore fastest for small numbers of samples. The sectioned methods have significant startup overhead for determining the sections or copying data but if the number of samples drawn is increased, the investment pays off. Figure 5 shows the best method for varying table size and number of samples. While the shared memory sectioned method can achieve higher peak throughputs, the cached limited sectioned method is more generic and achieves a good throughput in more cases.

(a) items
(b) items
Figure 4: Comparison between sampling methods depending on the input size and number of samples drawn. Input is a uniform random weight distribution. Note the logarithmic x-axes.
Figure 5: Comparison which method has the highest throughput depending on table size and number of samples drawn. The input weights are drawn from a uniform random distribution.

Comparison with the CPU method.

Table 3 compares the throughput of our sectioned limited method with the CPU implementation of Ref. [hubschle2019parallel] when using shuffled power law distributed weights. Our GPU method has up to 24 times more throughput on the RTX 2080 than Ref. [hubschle2019parallel] on the desktop CPU. Even for large , we can outperform the expensive Intel server using consumer hardware.

Machine GSamples/s
Desktop CPU 3.67 0.42 0.37 0.37
AMD server 1.36 0.92 0.92 0.89
Intel server 7.98 2.67 2.17 1.63
GTX 1650S 6.41 3.18 1.06
RTX 2080 13.43 10.14 2.44
Tesla V100 106.71333Throughput with N= is only constrained by the 64-bit floating point unit, which is significantly faster on the Tesla V100 than on the other cards. 26.93 5.62

Table 3: Sampling throughput comparison with the CPU method. Drawing samples from a table of varying size. On the GPU, we use our fastest variant for each input size (: cached limited sectioned, : baseline).

6.3 Power Consumption

Machine Construction Sampling
Desktop CPU 98 J/ items 376 J/GSample
AMD server 25 J/ items 181 J/GSample
Intel server 45 J/ items 242 J/GSample
GTX 1650S 7 J/ items444Not enough memory for temporary data structures during construction. Extrapolation based on a measurement with . 92 J/GSample
RTX 2080 7 J/ items 69 J/GSample
Tesla V100 5 J/ items 28 J/GSample

Table 4: Power usage of constructing an alias table of size with shuffled power law distributed weights and drawing samples

Because of their different architecture, comparing only running time between GPUs and CPUs can be unfair. A good sanity check is to compare by energy consumption, which is independent of current market prices and covers a major cost factor of computing. To compensate for different hardware setups, we calculate the CPU power usage by the difference between idle and loaded state using external measurements. For the GPUs, we directly use the values reported by the cards, adding additional 40 W to account for the CPUs that manage the cards.555Based on external measurements with the RTX 2080. Table 4 lists the power usage measurements of construction and sampling.

7 Conclusions

In this report, we have presented new algorithms that make construction of and sampling from alias tables efficient on GPUs. We are able to achieve a speedup of 17 to the CPU implementation of Hübschle-Schneider and Sanders [hubschle2019parallel], while simultaneously being more energy-efficient. We introduce a new search algorithm, partial -ary search, that enables fast splitting. Our pack method with chunked loading to the shared memory adapts the memory access pattern to be more efficient on GPUs. Our sectioned limited sampling algorithm is up to 24 times faster than the CPU implementation. This is achieved by dividing the alias table into sections which can then be sampled in a more cache-efficient way. In the future, we plan to evaluate our methods in real-world applications such as graph generation and also evaluate partial -ary search on its own.

8 Acknowledgments

The authors acknowledge support by the state of Baden-Württemberg through bwHPC. We also thank Emanuel Schrade for co-supervising the thesis that this report is based on.

References