Efficient construction of and sampling from alias tables on the GPU
An alias table is a data structure that allows for efficiently drawing weighted random samples in constant time and can be constructed in linear time. The PSA algorithm by Hübschle-Schneider and Sanders is able to construct alias tables in parallel on the CPU. In this report, we transfer the PSA algorithm to the GPU. Our construction algorithm achieves a speedup of 17 on a consumer GPU in comparison to the PSA method on a 16-core high-end desktop CPU. For sampling, we achieve an up to 24 times higher throughput. Both operations also require several times less energy than on the CPU. Adaptations helping to achieve this include changing memory access patterns to do coalesced access. Where this is not possible, we first copy data to the faster shared memory using coalesced access. We also enhance a generalization of binary search enabling to search for a range of items in parallel. Besides naive sampling, we also give improved batched sampling algorithms.READ FULL TEXT VIEW PDF
Efficient construction of and sampling from alias tables on the GPU
Weighted random sampling is the process of drawing items from a set , where each item has a specific weight . Denoting the total weight with
, each item is drawn with probability. In this report, we consider sampling with replacement, so the same item can be sampled multiple times. GPUs are becoming more important for high performance computing because of their fast memory and high degree of parallelism. Therefore, there is need for an efficient method to construct data structures for drawing weighted random samples on GPUs. A data structure that allows for efficiently sampling from a weighted random distribution in is the alias table, introduced by Walker [walker1977efficient].
, and sampling word distributions in machine learning[li2017saberlda]. Alias tables can also be used for interactive noise function generation [galerne2012gabor].
A GPU is highly symmetrical, consisting of multiple streaming multiprocessors (SMs). Each SM simultaneously executes multiple threads. The smallest level of parallelism, 32 threads, is called a warp. All threads in a warp share their instruction pointer and inactive threads are masked out [nvidia2018turingArchitecture]. Functions that are executed on the GPU are called kernels. A kernel is executed on a grid of blocks, each of which consists of a grid of threads that are scheduled to the same SM. Threads from the same block can synchronize and share memory, while threads from different blocks cannot cooperate directly [nvidia2008gettingstarted].
The GPU has a large global memory (also called device memory). Additionally, each block can allocate shared memory that is located directly on the SM and can be accessed much faster. Whenever the threads of a warp access global memory, the number of 32-byte transactions needed to fulfill the requests is minimized (coalescing) [nvidia2020bestpractices]. To leverage this performance improvement, special memory access patterns like interleaved addressing need to be used. Moreover, the GPU’s memory addresses are distributed over multiple physical memory modules called banks. The banks can perform transactions in parallel but when multiple threads access the same bank in different rows, the operations need to be serialized [nguyen2007gpu].
An alias table [walker1977efficient] has rows, where is the number of items in the input set. Each row represents a bucket of equal share of the total weight. It has two columns, namely a weight and an alias . To sample, we draw a uniform random number and multiply it by . The integer part selects a row from the table. The fractional part is used to choose between item and its alias by checking if . Thus, alias tables allow for sampling an item in time . It is possible to construct an alias table for every discrete distribution.
The idea of alias table construction is that heavy items that are more likely to be sampled than a table row () give excess weight to the buckets of one or more light items (). This procedure is illustrated in Figure 1. Vose [vose1991linear] describes an alias table construction algorithm that explicitly maintains lists l and h of light and heavy items. While there are items available, the algorithm takes a heavy item . It then distributes the excess weight of that item by taking light items and filling their buckets. When a heavy item’s weight drops below , it is moved to the list of light items.
Hübschle-Schneider and Sanders’ [hubschle2019parallel] PSA method uses a two-step approach to parallel alias table construction. During the first step, splitting, the algorithm precomputes the state of Vose’s construction at positions. These splits define sections that can later be worked on independently in parallel. The algorithm selects a number of light and heavy items in a way such that the number of items in each section is and the weights are balanced. Valid split positions are found by executing a binary search on the prefix sums of light and heavy items. Because the weight usually does not exactly fit into a section, the algorithm stores the remaining weight of the last heavy item as spill to the next section. The result of the splitting step is a list of section boundaries and their respective spill values. The second step, packing, then constructs the actual alias table. In parallel, each processor iterates over the items of one of the sections and distributes weight from buckets of heavy items to buckets of light items. The PSA+ method [hubschle2019parallel] is a semi-greedy variant that, instead of calculating prefix sums and splits for all items, builds the alias table in fixed-size sections until each section runs out of light or heavy items. PSA+ then only performs the PSA construction with the remaining items.
Mohanty et al. [mohanty2012efficient] implement only the alias table sampling step on the GPU and use it for Monte Carlo simulations. Binder and Keller [binder2019massively] introduce a monotonic sampling algorithm for GPUs that is not based on alias tables and can sample in average case and in worst case running time. A sampling algorithm is called monotonic if a larger random number also generates a larger sample. This can be used to preserve the low discrepancy of quasi-random number generators. In this report, we do not consider the additional requirement of monotonicity and are rather interested in improving throughput.
Because our new method is based on PSA [hubschle2019parallel], we can now introduce our construction algorithm by explaining the splitting and packing steps individually.
Let’s denote the number of splits with the variable . As a baseline, we transfer the original split algorithm of PSA [hubschle2019parallel] directly to the GPU. Being based on binary search, the first iterations take the same branches and therefore read the same memory locations. This allows for coalescing but does not utilize the parallelism of the memory banks. We then introduce a new search operation that we call partial -ary search that makes use of both architectural properties. While we present it only in context of alias table construction, it can be used in other contexts, too.
For finding an item in a sorted list, Kaldewey et al. [kaldewey2009parallel] evaluate -ary search on GPUs. In contrast to binary search, -ary search reduces the search range to in each iteration by looking at equally spaced pivots in parallel. The threads synchronize after each memory access and limit the search range to one of the sections. With plain -ary search, all threads cooperate to search for one single item. Our new partial -ary search algorithm can be used to search for one item per thread. It makes use of the fact that the threads of a block often search for items that are close together in memory. The algorithm, to our knowledge, has not previously been described in the literature. The algorithm works in two phases. In the first phase, it executes -ary search for all items of the block at once. In each iteration, instead of continuing the search on one section, partial -ary search reduces the search range to the range between the smallest and largest section that contain at least one of the searched items. This can be achieved by only comparing with the smallest and largest item of the block. This is repeated until the search range can no longer be reduced. In the second phase, each thread looks for its own item using ordinary binary search, which is initialized with the range determined using -ary search. We call the method partial -ary search because only the first iterations of searching are executed in -ary fashion before falling back to standard binary search. Algorithm 1 illustrates the idea.
For each of the threads, the split method searches for the number of heavy items to include. To make use of interleaved addressing, an inverse split algorithm would start one thread for each item and check them all in parallel. The method is 60 times slower than the baseline.
The pack step is similar to sequential alias table construction but starts at a specific position that is determined by the split. As a baseline, we transfer the original pack algorithm of PSA [hubschle2019parallel] to the GPU. We now explain multiple ideas that incrementally improve its performance.
The baseline pack method accesses the l and h arrays in a way that cannot be coalesced. For the shared memory method, we first copy the array sections that each block will later access to the shared memory in an efficient interleaved fashion. Because shared memory is much faster, the rather inefficient memory access pattern of the pack operation is no longer a problem.
In the baseline method, the l and h arrays only store the index of the light and heavy items. The pack method reads items from the arrays and then loads the corresponding weight from the input array. Using the shared memory method above, access to the l and h arrays is cheap but access to the weights array is still expensive and not properly coalesced. Instead of only storing the item index in l and h, we now also store the weight of the items. Because we do this during partitioning into the l and h arrays, no additional passes over the data are required.
With the shared memory pack method, we assume that the light and heavy items of each section fit into the shared memory. In order to make each section small enough, we need to compute a large number of splits. The idea of the chunked pack method is to generate larger sections and therefore reduce the number of splits required. This can be achieved by efficiently loading chunks of the l and h arrays to shared memory as needed. During packing, whenever all threads of a block have no light or no heavy items left, the threads cooperate to load a new chunk of new data from the global l and h arrays in an interleaved way (see Algorithm 2).
Because the l and h arrays are sorted by item index, write operations to the alias table cannot be coalesced. Writing to the shared memory first and copying the table afterwards is not feasible because split sections can write to overlapping memory locations in the output.111Without loss of generality, the last processed light item of a thread can have a significantly lower index in the input array than the last processed heavy item. The next thread can then process a light item with an index smaller than the index of the current thread’s last heavy item. Reordering the l and h arrays before executing the split kernel is up to 2.2 times slower than the baseline method. The CPU implementation [hubschle2019parallel] initializes the alias table with the weights instead of accessing the array directly in the pack step. On the GPU, the method is roughly 15 % slower than the baseline method. The CPU implementation iterates over the input items to find the next heavy item instead of using the l and h arrays. On the GPU, the method is more than 3.7 times slower than the baseline method. The pack method accesses the weights array indirectly using weight[l[i]] but precomputing those values directly to an array is roughly 10 % slower than the baseline method.
Hübschle-Schneider’s and Sanders’ implementation [hubschle2019parallel] executes greedy packing before partitioning into the l and h arrays. For that, it uses the sweeping pack method [hubschle2019parallel], which is not efficient on GPUs. The idea of our PSA+ implementation is to perform greedy packing while partitioning, when the arrays are already available in the fast shared memory. We then only copy items back to the global l and h arrays that are not yet handled. With this method, we are able to reduce both the time of the prefix sum and the memory reads and writes to the l and h arrays. Our PSA+ implementation does not perform any additional access to global memory that would not have been done with PSA.
We now consider algorithms for efficiently sampling alias tables on the GPU. The baseline sampling method directly follows the algorithm of Walker [walker1977efficient], which first chooses a random table row and then either outputs the item or its alias. The throughput scales with the number of samples drawn because table rows that are accessed a second time might already be cached. We now present batched methods that make explicit use of the cache.
To increase the number of cache hits, we use a similar idea as in Algorithm R [sanders2018efficient]. For uniform sampling, Algorithm R splits the items to be sampled into two sections recursively. The number of samples to be drawn from each section is decided using a binomial deviate. Each thread then only draws samples from one section and therefore accesses more local memory areas. Our new cached sectioned sampling algorithm uses the same idea to split the alias table into one section per block. The threads in the block then draw their samples only from that section, relying on the cache to improve sampling throughput. Splitting an alias table is easier than splitting the items themselves because each table row is sampled with the same probability. Like in Algorithm R, it is possible to determine the sections without communication by using a pseudorandom number generator. The size of the sections serves as a tuning parameter between the number of sections to calculate and the cache hit probability. In our setting (martinez2013accelerating] and computationally much easier to evaluate.
Even if the whole section would theoretically fit into the cache, the cached sectioned sampling method only achieves a small increase in throughput. This is due to multiple blocks being scheduled to each SM and therefore evicting each other’s cache entries. Our new cached limited sectioned method allocates (but does not use) so much shared memory that only a single block can be executed on each SM. Like the cached sectioned method, the method allows for using the section size as a tuning parameter.
Our shared memory sampling algorithm explicitly copies each block’s section to the fast shared memory in an interleaved fashion and then samples from there. The section size is limited by the size of the shared memory, so it cannot be used as a tuning parameter.
For comparing our methods among each other and with the CPU implementation [hubschle2019parallel], we use both consumer devices and powerful servers, as listed in Table 1. For speedups, we compare the RTX 2080 and the high-end desktop CPU because they have a similar price range. Because the behavior is similar on all tested GPUs, we only plot measurements from the RTX 2080. We use uniform random weights and the shuffled power law distribution ( in random order).
|Desktop||AMD Ryzen 3950X (16 cores, 32 threads), Ubuntu 20.04|
|AMD server||AMD EPYC 7551P (32 cores, 64 threads), Ubuntu 20.04|
|Intel server||4x Intel Xeon Gold 6138 (420 cores, 160 threads), Ubuntu 20.04|
|GTX 1650S||Nvidia GeForce GTX 1650 Super GPU, CUDA 11.1|
|Intel Xeon 1230 V2 (4 cores), Arch Linux (2021-06-11)|
|RTX 2080||Nvidia GeForce RTX 2080 GPU, CUDA 11.1|
|Intel Core i5-750 (4 cores), Ubuntu 16.04|
|Tesla V100||Nvidia Tesla V100 data center GPU, CUDA 11.0|
|Intel Xeon Gold 6230 (using 4 cores), Red Hat Enterprise 7.7|
By performing only index and pointer arithmetics in conditional branches and accessing the memory afterwards, we achieve a speedup of up to 2. In the pack method, we use casts to int4 to help the compiler generate a single 128-bit operation instead of two 64-bit operations for accesses to the alias table rows. This reduces memory transfers by 50 % and makes the pack operation nearly 50 % faster.
A comparison of our split methods is plotted in Figure 2. Independently of the number of splits , the partial -ary split method is up to 1.5 times faster than the baseline method, depending on the input distribution and number of splits. Figure 3 shows how the techniques of Section 4.2 achieve a speedup of 3.7 to the baseline. Because the pack method has an influence on the number of splits to calculate, the figure shows the full construction time including splitting. When storing weights in l and h, the pack step gets 2 times faster while the split and partition steps get 2 times slower because of an increased size of array elements. In total, this results in a speed improvement because the pack step takes most time overall. The pack step of the chunked method is slower than the shared memory method because its memory access cannot be coalesced as well but it speeds up the splitting step significantly. For large , the chunked method is slightly faster than the shared memory method.
When the items have uniform random weights, PSA+ on GPUs can greedily handle around 90 % of the items. A reason why Hübschle-Schneider and Sanders’ [hubschle2019parallel] algorithm can pack a higher fraction of the items greedily is that our section size is limited by the shared memory and therefore rather small. We only attempt greedy packing in promising situations by introducing a threshold for the minimum number of light and heavy items in each section. Using uniform random weights with items, PSA+ achieves a speedup of 1.5 to PSA and using a shuffled power law distribution with exponent , it achieves a speedup of 1.4. While PSA+ can be slower for some weight distributions, it achieves significant speedups for these important distributions.
Our GPU-based chunked method achieves a speedup of 17 on the RTX 2080 over Ref. [hubschle2019parallel] on a desktop CPU, as listed in Table 2. Constructing with items, our method is faster even when including the time to transfer the input weights to the GPU. In fact, our construction is faster than the time needed to transfer a finished alias table to the GPU.
|Desktop CPU||69.2 ms||743.2 ms|
|AMD server||21.3 ms||151.5 ms|
|Intel server||18.2 ms||83.1 ms|
|GTX 1650S||7.6 ms||–222Not enough memory for temporary data structures during construction.|
|RTX 2080||4.0 ms||32.8 ms|
|Tesla V100||2.5 ms||23.9 ms|
Figure 4 shows a comparison of the baseline sampling method and the three sectioned methods. The baseline method does not need preprocessing and is therefore fastest for small numbers of samples. The sectioned methods have significant startup overhead for determining the sections or copying data but if the number of samples drawn is increased, the investment pays off. Figure 5 shows the best method for varying table size and number of samples. While the shared memory sectioned method can achieve higher peak throughputs, the cached limited sectioned method is more generic and achieves a good throughput in more cases.
Table 3 compares the throughput of our sectioned limited method with the CPU implementation of Ref. [hubschle2019parallel] when using shuffled power law distributed weights. Our GPU method has up to 24 times more throughput on the RTX 2080 than Ref. [hubschle2019parallel] on the desktop CPU. Even for large , we can outperform the expensive Intel server using consumer hardware.
|Tesla V100||106.71333Throughput with N= is only constrained by the 64-bit floating point unit, which is significantly faster on the Tesla V100 than on the other cards.||26.93||5.62||–|
|Desktop CPU||98 J/ items||376 J/GSample|
|AMD server||25 J/ items||181 J/GSample|
|Intel server||45 J/ items||242 J/GSample|
|GTX 1650S||7 J/ items444Not enough memory for temporary data structures during construction. Extrapolation based on a measurement with .||92 J/GSample|
|RTX 2080||7 J/ items||69 J/GSample|
|Tesla V100||5 J/ items||28 J/GSample|
Because of their different architecture, comparing only running time between GPUs and CPUs can be unfair. A good sanity check is to compare by energy consumption, which is independent of current market prices and covers a major cost factor of computing. To compensate for different hardware setups, we calculate the CPU power usage by the difference between idle and loaded state using external measurements. For the GPUs, we directly use the values reported by the cards, adding additional 40 W to account for the CPUs that manage the cards.555Based on external measurements with the RTX 2080. Table 4 lists the power usage measurements of construction and sampling.
In this report, we have presented new algorithms that make construction of and sampling from alias tables efficient on GPUs. We are able to achieve a speedup of 17 to the CPU implementation of Hübschle-Schneider and Sanders [hubschle2019parallel], while simultaneously being more energy-efficient. We introduce a new search algorithm, partial -ary search, that enables fast splitting. Our pack method with chunked loading to the shared memory adapts the memory access pattern to be more efficient on GPUs. Our sectioned limited sampling algorithm is up to 24 times faster than the CPU implementation. This is achieved by dividing the alias table into sections which can then be sampled in a more cache-efficient way. In the future, we plan to evaluate our methods in real-world applications such as graph generation and also evaluate partial -ary search on its own.
The authors acknowledge support by the state of Baden-Württemberg through bwHPC. We also thank Emanuel Schrade for co-supervising the thesis that this report is based on.