Our goal is to make general-purpose GPU computing available to a wider range of developers through better support for object-oriented programming (OOP). Our focus is on applications, where parallelism is expressed as running the same method on a large set of objects of the same type, a pattern that we call Single-Method Multiple-Objects (SMMO). It is OOP-speech for Single-Instruction Multiple-Data (SIMD) and has many applications in high-performance computing (e.g., agent-based simulations (Allan, 2010; Bandini et al., 2009) or physical simulations (Maureira-Fredes and Amaro-Seoane, 2018)). One of the corner stones of OOP is dynamic memory management and the ability/flexibility to create/delete objects at any time. Dynamic memory management is challenging on GPUs (1) due their massively data-parallel execution with a large number of simultaneous allocations and (2) because data access and layout must be optimized to reach good memory bandwidth utilization. However, state-of-the-art allocators for GPUs focus on raw (de)allocation performance and miss key optimizations for optimizing the memory access of structured data (objects) in application code.
We propose SoaAlloc, a new dynamic object allocator for SMMO applications in CUDA. SoaAlloc can significantly speed up application code by allocating objects in a Structure of Arrays (SOA) memory layout (Kofler et al., 2015; Strzodka, 2012; Homann and Laenen, 2018) and scheduling objects efficiently to maximize the benefit of SOA with a custom do-all operation. Do-all runs a method on all objects of a type in parallel. SOA is a well-studied best practice for SIMD programs, yet it is not utilized any other GPU memory allocator. Similarly, no other memory allocator has a do-all operation. With both optimizations, SoaAlloc is more than 2x faster in application code in our benchmarks.
SoaAlloc divides the heap into blocks of equal byte size (Fig. 2). Objects are stored in blocks in SOA layout, i.e., all values of a field are stored together; every block can only store objects of the same type (class/struct). In a block, allocations and free slots are tracked with an object allocation bitmap. Since types can have different sizes, blocks may have a different number of objects.
A block can be in one of three states: uninitialized, allocated for a certain type, or allocated and active (Fig. 2). Allocated but not entirely full blocks are active. New objects are always allocated in active blocks to reduce fragmentation. This is important because high fragmentation causes objects to be more scattered in the heap and reduces the benefit of SOA. A new block is initialized (allocated) only if no active block exists for a certain type. To quickly locate blocks and space for new blocks, SoaAlloc maintains one global free bitmap and for each type a block allocation bitmap and an active block bitmap. A high-level overview of the object allocation process is shown in Figure 3.
Apart from SOA, SoaAlloc applies three optimizations to make allocation more efficient: (a) Similar to XMalloc (Huang et al., 2010), allocation requests in a SIMD thread group (warp) are combined into a single request to reduce the number of memory operations. (b) Bitmaps are hierarchical, such that blocks can be found without scanning an entire bitmap. (c) SoaAlloc is implemented with efficient bitwise operations provided by many hardware architectures (e.g., find first set (ffs)).
Since objects are always allocated in active blocks, we expect blocks to have a high average fill level (low fragmentation). SoaAlloc assigns threads with consecutive IDs to all object slots of a block, regardless of the fill level. This improves data locality and memory bandwidth utilization through memory coalescing and better cache utilization (Feng and Berger, 2005) because threads in a warp (consec. IDs) process objects in the same block. SoaAlloc finds active blocks with a top-down traversal of the block allocation bitmap. This avoids scanning large bitmap parts without allocated blocks compared to a stream compaction (prefix sum (Billeter et al., 2009)) step on a flat bitmap.
SoaAlloc heavily uses bitwise atomic operations and retry loops (D. et al., 2017) to implement lock-free, concurrent data structures (bitmaps and blocks). Based on the return value of an atomic operation, we know if it is a thread’s responsibility to update other internal data structures. This is a common pattern in lock-free algorithms (Michael, 2004).
3. Performance Evaluation
Figure 4 shows the overall running time of Wa-Tor (Dewdney, 1984) (Fish and Sharks) iterations111Excluding do-all time, because other allocators do not support do-all. and the memory fragmentation of SoaAlloc. Wa-Tor is a cellular automaton and interesting OOP application with classes for fish, sharks and cells. One iteration consists of multiple do-all operations; e.g., fish/sharks have a method for moving to a neighboring cell. We compare the performance of application code with different state-of-the-art memory allocators for CUDA. SoaAlloc is faster than the other allocators, mainly because of the benefit of the SOA data layout and do-all object scheduling strategy, resulting in coalesced memory access. Fragmentation increases after massive deallocations around iteration 70, because blocks can be deallocated only when all their objects are deallocated. However, it recovers quickly because new objects are created/destroyed continuously and SoaAlloc allocates objects only in existing active blocks.
|SoaAlloc||ScatterAlloc||Halloc (Adinetz and Pleiter, 2014)||CUDA|
Acknowledgements.This work was supported by JSPS KAKENHI Grant 18J14726.
- Adinetz and Pleiter (2014) A. V. Adinetz and D. Pleiter. 2014. Halloc: A High-Throughput Dynamic Memory Allocator for GPGPU Architectures. https://github.com/canonizer/halloc. In GPU Technology Conference 2014.
- Allan (2010) Robert J. Allan. 2010. Survey of Agent Based Modelling and Simulation Tools. Technical Report. Science and Technology Facilities Council, Warrington, United Kingdom.
- Bandini et al. (2009) S. Bandini, S. Manzoni, and G. Vizzari. 2009. Agent Based Modeling and Simulation: An Informatics Perspective. Journal of Artificial Societies and Social Simulation 12, 4 (2009), 4.
- Billeter et al. (2009) M. Billeter, O. Olsson, and U. Assarsson. 2009. Efficient Stream Compaction on Wide SIMD Many-core Architectures. In Proceedings of the Conference on High Performance Graphics 2009 (HPG ’09). ACM, New York, NY, USA, 159–166.
- D. et al. (2017) Cederman D., Gidenstam A., Ha P., Sundell H., Papatriantafilou M., and Tsigas P. 2017. Lock-Free Concurrent Data Structures. Wiley-Blackwell, Chapter 3, 59–79.
- Dewdney (1984) A. K. Dewdney. 1984. Computer Recreations: Sharks and fish wage an ecological war on the toroidal planet Wa-Tor. Scientific American 251, 6 (Dec. 1984), 14–22. Description of program for simulating predator-prey dynamics.
- Feng and Berger (2005) Y. Feng and E. D. Berger. 2005. A Locality-improving Dynamic Memory Allocator. In Proceedings of the 2005 Workshop on Memory System Performance (MSP ’05). ACM, New York, NY, USA, 68–77.
- Homann and Laenen (2018) H. Homann and F. Laenen. 2018. SoAx: A generic C++ Structure of Arrays for handling particles in HPC codes. Computer Physics Communications 224 (2018), 325 – 332.
- Huang et al. (2010) X. Huang, C. I. Rodrigues, S. Jones, I. Buck, and W. m. Hwu. 2010. XMalloc: A Scalable Lock-free Dynamic Memory Allocator for Many-core Machines. In 2010 10th IEEE International Conference on Computer and Information Technology. 1134–1139.
- Kofler et al. (2015) K. Kofler, B. Cosenza, and T. Fahringer. 2015. Automatic Data Layout Optimizations for GPUs. In Euro-Par 2015: Parallel Processing. Springer Berlin Heidelberg, Berlin, Heidelberg, 263–274.
- Maureira-Fredes and Amaro-Seoane (2018) C. Maureira-Fredes and P. Amaro-Seoane. 2018. GraviDy, a GPU modular, parallel direct-summation N-body integrator: dynamics with softening. Monthly Notices of the Royal Astronomical Society 473, 3 (2018), 3113–3127.
- Michael (2004) M. M. Michael. 2004. Scalable Lock-free Dynamic Memory Allocation. In Proceedings of the ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation (PLDI ’04). ACM, New York, NY, USA, 35–46.
- Steinberger et al. (2012) M. Steinberger, M. Kenzel, B. Kainz, and D. Schmalstieg. 2012. ScatterAlloc: Massively parallel dynamic memory allocation for the GPU. In 2012 Innovative Parallel Computing (InPar). 1–10.
- Strzodka (2012) R. Strzodka. 2012. Chapter 31 - Abstraction for AoS and SoA Layout in C++. In GPU Computing Gems Jade Edition. Morgan Kaufmann, Boston, 429 – 441.