BCL: A Cross-Platform Distributed Container Library

by   Benjamin Brock, et al.

One-sided communication is a useful paradigm for irregular parallel applications, but most one-sided programming environments, including MPI's one-sided interface and PGAS programming languages, lack application level libraries to support these applications. We present the Berkeley Container Library, a set of generic, cross-platform, high-performance data structures for irregular applications, including queues, hash tables, Bloom filters and more. BCL is written in C++ using an internal DSL called the BCL Core that provides one-sided communication primitives such as remote get and remote put operations. The BCL Core has backends for MPI, OpenSHMEM, GASNet-EX, and UPC++, allowing BCL data structures to be used natively in programs written using any of these programming environments. Along with our internal DSL, we present the BCL ObjectContainer abstraction, which allows BCL data structures to transparently serialize complex data types while maintaining efficiency for primitive types. We also introduce the set of BCL data structures and evaluate their performance across a number of high-performance computing systems, demonstrating that BCL programs are competitive with hand-optimized code, even while hiding many of the underlying details of message aggregation, serialization, and synchronization.



There are no comments yet.


page 1

page 2

page 3

page 4


MPI Windows on Storage for HPC Applications

Upcoming HPC clusters will feature hybrid memories and storage devices p...

A Scalable Actor-based Programming System for PGAS Runtimes

PGAS runtimes are well suited to irregular applications due to their sup...

High-Performance Distributed RMA Locks

We propose a topology-aware distributed Reader-Writer lock that accelera...

ParaSail: A Pointer-Free Pervasively-Parallel Language for Irregular Computations

ParaSail is a language specifically designed to simplify the constructio...

Enhancing Scalability of a Matrix-Free Eigensolver for Studying Many-Body Localization

In [Van Beeumen, et. al, HPC Asia 2020, https://www.doi.org/10.1145/3368...

ASH: A Modern Framework for Parallel Spatial Hashing in 3D Perception

We present ASH, a modern and high-performance framework for parallel spa...

Generic Deriving of Generic Traversals

Functional programmers have an established tradition of using traversals...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Writing parallel programs for supercomputers is notoriously difficult, particularly when they have irregular control flow; however, high-level languages and libraries can make this easier. A number of languages have been developed for high performance computing, including several using the Partitioned Global Address Space (PGAS) model: Titanium, UPC, Coarray Fortran, X10, and Chapel [yelick1998titanium, upc2005upc, numrich1998co, charles2005x10, weiland2007chapel, chamberlain2007parallel]. These languages are especially well-suited to problems that require asynchronous one-sided communication, or communication that takes place without a matching receive operation or outside of a global collective. However, PGAS languages lack the kind of high level libraries that exist in other popular programming environments. For example, high performance scientific simulations written in MPI can leverage a broad set of numerical libraries for dense or sparse matrices, or for structured, unstructured, or adaptive meshes. PGAS languages can sometimes use those numerical libraries, but are lacking the kind of data structures that are important in some of the most irregular parallel programs.

In this paper we describe a library, the Berkeley Container Library (BCL) that is intended to support applications with irregular patterns of communication and computation and data structures with asynchronous access, for example hash tables and queues, that can be distributed across processes but manipulated independently by each process. BCL is designed to provide a complementary set of abstractions for data analytics problems, various types of search algorithms, and other applications that do not easily fit a bulk-synchronous model. BCL is written in C++ and its data structures are designed to be coordination free, using one-sided communication primitives that can be executed using RDMA hardware without requiring coordination with remote CPUs. In this way, BCL is consistent with the spirit of PGAS languages, but provides higher level operations such as insert and find in a hash table, rather than low-level remote read and write. As in PGAS languages, BCL data structures live in a global address space and can be accessed by every process in a parallel program. BCL data structures are also partitioned to ensure good locality whenever possible and allow for scalable implementations across multiple nodes with physically disjoint memory.

BCL is cross-platform, and is designed to be agnostic about the underlying communication layer as long as it provides one-sided communication primitives. It runs on top of MPI’s one-sided communication primitives, OpenSHMEM, and GASNet-EX, all of which provide direct access to low-level remote read and write primitives to buffers in memory [gerstenberger2014enabling, chapman2010introducing, bonachea2017gasnet]. BCL provides higher level abstractions than these communication layers, hiding many of the details of buffering, aggregation, and synchronization from users that are specific to a given data structure. BCL also has an experimental UPC++ backend, allowing BCL data structures to be used inside another high-level programming environment.

We present the design of BCL with an initial set of data structures and operations. We then evaluate BCL’s performance on ISx, an integer sorting benchmark, and Meraculous, a benchmark taken from a large-scale genomics application. We explain how BCL’s data structures and design decisions make developing high-performance implementations of these benchmarks more straightforward and demonstrate that BCL is able to match or exceed the performance of both specialized, expert-tuned implementations as well as general libraries across three different HPC systems.

I-a Contributions

  1. A distributed data structures library that is designed for high performance and portability by using a small set of core primitives

  2. A distributed hash table implementation that supports fast insertion and lookup phases, dynamic message aggregation, and individual insert and find operations

  3. A distributed queue abstraction for many-to-many data exchanges performed without global synchronization

  4. A distributed Bloom filter which achieves fully atomic insertions using only one-sided operations

  5. The BCL ObjectContainer abstraction, which allows data structures to transparently handle serialization of complex types while maintaining high performance for simple types

  6. A fast and portable implementation of the Meraculous benchmark built in BCL

  7. An experimental analysis of irregular data structures across three different computing systems along with comparisons between BCL and other standard implementations.

Ii Background and High-Level Design

Several approaches have been used to address programmability issues in high-performance computing, including parallel languages like Chapel, template metaprogramming libraries like UPC++, and embedded DSLs like STAPL. These environments provide core language abstractions that can boost productivity, and some of them have sophisticated support for multidimensional arrays. However, none of these environments feature the kind of rich data structure libraries that exist in sequential programming enviroments like C++ or Java. A particular need is for distributed memory data structures that allow for nontrivial forms of concurrent access that go beyond partitioned arrays in order to address the needs of irregular applications. These data structure tend to have more complicated concurrency control and locality optimizations that go beyond tiling and ghost regions.

Our goal is to build robust, reusable, high-level components to support these irregular computational patterns while maintaining performance close to hardware limits. We aim to achieve this goal using the following design principles.

Ii-1 Low Cost for Abstraction

While BCL offers data structures with high-level primitives like hash table and queue insertions, these commands will be compiled directly into a small number of one-sided remote memory operations. Where hardware support is available, all primary data structure operations, such as reads, writes, inserts, and finds, are executed purely in RDMA without requiring coordination with remote CPUs.

Ii-2 Portability

BCL is cross-platform and can be used natively in programs written in MPI, OpenSHMEM, GASNet-EX, and UPC++. When programs only use BCL data structures, users can pick whichever backend’s implementation is most optimized for their system and network hardware.

Ii-3 Software Toolchain Complexity

BCL is a header-only library, so users need only include the appropriate header files and compile with a C++-14 compliant compiler to build a BCL program. BCL data structures can be used in part of an application without having to re-write the whole application or include any new dependencies.

Iii BCL Core

Iii-a Memory Model

The BCL Core is the cross-platform internal DSL we use to implement BCL data structures. It provides a high-level PGAS memory model. During initialization, each process creates a shared memory segment of a fixed size. Processes can read and write from any location within the shared memory segment of another node, but cannot directly read or write from any remote memory address outside of the shared segment. Ranks can refer to specific locations within a shared memory segment using a global pointer, which is simply a C++ object which contains (1) the rank number of the process on which the memory is located and (2) the particular offset within that process’ shared memory segment which is being referenced. Together, these two values uniquely identify a global memory address. Global pointers are regular data objects and can be passed around between BCL processes using communication primitives or stored in global memory. Global pointers support pointer arithmetic operations similar to local pointer arithmetic.

Iii-B Communication Primitives

Iii-B1 Writing and Reading

The BCL Core’s primary memory operations involve writing and reading to global pointers. Remote get operations read from a global pointer and copy the result into local memory, and remote put operations write the contents of an object in local memory to a shared memory location referenced by a global pointer. Remote completion of put operations is not guaranteed until after a memory fence such as a flush or barrier.

Internal DSL (BCL Core)





BCL Containers

Fig. 1: Organizational diagram of BCL.

Iii-B2 Collectives

BCL includes the broadcast and allreduce collectives. Depending on the backend, these may be implemented using raw remote put and remote get operations, or, more likely, may map directly to high-performance implementations offered by the backend communication framework. In the work presented here, collective performance is not critical, as they are mainly used for transporting pointers and control values.

Iii-B3 Atomics

BCL’s data structures avoid coordination between CPUs, instead relying on remote memory atomics to maintain consistency. BCL backends must implement at least the atomic compare-and-swap operation, since all other atomic memory operations (AMOs) can be implemented on top of compare-and-swap [herlihy1991wait]. However, backends will achieve much higher performance by directly including any atomic operations available in hardware. Other atomic operations provided by current BCL backends and utilized by BCL data structures include atomic fetch-and-add and atomic-fetch-and-or. We depend on backends to provide high quality interfaces to atomic operations as implemented in hardware, but also to provide atomic operation support through active messages or progress threads when hardware atomics are not available.

Iii-B4 Barriers

BCL applications enforce synchronization using BCL barriers, which are both barriers and memory fences, forcing ordering of remote memory operations. In order for a rank to enter a barrier, all of its memory operations must complete, both locally and at the remote target. In order for a rank to exit a barrier, all other threads must have entered the barrier

Iii-C Type Safety and Error Checking

The BCL Core is designed to avoid successfully compiling incorrect code where possible. This is accomplished largely through the type system. Unlike MPI, where no compile-time checks are performed to verify that pointers are the correct type, remote memory operations in BCL are inherently type safe. Attempting to read or write to a global pointer with data of an incorrect type will cause a compiler error. Global pointers cannot be implicitly cast from one type to another, but must be explicitly cast. BCL’s ops structs, which specify the type of operation, such as addition or multiplication, to be used with a memory operation like a collective or an atomic, use a class hierarchy to enforce that the ops are used correctly. For example, trying to use the “addition” op with a float type in an atomic operation will cause a compile-time error, if (as is commonly the case), the backend does not support atomic floating point addition in network hardware.

Iv BCL Data Structures

BCL data structures are split into two categories: distributed and hosted. Distributed data structures live in globally addressable memory and are automatically distributed among all the ranks in a BCL program. Hosted data structures, while resident in globally addressable memory, are hosted only on a particular process. All other processes may read or write from the data structure lying on the host process. We have found hosted data structures to be an important building block in creating distributed data structures.

All BCL data structures are coordination free, by which we mean that primary data structure operations, such as insertions, deletions, updates, reads, and writes, can be performed without coordinating with the CPUs of other nodes, but purely in RDMA where hardware support is available. Other operations, such as resizing or migrating hosted data structures from one node to another, may require coordination. In particular, operations which modify the size and location of the data portions of BCL data structures must be performed collectively, on both distributed and hosted data structures. This is because coordination-free data structure methods, such as insertions, use global knowledge of the size and location of the data portion of the data structure. For example, one process cannot change the size or location of a hash table without alerting other processes, since they may try to insert into the old hash table memory locations. Tables I and II give an overview of the available data structures and operations. Table II also gives the best-case cost of each operation in terms of remote reads , remote writes , atomic operations , local operations , and global barriers . As demonstrated by the table, each high-level data structure operation is compiled down to a small number of remote memory operations.

All BCL data structures are also generic, meaning they can be used to hold any type, including complex, user-defined types. Most common types will be handled automatically, without any intervention by the user. See Section LABEL:sec:container for a detailed description of BCL’s lightweight serialization mechanism.

Data Structure Distributed or Hosted
BCL::HashMap Distributed
BCL::CircularQueue Hosted
BCL::HashMapBuffer Distributed
BCL::BloomFilter Distributed
BCL::DArray Distributed
BCL::Array Hosted
Table I: A summary of BCL data structures.

width=1.05center Data Structure Method Collective Atomic Description Cost BCL::HashMap bool insert(const K &key, const V &val) N Y Insert item into hash table. bool find(const K &key, V &val) N Y Find item in table, return val. BCL::BloomFilter bool insert(const T &val) N Y Insert item into Bloom filter, return true if already present. bool find(const T &val) N Y Find item in table, return whether present. BCL::CircularQueue bool insert(const T &val) N Y Insert item into queue. bool pop(T &val) N Y Pop item into queue.

bool insert(const std::vector &vals)

N Y Insert items into queue. bool pop(std::vector <T> &vals, size_t n) N Y Pop items from queue. bool local_nonatomic_pop(T &val) N N Nonatomically pop item from a local queue. void resize(size_t n) Y Y Resize queue. void migrate(size_t n) Y Y Migrate queue to new host.

Table II: A selection of methods from BCL data structures. Costs are best case. is the cost of a remote read, the cost of a remote write, the cost of a remote atomic memory operation, the cost of a barrier, the cost of a local memory operation, and the number of elements involved.

Iv-a Hash Table

BCL’s hash table is implemented as a single logically contiguous array of hash table buckets distributed block-wise among all processes. Each bucket is a struct including a key, value, and status flag. Our hash table uses open addressing with quadratic probing to resolve hash collisions. As a result, neither insert nor find operations to our hash table require any coordination with remote ranks. Where hardware support is available, hash table operations will take place purely with RDMA operations.

Iv-A1 Interface

BCL’s BCL::HashMap is a distributed data structure. Users can create a BCL::HashMap by calling the constructor as a collective operation among all ranks. BCL hash tables are created with a fixed key and value type as well as a fixed size. BCL hash tables use ObjectContainers, discussed in Section LABEL:sec:container, to store keys and values of any type. BCL hash tables also use the standard C++ STL method for handling hash functions, which is to look for a std::hash <K> template struct in the standard namespace that provides a mechanism for hashing key objects.

The hash table supports two primary methods, bool insert(K key) and bool find (K key, V &val). Section LABEL:sec:eval gives a performance analysis of our hash table.

Iv-A2 Atomicity

Hash table insertions are atomic with respect to one another, including simultaneous insert operations with the same key. This is accomplished by using a separate reserved array along with the data array which holds the hash table keys and values. In order to insert a value into the array, a process probes through the reserved array, using an atomic compare-and-swap operation to request a slot to insert its value. If the process successfully reserves a slot, it will insert its key and value into the data portion of the hash table, flush those remote memory operations to ensure completion, and then update its reserved entry to indicate that the entry is ready to be read. If a process encounters an entry which is marked as ready, it will read that entry’s key, and, if the key matches the key to be inserted, request that slot, modify the value, then mark it as ready after completion. If a process encounters an entry that is marked as reserved but not ready, it must wait until the entry is marked as ready before proceeding to check the corresponding key.

Iv-A3 Hash Table Size

A current limitation of BCL is that, since hash tables are initialized to a fixed size and do not dynamically resize, an insertion may fail. In the future, we plan to support a dynamically resizing hash tables hash tables. Currently, the user must call the collective resize method herself when the hash table becomes full.

Iv-B Queues

The BCL::CircularQueue data structure is implemented as a ring buffer. A BCL::CircularQueue is initialized with a fixed size and host rank and is assigned a block of memory for data as well as head and tail indexes. To insert a value or array of values into the queue, a rank first atomically increments the tail pointer, checks that this does not surpass the head pointer, and then inserts its value or values into the data segment of the queue. An illustration of a push operation is shown in Figure LABEL:fig:queue_insert. In general, the head overrun check is performed without a remote memory operation by caching the position of the head pointer, so an insertion requires two remote memory operations. We similarly cache the location of the tail pointer, so pops to the queue usually require only one atomic memory operation to increment the head pointer and one remote memory operation to read the popped values. We have also included a local_nonatomic_pop() operation, which pops a value from a circular queue hosted locally using only local memory operations. This operation is nonatomic with respect to other pops.

BCL::CircularQueue supports resizing the queue as well as migrating the queue to another host process, both as collective operations. BCL::CircularQueue supports concurrent pushes and concurrent pops, but pushes and pops must be separated by a barrier. This is to guarantee that the rput operation which writes items pushed to the queue has completed before the items are read. A separate data structure provides a circular queue which supports concurrent pushes and pops, but it is not discussed here for reasons of space. We evaluate the performance of our circular queue data structure in Section LABEL:sec:cqeval.