In today’s world, performance is improved mainly by increasing parallelism, and therefore a scalable programming model is a necessity. The Actor Model [17, 2] is the primary concurrency mechanism  in languages such as Scala and Erlang and is also gaining popularity in modern system programming languages such as Rust. Large-scale cloud applications  from companies such as Facebook and Twitter that serve millions of users use actor-based models to support them. Actors express asynchronous communication using “mailboxes”. Selectors  extend the actor model with multiple mailboxes in a single actor, thereby improving its expressiveness. It has also been shown that actors can be integrated with other parallel programming constructs through the use of a unified task-parallel runtime system . Another important property of actors/selectors is their inherent asynchrony, i.e., there are no global constraints on the order in which messages are processed in mailboxes.
Traditionally HPC applications mainly made use of dense data structures, whereas recently, there has been a renewed interest in sparse data structures, including those used in graph algorithms, sparse linear algebra algorithms, and machine learning algorithms[32, 15]. These algorithms represent application domains in which there are real-world needs for large-scale distributed data and computations that span multiple processing elements (PEs). The PGAS (Partitioned Global Address Space) execution model  is well suited to such irregular applications due to its efficient support for short, non-blocking one-sided messages and the convenience of a non-uniform global address space abstraction which enables the programmer to implement scalable locality-aware algorithms. Notable PGAS programming systems include Co-array Fortran , OpenSHMEM , and Unified Parallel C (UPC) . Distribution of data structures across multiple PEs gives rise to large numbers of fine-grain communications which can be expressed succinctly in the PGAS model.
While regular applications with dense data structures can amortize overheads by using medium-grain or coarse-grain messages, a key challenge for PGAS applications is the need for careful aggregation and coordination of short messages to achieve low overhead, high network utilization, and correct termination logic. Communication aggregation libraries such as Conveyors  can help address this problem by locally buffering fine-grain communication calls and aggregating them to create medium/coarse-grain messages. However, the use of such aggregation libraries places a significant burden on programmer productivity and assumes a high expertise level.
Due to its asynchronous nature, the Actor model can help in such cases by enabling automatic aggregation of messages without requiring the programmer to perform the aggregation. Although the existing non-blocking communication primitives used in HPC systems have the potential for enabling aggregation, the use of actors would make it more welcoming for a user from other domains such as the cloud . The non-blocking primitives also add a challenge that the actor system designed should perform at least as well as the current non-blocking primitives and preferably better. The actor can also support the model of migrating the computation closer to where the data is located, which is beneficial for many irregular applications .
In this paper, we extend the Selector model to obtain a new scalable programming system for PGAS runtimes which enables the programmer to specify fine-grained asynchronous communications without worrying about complexities related to message aggregation and termination detection.
Specifically, this paper makes the following contributions:
A new PGAS programming system which extends the selector model to enable asynchronous communication with automatic message aggregation for scalable performance.
Design of a communication termination protocol, which transfers the burden of termination detection and related communication bookkeeping from the programmer to the selector runtime.
The Actor Model [17, 2] is an asynchronous message-based concurrency model. The key idea behind actors is to encapsulate mutable state and to use asynchronous messaging to coordinate activities among actors. The actor runtime maintains a separate logical mailbox for each actor. Any actor or non-actor, can send messages to an actor’s mailbox. An actor also maintains a local state, which is initialized during creation. The actor is only allowed to update its local state using data from the messages it receives. The actor is restricted to process at most one message at a time, thereby avoiding data races and the need for synchronization. In the actor model, there is no restriction on the order in which the actor processes incoming messages. Selector  is an extension of actor with multiple mailboxes.
Ii-B HClib Asynchronous Many-Task (AMT) Runtime
Habanero C/C++ library (HClib)  is a lightweight asynchronous many-task programming model-based runtime. It uses a lightweight work-stealing scheduler to schedule the tasks. HClib uses a persistent thread pool called workers, on which tasks are scheduled and load balanced using lock-free concurrent deques. HClib exposes several programming constructs to the user, which in turn helps them to express parallelism easily and efficiently.
A brief summary of the relevant APIs is as follows:
async: Used to create asynchronous tasks dynamically.
finish: Used for bulk task synchronization. It waits on all tasks spawned (including nested tasks) within the scope of the finish.
promise and future: Used for point-to-point inter-task synchronization starting from C++11 . A promise is a single-assignment thread-safe container, that is used to write some value and a future is a read-only handle for its value. Waiting on a future causes a task to suspend until the corresponding promise is satisfied by putting some value to the promise.
Iii Communication Layer
In this section, we describe the selection of an appropriate communication layer to build the Selector model. We will illustrate the communication layer by discussing examples of two basic communication idioms, namely update and gather idioms. The simple patterns are shown for demonstration purposes, and our approach can support other complex communication patterns as well. Since the focus of our work is on scalable parallelism, we assume that each processing element (PE) starts by executing the same code in the following code examples (as in the Single Program Multiple Data - SPMD model).
Iii-a Idiom 1: Update Pattern
This pattern is used extensively in computations where each PE needs to update a local or remote location that is only known dynamically at the time of the update.
A high-level version of this pattern is shown in 1.
This program creates a histogram in a distributed array named histo based on global indices stored in each PE’s local index array (+ performs an atomic increment).
As can be seen in 1, this version supports
a “global view” of the single distributed histo array. The corresponding operation can be performed in OpenSHMEM using shmem_int64_atomic_add/inc, in UPC using upc_atomic_relaxed, and in MPI using
Iii-B Idiom 2: Gather Pattern
The second common idiom is the gather pattern, in which each PE sends a request for data from a dynamically identified remote location and then processes the data received in response to the request. A high-level version of a program using this pattern is shown in 2. This program reads data from a distributed array named data (“global view”) and stores the retrieved values in a local array named gather based on global indices stored in a local array named index. The corresponding operation can be performed in OpenSHMEM using shmem_int64_g or shmem_int64_get_nbi, in UPC as shown in the listing or using upc_memget_nbi, and in MPI using MPI_Get.
Table I shows the performance comparison of the Histogram and Index-gather mini-applications using various commonly used communication systems/libraries. We can see that the Conveyors  version outperforms all other versions by a significant margin, and therefore we decided to use Conveyors as the communication library for the Selector runtime.
|Histogram||OpenSHMEM NBI (cray-shmem 7.7.10)||Y||4.3|
|UPC (Berkley-UPC 2020.4.0)||N||23.9|
|MPI3-RMA (OpenMPI 4.0.2)||Y||88.9|
|MPI3-RMA (cray-mpich 7.7.10)||Y||300|
|Charm++ (6.10.1, gni-crayxc w/ TRAM)||Y||9.7|
|Conveyors (2.1 on cray-shmem 7.7.10)||Y||0.5|
|Index-gather||OpenSHMEM (cray-shmem 7.7.10)||N||35.5|
|OpenSHMEM NBI (cray-shmem 7.7.10)||Y||4.2|
|UPC (Berkley-UPC 2020.4.0)||N||22.6|
|UPC NBI (Berkley-UPC 2020.4.0)||Y||19.7|
|MPI3-RMA (OpenMPI 4.0.2)||Y||25.8|
|MPI3-RMA (cray-mpich 7.7.10)||Y||8.3|
|Charm++ (6.10.1, gni-crayxc w/ TRAM)||Y||21.3|
|Conveyors (2.1 on cray-shmem 7.7.10)||Y||2.3|
Conveyors  is a C-based message aggregation library built on top of conventional communication libraries such as SHMEM, MPI, and UPC. It provides the following three basic operations:
push (convey_push): attempts to locally enqueue a message for delivery to a specified PE.
pull (convey_pull): attempts to fetch a received message from the local buffer.
advance (convey_advance): enables forward progress of communication by transferring buffers.
It is worth noting that both push and pull operations can fail as in return false meaning the intended operation was not fulfilled. push can fail due to a lack of local buffer space, and pull can fail due to a lack of an available item. Due to these failures, push and pull operations must always be placed in a loop that calls advance to ensure progress and to also detect termination. Calling advance too frequently can result in extra overheads, while failing to call advance when needed can result in livelock or deadlock.
3 which is a conveyors equivalent of 2 uses two conveyors, q and r where conveyor q is used for sending and processing queries and conveyor r is used for sending and processing responses. The array location that needs to be accessed is sent to the PE that owns that portion of the array at Line 3. The target PE pulls this query at Line 3 and processes it. While processing the query, it gets the value from the data array and returns this value to the requester using convey_push in Line 3.
Although the performance and scalability of the Conveyors based implementations is excellent, the programs are neither easy to write nor easy to understand. The application programmer has to take care of error-handling operations as well as retry operations that are interleaved with advance operations. These complexities place a significant burden on programmer productivity and assumes a high expertise level. Table I demonstrates that user-directed message aggregation with Conveyors can achieve much higher performance compared to non-blocking operations in state of the art communication libraries/systems, some of which includes automatic message aggregation [9, 8].
Iv-a High-Level Design
Our primary goal is to support a high-level actor programming model for PGAS applications that delivers comparable performance to that of explicit user-directed message aggregation and termination. Since we use Conveyors as the underlying communication layer in our runtime, we would like to keep users from worrying about 1) the lack of available buffer space (convey_push), 2) the lack of an available item (convey_pull), and 3) the progress and termination of communications (convey_advance). We believe that the use of the Actor/Selector model is well suited for this problem since its programming model productively enables the specification of lightweight asynchronous message passing.
Iv-A1 Abstracting buffers as mailboxes
Since the mailbox in the actor model is analogous to the buffer in aggregation libraries, the use of mailbox becomes a good fit for abstracting convey_push and convey_pull operations. Thus, we map convey_push to Actor.send, and map convey_pull to Actor.process which processes the received message, and leave it to the runtime to handle buffer/item failures and progressing/terminating communication between actors (mailboxes). More details on how the runtime takes care of failure scenarios are given in Section V-B.
Another design decision is to treat a mailbox as a scalable distributed object and partition it across PEs, which is analogous to how memory is accessed in the PGAS programming model. This partitioned global actor design allows users to access a target actor’s mailbox quickly (i.e., via a target PE ID: Actor.send(, PE);) instead of searching for the corresponding actor object across multiple nodes.
Iv-A2 Supporting Selectors
Among the two patterns discussed in Section III, the gather pattern differs from the update pattern by the fact that it involves communication in two directions, namely request and response. Since it is challenging for actors to implement such synchronization and coordination patterns, this motivates us to instead use a ‘Selector’ , which is an actor with multiple mailboxes, as a high-level abstraction. For example, for the gather pattern, users are only supposed to create two mailboxes (one for Request, the other for Response) and implement Selector.process functions for the two mailboxes. This partitioned global selector design enables a uniform programming interface across the different communication patterns.
Iv-A3 Progress and Termination
In general, the Actors/Selectors model provides an exit  operation to terminate actors/selectors. One may think that it is natural to expose this operation to users. However, due to the asynchronous communication aspect, one problem with this termination semantics is that it requires users to make sure all messages in the incoming mailbox are processed (or received in some cases) before invoking exit, which adds additional complexities even for mini-applications such as Histogram and Index Gather. To mitigate this burden, we added a relaxed version of exit, which we call done, to let the runtime do the heavy lifting. The semantics of done is that users tell the runtime that the PE on which a specific actor/selector object resides will not send any more messages in the future to a particular mailbox, and the runtime can still keep the corresponding actor/selector alive so it can continue to receive messages and process them. More details on progress and termination can be found Section IV-C.
Iv-B User-facing API
Iv-B1 The Actor/Selector API
Based on the discussions in Section IV-A, we provide a C/C++ based actor/selector programming framework as shown in 4 (Later in Section VI-A, we discuss how the process method can be replaced by an inline lambda in the send API.).
The two patterns can be expressed by using these API as follows:
Idiom 1: Update Pattern 5 shows our version of the histogram. We first define our custom Actor by extending the Actor base class in Line: 5 along with the mailbox’s process method in Line: 5. The main program creates an Actor object as a collective operation in Line: 5, which is used for communication. Then to create the histogram, it finds the target PE in Line:5 and local index within the target in Line: 5 from the global index. Then the local index is sent to the target PE’s mailbox using the Actor’s send method. Once the target PE’s mailbox gets the message, the actor invokes the process method, which updates the histo array.
Idiom 2: Gather Pattern 6 shows how the index-gather pattern can be implemented using our selector-based approach to support multiple mailboxes. (Recall that an actor is simply a selector with one mailbox.). Here we extend the Selector base class along with passing the number of mailboxes, which in this case is 2, and define the process method for each mailbox as shown in Lines 6 and 6. The program executes by creating a data packet for sending to the target PE calculated in Line: 6. The data packet contains the array location within the target (Line: 6). The data packet also includes the index i in Line: 6, which is saved for later use by the response as shown in Line: 6. The packet is sent to the target PE’s Request mailbox in Line: 6. Now once the Request mailbox receives the message, its process routine in Line: 6 is invoked, which gets the value from the data array and sends the response back to the sender as shown in Line: 6.
By comparing 3 which directly uses Conveyors with 6 which uses Selectors, the decrease in code complexity is evident. The tedious tasks of using convey_advance along with the loop conditions to progress communication and deal with the failures in convey_push and convey_pull have all been moved into the runtime. Further, subparts of the application that involve a large number of latency tolerant communication operations can use our API to achieve high throughput, while other parts of the application can continue using other PGAS interfaces as convenient.
The Actor/Selector equivalents are verbose due to the presence of class boilerplate definition. Section VI-A discusses the use of lambda to reduce the verbosity.
Iv-C Termination Graphs
As mentioned earlier, we provide the done operation as an alternative to terminating actors/selectors by exit . This design is based on our observation that 1) sending messages from a partition can be considered as the active part of communication where the user has to invoke send explicitly, but in contrast, 2) receiving messages is the passive part since the arrival of a message is not directly under the user’s control. Therefore, we designed the done termination interface to be more associated with sending of messages to a mailbox and leave it to the runtime to keep track and drain all messages sent to it in the future and also in flight.
One may have noticed that, in 6, the done operation is performed only for the Request mailbox and not for Response mailbox. This is possible since the Response mailbox depends on the Request mailbox - i.e., a message is only sent from the Request mailbox to the Response mailbox in Line: 6.
Here we introduce the concept of Termination Graph to discuss how this is possible. Let us first define that mailbox Y depends on mailbox X if a message is sent to mailbox Y in the process function of mailbox X. Based on the dependency relation, in general, we can create a directed graph between mailboxes within a selector. We assume an imaginary Outside mailbox, which is a virtual mailbox that does not depend on any mailboxes within the selector. A dependency on Outside mailbox implies a message is received from a non-actor/selector or a different actor/selector from the current distributed one.
Given a termination graph, removing an edge, say the one from X to Y, implies no more messages will be sent to mailbox Y in the process function of mailbox X. Therefore the done operation invoked on a mailbox by the user corresponds to removing all incoming edges to that mailbox since the semantics of done means no more messages will be sent to that mailbox. Using this edge deletion notion, termination of a selector can be formulated as follows.
Given a graph whose nodes are mailboxes of a selector and edges represent dependency between those mailboxes, termination of the selector corresponds to the removal of all edges from this graph.
Once such a graph is obtained, the user needs to invoke the done operation for those mailboxes that depend on the Outside mailbox and to break cycles. Using the dependency graph, the runtime can find out when to perform done* - a mini version of done invoked by the runtime - on the dependent mailboxes, as explained in the next paragraph. The done* operation on Mailbox Y from Mailbox X, only removes the edge from X to Y whereas user invoked done on Mailbox X removes all edges to X.
Figure 1 shows a sample mailbox dependency graph in which an arrow from A to D implies that mailbox D depends on mailbox A. For the given figure, the user needs to call done for the mailboxes A, B and C. Accordingly the incoming edges OutsideA, OutsideB, OutsideC and BC are removed. The runtime can deduce when to invoke done* automatically for the dependent mailboxes D and E. Once the user performs done(B) on a partition, no more sends can be invoked on B from that partition. Still, it can continue to receive and process messages. This implies that messages can be sent from the process method in any partition of mailbox B to mailbox E. Therefore the runtime needs to ensure done(B) is invoked on all partitions and wait for all messages to be drained from all partitions of mailbox B. At this stage no more messages can be sent to mailbox E from mailbox B. Therefore the runtime can now perform done* on mailbox E which corresponds to removing edge BE. Since B is the only source of message for E, runtime can safely invoke done on E. If a mailbox depends on multiple sources like mailbox F depending on C and E, the runtime waits for the termination of both C and E i.e. deletion of edges EF, CF or in other words done* on mailbox E from mailboxes C and E, before invoking done(F). This procedure removes all edges except the cycle between D and G. At this point, the runtime cannot make any progress on its own, and the user needs to invoke done on either D or G and the runtime can eventually process termination of the other mailbox as explained above.
For the mini-applications under consideration during evaluation, the only pattern that involved was a linear graph where one mailbox depends on another. Our current implementation uses the linear graph as the default pattern.
Iv-D Relaxing Access to Selector State
In a pure selector model, as mentioned in Section II, the selector is only allowed to update its local state using data from the messages it receives. We allow the user to relax this requirement and enable the selector’s local state to be accessed/manipulated outside the message processing routines. In those cases, it may be required to interleave message processing with other parts of the user-code. To accomplish this, the user can insert yield construct in the user-code, which transfers control to the selector runtime. Eventually, the runtime will process the messages and give control back to the user-code where yield was invoked.
In this section, we discuss the implementation of the selector runtime prototype created by extending HClib , a C/C++ Asynchronous Many-Task (AMT) Runtime library. We first discuss our execution model in Section V-A and then describe our extensions to the HClib  runtime to support our selector runtime in Section V-B.
V-a Execution Model
Figure 2 shows the high level structure of the execution model for our approach from the perspective of PE , shown as process[j], with memory[j] representing that PE’s locally accessible memory. This local memory includes partitions of global distributed data, in accordance with the PGAS model. Users can create as many tasks as required by the application, which are shown as Computation Tasks. For the communication part, each mailbox corresponds to a Communication Task. All tasks get scheduled for execution on to underlying worker threads. For example, if an application uses a selector with two mailboxes and an actor/selector with one mailbox, it corresponds to three communication tasks — two for the selector and one for the actor. All computation and communication tasks are created using the HClib  Asynchronous Many-Task (AMT) runtime library.
To enable asynchronous communication, the computation tasks offload all remote accesses on to the communication tasks . When the computation task sends a message, it is first pushed to the communication task associated with the mailbox using a local buffer. Eventually, the communication task uses the conveyors library to perform message aggregation and actual communication. Currently we use a single worker thread that multiplexes all the tasks. When a mailbox receives a message, the mailbox’s process routine is invoked.
It is worth noting that users are also allowed to directly invoke other communication calls outside the purview of our Selector runtime. For example, the user application can directly invoke the OpenSHMEM barrier or other collectives.
V-B Selector Runtime
As mentioned earlier, a key goal of our approach is to hide the low-level details of Conveyors operations from the programmer and incorporate them into our Selector runtime instead. To reiterate, such details include maintaining the progress and the termination of communication as well as handling 1) the lack of available buffer space, and 2) the lack of an available item. This enables users to only stick with the send(), done(), and process() APIs. The implementation details of these APIs are as follows:
Selector.send(): We map each mailbox to a conveyor object. Each send in a mailbox gets eventually mapped to a conveyor_push. Note that the send does not directly invoke the conveyor_push because we want to relieve the computation task on which the application is running from dealing with the failure handling of conveyor_push. Instead, this API adds a packet with the message and receiver PE’s rank to a small local buffer111This local buffer is different from the Conveyor’s internal buffer. that is based on the Boost Circular Buffer library . The packet is later picked up by the communication task associated with the mailbox and is passed into a conveyor_push operation. Whenever the mailbox’s local circular buffer gets filled, the runtime automatically passes control to the communication task, which drains the buffer, thereby allowing us to keep its size fixed.
Selector.done(): Analogous to send, when done is invoked, we enqueue a special packet to the mailbox that denotes the end of sending messages from the current PE to that mailbox.
Selector.process(): When the communication task receives a data packet through conveyor_pull, the mailbox’s process routine is invoked.
Worker Loop: The selector runtime creates a conveyor object for each mailbox and processes them separately within its own worker loop, as shown in Algorithm 1. When a mailbox is started, it creates a corresponding conveyor object (conv_obj) and a communication task that executes the algorithm shown in Algorithm 1. Initially, the communication task waits for data packets in the mailbox’s local buffer, which gets added when the user performs a send from the mailbox partition. During this polling for packets from the buffer, the communication task yields control to other tasks, as shown in Line 2. Once the data is added to the buffer, it breaks out of the polling loop and starts to drain elements from the buffer in Line 6. It then pushes each element in the buffer to the target PE in Line 11 until push fails. Then it removes all the pushed items from the buffer and starts the pull cycle. It pulls the received data in Line 16 and creates a computation task, which in turn invokes the mailbox’s process method, as shown in Line 18. As mentioned before, in case there is only one worker that is shared by all the tasks, we invoke the process method directly without the creation of any computation task. Once we come out of the processing of the received data, the task yields so that other communication tasks can share the communication worker.
Once the user invokes done, a special packet is enqueued to the buffer. When this special packet is processed, the is_done API in Line 5 returns true, thereby informing the conveyor object to start its termination phase. Once the communication of all remaining items is finished, the convey_advance API returns false, thereby exiting the work loop. Finally the communication task terminates and signals the completion of the mailbox using a variable of type promise named as end_promise, as shown in Line 23. The signaling of the promise schedules a dependent cleanup task which informs all dependent mailboxes, as shown in Figure 1 about the termination of the current mailbox. This task also manages a counter to find out when all the mailboxes in the selector have performed cleanup, to signal the completion of the selector itself using a future variable associated with the selector. Since the selector runtime is integrated with the HClib runtime, the standard synchronization constructs in AMT runtimes such as finish scope and future can be used by the user to coordinate with the completion of the selector. Other dependent tasks can use the future associated with the selector to wait for its completion. Users can also wait for completion by using a finish scope; for example each of Lines 7–5 in 5 and 6–6 in 6 can be enclosed in finish scopes.
This section presents the results of an empirical evaluation of our selector runtime system on a multi-node platform to demonstrate its performance and scalability.
Purpose: Our goal is
to demonstrate that the selector programming model approach based on partitioned global mailboxes can be used to express a range of irregular mini-applications,
to compare the performance of our approach with that of UPC, OpenSHMEM and Conveyors versions of these mini-applications.
Machine: We ran the experiments on the Cori supercomputer located at NERSC. In Cori, each node has two sockets, with each socket containing a 16-core Intel Xeon E5-2698 v3 CPU 2.30GHz (Haswell). For inter-node connectivity, Cori uses the Cray Aries interconnect with Dragonfly topology that has a global peak bisection bandwidth of 45.0 TB/s. We used cray-shmem 7.7.10, Berkeley UPC 2020.4.0 and GCC 8.3.0 (all available in Cori222We believe these modules were set up using the best parameters for Cori.) to build all the software. Cray-shmem and Berkeley UPC in Cori use Cross-partition memory (XPMEM) technology for cross-process mapping of user-allocated memory within a node that enables load-and-store semantics and native atomics. We use one worker thread per PE for the experiments. If separate workers are used, the computation task needs to ensure mutual exclusion while enqueuing data to the mailbox buffer, since the communication task can simultaneously access the same buffer. Since the mini-applications have enough parallelism at one level, they are written in an SPMD manner rather than the SPMD+multi-threaded manner. Therefore we used one worker per PE to avoid the unnecessary locking overhead. Conveyors was compiled using cray-shmem for our experiments since cray-shmem provided the best performance based on our evaluation in Table I. Conveyors can also use UPC or MPI as backends, in which case our Selectors library can also be invoked from any UPC or MPI program.
Mini-applications: We used all seven mini-applications in Bale [25, 24] that have Conveyors versions for our study. Bale can be used as a proxy for an application’s subpart involving a large number of irregular communication operations that are latency tolerant. All scalability results were obtained with weak scaling.
The first mini-application computes a histogram, using a partitioned global array to distribute elements across PEs. Each PE processes its set of indices and increments the appropriate elements, which are often remote locations in the global array. This mini-application is a simple example of the update pattern idiom mentioned in 4. In our experiments, we use a partitioned global array with each PE storing a local table of 1,000 integer elements. Each PE independently performs 10,000,000 atomic increments on this global array. With weak scaling to PEs, the global array has elements, and the total number of increments performed is .
The second mini-application performs an index gather on a partitioned global array. This mini-application is a straightforward use case of the gather pattern mentioned in 5.
The third mini-application performs permutations on a distributed sparse matrix. It permutes the rows and columns of the matrix according to two random permutations. This mini-application uses both the update pattern and gather pattern idioms.
The fourth mini-application solves the random permutation problem , which is to generate a random permutation of , assuming that each permutation is equally likely. This mini-application generates a random permutation in parallel using the “dart-throwing algorithm” . This mini-application uses both the update pattern and gather pattern idioms.
The fifth mini-application performs topological sorting on a distributed sparse matrix. It uses an upper-triangular matrix (with ones on the diagonal) as its input. We then randomly permute the rows and columns of the upper-triangular matrix to obtain a new matrix. Our goal is to find a row and a column permutation such that when these permutations are applied, we can reconstruct an upper triangular matrix. This mini-application mainly uses the update pattern idioms.
The sixth mini-application finds the transpose of a distributed sparse matrix in parallel. This mini-application mainly uses the update pattern idioms.
The seventh mini-application performs triangle counting in a graph. The graph is represented as an adjacency matrix, which, in turn is stored using a sparse matrix data structure. This mini-application mainly uses the update pattern idioms.
Experimental variants: Each mini-application was evaluated by comparing the following four versions. Among these, the UPC and OpenSHMEM versions were obtained from the Bale release  by replacing calls to libgetput with direct UPC and OpenSHMEM constructs (which resulted in a similar performance to that of the libgetput calls). The Conveyor versions were used unchanged from the Bale release. All problem sizes are identical to those used in the Bale release with one exception - the average number of nonzeros per row was reduced from 35 to 10 for all versions of the topological-sorting mini-application to make its execution time more comparable to that of the other mini-applications.
UPC: This version is written using UPC.
OpenSHMEM: This version is written using OpenSHMEM.
Conveyor: This version directly invokes the Conveyors APIs, which includes explicit handling of failure cases and communication progress.
Selector: This version uses the Selector API introduced in this paper with partitioned global mailboxes.
In Figures 3(a) to 3(g), the Y-axis shows the execution time in seconds, so smaller is better. Since we used weak scaling across PEs, ideal weak scaling should show the same time used by a mini-application for all PE counts. From the figures, we can see that the Conveyor versions perform much better than their UPC and OpenSHMEM counterparts. For the 2048 PE/core case, the Conveyor versions show a geometric mean performance improvement of 27.77 relative to the UPC and 21.52 relative to the OpenSHMEM versions, across all seven mini-applications.
This justifies our decision to use the Conveyors library for message aggregation in our Selector-based approach. Overall, we see that the Selector version also performs much better than the UPC/OpenSHMEM versions and close to the Conveyor version. For the 2048 PE/core case, the Selector versions show a geometric mean performance improvement of 25.59 relative to the UPC and 19.83 relative to the OpenSHMEM versions, and a geometric mean slowdown of only 1.09 relative to the Conveyor versions. These results confirm the performance advantages of our approach, while the productivity advantages can be seen in the simpler programming interface for the Selector versions relative to the Conveyor versions.
Vi-a Reducing Verbosity
The current usage of our partitioned global selector can be a bit verbose due to the class definition boilerplate code and the specification of process methods. We also created a succinct version using C++ lambdas to remove this boilerplate class definition to allow lines of code close to the high-level version from 1. 7 shows a version of the Histogram mini-application written using lambdas where the message processing routine is specified as part of the send API. Compared to 5, the class definition of HistoActor is not needed anymore. But due to the dynamic creation of lambda and its larger size, performance is lower compared to the class-based version. For reference, if we extend Table I:Histogram, it is 2.6 sec for lambda version and 0.6 sec for class-based version. Even with the additional overhead, the lambda version performs better than the other highly optimized state of the art communication systems such as cray-shmem (4.3 sec). We are currenly developing a tool to do source-to-source translation as shown in Figure 4 from the lambda version to the class-based version to get the best of both worlds.
Vii Related Work
The Chare abstraction in Charm++ has taken inspiration from the Actor model, and is also designed for scalability. As indicated earlier, the performance of Charm++ is below that of Conveyors (and hence that of our approach) for the workloads studied in this paper.
In the past, there has been much work on optimizing the communication of PGAS programs through communication aggregation. Avalo et al.  used techniques such as static coalescing and the inspector-executor model to optimize communication in UPC. Chavarria-Miranda and Mellor-Crummey  performed communication coalescing when generating code for regular, data-parallel applications in High-Performance Fortran(HPF). Hayashi et al.  introduced new PGAS language-aware LLVM passes to reduce communication overheads. Wesolowski et al.  introduced the TRAM library that optimizes communication by routing and combining short messages. Jenkins et al.  created the Chapel Aggregation Library (CAL) which aggregates user-defined data using an Aggregator object. UPC [9, 8] performs automatic message aggregation to improve the performance of fine-grained communication but is unable to achieve performance compared to user-directed message aggregation. We use Conveyors , which is a modular, portable, efficient, and scalable library as our message aggregation runtime.
There has also been several work on the integration of Actors with PGAS programming models. Shali and Lin  explored the addition of the Actor model to the Chapel  language as a message-passing layer. Roloff et al.  included Actors as a library in X10 called ActorX10. Pöppl et al.  later added an Actor library to UPC++ and ported a shallow water application to it. None of the integrations demonstrate the usage of an actor/selector as a high-level abstraction to aggregate communication and support automatic termination detection.
Finally, another major difference between our approach and past work is that we introduce a highly scalable actor/selector interface that can be integrated with any commonly used one-sided communication library (e.g., OpenSHMEM/ UPC/MPI3-RMA) rather than creating a standalone programming model.
Viii Conclusions and Future Work
This paper proposes a scalable programming system for PGAS runtimes to accelerate irregular distributed applications. Our approach is based on the actor/selector model, and introduces the concept of a Partitioned Global Mailbox. Subparts of the application that involve a large number of latency tolerant communication operations can use our system to achieve high throughput, while other parts of the application can keep using the PGAS communication system/runtime directly. Actors are often used as the concurrency mechanism in newer languages such as Scala or Rust and in other domains such as the cloud. Thus Actor’s addition to the PGAS ecosystem creates a low overhead path for users of other domains to develop high-performance computing applications without ramping up on non-blocking primitives. Moreover, we have shown that our Actor system beats the non-blocking operations in the state of the art communication libraries/systems by a handsome margin, thereby demonstrating the need to add/improve message aggregation in such libraries. Our programming system also abstracts away low-level details of message aggregation (e.g., manipulating local buffers and managing progress and termination) so that the programmer can work with a high-level selector interface. Further, this approach can be integrated with the standard synchronization constructs in asynchronous task runtimes (e.g., async-finish, future constructs). Our Actor runtime is more than a message-aggregation system since it also supports user-defined active messages, which can support the migration of computation closer to data that is beneficial for irregular applications. Our implementation, which is based on the HClib runtime and the Conveyors library, demonstrates a desirable intermediate point in the productivity-performance space, with scalable performance that approaches that of the user initiated message aggragation and productivity that approaches that of standard PGAS programming. For the 2048 PE case, our approach show a geometric mean performance improvement of 25.59 relative to the UPC versions, 19.83 relative to the OpenSHMEM versions, and a geometric mean slowdown of only 1.09 relative to the Conveyors versions.
In future work, we plan to support variable-sized messages since our current Mailbox implementation only accepts messages of a fixed size that is specified when creating a selector. Further, the original Selectors model  allows for operations such as mailbox priorities and enable/disable operations on mailboxes; support for such operations could enable richer forms of coordination logic across messages in our implementations. Finally, it would be interesting to explore compiler extensions to automatically translate from the natural version to our selector version, thereby directly improving the performance of natural PGAS programs.
-  (2014) Actors programming for the mobile cloud. In 2014 IEEE 13th International Symposium on Parallel and Distributed Computing, Vol. , pp. 3–9. External Links: Cited by: §I, §I.
ACTORS - a model of concurrent computation in distributed systems.
MIT Press series in artificial intelligence, MIT Press. External Links: Cited by: §I, §II-A.
-  (2013) Improving communication in PGAS environments: static and dynamic coalescing in UPC. In International Conference on Supercomputing, ICS’13, Eugene, OR, USA - June 10 - 14, 2013, A. D. Malony, M. Nemirovsky, and S. P. Midkiff (Eds.), pp. 129–138. External Links: Cited by: §VII.
-  (2004) The cascade high productivity language. In 9th International Workshop on High-Level Programming Models and Supportive Environments (HIPS 2004), 26 April 2004, Santa Fe, NM, USA, pp. 52–60. External Links: Cited by: §VII.
-  (1999) Introduction to upc and language specification. Technical report Technical Report CCS-TR-99-157, IDA Center for Computing Sciences. External Links: Cited by: §I.
-  (2010) Introducing openshmem: SHMEM for the PGAS community. In Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model, PGAS 2010, New York, NY, USA, October 12-15, 2010, J. E. Moreira, C. Iancu, and V. A. Saraswat (Eds.), pp. 2. External Links: Cited by: §I.
-  (2005) Effective communication coalescing for data-parallel applications. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2005, June 15-17, 2005, Chicago, IL, USA, K. Pingali, K. A. Yelick, and A. S. Grimshaw (Eds.), pp. 14–25. External Links: Cited by: §VII.
-  (2007) Automatic nonblocking communication for partitioned global address space programs. In Proceedings of the 21st Annual International Conference on Supercomputing, ICS ’07, New York, NY, USA, pp. 158–167. External Links: Cited by: §III-C, §VII.
-  (2004-12) Building a source-to-source upc-to-c translator. Technical report Technical Report UCB/CSD-04-1369, EECS Department, University of California, Berkeley. External Links: Cited by: §III-C, §VII.
-  (2020) Future. External Links: Cited by: item 3.
-  Boost.Circular Buffer.. Note: https://www.boost.org/doc/libs/1_72_0/doc/html/circular_buffer.html[Online; accessed 20-Apr-2020] Cited by: §V-B.
-  (1996) Efficient low-contention parallel algorithms. J. Comput. Syst. Sci. 53 (3), pp. 417–442. External Links: Cited by: §VI.
-  (2016) Integrating asynchronous task parallelism with openshmem. In Third Workshop, OpenSHMEM 2016, Baltimore, MD, USA, August 2-4, 2016, Vol. 10007, pp. 3–17. External Links: Cited by: §V-A.
-  (2017) A pluggable framework for composable HPC scheduling libraries. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS Workshops 2017, Orlando / Buena Vista, FL, USA, May 29 - June 2, 2017, pp. 723–732. External Links: Cited by: §II-B, §V-A, §V.
-  (2018) Programming artificial intelligence with chapel and deep 6 ai. External Links: Cited by: §I.
-  (2015) LLVM-based communication optimizations for PGAS programs. In Proceedings of the Second Workshop on the LLVM Compiler Infrastructure in HPC, LLVM 2015, Austin, Texas, USA, November 15, 2015, H. Finkel (Ed.), pp. 1:1–1:11. External Links: Cited by: §VII.
-  (1973) A universal modular ACTOR formalism for artificial intelligence. In Proceedings of the 3rd International Joint Conference on Artificial Intelligence. Standford, CA, USA, August 20-23, 1973, N. J. Nilsson (Ed.), pp. 235–245. External Links: Cited by: §I, §II-A.
-  (2012) Integrating task parallelism with actors. In Proceedings of the 27th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2012, part of SPLASH 2012, Tucson, AZ, USA, October 21-25, 2012, G. T. Leavens and M. B. Dwyer (Eds.), pp. 753–772. External Links: Cited by: §I.
-  (2014) Selectors: actors with multiple guarded mailboxes. In Proceedings of the 4th International Workshop on Programming based on Actors Agents & Decentralized Control, AGERE! 2014, Portland, OR, USA, October 20, 2014, pp. 1–14. External Links: Cited by: §I, §II-A, §IV-A2, §VIII.
-  (2018) Chapel aggregation library (CAL). In 2018 IEEE/ACM Parallel Applications Workshop, Alternatives To MPI, PAW-ATM SC 2018, Dallas, TX, USA, November 16, 2018, pp. 34–43. External Links: Cited by: §VII.
-  (2020) Introduction to the actor model. External Links: Cited by: §IV-A3, §IV-C.
-  (1993) CHARM++: A portable concurrent object oriented system based on C++. In Conference on Object-Oriented Programming Systems, Languages, and Applications (OOPSLA), Eighth Annual Conference, Washington, DC, USA, September 26 - October 1, 1993, Proceedings, T. Babitsky and J. Salmons (Eds.), pp. 91–108. External Links: Cited by: §VII.
-  (2017) A case for migrating execution for irregular applications. In Proceedings of the Seventh Workshop on Irregular Applications: Architectures and Algorithms, IA3@SC 2017, Denver, CO, USA, November 12 - 17, 2017, pp. 6:1–6:8. External Links: Cited by: §I.
-  (2019) Conveyors for streaming many-to-many communication. In 9th IEEE/ACM Workshop on Irregular Applications: Architectures and Algorithms, IA3 SC 2019, Denver, CO, USA, November 18, 2019, pp. 1–8. External Links: Cited by: Listing 3, item 3, §I, §III-B, §III-C, §VI, §VII.
-  (2020) A collection of buffered communication libraries and some mini-applications.. Note: https://github.com/jdevinney/bale[Online; accessed 20-Apr-2020] Cited by: item 3, §VI, §VI.
-  (1998-08) Co-array fortran for parallel programming. SIGPLAN Fortran Forum 17 (2), pp. 1–31. External Links: Cited by: §I.
-  (2021) Shmem get nbi selector. External Links: Cited by: §VI-A.
-  (2021) Shmem put nbi selector. External Links: Cited by: §VI-A.
-  (2019-11) A upc++ actor library and its evaluation on a shallow water proxy application. In 2019 IEEE/ACM Parallel Applications Workshop, Alternatives To MPI (PAW-ATM), Vol. , pp. 11–24. External Links: Cited by: §VII.
-  (2016) ActorX10: an actor library for X10. In Proceedings of the 6th ACM SIGPLAN Workshop on X10, X10 PLDI 2016, Santa Barbara, CA, USA, June 14, 2016, C. Fohry and O. Tardieu (Eds.), pp. 24–29. External Links: Cited by: §VII.
-  (2010) Actor oriented programming in chapel. External Links: Cited by: §VII.
-  Scalable machine learning with openshmem. Cited by: §I.
-  (2009) Concurrency in erlang and scala: the actor model. External Links: Cited by: §I.
-  (2014) TRAM: optimizing fine-grained communication with topological routing and aggregation of messages. In 43rd International Conference on Parallel Processing, ICPP 2014, Minneapolis, MN, USA, September 9-12, 2014, pp. 211–220. External Links: Cited by: §VII.
-  (2007) Productivity and performance using partitioned global address space languages. In Parallel Symbolic Computation, PASCO 2007, International Workshop, 27-28 July 2007, University of Western Ontario, London, Ontario, Canada, M. M. Maza and S. M. Watt (Eds.), pp. 24–32. External Links: Cited by: §I.