I Introduction
Many applications within scientific computing and bigdata analytics rely heavily on efficient implementations of sparse linear algebra algorithms, such as sparse matrixvector multiplication (SpMV). Examples include conjugate gradient solvers [1]
, tensor decomposition
[2], and graph analytics [3]. Unlike dense linear algebra algorithms, sparse kernels present difficult challenges to achieving high performance on today’s common architectures. These challenges include irregular access patterns and weak locality, which impede static optimizations and efficient cache utilization. As a result, there has been an abundance of research regarding the design of data structures and algorithms to take advantage of the capabilities of today’s systems, which include deepmemory hierarchy architectures and graphics processing units (GPUs)[4].Beyond designing data structures and algorithms for sparse kernels that conform to existing systems, there have been efforts to develop novel architectures that would be better suited for sparse algorithms. One such effort is the Emu architecture [5], which is a cacheless system centered around lightweight migratory threads and nearmemory processing capabilities. The premise of this architecture is that the challenges posed by irregular applications can be overcome through the use of finegrained memory accesses that reduce the memory system load by only transferring lightweight thread contexts. The Emu architecture is described in Section II.
To determine the efficacy of such a novel architecture, it is insightful to understand how the impact of existing optimizations for sparse algorithms differs between Emu and cachememory based systems. To this end, we explore SpMV on the Emu architecture and investigate the effects of traditional sparse optimizations such as vector data layouts, work distributions, and matrix reorderings. We focus on SpMV as it is one of the most prevalent sparse kernels, is found across a wide range of applications, and exhibits the algorithmic traits that the Emu architecture targets. Our implementation leverages the standard Compressed Sparse Row (CSR) data format for storing matrices.
This paper’s contributions are as follows:

We implement a standard CSRbased SpMV algorithm for the Emu architecture with two different data layout schemes for the vectors and two different work distribution strategies.

We conduct a performance evaluation of our implementation and the different sparse optimizations across a set of realworld matrices on the Emu Chick system.

We find that initially distributing work evenly across the system is inadequate to maintain load balancing over time due to the migratory nature of Emu threads.

We demonstrate that traditional matrix reordering techniques can improve SpMV performance on the Emu architecture by as much as 70% by encouraging sustained load balancing. On the other hand, we find that the performance gains of the same reordering techniques on a cachememory based system is no more than 16%.
The rest of this paper is organized as follows: Section II provides a brief overview of the Emu architecture. The details of our CSRbased SpMV implementation and sparse optimizations are presented in Section III. Section IV describes our experimental setup and the results of our performance study. Related work is presented in Section V and we provide concluding remarks in Section VI.
Ii Emu Architecture
The unique aspects of the Emu architecture are migratory threads and memoryside processing. These features are designed to improve the performance of code containing irregular memory access patterns [5]. Rather than fetching data from memory and bringing it to the location of the computation, the Emu architecture sends or migrates the computation to that data. This is accomplished by forcing threads to execute on processors that are colocated with accessed data or by sending computation to colocated memoryside processors.
Iia Emu Nodelet Architecture
The basic building block of an Emu system is a nodelet, which consists of one or more Gossamer Cores, a bank of narrow channel DRAM, and a memoryside processor. Eight nodelets are combined together to make up a single node. Figure 1 depicts one such node within the Emu architecture.
A Gossamer Core (GC) is a general purpose, cacheless processing unit developed specifically for the Emu architecture. It supports the execution of up to 64 concurrent lightweight threads. A GC can issue a single instruction every cycle and each thread on a GC is limited to one active instruction at any given time. Such restrictions, coupled with the lack of caches, simplifies the logic required by a GC.
The target applications for the Emu architecture have irregular access patterns and little spatial locality. Because such applications often require 8 bytes of memory per access, it is inefficient to load 64 byte blocks from main memory. These larger accesses are often required by other architectures due to standard DDR interfaces and cache line sizes. The narrow channel DRAM within an Emu system is designed to support narrower accesses by using eight 8bit channels rather than a single, wider 64bit interface.
When a thread on a GC makes a memory request to a remote address, a migration is generated. A migration involves a GC issuing a request to the Nodelet Queue Manager (NQM) to migrate the thread context to the nodelet that contains the desired data. The NQM interfaces with the Migration Engine (ME), which is the communication fabric that connects multiple nodelets together. The thread context sits in the migration queue until it is accepted by the ME, at which point it is sent over the ME and is processed by the destination nodelet’s NQM. An executable’s compiled code is replicated on each nodelet, so the thread can resume execution without being aware that it was migrated. By limiting the size of a thread context to roughly 200 bytes [6], the Emu architecture is able to keep the cost of a migration low. We have observed that a memory access that requires a migration is roughly 2x slower than a memory access without a migration.
The current Emu architecture supports a range of atomic operations on 64bit data. These include add, AND, OR, XOR, min and max. Atomic operations are handled by the memoryside processor on each nodelet. For those atomic and store operations that do not return a value to a thread, the memoryside processor can perform the operation on behalf of a thread executing on a different nodelet without generating a migration. Such operations are referred to as remote updates. A remote update generates a packet that is sent to the nodelet owning the remote memory location. The packet contains the data as well as the operation to be performed. While remote updates do not return results, they generate acknowledgements that are sent back to the source thread. A thread cannot migrate until all outstanding acknowledgements have been received.
Each nodelet has three queues that hold threads in various states. The run queue consists of threads that are waiting for an available register set on a GC. The migration queue holds threads, or packets, that are waiting to depart the nodelet. Packets containing remote updates are held in the memory queue until they can be executed by the memoryside processor. In the current architecture, the number of active threads on a nodelet can be throttled depending on the availability of resources on that nodelet, including space in these queues.
In this work, we perform our experiments on the Emu Chick system, which consists of 8 nodes connected together by a RapidIO network. In the current Emu Chick system, each nodelet consists of only one GC clocked at 150MHz and 8GB of narrow channel DDR4 memory clocked at 1600MHz. The GC and ME on each nodelet are implemented on an Arria 10 FPGA.
IiB Data Distribution Support
There are two basic functions for specifying where allocated memory is placed within the system. The mw_malloc1dlong function will allocate an array of 64bit elements that is distributed cyclically element by element across the available nodelets in the system. The mw_malloc2d function allocates an array of pointers that is distributed cyclically element by element across the available nodelets, where each pointer points to a block of colocated memory of a specified size. While mw_malloc2d does not directly support variable size blocks, one can achieve such a layout by invoking mw_malloc2d with block size of 1 and then use standard malloc to allocate the desired block size on each of the nodelets.
Iii Implementation
In this section, we provide details regarding our implementation of the Compressed Sparse Row (CSR) storage format and its accompanying SpMV algorithm on the Emu architecture. We also describe how we achieve different data layouts and work distribution strategies. For the remainder of this paper, we will refer to a generic sparse matrix A as having rows, columns and NNZ nonzeros. We consider as the formulation of SpMV, where x and b are dense vectors of length and , respectively.
Iiia Compressed Sparse Row Format
We adopt the standard CSR storage format, which stores A as three separate arrays: values stores all the nonzero values of A, colIndex stores the column indices of all the nonzeros and rowPtr stores pointers into colIndex that correspond to the start of each row. We distribute the rows of A across the desired number of nodelets such that each nodelet will have local access to all portions of the CSR arrays it needs to traverse its assigned rows. An illustration of this distribution for 4 nodelets is shown in Figure 2, where each nodelet is assigned two rows. The rows assigned to a given nodelet can then be distributed among the desired number of threads spawned on the nodelet. Note that each nodelet stores a “mini” CSR matrix for its rows with relative row offsets.
To perform SpMV, we start by spawning a “parent” thread on each of the desired nodelets. Each of these parent threads then spawns the desired number of worker threads to process the rows assigned to the nodelet. In order to make updates to b during SpMV, each worker thread needs to be aware of the absolute index of its assigned rows, as only relative offsets are stored in the nodelet’s rowPtr array. We accomplish this by passing in the absolute row index of the first row assigned to each nodelet when we spawn the parent threads. Worker threads migrate to and from the nodelets as needed to access elements of x and b. The rest of the SpMV algorithm is unchanged from the standard CSR implementation.
IiiB Vector Data Layout
While it is possible to enforce only local accesses to the CSR arrays for SpMV, accesses to the vectors are much more challenging to control. As these memory accesses largely dictate the overall performance of SpMV, it is crucial to address the data layout for x and b on the Emu architecture.
Since the nonzeros in row of A will only be assigned to a single worker thread, that thread can accumulate the updates to in a local register and then issue a single store to . Therefore, fully computing only requires stores to b, which do not require migrations as they are either local writes or remote updates. On the other hand, there are a total of NNZ loads from x required for SpMV, each of which may require a migration. As it is often the case that NNZ for most sparse matrices, the layout of x has more bearing on performance than that of b.
We implement two different data layouts for both x and b: cyclic and block. In a cyclic layout, adjacent elements of a vector are stored on different nodelets in a roundrobin fashion such that each consecutive access requires a migration. For a block layout, contiguous elements of a vector are stored in a fixedsize block on each nodelet. Assuming a block size of elements, one migration will be required for every consecutive accesses. For our approach, we use the same block size on each nodelet, which is for b and for x, where is the number of nodelets utilized.
IiiC Work Distribution Strategies
We explore two different strategies for distributing work across the nodelets: one that only considers the number of rows in A and one that also factors in the number of nonzeros on each row. For the rowbased approach, we evenly distribute the rows of A to each nodelet and then further divide those blocks of rows among the worker threads utilized by each nodelet. When using the block layout for b, the block size is equal to the number of rows assigned to each nodelet via the row distribution strategy. Therefore, the worker threads on a nodelet will have local access to the elements of b that need to be updated.
While each nodelet may receive the same number of rows via the row approach, the sparsity pattern of the matrix may result in some worker threads being assigned a significantly different number of nonzeros than others. Since the number of nonzeros given to a nodelet largely dictates its workload for SpMV, this can lead to load imbalance. To mitigate this, the nonzero approach distributes the rows of A to each worker thread such that the total number of nonzeros assigned to each thread is roughly the same. We achieve this by iterating over rowPtr and accumulating rows until the threshold of is met, where is the total number of worker threads used across all of the nodelets. For matrices with very irregular sparsity patterns, this can result in a given nodelet being assigned a significantly different number of rows than another. In such cases, a block layout for b no longer guarantees that the required elements of b will be local to each nodelet.
Iv Performance Evaluation
In order to understand the impact of traditional sparse optimizations on a migratory thread architecture, we evaluated our SpMV implementation on the Emu Chick system across a range of realworld matrices. In this section, we describe our experimental setup and then present the results of several different experiments that evaluate the sparse optimizations described in Section III.
Iva Experimental Setup
For our experiments, we ran on a single node of the Emu Chick system, as described in Section II, and used version 18.04.1 of the Emu toolchain. Multinode execution on the Emu Chick hardware was not reliable enough to conduct our tests, so we limit our experiments to a single node and leave multinode tests for future work. We utilize all 8 nodelets on a node and leverage 64 worker threads per nodelet.
We executed our SpMV implementation across a suite of 40 different matrices. In the following sections, we focus our evaluation on the matrices shown in Table I, which are representative of the suite and highlight the most interesting performance characteristics. All matrices were obtained from the University of Florida Sparse Matrix Collection [7] with the exception of rmat, which is an RMAT graph that was generated with RMAT a, b and c parameters of 0.45, 0.22 and 0.22, respectively [8]. For the symmetric matrices, we store and operate on the entire matrix rather than just the upper or lower triangular matrix. All results reported are the average of 10 trials.
Name  Dimensions  NonZeros  Density  Symmetric 

ford1  18k x 18k  100k  2.9E04  Yes 
cop20k_A  120k x 120k  2.6M  1.79E04  Yes 
webbase1M  1M x 1M  3.1M  3.11E06  No 
rmat  445k x 445k  7.4M  3.74E05  No 
nd24k  72k x 72k  28.7M  5.54E03  Yes 
audikw_1  943k x 943k  77.6M  8.72E05  Yes 
IvB Cyclic Versus Block Vector Data Layout
Figure 3 shows the achieved SpMV bandwidth for the cyclic and block vector layouts across the different matrices. A rowbased work distribution strategy was employed for these results. The block layout outperforms the cyclic layout on each matrix, achieving up to 25% more bandwidth.
We can study the sparsity patterns of the matrices, as shown in Figure 4, to understand why the block layout performs better than the cyclic layout. A particular sparsity pattern that offers significant benefits for the block layout is one in which a majority of the nonzeros are clustered around the main diagonal of the matrix. Since the matrices in Table I are square, the length of both x and b is equal to the number of rows in A. With the rowbased work distribution, the number of rows assigned to each nodelet is equal to the block size used for the vectors. For a matrix where the nonzeros are clustered on the main diagonal, this means that very few migrations are incurred when accessing x. Figure 5 illustrates this for the ford1 matrix across 8 nodelets. As can be seen, a majority of the nonzeros are contained in the shaded boxes, which represent local accesses to x. If x were distributed in a cyclic layout, then we could not exploit such a sparsity pattern, as consecutive accesses to x would cause migrations. Indeed, we observe that the block layout generates 1.42x – 6.3x fewer migrations than the cyclic layout across the matrices in Table I.
For the remainder of the results, we will assume a block data layout for x and b.
IvC Row Versus Nonzero Work Distribution
Figure 6 presents the SpMV bandwidth results of the row and nonzero work distribution strategies across the matrices. We observe that the nonzero approach consistently outperforms the row approach, achieving as much as 3.34x more bandwidth. As described in Section IIIC, the rowbased approach can lead to severe work imbalances by assigning a block of rows to a given nodelet with a significantly different number of nonzeros than other nodelets.
To quantify the performance advantage of the nonzero work distribution, Figure 7
shows the coefficient of variation (CV) for the number of memory instructions executed by each nodelet. The CV is the standard deviation divided by the mean and is a measure of relative variability. A low memory instruction CV indicates that each nodelet completed roughly the same amount of work, as SpMV is considered to be memorybound. Indeed, we observe a significantly lower CV for the nonzero approach across all of the matrices.
While the work is distributed more evenly with the nonzero approach and we observe higher bandwidth, the nonzero strategy incurs an average of 1.69x more migrations than the rowbased approach. This is because the row strategy coupled with a block layout for the vectors works to minimize migrations, especially for matrices with a dense main diagonal. However, the nonzero distribution does not necessarily assign equally sized blocks of rows to each nodelet. The result is that the block data layout for the vectors is less successful at minimizing migrations as a nodelet’s assigned rows and nonzeros are not necessarily aligned with the block partitions of x and b.
Despite incurring more migrations, the nonzero approach offers better performance than the rowbased approach. This suggests that the penalty of migrations can be offset by more uniform work distribution and load balancing. We discuss this topic in more detail in the next section. For the remainder of the results, we assume the use of the nonzero work distribution strategy.
IvD Hardware Load Balancing
On a traditional cachememory based system, both memory access locality and hardware load balancing for SpMV can be controlled by distributing the nonzeros among the threads and binding the threads to hardware resources such as cores. However, the Emu architecture differs because threads cannot be isolated to specific hardware resources, such as a Gossamer Core, due to their migratory behavior. To bind Emu threads to cores, one would need to only read from local memory, avoiding migrations entirely. At the other extreme, despite best efforts to initially lay out and distribute work evenly across the nodelets, it is possible for all of the threads to migrate to a single nodelet and oversubscribe that nodelet’s resources. In general, the layout of an application’s data structures across the nodelets as well as its memory access pattern are what determine the load balancing of the hardware.
Consider the cop20k_A matrix as shown in Figure 4. Regardless of how the rows are distributed among the nodelets, a large portion of the total nonzeros require access to elements of x that all reside on the same nodelet. Specifically, 25% of the total nonzeros in the matrix require access to elements of x that are stored on nodelet 0. Within the Emu architecture, this results in a majority of the threads all migrating to the same nodelet at roughly the same time.
The particular load imbalance scenario for the cop20k_A matrix is shown in Figure 8, where we monitor the total number of threads residing on each nodelet during the SpMV execution, including those executing and those waiting in the run queues. By the time 150ms have elapsed, all of the nodelets except for nodelet 0 have ceased significant activity, indicating that there is a clear load imbalance of the system resources. However, as described above, we would expect to see higher than average activity on nodelet 0 due to all of the threads requiring access to elements of x stored on nodelet 0. Instead, the number of threads on nodelet 0 is, on average, 32 while the other nodelets maintained between 53 and 75 threads on average.
To understand this behavior, we observed the sizes of the migration queue on each nodelet over the SpMV execution. We found that nodelet 0 experiences an immediate surge of packets into its migration queue. This is due to a majority of the worker threads migrating to nodelet 0 to access x and then requiring a migration back to their parent nodelet to access their CSR arrays. Nodelet 1 also exhibits noticeable activity, as it too holds elements of x that are required by a large number of nonzeros residing on other nodelets. In the current Emu architecture, thread activity on a nodelet is throttled based on the nodelet’s available resources, which includes space in the migration queue. Because the migration queue on nodelet 0 is immediately filled nearly to capacity, the nodelet reduces the number of threads that can be executed. It is not until the other nodelets approach completion of their work that the migration queue on nodelet 0 starts to empty out and thread activity increases. We note that running SpMV on the cop20k_A matrix with fewer threads per nodelet provides better load balancing by reducing the pressure on the migration queue, and thus, allowing for more threads to be active on nodelet 0. This suggests that as more nodelets and threads are present as a system is scaled up, load balancing issues due to thread migration hotspots will require attention.
IvE Matrix Reordering
As shown in the previous section, the sparsity pattern of a given matrix can have profound impacts on SpMV performance, despite efforts to properly lay out data structures and distribute work evenly to all of the threads. There are many known techniques to reorder the nonzeros of a matrix in order to improve locality and data reuse on traditional cachememory based systems. We investigated whether these existing reordering algorithms could offer similar performance benefits on Emu system as well as mitigate potential hardware load balancing issues. We focus on the following reordering techniques: Breadth First Search (BFS) [9], METIS [10] and Random. A random reordering performs a random permutation of the matrix rows and columns using a FisherYates shuffle. As an example, Figure 9 shows the original cop20k_A matrix and the results of the three reordering algorithms.
Figure 10 presents the achieved SpMV bandwidth for the different reordering techniques across the matrices, where NONE refers to the original matrix. We find that BFS and METIS generally offer the best performance, achieving up to 70% more bandwidth than the original matrix. BFS and METIS both attempt to move the nonzeros towards the main diagonal and they tend to put an equal number of nonzeros on each row. Having the nonzeros clustered on the main diagonal allows us to exploit the block data layout of the vectors and reduce the total amount of migrations, as described in Section IVB. Since BFS and METIS tend to produce balanced rows, both of the work distribution strategies achieve roughly the same outcome: each nodelet is assigned an equal number of rows and the total number of nonzeros assigned to each nodelet is roughly the same. Therefore, these reordering techniques allow us to maintain an equal amount of activity on each nodelet and mitigate hotspots by encouraging threads to be “pinned” to their parent nodelet.
However, an interesting result from Figure 10 is that a random reordering can offer up to a 50% increase in bandwidth over the original matrix, and in some cases, outperform BFS or METIS. The random reordering has the effect of also producing balanced rows by uniformly spreading out the nonzeros rather than clustering them together. As one would expect, this results in more migrations than the other techniques, but it provides a natural hotspot mitigation for SpMV. This is because it is very unlikely that a majority of the threads would all converge onto the same nodelet at the same time. We observe that such an effect is similar to the distributed randomized algorithm for packet routing proposed by Valiant [11], which prevents multiple packets from being sent across the same wire at the same time. As alluded to at the end of Section IVC, the cost of extra migrations can be overcome by better load balancing. Indeed, we can see the improvement in load balancing achieved by the random reordering in Figure 11, which tracks the total number of threads on each nodelet during SpMV for the cop20k_A matrix.
The results from Figure 10 highlight two approaches for achieving hardware load balancing on the Emu architecture: (1) assign an equal amount of work to each nodelet and lay out the data so that threads rarely need to migrate off of their parent nodelet, and (2) assign an equal amount of work to each nodelet and lay out the data so that threads will rarely converge onto the same nodelet at the same time. The first approach, as achieved by BFS and METIS, attempts to enforce the original intent of the data layout and work distribution, and generally offers the best performance. The second approach, as achieved by the random reordering, incurs more migrations but can be useful due to the minimal amount of work required to perform the reordering.
It is worth noting the difference in impact of the matrix reordering techniques on the Emu architecture when compared to a traditional cachememory based architecture. Figure 12 shows the achieved bandwidth for an identical SpMV implementation on a dualsocket Broadwell Xeon system with 45MB of lastlevel cache. We observe that BFS and METIS only achieve up to 12% and 16% higher bandwidth over the original matrix, respectively. Furthermore, a random reordering never outperforms the original matrix, and in general, is considerably worse than the other reordering techniques. This behavior is what we would expect, as the random reordering has poor spatial locality. The difference in time to access the L1 cache and main memory can be on the order of 100 – 200x, which is much more severe than the relative cost of a migration on the Emu architecture. Such a penalty cannot easily be amortized by the load balancing benefits provided by a random reordering.
V Related Work
The Emu architecture is described in detail by Dysart et al. [5], which also gives initial performance results obtained from a simulator of the architecture. Also using a simulator of an Emu system, Minutoli et al. [12] present an implementation of radix sort and benchmark its performance on up to 128 nodelets. Hein et al. [6] performed an evaluation of several microbenchmarks, including CSRbased SpMV, on actual Emu Chick system hardware. While our work is similar to Hein et al., there are significant differences. In that work, x was replicated on each nodelet and the entirety of b was placed on a single nodelet. Furthermore, their evaluation only considered synthetically generated Laplacian matrices. In our work, A, x and b are all distributed in some fashion across the nodelets, with no portion of the vectors being replicated. Additionally, we run experiments on realworld sparse matrices that are drawn from a variety of different domains.
Vi Conclusions
As migratory thread architectures, such as Emu, mature and evolve, optimizing sparse codes to capitalize on these new architecture’s unique strengths will be increasingly important. We evaluated several traditional sparse optimizations on the Emu architecture, including vector data layout, work distribution, and matrix reordering. Our findings can be summarized as follows:

While designing data structures and algorithms to minimize migrations is generally a good strategy, we found that work distribution and load balancing is of similar importance for achieving high performance.

Unlike traditional systems, it is very difficult to explicitly enforce hardware load balancing for the Emu architecture due to thread migration. Specifically, the placement of the data to be accessed and the patterns of these accesses entirely dictates the work performed by a given hardware resource, irrespective of how much work is initially delegated to each processing element.

The impact of employing known matrix reordering techniques is more significant on the Emu architecture than a traditional cachememory based system. We found that the METIS and BFS matrix reordering techniques can increase performance by as much as 70% on Emu while we observed a maximum gain of 16% on a traditional architecture. Furthermore, a completely random reordering of the rows and columns can exhibit better performance on Emu than not reordering at all, which contradicts what we observe on a traditional system.
For future work, we would like to reevaluate our performance study on the newly upgraded Emu hardware and toolchain (version 18.08.1), which includes a faster Gossamer Core clock rate and hotspot mitigation improvements. We are also interested in more thoroughly investigating multinode performance, specifically on hardware, as thus far we have only been able to do so via simulation. We would also like to evaluate other sparse matrix formats, including new formats targeted specifically at the Emu architecture. Furthermore, we are interested in investigating prior work by Valiant [11] on randomized data distributions and how it can apply to data layout schemes on Emu.
References
 [1] D. P. O’Leary, “The block conjugate gradient algorithm and related methods,” Linear Algebra and its Applications, vol. 29, pp. 293 – 322, 1980, special Volume Dedicated to Alson S. Householder. [Online]. Available: http://www.sciencedirect.com/science/article/pii/0024379580902475
 [2] S. Smith, N. Ravindran, N. D. Sidiropoulos, and G. Karypis, “SPLATT: Efficient and parallel sparse tensormatrix multiplication,” 29th IEEE International Parallel & Distributed Processing Symposium, 2015.
 [3] A. Ashari, N. Sedaghati, J. Eisenlohr, S. Parthasarathy, and P. Sadayappan, “Fast sparse matrixvector multiplication on GPUs for graph applications,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’14. Piscataway, NJ, USA: IEEE Press, 2014, pp. 781–792. [Online]. Available: https://doi.org/10.1109/SC.2014.69
 [4] A. Monakov, A. Lokhmotov, and A. Avetisyan, “Automatically tuning sparse matrixvector multiplication for GPU architectures,” in High Performance Embedded Architectures and Compilers, Y. N. Patt, P. Foglia, E. Duesterwald, P. Faraboschi, and X. Martorell, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 111–125.
 [5] T. Dysart, P. Kogge, M. Deneroff, E. Bovell, P. Briggs, J. Brockman, K. Jacobsen, Y. Juan, S. Kuntz, R. Lethin, J. McMahon, C. Pawar, M. Perrigo, S. Rucker, J. Ruttenberg, M. Ruttenberg, and S. Stein, “Highly scalable near memory processing with migrating threads on the Emu system architecture,” in Proceedings of the Sixth Workshop on Irregular Applications: Architectures and Algorithms. Piscataway, NJ, USA: IEEE Press, 2016, pp. 2–9. [Online]. Available: https://doi.org/10.1109/IA3.2016.7
 [6] E. Hein, T. Conte, J. Young, S. Eswar, J. Li, P. Lavain, R. Vuduc, and J. Riedy, “An initial characterization of the Emu Chick,” in The 8th International Workshop on Accelerators and Hybrid Exascale Systems (AsHES), 2018.
 [7] T. A. Davis and Y. Hu, “The University of Florida Sparse Matrix Collection,” ACM Trans. Math. Softw., vol. 38, no. 1, pp. 1:1–1:25, Dec. 2011. [Online]. Available: http://doi.acm.org/10.1145/2049662.2049663
 [8] F. Khorasani, R. Gupta, and L. N. Bhuyan, “Scalable simdefficient graph processing on gpus,” in Proceedings of the 24th International Conference on Parallel Architectures and Compilation Techniques, ser. PACT ’15, 2015, pp. 39–50.
 [9] I. AlFuraih and S. Ranka, “Memory hierarchy management for iterative graph structures,” in Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing, March 1998, pp. 298–302.
 [10] G. Karypis and V. Kumar, “A fast and high quality multilevel scheme for partitioning irregular graphs,” SIAM Journal on Scientific Computing, vol. 20, no. 1, pp. 359–392, 1998. [Online]. Available: https://doi.org/10.1137/S1064827595287997
 [11] L. Valiant, “A scheme for fast parallel communication,” SIAM Journal on Computing, vol. 11, no. 2, pp. 350–361, 1982. [Online]. Available: https://doi.org/10.1137/0211027
 [12] M. Minutoli, S. K. Kuntz, A. Tumeo, and P. M. Kogge, “Implementating radix sort on Emu 1,” in The 3rd Workshop on NearData Processing (WoNDP), 2015.
Comments
There are no comments yet.