As parallel computers and programming languages have proliferated, programmers are increasingly faced with the task of improving software performance through parallelism. Shared-memory multithreading is an especially common programming and execution model that is at the heart of a wide variety of server and user applications. The value of the shared memory model is its simple programming interface. However, shared memory also requires programmers to overcome several barriers to realize an application’s parallel performance potential.
Synchronization and data movement are the key impediments to an application’s efficient, parallel execution. To ensure that data shared by multiple threads remain consistent, the programmer must use synchronization (e.g., mutex locks ) to serialize the threads’ accesses to the data. Synchronization limits parallelism because it forces threads to sequentially access shared resources, often requiring threads to stop executing and wait for one another. Processors’ data caches are essential to high performance, and the need to manipulate shared data in cache requires the system to move the data between different processors’ caches during an execution. The latency of data movement impedes performance. Moreover, systems must use cache coherence  to ensure that processors always operate on the most up-to-date version of a value. Coherence protocol implementations cause processors to serialize their accesses to shared data, further limiting parallelism and performance.
Our work is motivated by an observation about synchronization and data movement: while accesses to shared data by different threads must be serialized, the order in which those accesses are serialized is often inconsequential. Indeed, in a multithreaded execution, the execution order of such accesses may vary non-deterministically, potentially leading to different – yet correct – outcomes. We refer to operations with this permissible form of order non-determinism as “commutative operations” (or COps) and the data that they access as “commutatively accessible data” (or CData).
Recent work described COUP , which modified the coherence protocol to exploit the commutativity of common operations (e.g., addition, logical OR). While COUP is effective at improving parallelism for programs that use these commutative operations, COUP has several important limitations. COUP is limited to a small, fixed set of operations that are built into the hardware. If software uses even a slightly different commutative updates (e.g., saturating addition, complex arithmetic) COUP is inapplicable and its performance benefits are lost. Additionally, COUP tightly couples commutative updates to the coherence protocol, adding a new coherence state, along with its attendant complexity and need for re-verification.
This work describes a hybrid software/hardware approach to exploiting the commutativity of COps on CData. We describe CCache, which uses simple hardware support that does not modify the cache coherence protocol to improve the parallel performance of threads executing flexible, software-defined commutative operations. Cores in CCache perform COps to replicated, privatized copies of the same CData without the need for synchronization, coherence, or data movement. Threads that perform parallel COps on replicated CData must eventually merge the result of their COps using an application-specific merge function that the programmer writes, to combine independently manipulated copies of CData. Merging combines the CData results of different threads’ COps, effectively serializing the execution of the parallel COps. CCache improves parallel performance through on-demand privatization, creating a copy of CData on which each thread may perform COps independently.
We describe extensions to a commodity multicore architecture that support CCache. CCache’s microarchitectural additions have low complexity and do not interfere with critical path operations. CCache uses a simple set of ISA extensions to support a programming interface that allows programmers to express COps and define, register, and execute a merge function. CCache requires commutatively manipulated data and coherently manipulated data to be disjoint, enabling efficient commutative updates with coherence protocol modifications. Coherent cache lines are handled by the existing coherence protocol. Commutatively manipulated lines never generate coherence actions and never match the tag of an incoming coherence message. CCache thus avoids the cost and complexity of a protocol change.
evaluates CCache on a collection of benchmarks including a key-value store, K-means clustering, Breadth-first Search and PageRank . To illustrate the flexibility of CCache, we implement variants of each benchmark that use different, application-specific merge operations. Using direct comparisons to static data duplication and fine-grained locking, our evaluation shows CCache improves the single-machine, in-memory performance of these applications by upto 3.2x over an already optimized, parallel baseline. Moreover, with half the L3 cache capacity, CCache has a 1.07-1.9x performance improvement over static duplication.
To summarize, our main contributions are:
The CCache execution model, which uses on-demand privatization to improve parallel performance of commutative shared data accesses.
We present a collection of architecture extensions that implement on-demand privatization in CCache without affecting the coherence protocol.
We port several important applications to use CCache’s ISA extensions, including a key-value store, PageRank, BFS, and K-means clustering.
We implement static duplication and fine-grained locking implementations of our workloads, and we show by direct comparison that CCache improves performance upto 3.2X across applications.
2 Background and Motivation
This section motivates the CCache approach to on-demand privatization in hardware. We frame CCache with a discussion of fine-grained locking and static data duplication, done manually [11, 12] or with compiler support .
2.1 Locking and Data Duplication
Parallel code requires threads to synchronize their accesses to shared data to keep the data consistent. Lock-based synchronization requires threads to use locks associated with shared data to serialize the threads’ accesses to the shared data. The simplest way to implement locking in a parallel program is to use coarse-grained locking (CGL). CGL associates one lock (or a small number of locks) with a large, shared data structures. CGL makes programming simple because the programmer is not required to reason about the details of associating locks with each variable or memory location. However, CGL can impede performance by serializing accesses to unrelated memory locations that are protected by the same lock. Fine-grained locking (FGL) is one response to the performance impediment of CGL. FGL associates a lock with each (or few) variables, eliminating unnecessary serialization of accesses to unrelated data. The key problem with FGL is the need for a programmer to express the mapping of locks to data, which is more complex for FGL than for CGL, and is a source of errors. Figure 1 illustrates the difference between FGL and CGL.
2.1.1 Data Duplication
Data duplication (DUP) is a strategy for increasing parallelism by creating copies of a memory location that different threads can manipulate independently of one another. To ensure the correctness of an execution that both reads and writes duplicated data, the program must, at some point, combine the results computed by the different threads. Functional reductions [28, 16, 15] and symbolic parallel updates [32, 31, 9] combine the result of each thread’s computation on its duplicated copy of the data. Reduction applies a (usually) side-effect-free operation to all of the copies producing a single, coherent output value. Some prior work has statically replicated data using compiler and runtime support, parceling copies out to threads and merging them with a reduction [28, 16, 15, 31]. Other prior work has used hardware support for speculation to effectively duplicate data [17, 9].
DUP is highly effective, especially when threads can independently perform many operations on their copy of the data. Duplication improves parallel performance by eliminating serialization due to synchronization (i.e., locking) and cache coherence, both of which hinder FGL and CGL. Instead, the reduction computes a result that is equivalent to some serialization of the independent computations.
Despite its benefits, DUP has several drawbacks. Duplicating data increases the application’s memory footprint. The increase in footprint leads to higher LLC occupancy and miss rate. Efficiently laying out and distributing duplicated data is difficult; we discuss this programming difficulty of data duplication in the context of our benchmarks in Section 5. Anecdotally, we found static data duplication more difficult to get right than FGL, forcing us to simultaneously reason about consistency, locality, and false sharing.
Our main insight is that data duplication and locking both have merits and drawbacks: data duplication increases parallelism at a cost in memory and LLC misses; locking decreases parallelism and makes programming complex, but does not degrade LLC performance or increase the memory footprint. Our work aims to capture the “best of all worlds”: the complexity and occupancy in the LLC and memory of CGL (or FGL), and the increased parallelism of DUP. As Figure 1 shows, CCache, our novel programming model and architecture, dynamically privatizes and later merges data without the need for the programmer to tediously lay out and manage in-memory copies. Additionally, our mechanism can use the same merge function as defined for DUP. Hence, CCache applies generally, to all cases where static DUP is possible. Section 3 describes the new CCache approach at a high level and Section 4 describes our programming model and architecture.
3 CCache: On-Demand Privatization
CCache is a new programming and execution model for a parallel program that uses on-demand data duplication to increase parallel access to data (CData) that are manipulated by commutative operations (COps). When several cores access the same CData memory locations using COps, the data are privatized, providing a copy to each core. After privatization, the cores can manipulate the CData in their caches in parallel without coherence or synchronization. When a core finishes operating on CData location, it uses a programmer-defined merge function to merge its updated value back into the multiprocessor’s closest level of shared storage (i.e., the LLC or memory). When all cores have merged their privatized copies, the result is the equivalent to some serialization of the cores’ parallel manipulations of the CData.
We describe the operation of CCache, assuming a baseline multicore architecture with an arbitrary cache hierarchy. In Section 4 we describe a concrete architectural incarnation of CCache. There are two parts to CCache. We first define COps and CData and show how on-demand data duplication increases parallelism. Then we describe what CCache does when a core executes a COp to a CData memory location. Last, we describe what a CCache core does when its commutative CData computation completes.
3.1 Executing Commutative Operations in Parallel
CCache increases the parallel performance of a program’s commutative manipulations of shared data. Operations to a shared memory location by different cores are commutative with one another if their execution can be serialized in either possible execution order and produce a correct result. Figure 2 shows a simple program in which two cores increment a shared counter variable x. The figure illustrates that these operations are commutative – the arbitrary serialization of the loop’s iterations and the coarse serialization produce the same result at the end of the execution. Any other serialization of the updates to x yield the same result.
CCache increases a core’s parallel performance running operations that access memory commutatively (COps) by automatically, dynamically duplicating the memory being accessed. CCache requires the programmer to explicitly identify which memory operations in the program access data commutatively (COps). The COps defined in CCache are the CRead and the CWrite operations. A CRead or a CWrite operation creates two copies of memory location it is accessing, the first is the core-local source copy, which the core preserves. The second copy is the core-local update copy, which the core uses to perform its computation, instead of referring directly to the location in memory. Each core executes its COps independently on a privatized copy of the shared CData, and then merges the resulting privatized copies, producing a result that is equivalent to some serialization. We discuss merging in Section 3.2.
Figure 2 shows how duplication improves parallelism in the CCache-like “privatize & merge” serialization depicted. The two cores privatize the value of x by preserving its source value into the abstract storage locations s1 and s2 and copying x into core-local, abstract storage locations x and x. The cores then independently execute their loops, updating their private copies. Note that this “privatize & merge” execution model for manipulating commutative data does not specify where to put the abstract storage for the copies. To simplify our exposition, we show data copies in named variables in the figure, but CCachedoes not use explicit, named copies. As Section 4.1 describes, CCache stores the updated copy in the core’s private cache, and its source copy in a new hardware structure called the source buffer. Architecture support for privatizing data is crucial, avoiding the memory overhead of statically allocated copies and the time overhead of dynamically allocating copies in software.
3.2 Merging Updates to Privatized Data
A merge function in CCache is a programmer-provided, application-specific function that uses a core’s updated value, saved source value, and the in-memory value to update the in-memory copy. Merging is a partial reduction [28, 15, 16] of a core’s value and the in-memory copy.
A typical merge function examines the difference between the source copy and the update copy to compute the update to apply to the memory copy. The merge function then applies the update to the memory copy, reflecting the execution of the core’s COps. When a set of cores that are commutatively manipulating data have all applied their merge functions, the data are consistent and represent a serialization of the cores’ updates.
The flexibility of a software-defined merge function is one of the most important, novel aspects of CCache, allowing its applicability to a broad variety of applications, and allowing applications (and their merge functions) to evolve with time.
Possible software merge functions in CCache include, but are not limited to
complex multiplication, saturating or thresholding addition, arbitrary bitwise logic, and variants using floating- and fixed-point operations. CCache also supports approximate merge techniques. An example of an approximate merge is to dynamically, selectively drop updates according to a programmer-provided binomial distribution, similar to loop perforation. Each of these merge functions is represented in a benchmark that we use to evaluate CCache in Section 6. We emphasize that CCache is a stark contrast to a system with fixed, hardware merge operations , which are less broadly applicable and unable to evolve to changing application needs.
Figure 2 shows how merging produces a correct serialized result. After a core complete its update loop, it executes the programmer-provided merge function shown. The programmer writes the merge function with the knowledge of the updates applied by the threads in their loops – the execution applies a sequence of increments to x. The merge function computes the update to apply to memory. In this example, the update is to add to the memory value the difference between the core’s modified value and the preserved source value. To apply the update, the merge function adds the computed difference to the value in memory. After both cores execute their merge function, x is in a consistent final state, equivalent to both the arbitrary and the coarse-grained serialization.
3.2.1 Synchronization and Merging
Parallel merges to the same location must be serialized for correctness and the execution of each merge to a location must be atomic with respect to that location. Such per-merge synchronization ensures that each subsequent merge sees the updated memory result of previous merges ensuring the result is a serialization of atomically applied updates. Section 4 describes how our CCache architecture serializes merges.
In addition to serialized, atomic merges, the programmer may sometimes need a barrier-like merge boundary that pauses every thread manipulating CData until all threads have merged all CData. A program needs a merge boundary when it is transitioning between phases and the next phase needs results from the previous phase. A programmer can implement a merge boundary by executing a standard barrier, preceded by an explicit CCache merge operation in each core that merges all duplicated values. If each core executes the CCache merge operation and then waits at the barrier, the merge boundary leaves all data consistent. Note that when such a barrier is needed in CCache, it would also be needed in a conventional program and CCache imposes no additional need for barrier synchronization.
3.3 Example: A CCache Key-value Store
We illustrate how CCache’s on-demand data duplication idea increases parallelism with an example. Figure 3 shows a key-value store manipulated by two cores. The program keeps a lookup table KV indexed by key, containing integer values that the cores increment. We use CRead and CWrite operations that perform on-demand data duplication. The merge function takes the source value at the time it was first read and privatized by the CRead, the updated value, and the shared memory value. The merge function adds the difference between the updated value and the source value to the memory value.
The figure reveals several important characteristics of CCache. First, the figure shows how CCache obviates duplicating KV. Instead, cores copy individual entries of KV on-demand into the Src and Upd copies. Second, the example shows that by privatizing KV, the cores can independently read and write its value in parallel. Third, the figure shows that core 1 has locality in its independent accesses to its privatized copy of KV. Fourth, the figure shows that the serialized execution of the merge functions by each core installs correct, consistent final values into shared memory.
4 CCache Architecture
We co-designed CCache as a collection of programming primitives implemented in a commodity multicore architecture. We assume a base multicore architecture with core-private L1 and L2 caches and a shared last-level cache (LLC) with a directory-based, MESI cache coherence protocol.
The CCache programming interface includes operations for manipulating and merging CData, which are summarized in Table 1. We implement the CCache programming primitives directly as ISA extensions using modifications to the L1 cache and a dedicated source buffer that manages source copies of CData. We add support for CCache maintain a collection of merge functions and associate each CData line with its merge function. Figure 4 shows the structures that we add to our base architecture design. Beyond the basic CCache design, we improve the performance of merging with an optimization that improves locality and eliminates unnecessary evictions.
|merge_init(&fn,i)||Stores pointer to merge function fn into MFR i|
|c_read(CData,i)||Read CData into src. buff. & L1, set CCache bit,|
|set merge type to i|
|c_write(CData,v,i)||Read CData into src. buff. on miss, write v in L1,|
|set CCache bit if unset, set merge type to i|
|rd_mreg(reg,i)||Return word i of merge register reg|
|wr_mreg(reg,v,i)||Write v into word i of merge register reg|
|soft_merge(core)||Set mergeable bit in L1 for each valid source buffer entry|
|merge(core_id)||For each valid source buffer entry, populate merge registers,|
|lock LLC line, call line’s merge function from MFRF, copy|
|memory merge register to LLC, flash clear source buffer,|
|unset CCache bit, unlock LLC line.|
4.1 CRead and CWrite
We introduce c_read and c_write operations that access CData in a similar way to typical load and store instructions. When a core executes a c_read or c_write operation, it loads the data into its L1 cache, as usual, but does not perform any coherence actions for the involved line. The core also need not acquire any lock before accessing CData with a c_read or c_write.
To track which lines hold CData, we add a single CCache bit to each line in the L1 cache. When a c_read or a c_write accesses a line for the first time (i.e., on an L1 miss), the core sets the line’s CCache bit. We also add a field to each cache line that describes the merge type of the data in the line. The merge type field determines which merge function to call when merging a line’s CData (Section 4.2). The size of the merge type field is the logarithm of the number of different merge functions in the system. An implementation using two bits (i.e., four merge functions) is reasonably flexible and makes the merge type field’s hardware overhead very low.
To allow for update-based merging, CCache must maintain the source copy, updated copy, and memory copy, as described in Section 3.2. CCache uses the L1 cache itself to maintain the updated copy and keeps the memory copy in shared memory as usual.
To maintain the source copy of a line accessed by a c_read or c_write we add a dedicated hardware structure to each core called the source buffer. The source buffer is a small, fully associative, cache-like memory that stores data at the granularity of cache lines. Figure 4 illustrates the source buffer in the context of the entire core. When a c_read or c_write experiences an L1 miss, CCache copies the value into an entry in the source buffer in parallel with loading the line into the L1.
The programmer registers a programmer-defined merge function for a region of CData using CCache’s merge_init operation. At a merge point, the system executes the merge function, passing as arguments the memory, source, and updated value of the CData location to be merged. The result of a merge function is that the memory copy of the data reflects the updates executed by the merging core before the merge. The signature of a merge function is fixed. A merge function takes pointers to three 64-byte values: the source and updated values are read-only inputs and the memory value acts as both an input and an output. The merge function must read and write these values using the rd_mreg and wr_mreg CCache operations depicted in Table 1.
Registering Merge Functions To implement merging, we add a merge function register file (MFRF) to the architecture to hold the addresses of merge function. We add a merge_init ISA instruction that installs a merge function pointer into a specified MFRF entry. The size of the MFRF is dictated by number of simultaneous CData types in the system. A four entry MFRF allows four different merge types and requires only two merge type bits per cache line to identify a line’s merge function.
Executing a Merge Function CCache runs a merge function at a merge operation. Table 1 shows CCache’s two varieties of merge - soft_merge and merge. We discuss the basic merge instruction here and defer discussion of the optimized soft_merge to Section 4.3.
A merge merges all of a core’s CData: the executing core walks the source buffer array and executes a merge for each valid entry. To execute a merge for a line, the core first locks the corresponding line in the LLC, preventing any other core from accessing the line until the merge completes. To prevent deadlock, a merge function can access only its source, updated, and memory values. Allowing arbitrary access to LLC data could cause two threads in merge functions to wait on one anothers’ locked LLC lines.
After locking the LLC line, the core next prepares the source, updated, and memory values for the merge. To prepare them, the core copies the value of each into its own, dedicated, cache-line-sized merge registers, that we add to the core (see Figure 4). After preparing the merge registers, the core calls the merge function and as it executes, rd_mreg and wr_mreg CCache operations that refer to the memory, updated, and source copies of CData access the copies in the merge registers. After the merge function completes, the core moves the contents of the merge register that corresponds to the memory copy of the merged line and into the L1 and triggers a write back to the LLC. Additionally, the source buffer line is invalidated and the CCache bit is reset to zero. The core then unlocks the LLC line, completing the merge. The entire sequence of steps during merging is shown in the flowchart in Figure 5.
Serialization and Merge Functions. A merge instruction serializes accesses to each merged LLC line by individually locking and unlocking them. The merge does not enforce the atomicity of a group of merge operations to different lines in the source buffer. Only individual lines’ merges are atomic. For coarser atomicity, the programmer should use locks and barriers as usual. Note that any situation in CCache that calls for a lock or barrier would require at least the same synchronization (or more) in a non-CCache program because such a point requires results to be serialized, regardless of the programming model.
4.3 Merge Optimizations
merge operations incur the cost of merging all of a core’s CData, including merge function execution and write back to the LLC. We applied two optimizations to CCache to reduce the cost of merge operations. First, we introduce a new instruction, soft_merge, that delays the merge and write back of CData lines for as long as possible. Second, we perform the merge operation described above only when a core updates a CData line.
soft_merge works by setting a CData line into a new mergeable state and delaying the merge of the line’s contents with the in-memory copy until the line’s eviction from the L1 cache and source buffer. To track lines in the mergeable state, we add a new mergeable bit per cache line in the L1, which is depicted in Figure 4. A soft_merge operation sets a line’s mergeable bit, indicating that it is ready to merge. When a core must evict from L1, lines with their CCache and mergeable bits are candidates for eviction. (Recall that lines with only their CCache bit set cannot be evicted.) When a mergeable line is selected for eviction, the core first executes the merge procedure (from Section 4.2) for the line and then evicts it. A c_read or c_write to a mergeable line resets the line’s mergeable bit to prevent the line from being evicted during subsequent commutative updates. These c_read or c_write operations enjoy additional locality in the source buffer and the L1 cache, compared to our unoptimized implementation.
The second optimization is to not perform merge operations for clean CData lines because merging an unmodified copy would not affect the in-memory result. CCache checks a line’s L1 dirty bit to decide whether a merge operation is required. CData lines which are candidates for eviction (ie. mergeable bit set) and do not have their dirty bit set can be silently evicted from the L1 cache and removed from the source buffer.
CCache does not require modifications to the cache coherence protocol, avoiding a major source of complexity and verification cost. The cache coherence protocol operates as usual for non-commutatively, coherently manipulated lines. CCache does not require sending any new coherence messages because a core never generates a coherence messages for a line with its CCache bit set. CCache also does not require a core to specially handle any incoming coherence message because no incoming coherence message can ever refer to a line of CData. A coherence messages cannot refer to a CData line because a CData line can only ever be manipulated by a CRead or CWrite operation with the line’s CCache bit, which never generates any coherence messages.
CCache affects the cache’s replacement policy because CData are not allowed to be evicted. CCache must ensure that data accessed without coherence by a c_read or c_write are merged before being evicted. However, CCache cannot simply merge on an eviction because words from the line might be modified in registers. If such a line were evicted along with its source buffer entry, then when the register value was written back using a c_write operation, the source buffer entry would no longer be available. Furthermore, the in-memory value may have changed (due to writes by other cores) by the time of the c_write. The absence of a source value and potential for changes to the memory copy in this situation precludes the eviction, and CCache conservatively prohibits all evictions of data with their CCache bit set.
A program cannot access more cache lines using COps than there are ways in the cache without an intervening merge. If there are ways in the cache, CCache will deadlock after COps if all accessed data map to the same cache set. Consequently, the programmer needs to carefully ensure that their program accesses at most distinct cache lines without an intervening merge. A limit of guarantees that in the worst case, when all accesses map to the same cache set, one way in the set is always available for coherent data, access to which may be required to make progress. In systems with SMT, hardware threads evenly divide cache ways for CData. While somewhat limiting, this programming restriction is similar to recent, wide-spread hardware transactional memory proposals [40, 1].
We assume CData are cache line aligned and that lines containing CData are only ever accessed using a c_read or c_write
instruction. We require the programmer or the compiler to add padding bytes to these aligned CData variables to ensure that a cache line never contains both CData and normal data bytes. This restriction prevents operations other thanc_read and c_write operations from accessing CData lines.
4.5 Commutativity of the Merge Function
In CCache, the programmer is solely responsible for the correctness of merge functions. The key to writing a merge function is to determine what update to apply to the in-memory copy of the cache line, given the updated copy, and source buffer copy. Merge functions are often arithmetic and computing and applying the update is simple. We have written many such cases (e.g., addition, minimum) that can be used as a library with little programmer effort.
A modestly more complex case is a commutative update with a conditional that depends on the values of the CData. In this case, the programmer must ensure that the merge function’s conditional observes the in-memory copy of the value, rather than the updated copy. An example of such a program might randomly access and increment an array of integers up to a maximum. A simple merge for integer addition adds the difference between the source value and the updated value to the in-memory value. To enforce a maximum, the merge function must also assess whether applying the update would exceed the maximum and, if so, update the in-memory copy to the maximum only.
4.6 Handling Context Switches and Interrupts
CCache cannot evict CData from the cache without merging. However, at an arbitrarily timed context switch or interrupt, a program may be using CData (i.e., in registers), making a merge impossible. There are two options for handling these events. The first option is to delay the context switch or interrupt until after a soft_merge executes for each CData line. At such a point, the architecture can safely execute the merge function for each CData line and then execute the context switch as usual. The main drawback to this approach is the need to delay context switches and interrupts, which may not be possible or desirable in some cases. Additionally, the delay is unpredictable, depending on the number of CData lines and the timing of merge operations.
An alternative is to save cached CData and source buffer entries with the process control block before the context switch. When switching back, a process must restore its CData and source buffer entries. This approach eliminates the delay of waiting to merge, but increases the size of process state. With an 8-way L1 and an 8-entry source buffer a process must preserve at most 1KB of state — a managable overhead for infrequent and already-costly context switches.
4.7 Area and Energy Overheads
We used CACTI  to quantify the overhead in extending the microarchitecture for CCache. We observed that the energy and area overhead of adding tracking bits to each cache line in the L1 cache and LLC would be negligible. A 32 entry, fully associative source buffer would occupy 0.1% the area of the Last Level Cache. The energy of reading and writing data from a source buffer of this size would be 6.5% of the energy of accessing the LLC. We assumed a 32nm process for all the caches in the system.
5 Experimental Setup
In this section we describe the simulation setup we used to evaluate CCache. We built our simulation infrastructure using PIN . To measure baseline performance, we developed a simulator that modeled a 3-level cache hierarchy implementing a directory-based MESI coherence protocol. To measure the performance of CCache, we extended the baseline simulator code to model CCache’s architectural additions. Our CCache simulator modeled incoherent accesses to CData, an 8-entry source buffer and a modified cache replacement policy that excludes CData. We also modeled the cost of executing merge functions in software, including the cost of accessing the merge registers and the LLC. Our model does not include the latency incurred due to waiting on locked LLC lines, but concurrent merges of the same line are rare and this simplification is unlikely to significantly alter our results. Table (2) describes the architectural parameters of our simulated architectures.
|Processor||8-cores. Non-memory instructions: 1 cycle|
|L1 cache||8-way, 32KB, 64B lines, 4 cyc./hit|
|L2 cache||8-way, 512KB, 64B lines, 10 cyc./hit|
|Shared LLC||16-way, 4MB, 64B lines, 70 cyc/hit|
|Main memory||300 cyc./access|
|Source buffer||fully assoc. 512B per-core, 64B lines, 3 cyc./hit|
|Merge Latency||170 cycles incl. LLC round-trip|
To evaluate CCache, we manually ported four parallel applications that commutatively manipulate shared data to use CCache: a Key-value Store, K-Means clustering , Page Rank and BFS. For each benchmark, we also implemented two variants: one that uses fine-grained locks to protect data and the other statically duplicates data. The following subsections provides a brief overview of the operation of each application.
Key-Value Store A key-value store is a lookup table that associates a key with a value, allowing a core to refer to and manipulate the value using the key. In our Key-value Store benchmark, 8 cores increment the values associated with randomly chosen keys. We used COps to implement the updates because increments commute. We set the total number of accesses to random keys to 16 times the number of keys, which we varied in our experiments from 250,000 to 4,000,000. Our DUP scheme creates a per-thread copy of the value array. The merge function computes the difference between the updated copy and the source copy and adds the difference to the memory copy. We use the same merge function for both CCache and DUP.
K-Means K-Means is a popular clustering algorithm. The algorithm assigns a set of
-dimensional vectors intoclusters with the objective of minimizing the sum of distances between each point and its cluster’s center. The algorithm iteratively assigns each point to the nearest cluster and then recomputes the cluster centers. To restrict simulation times, we fix the number of iterations of the algorithm.
Our DUP implementation is based on Rodinia’s  static data duplication scheme, which creates a per-thread copy the cluster center data structure. For the CCache implementation, we made the cluster centers CData and used COps to manipulate them. The merge function for both CCache and DUP does a component-wise addition of weights on point vectors in a cluster.
The results for K-Means also highlights the need for our soft-merge optimization (described in section 4.3). The cluster centres stored in CCache experience high reuse over the course of the application. While a naive implementation of CCache would require the CData to be merged after every iteration, the soft-merge optimization can exploit the locality in CData by delaying the merge operation until CCache becomes full. We discuss the benefit of this optimization in further detail in Section 6.4
Page Rank Page Rank  is a relevance-based graph node ranking algorithm that assigns a rank to a node based on its indegree and outdegree. As the algorithm computes the rank recursively, the data structure that contains each node’s rank is shared among all the threads. A naive DUP implementation would allocate each thread a private copy of the entire node rank data structure. Instead, we wrote an optimized data duplication implementation that partitions nodes across threads and creates only one duplicate. One copy of the structure holds the current iteration’s updates while the other copy is a read-only copy of the result of the previous iteration. These copies are then switched at the end of every iteration. For the CCache version, we made the node rank data structure CData and manipulated it using COps. To test Page Rank on varied inputs, we used three inputs generated by the Graph500  benchmark input generator using the RMAT, SSCA, and Random configurations. The merge function adds an iteration’s update to the global rank.
Breath First Search (BFS) Breadth First Search is a graph traversal order that forms an important kernel in many graph applications. For our evaluations, we used the BFS kernel of the Betweeness Centrality application from the GAP benchmark suite . The implementation uses a bitmap to efficiently represent the edges linking successors from a source vertex. The original implementation uses a Compare-and-Swap to atomically set an entry in the bitmap. We replaced the atomic operations with fine grained locks that matched the update granularity of the set operation in our FGL version. For the DUP version, we used an optimization that avoids privatization of the entire bitmap. Instead, we store all the updates from a thread in a thread-local dynamically sized container and apply these updates atomically during a merge operation at the end. For the CCache version, we simply marked the bitmap as CData and used COps to set locations of the bitmap. The CCache merge function performs a logical OR of all the privatized copies. We evaluated the four versions on kronecker and uniform random graphs provided in the GAP benchmark.
Duplication Strategies Porting code from fine-grained locking to a memory-efficient DUP version is a non-trivial task. We used an optimized DUP strategy for Page Rank because it was inefficient to replicate the entire rank array across all threads. Similarly, we avoid naive replication in BFS because the size of the bitmap makes creating thread-local copies infeasible. By contrast, in K-Means, we observed that replicating the data structure storing cluster centers did not drastically increase memory footprint. As a testament to the complexity of efficient duplication, we found that Rodinia’s K-Means implementation suffered from high false sharing. The Key-value Store imposes an application-level constraint on duplication: partitioning is not a good match because any core may access any key. In our experiments, it was reasonable to duplicate the table across all cores. In general, making decisions about how to duplicate data requires difficult, application-dependent reasoning. CCache eliminates the need for such subtle reasoning about data duplication, instead, just requiring the programmer to use COps.
We evaluated CCache to show that it improves performance compared to fine-grained locking (FGL) and data duplication (DUP) scalably across working set sizes. We show that CCache improves performance compared to static duplication even with fewer hardware resources. We also characterize our results to show they are robust to architectural parameters.
Figure 6 shows CCache’s performance improvement for each benchmark compared to DUP and FGL for an 8 core system. Our key result is that CCache improves performance by upto 3.2x compared to FGL and by upto 10x compared to DUP across all benchmarks. To characterize how our performance results vary with input size, we experimented with inputs ranging from 25% of the L3 cache size up to 400% of the L3 size. We report the performance improvement of DUP and CCache versions relative to the FGL version at each input size.
CCache hits the L3 capacity at a larger input size than FGL (which stores locks with data in the L3) and DUP (which stores duplicated data in the L3) because CCache’s on-demand duplication improves L3 utilization. We evaluated the improvement by cutting CCache’s L3 capacity in half (i.e., giving CCache a 2MB L3) and comparing its performance to DUP with a full sized L3 (i.e., DUP has 4MB of L3). Figure 7 compares CCache’s and DUP’s performance when run on an input matching the LLC capacity. CCache is able to provide performance improvements ranging from 1.1X for Pagerank and KV-Store, 1.19X for K-Means, to 1.91X for BFS with half the L3 cache size. CCache on-demand duplication is a marked improvement over DUP.
shows the peak memory overhead of different versions of our benchmarks when run on an input with working set size equal to LLC capacity. We used a mixture of static and dynamic reasoning to estimate the maximum amount of memory a version might use. The memory overhead for FGL seems to be consistently the largest for all benchmarks. This is because of the overhead of storing fine grained locks which results in more memory than the statically duplicating the data structure across different threads. However, in practice, we observed that the FGL version had fewer L3 misses than the DUP version since most of benchmark had significant sharing and, hence, didn’t incur the peak overhead of FGL. The data shows that the low memory overhead of CCache helps improve performance compared to FGL and DUP.
We collected additional data to characterize the performance difference between CCache, FGL, and DUP. The data suggest that reductions in invalidation traffic and L3 misses contributed to the systems’ performance differences.
DUP vs. FGL Our performance results show that CCache consistently outperforms the FGL and DUP versions at larger working set sizes. However, the performance of FGL and DUP does not show a consistent trend across applications. For Page Rank, Key-Value Store and K-Means DUP outperforms FGL by eliminating serialization caused by fine-grained locking and coherence traffic generated by exchange of critical sections. In BFS, DUP’s performance suffers because of the overhead of duplicating data across different cores. These results illustrate the tensions between serialization and coherence traffic incurred by fine-grained locking and the increase in memory footprint by data duplication.
Page Rank. Figure (a)a shows the number of directory messages issued per 1000 cycles for our three versions when run on the random graph input. The reduction in directory accesses in CCache compared to the FGL and DUP versions explains the speedup achieved by CCache. The Further, the increase in directory accesses of DUP with larger working sets corresponds with the reduced performance improvement provided by DUP for large working sets. We also observed a decrease in the number of L3 misses incurred by CCache compared to DUP and FGL, which could also contribute to CCache performance improvement. Note that CCache was able to outperform our highly optimized DUP implementation for larger, more realistic working set sizes without imposing the burden of efficient duplication on the programmer.
Key-value Store. Figure (b)b shows the fraction of L3 misses per 1000 cycles for FGL, DUP and CCache. The main result is that CCache’s performance improvement for Key-value Store corresponds to the reduction in L3 misses. The data also shows that CCache incur fewer L3 misses (2.5–3X fewer) than DUP and FGL when the working set size matches LLC capacity, further illustrating that CCache better utilizes the LLC. We also observed a consistent reduction in the number of invalidation signals issued in CCache compared to FGL. The reduction in invalidation traffic also likely contributes to CCache’s 2.3X performance improvement.
BFS Figure (c)c shows the invalidation traffic per 1000 cycles for FGL, DUP, CCache and atomics versions. The result shows a significant reduction in invalidation traffic in the DUP and CCache versions compared to FGL and atomics versions. The difference in the normalized invalidation traffic across different working set sizes corresponds with the performance improvement of CCache over FGL and atomics. We also observed that CCache and atomics incurred about the same number of L3 misses which was substantially fewer than those incurred by FGL and DUP. This could explain the bigger performance gap of CCache over FGL and DUP compared to atomics. CCache provides the performance benefits of atomic instructions while being more generally applicable to a variety of commutative updates. We discuss CCache’s generality in more detail in Section 6.3
K-Means. Figure (d)d shows the invalidation traffic normalized to cycle count for the three versions when run on the floating point dataset, illustrating the likely root of CCache’s performance improvement for K-Means. CCache has less coherence traffic than FGL because CCache operates on private duplicates when FGL must synchronizes and make updates globally visible. FGL requires coherence actions to keep both locks and data coherent, which CCache need not do. CCache also had fewer coherence actions than DUP for K-Means because CCache’s merge differs from DUP’s. During a DUP merge, one thread iterates over all cores’ copies of the data, to produce a consistent result. The merging core incurs a coherence overhead to invalidate the duplicated data in every other core’s cache. After the merge, each core that had its duplicate invalidated must re-cache the data, incurring yet more coherence overhead. CCache cores avoid the coherence overhead by manipulating data in their L1s and merging their own data.
6.3 Support for Diverse Merge Functions
To demonstrate CCache’s flexibility in supporting arbitrary merge functions, we implemented a saturating counter and complex number multiplication version of Key-Value Store and an approximate merge version of kmeans. In the saturating counter benchmark, CCache’s merge function reduces privatized copies up to a threshold. For complex multiplication, the merge function complex-multiplies privatized copies. We showed that CCache also flexibly supports approximate computing by writing an approximate merge function for K-Means. The approximate merge version discards updates for some points in a dataset, which does not significantly alter cluster centers. We randomly dropped 10% of a core’s merge operation, which leads to 20% degradation in the intra-cluster distance metric. In some cases, quality degradation is tolerable and CCache allows the programmer to make such a quality-performance trade-offs. Our evaluation showed that CCache provides a speedup over FGL and DUP similar to the baseline versions of these three applications (Figure 6). The results show that CCache’s performance benefits are not restricted only to applications with only simple commutative operations and can be extended to arbitrary commutative updates.
6.4 Characterizing the Merge-on-evict Optimization
By default CCache uses the merge-on-evict and dirty-merge optimizations (Section 4.3). We studied the benefit of these optimizations by re-running our benchmarks without the optimizations and comparing to the performance with the optimization. Both optimizations did not significantly improve the performance of un-optimized CCache because merge functions comprise only a small fractions of total cycles executed for all our benchmarks.
While the optimizations did not improve the performance of a baseline CCache implementation, they are essential for improving performance over DUP and FGL versions. The merge-on-evict optimization improves locality at the source buffer, requiring fewer merges and, consequently, reduced locking of LLC lines. Figure 9 shows the reduction in source buffer evictions by out merge-on-evict optimization compared to a CCache implementation without the optimization. The optimization reduced the number of evictions by 2.2X in BFS and 409.9X in K-Means. The increased source buffer locality makes CCache a more efficient alternative to data duplication than DUP. We also evaluated the performance benefits of the dirty-merge optimization. The optimization reduces the number of merges required by only merging data updated by a core. Our evaluation showed that this optimization does not provide performance benefit to update heavy benchmarks like K-Means, Key-Value store and BFS. However, for Page Rank, where a lot of CData is only read and never updated, the dirty merge optimization was able to reduce the number of merges performed by 24X compared to a CCache implementation without the optimization
7 Related Work
Several areas of prior work relate to CCache. Section 2 discussed explicit data duplication and reductions. This section discusses recent work on COUP  and then discuss work on: (1) combining parallel updates; (2) expansion and privatization; and (3) speculation, including transactions.
The closest prior work to CCache is COUP, which uses commutativity to reduce data movement. COUP extends the coherence protocol to support commutativity and supports a small, fixed set of operations in hardware. While similar, CCache differs significantly. CCache is more flexible, allowing programmer-defined, software commutative operations. In contrast, COUP supports a fixed set of hardware commutative operations only. This key difference makes CCache more flexible and broadly applicable than COUP. Section 6 evaluates CCache’s flexibility with a spectrum of merge types. Additionally, COUP requires coherence protocol changes and CCache does not. In CCache, COps never generate outgoing coherence requests, and are never the subject of incoming requests; CData lines need no coherence actions and non-CData remain coherent as usual. Lastly, COUP cannot exploit CCache’s merge-on-evict optimization because COUP does not get information from the programmer (i.e., soft_merge vs. merge).
7.1 Combining Independent Parallel Updates
Prior systems supported combining the result of independent executions of dependent operations. Parallel prefix  computations broke dependences by recursively decomposing operations and later combining partial results, similar to how CCache merges updates.
Commutativity analysis  identifies and parallelizes commutative regions of code, inserting code to combine commutative partial results. CCache draws inspiration from this work, but differs considerably, allowing arbitrary, programmer-defined merge functions and targeting hardware.
Commutative set  is a language that allows specifying commutative code blocks. A compiler can then produce code that executes commutative blocks in parallel, serializing them on completion. The main distinction from CCache is that CCache’s parallelization is on-demand and avoids the need for compiler analysis by using hardware support.
Concurrent revisions [11, 12] followed the tack of commutativity analysis, promoting an execution model that allows a software thread to operate on a copy of a shared data structure. The central metaphor of this work is “memory as version control”. The system resolves conflicting parallel updates with a custom merge function. This work’s execution model was motivational for CCache’s use of duplication and merging. A key difference is that CCache uses architecture, requiring very few software or compiler changes.
RetCon  operates on speculative parallel copies of data in transactions. When transactions conflict, RetCon avoids rollback by symbolically tracking updated values. Applying an update derived from symbolic tracking is like CCache’s use of a merge function to combine partial results. What differs is that symbolic tracking is limited in the types of merges it can perform. RetCon cannot perform merges that cannot be represented with the supported, symbolic, arithmetic expressions. CCache’s permits general merge functions and does not incur the cost of speculation.
Symple  automatically parallelizes dependent operations to user-defined data aggregations, also using symbolic expressions. Symple treats unresolved dependences as symbolic expressions, eventually resolving them to concrete values. Like Symple, CCache allows manipulating shared data and merging partial results. CCache differs in its use of architecture and lack of need for symbolic anlaysis, which is likely to be complex in hardware.
7.2 Duplication, Expansion, and Privatization
Expansion makes copies of scalars , data structures , and arrays , allowing threads to manipulate independent replicas. Expansion, especially of large structures, is like duplication in our evaluation 6. Expansion risks excessive cache pressure, especially in the single-machine, in-memory workloads that we target.
Data duplication and reduction has wide-spread adoption in parallel frameworks [15, 16, 28, 18]. These systems focus on scaling to big data and large numbers of machines, unlike CCache, which does not require a language framework, instead leveraging hardware to avoid static duplication.
7.3 Speculative Privatization
A class of techniques use speculation to parallelize accesses to shared data. Speculation increases parallelism, but has high software overheads and hardware complexity.
Software  and hardware transactions  buffer updates (or log values) and threads compute independently on shared data. Mis-speculation aborts and rolls back updates, re-executing to find a conflict-free serialization. The similarity to CCache is that transactional threads manipulate isolated values. However, transactions abort work on a conflict, rather than trying to produce a valid serialization. By contrast, CCache’s merge function aggressively finds a serialization, despite conflicts.
Both speculative multithreading [20, 37, 36] and bulk speculative consistency [19, 13, 39, 8] are transaction-like execution models that continuously dynamically duplicate data, enabling different threads to operate on duplicates in parallel. Like most other transactions work, these efforts primarily roll back work when the system detects an access to the same data in different threads. In contrast, CCache merges manipulations of shared data in different threads.
Prior work on TMESI  also modified the coherence protocol to support programmable privatization for transactions. CCache also offers a form of programmable privatization, but differs in several ways. CCache does not require a large number of additional coherence protocol states to handle privatized data. CCache has only a single “state” – the CCache bit – because it privatizes commutatively updated data only, and those data are kept coherent by merging. Unlike TMESI, CCache avoids the cost of speculation, providing atomicity at the granularity of cache lines only, not transactional read and write sets. Moreover, CCache is applicable to lock based code, while TMESI is specific to transactions.
8 Conclusions and Future Work
We presented CCache, a system that improves the performance of parallel programs using on-demand data duplication. CCache improves the performance of accesses to memory that are commutative. Leveraging the fact that commutative operations can execute correctly in any order, CCache allows each core to operate on involved data independently, without coherence actions or synchronization. Merging combines cores’ independently computed results with memory, producing a consistent, coherent final memory state. Our evaluation showed that CCache considerably improves the performance of several important applications, including clustering, graph processing, and key-value lookups, even earning a performance improvement over a system with twice the amount of L3 cache. The future for CCache goes in two directions. First, leveraging other high-level properties, such as approximability, to extend its benefits to programs with non-commutative operations. Second, we envision CCache-like support to remediate conflicts between commutative operations in conflict-checking parallel execution models [23, 7, 25].
-  Advanced Micro Devices. Advanced Synchronization Facility Proposed Architectural Specification, Publication 45432, Rev. 2.1, 2009.
-  C. S. Ananian, K. Asanovic, B. C. Kuszmaul, C. E. Leiserson, and S. Lie. Unbounded transactional memory. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture, HPCA ’05, pages 316–327, Washington, DC, USA, 2005. IEEE Computer Society.
-  A. Aviram, S.-C. Weng, S. Hu, and B. Ford. Efficient system-enforced deterministic parallelism. Communications of the ACM, 55(5):111–119, 2012.
-  S. Beamer, K. Asanovic, and D. A. Patterson. The GAP benchmark suite. CoRR, abs/1508.03619, 2015.
-  T. Bergan, O. Anderson, J. Devietti, L. Ceze, and D. Grossman. Coredet: a compiler and runtime system for deterministic multithreaded execution. In ACM SIGARCH Computer Architecture News, volume 38, pages 53–64. ACM, 2010.
-  E. Berger, T. Yang, T. Liu, D. Krishnan, and A. Novark. Grace: safe and efficient concurrent programming. Citeseer, 2009.
-  S. Biswas, M. Zhang, M. D. Bond, and B. Lucia. Valor: Efficient, software-only region conflict exceptions. In Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA 2015, pages 241–259, New York, NY, USA, 2015. ACM.
-  C. Blundell, M. M. Martin, and T. F. Wenisch. Invisifence: Performance-transparent memory ordering in conventional multiprocessors. In Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA ’09, pages 233–244, New York, NY, USA, 2009. ACM.
-  C. Blundell, A. Raghavan, and M. M. Martin. Retcon: Transactional repair without replay. In Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA ’10, pages 258–269, New York, NY, USA, 2010. ACM.
-  S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the Seventh International Conference on World Wide Web 7, WWW7, pages 107–117, Amsterdam, The Netherlands, The Netherlands, 1998. Elsevier Science Publishers B. V.
-  S. Burckhardt, A. Baldassin, and D. Leijen. Concurrent programming with revisions and isolation types. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA ’10, pages 691–707, New York, NY, USA, 2010. ACM.
-  S. Burckhardt, D. Leijen, C. Sadowski, J. Yi, and T. Ball. Two for the price of one: A model for parallel and incremental computation. In Proceedings of the 2011 ACM International Conference on Object Oriented Programming Systems Languages and Applications, OOPSLA ’11, pages 427–444, New York, NY, USA, 2011. ACM.
-  L. Ceze, J. Tuck, P. Montesinos, and J. Torrellas. Bulksc: Bulk enforcement of sequential consistency. In Proceedings of the 34th Annual International Symposium on Computer Architecture, ISCA ’07, pages 278–289, New York, NY, USA, 2007. ACM.
-  S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the 2009 IEEE International Symposium on Workload Characterization (IISWC), IISWC ’09, pages 44–54, Washington, DC, USA, 2009. IEEE Computer Society.
-  J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI’04, pages 10–10, Berkeley, CA, USA, 2004. USENIX Association.
-  J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 51(1):107–113, Jan. 2008.
-  K. Fraser and T. Harris. Concurrent programming without locks. ACM Trans. Comput. Syst., 25(2), May 2007.
-  M. Frigo, P. Halpern, C. E. Leiserson, and S. Lewin-Berlin. Reducers and other cilk++ hyperobjects. In Proceedings of the Twenty-first Annual Symposium on Parallelism in Algorithms and Architectures, SPAA ’09, pages 79–90, New York, NY, USA, 2009. ACM.
-  L. Hammond, B. D. Carlstrom, V. Wong, B. Hertzberg, M. Chen, C. Kozyrakis, and K. Olukotun. Programming with transactional coherence and consistency (tcc). In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XI, pages 1–13, New York, NY, USA, 2004. ACM.
-  L. Hammond, M. Willey, and K. Olukotun. Data speculation support for a chip multiprocessor. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS VIII, pages 58–69, New York, NY, USA, 1998. ACM.
-  IEEE and The Open Group. IEEE Standard 1003.1-2001, 2001.
-  R. E. Ladner and M. J. Fischer. Parallel prefix computation. J. ACM, 27(4):831–838, Oct. 1980.
-  B. Lucia, L. Ceze, K. Strauss, S. Qadeer, and H.-J. Boehm. Conflict exceptions: Simplifying concurrent language semantics with precise hardware exceptions for data-races. In Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA ’10, pages 210–221, New York, NY, USA, 2010. ACM.
-  C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’05, pages 190–200, New York, NY, USA, 2005. ACM.
-  D. Marino, A. Singh, T. Millstein, M. Musuvathi, and S. Narayanasamy. Drfx: A simple and efficient memory model for concurrent programming languages. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’10, pages 351–362, New York, NY, USA, 2010. ACM.
-  N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi. Cacti 6.0: A tool to model large caches. HP Laboratories, pages 22–31, 2009.
-  R. C. Murphy, K. B. Wheeler, B. W. Barrett, and J. A. Ang. Introducing the graph 500. Cray Users Group (CUG), 2010.
-  OpenMP Architecture Review Board. OpenMP Application Programming Interface Version 4.0. http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf. July 2013.
-  D. A. Padua and M. J. Wolfe. Advanced compiler optimizations for supercomputers. Commun. ACM, 29(12):1184–1201, Dec. 1986.
-  P. Prabhu, S. Ghosh, Y. Zhang, N. P. Johnson, and D. I. August. Commutative set: A language extension for implicit parallel programming. In Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’11, pages 1–11, New York, NY, USA, 2011. ACM.
-  V. Raychev, M. Musuvathi, and T. Mytkowicz. Parallelizing user-defined aggregations using symbolic execution. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP ’15, pages 153–167, New York, NY, USA, 2015. ACM.
-  M. C. Rinard and P. C. Diniz. Commutativity analysis: A new analysis technique for parallelizing compilers. ACM Trans. Program. Lang. Syst., 19(6):942–991, Nov. 1997.
-  A. Shriraman, M. F. Spear, H. Hossain, V. J. Marathe, S. Dwarkadas, and M. L. Scott. An integrated hardware-software approach to flexible transactional memory. In ACM SIGARCH Computer Architecture News, volume 35, pages 104–115. ACM, 2007.
-  S. Sidiroglou-Douskos, S. Misailovic, H. Hoffmann, and M. Rinard. Managing performance vs. accuracy trade-offs with loop perforation. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering, pages 124–134. ACM, 2011.
-  D. J. Sorin, M. D. Hill, and D. A. Wood. A Primer on Memory Consistency and Cache Coherence. Morgan & Claypool Publishers, 1st edition, 2011.
-  J. G. Steffan, C. Colohan, A. Zhai, and T. C. Mowry. The stampede approach to thread-level speculation. ACM Trans. Comput. Syst., 23(3):253–300, Aug. 2005.
-  J. G. Steffan, C. B. Colohan, A. Zhai, and T. C. Mowry. A scalable approach to thread-level speculation. In Proceedings of the 27th Annual International Symposium on Computer Architecture, ISCA ’00, pages 1–12, New York, NY, USA, 2000. ACM.
-  P. Tu and D. A. Padua. Automatic array privatization. In Proceedings of the 6th International Workshop on Languages and Compilers for Parallel Computing, pages 500–521, London, UK, UK, 1994. Springer-Verlag.
-  T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. Mechanisms for store-wait-free multiprocessors. In Proceedings of the 34th Annual International Symposium on Computer Architecture, ISCA ’07, pages 266–277, New York, NY, USA, 2007. ACM.
-  R. Yoo, C. Hughes, K. Lai, and R. Rajwar. Performance evaluation of intel(r) transactional synchronization extensions for high-performance computing. In Supercomputing 2013, 2013.
-  H. Yu, H.-J. Ko, and Z. Li. General data structure expansion for multi-threading. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’13, pages 243–252, New York, NY, USA, 2013. ACM.
-  G. Zhang, W. Horn, and D. Sanchez. Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems. In Proceedings of the 48th International Symposium on Microarchitecture, MICRO-48, pages 13–25, New York, NY, USA, 2015. ACM.