1 Introduction
Conventional SRAM/DRAMbased memory systems are unable to conform to the growing demand of low power, low cost and large capacity memories. Increase in the memory size is barred by technology scalability as well as leakage and refresh power. As a result, multiple nonvolatile memories such as phase change memory (PCM), spin transfer torque (STTRAM) and resistive RAM (ReRAM) have emerged and attracted considerable attention [1, 2, 3, 4]. These memory technologies offer power, bandwidth and scalability features amenable to processor scaling. However, they pose new challenges such as imperfect reliability and higher write latency. The relatively new spinorbitronics based racetrack memory (RM) represents a promising option to surmount the aforementioned limitations by offering ultrahigh capacity, energy efficiency, lower per bit cost, higher reliability and smaller read/write latency [5, 6]. Due to these attractive features, RMs have been investigated at all levels in the memory hierarchy. Table 1 provides a comparison of RM with contemporary volatile and nonvolatile memories.
The diverse memory landscape has motivated research on hardware and software optimizations for more efficient utilization of NVMs in the memory subsystem. To avoid the design complexity added by hardware solutions, softwarebased data placement has become an important emerging area for compiler optimization [7]. Even modern days processors such as intel’s Knight Landing Processor offer means for software managed onboard memories. Compiler guided data placement techniques have been proposed at various levels in the memory hierarchy, not only for improving the temporal/spatial locality of the memory objects but also the lifetime and high write latency of NVMs [8, 9, 10, 11]. In the context of near data processing (NDP), efficient data placement improves the effectiveness of NDP cores by allowing more accesses to the local memory stack and mitigating remote accesses.
In this paper, we study data placement optimizations for the particular case of racetrack memories. While RMs do not suffer from reliability and latency issues, they pose a significantly different challenge. From the architectural perspective, RMs store multiple bits —1 to 100— per access point in the form of magnetic domains in a tapelike structure, referred to as track. Each track is equipped with one or more magnetic tunnel junction (MTJ) sensors, referred to as access ports, that are used to perform read/write operations. While a track could be equipped with multiple access ports, the number of access ports per track are always much smaller than the number of domains. In the scope of this paper, we consider the ideal single access port per track for ultra high density of the RM. This implies that the desired bits have to be shifted and aligned to the port positions prior to their access. The shift operations not only lead to variable access latency but also impact the energy consumption of a system, since the time and the energy required for an access depend on the position of the domain relative to the access port. We propose a set of techniques that reduce the number of shift operations by placing temporally close accesses at nearby locations inside the RM.
SRAM  eDRAM  DRAM  STTRAM  ReRAM  PCM  RaceTrack 4.0  
Cell Size ()  120200  30100  48  650  410  412  2 
Write Endurance  4 X  
Read Time  Very Fast  Fast  Medium  Medium  Medium  Slow  Fast 
Write Time  Very Fast  Fast  Medium  Slow  Slow  Very Slow  Fast 
Dynamic Write Energy  Low  Medium  Medium  High  High  High  Low 
Dynamic Read Energy  Low  Medium  Medium  Low  Low  Medium  Low 
Leakage Power  High  Medium  Medium  Low  Low  Low  Low 
Retention Period  As long as  s  ms  Variable  Years  Years  Years 
volt applied  
Concretely, we make the following contributions.

An integer linear programming (ILP) formulation of the data placement problem for RMs.

A thorough analysis of existing offset assignment heuristics, originally proposed for data placement in DSP stack frames, for data placement in RM.

ShiftsReduce, a heuristic that computes memory offsets by exploiting the temporal locality of accesses.

An improvement in the stateoftheart RMplacement heuristic [13] to judiciously decide the next memory offset in case of multiple contenders.

A final refinement step based on a genetic algorithm to further improve the results.
We compare our approach with existing solutions on the OffsetStone benchmarks [14]. ShiftsReduce diminishes the number of shifts by 28.8% which is 4.4% and 6.6% better than the best performing heuristics [14] and [13] respectively.
The rest of the paper is organized as follows. Section 2 explains the recently proposed RM 4.0, provides motivation for this work and reviews existing data placement heuristics. Our ILP formulation and the ShiftsReduce heuristic are described in Section 3 and Section 4 respectively. Benchmarks description, evaluation results and analysis are presented in Section 5. Section 6 discusses stateoftheart and Section 7 concludes the paper.
2 Background and motivation
This section provides background on the working principle of RMs, current architectural sketches and further motivates the data placement problem (both for RAMs and RMs).
2.1 Racetrack memory
Memory devices have evolved over the last decades from hard disk drives to novel spinorbitronics based memories. The latter uses spinpolarized currents to manipulate the state of the memory. The domain walls (DWs) in RMs are moved into a third dimension by an electrical current [15, 5]. The racetracks can be placed vertically (3D) or horizontally (2D) on the surface of a silicon wafer as shown in Fig. 1. This allows for higher density but is constrained by crucial design factors such as the shift speed, the DWtoDW distance and insensitivity to external influences such as magnetic fields.
In earlier RM versions, DWs were driven by a current through a magnetic layer which attained a DW velocity of about 100 ms [16]. The discovery of even higher DW velocities in structures where the magnetic film was grown on top of a heavy metal allowed to increase the DW velocity to about 300 ms [17]. The driving mechanism is based on spinorbit effects in the heavy metal which lead to spin currents injected into the magnetic layer [18]. However, a major drawback of these designs was that the magnetic film was very sensitive to external magnetic fields. Furthermore, they exhibited fringing fields which did not allow to pack DWs closely to each other.
The most recent RM 4.0 resolved these issues by adding an additional magnetic layer on top, which fully compensates the magnetic moment of the bottom layer. As a consequence, the magnetic layer does not exhibit fringing fields and is insensitive to external magnetic fields. In addition, due to the exchange coupling of the two magnetic layers, the DWs velocity can reach up to 1000 ms
[19, 6].2.1.1 Memory architecture
Fig. 2 shows a widespread architectural sketch of an RM based on [20]. In this architecture an RM is divided into multiple Domain Block Clusters (DBCs), each of which contains tracks with DWs each. Each domain wall stores a single bit, and we assume that each Mbit variable is distributed across tracks of a DBC. Accessing a bit from a track requires shifting and aligning the corresponding domain to the track’s port position. We further assume that the domains of all tracks in a particular DBC move in a lock step fashion so that all bits of a variable are aligned to the port position at the same time for simultaneous access. We consider a single port per track because adding more ports increases the area. This is due to the use of additional transistors, decoders, sense amplifiers and output drivers. As shown in Fig. 2, each DBC can store a maximum of variables.
Under the above assumptions, the shift cost to access a particular variable may vary from 0 to . It is worth to mention that worst case shifts can consume more than 50% of the RM energy [21] and prolong access latency by 26x compared to SRAM [20]. The architectural simulator, RTSim [22], can be used to analyze the shifts’ impact on the RM performance and energy consumption, and explore its design space by varying the above mentioned design parameters.
2.2 Motivation example
To illustrate the problem of data placement consider the set of data items and their access order from Fig. (a)a. We refer to the set of program data items as the set of program variables () and the set of their access order as access sequence (), where , for any given source code. Note that data items can refer to actual variables placed on a function stack or to accesses to fields of a structure or elements of an array. We assume two different, a naive (P1) and a more carefully chosen (P2), memory placements of the program variables as shown in Fig. (b)b.
The number of shifts for the two different placements, P1 and P2 in Fig. (b)b, are shown in Fig. 6. The shift cost between any two successive accesses in the access sequence is equivalent to the absolute difference of their memory offsets (e.g, for b,c in P1). The naive data placement P1 incurs shifts in accessing the entire access sequence, while P2 incurs only , i.e., better. This leads to an improvement in both latency and energy consumption for the simple illustrative example.
2.3 Problem definition
Fig. 7 shows a conceptual flow of the data placement problem in RMs. The access sequence corresponds to memory traces which can be obtained with standard techniques. They can be obtained via profiling and tracing, e.g., using Pin [23], inferred from static analysis, e.g., for Static Control Parts using polyhedral analysis, or with a hybrid of both as in [24]. In this paper we assume the traces are given and focus on the data placement step to produce the memory layout. We investigate a number of exact/inexact solutions that intelligently decide memory offsets of the program variables referred to as memory layout based on the access sequence. The memory for which the layout is generated could either be a scratchpad memory, a software managed flat memory similar to the onboard memory in intel’s Knight Landing Processor or the memory stack exposed to an NDP core.
The shift cost of an access sequence depends on the memory offsets of the data items. We assume that each data item is stored in a single memory offset of the RM (cf. Section 2.1.1). We denote the memory offset of a data item as . The shift cost between two data items and is then:
(1) 
The total shift cost () of an access sequence () is computed by accumulating the shift costs of successive accesses:
(2) 
The data placement problem for RMs can be then defined as:
Definition 1
Given a set of variables and an access sequence , find a data placement for such that the total cost is minimized.
2.4 Stateoftheart data placement solutions
The data placement problem in RMs is similar to the classical single offset assignment (SOA) problem in DSP’s stack frames [14, 25, 26, 27]. The heuristics proposed for SOA assign offsets to stack variables; aiming at maximizing the likelihood that two consecutive references at runtime will be to the same or adjacent stack locations. Most SOA heuristics work on an access graph and formulate the problem as maximum weighted Hamiltonian path (MWHP) or maximum weight path covering (MWPC). An access graph represents an access sequence where is the set of vertices corresponding to program variables (). An edge has weight if variables are accessed consecutively times in . The assignment is then constructed by solving the MWHP/MWPC problem. The access graph for the access sequence in Fig. (a)a is shown in Fig. 8.
The SOA cost for two consecutive accesses is binary. That is, if the next access cannot be reached within the autoincrement/decrement range, an extra instruction is needed to modify the address register (cost of ). The cost is otherwise. In contrast, the shift cost in RM is a natural number. For RMplacement, the SOA heuristics must be revisited since they only consider edge weights of successive elements in . This may produce better results on small access sequences due to the limited number of vertices and smaller endtoend distance in , but might not perform well on longer access sequences. In this paper, we extend the SOA heuristics to account for the more general cost function.
Chen et al. recently proposed a groupbased heuristic for data placement in RMs [13]. Based on an access graph it assigns offsets to vertices by moving them to a group . The position of a data item within a group indicates its memory offset. The first vertex added to the group has the maximum vertexweight in the access graph where vertexweight is the sum of all edge weights that connect a vertex to other vertices in . The remaining elements are iteratively added to the group, based on their vertextogroup weights (maximum first). The vertextogroup weight of a vertex is the sum of all edge weights that connect to the vertices in .
We argue that intelligent tiebreaking for equal vertextogroup weights deserves investigation. Further the static assignment of highest weight vertex to offset 0 seems restrictive. Defining positions relative to other vertices provides more flexibility to navigate the solution space.
3 Optimal data placement: ILP formulation
This section presents an ILP formulation for the data placement problem in RM. Consider the access graph and the access sequence to variables , the edge weight between variables can be computed as:
(3) 
with and defined as:
(4) 
To model unique variable offsets we introduce binary variables (
):(5) 
The memory offset of is then computed as:
(6) 
Since edges in the access graph embodies the access sequence information, we use them to compute the total shift cost as:
(7) 
The cost function in Equation 7 is not inherently linear due to the absolute function in (cf. Equation 1). Therefore, we generate new products and perform subsequent linearization. We introduce two integer variables to rewrite as:
(8) 
such that
(C1) 
(C2) 
The second nonlinear constraint (C2) implies that one of the two integer variables must be . To linearize it, we use two binary variables and a set of constraints:
(C3) 
(C4) 
(C5) 
C5 guarantees that the value of both binary variables and can not be simultaneously for a given pair . This, in combination with C3C4, sets one of the two integer variables to . We introduce the following constraint to enforce that the offsets assigned to data items are unique:
(C6) 
It ensures uniqueness because the left hand side of the constraint is the difference of the two memory locations (cf. Eq. 8).
Finally, the linear objective function is:
(9) 
The following two constraints are added to ensure that offsets are within range.
(C7) 
(C8) 
4 Approximate data placement
In this section we describe our proposed heuristic and use the insights of our heuristic to extend the heuristic by Chen [13].
4.1 The ShiftsReduce heuristic
ShiftsReduce is a groupbased heuristic that effectively exploits the locality of accesses in the access sequence and assigns offsets accordingly. The algorithm starts with the maximum weight vertex in the access graph and iteratively assigns offsets to the remaining vertices by considering their vertextogroup weights. Recall from Section 5.2 that the weight of a vertex indicates the count of successive accesses of a vertex with other vertices in , i.e., . Note that the maximum weight vertex may not necessarily be the vertex with the highest access frequency, considering repeated accesses of the same vertex.
Definition 2
The vertextogroup weight of a vertex is defined as the sum of all edge weights that connect to other vertices in , i.e., .
ShiftsReduce maintains two groups referred to as leftgroup (highlighted in red in Fig. 9) and rightgroup (highlighted in green). Both and are lists that store the already computed vertices in . The heuristic assigns offsets to vertices based on their global and local adjacencies. The global adjacency of a vertex is defined as its vertextogroup weight with the global group, i.e., ^{1}^{1}1We abuse notation, using set operations () on lists for better readability. while the local adjacency is the vertextogroup weight with either of the subgroups, i.e., or .
Pseudocode for the ShiftsReduce heuristic is shown in Algorithm 1. The subgroups and initially start at index , the only shared index between and , and expand in opposite directions as new elements are added to them. We represent this with negative and positive indices respectively as shown in Fig. 9. The algorithm selects the maximum weight vertex () and places it at index in both subgroups (cf. lines 56).
The algorithm then determines two more nodes and add them to the right (cf. line 8) and left (cf. line 10) groups respectively. These two nodes correspond to the nodes with the highest vertextogroup weight (), which boils down to the maximum edge weight to . Lines 1227 iteratively select the next group element based on its global adjacency (maximum first) and add it to or based on its local adjacency. If the local adjacency of a vertex with the left group is greater than that of the right group, it is added to left group (cf. lines 1416). Otherwise, the vertex is added to the right group (cf. lines 1719).
The algorithm prudently breaks both intergroup and intragroup tie situations. In an intergroup tie situation (cf. line 20), when the vertextogroup weight of the selected vertex is equal with both subgroups, the algorithm compares the edge weight of the selected vertex with the last vertices of both groups ( in and in ) and favors the maximum edge weight (cf. lines 2126).
To resolve intragroup ties, we introduce the Tiebreak function. The intragroup tie arises when and have equal vertextogroupweights with (cf. line 2 in Tiebreak). Since the two vertices have equal adjacency with other group elements, they can be placed in any order. We specify their order by comparing their edge weights with the fixed vertex ( for and for ) and prioritize the highest edge weight vertex. The algorithm checks the intragroup tie for every vertex before assigning it to the leftgroup (cf. line 16) or rightgroup (cf. line 19).
We demonstrate ShiftsReduce in Fig. 9 for the example in Fig. 8. Vertex has the highest vertex weight (equal to 4 + 3 + 1 = 8) and is placed at index in both subgroups. Vertices and have maximum edge weights with and are added to the right and left groups respectively (cf. lines 8 and 10). At this point, the two subgroups contain two elements each. The next vertex is added to because it has higher local adjacency with compared to . In a similar fashion, and are added to and respectively. ShiftsReduce ensures that vertices at far ends of the two groups have least adjacency (i.e., vertex weights) compared to the vertices that are placed in the middle. Note that the number of elements in and may not necessarily be equal. Finally, offsets are assigned to vertices based on their group positions as highlighted in Fig. 9.
Given that we add vertices to two different groups, there are less occurrences of tie compared to algorithms such as Chen’s [13] where vertices are always added to the same group. For comparison reasons, we extend Chen’s heuristic with tiebreaking in the following section.
4.2 The ChenTB heuristic
ChenTB is a heuristic that extends Chen’s heuristic with the Tiebreak strategy introduced for ShiftsReduce. As shown in Algorithm 2, ChenTB initially adds three vertices to the group in lines 413. In contrast to Chen, we intelligently swap the order of the first two group elements by inspecting their edge weights with the third group element. Subsequently, lines 1418 iteratively decide the position of the new group elements until is empty.
The stepwise addition of vertices to the group is demonstrated in Fig. 10. Initially, the algorithm inspects three vertices from referred to as , , and . In line 4, because has the largest vertex weight (). Next, because has the maximum edge weight () with (cf. line 6). Similarly, because it has the maximum vertextogroup weight (which is 3) with (cf. line 8). Since the edge weight between and (i.e., = 3) is higher than the edge weight between and (i.e., = 0), we swap the positions of and in the group (cf. lines 1011). At this point, the group elements are . The position of is fixed while is the last group element. The next selected vertex is due to its highest vertextogroup weight with . In this case, the vertextogroup weight of and is compared with (cf. line 2 in Tiebreak). Since has higher vertextogroup weight, becomes the last element while the position of is fixed (cf. line 9 in Tiebreak). Following the same argument, the next selected element becomes the last element while the position of is fixed. The next selected vertex and the last element have equal vertextogroupweights i.e. with the fixed elements . ChenTB prioritizes over because it has the higher edge weight with the last fixed element .
The final data placements of Chen, ChenTB and ShiftsReduce are presented in Fig. 11. For the access sequence in Fig. 8, ChenTB reduces the number of shifts to 23 compared to 27 by Chen, as shown in Fig. 11. ShiftsReduce further diminishes the shift cost to 19. Note that the placement decided by ShiftsReduce is the optimal placement shown in Fig. (b)b. We assume 3 or more vertices in the access graph for our heuristics because the number of shifts for two vertices, in either order, remain unchanged.
5 Results and discussion
This section provides evaluation and analysis of the proposed solutions on realworld application benchmarks. It presents a detailed qualitative and quantitative comparison with stateoftheart techniques. Further, it brings a thorough analysis of SOA solutions for RMs.
5.1 Experimental setup
We perform all experiments on a Linux Ubuntu (16.04) system with Intel core i74790 (3.8 GHz) processor, GB memory, g++ v5.4.0 with optimization level. We implement our ILP model using the python interface of the Gurobi optimizer, with Gurobi 8.0.1 [28].
As benchmark we use OffsetStone [14], which contains more than 3000 realistic sequences obtained from complex realworld applications (controldominated as well as signal, image and video processing). Each application consists of a set of program variables and one or more access sequences. The number of program variables per sequence varies from 1 to 1336 while the length of the access sequences lies in the range of 0 and 3640. We evaluate and compare the performance of the following algorithms.
5.2 Revisiting SOA algorithms
We, for the first time, reconsider all wellknown offset assignment heuristics. The empirical results in Fig. 12 show that the SOA heuristics can reduce the shift cost in RM by 24.4%. On average, (Bartley, Liao, SOATB, INC and INCTB) reduce the number of shifts by (10.9%, 10.9%, 12.2%, 22.9%, 24.4%) compared to OFU respectively. For brevity, we consider only the best performing heuristic i.e., INCTB for detailed analysis in the following sections.
5.3 Analysis of ShiftsReduce
In the following we analyze our ShiftsReduce heuristic.
5.3.1 Results overview
An overview of the results for all heuristics across all benchmarks, normalized to the OFU heuristic, is shown in Fig. 13. As illustrated, ShiftsReduce yields considerably better performance on most benchmarks. It outperforms Chen’s heuristic on all benchmarks and INCTB on 22 out of 28. The results indicate that INCTB underperforms on benchmarks such as mp3, viterbi, gif2asc,dspstone, and h263. On average, ShiftsReduce curtails the number of shifts by 28.8% which is 4.4% and 6.6% better compared to INCTB and Chen respectively.
Closer analysis reveals that Chen significantly reduces the shift cost on benchmarks having longer access sequences. This is because it considers the global adjacency of a vertex before offset assignment. On the contrary, INCTB maximizes the local adjacencies and favors benchmarks that consist only of shorter sequences. ShiftsReduce combines the benefits of both local and global adjacencies, providing superior results. None of the algorithms reduce the number of shifts for fft, since in this benchmark each variable is accessed only once. Therefore, any permutation of the variables placement results in identical performance.
5.3.2 Impact of access sequence length
As mentioned above, the length of the access sequence plays a role in the performance of the different heuristics. To further analyze this effect we partition the sequences from all benchmarks in 6 bins based on their lengths. The concrete bins and the results are shown in Fig. 14, which reports the average number of shifts (smaller is better) relative to OFU.
Several conclusions can be drawn from Fig. 14. First, INCTB performs better compared to other heuristics on short sequences. For the first bin (070), INCTB reduces the number of shifts by 26.3% compared to OFU which is 10.9%, 7.1% and 2.2% better than Chen, ChenTB and ShiftsReduce respectively. Second, the longer the sequence, the better is the reduction compared to OFU. Third, the performance of INCTB aggravates compared to groupbased heuristics as the access sequence length increases. For bin5 (501800), INCTB reduces the shift cost by 25.2% compared to OFU while Chen, ChenTB and ShiftsReduce reduces it by 38.3%, 38.6% and 41.2% respectively. Beyond 800 (last bin), INCTB deteriorates performance compared to OFU and even increases the number of shifts by 97.8%. This is due to the fact that INCTB maximizes memory accesses to consecutive locations (i.e., edge weights) without considering its impact on farther memory accesses (i.e., global adjacency). Fourth, Chen performs better compared to INCTB on long sequences (average 36.6% for bins 36) but falls after it by 6.9% on short sequences (bins 12). Fifth, ChenTB consistently outperforms Chen on all sequence lengths, demonstrating the positive impact of the tiebreaking proposed in this paper. Finally, the proposed ShiftsReduce heuristic consistently outperforms Chen in all 6 bins. This is due to the fact that ShiftsReduce exploit bidirectional group expansion and considers both local and global adjacencies for data placement (cf. Section 4.1). On average, it surpasses (INCTB, Chen and ChenTB) by (39.8%, 3.2% and 2.8%) and (0.3%, 7.3% and 4.5%) for long (bins 36) and short (bins 12) sequences respectively.
Category  Benchmarks  Short  Long  Very Long 
Seqs (%)  Sequences (%)  Sequences (%)  
mp3  65.1%  25.6%  9.3%  
veterbi  35.0%  40.0%  25.0%  
gif2asc  17.7%  50.0%  33.3%  
dspstone  63.0%  29.6%  7.4%  
gsm  65.1%  21.6%  13.3%  
cavity  20.0%  40.0%  40.0%  
h263  0.0%  75.0%  25.0%  
codecs  59.7%  33.3%  8.0%  
flex  75.8%  16.9%  7.3%  
sparse  69.6%  22.8%  7.6%  
klt  54.5%  15.9%  29.6%  
triangle  75.4%  17.2%  7.4%  
f2c  79.5%  15.2%  6.3%  
mpeg2  50.7%  32.4%  16.9%  
bison  63.8%  26.4%  9.8%  
cpp  43.7%  33.3%  13.0%  
gzip  50.1%  35.2%  14.7%  
lpsolve  44.6%  38.5%  16.9%  
jpeg  54.5%  15.9%  29.6%  
bdd  85.8%  10.8%  3.4%  
adpcm  93.2%  3.4%  3.4%  
fft  100.0%  0.0%  0.0%  
anagram  100.0%  0.0%  0.0%  
eqntott  100.0%  0.0%  0.0%  
fuzzy  100%  0.0%  0.0%  
hmm  79.7%  10.3%  0.0%  
8051  80.0%  20.0%  0.0%  
cc65  84.6%  13.1%  2.3% 
5.3.3 Categorywise benchmarks evaluation
Based on the above analysis, we classify all benchmarks into 3 categories as shown in Table
2. We categorize each access sequence into three ranges i.e., short (), long (greater than ) and verylong (greater than ). The first benchmark category comprises 19 benchmarks; each containing at least 15% long and 5% very long access sequences. The second and third categories mostly contain short sequences.Fig. 15 shows that ShiftsReduce provides significant gains on categoryI and curtails the number of shifts by 31.9% (maximum upto 43.9%) compared to OFU. This is 8.1% and 6.4% better compared to INCTB and Chen respectively. Similarly, ChenTB outperforms both Chen and INCTB by 2.3% and 4% respectively. INCTB does not produce good results because the majority of the benchmarks in categoryI are dominated by long and/or very long sequences (cf. Table 2 and Section 5.3.2). CategoryII comprises 5 benchmarks, mostly dominated by short sequences. INCTB provides higher shift reduction (19.6%) compared to Chen (13.2%) and ChenTB (15.3%). However it exhibits comparable performance with ShiftsReduce (within range). On average, ShiftsReduce outperforms INCTB by 1.1%. INCTB outperforms ShiftsReduce only on the 4 benchmarks listed in categoryIII.
5.4 GASOA vs GAOurs
Apart from heuristics, genetic algorithms (GAs) have also been employed to solve the SOA problem [30]. They start with a random population and compute an efficient solution by imitating natural evolution. However, GAs always take longer computation times compared to heuristics. In order to avoid premature convergence, GAs are often initialized with suboptimal initial solutions.
This section leverages two genetic algorithms (namely GASOA and GAOurs) for RM data placement. We analyze the impact on the results of GA using our solutions compared to solutions obtained with SOA heuristics as initial population. Both algorithms use the same parameters as presented in [14]. The initial populations of GASOA and GAOurs are composed of (OFU, Liao [25], INCTB [14]) and (OFU, ChenTB, ShiftsReduce) respectively.
Experimental results demonstrate that GAOurs is superior to GASOA in all benchmarks. The average reduction in shift cost across all benchmarks (cf. Fig. 17) translate to 35.1% and 38.3% for GASOA and GAOurs respectively.
5.5 ILP results
As expected, the ILP solver could not produce any solution in almost 30% of the instances when given three hours per instance. In the remaining instances, the solver either provides an optimal solution (on shorter sequences) or an intermediate solution. We evaluate ShiftsReduce and GAOurs on those instances where the ILP solver produces results and show the comparison in Fig. 16.
On average, the ShiftsReduce results deviate by 8.2% from the ILP result. GAOurs bridges this gap and deviate by only 1.3%.
5.6 Summary runtimes and energy analysis
Recall the results overview from Fig. 17. In comparison to OFU, ShiftsReduce and ChenTB mitigate the number of shifts by 28.8% and 24.5% which is (4.4%, 0.1%) and (6.6%, 2.3%) superior than INCTB and Chen respectively. Compared to the offset assignment heuristics in Fig. 12, the performance improvement of ShiftsReduce and ChenTB translate to (17.9%, 17.9%, 16.6%, 5.9%) and (13.6%, 13.6%, 12.3%, 1.6%) for Bartley, Liao, SOATB and INC respectively. GAOurs further reduces the number of shifts in ShiftsReduce by 9.5%. The average runtimes of ChenTB and ShiftsReduce are 2.99 ms, which is comparable to other heuristics, i.e., Bartley (0.23 ms), Liao (0.08 ms), SOATB (0.11 ms), INC (2.3 s), INCTB (2.7 s), GASOA (4.96 s), GAOurs (4.98 s) and Chen (2.98 ms).
Using the latest RM 4.0 prototype device in our inhouse physics lab facility, a current pulse of 1 , corresponding to a current density of , is applied to the nanowire to drive the domains. Employing a 50 wide, 4 thick wire, the shift current corresponds to 0.1. With a 5V applied voltage, the power to drive a single domain translates to 0.5 (). Therefore, the energy to shift a single domain amounts to (). The RM 4.0 device characteristics indicate the domains in RM 4.0 shifts at a constant velocity without inertial effects. Therefore, for a 32bit data item size, the total shift energy amounts to 16 without inertia. The overall shift energy saved by a particular solution is calculated as the total number of shifts for all instances across all benchmark multiplied by per data item shift energy (i.e., 16). Using RM 4.0, the shift energy reduction for ShiftsReduce relative to OFU translates to 35%. In contrast to RM 4.0, the domains in earlier RM prototypes show inertial effects when driven by current. Considering the inertial effects in earlier RM prototypes, we expect less energy benefits compared to RM 4.0.
6 Related Work
Conceptually, the racetrack memory is a 1dimensional version of the classical bubble memory technology of the late 1960s. The bubble memory employs a thin film of magnetic material to hold small magnetized areas known as bubbles. This memory is typically organized as 2dimensional structure of bubbles composed of major and minor loops [31]. The bubble technology could not compete with the Flash RAM due to speed limitations and it vanished entirely by the late 1980s. Various data reorganization techniques have been proposed for the bubble memories [32, 33, 31]. These techniques alter the relative position of the data items in memory via dynamic reordering so that the more frequently accessed items are close to the access port. Since these architectural techniques are blind to exact memory reference patterns of the applications, they might excerbate the total energy consumption.
Compared to other memory technologies, RMs have the potential to dominate in all performance metrics, for which they have received considerable attention as of late. RMs have been proposed as replacement for all levels in the memory hierarchy for different application scenarios. Mao and Wang et al. proposed an RMbased GPU register file to combat the high leakage and scalability problems of conventional SRAMbased register files [34, 35]. Xu et al. evaluated RM at lower cache levels and reported an energy reduction of 69% with comparable performance relative to an isocapacity SRAM [36]. Sun et al. and Venkatesan et al. demonstrated RM at lastlevel cache and reported significant improvements in area (6.4x), energy (1.4x) and Performance (25%) [37, 20]. Park advocates the usage of RM instead of SSD for graph storage which not only expedites graph processing but also reduces energy by upto 90% [38]. Besides, RMs have been proposed as scratchpad memories [39], content addressable memories [40] and reconfigurable memories [41].
Various architectural techniques have been proposed to hide the RM access latency by preshifting the likely accessed DW to the port position [20]. Sun et al. proposed swapping highly accessed DWs with those closer to the access port(s) [37]. Atoofian proposed a predictorbased proactive shifting by exploiting register locality [42]. Likewise, proactive shifting is performed on the data items waiting in the queue [35]. While these architectural approaches reduce the access latency, they may increase the total number of shifts which exacerbates energy consumption.
To abate the total number of shifts, techniques such as data migration [36], data swapping [37], data compression [43], data reorganization for bubble memories [32, 33, 31], and efficient data placement [13, 39] have been proposed. Amongst all, data placement has shown great promise because it effectively reduces the number of shifts with negligible overheads.
Historically, data placement has been proposed for different memory technologies at different levels in the memory hierarchy. It is demonstrated that efficient data placement improves energy consumption and system performance by exploiting temporal/spatial locality of the memory objects [44]. More recently data placement techniques have been employed in NVM based memory systems in order to improve their performance and lifetimes. For instance [45, 8] employ data placement techniques to hide the higher write latency and hence cache blocks migration overhead in an STTSRAM hybrid cache. Similarly in [9, 10, 11], dataplacement techniques have been proposed to make efficient utilization of the memory systems equipped with multiple memory technologies. Likewise, data placement in RMs is proposed for GPU register files [46], scratchpad memories [39] and stacks [47] in order to reduce the number of shifts.
In the past, various data placement solutions have been proposed for signal processing in the embedded systems domain (i.e. SOA, cf. 5.2). These solutions include heuristics [26, 25, 29, 27, 14], genetic algorithms [30] and ILP based exact solutions [48, 49, 50]. As discussed in Section 5 our heuristic builds on top of this previous work, providing an improved data allocation.
7 Conclusions
This paper presented a set of techniques to minimize the number of shifts in RMs by means of efficient data placement. We introduced an ILP model for the data placement problem for an exact solution and heuristic algorithms for efficient solutions. We show that our heuristic computes nearoptimal solutions, at least for small problems, in less than 3 ms. We revisited wellknown offset assignment heuristics for racetrack memories and experimentally showed that they perform better on short access sequences. In contrast, groupbased approaches such as the Chen heuristic exploit global adjacencies and produce better results on longer sequences. Our ShiftsReduce heuristic combines the benefits of local and global adjacencies and outperforms all other heuristics, minimizing the number of shifts by up to 40%. ShiftsReduce employs intelligent tiebreaking, a technique that we use to improve the original Chen heuristic. To further improve the results, we combined ShiftsReduce with a genetic algorithm that improved the results by 9.5%. In future work, we plan to investigate placement decisions together with reordering of accesses from higher abstractions in the compiler, e.g., from a polyhedral model or by exploiting additional semantic information from domainspecific languages.
Acknowledgments
This work was partially funded by the German Research Council (DFG) through the Cluster of Excellence ‘Center for Advancing Electronics Dresden’ (cfaed). We thank Andrés Goens for his useful input in the ILP formulation and Dr. Sven Mallach from Universität zu Köln (Cologne) for providing the sources of SOA heuristics.
References
 [1] H.S. Philip Wong, Simone Raoux, Sangbum Kim, Jiale Liang, John Reifenberg, Bipin Rajendran, Mehdi Asheghi, and Kenneth Goodson. Phase change memory. 98, 12 2010.
 [2] E. Kultursay, M. Kandemir, A. Sivasubramaniam, and O. Mutlu. Evaluating STTRAM as an energyefficient main memory alternative. In International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 256–267, 2013.
 [3] F. Hameed, A. A. Khan, and J. Castrillon. Performance and energyefficient design of sttram lastlevel cache. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 26(6):1059–1072, June 2018.
 [4] H. . P. Wong, H. Lee, S. Yu, Y. Chen, Y. Wu, P. Chen, B. Lee, F. T. Chen, and M. Tsai. Metal–oxide rram. Proceedings of the IEEE, 100(6):1951–1970, June 2012.
 [5] Stuart Parkin, Masamitsu Hayashi, and Luc Thomas. Magnetic DomainWall Racetrack Memory. 320:190–194, 05 2008.
 [6] Stuart Parkin and SeeHun Yang. Memory on the Racetrack. 10:195–198, 03 2015.
 [7] Sparsh Mittal and Jeffrey Vetter. A survey of software techniques for using nonvolatile memories for storage and main memory systems. IEEE Transactions on Parallel and Distributed Systems, 27, 01 2015.
 [8] Qingan Li, Jianhua Li, Liang Shi, Chun Jason Xue, and Yanxiang He. Mac: Migrationaware compilation for sttram based hybrid cache in embedded systems. In Proceedings of the 2012 ACM/IEEE International Symposium on Low Power Electronics and Design, ISLPED ’12, pages 351–356, New York, NY, USA, 2012. ACM.
 [9] Ivy Bo Peng, Roberto Gioiosa, Gokcen Kestor, Pietro Cicotti, Erwin Laure, and Stefano Markidis. Rthms: A tool for data placement on hybrid memory system. In Proceedings of the 2017 ACM SIGPLAN International Symposium on Memory Management, ISMM 2017, pages 82–91, New York, NY, USA, 2017. ACM.
 [10] Wei Wei, Dejun Jiang, Sally A. McKee, Jin Xiong, and Mingyu Chen. Exploiting program semantics to place data in hybrid memory. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT), PACT ’15, pages 163–173, Washington, DC, USA, 2015. IEEE Computer Society.
 [11] H. Servat, A. J. Peña, G. Llort, E. Mercadal, H. Hoppe, and J. Labarta. Automating the application data placement in hybrid memory systems. In 2017 IEEE International Conference on Cluster Computing (CLUSTER), pages 126–136, Sept 2017.
 [12] S. Mittal, J. S. Vetter, and D. Li. A survey of architectural approaches for managing embedded dram and nonvolatile onchip caches. IEEE Transactions on Parallel and Distributed Systems, 26(6):1524–1537, June 2015.
 [13] Xianzhang Chen, Edwin HsingMean Sha, Qingfeng Zhuge, Chun Jason Xue, Weiwen Jiang, and Yuangang Wang. Efficient data placement for improving data access performance on domainwall memory. IEEE Trans. Very Large Scale Integr. Syst., 24(10):3094–3104, October 2016.
 [14] Rainer Leupers. Offset assignment showdown: Evaluation of dsp address code optimization algorithms. In Proceedings of the 12th International Conference on Compiler Construction, CC’03, pages 290–302, Berlin, Heidelberg, 2003. SpringerVerlag.
 [15] S. S. Parkin. Shiftable Magnetic Shift Register And Method Of Using the Same, December 2004.
 [16] M. Hayashi, L. Thomas, C. Rettner, R. Moriya, Y. B Bazaliy, and S. Parkin. Current Driven Domain Wall Velocities Exceeding the Spin Angular Momentum Transfer Rate in Permalloy Nanowires. 98:037204, 02 2007.
 [17] I. Mihai Miron, T. Moore, H. Szambolics, L. BudaPrejbeanu, S. Auffret, B. Rodmacq, S. Pizzini, J. Vogel, M. Bonfim, A. Schuhl, and G. Gaudin. Fast Currentinduced Domainwall Motion Controlled by the Rashba Effect. 10:419–23, 06 2011.
 [18] KSu Ryu, L. Thomas, SHun Yang, and S. Parkin. Chiral Spin Torque at Magnetic Domain Wall. 8, 06 2013.
 [19] SeeHun Yang, KwangSu Ryu, and Stuart Parkin. Domainwall Velocities of up to 750 m/s Driven by Exchangecoupling Torque in Synthetic Antiferromagnets. 10, 02 2015.
 [20] Rangharajan Venkatesan, Vivek Kozhikkottu, Charles Augustine, Arijit Raychowdhury, Kaushik Roy, and Anand Raghunathan. Tapecache: A high density, energy efficient cache based on domain wall memory. In Proceedings of the 2012 ACM/IEEE International Symposium on Low Power Electronics and Design, ISLPED ’12, pages 185–190, New York, NY, USA, 2012. ACM.
 [21] Chao Zhang, Guangyu Sun, Weiqi Zhang, Fan Mi, Hai Li, and W. Zhao. Quantitative modeling of racetrack memory, a tradeoff among area, performance, and power. In The 20th Asia and South Pacific Design Automation Conference, pages 100–105, Jan 2015.
 [22] A. A. Khan, F. Hameed, R. Blaesing, S. Parkin, and J. Castrillon. Rtsim: A cycleaccurate simulator for racetrack memories. IEEE Computer Architecture Letters, pages 1–1, 2019.
 [23] ChiKeung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’05, pages 190–200, New York, NY, USA, 2005. ACM.
 [24] Silvius Rus, Lawrence Rauchwerger, and Jay Hoeflinger. Hybrid analysis: Static & dynamic memory reference analysis. International Journal of Parallel Programming, 31(4):251–283, Aug 2003.
 [25] Stan Liao, Srinivas Devadas, Kurt Keutzer, Steve Tjiang, and Albert Wang. Storage assignment to decrease code size. SIGPLAN Not., 30(6):186–195, June 1995.
 [26] David H. Bartley. Optimizing stack frame accesses for processors with restricted addressing modes. Softw. Pract. Exper., 22(2):101–110, February 1992.
 [27] Sunil Atri, J. Ramanujam, and Mahmut T. Kandemir. Improving offset assignment for embedded processors. In Proceedings of the 13th International Workshop on Languages and Compilers for Parallel ComputingRevised Papers, LCPC ’00, pages 158–172, London, UK, UK, 2001. SpringerVerlag.
 [28] LLC Gurobi Optimization. Gurobi optimizer reference manual, 2018.
 [29] R. Leupers and P. Marwedel. Algorithms for address assignment in dsp code generation. In Proceedings of International Conference on Computer Aided Design, pages 109–112, Nov 1996.
 [30] R. Leupers and F. David. A uniform optimization technique for offset assignment problems. In Proceedings. 11th International Symposium on System Synthesis (Cat. No.98EX210), pages 3–8, Dec 1998.
 [31] Mario Jino and Jane W. S. Liu. Intelligent Magnetic Bubble Memories. In Proceedings of the 5th Annual Symposium on Computer Architecture, ISCA ’78, pages 166–174. ACM, 1978.
 [32] O. Voegeli, B. A. Calhoun, L. L. Rosier, and J. C. Slonczewski. The use of bubble lattices for information storage. AIP Conference Proceedings, 24(1):617–619, 1975.
 [33] C. K. Wong and P. C. Yue. Data Organization in Magnetic Bubble Lattice Files. IBM Journal of Research and Development, 20(6):576–581, Nov 1976.
 [34] Shuo Wang, Yun Liang, Chao Zhang, Xiaolong Xie, Guangyu Sun, Yongpan Liu, Yu Wang, and Xiuhong Li. Performancecentric register file design for gpus using racetrack memory. In 2016 21st Asia and South Pacific Design Automation Conference (ASPDAC), pages 25–30, Jan 2016.
 [35] M. Mao, W. Wen, Y. Zhang, Y. Chen, and H. Li. Exploration of gpgpu register file architecture using domainwallshiftwrite based racetrack memory. In 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC), pages 1–6, June 2014.
 [36] H. Xu, Y. Alkabani, R. Melhem, and A. K. Jones. Fusedcache: A naturally inclusive, racetrack memory, duallevel private cache. IEEE Transactions on MultiScale Computing Systems, 2(2):69–82, April 2016.
 [37] Z. Sun, Wenqing Wu, and Hai Li. Crosslayer racetrack memory design for ultra high density and low power consumption. In 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC), pages 1–6, May 2013.
 [38] E. Park, S. Yoo, S. Lee, and H. Li. Accelerating graph computation with racetrack memory and pointerassisted graph representation. In 2014 Design, Automation Test in Europe Conference Exhibition (DATE), pages 1–4, March 2014.
 [39] H. Mao, C. Zhang, G. Sun, and J. Shu. Exploring data placement in racetrack memory based scratchpad memory. In 2015 IEEE NonVolatile Memory System and Applications Symposium (NVMSA), pages 1–5, Aug 2015.
 [40] Y. Zhang, W. Zhao, J. Klein, D. Ravelsona, and C. Chappert. Ultrahigh density content addressable memory based on current induced domain wall motion in magnetic track. IEEE Transactions on Magnetics, 48(11):3219–3222, Nov 2012.
 [41] W. Zhao, N. Ben Romdhane, Y. Zhang, J. Klein, and D. Ravelosona. Racetrack memory based reconfigurable computing. In 2013 IEEE Faible Tension Faible Consommation, pages 1–4, June 2013.
 [42] Ehsan Atoofian. Reducing shift penalty in domain wall memory through register locality. In Proceedings of the 2015 International Conference on Compilers, Architecture and Synthesis for Embedded Systems, CASES ’15, pages 177–186, Piscataway, NJ, USA, 2015. IEEE Press.
 [43] Haifeng Xu, Yong Li, R. Melhem, and A. K. Jones. Multilane racetrack caches: Improving efficiency through compression and independent shifting. In The 20th Asia and South Pacific Design Automation Conference, pages 417–422, Jan 2015.
 [44] Brad Calder, Chandra Krintz, Simmi John, and Todd Austin. Cacheconscious data placement. SIGPLAN Not., 33(11):139–149, October 1998.
 [45] Q. Li, J. Li, L. Shi, M. Zhao, C. J. Xue, and Y. He. Compilerassisted sttrambased hybrid cache for energy efficient embedded systems. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 22(8):1829–1840, Aug 2014.
 [46] Yun Liang and Shuo Wang. Performancecentric optimization for racetrack memory based register file on gpus. Journal of Computer Science and Technology, 31(1):36–49, Jan 2016.
 [47] Hoda Aghaei Khouzani and Chengmo Yang. A dwmbased stack architecture implementation for energy harvesting systems. ACM Trans. Embed. Comput. Syst., 16(5s):155:1–155:18, September 2017.
 [48] Michael Jünger and Sven Mallach. Solving the simple offset assignment problem as a traveling salesman. In Proceedings of the 16th International Workshop on Software and Compilers for Embedded Systems, MSCOPES ’13, pages 31–39, New York, NY, USA, 2013. ACM.
 [49] Sven Mallach and Roberto Castañeda Lozano. Optimal general offset assignment. In Proceedings of the 17th International Workshop on Software and Compilers for Embedded Systems, SCOPES ’14, pages 50–59, New York, NY, USA, 2014. ACM.
 [50] Sven Mallach. More general optimal offset assignment. Leibniz Transactions on Embedded Systems, 2(1):02–1–02:18, 2015.
Comments
There are no comments yet.