Existing prefetchers are designed to analyze the memory access stream and identify specific types of access patterns, ranging from sequential and strided ones to traversals over linked data structures. Most of them targetspatio-temporal locality and temporal correlation between addresses or address-space artifacts (e.g., address deltas), based on the observation that temporally or spatially adjacent accesses tend to repeat .
Many applications, however, make use of data structures and algorithms whose physical layout and data access patterns are not plainly observable in the memory address-space domain (e.g., linked lists, arrays of pointers, sparse graphs, cross-indexed tables) and require deeper analysis in order to understand the causal relations between accesses to objects in memory. These causal relations may involve complicated arithmetic computations or a chain of memory dereferences. Such relations between accesses exhibit semantic locality  if they represent consequential steps along a data structure or an algorithm. These steps are characterized by the existence of some program code flow that traverses from one data object to the next 111Spatio-temporal locality represents a specific case where the relations are purely arithmetic and can be detected through address comparison, but semantic locality encompasses all forms of algorithmic and data structural relations (such as proximity within linked data structures or connectivity in cross-indexed tables).. The set of all semantic relations within a program can be said to span its data structures, describing all the steps that the program may employ to walk through them.
In this paper we argue that the semantic relations between memory accesses can be represented through the code segments (referred to as code slices) that generate the memory traversals. The set of all slices effectively forms an abstract guide to the program’s data layout, but we can further combine or extrapolate these flows to create forecast slices with more complex “lookahead” semantics that can predict program behavior over longer periods.
Following our observation, we present the semantic prefetcher that dynamically constructs and injects prefetching code for arbitrary memory traversals. The prefetcher analyzes program code at run-time, identifies the dependency chains forming all address calculations, and detects locality artifacts within that code based on contextual similarities. The prefetcher then generates compact and optimized forecast slices which are code constructs that did not exist in the original program and enhance the code to generate longer memory traversal steps capable of reaching future iterations. The semantic prefetcher generates the forecast slices using hardware-managed binary optimization. The slices are constructed to have no lingering side effects. Once the prefetcher reaches sufficient confidence in their correctness and structural stability, it injects them at certain interception points to trigger prefetches.
The semantic prefetcher is fundamentally different from previous prefetchers that aim to reconstruct address relations or code sequences such as temporal-correlation prefetchers [25, 35, 38, 3] and runahead-based prefetchers [9, 23, 10, 1, 24, 13, 39]. Unlike temporal correlation prefetchers, which detect correlations between addresses, the semantic prefetcher correlates program states (specific code locations with specific history and context) with the generated code slices. Similarly, unlike runahead-based prefetchers that run the program (or its address generation code) in parallel to reach future iterations earlier (but are ultimately constrained by finite out-of-order depths), the semantic prefetcher can peek into future iso-context iterations without having to execute everything in the middle.
The semantic prefetcher was implemented on an industrial-grate, cycle-accurate x86 simulator that represents a modern micro-architecture. It provides a 24% IPC speedup on average over SPEC 2006 (outliers of up to 3.7), and 16% on average over SPEC 2017 (outliers of up to 85%).
Our contributions in this paper are as follows:
We present a novel scheme of prefetching using forecast slices. We utilize internal locality artifacts to extrapolate the code slices and create new functional behavior with lookahead semantics.
We present the design of the semantic prefetcher that injects forecast slices directly into the execution stream. We describe its architecture: flaky load detection, slice generation, binary optimization, and dynamic prefetching depth control.
We demonstrate how the forecast slices can reproduce complex patterns prevalent in common applications, and show that these patterns are not addressed by existing prefetchers.
We model the semantic prefetcher using a cycle accurate simulator. We show that it outperforms five competing state-of-the-art prefetchers, some of which target irregular access patterns.
The remainder of this paper is organized as follows: Section II discusses semantic locality and its manifestation in forecast slices. Section III presents the semantic prefetcher and its architecture. Section IV explains the experimental methodology. Section V shows the evaluation results and discussion. Section VI describes related work. We conclude in Section VII.
Ii Extracting Semantic Locality from memory access patterns
Existing memory prefetchers scan the stream of memory accesses and extract spatio-temporal correlations in order to identify patterns and predict future memory accesses . Some prefetchres [25, 2] also associate memory accesses with program context (e.g., instruction pointer) to further refine their predictions.
However, basing predictions solely on the stream of memory accesses that the memory unit emits makes prefetchers oblivious to the underlying program code semantics. Indeed, most existing prefetchers ignore the data and control flows that generate the memory access sequences they are meant to detect. A small number of exceptional prefetchers capable of detecting more elaborate or irregular relations focus only on specific access patterns such as indirect accesses (for e.g., A[B[i]]) and linked data structures [31, 32, 4].
In this section we argue that a more fundamental form of locality can be extracted even when no spatio-temporal locality is present. Semantic locality [27, 28] correlates memory accesses through their dependency within the program’s abstract data layout and usage flow, such as being adjacent steps on a data structure traversal path or being consequential steps in the execution of an algorithm. These accesses do not necessarily exhibit any spatio-temporal correlation. While prior work attempted to approximate semantic locality through memoization and correlative program context cues, we show that extracting this form of locality requires following the set of operations that constitutes the relation between two memory addresses. To this end, we define a code slice as the minimal subset of the dynamic code preceding a certain memory operation that is required to generate its memory address. Notably, this subset can be described through the data dependency chain that starts with the address calculation, and goes backwards through all relevant sources at each step.
As semantic locality usually describes program constructs such as data structures or algorithms, the relations it captures often have strong recurrence. Extracting that form of locality can therefore be achieved by analysis of the address-generation code slice between two recurring consequential loads.
The remainder of this section demonstrates how program introspection can generate short, explicit code slices that can be replayed to generate memory accesses, or manipulated to create forecast slices that generate future accesses at an arbitrary distance. These code slices can be injected into the code sequence at run time to issue memory accesses ahead of time. Finally, the memory access stream of typical programs is shown to be adequately covered by a small number of distinct code slices.
Ii-a How code slices describe dependency chains
Memory access patterns can often be tightly incorporated in a way that makes it difficult for a simple address scan to distinguish between them without understanding program semantics. Figure 1 demonstrates this over a breadth-first search (BFS) code taken from the Graph500 benchmark. The main while loop traverses an array of vertices. An internal loop then scans each vertex’s outgoing edges to find its neighboring vertices and check their BFS depth.
Notably, the top level access pattern (array “vlist”) is sequential, but the deeper levels are accessed through data dependent patterns (the edge loop is also sequential but relatively short, making the first edge of each vertex the critical element). These accesses have no spatial locality and very little temporal reuse. Even contextual cues such as the program counter do not help in correlating the accesses. However, the figure shows that the dependency chain within each iteration, whose accesses are increasingly critical to program performance, can be represented using a short code slice.
The use of the extracted code slice is demonstrated in Figure 2. Thanks to the sequential nature of the top loop that exposes spatial locality within the first load in the slice, a simple change in the stride delta can create a forecast slice that predicts accesses in the next iterations at the top loop. Overall, the example detailed in Figures 1 and 2 demonstrates how code slices can represent the dependency chain of irregular data structures, and how these slices can generate lookahead semantics within the algorithm.
Ii-B Forecast slice creation
Tracking all dependency chains for a given load would construct a graph of instructions that may span back to the beginning of the program. To generate concise and useful slices, history tracking is limited by breaking the dependency chain in the following cases:
Constant values remaining static during analysis.
Strided values that were shown to have a constant stride or are produced by a simple add/sub operation with constant or immediate sources.
When a loop wraps around to the same operation where the analysis started from, the dependency chain can usually stop as it would also repeat itself. Linked data structures may iterate a few time to create a deeper chain.
Before the code slice can be used to produce future accesses, it needs to be clean of any side effects. The code is sanitized by performing two final steps: first the destination registers are replaced with temporary ones (which are guaranteed not to be used by the original code) and their occurrences as sources within the slice are renamed accordingly. Second, all memory writes are eliminated from the slice. Since the code was generated through a dependency chain, all writes to memory were added to resolve younger loads reading from the same address. Therefore, a simple memory-renaming may be performed to replace such store-load operations with move operations to a reserved register. For the sake of simplicity partial address overlaps are ignored.
When the base slices are ready, they may be converted into forecast slices. To this end, any detected stride is extended by the lookahead factor: If a certain operation in the slice was detected to induce a stride of , that stride is replaced by where is the current lookahead applied for that slice. This lookahead variable is initialized to point a few iterations ahead, but its value dynamically changes to allow further lookahead based on the average hit depth for that slice. The hit depth is updated dynamically as explained in Section III.
Ii-C Data-structure spanning code slices
Code slice generation is flexible and generalized enough to cover whole applications efficiently, with a relatively low amount of code slices. Any given data structure has a set of operations that define all forms of traversals across it within a given program. We define this set of operations as spanning the data structure. Some examples are shown in Figure 3.
A linked list, for example, is spanned by the action of dereferencing its next elements pointer. A tree is spanned by the actions of descending from a node to any given child. The semantic relation must capture all data structures required to complete any recurring traversal step.
We demonstrate the effectiveness of code slices in Figure 4, which shows the number of unique slices needed to cover all accesses to the main data structures in the SPEC 2006 and 2017 benchmarks (sampling methodology is explained in Section IV). The results were obtained by running the construction flow on each load that has a sufficient level of recurrence (recurring at least three times and passing seven validation phases to confirm that its slice is invariant). We filtered out low-usage strides (below 1k of actual hits on a generated slice). Figure 4 demonstrates the efficiency of memory access coverage of code slices. Specifically, 39 of the 46 benchmarks require only up to 100 slices to cover all recurring loads, and only one benchmark requires more than 300 slices. This indicates that a prefetcher constructing code slices can cover a large code base with reasonable storage requirements. The average slice size is 6.6 operations, and the median is 3.5 operations.
Detecting semantic locality through code slices generalizes the existing paradigms of data locality. Figure 5classifies the code slices (observed in Figure 4) according to their memory dereference depth (longest dependent load chain within the slice) and the number of arithmetic operations. The circle sizes indicate relative number of slices within each bucket. The special case of (1,1) represents slices that have a single arithmetic operation and a single load based on it, which for the most part conform with the common stride pattern. Another interesting case is the (2,1) and (2,2) data points (two loads and one or two arithmetic operation), which includes most examples of array-of-arrays/pointers (A[B[i]] accesses): one index stride, one internal array reference, possible index/pointer math and outer array access. These are potentially covered by IMP  or similar dereference based prefetchers like Jump-pointers .
Notably, while the largest data points pertain to loads that are addressed by existing prefetchers (37% of all loads in the figure fall under the stride pattern; 13% are within the two simple single-dereference patterns), there are still many cases left outside that are not targeted by any existing prefetcher. Semantic analysis can cover all these cases using the same mechanism, thereby generalizing existing prefetchers without having to manually design for each use-case.
Iii Prefetcher architecture
In this section we describe the architecture of the semantic prefetcher. Figure 7 shows the high level block diagram of the components: 1) The flakiness detector for tracking recurring loads; 2) A cyclic History queue, tracking retired code flow; 3) The prefetch injection entries (PIE) array, storing slices; 4) Several walker FSMs, generating and validating slices; 5) The slice injector. and 6) The prefetch queue for tracking usefulness and providing feedbacks.
Iii-a Flaky load detection
The first component is the flakiness detector, which is responsible for isolating loads that have both high recurrence and miss rates. The unit identifies and tracks load context by a combination of its instruction pointer (IP) and a branch history register (BHR). Our BHR tracks up to 6 recent branches. Each branch is represented by the lowest 4 bits of its IP, with the least significant bit XOR-ed with the binary outcome of that branch (taken or not). Together, the load IP and the BHR represent an instance of load within a specific program context. This method can distinguish between different occurrences within nested loops or complex control flows, which may affect the access pattern and the generated slice. The IP and BHR are concatenated and hashed to create an index that would identify the load throughout the prefetcher mechanisms.
For each load missing the L1, the prefetcher allocates a prefetch injection entry (PIE) in the PIE array. These entries serve to track potential loads (within some context) and, if considered useful, construct a slice for them and store it for prefetching. Figure 7 describes the life cycle of a single PIE. Once allocated, the entry starts at the ”Active” state. The flakiness detector tracks recurrence and miss rate for each of the active loads. Once a PIE has been qualified as flaky (above-threshold miss rate) and hot (high recurrence over a time window) its state switches to ”Gen” and it is assigned a walker to construct its slice of code (a PIE slice).
Iii-B PIE slice generation
The slice that generates the load address consists of a subset of the code path prior to that load. The prefetcher tracks the program code flow at retirement using a cyclic history queue, although in modern processors this can be replaced with existing debug features such as Intel’s Real-Time Instruction Tracing (RTIT) . Once a PIE is switched to ”Gen” state and needs to construct a slice it is assigned one of the free walker finite-state-machines (FSMs). The walker traverses the history queue from the youngest instruction (the flaky load itself) to the oldest and constructs the PIE slice.
To track data dependency, the walker uses 1) a source bitmap which assigns one bit per register to track the active sources (only general-purpose and flags registers); 2) a renaming cache that tracks memory operations for potential memory renaming ; 3) a Temporary register map that tracks architectural registers replaced with temporary ones. The walker also has storage for 16 operations that serves as the local copy of the slice during construction. Finally, the walker has an index pointing to the history queue (for the traversal), an index for the local slice (for construction), and a counter of temporary registers used for memory renaming.
The walker first sets the bits representing the load sources, and then traverses the history queue backwards (from youngest to oldest instruction). On each instruction that writes to a register in the bitmap, the walker does the following:
Pushes the instruction to its local code slice (using an index that starts from the last entry and going backwards from the end of the slice).
Clears the destination register from the sources bitmap marking that its producer has been added.
Sets all the registers corresponding with the current instruction sources. This ensures older operations producing these sources will also be added.
Records the destination value. This will be checked for constants or strides during the next phases.
Loads that are added to the slice record their address and their index within the local slice in the rename cache. This structure can host 16 addresses in the form of a set-associative cache. The walker then performs memory renaming whenever an older store is observed (further along the walk) that matches an adderess in the rename cache. The renaming is done by extracting the index of the matching load from the structure and replacing both the store and the load operations in the slice with a move to and from (respectively) an available temporary register. It should be noted that reducing the store/load pair further by moving the store data directly to the load destination is not possible, since the load destination register may be reused between the store and the load (and therefore override the data).
The walker completes the traversal upon 1) reaching the tail of the cyclic history queue; 2) when there are no longer valid sources marked in the source bitmap; or 3) when the loop completes a round-trip and the same load within the same BHR context is encountered. Upon successful completion, the walker switches the PIE to ”Validate” phase. When the same load context is encountered again, the prefetcher assigns a walker to perform the walk once more to validate that the code slice did not change. The PIE remains in validation phase for several encounters to ensure the code is stable and to identify constant/strided values (the strides themselves may be caused by code beyond the scope of the slice).
The prefetcher performs three validation rounds. Other values (up to seven) were tested, indicating that prime numbers work better, especially when no BHR context is used, as they may avoid some loop patterns from confusing the validation process. However, a lower value was chosen as overall performance benefits more from the speed of generating new slices than from the additional accuracy that may accompany longer validation.
After finishing all validation rounds the entry is switched to the ”Trim” phase. The ”Trim” phase is the only step allowed to change the PIE slice since it was first generated. It performs the same walk, but stops tracking sources when reaching constants or strides that were discovered during the validation passes and replaces them with a simple immediate move or add/sub. As a result, some branches of the data dependency flow may be removed from the PIE slice.
Another change performed during trimming is renaming the destinations to temporary registers to avoid any side effects. The walker performs a forward traversal over the constructed slice and converts each destination register to the next available temp register. The conversions are recorded in the temporary register map. During the following traversal steps, all younger slice instruction will rename any matching sources to read from the corresponding temporary register. After trimming is done, the entry is switched to ”Armed” state.
We assume that the walker FSM can handle up to 8 instructions per cycle without exceeding timing restrictions (based on similar existing mechanisms like branch recovery walks), so the full history walk should take up to 16 cycles. However, to ensure feasibility and allow larger history queue sizes, our evaluation assumes that a walk may take up to 64 cycles. Since the prefetcher may encounter additional loads during that time, it may use several parallel walkers, assigned to generation or validation phases based on availability.
Figure 8 shows an example of slice generation over code striding across an array that requires double dereference (since, for e.g., it was passed by pointer). The dependency chain is discovered by walking the history queue as shown on the right hand side (removing the unrelated multiply operation but identifying all other operations as part of the dependency chain). The intermediate slice remains consistent during several validation phase iterations. During that phase the add operation is detected as a stride of one and the two middle loads are identified as constants. Since RBX is constant, the Trim phase replaces it with a move and stops processing its dependencies (thereby also eliminating the RDX load). The final slice is therefore only three operations long.
Iii-C Slice injection
Once a slice has been armed, each encounter with its load context (i.e., hitting the same IP while having the same branch history) triggers the PIE slice injection. The allocation stops immediately prior to the triggering load (thus preserving the same register roles and meaning as seen during construction). The slice operations are then injected in order, with no lingering side effects as the temporary registers used are not part of the architectural state. Any memory operation is allowed to lookup the TLBs and caches and, if needed, perform page walks and allocate line fill buffers. These accesses may, by themselves, act as prefetches.
The injected operations may be executed by the normal machine out-of-order resources. However, this may incur a substantial cost to the actual program performance due to added stress over critical execution resources. Instead, an internal execution engine was added to perform the arithmetic operations without interfering with the normal core activity (other than stalling allocation). We evaluate both the shared-resources and the private-resources modes in Section V.
During the injection, sources marked as constants use the recorded constant value, but operations marked as having a stride are adjusted by having their stride value multiplied by a dynamic lookahead factor. The dynamic lookahead is initialized for each slice to one, meaning that by default the prefetcher injects the PIE slice as-observed, with no extrapolation (thereby performing the address computation of the next iteration). However, if the PIE is eventually detected as non-timely (as explained in the next section) the lookahead factor will increase gradually up to a maximum of 64 (chosen to allow prefetching far enough ahead of time, but not too far as to exceed the cache lifetime). All strides within a slice are always multiplied by the same lookahead factor so that the ratios between strides are always kept as they were detected over a single iteration.
The final operation in the slice is a copy of the original load that the slice was constructed from, but since any strided sources were enhanced to apply a lookahead over their strides, the load address would belong to some future iteration. This becomes the final prefetch address and is sent to the memory unit as a prefetch. In parallel, it is also pushed to the prefetch queue along with its predicting PIE-id for usefulness tracking.
Iii-D Usefulness tracking
The generated prefetches must be correct and timely (i.e., the address should be used later by a demand, and do so within a sufficiently short time period as to avoid being flushed from the cache). We solve both requirements by tracking the prefetches in the prefetch queue. Each demand address is checked against the queue to find the first (most recent) matching prefetch, and the entry is marked as hit. If the hit is within useful distance (determined by a reward function as in the context-RL prefetcher ), the PIE receives confidence upgrade based on the reward score. On the other hand, if a prefetch entry reaches the end of the prefetch queue without ever being hit, it is considered useless. The number of sent and useless prefetches is tracked in the PIE (the counters are both right-shifted whenever they are about to exceed in order to preserve their ratio). When a PIE goes below the usefulness threshold (we used 10% in our experiments), it is reset, but allowed to regenerate the slice in case the current code flow changed compared to when it was originally constructed. If a PIE is reset more than 25 times, it is considered a stale PIE, and its state becomes Disabled, preventing it from reconstructing.
Another form of filtering is tracking recurring addresses. The last address emitted by each slice is saved, and if the slice generates it again multiple times in a row, the slice is reset due to low usefulness.
Iii-E Dropping PIE slices
Multiple issues could stop a slice construction process or reset an already constructed one. Construction can be aborted due to the following reasons:
Slice is inconsistent during validation. This may indicate insufficient context length, the code having no useful recurrence, or a complex control flow.
Timeout while waiting for another validation pass, may indicate the load is not as hot as predicted.
Slice is too long (over 16 operations).
Complex instruction (for e.g., non-trivial side effects).
Too many temporary registers (over 8) are needed.
Resets during slice construction are considered transient, meaning that a later attempt may still construct a useful slice. Conversely, slices can also be reset during run-time. If too many prefetches fall off the prefetch queue without ever being hit by demands, the slice may have failed capturing the semantic relation. The minimal usefulness ratio is configurable and by default a threshold of 10% is used. The failure rate is tracked using 2 counters: Failures and sent-prefetches. Both counters saturate at 64, and both shift right by 1 whenever any of them reaches that limit. If, after the counters reach steady-state, the ratio between them drops below the usefulness threshold, the prefetcher resets the entry.
Alternatively, If the same code slice produces the exact same address over and over, the slice is no longer considered meaningful. This may occur when reaching the history limit during construction, when the walker cannot include a source being changed. Aborting a slice at run-time provides information that triggering a prefetch on the initiating context might harm performance. Therefore, the PIE array records these resets and keeps them (unless the PIE is overridden by another load context). If too many run-time resets occur, the PIE switches to disabled state and no longer accepts re-construction attempts for that context.
Iii-F Prefetcher area
The parameters used for the prefetcher are summarized in Table I. For the sake of this paper each stored micro-operation is assumed to be represented with 64 bits including all data required for reproduction. The history queue entry also has to store the result 64-bit value (for const/stride detection), and therefore requires 128 bits. A 128-entries history queue requires 2kB.
Each PIE slice requires 16 operations, a context tag for indexing, a walker ID (used during generation and validation) and additional bits for state and reset tracking. Overall size is 140 Bytes. Since the PIEs are relatively large, the PIE array holds only 16 entries, with a total size of 2.25kB. This is sufficient for most of the applications since the number of slices presented in Figure 4 refers to the entire lifetime of the application, but at any given program phase only a few slices are actively used. This is demonstrated later in Section V-C. Future work may find ways to reduce the size of each entry (for example by compressing identical operations, as some slices may share parts of their history). The storage size is therefore not a fundamental issue.
The slice generation FSM (walker) requires a source bitmap (32 bits), a memory renaming cache (16 entries with a 64b tag + 4b index each = 136 bytes), and a temporary registers map (40 bits). Each FSM also has a slice storage for the construction process, reaching a total of 280B. Having 2 parallel walkers would therefore require 0.6kB.
Power consideration are reviewed in section V.
|History queue||128 instructions, 2kB|
|BHR size||24 bit (4b 6 last branches)|
|Mem. renaming cache||16 (64 + 4) bits = 1kB|
|Walkers||4 280B = 0.6kB|
|PIE array size||16, 2.25kB|
|Total size (kB)||6kB|
|hot/flaky thresholds||2 appearances / 1 miss|
|Core type||OoO, 4-wide fetch, 3.2Ghz|
|Queue sizes||224 ROB, 97 RS, 180/168 int/FP regs,|
|72 load buffer, 56 store buffer|
(estimated) 10 L1, 20 L2
|L1 cache||32kB Data + 32kB Code, 8 way, 2 cycles|
|L2 cache||256kB, 4 ways|
|L3 cache||4MB, 16 ways x 2 slices|
|Memory||LPDDR3, 2-channel, 8GB, 1600 Mhz|
|GHB (all) ||GHB size: 2K, History length: 3|
|Prefetch degree: 3, Overall size: 32kB|
|SMS ||PHT size: 2K, AGT size: 32, Filter: 32|
|Regions size: 2kB, Overall size: 20kB|
|VLDP ||3 DPTs 64 entries|
|Context RL ||CST size: 2K entries x 4 links (18kB),|
|Reducer: 16K entries (12kB)|
The semantic prefetcher was implemented in a proprietary cycle-accurate x86 simulator configured to match the Skylake micro-architecture  and validated against real hardware over a wide range of applications (showing an average error margin within 2%). Table II specifies the parameters used. All prefetchers support L1 triggering and virtual addresses and can use TLBs/page walks when needed.
The prefetcher was tested over the SPEC 2006 and 2017 benchmark suites, compiled on Linux with ICC 14 and 16 (respectively). Each application had 5-8 different traces chosen by a SimPoint-equivalent tool based on workload characterization. The traces are weighted to represent overall application performance while measuring 5M instructions each (following a warmup of about 1B memory accesses and 20-50M actual instructions).
To test multithreaded performance we also run combinations of workloads on different threads, although we do not implement the ability to share the learnings across threads. For that purpose, we use combinations of traces from the same applications to measure SPEC-rate behavior (where several copies of the same application are being run independently). We run the traces with separate physical memory address ranges to avoid data collisions. We also offset the run phases by a few million instructions to ensure some heterogeneity.
To evaluate the benefits of the semantic prefetcher, we first need to determine its ability to cover enough performance-critical loads within common workloads. Figure 9 shows the coverage of different SPEC 2006/2017 workloads: each application shows the number of dynamic loads analyzed by the semantic prefetcher, and the break-down by analysis outcome. The Non-Flaky component counts loads not deemed hot enough or not having enough misses to be interesting. The Timing component shows slices failing during the validation period due to low rate of recurrence or hash conflicts. The Non-Stable component shows slices failing validation due to variability of code path (results are shown with zero context, increasing the context is shown later to reduce this component as the code paths are more consistent when compared across recurrences with the same branch history). The remaining component at the top shows the number of armed slices per workload.
The absolute number of slice validation failures is shown in Figure 10 for the 7 SPEC workloads that have the highest failure rate when using no context (all having over 40% of their slices reset during slice generation). The number of failures is compared having between 0 to 24 context bits (i.e., indexing loads based on the history of the last 0 to 6 branches and their resolutions). Adding the full context length reduced between 30% (gcc) to 98% (milc) of the failures, indicating that recurrences with the same branch history are more likely to have consistent code slice behavior.
The overall breakdown of reset causes appears in Figure 11
. The first element is failures due to hash collisions: new dynamic loads matching the PIE index of an existing slice that is under construction, causing it to drop (armed slices are protected from overwrite and can only be reset due to low usefulness). The second element is variance in code flow during slice generation. The third is timeout during the construction: a slice that was not armed within 100k cycles is reset due to low relevance. The last reset cause, Too-Many-Failures, is a run-time reset cause, occurring after a slice was validated and armed, as explained in SectionIII-D.
It should be noted that the various reset thresholds and parameters have shown very little sensitivity to tuning. This happens because most slice stabilization and resets occur during warmup and are therefore negligible on longer runs. On the other hand, flakiness and usefulness parameters are more sensitive to tuning, and show better performance the more aggressive they are dialed (i.e., building and maintaining slices for more loads). However, optimizing with a higher cost of slice injection (especially without dedicated execution) would likely lead to more conservative thresholds.
The speedup of the semantic prefetcher (with 0 and 16 bits context) is shown in Figure 12 across SPEC 2006/2017 benchmarks. Several competing prefetchers are also shown. The overall gain is significantly higher on SPEC 2006 (24.5%), mostly due to the lack of software prefetching by the older compiler, but the improvement exists also on SPEC 2017.
Slice injection also adds computation work. Figure 13 shows the injected instructions out of the overall instructions executed, compared with the performance gain. In most cases there is good correlation (i.e., the performance gain is proportional to the added work), but some applications with relatively simple slices are able to gain significantly more than their overhead. If we assume the prefetcher’s steady-state power cost (disregarding slice generation) is equivalent to the added operations, then the power/performance score has on average 2.5 more IPC gain than power cost.
Finally, the semantic prefetcher improves performance also on multi-threaded runs. Figure 14 shows the speedup over SPEC-rate simulation (4 traces from the same application over 4 physical cores). In some cases prefetching provides a higher gain than on a single thread (in h264, for example). There is no sharing of generated slices across physical cores, so the only gain comes from increasing the effective prefetching depth. On MT runs the system becomes more saturated and memory bandwidth usually becomes a more critical bottleneck compared to memory latency. This may reduce the efficiency of prefetching (or even the chances of issuing the prefetches). On the other hand, prefetches may also serve as cache hints that increase the lifetime of required lines, thereby reducing overall bandwidth and improving MT performance. On the highest MT gainer (libquantum), the prefetcher reduced L1 MPKI by 40%.
V-a Comparison with other prefetchers
We compare the speedup of the semantic prefetcher with other prefetcher with different approaches and coverage. The semantic prefetcher wins over most SPEC workloads, scoring on average more than twice the speedup of the next best prefetcher. However, on some workloads the semantic prefetcher loses to one of the competing prefetchers. In gemsFDTD, a simple stride prefetcer is able to gain almost 60% speedup while the semantic prefetcher gains only 17% at best. The reason for that is short nested loops, where the inner recurrence is too long to fit in the context history length, but too short to allow the chance for re-learning the inner loop on every outer iteration. This control flow gives an advantage to simple and fast prefetchers that need only a few loop recurrences to learn and trigger (the stride prefetcher can start issuing prefetches after the 3rd inner iteration). The semantic prefetcher can still learn the code pattern given sufficient context, and in fact begins to gain performance by covering at least some of the cases with a context of 32 bits and above, but such context length begins to stress the physical design and does not solve the general case where inner loops can be much longer. This can be solved by having the context support compressed representation of loop patterns.
Another competing prefetcher that gains over the semantic prefetcher over some workloads is the Best-Offset prefetcher. BOP has several outliers in its favor, most notably zeusMP. Unlike GHB and other fast-learning prefetches, BOP also takes a while to discover the optimal depth, but once it does, it has a throttling effect where it eliminates unnecessary prefetches (at sub-optimal offsets). The gains in zeusMP (and to a lesser extent also in dealII, sphinx3 and cactusADM) are mostly through reduction of BW from excessive prefetching. For the same reason adding context to the semantic prefetcher also helps on zeusMP by eliminating badly constructed slices emitting some useless prefetches.
Finally, IMP presents an interesting point: within the SPEC applications it wins only on WRF (and only by 8%), but the graph500 example had array dereferences that make IMP quite useful, except on the longest ones.
V-B Lookahead method
The lookahead multiplier (the number of strides we prefetch ahead) plays a key role in the prefetcher speedup. However, when applying a fixed multipliers we noticed that different workloads were favoring different values, and optimizing the best method required dynamic tuning. We implemented the following approaches:
Constant lookahead: we set a constant value and always perform the lookahead according to it.
Hit-depth normalization: we measure the average depth of hits within the prefetch queue which indicate the actual distance between a prefetch and the demands access using it. We then increase or decrease the lookahead value to normalize this hit depth to the desired range (if we hit too early, around the beginning of the prefetch queue, we need to extend our lookahead and vice versa).
The difference between the policies is shown in Figure 15. Lookahead 1 and 16 are dynamically adjusting the lookahead distance while starting from a multiplier of 1 or 16 iterations ahead (respectively) and increasing from there. The fixed lookahead policy (always 32 iterations ahead) has some minor gains in cases where it starts from a more effective prefetching depth while the dynamic policy takes a while to reach there, but it is ultimately inferior on most runs where the dynamic approach is more adaptable.
V-C Execution resources
The semantic prefetcher uses a dedicated generic ALU as the execution engine of the PIE slices. However, if the slice execution latency becomes a critical factor, the area addition is too high, and the overall execution bandwidth is sufficient, we may choose to simply inject the slice code into the main out-of-order engine and let the existing resources do the work for us. Figure 16 shows the penalty of dropping the dedicated HW and sharing the core execution resources.
Another tradeoff is the number of walkers performing the slice generation and validation. Figure 17 shows how many parallel walks (traversals over the history queue) and PIE entries (slices tracked for constructions) are needed using cumulative time histograms. Overall cycles with any number of walks consume only 0.5% of the run-time, (also indicating that the power consumption of the walk itself is negligible). The results indicate that two walkers are sufficient. In the same way, 16 PIE entries are enough to cover 99.4% of the run.
Vi Related work
Vi-a Program semantics
Multiple researches attempt to automate the analysis and understanding of software applications. Shape analysis  attempts to build a set of properties and rules representing the program’s data sets in order to facilitate program verification (mostly of memory object allocation and management, bounds/coherence checking and functional correctness).
Newer approaches attempt to represent programs in abstract forms derived from their code and behavioral analysis [15, 5], in order to find similarities for code suggestion/completion, anti-plagiarism, or algorithm identification. These approaches may be useful in high level dynamic tuning (adjusting HW properties such as the type of prefetcher used, or optimal aggressiveness), but they do not yet assist in the analysis of the access pattern or address generation.
Vi-B Using code slices
Collecting and using slices of actual code has already been proposed for various purposes. Trace cache  is a form of constructing and efficiently caching selective code paths based on run-time analysis. In the realm of branch prediction, Zilles et al.  proposed using code slices to execute ahead the predicted code path to resolve hard-to-predict branches. A similar approach suggested by Peled et al.  is based on injecting code slices to resolve data-dependent branches by prefetching the data, using it to resolve the branch condition, and queuing the resolution for overriding the prediction.
This method was also proposed for memory latency mitigation. Carlson et al.  proposed a similar mechanism that dynamically learns load dependency chains in order to expedite their execution. However, their approach was based on in-order cores, and motivated to extract a small portion of the ILP available to full out-of-order cores by execution only load address dependencies out of order.
Prefetching can also be achieved by executing actual code (or even just critical subsets of it) ahead of time as proposed by Mutlu and Hashemi et al. in their set of Runahead techniques [24, 13, 14], and by Collins et al. in their speculative precomputation technique  (extended by Atta et al.  to include also limited control flow). That work relied on continued execution of the same program context past memory stalls, and focused on managing a complicated speculative execution state for that purpose. It did not modify the executed code but most approaches did filter out code not required for memory address calculation. It also did no extrapolation, and thus was limited in range to what could fit in the enhanced out-of-order window.
Prefetching based on actual code slices can also be done by helper threads (Runahead flavor by Xekalakis et al.  and slice-based processors by Moshovos et al. ). Similar decoupled approaches were also used for branch prediction by Chappell et al. [8, 7] and Farcy et al. . However, this form is asynchronous with the progress of the main thread and will not be able to ensure fixed (or even positive) prefetching distance.
Another form of expedited execution through other threads is Hardware scouting , which is intended for highly multithreaded machines and uses other threads to run ahead of execution. However this approach attempts to optimize MT throughput, and not address single-threaded performance.
Vi-C Prefetching techniques
Falsafi and Wenisch classified prefetching techniques into the following groups :
, and Access Map Pattern Matching (AMPM), proposed various methods of testing different strides and choosing the optimal one, thereby covering complex flows through common recurrence deltas. Other prefetchers such as the variable length delta prefetcher (VLDP)  enhanced that ability to varying stride patterns.
Address-correlating prefetchers detect correlation within sequences of recurring accesses. This form of locality has the ability to cover some semantic relations, but is ultimately limited to the storage capacity of correlated addresses. Examples include the Markov predictor , the Global History Buffer Address-Correlation flavors (GHB/AC) , and prefetchers targeting linked data structures through partial memoization, such as that by Roth, Moshovos and Sohi [31, 32], and Bekerman et al. . An extension of address-correlation is context-correlation [27, 28]
which seeks to correlate a larger context vector with future addresses.
Spatially-correlated prefetchers use an extension of temporal locality that correlates between spatial patterns instead of absolute addresses. These prefetchers seek out recurring spatial patterns that are not part of a long consecutive sequence but may repeat locally, such as accesses to the same fields of a structure across different instances. Examples of this family are Spatial Memory Streaming (SMS)  and the DC flavors of GHB .
Irregular data patterns prefechers target specific data structured that do not have spatio-temporal locality. IMP  prefetches future elements within an array of indexes (A[B[i]]). Other data-driven prefetchers include the Irregular Stream Buffer (ISB) , which restructures the dataset spatially. Another form of irregular prefetching based on context if B-Fetch  by Kadjo et al. which uses branch history to detect strides in registers used for address generation. However, since it does not execute actual code it is limited to simple register strides and cannot reconstruct complex value manipulations or see through memory dereferences.
This paper presents the semantic prefetcher. The prefetcer is designed to utilize the most generalized form of locality: semantic locality, which is not limited to spatial or temporal artifacts of the memory address sequence, but instead attempts to extract the code slices responsible for invoking consequential memory accesses along the data structure traversal path or algorithmic flow. We then combine and manipulate these slices into forecast slices by utilizing locality artifacts within their code to create new functionality that can predict the memory behavior of future iterations.
While some existing prefetchers attempt to capture access patterns that belong to specific use-cases, whether spatio-temporal relations, temporal correlations, or even irregular and data-dependent patterns, there is currently no generalized technique that can capture all such cases. The semantic prefetcher attempts to solve that by observing the code path directly, imitating any program functionality and extending it to create lookahead functionality. Based on that technique, the semantic prefetcher can extend the coverage provided by existing prefetchers (including irregular ones) through inherently supporting complex address generation flows and multiple memory dereferences. The semantic prefetcher provides a speedup of 24.5% over SPEC-2006 and 16% over SPEC-2017, exceeding gains from other prefetchers by over 2.
-  (2015) Self-contained, accurate precomputation prefetching. In Proceedings of the 48th International Symposium on Microarchitecture, MICRO-48, New York, NY, USA, pp. 153–165. External Links: Cited by: §I, §VI-B.
-  (1991) An effective on-chip preloading scheme to reduce data access penalty. In Supercomputing’91: Proceedings of the 1991 ACM/IEEE Conference on Supercomputing, pp. 176–186. Cited by: §II.
-  (2018-02) Domino temporal data prefetcher. In Symp. on High-Performance Computer Architecture (HPCA), Vol. , pp. 131–142. External Links: Cited by: §I.
-  (1999-05) Correlated load-address predictors. In Intl. Symp. on Computer Architecture (ISCA), Cited by: §II, 2nd item.
-  (2018) Neural code comprehension: A learnable representation of code semantics. CoRR abs/1806.07336. External Links: Cited by: §VI-A.
-  (2015-06) The load slice core microarchitecture. In Intl. Symp. on Computer Architecture (ISCA), Vol. , pp. 272–284. External Links: Cited by: §VI-B.
-  (2002-05) Difficult-path branch prediction using subordinate microthreads. In Proceedings 29th Annual International Symposium on Computer Architecture, Vol. , pp. 307–317. External Links: Cited by: §VI-B.
-  (1999) Simultaneous subordinate microthreading (ssmt). In Proceedings of the 26th International Symposium on Computer Architecture (Cat. No. 99CB36367), pp. 186–195. Cited by: §VI-B.
-  (2005-05) High-performance throughput computing. IEEE Micro 25 (3), pp. 32–45. External Links: Cited by: §I, §VI-B.
-  (2001) Speculative precomputation: long-range prefetching of delinquent loads. In Proceedings 28th Annual International Symposium on Computer Architecture, pp. 14–25. Cited by: §I, §VI-B.
-  (2014) A primer on hardware prefetching. Synthesis Lectures on Computer Architecture 9 (1). Cited by: §I, §II, §VI-C.
-  (1998) Dataflow analysis of branch mispredictions and its application to early resolution of branch outcomes. In Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture, pp. 59–68. Cited by: §VI-B.
-  (2016) Continuous runahead: transparent hardware acceleration for memory intensive workloads. In The 49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-49, Piscataway, NJ, USA, pp. 61:1–61:12. External Links: Cited by: §I, §VI-B.
-  (2015) Filtered runahead execution with a runahead buffer. In Proceedings of the 48th International Symposium on Microarchitecture, pp. 358–369. Cited by: §VI-B.
-  (2018) Code vectors: understanding programs through embedded abstracted symbolic traces. In ESEC/SIGSOFT FSE, Cited by: §VI-A.
-  (2009-06) Access map pattern matching for data cache prefetch. In ics, Cited by: 1st item.
-  (2013-12) Linearizing irregular memory accesses for improved correlated prefetching. In Intl. Symp. on Microarchitecture (MICRO), Cited by: 4th item.
-  (1997-06) Prefetching using markov predictors. In Intl. Symp. on Computer Architecture (ISCA), Cited by: 2nd item.
-  (2014-12) B-fetch: branch prediction directed prefetching for chip-multiprocessors. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, Vol. , pp. 623–634. External Links: Cited by: 4th item.
-  Cited by: §III-B.
-  (2015) Technology insight: intel’s next generation microarchitecture code name skylake. In Intel Developer Forum, San Francisco, Cited by: §IV.
-  (2016-03) Best-offset hardware prefetching. In Symp. on High-Performance Computer Architecture (HPCA), pp. 469–480. External Links: Cited by: 1st item.
-  (2001) Slice-processors: an implementation of operation-based prediction. In Proceedings of the 15th International Conference on Supercomputing, ICS ’01, New York, NY, USA, pp. 321–334. External Links: Cited by: §I, §VI-B.
-  (2003-02) Runahead execution: an alternative to very large instruction windows for out-of-order processors. In The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings., Vol. , pp. 129–140. External Links: Cited by: §I, §VI-B.
-  (2004-02) Data cache prefetching using a global history buffer. In Symp. on High-Performance Computer Architecture (HPCA), Cited by: §I, §II, TABLE II, 2nd item, 3rd item.
-  (2018-June 21) Branch predictor with branch resolution code injection. Google Patents. Note: US Patent App. 15/385,011 Cited by: §VI-B.
Semantic locality and context-based prefetching using reinforcement learning. In Intl. Symp. on Computer Architecture (ISCA), Cited by: §I, §II, §III-D, TABLE II, 2nd item.
A neural network prefetcher for arbitrary memory access patterns. ACM Trans. Archit. Code Optim. 16 (4), pp. 37:1–37:27. External Links: Cited by: §II, 2nd item.
-  (1995-01) Dynamic flow instruction cache memory organized around trace segments independent of virtual address line. Note: US Patent 5,381,533 Cited by: §VI-B.
-  (2014-02) Sandbox prefetching: safe run-time evaluation of aggressive prefetchers. In Symp. on High-Performance Computer Architecture (HPCA), Cited by: 1st item.
-  (1998-10) Dependence based prefetching for linked data structures. In Intl. Conf. on Arch. Support for Programming Languages & Operating Systems (ASPLOS), Cited by: §II, 2nd item.
-  (1999-05) Effective jump-pointer prefetching for linked data structures. In Intl. Symp. on Computer Architecture (ISCA), Cited by: §II-C, §II, 2nd item.
-  (1999) Parametric shape analysis via 3-valued logic. In Proceedings of the 26th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’99, New York, NY, USA, pp. 105–118. External Links: Cited by: §VI-A.
-  (2015-12) Efficiently prefetching complex address patterns. In Intl. Symp. on Microarchitecture (MICRO), Cited by: TABLE II, 1st item.
-  (2009) Spatio-temporal memory streaming. In Intl. Symp. on Computer Architecture (ISCA), ISCA ’09, pp. 69–80. External Links: Cited by: §I.
-  (2006-06) Spatial memory streaming. In Intl. Symp. on Computer Architecture (ISCA), Cited by: TABLE II, 3rd item.
-  (1997) Improving the accuracy and performance of memory communication through renaming. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture, MICRO 30, Washington, DC, USA, pp. 218–227. External Links: Cited by: §III-B.
-  (2009-02) Practical off-chip meta-data for temporal memory streaming. In Symp. on High-Performance Computer Architecture (HPCA), Vol. , pp. 79–90. External Links: Cited by: §I.
-  (2009) Combining thread level speculation helper threads and runahead execution. In Proceedings of the 23rd International Conference on Supercomputing, ICS ’09, New York, NY, USA, pp. 410–420. External Links: Cited by: §I, §VI-B.
-  (2015) IMP: indirect memory prefetcher. In Proceedings of the 48th International Symposium on Microarchitecture, pp. 178–190. Cited by: §II-C, §II, 4th item.
-  (2001) Execution-based prediction using speculative slices. In Intl. Symp. on Computer Architecture (ISCA), ISCA ’01, New York, NY, USA, pp. 2–13. External Links: Cited by: §VI-B.