I Introduction
^{†}^{†} 2nd ISCA International Workshop on AI Assisted Design for Architecture (AIDArc), June 2019, Phoenix, AZ, USACPU branch predictors enable speculative execution and are a critical tool for hiding latency in outoforder cores. They work by inferring the unresolved direction of a branch instruction when it is fetched, based on a model trained to previously observed directions. Today, branch prediction units (BPUs) perform both prediction and training online within a CPU’s frontend, as an application runs. Though tightly constrained, e.g. in storage and latency, existing predictors achieve 99% accuracy on 99% of static branch instructions in SPEC 2017 [1, 2, 3].
However, the mispredictions that remain hide a major performance upside. Our data [3] shows that a small number of static branch instructions, just 10 on average per SPECint 2017 SimPoint phase, are systematically mispredicted. Improving accuracy on these hardtopredict branches (H2Ps) would boost instructions per cycle (IPC) on an Intel SkyLake core upto 14.0%, and upto 37.4% on a projected future CPU pipeline with width and depth scaled by 4. But when the bestknown branch predictors are afforded exponentially more resources, 80% of this opportunity remains untapped. New approaches are needed to extract this performance, which lies in just a handful of static branches in each application.
For the first time, branch prediction poses an attractive deployment scenario for machine learning (ML). Gains in branch predictors over past decades have balanced strict BPU constraints with the need for high accuracy on thousands of static branches at a time. Solutions have favored simple, lightweight patternmatching [4, 2, 3]
, while comparatively powerful, yet expensive ML models such as deep neural networks, support vector machines, and random forests have been left unexplored. The large IPC opportunity that remains, and its concentration in a few H2Ps that resist existing techiques leads us to pursue ML models that implement more sophisticated pattern matching within the BPU.
We propose MLdriven helper predictors that operate alongside a baseline predictor to boost accuracy for individual H2Ps. This report provides a tutorial developing convolutional neural network (CNN) helpers to improve pattern matching on the same global history data used by existing branch predictors. We show how convolutional filters better tolerate distortions in history data caused by control structures with variable iteration counts. We then train CNN helpers with 2bit weights and translate their inference procedure into a small number of table lookups that meet BPU constraints. Finally, we evaluate CNN helpers on applications traced over multiple inputs to establish that gains hold in future executions. At full precision, CNN helpers reduce mispredictions by an average of 36.6% on 47% of H2Ps in SPECint 2017; our implementable design improves 24% of H2Ps by 14.4%.
We adopt a deployment scenario in which helpers are trained to runtime data offline and uploaded to the BPU in future application executions to generate predictions online [5, 6]. This approach amortizes training over the lifetime of a device and across devices that run the same application, e.g. in a datacenter. The result is an applicationspecific IPC boost that requires no access to source code or painstaking expert analysis. Given a rich set of ML helper predictors, we envision an optimization service that automatically fits the best helper to each H2P and packages those that maximize IPC as application metadata. CNN helpers solve one source of systematic misprediction, and we intend this report as a blueprint for the development of other MLdriven helpers.
Ii Mispredictions Due to Variable Iteration Control Structures
We motivate CNN helpers by showing one class of H2P that arises due to control structures with datadependent iteration counts. Two examples, one illustrative and the other drawn from deepsjeng in SPEC 2017 demonstrate that this common motif causes positional variations in data available to the BPU. Such distortions confound stateoftheart predictors that rely on exact sequence matching or positional correlations, but are tolerated by convolutional filters. These examples are predicated on the following:

We consider conditional branches only;

When a branch is fetched, its global history is the sequence of instruction pointer values (IPs) and the directions of branches executed leading up to the current instruction;

TAGESCL is the stateoftheart branch predictor [2]. It conditions each prediction on the longest recognized subsequence of global history by approximating Partial Pattern Matching (PPM) [7]. It is implemented by hashing global history subsequences into tagged tables. Table entries hold a saturating counter that tallies previously observed directions, and can be thresholded to make a prediction. TAGESCL also implements a loop predictor, arbitrating between this and the longestmatching PPM predictions using a statistical corrector
, itself a perceptron;

The perceptron predictor (distinct from the statistical corrector above) is an alternative to PPM predictors that trains weights for each global history position, isolating directions correlated to the current prediction [8]. This mechanism filters noisy data that affects TAGESCL’s exactmatch hash lookups, but requires positional weights to be stored and retrieved for many branches;
Illustrative Example – Listing 1 showcases an H2P (Line 1, H2P1) whose global history is affected by a loop with a variable iteration count. H2P1 is exactly correlated to the datadependent branch at Line 1, and both branches are biased to be taken 33% of the time when uvec
’s values are uniformly distributed. Crucially, they are separated by a loop whose iteration count depends on data. This code contains a simple, stable relationship that predicts H2P1—with no additional information on data values, these two branches should be predicted to 66% and 100% accuracy, respectively.
When a simple program calls function repeatedly with random inputs, H2P1’s global histories exhibit significant variations. The loop at Line 1 injects different numbers of uncorrelated branches into history data, causing the position of the predictive datadependent branch to change relative to H2P1. This positional variation explodes the number of unique histories memorized by a PPM predictor and breaks perceptron predictors that require positional consistency. Consequently, TAGESCL predicts H2P1 with 68% accuracy, storing statistics in table entries corresponding to all tracked subsequence lengths, while reusing few for prediction. Training a perceptron on H2P1’s global history gives a similar 69% accuracy. In contrast, a CNN helper predicts H2P1 with 100% accuracy (see Section III).
[clip,width=]pc_pos_histogram.pdf
SPECint 2017 deepsjeng – Positional variations appear in the wild in SPECint 2017, for example in deepsjeng. Listing LABEL:lst:deepsjeng shows a code fragment from the deepsjeng source containing an H2P branch at Line II (red). Lines IIII, II, II, II, II, II, II, II, II, and II (orange and yellow) show correlated branches in the H2P’s global history. They all reside within a loop conditioned on variable lc, whose value is initialized at Line II but modified over iterations (Lines IIII). As a result, the loop iteration count is variable and correlated branches shift position in the H2P’s global history.
This is evident on examination of just one of the correlated branches, Correlated Branch A (Line II, yellow). Fig. 1 shows that the distribution of its position in the H2P’s histories roughly spans the most recent 25 positions. Increasing global history subsequence length, an optimization made when scaling TAGESCL from 8KB to 64 KB, does not directly address this variation, which is the root cause of this H2P. Positional variations are also exhibited by all other (orange) correlated branches in Listing LABEL:lst:deepsjeng. As a result, we find that TAGESCL predicts the H2P on Line II with just 93.8% accuracy, while a CNN will predict it with 100% accuracy.
Iii A CNN Global History Model
To show how a CNN predicts H2P1, we first walk through the forward pass of the twolayer CNN in Listing 2 to produce a prediction, and then describe how it is trained. We initially use the full network representation, but in Section IV translate it into a mechanism that meets BPU constraints.
Iiia Encoding History Data
[width=0.95]H2PEmbeddingTop.pdf 
[width=0.95]H2PEmbeddingBot.pdf 
Given a dynamic instance of an H2P, we convert its global history sequence of IP, direction tuples into an algebraic vector representation. IPs are discrete and take on a large number of possible values, so we use a hash function to index into a “1hot” vector, which contains a one at the index position and zeros otherwise. For example, setting vector dimension to , we map each tuple to by concatenating the observeddirection bit onto the LSBs of the IP: . This process is shown in Fig. 2 for H2P1. Four branches from Listing 1 are shown alongside their IP values, observed directions, and the indices used to generate 1hot vectors. We concatenate column vectors to form a global history matrix , which is input to the CNN.
Though 1hot history matrices appear costly in terms of storage, we choose this encoding because matrices can be replaced onBPU with directmapped table lookups (Section IV). Our experiments show that our CNNs perform well with as few as seven LSBs from each IP, making them agnostic to an application’s base virtual address. To ensure history encodings behave consistently across executions, we set .
IiiB Layer 1: Convolutional Correlation
[width=0.91]TutorialLayer1Filters.pdf 
[width=0.80]TutorialLayer1Activations.pdf 
CNNs perform pattern matching using innerproduct computations between a data vector and a weight vector , also called a filter, with an optional bias term . A similar computation is used in perceptron predictors, however our CNNs differ by performing the same filter matches at every history position, and by also matching on IPs.
(1) 
To illustrate, we instantiate our CNN with two filters and plot their values in Fig. 3 (Top) after training on history matrices and observed directions for H2P1. We see that Filter 1 learns a large positive weight at index 14, aligning to correlated branch 0x400587 being nottaken, while Filter 2 exhibits a large weight at index 15 for the same IP being taken. Small weights adjust for branches that are biased in H2P1’s history, though magnitudes are negligible in comparison.
Evaluating Eq. 1 for each filter against each column of the history matrix produces inner product scores for history length 200. Fig. 3 (Bottom) shows the 200 scores computed from Filter 2. We call f() from a loop, so H2P1’s global history also contains stale appearances of the correlated datadependent branch, and each produces a large filter response.
IiiC Layer 2: Positional Prediction
[clip,width=0.90]TutorialLayer2Filter.pdf
Scores computed in the convolution layer above are passed to a perceptronlike linear layer, which contains a single filter that is matched against the output of Layer 1. The trained weights of Layer 2’s filter are shown in Fig. 4. Nearzero weights damp positions beyond the most recent 30, filtering out stale appearances of IP 0x400587. Eq. 1 is applied once at Layer 2 using its filter and the Layer 1 scores as inputs.
IiiD Stacking the Layers Together
The result of Layer 2’s pattern matching operation predicts “taken” if greater than zero and “nottaken” otherwise. The two layers of this CNN handle different aspects of predicting H2P1’s direction: the first layer is positionagnostic and is designed to identify which IP, direction tuples in a branch history correlate highly with the H2P’s direction; the second layer is designed to identify which positions in a branch history contribute most to the prediction. The combined filtering action of these stacked layers allows the CNN to precisely latch onto the predictive signal in H2P1’s histories, even as it shifts position—it is this mechanism, the result of stacking convolutional and linear layers, that gives our CNNs a pattern matching advantage over PPM and perceptron predictors.
IiiE Offline Training
The training dataset for a CNN helper consists of history matrices of an H2P alongside its observed directions, which we collect using the Pin binary instrumentation tool [10]. We train networks using Chainer [11] and find, for the CNN configurations used in Section V
, 5,000 historymatrix/H2Pdirection pairs sampled uniformly from runtime data are sufficient to converge in 40 epochs using the Adam optimizer
[12].Iv OnBPU Inference with 2Bit CNNs
To deploy our CNN helper in a BPU, we train networks with 2bit weights and show that they need only modest onchip storage and bitwiseparallel logic at prediction time. CNNs provide strong pattern recognition even when their weights are constrained to values in
[13, 14], allowing logical operations to replace arithmetic during inference.Following Courbariaux et al. [13], we impose lowprecision constraints during training by clipping weights to , normalizing activations, and quantizing during forward CNN computations (Listing 3). We train the resulting ternary CNN helper for H2P1 on the same training data. Fig. 5 shows ternary Layer 2 weights. Compared to the fullprecision weights in Fig. 4, quantized weights lose accuracy encoding the magnitude of each position’s contribution to predictions, but still detect correlated IP, direction tuples and damp stale data. This ternary CNN helper yields 98% accuracy for H2P1.
[clip,width=0.92]TutorialLayer2TernaryFilter.pdf
Not only does accuracy remain high, but ternary CNN inference for our network can also be made far more efficient than its fullprecision counterpart based on the following observations:
Multiplication by a 1hot vector yields a scalar. The inner product of a 1hot input vector and a filter is the filter value that aligns to the input’s sole nonzero. We therefore sidestep history matrices completely onBPU by indexing IP, direction tuples into a table of filter values. Subsequent normalization and quantization steps can also be folded into this table, since they produce a 2bit value from each possible filter value after training. We precompute this half of the inference computation for any input by populating a lookup table as follows: For filters of length , denoted ; indices ; learned parameters from a normalization layer that transforms data according to ; and quantization bins defined over the ranges , , , we populate a bit table as:
(2) 
defines quantization buckets for ternary CNN weights [13, 14]; we set but note its value may be learned [15].
1wide convolutions can be computed as soon as history tuples are available. When applying convolutions of width 1, filter responses for each history position are independent of their neighbors. This allows us to retrieve the Layer 1 responses well before an H2P is encountered, as IP, direction tuples become available. Whenever a prior branch’s direction is predicted, the corresponding Layer 1 responses are retrieved from and pushed into a FIFO buffer. When an H2P is fetched and a CNN prediction needed, Layer 1 outputs are already available in the FIFO buffer.
Ternary inner products require only bitwise parallel logic, popcount, and subtraction. At prediction time, we evaluate Layer 2 and its normalization layer. This entails an inner product between the FIFO buffer’s contents and ternary weights, scaling and shifting the resulting integer value by learned normalization parameters, and comparing to 0 to give a “taken” or “nottaken” prediction. We implement the ternary inner product as:
(3) 
where and are the sign and value bits of the FIFO buffer, respectively, and and contain those for the Layer 2 filter. We apply the inverse of normalization to 0 to solve for a threshold , above which we predict taken:
(4) 
OnBPU CNN inference thus consists of two steps, defined by Algorithms 1 and 2. The first is a table lookup to update the FIFO buffer of Layer 1 filter responses whenever any dynamic conditional branch is fetched. The second is a ternary inner product between the FIFO buffer and Layer 2 filter when the H2P is fetched and a prediction needed. Any time a branch is mispredicted, the CPU is rolled back to that instruction, and wrongpath entries are simply shifted off the FIFO buffer.
Iva OnBPU Storage and Latency
To install a CNN helper in a BPU, we must store four components: (1) a bit table to hold Layer 1 filter responses; (2) a bit FIFO buffer to hold convolution results; (3) a bit buffer to hold the Layer 2 weights; (4) a buffer to hold the precomputed integer threshold. Our network, with , , and , requires 336 bytes per helper. For , storage is 5.2KB.
While a full layout of our CNN helper is beyond the scope of this report, we can compare the relative latency between ternary CNN inference and TAGESCL by analyzing the computation graphs of their prediction procedures (Table II). For example, in Algorithm 2, we are able to compute Lines 2–3 in parallel, compute (s_bits & v_bits) on Line 4 and operations serially, the popcounts in parallel, and finally the subtract and comparison serially. Predictions from a 2bit CNN helper thus require six serial computations. The bottleneck computation is popcount, which requires a 13 or 15 stage circuit depending on [16]. In contrast, TAGESCL 8KB and 64KB require 34 and 32 serial computations, respectively (TAGESCL 8KB uses more complex hashing). Their bottleneck computations are backtoback lookups to 4k and 8k entry tables, depending on predictor. This comparison shows that a ternary CNN helper requires a similar number of computation steps to existing predictors.
V CNN Helper Gains & Reusability
! SPECint2017 Benchmark # Training Folds # H2Ps (All Phases) FPCNN with TAGE 8KB Baseline TPCNN with TAGE 8KB Baseline FPCNN, Gains Beyond TAGE 64KB % Winners Mispred. Red. per H2P % Winners Mispred. Red. per H2P % Winners Mispred. Red. per H2P 600.perlbench_s 4 16 51% 63.2% 18% 26.6% 4% 8.2% 605.mcf_s 8 20 55% 44.8% 28% 27.9% 35% 19.3% 620.omnetpp_s 5 28 71% 33.6% 30% 16.3% 24% 11.2% 623.xalancbmk_s 4 8 39% 27.4% 0% 0.0% 23% 12.8% 625.x264_s 14 7 44% 16.8% 35% 12.0% 33% 12.2% 631.deepsjeng_s 12 49 56% 31.2% 24% 10.0% 12% 15.3% 641.leela_s 10 68 68% 40.7% 44% 15.3% 41% 19.7% 645.exchange2_s 5 19 9% 46.5% 4% 6.0% 0% 0.0% 657.xz_s 5 50 28% 25.2% 29% 15.4% 15% 12.3% MEAN 7.3 29 47% 36.6% 24% 14.4% 21% 12.3%
We demonstrate CNN helpers on SPECint 2017 and assess reusability with the dataset of [3], which traces each benchmark over multiple inputs. For each benchmark, we screen for H2Ps using TAGESCL 8KB as the baseline predictor in the Championship Branch Prediction 2016 simulator [4], and train a CNN helper for any H2P appearing in 3 or more application inputs (i.e. workloads) to support fold crossvalidation. We train on data from the entirety of a single workload and report performance averaged across all heldout workloads; this constitutes one fold, and we average all possible folds to compute the expected gains in future executions, assuming we train on data from an arbitrary execution. Training on one workload and testing on the holdouts demonstrates the reusability of our CNNs. We evaluate fullprecision CNN Helpers (FPCNN) as a limit study and ternary CNNs (TPCNN). For both, we use a history length of 200, encode 7 bits of each IP and 1 direction bit, and 32 Layer 1 filters.
0.47! Prediction Generation Complexity TAGE 8 KB TAGE 64 KB TPCNN 8 filter TPCNN 32 filter # Serial Computations 34 32 6 6 # Serial Tbl. Lkups. 2 2 0 0 Latency Limiting Computation 2 lookup, 4kentry table 2 lookup, 8kentry table popcount (13 stage circuit) popcount (15 stage circuit)
Table I breaks out the portion of CNN helper predictors that improved H2P accuracy (% Winners) by benchmark, alongside the accuracy improvement per H2P (% Reduction in Mispredictions). On average, the FPCNN shows that pattern matching with tolerance for positional variations improves accuracy on 47% of H2Ps by an average 36.6% reduction in mispredictions, reusably across workloads. When we use FPCNN helpers and scale TAGESCL to 64KB, we still find additional gains—21% of H2Ps improve by 12.3% on average, improving gains in mispredictionsperkiloinstruction (MPKI) from 21.2% to 22.3% over the baseline. This shows one example when improved pattern matching provides a fundamental advantage over scaling existing algorithms.
TPCNN helpers improve 24% of H2Ps by 14.4% on average, capturing roughly half the gain of FPCNNs. Given that quantizing Layer 2 weights in TPCNN (Fig. 5) tempers the positional precedence captured by a fullprecision Layer 2 (Fig. 4), this comparison shows that arbitrating with potentially stale data is also an important contributor to prediction accuracy.
Vi Directions for Future ML Helpers
This paper details how a twolayer CNN reduces systematic branch mispredictions caused by positionalvariations in global history data. We demonstrate a path to deployment that (1) meets onBPU constraints for prediction generation, and (2) can amortize iterative batch training through reuse across application executions. Several natural future directions exist:

CNNs provide an expressive pattern matching framework and support rapid experimentation; exploring topologies, e.g. to learn predictive multiIP subsequences or extract patterns from arbitrarily long global histories using recurrence can address different causes of misprediction;

The gap between FPCNN and TPCNN shows the need for alternative onBPU designs, that, e.g., integrate dependent branch IPs identified by a CNN into lightweight predictors. In such a design, ML models act as an automated analysis tool, rather than an onBPU predictor directly;

Feeding additional data such as register values into ML models may boost prediction accuracy for datadependent branches. In this manner, an ML model acts as a approximate value predictor, possibly exploiting idle multiplyaccumulate cycles in the core.
These avenues and others will provide fruitful ground for machine learning in branch predictor development.
References
 [1] A Fog. The microarchitecture of intel, amd, and via cpus. An Optimization Guide for Assembly Programmers and Compiler Makers. Copenhagen University College of Engineering, 2018.
 [2] A Seznec. TAGESCL Branch Predictors Again. In Proc. 5th Championship on Branch Prediction, 2016.
 [3] CK Lin and SJ Tarsa. Branch Prediction is Not a Solved Problem: Measurements, Opportunities, and Future Directions. arXiv:1906.08170, 2019.
 [4] CBP5 Kit. In Proc. 5th Championship on Branch Prediction, 2016.
 [5] GS Ravi and MH Lipasti. CHARSTAR: Clock Hierarchy Aware Resource Scaling in Tiled Architectures. ACM SIGARCH, 2017.
 [6] SJ Tarsa, RBR Chowdhury, J Sebot, GN Chinya, J Gaur, K Sankaranarayanan, CK Lin, R Chappell, R Singhal, and H Wang. Practical PostSilicon CPU Adaptation Using Machine Learning. In ISCA, 2019.
 [7] J Cleary and I Witten. Data Compression Using Adaptive Coding and Partial String Matching. IEEE Trans Comms, 1984.
 [8] DA Jiménez. Multiperspective Perceptron Predictor. In Proc. 5th Championship on Branch Prediction, 2016.
 [9] S Song, Q Wu, S Flolid, and J et al Dean. Experiments with SPEC CPU 2017: Similarity, Balance, Phase Behavior and Simpoints. Technical report, TR18051501, Dept. of ECE, UTAustin, 2018.
 [10] CK Luk, R Cohn, R Muth, H Patil, A Klauser, G Lowney, S Wallace, VJ Reddi, and K Hazelwood. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. 2005.

[11]
S Tokui, K Oono, S Hido, and J Clayton.
Chainer: A NextGeneration Open Source Framework for Deep Learning.
In LearnSys, 2015.  [12] D Kingma and J Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980, 2014.
 [13] M Courbariaux, I Hubara, D Soudry, R ElYaniv, and Y Bengio. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or 1. arXiv:1602.02830, 2016.
 [14] M Rastegari, V Ordonez, J Redmon, and A Farhadi. XNORNet: Imagenet Classification Using Binary Convolutional Neural Networks. In ECCV, 2016.
 [15] C Zhu, S Han, H Mao, and WJ Dally. Trained Ternary Quantization. arXiv:1612.01064, 2016.
 [16] R Ramanarayanan, S Mathew, V Erraguntla, R Krishnamurthy, and S Gueron. A 2.1Ghz 6.5mW 64bit Unified Popcount/Bitscan Datapath Unit for 65nm HighPerformance Microprocessor Execution Cores. In VLSID, 2008.
Iii A CNN Global History Model
To show how a CNN predicts H2P1, we first walk through the forward pass of the twolayer CNN in Listing 2 to produce a prediction, and then describe how it is trained. We initially use the full network representation, but in Section IV translate it into a mechanism that meets BPU constraints.
Iiia Encoding History Data
[width=0.95]H2PEmbeddingTop.pdf 
[width=0.95]H2PEmbeddingBot.pdf 
Given a dynamic instance of an H2P, we convert its global history sequence of IP, direction tuples into an algebraic vector representation. IPs are discrete and take on a large number of possible values, so we use a hash function to index into a “1hot” vector, which contains a one at the index position and zeros otherwise. For example, setting vector dimension to , we map each tuple to by concatenating the observeddirection bit onto the LSBs of the IP: . This process is shown in Fig. 2 for H2P1. Four branches from Listing 1 are shown alongside their IP values, observed directions, and the indices used to generate 1hot vectors. We concatenate column vectors to form a global history matrix , which is input to the CNN.
Though 1hot history matrices appear costly in terms of storage, we choose this encoding because matrices can be replaced onBPU with directmapped table lookups (Section IV). Our experiments show that our CNNs perform well with as few as seven LSBs from each IP, making them agnostic to an application’s base virtual address. To ensure history encodings behave consistently across executions, we set .
IiiB Layer 1: Convolutional Correlation
[width=0.91]TutorialLayer1Filters.pdf 
[width=0.80]TutorialLayer1Activations.pdf 
CNNs perform pattern matching using innerproduct computations between a data vector and a weight vector , also called a filter, with an optional bias term . A similar computation is used in perceptron predictors, however our CNNs differ by performing the same filter matches at every history position, and by also matching on IPs.
(1) 
To illustrate, we instantiate our CNN with two filters and plot their values in Fig. 3 (Top) after training on history matrices and observed directions for H2P1. We see that Filter 1 learns a large positive weight at index 14, aligning to correlated branch 0x400587 being nottaken, while Filter 2 exhibits a large weight at index 15 for the same IP being taken. Small weights adjust for branches that are biased in H2P1’s history, though magnitudes are negligible in comparison.
Evaluating Eq. 1 for each filter against each column of the history matrix produces inner product scores for history length 200. Fig. 3 (Bottom) shows the 200 scores computed from Filter 2. We call f() from a loop, so H2P1’s global history also contains stale appearances of the correlated datadependent branch, and each produces a large filter response.
IiiC Layer 2: Positional Prediction
[clip,width=0.90]TutorialLayer2Filter.pdf
Scores computed in the convolution layer above are passed to a perceptronlike linear layer, which contains a single filter that is matched against the output of Layer 1. The trained weights of Layer 2’s filter are shown in Fig. 4. Nearzero weights damp positions beyond the most recent 30, filtering out stale appearances of IP 0x400587. Eq. 1 is applied once at Layer 2 using its filter and the Layer 1 scores as inputs.
IiiD Stacking the Layers Together
The result of Layer 2’s pattern matching operation predicts “taken” if greater than zero and “nottaken” otherwise. The two layers of this CNN handle different aspects of predicting H2P1’s direction: the first layer is positionagnostic and is designed to identify which IP, direction tuples in a branch history correlate highly with the H2P’s direction; the second layer is designed to identify which positions in a branch history contribute most to the prediction. The combined filtering action of these stacked layers allows the CNN to precisely latch onto the predictive signal in H2P1’s histories, even as it shifts position—it is this mechanism, the result of stacking convolutional and linear layers, that gives our CNNs a pattern matching advantage over PPM and perceptron predictors.
IiiE Offline Training
The training dataset for a CNN helper consists of history matrices of an H2P alongside its observed directions, which we collect using the Pin binary instrumentation tool [10]. We train networks using Chainer [11] and find, for the CNN configurations used in Section V
, 5,000 historymatrix/H2Pdirection pairs sampled uniformly from runtime data are sufficient to converge in 40 epochs using the Adam optimizer
[12].Iv OnBPU Inference with 2Bit CNNs
To deploy our CNN helper in a BPU, we train networks with 2bit weights and show that they need only modest onchip storage and bitwiseparallel logic at prediction time. CNNs provide strong pattern recognition even when their weights are constrained to values in
[13, 14], allowing logical operations to replace arithmetic during inference.Following Courbariaux et al. [13], we impose lowprecision constraints during training by clipping weights to , normalizing activations, and quantizing during forward CNN computations (Listing 3). We train the resulting ternary CNN helper for H2P1 on the same training data. Fig. 5 shows ternary Layer 2 weights. Compared to the fullprecision weights in Fig. 4, quantized weights lose accuracy encoding the magnitude of each position’s contribution to predictions, but still detect correlated IP, direction tuples and damp stale data. This ternary CNN helper yields 98% accuracy for H2P1.
[clip,width=0.92]TutorialLayer2TernaryFilter.pdf
Not only does accuracy remain high, but ternary CNN inference for our network can also be made far more efficient than its fullprecision counterpart based on the following observations:
Multiplication by a 1hot vector yields a scalar. The inner product of a 1hot input vector and a filter is the filter value that aligns to the input’s sole nonzero. We therefore sidestep history matrices completely onBPU by indexing IP, direction tuples into a table of filter values. Subsequent normalization and quantization steps can also be folded into this table, since they produce a 2bit value from each possible filter value after training. We precompute this half of the inference computation for any input by populating a lookup table as follows: For filters of length , denoted ; indices ; learned parameters from a normalization layer that transforms data according to ; and quantization bins defined over the ranges , , , we populate a bit table as:
(2) 
defines quantization buckets for ternary CNN weights [13, 14]; we set but note its value may be learned [15].
1wide convolutions can be computed as soon as history tuples are available. When applying convolutions of width 1, filter responses for each history position are independent of their neighbors. This allows us to retrieve the Layer 1 responses well before an H2P is encountered, as IP, direction tuples become available. Whenever a prior branch’s direction is predicted, the corresponding Layer 1 responses are retrieved from and pushed into a FIFO buffer. When an H2P is fetched and a CNN prediction needed, Layer 1 outputs are already available in the FIFO buffer.
Ternary inner products require only bitwise parallel logic, popcount, and subtraction. At prediction time, we evaluate Layer 2 and its normalization layer. This entails an inner product between the FIFO buffer’s contents and ternary weights, scaling and shifting the resulting integer value by learned normalization parameters, and comparing to 0 to give a “taken” or “nottaken” prediction. We implement the ternary inner product as:
(3) 
where and are the sign and value bits of the FIFO buffer, respectively, and and contain those for the Layer 2 filter. We apply the inverse of normalization to 0 to solve for a threshold , above which we predict taken:
(4) 
OnBPU CNN inference thus consists of two steps, defined by Algorithms 1 and 2. The first is a table lookup to update the FIFO buffer of Layer 1 filter responses whenever any dynamic conditional branch is fetched. The second is a ternary inner product between the FIFO buffer and Layer 2 filter when the H2P is fetched and a prediction needed. Any time a branch is mispredicted, the CPU is rolled back to that instruction, and wrongpath entries are simply shifted off the FIFO buffer.
Iva OnBPU Storage and Latency
To install a CNN helper in a BPU, we must store four components: (1) a bit table to hold Layer 1 filter responses; (2) a bit FIFO buffer to hold convolution results; (3) a bit buffer to hold the Layer 2 weights; (4) a buffer to hold the precomputed integer threshold. Our network, with , , and , requires 336 bytes per helper. For , storage is 5.2KB.
While a full layout of our CNN helper is beyond the scope of this report, we can compare the relative latency between ternary CNN inference and TAGESCL by analyzing the computation graphs of their prediction procedures (Table II). For example, in Algorithm 2, we are able to compute Lines 2–3 in parallel, compute (s_bits & v_bits) on Line 4 and operations serially, the popcounts in parallel, and finally the subtract and comparison serially. Predictions from a 2bit CNN helper thus require six serial computations. The bottleneck computation is popcount, which requires a 13 or 15 stage circuit depending on [16]. In contrast, TAGESCL 8KB and 64KB require 34 and 32 serial computations, respectively (TAGESCL 8KB uses more complex hashing). Their bottleneck computations are backtoback lookups to 4k and 8k entry tables, depending on predictor. This comparison shows that a ternary CNN helper requires a similar number of computation steps to existing predictors.
V CNN Helper Gains & Reusability
! SPECint2017 Benchmark # Training Folds # H2Ps (All Phases) FPCNN with TAGE 8KB Baseline TPCNN with TAGE 8KB Baseline FPCNN, Gains Beyond TAGE 64KB % Winners Mispred. Red. per H2P % Winners Mispred. Red. per H2P % Winners Mispred. Red. per H2P 600.perlbench_s 4 16 51% 63.2% 18% 26.6% 4% 8.2% 605.mcf_s 8 20 55% 44.8% 28% 27.9% 35% 19.3% 620.omnetpp_s 5 28 71% 33.6% 30% 16.3% 24% 11.2% 623.xalancbmk_s 4 8 39% 27.4% 0% 0.0% 23% 12.8% 625.x264_s 14 7 44% 16.8% 35% 12.0% 33% 12.2% 631.deepsjeng_s 12 49 56% 31.2% 24% 10.0% 12% 15.3% 641.leela_s 10 68 68% 40.7% 44% 15.3% 41% 19.7% 645.exchange2_s 5 19 9% 46.5% 4% 6.0% 0% 0.0% 657.xz_s 5 50 28% 25.2% 29% 15.4% 15% 12.3% MEAN 7.3 29 47% 36.6% 24% 14.4% 21% 12.3%
We demonstrate CNN helpers on SPECint 2017 and assess reusability with the dataset of [3], which traces each benchmark over multiple inputs. For each benchmark, we screen for H2Ps using TAGESCL 8KB as the baseline predictor in the Championship Branch Prediction 2016 simulator [4], and train a CNN helper for any H2P appearing in 3 or more application inputs (i.e. workloads) to support fold crossvalidation. We train on data from the entirety of a single workload and report performance averaged across all heldout workloads; this constitutes one fold, and we average all possible folds to compute the expected gains in future executions, assuming we train on data from an arbitrary execution. Training on one workload and testing on the holdouts demonstrates the reusability of our CNNs. We evaluate fullprecision CNN Helpers (FPCNN) as a limit study and ternary CNNs (TPCNN). For both, we use a history length of 200, encode 7 bits of each IP and 1 direction bit, and 32 Layer 1 filters.
0.47! Prediction Generation Complexity TAGE 8 KB TAGE 64 KB TPCNN 8 filter TPCNN 32 filter # Serial Computations 34 32 6 6 # Serial Tbl. Lkups. 2 2 0 0 Latency Limiting Computation 2 lookup, 4kentry table 2 lookup, 8kentry table popcount (13 stage circuit) popcount (15 stage circuit)
Table I breaks out the portion of CNN helper predictors that improved H2P accuracy (% Winners) by benchmark, alongside the accuracy improvement per H2P (% Reduction in Mispredictions). On average, the FPCNN shows that pattern matching with tolerance for positional variations improves accuracy on 47% of H2Ps by an average 36.6% reduction in mispredictions, reusably across workloads. When we use FPCNN helpers and scale TAGESCL to 64KB, we still find additional gains—21% of H2Ps improve by 12.3% on average, improving gains in mispredictionsperkiloinstruction (MPKI) from 21.2% to 22.3% over the baseline. This shows one example when improved pattern matching provides a fundamental advantage over scaling existing algorithms.
TPCNN helpers improve 24% of H2Ps by 14.4% on average, capturing roughly half the gain of FPCNNs. Given that quantizing Layer 2 weights in TPCNN (Fig. 5) tempers the positional precedence captured by a fullprecision Layer 2 (Fig. 4), this comparison shows that arbitrating with potentially stale data is also an important contributor to prediction accuracy.
Vi Directions for Future ML Helpers
This paper details how a twolayer CNN reduces systematic branch mispredictions caused by positionalvariations in global history data. We demonstrate a path to deployment that (1) meets onBPU constraints for prediction generation, and (2) can amortize iterative batch training through reuse across application executions. Several natural future directions exist:

CNNs provide an expressive pattern matching framework and support rapid experimentation; exploring topologies, e.g. to learn predictive multiIP subsequences or extract patterns from arbitrarily long global histories using recurrence can address different causes of misprediction;

The gap between FPCNN and TPCNN shows the need for alternative onBPU designs, that, e.g., integrate dependent branch IPs identified by a CNN into lightweight predictors. In such a design, ML models act as an automated analysis tool, rather than an onBPU predictor directly;

Feeding additional data such as register values into ML models may boost prediction accuracy for datadependent branches. In this manner, an ML model acts as a approximate value predictor, possibly exploiting idle multiplyaccumulate cycles in the core.
These avenues and others will provide fruitful ground for machine learning in branch predictor development.
References
 [1] A Fog. The microarchitecture of intel, amd, and via cpus. An Optimization Guide for Assembly Programmers and Compiler Makers. Copenhagen University College of Engineering, 2018.
 [2] A Seznec. TAGESCL Branch Predictors Again. In Proc. 5th Championship on Branch Prediction, 2016.
 [3] CK Lin and SJ Tarsa. Branch Prediction is Not a Solved Problem: Measurements, Opportunities, and Future Directions. arXiv:1906.08170, 2019.
 [4] CBP5 Kit. In Proc. 5th Championship on Branch Prediction, 2016.
 [5] GS Ravi and MH Lipasti. CHARSTAR: Clock Hierarchy Aware Resource Scaling in Tiled Architectures. ACM SIGARCH, 2017.
 [6] SJ Tarsa, RBR Chowdhury, J Sebot, GN Chinya, J Gaur, K Sankaranarayanan, CK Lin, R Chappell, R Singhal, and H Wang. Practical PostSilicon CPU Adaptation Using Machine Learning. In ISCA, 2019.
 [7] J Cleary and I Witten. Data Compression Using Adaptive Coding and Partial String Matching. IEEE Trans Comms, 1984.
 [8] DA Jiménez. Multiperspective Perceptron Predictor. In Proc. 5th Championship on Branch Prediction, 2016.
 [9] S Song, Q Wu, S Flolid, and J et al Dean. Experiments with SPEC CPU 2017: Similarity, Balance, Phase Behavior and Simpoints. Technical report, TR18051501, Dept. of ECE, UTAustin, 2018.
 [10] CK Luk, R Cohn, R Muth, H Patil, A Klauser, G Lowney, S Wallace, VJ Reddi, and K Hazelwood. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. 2005.

[11]
S Tokui, K Oono, S Hido, and J Clayton.
Chainer: A NextGeneration Open Source Framework for Deep Learning.
In LearnSys, 2015.  [12] D Kingma and J Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980, 2014.
 [13] M Courbariaux, I Hubara, D Soudry, R ElYaniv, and Y Bengio. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or 1. arXiv:1602.02830, 2016.
 [14] M Rastegari, V Ordonez, J Redmon, and A Farhadi. XNORNet: Imagenet Classification Using Binary Convolutional Neural Networks. In ECCV, 2016.
 [15] C Zhu, S Han, H Mao, and WJ Dally. Trained Ternary Quantization. arXiv:1612.01064, 2016.
 [16] R Ramanarayanan, S Mathew, V Erraguntla, R Krishnamurthy, and S Gueron. A 2.1Ghz 6.5mW 64bit Unified Popcount/Bitscan Datapath Unit for 65nm HighPerformance Microprocessor Execution Cores. In VLSID, 2008.
Iv OnBPU Inference with 2Bit CNNs
To deploy our CNN helper in a BPU, we train networks with 2bit weights and show that they need only modest onchip storage and bitwiseparallel logic at prediction time. CNNs provide strong pattern recognition even when their weights are constrained to values in
[13, 14], allowing logical operations to replace arithmetic during inference.Following Courbariaux et al. [13], we impose lowprecision constraints during training by clipping weights to , normalizing activations, and quantizing during forward CNN computations (Listing 3). We train the resulting ternary CNN helper for H2P1 on the same training data. Fig. 5 shows ternary Layer 2 weights. Compared to the fullprecision weights in Fig. 4, quantized weights lose accuracy encoding the magnitude of each position’s contribution to predictions, but still detect correlated IP, direction tuples and damp stale data. This ternary CNN helper yields 98% accuracy for H2P1.
[clip,width=0.92]TutorialLayer2TernaryFilter.pdf
Not only does accuracy remain high, but ternary CNN inference for our network can also be made far more efficient than its fullprecision counterpart based on the following observations:
Multiplication by a 1hot vector yields a scalar. The inner product of a 1hot input vector and a filter is the filter value that aligns to the input’s sole nonzero. We therefore sidestep history matrices completely onBPU by indexing IP, direction tuples into a table of filter values. Subsequent normalization and quantization steps can also be folded into this table, since they produce a 2bit value from each possible filter value after training. We precompute this half of the inference computation for any input by populating a lookup table as follows: For filters of length , denoted ; indices ; learned parameters from a normalization layer that transforms data according to ; and quantization bins defined over the ranges , , , we populate a bit table as:
(2) 
defines quantization buckets for ternary CNN weights [13, 14]; we set but note its value may be learned [15].
1wide convolutions can be computed as soon as history tuples are available. When applying convolutions of width 1, filter responses for each history position are independent of their neighbors. This allows us to retrieve the Layer 1 responses well before an H2P is encountered, as IP, direction tuples become available. Whenever a prior branch’s direction is predicted, the corresponding Layer 1 responses are retrieved from and pushed into a FIFO buffer. When an H2P is fetched and a CNN prediction needed, Layer 1 outputs are already available in the FIFO buffer.
Ternary inner products require only bitwise parallel logic, popcount, and subtraction. At prediction time, we evaluate Layer 2 and its normalization layer. This entails an inner product between the FIFO buffer’s contents and ternary weights, scaling and shifting the resulting integer value by learned normalization parameters, and comparing to 0 to give a “taken” or “nottaken” prediction. We implement the ternary inner product as:
(3) 
where and are the sign and value bits of the FIFO buffer, respectively, and and contain those for the Layer 2 filter. We apply the inverse of normalization to 0 to solve for a threshold , above which we predict taken:
(4) 
OnBPU CNN inference thus consists of two steps, defined by Algorithms 1 and 2. The first is a table lookup to update the FIFO buffer of Layer 1 filter responses whenever any dynamic conditional branch is fetched. The second is a ternary inner product between the FIFO buffer and Layer 2 filter when the H2P is fetched and a prediction needed. Any time a branch is mispredicted, the CPU is rolled back to that instruction, and wrongpath entries are simply shifted off the FIFO buffer.
Iva OnBPU Storage and Latency
To install a CNN helper in a BPU, we must store four components: (1) a bit table to hold Layer 1 filter responses; (2) a bit FIFO buffer to hold convolution results; (3) a bit buffer to hold the Layer 2 weights; (4) a buffer to hold the precomputed integer threshold. Our network, with , , and , requires 336 bytes per helper. For , storage is 5.2KB.
While a full layout of our CNN helper is beyond the scope of this report, we can compare the relative latency between ternary CNN inference and TAGESCL by analyzing the computation graphs of their prediction procedures (Table II). For example, in Algorithm 2, we are able to compute Lines 2–3 in parallel, compute (s_bits & v_bits) on Line 4 and operations serially, the popcounts in parallel, and finally the subtract and comparison serially. Predictions from a 2bit CNN helper thus require six serial computations. The bottleneck computation is popcount, which requires a 13 or 15 stage circuit depending on [16]. In contrast, TAGESCL 8KB and 64KB require 34 and 32 serial computations, respectively (TAGESCL 8KB uses more complex hashing). Their bottleneck computations are backtoback lookups to 4k and 8k entry tables, depending on predictor. This comparison shows that a ternary CNN helper requires a similar number of computation steps to existing predictors.
V CNN Helper Gains & Reusability
! SPECint2017 Benchmark # Training Folds # H2Ps (All Phases) FPCNN with TAGE 8KB Baseline TPCNN with TAGE 8KB Baseline FPCNN, Gains Beyond TAGE 64KB % Winners Mispred. Red. per H2P % Winners Mispred. Red. per H2P % Winners Mispred. Red. per H2P 600.perlbench_s 4 16 51% 63.2% 18% 26.6% 4% 8.2% 605.mcf_s 8 20 55% 44.8% 28% 27.9% 35% 19.3% 620.omnetpp_s 5 28 71% 33.6% 30% 16.3% 24% 11.2% 623.xalancbmk_s 4 8 39% 27.4% 0% 0.0% 23% 12.8% 625.x264_s 14 7 44% 16.8% 35% 12.0% 33% 12.2% 631.deepsjeng_s 12 49 56% 31.2% 24% 10.0% 12% 15.3% 641.leela_s 10 68 68% 40.7% 44% 15.3% 41% 19.7% 645.exchange2_s 5 19 9% 46.5% 4% 6.0% 0% 0.0% 657.xz_s 5 50 28% 25.2% 29% 15.4% 15% 12.3% MEAN 7.3 29 47% 36.6% 24% 14.4% 21% 12.3%
We demonstrate CNN helpers on SPECint 2017 and assess reusability with the dataset of [3], which traces each benchmark over multiple inputs. For each benchmark, we screen for H2Ps using TAGESCL 8KB as the baseline predictor in the Championship Branch Prediction 2016 simulator [4], and train a CNN helper for any H2P appearing in 3 or more application inputs (i.e. workloads) to support fold crossvalidation. We train on data from the entirety of a single workload and report performance averaged across all heldout workloads; this constitutes one fold, and we average all possible folds to compute the expected gains in future executions, assuming we train on data from an arbitrary execution. Training on one workload and testing on the holdouts demonstrates the reusability of our CNNs. We evaluate fullprecision CNN Helpers (FPCNN) as a limit study and ternary CNNs (TPCNN). For both, we use a history length of 200, encode 7 bits of each IP and 1 direction bit, and 32 Layer 1 filters.
0.47! Prediction Generation Complexity TAGE 8 KB TAGE 64 KB TPCNN 8 filter TPCNN 32 filter # Serial Computations 34 32 6 6 # Serial Tbl. Lkups. 2 2 0 0 Latency Limiting Computation 2 lookup, 4kentry table 2 lookup, 8kentry table popcount (13 stage circuit) popcount (15 stage circuit)
Table I breaks out the portion of CNN helper predictors that improved H2P accuracy (% Winners) by benchmark, alongside the accuracy improvement per H2P (% Reduction in Mispredictions). On average, the FPCNN shows that pattern matching with tolerance for positional variations improves accuracy on 47% of H2Ps by an average 36.6% reduction in mispredictions, reusably across workloads. When we use FPCNN helpers and scale TAGESCL to 64KB, we still find additional gains—21% of H2Ps improve by 12.3% on average, improving gains in mispredictionsperkiloinstruction (MPKI) from 21.2% to 22.3% over the baseline. This shows one example when improved pattern matching provides a fundamental advantage over scaling existing algorithms.
TPCNN helpers improve 24% of H2Ps by 14.4% on average, capturing roughly half the gain of FPCNNs. Given that quantizing Layer 2 weights in TPCNN (Fig. 5) tempers the positional precedence captured by a fullprecision Layer 2 (Fig. 4), this comparison shows that arbitrating with potentially stale data is also an important contributor to prediction accuracy.
Vi Directions for Future ML Helpers
This paper details how a twolayer CNN reduces systematic branch mispredictions caused by positionalvariations in global history data. We demonstrate a path to deployment that (1) meets onBPU constraints for prediction generation, and (2) can amortize iterative batch training through reuse across application executions. Several natural future directions exist:

CNNs provide an expressive pattern matching framework and support rapid experimentation; exploring topologies, e.g. to learn predictive multiIP subsequences or extract patterns from arbitrarily long global histories using recurrence can address different causes of misprediction;

The gap between FPCNN and TPCNN shows the need for alternative onBPU designs, that, e.g., integrate dependent branch IPs identified by a CNN into lightweight predictors. In such a design, ML models act as an automated analysis tool, rather than an onBPU predictor directly;

Feeding additional data such as register values into ML models may boost prediction accuracy for datadependent branches. In this manner, an ML model acts as a approximate value predictor, possibly exploiting idle multiplyaccumulate cycles in the core.
These avenues and others will provide fruitful ground for machine learning in branch predictor development.
References
 [1] A Fog. The microarchitecture of intel, amd, and via cpus. An Optimization Guide for Assembly Programmers and Compiler Makers. Copenhagen University College of Engineering, 2018.
 [2] A Seznec. TAGESCL Branch Predictors Again. In Proc. 5th Championship on Branch Prediction, 2016.
 [3] CK Lin and SJ Tarsa. Branch Prediction is Not a Solved Problem: Measurements, Opportunities, and Future Directions. arXiv:1906.08170, 2019.
 [4] CBP5 Kit. In Proc. 5th Championship on Branch Prediction, 2016.
 [5] GS Ravi and MH Lipasti. CHARSTAR: Clock Hierarchy Aware Resource Scaling in Tiled Architectures. ACM SIGARCH, 2017.
 [6] SJ Tarsa, RBR Chowdhury, J Sebot, GN Chinya, J Gaur, K Sankaranarayanan, CK Lin, R Chappell, R Singhal, and H Wang. Practical PostSilicon CPU Adaptation Using Machine Learning. In ISCA, 2019.
 [7] J Cleary and I Witten. Data Compression Using Adaptive Coding and Partial String Matching. IEEE Trans Comms, 1984.
 [8] DA Jiménez. Multiperspective Perceptron Predictor. In Proc. 5th Championship on Branch Prediction, 2016.
 [9] S Song, Q Wu, S Flolid, and J et al Dean. Experiments with SPEC CPU 2017: Similarity, Balance, Phase Behavior and Simpoints. Technical report, TR18051501, Dept. of ECE, UTAustin, 2018.
 [10] CK Luk, R Cohn, R Muth, H Patil, A Klauser, G Lowney, S Wallace, VJ Reddi, and K Hazelwood. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. 2005.

[11]
S Tokui, K Oono, S Hido, and J Clayton.
Chainer: A NextGeneration Open Source Framework for Deep Learning.
In LearnSys, 2015.  [12] D Kingma and J Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980, 2014.
 [13] M Courbariaux, I Hubara, D Soudry, R ElYaniv, and Y Bengio. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or 1. arXiv:1602.02830, 2016.
 [14] M Rastegari, V Ordonez, J Redmon, and A Farhadi. XNORNet: Imagenet Classification Using Binary Convolutional Neural Networks. In ECCV, 2016.
 [15] C Zhu, S Han, H Mao, and WJ Dally. Trained Ternary Quantization. arXiv:1612.01064, 2016.
 [16] R Ramanarayanan, S Mathew, V Erraguntla, R Krishnamurthy, and S Gueron. A 2.1Ghz 6.5mW 64bit Unified Popcount/Bitscan Datapath Unit for 65nm HighPerformance Microprocessor Execution Cores. In VLSID, 2008.
V CNN Helper Gains & Reusability
! SPECint2017 Benchmark # Training Folds # H2Ps (All Phases) FPCNN with TAGE 8KB Baseline TPCNN with TAGE 8KB Baseline FPCNN, Gains Beyond TAGE 64KB % Winners Mispred. Red. per H2P % Winners Mispred. Red. per H2P % Winners Mispred. Red. per H2P 600.perlbench_s 4 16 51% 63.2% 18% 26.6% 4% 8.2% 605.mcf_s 8 20 55% 44.8% 28% 27.9% 35% 19.3% 620.omnetpp_s 5 28 71% 33.6% 30% 16.3% 24% 11.2% 623.xalancbmk_s 4 8 39% 27.4% 0% 0.0% 23% 12.8% 625.x264_s 14 7 44% 16.8% 35% 12.0% 33% 12.2% 631.deepsjeng_s 12 49 56% 31.2% 24% 10.0% 12% 15.3% 641.leela_s 10 68 68% 40.7% 44% 15.3% 41% 19.7% 645.exchange2_s 5 19 9% 46.5% 4% 6.0% 0% 0.0% 657.xz_s 5 50 28% 25.2% 29% 15.4% 15% 12.3% MEAN 7.3 29 47% 36.6% 24% 14.4% 21% 12.3%
We demonstrate CNN helpers on SPECint 2017 and assess reusability with the dataset of [3], which traces each benchmark over multiple inputs. For each benchmark, we screen for H2Ps using TAGESCL 8KB as the baseline predictor in the Championship Branch Prediction 2016 simulator [4], and train a CNN helper for any H2P appearing in 3 or more application inputs (i.e. workloads) to support fold crossvalidation. We train on data from the entirety of a single workload and report performance averaged across all heldout workloads; this constitutes one fold, and we average all possible folds to compute the expected gains in future executions, assuming we train on data from an arbitrary execution. Training on one workload and testing on the holdouts demonstrates the reusability of our CNNs. We evaluate fullprecision CNN Helpers (FPCNN) as a limit study and ternary CNNs (TPCNN). For both, we use a history length of 200, encode 7 bits of each IP and 1 direction bit, and 32 Layer 1 filters.
0.47! Prediction Generation Complexity TAGE 8 KB TAGE 64 KB TPCNN 8 filter TPCNN 32 filter # Serial Computations 34 32 6 6 # Serial Tbl. Lkups. 2 2 0 0 Latency Limiting Computation 2 lookup, 4kentry table 2 lookup, 8kentry table popcount (13 stage circuit) popcount (15 stage circuit)
Table I breaks out the portion of CNN helper predictors that improved H2P accuracy (% Winners) by benchmark, alongside the accuracy improvement per H2P (% Reduction in Mispredictions). On average, the FPCNN shows that pattern matching with tolerance for positional variations improves accuracy on 47% of H2Ps by an average 36.6% reduction in mispredictions, reusably across workloads. When we use FPCNN helpers and scale TAGESCL to 64KB, we still find additional gains—21% of H2Ps improve by 12.3% on average, improving gains in mispredictionsperkiloinstruction (MPKI) from 21.2% to 22.3% over the baseline. This shows one example when improved pattern matching provides a fundamental advantage over scaling existing algorithms.
TPCNN helpers improve 24% of H2Ps by 14.4% on average, capturing roughly half the gain of FPCNNs. Given that quantizing Layer 2 weights in TPCNN (Fig. 5) tempers the positional precedence captured by a fullprecision Layer 2 (Fig. 4), this comparison shows that arbitrating with potentially stale data is also an important contributor to prediction accuracy.
Vi Directions for Future ML Helpers
This paper details how a twolayer CNN reduces systematic branch mispredictions caused by positionalvariations in global history data. We demonstrate a path to deployment that (1) meets onBPU constraints for prediction generation, and (2) can amortize iterative batch training through reuse across application executions. Several natural future directions exist:

CNNs provide an expressive pattern matching framework and support rapid experimentation; exploring topologies, e.g. to learn predictive multiIP subsequences or extract patterns from arbitrarily long global histories using recurrence can address different causes of misprediction;

The gap between FPCNN and TPCNN shows the need for alternative onBPU designs, that, e.g., integrate dependent branch IPs identified by a CNN into lightweight predictors. In such a design, ML models act as an automated analysis tool, rather than an onBPU predictor directly;

Feeding additional data such as register values into ML models may boost prediction accuracy for datadependent branches. In this manner, an ML model acts as a approximate value predictor, possibly exploiting idle multiplyaccumulate cycles in the core.
These avenues and others will provide fruitful ground for machine learning in branch predictor development.
References
 [1] A Fog. The microarchitecture of intel, amd, and via cpus. An Optimization Guide for Assembly Programmers and Compiler Makers. Copenhagen University College of Engineering, 2018.
 [2] A Seznec. TAGESCL Branch Predictors Again. In Proc. 5th Championship on Branch Prediction, 2016.
 [3] CK Lin and SJ Tarsa. Branch Prediction is Not a Solved Problem: Measurements, Opportunities, and Future Directions. arXiv:1906.08170, 2019.
 [4] CBP5 Kit. In Proc. 5th Championship on Branch Prediction, 2016.
 [5] GS Ravi and MH Lipasti. CHARSTAR: Clock Hierarchy Aware Resource Scaling in Tiled Architectures. ACM SIGARCH, 2017.
 [6] SJ Tarsa, RBR Chowdhury, J Sebot, GN Chinya, J Gaur, K Sankaranarayanan, CK Lin, R Chappell, R Singhal, and H Wang. Practical PostSilicon CPU Adaptation Using Machine Learning. In ISCA, 2019.
 [7] J Cleary and I Witten. Data Compression Using Adaptive Coding and Partial String Matching. IEEE Trans Comms, 1984.
 [8] DA Jiménez. Multiperspective Perceptron Predictor. In Proc. 5th Championship on Branch Prediction, 2016.
 [9] S Song, Q Wu, S Flolid, and J et al Dean. Experiments with SPEC CPU 2017: Similarity, Balance, Phase Behavior and Simpoints. Technical report, TR18051501, Dept. of ECE, UTAustin, 2018.
 [10] CK Luk, R Cohn, R Muth, H Patil, A Klauser, G Lowney, S Wallace, VJ Reddi, and K Hazelwood. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. 2005.

[11]
S Tokui, K Oono, S Hido, and J Clayton.
Chainer: A NextGeneration Open Source Framework for Deep Learning.
In LearnSys, 2015.  [12] D Kingma and J Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980, 2014.
 [13] M Courbariaux, I Hubara, D Soudry, R ElYaniv, and Y Bengio. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or 1. arXiv:1602.02830, 2016.
 [14] M Rastegari, V Ordonez, J Redmon, and A Farhadi. XNORNet: Imagenet Classification Using Binary Convolutional Neural Networks. In ECCV, 2016.
 [15] C Zhu, S Han, H Mao, and WJ Dally. Trained Ternary Quantization. arXiv:1612.01064, 2016.
 [16] R Ramanarayanan, S Mathew, V Erraguntla, R Krishnamurthy, and S Gueron. A 2.1Ghz 6.5mW 64bit Unified Popcount/Bitscan Datapath Unit for 65nm HighPerformance Microprocessor Execution Cores. In VLSID, 2008.
Vi Directions for Future ML Helpers
This paper details how a twolayer CNN reduces systematic branch mispredictions caused by positionalvariations in global history data. We demonstrate a path to deployment that (1) meets onBPU constraints for prediction generation, and (2) can amortize iterative batch training through reuse across application executions. Several natural future directions exist:

CNNs provide an expressive pattern matching framework and support rapid experimentation; exploring topologies, e.g. to learn predictive multiIP subsequences or extract patterns from arbitrarily long global histories using recurrence can address different causes of misprediction;

The gap between FPCNN and TPCNN shows the need for alternative onBPU designs, that, e.g., integrate dependent branch IPs identified by a CNN into lightweight predictors. In such a design, ML models act as an automated analysis tool, rather than an onBPU predictor directly;

Feeding additional data such as register values into ML models may boost prediction accuracy for datadependent branches. In this manner, an ML model acts as a approximate value predictor, possibly exploiting idle multiplyaccumulate cycles in the core.
These avenues and others will provide fruitful ground for machine learning in branch predictor development.
References
 [1] A Fog. The microarchitecture of intel, amd, and via cpus. An Optimization Guide for Assembly Programmers and Compiler Makers. Copenhagen University College of Engineering, 2018.
 [2] A Seznec. TAGESCL Branch Predictors Again. In Proc. 5th Championship on Branch Prediction, 2016.
 [3] CK Lin and SJ Tarsa. Branch Prediction is Not a Solved Problem: Measurements, Opportunities, and Future Directions. arXiv:1906.08170, 2019.
 [4] CBP5 Kit. In Proc. 5th Championship on Branch Prediction, 2016.
 [5] GS Ravi and MH Lipasti. CHARSTAR: Clock Hierarchy Aware Resource Scaling in Tiled Architectures. ACM SIGARCH, 2017.
 [6] SJ Tarsa, RBR Chowdhury, J Sebot, GN Chinya, J Gaur, K Sankaranarayanan, CK Lin, R Chappell, R Singhal, and H Wang. Practical PostSilicon CPU Adaptation Using Machine Learning. In ISCA, 2019.
 [7] J Cleary and I Witten. Data Compression Using Adaptive Coding and Partial String Matching. IEEE Trans Comms, 1984.
 [8] DA Jiménez. Multiperspective Perceptron Predictor. In Proc. 5th Championship on Branch Prediction, 2016.
 [9] S Song, Q Wu, S Flolid, and J et al Dean. Experiments with SPEC CPU 2017: Similarity, Balance, Phase Behavior and Simpoints. Technical report, TR18051501, Dept. of ECE, UTAustin, 2018.
 [10] CK Luk, R Cohn, R Muth, H Patil, A Klauser, G Lowney, S Wallace, VJ Reddi, and K Hazelwood. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. 2005.

[11]
S Tokui, K Oono, S Hido, and J Clayton.
Chainer: A NextGeneration Open Source Framework for Deep Learning.
In LearnSys, 2015.  [12] D Kingma and J Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980, 2014.
 [13] M Courbariaux, I Hubara, D Soudry, R ElYaniv, and Y Bengio. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or 1. arXiv:1602.02830, 2016.
 [14] M Rastegari, V Ordonez, J Redmon, and A Farhadi. XNORNet: Imagenet Classification Using Binary Convolutional Neural Networks. In ECCV, 2016.
 [15] C Zhu, S Han, H Mao, and WJ Dally. Trained Ternary Quantization. arXiv:1612.01064, 2016.
 [16] R Ramanarayanan, S Mathew, V Erraguntla, R Krishnamurthy, and S Gueron. A 2.1Ghz 6.5mW 64bit Unified Popcount/Bitscan Datapath Unit for 65nm HighPerformance Microprocessor Execution Cores. In VLSID, 2008.
References
 [1] A Fog. The microarchitecture of intel, amd, and via cpus. An Optimization Guide for Assembly Programmers and Compiler Makers. Copenhagen University College of Engineering, 2018.
 [2] A Seznec. TAGESCL Branch Predictors Again. In Proc. 5th Championship on Branch Prediction, 2016.
 [3] CK Lin and SJ Tarsa. Branch Prediction is Not a Solved Problem: Measurements, Opportunities, and Future Directions. arXiv:1906.08170, 2019.
 [4] CBP5 Kit. In Proc. 5th Championship on Branch Prediction, 2016.
 [5] GS Ravi and MH Lipasti. CHARSTAR: Clock Hierarchy Aware Resource Scaling in Tiled Architectures. ACM SIGARCH, 2017.
 [6] SJ Tarsa, RBR Chowdhury, J Sebot, GN Chinya, J Gaur, K Sankaranarayanan, CK Lin, R Chappell, R Singhal, and H Wang. Practical PostSilicon CPU Adaptation Using Machine Learning. In ISCA, 2019.
 [7] J Cleary and I Witten. Data Compression Using Adaptive Coding and Partial String Matching. IEEE Trans Comms, 1984.
 [8] DA Jiménez. Multiperspective Perceptron Predictor. In Proc. 5th Championship on Branch Prediction, 2016.
 [9] S Song, Q Wu, S Flolid, and J et al Dean. Experiments with SPEC CPU 2017: Similarity, Balance, Phase Behavior and Simpoints. Technical report, TR18051501, Dept. of ECE, UTAustin, 2018.
 [10] CK Luk, R Cohn, R Muth, H Patil, A Klauser, G Lowney, S Wallace, VJ Reddi, and K Hazelwood. Pin: Building Customized Program Analysis Tools with Dynamic Instrumentation. 2005.

[11]
S Tokui, K Oono, S Hido, and J Clayton.
Chainer: A NextGeneration Open Source Framework for Deep Learning.
In LearnSys, 2015.  [12] D Kingma and J Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980, 2014.
 [13] M Courbariaux, I Hubara, D Soudry, R ElYaniv, and Y Bengio. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or 1. arXiv:1602.02830, 2016.
 [14] M Rastegari, V Ordonez, J Redmon, and A Farhadi. XNORNet: Imagenet Classification Using Binary Convolutional Neural Networks. In ECCV, 2016.
 [15] C Zhu, S Han, H Mao, and WJ Dally. Trained Ternary Quantization. arXiv:1612.01064, 2016.
 [16] R Ramanarayanan, S Mathew, V Erraguntla, R Krishnamurthy, and S Gueron. A 2.1Ghz 6.5mW 64bit Unified Popcount/Bitscan Datapath Unit for 65nm HighPerformance Microprocessor Execution Cores. In VLSID, 2008.
Comments
There are no comments yet.