1 Introduction & Motivation
Superscalar pipelined processors rely profoundly on speculative and outoforder (OoO) execution to keep their pipeline busy, hiding latency with instruction and memorylevel parallelism (ILP and MLP). Branch prediction (BP) is the key mechanism that drives speculative execution by steering the front end of the pipeline, each time predicting the instructions that follow in the stream. Working on the correct code path is paramount to performance since recovering from a branch misprediction and refilling the processor’s instruction reservoirs (ROB, instruction/issue queue, etc.) after a pipeline restart can cost dozens of cycles [z15_IBM]. Notably, as the misprediction penalty gets larger due to the increasing pipeline depth and width [Sprangle2002, michaud2001, karkhanis2004], modern processors have come to rely on incredibly accurate branch predictors.
Today, virtually all branch predictors in highend processors are highlyengineered variants of TAGE [seznec2006]
and Perceptron
[perceptron2001]. Over the past 15 years, branch prediction has been primarily focused on these two designs, resulting in some highly sophisticated branch predictors that can overall achieve remarkably high prediction accuracy [seznec2016, jimenez2016]. However, recent studies demonstrate that the missing accuracy gap limits substantially the performance scalability of future processors[bp:not_solved:2019, bp:autopredication2020]. As the further enhancement of BP methods is becoming increasingly challenging, performance improvements have stagnated the recent years. Lately, the focus has been concentrated on improving the prediction accuracy of a specific class of branches that are regularly mispredicted. In this work, we argue that there are still uncapped opportunities that can unveil key improvements in branch prediction not by strictly focusing on what is classified as
“hardtopredict” (H2P), but by exploring vital controlflow properties that have been systematically overlooked.More specifically, despite extensive differences in their prediction mechanisms, stateoftheart branch predictors learn correlations between branches through the use of large hardware tables and long histories of previous branch outcomes (taken/nottaken), with efficacy strongly associated with the amount of storage available. An essential observation made from both TAGE and Perceptronbased predictors is that predictive signatures in branch history occur with varying lengths. That is, long histories may be able to expose important branch patterns that shorter ones fail to and vice versa. As such, modern BP designs are built with a set of input features that are based on branch histories with different lengths (in number of bits) [isca:exynos, z15_IBM].
Ultimately, longer histories can elicit correlations with the more distant past, but in the general case, the correlation of a branch with a large history is rather sparse, i.e., only a few selective parts of the history are eventually informative of the branch outcome [evers:phdthesis, evers:branch_sparsity]. As an example, in Fig. 1, we demonstrate the sparsity of branch correlations in two SPECINT2017 applications (xz and leela) with the highest branch misprediction rate [limaye2018workload]
. To characterize sparsity, we use linear regression analysis methods. In particular, we perform Lasso logistic regression
[book:stat_learning_lasso] over all the static^{1}^{1}1A static instruction (either branch or not) is uniquely identified by its PC (program counter) address and might be executed dynamically more than once for a specific program execution, e.g., in a loop. branches in our dataset, and we build sparse linear functions (models) that predict branches based on the branchoutcome history (bit long). We choose to employ linear models as they lend themselves well to a pragmatic hardware implementation. We only consider sparse models of over 99% accuracy that relate to branches that are not highly biased towards a specific direction. Rows in Fig. 1 correspond to a few such branches per application, while each mark indicates the existence of correlation between their outcome and the corresponding history location.As illustrated, the screened branches are correlated with at most 20 sparse history locations, and in some cases, only with a single one. Nonetheless, modern historybased branch predictors do not employ methods for discarding the noninformative (noisy) parts of the branch history during prediction (TAGEbased) or they attempt to identify them online (i.e., at runtime) with weight balancing (Perceptronbased). This fundamental design decision leads to a catastrophic explosion of the number of table entries required for tracking all the observable history patterns since such number grows exponentially with the history length [fern2000dynamic]
. As branch predictors are traditionally anticipated to be trained online, designers resort to various heuristics trying to workaround this issue. Yet, the focus has been mainly on achieving a high enough accuracy per bit of storage capacity rather than on revising rigorously the key prediction principles
[amdzen2].Our work aims to fill this gap by attending to only the few important bits in the branch history vectors, while discarding the noninformative ones. To properly filter the branch history at this granularity, it is necessary to adopt more detailed training algorithms through additional compiler support, otherwise dubbed
“offline training”. Fortunately, in the era where data center/cloud applications surge, the otherwise overwhelming cost of offline training can be amortized through economies of scale. In this work, we employ an “offlinetraining/onlineinference” paradigm, as detailed by Lin & Tarsa [bp:not_solved:2019], for introducing sparsityaware branch prediction. Specifically, our contributions are as follows:
Effective detection of sparsely correlated branches. We demonstrate that in practical workloads there exist branches whose direction (i.e. taken/nottaken) can be modeled with a sparse linear mapping from their history. In particular, we employ an offline methodology for determining sparse linear model parameters per branch that we call “sparsity hints”.

Novel sparsityaware branch prediction design. We show that sparsity hints improve prediction accuracy using a compact storage. We present a detailed microarchitectural design of a branch predictor, dubbed Sparse Linear Branch Inference Unit (SLBIU), that uses sparsity hints for a few selected branches, operating as an auxiliary component alongside a primary branch predictor. Overall, our scheme improves the prediction accuracy on CBP5 [cbp2016] and SPECINT2017 [spec2017] benchmark suites, compared to TAGESCL 8KB with KB of storage overhead. Moreover, our design can operate within the latency acceptable by the CPU frontend for branch prediction, while remaining under reasonable area and power limitations.
Oracle Sparse Prediction  TAGESCL Misses  TAGESCLL Entries (avg)  Allocations  

Trace / branch 
History bits  Misses  8KB  64KB  8KB  64KB 
LONGMOBILE1 / 548221168352  1  6,118  657,293  12,734  443.5  549K  1,131.9  47K 
SHORTMOBILE16 / 1566871128  7  3,697  194,634  10,181  36.8  328K  150.9  24K 
SHORTSERVER225 / 5564716  1  45,794  103,383  74,393  311.7  56K  3,022.14  83K 
SHORTMOBILE60 / 50044  7  711  66,516  54,686  61.9  72K  601.9  75K 
LONGMOBILE24 / 50044  7  711  66,310  54,976  69.3  71K  834.5  77K 
SHORTMOBILE59 / 50044  9  879  64,126  33,102  73.5  40K  421.7  42K 
2 Current predictors Limitations
Stateoftheart branch predictors, such as TAGE [seznec2006] and Perceptron [perceptron2001], are tabular and historybased, i.e., they identify predictable branch patterns by mapping recent branchoutcome histories to internal state machines (saturating counters or weights) stored in tableentries. Most commonly, they use various formations of the local (previous outcomes of a static branch) and the global history record (prior outcomes of any branch) and the branch program counter (PC). Intuitively, branch histories are expressed as a sequence of consecutive binary events (taken/nottaken) leading to a branch. For example, the global history is a bit sequence representing the branch directions before a certain branch. As such, historybased predictors attempt to identify predictable patterns in the form of thick series of events, even when branches correlate sparsely with them. Ideally, a separate tableentry will be allocated for each observable pattern. As the number of patterns grows exponentially with the history length, the required number of entries becomes quickly too large.
Current branch predictors use astonishingly large branch histories. TAGESCL [seznec2016], the most recent TAGE variant and winner of the last BP championship (CBP5 [cbp2016]), tracks histories with lengths up to bits, whereas the Multiperspective Perceptron predictor [jimenez2016] (ranked second in CBP5), uses several features acquired by tracking similarly sized histories. As the employed histories become longer, more previous branch directions can be looked up, and thus, the chances of identifying correlation with more distant branches increases. Nonetheless, so does the amount of sparsity and variation of the predictive signatures.
Consider the simple case of two type branches and with dependable conditions, where branch precedes branch in the program order. In such a scenario, the outcome of correlates with the one of . If there are other type branches between and with independent conditions that change directions constantly, there will be at least different patterns that the predictor needs to distinguish to predict accurately. These intermediate branchoutcomes can be considered as "noise" in the history, since can be predicted accurately by solely monitoring . This noise burdens the predictor with superfluous increase of entries allocations for tracking all the observable patterns. Under storage constraints, such allocation pressure can heavily disorganize the predictor’s state machines, compromising accuracy. Although branch predictors employ various storagesaving heuristics to generally mitigate entries allocation, they miss sparsity by tracking history patterns in a quite dense form.
TAGE predictors are based on the PPM data compression scheme [ppm1984]. They employ a plurality of tables that are indexed using overlapping history slices of increasing lengths, hashed (XORed) with the branch address. Table entries contain (partially) tagged saturating counters that model the prediction. The matching entry that is accessed with the longest history provides eventually the prediction. As in PPM, the intuition is that longer histories should be used when shorter ones fail to be accurate. Yet, even when the correlation of a branch with the history is sparse, TAGE predictors need to allocate storage for tracking all the possible patterns, since the histories are used in a compact way, i.e., without an explicit mechanism for disregarding the noninformative parts. Glimpsing at Table 1 (detailed in Section 3) demonstrates the effects of the excessive entries allocation when histories are noisy. Especially under the storage pressure of 8KB, TAGESCL repeatedly "forgets" and "relearns" branch patterns, inducing a much higher number of mispredictions than at 64KB.
Perceptron predictors are loosely based on the homonymous learning algorithm. Their prediction mechanism receives a vector of inputs with corresponding weights and computes their dot product, which is then thresholded to provide the prediction. Input weights are trained and adapted according to the correctness of the produced predictions. Initial proposals used only the global history [perceptron2001], while in the latest Perceptron predictors, the input vector consists of several different hashes (organizations) of the branch history [tarjan2005, jimenez2016]. Still, without filtering the branch history a priori, Perceptron suffers from significant aliasing among noisy histories and their synthetic formations.
As in our dataset (see Section 5.1), Perceptron predictors achieve lower accuracy than TAGE predictors, we employ the latest TAGESCL models [seznec2016] for our study (explained in Section 5.1). In next section we explore the application of sparse modeling in BP for capturing the sparse correlations between branches and recent history. We specifically focus on sparse linear models, dropping the nonlinear ones that can induce excessive storage and computational overheads. As we will show, sparse linear modeling effectively captures branches’ correlation with a commonly used set of features, namely the global and local branch history, and inherently identifies its sparsity.
3 Sparse Branch Correlations
We start our exploration by outlining in Table 1 a few examples that demonstrate the opportunity behind sparse branchhistory correlations. Interestingly, the reported branch examples are not “hard to predict” for TAGESCL (around a 1% misprediction ratio). However, the average number of entries that TAGESCL allocates for these branches is far from optimal since it has to replicate them exponentially per every noisy uncorrelated event (branch outcome) in the history. For example, TAGESCL 64KB allocates 1K entries on average for a branch that could be predicted accurately with a single sparse signature (LONGMOBILE1 / 548221168352). TAGESCL 8KB requires entries on average for the same branch and, more importantly, induces around 51 mispredictions by not being able to suppress the effects of destructive aliasing in such a limited storage budget.
This storage efficiency issue is in line with previous work that exposed it by examining TAGESCL 64KB with large codefootprint applications [bp:not_solved:2019]. Storing only sparse correlations could therefore generally improve the performance of stateoftheart branch predictors by eliminating the need to represent irrelevant features in their storage, thereby reducing the predictor’s footprint. Nonetheless, identifying and exploiting sparsity effectively is a grand challenge for branch predictors. By historically performing training solely online, typical branchprediction designs are quite cumbersome for applying powerful sparse modeling. The next section provides a brief background of sparse linear modeling, revealing its interconnection with branch prediction.
3.1 Sparse Linear Modeling
In supervised learning prediction tasks, often, only a small subset of input variables contributes to the prediction outcome. In the case of linear models where we focus, the problem is wellstudied in literature with various synonyms, i.e., sparse modeling, sparse signal recovery, and Lasso regression
[book:stat_learning_lasso]. In our study, we apply these techniques in the context of BP. Due to the binary nature of branch outcomes, BP resembles a classification problem. Thus, we opt to employ the Lasso logistic regression model, i.e., regularized logistic regression. Our main focus is on offline linear methods to understand the limitations and opportunities that sparsity could present in BP. However, in Section 7, we also examine the promising yet challenging case of online sparse linear modeling.Assume a collection of input featurevectors and target pairs: where is the th input featurevector and is the corresponding target. Lasso logistic regression seeks for a linear logistic function where
is the sigmoid function,
is the model bias/intercept, denotes the dotproduct and is the vector of model weights. The parameters that define , are determined by optimizing the following objective:(1) 
for some , being the logistic loss and the
norm. The hyperparameter
enforces a tradeoff between the model’s sparsity and prediction performance. That is, in the special case where , Equation (1) corresponds to logistic regression (no model sparsity enforced). On the other extreme, it can be shown that is the allzeroes vector for a large enough value of . Setting an appropriate value is not a trivial task, although methods such as coordinate descent [lasso:glmnet] can compute the full regularization path of values. Such methods are also guaranteed to converge to a globally optimal solution [lasso:cd:tseng_2001]. In our experiments, we observed that a binary search approach on the range of with a stopping criterion of 99% accuracy allows us to finetune efficiently. Finally, we should note that there are several methods that can compute an optimal solution of Equation (1) [book:stat_learning_lasso].3.2 Sparse Linear Models on Branch History
Branch predictors correlate the behavior of branches with the recent history of previous branch outcomes, expressed, most commonly, through the global and the local history records (GHR/LHR). To leverage the sparsity of branches’ correlations, we construct sparse linear models that can express branch outcomes based on the branch history. In particular, we aim to define the parameters (bias and sparse weights) of a logistic regression model so that , where is the branch history with branch directions represented as , and is the modeled branch outcome.
returns the prediction probability that the branch identified with its
PC will be taken. To simplify the inference, we replace the sigmoid function with a sign check. If the sign is negative, the branch outcome is predicted nottaken and taken otherwise.During training, once the history is sufficiently populated, we collect training samples in the form of where is the history vector and is the sampled branch outcome. Equipped with the above modeling configuration, Equation (1) can be optimized for every static branch. After training, based (mainly) on the accuracy and the sparsity of the resulting models of all the screened branches, it is defined which of them follows a sparse linear model. Section 4.1.2 makes an indepth analysis of this process.
Based on this setup, we now perform sparsity analysis on traces of SPECINT2017 [spec2017] applications described in Section 5.1. The branch history we use is the concatenation of a bit GHR and a bit LHR, i.e., and . We perform Lasso logistic regression on all the branches of each trace (with a few exceptions) to show that sparse linear correlations can be efficiently modeled. To strengthen our confidence in the statistical properties of the computed models, we exclude branches that are highly biased towards a single direction, i.e., those with a taken rate less than or greater than , and branches that appear less than times in a trace. As the crux of our work is to design effective sparsityaware branch predictors, we also set a rigid 99% accuracy threshold, only above which, a sparse model is considered sufficiently accurate. In Fig. 2 we plot the distribution of all the sufficiently accurate and non highlybiased branches according to the number of the nonzero (nnz) weights of their model, as returned by sparse modeling. Note that per application, some branches may be counted more than once if they appear in multiple traces of the application.
As illustrated, most branches have a sharp decaying distribution based on the number of their nonzero (nnz) weights. Furthermore, the majority of them requires a fairly small number of nonzero weights in proportion to the employed history length (bit long), bounded at . Consider that for clarity we have also dropped the first histogram bin grouping branches of one single nonzero weight, as it was fairly large () and less challenging, representing essentially branches that correspond to loops.
According to our analysis, only a relatively small number of branches are identified as the interestingly sparse and accurate cases. Nonetheless, they can significantly affect the predictor’s effectiveness by creating an increased allocation pressure when the available storage is limited. In our trace set, TAGESCL 8KB is plagued largely by such cases. In Fig. 3 we show numerous cases of branches^{2}^{2}2For clarity, in Fig. 3 we do not include xalancbmk and mcf, where we found very few branches with highlyaccurate sparse models, as shown by Fig. 2. whose sparse models contain only up to nonzero weights, i.e., those branches are correlated only with previous directions from the history, and still, they account for a massive amount of average unique entries in TAGESCL 8KB. Even more interestingly, TAGESCL 8KB mispredicts with a relatively high ratio a large fraction of these branches, yet, by dedicating a considerable amount of its total storage for tracking them.
Overall, designing branch predictors by overlooking sparsity leads to suboptimal use of onchip resources that can greatly affect prediction accuracy. In next section, we introduce a complete architectural design that enables sparsityaware BP for effectively handling sparsely correlated branches.
4 Sparse Predictor Architecture
We now present our sparsityaware BP scheme outlined in Fig. 4. Our proposal assumes a deployment scenario where predictions are generated at runtime after an offline training phase, as also considered in the work of Lin & Tarsa [bp:not_solved:2019]. Offline training is necessary to capture the predictive statistics of the otherwise hardly detectable sparse correlations. It consists of three major steps, starting with sparse linear modeling for extracting the branch models, then compression of such models and eventually filtering. Branch models are filtered according to certain microarchitectural design constraints that facilitate their use by a dedicated component called Sparse Linear Branch Inference Unit (SLBIU) for runtime prediction. SLBIU is the BP mechanism that stands at the core of our scheme, specialized to predict branches with offlineprepared sparse models. We envision SLBIU as an auxiliary element of the branch prediction unit (BPU), complementing the functionality of the primary branch predictor that is traditionally trained solely online. In the rest of this section, we describe all the details of our model, including offline processing, microarchitectural modifications and certain system requirements of our deployment scenario.
4.1 Offline Training Process
The offline stage requires a set of branch traces collected by profiling target applications in a postcompilation phase. Such traces undergo Sparse Modeling through Lasso logistic regression (as described in Section 3.2) that produces the Sparsity hints: a set of perbranch sparse linear models that are attached to the program binary. Sparsity hints essentially materialize into a collection of weights and historyindices pairs. Each weight expresses the correlation of the screened branch with the branch at the respective history index. The sparsity hints are loaded to the SLBIU using a dedicated SW/HW API and used to perform BP by using the dynamic history as input. The SLBIU functionality and API are described in Section 4.2. As such, it is crucial to keep storage requirements and complexity in reasonable levels without compromising performance. To do so, we perform two necessary optimizations on sparsity hints before deployment, denoted as Compression and Selection routines in Fig. 4.
4.1.1 Compression
Weights Quantization:
Sparse linear model training is performed in floating point arithmetic with a highenough precision able to express small parameter updates during models’ optimization. However, once sparse models are trained, weights can be represented with fewer bits for use during inference. We quantize the model weights to
fixed precision format [gupta2021neural] by rounding each weight to its nearest representable signed number [quantization:golovin13]. As we will show, 8bit weights are sufficient for our models expressed in .History Deduplication: We observed that, quite often, branch histories lead to (conceptually) duplicated inputs. For example, in a looptype branch with iterations, local history satisfies for a positive offset . For such a branch, Lasso will assign arbitrary weights in the full history set. However, if there is indeed some correlation with any of these history indices, only one of them is sufficient to express it. To detect these duplicates, we leverage ElasticNet, a generalization of Lasso that assigns approximately the same weight on highly correlated or duplicated features [elasticnet_2005]
. In this way, we manage to keep only one instead of multiple identical nonzero weights for branches with such a “strided” localhistory.
4.1.2 Selection
The set of sparsity hints produced after compression contains a sparse model for every static branch of the target program. However, evidently, sparse models are not efficient for all static branches. Essentially, the number of static branches per application that are predicted more accurately with a sparse model than a stateoftheart predictor is conveniently small. In our evaluation, we show that the number of finally selected sparsity hints does not exceed , for a storage overhead of KB. Still, by removing the burden of predicting such sparselycorrelated branches from the primary predictor, enables important improvements, as we discuss in Section 6.
As it appears, it is necessary to employ a selection method to filter out the nonpromising cases or otherwise to define the cases where it is effective to employ a sparse model. Such selection method needs to solve an optimization problem: identify the subset of branches that are predicted with better accuracy through sparse models under certain storage constraints. We consider two dimensions to express storage constraints, the maximum permitted number of sparsity hints, denoted by , and the maximum permitted number of nonzero weights per selected hint, denoted by . Our selection method follows three steps. First, we employ a specific score function that assigns a scalar scorevalue to each hint. Positive scores indicate a potential improvement and negative scores a potential drop in performance. Second, the hints with negative scores or with more than weights are discarded. Finally, the remaining hints are ranked based on their score and the top (at most) are selected.
We define two different score functions: independent and relative. In the independent score function, hints with relatively lowaccuracy (
) are dropped a priori. The remaining hints are assigned with a score equal to the number of their correct predictions solely during the offline analysis. Scores are always positive and they do not consider the performance of the primary predictor. In the relative score function, scores are defined as the difference between the number of correct predictions made by the respective sparse models offline, and the number of correct predictions (estimation with sampling) made by the primary predictor. Both score functions are based on counts of correct predictions instead of accuracy rates to prioritize hints that relate to branches with a greater impact, i.e., hints that achieve high accuracy for branches of very few invocations will have less priority.
4.2 Sparse Linear Branch Inference Unit
As mentioned above, the deployment scenario of our scheme involves an offline training phase that produces the sparse prediction models of branches in the form of hints. Sparsity hints are models that receive branch histories as input and provide accurate predictions for the corresponding branches. SLBIU, the "Sparse Linear Branch Inference Unit", is the hardware mechanism that enables the runtime prediction of branches based on these sparse models which are trained offline. In the rest of this section, we describe the structure and functionality of SLBIU and explain all the microarchitectural modifications required by our sparsityaware branch prediction scheme.
4.2.1 SLBIU Structure
Parameter  Definition 

Local history length (in bits)  
Global history length (in bits)  
Number of branches with sparse models  
Maximum number of nonzero weights  
Weights bitwidth  
Branch PC bitwidth 
Fig. 5 depicts in detail the building blocks of SLBIU, along with all their essential design parameters described in Table 2. SLBIU is abstractly divided in two main parts, the SparseModel Storage, that is a contentaddressable memory (CAM) space, and the Prediction Engine
, that is an arithmetic logic circuitry. The CAM space is fully associative PCbased and holds the sparsity hints and the local histories of the (static) branches selected during the offline training process. Recall that sparsity hints represent sparse models. We use the Coordinate storage format (COO) to encode the sparsity hints, for keeping a fixed size per hintvector (zero padding) and for allowing a convenient hardware implementation.
SLBIU can store up to hintvectors in its CAM space. Each hintvector includes the bit wide weights and their respective history indices. Indices are bits wide as they encode a history position in the range . The intercept values (one per sparse model) and the weights are reduced to bits after the quantization step discussed in Section 4.1.1. Eventually, considering all the design parameters listed in Table 2 (), the amount of storage space required by SLBIU is defined as:
(2) 
As it appears, the SLBIU storage scales logarithmically with the global history length (). Considering that current branch predictors already employ global histories of the order of thousands previous branch outcomes (bits), such logarithmic relation essentially discharges the history length from typically being the major limiting factor. On the other hand, the linear relation of the required storage with the number of branches with sparse models () and with the number of their nonzero weights (), makes both and the most crucial parameters. In Section 6 we discuss the tradeoffs that occur from the nontrivial task of determining the optimal values of and under certain storage budgets.
4.2.2 SLBIU Functionality
As outlined by Fig. 4, in our design, SLBIU does not represent a standalone branch predictor. Essentially, the utility of SLBIU is revealed when coupled with the primary branch predictor of a CPU design through the improvement of the prediction of sparsely correlated branches. In our deployment scenario, SLBIU is informed of the branches with accurate sparse models after an offline analysis. That is, we assume that SLBIU is initialized before a program’s execution phase starts. Initialization includes loading all the sparse model parameters of the selected branches trained offline (intercept value, weights and history indices, as shown in Fig. 5) and resetting all the respective LHR fields to zero. Nevertheless, the potential of performing sparse modeling purely online is not generally excluded. In Section 7 we briefly discuss the feasibility of such an implementation scenario, although we keep it out of the scope of this work.
Eventually, SLBIU contains one sparse model per each static branch that has been selected at the end of the offline training phase. When branches are routed for prediction in BPU, SLBIU is probed based on the branch PC in parallel with the primary branch predictor. When discovering that a branch possesses an entry in SLBIU ("hit"), the main branch predictor is signaled to halt any update of its internal state related to that branch, i.e., entries allocations and/or update of entries state machines. Thereafter, the prediction of that branch happens exclusively by SLBIU. In that sense, the selected sparsely correlated branches are offloaded to SLBIU, since they do not require resources by the primary predictor anymore. However, any branch history organization that is used and maintained in BPU is still normally updated.
In particular, we assume that BPU manages the global branch history with a single GHR which is common for the primary predictor and SLBIU. The fraction of GHR that each component uses internally for prediction and update can be different, depending on the mechanism. In our evaluation we couple SLBIU with TAGESCL 8KB that uses GHR slices of up to bits long, whereas in SLBIU we have found that a long global history suffices to effectively capture the potential for improvement in our dataset. Below we describe the prediction and update process in SLBIU.
Prediction computation: After initialization, GHR along with the perbranch maintained LHR are the two main inputs of SLBIU for predicting an offloaded branch. At prediction time, LHR and GHR are concatenated forming an bit long history vector from which the important bits are selected through an array of :1multiplexers (HistorySelect()). Predictions are based on the dot product of the important history bits and their weights. An adder tree can be used to sum up the vector of products and the intercept value. Similarly to previous work [perceptron2001], as we interpret taken/nottaken events with values (usually expressed with in GHR/LHR), respectively, vector multiplication happens practically by only flipping the weights’ signs that are paired with nottaken history bits (SignFlip()). The prediction of a branch is nottaken if the dotproduct sign is negative, and taken otherwise. To allow higher clock rates and to limit the latency to acceptable 34 cycles [zangeneh2020branchnet, zhao2021cobra], we pipeline the prediction operation of SLBIU into stages: 1^{st} stage performs the fully associative lookup, 2^{nd} stage extracts the model weights and LHR from the CAM, concatenates and selects history bits and flips weights signs, while 3^{rd} stage is dedicated to the addertree.
Update: The sparse models of the branches offloaded to SLBIU are only updated once at initialization. That is, at a context switch or at the start of a program phase for which a different set of hints has been produced offline. Our design assumes that a different set of sparsity hints can be loaded in SLBIU per program phase. Section 4.3 clarifies the requirements of such an approach. During execution, only LHRs are updated according to the respective branch outcomes. To support a simultaneous update of LHRs alongside the retrieval of model parameters for prediction, we implement a dualported CAM storage space, i.e., a singlebit port for writing and an entrywide port for reading.
4.3 SW/HW Interface for Sparsity Hints
In this paper, we advocate the use of offline training for more accurate and focused sparse modeling. Our scheme relies on a sparsity analysis performed at compilation over traces that contain branch outcomes recorded from normally running the application. The traces can be generated through a profileguided optimization (PGO) phase [Gupta02profileguided]; through manual workload optimization; or through a JIT analysis [JIT_tracing]. Recent work also suggests that PGO may be performed over a wide corpus of other applications using ML techniques [PGO_ML].
In their previous work [bp:not_solved:2019], Lin and Tarsa indicate that acquiring several traces from a single program allows to effectively refine training over specialized programs statistics. Therefore, we obtain multiple traces representing different phases of a program’s execution (using SimPoint [hamerly2005simpoint] intervals) resulting in a respective set of trained sparse models. These models pass the selection and compression steps individually so they may each focus on a different subset of branches that dominate each specific program phase. Even if a branch appears in different models, its weights may differ, representing its localized behavior.
The sparse models are stored in the binary as program metadata. According to our experiments (as seen on Section 6.1), the binary size overhead is expected to be minimal, since, per execution phase, sparse models that account for less than KB of storage space suffice to capture effectively the majority of mispredictions from sparse branches. For the onchip delivery of model parameters, the binary can be annotated with trigger points that will indicate the passing of program phases. The application will then be responsible for unpacking the sparse model of each phase to memory and, using dedicated instructions (ISA extensions), to populate the SLBIU weights from that memory range. Given that the storage overhead of each model is small and the phases represented are multiple billions of instructions each (depending on simpoint representativeness), the overall loading time is also assumed to be negligible. We leave a more detailed analysis and evaluation of the requirements of such a SW/HW interface for sparsity hints or other alternative approaches to future work.
5 Evaluation Methodology
5.1 Experimental Setup & Benchmarks
We have implemented the full process of offline training as a software module that receives branch traces and produces the set of selected branch sparse models. Sparse modeling is performed through Lasso logistic regression. Training on branches with roughly million dynamic executions and bit long branch history features takes around minutes in a commodity server.
To evaluate our sparsityaware BP scheme, we implemented SLBIU and the interface for initialization with the offlineproduced sparse models in the CBP5 tracebased framework [cbp2016]. TAGESCL predictors are configured to ignore the branches that hit in SLBIU. TAGESCL and SLBIU work in tandem based on a common GHR. We also model the programphase adaptation of sparse models by initializing SLBIU before each trace simulation, assuming that each trace represents a program phase with different sparse models. As we perform simulations in CBP5 framework, we evaluate our BP model exclusively based on MPKI measurements. In this way, we are able to gauge the improvement of stateoftheart BP independently from the various designspecific artifacts of a modern CPU. Nonetheless, this paper includes all the necessary details for a fullscale evaluation.
We evaluate our microarchitectural model over a rich set of traces that undergo sparse modeling before simulation. Our trace pool includes the publicly available set of CBP5 [cbp2016] traces and a set of traces recorded from running the SPECINT2017 Rate benchmarks [spec2017] using the ref inputset. SPEC traces are obtained for each of the Minstructions long Simpoints that we identify per benchmark ( on average). Therefore, in our experiments, SLBIU is reconfigured every M instructions for SPECINT2017 and every M instructions on average on CBP5 traces, i.e., once per trace. In total, we use CBP5 traces, after dropping some duplicated and corrupted traces and also a few that are already highly optimized by TAGESCL 8KB ( MPKI). For SPEC benchmarks with traces from several inputs (xz, x264, perlbench, gcc) we report average metrics with ranges. Note that all our measurements concern the total trace simulation, i.e., no warmup phase exists. As we are interested in cases where storage pressure is high, we couple SLBIU with TAGESCL 8KB and we report the relative difference in the obtained MPKI. As such, we (arguably) emulate scenarios where large predictors are cornered by applications with extensive working sets. By lacking such cases in our dataset, we do not observe significant differences when coupling SLBIU with TAGESCL 64KB. Therefore, we do not present quantitative analysis for the KB variant of TAGESCL. Yet, the 8KB configuration of TAGESCL that we study resembles closely a practical BP design for today’s common CPU resource budgets, as also considered by previous work [bp:not_solved:2019].
5.2 Physical Implementation
We implemented SLBIU as presented in Section 4.2 with an industrygrade technology library and compiler tools. The placeandroute was performed at typical corner (0.8 V, 25C). We use retiming to balance the 3 pipeline stages, automatic clock gating, and manual clock gating based on the hit signal to avoid switching in the history select and compute logic. The CAM is a fullyassociative lookup table implemented as a register file of entries of bits each.
The power is evaluated based on the switching activities (VCD) extracted from timingannotated (SDF) postplace & route gatelevel simulations running a synthetic benchmark sweeping over different scenarios of branch ratio and branch offloading rate. A scenario consists of a randomly generated trace and a set of random sparsity hints
. The scenarios are parameterized by “branch frequency” (ratio between the number of dynamic branches and the number of all dynamic instructions), and by “offloaded branch ratio” (ratio between the number of offloaded sparsity hints and the total number of static branches in the trace). Instructions are scheduled uniformly across the trace, while hints’ weights and indices are drawn from a uniform distribution. Using synthetic scenarios allows us to explore various extreme cases of read/write intensity to our proposed
SLBIU circuit.The simulation of SLBIU is embedded in a cocotbbased [rosser2018cocotb] testbench, which begins with the sparsity hints initialization (the entire CAM is always filled with hints) and followed by the execution of the 10Klong trace, fetching an instruction every clock cycle. For each branch instruction of the trace the corresponding inputs are applied to SLBIU (PC, GHR and Predict signals), the prediction output is compared with the expected result to verify correctness, and the LHR is updated based on branch resolution.
6 Results & Analysis
We start our analysis by quantifying the effects of various design parameters, then, we continue with an overall evaluation of our approach, and finally, we demonstrate the effectiveness we observe at the circuit level. In all our experiments, TAGESCL 8KB, as implemented in CBP5 [cbp2016], is the primary branch predictor coupled with the respective SLBIU configuration that is examined. All evaluations are relative to the standalone TAGESCL 8KB. Overall, our results demonstrate that our design can achieve noticeable MPKI improvements with insignificant storage overheads.
6.1 Sensitivity to Design Parameters
We distinguish main factors that can affect the performance of our design: the amount of available storage, the method of hints selection, the length of branch histories and the quantization degree of models’ weights. As these design parameters are interdependent, to identify their impact, we define a specific range of cases to experiment, and then, we perform MPKI evaluation by simulating our SPEC traces. Eventually, after experimenting with several combinations, we analyze the most interesting and representative cases.
In particular, we choose to experiment with different storage budgets, the smallest being KB, a moderate option equal to KB, and the largest at KB. We perform Sparse Modeling (as described in Section 5.1) and we prepare a set of branch sparse models for two different history sizes, and . In our first set of experiments, to fully capture the potential of branch sparse models, we allow fullprecision floating point weights (bit wide) without quantization. Thereafter, we run our two methods for hints selection, independent and relative, to identify the optimal sets of sparsity hints that satisfy the three storage budgets examined. In practice, each selected set defines the actual dimensions of SLBIU’s CAM space. Eventually, the pair we choose is the one with the highest sum of scores. We configure SLBIU according to the chosen pair and we simulate our SPEC traceset.
Fig. 6 depicts the results of our analysis in two distinct columns, left side for independent and right side for relative selection. Horizontal figurepairs represent the three storage budgets we examine comparing the two different history sizes used in SLBIU. Unsurprisingly, the relative selection consistently outperforms the independent selection by being able to prioritize only the hints that are guaranteed to improve TAGESCL. Noticeable improvements can be seen in almost all the benchmarks, except from mcf, with leela demonstrating the largest uplift in all cases. As expected, the benefit increases for larger storage budgets where more sparse models can be employed, as validated by Table 3 listing the chosen pairs. The maximum amount of offloaded branches increases almost linearly with the available storage, while sparsity remains steadily high. The KB SLBIU configuration features the maximum , that of , whereas for both the other storage configurations, does not exceed .
Independent  Relative  

lh=gh=64  lh=gh=512  lh=gh=64  lh=gh=512  
0.5KB  (5,16)  (3,17)  (5,16)  (2,34) 
2KB  (18,19)  (8,34)  (13,28)  (8,34) 
8KB  (50,29)  (29,39)  (46,32)  (27,43) 
Furthermore, the set of bit histories account for higher improvements, with a few exceptions in gcc and leela at KB, and exchange2 at KB, where bit histories perform better. Therefore, larger histories broaden the scope of sparse models and allow them to capture effectively correlations that are found in the quite distant past. Even more so, as illustrated in Table 3, they achieve that by requiring storage for only the quite few important segments of the history. As our experiments dictate, in the rest of our analysis we will focus on sparsity hints over bit histories, filtered with the relative selection method.
Next, we explore the impact of quantization, the important performance factor of our design. We evaluate the effectiveness of our model with  and bit quantization degrees using and signed fixedpoint formats, respectively. To do so, we simulate our SPEC traces using the two most promising SLBIU configurations of KB and KB. In Fig. 7 we plot the MPKI improvements obtained with quantized models, comparing them with fullprecision.
FP32  Q3.12  Q3.4  

2KB  (8,34)  (11,34)  (13,36) 
8KB  (27,43)  (33,53)  (53,42) 
According to our results, MPKI improvements are successfully sustained after quantization, although no significant gains are observed. More specifically, for the KB SLBIU configuration, the benefits are mostly higher at (bit) than at (bit) format, manifesting the available quantization headroom. With KB SLBIU, such trend is observed only in gcc and deepsjeng. In most benchmarks, improvements tend to saturate along all precision formats, with the exception of perlbench, where the format performs marginally better. Naturally, quantization enables storage savings that can allow more branches to be offloaded to SLBIU. Our evaluation reveals that such opportunity is more important in lower storage budgets. That is, quantization appears to be necessary for minimizing storage requirements without compromising prediction accuracy.
In Table 4 we present the pairs chosen in our experiment, confirming the increase of from fullprecision to quantized models. More importantly, Table 4 shows that sparsity levels remain comparable; is equal to and for the configurations of KB and KB, respectively. That is, despite the relation of available storage, offloaded branches satisfy well a similar sparsity threshold. Recall that specifies the maximum number of weights per offloaded branch by defining the width of SLBIU’s CAM (see Fig. 5). As such, configuring SLBIU efficiently for exploiting most of the underlying opportunity appears to be feasible. In next section, we explore this aspect by evaluating the effectiveness of specific SLBIU designs over our large set of traces.
6.2 Largescale MPKI Evaluation
We evaluate the performance of the two best performing configurations from our previous experiments over our full trace set (described in Section 5.1). In particular, we compare SLBIU of KB and KB for bit histories, where sparsemodels’ weights are quantized to bits ( format) and the set of offloaded branches is selected with the relative method, satisfying the dimensions in Table 4, i.e., there can be up to selected branches per execution phase (trace) with less than weights.
Fig. 8 depicts an Scurve of MPKI for the 392 traces of our evaluation set (xaxis) sorted according to TAGESCL 8KB MPKI. We refer to the improvements of the 2KB/8KB configurations in 3 MPKI ranges. In the high range of above 5 MPKI (128 traces) our configurations reduce on average 0.13/0.15 MPKI (1.3%/1.6%). In the middle range of 15 MPKI (141 traces) our configurations reduce on average 0.05/0.07 MPKI (2.1%/2.7%). In the low range of 0.011 MPKI (123 traces) our configurations reduce on average 0.012/0.014 MPKI (3.4%/3.7%). Note, that across various MPKI ranges we also observed traces (for both designs) where no branches were offloaded to SLBIU resulting in no fluctuation in MPKI. Interestingly, in traces the KB configuration achieves (marginally) the best performance. Essentially, such phenomenon exposes the importance of the selection method in large storage budgets, that needs to be optimized adequately for balancing storage exploitation and performance. It also shows that the KB SLBIU configuration can be a highly effective design.
Fig. 9 compares the mean MPKI improvements of the two SLBIU configurations for different groups of traces. Although the storage budget of 8KB gives a higher improvement across all groups of traces, it is only higher than of the KB. This demonstrates that large storage budgets are not necessary to capture effectively the underlying opportunity.
6.3 Circuitlevel Evaluation
We evaluate the two SLBIU configurations and that achieved the best MPKI improvement on SPECINT2017 under a KB storage budget, for the long and the short history lengths, respectively. The evaluation is in terms of timing, area and power with respect to 28 nm technology. We also provide a rough estimate of this evaluation with respect to a 7 nm technology, for which we use the following scaling factors from 28 nm to 7 nm: 0.4x power@same speed, 1.8x speed@same power, 3.4x area[techscaling, 7nm].
Timing. Both designs run at 750/790 MHz in 28nm. (1.4 GHz in 7 nm). The 3 pipeline stages are balanced, with the critical path dictated by the adder tree (3^{rd} stage) due to the 16bit operands in the short history configuration, and by the history selection (2^{nd} stage) in the long history configuration. Note that in 7nm same 3cycle latency can also be retained under certain frequency requirements [zhao2021cobra, zangeneh2020branchnet].
Area. For both designs, the standard cellbased contentaddressable memory (CAM) is dimensioned to 2 KB and thus, it dominates the area breakdown (98%/88%). Nonetheless, CAM occupies just 0.34 mm of area (0.1 mm in 7nm).
Power. Fig. 9(a) shows the power for both modules for various offloading ratios and branch frequencies. At full utilization (100%/100%), the two candidate modules consume 15 or 40 mW (i.e., up to 28 mW at 1.3 GHz in 7 nm), where at zero utilization requires just 220 or 240 W thanks to the high switching reduction achieved by our manual clock gating scheme. Importantly, the SLBIU spends only negligible power for the lookups in its CAM that result in a miss (purple line in Fig. 9(a)). Conversely, the power is effectively spent only in the cases of a hit. Fig. 9(b) shows a breakdown of power consumption of each component of SLBIU (CAM, Adder Tree, etc.) at the most aggressive, yet realistic scenario with 100% branch frequency (the unit queried every clock cycle) and over varying offloaded branch ratios. Most power is spent in the historyselect unit (50%/13%) and the CAM register file (39%/56%). Despite the same storage, the power consumption is smaller for the 64bit history configuration due to the 16 smaller history select unit and narrower CAM.
7 Online Sparse Modeling
In our study, we have shown that sparse correlations of branches with branch history can be detected efficiently offline with sparse modeling. In this section, we briefly argue that such sparsity can be detected also with online training.
To that end, we implemented sparse linear modeling in an online setting, where model parameters are updated after predictions are resolved during trace simulation. We employ bit long global and local histories for comparing online with offline findings from Table 1
. We experimented with several optimization methods and we found stochastic gradient descent with cumulative penalty (SGDL1)
[lasso:sgd_cum_l1] as the most efficient. We improved SGDL1 by adapting the hyperparameter with online binary search, i.e., starting with and halving it or doubling it within the range to keep the number of nonzero model weights at most .
Trace / PC 
Online (nnz)  Offline (nnz)  

LONGMOBILE1 / 548221168352  921  (1.2)  6,118 (1) 
SHORTMOBILE16 / 1566871128  2,085  (34.1)  3,697 (7) 
SHORTSERVER225 / 5564716  44,966  (28.5)  45,794 (1) 
SHORTMOBILE60 / 50044  2,375  (27.0)  711 (7) 
LONGMOBILE24 / 50044  2,584  (35.9)  711 (7) 
SHORTMOBILE59 / 50044  3,152  (35.83)  879 (9) 
The hardware implementation of SGDL1 is challenging, since it requires two additional bit floats per history bit, one for the trainable model weights and one for the cumulative penalty. Nonetheless, specific accuracy restrictions, can be used for tracking only a certain subset of static branches with a reasonable entries count and adapt adequately. Similar approaches have already been used during the past in commercial products for learning complicated correlations at runtime, such as Perceptron’s training in IBM’s z15 [z15_IBM].
In Table 5, we present the number of mispredictions of SGDL1 against the offline sparse linear model of Section 3.2. Within parentheses, we show the average number of nonzero weights that SGDL1 maintains over its execution. As illustrated, such number never exceeds , and thus, a unit similar to SLBIU can be efficiently tuned for predicting branches based on the the trained models timely, as we demonstrated in Section 6.3. Note that online training is not able to learn the sparsity patterns exactly as in the offline setup, with the pleasant exception of the first branch in Table 5. However, online sparse modeling does learn an enlarged set of each sparsity pattern. Furthermore, the number of mispredictions are comparable with the ones offline. These early findings suggest that online sparse linear modeling is a promising research direction.
8 Other Related Work
Evers et al. investigated branch correlations and showed that most of the branches can be predicted efficiently by considering a selective history of only a few previous branches. Essentially, our study corroborates the early findings of Evers et al. employing sparse linear modeling to specify the informative branchhistory locations. The findings of Evers et al. motivated also the Spotlight branch predictor [bp:spotlight]. Similarly to our work, Spotlight identifies the important parts of the history offline that are then used by a gsharelike predictor [mcfarling1993]. During profiling, globalhistory segments that lead to a branch are analyzed exhaustively to discover the combination that provides higher accuracy. On the contrary, in our work we employ nonexhaustive training methods based on sparse modeling to build a dedicated linear model for predicting each screened branch efficiently by a specialized hardware unit. Fern et al. [fern2000dynamic]
proposed a decisiontreebased branch predictor with dynamic feature selection to decrease the number of input features, thereby reducing the storage of a tabular branch predictor. Some recent studies
[gupta2021neural, lafiandra2021brat]focused on the implementation of NeuralNetworkbased predictors that can be trained online, similarly to the online sparse modeling concept we briefly discussed in Section
7. BranchNet [zangeneh2020branchnet] and the work of Tarsa et al. [tarsa2019improving]train offline convolutional neural networks for branches that are hardtopredict for
TAGE. Our work differs in the simplicity of the linear models we deploy and in targeting a fundamental controlflow property, the sparsity of branch correlations, that is independent from the predictor’s mechanism. Similarly to the above works, however, our model does not target datadependent branches which have been recently addressed using compiler support by SLBpredictor [farooq2013slb] or using aggressive runahead execution as in Branch Runahead [pruett2021branch].9 Conclusions & Future Work
This study stimulates the development of sparsityaware branch prediction for improving accuracy by exploiting a fundamental property of programs controlflow, the sparsity of branch correlations. We analyzed several traces derived from SPECINT2017 benchmarks and CBP5 by capitalizing on sparse modeling methods. Our results demonstrated the existence of numerous sparsely correlated branches. Such branches impede the effectiveness of common branch predictors by putting them under an unnecessary pressure for entries allocation. To eliminate their effects, we propose to employ offline sparse modeling for producing the respective sparse models of branches that will be used for runtime prediction. To that end, we introduce SLBIU, a hardware mechanism specialized to predict branches with offlineprepared sparse models. SLBIU works auxiliary to the primary branch predictor of a CPU design and improves significantly the prediction of sparsely correlated branches. In particular, when combined with TAGESCL 8KB, SLBIU accounts for up to to 42% (2.3% on average) of MPKI improvements with 2 KB of storage overhead. Furthermore, our circuitlevel evaluation with 28nm technology showed that SLBIU is able to deliver predictions in 3 clock cycles at 740 MHz by requiring no more than 40 mW of power and as little as 0.34 mm^{2} of area. Essentially, our results demonstrate the important benefits in branch prediction from identifying and exploiting sparsity effectively. Our study unlocks several other topics for exploration, mainly related to the optimization of offline training of sparse models. In future work, we plan to investigate the effectiveness of other algorithms from the quiver of sparse modeling and also study their runtime adaptability.