Accurately predicting the throughput of a basic block of instructions – clock cycles taken to execute a basic block in steady state – is an essential requirement in many systems, specially ones which strive to optimize the runtime performance of programs. Throughput of a sequence of instructions determines how fast it can process data, specially inside hot loops. Empirical evaluation of ground truth throughput values is too expensive for systems such as compilers and superoptimizers, hence they use analytical models for prediction. For example, LLVM compiler uses a complex machine model to estimate throughput to perform low level optimization tasks such as register allocation, instruction selection and instruction scheduling to produce code that yield fast execution times. The STOKE  superoptimizer uses a simple additive model as an approximation to instruction cycle counts in its search for high peformant code sequences.
Getting a static estimation of throughput of a basic block of instructions has many challenging constraints in that an ideal throughput estimator should be accurate, portable and fast.
Throughput estimator should be accurate. However, estimating throughput of an instruction sequence executed in a modern x86-64 Complex Instruction Set Computer (CISC) processor is no easy task. To be more concrete, estimating throughput statically for a basic block that executes on a simple, non-pipelined, in-order and scalar processor follows directly by summing the specified latency for each instruction. Such additive models are easily implementable and therefore used in many contexts, including stochastic superoptimizers. However, static throughput estimation for a modern processor quickly encounters substantial complexity, including:
Micro-ops Expansion: each instruction is expanded into micro-ops within the processor. Thus, pipelines, dependencies, stalls and resource bottlenecks happen at the micro-ops level, which is lot more granular than the instructions.
Out-of-Order and Superscalar Execution: out-of-order processors leverage the data dependence structure of a basic block to map its linear order to an execution schedule that maximizes instruction-level parallelism of the micro-ops. Superscalar execution units allow multiple instructions of the same operation to be executed in parallel. The resulting throughput estimation problem is therefore non-linear.
Microarchitectural Resources: the microarchitecture may include additional resources that are not exposed in the architectural specification. For example, the implied dependence structure of a basic block may include anti-dependencies and output dependencies that can be eliminated by the processor because of additional physical registers (and register renaming). Without knowledge of these resources, the estimation model will be overly pessimistic.
Microarchitectural Resource Constraints: the microarchitecture has resource constraints such as execution port limitations, binding of certain operations to specific execution ports, pipeline interlocks etc. Without precise knowledge of these limitations models can be overly optimistic.
Apart from modeling these complexities efficiently, the throughput estimator has to deal with errors and unspecified microarchitectural details, including:
Latent Microarchitectural Optimizations: the microarchitecture may include undocumented optimizations such as micro-op fusions and idioms which execute in separate data paths (e.g., zeroing registers).
Specification Inconsistency: the vendor’s documentation is a best-effort approach to document a complicated system. Therefore, while the vendor may have intended to document a given feature of the microarchitecture, the resulting documentation may be incorrect.
Under-specified Features: Vague documentation of certain microarchitectural features poses an additional challenge. For example, while out-of-order execution is specified reorder buffer size may not be.
Throughput estimator should ideally be portable. We should be able to port the estimator from one microarchitecture to another with minimal manual intervention. Each microarchitecture of a processor family has its own architectural quirks and implementation changes. Manually catering the throughput estimator to suite different architectures may require rewriting instruction tables, rewriting resource utilization charts, all of which are tedious and error prone.
Throughput estimator has to be fast. Systems such as STOKE and LLVM need to search through many code blocks before emitting the faster versions of a given instruction sequence. Any complex model that has to simulate the effects of the pipeline will be excruciatingly slow. Running the basic blocks to get the ground truth estimation needs sandboxing and many iterations of execution to arrive at a consistent throughput estimate, which is impractical. In certain cases, compilation target microarchitecture may be different from the native microarchitecture. Therefore, having fast static estimation of throughput is important.
1.1 Existing Approaches
Delivering precise throughput estimates require modeling and managing all of the complexities efficiently in a portable manner. llvm-mca  is a recent (as of March, 2018) commandline tool distributed with LLVM 111http://lists.llvm.org/pipermail/llvm-dev/2018-March/121490.html, that exposes the machine model used by it for throughput estimation. It models a complex out-of-order processor with resource constraints and super scalar units. LLVM uses separate table description files to list down instructions, their resource usages in terms of execution ports used, throughputs etc. for different x86-64 microarchitectures. It uses these elaborate descriptions to calculate the throughput of a set of instructions.
Intel released a machine code analyzer, IACA , which also predicts the throughput of a given instruction sequence and naturally we expect it to be well tuned for its own processors, incorporating any undocumented Intel processor features to its throughput calculation algorithm. Both llvm-mca and IACA rely on hand written rules and are tedious to maintain as well as tedious to check for correctness.
1.2 Ithemal - Data Driven Approach
In this paper we introduce Ithemal (Instruction THroughput Estimator using MAchine Learning)
, which takes a novel data driven approach to predicting throughput for a block of instructions inspired by the recent advances in Natural Language Processing (NLP). Ithemal models throughput estimation using a Deep Neural Network (DNN), which can be learnt through a large corpus of labelled data. We show that while it is non-trivial, by understanding the capabilities of DNNs and the complexity of the problem at hand, we were able to come-up with a novel architecture of a DNN that predicts throughput estimates with almost half the error (e.g., drop of average error from 0.2206 to 0.1053 for Haswell microarchitecture) of sophisticated models used in llvm-mca and IACA, while delivering the fastest estimation speeds among all. We show that our data driven approach allows us to easily port from one microarchitecture to another with minimal human intervention.
Ithemal leverages a DNN model whose structure mimics that of the underlying dataflow execution of a modern-processor. Specifically, Ithemal maps a basic block to a dataflow embedding
: a directed, acyclic graph with real-valued vectors as the content of each node. Each node in the graph corresponds to an instruction in the basic block. Each node and, correspondingly, each instruction, also has an-dimensional, real-valued vector associated with it. The graph contains a directed edge from a node to a node if the instruction for depends on (as determined by analyzing the source operands of and the destination operands of ).
The -dimensional vector for each node follows from using traditional natural language processing techniques to map a sequence of textual tokens – in this case the instruction’s opcode, source operands, and destination operands – to a deep representation in the form of a vector [24, 30, 3].
Compared to analytical approaches, the only domain knowledge we need to apply are the dataflow graph of the instruction sequence and the explicit listing of implicit register sources and destinations of x86-64 instructions. The DNN learns all other microarchitectural details that contribute to accurate throughput estimation.
In this paper, we present the following contributions:
Data-Driven Throughput Estimation. We present, Ithemal, the first system and technique for data-driven throughput estimation.
Dataflow Embedding and DAG-RNN. We present a novel approach for mapping an assembly basic block to an embedding that captures the basic block’s data dependencies in the form of a graph (as in traditional dataflow graphs). We then leverage a DAG-RNN – a DNN that operates on DAG-structured inputs – to perform prediction.
Evaluation: We demonstrate that Ithemal is accurate than the state-of-the-art hand written throughput estimation tools, while maintaining the highest estimation speed. We also demonstrate the portability of Ithemal’s data-driven approach.
DNN Architecture Exploration. We demonstrate that encoding the data dependencies of a basic block is important to the accuracy of the system by evaluating alternative, compatible DNN designs to the throughput estimation problem.
Ithemal holds out the promise that future systems can leverage data-driven techniques to either augment or fully replace manually developed throughput estimators.
Analytical approaches require detailed modeling of the underlying microarchitectural details to arrive at accurate throughput measurements. Ithemal’s data driven approach intrinsically learns accurate prediction from the available ground truth data.
Consider code sequences (a)-(e) and their associated throughput predictions for 100 iterations of the sequence for Intel Haswell microarchitecture in Figure 1 (sequences taken from the test set). Sequences (a) and (b) show two single instruction sequences which zeros out (a) scalar register eax and (b) upper half of all ymm vector registers. IACA and our model closely follow the measured values. However, llvm-mca predictions are many folds off. Upon close inspection, we found that zeroing out is a common idiom that Intel processors use to execute these instructions using a faster data path separate from the normal instruction fetch-execute path. This fact may be coded into IACA, but is missing from the LLVM model. Our model however, which is driven by data intrinsically, learns to predict correctly.
Sequence (c) is a short sequence with data dependency. Both llvm-mca and IACA predict similar values and they both report bypassing happens within the processor pipeline, hence the mov instruction does not consume many additional clock cycles. However, the throughput values they use (from Intel Manual) for shl instruction differs from the measured value. In comparison, our data driven model closely predicts the actual throughput, even when the processor vendor manual numbers differ from the actual.
Sequence (d) is a mixed scalar and vector instruction sequence. IACA predicts a lower cost since it finds a micro-op fusion opportunity at vpaddd instruction, however this is not visible in the measured value.
Sequence (e) shows a highly dependent sequence of floating point operations. While all models do not predict exactly, our data driven model predicts the closest value to the actual, where IACA predicted value is more than 1.5x higher and llvm-mca predicted value is more than 3.5x higher.
All of these show that even when many human hours were spent at fine tuning these processor models, they are less than perfect, even the ones released by processor vendors themselves. Sometimes even the released details of the processor manuals are less than perfect. However, our data driven model automatically learns how to predict the throughput values without needing to model the processors precisely.
In fact, the amount of prior knowledge that we feed into the model is kept at a minimal. Concretely, the only structural detail we enforce on the model is the data dependencies of the instruction sequence. In spite of the that, our model exhibits higher accuracy compared to hand written models. Further, we conjecture that with more and more data our model will evolve and its accuracy will grow and collecting more x86-64 assembly basic blocks is not difficult, just needs time. We show our model is portable, where learning to predict throughputs of a different microarchitecture can be done without any structural changes to the model. However, handwritten models need to be customized to cater the intricacies of each microarchitecture.
3 Data Driven Model
Figure 2 presents the high-level design of Ithemal’s approach. We specifically, decompose its operation into the following stages: canonicalization, embedding and prediction.
Ithemal’s canonicalization stage takes an assembly block specified as text (Intel Syntax) and maps it to a list of instructions, each of which consists of an opcode (specified as text), a list of source operands (each specified as text), and a list of destination operands (each specified as text). This mapping is akin to traditional lexing and parsing, with the only exception being that canonicalization makes explicit any operands that are typically implicit in standard assembly language notation. For example, the instruction add rbx 0x02 has rbx as an implicit source operand.
The information required to parse an instruction, compute its implicit operands, and categorize its operands into either source or destination operands is the only architecture-specific information Ithemal requires as input. Using only this information, Ithemal automatically embeds and predicts the throughput of a basic block.
Ithemal’s embedding stage takes a canonicalized basic block and produces an embedding
, which is a representation of the basic block that is amenable to consumption by a neural network. Specifically, neural networks typically take as input a sequence of real-valued inputs. In domains such as speech and computer vision, the natural specification of images and audio as signals easily lends itself to consumption by a neural network. However, in structured domains, such as text or – as in our domain – programs, inputs are discrete in nature (such as words and basic blocks) and therefore one must map each structured input to a representation that can be consumed by a neural network.
Ithemal maps a basic block to a dataflow embedding: a directed, acyclic graph with real-valued vectors as the content of each node. Each node in the graph corresponds to an instruction in the basic block. Each node and, correspondingly, each instruction, also has an -dimensional, real-valued vector associated with it. The graph contains a directed edge from a node to a node if the instruction for depends on (as determined by analyzing the source operands of and the destination operands of ).
The -dimensional vector for each node follows from using traditional natural language processing techniques to map a sequence of textual tokens – in this case the instruction’s opcode, source operands, and destination operands – to deep representation in the form of a vector [24, 30, 3].
A dataflow embedding is therefore a dataflow graph that specifies an -dimensional vector for each instruction. The graph structure of our embedding mimics the underlying dataflow execution model behind out-of-order processors and – as we show in Section 6 – is a critical element for high-accuracy precision prediction when compared to embedding approaches and neural network architectures that do not directly integrate the basic block’s dependence structure.
Ithemal’s prediction stage takes a basic block’s dataflow embedding and predicts its throughput. Because a dataflow embedding is a Directed Acyclic Graph, we leverage a DAG-Recurrent Neural Network (DAG-RNN) [38, 44]
to perform prediction. Ithemal’s DAG-RNN traverses the graph structure of the dataflow embedding in topological order, computing a deep, real-valued, vector representation of each connected subgraph in the dataflow embedding. Given the vectors for each subgraph, Ithemal reduces these vectors into a single vector and then performs prediction using linear regression.
Figure 3 illustrates an example of the input and output of Ithemal’s canonicalization stage. Ithemal takes a textual representation of a basic block and produces a canonicalized form that includes each instruction’s implicit operands.
Specifically, consider an instruction , with opcode , destination operands , and source operands, . Then, ’s canonicalized form is a token list – a list of strings, or tokens :
where <D> is a special delimiting string. For example, consider the instruction mul ecx on the last line of the assembly source in Figure 3. The instruction has one explicit source (ecx) and one implicit source (eax). It stores the multiplication result in two implicit destinations (eax and edx). Therefore, canonicalization produces the token list:
Each destination and source can be a register or a memory location. For memory locations, we map all such operands to a numbered, distinguished token where denotes the position in the ordered list of memory references in the basic block (ordered left to right and top to bottom). For example, for the mov ebx [ecx] instruction on the first line of the assembly code in Figure 3, canonicalization maps the memory reference [ecx] to the token mem_0.
This mapping results in a unique token for each reference in the basic block under the assumption that all memory references in a basic block are not aliased. Further, we give aliased partial registers separate unique tokens.
Sources can additionally be integer or floating point immediates for which canonicalization elides the value but preserves the type. For example, for the instruction shl eax, 0x02, canonicalization maps the immediate value 0x02 to the token INT.
3.2 Dataflow Embedding
Ithemal next converts a canonicalized basic block into a dataflow embedding, which is a directed, acyclic graph with -dimensional vectors as nodes. Each node corresponds to an instruction in the basic block.
Figure 4 presents an example dataflow embedding for our running example. Each node, , where is an instruction in the basic block is an -dimensional, real-valued vector that is a deep representation of each instruction.
Ithemal constructs the dataflow embedding by embedding each instruction and then connecting them into a graph using the dependence information that can be computed from the canonicalized form of each instruction (Section 3.1).
3.2.1 Instruction Embedding.
Ithemal’s instruction embedding step maps an instruction into a single -dimensional vector. Because each instruction can have a variable number of tokens depending on its number of source and destination operands, the input to our model is dynamic and unbounded. Inspired by variable length sentence parsing used in sequence to sequence learning , we use a Recurrent Neural Network (RNN) architecture to produce an instruction’s embedding.
Recursive Neural Network (RNN).
An RNN consumes a sequence of vectors representing a sequence of inputs and produces a sequence of vectors as output. An RNN is recursive in that its execution is recursively defined on the length of the sequence where at each step the RNN applies a cell to produce the output vector for that step. The cell computation depends on the output vector of the previous step and – therefore – the dependency between steps enables the RNN to compute vectors that depend on inputs consumed earlier in the sequence. The final vector that an RNN computes therefore summarizes the entire sequence.
Figure 5 presents the operation of our RNN-based instruction embedding approach on our running mul ecx example. The topmost level presents the sequence of tokens for the instruction. The second level presents the sequence of token embeddings: a sequence of -dimensional vectors, one for each token in the instruction. The third level presents the sequential application of an RNN cell to sequentially reduce the token embedding sequence into the final instruction embedding, .
The first step in executing the RNN on an instruction’s token list is to first map each token to a vector, thereby producing a sequence of vectors. In Figure 5 each where is an operation code or operand denotes a vector. To represent each token as an embedding, our approach maps each token in the sequence to an -dimensional vector using a lookup table. This lookup table approach is a common approach in other domains, such as natural language processing (e.g., Word2vec  and GloVe ). The values of the vectors in this table are not given a priori; instead, the end-to-end training process (Section 4.3) learns these values automatically.
An RNN cell, Cell, is a function with internal learnable parameters that when evaluated at position in the sequence, consumes an input vector and a hidden state vector, , from computed at the previous position and then produces a new hidden state vector, . We use the following notation to denote an RNN cell’s computation.
The exact implementation of an RNN cell is a configurable choice – or hyperparameter
– of our approach and could in principle be implemented using many common approaches, such as the Gated Recurrent Unit
and the Long Short Term Memory (LSTM) cell. Ithemal uses an LSTM cell. Regardless of implementation approach, a key detail of any cell’s structure is that it includes a set of internal parameters that are trained by the overall training algorithm. Moreover, these parameters are shared across each cell application. Therefore each box in Figure 5 corresponds to an application of the same cell and therefore the same shared parameters to each element of the sequence.
Ithemal uses an LSTM cell in its implementation. LSTM selectively remembers information passed to it from the previous cell using forget and input gates. State that gets passed along the RNN chain consists of two components which are the hidden state at time and the cell state at time . To be consistent with the notation of a generic RNN cell, one can consider a single concatenated state gets passed along the RNN chain. The computation of and inside the LSTM cell is governed by the following set of equations, where is the current input to the cell at time .
s are learnable parameters (weight matrices and bias vectors) and all multiplications are done element-wise. Computation ofand depend on the hidden and cell states at time , namely, and and input at time , . Initial hidden and cell states are used at time . are known as input, forget, cell and output gates which control the computation of hidden and cell states at time . Interested reader can refer to [26, 20] for more details about LSTMs.
The output of an RNN is the hidden state vector computed by the application of the RNN cell to the last element in the sequence. If an instruction has tokens and corresponding input vectors , then the output of an RNN is given by:
The initial cell at uses an initial hidden state vector, .
3.2.2 Graph Construction.
Given each instruction’s embedding, Ithemal then constructs the final dataflow embedding, which includes the basic block’s graph structure. The categorization of each canonicalized instruction’s operands into source and destination operands is sufficient to construct a standard dataflow graph, where each node in the graph corresponds to an instruction. Ithemal therefore maps the basic block’s dataflow graph to its dataflow embedding by mapping each node in the dataflow graph to its instruction embedding.
Ithemal’s prediction stage takes as input a dataflow embedding and performs prediction. Given that the dataflow embedding is a directed graph, we leverage a DAG-Recurrent Neural Network.
Our DAG-RNN based approach is a close cousin of the Recursive Neural Network (RNN) approach presented in Section 3.2. Specifically, while an RNN processes a linear sequence of elements, our DAG-RNN processes a directed acyclic graph. Therefore, the key difference is that each element may have multiple predecessors in the graph as opposed to a single predecessor as in a sequence.
Therefore the implementation difference is that for a given node, the DAG-RNN first reduces the multiple hidden state vectors of its multiple predecessors into a single vector before then applying an RNN cell to the resulting reduced hidden state.
Reducing Multiple Predecessors.
A reduction function takes as input a variable number of hidden states from a nodes predecessors and produces an aggregated hidden state. Two important features of a reduction function are that it should be commutative and differentiable. By using a commutative function, we ensure that the DAG-RNN’s semantics is invariant to the order in which it evaluates a node’s predecessors. By using a differentiable function, we ensure that the entire DAG-RNN can still be trained with gradient-based optimization (i.e., Stochastic Gradient Descent). Ithemal uses element-wise max as its reduction function.
Concretely, we summarize the computation of a DAG-RNN as follows. Let the cell used by DAG-RNN be Cell and the reduction function used for predecessors be . For instruction , the output of the DAG-RNN for a node is computed as follows.
where are predecessor instructions of in the dataflow graph, is the corresponding hidden state output of the DAG-RNN for instruction , and is the instruction embedding for instruction . Note that this computation is recursive and the base case is for instructions without predecessor instructions. As for an RNN, these Cell applications use a global initial hidden state vector, , where start denotes a distinguished starting instruction for each basic block. We also note that the cell used in our DAG-RNN is distinct from that of RNN, with a separate set of learnable parameters.
Given the output of the DAG-RNN, Ithemal next estimates the basic block’s throughput. One key concern when processing the output of the DAG-RNN computation step is that the computation may produce multiple hidden state vectors. Specifically, it may be the case that the dataflow graph and therefore dataflow embedding of the basic block has multiple disconnected subgraphs because the basic block contains multiple independent sequences of instructions within. To reduce these multiple vectors to a single vector, we introduce a reduction function (similar to that for reducing multiple predecessors in the DAG-RNN). We use element wise max as the reduction function for multiple sinks in our final model.
Ithemal’s final step is to compute its throughput estimation using a linear regression. Specifically, given a reduced hidden state vector, , Ithemal computes , where is a vector of parameters and is a bias. This computation produces a final, real-valued number that denotes the network’s throughput prediction.
The following aspects of Ithemal’s model are hyperparameters or choices we’ve made in its design.
Token Embedding Size. Each token embedding, , is a is an vector of size 256.
Hidden State Size. Each hidden state in both the RNN, , and DAG-RNN, , is a vector of size 256.
Initial Hidden State Value. We initialize the initial hidden state values and to the zero vector.
Hidden State Reductions. We reduce multiple hidden states using element-wise max.
Cell Implementation. Both the RNN and the DAG-RNN cells are LSTM cells. We note that the parameters of the LSTM cell in the RNN are distinct from (and therefore not shared width) the parameters of the LSTM cell in the DAG-RNN.
In principle, a user can modify these parameters to suit the dataset at hand. However, in practice, the above choices were our initial choices during Ithemal’s design – prior to evaluating Ithemal’s performance on the data. Given Ithemal’s high-precision across multiple microarchitectures with no changes in these parameters, we have not found the need to change them or evaluate Ithemal’s sensitivity to their values.
4 Data and Training
Ithemal’s training approach requires training data in the form of a set of x86-64 basic blocks of which each is labeled with its true throughput value. We collect the set of basic blocks from a corpus of applications and use runtime profiling to calculate precise throughput values for each basic block.
|Benchmark suite||Description||Binaries||total BBs||timed BBs|
|Linux Shared Libraries||linux loaders, standard library and other utilities||14||19069||15021|
|SPEC2006 ||benchmark suite with compilers, chess engines, video compression and various simulation applications. Commonly used for benchmarking compilers||31||254583||192818|
|SPEC2017 ||similar to SPEC2006, but with a larger codebase and variety||47||648309||453998|
|NAS ||benchmarks with stencil computations (dense loops)||8||4332||2687|
|polybench-3.1 ||polyhedral compilation test suite (dense loops)||30||1915||1218|
|TSVC ||suite for testing compiler auto-vectorization||2||5191||2893|
|cortexsuite ||computer vision workloads including neural networks||7||6398||4975|
|simd ||heavily hand vectorized image processing library (exposes lot of SSE2, AVX, AVX2 instruction variants)||113||215096||172151|
Table 7 summarizes the set of programs/benchmark suites in our dataset. We designed the dataset to include real world benchmarking programs as well as programs that exercise a wide variety of x86-64 instructions, such as different vector instructions sets.
We compiled each program using GCC 4.9.4 under -O3 optimization level on a Haswell machine. We use Dynamorio , a dynamic binary instrumentation tool, to dynamically dump the textual representation (in both AT&T and Intel syntax) as well as the canonicalized input of the executed x86-64 basic blocks. We run the benchmarks using the standard inputs supplied with them.
4.2 Throughput Profiling
To collect throughput numbers, we profile the execution of a loop that executes each basic block in isolation 100 times. We randomly initialize the registers at the beginning of the loop’s execution. We measure the throughput in terms of the core clock cycles performance counter using the timing script developed by Agner Fog222https://www.agner.org/optimize/testp.zip. This timing script is used by compiler writers, for example LLVM developers, to validate per instruction throughput values.
We redirect all memory references to a globally defined array by adding a constant offset to each memory access that points to the start of the array. This technique enables us to redirect memory accesses to valid regions of memory and therefore reduce the number of segmentation faults that result from executing a basic block in isolation, disembodied from the application and its state at the time the basic block would otherwise execute.
However, this technique is not perfect because profiling can still (and does) result in segmentation faults. For example, if a memory access in a basic block has a large stride between iterations, then the access can eventually exceed the length of our allocated buffer and, therefore, result in a fault.
We profile each basic block 8 times and we take the mode of the measured throughput values. By this method, we capture consistent memory effects such as prefetching and cache effects modulo transient behavior such as cold misses. We want measured throughput numbers to capture the steady state memory access patterns which exist when these basic blocks are executed multiple times specially inside hot inner loops of a given program.
This assumption is more realistic compared to assumptions used in most of the sophisticated static models used in modern compiler scheduling (e.g., LLVM exposed through llvm-mca tool) and other machine code static analyzers such as IACA. They assume all memory values are in L1 cache already, when statically estimating throughputs.
Figure 7 presents the breakdown of basic blocks of our dataset collection methodology when executed on a CPU with the Haswell architecture (Intel(R) Xeon(R) CPU E5-2680). Data collection takes 3-4 days for each microarchitecture.
On Haswell, our methodology produced throughput values for 845761 out of the 1154893 basic blocks. For the remaining basic blocks, the basic block either experienced a segmentation fault or encountered an assembler error due to some bugs in the assembly output of DynamoRIO.
The average length of a basic block is 5.428 assembly instructions. Figure 8 presents a histogram of the measured throughput values for 100 repetitions of those basic blocks with throughput values less than 1000 cycles on Intel Haswell microarchitecture.
We further collected throughput values for Intel Nehalem and Skylake microarchitectures. We used the same set of basic blocks compiled under Haswell to collect throughput values for Intel Skylake microarchitecture (Intel(R) Xeon(R) Platinum 8175). Intel Nehalem microarchitecture supports only up to SSE4.2 instruction set and is more than two generations old compared to Haswell which supports up to AVX2 instruction set. To reflect different code generation strategies used by compilers targeting the Nehalem microarchitecture, we recompiled the benchmarks/programs listed in Table 7 natively on a Nehalem machine (Intel(R) Xeon(R) CPU X7550) and used the resultant basic blocks to collect throughput data for Nehalem. In summary, we were able to collect throughput values for 814234 and 889737 basic blocks for Intel Nehalem and Skylake microarchitectures.
4.3 Training and Testing Methodology
We implemented our neural network model in Pytorch (0.4.0a0+59bda9a), a neural network framework which allows building models with dynamic computation graphs. The learnable parameters in our model include the token embeddings, the LSTM parameters in our networks and the affine coefficients in the final linear regression. The model is trained using supervision, where we use L2 norm as the loss function. We partition the dataset into two sets, one used for training (80% – i.e., the training set) and one used for testing (20% – i.e., the test set). In order to allow a fair distribution of basic blocks from different programs in each set, we initially randomize the composition of the two sets.
We use Stochastic Gradient Descent to train the model. For each batch, we randomly sample 1000 basic blocks from the training set. However, since the dataflow graph of each basic block can be different we use one basic block at a time to train our model (effective batch size of 1) from the sampled batch. We train the network for 5 epochs where we samplebatches per each epoch. Typical total training time is around 10 hours per model.
We use normalized error to evaluate the performance of our trained model on a given test example, which is defined as,
We report the average normalized error as the accuracy metric for the test set. Further, we calculate Pearson’s Correlation Coefficient () for the measured throughput values with the predicted values as an additional accuracy metric.
We evaluate our neural network model with state-of-the-art hand written machine models used in machine code analyzers. Concretely, these include the machine model used by LLVM (7.0.0svn) which is exposed through the tool llvm-mca  and the closed source machine code analyzer, IACA  (v3.0-28-g1ba2cbb), which is developed by the processor vendor Intel. These models are built covering most of the complexities of the modern processors (pipelined, super-scalar, out-of-order units) with in them while maintaining relatively high prediction speeds. We show our data driven model is more accurate (Section 5.1) as well as faster (Section 5.3) than these sophisticated hand-written models. Further, we show that our model is portable across different microarchitectures in Section 5.2.
We compared Ithemal with LLVM’s llvm-mca and Intel’s IACA, both of which implement sophisticated machine models. We evaluated the accuracy of each model against the actual throughput values for Intel Nehalem, Haswell and Skylake microarchitectures. We trained Ithemal using the data collected under all three microarchitectures without changing its structure.
The test set for each microarchitecture is constructed using 20% of the data available under each microarchitecture using the method specified in Section 4.3. Test set sizes are 162847, 169512 and 177947 basic blocks under Intel Nehalem, Haswell and Skylake microarchitectures respectively. Latest Intel IACA version (the version we use) does not support throughput estimation under Nehalem, hence we evaluate accuracy only for Ithemal and llvm-mca under Nehalem microarchitecture. Further, prediction speed of Intel’s IACA is much slower than both llvm-mca and our model and in certain cases does not finish analyzing large basic blocks (repeated 100 times) in an acceptable amount of time. Therefore, we had to limit comparisons for basic blocks which finish analysis in an accepable amount of time (60s) to make data collection feasible. Total of 129395 and 132419 basic blocks finished analysis within this time limit using IACA for Intel Haswell and Skylake microarchitectures respectively. We use only these basic blocks when comparing the relative performance of each prediction method. Throughput distribution of basic blocks used for comparing models under each microarchitecture is shown in Appendix B.
We report the average errors and cross correlations for each model in Table 10. It is evident our learnt model is better at predicting the throughput of basic blocks both in terms of average normalized error and cross correlation. Moreover, in terms of average error neural network model is almost twice better than the hand-written models across all three microarchitectures. Upon investigation we found out, our model predicts more closer approximations of the actual throughput in 70.35% of the basic blocks when compared to both IACA and LLVM.
We present heatmaps for actual and predicted values in Figure 9 for basic blocks with throughput values less than 1000 cycles (for 100 iterations) for each prediction method under Haswell microarchitecture. Appendix A
shows the full set of heatmaps for each prediction method under each microarchitecture. To generate each heatmap, we binned the actual and predicted data into 2500 (50 x 50) bins with 50 bins per each axis. Each time a (actual, predicted) pair of values hit one of the bins we increment a counter. The color in each heatmap represent the density (the counter value) of a particular bin with reference to the colorbar shown to the right of the heatmap. Note that we use a log scale since the throughput distribution is skewed. To help interpret these graphs, an oracle that can perfectly estimate throughput will have all its points lying on the unit gradient line, (). Therefore, we expect models with high density near the unit gradient line to be intuitively better.
We see a higher and a sharper density near the unit gradient line for Ithemal compared to both llvm-mca and IACA. Notice that llvm-mca has high density horizontal lines spread across various predicted throughput values. This shows that llvm-mca suffers inaccuracies in its model across a wide range of actual throughput values. IACA on the other hand very rarely predicts higher throughput values and we can see sharp horizontal lines near low throughput predictions. IACA predicts more accurately than llvm-mca for smaller basic blocks, but suffers heavily when analyzing larger basic blocks. This fact is confirmed by the average error each system exhibits for each throughput value range shown in Figure 11 for Haswell microarchitecture. Note that we use the same set of bins as the heatmaps in Figure 11. Ithemal exhibits lower errors for basic blocks with lower throughput values and is significantly better at larger basic blocks compared to IACA and llvm-mca. This shows the robustness of the data driven model. Average error curves for other architectures are shown in Appendix B.
|Microarchitecture||Method||Average Error||correlation ()|
We trained Ithemal’s neural network model without any changes to its structure using the data collected for Nehalem, Haswell and Skylake processors. Table 10 summarizes the average error and cross correlations for Ithemal with the measured values under each microarchitecture.
It is seen that Ithemal can learn to predict throughput values under each architecture with a maximum average error of 0.1053 exhibited for the Haswell microarchitecture. Compared to hand written models which require rewrites for each processor generation, our model only needs measured data for a given processor to learn and requires minimal human intervention. More specifically, data collection for a given microarchitecture took us between 3 to 4 days and training took around 10 hours. Both of these tasks are automated and does not require human intervention, where as writing and maintaining processor models in LLVM requires considerable human effort and expertise. For example, X86SchedHaswell.td file which describes the scheduling model for Haswell has more than 100 commits each fixing or enhancing its model.
Maintenance and support of hand-written models is tedious and may be a reason why IACA does not support throughput estimation for Intel Nehalem architecture. Ithemal on the other hand is learnt from data and does not require manual intervention in terms of maintaining processor models. Its robustness for a given microarchitecture can be maintained or increased using more data.
Appendix A shows the heatmaps of the predicted and measured values for each microarchitecture for basic blocks with throughput values less than 1000 cycles (for 100 iterations) under Ithemal. It is evident that our model is almost equally as good at learning how to predict throughput values under each microarchitecture. This shows the portability of our neural network architecture.
Table 12 lists the throughput in terms of number of basic blocks predicted per second for each prediction model. It is seen that our neural network model is faster than both llvm-mca and IACA in throughput prediction.This is in spite of the fact that our neural network model is implemented in Python using a research framework. We expect many folds of speedup in prediction in future when it is ported to a faster implementation in a language like C++.
IACA’s prediction rate is not suitable to be used in tasks such as compilation. We noticed for certain larger basic blocks IACA takes close to a minute to predict throughput, where as our model as well as llvm-mca predicts in less than a second. The reported throughput of IACA excludes basic blocks which take more than 60s for it to make a prediction. That said, both llvm-mca and IACA output a resource usage chart as well as other information, hence the prediction speeds may be affected by extra calculations.
|Method||Throughput (BB / second)|
6 Neural Network Architecture Exploration
We evaluated number of neural network architectures with varying levels of structure and complexity before we arrived at Ithemal’s network architecture (Section 3.3). While, in theory, a neural network with a single hidden layer can approximate any continuous function  with sufficient accuracy; in practice, networks whose structure more closely match that of the task at hand require less data or time (or both) to train.
Figure 13 presents two additional neural network architectures that we explored, listed according to increasing order of complexity. These networks are compatible with the task at hand and are logical applications of network architectures from other domains, (e.g., natural language processing) to our domain. However, our exploration shows that each network can learn various concepts in our domain at different rates as well as at different accuracy levels.
The Sequential RNN (Figure 13 (a)) consumes all tokens in a basic block using a single RNN. with no a distinction between instructions. The Hierarchical RNN (Figure 13 (b)) network has two RNNs, one for tokens and one for instructions (which consumes the output of the token based RNN). Essentially, this has the same hierarchical composition as Ithemal’s graph-based model (Section 3.3), but we have replaced the DAG-RNN – which captures the data dependencies between the instructions – with a standard RNN.
To compare these two models with Ithemal’s model, we evaluated these models on three tasks.
Simple In-Order Model: learning a simple (non-pipelined and scalar) in-order processor model. We created an artificial dataset by giving each instruction a fixed, random latency cost and the total cost of a basic block is equal to the addition of costs for each individual instruction. Abstractly, this is a linear model, therefore the neural network should learn to solve a set of linear equations.
Out-of-Order Model: learning an out-of-order processor that has no additional physical registers but also has no resource constraints (e.g., execution port restrictions); this model executes all executions with maximum instruction-level parallelism. We created an artificial data set by using artificial per instruction costs, but now the cost of a basic block is the addition of costs for each individual instruction on the basic block’s critical path. Abstractly, the model should learn dependencies, in addition to learning how to solve linear equations.
We performed training for each network and task pair as specified in Section 4.3.
Figure 14 shows how the training loss changes for each network across each sampled batch for the first 300 batches. It is seen that the hierarchical and sequential RNNs are equally better at learning the simple, in-order model, where as graph neural network is much slower and saturates at a higher training loss. Essentially, sequential architectures are better at learning static costs and then performing addition. However, notice that for the out-of-order model and for the Haswell model, graph neural network outperforms other two networks. Both tasks need to reason about instruction order invariance, where in the out-of-order model this is explicit and for the Haswell model it is implicitly true due to out-of-order execution in modern processors.
Two important conclusions we derive from our neural network architecture search are that different networks are better at learning different tasks and providing prior known structure to a network can give it the needed push and the prior knowledge to learn higher level concepts.
7 Related Work
DAG-RNNs and Graph Neural Networks
Neural networks with generic graph based structures have been in used in NLP tasks, in particular, to model relations among words in sentences [29, 12]. In programming languages field, high level programs were represented as Gated Graph Neural Networks in  to perform tasks such as identifying variable misuses and variable completion targeting constructs present in high level programming languages.  uses graph neural networks to find binary similarity between different execution platforms. Compared to  and , our representation is simpler where we model basic block code sequences based on its dataflow graph and use a DAG-RNN inspired by uses in NLP problems [38, 44] to predict throughput values.
Instruction Throughput Estimation
Apart from state-of-the-art tools like llvm-mca and IACA, other analytical models exist for throughput estimation  of instructions. Analytical models such as in  does throughput estimation for multithreaded whole programs. Cycle accurate simulators such as ZSim  and Marss  have a high start-up cost and are more suited coarse grained simulations.
Learnt Models for Runtime Estimation
To the best of our knowledge, our system is the first to automatically learn how to predict throughput of assembly instruction sequences. Other systems exist which predict runtimes of monolithic programs with varying level of manual intervention. GameTime [36, 37] uses SMT solvers to generate inputs and game theoretic approaches to predict the distribution of runtimes of embedded programs. It uses a cycle accurate simulator to simulate various program paths for example inputs generated for each prediction. Also, they need to model the execution environment to formulate the game theoretic adversary. In contrast, Ithemal predicts throughput statically without simulation or execution and does not need a processor model. It needs to be trained only once per microarchitecture. 
introduces sparse polynomial regression to predict execution time of programs by extracting suitable features of high level programs. To apply that technique to instruction sequences, you need to handcraft features again at the instruction level. Our method does not rely on any feature extraction process.
Analytical Models for Runtime Estimation
Various analytical models exist for predicting runtime execution . There also exists models for predicting performance of restricted classes of programs. For example, work on predicting parallel program runtimes include [33, 1, 16, 4, 13]. Another class of analytical models useful specially in real-time embedded systems predict Worst Case Execution Times (WCET) [14, 21] of graphs. However, these models require embedded processor simulation through detailed processor models. Statistical systems developed for WCET estimation include .  predicts performance of another important class of applications namely stencil computations.
We present Ithemal, a data-driven system for basic block throughput estimation. Ithemal is the first such data-driven approach to this problem. Moreover, Ithemal’s accuracy surpasses that of state-of-the-art, handwritten analytical models.
Ithemal achieves its accuracy by leveraging a deep neural network that we have designed to capture the out-of-order behavior of modern processors, which is a first-order concern for throughput estimation.
With a potential, emerging future in which processor architecture is more varied and continued improvements in application performance require exploiting detailed features of specialized architectures, Ithemal holds out the promise that future compilation and performance engineering tools could be augmented with data-driven approaches to improve their functionality with limited developer effort.
-  Vikram S. Adve and Mary K. Vernon. Parallel program performance prediction using deterministic task graph analysis. ACM Trans. Comput. Syst., 22(1):94–136, February 2004.
-  Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. Learning to represent programs with graphs. In International Conference on Learning Representations, 2018.
-  Sanjeev Arora, Yingyu Liang, and Tengyu Ma. A simple but tough-to-beat baseline for sentence embeddings. 2017.
-  Vicente Blanco, Jesus A González, Coromoto León, Casiano Rodrıguez, Germán Rodrıguez, and Marcela Printista. Predicting the performance of parallel programs. Parallel Computing, 30(3):337–356, 2004.
-  Derek Bruening, Qin Zhao, and Saman Amarasinghe. Transparent dynamic instrumentation. In Proceedings of the 8th ACM SIGPLAN/SIGOPS Conference on Virtual Execution Environments, VEE ’12, pages 133–144, New York, NY, USA, 2012. ACM.
-  Xi E Chen and Tor M Aamodt. A first-order fine-grained multithreaded throughput model. In HPCA-15 2009. IEEE 15th International Symposium on High Performance Computer Architecture, pages 329–340. IEEE, 2009.
-  Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
-  Intel Corporation. Intel architecture code analyzer, 2017.
-  SPEC Corporation. Spec cpu2006 benchmark suite, 2006.
Approximation by superpositions of a sigmoidal function.Mathematics of Control, Signals and Systems, 2(4):303–314, Dec 1989.
-  Kaushik Datta, Shoaib Kamil, Samuel Williams, Leonid Oliker, John Shalf, and Katherine Yelick. Optimization and performance modeling of stencil computations on modern microprocessors. SIAM review, 51(1):129–159, 2009.
-  B. Dhingra, Z. Yang, W. W. Cohen, and R. Salakhutdinov. Linguistic Knowledge as Memory for Recurrent Neural Networks. ArXiv e-prints, March 2017.
-  Thomas Fahringer and Hans P Zima. A static parameter based performance prediction tool for parallel programs. In Proceedings of the 7th international conference on Supercomputing, pages 207–219. ACM, 1993.
-  Christian Ferdinand, Reinhold Heckmann, Marc Langenbach, Florian Martin, Michael Schmidt, Henrik Theiling, Stephan Thesing, and Reinhard Wilhelm. Reliable and precise wcet determination for a real-life processor. In International Workshop on Embedded Software, pages 469–485. Springer, 2001.
-  Jeffery Hansen, Scott Hissam, and Gabriel A Moreno. Statistical-based wcet estimation and validation. In OASIcs-OpenAccess Series in Informatics, volume 10. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2009.
-  Franz Hartleb and Vassilis Mertsiotakis. Bounds for the mean runtime of parallel programs. In Proceedings of the Sixth International Conference on Modelling Techniques and Tools for Computer Performance Evaluation, volume 92, pages 197–210, 1992.
-  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–1780, November 1997.
-  Ling Huang, Jinzhu Jia, Bin Yu, Byung gon Chun, Petros Maniatis, and Mayur Naik. Predicting execution time of computer programs using sparse polynomial regression. In J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems 23, pages 883–891. Curran Associates, Inc., 2010.
-  Yermalayeu Ihar, Antonenka Mikhail, Radchenko Andrey, Dmitry Fedorov, and Kirill Matsaberydze. Simd library for image processing, 2018.
-  Pytorch LSTM implementation. https://pytorch.org/docs/stable/nn.html?highlight=lstm#torch.nn.lstm, 2018.
-  Xianfeng Li, Yun Liang, Tulika Mitra, and Abhik Roychoudhury. Chronos: A timing analyzer for embedded software. Science of Computer Programming, 69(1):56 – 67, 2007. Special issue on Experimental Software and Toolkits.
-  LLVM. llvm-mca, 2018.
-  Saeed Maleki, Yaoqing Gao, Maria J Garzar, Tommy Wong, David A Padua, et al. An evaluation of vectorizing compilers. In Parallel Architectures and Compilation Techniques (PACT), 2011 International Conference on, pages 372–382. IEEE, 2011.
-  Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc., 2013.
-  Division NASA Advanced Supercomputing. Nas c benchmark suite 3.0, 1991–2014.
-  Christopher Olah. http://colah.github.io/posts/2015-08-understanding-lstms/, 2015.
-  Chang Yun Park. Predicting program execution times by analyzing static and dynamic program paths. Real-Time Systems, 5(1):31–62, 1993.
-  Avadh Patel, Furat Afram, Shunfei Chen, and Kanad Ghose. Marss: a full system simulator for multicore x86 cpus. In Design Automation Conference (DAC), 2011 48th ACM/EDAC/IEEE, pages 1050–1055. IEEE, 2011.
-  Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina Toutanova, and Wen-tau Yih. Cross-sentence n-ary relation extraction with graph lstms. Transactions of the Association for Computational Linguistics, 5:101–115, 2017.
-  Jeffrey Pennington, Richard Socher, and Christopher D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
-  Louis-Noël Pouchet. The polyhedral benchmark suite, 2012.
-  Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
-  Radu Rugina and Klause Schauser. Predicting the running times of parallel programs by simulation. In ipps, page 0654. IEEE, 1998.
-  Daniel Sanchez and Christos Kozyrakis. Zsim: Fast and accurate microarchitectural simulation of thousand-core systems. In ACM SIGARCH Computer architecture news, volume 41, pages 475–486. ACM, 2013.
-  Eric Schkufza, Rahul Sharma, and Alex Aiken. Stochastic superoptimization. SIGPLAN Not., 48(4):305–316, March 2013.
-  Sanjit A. Seshia and Jonathan Kotker. GameTime: A toolkit for timing analysis of software. In Proceedings of Tools and Algorithms for the Construction and Analysis of Systems (TACAS), pages 388–392, March 2011.
-  Sanjit A Seshia and Alexander Rakhlin. Quantitative analysis of systems using game-theoretic learning. ACM Transactions on Embedded Computing Systems (TECS), 11(S2):55, 2012.
-  Bing Shuai, Zhen Zuo, Gang Wang, and Bing Wang. Dag-recurrent neural networks for scene labeling. CoRR, abs/1509.00552, 2015.
-  Corporation SPEC. Spec cpu2017 benchmark suite, 2017.
-  Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In Proc. NIPS, Montreal, CA, 2014.
-  Tarek M Taha and D Scott Wills. An instruction throughput model of superscalar processors. In Rapid Systems Prototyping, 2003. Proceedings. 14th IEEE International Workshop on, pages 156–163. IEEE, 2003.
-  S. K. Venkata, I. Ahn, D. Jeon, A. Gupta, C. Louie, S. Garcia, S. Belongie, and M. B. Taylor. Sd-vbs: The san diego vision benchmark suite. In 2009 IEEE International Symposium on Workload Characterization (IISWC), pages 55–64, Oct 2009.
-  Xiaojun Xu, Chang Liu, Qian Feng, Heng Yin, Le Song, and Dawn Song. Neural network-based graph embedding for cross-platform binary code similarity detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS ’17, pages 363–376, New York, NY, USA, 2017. ACM.
-  Xiaodan Zhu, Parinaz Sobhani, and Hongyu Guo. Dag-structured long short-term memory for semantic compositionality. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 917–926, 2016.
Appendix A Heatmaps of Different Prediction Methods
Figure 15 shows all prediction heatmaps for Ithemal, llvm-mca and IACA under Intel Nehalem, Haswell and Skylake microarchitectures. Note that, the latest IACA version does not support Intel Nehalem microarchitecture and hence its prediction heatmap is not available.
Appendix B Prediction Errors for Throughput Ranges
Figure 16 shows how the average error changes between various throughput ranges for each prediction method under different microarchitectures for basic blocks with throughput values under 1000 cycles (100 iterations). Throughput values are broken up in to 50 equally spaced value ranges. It also shows the throughput distribution of the basic blocks for which the average error was calculated using the same value ranges. Note that the distribution is skewed and the total average error across all ranges depend on how well you predict values near the mode of the distribution. It is seen that Ithemal under all microarchitectures predict throughput values with lower average errors compared to llvm-mca and IACA near the mode. Moreover, it is maintains a significant edge over the other two evaluation methods at higher throughput values. Overall, Ithemal is more robust in its prediction across all throughput ranges compared to llvm-mca and IACA which show higher fluctuations.
Also note that, llvm-mca is better at predicting throughput values for Nehalem compared to both Haswell and Skylake. This may be an artifact from more refinements undergone to the Nehalem model. This is an example where more development effort correlates with better hand-written and refined models. For Ithemal, the analogous claim would be with more better data a more accurate model may be learnt, rather than requiring more manual intervention.