1 Introduction
Compilers often rely on performance models for solving optimization problems because collecting performance measurements from a real machine can be expensive, limited by hardware availability, or infeasible (such as during aheadoftime compilation). For example, LLVM’s loop vectorizer uses a performance model to compute the optimal vectorization and unroll factors
LLVMvector, and GCC uses a model to decide when to apply looppeeling, loopversioning, outerloop vectorization, and intraiteration vectorization GCCvector. In addition, a performance model can be used by a compiler autotuner to evaluate candidate configurations in a search space AutoTVM; autohalide; PipeDream; roc.Developing an accurate analytical model of program performance on a modern processor is challenging and can take months of engineering effort. Program performance is tightly coupled with the underlying processor architecture as well as the optimization decisions that are made during compilation Chaos. Developers of analytical models are often unaware of detailed features of the processor or effects from all compiler passes. Furthermore, architectural features and the underlying compiler code generation interact in extremely complex ways; manually implementing these interactions and their effects on program performance is tedious and errorprone. The recent proliferation of deep learning accelerators has only exacerbated this problem by demanding rapid, repeated development of performance models targeting new accelerators.
This paper addresses these problems by applying machine learning techniques to produce a performance model. In particular, we are interested in learning a model for predicting execution time of tensor programs on TPUs, which are widelyused accelerators for deep learning workloads TPUisca; TPUtraining. We aim to develop a learned approach to performance modeling that satisfies the following key criteria for the ease of development and deployment. First, the approach must be general enough to handle nontrivial constructs in tensor programs (e.g., multilevel loop nests common in programs involving highdimensional tensors). Second, it must generalize across programs of different application domains as well as to programs unseen at training time. Third, it should not rely on wellcrafted features that require significant domain expertise and effort to develop and tune. Finally, the approach should be retargetable to different optimization tasks with minimal effort.
While there has been some prior work autohalide; AutoTVM; ithemal proposing learned approaches to performance modeling, to the best of our knowledge, none of them satisfy the four criteria stated above. For instance, Ithemal ithemal does not handle complex multilevel loop nests. While Halide’s learned performance model can handle tensor programs autohalide, it requires heavy feature engineering. Although AutoTVM’s models do not rely entirely on manuallyengineered features AutoTVM, it shows limited ability to generalize across kernels.
Like prior work, we formulate the runtime estimation problem as a regression task. However, we make specific architectural choices to satisfy the desiderata. First, our approach represents tensor programs as data flow graphs with nodes that represent operations and edges that represent tensor flows between nodes. Second, we use a graphbased neural network optionally coupled with a sequence model; the graph model ensures generalizability across different programs, while the sequence model is used to capture long range dependencies within a graph. Third, we directly encode operation properties to generate a feature vector for a node in the graph. While our approach does not require any program analyses, adding manually engineered features as additional features is trivial. Our approach is retargetable to different tensor graph optimization tasks. We evaluate our performance model on its ability to predict runtimes for two tasks:
tilesize selection and operator fusion. The model is applied to evaluate program configurations generated by an autotuner for the Accelerated Linear Algebra (XLA) compiler XLA as depicted in Fig. 2.In summary, we make the following contributions:

We develop a learned performance model for tensor programs that does not require feature engineering, generalizes to unseen programs, and is retargetable for different compiler optimization tasks.

We show that our learned models achieve 96.3% and 95.5% accuracy with respect to true measurements; and 2.4% and 26.6% better accuracy than the best handtuned model for tilesize and fusion tasks, respectively.

We conduct a comprehensive set of ablation studies over modeling choices.

We integrate our learned performance model into an XLA autotuner, and demonstrate that it helps in discovering faster programs when access to real hardware is limited or expensive, which is often true in practice.
2 Target Hardware and Tasks
Our approach to learning a performance model is applicable to any target processor executing tensor programs. A tensor program can be represented as a computation graph, which is acyclic and directed. A node in a computation graph represents a tensor operation, processing one or more input tensors into a single output, and an edge connects an output tensor from one node to an input tensor of another node.
To evaluate our method, we build a learned model to predict runtimes of XLA programs on a TPU. XLA is a machine learning compiler for multiple hardware targets, and is used by various machine learning programming frameworks. XLA first performs highlevel optimizations at the wholeprogram level. During this stage, some nodes (primitive operations) in the original computation graph may be merged into a fused node, called a kernel, as illustrated in Fig. 2. After that, XLA lowers each kernel into a lowlevel representation, which is then further optimized and compiled to machine code. In this paper, we evaluate on two optimization tasks: tilesize selection (a kernellevel optimization applied during lowering) and operator fusion (a programlevel optimization).
2.1 Tensor Processing Unit
Tensor Processing Units TPUtraining are fast, energyefficient machine learning accelerators. They achieve high performance by employing systolic arraybased matrix multiplication units. The architecture incorporates a vector processing unit, a VLIW instruction set, 2D vector registers, and a transpose reduction permute unit. Programs can access the High Bandwidth Memory (HBM) or the faster but smaller onchip scratchpad memory that is softwaremanaged. While a TPU has no outoforder execution, it relies heavily on instructionlevel parallelism—done by the compiler backend across several passes including critical path scheduling and register allocation—making it challenging for performance modeling. TPUs do not support multithreading; one kernel is executed at a time.
The design of a TPU allows us to compute the runtime of an entire program by summing the runtimes of its kernel executions. We expect our approach to work best for targets where the runtime of one kernel is independent of others (e.g., no overlapping kernel executions and no interkernel caching). For example, prior work has shown that this approach is sufficiently accurate for autotuning graph rewrites MetaFlow and parallelization configurations FlexFlow; PipeDream on GPUs.
We evaluate our approach on TPUs v2 and v3 to demonstrate its generalizability across different generations of hardware. TPU v3 has higher memory bandwidth and twice as many matrix multiplier units compared to TPU v2.
2.2 Optimization Tasks
TileSize Selection
To generate efficient code, XLA utilizes the fast scratchpad to store data. Because of the limited scratchpad memory, a kernel cannot consume its whole input or compute its entire output at once. Instead, it computes one piece of its output at a time from one or more pieces of its inputs. These pieces are called tiles. An output tile is copied to the slower HBM before the next tile is computed. The goal of tilesize selection is to choose a tile size that minimizes kernel runtime.
Operator Fusion
Operator fusion merges multiple operations into a single unit. Before this pass, a node in a computation graph is a primitive tensor operation (e.g., convolution, elementwise add, etc.). When producer and consumer nodes are fused, intermediate data is stored in scratchpad memory, without transferring it to/from HBM, thereby reducing data communication. After the fusion pass, a node in a computation graph is either a single primitive operation or a fused operation with many primitive operations.
2.3 Existing Analytical Model
For tilesize selection, XLA enumerates all possible tile sizes and selects the best according to a heavily handtuned analytical performance model. This model estimates the kernel’s data transfer time and computation time, and takes the maximum of the two. Tilesize selection happens prior to the code generation, so the model relies on several heuristics that may cause inaccuracy. While the model works well in practice, the approach has the following demerits: (a) the lack of some execution behaviors due to poorly understood architectural characteristics implies missed opportunities; (b) changes in the code generation result in a constant need to update the heuristics; and (c) each new hardware generation requires not only tuning of existing heuristics but also additional modeling. Details can be found in Appendix A.
Unlike tilesize selection, XLA does not use a precise performance model for the fusion task. Instead, it relies on estimates of whether including each node into a fused group will save memory space and access time. It then prioritize fusion decisions according to these estimates.
2.4 Autotuning
Instead of relying on the compiler’s heuristics and analytical performance model, an XLA autotuner has been developed to search for the fastest tile size for each kernel, and the fastest fusion configuration of each XLA program. The autotuner found up to 25% speedup over the compiler’s default on some production deep learning models. However, the autotuning process involves exhaustively running each kernel with all valid tile sizes (ranging from 2 to 500,000 options) and exploration in an exponentially large space of up to fusion configurations per each program. This requires many evaluations, where each evaluation is slow due to the time spent in compilation and execution. An accurate performance model can provide a cheap and reliable estimate of the runtime, and can significantly reduce the time and resource requirements of the autotuning process.
3 Model Design
Our approach decomposes an XLA program into smaller computation subgraphs (kernels) whose runtimes are predicted with a neural network. The estimated program runtime is the sum of its kernel runtimes. Predicting kernel runtimes instead of the whole program’s runtime has multiple benefits. First, this decomposition is general enough that we can apply the neural network model to various tasks, including both program and kernellevel optimizations. Second, predicting runtime at a lowlevel representation should be more accurate as the model does not have to capture what happens inside the highlevel compiler. Additionally, kernels are smaller than whole programs, simplifying the model’s domain. The rest of this section focuses on the three main components for predicting a kernel runtime: model inputs, model architecture, and training objectives.
3.1 Model Inputs
A model input is a kernel represented as node features, wholekernel features, and an adjacency matrix (highlighted yellow, red, and blue respectively in Fig. 3
). Node features include the integervalued type of the operation (opcode), as well as scalar features which further describe the node’s behavior, such as output tensor shape, tensor layout, striding, padding, and when applicable, convolution filter size.
^{1}^{1}1Integers are cast to reals. Features are independently scaled to be in the range using the minimum and maximum observed in the training set. Kernel’s inputs are expressed by nodes with the parameter opcode, and outputs are expressed via an extra feature associated with the output nodes. Kernel features include tile size (only for the tilesize selection task) and optional static performance information. The adjacency matrix captures dataflow dependencies between nodes in the kernel, as shown in Fig. 2.Optional Static Performance Features
The XLA compiler has static analyses that determine highlevel performance metrics of a given kernel. In addition to the features extracted directly from the program representation, we consider providing information from these analyses as additional inputs to the model. These features are optional, but may improve the model’s accuracy. We consider four such kernel features: (1) number of floating point operations, (2) amount of data read in bytes, (3) amount of data being written in bytes, and (4) number of instructions executing on a special functional unit. These are estimates because the static analyses do not precisely model the compiler’s backend code generation. The static performance features of the same kernel with different tilesizes are the same.
VariableSized Features
Many node and kernel features are naturally interpreted as variablelength lists of numbers. This is because tensors are dimensional arrays and some features describe individual tensor dimensions. For example, tile size is encoded as a vector of length , in which each component corresponds to a tensor dimension. We encode these features as fixedsize subvectors, padded or truncated as necessary. Additionally, we include the sum and product of all the values. Including the product is critical as it usually represents the volume of a tensor, and can be more predictive in cases where the subfeature has been truncated and so the product could not be recovered by the model.
3.2 Model Architecture
Figure 3 depicts the architecture of our model. We apply a Graph Neural Network (GNN
) to capture local structural information and then apply a reduction over node embeddings to generate a kernel embedding, which is in turn used to predict the final runtime. We explore different choices for the reduction, including sequence and attention models that can capture global graph information.
Node and Kernel Features
The opcode of an operation is categorical, so we follow best practices and map it to a vector of parameters called an opcode embedding. Opcode embeddings are concatenated with node features before being passed, along with the adjacency matrix, to a GNN. Kernel features are duplicated and concatenated with node feature vectors (‘option 1’ in Fig. 3).
Node Embedding
We use a GNN to combine information from the node and its neighbors to generate a node representation. We use a GNN because (i) a tensor computation kernel is naturally represented as a graph, and (ii) learning node representations conditioned only on their own features and local neighborhoods has shown to improve generalization in other settings. We believe that local neighborhoods capture information that is important for estimating runtime. For a example, node features include an output tensor shape but not input tensors’ shape because operations can have variable numbers of inputs. With a GNN, the model can receive input shape information from node’s neighbors.
Our model builds on the GraphSAGE architecture graphsage. We selected GraphSAGE since it is one of the simpler GNN formulations that has been used successfully in inductive tasks. The GraphSAGE embedding of node considering hop neighbors can be computed as follows:
for all , and otherwise. Here: denote feedforward layers specific to depth . denotes L2 normalization. is the set of immediate neighbors of node .
is a reduction chosen during hyperparameter search.
Kernel Embedding & Prediction
We combine the node embeddings to create the embedding of the kernel. We treat the exact method of calculating as a hyperparameter, choosing from the following methods, including:

a fully deterministic concatenation of one or more of columnwise maximum, mean, and/or sum reduction over (columnwise option),

the final state of an LSTM LSTM on topologically sorted node embeddings, and

the application of a Transformer encoder Transformer to the node embeddings.
In each of these cases, the resulting kernel embedding will be linearly transformed into to scalar output by a feedforward layer without activation. Additionally, we evaluate pernode predictions, which are the scalar outputs of a feedforward layer applied to each node embedding
, and then we compute the sum of pernode predictions to get the kernel prediction (pernode option).The LSTM and Transformer reduction models are able to capture global and longrange dependency information, while the columnwise and pernode methods are not.
3.3 Training Objectives
TileSize Selection Task
In this task, we are interested in the relative speed between different tile sizes within each kernel. Therefore, the performance model does not need to predict absolute runtime, but instead should be able to rank tile sizes by relative speed within each kernel. With this intuition, we train the model with a pairwise rank loss ranklosspaper:
(1) 
where is the number of samples in each batch; is 1 if , or 0 otherwise; is either the hinge function or logistic function , tuned via hyperparameter search.
Operator Fusion Task
In this task, we would like the model to predict absolute kernel runtimes which can be used to compute total program runtime. Thus we minimize the model’s squared error loss
with logtransformed targets. We apply log transformation because targets are rightskewed and range from a nanosecond to a second.
4 Data
Random Split Method  Manual Split Method  
Programs  Kernels  Programs  Kernels  
Set  TileSize  Fusion  TileSize  Fusion  TileSize  Fusion  TileSize  Fusion 
Train  93  78  21.8M  157.5M  93  78  22.9M  190.2M 
Validation  8  8  1.6M  30.1M  8  8  1.4M  11.2M 
Test  8  8  1.4M  20.3M  6  6  0.5M  6.6M 
Our dataset consists of 104 XLA programs used in production or commonly in research. In order to test the ability of our approach to generalize to unseen programs, the programs were split into training, validation, and test sets in two ways: (a) using the random split method, in which programs were partitioned randomly into sets, and (b) using the manual split method, in which the test set was chosen by hand to minimize the subjective similarity of programs between the training and other two sets. For each of the train, validation, and test sets, programs were expanded into individual kernels. Table 1 shows the number of programs and kernels in the training, validation, and test sets using both splitting methods. The number of nodes per kernel is 41 on average across all programs, and ranges from 1 to 1,000. We measured the kernel runtimes on both TPUs v2 and v3. Our experiments use TPU v2 measurements unless mentioned otherwise.
TileSize Dataset
For the tile size dataset, we compiled each XLA program using the compiler’s default fusion heuristics, obtaining an optimized computation graph that we decompose into kernels. For each kernel, we queried the compiler for a list of valid tile sizes. The runtime target for each sample is the minimum runtime from three runs. A kernel may have as many as 500,000 valid tile sizes, so we measured runtimes for as many as possible for each kernel within 30 minutes across 50 hosts, each with an accelerator. This process generated the total of 25 million samples.
Fusion Dataset
For the fusion dataset, we ran the fusion autotuner with a random search strategy to generate, for each computation graph, 50,000 fusion configurations or until timeout (4 hours using 50 machines). Graphs were then decomposed into kernels, yielding 208 million samples after duplicate elimination. Approximately half of the resulting kernels have runtimes below 5s. These contribute negligibly to total program runtimes, so we emphasize larger kernels in our analysis.
Imbalances
Our data are imbalanced in two ways. First, programs are not wholly independent. For example, there are many variations of ResNet models, but just one AlexNet model and one DLRM (recommendation) model. Second, the number of kernels and tile sizes vary widely across different models. In the fusion dataset, ResNet variant models have 300x more samples than AlexNet variants, and in the tilesize dataset, models using Inception have 400x more kernels than autocompletion models. To account for these imbalances, we draw examples evenly from each model type during training.
5 Model Accuracy Evaluation
We trained our models on a single NVidia V100 instance with 96GB of RAM and 10 CPU cores . For all the learned models, we did a hyperparameter search (presented in Appendix B) and selected the bestperforming model for each task on the validation set.
TileSize  Fusion  
TileSize APE  Kendall’s  MAPE  Kendall’s  
Learned  Analytical  Learned  Analytical  Learned  Analytical  Learned  Analytical  
ConvDRAW  9.7  3.9  0.75  0.79  17.5  21.6  0.80  0.77 
WaveRNN  1.5  2.8  0.75  0.65  2.9  322.9  0.97  0.70 
NMT Model  3.1  13.1  0.86  0.81  9.8  26.3  0.94  0.91 
SSD  3.9  7.3  0.82  0.77  11.4  55.9  0.88  0.76 
RNN  8.0  10.2  0.64  0.55  1.9  20.5  0.97  0.86 
ResNet v1  2.8  4.6  0.85  0.73  3.1  11.5  0.95  0.88 
ResNet v2  2.7  5.4  0.87  0.73  2.4  13.3  0.96  0.86 
Translate  3.4  7.1  0.93  0.92  2.1  27.2  0.92  0.74 
Median  3.3  6.2  0.84  0.75  3.0  24.0  0.95  0.82 
Mean  3.7  6.1  0.80  0.74  4.5  31.1  0.92  0.80 
The main evaluation metrics for both tasks on the randomly split test set, grouped by test application, comparing our best learned performance models against the analytical baseline. Geometric mean and median statistics are over applicationlevel metrics. Fusion experiment statistics are evaluated over kernels with
5s true runtimes, which account for the majority of total runtime in our programs.5.1 TileSize Task
Metrics
For this task, we are interested in relative runtimes between different tile sizes within each kernel. Thus, for each kernel, we find the tile size with the best predicted runtime and the one with the best true runtime, and find the difference between their true runtimes. This is distinct from measuring differences between predicted runtimes and true runtimes. The ‘TileSize APE’ (listed in Table 2) is computed by summing the differences across all program kernels and dividing the sum by the runtime of the program as if it had chosen the best tile size for every kernel. More precisely, the TileSize APE of a program with kernels and set of tile size configurations is:
(2) 
where is the true runtime of tile size configuration for kernel , and is the predictedbest configuration. This is a good measure of efficacy for the setting, in which we use the performance model to select the top candidates and verify their actual runtimes using real hardware. TileSize APE shows how far we are from the fastest program. We also measure the Kendall rank correlation between targets and predictions of tilesize runtimes within each kernel, and compute the average over all kernels in each program.
Results
Table 2 shows results for the randomly split dataset. The baseline is XLA’s mature analytical performance model designed for this task, as described in Section 2.3. Our learned performance model (3.7% mean error and 0.8 mean correlation) performs slightly better than the analytical model (6.1% mean error and 0.74 mean correlation). Our learned model is consistently better than the analytical model on all benchmarks except ConvDraw, which differs more (subjectively) from the programs in our training set than any other test program. TPU v3 results are similar; the learned performance model has 3.8% mean error with a slightly lower mean correlation of 0.65.
On the manually split dataset, the learned model (6.3% mean error) performs slightly worse than the analytical model (2.3% mean error). It is expected that the test error of the learned model on this test set will be higher than that of the randomly split test set, as these test programs were chosen for their dissimilarity to the training set. See Table 8 in the appendix for more detail.
5.2 Fusion Task
Metric
In this task, we use mean absolute percentage error (MAPE) as we wish to estimate the absolute runtime of the kernels in order to predict the total program runtime.
Baseline
The existing analytical performance model in XLA is built for selecting the fastest tile size given a kernel, so performance estimates for different kernel types (e.g., fused kernels with and without convolutions) are in different scales. Hence, we scale the analytical model’s output with a coefficient associated with the kernel’s type to get an estimated absolute runtime. Coefficients are determined by executing each program in the test set with a default fusion configuration, and dividing the actual total runtime for all kernels of each type by the estimate in its original scale. The analytical model does not support kernels without tilesize options, which account for 1% of kernels in the dataset. We ignore these kernels in our comparisons in this section.
Results
Table 2 reports MAPEs of kernels with runtimes 5s. Our best learned model (4.5 MAPE and 0.92 mean correlation) substantially outperforms the analytical model (31.1 MAPE and 0.8 mean correlation). Similar to the tilesize dataset, our model consistently performs better than the analytical model on all benchmarks. On kernels with 5s runtimes, results follow the same trend; our model and the analytical model have MAPEs of 5.0 and 22.7; and mean Kendall’s coefficients of .89 and .7 respectively. For the kernels with runtimes 5s on TPU v3, the learned performance model has 4.9 MAPE and 0.92 mean correlation.
On the harder manual split, the learned model still outperforms the analytical model significantly (see Table 8 in the appendix). On kernels with runtimes 5s, our model and the analytical model have MAPEs of 6.2 and 18.1 respectively.
6 Model Ablation Studies
We ran a comprehensive set of ablation experiments to study the effects of design decisions underlying the bestperforming model presented in Section 5, including the objectives used, the presence of static performance features, and the model architecture. Experiments in this section use the randomly split datasets and the same evaluation metrics as in the previous section: TileSize APE for the tilesize task and MAPE for the fusion task. Each ablation (row in Table 3) is a single change to the ‘vanilla’ configuration.
6.1 Graph Features and Loss Function
To determine what input features are important and the suitable training loss function to use, we used the same neural network model across all the experiments in
Section 6.1. In particular, we used GraphSAGE with the simple pernode reduction, which is quick to train, and one of our bestperforming hyperparameter configurations. Each model configuration was trained for 3 million steps.Edge Direction
First, we considered a model variant that, unlike the ‘vanilla’ model, applied the same feedforward network to node representations from incoming and outgoing edges (see ‘Undirected’ in Table 3). The results suggest that edge direction is important for the fusion task—reducing the mean error by 3.8%—but irrelevant to the tilesize task.
Static Performance Features
The ‘With static perf. (as node features)’ row of Table 3 shows the result when we add four static performance features—as explained in Section 3.1—to the ‘vanilla’ model, which uses only features that are extracted directly from the XLA program representation. Similar to edge directions, these features significantly improve model accuracy for the fusion task—reducing the mean error by 5%—but less so for the tilesize task.
The finding that edge direction and static performance information help only the fusion task is somewhat unexpected but not entirely surprising. In the tile size selection task, we predict the relative runtimes of different tile sizes of the same kernel, but never compare runtimes of different kernels. Thus, the static performance features and the kernel graph are constant across different tile sizes, and the only changing input features are the tile size features. However, these constant features may still help determine the relative runtimes of different tile sizes more accurately, as we can see that the static performance features slightly improve the tile size runtime prediction accuracy. Hence, adding more input features may not help significantly if they are constant across kernels that will be compared against each other.
TileSize  Fusion  

Median  Mean  Median  Mean  
Vanilla  6.2  6.8  9.5  10.2 
Undirected  7.2  6.8  11.0  14.0 
With static perf. †  6.5  6.3  4.0  5.2 
(as node features)  
With static perf.  6.1  5.9  5.7  6.0 
(in kernel embedding)  
Move tilesize (node  10.2  9.4  N/A  N/A 
feats. to kernel emb.)  
MSE loss (not rank)  16.7  17.7  N/A  N/A 
TileSize  Fusion  

Reduction Graph  No GNN  GraphSAGE  GAT  No GNN  GraphSAGE  GAT 
pernode  10.7 (5.3)  6.0 (3.8)  9.2 (6.4)  16.6 (132.7)  7.3 (34.6)  15.1 (4.0) 
columnwise  9.3 (3.3)  6.9 (3.0)  8.4 (4.2)  6.6 (9.1)  5.1 (3.6)  8.5 (3.8) 
LSTM  7.1 (3.7)  3.7 (2.8)  7.7 (4.2)  3.9 (7.5)  5.0 (4.3)  7.4 (4.5) 
Transformer  10.8 (7.4)  4.6 (2.6)  8.2 (3.8)  7.3 (10.1)  4.5 (5.8)  14.6 (11.3) 
Model ablation study results. The table reports TileSize APE for the tilesize dataset, and MAPE for the fusion dataset, on test programs. Standard deviations of the errors across test applications are in parentheses. Reductions (rows) are defined in
Section 3. Bold indicates the selected models used in Section 5. All models are trained until 5 million steps.Kernel Features Encoding
Two ways to encode kernel information are shown in Fig. 3, labeled ‘kernel features (option 1)’ and ‘(option 2)’. In Table 3, the ‘vanilla’ model uses option 1, whereas the ‘Move tilesize (node feats. to kernel emb.)’ uses option 2. Encoding tilesize with node features outperforms encoding it with the kernel embedding (2.6% lower mean error). We believe this is because tile size can be important for estimating runtime for an individual operation before aggregation. When the tilesize information is available at the node level, the model still has all the information about the node, such as its input and output shapes. On the other hand, encoding static performance information as node features or kernel features makes little difference because these features are not very important for estimating the runtime for an individual node.
MSE vs. Rank Loss
For the tilesize dataset, we compare using MSE and rank loss as the training objective. The ‘vanilla’ model is trained using rank loss, while the ‘MSE loss (not rank)’ in Table 3 uses MSE loss. The effect is significant, using rank loss is 10.9% more accurate. This result confirms our intuition that training a model to predict relative speeds is easier than absolute runtimes.
6.2 Neural Network Model
Once we determined the best set of graph features to include and the suitable loss function for each task from the experiment in Section 6.1, we performed a comprehensive comparison between different modeling choices to answer:

Does a GNN outperform models that treat programs as sequences?

Do we need an additional model to capture longrange dependencies in a kernel beyond a GNN?

How does GraphSAGE compare to a more sophisticated GNN: Graph Attention Network (GAT)?
To answer these questions, we explore different combinations of modeling choices for a GNN (GraphSAGE, GAT, and no GNN) and node reduction methods (pernode, columnwise, LSTM, and Transformer).
For all the models in this experiment, we train to five million steps and use the best settings from Section 6.1: distinguishing edge direction, and including static performance information (and tilesize) as node features.
Q1: Graphs vs. Sequences
Prior work proposes an LSTMbased performance model for x86 basic blocks ithemal. To understand the effect of representing program examples as graphs rather than sequences, we compare GraphSAGE with the simple columnwise reduction to LSTM and Transformer (with no GNN). LSTM and Transformer models are trained over topologically sorted sequences of nodes, whose embeddings are the same pernode representations fed into GNNs.
According to Table 4
, GraphSAGE with the columnwise reduction is more accurate than using LSTM or Transformer without a GNN on the tilesize dataset. On the fusion dataset, LSTM is slightly better than GraphSAGE, but LSTM has a higher error variance across test applications. Since we want the performance model to be consistently accurate across all programs, we conclude that GraphSAGE is a crucial component to achieve that. Another interesting finding is that Transformer alone is worse than the simple columnwise reduction even without GraphSAGE.
Q2: Most Effective Global Reduction
GNNs capture local graph structural information but not the global structure. To consider some global information and longrange dependencies in a kernel, we apply a sequence model (LSTM) and a global attention model (Transformer) to generate kernel embeddings from node embeddings produced by a GNN.
As seen in Table 4, applying either LSTM or Transformer on top of GraphSAGE improves the model accuracy over GraphSAGE with a nonmodelbased reduction (pernode or columnwise). This result suggests that in order to achieve the best accuracy, the model indeed needs to capture more than local dependencies. GraphSAGELSTM and GraphSAGETransformer perform equally well on both tilesize and fusion datasets. However, GraphSAGETransformer is much faster to train.
Nevertheless, GraphSAGE with the simple columnwise reduction already works reasonably well. If one prefers a model with fast inference time, we recommend such a combination. While GraphSAGE with the pernode reduction is slightly better than GraphSAGE with the columnwise reduction on the tilesize task, it is significantly worse on the fusion task with a high variance across applications.
Q3: Most Effective GNN
We compared our choice of GraphSAGE to GAT, which found stateoftheart performance on a number of benchmarks gtn. We use GAT with multiple attention heads per each layer. According to Table 4, GraphSAGE consistently exhibits better test accuracy compared to GAT. We noticed that training GATs was especially sensitive to hyperparameter choices. For example, GraphSAGE was less sensitive to learning rate changes than GATs. Therefore, we conclude that GraphSAGE with roughly the same number of learnable parameters compared to GAT generalizes better for our cost prediction regression task. Additionally, we observed that GAT with LSTM/Transformer is worse than LSTM/Transformer alone. We hypothesize that training a compounded complex model further increases the training difficulty.
7 XLA Toolchain Integration
In this section, we integrated the learned model into the XLA compiler and autotuner.
7.1 TileSize Compiler Integration
We integrated the model directly into the XLA compiler, replacing the analytical model. Fig. 4’s ‘Learned model 1’ shows the benchmark speedup over the default tilesize configurations (the best tilesizes according to the analytical cost model). The first eight benchmarks are from the test set, and the remaining four are benchmarks that gain most speedup from exhaustive search.
On the test set benchmarks, the learned model is comparable to the analytical model, except for ConvDraw. We observe a few percent slowdown on NMT, SSD, and Translate even though our model shows better accuracy on these benchmarks in Table 2. This is likely because the dataset does not contain all possible tile sizes for a kernel if the time limit is reached during data generation.
On 3 out of 4 additional benchmarks, the learned cost model is better than the analytical model. On Translate (3), replacing the compiler’s analytical model with the learned model would yield a 20% speedup. This demonstrates another advantage of a learned performance model over a manuallywritten model: it can be easily improved with more data. If the learned model does not perform well on some benchmarks, we can retrain or finetune the model on similar benchmarks. In contrast, to fix the analytical model, engineers must identify the problem and fix it in a way that does not hurt other benchmarks, which is challenging in practice.
7.2 TileSize Autotuner Integration
Instead of using the learned model inside the compiler directly, we can use it in the tilesize autotuner. Fig. 4 reports the endtoend benchmark speedup found by the autotuner. By default the autotuner enumerates all tilesizes for each kernel, and evaluates them on hardware (labeled ‘Exhaustive’). Instead, we use the learned performance model (labeled ‘Learned model 10’) and the analytical model (labeled ‘Analytical 10’) to select the top 10 tilesizes to be evaluated on hardware as a way to reduce the search time. The figure shows that the learned model is on par with the analytical model across all benchmarks (within 13% of each other).
7.3 Fusion Autotuner Integration
We also integrate the best learned performance model from Section 5.2 in the XLA fusion autotuner. The analytical model is not used in this experiment as it cannot estimate runtimes for kernels that lack tilesize options; kernels that are not fusion, convolution, or data formatting operations.
Experimental Setup
TPUs are in high demand, so we wish to minimize their use during autotuning. CPUs are more abundant and better support timesharing, and, with a performance model, can be used to more cheaply run the autotuner. We compare the baseline autotuner (which uses TPUs) with the learned autotuner (which uses both the learned performance model and a TPU). In this experiment, the autotuner searches via simulated annealing. The baseline autotuner evaluates fusion configurations on real hardware for 10 minutes. The learned autotuner first evaluates fusion configurations on a CPU for an hour, then runs promising fusion configurations on the real hardware for up to either 1 or 10 minutes, in the order ranked by the predicted costs.
In this experiment, we compare the fusion autotuner on a set of programs that gain significant speedup from autotuning. The autotuner starts the search from a default configuration, which is generated by the compiler’s fusion heuristic given a specific program. Although some test programs (Transformer, Char2Feats, and ResNetparallel) are in our training set, most kernels seen during the evaluation are unlikely included in the training set. This is because kernels in the training set are generated using a random search as opposed to the simulated annealing used during this evaluation; as a result, different kernels are produced even for the same program.
Results
We run the autotuner on each program 10 times and report the best speedup found over the default configuration in Fig. 5. Using the learned performance model together with the hardware let us discover fusion configurations that are on average 1.5% faster than using the hardware alone. Additionally, they are on average only 1.5% slower than the best known configurations found when running the autotuner on hardware for 4 hours. When running simulated annealing starting from a random configuration, the benefit from the performance model is even more pronounced. On average, using the performance model led to discovering 10% faster configurations compared to not using the performance model.
Furthermore, the learned performance model reduces time spent using real target hardware for evaluation from 10 minutes to 1 minute without degrading performance. This demonstrates that when access to a target hardware is limited, the autotuner can utilize the learned performance model to discover faster code. This experiment shows that our approach can indeed be used to build a practical, accurate performance model to guide a compiler optimization task.
8 Related Work
Ithemal uses a hierarchical recurrent neural network to estimate throughputs of x8664 basic blocks
ithemal. Basic blocks are short, loopfree sequences of instructions (6.06 instructions on average). In contrast, our work addresses larger kernels with implicit nested loops containing up to a thousand operators. Ithemal was evaluated on its ability to generalize to heldout basic blocks. However, our method is tested for its ability to generalize to novel tensor programs and targets a very different processor.The code featurebased performance model codefeaturesbased, Halide’s performance model autohalide, and work by Justus et al. justusetalmodel use simple neural networks to predict runtime from manuallyengineered features produced by a static analyzer that examines an optimized program. Since extracting these features from an XLA graph is nontrivial, we train a more complex neural net—using features that can be extracted directly from the XLA graph and very minimal features produced by an already existing static analyzer—with sufficient capacity to recover similarly powerful representations.
AutoTVM also uses a learned performance model to optimize tensor kernels, by ranking candidates AutoTVM. However, AutoTVM’s model shows limited ability to generalize between even very similar individual kernels (e.g., different kinds of convolution). In contrast, we train a performance model over entire tensor programs with many kernels, and can generalize to novel tensor programs containing many kernels dissimilar to those seen during training.
Additionally, Neural Architecture Search (NAS) often employs a related idea: learning models to predict the error of an deep learning model architecture (e.g, peephole; istrate2018tapas; wen2019neural). Others, such as ReNAS xu2019renasrelativistic, learn to rank candidate neural architectures rather than predict runtimes in isolation.
Deep learningbased techniques have been proposed to find better compiler optimizations endtoenddeeplearning; Chameleon. More specifically, GNNs have been used in the context of various compiler optimizations. ProGraML programl uses GNNs to perform compiler analyses. Vemal vemal
proposes imitation learningbased autovectorization based on gated GNNs. Reinforcement learning and evolutionary searchbased techniques using GNNbased policies have been proposed for the device placement task
regal; placeto; zhou2019gdp.9 Conclusion
We have presented an approach for automatically learning a performance model for tensor programs. We have found that the learned model can generalize well to programs with some similarity to our training set, usually matching or improving upon the performance of the best known analytical baseline for our target hardware. We also demonstrated that the learned performance model can be employed by autotuners to discover faster tensor programs than using hardware targets alone when hardware access is limited. In addition, we showed several advantages of the learned approach over the manual one, beyond accuracy. First, we have created, without manual feature engineering, a new performance model for the XLA fusion task where none existed before. Second, we can improve the learned model by retraining or finetuning with more data.
Acknowledgements
We would like to thank the XLA team, especially Bjarke Roune, for feedback and help during development, Hyeontaek Lim for code review and writing feedback, and Rishabh Singh for guidance.
References
Appendix A TileSize Analytical Model
A key to achieving high performance is to use the fast scratchpad memory effectively. Choosing an appropriate tile size is essential to achieving this goal for a number of reasons:

Tile selection determines the number of times data has to be copied between the HBM and scratchpad memory. A bad tile choice may result in a larger data movement.

Tile selection determines the quantity of data that gets copied in a given iteration, and the amount of compute performed in that iteration. A good balance between the two is essential for achieving high performance through the overlap of compute and data transfers.

Because tile size determines the data size copied, it also controls the achieved bandwidth for data transfers. Larger transfers are more efficient.
The analytical model estimates the kernel’s data transfer time and computation time, taking the maximum of the two. The compiler pipelines the code overlapping computation of a given tile with the data copyin (HBM to scratchpad) of the next tile, and data copyout (scratchpad to HBM) of the previous tile. The performance model takes into account the memory required by all the operations it contains. The compiler’s code generation scheme distributes operations among the functional units while respecting data dependencies. To estimate the computation cost, the model must estimate the instruction schedule for each operation to determine the critical path. Since different tiles may demand and execute different amounts of data transfer and computation, the total cost is determined on a periteration basis.
Since the tilesize selection happens prior to the code generation, it has to rely on several heuristics due to: (i) inability to accurately estimate bidirectional data transfers, (ii) limitations in modeling instruction scheduling, (iii) inability to model the effect of register usage, and (iv) limitations in capturing dynamic execution properties, such as issue stalls. The heuristics are chosen by tuning the performance model on a set of benchmark programs.
Appendix B Hyperparameters
Table 5 shows the fixed hyperparameters we used in all experiments. These hyperparameters were tuned in prior preliminary experiments. Table 6 and Table 7 reports the hyperparameters of the best performing models in Table 4 for the tilesize and fusion datasets, respectively.
The model hyperparameters used to produce Table 3 are the same as ‘GraphSAGE + pernode’ in Table 6 and Table 7. The training hyperparameters are slightly different but in the same range as we always tuned these parameters in every experiment.
Hyperparameter  Applicable to  Fixed value 

Opcode embedding size  All  256 
Node neighbor size  GNN  20 
GNN layers  GNN  3 
GraphSAGE aggregator  GNN  mean 
Node final layers  All  3 
Columnwise reduction type  Columnwise reduction  mean & max 
Transformer attn. heads  Transformer  4 
Transformer reduction  Transformer  sum 
Include perlayer biases  All  no 
Node neighbor size is the maximum number of neighbors included per node. If a node has more neighbors, we truncate the neighbor list. We experiment with sampling instead of truncation, but there is no difference.
Node final layers is the number of feedforward layers applied to node embeddings before reduction.
Concatenation of columnwise mean and columnwise max.
Model Hyperparameters  Training Hyperparameters  

Tilesize dataset  Hidden dim.  Module L2 norm  Module layers  Transformer layers  GAT head  Learning rate  Learning rate decay  Grad. clip  Dropout  Rank loss 
No GNN + pernode  512  False  3  N/A  N/A  0.000802  1.0  none  0.1  hinge 
No GNN + columnwise  1024  False  3  N/A  N/A  0.000642  1.0  none  0.1  hinge 
No GNN + LSTM  512  False  0  N/A  N/A  0.000434  0.99  norm  0.1  hinge 
No GNN + Transformer  1024  False  0  3  N/A  0.000424  0.99  norm  0.1  hinge 
GraphSAGE + pernode  512  False  0  N/A  N/A  0.001526  1.0  none  0.1  hinge 
GraphSAGE + columnwise  1024  False  0  N/A  N/A  0.000642  1.0  none  0.1  logistic 
GraphSAGE + LSTM  1024  False  0  N/A  N/A  0.000386  0.98  norm  0.1  hinge 
GraphSAGE + Transformer  1024  False  0  1  N/A  0.000466  1.0  norm  0.1  hinge 
GAT + pernode  512  False  0  N/A  2  0.00001  1.00  norm  0.1  hinge 
GAT + columnwise  512  False  0  N/A  4  0.00001  1.00  norm  0.1  hinge 
GAT + LSTM  512  False  0  N/A  4  0.00001  0.99  norm  0.1  hinge 
GAT + Transformer  512  False  0  2  4  0.00001  1.00  norm  0.1  hinge 
Model Hyperparameters  Training Hyperparameters  

Fusion dataset  Hidden dim.  Module L2 norm  Module layers  Transformer layers  GAT heads  Learning rate  Learning rate decay  Grad. clip  Dropout 
No GNN + pernode  512  False  3  N/A  N/A  0.000214  0.95  none  0.2 
No GNN + columnwise  512  False  3  N/A  N/A  0.000102  1.0  none  0.25 
No GNN + LSTM  512  False  0  N/A  N/A  0.000144  1.0  none  0.25 
No GNN + Transformer  512  True  0  1  N/A  0.000862  1.0  norm  0.25 
GraphSAGE + pernode  512  False  0  N/A  N/A  0.000664  0.9  none  0.2 
GraphSAGE + columnwise  1024  False  0  N/A  N/A  0.000469  0.9  none  0.2 
GraphSAGE + LSTM  1024  False  0  N/A  N/A  0.000962  0.9  none  0.2 
GraphSAGE + Transformer  512  True  0  2  N/A  0.000768  1.0  none  0.2 
GAT + pernode  1024  False  0  N/A  2  0.000002  0.90  none  0.25 
GAT + columnwise  1024  False  0  N/A  2  0.000004  0.95  none  0.2 
GAT + LSTM  1024  False  0  N/A  2  0.000006  0.95  none  0.25 
GAT + Transformer  1024  False  0  2  2  0.000001  1.00  norm  0.2 
TileSize  Fusion  
TileSize APE  Kendall’s  MAPE  Kendall’s  
Learned  Analytical  Learned  Analytical  Learned  Analytical  Learned  Analytical  
Ranking  9.5  1.4  0.81  0.71  10.8  10.7  0.72  0.81 
Feats2Wave  16.9  1.2  0.71  0.83  9.6  72.4  0.59  0.72 
ImageEmbed  5.7  5.6  0.81  0.75  11.4  14.6  0.90  0.90 
SmartCompose  3.2  1.6  0.67  0.76  6.6  40.2  0.96  0.95 
WaveRNN 1  7.0  2.6  0.66  0.81  2.7  8.8  0.97  0.95 
WaveRNN 2  3.4  4.4  0.72  0.68  2.8  10.3  0.97  0.94 
Median  6.3  2.1  0.71  0.75  8.1  12.6  0.93  0.92 
Mean  6.4  2.3  0.73  0.75  6.2  18.1  0.84  0.88 