Learning Execution through Neural Code Fusion

06/17/2019
by   Zhan Shi, et al.
Google
The University of Texas at Austin
1

As the performance of computer systems stagnates due to the end of Moore's Law, there is a need for new models that can understand and optimize the execution of general purpose code. While there is a growing body of work on using Graph Neural Networks (GNNs) to learn representations of source code, these representations do not understand how code dynamically executes. In this work, we propose a new approach to use GNNs to learn fused representations of general source code and its execution. Our approach defines a multi-task GNN over low-level representations of source code and program state (i.e., assembly code and dynamic memory states), converting complex source code constructs and complex data structures into a simpler, more uniform format. We show that this leads to improved performance over similar methods that do not use execution and it opens the door to applying GNN models to new tasks that would not be feasible from static code alone. As an illustration of this, we apply the new model to challenging dynamic tasks (branch prediction and prefetching) from the SPEC CPU benchmark suite, outperforming the state-of-the-art by 26 respectively. Moreover, we use the learned fused graph embeddings to demonstrate transfer learning with high performance on an indirectly related task (algorithm classification).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/09/2020

Automatic Code Summarization via Multi-dimensional Semantic Fusing in GNN

Source code summarization aims to generate natural language summaries fr...
02/21/2020

Memory-Based Graph Networks

Graph neural networks (GNNs) are a class of deep models that operate on ...
10/14/2021

ReGVD: Revisiting Graph Neural Networks for Vulnerability Detection

Identifying vulnerabilities in the source code is essential to protect t...
03/09/2019

Program Classification Using Gated Graph Attention Neural Network for Online Programming Service

The online programing services, such as Github,TopCoder, and EduCoder, h...
06/04/2021

The Closer You Look, The More You Learn: A Grey-box Approach to Protocol State Machine Learning

In this paper, we propose a new approach to infer state machine models f...
11/18/2020

GRAPHSPY: Fused Program Semantic-Level Embedding via Graph Neural Networks for Dead Store Detection

Production software oftentimes suffers from the issue of performance ine...
05/28/2021

Learning to Extend Program Graphs to Work-in-Progress Code

Source code spends most of its time in a broken or incomplete state duri...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Over the last 50 years, hardware improvements have led to exponential increases in software performance, driven by Moore’s Law. The end of this exponential scaling has enormous ramifications for computing Hennessy and Patterson (2019) since the demand for compute has simultaneously grown exponentially, relying on Moore’s Law to compensate Ranganathan (2017). As the onus of performance optimization shifts to software, new models, representations, and methodologies for program understanding are needed to drive research and development in computer architectures, compilers, and to aid engineers in writing high performance code.

Deep learning has emerged as a powerful framework for solving difficult prediction problems across many domains, including vision Krizhevsky et al. (2012), speech Hinton et al. (2012), and text Sutskever et al. (2014)

. Recent work has started to frame many canonical tasks in computer architecture as analogous prediction problems, and have shown that deep learning has the potential to outperform traditional heuristics 

Hashemi et al. (2018). In this work, we focus on two representative tasks: address prefetching (modeling code data-flow) Jouppi (1990); Wenisch et al. (2009); Jain and Lin (2013); Hashemi et al. (2018) and branch prediction (modeling code control-flow) Jiménez and Lin (2001); Seznec (2011); Yeh and Patt (1992); Smith (1981)111As Moore’s Law ends, modelling techniques in these fields have also stagnated. For example, the winner of the most recent branch prediction championship increased precision by 3.7% Dundas (2016).. Traditional models for solving these tasks memorize historical access patterns and branch history to make predictions about the future. However, this approach is inherently limited as there are simple cases where history-based methods cannot generalize (Section 4.7). Instead, we argue that these tasks (control-flow and data-flow) jointly model the intermediate behavior of a program as it executes. During execution, there is a rich and informative set of features in intermediate memory states that models can learn to drive both prediction tasks. Additionally, since programs are highly structured objects, static program syntax can supplement dynamic information with additional context about the program’s execution.

In this paper, we combine these two sources of information by learning a representation of a program from both its static syntax and its dynamic intermediate state during execution. This incorporates a new set of previously unexplored features for prefetching and branch prediction, and we demonstrate that these can be leveraged to obtain significant performance improvements. Inspired by recent work on learning representations of code Allamanis et al. (2017), our approach is distinguished by two aspects. First, instead of using high level source code Allamanis et al. (2017), we construct a new graph representation of low-level assembly code and model it with a graph neural network. Assembly makes operations like register reads, memory accesses, and branch statements explicit, naturally allowing us to solve address prefetching and branch prediction within a single, unified representation. Second, to model intermediate state, we propose a novel snapshot mechanism that feeds limited memory states into the graph (Section 3.2).

We call our approach neural code fusion (NCF). This same representation can easily be leveraged for a bevy of other low-level optimizations (including: indirect branch prediction, value prediction, memory disambiguation) and opens up new possibilities for multi-task learning that were not previously possible with traditional heuristics. NCF can also be used to generate useful representations of programs for indirectly related downstream tasks, and we demonstrate this transfer learning approach on an algorithm classification problem.

On the SPEC CPU2006 benchmarks Sta (2006), NCF outperforms the state-of-the-art in address and branch prediction by a significant margin. Moreover, NCF is orthogonal to existing history-based methods, and could easily combine them with our learned representations to potentially boost accuracy further. To our knowledge, NCF is the first instance of a single model that can learn simultaneously on control-flow and data-flow tasks, setting the stage for teaching neural network models to better understand how programs execute.

In summary, this paper makes the following contributions:

  • An extensible graph neural network based representation of code that fuses static and dynamic information into one graph.

  • A binary representation for dynamic memory states that generalizes better than scalar or categorical representations.

  • The first unified representation for both control-flow and data-flow tasks.

  • State-of-the-art performance, improving branch prediction accuracy by 26% and address prefetching accuracy by 45%.

  • We show that NCF representations pre-trained on branch prediction are useful for transfer learning, achieving competitive performance on an algorithm classification task.

2 Background

In order to generate our fused representation (Figure 1), we combine three fundamental components. The representation itself builds on Graph Neural Networks (GNN). Instead of directly representing source code, our static representation uses assembly code. To drive dynamic information through the GNN, we use binary memory snapshots. We start with background on these three components.

Figure 1: Overview of the fused static/dynamic graph representation.

2.1 Gated Graph Neural Networks

A generic graph neural network structure consist of a set of nodes and sets of directed edges where is the set of directed edges of type . Each node is annotated with a initial node embedding denoted by

and associated with a node state vector

for each step of propagation .

Our work builds on a specific GNN variant – Gated Graph Neural Networks (GGNNs) Li et al. (2015). GGNNs propagate information in the graph through message passing. At each step of propagation, “messages” to each node are computed as:

(1)

where is the zero vector if there are no edges of type directed towards . is a linear layer with parameters in this model, but can be an arbitrary function. To update the state vector of a node , all nonzero incoming messages are aggregated as:

(2)

Here

is an aggregation function, for which we use element-wise summation. Finally, the next state vector is computed using a gated recurrent unit (GRU) 

Chung et al. (2014):

(3)

The propagation is initialized with and repeated times. The state vectors are considered as the final node embeddings. For each task, we mark a specific node as the “task node”. We feed its final state vector to a linear output layer to make final predictions.

2.2 Program Representations

Here we give a brief review of how compilers and processors represent source code and program state, along with tools for extracting these representations from programs and their executions. Incorporating these representations into a GNN model will be discussed in the next section.

Dynamic Execution State.

The dynamic state of a program is the set of values that change as a program executes. This is defined by a fixed set of registers (referenced by names like %rdi and %rax) and memory (which is much larger and indexed by an integer memory address). Values are moved from memory to registers via load instructions and from registers to memory via store instructions. Finally, the instruction pointer specifies which instruction should be executed next.

So, what is the correct subset of dynamic state to feed into a model? In principle it could include all registers and memory. However, this can be difficult to work with (memory is very large) and it is expensive to access arbitrary memory at test time. Instead, we restrict dynamic state to a snapshot that only includes CPU general purpose registers and recently used memory states as input. These values are cheaply obtainable in hardware through buffers that hold recently used data and in software through dynamic instrumentation tools like Pin (see Tools section).

Assembly Code.

Assembly code is compiled from source code and is specific to a particular processor architecture (such as x86). It is a sequence of instructions, some of which operate on register values, some of which move values between registers and memory (loads and stores), and some of which conditionally jump to other locations in the program. A common way of organizing assembly code is in a control flow graph (CFG). Nodes of a CFG are basic blocks, which are sequences of instructions without any control flow statements. Edges point from a source basic block to a target basic block when it is possible for control to jump from the source bock to the target block. For x86 direct branches, there are only two possible target blocks for a given source block, which we can refer to as the true block and false block. A benefit of assembly code in our context is that it is typically less stylish and tighter to program semantics. For example, source programs that are syntactically different but semantically equivalent tend to correspond to the same assembly, as shown in Figure 2.

(a) Assembly example 1: for vs. while.
(b) Assembly example 2: if-else vs. ternary.
Figure 2: Two assembly examples. For each example, two syntactically different but semantically equivalent source code snippets are compiled to the same assembly. Corresponding sections of source code and lines of assembly are shaded in the same color.

While we only use assembly for static code in this work, it is also possible to link the assembly code to the source code it was generated from in order to gain additional information about high-level constructs like data structures.

Tasks.

We test learned understanding of control-flow using the branch prediction

task. Branch prediction traditionally uses heuristics to predict which target basic block will be entered next. The instruction pointer determines which basic block is currently being executed, and the target output is a boolean specifying either the true block or false block.

Branch prediction is a difficult problem with large performance implications for even small relative improvements. Since high-performance microprocessors can execute hundreds of instructions speculatively, every mispredicted branch means that the processor has to discard all work done after that branch and restart execution.

Learned understanding of data-flow is tested using the prefetching task. Prefetching predicts the memory address that will be accessed in the next load operation. Since data access time is the largest bottleneck in server applications, solving data prefetching has significant implications for scaling computer architectures Hashemi et al. (2018). Note that there is generally interleaving of branching and memory instructions, so predicting the next memory access may depend on an unknown branch decision, and vice versa.

Tools.

Compilers convert source code into assembly code, we use gcc. Creating a usable snapshot of the dynamic state of a program is nontrivial. Given the large size of memory, we need to focus on memory locations that are relevant to the execution. These are obtained by monitoring the dynamic target memory addresses of load instructions that are executed. To obtain these snapshots, we instrument branch and load instructions with a dynamic instrumentation tool called Pin Luk et al. (2005).

3 Model

We model the static assembly as a GNN (Section 3.1). Dynamic snapshots are then used as features to inform the GNN of the data-flow and control-flow choices that the software makes during execution (Section 3.2). By modeling these two sources of uncertainty we show in Section 4 that the model has learned the behavior of the application.

3.1 Graph Structure

Figure 3 provides an example of our graph structure translating from 3 lines of assembly to a GNN. The graph consists of three major types of nodes: instruction nodes (in white), variable nodes (in yellow), and pseudo nodes (in grey).

Instruction nodes are created from instructions to serve as the backbone of the graph. Each instruction can have variable nodes or pseudo nodes as child nodes.

Variable nodes represent variables that use dynamic values, including registers and constants.

Instead of connecting instructions nodes directly to their child variable nodes, Pseudo nodes represent the sub-operations inside an instruction. The value associated with a pseudo node is computed in a bottom-up manner by recursively executing the sub-operations of its child nodes. For example, in instruction 0 in Figure 3, a pseudo node is created to represent the source operand that loads data from memory222In x86 assembly, parentheses represent addressing memory, which contain a child constant 0x48 and a child register %rbx. There are a number of different pseudo node types listed in the appendix.

Figure 3: Graph structure on assembly code.

Three major types of edges are used to connect nodes in the graph: control-flow edges, parent edges and usage edges. Control-flow edges connect an instruction node to all potential subsequent instruction nodes. For non-branch instructions, the control-flow edge from an instruction node points to the next sequential instruction node in the program. For branch instructions, control-flow edges are used to connect to both the next instruction and the branch target. Parent edges are used to connect child variable nodes or pseudo nodes to their parent instruction nodes or pseudo nodes. Usage edges provide the graph with data flow information, connecting variable nodes with their last read or write. Given this static structure, Section 3.2 describes how the GNN is initialized and used.

3.2 Fused Static/Dynamic Gated Graph Neural Networks

Node initialization. Unlike previous approaches to code analysis where node embeddings are initialized with the static text of source code, we fuse the static graph with dynamic snapshots by using dynamic state to initialize nodes in the graph.

Each variable node and pseudo node is initialized with a dynamic value from the memory snapshot. These values are converted into initial node embeddings via a learned embedding layer. We find that the numerical format of the dynamic values are critical to allowing the model to understand the application. We consider three types of representations for data values: categorical, scalar and binary. Our results (Section 4.7) show that binary has an inherent ability to generalize more efficiently than categorical or scalar representations. The intuition behind why binary generalizes so well is that the representation is inherently hierarchical, which allows for stronger generalization to previously unseen bit patterns.

Lastly, instruction nodes are initialized with zero vectors as embeddings. Given the initial embeddings, the GNN runs for a predefined number of propagation steps to obtain the final embeddings.

Defining tasks on the graph. Control-flow and data-flow tasks are defined on nodes using masking. Similar to masking in RNNs to handle variable sequence lengths, masking in GNNs handles different numbers of task nodes. A node defined with a task has a mask value of 1 and the ones without a task are masked out using 0 during both forward and backward propagation.

The control-flow task (branch-prediction) is defined on the branch instruction node. Since each branch can either be taken or not taken, this is a binary decision. The final node embeddings are fed into a linear layer to generate a scalar output using a sigmoid activation and a cross entropy loss.

The data-flow task (prefetching) is defined on the src pseudo node that represents a memory load operation. The task is to predict the 64-bit target address of the next memory load from this node. A 64-bit output is generated by feeding the final node embeddings of the task node to a different linear layer. In this case, the output layer is 64-dimensional to correspond to a 64-bit address. The loss is the summation of sigmoid cross entropy loss on all 64 bits.333Our framework supports multitasking in that it handles control-flow and data-flow tasks simultaneously. However, in our ablation studies, we did not see significant evidence that these tasks currently help each other.

Scaling to large programs. For some large-scale programs, it is unrealistic to utilize the static graph built on the entire assembly file. For example, the gcc benchmark studied in this paper has more than 500K instructions. To handle large graph sizes, we only use a part of the graph that is close to the task node in a snapshot. Given a snapshot, we discard instruction nodes and their child nodes if an instruction node is over 100 steps to the task node. As in  Li et al. (2015), the message propagation mechanism runs for a fixed number of steps and only nodes whose distances to the task node are within the total number of propagation steps will affect the prediction.

4 Experiments

4.1 Data Collection

Our model consists of two parts, the static assembly and dynamic snapshots. To collect static assembly we use gcc to compile source code for each binary. This binary is then disassembled using the GNU binary utilities to obtain the assembly code.

The dynamic snapshots are captured for conditional branch and memory load instructions using the dynamic instrumentation tool Pin Luk et al. (2005). We run the benchmarks with the reference input set and use SimPoint Hamerly et al. (2005) to generate a single representative sample of 100 million instructions for each benchmark. Our tool attaches to the running process, fast forwards to the region of interest and outputs values of general registers and related memory addresses into a file every time the target conditional branch instructions or memory load instructions are executed by the instrumented application.

We use SPECint 2006 to evaluate our proposal. This is a standard benchmark suite commonly used to evaluate hardware and software system performance.

4.2 Experimental Setup

Our proposal is evaluated offline. We use the first 70% of snapshots for training and the last 30% for evaluation. We train the model on each benchmark independently. Hyperparameters are reported in the appendix.

4.3 Metrics

To evaluate our control-flow task we follow computer architecture research and use mispredictions per thousand instructions (MPKI)  Jiménez and Lin (2001); Lee et al. (1997) as a metric. Predicting data flow is a harder problem as the predictor needs to accurately predict all bits of a target memory address Peled et al. (2018). A prediction with even 1 bit off, especially in the high bits, is an error at often distant memory locations. We evaluate our data-flow predictor using complete accuracy, defined as an accurate prediction in all bits.

4.4 Model Comparisons

We compare our model to three branch predictors. The first is a bimodal predictor that uses a 2-bit saturating counter for each branch instruction to keep track of its branch history Lee et al. (1997)

. The second is a perceptron branch predictor 

Jiménez and Lin (2001) that uses the perceptron learning algorithm on long sequential binary taken/not-taken branch histories Jiménez (2016)

. This state-of-the-art predictor is widely used in modern processors for its effectiveness and online adaptation capability. As a more powerful baseline, we implement an offline multi-layer perceptron (MLP), which can accurately learn to classify non-linearly separable data. The MLP has two hidden layers and each layer is of the same size as the input layer. A default SGD solver is used for optimization. The results are shown in Figure

5. We find that NCF reduces MPKI by 26% and 22% compared to the perceptron and MLP respectively.

Figure 4: Evaluation of the branch-prediction task (lower is better).
Figure 5: Evaluation of the prefetching task (higher is better).

Two baselines are used to evaluate our data-flow predictor in Figure 5

. The first is a stride data prefetcher 

Chen and Baer (1995) that is good at detecting regular patterns, such as array operations. The second is a state-of-the-art address correlation (AC) prefetcher that handles irregular patterns by learning temporal address correlation Wenisch et al. (2009). Figure 5 shows that NCF achieves significantly higher performance than prior work by handling both regular and irregular patterns with its binary representation. When compared to recent LSTM-based models, due to its binary representation, NCF achieves nearly 100% coverage of all addresses, unlike the 50-80% reported for the LSTM-prefetcher of Hashemi et al. (2018). In both Figures 5 and  5, the applications are sorted from most-challenging to least-challenging. We find that NCF particularly outperforms the traditional baselines on the most challenging datasets.

The traditional baselines in both branch prediction and prefetching leverage long sequential features. Our NCF does not yet use sequential features or sequential snapshots, we leave this for future work.

4.5 Ablation Study

The effectiveness of the GNN depends on the input graph. As pseudo nodes are a large component of the static graph, we run additional experiments to understand their importance. In particular, we try to only use the pseudo nodes src and tgt, which are directly connected to instruction nodes. Our data shows that removing pseudo nodes other than src and tgt and connecting variable nodes directly to src and tgt has little impact on control-flow prediction (an MPKI increase of 0.26), but has a large impact on the data-flow accuracy (accuracy goes down by 12.1%).

Figure 6 shows the sensitivity of task performance to the number of propagation steps during training for the GNN on omnetpp. We find that prefetching is more sensitive to propagation steps than branch prediction, and requires 5-8 steps for peak accuracy. Due to the control flow of programs, we find that 5-8 steps propagates information for 50-60 instruction nodes across the graph’s backbone for omnetpp (up to 6000 nodes for perlbench).

Figure 6: GNN performance vs. number of propagation steps.

4.6 Algorithm Classification

To test if the model has learned about the behavior of the application, we test the NCF representation on an algorithm classification dataset Lili Mou (2016). We randomly select a subset of 15 problems from this dataset444We use a subset because the programs had to be modified (by adding appropriate headers, fixing bugs) to compile and run in order to retrieve the assembly code and dynamic states. and generate inputs for each program. 50 programs are randomly selected from each class. These are split into 30 for training, 10 for validation (tuning the linear SVM described below) and 10 for testing.

We generate the graph for each program post-compilation and obtain memory snapshots via our instrumentation tool. The representation is pre-trained on the control-flow task and the resultant embeddings are averaged to serve as the final embedding of the program. A linear SVM is trained using the pre-trained embeddings to output a predicted class.

The result achieves 96.0% test accuracy, where the state-of-the-art Ben-Nun et al. (2018) achieves 95.3% on the same subset. In contrast to Ben-Nun et al. (2018), which pre-trains an LSTM on over 50M lines of LLVM IR, our embeddings are trained on 203k lines of assembly from the algorithm classification dataset itself. This shows that control-flow can be highly predictive of high-level program attributes, suggesting that it may be fruitful to consider using dynamic information to help solve other static tasks.

4.7 Generalization Test on Representations

Lastly, we test the effectiveness of binary representations of memory state. There are three major options for representing dynamic state: categorical, real-valued scalar, and binary. State-of-the-art data prefetchers Wenisch et al. (2009); Jain and Lin (2013) tend to use categorical representations. Recent advances in representing and manipulating numbers for neural arithmetic logic units use scalar representations  Trask et al. (2018).

We evaluate the generalization ability of these representations using a simple loop. We replace the constant 10 total iterations of the loop in Figure 2(a) with a variable . The control-flow of the loop decides to stay in or jump out of the loop by comparing variable and . The branch will be not taken for the first times but will be taken at the th time. Since traditional state-of-the-art branch predictors only depend on memorizing past branch history, they will always mispredict the final branch (as it has always been taken). Our proposal is able to make the correct prediction at the th time.

The challenge for our model is that the value can change during program execution. The model needs to be able to generalize to values that have not been seen before. We use this example to test three representations. For each representation, a testing set uses values from 1 to 80 and the training set only contains of values from 1 to 40 with a step size of 3 (1, 4, 7, …, 37). The representations are considered to be correct on certain value only if all predictions including the are correctly predicted. We feed all three representations to MLP predictors that all have one hidden layer of the same size of each input representation (160 for categorical, 2 for scalar and 14 for binary). The results are shown in Figure 7.

Figure 7: Generalization ability of representations. ”Yes” means predictions on all branches of the loop with a value are correct.

The categorical representation can only correctly predict training samples, missing every two out of three values, where scalar and binary representations are both able to generalize across a continuous range, filling the “holes” between training samples. We find that binary representation generalizes to a larger range than a scalar representation, as long as the bits have been seen and toggled in the training set. Since binary is inherently hierarchical (the range increases exponentially with the number of bits), this advantage is greater in a real world 64-bit machine.

5 Related Work

5.1 Learning from Source Code & Execution Behavior

There is a significant body of work on learning for code, and we refer the reader to Allamanis et al. (2018) for a survey. We focus on the most relevant methods here. Li et al. (2015) use GNNs to represent the state of heap memory for a program verification application. Allamanis et al. (2017) learn to represent source code with GNNs.

Similar to us, Ben-Nun et al. (2018) learn representations of code from low-level syntax, the LLVM intermediate representation (IR), but do not use dynamic information. We use assembly code instead of IR to maintain a 1:1 mapping between dynamic state and the static backbone of the graph (since instructions are atomic when executed).

Wang et al. (2017) embed the sequences of values that variables take on during the execution of a program as a dynamic program embedding. The code is not otherwise used. The states are relatively simple (variables can take on relatively few possible values) in contrast to our dynamic states that are “from the wild.” Cummins et al. (2017) embeds code and optionally allows a flat vector of auxiliary features that can depend on dynamic information. Abstract program execution can also be used as a basis for learning program representations DeFreez et al. (2018); Henkel et al. (2018). However, neither uses concrete program state.

5.2 Using Program State to Guide Program Synthesis

There are several works that learn from program state to aid program synthesis Balog et al. (2016); Parisotto et al. (2016); Devlin et al. (2017); Zohar and Wolf (2018); Chen et al. (2019); Vijayakumar et al. (2018); Menon et al. (2013). In particular, Balog et al. (2016) use neural networks to learn a mapping from list-of-integer-valued input-output examples to the set of primitives needed. All of these operate on programs in relatively simple Domain Specific Languages and are learning mappings from program state to code, rather than learning joint embeddings of code and program state.

5.3 Dynamic Prediction Tasks

Branch prediction and prefetching are heavily studied in the computer architecture domain. High-performance modern microprocessors commonly include perceptron Jiménez and Lin (2001) or table-based branch predictors that memorize commonly taken paths through code Seznec (2011).

While there has been a significant amount of work around correlation prefetching in academia Wenisch et al. (2009); Charney and Reeves (1995); Roth et al. (1998), modern processors only commonly implement simple stream prefetchers Chen and Baer (1995); Jouppi (1990); Gindele (1977). Recent work has related prefetching to natural language models and shown that LSTMs achieve high accuracy Hashemi et al. (2018). However, their categorical representation covers only a limited portion of the access patterns while the binary representation described here is more general.

6 Conclusion

We develop a novel graph neural network that uses both static and dynamic features to learn a rich representation for code. Since the representation is based on a relational network, it is easy to envision extensions that include high-level source code into the model or to add new prediction tasks. Instead of focusing on hardware-realizeable systems with real-time performance (as is common in systems research) our primary focus in this paper is to develop representations that explore the limits of predictive accuracy for these problems with extremely powerful models, so that the improvements can be be eventually be distilled. This is common in machine learning research, where typically the limits of performance for a given approach are reached, and then distilled into a sufficiently performant system, e.g. 

Van Den Oord et al. (2016); Oord et al. (2017). However, immediate benefits can still be derived by using the model to affect program behavior through compilation hints Chilimbi and Hirzel (2002); Jagannathan and Wright (1996); Wolf et al. (1996), making this exploration immediately practical. We argue that fusing both static and dynamic features into one representation is an exciting direction to enable further progress in neural program understanding.

References

7 Hyperparameters

The hyperparameters for all models are given in Table 1.

Fused GGNN input feature size 64
hidden size 64
propagation steps 5
optimizer adam
learning rate 0.01
MLPs for generalization test hidden size 2*input size
optimizer adam
L2 regularization 0.0001
SVM for algorithm classification loss square hinge
L2 regularization 0.01
baseline: Bimodal bits 2
Resources Unlimited
baseline: Perceptron history length 64
L2 regularization 0.0001
baseline: stride
Unlimited resources to store all strides
(delta between addresses) for each load,
predicting the most frequent stride
baseline: Address Correlation
Unlimited resources to store every pairwise
correlation, predicting the most frequent pair
Table 1: Hyperparameters for models.

8 Node Sub-Types

We describe the node sub-types in Table 2. Pseudo-nodes implement operations that are commonly known as the addressing modes of the Instruction Set Architecture. Note that node sub-types are used to derive initial node embeddings and for interpretability. They do not factor into the computation of the graph neural network.

Major node type Sub-type Description
Pseudo nodes non-mem-src
a source operand that does not involve memory load operation,
obtained directly from register(s) and/or constant(s)
mem-src
a source operand that involves a memory load operation, obtained
from loading data from a memory location
non-mem-tgt
a target operand that does not involve memory write operation,
writing directly to a register
mem-tgt
a source operand that involves a memory write operation, writing
data to a memory location
base a base that is obtained directly from a variable node
ind-base
an indirect base that is obtained from certain operations on the child variable
nodes, such as multiplying a register by a constant
offset an offset value that is to be added to a base
Variable nodes reg a register, value is dynamically changed during execution
const a constant, value is specified in the assembly
Table 2: Descriptions about sub node types.