Code Vectors: Understanding Programs Through Embedded Abstracted Symbolic Traces

03/18/2018 ∙ by Jordan Henkel, et al. ∙ University of Wisconsin-Madison Microsoft 0

With the rise of machine learning, there is a great deal of interest in treating programs as data to be fed to learning algorithms. However, programs do not start off in a form that is immediately amenable to most off-the-shelf learning techniques. Instead, it is necessary to transform the program to a suitable representation before a learning technique can be applied. In this paper, we use abstractions of traces obtained from symbolic execution of a program as a representation for learning word embeddings. We trained a variety of word embeddings under hundreds of parameterizations, and evaluated each learned embedding on a suite of different tasks. In our evaluation, we obtain 93 analogies extracted from the Linux kernel. In addition, we show that embeddings learned from (mainly) semantic abstractions provide nearly triple the accuracy of those learned from (mainly) syntactic abstractions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Computer science has a long history of considering programs as data objects (INRIA:DGHKL80; Proc:PADO85). With the rise of machine learning, there has been renewed interest in treating programs as data to be fed to learning algorithms (DBLP:journals/corr/abs-1709-06182). However, programs have special characteristics, including several layers of structure, such as a program’s context-free syntactic structure, non-context-free name and type constraints, and the program’s semantics. Consequently, programs do not start off in a form that is immediately amenable to most off-the-shelf learning techniques. Instead, it is necessary to transform the program to a suitable representation before a learning technique can be applied.

This paper contributes to the study of such representations in the context of word embeddings. Word embeddings are a well-studied method for converting a corpus of natural-language text to vector representations of words embedded into a low-dimensional space. These techniques have been applied successfully to programs before (Nguyen2017; Pradel2017; Gu:2016:DAL:2950290.2950334), but different encodings of programs into word sequences are possible, and some encodings may be more appropriate than others as the input to a word-vector learner.

The high-level goals of our work can be stated as follows:

Devise a parametric encoding of programs into word sequences that (i) can be tuned to capture different representation choices on the spectrum from (mainly) syntactic to (mainly) semantic, (ii) is amenable to word-vector-learning techniques, and (iii) can be obtained from programs efficiently.

We also wish to understand the advantages and disadvantages of our encoding method. RQ1–RQ4 summarize the experiments that we performed to provide insight on high-level goal (ii).

We satisfy high-level goals (i) and (iii) by basing the encoding on a lightweight form of intraprocedural symbolic execution.

  • We base our technique on symbolic execution due to the gap between syntax (e.g., tokens or abstract syntax trees (ASTs)) and the semantics of a procedure in a program. In particular, token-based techniques impose a heavy burden on the embedding learner. For instance, it is difficult to encode the difference between constructions such as —a == b— and —!(a != b)— via a learned, low-dimensional embedding (Allamanis2016a).

  • Our method is intraprocedural so that different procedures can be processed in parallel.

  • Our method is parametric in the sense that we introduce a level of mapping from symbolic-execution traces to the word sequences that are input to the word-vector learner. (We call these abstraction mappings or abstractions, although strictly speaking they are not abstractions in the sense of abstract interpretation (POPL:CC77).) Different abstraction mappings can be used to extract different word sequences that are in different positions on the spectrum of (mainly) syntactic to (mainly) semantic.

We have developed a highly parallelizable toolchain that is capable of producing a parametric encoding of programs to word sequences. For instance, we can process 311,670 procedures in the Linux kernel111Specifically, we used a prerelease of Linux 4.3 corresponding to commit fd7cd061adcf5f7503515ba52b6a724642a839c8 in the GitHub Linux kernel repository. in 4 hours,222During trace generation, we exclude only —vhash_update—, from crypto/vmac.c, due to its size. using a 64-core workstation (4 CPUs each clocked at 2.3 GHz) running CentOS 7.4 with 252 GB of RAM.

After we present our infrastructure for generating parametric encodings of programs as word sequences (Overview), there are a number of natural research questions that we consider.

First, we explore the utility of embeddings learned from our toolchain:

useful Are vectors learned from abstracted symbolic traces encoding useful information?

Judging utility is a difficult endeavor. Natural-language embeddings have the advantage of being compatible with several canonical benchmarks for word-similarity prediction or analogy solving (the-microsoft-research-sentence-completion-challenge; Finkelstein:2001:PSC:371920.372094; Luong2013BetterWR; Szumlanski; Hill:2015:SES:2893320.2893324; Rubenstein:1965:CCS:365628.365657; NIPS2013_5021). In the domain of program understanding, no such canonical benchmarks exist. Therefore, we designed a suite of over nineteen thousand code analogies to aid in the evaluation of our learned vectors.

Next, we examine the impact of different parameterizations of our toolchain by performing an ablation study. The purpose of this study is to answer the following question:

best Which abstractions produce the best program encodings for word-vector learning?

There are several examples of learning from syntactic artifacts, such as ASTs or tokens. The success of such techniques raises the question of whether adding a symbolic-execution engine to the toolchain improves the quality of our learned representations.

svs Do abstracted symbolic traces at the semantic end of the spectrum provide more utility as the input to a word-vector-learning technique compared to ones at the syntactic end of the spectrum?

Because our suite of analogies is only a proxy for utility in more complex downstream tasks that use learned embeddings, we pose one more question:

downstream Can we use pre-trained word-vector embeddings on a downstream task?

The contributions of our work can be summarized as follows:

We created a toolchain for taking a program or corpus of programs and producing intraprocedural symbolic traces. The toolchain is based on Docker containers, is parametric, and operates in a massively parallel manner. Our symbolic-execution engine prioritizes the amount of data generated over the precision of the analysis: in particular, no feasibility checking is performed, and no memory model is used during symbolic execution.

We generated several datasets of abstracted symbolic traces from the Linux kernel. These datasets feature different parameterizations (abstractions), and are stored in a format suitable for off-the-shelf word-vector learners.

We created a benchmark suite of over 19,000 API-usage analogies.

We report on several experiments using these datasets:

  • In RQ1, we achieve 93% top-1 accuracy on a suite of over 19,000 analogies.

  • In RQ2, we perform an ablation study to assess the effects of different abstractions on the learned vectors.

  • In RQ3, we demonstrate how vectors learned from (mainly) semantic abstractions can provide nearly triple the accuracy of vectors learned from (mainly) syntactic abstractions.

  • In RQ4, we learn a model of a specific program behavior (which error a trace is likely to return), and apply the model in a case study to confirm actual bugs found via traditional static analysis.

Our toolchain, pre-trained word embeddings, and code-analogy suite are all available as part of the artifact accompanying this paper; details are given in Artifact.

Organization.

The remainder of the paper is organized as follows: Overview provides an overview of our toolchain and applications. ABS details the parametric aspect of our toolchain and the abstractions we use throughout the remainder of the paper. WV briefly describes word-vector learning. RQ1–RQ4 address our four research questions. THREATS considers threats to the validity of our approach. RW discusses related work. Artifact describes supporting materials that are intended to help others build on our work. CONC concludes.

2. Overview

Our toolchain consists of three phases: transformation, abstraction, and learning. As input, the toolchain expects a corpus of buildable C projects, a description of abstractions to use, and a word-vector learner. As output, the toolchain produces an embedding of abstract tokens to double-precision vectors with a fixed, user-supplied, dimension. We illustrate this process as applied to the example in OverviewProg.

int example() {
  buf = alloc(12);
  if (buf != 0) {
    bar(buf);
    free(buf);
    return 0;
  } else {
    return -ENOMEM;
  }
}
Figure 1. An example procedure

Phase I: Transformation. The first phase of the toolchain enumerates all paths in each source procedure. We begin by unrolling (and truncating) each loop so that its body is executed zero or one time, thereby making each procedure loop-free at the cost of discarding many feasible traces. We then apply an intraprocedural symbolic executor to each procedure. OverviewTraces shows the results of this process as applied to the example code in OverviewProg.

call alloc(12);
assume alloc(12) != 0;
call bar(alloc(12));
call free(alloc(12));
return 0;
(a) Trace 1
call alloc(12);
assume alloc(12) == 0;
return -ENOMEM;
(b) Trace 2
Figure 2. Traces from the symbolic execution of the procedure in OverviewProg

Phase II: Abstraction. Given a user-defined set of abstractions, the second phase of our toolchain leverages the information gleaned from symbolic execution to generate abstracted traces. One key advantage of performing some kind of abstraction is a drastic reduction in the number of possible tokens that appear in the traces. Consider the transformed trace in OverviewTrace2. If we want to understand the relationship between allocators and certain error codes, then we might abstract procedure calls as parameterized tokens of the form Called(); comparisons of returned values to constants as parameterized RetEq(, ) tokens; and returned error codes as parameterized RetError() tokens. OverviewAbstractions shows the result of applying these abstractions to the traces from OverviewTraces.

Called(alloc)
RetNeq(alloc, 0)
Called(bar)
Called(free)
(a) Abstracted Trace 1
Called(alloc)
RetEq(alloc, 0)
RetError(ENOMEM)
(b) Abstracted Trace 2
Figure 3. Result of abstracting the two traces in OverviewTrace2

Phase III: Learning. Our abstracted representation discards irrelevant details, flattens control flows into sequential traces, and exposes key properties in the form of parameterized tokens that capture domain information such as Linux error codes. These qualities make abstracted traces suitable for use with a word-vector learner. Word-vector learners place words that appear in similar contexts close together in an embedding space. When applied to natural language, learned embeddings can answer questions such as “king is to queen as man is to what?” (Answer: woman.) Our goal is to learn embeddings that can answer questions such as:

  • If a lock acquired by calling spin_lock is released by calling spin_unlock, then how should I release a lock acquired by calling mutex_lock_nested? That is, Called(spin_lock) is to Called(spin_unlock) as Called(mutex_lock_nested) is to what? (Answer: Called(mutex_unlock).)

  • Which error code is most commonly used to report allocation failures? That is, which RetError() is most related to RetEq(alloc, 0)? (Answer: RetError(ENOMEM).)

  • Which procedures and checks are most related to alloc? (Answers: Called(free), RetNeq(alloc, 0), etc.)

The remainder of the paper describes a framework of abstractions and a methodology of learning embeddings that can effectively solve these problems. Along the way, we detail the challenges that arise in applying word embeddings to abstract path-sensitive artifacts.

3. Abstractions