Learning to Repair Software Vulnerabilities with Generative Adversarial Networks

by   Jacob Harer, et al.
Boston University
Draper, Inc.

Motivated by the problem of automated repair of software vulnerabilities, we propose an adversarial learning approach that maps from one discrete source domain to another target domain without requiring paired labeled examples or source and target domains to be bijections. We demonstrate that the proposed adversarial learning approach is an effective technique for repairing software vulnerabilities, performing close to seq2seq approaches that require labeled pairs. The proposed Generative Adversarial Network approach is application-agnostic in that it can be applied to other problems similar to code repair, such as grammar correction or sentiment translation.


page 1

page 2

page 3

page 4


Multi-source Domain Adaptation for Visual Sentiment Classification

Existing domain adaptation methods on visual sentiment classification ty...

Deep Domain Adaptation under Deep Label Scarcity

The goal behind Domain Adaptation (DA) is to leverage the labeled exampl...

Domain-Invariant Adversarial Learning for Unsupervised Domain Adaption

Unsupervised domain adaption aims to learn a powerful classifier for the...

Exploiting Images for Video Recognition with Hierarchical Generative Adversarial Networks

Existing deep learning methods of video recognition usually require a la...

Learning to Fuse Music Genres with Generative Adversarial Dual Learning

FusionGAN is a novel genre fusion framework for music generation that in...

Generative Adversarial Network-based Cross-Project Fault Prediction

Background: The early stage of defect prediction in the software develop...

Generative Creativity: Adversarial Learning for Bionic Design

Bionic design refers to an approach of generative creativity in which a ...

1 Introduction

Security vulnerabilities in software programs pose serious risks to computer systems. Malicious users can compromise programs through their vulnerabilities to force them to behave in undesirable ways (e.g. crash, expose sensitive user information, etc.). Thousands of such vulnerabilities are reported publicly to the Common Vulnerabilities and Exposures database (CVE) each year, and many more are discovered internally in proprietary code and patched CVE , BugOccurrence . These vulnerabilities are often the result of errors made by programmers, and, due to the prevalence of open source software and code re-use, can propagate quickly.

In this paper, we address the problem of learning to automatically repair the source code of software containing security vulnerabilities. This problem is analogous to grammatical error correction, in which a grammatically incorrect sentence is translated into a correct one. In our case, bad source code (that contains a vulnerability) takes the place of an incorrect sentence and is repaired into good source code.

Neural Machine Translation (NMT) systems have recently achieved the state-of-the-art performance on language translation and correction tasks Ji:2017gk , Yuan:2016kf , Schmaltz:2017ur , Xie:2016vi . These models use an encoder-decoder approach to transform an input sequence into an output sequence , e.g., translating a sequence of words forming a sentence in English to one in German. By far the most common method of training NMT systems is to use labeled pairs of examples to compare the likelihood of network output to a desired version, necessitating a one-to-one mapping between input and desired output data. This can be difficult to obtain as in most cases it requires costly hand annotation.

In many sequence-to-sequence (seq2seq) applications, it is much easier to obtain unpaired data, i.e., data from both source and target domains without any matching pairs, since this only requires data to be labeled as either source or target. For example, in natural language translation it is easy to obtain monolingual corpora in different languages with almost no cost. For source code, automated error detection methods exist, such as static analyzers or machine learning approaches, which can be used to label code as having vulnerabilities or not, but do not provide one-to-one pairing between labeled sets

Harer:2018te , Anonymous:2017tb .

Our approach to address this problem is adversarial learning with Generative Adversarial Networks (GANs) Goodfellow:2014wp . This approach allows us to train without paired examples. We employ a traditional NMT model as the generator, and replace the typical negative likelihood loss with the gradient stemming from the loss of an adversarial discriminator. The discriminator is trained to distinguish between NMT-generated outputs and real examples of desired output, and so its loss serves as a proxy for the discrepancy between the generated and real distributions. This problem has three main difficulties. Firstly, sampling from the output of NMT systems, in order to produce discrete outputs, is non-differentiable. We address this problem by using a discriminator which operates directly on the expected (soft) outputs of the NMT system during training, which we thoroughly discuss in Section 3.2. Secondly, adversarial training does not guarantee that the generated code will correspond to the input bad code (i.e. the generator is trained to match distributions, not samples). To enforce the generator to generate useful repairs, (i.e., generated code is a repaired version of input bad code), we condition our NMT generator on the input

by incorporating two novel generator loss functions, described in Section

3.3. Thirdly, the domains we consider are not bijective, i.e., a bad code can have more than one repair or a good code can be broken in more than one way. The regularizers we propose in Section 3.3 still work in this case. We should note that although our motivation is to repair source code, the approach and the techniques proposed in this paper are application-agnostic in that they can be applied to other similar problems, such as correcting grammar errors or converting between negative and positive sentiments (e.g., in online reviews.). Additionally, while software vulnerability repair is a harder problem than detection, our proposed repair technique can leverage the same datasets used for detection and yields a much more explainable and useful tool than detection alone.

2 Related Work

2.1 Software Repair

Much research has been done on automatic repair of software. Here we describe previous data-driven approaches (see Monperrus:2018:survey for a more extensive review of the subject). Two successful recent approaches are that of Le et al. Le:2016history and Long and Rinard Long:2016automatic . Le et al. mine a history of bug fixes across multiple projects and attempt to reuse common bug fix patterns on newly discovered bugs. Long and Rinard learn and use a probabilistic model to rank potential fixes for defective code. These works, along with the majority of past work in this area, require a set of test cases which is used to rank and validate produced repairs. Devlin et al. Devlin:2017tf

avoid the need for test cases by generating repairs with a rule based method and then ranking them using a neural network. Gupta et al.

Gupta:2017deepfix take this one step further by training a seq2seq model to directly generate repairs for incorrect code. Hence, the work in Gupta:2017deepfix most closely resembles our work, but has the major drawback of requiring paired training data.

2.2 GANs

GANs were first introduced by Goodfellow et al. Goodfellow:2014th to learn a generative model of natural images. Since then, many variants of GANs have been created and applied to the image domain Arjovsky:2017tk , Arjovsky:2017vh , Chen:2016infogan , Creswell:2018generative , Radford:2015unsupervised . GANs have generally focused on images due to the abundance of data and their continuous nature. Applying GANs to discrete data (e.g. text) poses technically challenging issues not present in the continous case (e.g. propagating gradients through discrete values). One successful approach is that of Yu et al. Yu:2016wz

, which treats the output of the discriminator as a reward in a reinforcement learning setting. This allows the sampling of outputs from the generator since gradients do not need to be passed back through the discriminator. However, since a reward is provided for the entire sequence, gradients computed for the generator do not provide information on which parts of the output sequence the discriminator thinks is incorrect, resulting in long convergence times. Several other approaches have had success with directly applying an adversarial discriminator to the output of a sequence generator with likelihood output. Zhang et al.

Zhang:2017ut replace the traditional GAN loss in the discriminator with a Maximum Mean Discrepancy (MMD) metric in order to stabilize GAN training. Both Press et al. Press:2017wz and Rajeswar et al. Rajeswar:2017wd are able to generate fairly realistic looking sentences of modest length using Wasserstein GAN Arjovsky:2017vh , which is the approach we adopt in this paper.

Work has also been done on how to condition a GAN’s generator on an input sequence

instead of a random variable. This can easily be performed when paired data is available, by providing the discriminator with both

and , thereby formulating the problem as in the conditional approach of Mirza and Osindero Mirza:2014wi , Yang:2017td . This approach, however, is clearly more difficult when pairs are not available. One approach is to enforce conditionality through the use of dual generator pairs which translate between domains in opposite directions. For example, Gomez et al. apply the cycle GAN Zhu:2017ua approach to cipher cracking Gomez:2018taa . They train two generators, one to take raw text and produced ciphered text, and the other to undo the cipher. Having two generators allows Gomez et al. to encrypt raw data using the first generator, then decrypt it with the other, ensuring conditionality by adding a loss function which compares this doubly translated output with the original raw input. Lample et al. Lample:2017ts adopt a somewhat similar approach for NMT. They translate using two encoder/decoder pairs which convert from a given language to a latent representation and back respectively. They then use an adversarial loss to ensure that the latent representations are the same between both languages, thus allowing translation by encoding from one language and then decoding into the second. For conditionality they adopt a similar approach to Gomez et al. by fully translating a sentence from one language to another, translating it back, and then comparing the original sentence to the produced double translation.

The approaches of both Gomez et al. and Lample et al. rely on the ability to transform a sentence across domains in both directions. This makes sense in many translation spaces as there are a finite number of reasonable ways to transform a sentence in one language to a correct one in the other. This allows for a network which finds a single mapping from every point in one domain to a single point in the other domain, to still cover the majority of translations. Unfortunately, in a sequence correction task such as our problem, one domain contains all correct sequences, while the other contains everything not in the correct domain. Therefore, the mapping from correct to incorrect is not one-to-one, it is one to many. A single mapping discovered by a network would fail to elaborate the space of all bad functions, thus enforcing conditionality only on the relatively small set of bad functions it covers. Therefore, we propose to enforce conditionality using a self-regularization term on the generator, similar in nature to that used by Shrivastava et al. Shrivastava:2017vy to generate realistic looking images from simulated ones.

3 Formulation

GANs are generative models originally proposed to generate realistic images,

, from random noise vectors,

Goodfellow:2014wp . GANs find a mapping by framing the learning problem as a two player minimax game between a generator and a discriminator , where the generator learns to generate realistic looking data samples by minimizing the performance of a discriminator whose goal is to maximize its own performance on discriminating between generated and real samples.

Our problem in this paper is different from the original GAN problem in that our goal is to find a mapping between two discrete valued domains, namely between a given bad code (or source) domain and a good code (or target) domain by using unpaired training samples and , where and .

3.1 Adversarial Loss

The original GAN loss of Goodfellow et al. Goodfellow:2014wp is expressed as


where the optimal generator is . It is well known that this loss can be unstable when the support of the distributions of generated and real samples do not overlap Arjovsky:2017tk . This causes the discriminator to provide zero gradients. Further, this standard loss function can lead to mode collapse, where the resulting samples come from a single mode of the real data distribution. To alleviate these problems, Arjovsky et al. Arjovsky:2017vh

proposed the Wasserstein GAN (WGAN) loss which instead uses the Wasserstein-1 or Earth-Movers (EM) distance between generated and real data samples in the discriminator. EM distance is relatively straightforward to estimate and leads to the easily computable loss function:


where the discriminator function is constrained to be 1-Lipschitz. We use WGAN in our model as it leads to more stable training.

3.2 GANs with Discrete Data

One of the main challenges of adversarial training with discrete sequences is that sampling from the output of NMT systems in order to produce discrete outputs is non-differentiable. The goal of training is to generate samples from the unknown distribution of real sequences , which can be factorized as


where each conditional distribution is estimated (using an RNN generator in our case) with a softmax output


where and denote the generator network and the hidden state of the RNN at time , respectively. Ideally, we would sample from to generate a sequence and provide that to the discriminator along with the real data for training, but this sampling process is non-differentiable. Instead, we provide the discriminator with directly. Since each is dependent on the previously produced output token and the RNN state, we still need to sample from using to generate . Note that can be interpreted as the soft one-hot representation as it corresponds to the expectation of one-hot vectors with respect to the conditional distribution in (4). Although this soft representation alleviates the issue of non-differentiability, it may introduce potential issues with the discriminator which we discuss next.

Note that since each generator output

is a probability vector it will almost surely not be a one hot vector. In other words, while every real token,

, lies on one of the vertices of the standard dimensional simplex, our generated outputs, , lie on the interior of the simplex. This implies that and have disjoint supports and are perfectly separable in theory. Therefore, there exists a ‘trivial’ discriminator which looks at each token independently and discriminates based on whether a sequence consists of one-hot vectors or not. Such a discriminator would not provide useful information for training the generator since it does not pay attention to the sequential dependencies between tokens. Nevertheless, we conjecture that simple discriminator architectures do not have this problem, since such a ‘trivial’ discriminator may be hard to realize in practice. This was verified in our experiments where we found that relatively shallow networks, such as those using only a single convolutional layer, performed better than deeper ones.

There is related work in the literature Press:2017wz , Rajeswar:2017wd , Gomez:2018taa that reported avoiding this ‘trivial’ discriminator by using the improved Wasserstein GAN (WGAN-GP) loss Gulrajani:2017te . However, in our implementations, we had more success with the original version of the Wasserstein GAN, which uses clipped weights in the discriminator (after both versions had sufficient hyper-parameter tuning). We believe that this is due to weight clipping in the original Wasserstein GAN that forces the discriminator to learn simpler functions, as was shown in the improved WGAN paper Gulrajani:2017te . These simpler functions do not allow the discriminator to simply focus on one-hot vectors and force it to pay attention to sequential dependencies between tokens. To further analyze this point, we provide some visualizations in Figure 1, where we use a paired dataset for analysis purposes. We sample a random set of data pairs, where is a bad version of , and compute Wasserstein loss values, as defined in (2), for two separate cases. For the first loss calculation, we select pairs where the generator generates correct outputs (), and for the second loss, the generated outputs are incorrect (). We then take the ratio of these two loss values and plot them in Figure 0(a) for three different discriminator settings, namely i) -layer CNN with WGAN loss; ii) -layer CNN with WGAN-GP loss; and 3) -layer CNN with WGAN loss. A discriminator which only differentiates inputs based on whether they are one-hot vectors or not should have very similar loss values for the two cases resulting in a loss ratio of , since in neither case does the generator produce one-hot vectors. As we observe in Figure 0(a), the simpler network architecture (1-layer CNN in this case) with the original Wasserstein loss provides better separation, i.e., better signal, for training the generator. This is further emphasized by Figures 0(b)-0(d) where we show normalized weights of the 1-D convolutional filters (whose kernel size is 11) on the first convolutional layer in each network. Filters for the simplest network in Figure 0(b) have a low degree of sparsity, implying that they are aggregating data from multiple tokens taking into account sequential dependencies, whereas the networks in both Figures 0(c) and 0(d)

have a much higher degree of sparsity, often emphasizing only a single token at a time, which we would expect for discriminators paying attention to individual tokens to decide based on whether a given token is one-hot or not. These observations imply an inherent trade-off. An overly complex discriminator can learn to discriminate based on spurious features, i.e., whether a vector is one-hot or not, which can lead to overfitting. On the other hand, a very simple discriminator will not accurately model the data and, therefore, not provide any useful information to the generator. One needs to treat this trade-off as one would treat a hyperparameter, by tuning the discriminator model on an application by application basis.

We should also mention that there are two other approaches proposed in the literature to overcome the issues we discussed above. The first approach is to (linearly and deterministically) embed each one-hot vector into a lower dimensional space Press:2017wz . This approach is still vulnerable to the problem of a sufficiently complex discriminator ignoring sequential dependencies since these embedding are deterministic. We found this to be the case in practice as well; adding an embedding the the discriminator alone produced no noticeable improvement and still required the use of simple networks. The second alternative approach is to reparamaterize the discrete sampling process via a continuous relaxation using the Gumbel-softmax distribution Jang:2016ub , Maddison:2016tw . This approach, due to continuous relaxation, still generates (random) outputs via a softmax function, which are therefore similar to our soft one-hot outputs. We experimented with this approach and did not observe any improvements.

(a) Loss Ratios
(b) -Layer WGAN
(d) -Layer WGAN
Figure 1: (a) Wasserstein loss ratios between correctly and incorrectly generated pairs during training. (b-d) Weights of -layer D CNN with WGAN loss, -layer D CNN with WGAN-GP loss, and -layer D CNN with WGAN, respectively.

3.3 Domain Mapping with Self-Regularization

In the context of source code repair, or more generally sequence correction, we need to constrain our generated samples to be a corrected versions of . Therefore, we have the following two requirements: (1) correct sequences should remain unchanged when passed through the generator; and (2) repaired sequences should be close to the original corresponding incorrect input sequences.

We explore two regularizers to address these requirements. As our first regularizer, in addition to GAN training, we train our generator as an autoencoder on data sampled from correct sequences. This directly enforces item (1), while indirectly enforcing item (2) since the autoencoder loss encourages subsequences which are correct to remain unchanged. The autoencoder regularizer is given as


As our second regularizer, we enforce that the frequency of each token in the generated output remains close to the frequency of the input tokens. This enforces item (2) with the exception that it may allow changes in the order of the sequence, e.g., arbitrary reordering does not increase this loss. However, the GAN loss alleviates this issue since arbitrary reordering produces incorrect sequences which differ heavily from . Our second regularizer is given as


where is the size of the vocabulary and is the frequency of the ith token in x.

4 Putting It All Together - Proposed GAN Framework

The generator in our network consists of a standard NMT system with an attention mechanism similar to that of Luong et al Luong:2015wx

. For all experiments the encoder and decoder consist of multi-layer RNNs utilizing Long Short-Term Memory (LSTM) units

Hochreiter:1997fq . We use a dot-product attention mechanism as per Luong:2015wx . We use convolution based discriminators since they have been shown to be easier to train and to generally perform better than RNN based discriminators Yang:2017td . Additional network details are provided in Appendix A

We have two different regularized loss models given as


where and are defined in Section 3.3. We also experiment with the unregularized base loss model where we set .

4.1 Autoencoder Pre-Training

We rely heavily on pre-training to give our GAN a good starting point. Our generators are pre-trained as de-noising autoencoders on the desired data Vincent:2008jr . Specifically we train the generator with the loss function:


where is the noisy version of the input created by dropping tokens in with probability and randomly inserting and deleting tokens, where is times the sequence length. These numbers were selected based on hyperparameter tuning.

4.2 Curriculum Learning

Likelihood based methods for training seq2seq networks often utilize teacher forcing during training where the input to the decoder is forced to be the desired value regardless of what was generated at the previous time step Williams:1989vi . This allows stable training of very long sequence lengths even at the start of training. Adversarial methods cannot use teacher forcing since the desired sequence is unknown, and must therefore always pass a sample of as the input to time . This can lead to unstable training since errors early in the output will be propagated forward, potentially creating gibberish in the latter parts of the sequence. To avoid this problem we adopt a curriculum learning strategy where we incrementally increase the length of produced sequences throughout training. Instead of selecting subsets of the data for curriculum training, we clip all sequences to have a predefined maximum length for each curriculum step. Although this approach relies on the discriminator being able to handle incomplete sentences, it does not degrade the performance as long as the discriminator is briefly retrained after each curriculum update.

5 Experiments

GAN methods have often been criticized for their lack of easy evaluation metrics. Therefore, we focus our experiments on datasets which contain paired examples. This enables us to meaningfully evaluate the performance of our approach, even though our GAN approach does not require pairs to train. These datasets also allow us to train seq2seq networks and use their performance as an upper bound to our GAN based approach. We start our experiments by exploring two hand-curated datasets, namely sequences of sorted numbers and Context Free Grammar (CFG), which help highlight the benefits of our proposed GAN approach to address the domain mapping problem. We then investigate the harder problem of correcting errors in C/C++ code. All of our results are given in Table


5.1 Sorting

In order to show the necessity of enforcing accurate domain mapping we generate a dataset where the repair task is to sort the input into ascending order. We generate sequences of randomly selected integers (without replacement) between and in ascending order. We then inject errors by swapping

selected tokens which are next to each other, where n is a (rounded) Gaussian random variable with mean 8 and standard deviation 4. The task is then to sort the sequence back into its original ascending order given the error injected sequence. This scheme of data generation allows us to maintain pairs of good (before error injection) and bad (after error injection) data, and to compute the accuracy with which our GAN is able to restore the good sequences from the bad. We refer to this accuracy as ‘Sequence Accuracy’ (or Seq. Acc.). In order to assess our domain mapping approach and evaluate the usefulness of our self-regularizer loss functions defined in Section

3.3, we also compute the percentage of sequences which have valid orderings but not necessarily valid domain mappings, which we refer to as ‘Order Accuracy’ (or Order Acc.).

It is clear from the results in Table 1 that the vanilla (base) GAN easily learns to generate sequences with valid ordering, without necessarily paying attention to the input sequence. This leads to high Order Accuracy, but low Sequence Accuracy. However, adding Auto or Freq loss regularizers, as in (7) and (8), significantly improves the Seq. Acc., which shows that these losses do effectively enforce correct mapping between source and target domains.

5.2 Simple Grammar

For our second experiment, we generate data from a simple Context Free Grammar similar to that used by Rajeswar et al. Rajeswar:2017wd . The specifics of the CFG is provided in Appendix B. Our good data is selected randomly from the set of all sequences which satisfy the grammar and are less than length . We then inject errors into each sequence, where the number of errors is chosen as a Gaussian random variable (zero thresholded and rounded) with mean and standard deviation . Each error is then randomly chosen to be either a deletion of a random token, insertion of a random token, or swap of two random tokens.

The network is tasked with generating the original sequence from the error injected one. This task better models real data than the sorting task above, because each generated token must follow the grammar and is therefore conditioned on all previous tokens. The results in Table 1 show that our proposed GAN approach is able to achieve high CFG accuracy, in terms of generating correct sequences that fit the CFG. In addition to CFG accuracy, we also compute BLEU scores based on the pairs before and after error injection. We should note that our random error injection process results in many bad examples corresponding to a specific good example or vice verse, i.e., mappings are not bijective. Having multiple bad examples in the dataset paired with the same good example contributes to the slightly lower BLEU scores, since the network can only map each bad input to a single output. This issue appears frequently in real world repair datasets, since code sequences can be repaired or broken multiple different ways. Our GAN approach performs well on this CFG dataset suggesting that it can handle this issue for which cycle approaches are not appropriate Gomez:2018taa , Lample:2017ts , Zhu:2017ua .

5.3 Sate Iv

SATE IV is a dataset which contains C/C++ synthetic code examples (functions) with vulnerabilities from 116 different Common Weakness Enumeration (CWE) classes, and was originally designed to explore performance of static and dynamic analyzers Okun:2013jr . Each bad function contains a specific vulnerability, and is paired with several candidate repairs. There is a total of functions of which contain a vulnerability and do not. We lex each function using our custom lexer. After lexing, each function ranges in length from to tokens.

Using this data, we created two datasets to perform two different experiments, namely paired and unpaired datasets. The paired dataset allows us to compare the performance of our GAN approach with a seq2seq approach. In order to have a dataset which is fair for both GAN and seq2seq training, we created paired data by taking each example of vulnerable code and sampling one of its repairs randomly. We iterate this process through the dataset four times, pairing each vulnerable function with a sampled repair, and combine the resulting sets into a single large dataset. We should mention that although the paired dataset includes labeled pairs, those labels are not utilized for GAN training. For the unpaired dataset, we wanted to guarantee that a given source sequence does not have a corresponding target sequence anywhere in the training data. To achieve this, we divided the data into two disjoint sets by placing either a vulnerable function or its candidate repairs into the training dataset with equal probability. Note that this operation reduces the size of our training data by half. For testing, we compute BLEU scores using all of the candidate repairs for each vulnerable function. We use a //% train/validation/test split.

As shown in Table 1, our proposed GAN approach achieves progressively better results when we add (a) curriculum training, and (b) either or regularization loss. The Base + Cur + Freq model proves to be the best among different GAN models, and performs reasonably close to the seq2seq baseline, which is the upper performance bound. The results on the unpaired dataset are fairly close to those achieved by the paired dataset, particularly in the Base case, even though they are obtained with only half of the training data. Some code examples where our GAN makes correct repairs are provided in Table 2, with additional examples Appendix C.

Sorting CFG Sate4-P Sate4-U
Model Seq Acc. Order Acc BLEU-4 CFG Acc BLEU-4 BLEU-4
Base 99.7 99.8 91.3 99.3 96.3 N/A
Base + Cur 99.7 99.8 90.2 98.9 96.4 N/A
Proposed GAN
Base 82.8 96.9 88.5 98.0 84.2 79.3
Base + Auto 98.9 99.6 88.6 96.5 85.7 79.2
Base + Freq 99.3 99.7 88.3 97.5 86.2 79.5
Base + Cur 81.5 98.0 88.4 98.9 88.3 81.1
Base + Cur + Auto 96.2 98.0 88.5 97.8 89.9 81.5
Base + Cur + Freq 98.2 99.1 88.6 96.3 90.3 81.3
Table 1: Results on all experiments. Cur refers to experiments using curriculum learning, while Auto and Freq are those using and , respectively. Sate4-P and Sate4-U denote paired and unpaired datasets, respectively.
With Vulnerability Repaired
void CWE685_Function_Call_With_Incorrect_Number_Of_Arguments() {   char dst[DST_SZ];   sprintf(dst, "%s %s", SRC_STR);   printLine(dst); } void CWE685_Function_Call_With_Incorrect_Number_Of_Arguments() {   char dst[DST_SZ];   sprintf(dst, "%s %s", SRC_STR, SRC_STR);   printLine(dst); }
void CWE415_Double_Free__malloc_free_struct_31() {    twoints *data;    data = NULL;    data = (twoints *)malloc(100 * sizeof(twoints));    free(data);    {      twoints *data_copy = data;      twoints *data = data_copy;      free(data);    }  }  void CWE415_Double_Free__malloc_free_struct_31() {    twoints *data;    data = NULL;    data = (twoints *)malloc(100 * sizeof(twoints));    {      twoints *data_copy = data;      twoints *data = data_copy;      free(data);    }  }
Table 2: Successful Repairs: (Top) This function calls sprintf to print out two strings, but only provides the first string to print. Our GAN repairs it by providing a second string. (Bottom) This function uses a variable again after freeing it. Our GAN repairs it by removing the first free.

6 Conclusions

We have proposed a GAN based approach to train an NMT system for discrete domain mapping applications. The major advantage of our approach is that it can be used in the absence of paired data, opening up a wide set of previously unusable data sets for the sequence correction task. Key to our approach is the addition of two novel generator loss functions which enforce accurate domain mapping without needing multiple generators or domains to be bijective. We also have discussed, and made some progress, toward handling discrete outputs with GANs. We note that this problem is far from solved, however, and will likely inspire more research. Even though we only apply our approach to the problem of source code correction, it is applicable to other sequence correction problems, such as Grammatical Error Correction or language sentiment translation, e.g., converting negative reviews into positive ones.


This project was sponsored by the Air Force Research Laboratory (AFRL) as part of the DARPA MUSE program.


  • [1] MITRE. Common vulnerabilities and exposures. cve.mitre.org.
  • [2] T. D. LaToza, G. Venolia, and R. DeLine. Maintaining mental models: A study of developer work habits. International Conference on Software Engineering (ICSE), 2006.
  • [3] J. Ji, Q. Wang, K. Toutanova, Y. Gong, S. Truong, and J. Gao. A Nested Attention Neural Hybrid Model for Grammatical Error Correction. Annual Meeting of the Association for Computational Linguistics (ACL), pages 753–762, 2017.
  • [4] Z. Yuan and T. Briscoe. Grammatical error correction using neural machine translation. North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT), 2016.
  • [5] A. Schmaltz, Y. Kim, A. M. Rush, and S. M. Shieber. Adapting Sequence Models for Sentence Correction.

    Empirical Methods in Natural Language Processing (EMNLP)

    , 2017.
  • [6] Z. Xie, A. Avati, N. Arivazhagan, D. Jurafsky, and A. Y. Ng. Neural Language Correction with Character-Based Attention. arXiv:1603.09727, March 2016.
  • [7] J. A. Harer, L. Y. Kim, R. L. Russell, O. Ozdemir, L. R. Kosta, A. Rangamani, L. H. Hamilton, G. I. Centeno, J. R. Key, P. M. Ellingwood, M. W. McConley, J. M. Opper, P. Chin, and T. Lazovich. Automated software vulnerability detection with machine learning. arXiv:1803.04497, February 2018.
  • [8] Fourth workshop on the llvm compiler infrastructure in hpc. LLVM-HPC, 2017.
  • [9] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative Adversarial Networks. Neural Information Processing Systems (NIPS), June 2014.
  • [10] M. Monperrus. Automatic software repair: A bibliography. ACM Computing Surveys (CSUR), 51(1):17:1–17:24, January 2018.
  • [11] X. B. D. Le, D. Lo, and C. Le Goues. History driven program repair. Software Analysis, Evolution, and Reengineering (SANER), 2016.
  • [12] F. Long and M. Rinard. Automatic patch generation by learning correct code. Principles of Programming Languages (POPL), 2016.
  • [13] Devlin, Jacob, Uesato, Jonathan, Singh, Rishabh, and Kohli, Pushmeet. Semantic Code Repair using Neuro-Symbolic Transformation Networks. arXiv:1710.11054, October 2017.
  • [14] R. Gupta, S. Pal, A. Kanade, and S. Shevade.

    Deepfix: Fixing common c language errors by deep learning.

    Association for the Advancement of Artifical Intelligence (AAAI), pages 1345–1351, 2017.
  • [15] I. J. Goodfellow, O. Vinyals, and A. M. Saxe. Qualitatively characterizing neural network optimization problems. International Conference on Learning Representations (ICLR), 2015.
  • [16] M. Arjovsky and L. Bottou. Towards Principled Methods for Training Generative Adversarial Networks. International Conference on Learning Representations (ICLR), 2017.
  • [17] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein Generative Adversarial Networks. International Conference on Machine Learning (ICML), 2017.
  • [18] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. Neural Information Processing Systems (NIPS), 2016.
  • [19] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, and A. A. Bharath. Generative adversarial networks: An overview. IEEE Signal Processing Magazine, 35(1):53–65, 2018.
  • [20] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. International Conference on Learning Representations (ICLR), 2016.
  • [21] L. Yu, W. Zhang, J. Wang, and Y. Yu. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. Association for the Advancement of Artifical Intelligence (AAAI), 2017.
  • [22] Y. Zhang, Z. Gan, K. Fan, Z. Chen, R. Henao, D. Shen, and L. Carin.

    Adversarial Feature Matching for Text Generation.

    International Conference on Machine Learning (ICML), 2017.
  • [23] O. Press, A. Bar, B. Bogin, J. Berant, and L. Wolf. Language Generation with Recurrent Generative Adversarial Networks without Pre-training. 1st Workshop on Subword and Character Level Models in NLP (SCLeM), 2017.
  • [24] S. Rajeswar, S. Subramanian, F. Dutil, C. Pal, and A. Courville. Adversarial Generation of Natural Language. 2nd Workshop on Representation Learning for NLP (RepL4NLP), 2017.
  • [25] M. Mirza and S. Osindero. Conditional Generative Adversarial Nets. arXiv:1411.1784, November 2014.
  • [26] Z. Yang, W. Chen, F. Wang, and B. Xu. Improving Neural Machine Translation with Conditional Sequence Generative Adversarial Nets. North American Chapter of the Association for Computational Linguistics (NAACL), 2018.
  • [27] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros.

    Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks.

    International Conference on Computer Vision (ICCV)

    , 2017.
  • [28] A. N. Gomez, S. Huang, I. Zhang, B. M. Li, M. Osama, and L. Kaiser. Unsupervised Cipher Cracking Using Discrete GANs. International Conference on Learning Representations (ICLR), 2018.
  • [29] G. Lample, L. Denoyer, and M. Ranzato. Unsupervised Machine Translation Using Monolingual Corpora Only. International Conference on Learning Representations (ICLR), 2018.
  • [30] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from Simulated and Unsupervised Images through Adversarial Training.

    Computer Vision and Pattern Recognition (CVPR)

    , 2017.
  • [31] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved Training of Wasserstein GANs. Neural Information Processing Systems (NIPS), 2017.
  • [32] E. Jang, S. Gu, and B. Poole. Categorical Reparameterization with Gumbel-Softmax. International Conference on Learning Representations (ICLR), 2017.
  • [33] C. J. Maddison, A. Mnih, and Y. W. Teh.

    The Concrete Distribution - A Continuous Relaxation of Discrete Random Variables.

    International Conference on Learning Representations (ICLR), 2017.
  • [34] M.-T. Luong, H. Pham, and C. D. Manning. Effective Approaches to Attention-based Neural Machine Translation. Empirical Methods in Natural Language Processing (EMNLP), 2015.
  • [35] S. Hochreiter and J. Schmidhuber. Long Short-term Memory. Neural Computation, 9(8):1735–1780, December 1997.
  • [36] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol.

    Extracting and composing robust features with denoising autoencoders.

    International Conference on Machine Learning (ICML), 2008.
  • [37] R. J. Williams and D. Zipser.

    A Learning Algorithm for Continually Running Fully Recurrent Neural Networks.

    Neural Computation, 1989.
  • [38] V. Okun, A. Delaitre, and P. E. Black. Report on the Static Analysis Tool Exposition (SATE) IV. Technical report, 2013.

Appendix A Network and Training Details

Here we provide additional network and training details useful for experimental replication. All of the networks used in this paper use a similar architecture but vary in the number and size of layers.

a.1 Network Architecture

For all the experiments, we use identical networks for the Generator in our GAN model as well as in the NMT model in our seq2seq baseline. Thus when we refer to generator in the rest of the section it applies to both models. Our network architecture is shown in Figure 2.

Our generator consists of two RNNs, an encoder and a decoder. For Sorting and CFG experiments, the generator RNNs contain layers of neurons each. For Sate4, it contains layers of neurons each. The encoder RNN processes the input sequence and produces a set of hidden states . The final hidden state is used as the initial state to the decoder RNN which generates outputs one at a time, feeding its outputs back as input to until an end of sequence character is produced. The decoder and encoder are linked using global dot product attention as per Luong:2015wx .

All networks share the same discriminator architecture. Discriminator inputs each in are concatenated into a matrix in . They are then passed through a single D convolutional layer with filters each of sizes of , , and

. These outputs are then aggregated and fed into a max pooling operation over the entire sequence length. This is fed into two fully connected layers, the first with

neurons, and the second with a single neuron, the output of the discriminator.

Figure 2: (Left) Generator consisting of N encoder layers feeding N decoder layers. Outputs from the encoder are also used as inputs to the attention mechanism with the query coming from the decoder output. (Right) Discriminator consisting of N convolution layers, a temporal max pooling, and N fully connected layers.

a.2 Training

We first train our generator as a denoising autoencoder for which we use the Adam optimization algorithm with a learning rate of . The same pretrained network is used to initialize the generator for all GAN and seq2seq networks.

GAN networks are trained using the RMSprop optimization algorithm. Learning rates are initialized to

for the discriminator and for the generator. We train the discriminator times for every generator update. Seq2seq models are trained using the Adam optimizer with a learning rate of . We experimented extensively with varying the learning rate but found that increasing the discriminator learning rate made it unstable causing its accuracy to decrease. Increasing the generator learning rate causes it to update to quickly for the discriminator, meaning the discriminator would not remain close to optimal and therefore gradients through it were not reliable. In order to ensure that the discriminator starts at a good initial point, we initialize it by training it alone for the first epochs. The generators learning rate is decayed by a factor of every epochs. In models where we employ curriculum learning, this decay is only performed after the curriculum is completed. Networks are trained for , , and epochs for the sorting, CFG, and SATE4 experiments, respectively.

GAN training uses the original clipped version of Wasserstein GAN Arjovsky:2017vh with a clipping threshold of . We also experimented heavily with this threshold, and found that a lower threshold led to low discriminator accuracy, and a higher threshold led to the same discrete domain issues as discussed in the paper for WGAN-I in Section .

Our curriculum clips each sequence to a given length. We step up the curriculum length either when the discriminator accuracy falls below or after epochs, whichever comes first. Sorting and CFG curriculum starts at sequence length and is increased by at each step. SATE4 curriculum starts at length and is increased by at each step.

Appendix B Context Free Grammar

Our simple CFG experiment uses the following CFG. Each line is a production rule with possible sequences separated by . Symbols in quotes are terminals.

  SOS: ’1’
  EOS: ’2’
  NP: Det Nom | PropN
  Nom: Adj N | N
  VP: V NP | V NP PP
  PP: P NP
  PropN: ’3’ | ’4’ | ’5’
  Det: ’6’ | ’7’
  N: ’8’ | ’9’ | ’10’ | ’11’ | ’12’
  Adj : ’13’ | ’14’ |  ’15’ | ’16’ | ’17’
  V:  ’18’  | ’19’ | ’20’ | ’21’
  P: ’22’ | ’23’

Appendix C Repair Examples

Here we provide additional selected examples of source code correctly and incorrectly repaired by our GAN model. Tables 3-6 show successfull repairs, and Tables 7-8 show failures.

With Vulnerability Repaired
void CWE129_Improper_Validation_Of_Array_Index() {   int data;   data = -1;   {     ifdef _WIN32 WSADATA wsa_data;     int wsa_data_init = 0;     endif int recv_rv;     struct sockaddr_in s_in;     SOCKET connect_socket = INVALID_SOCKET;     char input_buf[CHAR_ARRAY_SIZE];     do {       ifdef _WIN32 if (WSAStartup(MAKEWORD(2, 2), &wsa_data) != NO_ERROR) break;       wsa_data_init = 1;       endif connect_socket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);       if (connect_socket == INVALID_SOCKET)         break;       memset(&s_in, 0, sizeof(s_in));       s_in.sin_family = AF_INET;       s_in.sin_addr.s_addr = inet_addr("");       s_in.sin_port = htons(TCP_PORT);       if (connect(connect_socket, (struct sockaddr *)\&s_in, sizeof(s_in)) ==           SOCKET_ERROR)         break;       recv_rv = recv(connect_socket, input_buf, CHAR_ARRAY_SIZE, 0);       if (recv_rv == SOCKET_ERROR || recv_rv == 0)         break;       data = atoi(input_buf);     } while (0);     if (connect_socket != INVALID_SOCKET)       CLOSE_SOCKET(connect_socket);     ifdef _WIN32 if (wsa_data_init) WSACleanup();     endif   }   {     int data_copy = data;     int data = data_copy;     {       int data_buf[10] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};       if (data > = 0) {         printIntLine(data_buf[data]);       } else {         printLine("ERROR: Array index is negative");       }     }   } } void CWE129_Improper_Validation_Of_Array_Index() {   int data;   data = -1;   {     ifdef _WIN32 WSADATA wsa_data;     int wsa_data_init = 0;     endif int recv_rv;     struct sockaddr_in s_in;     SOCKET connect_socket = INVALID_SOCKET;     char input_buf[CHAR_ARRAY_SIZE];     do {       ifdef _WIN32 if (WSAStartup(MAKEWORD(2, 2), &wsa_data) != NO_ERROR) break;       wsa_data_init = 1;       endif connect_socket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);       if (connect_socket == INVALID_SOCKET)         break;       memset(&s_in, 0, sizeof(s_in));       s_in.sin_family = AF_INET;       s_in.sin_addr.s_addr = inet_addr("");       s_in.sin_port = htons(TCP_PORT);       if (connect(connect_socket, (struct sockaddr *)&s_in, sizeof(s_in)) ==           SOCKET_ERROR)         break;       recv_rv = recv(connect_socket, input_buf, CHAR_ARRAY_SIZE, 0);       if (recv_rv == SOCKET_ERROR || recv_rv == 0)         break;       data = atoi(input_buf);     } while (0);     if (connect_socket != INVALID_SOCKET)       CLOSE_SOCKET(connect_socket);     ifdef _WIN32 if (wsa_data_init) WSACleanup();     endif   }   {     int data_copy = data;     int data = data_copy;     {       int data_buf[10] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};       if (data > = 0 && data < 10) {         printIntLine(data_buf[data]);       } else {         printLine("ERROR: Array index is out-of-bounds");       }     }   } }
Table 3: Successful Repair - This functions reads the index of an array access from a socket and returns the memory at the index. The vulnerable function only checks the lower bound on the array size. Our GAN repairs it by adding an additional check on the upper bound.
With Vulnerability Repaired
void CWE666_Operation_on_Resource_in_Wrong_Phase_of_Lifetime__accept_listen_bind_() {   {     char data[100] = "";     ifdef _WIN32 WSADATA wsa_data;     int wsa_data_init = 0;     endif int recv_rv;     struct sockaddr_in s_in;     char *replace;     SOCKET listen_socket = INVALID_SOCKET;     SOCKET accept_socket = INVALID_SOCKET;     size_t data_len = strlen(data);     do {       ifdef _WIN32 if (WSAStartup(MAKEWORD(2, 2), &wsa_data) != NO_ERROR) break;       wsa_data_init = 1;       endif listen_socket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);       if (listen_socket == INVALID_SOCKET)         break;       memset(&s_in, 0, sizeof(s_in));       s_in.sin_family = AF_INET;       s_in.sin_addr.s_addr = INADDR_ANY;       s_in.sin_port = htons(TCP_PORT);       accept_socket = accept(listen_socket, NULL, NULL);       if (accept_socket == SOCKET_ERROR)         break;       if (listen(listen\_socket, LISTEN\_BACKLOG)           == SOCKET\_ERROR)         break;       if (bind(listen_socket, (struct sockaddr *)& s_in,           sizeof(s_in)) == SOCKET_ERROR)         break;       recv_rv = recv(accept_socket, (char *)data + data_len,                      (int)(100 - data_len - 1), 0);       if (recv_rv == SOCKET_ERROR || recv_rv == 0)         break;       data[recv_rv] = ’\0’;       replace = strchr(data, ’\r’);       if (replace)         *replace = ’\0’;       replace = strchr(data, ’\n’);       if (replace)         *replace = ’\0’;     } while (0);     if (listen_socket != INVALID_SOCKET)       CLOSE_SOCKET(listen_socket);   } } void CWE666_Operation_on_Resource_in_Wrong_Phase_of_Lifetime__accept_listen_bind_() {   {     char data[100] = "";     ifdef _WIN32 WSADATA wsa_data;     int wsa_data_init = 0;     endif int recv_rv;     struct sockaddr_in s_in;     char *replace;     SOCKET listen_socket = INVALID_SOCKET;     SOCKET accept_socket = INVALID_SOCKET;     size_t data_len = strlen(data);     do {       ifdef _WIN32 if (WSAStartup(MAKEWORD(2, 2), &wsa_data) != NO_ERROR) break;       wsa_data_init = 1;       endif listen_socket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);       if (listen_socket == INVALID_SOCKET)         break;       memset(&s_in, 0, sizeof(s_in));       s_in.sin_family = AF_INET;       s_in.sin_addr.s_addr = INADDR_ANY;       s_in.sin_port = htons(TCP_PORT);       if (bind(listen_socket, (struct sockaddr*)&s_in,           sizeof(s_in)) == SOCKET_ERROR)         break;       if (listen(listen_socket, LISTEN_BACKLOG)           == SOCKET\_ERROR)         break;       accept_socket = accept(listen_socket, NULL, NULL);       if (accept_socket == SOCKET_ERROR)         break;       recv_rv = recv(accept_socket, (char *)data + data_len,                      (int)(100 - data_len - 1), 0);       if (recv_rv == SOCKET_ERROR || recv_rv == 0)         break;       data[recv_rv] = ’\0’;       replace = strchr(data, ’\r’);       if (replace)         *replace = ’\0’;       replace = strchr(data, ’\n’);       if (replace)         *replace = ’\0’;     } while (0);     if (listen_socket != INVALID_SOCKET)       CLOSE_SOCKET(listen_socket);   } }
Table 4: Successful Repair - This function attempts to accept a socket and use it before it has bound it. Our GAN approach repairs the function by reordering the bind, listen, and accept into the correct order.
With Vulnerability Repaired
void CWE131_Incorrect_Calculation_Of_Buffer_Size() {   wchar_t *data;   data = NULL;   data = (wchar_t *)malloc(10 * sizeof(wchar_t));   {     wchar_t data_src[10 + 1] = SRC_STRING;     size_t i, src_len;     src_len = wcslen(data_src);     for (i = 0; i < src_len; i++) {       data[i] = data_src[i];     }     data[wcslen(data_src)] = L ’\0’;     printWLine(data);     free(data);   } } void CWE131_Incorrect_Calculation_Of_Buffer_Size() {   wchar_t *data;   data = NULL;   data = (wchar_t *)malloc((10 + 1) * sizeof(wchar_t));   {     wchar_t data_src[10 + 1] = SRC_STRING;     size_t i, src_len;     src_len = wcslen(data_src);     for (i = 0; i < src_len; i++) {       data[i] = data_src[i];     }     data[wcslen(data_src)] = L ’\0’;     printWLine(data);     free(data);   } }
Table 5: Successful Repair - This function has a buffer allocated which is too small for the resulting data write. Our GAN repairs it by increasing the amount of memory allocated to the buffer.
With Vulnerability Repaired
void CWE685_Function_Call_With_Incorrect_Number_Of_Arguments() {   char dst[DST_SZ];   sprintf(dst, "%s %s", SRC_STR);   printLine(dst); } void CWE685_Function_Call_With_Incorrect_Number_Of_Arguments() {   char dst[DST_SZ];   sprintf(dst, "%s %s", SRC_STR, SRC_STR);   printLine(dst); }
Table 6: Successful Repair - This function calls sprintf to print out two strings, but only provides the first string to print. Our GAN repairs it by providing a second string.
With Vulnerability GAN Attempted Repaired Golden Repair
void CWE457_Use_of_Uninitialized_Variable() {   double *data;   data = (double *)ALLOCA(10 * sizeof(double));   {     int i;     for (i = 0; i < (10 / 2); i++) {       data[i] = (double)i;     }   }   {     int i;     for (i = 0; i < 10; i++) {       printDoubleLine(data[i]);     }   } } void CWE457_Use_of_Uninitialized_Variable(){   double *data;   data = (double *)ALLOCA(10 * sizeof(double));   {     size_t i;     for (i = 0; i < 10; i++) {       data[i] = (double)i;     }   }   {     int i;     for (i = 0; i < 10; i++) {       static_five(data[i]);     }   } } void CWE457_Use_of_Uninitialized_Variable() {   double *data;   data = (double *)ALLOCA(10 * sizeof(double));   {     int i;     for (i = 0; i < 10; i++) {       data[i] = (double)i;     }   }   {     int i;     for (i = 0; i < 10; i++) {       printDoubleLine(data[i]);     }   } }
Table 7: Incorrect Repair - This function tries to print out from data which has not been fully initialized. Our GAN corrects this by fully initializing the variable, but then attempts to call a variable identifier instead of making the appropriate function call.
With Vulnerability GAN Attempted Repaired Golden Repair
void CWE191_Integer_Underflow() {   unsigned char data;   data =  ;   data = (unsigned char)rand();    {     unsigned char result =  data - 1;     printHexUnsignedCharLine(result);   } } void CWE191_Integer_Underflow(){   unsigned char data;   data =  ;   data = (unsigned char)rand((unsigned int)data);   {     char data = data;      free(data);   } } void CWE191_Integer_Underflow() {   unsigned char data;   data =  ;   data = 5;   {     unsigned char result = data - 1;     printHexUnsignedCharLine(result);   } }
Table 8: Incorrect Repair - This function encounters an integer underflow by assigning a random value to a char and then subtracting 1. If the random value is 0 this will underflow the char. The given golden repair in this case is simply to change the random char to a known value. However, our GAN gets confused. It instead modifies the rand function in an unknown way and proceeds to free the data rather then print it.