Using Sequence-to-Sequence Learning for Repairing C Vulnerabilities

12/04/2019 ∙ by Zimin Chen, et al. ∙ KTH Royal Institute of Technology Colorado State University 0

Software vulnerabilities affect all businesses and research is being done to avoid, detect or repair them. In this article, we contribute a new technique for automatic vulnerability fixing. We present a system that uses the rich software development history that can be found on GitHub to train an AI system that generates patches. We apply sequence-to-sequence learning on a big dataset of code changes and we evaluate the trained system on real world vulnerabilities from the CVE database. The result shows the feasibility of using sequence-to-sequence learning for fixing software vulnerabilities.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Sequence-to-sequence learning

Sequence-to-sequence (seq2seq) learning is a modern machine learning framework that is used to learn the mapping between two sequences, typically of words [sutskever2014sequence]

. It is widely used in automated translation, text summarization and other tasks related to natural language. A seq2seq model consists of two parts, an encoder and a decoder. The encoder maps the input sequence

to an intermediate continuous representation . Then, given , the decoder generates the output sequence . Note that the size of the input and output sequences, and

, can be different. A seq2seq model is optimized on a training dataset to maximize the conditional probability of

, which is equivalent to:

Prior work has shown that source code is as natural as human language [hindle2012naturalness]

, and techniques used in natural language processing have been demonstrated to work well on source code, including seq2seq learning

[chen2019sequencer]. In our work, we use a seq2seq model called “transformer” [vaswani2017attention]. The transformer model is the state-of-the-art architecture for seq2seq learning.

Rare words in source code

One of the main challenges of using the seq2seq model on source code is that it hardly handles very rare words [hellendoorn2017deep]. The problem is that rare words, such as original literals or domain-specific identifiers, are too uncommon or even non-existent in the collected training data, and hence cannot be used when decoding. Indeed, in source code, rare variable and function names are more common compared to human language. A basic technique to handle the rare word problem is to increase the vocabulary size, say from 10k to 50k, but this is only a workaround: there will always be rare words for which not enough data is available at training time.

However, a rare word may have subwords that are frequent. For example the word underworld might be a rare word, but the subwords, under and world are common words. So if we could represent our vocabulary with frequent subwords, then we could generate any word with it. Byte pair encoding (BPE) is the state-of-the-art technique for learning the most frequent subwords [sennrich2015neural]. BPE starts with the basic characters as the vocabulary (e.g., the letters in the Latin alphabet). At each time step, the most frequent subword pair is combined into one new subword, and it is added into the vocabulary. It continues until we have reached a predefined vocabulary size. Listing 2 shows an example of applying BPE on a C function. Variables like destroyKeyValuePair and freeValue are considered as respectively 4 subwords and 2 subwords (destroy Key Value Pair and free Value), which are more common words that can be expected to be found in other code snippets. BPE has been successfully used in machine translation [sennrich2015neural] and source code modeling [karampatsis2019maybe]. In this paper, we are the first to report on using seq2seq with BPE for patch generation.

                         C code
void destroyKeyValuePair(keyValuePair kvPair) {
    kvPair -> freeValue(kvPair -> value);
    kvPair -> freeKey(kvPair -> key);
    free(kvPair);
}
                After applying BPE
_void _destroy Key Value Pair _( _key
Value Pair _kv Pair _) _{ _kv Pair _->
_free Value _( _kv Pair _-> _value _) _;
_kv Pair _-> _free Key _( _kv Pair _->
_key _) _; _free _( _kv Pair _) _; _}
Listing 2: Example of applying learned BPE on a C function. ”_”(U+2581) indicates the start of a new word. BPE learned some useful subwords such as Key, Value and Pair.

Data collection and filtering

Training dataset. To train a seq2seq model, we need a large corpus of buggy and fixed source code. We create such a corpus by mining the GitHub development platform. We use GH Archive [GHArchive]

to download all GitHub events that happened between 2017-01-01 and 2018-12-31. These events can be triggered by a Github issue creation, an opened pull request, and other development activities. In our case, we focus on push events, which are triggered when a commit is pushed to a repository branch. To only collect bug fix commits, we adopt a keyword-based heuristic

[Martinez2013]: if the commit message contains keywords (fix OR solve OR repair) AND (bug OR issue OR problem OR error OR fault OR vulnerability), we consider it a bug fix commit and add it to our corpus. In total, we have analyzed 730 million commits and selected 21 million commits identified as bug fix commits.

In our experiment, we focus on C code as the target programming language for automatic repair. Therefore we further filter the bug fix commits based on the file extension. We remove commits that did not fix any file that ends with ’.c’, this leaves us with 910 000 buggy C commits. Then, for each commit, we extract function pairs that were changed in the commit. We are learning function-level changes instead of file-level changes because seq2seq suffers from long input and output [cho2014properties]. To identify function-level changes, we use the GNU compiler preprocessor to remove all comments, and we only extract functions that are changed. Then, we used Clang to parse and tokenize the function source code.

In the end, we obtain 1 806 879 function-level changes, reduced to 642 399 after removing duplicates. The sizes of functions vary and we observe some C functions are still too big to be learned with seq2seq. Therefore, we further divide the training dataset to , , and , where the function lengths in before and after the change are limited to 200, 100 and 50 tokens respectively. In , , and , we have 299 976, 146 051, and 49 340 function-level changes respectively. The function code before the change is used as input to the seq2seq model, and the function after the change is used as the ground truth output for training.

Testing dataset. We also collect a dataset for testing the ability of seq2seq to fix real vulnerabilities. We used Data7 [JimenezPT18] to collect known vulnerabilities with CVE identifier from four well-known projects: Linux kernel, OpenSSL, systemd, and Wireshark. Each sample in the testing dataset consists of a CVE number and a list of commits that fixed the vulnerability. Next, we extract function-level changes from these vulnerabilities, and we call them vulnerable functions. We consider a vulnerability to be completely fixed if all its vulnerable functions are fixed. A vulnerability is partly fixed if at least one of its vulnerable functions is fixed. Test sets , , and are created by including only vulnerable functions where the token lengths before and after the change are limited to 200, 100, and 50 tokens respectively. For , we have 1615 vulnerable functions, that represent 630 vulnerabilities. For , we have 725 vulnerable functions, that represent 288 vulnerabilities. For , we have 120 vulnerable functions spread over 85 vulnerabilities.

Experiment setup

The training datasets , , and are randomly divided into training data and validation data, with 98% as training and 2% as validation. We select the best models with the highest validation accuracy a grid search in the hyper-parameter space. We evaluate the resulting models on our testing datasets, , and . We train three baseline seq2seq models on the three datasets , and with a vocabulary set to the top 50k most common tokens. Those baselines represent the state-of-the-art seq2seq model, without specific care to address the rare word problem.

Next, we explore seq2seq models using BPE for handling rare tokens. For the BPE configuration, we set the size of subword vocabularies to either 1000, 5000 or 10000, i.e., the vocabulary is the top 1000, 5000 or 10000 most frequent subwords in source code identifiers. After having identifier the optimal BPE subvocabulary, we traing the seq2seq models our training dataset , and . Consequently, in addition to our baselines, we have nine different settings: the cross-product of three different token length limits and the three different vocabularies defined by BPE. In total, we have 12 different seq2seq models summarized in Table Experimental Results. The seq2seq model are called after the training dataset and the BPE configuration: for example refers to the seq2seq model trained on and with vocabulary set to the top 1k most common subwords.

The best model for all 12 different settings is evaluated on the corresponding testing dataset: is evaluated on , is evaluated on , etc. We use beam search to predict fixes of vulnerable functions which means that the seq2seq model generates the top 50 most likely predictions per vulnerable function. The vulnerable function is considered as fixed when one of 50 predictions matches the ground truth human fix, as done in prior work [chen2019sequencer, tufano2018empirical].

We use the OpenNMT-tf [2017opennmt] for training the transformer model, and SentencePiece [kudo2018sentencepiece] for learning BPE on the training data.

Experimental Results

—c—[2pt]c—[2pt]c—[2pt]c— Model & Fixed functions & Vulnerabilities  

& & Partially fixed   & Completely fixed  

[2pt]- & 5/120 (4.2%) & 3/85 (3.5%) & 1/85 (1.2%)

& 26/120 (21.7%) & 17/85 (20%) & 3/85 (3.5%)

& 32/120 (26.7%) & 22/85 (25.9%) & 3/85 (3.5%)

& 28/120 (23.3%) & 18/85 (21.1%) & 3/85 (3.5%)

[2pt]- & 2/725 (0.3%) & 2/288 (0.7%) & 0/288 (0%)

& 68/725 (9.4%) & 40/288 (13.9%) & 6/288 (2.1%)

& 97/725 (13.4%) & 47/288 (16.3%) & 5/288 (1.7%)

& 99/725 (13.7%) & 45/288 (15.6%) & 10/288 (3.5%)

[2pt]- & 0/1615 (0%) & 0/630 (0%) & 0/630 (0%)

& 109/1615 (6.7%) & 45/630 (7.1%) & 9/630 (1.4%)

& 131/1615 (8.1%) & 52/630 (8.3%) & 9/630 (1.4%)

& 148/1615 (9.2%) & 55/630 (8.7%) & 14/630 (2.2%)

Table 1: Performance of our trained seq2seq models on the testing datasets. The first column gives the kind of seq2seq model. The second column shows the accuracy on , and respectively. The third column displays the number of partly fixed vulnerabilities. The fourth column represents the number of completely fixed vulnerabilities.

Can the trained seq2seq models generate patches for real-world vulnerabilities? The main results are presented in Table Experimental Results. The first column gives the name of seq2seq model depending on its BPE configuration. The second column shows the prediction accuracy on , and respectively. The third column displays the number of partly fixed vulnerabilities. The fourth column represents the number of completely fixed vulnerabilities.

We first focus on the number of C functions with vulnerabilities that are correctly patched by our system. Overall, our model is able to fix real world vulnerabilities, up to 32/120 (26%) for vulnerabilities in small C functions of less than 50 tokens. The seq2seq model’s performance decreases with the input and output length: the values on the models (first group of four rows) are higher than those for the bigger functions: it is easier to fix vulnerabilities in shorter C functions.

Recall that the Baseline models do not use BPE, and Table Experimental Results indicates that their accuracies are close to 0. This shows that a standard seq2seq model with a fixed vocabulary is not an option for handling the rare token problem. To further analyze this phenomenon, we analyzed the 80 750 predictions generated by (1 615 vulnerable functions in 50 predictions per function). We found that 80 047 / 80 346 (99%) predictions contain out-of-vocabulary tokens, this further confirms the prevalence of rare words in source code. Now, our results show that byte-pair encoding (BPE) is a powerful solution to this problem: the number of fixed C functions jumps from 5 to 32 for small functions () and from 0 to 148 for larger functions (). Our data suggests that for large functions, there is still some room for improvement with large subword vocabulary (a subword vocabulary of 20000 would likely increase the accuracy).

Recall that a single vulnerability is often fixed in multiple functions at once. The third and fourth columns of Table Experimental Results focus on the number of fixed vulnerabilities instead of the number of fixed vulnerable functions. Our seq2seq models are able to partially fix up to 22/85 vulnerabilities in small functions. Completely fixing a vulnerability is much harder, because all the functions inside the vulnerability must be fixed: for instance in CVE-2011-1771, the fix is done over 95 vulnerable functions. Despite this strong requirement, our approach is able to completely fix 3/85 vulnerabilities in , 10/288 vulnerabilities in , and 14/630 vulnerabilities in . To our knowledge, this is the first report of result with seq2seq learning on fixing general vulnerabilities.

The results of seq2seq models are computed based on the ground truth human fix. In production, such a vulnerability fixing system would be used without a ground truth fix. Based on the suspicious functions, we would filter the output of seq2seq using additional checks: such as compilation (remove uncompilable code) and test execution (remove patches yielding test failures). Previous work has shown that using these automatic filtering techniques correctly filters out up to 97% of the patches generated with seq2seq [chen2019sequencer].

Case studies

int cifs_close(struct inode *inode, struct file *file) { -   cifsFileInfo_put(file->private_data); -   file->private_data = NULL; +   if (file->private_data != NULL) { +       cifsFileInfo_put(file->private_data); +       file->private_data = NULL; +   }     /* return code from the ->release op is always ignored */     return 0; Listing 3: CVE-2011-1771 is successfully fixed by seq2seq.

Let us now discuss interesting cases. CVE-2011-1771 is a vulnerability from the Linux kernel, the description of this vulnerability from NVD is: The cifs_close function in fs/cifs/file.c in the Linux kernel before 2.6.39 allows local users to cause a denial of service (NULL pointer dereference and BUG) or possibly have unspecified other impact by setting the O_DIRECT flag during an attempt to open a file on a CIFS filesystem. The human patch for this vulnerability is shown in Listing 3. The fix consists of adding a null check for variable private_data. Our seq2seq approach is able to generate this exact patch (4 out of 12 models can do so: , , and ).

static int omninet_open(struct tty_struct *tty, struct usb_serial_port *port) { -   struct usb_serial   *serial = port->serial; -   struct usb_serial_port      *wport; - -   wport = serial->port[1]; -   tty_port_tty_set(&wport->port, tty); -     return usb_serial_generic_open(tty, port); } Listing 4: CVE-2017-8925 is successfully fixed by seq2seq.

CVE-2017-8925 is another vulnerability from the Linux kernel, the description from NVD is: The omninet_open function in drivers/usb/serial/omninet.c in the Linux kernel before 4.10.4 allows local users to cause a denial of service (tty exhaustion) by leveraging reference count mishandling. It is categorized as ‘Improper Resource Shutdown or Release’. The human fix shown in Listing 4 removes statements that improperly handle variables ‘port’ and ‘tty’. This exact patch could be generated by all our seq2seq models trained on and . With appropriate training data, our seq2seq approach to generate vulnerability fixes is able to predict the same patches as human developers.


Conclusion

Software vulnerabilities are common and can cause much damage. In this paper, we took a step towards automatic repair of security vulnerabilities. We devised, implemented and evaluated a novel system based on sequence-to-sequence learning on past commits from software repositories. We mined 2 years of commit history from GitHub, and we solved the rare word problem in source code by using the byte-pair encoding technique. Our original results show that real world vulnerable C functions can be fixed in a fully automated, data-driven manner. Future work is required to increase the performance of automatic vulnerability tools on fixing general vulnerabilities, and to explore the integration of such technology into the software development process.


Case studies

int cifs_close(struct inode *inode, struct file *file) { -   cifsFileInfo_put(file->private_data); -   file->private_data = NULL; +   if (file->private_data != NULL) { +       cifsFileInfo_put(file->private_data); +       file->private_data = NULL; +   }     /* return code from the ->release op is always ignored */     return 0; Listing 3: CVE-2011-1771 is successfully fixed by seq2seq.

Let us now discuss interesting cases. CVE-2011-1771 is a vulnerability from the Linux kernel, the description of this vulnerability from NVD is: The cifs_close function in fs/cifs/file.c in the Linux kernel before 2.6.39 allows local users to cause a denial of service (NULL pointer dereference and BUG) or possibly have unspecified other impact by setting the O_DIRECT flag during an attempt to open a file on a CIFS filesystem. The human patch for this vulnerability is shown in Listing 3. The fix consists of adding a null check for variable private_data. Our seq2seq approach is able to generate this exact patch (4 out of 12 models can do so: , , and ).

static int omninet_open(struct tty_struct *tty, struct usb_serial_port *port) { -   struct usb_serial   *serial = port->serial; -   struct usb_serial_port      *wport; - -   wport = serial->port[1]; -   tty_port_tty_set(&wport->port, tty); -     return usb_serial_generic_open(tty, port); } Listing 4: CVE-2017-8925 is successfully fixed by seq2seq.

CVE-2017-8925 is another vulnerability from the Linux kernel, the description from NVD is: The omninet_open function in drivers/usb/serial/omninet.c in the Linux kernel before 4.10.4 allows local users to cause a denial of service (tty exhaustion) by leveraging reference count mishandling. It is categorized as ‘Improper Resource Shutdown or Release’. The human fix shown in Listing 4 removes statements that improperly handle variables ‘port’ and ‘tty’. This exact patch could be generated by all our seq2seq models trained on and . With appropriate training data, our seq2seq approach to generate vulnerability fixes is able to predict the same patches as human developers.


Conclusion

Software vulnerabilities are common and can cause much damage. In this paper, we took a step towards automatic repair of security vulnerabilities. We devised, implemented and evaluated a novel system based on sequence-to-sequence learning on past commits from software repositories. We mined 2 years of commit history from GitHub, and we solved the rare word problem in source code by using the byte-pair encoding technique. Our original results show that real world vulnerable C functions can be fixed in a fully automated, data-driven manner. Future work is required to increase the performance of automatic vulnerability tools on fixing general vulnerabilities, and to explore the integration of such technology into the software development process.


Conclusion

Software vulnerabilities are common and can cause much damage. In this paper, we took a step towards automatic repair of security vulnerabilities. We devised, implemented and evaluated a novel system based on sequence-to-sequence learning on past commits from software repositories. We mined 2 years of commit history from GitHub, and we solved the rare word problem in source code by using the byte-pair encoding technique. Our original results show that real world vulnerable C functions can be fixed in a fully automated, data-driven manner. Future work is required to increase the performance of automatic vulnerability tools on fixing general vulnerabilities, and to explore the integration of such technology into the software development process.