Security bugs (i.e., vulnerabilities) in software are becoming more and more difficult to identify in today’s applications, allowing hackers and attackers to profit from their exploit. Every year, tens of thousands of such vulnerabilities are discovered and fixed. Manually auditing source code and finding vulnerabilities is costly at best, if not impossible at all. Therefore, researchers and practitioners have proposed various tools that can help in discovering vulnerabilities automatically. Classical vulnerability detection tools rely on static [10, 3, 17] or dynamic [22, 20, 24] code analysis, symbolic execution or taint analysis. However, with the advent of efficient machine learning techniques, new approaches appear that try to solve Software Engineering (SE) problems by training AI prediction models on large amount of annotated code samples. Vulnerability detection is one typical such SE task that has been addressed with these new ML approaches. As of today, using prediction models to decide if a source code fragment is vulnerable or has became a very common and effective approach. An inherent property of such approaches is that they can work with numeric vectors (i.e., feature vectors) as inputs.Therefore, one needs to transform the source code into such feature vectors, often referred to as code embedding. This process can be either manual (i.e., defining and extracting features from source code manually, like lines of code, number of branches, code complexity) or automatic (i.e., applying ML based techniques to automatically learn the vector representation of code). A popular approach for automatic code embedding is to adapt natural language processing techniques, like text representation , to automatically derive the necessary features from the source code.Despite their popularity, the suitability and comparison of different text representation techniques for solving SE problems has been rarely studied systematically.Given the fact that the accuracy of prediction models relies heavily on the appropriate representation of input data, we need empirical data about the effect of such representations on the underlying SE task to be solved.
applied to the SE task of detecting vulnerabilities in Python code. We applied a data mining approach to collect a suitable training data for training vulnerability prediction models.Using a heuristic approach (i.e., searching for simple terms indicating security fixes in commit logs) we collected a large volume of Python source code from GitHub in both vulnerable and fixed forms.We generated the vector representation of these code fragments using automatic code embedding based on the word2vec, fastText, and BERT text representation methods and used a Long Short-Term Memory network to create a vulnerability prediction model based on them. Training the LSTM model with the same architecture on each of the different code representations, we could compare the efficiency of the various embeddings in deriving meaningful feature vectors for vulnerability prediction. We investigated the following two research questions using the above described methodology:
RQ1: Is there a significant difference in the performance of the vulnerability prediction models based on the different code embedding methods?
RQ2: Are some of the code embedding methods more suitable for predicting certain types of vulnerabilities than others?
Our findings show that all the text representation methods are suitable for code representation in this particular task, but the BERT model is the most promising, as it is the least time consuming, and the LSTM model based on it achieved the best overall accuracy(93.8%) in predicting Python source code vulnerabilities. Regarding the various vulnerability types, we observed slight variances in model performances based on the applied source code embeddings. Nonetheless, the prediction model based on the word2vec representation of code clearly outperformed models based on fastText and BERT for detecting SQL injection, while in case of Command Injection, Cross-Site Request Forgery (XSRF), Remote Code Execution (RCE), and Path Disclosure the BERT based models achieved better results than models based on the other two embeddings.
The rest of the paper is organized as follows. In Section 2 we introduce works that are similar to ours. Section 3 gives details about our dataset collection methodology, while in Section 4 we describe our overall approach for the systematic comparison of the different source code embedding methods. Section 5 contains the comparison results. We summarize the set of threats to the validity of our work in Section 6 and conclude our findings in Section 7.
2 Related Work
Solving a SE task with machine learning requires the input source code to be represented as a numeric vector. Therefore many approaches have been proposed for deriving meaningful code representations to feed into ML models.
Alon et al.  introduce code2vec, a neural model for representing snippets of code as continuous distributed vectors. The method first breaks down the code to a collection of paths in its abstract syntax tree. Then, the network learns the atomic representation of each path while simultaneously learning how to aggregate a set of them.
Lozoya et al.  introduce a new code embedding technique called commit2vec based on code2vec. This representation focuses on embedding code changes rather than code snapshots, which they used to successfully train models to detect vulnerability fixing commits.
Ben-Nun et al.  propose a code embedding technique called inst2vec that is based on an Intermediate Representation (IR) of the code that is independent of the source programming language. They provide a novel definition of contextual flow for this IR, leveraging both the underlying data- and control-flow of the program. The athors of the paper demonstrate the effectiveness of the approach on compute device mapping, optimal thread coarsening and algorithm classification .
Mou et al. 
propose the “coding criterion” to build program vector representations, which are the premise of deep learning for program analysis. They evaluate the learned vector representations both qualitatively and quantitatively.
In our work, we do not propose new code embedding techniques, rather evaluate the impact of different text representation methods (i.e., word embeddings) used as code representations for vulnerability prediction. As these techniques are usually used under the hood of the more complex code representations, this is a natural first step towards better understanding the application of natural language processing techniques for solving SE tasks.
There are also many related works that focus on vulnerability prediction or similar SE tasks based on word embeddings in particular. Harer et al.  use word2vec to create word embeddings for C/C++ tokens. Based on this code representation they successfully train machine learning models to predict the results of static analyzers.
White et al.  apply word2vec in the scope of automatic program repair. In their approach, DeepRepair, they create Java token embeddings that they use to start a recursive encoder for abstract syntax trees. Chen and Monperrus  use word2vec to create Java token embeddings for automated program repair as well, in order to find the correct ingredients.
Unlike the works mentioned above, we are not focusing on creating a state-of-the-art prediction model for a particular SE task. Rather, we investigate the capabilities of various text vectorization techniques as source code representations in the context of identifying vulnerabilities in Python code with ML.
The work of Russel et al. 
is the closest one to ours. They developed a fast and scalable vulnerability detection tool based on deep feature representation learning that directly interprets lexed source code. They compared the bag-of-words (BOW) based simple source code embedding with code representations learned by CNN and RNN models automatically (i.e., with an embedding layer). Although our approach is similar, we do not compare different ML models and automatic code representation learning, but explicitly compare the effect of applying word embeddings as features for an LSTM prediction model. To the best of our knowledge, ours is the first attempt to systematically evaluate the impact of word2vec, fastText, and BERT-based code embeddings on vulnerability prediction.
3 Dataset Extraction
In order to compare the various word embedding based code representations, we need a training dataset for the vulnerability prediction model relying on them. We applied a data mining approach to gather actual vulnerability fixes from various repositories and use them to train our model to recognize different patterns of security vulnerabilities in source code. We chose GitHub as our data source as it contains a wide range of open source applications, including Python source code, which we focus on in this work. We searched for commits in Python projects with some vulnerability related keywords included in their commit messages. There are numerous types of security vulnerabilities in programming languages, but most of them are spread across languages. We focus on some of the most popular vulnerability types, namely SQL and command injection, cross-site scripting, cross-site request forgery, remote code execution, and path disclosure. Our overall data mining process is displayed in Figure1.
3.1 Mining GitHub
We followed the guidance provided by Chaturvedi et al. , who demonstrated how to use tools and datasets to mine database repositories and assist us in gathering data in this time-consuming task. The first step is to collect a large number of commits that are candidate vulnerability fixes. We searched for vulnerability fixing commits by querying GitHub data through its public REST API. Because of GitHub constraints, we first had to extract a dataset containing commits coming from different language projects and then filter out data related to Python projects. We ended up collecting approximately 70k commits yielding to 140k Python code snippets (vulnerable and fixed together) from 14k different Python projects. To facilitate reproducibility, we published all the collected data and processing scripts in the form of an online appendix.111https://doi.org/10.5281/zenodo.4703996
3.2 Filtering the Data
After collecting the candidate commits, we filtered them based on some security relevant keywords. Some sample keywords we use in the heuristic data collection scripts are shown in Figure 2. For the complete list of search terms, see the script source in the online appendix package.
We used the PyDriller tool  to download repositories and look for relevant commits and extract information from them. We also filtered out commits that did not contain changes in files with ’.py’ extension. Once we identified the commits that are related to vulnerability fixes, we downloaded the changed source code before and after the fixes. It turned out that downloading the source code in a reasonable amount of time was possible if all of the scanning was done ahead of time in a clever way to keep the number of downloaded repositories to a bare minimum. The diffs files we downloaded are essentially large text files that represent the changes in the source code introduced by a commit; however, they contained some unnecessary details (file name, line number) that we eliminated before assembling the final dataset using the previous and subsequent versions of the code snippet. Both snippets contained the changed lines so we could extract and label the functions in the previous version as vulnerable while after the fix, they become not vulnerable.
After filtering the commits based on their messages and downloaded the relevant code changes in form of diffs, we created the final labeled dataset as follows. We removed the comments from the affected code blocks because they are unlikely to impact a file’s vulnerability. After that, we extracted the fragments of code (i.e., code blocks) from the diff files that were affected by the fix in the commit and assigned the vulnerable label to its pre-fixed version while not vulnerable label was assigned to its fixed form. However, it is not trivial to identify the exact code blocks within the whole source file that were affected by a vulnerability fixing change. For this, we analyzed the downloaded diff files and implemented the algorithm presented by Hovsepyan et al.  and Wartschinski et al.  to find the appropriate code block.
The primary goal of this work is to compare various embedding layers based on text representations in order to determine their capabilities in detecting vulnerabilities in Python programs. To achieve an objective comparison, we need to apply the different source code embedding methods selected with the same ML algorithm trained on the same dataset. We evaluate the embedding methods by training a Long Short-Term Memory (LSTM) model with the same hyper-parameters on the dataset described in Section 3.
4.1 The Evaluated Embedding Layers
To encode the code tokens we need to transfer the code tokens into vectors using one of our selected embedding methods (word2vec, fastText and BERT). For word2vec and fastText we need to train a model that learns the embeddings based on a large corpus of Python source code. To collect this, we also mined GitHub for popular Python projects.
Word2vec is one of the most widely adopted word embedding methods to represent source code in vector form. To derive word2vec based source code embeddings, we needed to train a suitable word2vec model on Python source code to encode the code tokens into word2vec vectors. Training the word2vec model requires a large corpus of Python source code (for further reference, see the works of Bhoopchand et al.  and Allamanis et al. ).
To collect such a corpus, we searched for popular projects on GitHub. GitHub uses two metrics to measure a repository’s popularity: stars (user-created highlights) and forks (number of copies of a repository). The list of selected repositories with a high number of stars and forks used as a code corpus is available in our online appendix. We used PyDriller  for querying the most popular projects and downloading their source files. The resulting source code, 11 million lines in total from 38 of the most popular projects, is simply concatenated to create a single massive Python code file. Another script is then used to fix any issues with the text, such as indentation errors. We transform the Python programs into Python tokens using the built-in Python tokenizer. We delete the comments from files and add new lines at the end of the file. Tabs and indentations have been normalized. The word2vec model is then trained on the corpus using the Gensim 222https://radimrehurek.com/gensim/ Python package (see Figure 3). All the word2vec training scripts are also part of our online appendix.
We use the exact same process for calculating fastText 
embeddings as for word2vec. This means, we apply the same data analysis and tokenizer scripts and train the embedding model with the same Python code corpus. We chose fastText as a study subject because word2vec only learns vectors for words that are complete in the training corpus. FastText, on the other hand, learns vectors for both the n-grams and the full words contained inside each word. FastText uses the mean of the target word vector and its component n-gram vectors for training at each step. The change derived from the error is then applied uniformly to all of the vectors that were combined to form the target. This adds a significant amount of extra computation to the training step. A word must sum and average its n-gram component parts at each point.
Bidirectional Encoder Representations from Transformers (BERT) is a Google-developed Transformer-based machine learning technique for natural language processing (NLP) pre-training . Jacob Devlin and his Google colleagues developed and released BERT in 2018. We selected this embedding method for comparison due to its recent successes within the NLP field. As the BERT model is pre-trained on natural language texts, to adopt it to source code, we used its Microsoft’s variant, called CodeBERT . CodeBERT is a pre-trained BERT model for programming languages. In the context of this paper, we used BERT only as an embedding method and we feed all output tokens to an LSTM model. The biggest difference between BERT and the other two embedding methods is that the training part of the embedding model is done in advance using a huge corpus in case of BERT, while for word2vec and fastText, we need to do the training locally. This means that CodeBERT can be used out of the box, without having to train a model for token embeddings.
4.2 Preparing the Data for Classification
The collected vulnerability dataset (see Section 3
) contains data in the form of vulnerable and not vulnerable code snippets. We need to transform these into a list of tokens (such as ’+’, ’for’, ’init’) and convert each token into its vector representation according to the different embedding methods. Each list of such vectors (representing the list of tokens of the underlying code snippet) will be labeled, where label ’1’ means that the code is vulnerable and ’0’ means it is not vulnerable. Since ML models require a fix-sized input, we took the overall length of the concatenated vectors for the longest code snippet in our dataset and padded all the shorter ones with zeros to make them have the same length.
We split the training dataset into three sets, train, validation and test. 80% of the data selected randomly is used for training, 10% for validating and 10% for testing. Note that the validation data is only used to evaluate the model performance but all the results are presented on the test data, which the model has never seen before, applying a 10-fold cross-validation.
4.3 Training the LSTM
After transforming each of our learning samples into a fix-sized numeric vector, we are ready for training the LSTM model on them. For the implementation of the model, we used the Keras library. The first component of the model is the LSTM layer, which learns the features associated with the label of the code snippet (i.e., whether it is vulnerable or not). We can use a variety of hyper-parameters for the model, like dropout.
Then, there is an activation layer that creates a dense output layer with one neuron. We used the Sigmoid activation function as our aim is to predict between two classes: vulnerable and not vulnerable. We applied different hyper-parameters and tried out several different combinations of them as a set. Technically, the evaluation metric and loss functions are also hyper-parameters. We chose to compare the model performances based on the F1-score metric.
Our model’s base hyper-parameter is the number of neurons, which has a direct impact on its learning capacity; the more neurons we use, the more complex structures our model can recognize, but the training can also take longer. Finally, we have the number of epochs, or the number of times the learning algorithm can iterate over the entire training data set, which we set to 100 and 200. The high-level overview of our code representation evaluation/comparison process is shown in Figure4.
With data mining, we created a large dataset of Python code snippets from GitHub and labeled them as being vulnerable or not vulnerable based on detected vulnerability fixing commits. The dataset covers six common types of vulnerabilities (SQL and command injection, cross-site scripting, cross-site request forgery, remote code execution, and path disclosure). We trained an LSTM classifier using different embedding layers (i.e., word2vec, fastText, and BERT) with different hyper-parameters to compare the impact of the three different source code embeddings on vulnerability detection predicting performance.
The results of the LSTM models based on the three different code embedding layers are displayed in Figures 5 and 6. Figure 5 shows the changes in F1 scores based on the number of epochs and ratio of dropout applied for the training. As can be seen, the results are very close for the three embedding approaches. Word2vec based results are slightly outperforming the others for small number of epochs and high droput rate. However, for more than 75 epochs and a droput rate lower than 20%, BERT based models perform the best.
Figure 6 depicts F1 score changes based on the hyper-parameters of neuron counts and batch sizes. Again, the results achieved with the different embedding methods are very close. For smaller number of neurons, word2vec based models work better, while for large number of neurons, BERT becomes superior to others. BERT is also the best performing model when it comes to various batch sizes. It works best with batch sizes between 400 and 600. It is also true, that word2vec slightly outperforms fastText when applying larger batch sizes.
LSTM models using the BERT-based code representation achieve, on average, an accuracy of 93.8%, a recall of 83.2%, a precision of 91.4%, and an an F1 score of 87.1%. The models based on the word2vec code representation achieve, on average, an accuracy of 91%, a recall of 86.1%, a precision of 88.2%, and an F1 score of 85.6%. While fastText code representation based models achieve, on average, an accuracy of 91.8%, a recall of 86.4%, a precision of 85.1%, and an F1 score of 84%, on average, an accuracy of 91%, a recall of 81%, a precision of 90% and an F1 score of 82%. Based on the results, we can answer RQ1 as follows.
To answer RQ2, we calculated the same performance measures grouped by the different vulnerability types. We categorized each code snippet according to the keywords found in the vulnerability fixing commit. As it can be seen from Figures 7, 8, and 9, there is not much variance in model performance values within the categories.
The word2vec based models (see Figure 7) show the least variance in model performances within vulnerability categories. The only minor exception is the recall for XSS, which is clearly lower than that of the others or the average, mostly because finding good vulnerable dataset of it is difficult and we think that we didn’t train it with enough data. On the other hand, word2vec based models perform the best among all in detecting SQL injection vulnerabilities.
The fastText based models (see Figure 8) show a higher variance within vulnerability categories. They have similar average performance values to the word2vec based models, but are less effective in finding XSS, XSRF, and Path disclosure types of vulnerabilities.
The BERT based models (see Figure 9) have the highest average performance measures and the variance in the values is small within the categories. The only exception is SQL injection, where BERT based models are less efficient than the other models (recall of 78%, F1 score of 80.1%). However, in case of Command Injection, Cross-Site Request Forgery (XSRF), Remote Code Execution (RCE), and Path Disclosure the BERT based models achieved better results than models based on other embeddings. We can sum up the above observations to answer RQ2 as follows.
6 Threats to Validity
The heuristic data collection is a major threat to the validity of the results. With a keyword based commit search we might include irrelevant commits (that do not fix vulnerabilities) and we might miss out those that fix vulnerabilities but do not contain the searched keywords. To mitigate this threat, we manually investigated a small sample of the collected data, which we found to be accurate in the majority of cases. Since we collected a very large amount of such training data, the impact of several mis-classified commits should be negligible.
Many scenarios exist where a weakness arises from the interaction of lines of code that are distributed over a large file (or multiple files). However, since the examples for vulnerabilities used to train the model only concentrate on the immediate vicinity of fixed lines, the model might be unable to learn the consequences of far-reaching dependencies. Even though our results might not generalize to all of the different types of vulnerabilities, we believe these preliminary empirical results are already valuable.
Limitations in the chosen approach may also be a threat to the internal validity. We selected one specific prediction model to compare three different code embedding algorithms. Different ML models might yield to different results. In the future, we plan to extend our scope and add multiple ML models and code embeddings to the comparison.
In this paper, we presented an empirical study where we performed the comparison of three word embedding based code representation methods in the context of vulnerability prediction in Python code. These methods – word2vec, fastText, and BERT – adopted from the field of natural language processing, are widely used in practice to represent source code as numeric vectors and solve SE tasks (e.g., code summarization, bug detection, or finding copy-pasted code parts) with ML models trained on these representations. Despite their popularity, very few works evaluate and compare their impact on the prediction performances of the ML models relying on them.
With a data mining approach, we collected vulnerability fixing commits from which we could extract vulnerable (before the fix) and not vulnerable (after the fix) code snippets that formed our training dataset (140k Python code snippets in total). We applied the three investigated code embeddings to these code snippets and fed the resulting vectors into the same LSTM architecture to train a prediction model. Our findings show that all the text representation methods are suitable for code representation in this particular task, but the BERT model is the most promising as it is the least time consuming and the LSTM model based on it achieved the best overall accuracy(93.8%) in predicting source code vulnerabilities. Regarding the various vulnerability types, we observed slight variances in model performances based on the applied source code embeddings. Nonetheless, the prediction model based on the word2vec representation of code clearly outperformed models based on fastText and BERT for detecting SQL injection, while in case of Command Injection, Cross-Site Request Forgery (XSRF), Remote Code Execution (RCE), and Path Disclosure the BERT based models achieved better results than models based on other embeddings.
Our future work will focus on using different classifiers and improving the approach for labelling the data, collecting a dataset of higher quality, and leveraging the commit context to create actionable fix recommendations. The work could also be extended to other programming languages or types of vulnerabilities.
The presented work was carried out within the SETIT Project (2018-1.2.1-NKP-2018-00004)333Project no. 2018-1.2.1-NKP-2018-00004 has been implemented with the support provided from the National Research, Development and Innovation Fund of Hungary, financed under the 2018-1.2.1-NKP funding scheme. and supported by the Ministry of Innovation and Technology NRDI Office within the framework of the Artificial Intelligence National Laboratory Program (MILAB). The research was partly supported by the EU-funded project AssureMOSS (Grant no. 952647).
Furthermore, Péter Hegedűs was supported by the Bolyai János Scholarship of the Hungarian Academy of Sciences and the ÚNKP-20-5-SZTE-650 New National Excellence Program of the Ministry for Innovation and Technology.
-  (2013) Mining source code repositories at massive scale using language modeling. In 2013 10th Working Conference on Mining Software Repositories (MSR), pp. 207–216. Cited by: §4.1.1.
Code2vec: learning distributed representations of code. Proceedings of the ACM on Programming Languages 3 (POPL), pp. 1–29. Cited by: §2.
-  (2016) An user configurable clang static analyzer taint checker. In 2016 35th International Conference of the Chilean Computer Science Society (SCCC), pp. 1–12. Cited by: §1.
-  (2018) Neural code comprehension: a learnable representation of code semantics. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, Red Hook, NY, USA, pp. 3589–3601. Cited by: §2.
-  (2016) Learning python code suggestion with a sparse pointer network. arXiv preprint arXiv:1611.08307. Cited by: §4.1.1.
-  (2013) Tools in mining software repositories. In 2013 13th International Conference on Computational Science and Its Applications, pp. 89–98. Cited by: §3.1.
-  (2018) The remarkable role of similarity in redundancy-based program repair. arXiv preprint arXiv:1811.05703. Cited by: §2.
-  (2018) Keras: the python deep learning library. Astrophysics Source Code Library, pp. ascl–1806. Cited by: §4.3.
-  (2017) Word2Vec. Natural Language Engineering 23 (1), pp. 155–162. Cited by: §1.
-  (2005) The astrÉe analyzer. In European Symposium on Programming, pp. 21–30. Cited by: §1.
-  (2018) BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805. Cited by: §1, §4.1.3.
-  (2020) Codebert: a pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155. Cited by: §4.1.3.
-  (2018) Automated software vulnerability detection with machine learning. External Links: Cited by: §2.
-  (2012) Software vulnerability prediction using text analysis techniques. In Proceedings of the 4th international workshop on Security measurements and metrics, pp. 7–10. Cited by: §3.3.
-  (2016) Fasttext.zip: compressing text classification models. arXiv:1612.03651. Cited by: §1, §4.1.2.
-  (2021) Commit2vec: learning distributed representations of code changes. SN Computer Science 2 (3), pp. 1–16. Cited by: §2.
-  (2014) Coccinelle: tool support for automated cert c secure coding standard certification. Science of Computer Programming 91, pp. 141–160. Cited by: §1.
-  (2015) Building program vector representations for deep learning. In Knowledge Science, Engineering and Management, S. Zhang, M. Wirsing, and Z. Zhang (Eds.), Cham, pp. 547–553. External Links: Cited by: §2.
-  (2018) Automated vulnerability detection in source code using deep representation learning. In 2018 17th IEEE international conference on machine learning and applications (ICMLA), pp. 757–762. Cited by: §2.
-  (2010) Dynamic program analysis of microsoft windows applications. In 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS), pp. 2–12. Cited by: §1.
-  (2018) Pydriller: python framework for mining software repositories. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 908–911. Cited by: §3.2, §4.1.1.
-  (2004) ATOM: a system for building customized program analysis tools. ACM SIGPLAN Notices 39 (4), pp. 528–539. Cited by: §1.
LSTM neural networks for language modeling. In 13th Annual Conference of the International Speech Communication Association, Cited by: §1.
-  (2010) Dynamic analysis and profiling of multithreaded systems. In Advanced Operating Systems and Kernel Applications: Techniques and Technologies, pp. 156–199. Cited by: §1.
-  (2014) Detecting software vulnerabilities with deep learning. Master’s Thesis, Humboldt University, Berlin. Cited by: §3.3.
Learning text representation using recurrent convolutional neural network with highway layers. arXiv preprint arXiv:1606.06905. Cited by: §1.
-  (2019-02) Sorting and transforming program repair ingredients via deep learning code similarities. 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER). External Links: Cited by: §2.