Malicious Source Code Detection Using Transformer

by   Chen Tsfaty, et al.

Open source code is considered a common practice in modern software development. However, reusing other code allows bad actors to access a wide developers' community, hence the products that rely on it. Those attacks are categorized as supply chain attacks. Recent years saw a growing number of supply chain attacks that leverage open source during software development, relaying the download and installation procedures, whether automatic or manual. Over the years, many approaches have been invented for detecting vulnerable packages. However, it is uncommon to detect malicious code within packages. Those detection approaches can be broadly categorized as analyzes that use (dynamic) and do not use (static) code execution. Here, we introduce Malicious Source code Detection using Transformers (MSDT) algorithm. MSDT is a novel static analysis based on a deep learning method that detects real-world code injection cases to source code packages. In this study, we used MSDT and a dataset with over 600,000 different functions to embed various functions and applied a clustering algorithm to the resulting vectors, detecting the malicious functions by detecting the outliers. We evaluated MSDT's performance by conducting extensive experiments and demonstrated that our algorithm is capable of detecting functions that were injected with malicious code with precision@k values of up to 0.909.


page 1

page 15

page 16

page 17

page 18

page 19

page 20

page 21


Backstabber's Knife Collection: A Review of Open Source Software Supply Chain Attacks

A software supply chain attack is characterized by the injection of mali...

If You've Seen One, You've Seen Them All: Leveraging AST Clustering Using MCL to Mimic Expertise to Detect Software Supply Chain Attacks

Trojanized software packages used in software supply chain attacks const...

Taxonomy of Attacks on Open-Source Software Supply Chains

The widespread dependency on open-source software makes it a fruitful ta...

Practical Automated Detection of Malicious npm Packages

The npm registry is one of the pillars of the JavaScript and TypeScript ...

MalBERT: Using Transformers for Cybersecurity and Malicious Software Detection

In recent years we have witnessed an increase in cyber threats and malic...

LightSys: Lightweight and Efficient CI System for Improving Integration Speed of Software

The complexity and size increase of software has extended the delay for ...

Detecting Oxbow Code in Erlang Codebases with the Highest Degree of Certainty

The presence of source code that is no longer needed is a handicap to pr...

1 Introduction

Software supply chain attacks aim to access source codes, build processes, or update mechanisms by infecting legitimate apps to distribute malware.111 Hence the end-users will refer to that malware as trusted software, e.g., download or update sites. An illustrative example of such attacks is the Codecov attack [jackson_2021], a backdoor concealed within a Codecov uploader script that is downloaded vastly. In April 2021, attackers compromised a Codecov server to inject their malicious code into a bash uploader script. Codecov customers then downloaded this script for two months. When executed, the script exfiltrated sensitive information, including keys, tokens, and credentials from those customers’ Continuous Integration/ Continuous Delivery (CI/CD) environments. By utilizing this data, Codecov attackers reportedly breached hundreds of customer networks, including HashiCorp, Twilio, Rapid7,, and e-commerce giant Mercari [jackson_2021].

Those types of attacks are becoming more popular and harmful [sonatype2021state] due to modern development procedures. Those procedures use open-source packages and public repositories for many reasons: efficiency, accelerating development, cost-effectiveness, etc. For that reason, open-source demand is becoming widespread among many developers. With a 73% growth of components downloaded in 2021 compared to 2020 [sonatype2021state]. The development procedures that involve those packages and repositories are mostly automatic, such as build procedures or semi-automatic, the same as developers installing an open-source package [chris2021beware]. As a result of the mentioned growth, popular packages, development communities, lead contributors, and many more can be considered attractive targets for software supply chain attacks [NIST2021defending, sawers2021next, sharma2021newly, peterson2021software, gregory2021supply]. That kind of attack may pass their vulnerability to dependent software projects. By 2021, OWASP considers software supply chain threat one of the Top-10 security issues worldwide.222 A lead example of such attacks is ua-parser-js attack [sharma_2021], which occurred in October 2021. The attacker was granted ownership of the package by account takeover and published three malicious versions. At that time, ua-parser-js was a highly popular package with more than seven million weekly downloads.

In recent years, a vast research field has emerged to issue with this threat [NIST2021defending, ohm2020backstabber]. This field is researched by academia and is part of the application security market, which was valued at 6.42 billion USD [marketsandmarkets2020application]. This field includes many aspects that depend on various parameters, such as (1) programming language (PL). For example, different PLs have different security issues [georgian2020common, kelly2021cpp]; and (2) the scope of examining functionalities (function, class, scripts, etc.). For example, there are attacks targeting a centric function [bertus2019discord] or modules [constantin2018npm].

In this study, we developed the MSDT algorithm, a novel method for detecting malicious code injection within functions’ source code, by static analysis that consists of the following four key steps (see Figure 1 and Section 3.1): First, we used the PY150 dataset [raychev2016probabilistic] to train a transformer architecture model. Second, by utilizing the transformer, we were able to embed every function in the CodeSearchNet (CSN) Python dataset, which is used for experiments evaluation, [husain2019codesearchnet] into the representation space of the transformers’ encoding part. Third, we applied a clustering algorithm over every function type implementation to detect anomalies by outlier research. Lastly, we ranked the anomalies by their distance from the nearest clusters’ border points - the farther the point is, the higher the score.

We conducted extensive experiments to evaluate MSDT’s performance. The experiments concluded, randomly injecting to the top 100 common functions five different real-world malicious codes, Code2Seq [alon2018code2seq] as the transformer, and DBSCAN for the clustering algorithm [prado2017dbscan]. Eventually, we evaluate the results by precision at (precision@k) (for various

values) of matching functions classified as malicious with their true tagging (see Section

3.2). The precision@k test result values measured by applying MSDT reached up to 0.909. For example, MSDT achieved this result when for the different implementations of the get function. Those implementations were randomly injected with a real-world attack presented by Bertus et al. [bertus2019discord]. Additionally, we empirically evaluated MSDT on a real-world attack and succeeded in detecting it. Lastly, we empirically compared MSDT to widely used static analysis tools, which are only able to work on files, while MSDT works on functions. MSDT’s capability to work on functions gives a more precise ability to detect an injection in a given function.

Figure 1:

Overview of our data embedding and anomaly detection model process.

The key contributions of our study are threefold:

  1. We have developed MSDT, a novel algorithm to automatically detect code injection via anomaly detection in functions’ source code.

  2. We have created MSDT to support any textual PL. We can ensure it by using the proper grammar and a transformer architecture (Code2Seq [alon2018code2seq]) to embed functions’ source code.

  3. We have curated an open dataset of 607,461 functions that were injected with several real-world malicious codes. This dataset can be used in future works in the field of detection code injections.

The remainder of the paper is structured as follows: Section 2 summarizes the related work. Section 3 describes the proposed methodology and the conducted experiments in the study. Section 4 presents the results of this study. That is followed by Section 5, in which we discuss the study results. Lastly, Section 6 summarizes and concludes the study and offers future work.

2 Related Work

Malformed open-source packages constitute several threats to every component in some development procedures and have become a vast research field with three main branches [ogasawara1998experiences]. In the following subsection, we provide an overview of these branches: Section 2.1 introduces an overview of the security issues that commonly appear in public repositories or occur due to the PL features weaknesses exploitation. Next, Section 2.2 provides an overview of the widely used methods to detect those attacks or weaknesses. Lastly, Section 2.3 gives an overview of the different Deep Learning (DL) methods in the field of code representation, which are used to apply advanced static analysis to the targeted code.

2.1 Security issues within open source packages

In recent years, the awareness of the threats regarding public repositories and open-source packages has increased. As a result, many studies [ohm2020backstabber, Harush2021, birsan2021dependency] point out two main security issues with the usage of those packages: (1) vulnerable packages [snyk2021opencv]- which contain a flaw in their design [snyk2018double], unhandled code error [snyk2020unchecked] or other bad practices that could be a future security risk [ruohonen2021large, ruohonen2018empirical]. This threat is widespread and has been vastly researched by communities or commercial companies (e.g., Snyk333 and WhiteSource444 Usually, this threat is based on Common Vulnerabilities and Exposures (CVEs).555 Those vulnerabilities allow the malicious actor, with prior knowledge of the package usage location, to achieve its goal with a few actions [Tal2020, Sharma2021]; and (2) malicious intent in packages [tschacher2016typosquatting]- which contain bad design, unhandled code error, piece of code that is not serving the main functionality of the program, etc. Those examples are created to be exploited or triggered at some phases of the package (installation, test, runtime, etc.).

Studies have shown a rise in malicious functionalities appearing in public repositories and highly used packages [ruohonen2021large, zimmermann2019small, polkovnichenko_2022]. With this rise, it becomes clear that there are common injection methods for malicious actors to infect packages. As demonstrated by Ohm et al. [ohm2020backstabber], to inject malicious code into a package, an attacker may either infect an existing package or create a new package that will be similar to the original one (often called dependency confusion [birsan2021dependency]). A new malicious package developed and published by a malicious actor has to follow several principles: (1) To make a proper replacement to the targeted package, it has to contain a semi-ident functionality; and (2) It has to be attractive, ending up in the targeted users’ dependency tree. To grant the use of those new packages types, one of the following methods can suit: Naming the malicious package similar to the original one (typosquatting) [bertus2019discord, birsan2021dependency, tschacher2016typosquatting, cimpanu2018twelve], creating a trojan in the package [constantin2018npm, cimpanu2019malicious], using an unmaintained package, or user account (use after free) [claburn2018resurrect]. As mentioned, the second injection strategy is to infect existing packages in one of the following methods: (1) Inject to the source of the original package, by a Pull request / social engineering [chris2021beware, della2021anatomy, thomas2018compromised, us2021malware]; (2) The open source project owner added malicious functionality out of ideology, such as political [paganini2022nodeipc]; (3) Inject during the build process [kisielius2021breaking]; and (4) Inject through the repositories system [cappos2008look].

Ohm et al. [ohm2020backstabber] demonstrated that the malicious intent in packages could be categorized by several parameters: targeted OS (Operating System), PL, the actual malicious activity, the location of the malicious functionality within the package (where it is injected), and more. Additionally, they showed the majority of the maliciousness is associated with persistence purposes, which can be categorized into several major groups: Backdoors, Droppers, and Data Exfiltration [ohm2020backstabber].

In this study, we focus on the second security issue with a specification in a dynamic PL (Python as a test case) for the reasons of usage popularity and the popularity of injection-oriented attacks within those PLs repositories (Node.js, Python, etc.) [ohm2020backstabber]. Those injections are often related to the PLs dynamicity features [georgian2020common], such as exposing the running functionalities only at runtime (e.g., exec(“print (Hello world!)”)), configurable dependencies and imports of packages (e.g., import from a local package instead of a global one).

The described use of the PLS dynamicity features is the most common among the known attacks [ohm2020backstabber, sonatype2020state]. A leading example of this kind of attack was presented by Bertus [bertus2019discord]. Bertus reviewed a malicious package named “pytz3-dev,” which was seen in PyPI777Python package index - the main repository of Python packages and downloaded by many. This package contains malicious code in the initialization module and searches for a Discord authentication token stored in an SQLite database. Eventually, the code exfiltrated the token if found. This attack was carried out unnoticed for seven months and downloaded by 3000 users in 3 months [bertus2019discord, sonatype2020state]. Those features, and many more, are used by attackers, making this threat one of the most common attack techniques associated with a supply chain attack, as covered by NIST [NIST2021defending].

2.2 Detection methods of malicious intent in source code

As a result of the increase in the mentioned above security issues, two major detection methods were developed:

2.2.1 Static Analysis

A type of analysis that finds irregularities in a program without executing it. The irregularities can broadly be categorized into three main branches: coding style enforcement, reliability, and maintainability [ruohonen2021large, lizdenis2020configure]. The security issues are mainly associated with the reliability domain, which primarily covers bug detection [wang2010detect], vulnerability detection [russell2018automated], and malware detection challenges [idika2007survey, patil2017detection]. To deal with those challenges, the following are common techniques in static analysis that gather information regarding the detection mission:

  • Syntax properties. This technique uses the PL syntax to find irregularities. For example, using AST to search obfuscated strings that are most likely to be executed [bertus2018detecting] or a linter operation to check the program’s correctness [lizdenis2020configure].

  • Feature-based technique. This technique uses the occurrences count of known problematic functionalities [ruohonen2021large, garrett2019detecting]. For example, Patil et al. [patil2017detection]

    have constructed a classifier with a given labeled dataset and several features extracted (function appearances, length of the script, etc.) that can predict the maliciousness of a script. The main drawback of this technique is that it strongly binds with reversing research that points to features related to the attack, which may lead to detection overfitting the attacks that have been revealed and learned. Secondly, potential attackers could evade detection by several methods, such as not using or properly using the searched features in the code


    An example of such a static analysis tool is Bandit [bandit_2022]. Bandit is a widespread tool [ruohonen2021large] designed to find common security issues in Python files, using hard-coded rules. This tool uses AST (see Section 2.3) form of the source code to better examine the rule set. In addition, Bandit detection method includes the following metrics: severity of the issues detected and the confidence of detection for a given issue. Those metrics are divided into three values: low, medium and high. Each rule gets its severity and confidence values manually by Bandits’ community.

  • Data preprocess. Construct a workable data structure that grasps the syntax and semantic information of the code to represent the code better (see Section 2.3). It will be convenient to apply anomaly detection or classification research with a proper code representation. For example, Alomari et al. [alomari2019scalable] construct a control flow graph, and by resemblance subgraphs, they manage to identify similar code segments between programs.

  • Signature-based detection (in the case of malware detection) is a process where a set of rules (based on reversing procedure) define the maliciousness level of the program [sentinelone2021what]. Those rules that are generated for static analysis purposes are often a set of functionalities or opcodes in a specific order to match the researched code behavior. For example, YARA888 is a commonly used static signature tool; and the rules that are generated for dynamic analysis purposes are often a set of executed operations, memory states, registers’ values, etc. [idika2007survey]. The main drawback of this technique is that it applies to known maliciousness.

  • Comparing packages to known CVEs (see Section 2.1).

On the one hand, static analysis tends to scale well over many PL classes (with a given grammar), efficiently operates on large corpora, often will identify well-known security issues, and in many cases, is explainable [pvs2015static]. On the other hand, this kind of analysis suffers from a high number of false positives and poor configuration issues detection [wichers2020source].

2.2.2 Dynamic Analysis

Those type of analysis is a group that finds irregularities in a program after its execution and determines its maliciousness. In this type of analysis, the gathered data (system calls, variable values, IO access, etc.) are often used as part of anomaly detection or classification problem [idika2007survey]. There are several drawbacks for this type of analysis on a source code [pvs2013dynamic]: (a) Data gathering difficulties- there is a need to activate the package and execute its functionality, hence making the procedure of extracting data hard to automate; and (b) Scalability - there is a need to activate all the learned and tested program, and for each to extract the wanted data. In this study, we will focus on advanced static analysis.

2.3 Deep learning methods for analyzing source code

In recent years, there has been an increasing need to use machine learning (ML) methods in code intelligence for productivity and security improvement

[lu2021codexglue]. As a result, many studies construct statistical models to code intelligence tasks. Recently, pre-trained models were constructed by learning from big PL corpora, such as CodeBERT [feng2020codebert] and CodeX [chen2021evaluating]

. These pre-trained models are commonly based on models from the natural language process (NLP) field, such as BERT

[devlin2018bert] and GPT [brown2020language]. This development led not only to improvement in code understanding [lu2021codexglue] and generation problems [alon2020structural] but also to enlarging the number of tasks and their necessity [lu2021codexglue], such as Clone detection [ain2019systematic] and Code completion [raychev2014code]. Those tasks include several challenges, such as capturing semantic essence [nagar2021code], syntax resemblance [alomari2019scalable], and figure execution flow [yu2019empirical]. For every challenge, it occurred that there is a model that will fit better than others [lu2021codexglue]. For example, for code translating between PLs, algorithms that include a “Cross-lingual Language Model” with masked tokens preprocessing are superior for capturing the semantic essence well [feng2020codebert, lachaux2020unsupervised].

Over the years, several ML methods have been researched in the context of code analysis tasks. In 2012, Hovsepyan et al. [hovsepyan2012software] showed the use of techniques from the classic text analysis field, for example, using SVM on a bag-of-words (BOW) representation of simple tokenization (lexing by the PL grammar) of Java source. In 2016, Dam et al. [dam2016deep] and Liang et al. [liang2018automatic]

presented techniques to get context for the extracted tokens, for example, using the output of recurrent neural network (RNN) trained over tokenized (lexing representations) code

[dam2016deep]. However, according to Ahmad et al. [ahmad2020transformer], RNN-based sequence models lack several source code concepts regarding source code representations. First, inaccurate representation of the non-sequential structure of source code. Second, RNN-based models may be inefficient for very long sequences. Third, those models lack to grasp of the syntactic and semantic information of the source code. Therefore, starting in 2018, studies include two significant changes in learning source code representation. First is the use of Transformers, which have proven to be efficient in capturing long-range dependencies [alon2020structural]. Second are the different data preprocessing procedures, which yields more informatically data structures to learn on: Alon et al. [alon2018code2seq] used AST Paths for a transformer architecture named Code2Seq [alon2018code2seq], Mou et al. [mou2014tbcnn] utilized abstract syntax tree999Abstract Syntax Tree (AST) is a well-known data structure for representing a program with a given PL grammar (see for further explanation).

nodes to train tree-based convolutional neural networks for supervised classification problems. Lately, researchers have tried to include semantic data of the PLs. For example, Feng et al.

[feng2020codebert] presented the CodeBERT model, which uses a bimodal pre-trained model to learn the semantic relationship between natural language and PLs such as Java, PHP, Python, etc.

In this study, we used the Code2Seq model, a transformer architecture developed by Alon et al. [alon2018code2seq]. Additionally, similarly to Ramakrishnan et al. [ramakrishnan2020semantic], we trained the model using the PY150 dataset [mou2014tbcnn] - a dataset that contains Python functions in the form of AST (see Section 3.2.1). In this architecture, a function is referred to as an AST. The output trees’ internal nodes represent the construction of the program with known rules, as described in the given grammar. The tree’s leaves represent information regarding the program variables, such as names, types, values, etc. Figure 2 outlines the notion of AST on code snippets.

Figure 2: Example AST transformation of the code snippet if x == 3: print(“Hello”). Example of AST path painted in red.

Eventually, the Code2Seq model gets as an input a set of AST paths101010Every pairwise path between two leaf tokens is represented as a sequence containing the AST nodes. Those nodes are connected by up and down arrows. These arrows exemplify the up or downlink between the nodes in the tree. Example for an AST path that is shown in Figure 2 : (x, ↑if stmt, ↑method dec ↓print: “Hello”). that were extracted from code snippets. A bi-directional LSTM encodes those paths to create a vector representation for each path and its AST values separately. Then the decoder attends over those encoded paths while generating the target sequence. The final output of the Code2Seq model is generated sequence of words that explain the functionality of the given code snippet [alon2018code2seq].

Code2seq can be integrated into many applications [alon2018code2seq, nagar2021code, ramakrishnan2020semantic], such as code search - with a given sentence describing a code, and the output will be the wanted code. For example, Nagar et al. [nagar2021code] used the Code2seq model to generate comments for collected code snippets. Then, the candidate code snippets and corresponding machine-generated comments are stored in a database. Eventually, the code snippets whose comments are semantically similar to natural language queries are retrieved.

Recent studies have presented more advanced code embedding methods that try to include the program’s semantic, syntactic, and execution flow as part of the representation [alomari2019scalable, yu2019empirical].

3 Methods

The primary goal of this study is to detect code injection by applying static analysis to the source code. This section describes the static analysis algorithm we developed (see Section 3.1) and our experiments to test and evaluate our proposed method, MSDT (see Section 3.2).

3.1 The proposed method

As presented in Section 2.1, in supply chain attacks, the injected functionality will often be added to the source of the targeted program. Therefore, the code will be changed. This study presents MSDT, an algorithm to detect the mentioned difference in the program’s functionality for a chosen PL, by the four following steps (see Figure 1):

  1. Data collection

    . In this step, we collect a sufficient amount of function implementations of the chosen PL, for each function type. For example, to detect code injection in the ”encode” function, we collect a sufficient amount of ”encode” implementations to better estimate the distribution of the implementations. In addition, the collected data can be different versions of the same function. The collection of data can be manually collected from any code-base warehouse (such as GitHub) or extracted from an existing code dataset. For example, an existing dataset of functions with their names and implementations (see Section


  2. Code embedding. In this step, we create an embedding layer to the given source code snippets by using an algorithm that gets sequence data and represents it as a vector. An example of such algorithms is transformers that vectorize the input sequence and transform it to another sequence, such as Seq2seq [ramakrishnan2020semantic], Code2seq [alon2018code2seq], CodeBERT [feng2020codebert], and TransCoder [lachaux2020unsupervised]. The resulting embedding layer has to be reasonable so that similarity in the source code snippets (similar functions) translates to a similarity in the embedding space. For example, the vectors of the square-root and cube-root functions will be relatively close to each other and farther than the parse timezone function’s vector.

  3. Anomaly detection

    . In this step, we apply an anomaly detection technique by applying cluster algorithms and detecting the outliers. For example, we can utilize DBSCAN and K-means to cluster the input and detect outliers

    [badr_2019]. We use this technique on every function type embedding layer and manage to differentiate code snippets that were injected from benign code snippets.

  4. Anomaly ranking. Lastly, we rank the outliers by their distance from the nearest clusters’ border points in this step [huang2013rank]. The farther the point is, the higher the score.

3.2 Experiments

There are several datasets including labeled function implementations for several purposes [lu2021codexglue]. In this study, we used 607,461 public Python function implementations, with simulated test cases and real-world observed attacks. Additionally, this study combines an embedding layer based on a transformer, Code2Seq [alon2018code2seq]. Lastly, this study showcases traditional anomaly detection techniques over the Code2Seq representation based on DBSCAN [prado2017dbscan] compared to another anomaly detection technique based on Ecod [li2022ecod].

3.2.1 Datasets

In this study, we utilized three datasets: (1) The PY150 dataset [raychev2016probabilistic] is used for training Code2Seq. The PY150 is a Python corpus with 150,000 files. Each file contains up to 30,000 AST nodes from open-source projects with non-viral licenses such as MIT. For the training procedure, we randomly sampled the PY150 dataset to validation/test/train sets of 10K/20K/120K files; (2) The CodeSearchNet (CSN) Python dataset [husain2019codesearchnet] is used for evaluating the different experiments. CSN is a Python corpus, containing 457,461 pairs from open source libraries, which we refer only to as the code; and (3) The Backstabber’s Knife Collection [ohm2020backstabber] is used for the malicious functionalities injected during the simulations. The Backstabber’s Knife Collection is a dataset of manual analysis of malicious code from 174 packages that were used by real-world attackers. Namely, we use five different malicious code injections from this collection, to inject in the 100 most common functions within the CSN corpus. We chose those specific malicious codes for their straightforward integration within the injected function, and their download popularity [ohm2020backstabber].

As mentioned above, the input to the Code2seq model is an AST representation of a function. To get this representation for each function, we extracted tokens using fissix111111 and tree_sitter,121212 which allows us to normalize the code to get consistent encoding. With the normalized output code, we then generate an AST using fissix.

3.2.2 Injection simulation

To simulate the real-world number of code injections, we randomly selected up to 10% [sonatype2021state] implementations from each of the top 100 common functions to be code injected,131313To find the 100 most common functions we count the number of implementations for each function in the CSN dataset, and refer to the 100 most frequent functions. with a total of 48627 implementations. The injected functionalities were five malicious samples collected from Backstabber’s Knife Collection [ohm2020backstabber]. Those injections illustrate several attacks types:

  1. A one-liner execution of obfuscated string, encoded by base64 [bertus2019discord]. This string is a script that finds the Discord chat application’s data folder on Windows machines and then attempts to extract the Discord token from an SQLite database file. Once the Discord token is found, it is sent to a web server.141414We use two different execution functions (in different types of injections), exec and os.system functions. These functions allow the user to execute a string.

  2. A one-liner execution of non-obfuscated script - the deobfuscation of the described above attack.

  3. Loading a file from the root directory of the program. The loaded file is a keylogger that eventually sends the collected data to a remote server via emails. To mask the keylogger loading, we are using the Popen function to execute the malicious functionality in other subprocesses [meyers_tozer_2020].

  4. Attacker payload construction as an obfuscation use case.151515 We splitted the obfuscated string (the first attack mentioned in this section) into several substrings. Then we concatenate those strings in several parts of the program to construct the original attacker string.161616 Executing the concatenate string using os.system function.

The injected functionalities were injected at the beginning of the randomly selected implementations for those popular function types, similar to the mentioned attacks above [bertus2019discord, meyers_tozer_2020] and as viewed by Ohm et al. [ohm2020backstabber].

3.2.3 Code2seq representation

In this study, we use the result vectors of the attention procedure (see Section 2.3), named context vectors

with 320 dimensions - it is the representation space of the model for code snippets. At each decoding step, the probability of the next target token depends on the previous tokens


We used Alon et al. [alon2018code2seq] implementation for Code2Seq171717 model and set it with the same parameters. We trained the Code2Seq model on a server with a high RAM setting.181818The server specifications are: 256G RAM and 48 CPU cores. The training process continued for 24 hours on 130K functions.

We construct the encoder to be two bi-directional LSTMs that encode the AST paths consisting of 128 units each, and we set a dropout of 0.5 on each LSTM. Then, we construct the decoder to be an LSTM consisting of one layer with size 320, and we set a dropout of 0.75 to support the generation of longer target sequences. At last, we trained the model for 20 epochs or until there was no improvement after 10 iterations. Eventually, we test our Code2seq model on the PY150 test set (as mentioned in Section 

3.2.1) and achieved the following metrics on the mentioned randomly sampled test set: recall of 47%, precision of 64%, and F1 of 54%.

3.2.4 Anomaly detection on representation

In this step, we use our Code2Seq representation (see Section 3.2.3

) for the given injected functions and non-injected from the same type. Then, we test several clustering algorithms, such as DBSCAN, K-means, Ecod, and Hierarchical clustering. Eventually, we chose the DBSCAN method (referred to as

) to find outliers because it works well on multi-dimensional data, as presented by Oskolkov et al. [Nikolay_Oskolkov_2019]. We achieved it by using tuning the following parameters for the DBSCAN method [prado2017dbscan]:

  1. eps which specifies the distance between two points, and is testing with the following values: 0.2 - 1.0.

  2. min_samples which specifies the minimum number of neighbors to consider a point in a cluster, and is testing with the following values: 2 - 10.

For each iteration, we apply 10-fold cross-validation and measure the following metrics by the mean of the different folds: TPR, AP (Average Precision), and detecting outlier precision.

3.2.5 Evaluation Process

The performance of the anomalies detected by MSDT was measured by precision at (precision@k) study, which stands for the true positive rate (TPR) of the results that occurs within the top of the ranking [ruohonen2021large]. We rank the anomalies by their Euclidean distance from the nearest clusters’ border points. Eventually, we measured the precision@k metric for each function type with the mentioned code injection attacks and compared it to a , to show the performance of MSDT relatively to a random decision. Additionally, to understand better the way MSDT detects attacks, we examine the correlation between the detection rate and the number of implementations among the various function types. Therefore we measured the average precision@k for every attack, and for every function type, we calculated the average of the average detection rate of the various attacks. We used Spearman’s rank correlation () to measure the correlation between the mentioned average of the function types and their number of implementations.

We compared

’s performance to another widely use outlier detection baseline method name

Ecod (referred to as ) [li2022ecod] over the mentioned representation (see Section 3.2.4). We use Ecod to detect outliers as follows: First, we apply Ecod on every function type for every attack type (accordingly to ). Second, we measure the anomaly score of each implementation.191919The Ecod algorithm calculates this score. The more the vector is distant, the higher its score. Third, we extract the precision@k where indicates the anomalies in descending order, i.e, precision@2 is the precision of the two most highly ranked anomalies, as simulated by Amidon et al. [amidon2022ecod].

To evaluate our method on real-world injections, we applied on a real-world case taken from the Backstabber’s Knife Collection [ohm2020backstabber]. The case is a sample of malicious functionality injected in multiply calculation functionality that loads a file by Popen, as mentioned above in Section  3.2.2. We collected 48 implementations of multiply relate functions from the mentioned datasets (see Section 3.2.1). We did so to gain reference of the injected multiply function to the benign implementations, and thus we were able to apply on this multiply case.

Additionally, we compared MSDT with the mentioned method and two of the well-known static analysis tools named Bandit and Snyk (see Section 2.2.1). Namely, we evaluate those static analysis tools on the origin file where the malicious implementation of multiply appeared.

Lastly, to emphasize the relations between the malicious and the benign implementations, we visualized the achieved embedding of the get and the log functions with the injected code. We manage this visualization by applying PCA (2 components) [li2019pca] on the Code2Seq context vectors (see Section 3.2.3).

4 Results

In this section, we present the experimental results, which were obtained by the MSDT algorithm (see Section 3.1) when applied to the constructed function types dataset that contains both injected and benign implementations (see Section 3.2.2).202020We utilize 8G RAM with 8 CPU cores server to evaluate the algorithm. The runtime of the process took 10 minutes for 48627 different implementations.

The constructed dataset includes the 100 most common function types from the CSN dataset (see Section 3.2.1). From the function types implementations distribution (see Figure 3), the most common function type is the get function with over of 3,000 unique implementations; and the least common from those function types is the prepare with 102 unique implementations.

Figure 3: Number of different implementations per functions’ types.

The first experiment included parameter tuning of the DBSCAN method mentioned in Section 3.2.4. We received the following best results (see Figure 4) for eps=0.3 and min_samples=10: TPR=0.637, AP=0.384, detecting outlier precision=0.953. These results indicate that it is possible to detect anomalies by finding outliers with probable rates. In addition, when the default values of the DBSCAN method is set [schubert2017dbscan], we got TPR=0.632, AP=0.373, detecting outlier precision=0.738. Therefore, the DBSCAN with the tuned parameters exceeded the one with the default parameters.

Figure 4: The following graphs show the DBSCAN parameter tuning process: (1) The size of the outlier cluster, that indicate whether the methods overfit or underfit; (2) The measured precision@k for a range of ; and (3) The measured AP (average precision) for a range of .

The second experiment included the evaluation of on every function type against every attack type and every in the range of 1 to 10 percent of the implementations. For every iteration of , we measured precision@k. We found that manages to detect well when applied to several functions and attacks. Such as the get function with three of the mentioned attacks, for , MSDT presented the highest value of (see Figure 5), compared to which was obtained by the . On the other hand, we found that achieved less successful results on several functions no matter the type of the applied attack, and the value of the . Such as the log function with all the attacks, specifically with the non-obfuscated attack. Table 1 and Appendix LABEL:sec:all_functions_implementations_tpr present in detail the results of these experiments.

In addition, we discovered that the measured Spearman’s rank correlation between the MSDT’S detection rate and the number of implementations is equal to , which indicates a correlation between the detection rate and the number of implementations.

Additionally, we tested the on the same experiment settings described in Section 3.2.3. Followed by the mentioned evaluation (see Section 3.2.5), we measured the precision@k for every in range of 1 to 30. We can observe that generally the detects the top 2 rank anomalies, and less successful in the following values (see Figure 6).

Model Function Name k Execution of an obfuscated string using exec Execution of a non obfuscated script using exec Execution of a obfuscated string using os.system Loading a file from the root directory of the program Payload construction as an obfuscation use case
get 10 0.9 0.8 0.889 0.9 0.7
20 0.9 0.4 0.889 0.909 0.35
30 0.9 0.267 0.889 0.909 0.233
log 10 0.4 0.1 0.4 0.3 0.3
20 0.15 0.05 0.25 0.25 0.2
30 0.3 0.033 0.267 0.233 0.267
update 10 0.7 0.167 0.7 0.7 0.6
20 0.733 0.167 0.722 0.75 0.706
30 0.733 0.167 0.722 0.821 0.706
get 10 0.5 0.4 0.3 0.1 0.2
20 0.3 0.25 0.15 0.05 0.1
30 0.276 0.172 0.138 0.034 0.103
log 10 0.3 0.1 0.1 0.2 0.2
20 0.15 0.15 0.1 0.1 0.2
30 0.172 0.103 0.103 0.069 0.172
update 10 0.2 0.5 0.4 0.1 0.2
20 0.2 0.35 0.35 0.05 0.2
30 0.172 0.276 0.276 0.038 0.241
Table 1: precision@k for 3 functions with all attacks and values. The complete precision@k results shown in Appendix References
Figure 5: The measured precision@k of and of the get and the log functions’ implementations.
Figure 6: The measured mean precision@k of and of all the 100 function types and the 5 attacks.

The third experiment included detecting injected malicious implementations of multiply by applying on it. By visualizing the PCA (2 components) of the collected samples (see Figure 7), we can see that detecting the attacked functions, for this case, is not a straightforward task. Additionally, we can see (see Figure 7) that by applying

, we managed to detect the malicious implementation, along with two unique and odd implementations

212121Those implementations include:(1) Adding in a for loop the first input number by the second input number; and (2) Output the result by comparing the two input number to a results dictionary. of multiply. Then we compared the results of this experiment to Bandit and Snyk 2.2.1, yielding that those static analysis tools failed to detect these attacks. Additionally, we compared to , which detects only one of the mentioned unique implementation.

Figure 7: PCA (2 compensates) visualization of real-case detection. The red data point is the attacked function, and the two yellow data points are the unique functions.

The fourth experiment emphasizes the relations between malicious and benign implementations. By the following visualization we received (see Figures 8 and 9) that the get functions tend to cluster and on the other hand log functions do not cluster well. Therefore, this illustrates the differences in the distribution of the various function types.

Figure 8: PCA of the get function benign (blue) and malicious (red) implementations.
Figure 9: PCA of the log function benign (blue) and malicious (red) implementations.

5 Discussions

Based on our analysis of the results presented in Section 4 and Appendix LABEL:sec:all_functions_implementations_tpr, we can observe the following:

First, , which detects malicious code injections to functions by anomaly detection on an embedding layer, had promising results when evaluated on different function types with various injected attacks, reaching to precision@k up to 0.909 with median=0.889 and mean=0.807 for get and list function types (see Appendix LABEL:sec:all_functions_implementations_tpr and Figure 5).

Second, achieved successful compared to other tools and methods (see Table 1 and Figure 6). For example, the general precision@k of is higher for compare to the based method (as can seen in Section 6). As mentioned in Section 3.2.2 the simulated injections are taken from real-world cases and injected into functions. For illustrating a real-world code injection detection we conducted an empirical experiment, which includes detecting real-world attack by (see Section 3.2.5). We got that results seem promising compared to other widely use static analysis tools and , in this specific case (see Figure 7 and Section 4). In the future, we would evaluate on other real-world cases and test on different Program Language functions. In addition, we can notice that the mentioned static analysis tools are only able to work on files whilst MSDT works on functions. On the one hand, this gives a more precise ability to detect code injections to functions. On the other hand, when applied to rare functions without many implementations, MSDT would not necessarily succeed. In this case, we would like to test whether applying MSDT on similar functions helps to detect code injection in rare functions.

Third, we observed that when evaluated on similar attacks we get similar results. For example the attacks that use exec and os.system (as can seen in get results in Figure 5) using the same payload but different execution functions. Additionally, we can see that the precision@k values is relatively similar for these two attacks in general (see Appendix LABEL:sec:all_functions_implementations_tpr). This conclusion shows us that if manages to detect some attack well then it should detect another semantically related attack - we would like to explore this further in future works.

Fourth, we found that seems to succeed when applied to functions with specific functionality that repeats in the various implementations of the same function type. For example, the update implementations tends to be similar - in general this type of function gets an object and calculates or gets as an input a new value to insert in the given object - as we can see in Appendix LABEL:sec:all_functions_implementations_tpr for functions like reset, list, and update are with a main functionality and a relatively high precision@k. In this case, the various implementations of the same function type are similar semantically, yielding that the embedding for each of those is close, hence cluster well (see Figure 8 for illustration).

Fifth, we found that ’ detection rate is positively correlated to the number of implementations in the function type. Hence, is more likely to achieve a higher detection rate with a more common function type with numerous implementations.

Sixth, when injecting attacks with large line lengths, such as the non-obfuscated script execution, tends to achieve less successful results (see Figure 5). For example when evaluating on the different function types injected with the non obfuscated script, we generally get a low precision@k (see Appendix LABEL:sec:all_functions_implementations_tpr). In this case, the injected functionality is a script with numerous lines, which probably affects the Code2Seq robustness and causes it to miss-infer the function’s functionality, as researched by Ramakrishnan et al. [ramakrishnan2020semantic]. In future work, we would like to create with Code2Seq and a more robust model for source code (such as Seq2Seq [ramakrishnan2020semantic]), stacking model to overcome Code2Seq vulnerabilities.

Seventh, we can observe that tended to achieved less successful results when applied on abstract functions with functionality that does not repeat in other implementations - as we can see in the Appendix LABEL:sec:all_functions_implementations_tpr for functions like run, main etc. For example install function, generally, this function is supposed to change the state of the endpoint by activities that belong to the installation process (each application has a different process), such as writing files to disk or establishing a connection with a remote server, etc. Each application has a different process with its unique activities to install the app. In this case, the various implementations of the same function type are inherently different, yielding that the embedding for each of those is not close, hence does not cluster well (see Figure 9 for illustration). However, we will able to detect anomalies with with given versions of the abstract function.

Finally, as can observe from the results, statically detecting code injection within functions is a difficult and not homogeneous task for all of the various cases, such as function and attack types. However, MSDT had shown successful results for some cases simulated in the experiments. Therefore MSDT can be used as a detection tool that indicates what function need further investigation, thus reducing the search space and allowing prioritizing anomalies.

6 Conclusions and Future Works

This study introduces MSDT, a novel algorithm to statically detect code injection in functions’ source code by utilizing a transformer-based model named Code2Seq, and applying anomaly detection techniques on Code2Seq’s representation for each function type. We provided a comprehensive description of MSDT’s steps, which start with a collection of a dataset and preprocessing it. After injecting five malicious functionalities into random implementations, we extracted embedding for each one of the implementations in the function type. Based on these embeddings, we managed to apply an anomaly detection technique, resulting in anomalies that we eventually ranked by their distance from the nearest cluster border point.

This evaluation of MSDT on the constructed dataset demonstrates that MSDT succeeded for cases when: (1) The functions have a repetitive functionality; and (2) The injected code has a limited number of lines. However, MSDT was less successful when: (1) The injected code contains a relatively large number of lines; and (2) The functions have a more abstract functionality.

For the MSDT to use the Code2Seq embedding, it is necessary to convert every function to an AST representation. A possible future research direction is using a more comprehensive representation for a code that includes the semantic, syntactic, and execution flow data of the program. For instance, using execution paths in a control flow graph [alomari2019scalable, yu2019empirical] that have been constructed statically from a program. Another possible research direction can be exploring other models than Code2Seq for source code embeddings, like Seq2Seq, CodeBERT, and CodeX.

Those future works are direct conclusions from the MSDT evaluation and results. Therefore, we believe that this future research along with MSDT can create more secure software products and more effective software development procedures.

7 Data and Code Availability

The code that implements our simulations (see Section 3.2.2) and the simulated datasets we created (see Section 3.2.1) will be available after publication upon request.