Energy-bounded Learning for Robust Models of Code

by   Nghi D. Q. Bui, et al.
HUAWEI Technologies Co., Ltd.

In programming, learning code representations has a variety of applications, including code classification, code search, comment generation, bug prediction, and so on. Various representations of code in terms of tokens, syntax trees, dependency graphs, code navigation paths, or a combination of their variants have been proposed, however, existing vanilla learning techniques have a major limitation in robustness, i.e., it is easy for the models to make incorrect predictions when the inputs are altered in a subtle way. To enhance the robustness, existing approaches focus on recognizing adversarial samples rather than on the valid samples that fall outside a given distribution, which we refer to as out-of-distribution (OOD) samples. Recognizing such OOD samples is the novel problem investigated in this paper. To this end, we propose to first augment the in=distribution datasets with out-of-distribution samples such that, when trained together, they will enhance the model's robustness. We propose the use of an energy-bounded learning objective function to assign a higher score to in-distribution samples and a lower score to out-of-distribution samples in order to incorporate such out-of-distribution samples into the training process of source code models. In terms of OOD detection and adversarial samples detection, our evaluation results demonstrate a greater robustness for existing source code models to become more accurate at recognizing OOD data while being more resistant to adversarial attacks at the same time. Furthermore, the proposed energy-bounded score outperforms all existing OOD detection scores by a large margin, including the softmax confidence score, the Mahalanobis score, and ODIN.



There are no comments yet.


page 1

page 2

page 3

page 4


WOOD: Wasserstein-based Out-of-Distribution Detection

The training and test data for deep-neural-network-based classifiers are...

Energy-based Out-of-distribution Detection

Determining whether inputs are out-of-distribution (OOD) is an essential...

Semantic Robustness of Models of Source Code

Deep neural networks are vulnerable to adversarial examples - small inpu...

Out of Distribution Detection and Adversarial Attacks on Deep Neural Networks for Robust Medical Image Analysis

Deep learning models have become a popular choice for medical image anal...

A Controlled Experiment of Different Code Representations for Learning-Based Bug Repair

Training a deep learning model on source code has gained significant tra...

CoCoFuzzing: Testing Neural Code Models with Coverage-Guided Fuzzing

Deep learning-based code processing models have shown good performance f...

Adversarial Robustness for Code

We propose a novel technique which addresses the challenge of learning a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Learning code representations (a.k.a. embeddings) and developing a prediction model for programs have been found to be beneficial for a variety of programming tasks, including program functionality classification (Nix and Zhang, 2017; Dahl et al., 2013), code search (Gu et al., 2018; Kim et al., 2018; Sachdev et al., 2018), code comments generation (Hu et al., 2018; Wan et al., 2018; Alon et al., 2019), bug prediction (Li et al., 2017; Zhou et al., 2019), program translation (Chen et al., 2018; Gu et al., 2017), etc.

While the performance of source code models continues to improve, the real world is open and full of unknowns, presenting significant challenges for such learning models to handle diverse inputs reliably. Existing source code models suffer from two kinds of robustness problems: (1) adversarial robustness: small, seemingly innocuous perturbations to the input that lead to incorrect predictions (Rabin et al., 2020; Zhang et al., ; Bielik and Vechev, 2020)

; (2) out-of-distribution (OOD) uncertainty arises when a machine learning model sees an input that differs from its training data, and while still predicting on an existing class label by the model regardless. To the best of our knowledge, adversarial robustness for code has been studied recently 

(Rabin et al., 2019; Bielik and Vechev, 2020; Rabin et al., 2020; Ramakrishnan et al., 2020; Zhang et al., ; Yefet et al., 2020), while the OOD uncertainty detection has been neglected in the research on the robustness of code models.

In this work, we seek to address the OOD robustness problem by proposing a novel learning method that brings two benefits together: (1) making source code models more resilient to adversarial samples; (2) enabling the detection of OOD samples.

We observe that current learning models only capture very specific patterns from limited training datasets, i.e., the datasets are classified by a given set of labels, and the models can only know what is known from the datasets to make known predictions. In other words, they learn known-knowns and unknown-knowns, rather than known-unknowns and unknown-unknowns.

In general, when training and testing data are drawn from the same distribution, modern neural networks are known to generalize effectively. However, when the neural networks are used in real-world applications, there is often limited control over the distribution of testing data. The training data may not contain sufficient number of classes beforehand. Regardless of the size of the training dataset, therefore, the model cannot encompass all possible labels, which may lead to an issue that the model may simply output a label based on the value of the softmax layer for classification, even though the testing sample does not belong to any of the labelled classes.

To tackle this problem, we aim to enable source code models to say ”I don’t know” whenever possible instead of making a blind prediction by pretending that they knew the answer. For example, POJ104 (Mou et al., 2016) is a well-known dataset for code classification task that contains 52000 samples and 104 classes. Many code learning models have been trained on this dataset, including Tree-based CNN (Mou et al., 2016), ASTNN (Zhang et al., 2019), TreeCaps (Bui et al., 2021), etc. However, when one deploys any of these models to real-world usage, a new/unforeseen program must still be classified into one of the 104 classes, which does not hold. Ideally, the code classification model should be able to tell if the new code snippet does not really belong to any of these classes.

Apart from code classification, a significant variety of programming tasks can be formulated as classification problems. For instance, Alon et al. (2019b) formulates the method name prediction task as an extreme classification problem 111Extreme classification is a rapidly growing research area focusing on multi-class and multi-label problems, where the label space is extremely large.  Hoang et al. (2019) develops a tool for the patch identification task which can also be formulated as a classification problem. Numerous studies on bug detection have also been conducted as classification tasks (Nari and Ghorbani, 2013; Pascanu et al., 2015; Zhao et al., 2021; Liu et al., 2021). Given that OOD is a general problems for classification tasks, all of these programming tasks will suffer from it and might expose risks when deployed in real-world usages. Therefore, it is our belief that determining where inputs are OOD is essential for source code model in real-world applications.

To overcome the above-mentioned limitation in learning methods, we propose to use an auxiliary dataset in addition to the main dataset to train for a specific code learning task. The auxiliary dataset is used as an external resource to improve the model’s prediction capability, i.e., the model will know when not to predict something outside of its knowledge. In the case of code learning, a collection of code snippets from any code hosting repository, such as Github, can be used as the auxiliary dataset. As a result, the auxiliary dataset has the benefit of being inexpensive to collect because one does not need to label its elements, as we will demonstrate later in technical details. Now, we aim to train the model to generate a single scalar value as the measurement score, so that in-distribution samples have high scores and out-of-distribution samples have low scores.

A softmax confidence score has been used in prior work for OOD detection in vision learning and natural language processing to safeguard against OOD inputs 

(Lee et al., 2018; Liang et al., 2018)

, where OOD is defined as an input with a low softmax confidence score. However, neural networks can yield arbitrarily high softmax confidence for inputs that are far away from the training data. On the other hand, energy-based models have been shown to improve image synthesis calibration, robustness, and OOD detection 

(Grathwohl et al., 2019). Inspired by the pioneer work, we propose an energy-bounded objective function for the auxiliary dataset that will be jointly trained with the cross-entropy objective function for the main dataset. The additional knowledge from the auxiliary dataset will be leveraged to the model during this jointly training phase by pushing the OOD samples farther away, in distance, from the in-distribution samples.

Since our energy-bounded learning method is agnostic to the cross-entropy objective functions, it is applicable to a wide range of source code models and a several programming tasks. In this work, we evaluated the method on the code representations such as Tree-based CNN, TreeCaps, Code2Vec, Bi-LSTM, and Transformer, with programming tasks including code classification and method name prediction. The results show that training the source code model with an auxiliary dataset and using an energy-bounded objective improves the model’s robustness in terms of both adversarial and OOD detection robustness.

In summary, our contributions are listed as follows.

  • [leftmargin=*]

  • We are the first to explore a novel research direction to detect OOD samples to enhance the robustness of source code models;

  • We propose to use the energy as the score to detect OOD samples at inference time. We show that the energy score outperforms the other scores, such as softmax confidence score (Hendrycks and Gimpel, 2016), the Mahalanobis score (Lee et al., 2018), and ODIN (Liang et al., 2018) on novel OOD evaluation benchmarks of source code models.

  • We propose a novel energy-bounded loss function as an alternative to the cross-entropy loss function at

    training time for source code models. The source code models that employ this loss are trained on an auxiliary dataset (out-distribution) in addition to the main dataset (in-distribution), which is intended to improve the models’ robustness. Our results demonstrate that using an energy-bounded loss function significantly improves the robustness of source code models in terms of adversarial robustness and OOD detection.

  • We leverage several semantics-equivalent program transformations as adversarial attacks because a weak code model could regard such transformed programs as being a new class of program. Therefore, we use these transformations to test the adversarial robustness of the energy-bound enhanced learning against subtle changes in the input code on the percentage of predictions changed metric;

  • Using code classification and method name prediction tasks, our evaluations demonstrate that using an energy-bounded loss function and an auxiliary dataset increases the capacity of big code models to discriminate between in- and out-of-distribution samples while retaining classification performance.

  • We release our code on the anonymous repository and the auxiliary dataset for OOD benchmark.

The structure of the remainder of the paper is organized as follows. Section 2 presents related work, Section 3 introduces background concepts necessary for our approach, which is detailed in Section 4. Section 5 presents the evaluation of our approach, and Section 6 analyses the case study for threats to validity, with a discussion of future work in Section 7. Finally, we conclude in Section 8.

2. Related Work

Code Representation Learning

There has been a huge interest in applying deep learning techniques to programming tasks such as program functionality classification 

(Mou et al., 2016; Zhang et al., 2019), bug localization (Pradel and Sen, 2018; Gupta et al., 2019), function name prediction (Fernandes et al., 2019), code clone detection (Zhang et al., 2019), program refactoring (Hu et al., 2018), program translation (Chen et al., 2018), and code synthesis (Brockschmidt et al., 2019). Allamanis et al. (2018) extend ASTs to graphs by adding a variety of code dependencies as edges among the tree nodes, intended to represent code semantics, and apply Gated Graph Neural Networks (GGNN) (Li et al., 2016) to learn the graphs from code; Code2vec (Alon et al., 2019), Code2seq (Alon et al., 2019a), and ASTNN (Zhang et al., 2019)

are designed based on splitting ASTs into smaller ones, either as a bag of path-contexts or as flattened subtrees representing individual statements. They use various kinds of Recurrent Neural Networks (RNNs) to learn such code representations.

Robustness of Neural Networks

A considerable body of research has been conducted on the robustness of artificial intelligence (AI) systems in general, and deep neural networks in particular.  

Szegedy et al. (2014) is the first to discover that deep neural networks are vulnerable to small perturbations that are imperceptible to human eyes. Many following up works (Kurakin et al., 2016; Moosavi-Dezfooli et al., 2016; Carlini and Wagner, 2017; Dong et al., 2018) further demonstrated the severity of the robustness issues with a variety of attacking methods. While aforementioned approaches only apply to models for image classification, new attacks have been proposed that target models in other domains, such as natural language processing (Li et al., 2016; Jia and Liang, 2017; Zhao et al., 2018) and graphs (Dai et al., 2018; Zügner et al., 2018). In addition, a new line of research about methods to ensure the robustness of source code models has been proposed (Bielik and Vechev, 2020; Zhang et al., ; Yefet et al., 2020; Ramakrishnan et al., 2020). Aside from these methods,  Rabin et al. (2021) proposed a method to leverage program transformation to measure the generalizability of source code models.

OOD Detection

Significant progress has recently been made in identifying OODs and training models that are robust to such examples (defenses). However, the majority of the research has targeted computer-vision and natural language processing tasks 

(Hendrycks and Gimpel, 2016; Lee et al., 2018; Liang et al., 2018; DeVries and Taylor, 2018). The softmax confidence score has become a common baseline for OOD detection (Hendrycks and Gimpel, 2016). Many following up methods proposes different scores to detect the OOD samples, such as the Mahalanobis score (Lee et al., 2018), adding a temperature scaling factor for softmax score (Liang et al., 2018), etc.  DeVries and Taylor (2018) propose to learn the confidence score by attaching an auxiliary branch to a pre-trained classifier and deriving an OOD score. While most of these methods have been proposed for either vision learning or natural language processing, no previous work has been found to know whether these methods are applicable to code representation learning, nor knowing whether they require any adaptation.

3. Background

In this section, we present the background concepts to prepare for the understanding of our solution in Section 4.

3.1. Code Representation Learning

The method of learning source code representations typically consists of two stages: (1) converting a code snippet to an intermediate representation (IR), such as token streams, ASTs, AST paths, or graphs; and (2) designing a neural network capable of processing such intermediate representations.

This type of neural network is also referred to as an encoder because it takes the input code IR and maps it to a code vector embedding

, typically a combination of various types of code elements, then feed the embedding vector into the next layer(s) of training for an objective function associated with the learning system’s specific task.

Furthermore, cross-entropy is a loss function that is mostly used across the techniques to model source code ((Watson et al., 2020). In our work, we propose an energy-bounded objective as an alternative to the cross-entropy loss function in order to enhance the robustness of source code models.

3.2. Energy-based Models

The essence of the energy-based model (EBM) (LeCun et al., 2006) is based on the construction of a function that maps each point in the input space to a single, non-probabilistic scalar called the energy

. A collection of energy values could be turned into a probability density

through the Gibbs distribution:


where the denominator is called the partition function, which marginalizes over , and is the temperature parameter.

The energy of a given data point can be expressed as the negative logarithm of the partition function:


3.3. OOD detection

OOD detection is the task of deciding whether or not a sample is from a learned distribution . Samples from are called in-distribution, and otherwise are said to be out-of-distribution (OOD) from . Therefore, OOD detection is a binary classification problem that relies on a score to differentiate between in- and out-of-distribution samples. We call the classifier as the out-of-distribution detector (OODD).

Out-of-distribution Detector. An out-of-distribution detector should produce values that are distinguishable between in- and out-of-distribution. For each code sample , we first feed into a pre-trained code-processing neural network (such as Tree-based CNN), then calculate the confidence score and compare it to a threshold .

Mathematically, the out-of-distribution detector can be described as:

3.4. Adversarial Attack and Robustness

Adversarial robustness for source code models include two steps. The first step is to generate samples that are adversarial to the normal samples, i.e., in hope that the classifier would find it difficult to associate them with the correct labels. The second step is to train the source code models in such a way that it can defend itself against an adversarial attack.

For code learning models, a general idea to generate adversarial samples is to apply some program transformations to obtain from a program a variant of itself. The requirement is that the semantics of the original program must be preserved by the transformed programs, i.e., given an input, the transformed program must produce the same output as does the original program.

The adversarial robustness of a code learning model refers to the minimal performance degradation when adversarial samples are used. Generative adversarial networks (GAN) could enhance the robustness of a given neural learning networks. In this work, we do not use them because our purpose is to measure the adversarial robustness through the generated samples.

4. Our Approach

Usually the general steps for training source code models are carried out as follows. Initially, the code samples needs to be mapped into an appropriate representation, and then the code representation is fed into a code encoder that is specifically designed to process such code presentation, e.g., with the code classification task using Tree-based CNN (Mou et al., 2016), the code samples in the main training dataset will be represented as the Abstract Syntax Tree (AST), and the AST will be fed into the TBCNN to obtain the code embedding. The code embedding is then fed into a softmax layer for classification, and the cross-entropy loss function is utilized as the objective to train the code classification model.

Figure  1 presents an overview of our framework.

Figure 1.

An overview of our approach: the encoder processes both the in-distribution dataset and the out-of-distribution dataset to get the corresponding embeddings. Then the embeddings are fed into a feed-forward neural network to get the logits. The scalar energy scores

and for in- and out-of-distribution samples respectively are computed from the logits. Next, the energy-bounded loss function is computed from and and the cross-entropy loss is computed from the in-distribution’s logits. Finally, and will be jointly trained into a total loss function for the end-to-end training process.

First, an auxiliary dataset including additional code samples will be treated as out-distribution data, while the samples in the main dataset will be treated as in-distribution data. The code samples in the OOD dataset will be encoded in exactly the same way as the samples in the main dataset, except for the steps after obtaining the code embeddings and logits: the energy of both the auxiliary and main code embeddings are computed using an energy-bounded loss function. The intuition of introducing the energy-bound loss function is to assign distinct score ranges to separate out-of-distribution from in-distribution data. The cross-entropy and energy-bounded loss functions are trained jointly in our end-to-end learning framework as the total loss.

4.1. Energy-bounded Loss Function at Training Time

Here we present the energy-bounded loss function to train source code model with an auxiliary dataset. While an energy score may be advantageous for a pre-trained neural network, the energy difference between in- and out-of-distribution samples is not always optimal for distinction. Following  (Liu et al., 2020), we propose an energy-bounded learning objective in which the neural network is designed to purposely generate a gap between in-distribution and out-of-distribution data by assigning lower energy to in-distribution data and larger energy to out-of-distribution data.

Specifically, the energy-based classifier is trained using the following objective function:


where is the softmax output of the classification model and is the in-distribution training data.

The overall training objective combines the standard cross-entropy loss, along with a regularization loss defined in terms of energy:


where is the unlabeled auxiliary OOD training data.

In particular, the energy is regularized using two squared hinge loss terms with separate margin hyperparameters

and . In one term, the model penalizes in-distribution samples that produce energy higher than the specified margin parameter . Similarly, in another term, the model penalizes the out-of-distribution samples with energy lower than the margin parameter . In other words, the loss function penalizes the samples with energy .

In our evaluation, we choose = 0.1. is chosen from , and is chosen from .

4.2. Using Energy on Pretrained Models for OOD Detection at Inference Time

Once the model is trained, the energy score can be derived from it for OOD detection. Here we intuitively present the steps to derive the energy score.

The energy-based model has an inherent connection with modern machine learning, especially discriminative models. To see this, consider a discriminative neural classifier , which maps an input to real-valued numbers known as logits.

Typically, these logits are used to derive a categorical distribution using the softmax function:


where indicates that the index of , i.e., the logit is corresponding to the class label.

By connecting Eq. 1 and Eq. 6, the energy can be defined for a given input as . More importantly, without changing the parameterization of the neural network , we can express the free energy function over in terms of the denominator of the softmax activation:


It should be noted that the energy score can be derived from a pre-trained model that was trained using either the cross-entropy or the energy-bounded loss function. We will demonstrate the use of the energy score in the Evaluation section.

5. Evaluation

We describe two ways for measuring the robustness of code learning models. The first way is to determine whether or not a sample is out-of-distribution, i.e., OOD Detection. The second way is to determine whether a code learning model can continue to predict correctly against adversarial samples, i.e., Adversarial Robustness.

Notably, these two robustness measurements must be evaluated in accordance with specific code learning tasks. In Section 5.1, we describe the code learning tasks and the source code models for each of the task. Then in Sections 5.3 and 5.4, we describe the settings and the evaluation results for OOD Detection and Adversarial Robustness measurements, respectively.

5.1. Code Learning Tasks

The code learning tasks we consider in this paper are code classification, method name prediction. All of these tasks share a similar framework such that the code representation is fed into the code encoder, then the code encoder is used to predict some labels such that the tag name of code snippet, the method name of a function, or text description of a function body. The code encoder can be any of the well-known models that have been used for code learning recently, such as Tree-based CNN (Mou et al., 2016), Code2vec (Alon et al., 2019b), Code2seq (Alon et al., 2019a), BiLSTM (Schuster and Paliwal, 1997) and Transformer (Vaswani et al., 2017). Due to the generality of deep learning representations, each of these models can be adapted into these code learning tasks, even when their original design is for other tasks.

Below we describe the goal of each code learning task:

  • [leftmargin=*]

  • Code Classification: Given a piece of code, this task is to classify the functionality it belongs to. This task may go under different names depending on the purposes, such as malware classification, patch identification, bug triage, etc. Even though the names are different, the general goal is similar, i.e., to classify a piece of code into a given class. We choose ‘code classification‘ to represent this family of problems because their processes are similar;

  • Method Name Prediction: Given a piece of code (without its function header), this task is to predict a meaningful name that reflects the functionality of the code, e.g., . It should be noted that this task can be formulated in a variety of ways. The first way, e.g., Code2Vec (Alon et al., 2019b), is to treat this problem as an extreme classification task, where classes are the set of method names which can be a huge number. The second way, e.g., Code2seq (Alon et al., 2019a), is to treat this problem as a generation task, meaning that one can use a decoder to generate a sequence of sub-tokens. Since our aim is to consider OOD for classification task, we use the first approach like Code2Vec to predict the method name out of a large number of classes.

5.2. Training Settings

We train the code classification models and the method name prediction models using two different loss functions. The first one is the well-known cross-entropy loss function used in most of the source code models (Watson et al., 2020). The second one is the energy-bounded loss function we presented in this study (Equation 4). In general, we call the source code models trained on cross-entropy loss as CE Models, and the energy-bounded loss as EB Models.

Note that the code snippet is processed differently for the appropriate model. For TBCNN, Code2vec and Code2seq models, code must be parsed into an AST for these models to process. We use tree-sitter, an incremental parsing system which supports multiple programming languages (Clem and Thomson, 2021) as the parser to parse code into the AST. For Bi-LSTM and Transformer, one can treat code simply as a sequence of tokens.

To train the neural networks, a common setting used among all these techniques is that they all utilize both node type and token information to initialize a node in ASTs to make it consistent and fair among all of the baselines, we follow (Bui et al., 2021) to set both the dimensionality of type embeddings and text embeddings to 100. Note that we make the source code models as strong as possible by choosing the hyper-parameters above as the optimal settings according to their papers or code. Also, note that the goal of this work is not to increase the performance of source code models on certain code learning tasks. Our goal is to train such models on EB loss function such that the performance of such EB models are comparable to the CE models, and the robustness of the EB models should be significantly higher than the CE models..

Tasks CC (Acc) MNP (F1)
TBCNN 95.12 95.35 40.89 39.86
TreeCaps 97.24 97.56 48.29 47.01
Code2Vec - - 27.49 28.83
Bi-LSTM 84.59 85.92 38.25 38.69
Transformer 93.49 93.80 45.39 44.28

Table 1. Performance of Code Classification (CC) and Method Name Prediction (MNP) of source code models when training with the Cross-Entropy loss (CE) and the Energy-Bounded loss (EB), respectively

Table 1 shows the performance when we train such models using the cross-entropy loss (CE) and energy-bounded loss (EB). The metric to measure the quality of code classification (CC) is accuracy (Acc), and the metric for method name prediction (MNP) is F1 score on the sub-tokens. As seen, the performance of any of the models are comparable between CE and EB. As one can observe, the accuracy of EB of different models are similar to CE of the original models. This shows that training with the energy-bounded loss does not affect on the prediction capability of source code models.

5.3. Out-of-distribution Detection

5.3.1. Baselines

We aim to compare our energy score with the other scores for OOD detection, they are:

  • [leftmargin=*]

  • Maximum softmax probability (MSP) (Hendrycks and Gimpel, 2016): the softmax score has been used as a strong baselines to detect OODs,i.e., OOD samples tends to have a lower value of the softmax score than the in-distribution samples, thus it can be use to detect the OOD samples.

  • Mahalanobis Distance (Lee et al., 2018): this score is to measure the probability density of test sample on feature spaces of DNNs utilizing the concept of a “generative” classifier.

  • ODIN (Liang et al., 2018): this score is an improved version of the MSP score, which uses an additional temperature scaling of the softmax score to improve OOD detection.

In contrast, we used the energy score in two different manners. First, an energy score is derived from Equation 7 for a pre-trained CE model, this is denoted as Eneg. Second, a similar energy score is also derived from Equation 7, but from a pre-trained EB model instead. By training the source code model with an auxiliary dataset, we would like to know whether or not there is any superior performance using the energy-bounded loss function compared to the simply derived energy score from a CE model.

5.3.2. Datasets

Code Classification

For in-distribution dataset, we use the POJ dataset from  (Mou et al., 2016) that has been widely used as the benchmark for code classification, which comprises of 52000 C programs of 104 classes. For out-distribution dataset, we use the data from Project CodeNet (Puri et al., 2021), a large-scale dataset that consisting of over 14 million code samples and about 500 million lines of code in 55 different programming languages. Because the POJ dataset is for C language only, we select the samples in Project CodeNet that are written in C, which resulting in around 750k samples. The number of OOD samples is much larger than those in the in-distribution dataset. To deal with imbalanced datasets and speed up the training progress, we randomly sample 60k samples from these 750k samples. Then for both the in- and out- distribution dataset, we split them into training/testing/validation with the ratio 70/20/10 into data portions called , , and , , .

Method Name Prediction

For in-distribution dataset, we use the Java-Small dataset from  (Alon et al., 2019a) that has been widely used as the benchmark for method name prediction, which comprises of 700k samples. This dataset have been split into training/testing/validation by projects, which can considered as , , . For out-distribution dataset, we also use the data from Project CodeNet (Puri et al., 2021). Because the Java-Small dataset is for Java, we only select the samples in Project CodeNet that are written in Java, which resulting in  2M samples. We also randomly select 700k samples from these 2M samples and split it into training/testing/validation with the ratio 70/20/10 as the , , .

5.3.3. Metrics

To evaluate out-of-distribution detection methods, we treat the OOD examples as the positive class, and then evaluate four metrics: area under the receiver operating characteristic curve (

AUROC), area under the precision-recall curve (AUPR), the false positive rate at true positive rate (FPR), and the test error of the (Test Error).

AUROC and AUPR are holistic metrics that summarize the performance of a detection method across multiple thresholds. They are also useful when the anomaly score threshold is not known a priori. AUROC can be thought of as the probability that an anomalous example is given a higher OOD score than an in-distribution example (Davis and Goadrich, 2006). Thus, a higher AUROC is better, and a random detector has an AUROC of 50%. AUPR is useful when anomalous examples are infrequent (Manning and Schutze, 1999), as it takes the base rate of anomalies into account.

Whereas AUROC and AUPR represent the detection performance across various thresholds, the FPR metric represents the performance at one strict threshold. By observing the performance at a strict threshold, we can make clear comparisons among strong detectors. The FPR metric (Liu et al., 2018; Kumar BG et al., 2016; Balntas et al., 2016) is the probability that an in-distribution example (negative) raises a false alarm when of anomalous examples (positive) are detected, so a lower FPR is better.

Capturing nearly all anomalies with few false alarms can be of high practical value. The Test Error is also a metric derived from FRP that measure the misclassification probability when TPR is 95%.

5.3.4. Evaluation Results

(a) FPR95: 99.71 
(b) FPR95: 68.62 
(c) FPR95: 0.00
Figure 2. Distribution of softmax scores vs. energy scores from TBCNN-CE vs. TBCNN-EB. (a) FPR95 = 99.71 for the softmax confidence score derived from the TBCNN-CE model. (b) FPR95 = 68.62 for the energy score derived from the TBCNN-CE model. (c) FPR95 = 0.00 for the energy score derived from the TBCNN-EB model.
Tasks Models FPR95  AUROC  AUPR  Test Error 
MSP/ Maha / ODIN / Energy / Energy-bounded
CC TBCNN 99.71 / 89.85 / 74.58 / 68.62 / 0.00 14.74 / 43.66 / 51.98 / 74.05 / 99.95 48.33 / 67.21 / 61.39/ 84.39 / 99.98 33.09 / 36.32 / 31.24 / 32.37 / 0.71
Bi-LSTM 99.87 / 94.96 / 89.43 / 92.55 / 5.96 9.74 / 56.45 / 48.45 / 82.34 / 95.82 34.25 / 45.83 / 60.42/ 80.49 / 96.99 35.89 / 34.55/ 31.32 / 33.35 / 8.45
TreeCaps 99.65 / 93.82 / 88.28 / 89.34 / 0.00 25.46 / 49.12 / 62.82 / 84.28 / 99.95 47.69 / 68.89 / 64.35/ 89.85 / 99.95 31.92 / 30.55 / 40.59 / 27.49 / 0.71
Transformer 99.68 / 86.92 / 84.41 / 88.85 / 0.00 20.88 / 45.45 / 60.57 / 73.95 / 99.97 48.21 / 67.32 / 65.22/ 84.90 / 99.95 32.59 / 28.55/ 35.41 / 37.33 / 0.90
MNP TBCNN 89.71 / 80.85 / 74.58 / 68.28 / 6.56 25.56 / 34.55/ 40.57 / 58.33 / 93.56 59.29 / 85.56/ 40.59 / 64.44 / 92.49 25.53 / 24.55 / 27.59 / 24.31 / 12.29
Bi-LSTM 86.16 / 82.85 / 74.58 / 70.28 / 14.89 25.98 / 30.94/ 61.09 / 62.45 / 90.94 66.56 / 74.39/ 40.59 / 78.33 / 90.59 20.31 / 21.55 / 23.59 / 29.98 / 25.49
Code2Vec 89.71 / 89.36 / 74.58 / 73.58 / 6.39 36.98 / 39.94/ 53.08 / 67.35 / 95.32 57.12 / 78.69/ 40.59 / 81.90 / 93.99 19.57 / 26.93 / 25.02 / 20.04 / 10.32
TreeCaps 80.36 / 92.84 / 65.56 / 60.23 / 5.34 38.32 / 48.15 / 69.82 / 84.97 / 97.23 42.63 / 71.82 / 67.59/ 92.89 / 97.22 28.83 / 34.21 / 45.32 / 39.49 / 10.75
Transformer 89.71 / 84.52 / 70.25 / 62.39 / 11.49 35.98 / 30.94/ 61.19 / 55.97 / 94.71 58.93 / 79.03/ 40.59 / 76.95 / 93.87 23.22 / 24.82 / 28.53 / 19.83 / 15.84
Table 2. OOD Performance. CC stands for Code Classification, MNP stands for Method Name Prediction.   indicates smaller values are better and   indicates larger values are better.

Table 2 shows the main evaluation results. The abbreviated metric names ”MSP” stands for the Maximum Softmax Probability score, ”Maha” stands for the Mahalanobis score. ”ODIN” is the MSP score with additional temperature scaling factor, ”Energy” is the energy score computed from the Equation 7. Note that all of these scores are computed from a CE Model.

The remaining metrics are ”energy-bounded,” which also refer to the energy score derived in Equation 7, but this score is computed from the EB models instead. The metrics for evaluation are FPR95, AUROC, AUPR and Test Error. Here the marker   indicates that smaller values are better and the marker   indicates that larger values are better.

We can see that, on average, the OOD detection performance in term of FPR95, AUROC, AUPR using the energy score outperforms all of the other scores for any of the CE Models on both of the code classification and method name prediction task. This shows that the energy score is a powerful measurement to detect OOD samples. Moreover, if we train the source code models using the energy-bounded loss function, the derived energy score can distinguish in- and out-distribution samples even better, i.e., we can see in Table 1 that the Energy-bounded score for all of the metrics (FPR95, AUROC, AUPR, Test Error) across all of the models are the best among the other measurement scores.

It should be noted that the performance of the Bi-LSTM is mostly worse than the other code models in any of the metric, it is because Bi-LSTM is not a strong model for code learning tasks, but we still involve this into the evaluation to show that the energy-bounded learning method is even applicable to a weaker model.

To gain further insights, we compare the energy score distribution for in- and out-of-distribution data for the code classification task. Figure 2 compares the energy and softmax score histogram distributions, derived from the TBCNN model trained on cross-entropy loss (TBCNN-CE), and another TBCNN model trained on energy-bound loss (TBCNN-EB) for the code classification task. As we can see in Figure 2a, the softmax histograms derived from TBCNN-CE for in- and out- distribution makes it difficult to distinguish the two distributions, resulting in FPR95 value of 99.71%. On the other hand, using the energy histograms derived from TBCNN-CE in Figure 2b makes it better to distinguish the two distributions, resulting in FPR95 value of 68.62%. Finally, Figure 2c shows the energy histograms derived from TBCNN-EB, resulting in the perfect value of FPR95=0.0%. This demonstrates the superior performance of the energy-bound learning model trained with the auxiliary dataset.

5.4. Adversarial Robustness

We consider the task to measure the robustness of each model by applying the semantically-preserving program transformations to a set of test programs. We follow Wang et al. (2019); Rabin et al. (2020) to transform programs in 5 ways that change code syntax but preserve code functionality:

  • [leftmargin=*]

  • Variable Renaming (VN), a refactoring transformation that renames a variable in code, where the new name of the variable is taken randomly from a set of variable vocabulary in the training set;

  • Unused Statement (US), inserting an unused string declaration to a randomly selected basic block in the code.

  • Permute Statement (PS), swapping two independent statements (i.e., with no dependence) in a basic block in the code.

  • Loop Exchange (LX) replaces for loops with while loops or vice versa. We traverse the AST to identify the node the defines the for loop (or the while loop) then replace one with another with modifications on the initialization, the condition, and the afterthought.

  • Switch to If (SF) replaces a switch statement in the method with its equivalent if statement. We traverse the AST to identify a switch statement, then extract the subtree of each case statement of the switch and assign it to a new if statement.

We then examine if the models make the same predictions as the prior predictions for the original programs after the programs have been transformed. We use the method name prediction task with the similar datasets used in OOD detection for this Adversarial Robust measurement.

5.4.1. Metrics

We use percentage of predictions changed () as the metric used by (Rabin et al., 2020; Zhang et al., ; Wang et al., 2019) to measure the robustness of the code models. Formally, suppose denotes a set of test programs, a semantic-preserving program transformation that transforms into a set of transformed programs , and a source code model that can make predictions for any program : , where denotes a predicted label for according to a set of labels learned by , we compute the percentage of predictions changed as:


Lower values for suggest higher robustness as they can maintain more of correct predictions with respect to the transformation.

5.4.2. Results

Table 3 shows the results of this task by comparing the before/after percentage of prediction changes with respect to the five semantic-preserving transformations. It is clear that for all these transformations, EB outperforms CE in terms of the adversarial robustness (by an average of 6.7% across the transformations).

Although additional types of program transformations could be used to enhance model robustness, the analysis of current transformations already indicates that training all source code models with energy-bounded loss makes them more robust to adversarial attacks. In future work, one could introduce more semantic-preserving program transformations to enhance the robustness of the models.

TBCNN 22.45% 15.48% 24.92% 14.60% 18.49% 15.33% 19.84% 12.33% 25.09% 18.50%
TreeCaps 16.39% 10.49% 22.49% 13.01% 17.94% 13.45% 17.39% 11.20% 23.22% 15.29%
Code2Vec 24.40% 20.59% 24.06% 20.58% 19.55% 17.29% 23.88% 18.49% 29.31% 18.02%
Bi-LSTM 28.29% 22.60% 29.92% 23.09% 24.92% 20.99% 26.29% 20.55% 27.23% 22.50%
Transformer 20.09% 16.33% 21.11% 16.29% 20.94% 16.30% 22.98% 16.23% 27.95% 21.28%
Table 3. Model robustness between cross-entropy loss (CE) vs energy-bounded loss (EB), measured as percentage of predictions changed wrt. semantic-preserving program transformations. The lower the more robust.

6. Case Study

Our goal in this study is to use the pre-trained EB models to determine if a code snippets in the in-testing are out-of-distribution or not. We choose the method name prediction since code classification is not suitable for this analysis. Because the in-testing of code classification is split from the same data distribution, while the in-testing from method name prediction is not split by the same data distribution, but is collected by projects, e.g., Wildlfy 222 and Gradle 333 are 2 projects in the in-testing, thus determine if the code snippets in these projects are OOD can bring benefit such that it can reduce the false-positives produced by the model’s prediction.

To do so, we need to define a threshold on the energy score, such that when the energy score is higher than , then the sample is considered as out-distribution, otherwise it will be considered as in-distribution. We use the ROC curve to help us to find a good threshold. A good threshold in our case is to satisfy that the True Positive Rate (TPR) must be high, the the False Positive Rate (FPR) must be low, i.e., miss-classifications are low. Since our dataset is balanced, we choose the threshold that maximize .

In this study, we employ the TreeCaps-EB model for method name prediction. Thus, finding the value for the energy threshold from the test samples yields a value of . Then, we utilize this threshold to assess whether or not the code snippets contained in the in-testing section are OOD. According to our analysis, 64% of the testing samples in the in-testing part are OOD. This means that if we are certain that the model will fail to predict a suitable name for these samples, we can disregard them during the testing phase. By eliminating these 64% test samples from the testing phase, we can enhance the F1 score for the TreeCaps-EB method name prediction from 39.86% to 74.38%. To ensure that eliminating these 64% is indeed significant, we randomly select another 64% of test samples to eliminate and examine the remainder; this results in an F1 score of 44.34% only. Comparing to the F1=74.38% increased from excluding the 64% OODs, the F1=44.35% from excluding 64% randomly is significantly lower. This means that 64% of OOD test samples assist us in excluding samples for which the model cannot make an accurate prediction.

It is worth noting that code learning models, even the state-of-the-art ones such as TreeCaps, are not always able to get correct results for the classification tasks that involve many class labels. For example, Figure 3 shows some code snippets where TreeCaps-EB model could not predict the correct name for the method name prediction task. These samples have been predicted to be OOD by our method. Therefore, it is possbile to use OODD to enhance the understanding that there is a limitation in code learning models in such tasks.

In summary, threats of validity in this case study can be summarised as follows:

Construct validity

Even though we have chosen an energy threshold based on the principle to maximize the TPR-FPR, it is still an empirical parameter which may vary by datasets of case studies. It would require further study to know whether there could be better signals to determinate the threshold. Another potential threat to validity is the choice of procedural programming languages. Since existing code classification and method name prediction tasks have not been evaluated on functional programming paradigms, we do not extend our scope to them either. In future we may extend the scope to other paradigms after the code learning models have been evaluated to them.

Internal and external validity

By using semantics- preserving program transformations, we minimize the potential errors introduced by us because all the labels of out-of-distribution samples are determined by the algorithm. The two tasks being used for this study have been bench marked by other researchers in this field. We have not introduced additional datasets and labels that could bias the analysis.

(a) Ground Truth: getMetaData, TreeCaps’s Prediction: getInstance  
(b) Ground Truth: solve, TreeCaps’s Prediction: getCharacter  
(c) Ground Truth: expectedDefaultForNoModeZips , TreeCaps’s Prediction: assert  
(d) Ground Truth: maybeFire, TreeCaps’s Prediction: registry  
Figure 3. Examples of code snippets that our energy-based learning algorithm identified as being out-of-distribution successfully. We illustrate the ground-truth method name in each case study alongside the prediction provided by the TreeCaps model from the Softmax layer. All of the predicted names are incorrect or incomplete.

7. Discussion

In this section, we would like to explore our technique in further detail. To begin with, our technique was assessed exclusively on classification-based tasks, but many programming problems might be formulated differently. For instance, extreme summarization (Allamanis et al., 2016) is to summarize a program into textual description, program translation (Chen et al., 2018) is to translate a program from one language into another. These two tasks can be considered as translation tasks in general, and there is a need to quantify the uncertainty of the translated results, i.e., the model should not produce incorrect translation results if the output is uncertain. Indeed, numerous works have proposed techniques for testing such machine translation tasks for natural language processing (He et al., 2021; Sun et al., 2020; Gupta et al., 2020), but these techniques focus exclusively on the erroneous in the translated sentences, rather than on their uncertainty. Additionally, the methods are limited to natural language processing, but our approach is suitable to code representation learning.

Second, there are a few other ways for evaluating the robustness of source code models against adversarial attacks. For instance, Yefet et al. (2020) proposes a gradient-based technique for generating adversarial examples that can be used to attack source code models. This method has been used to generate adversarial examples to bypass method name prediction and bug detection task. In our situation, we use a more straightforward way suggested by Rabin et al. (2021) to apply a predefined set of transformation operators to the code snippets. We leave the task to use the gradient-based method to evaluate against our energy-based models in the future.

8. Conclusion

In this paper, we propose an energy-bound loss function as an alternative for the cross-entropy loss function commonly used to train source code models. Along with the main dataset (in-distribution dataset), the energy-bounded objective is trained using an auxiliary dataset (out-of-distribution dataset). We hypothesise that this training technique improves the robustness of source code models while preserving their predictive ability and the energy score produced from such models is used to detect code snippets that are out-of-distributions. The results of our evaluation on two programming-related tasks indicate that the energy-bound score surpasses all of the other scores for the OOD detection task. Additionally, for the method name prediction task, models trained on the energy-bounded loss function demonstrate higher robustness than those models trained on the cross-entropy loss function. It is therefore recommendable to consider this alternative in classification-based code learning.


  • M. Allamanis, M. Brockschmidt, and M. Khademi (2018) Learning to represent programs with graphs. In International Conference on Learning Representations, Cited by: §2.
  • M. Allamanis, H. Peng, and C. Sutton (2016) A convolutional attention network for extreme summarization of source code. In International conference on machine learning, pp. 2091–2100. Cited by: §7.
  • U. Alon, O. Levy, and E. Yahav (2019a) Code2seq: generating sequences from structured representations of code. In International Conference on Learning Representations, External Links: Link Cited by: §2, 2nd item, §5.1, §5.3.2.
  • U. Alon, M. Zilberstein, O. Levy, and E. Yahav (2019)

    Code2Vec: learning distributed representations of code

    Proc. ACM Programming Languages 3 (POPL), pp. 40:1–40:29. Cited by: §1, §2.
  • U. Alon, M. Zilberstein, O. Levy, and E. Yahav (2019b) Code2vec: learning distributed representations of code. Proceedings of the ACM on Programming Languages 3 (POPL), pp. 1–29. Cited by: §1, 2nd item, §5.1.
  • V. Balntas, E. Riba, D. Ponsa, and K. Mikolajczyk (2016)

    Learning local feature descriptors with triplets and shallow convolutional neural networks.

    In Bmvc, Vol. 1, pp. 3. Cited by: §5.3.3.
  • P. Bielik and M. Vechev (2020) Adversarial robustness for code. arXiv preprint arXiv:2002.04694. Cited by: §1, §2.
  • M. Brockschmidt, M. Allamanis, A. L. Gaunt, and O. Polozov (2019) Generative code modeling with graphs. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Link Cited by: §2.
  • N. D. Bui, Y. Yu, and L. Jiang (2021) TreeCaps: tree-based capsule networks for source code processing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 30–38. Cited by: §1, §5.2.
  • N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pp. 39–57. Cited by: §2.
  • X. Chen, C. Liu, and D. Song (2018) Tree-to-tree neural networks for program translation. In Advances in Neural Information Processing Systems, pp. 2547–2557. Cited by: §1, §2, §7.
  • T. Clem and P. Thomson (2021) Static analysis at github: an experience report. Queue 19 (4), pp. 42–67. Cited by: §5.2.
  • G. E. Dahl, J. W. Stokes, L. Deng, and D. Yu (2013) Large-scale malware classification using random projections and neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3422–3426. Cited by: §1.
  • H. Dai, H. Li, T. Tian, X. Huang, L. Wang, J. Zhu, and L. Song (2018) Adversarial attack on graph structured data. Proceedings of the 35th International Conference on Machine Learning, PMLR 80. External Links: Link Cited by: §2.
  • J. Davis and M. Goadrich (2006) The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning, pp. 233–240. Cited by: §5.3.3.
  • T. DeVries and G. W. Taylor (2018) Learning confidence for out-of-distribution detection in neural networks. arXiv preprint arXiv:1802.04865. Cited by: §2.
  • Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li (2018) Boosting adversarial attacks with momentum. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 9185–9193. Cited by: §2.
  • P. Fernandes, M. Allamanis, and M. Brockschmidt (2019) Structured neural summarization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Link Cited by: §2.
  • W. Grathwohl, K. Wang, J. Jacobsen, D. Duvenaud, M. Norouzi, and K. Swersky (2019) Your classifier is secretly an energy based model and you should treat it like one. arXiv preprint arXiv:1912.03263. Cited by: §1.
  • X. Gu, H. Zhang, and S. Kim (2018) Deep code search. In 40th ICSE, pp. 933–944. Cited by: §1.
  • X. Gu, H. Zhang, D. Zhang, and S. Kim (2017) DeepAM: migrate apis with multi-modal sequence to sequence learning. In International Joint Conference on Artificial Intelligence, pp. 3675–3681. Cited by: §1.
  • R. Gupta, A. Kanade, and S. Shevade (2019) Neural attribution for semantic bug-localization in student programs. In Advances in Neural Information Processing Systems, pp. 11861–11871. Cited by: §2.
  • S. Gupta, P. He, C. Meister, and Z. Su (2020) Machine translation testing via pathological invariance. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 863–875. Cited by: §7.
  • P. He, C. Meister, and Z. Su (2021) Testing machine translation via referential transparency. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), pp. 410–422. Cited by: §7.
  • D. Hendrycks and K. Gimpel (2016) A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136. Cited by: 2nd item, §2, 1st item.
  • T. Hoang, J. Lawall, R. J. Oentaryo, Y. Tian, and D. Lo (2019) PatchNet: a tool for deep patch classification. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion), pp. 83–86. Cited by: §1.
  • X. Hu, G. Li, X. Xia, D. Lo, and Z. Jin (2018) Deep code comment generation. In International Conference on Program Comprehension, pp. 200–210. Cited by: §1, §2.
  • R. Jia and P. Liang (2017) Adversarial examples for evaluating reading comprehension systems. arXiv preprint arXiv:1707.07328. Cited by: §2.
  • K. Kim, D. Kim, T. F. Bissyandé, E. Choi, L. Li, J. Klein, and Y. L. Traon (2018) FaCoY: a code-to-code search engine. In icse, pp. 946–957. Cited by: §1.
  • V. Kumar BG, G. Carneiro, and I. Reid (2016) Learning local image descriptors with deep siamese and triplet convolutional networks by minimising global loss functions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5385–5394. Cited by: §5.3.3.
  • A. Kurakin, I. Goodfellow, and S. Bengio (2016) Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236. Cited by: §2.
  • Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang (2006) A tutorial on energy-based learning. Predicting structured data 1 (0). Cited by: §3.2.
  • K. Lee, K. Lee, H. Lee, and J. Shin (2018) A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Advances in neural information processing systems 31. Cited by: 2nd item, §1, §2, 2nd item.
  • J. Li, P. He, J. Zhu, and M. R. Lyu (2017) Software defect prediction via convolutional neural network. In IEEE International Conference on Software Quality, Reliability and Security, pp. 318–328. Cited by: §1.
  • J. Li, W. Monroe, and D. Jurafsky (2016) Understanding neural networks through representation erasure. arXiv preprint arXiv:1612.08220. Cited by: §2.
  • Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel (2016) Gated graph sequence neural networks. In International Conference on Learning Representations, Cited by: §2.
  • S. Liang, Y. Li, and R. Srikant (2018) Enhancing the reliability of out-of-distribution image detection in neural networks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: 2nd item, §1, §2, 3rd item.
  • S. Liu, R. Garrepalli, T. Dietterich, A. Fern, and D. Hendrycks (2018) Open category detection with pac guarantees. In International Conference on Machine Learning, pp. 3169–3178. Cited by: §5.3.3.
  • W. Liu, X. Wang, J. D. Owens, and Y. Li (2020) Energy-based out-of-distribution detection. arXiv preprint arXiv:2010.03759. Cited by: §4.1.
  • Y. Liu, C. Tantithamthavorn, L. Li, and Y. Liu (2021) Deep learning for android malware defenses: a systematic literature review. arXiv preprint arXiv:2103.05292. Cited by: §1.
  • C. Manning and H. Schutze (1999) Foundations of statistical natural language processing. MIT press. Cited by: §5.3.3.
  • S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2574–2582. Cited by: §2.
  • L. Mou, G. Li, L. Zhang, T. Wang, and Z. Jin (2016) Convolutional neural networks over tree structures for programming language processing. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, pp. 1287–1293. Cited by: §5.1, §5.3.2.
  • L. Mou, G. Li, L. Zhang, T. Wang, and Z. Jin (2016) Convolutional neural networks over tree structures for programming language processing.. Cited by: §1, §2, §4.
  • S. Nari and A. A. Ghorbani (2013) Automated malware classification based on network behavior. In 2013 International Conference on Computing, Networking and Communications (ICNC), pp. 642–647. Cited by: §1.
  • R. Nix and J. Zhang (2017) Classification of android apps and malware using deep neural networks. In International Joint Conference on Neural Networks, pp. 1871–1878. Cited by: §1.
  • R. Pascanu, J. W. Stokes, H. Sanossian, M. Marinescu, and A. Thomas (2015) Malware classification with recurrent networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1916–1920. Cited by: §1.
  • M. Pradel and K. Sen (2018) DeepBugs: a learning approach to name-based bug detection. Proceedings of the ACM on Programming Languages 2 (OOPSLA), pp. 147. Cited by: §2.
  • R. Puri, D. S. Kung, G. Janssen, W. Zhang, G. Domeniconi, V. Zolotov, J. Dolby, J. Chen, M. Choudhury, L. Decker, et al. (2021) Project codenet: a large-scale ai for code dataset for learning a diversity of coding tasks. arXiv preprint arXiv:2105.12655. Cited by: §5.3.2, §5.3.2.
  • M. Rabin, R. Islam, N. D. Bui, Y. Yu, L. Jiang, and M. A. Alipour (2020) On the generalizability of neural program analyzers with respect to semantic-preserving program transformations. arXiv preprint arXiv:2008.01566. Cited by: §1, §5.4.1, §5.4.
  • M. R. I. Rabin, N. D. Bui, K. Wang, Y. Yu, L. Jiang, and M. A. Alipour (2021) On the generalizability of neural program models with respect to semantic-preserving program transformations. Information and Software Technology 135, pp. 106552. Cited by: §2, §7.
  • M. R. I. Rabin, K. Wang, and M. A. Alipour (2019) Testing neural program analyzers. In 34th IEEE/ACM International Conference on Automated Software Engineering (Late Breaking Research-Track), Cited by: §1.
  • G. Ramakrishnan, J. Henkel, Z. Wang, A. Albarghouthi, S. Jha, and T. Reps (2020) Semantic robustness of models of source code. arXiv preprint arXiv:2002.03043. Cited by: §1, §2.
  • S. Sachdev, H. Li, S. Luan, S. Kim, K. Sen, and S. Chandra (2018) Retrieval on source code: a neural code search. In 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pp. 31–41. External Links: Document, ISBN 9781450358347 Cited by: §1.
  • M. Schuster and K. K. Paliwal (1997) Bidirectional recurrent neural networks. IEEE transactions on Signal Processing 45 (11), pp. 2673–2681. Cited by: §5.1.
  • Z. Sun, J. M. Zhang, M. Harman, M. Papadakis, and L. Zhang (2020) Automatic testing and improvement of machine translation. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pp. 974–985. Cited by: §7.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networksIntriguing properties of neural networks. In ICLR, Cited by: §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §5.1.
  • Y. Wan, Z. Zhao, M. Yang, G. Xu, H. Ying, J. Wu, and P. S. Yu (2018)

    Improving automatic source code summarization via deep reinforcement learning

    In 33rd ASE, New York, NY, USA, pp. 397–407. External Links: Document, ISBN 9781450359375 Cited by: §1.
  • Y. Wang, F. Gao, L. Wang, and K. Wang (2019) Learning a static bug finder from data. arXiv preprint arXiv:1907.05579. Cited by: §5.4.1, §5.4.
  • C. Watson, N. Cooper, D. N. Palacio, K. Moran, and D. Poshyvanyk (2020) A systematic literature review on the use of deep learning in software engineering research. arXiv preprint arXiv:2009.06520. Cited by: §3.1, §5.2.
  • N. Yefet, U. Alon, and E. Yahav (2020) Adversarial examples for models of code. Proceedings of the ACM on Programming Languages 4 (OOPSLA), pp. 1–30. Cited by: §1, §2, §7.
  • [63] H. Zhang, Z. Li, G. Li, L. Ma, Y. Liu, and Z. Jin Generating adversarial examples for holding robustness of source code processing models. Cited by: §1, §2, §5.4.1.
  • J. Zhang, X. Wang, H. Zhang, H. Sun, K. Wang, and X. Liu (2019) A novel neural source code representation based on abstract syntax tree. In 41st ICSE, pp. 783–794. Cited by: §1, §2.
  • Y. Zhao, L. Li, H. Wang, H. Cai, T. F. Bissyandé, J. Klein, and J. Grundy (2021) On the impact of sample duplication in machine-learning-based android malware detection. ACM Transactions on Software Engineering and Methodology (TOSEM) 30 (3), pp. 1–38. Cited by: §1.
  • Z. Zhao, D. Dua, and S. Singh (2018) Generating natural adversarial examples. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  • Y. Zhou, S. Liu, J. K. Siow, X. Du, and Y. Liu (2019) Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 10197–10207. External Links: Link Cited by: §1.
  • D. Zügner, A. Akbarnejad, and S. Günnemann (2018) Adversarial attacks on neural networks for graph data. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2847–2856. Cited by: §2.