, GPT-3(Brown et al., 2020), T5 (Raffel et al., 2020), and BART (Lewis et al., 2020)
, have greatly contributed to the development of the field of natural language processing (NLP), and gradually form the pretrain-then-finetune paradigm. The basic idea of this paradigm is to first pre-train a model on large general-purpose datasets by self-supervised tasks,e.g., masking tokens in training data and asking the model to guess the masked tokens. The trained model is then fine-tuned on smaller and more specialized datasets, each designed to support a specific task. The success of pre-trained models in the natural language domain has also spawned a series of pre-trained models for programming language understanding and generation, including CodeBERT (Feng et al., 2020), GraphCodeBERT (Guo et al., 2021), PLBART (Ahmad et al., 2021), and the usage of T5 to support code-related tasks (Mastropaolo et al., 2021), improving the performance of a variety of source code understanding and generation tasks.
However, pre-training a large-scale model from scratch is costly. Additionally, along with an increasing number of pre-trained models, how to effectively adapt these models for a new task is not fully exploited. In this paper, we try to take the first step to bridge large pre-trained models and code-related downstream tasks. Moreover, despite the success of existing pre-trained models for code-related tasks, these models have two potential issues. First, these models graft NLP pre-training techniques to understand the semantics of source code, however, the semantics of programming language and natural language are essentially different, and semantically equivalent source code may be in various syntactic forms. The second issue is that pre-trained models typically have at least millions of parameters, so when a pre-trained model is applied to downstream tasks with specialized datasets, there is a risk of overfitting because the model is over-parameterized for the target dataset. McCoy et al. (McCoy et al., 2019)
find that many models, including BERT, overuse syntactic heuristic information in natural language inference task,e.g. associating irrelevant sentences due to the lexical overlap, resulting in good performance for the wrong reasons. Many studies have also found that when the test set is different from the actual scene or the test set is slightly perturbed, various models for source code would make mistakes (Quiring et al., 2019; Ramakrishnan et al., 2020; Yefet et al., 2020).
To address the above issues, we design a lightweight approach on top of the existing pre-trained language model fine-tuning paradigm, that satisfies (1) extracting code semantic knowledge embedded in diverse syntactic forms and complementing it to pre-trained models, (2) reducing overfitting to the target dataset and being more robust in testing. In order to incorporate semantic knowledge of the programming languages into models, we employ data augmentation, which is mainly used to enrich the training dataset and make it as diverse as possible. There are many successful applications of data augmentation in the field of image processing, such as random cropping (Krizhevsky et al., 2012), flipping (Simonyan and Zisserman, 2015) and dropout (Srivastava et al., 2014). For code data, this paper considers semantic-preserving transformation. An example of code transformation is shown in Fig. 1, where the same program is transformed three times successively, keeping the semantics unchanged. Since the semantics of the original program are preserved, it is logical that the model should have the same behavior as the original program for the program generated by the transformation techniques. Moreover, it is cheap to leverage a source-to-source compiler (Aho et al., 2006) to perform semantic-preserving transformations on source code. Thus, without additional labelled data, diverse data are created through semantic-preserving transformations. The transformations also introduce semantic knowledge into the model learning process, allowing the model to learn semantically invariant code features rather than relying on syntax and implementation details.
In this paper, we build our approach on a series of large-scale pre-trained models, including natural language pre-trained model RoBERTa and code pre-trained models CodeBERT and GraphCodeBERT, to bridge pre-trained models with downstream tasks for source code. We first construct semantic-preserving transformation sequences and apply them to original training samples, as in Fig. 1, to generate new training data and introduce code semantic knowledge into models. The transformation sequences make code transformations more complicated and could guide models to better learn the underlying semantics of the code. These training data are then fed to pre-trained models to fine-tune the models. Finally, in order to make full use of the features learned from semantically equivalent transformations during the training process, we augment the test sets with the same augmentation techniques as the training sets to obtain multiple transformed test sets. In this way, the transformations that appear during testing would act like prompts and help the model make accurate predictions more easily. To further reduce overfitting from the training process, we average the model performance on these test sets. Since our method averages the predictions from various transformation versions for any code snippet in test sets, the final predictions are robust to any transformation copy.
The transformed data significantly increase the data diversity, however, they can also be considered as adversarial examples compared to the original data (Ramakrishnan et al., 2020; Rabin et al., 2020). Fig. 1 shows the original program and programs after multiple code transformations. As the number of transformations increases, new tokens and syntactic forms are introduced, and the distribution of transformed data becomes more distinct from that of original data, making it more difficult to learn. To solve this issue, we introduce Curriculum Learning (CL) (Matiisen et al., 2020) and present training examples in an easy-to-hard manner, instead of a completely random order during training. Many studies have shown that it benefits the learning process not only for humans but also for machines (Elman, 1993; Krueger and Dayan, 2009). There are many successful applications of CL in natural language processing, including machine translation (Platanios et al., 2019; Zhang et al., 2018), natural language understanding (Xu et al., 2020), and answer generation (Liu et al., 2018). The key challenge of CL is how to define easy and hard samples, and in this paper we propose two hypotheses and experimentally verify them to determine the learning order.
In our experiments, based on pre-trained models CodeBERT and GraphCodeBERT, our method significantly surpasses the state-of-the-art performance on algorithm classification, code clone detection and code search tasks. In the algorithm classification task, our approach improves 10.24% Mean Average Percision (MAP) compared to the state-of-the-art performance, and in the code clone detection task, using only 10% of the randomly sampled training data, code pre-trained model CodeBERT fine-tuned with our approach outperforms the state-of-the-art model GraphCodeBERT normally fine-tuned with all training data. In the code search task, our method improves the state-of-the-art performance to 0.720 Mean Reciprocal Rank (MRR). More impressively, to test whether our approach introduces additional semantic knowledge of source code for the model, we apply our approach to natural language pre-trained model RoBERTa and find that it even outperforms CodeBERT with 3.88% MAP on algorithm classification task and RoBERTa pre-trained with code on code search task, and has the same performance as CodeBERT on code clone detection task. The data, pre-trained models and implementation of our approach are publicly available at the anonymous link: https://anonymous.4open.science/r/DACL-F660/.
The main contributions of our paper are as follows:
We design a lightweight approach on top of the existing pre-trained language model fine-tuning paradigm, to bridge pre-trained models and downstream tasks for source code. To the best of our knowledge, it is the first work in this direction.
We apply our method to pre-trained models CodeBERT and GraphCodeBERT, and the augmented models dramatically outperform the state-of-the-art performance on algorithm classification, code clone detection and code search tasks.
Our study reveals that for code-related tasks, without the need for heavy pre-training on code data, natural language models (e.g. RoBERTa) easily outperform the same models pre-trained with code, as well as the state-of-the-art code pre-trained models (e.g. CodeBERT) with the help of our approach.
The rest of the paper is organized as follows: preliminaries and hypotheses are described in Section 2. The technical details of our approach are presented in Section 3. The evaluation and analysis of our approach are shown in Section 4 and Section 5. Related work and threats to validity are in Section 6 and Section 7. Section 8 concludes the paper.
2. Preliminaries and Hypotheses
2.1. Data Augmentation
Data Augmentation (DA) is a technique to create new training data from existing training data artificially. It is done by applying random transformations to increase the diversity of the training set. Data augmentation is often performed with image data, where copies of images in the training set are created with some image transformation techniques performed, such as zooms, flips, shifts, and more. In fact, data augmentation can also be applied to natural language and code data. In this paper, our purpose of introducing data augmentation is to learn code semantics from semantic-preserving transformation, more specifically, to assist models in extracting and learning features in a way that are invariant to semantically equivalent declarations, APIs, control structures and so on.
In this paper, we exploit data augmentation not only for the training set but also for the test set. The application of data augmentation to the test set is called Test-Time Augmentation (TTA) (Simonyan and Zisserman, 2015; Moshkov et al., 2020). Specifically, it creates multiple augmented copies of each sample in the test set, has the model make a prediction for each, and then returns an ensemble of those predictions. The number of copies of the given data for which a model must make a prediction is often small. In our experiment, we randomly sample three samples for each piece of data from their augmented copies, take the average results as the result of the augmented perspective and add the results on the original dataset as the final results.
2.2. Curriculum Learning
The learning process of humans and animals generally follows the order of easy to difficult, and CL draws on this idea. Bengio et al. (Bengio et al., 2009)
propose CL for the first time imitating the process of human learning, and advocate that the model should start learning from easy samples and gradually expand to complex samples. In recent years, CL strategies have been widely used in various scenarios such as computer vision and natural language processing. It has shown powerful benefits in improving the generalization ability and accelerating convergence of various models(Guo et al., 2018; Jiang et al., 2014; Platanios et al., 2019; Tay et al., 2019). At the same time, it is also easy-to-use, since it is a flexible plug-and-play submodule independent of original training algorithms.
There are two key points of CL, one is the scoring function and the other is the pacing function. The scoring function makes it possible to sort the training examples by difficulty, and present to the network the easier samples first. The pacing function determines the pace by which data is presented to the model. The main challenge is how to obtain an effective scoring function without additional labelling of the data.
We formulate two hypotheses about the scoring functions to determine the order of learning and conduct experiments to verify them.
Many studies have shown that deep models for source code are vulnerable to adversarial examples (Quiring et al., 2019; Ramakrishnan et al., 2020; Yefet et al., 2020). Slight perturbations to the input programs could cause the model to make false predictions. Therefore, it is natural for us to formulate the first hypothesis that the augmented data are more challenging to learn than the original data for general models. We design an experiment to verify this hypothesis directly, as shown in Algorithm 1. It shows the pseudocode to verify the impact of code transformation by comparing the performance of the model on a range of training set variants. The training set variants are generated by iterating the transformation functions on the original training set. (line 4-7) After the model is trained on the original training set, we evaluate the model on these training set variants.
We apply Algorithm 1 to the state-of-the-art model CodeBERT with benchmark dataset POJ104 (Mou et al., 2016) (will be explained in 4.1). Fig. 2 shows the performance of CodeBERT for these training set variants. Since the model is trained and tested on the same dataset, it performs best on the original training set. The performance gets progressively worse as the number of transformations on the original dataset increases, which verifies that data augmentation would increase the difficulty of the training set and experimentally supports our hypothesis.
As the augmented data are more difficult to learn, it is natural to let the model learn the original data first and then the augmented data from easy to hard. We detail our curriculum learning strategy based on this hypothesis in the next section.
The second hypothesis we propose is to solve the multiclass classification task. Image classification, text classification like news, and algorithm classification are all classical multiclass classification tasks. The task is quite difficult, and a common simplification is to split the multiclass classification task into easily solvable subtasks with fewer classes. Hence, we formulate the hypothesis that for the multiclass classification task, it is more effective to determine the learning order of the model from a class perspective. Based on this hypothesis, the optimization goal of the model gradually transitions from a classification problem with few classes to a classification of multiple classes during the entire training process. Intuitively, the task is much easier to solve under this setting compared to a straightforward solution. We next conduct an experiment to verify the hypothesis.
The difficulty of code data may be reflected in the length of the code, the use of rare tokens, the complexity of logic, etc. Although these heuristics are reasonable for people, they are not necessarily the case for models. Therefore, unlike the previous validation experiment that uses code augmentation techniques to distinguish the difficulty of the samples artificially, we let the model itself give an evaluation of the data as the difficulty scores, as shown in Algorithm 2.
The purpose of Algorithm 2 is to get the average difficulty score of each class on the training set. To get the difficulty score of each sample on the training set, we apply the leave-one-out strategy, i.e., when we compute the difficulty scores for a part of the samples, we train the model with all the other data. (line 4-9) Then we compute the average difficulty scores on each class. (line 11-14)
To have a comparison with the learning order under the first hypothesis, we also apply Algorithm 2 to the state-of-the-art model CodeBERT with POJ104 dataset. POJ104 dataset contains many classes, and the task of POJ104 dataset is to predict the class for a given program. We apply Algorithm 2 to both the original training set and the augmented training set. We sort their average difficulty scores of each class according to the scores on the original training set, as shown in Fig. 3.
From Fig. 3 it can be found that the performance of the model on various classes varies greatly. The experimental performance reflects the difficulty of classes; the better the experimental performance, the lower the difficulty, and vice versa. Also, we find that the performance on the augmented dataset is almost always lower than that on the original dataset, further validating our previous hypothesis. At the same time, Fig. 3 shows that the performance of the model on the augmented dataset, although decreasing, is always distributed around the performance of the same class on the original dataset. Therefore, we conclude that for multiclass classification tasks organizing the data by class can yield data with more stable gradients than artificially differentiating the data by augmentation techniques. It motivates us to expose models to the easier classes first and then gradually transition to the harder classes.
3. Proposed Approach
In this section, we describe the details of our approach. Our method is built on the fine-tuning paradigm and adapts pre-trained models to downstream tasks. Given pre-trained models and datasets of downstream tasks, we exploit the potential of pre-trained models on these tasks by acting on the data only.
3.1. Approach Overview
Fig. 4 presents an overview of our approach. Our approach mainly consists of three components.
Augmentation for training data that transforms given programs into semantically equivalent programs and build augmented dataset to make training data more diverse.
Curriculum strategy that organizes augmented dataset into the ordered dataset in an easy-to-hard order. The order is determined by scoring functions.
Test-time augmentation that yields transformed versions of programs for prediction. The results are the fusion of results of original programs and transformed programs of different transformation types.
3.2. Augmentation for Training Data
In order to help models learn code features in a way that are invariant to semantically equivalent programs, we construct semantic-preserving transformations for code data. The lexical appearances and syntactical structures are different before and after transformations, but the semantics of programs are identical.
Various languages apply different transformation techniques due to specific language characteristics. In this paper, we use the same transformation techniques for data in the same language which do not rely on prior knowledge from tasks or datasets. There are two programming languages in our experiments. For C/C++, we modify the work from Quiring et al. (Quiring et al., 2019). For Java, we apply the SPAT tool (SantiagoMunz, 2021). We apply ten transformations for C/C++ and nine transformations for Java. The specific transformations are shown in Table 1. These techniques are grouped by the granularity of their changes. They change the control structure, API and declaration, respectively, to help models extract and learn the corresponding features, while ensuring that the semantics remain unchanged. Taking the transformations in Fig. 1 as an example, the transformer is applied to transform the original program to the program and converts the structure to . This type of transformation enables the model to understand various control structures. From program to program, and transformer are applied. This type of transformations could also generate diverse and equivalent declaration statements by merging, splitting and swapping declaration statements, helping the model to ignore the interference of syntactic formals and focus on semantics. In the last transformation to program, the output API and operator ”++” are converted to and ”+=”, respectively. The API transformation exploits the fact that the same function can be implemented by different APIs and operators. These transformation techniques would also work in combination to make the dataset more diverse.
3.3. Curriculum Strategy
The key challenge of curriculum learning is how to define easy/difficult examples. In this paper, we propose two difficulty scoring functions based on the hypotheses presented in Section 2.3.
Augmentation-based Curriculum Strategy
The previous section has introduced data augmentation techniques for code data, and it is cheap to generate diverse data through transformations. However, compared with original data, the augmented data can be regarded as perturbations or adversarial examples of original data (Ramakrishnan et al., 2020; Yefet et al., 2020), and they should be more difficult to learn as verified in Section 2.3.
Therefore, we design an augmentation-based curriculum strategy. We first train on only the original data, and then gradually increase the proportion of the augmented data, ensuring that the model is exposed to more data and the difficulty gradually increases during the training process.
In particular, it should be noted that in the process of learning the augmented data we do not strictly follow the order of programs to programs, since we find that some programs have far more transformed program variants than others and multiple transformations could cause the data to be unbalanced. Therefore, we sample an equal number of augmented samples from the transformed program variants of each sample in the original training set for learning, and the data statistics are shown in Table 2. This method is easy to implement on general models, and we illustrate its effects in the following experiments.
Class-based Curriculum Strategy
Especially for multiclass classification tasks, based on the hypothesis verified in Section 2.3, we propose a class-based curriculum strategy.
Specifically, the leave-one-out strategy is employed to obtain the difficulty scores on the entire training set, and then the average difficulty score on each class is calculated. The samples in the same class take the average class difficulty score as their difficulty scores. In the training process, this setting allows the model to learn easier classes first, and then to more difficult classes. Obviously, the model needs to extract and learn more features to deal with increasingly difficult tasks.
Once the scoring function is determined, we still need to define the pace at which we transition from easy samples to harder samples. With reference to the work (Penha and Hauff, 2020), when selecting and applying different pacing functions, we ensure that the model has a number of samples to learn when the training iteration begins, and gradually gets in touch with difficult samples until all samples are available. We implement a range of pacing functions according to Penha et al. (Penha and Hauff, 2020) and illustrate its effects in Section 5.3.
3.4. Test-Time Augmentation
To align transformation techniques applied on the training set, we also apply augmentations on the test set. These are the same as the augmentation techniques applied on the training set. In this way, the features learned from semantically equivalent transformations during the training process would be fully utilized during evaluation.
To further eliminate overconfident incorrect predictions due to overfitting (Wang et al., 2019), for each sample in the test set we sample three augmented copies from its transformed candidates. Sampling more samples for prediction may make the results more robust, but would increase the prediction time proportionally. As shown in the right part of Fig. 4, the final experimental performance is the sum of results on the original test set and results in the augmented perspective, which are the average of the results on augmented copies. As a result, incorrect prediction on a single test case by the model is corrected by combining multiple perspectives to make a final prediction.
In this section, we conduct experiments to verify whether our method is effective in different tasks, including algorithm classification, code clone detection and code search tasks.
|Dataset||Original training set||Augmented training set|
4.1. Data preparation
In this subsection, we present benchmark datasets for three tasks from CodeXGLUE (Lu et al., 2021): POJ104, BigCloneBench (Svajlenko et al., 2014) and CodeSearchNet (Husain et al., 2019) and describe how to simply adapt data of various tasks to our approach.
POJ104 dataset is collected from an online judge platform, which consists of 104 program classes and includes 500 student-written C/C++ programs for each class. The task of POJ-104 dataset is to retrieve other programs that solve the same problem as a given program. We split the dataset according to labels. We use 64 classes of programs for training, 24 classes of programs for testing, and 16 classes of programs for validation. For data augmentation, to successfully compile the programs, “#include” statements are prepended before the programs. This process does not introduce differences since added statements are the same for all programs. As some programs cannot be compiled, we further use regular expressions to correct programs with simple grammatical errors, and remove the rest with serious grammatical and semantic problems. A total of 1710 programs were removed, accounting for about 3% (1710/52000). To guarantee the fairness of the experiments, we also evaluate the baseline models on both the original dataset and the normalized dataset. For test-time augmentation, the results of the original and augmented versions of the same program are merged to make a prediction.
BigCloneBench dataset contains 25,000 Java projects, cover 10 functionalities and including 6,000,000 true clone pairs and 260,000 false clone pairs. The dataset provided by Wang et al. (Wang et al., 2020) is filtered by discarding code fragments without any tagged true or false clone pairs, leaving it with 9,134 Java code fragments. The dataset includes 901,028/415,416/415,416 pairs for training, validation and testing, respectively. This dataset has been widely used for the code clone detection task. For code augmentation, since the data is in the form of code pairs, we replace any original program in clone pairs with augmented programs to form new pairs. For test-time augmentation, all versions of a code pair are considered to determine whether it is a clone pair.
CodeSearchNet contains about 6 million functions from open-source code spanning six programming languages. In this paper, we use the dataset in Java. Given a natural language query as the input, the task is to find the most semantically related code from a collection of candidate programs. According to the state-of-the-art model GraphCodeBERT(Guo et al., 2021), we expand 1000 query candidates to the whole code corpus, which is closer to the real-life scenario. The answer of each query is retrieved from the whole validation and testing code corpus instead of 1,000 candidate programs. For code augmentation in the training set, since the data are pairs of natural language queries and programming language fragments, we replace original programs with augmented programs and form new pairs with their natural language queries. When doing test-time augmentation, it is different from the previous two tasks. Since the test set is the set of natural language queries, we apply code augmentation techniques to the codebase corresponding to these queries, and build augmented codebases of the same size. The final results is to sum and average results on the original codebase and augmented codebases.
The original and augmented data statistics of the above tasks are shown in Table 2 and the augmented datasets contain the original data. We release all data for verification and future development. Theoretically, more augmented data can be obtained, however, more data to train would bring larger time overhead. To trade off the experimental performance and time overhead, we use a limited amount of augmented data, and we apply curriculum learning strategy where the model is trained from a smaller data size and the overhead is further reduced.
4.2. Experimental Setups
To illustrate the effectiveness of our method on code-related tasks, we build our approach on code pre-trained models CodeBERT and GraphCodeBERT. To illustrate the applicability of our method, we also evaluate our method on natural language pre-trained model RoBERTa (Liu et al., 2019)
that has not been exposed to code at all. In replication experiments, we follow the description in their original papers and released code. For parameter settings, to ensure fairness, we keep all parameters consistent with their released code including random seeds except for the warmup step and epoch. The warmup step parameter adapts to the increase of the dataset, and its value is adjusted from the original dataset size to the augmented dataset size. Also due to the increase in data size and the progressive curriculum learning, we increase the epoch and set it to 20, 10, and 15 on POJ104, BigCloneBench, and CodeSearchNet, respectively. We replicate CodeBERT and GraphCodeBERT with the same parameter settings. The results reported in the original papers and our replicated results are not much different, and we present all the results. For data augmentation, we implement augmentaion techniques on the top of Clang(7)
for C/C++. With respect to pacing function, the hyperparameters are set according to Penhaet al. (Penha and Hauff, 2020).
4.3. Algorithm Classification
Metrics and Baselines
We use precision and MAP as the evaluation metrics of the algorithm classification task. Precision is defined as the average precision score and MAP is the rank-based mean of average precision score, each of which is evaluated for retrieving most similar samples given a query. We apply RoBERTa and the state-of-the-art model CodeBERT as baseline methods. RoBERTa is a pre-trained model on natural language. CodeBERT is a pre-trained model on code data. It combines masked language modeling(Devlin et al., 2019) with replaced token detection objective (Clark et al., 2020) to pre-train a Transformer (Vaswani et al., 2017) encoder.
|RoBERTa + DA + CL||88.15||86.55|
|CodeBERT + DA + CL||93.63||92.91|
We compare with and without our method (DA + CL) for these pre-trained models. Table 3 summarizes these results. For baseline methods, all experimental results are evaluated on our normalized dataset, except for results of MAP in parentheses. These results are reported in the original paper of baseline methods and MAP is their only metric for algorithm classification task. Natural language pre-trained model RoBERTa fine-tuned with our method, achieves 88.15% on precision, 86.55% on MAP. Our method improves its performance noticeably by 5.33% on precision, 6.31% on MAP and 9.88% compared to the results reported in the original paper. Code pre-trained model CodeBERT fine-tuned with our method, achieves 93.63% precision and 92.91% on MAP. Our method substantially improves 8.35% on precision, 10.15% on MAP, and 10.24% compared to the original result. Notably, with our method, RoBERTa model without being pre-trained on code data outperforms the existing state-of-the-art model CodeBERT fine-tuned on this task by 3.79% MAP.
4.4. Code Clone Detection
Metrics and Baselines
We use precision, recall and F1 score as the evaluation metrics of the code clone detection task. In our experiments, we compare a range of models including the state-of-the-art model GraphCodeBERT. GraphCodeBERT is a pre-trained model for code which improves CodeBERT by modeling the data flow edges between code tokens. CDLH (Wei and Li, 2017) learns representations of code fragments through AST-based LSTM. ASTNN (Zhang et al., 2019) encodes AST subtrees for statements and feeds the encodings of all statement trees into an RNN to learn representation for a program. FA-AST-GMN (Wang et al., 2020) leverages explicit control and data flow information and uses GNNs over a flow-augmented AST to learn representation for programs. TBCCD (Yu et al., 2019) proposes a tree convolution-based method to detect semantic clone, that is, using AST to capture structural information and obtain lexical information from the position-aware character embedding.
|RoBERTa(10% data) + DA + CL||0.973||0.957||0.965|
|CodeBERT(10% data) + DA + CL||0.972||0.972||0.972|
Table 4 shows results for code clone detection. Our reproduced results are mostly consistent with results reported in original papers, except for the F1 score of 0.964 for RoBERTa, which is higher than the original result of 0.949. We implement our method on RoBERTa and CodeBERT. Experiments show that models with our method consistently perform better than the original models. Notably, with our method, RoBERTa performs comparably to CodeBERT, and CodeBERT outperforms the state-of-the-art model GraphCodeBERT. More importantly, following the original settings of CodeBERT, CodeBERT only randomly samples 10% of the data for training compared to GraphCodeBERT. Even though we expand the data using data augmentation in the experiment for CodeBERT, the data used by CodeBERT are still much less than data for GraphCodeBERT.
4.5. Code Search
Metrics and Baselines
For code search task, we use MRR as the evaluation metric. MRR is the average of the reciprocal rank of results of a set of queries. The reciprocal rank of a query is the inverse of the rank of the first hit result.
Table 5 shows the results of different approaches on the CodeSearchNet corpus. The first four rows are reported by Husain et al. (Husain et al., 2019). NBOW, CNN, BIRNN and SELFATT represent neural bag-of-words (Sheikh et al., 2016)2014)
, bidirectional GRU-based recurrent neural network(Cho et al., 2014), and multi-head attention (Vaswani et al., 2017), respectively.
|RoBERTa + DA + CL||0.635|
|CodeBERT + DA + CL||0.697|
|GraphCodeBERT + DA + CL||0.720|
Table 5 shows results of different approaches for code search. RoBERTa (code) is pre-trained on programs from CodeSearchNet with masked language modeling while maintaining the RoBERTa architecture. Our reproduced result 0.696 of GraphCodeBERT is slightly differently from the originally reported result 0.691. We implement our method on RoBERTa, CodeBERT and the state-of-the-art model GraphCodeBERT for code search. The results show that natural language pre-trained model RoBERTa with our method outperforms RoBERTa (code), which is the same model architecture pre-trained on code data. CodeBERT with our method outperforms the original state-of-the-art model GraphCodeBERT. The performance of GraphCodeBERT with our method reaches 0.720 MRR, surpassing the original result 0.691 MRR.
On above tasks and their benchmark datasets, our method substantially improves the performance of a range of pre-trained models, achieving the state-of-the-art performance on all tasks. For the natural language pre-trained model with no exposure to code at all, with the help of our approach, it is able to match or even surpass existing code pre-trained models normally fine-tuned to corresponding tasks. In the code search task, RoBERTa pre-trained with natural language and fine-tuned with our method, surpasses the same architecture pre-trained with code data and fine-tuned with the general method. These all illustrate the strong bridging role of our method between pre-trained models and code-related downstream tasks by introducing semantic knowledge for downstream tasks into pre-trained models.
For code-related tasks, applying our approach to a pre-trained model at the finetune stage with a relatively small cost is preferable to pre-training a more complicated model from scratch with huge resources. It illustrates the superiority of our method, but this is not to negate the work of code pre-trained models either. In fact, our approach achieves better results when applied to a superior pre-trained model. Probably, the research of pre-trained models for source code has much work to do in terms of data diversity and conjunction with downstream tasks.
This section analyzes the effects of different parameters on the performance of tasks in our experiment.
5.1. Ablation Study
This section investigates how data augmentation and curriculum learning affect the performance of models, respectively. The following subsections show these results for algorithm classification, code clone detection and code search task.
|CodeBERT + DA + CL||93.63||92.91|
For algorithm classification task, we conduct experiments without augmention on training set (DA-Training), test-time augmentation or curriculum learning. The results are shown in Table 6. The first row shows the results of the baseline model. The second row presents the results of the baseline model with our full method. The third row removes augmentation on the training set. The fourth row presents the results of removing test-time augmentation. The results of removing curriculum learning strategy are shown in the last row. As seen from the results, removing any of the components leads to a drop of the model performance, and the removal of test-time augmentation leads to a significant performance degradation, indicating that all three components are necessary to improve performance, and test-time augmentaion contributes the most to the improvements. We believe that for clustering tasks similar to algorithm classification, integrating multiple perspectives in a data augmentation manner during testing could be a huge boost to model performance.
|CodeBERT + DA + CL||0.972||0.972||0.972|
|w/o DA-Training + CL||0.964||0.965||0.964|
Code Clone Detection
For code clone detection task, we also conduct experiments without augmention on training set, test-time augmentation or curriculum learning. Unlike algorithm classification, we apply augmentation-based curriculum learning for code clone detection task. The removal of augmentation on the training set means that the CL component also does not work, and only test-time augmentation component works. The experimental results in Table 7 show that the combination of augmentation on the training set and CL component has the largest performance improvement, and test-time augmentation has no significant performance improvement, but the model can still benefit from it.
|GraphCodeBERT + DA + CL||0.720|
|w/o DA-Training + CL||0.710|
With the same ablation experimental setups as for the code clone detection task, we conduct experiments on the code search task. As shown in Table 8, we conclude that all three components are necessary for the improvements. The last row shows the result using only test-time augmentation, which is able to significantly exceed the original state-of-the-art performance without training with additional augmentation data. We speculate that test-time augmentation is able to combine multiple augmentation copies in the code retrieval process to make judgments and eliminate overconfident incorrect predictions on the original test set. The penultimate row shows the experimental result of removing CL component. In other words, it is obtained by the combination of augmentation on the training set and test-time augmentation acting on the model. Compared to the result of applying test-time augmentation component only in the last row, we find that more augmented data used for training may result in negative gains. One possible reason is that the augmented data introduces more noise, causing the model to choose from more candidates for the same query during training. These results further illustrate the necessity of curriculum learning on augmented data.
5.2. Effects of Augmentation Type
Since this paper considers multiple augmentation techniques, in this section we explore the effects of augmentation techniques at different granularities on the experimental results. We build transformed datasets of the same size using augmentation techniques of different granularities and train CodeBERT separately on these datasets for algorithm classification task. Results are shown in Table 9. The first row shows the results using all augmentation techniques of three granularities, while the second to fourth rows show the results without the augmentation techniques for the declaration, API, or control stucture granularity, respectively. From the results, it can be seen that not using the augmentation techniques of declaration or API granularity leads to a decrease in results, while not using the augmentation techniques of control sturcture leads to an increase. This indicates that the augmentation of declaration and API contribute more to the improvements, however, the control structure augmentation introduces more noise than contribution. We speculate that changing the control structure has a greater impact on the token order and context relative to the other two granularities of augmentation techniques, and pre-trained models we use are based on masked language modeling and are context sensitive. These reasons make it more difficult for the models to learn the knowledge and features introduced in the process of changing the control structure. This finding also encourages the code pre-trained model to further exploit structural information of source code in order to better understand the program semantics.
5.3. Effects of Pacing Function
To understand how the model is impacted by the pace we go from easy to hard examples, we evaluate the effects of different pacing functions on the experimental results, as shown in Table 10. We conduct experiments on POJ104 dataset in the algorithm classfication task. The learning order is determined by the scoring function described in Section 3.3. The baseline model CodeBERT is trained in a random order and the Anti method orders training samples from hard to easy. The other methods learn training samples from easy to hard, with the difference that at each epoch a different proportion of the training data are fed to the model as determined by their functions. We briefly introduce different pacing functions and the details are described in Penha et al. (Penha and Hauff, 2020). The function linearly increases the percentage of training data input to the model. function divides training data into several groups, and after fixed epoches a group of training samples will be added for model training. and functions correspond to two extreme cases. function feeds the model with a large number of easy samples and then slowly increases the proportion of hard samples, while function does the opposite. In the function, is the hyperparameter, and the larger the value of , the more training data are fed to the model at the beginning. All these functions are fed with the same training data at the final stage of training.
In Table 10, we can see that feeding data from easy to hard has a certain performance improvement, while the performance of inputting training samples from hard to easy is significantly worse than the baseline in a random order. These results illustrates the effectiveness of our curriculum learning strategy and scoring functions. Comparison of different pacing functions shows that and functions achieve similar results as function. The functions obviously outperform the function, which is consistent with the findings of Sohrmann et al. (Sohrmann et al., 2020) and Penha et al. (Penha and Hauff, 2020). The reasons are that the root function gives the model more time to learn from harder instances and is better than no CL in terms of statistical significance. In our experiments, we used function for algorithm classification task, and since we did not perform ablation study on the datasets of the other two tasks, we use function by default. The performance on these two tasks could probably be further improved with different pacing functions, and we leave it for future work.
6. Related Work
6.1. Data Augmentation
Data augmentation aims to increase the data diversity and thus the generalization ability of the model by various transformation techniques. This approach is widely used in the computer vision domain (Wei et al., 2019; Shorten and Khoshgoftaar, 2019; Zhong et al., 2020). In recent years, researchers apply data augmentation to code data as well (Quiring et al., 2019; Ramakrishnan et al., 2020; Zhang et al., 2020; Yefet et al., 2020; Rabin et al., 2020). A series of studies are motivated by the fact that existing models are vulnerable to adversarial examples, and they design methods to expose the vulnerability of models and improve the robustness of models. Our aim is to make the models more generalizable and perform better on real data, unlike the methods described above. Jain et al. (Jain et al., 2020) improve accuracy in code summarization and type inference task based on equivalent data transformations and unsupervised auxiliary tasks. Nghi et al. (Bui et al., 2021) propose a self-supervised contrastive learning framework for code retrieval and code summarization tasks. Our aim is similar to these studies, but we do not need to design the objective function or model architecture. Without the need for complicated model design, our approach accomplishes the same goal by acting on the data only. We simply augments the data and feeds the augmented data into the model in an easy-to-hard manner. Therefore, our lightweight method can be easily applied over existing models and various downstream tasks.
6.2. Curriculum Learning
Learning educational material in order from easy to difficult is very common in the human learning process. Inspired by cognitive science (Rohde and Plaut, 1999), researchers have found that model training can also benefit from a similar curriculum learning setting. Since then, CL has been successfully applied to image classification (Gong et al., 2016; Hacohen and Weinshall, 2019), machine translation (Kocmi and Bojar, 2017; Platanios et al., 2019; Zhang et al., 2018), answer generation (Liu et al., 2018) and information retrieve (Penha and Hauff, 2020).
The core of CL lies in the design of the scoring function, that is, how to define easy and hard samples. A straightforward approach is to study the data to create specific heuristic rules. For example, Bengio et al. (Bengio et al., 2009) use images containing less varied shapes as easy examples to be learned first. Tay et al. (Tay et al., 2019) use paragraph length as an evaluation criterion for difficulty in the question answer task. However, these are highly dependent on the task dataset and cannot be generalized to general tasks. Guo et al. (Guo et al., 2018) examine the examples in their feature space, and define difficulty by the distribution density, which successfully distinguishes noisy images. Xu et al. (Xu et al., 2020) generally distinguish easy examples from difficult ones on natural language understanding tasks by reviewing the training set in a crossed way. In this paper, similar to Xu et al. (Xu et al., 2020), we also utilize cross validation to measure data difficulty by model itself, but we also take the class distribution into consideration. We intuitively solve the multiclass classification problem from a class perspective by first transforming it into a classification of fewer easy classes and then gradually increasing the number of difficult classes. At the same time, we combine curriculum learning and data augmentation to overcome the problem that augmented data is more difficult to learn. We first learn the original data, then gradually transition to augmented data, and experimentally illustrate and verify the effectiveness of the design.
7. Threats to Validity
There are several threats to validity of our method.
Due to the use of test-time augmentation in our method, this component cannot be easily applied to code generation tasks. Augmentation on the training set and curriculum learning are still applicable, e.g., Jain et al. (Jain et al., 2020) have achieved good performance on the code summarization task using code augmentation.
The transformation techniques we use are not representative of the whole. Due to the characteristics of various tasks and datasets, some transformations may lead to large improvements and some may bing no improvements. Therefore, we release the datasets for replication and reducing experimental bias. Our approach is designed to be a lightweight component that generalizes to multiple downstream tasks. For specific downstream tasks, new augmentation techniques can also be applied to optimize the performance.
Due to limited computed resource, we did not explore the performance of our approach for the code clone detection task on GraphCodeBERT or conduct ablation stuies on all three tasks regarding the pacing function and transformation type. In fact, there should be room for improvement and interesting conclusions to be explored. We shall get better results by searching for more suitable pacing functions and transformation types for the other two tasks. We leave it for future works.
In this paper, we focus on bridging pre-trained models and code-related downstream tasks and propose a lightweight approach on the fine-tuning paradigm, which is easy to implement on top of various models. We build our approach on code pre-trained models of CodeBERT and GraphCodeBERT, and these models substantially outperform original models and achieve the state-of-the-art performance on algorithm classification, code clone detection and code search. Moreover, we apply our method to natural language pre-trained model RoBERTa and it achieves comparable or better performance than existing state-of-the-art code pre-trained models fine-tuned on these tasks. This finding reveals that there is still much room for improvement in existing pre-trained models for source code understanding.
This paper focuses on code discriminative tasks. It is more challenging to apply our approach to code generation tasks. However, generation tasks are data-hungry and may require more diverse data for learning, such as code generation where multiple code candidates are expected to be generated. In the future, it would be interesting to combine our approach and prompt-based learning (Liu et al., 2021) to further exploit the potential of generative pre-trained models on code generation tasks.
- Unified pre-training for program understanding and generation. In NAACL, Cited by: §1.
- Compilers: principles, techniques, and tools (2nd edition). Cited by: §1.
- Curriculum learning. In ICML ’09, Cited by: §2.2, §6.2.
- Language models are few-shot learners. ArXiv abs/2005.14165. Cited by: §1.
- Self-supervised contrastive learning for code retrieval and summarization via semantic-preserving transformations. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. Cited by: §6.1.
- Learning phrase representations using rnn encoder–decoder for statistical machine translation. ArXiv abs/1406.1078. Cited by: §4.5.
-  () Clang: a c language family frontend for llvm. External Links: Cited by: §4.2.
- ELECTRA: pre-training text encoders as discriminators rather than generators. ArXiv abs/2003.10555. Cited by: §4.3.
- BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Cited by: §1, §4.3.
Learning and development in neural networks: the importance of starting small. Cognition 48, pp. 71–99. Cited by: §1.
- CodeBERT: a pre-trained model for programming and natural languages. ArXiv abs/2002.08155. Cited by: §1.
- Multi-modal curriculum learning for semi-supervised image classification. IEEE Transactions on Image Processing 25, pp. 3249–3260. Cited by: §6.2.
- GraphCodeBERT: pre-training code representations with data flow. ArXiv abs/2009.08366. Cited by: §1, §4.1.
- CurriculumNet: weakly supervised learning from large-scale web images. ArXiv abs/1808.01097. Cited by: §2.2, §6.2.
- On the power of curriculum learning in training deep networks. ArXiv abs/1904.03626. Cited by: §6.2.
- CodeSearchNet challenge: evaluating the state of semantic code search. ArXiv abs/1909.09436. Cited by: §4.1, §4.5.
- Contrastive code representation learning. ArXiv abs/2007.04973. Cited by: §6.1, 1st item.
- Easy samples first: self-paced reranking for zero-example multimedia search. Proceedings of the 22nd ACM international conference on Multimedia. Cited by: §2.2.
- Convolutional neural networks for sentence classification. In EMNLP, Cited by: §4.5.
Curriculum learning and minibatch bucketing in neural machine translation. In RANLP, Cited by: §6.2.
- ImageNet classification with deep convolutional neural networks. Communications of the ACM 60, pp. 84 – 90. Cited by: §1.
- Flexible shaping: how learning in small steps helps. Cognition 110, pp. 380–394. Cited by: §1.
- BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. ArXiv abs/1910.13461. Cited by: §1.
- Curriculum learning for natural answer generation. In IJCAI, Cited by: §1, §6.2.
- Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ArXiv abs/2107.13586. Cited by: §8.
- RoBERTa: a robustly optimized bert pretraining approach. ArXiv abs/1907.11692. Cited by: §1, §4.2.
CodeXGLUE: a machine learning benchmark dataset for code understanding and generation. ArXiv abs/2102.04664. Cited by: §4.1.
- Studying the usage of text-to-text transfer transformer to support code-related tasks. 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), pp. 336–347. Cited by: §1.
- Teacher–student curriculum learning. IEEE Transactions on Neural Networks and Learning Systems 31, pp. 3732–3740. Cited by: §1.
- Right for the wrong reasons: diagnosing syntactic heuristics in natural language inference. ArXiv abs/1902.01007. Cited by: §1.
Test-time augmentation for deep learning-based cell segmentation on microscopy images. Scientific Reports 10. Cited by: §2.1.
- Convolutional neural networks over tree structures for programming language processing. In AAAI, Cited by: §2.3.
- Curriculum learning strategies for ir. Advances in Information Retrieval 12035, pp. 699 – 713. Cited by: §3.3, §4.2, §5.3, §5.3, §6.2.
- Competence-based curriculum learning for neural machine translation. In NAACL-HLT, Cited by: §1, §2.2, §6.2.
- Misleading authorship attribution of source code using adversarial learning. In USENIX Security Symposium, Cited by: §1, §2.3, §3.2, §6.1.
- On the generalizability of neural program analyzers with respect to semantic-preserving program transformations. ArXiv abs/2008.01566. Cited by: §1, §6.1.
Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv abs/1910.10683. Cited by: §1.
- Semantic robustness of models of source code. ArXiv abs/2002.03043. Cited by: §1, §1, §2.3, §3.3, §6.1.
- Language acquisition in the absence of explicit negative evidence: how important is starting small?. Cognition 72, pp. 67–109. Cited by: §6.2.
- Semantic preserving auto transformation. External Links: Cited by: §3.2.
- Learning word importance with the neural bag-of-words model. In Rep4NLP@ACL, Cited by: §4.5.
- A survey on image data augmentation for deep learning. Journal of Big Data 6, pp. 1–48. Cited by: §6.1.
- Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. Cited by: §1, §2.1.
- Nationwide introduction of a new competency framework for undergraduate medical curricula: a collaborative approach.. Swiss medical weekly 150, pp. w20201. Cited by: §5.3.
- Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, pp. 1929–1958. Cited by: §1.
- Towards a big data curated benchmark of inter-project code clones. 2014 IEEE International Conference on Software Maintenance and Evolution, pp. 476–480. Cited by: §4.1.
- Simple and effective curriculum pointer-generator networks for reading comprehension over long narratives. ArXiv abs/1905.10847. Cited by: §2.2, §6.2.
- Attention is all you need. ArXiv abs/1706.03762. Cited by: §4.3, §4.5.
Aleatoric uncertainty estimation with test-time augmentation for medical image segmentation with convolutional neural networks. Neurocomputing 335, pp. 34 – 45. Cited by: §3.4.
- Detecting code clones with graph neural network and flow-augmented abstract syntax tree. 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 261–271. Cited by: §4.1, §4.4.
Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In IJCAI, Cited by: §4.4.
- Generative image translation for data augmentation in colorectal histopathology images. Proceedings of machine learning research 116, pp. 10–24. Cited by: §6.1.
- Curriculum learning for natural language understanding. In ACL, Cited by: §1, §6.2.
- Adversarial examples for models of code. Proceedings of the ACM on Programming Languages 4, pp. 1 – 30. Cited by: §1, §2.3, §3.3, §6.1.
- Neural detection of semantic code clones via tree-based convolution. 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC), pp. 70–80. Cited by: §4.4.
- Generating adversarial examples for holding robustness of source code processing models. In AAAI, Cited by: §6.1.
- A novel neural source code representation based on abstract syntax tree. 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 783–794. Cited by: §4.4.
- An empirical exploration of curriculum learning for neural machine translation. ArXiv abs/1811.00739. Cited by: §1, §6.2.
- Random erasing data augmentation. In AAAI, Cited by: §6.1.