Code understanding is an increasingly important application of AI, with over 100 papers targeting the area in the last year alone111https://ml4code.github.io/papers.html. Much research in this area has focused on understanding code from abstract representations of the program such as Abstract Syntax Trees (ASTs) and program flow. However, there has been little emphasis in utilizing important semantics about code buried in textual artifacts, such as documentation or forum discussions. Extracting such information can significantly enrich code representations. As an example, Figure 1 shows a program where the classes GLM and SGDClassifier are being used. If one could enrich the representation of the two classes in the code with their key features from text, we would understand that both represent linear models, and hence both code snippets perform similar functions.
To enrich code with textual information, we need to be able to summarize textual information about classes and functions into vector representations. Pre-trained language models are an obvious choice, but we currently do not know how applicable they are to text about code, given the specialized language of the programming domain. We need a set of code-related downstream tasks to evaluate these models, just as GLUE[wang-etal-2018-glue] and SuperGLUE [SuperGLUE] have been used extensively to further language model development in the natural language understanding domain. CodeXGLUE [CodeXGLUE] provides a suite of tasks but only a single task in it is related to textual code artifacts; it is translation of documentation about code from one natural language to another. To our knowledge, we know of no other tasks that focus on relations between textual artifacts about code. This paper attempts to fill this gap.
We have three goals in this paper: (a) design a suite of tasks we refer to as BLANCA (Benchmarks for LANguage models on Coding Artifacts) that can be used to train language models about the semantics of code
, (b) evaluate whether existing models, fine-tuned for different aspects of natural language processing or different code oriented corpora, can perform well on these tasks, and (c) establish whether these tasks can be used to build better models for code understanding.
To construct these tasks, we relied on existing annotations in large public repositories such as GitHub (for code), StackOverflow, StackExchange and code documentation (for text about code). We exploited an integration of these sources in an open source dataset[abdelaziz2020graph4code] to define the following five tasks focused on text about code:
Forum Answer Ranking (R). Some answers on forums have many votes or are selected as the best relative to others. Can language models predict the best answers?
Forum Link Prediction (L). Users of forum posts often point to other similar posts, which reflect semantically related posts compared to random pairs. Can language models predict links?
Forum to Class Prediction (F). Key features of classes or functions often get discussed in forum posts. Do language models discriminate related posts and class documentation from unrelated ones?
Class Hierarchy Distance Prediction (H). Code is often organized into class hierarchies. Do embedding distances from language models reflect class distances in the hierarchy?
Class Usage Prediction (U). Similar code is often used in similar ways. Are embedding distances smaller for documentation about classes that are used similarly, and larger for dissimilar ones?
We compare performance on these tasks for seven language models, chosen for differences in architecture, training tasks, and corpora, as outlined in Section 3. Our main findings are as follows:
Out of the box, language models trained on general corpora perform reasonably well on most BLANCA tasks, compared to models trained on code specific corpora such as CodeBERT [feng2020codebert] or BERTOverflow [tabassum2020code], attesting to the generality of these models.
However, on every task, fine tuning on code specific models resulted in significant boost in performance, highlighting the usefulness of BLANCA tasks for building better language models.
Multi-task training produced better performance on many BLANCA tasks, suggesting the tasks do help models learn code semantics that transfers across tasks.
To aid further research in code understanding, the code, datasets and the fine-tuned models are publicly available222https://github.com/wala/blanca under an open source license (Eclipse for the code and Creative Common with Attribution for the data), and we hope they prove useful to the code understanding community to enrich representations of programs with textual information about classes and functions.
2 Related Work
There have been numerous benchmarks for code summarization or generation of code from natural language, and hence they have focused on collecting code and textual documentation that characterize the code. For these tasks, most have used the approach of generating code and its associated documentation strings, e.g., [leclair-mcmillan-2019-recommendations], [movshovitz-attias-cohen-2013-natural]. Similarly, code and corresponding textual documentation have been used for numerous tasks involving searching for code, e.g., [li2019neural], [husain2019codesearchnet], or searching for posts given code, e.g., [10.1145/2597073.2597077].
While such benchmarks are useful for joint embeddings of code and their associated text, they are restricted to tasks around code summarization, code generation, comment generation or code search; i.e., they do not directly help with the evaluation of language models for textual artifacts about code. Furthermore, most of the datasets in the literature do not correlate textual artifacts around code with code usage, with the exception of [ijcai2018-314], which does link the generation of API sequence information from their usage in code to the problem of code summarization. The work in  connects code on GitHub to StackOverflow posts, but the latter dataset is not available. Again, their datasets are targeted to the task of code summarization, and code search respectively. Similarly StackOverflow posts have been used for tasks such as answer summarization [10.1145/3338906.3341186], program repair  or generating code, e.g., . Finding directly related or duplicate posts is a recent task and dataset proposed in [Shirani2019QuestionRO], but there is no evaluation of any language model in that work.
provided a BERTOverflow model for an in-domain representation of text about code. BERTOverflow is trained on 152 million StackOverflow questions over a BERT architecture, and has been fined tuned for software named entity recognition (e.g., finding mentions of operating systems in text). We use BERTOverflow as one of the models for the BLANCA tasks. There has also been work building structural language models from the abstract syntax trees, e.g.,[ANY-Code], which is clearly a related task, but the focus is once again on code. CodeXGLUE [CodeXGLUE] provides a novel text-text benchmark which involves translation of documentation about code from one language to another, but that is arguably closer to natural language processing than code.
Thus, to the best of our knowledge, no work so far has examined how language models perform on a set of code related tasks for textual program artifacts, nor has there been much emphasis on building benchmarks to build better text representations for code understanding. BLANCA is built to address this gap.
In choosing models for our experimentation, we needed language models to encode paragraphs in either class documentation or posts. We relied largely on the sentence transformers library [reimers-2019-sentence-bert], which provides a wide range of transformer models that have been fine-tuned for tasks such as information retrieval, paraphrase detection, and sentence similarity detection. These models have been shown to be effective in sentence and paragraph encoding style tasks. We also chose models with a different base, such as BERT [devlin-etal-2019-bert], XLM-RoBERTa [DBLP:journals/corr/abs-1911-02116] and DistilBERT[sanh2020distilbert]. We also added a non-transformer style model (Google’s Universal Sentence Encoder [cer-etal-2018-universal]), and models fine-tuned on StackOverflow posts (BERTOverflow [tabassum2020code]) and code documentation (CodeBERT [feng2020codebert]) to see if domain-specific training is helpful. We did not consider models, such as CuBERT [cubert], designed only for code, and not text about code. The reason is that cuBERT’s vocabulary is based on programming language tokens for Java or Python, which is only partially useful for text about code. Table 1 shows the types of base models used in our evaluation, using the names from the sentence-transformers (SBERT333https://github.com/UKPLab/sentence-transformers) library.
|Model name||Fine-tuning task|
|Universal Sentence Encoder||N/A|
|CodeBERT-mlm||NL-PL pairs in 6 languages|
Models used as baselines. Sources were tensorflow-hub, and SBERT. bert-base-nli-stsb-mean-tokens, distilroberta-base-paraphrase-v1, xlm-r-distilroberta-base-paraphrase-v1, msmarco-distilroberta-base-v2 and microsoft/codebert-base-mlm are their corresponding names in SBERT.
We also tested if fine-tuning on each task would enhance performance, to establish whether the tasks can be used to build a better language model. For fine-tuning, we started either with BERTOverflow or CodeBERT, with the assumption that an in-domain representation would provide some advantage. We also examined whether multi-task training would improve performance, to see if better models could be built from using a combination of BLANCA tasks.
All our datasets describe code artifacts in Python, and are derived from Graph4Code, which links 1.3 million programs of Python code to associated posts and class-documentation [abdelaziz2020graph4code]. For multi-task fine tuning, we report, for each task, the model with the best performance, and we outline its characteristics. In Section 4.6, we discuss more general findings for multi-task training. Performance on tasks is encoded as follows in tables: (1) Forum Answer Ranking (R), (2) Forum Link Prediction (L), (3) Forum Class Prediction (F), (4) Class Hierarchy Prediction (H), and (5) Class Usage Prediction (U). Table 2 lists each BLANCA task and the corresponding train/test data sizes.
|Forum Answer Ranking||Question-answer pairs||450,000||50,000|
|Forum Link Prediction||Question-Question pairs||23,516||5,854|
|Forum to Class Prediction||Question-Class pairs||11,488||1,275|
|Class Hierarchy Prediction||Class-Class pairs||16,215,400||1,801,716|
|Class Usage Prediction||Class-Class pairs||75,862||8,439|
4.0.1 Dataset Annotation Quality
Two of BLANCA tasks are based on manually curated datasets by millions of users such as ranking answers in StackOverflow forums (Forum Answer Ranking) and manually linking similar posts (Forum Link Prediction). These data are high quality, in the sense that they are crowd annotated by humans, which is how most gold standards get constructed. Class Hierarchy and Class Usage Prediction tasks are both based on objective properties of code artifacts (class hierarchy and similarities among classes in terms of their methods, respectively), so once again, the issues of data quality do not arise. The only task where we did not have explicit human labeling for every example is Forum to Class Prediction. In this task, we relied on heuristics to automatically label the data. Furthermore, to assure quality, we performed a manual evaluation of a sample with three human annotators (see Section4.3 for details).
4.0.2 Hyperparameter Search for Finetuning
We started with the default parameters of our base models; CodeBERT and BERTOverflow. We also tried to use Population Based Training from RayTune444https://docs.ray.io/en/latest/tune/index.html to perform hyper-parameter search for the Forum Answer Ranking (R) and Forum Link Prediction (L) tasks. However, we did not get better performance compared to using the default parameters from the corresponding base models.
We describe below how we formulated each task, the dataset definition process and the performance of various language models on it.
4.1 Forum Answer Ranking (R)
4.1.1 Task Description
StackOverflow and StackExchange contain questions and answers. Accepted answers are manually annotated and most answers have a vote count. The core task here is to predict the best answer to each question, and order the answers by their popularity.
We generated a dataset of 500K questions such that each question comes with at least three answers. The average number of answers per question in this dataset is 4.9 answers, and the average number of votes per question is 23.5 and per answer is 12.74. The train and test tasks were split 90-10, so the train set had 450,000 questions and test had 50,000 questions. To build the fine tuning model, we modeled this as a task similar to training on the Semantic Textual Similarity Benchmark (STSB) adopted by SBERT. Each answer was ranked according to popularity, and ties were broken by adding only one of the answers that were tied. The ranks were then converted to a score between 0 (worst rank) and 1 (best rank), with a cosine similarity loss, and an embedding similarity evaluator from the SBERT library. Fine tuning was performed on BERTOverflow and CodeBERT models, with the 90% of training data for training, 10% of the training data for validation, for 10 epochs.
To capture how well the embeddings of different language models identified the ranking of answers, we computed the cosine distances between the question embedding and the embedding of each of the answers, and ranked answers by nearest in cosine distance to furthest. We report standard information retrieval metrics of average Mean Reciprocal Rank (MRR) and average Normalized Discounted Cumulative Gain (NDCG) on this predicted ranking.
Table 3 shows that most language models do reasonably well on this task, which is not surprising because text in forum posts is mostly natural language. Surprisingly though, there is no benefit for the base BERTOverflow model that has been tuned on StackOverflow posts compared to the rest of non-finetuned models. However, fine-tuned BERTOverflow does much better, which is consistent with our hypothesis that it is possible to use these tasks for building better language models. Across many tasks, fine-tuning on BERTOverflow produced better performance than fine-tuning on CodeBERT, which suggests that forum discussions contain in most cases, the right mixture of explanations in natural language along with code. Moreover, the best performance was achieved with multi-task finetuning (RFLHU-BERTOverflow), which suggests that use of multiple BLANCA tasks builds better language models for textual code artifacts.
|DistilBERT-paraphrasing||0.5937 (.001)||0.8393 (.001)|
|BERT-NLI||0.5972 (.001)||0.8407 (.001)|
|msmarco-DistilRoBERTa||0.5992 (.001)||0.8427 (.001)|
|xlm-r-paraphrase-v1||0.5977 (.001)||0.8411 (.001)|
|USE||0.6114 (.001)||0.8483 (.001)|
|BERTOverflow||0.5910 (.001)||0.8375 (.001)|
|RFLHU-BERTOverflow||0.6879 (.001)||0.8893 (.001)|
Performance of language models on forum answer ranking (R). The numbers in parentheses are the standard errors of the sample mean. FT represents fine tuning on R alone, RFLHU-BERTOverflow is the best multi-task training model.
4.2 Forum Link Prediction (L)
4.2.1 Task Description
Forum posts with links to one another are usually related compared to unlinked posts; we investigate if language models place such related post pairs closer in vector space. We focus on embedding distance because it is a more direct metric for assessing the quality of the embedding rather than classification accuracy.
For this task, we generated 23,516 pairs of posts for training (11,758 positive and 11,758 negative), 5,854 pairs (2,727 positive and 2,727 negative) for testing. Fine-tuning was set up as a classification task in SBERT, with the use of contrastive loss along with a binary classification evaluator from the SBERT library. All other training details were similar to the forum answer ranking task.
Relevant to this task, [Shirani2019QuestionRO] recently introduced a similar benchmark for predicting relatedness in StackOverflow posts focused on Java code, as opposed to our dataset which is language agnostic. Their dataset contains 300K of linked pairs categorized into 1) duplicates: questions in StackOverflow marked by moderators as duplicates, 2) direct: explicitly linked posts, 3) indirectly or transitively connected posts through a direct or a duplicate link and 4) isolated or unlinked posts. Direct and isolated links are similar to our positive and negative examples. We evaluate all our models’ ability to differentiate these link types. Note that we did not use this data for fine-tuning a model which discriminates the different categories; but one might expect direct links and duplicates to be closer in embedding distance, and isolated links to be the furthest, with indirect links in the middle. Shirani2019QuestionRO did not evaluate this with any of the language models, so we examine whether these categories of relatedness of posts is reflected in embeddings of pre-trained models.
As shown in Table 4, all language models showed a statistically significant difference (
) on independent sample t-tests between linked and unlinked posts. BERTOverflow with fine-tuning (both versions tuned on L only and RFLHU) performing the best in terms of pulling apart linked and unlinked posts. We note that the size of T value normalizes the distance between linked and unlinked posts by their variance; that is, the T value captures not only the average distance but also the separation between the two distributions. Our focus then is on the absolute value of that separation as provided by the T value. Figure2 shows this visually. We note that BERTOverflow and CodeBERT as base models discriminated least between linked and unlinked posts, but fine-tuning clearly helped greatly. This is evident in the solid and dashed lines for RFLHU-BERTOverflow where it shows little overlap between linked and unlinked posts.
Figure 3 shows the results of a variety of language models for question relatedness variant of this task [Shirani2019QuestionRO]. We ensured that none of Shirani2019QuestionRO’s test set examples were used in our training set. Across all models, directly related questions are closest in embedding space followed by indirectly related questions. Questions marked duplicate posts were similar to the indirect questions only in the RFLHU model, which seemed to be picking up relatedness in both indirect and duplicate questions. We note that questions marked duplicates in forums are only duplicates at a level of coding abstraction. For example, the two questions “How to return multiple objects from a Java method?" and “Java how to return two variables?" are a duplicate pair. Although the two questions talk about the same problem, the discussions and even the solutions are different. Therefore, cosine similarity between them is not as close as one would expect. Finally, isolated question pairs are the most distant compared to all other pairs across all models (all differences from isolated pairs to direct, indirect and duplicate pairs were statistically significant at the .01 level). Multi-task fine-tuning (RFLHU-BERTOverflow) clearly helped the best in getting semantically related posts closer and pulling apart the unrelated ones.
4.3 Forum to Class Prediction (F)
4.3.1 Task Description
Forum posts often describe specific code artifacts in text, where they discuss key features of a class or a function. A key question is whether a model can predict if a post about a class and documentation of the same class are related.
As shown in Table 5, all language models showed a statistically significant difference () on independent sample t-tests between positive and negative class-post examples. Again, fine-tuning helps improving the performance on this task significantly; e.g. single task tuning of FT-CodeBERT vs. CodeBERT and FT-BERTOverflow compared to BERTOverflow. Fine-tuning on multiple tasks, e.g. RFLHU-CodeBERT, gave better performance compared to the single-task tuned models, FT-CodeBERT and FT-BERTOverflow. As shown in Figure 4, the distance was greatly enhanced by fine-tuning e.g. BERTOverflow vs. fine-tuned RFLHU-BERTOverflow.
4.4 Class Hierarchy Distance Prediction (H)
4.4.1 Task Description
Semantically related classes tend to be linked by developers in a class hierarchy, so its reasonable to ask if neural embeddings of related classes cluster closer together in a class hierarchy. We structured this as a class distance prediction task, with class distances ranging from 1 to 10.
We collected the documentation associated with 257,655 classes in Graph4Code [abdelaziz2020graph4code], but many of these represent different names that resolve to the same class. We aliased the classes to its canonical version by loading the class dynamically to obtain its runtime name, and added in classes that we could not load for some reason, which resulted in 90,464 classes.
To get classes related by distance, we created an undirected graph of class to superclass relations for every module, being careful not to add edges from any class to the class object. For each module graph, we computed distances between every pair of classes using an all pairs shortest paths algorithm. We eliminated pairs with distances greater than 10, and this resulted in a set of pairs that we split randomly such that 16,215,400 million pairs of classes were in train, and 1,801,716 million pairs were in test. For fine-tuning, we structured this similar to the forum ranking task, with distances translated to scores between 0 (least related) and 1 (most related), and we used cosine similarity loss, coupled with a embedding similarity evaluator from SBERT. Training on 16.2 million pairs was computationally expensive so we trained it on a random sample of 100,000 training examples, 10,000 of which was used for validation. Figure 5 shows the distribution of embedding distances for each class distance (1-10) to show the dataset characteristics.
Since this is a regression task, we evaluated the Pearson correlation, which as shown in Table 6 varied from 0.17 (for BERTOverflow) to 0.34 (for Fine-tuned-BERTOverflow); all are statistically significant at . Regression for each model is shown in Figure 6. The improvement from fine-tuning for BERTOverflow showed that the task is useful for building better embeddings. Some other models showed reasonable performance with no tuning (xlm-r-paraphrase at 0.27, and USE at 0.28 respectively), so there is clearly a room to improve these different base models as well, but we leave that issue for future work, since our goal is more on task development rather than building better models.
4.5 Class Usage Prediction (U)
4.5.1 Task Description
GitHub contains millions of programs, where classes are used in code to achieve some purpose. Classes that are used in the same way; i.e., same set of methods get invoked on them, might be expected to be rated as more similar than classes that do not share any methods. We structured this as a similarity rating task.
To construct this dataset, we used the Graph4Code knowledge graph[abdelaziz2020graph4code] which has data flow graphs for 1.3 million GitHub programs. Dataflow tracks the flow of data through return values and parameters within a program. As an example, for the program snippet shown in Figure 1, dataflow would show that fit and predict calls occur on objects returned by calls to the constructors of SGDClassifier and GLM. In this example, SGDClassifier shares 2 methods (denoted as ) with 1 class (denoted as ), which in this instance is GLM). The classes are similar, in the sense that they both share the same methods in usage, but the degree of similarity is dependent on the number of shared methods (), and the number of classes that have the same methods (). The smaller the , the more likely it is that a pair is similar, and the larger the the more likely it is that the pair is similar. To capture both dimensions of similarity into a single distance metric for learning, we defined an ‘ideal’ class pair in terms of our data - that is a vector with . We used the Euclidean distance of each class pair from this ideal vector as the dissimilarity metric. Given a pair, the task then is to predict if the classes were similar or distant based on their usage.
The train task contains 75,862 class pairs, and the test task contains 8,439 pairs, with an average distance of 312.21 for train pairs and an average distance of 312.12 for test pairs, suggesting the two had similar characteristics. The fine-tuning task was modeled the same as the class hierarchy prediction task.
We frame this task as a regression task and evaluate the Pearson correlation once again for all models, which varied from 0.17 (for BERT-NLI) to 0.61 (for HU-BERTOverflow) as shown in Table 7; all results are statistically significant at . The improvement from fine-tuning for BERTOverflow (0.37 to 0.52) also shows the task is useful for building better embeddings. Using hierarchy task as well with HU-BERTOverflow further improved performance to 0.61. This was not the case though for CodeBERT models with and without fine-tuning where its performance dropped to 0.30 from 0.33. We also show in Figure 7 the effectiveness of usage distance as a predictor of cosine embedding distance.
4.6 Multi-Task Training
We focus our multi-task training discussion on BERTOverflow, because combining it with training produced the best performance consistently. As shown in Table 8 tasks that derived from code properties (usage (U) and hierarchy (H)) did not benefit from training on ranking (R), forum to class (F) or linked posts (L) tasks on BERTOverflow, which suggests that tasks derived from code properties require different features than those emphasized by RFL tasks. Tasks derived from code properties (HU) however helped RFL tasks, suggesting the importance of having a diversity of tasks for tuning. We were expecting and found class hierarchy training to help the usage task, since code that is closely-related in the type hierarchy tends to have similar usage due to the nature of classes; this was confirmed by our findings. We also expected usage analysis to help class hierarchy training, because we expected parameter types of methods to relate to the class hierarchy; this did not happen, perhaps due to the dynamically-typed nature of Python, where distinct types can share method names. We expect this to be different in typed languages such as Java, and we plan to investigate it in our future work.
In this paper, we presented BLANCA, a set of benchmarks to help further research in code understanding from textual manuals and posts about code. We used BLANCA tasks to show that one can build better language models for understanding code artifacts. We also used multi-task training to demonstrate better representations of classes and functions from these models. We hope these will be useful in enriching code representations with their textual semantics embedding in natural language artifacts of code.