Neural Code Search Evaluation Dataset

08/26/2019 ∙ by Hongyu Li, et al. ∙ Facebook 0

There has been an increase of interest in code search using natural language. Assessing the performance of such code search models can be difficult without a readily available evaluation suite. In this paper, we present an evaluation dataset consisting of natural language query and code snippet pairs, with the hope that future work in this area can use this dataset as a common benchmark. We also provide the results of two code search models ([1] and [6]) from recent work.



There are no comments yet.


page 1

page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In recent years, learning the mapping between natural language and code snippets has been a popular field of research. In particular, (ncsmapl, ), (unifarxiv, ), (deepcodesearch, )

have explored finding relevant code snippets given a natural language query, with the models varying from using word embeddings and IR techniques to using sophisticated neural networks. To evaluate the performance of these models, Stack Overflow questions and code answer pairs are prime candidates, as Stack Overflow questions well resemble what a developer may ask. Such an example is ”Close/hide the Android Soft Keyboard”.

111 Vidar Vestnes. One of the first answers333 Reto Meier. on Stack Overflow correctly answers this question. However, collecting these questions can be tedious, and systematically comparing various models can pose a challenge.

To this end, we have constructured an evaluation dataset, which contains natural language queries and relevant code snippet answers from Stack Overflow. It also includes code snippet examples from the search corpus (public repositories from GitHub) that correctly answers each query. We hope that this dataset can be served as a benchmark to evaluate performance across various code search models.

The paper is organized as follows. First we will explain what data we are releasing in the dataset. Then we will describe the process for obtaining this dataset. Finally, we will evaluate two code search models of our own creation, NCS and UNIF, on the evaluation dataset as a benchmark.

2. Dataset Contents

In this section, we explain what data we are releasing.

2.1. GitHub Repositories

The most popular Android repositories on GitHub (ranked by the number of stars) is used to create the search corpus. For each repository that we indexed, we provide the link, specific to the commit that was used.555From August 2018 In total, there are 24,549 repositories.666There were originally 26,109 repositories - the difference is due to reasons outside of our control (e.g. repositories getting deleted). Note that not all of the links in this dataset may not always be available in the future for the similar reasons. We will release a text file containing the download links for these GitHub repositories. See LABEL:lst:download-links for an example.
Listing 1: GitHub repositories download links example.

2.2. Search Corpus

The search corpus is indexed using all method bodies parsed from the 24,549 GitHub repositories. In total, there are 4,716,814 methods in this corpus. The code search model will find relevant code snippets (i.e. method bodies) from this corpus given a natural language query. In this data release, we will provide the following information for each method in the corpus:

  • [leftmargin=7mm]

  • id: Each method in the corpus has a unique numeric identifier. This ID number will also be referenced in our evaluation dataset.

  • filepath: The file path is in the format of

  • method_name

  • start_line: Starting line number of the method in the file.

  • end_line: Ending line number of the method in the file.

  • url: GitHub link to the method body with commit ID and line numbers encoded.

LABEL:lst:search-corpus provides an example of a method in the search corpus.

   "id": 4716813,
   "filepath": "Mindgames/VideoStreamServer/playersdk/src/main/java/com/kaltura/playersdk/",
   "method_name": "notifyKPlayerEvent",
   "start_line": 506,
   "end_line": 566,
Listing 2: Search corpus example.

2.3. Evaluation Dataset

The evaluation dataset is composed of 287 Stack Overflow question and answer pairs, for which we release the following information:

  • [leftmargin=7mm]

  • stackoverflow_id: StackOverflow post ID.

  • question: Title of the StackOverflow post.

  • question_url: URL of the StackOverflow post.

  • answer: code snippet answer to the question.

  • answer_url: URL of the StackOverflow answer to the question.

  • examples: 3 methods from the search corpus that best answer the question (most similar to the Stack Overflow answer).

  • examples_url: GitHub links to the examples.

Note that there may be more acceptable answers to each question. See LABEL:lst:evaluation-dataset for a concrete example of an evaluation question in this dataset. The source of the question and answer pairs is extracted from the Stack Exchange Network (stackexchange, ).

   "stackoverflow_id": 1109022,
   "question": "Close/hide the Android Soft Keyboard",
   "question_author": "Vidar Vestnes",
   "answer": "// Check if no view has focus:\nView view = this.getCurrentFocus();\nif (view != null) {   InputMethodManager imm = (InputMethodManager)getSystemService(Context.INPUT_METHOD_SERVICE);  imm.hideSoftInputFromWindow(view.getWindowToken(), 0);}",
   "answer_url": "",
   "answer_author": "Reto Meier",
   "examples": [1841045, 1800067, 1271795],
   "examples_url": [
Listing 3: Evaluation dataset example.

2.4. NCS / UNIF Score Sheet

We provide the evaluation results for two code search models of our creation, each with two variations:

  • [leftmargin=7mm]

  • NCS: an unsupervised model which uses word embedding derived directly from the search corpus(ncsmapl, ).

  • NCSpostrank: an extension of the base NCS model that performs a post-pass ranking, as explained in (ncsmapl, ).

  • UNIFandroid, UNIFstackoverflow: a supervised extension of the NCS model that uses a bag-of-words-based neural network with attention. The supervision is learned using GitHub-Android-Train and StackOverflow-Android-Train datasets, respectively, as described in (unifarxiv, ).

We provide the rank of the first correct answer (FRank) for each question in our evaluation dataset. The score sheet is saved in a comma-delimited csv file as illustrated in LABEL:lst:score-sheet.

No.,StackOverflow ID,NCS FRank,NCS_postrank FRank,UNIF_android FRank,UNIF_stackoverflow FRank
Listing 4: Score sheet example. ”NF” stands for correct answer not found in the top 50 returned results.

3. How we Obtained the Dataset

In this section, we describe the procedure for how we obtained the data.

GitHub repositories. We obtained the information of the GitHub repositories with the GitHub REST API (githubapi, ), and the source files were downloaded using publicly available links.

Search corpus. The search corpus was obtained by dividing each file in the GitHub repositories by method-level granularity.

Evaluation dataset. The benchmark questions were collected from a data dump publicly released by Stack Exchange (stackexchange, )

. To select the set of Stack Overflow question and answer pairs, we created a heuristics-based filtering pipeline where we discarded open-ended, discussion-style questions. We first obtained the most popular 17,000 questions on Stack Overflow with “Android” and “Java” tags. The dataset is further filtered with the following criteria: 1) there exists an upvoted code answer, 2) the ground truth code snippet has at least one match in the search corpus. From this pipeline, we were able to obtain 518 questions. Finally, we manually went through these questions and filtered out questions with vague queries and/or code answers. The final dataset contains 287 Stack Overflow question and answers pairs.

NCS / UNIF score sheet. To judge whether a method body correctly answers the query, we compare how similar it is to the Stack Overflow answer - we do this systematically using a code-to-code similarity tool, called Aroma (aromaarxiv, ). Aroma gives a similarity score between two code snippets; if this score is above a certain threshold (0.25 in our case), we count it as success. This similarity score, aims to mimic manually assessing the correctness of search results in an automatic and reproducible fashion, while leaving out human judgment in the process. More details on how we chose this threshold can be found in (unifarxiv, ).

4. Evaluation

We provide the results for four models: NCS, NCSpostrank, UNIFandroid, and UNIFstackoverflow.

Table 1 reports the number of questions answered within the top_n returned code snippet, where n = 1, 5, and 10 (Answered@1, 5, 10 in Table 1), as well as the Mean Reciprocal Rank (MRR).

Model Answered@1 Answered@5 Answered@10 MRR
NCS 33 74 98 0.189
NCSpostrank 85 151 180 0.4
UNIFandroid 25 74 110 0.178
UNIFstackoverflow 104 164 188 0.465
Table 1. Number of questions answered in the top 1, 5, 10 and MRR for NCS, NCSpostrank, UNIFandroid and UNIFstackoverflow.


  • (1) Jose Cambronero, Hongyu Li Seohyun Kim, Koushik Sen, and Satish Chandra.

    When deep learning met code search.

    CoRR, abs/1905.03813, 2019. URL:, arXiv:1905.03813.
  • (2) Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. Deep code search. In Proceedings of the 40th International Conference on Software Engineering, pages 933–944. ACM, 2018.
  • (3) GitHub Inc. Github rest api v3. URL:
  • (4) Stack Exchange Inc. datastack exchange data dump, 2018. CC-BY-SA 3.0. URL:
  • (5) Sifei Luan, Di Yang, Celeste Barnaby, Koushik Sen, and Satish Chandra. Aroma: Code recommendation via structural code search. CoRR, abs/1812.01158, 2018. URL:, arXiv:1812.01158.
  • (6) Saksham Sachdev, Hongyu Li, Sifei Luan, Seohyun Kim, Koushik Sen, and Satish Chandra. Retrieval on source code: a neural code search. In

    Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages

    , pages 31–41. ACM, 2018.