CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

09/20/2019
by   Hamel Husain, et al.
0

Semantic code search is the task of retrieving relevant code given a natural language query. While related to other information retrieval tasks, it requires bridging the gap between the language used in code (often abbreviated and highly technical) and natural language more suitable to describe vague concepts and ideas. To enable evaluation of progress on code search, we are releasing the CodeSearchNet Corpus and are presenting the CodeSearchNet Challenge, which consists of 99 natural language queries with about 4k expert relevance annotations of likely results from CodeSearchNet Corpus. The corpus contains about 6 million functions from open-source code spanning six programming languages (Go, Java, JavaScript, PHP, Python, and Ruby). The CodeSearchNet Corpus also contains automatically generated query-like natural language for 2 million functions, obtained from mechanically scraping and preprocessing associated function documentation. In this article, we describe the methodology used to obtain the corpus and expert labels, as well as a number of simple baseline solutions for the task. We hope that CodeSearchNet Challenge encourages researchers and practitioners to study this interesting task further and will host a competition and leaderboard to track the progress on the challenge. We are also keen on extending to more queries and programming languages in the future.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/24/2020

Deep Graph Matching and Searching for Semantic Code Retrieval

Code retrieval is to find the code snippet from a large corpus of source...
research
07/07/2017

A parallel corpus of Python functions and documentation strings for automated code documentation and code generation

Automated documentation of programming source code and automated code ge...
research
08/27/2021

Lyra: A Benchmark for Turducken-Style Code Generation

Code generation is crucial to reduce manual software development efforts...
research
03/28/2019

Crowd Sourced Data Analysis: Mapping of Programming Concepts to Syntactical Patterns

Since programming concepts do not match their syntactic representations,...
research
08/28/2018

Semantic Matching Against a Corpus: New Applications and Methods

We consider the case of a domain expert who wishes to explore the extent...
research
10/19/2020

Adversarial Training for Code Retrieval with Question-Description Relevance Regularization

Code retrieval is a key task aiming to match natural and programming lan...
research
05/26/2023

DataFinder: Scientific Dataset Recommendation from Natural Language Descriptions

Modern machine learning relies on datasets to develop and validate resea...

Please sign up or login with your details

Forgot password? Click here to reset