CodeDSI: Differentiable Code Search

by   Usama Nadeem, et al.

Reimplementing solutions to previously solved software engineering problems is not only inefficient but also introduces inadequate and error-prone code. Many existing methods achieve impressive performance on this issue by using autoregressive text-generation models trained on code. However, these methods are not without their flaws. The generated code from these models can be buggy, lack documentation, and introduce vulnerabilities that may go unnoticed by developers. An alternative to code generation – neural code search – is a field of machine learning where a model takes natural language queries as input and, in turn, relevant code samples from a database are returned. Due to the nature of this pre-existing database, code samples can be documented, tested, licensed, and checked for vulnerabilities before being used by developers in production. In this work, we present CodeDSI, an end-to-end unified approach to code search. CodeDSI is trained to directly map natural language queries to their respective code samples, which can be retrieved later. In an effort to improve the performance of code search, we have investigated docid representation strategies, impact of tokenization on docid structure, and dataset sizes on overall code search performance. Our results demonstrate CodeDSI strong performance, exceeding conventional robust baselines by 2-6 across varying dataset sizes.


page 1

page 2

page 3

page 4


Neural Code Search Evaluation Dataset

There has been an increase of interest in code search using natural lang...

Backdooring Neural Code Search

Reusing off-the-shelf code snippets from online repositories is a common...

deGraphCS: Embedding Variable-based Flow Graph for Neural Code Search

With the rapid increase in the amount of public code repositories, devel...

Crowd Sourced Data Analysis: Mapping of Programming Concepts to Syntactical Patterns

Since programming concepts do not match their syntactic representations,...

Code Search Intent Classification Using Weak Supervision

Developers use search for various tasks such as finding code, documentat...

Piloting Copilot and Codex: Hot Temperature, Cold Prompts, or Black Magic?

Language models are promising solutions for tackling increasing complex ...

NBSearch: Semantic Search and Visual Exploration of Computational Notebooks

Code search is an important and frequent activity for developers using c...

Please sign up or login with your details

Forgot password? Click here to reset