DiffSearch: A Scalable and Precise Search Engine for Code Changes

04/06/2022
by   Luca Di Grazia, et al.
0

The source code of successful projects is evolving all the time, resulting in hundreds of thousands of code changes stored in source code repositories. This wealth of data can be useful, e.g., to find changes similar to a planned code change or examples of recurring code improvements. This paper presents DiffSearch, a search engine that, given a query that describes a code change, returns a set of changes that match the query. The approach is enabled by three key contributions. First, we present a query language that extends the underlying programming language with wildcards and placeholders, providing an intuitive way of formulating queries that is easy to adapt to different programming languages. Second, to ensure scalability, the approach indexes code changes in a one-time preprocessing step, mapping them into a feature space, and then performs an efficient search in the feature space for each query. Third, to guarantee precision, i.e., that any returned code change indeed matches the given query, we present a tree-based matching algorithm that checks whether a query can be expanded to a concrete code change. We present implementations for Java, JavaScript, and Python, and show that the approach responds within seconds to queries across one million code changes, has a recall of 80.7 users to find relevant code changes more effectively than a regular expression-based search, and is helpful for gathering a large-scale dataset of real-world bug fixes.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/08/2017

Source Forager: A Search Engine for Similar Source Code

Developers spend a significant amount of time searching for code: e.g., ...
research
11/24/2020

Code Search Intent Classification Using Weak Supervision

Developers use search for various tasks such as finding code, documentat...
research
05/25/2023

Beryllium: Neural Search for Algorithm Implementations

In this paper, we explore the feasibility of finding algorithm implement...
research
02/07/2019

How Different Are Different diff Algorithms in Git? Use --histogram for Code Changes

Automatic identification of the differences between two versions of a fi...
research
08/26/2020

MAR: A structure-based search engine for models

The availability of shared software models provides opportunities for re...
research
03/12/2020

Code Clone Matching: A Practical and Effective Approach to Find Code Snippets

Finding the same or similar code snippets in source code is one of funda...
research
01/27/2019

CRAQL: A Composable Language for Querying Source Code

This paper describes the design and implementation of CRAQL (Composable ...

Please sign up or login with your details

Forgot password? Click here to reset