RepoFusion: Training Code Models to Understand Your Repository

06/19/2023
by   Disha Shrivastava, et al.
0

Despite the huge success of Large Language Models (LLMs) in coding assistants like GitHub Copilot, these models struggle to understand the context present in the repository (e.g., imports, parent classes, files with similar names, etc.), thereby producing inaccurate code completions. This effect is more pronounced when using these assistants for repositories that the model has not seen during training, such as proprietary software or work-in-progress code projects. Recent work has shown the promise of using context from the repository during inference. In this work, we extend this idea and propose RepoFusion, a framework to train models to incorporate relevant repository context. Experiments on single-line code completion show that our models trained with repository context significantly outperform much larger code models as CodeGen-16B-multi (∼73× larger) and closely match the performance of the ∼ 70× larger StarCoderBase model that was trained with the Fill-in-the-Middle objective. We find these results to be a novel and compelling demonstration of the gains that training with repository context can bring. We carry out extensive ablation studies to investigate the impact of design choices such as context type, number of contexts, context length, and initialization within our framework. Lastly, we release Stack-Repo, a dataset of 200 Java repositories with permissive licenses and near-deduplicated files that are augmented with three types of repository contexts. Additionally, we are making available the code and trained checkpoints for our work. Our released resources can be found at <https://huggingface.co/RepoFusion>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/26/2022

Repository-Level Prompt Generation for Large Language Models of Code

With the success of large language models (LLMs) of code and their use a...
research
06/19/2023

Guiding Language Models of Code with Global Context using Monitors

Language models of code (LMs) work well when the surrounding code in the...
research
03/22/2023

RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation

The task of repository-level code completion is to continue writing the ...
research
05/06/2014

Some thoughts about benchmarks for NMR

The NMR community would like to build a repository of benchmarks to push...
research
01/09/2023

SantaCoder: don't reach for the stars!

The BigCode project is an open-scientific collaboration working on the r...
research
06/05/2023

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

Large Language Models (LLMs) have greatly advanced code auto-completion ...

Please sign up or login with your details

Forgot password? Click here to reset