Guiding Language Models of Code with Global Context using Monitors

06/19/2023
by   Lakshya A Agrawal, et al.
0

Language models of code (LMs) work well when the surrounding code in the vicinity of generation provides sufficient context. This is not true when it becomes necessary to use types or functionality defined in another module or library, especially those not seen during training. LMs suffer from limited awareness of such global context and end up hallucinating, e.g., using types defined in other files incorrectly. Recent work tries to overcome this issue by retrieving global information to augment the local context. However, this bloats the prompt or requires architecture modifications and additional training. Integrated development environments (IDEs) assist developers by bringing the global context at their fingertips using static analysis. We extend this assistance, enjoyed by developers, to the LMs. We propose a notion of monitors that use static analysis in the background to guide the decoding. Unlike a priori retrieval, static analysis is invoked iteratively during the entire decoding process, providing the most relevant suggestions on demand. We demonstrate the usefulness of our proposal by monitoring for type-consistent use of identifiers whenever an LM generates code for object dereference. To evaluate our approach, we curate PragmaticCode, a dataset of open-source projects with their development environments. On models of varying parameter scale, we show that monitor-guided decoding consistently improves the ability of an LM to not only generate identifiers that match the ground truth but also improves compilation rates and agreement with ground truth. We find that LMs with fewer parameters, when guided with our monitor, can outperform larger LMs. With monitor-guided decoding, SantaCoder-1.1B achieves better compilation rate and next-identifier match than the much larger text-davinci-003 model. The datasets and code will be released at https://aka.ms/monitors4codegen .

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/19/2023

RepoFusion: Training Code Models to Understand Your Repository

Despite the huge success of Large Language Models (LLMs) in coding assis...
research
06/05/2023

A Static Evaluation of Code Completion by Large Language Models

Large language models trained on code have shown great potential to incr...
research
08/09/2020

Function completion in the time of massive data: A code embedding perspective

Code completion is an important feature of integrated development enviro...
research
07/19/2023

Efficient Guided Generation for Large Language Models

In this article we describe an efficient approach to guiding language mo...
research
04/06/2022

Knowledge Infused Decoding

Pre-trained language models (LMs) have been shown to memorize a substant...
research
11/19/2020

ReAssert: Deep Learning for Assert Generation

The automated generation of test code can reduce the time and effort req...

Please sign up or login with your details

Forgot password? Click here to reset