Prompt-tuned Code Language Model as a Neural Knowledge Base for Type Inference in Statically-Typed Partial Code

08/10/2022
by   Qing Huang, et al.
0

Partial code usually involves non-fully-qualified type names (non-FQNs) and undeclared receiving objects. Resolving the FQNs of these non-FQN types and undeclared receiving objects (referred to as type inference) is the prerequisite to effective search and reuse of partial code. Existing dictionary-lookup based methods build a symbolic knowledge base of API names and code contexts, which involve significant compilation overhead and are sensitive to unseen API names and code context variations. In this paper, we formulate type inference as a cloze-style fill-in-blank language task. Built on source code naturalness, our approach fine-tunes a code masked language model (MLM) as a neural knowledge base of code elements with a novel "pre-train, prompt and predict" paradigm from raw source code. Our approach is lightweight and has minimum requirements on code compilation. Unlike existing symbolic name and context matching for type inference, our prompt-tuned code MLM packs FQN syntax and usage in its parameters and supports fuzzy neural type inference. We systematically evaluate our approach on a large amount of source code from GitHub and Stack Overflow. Our results confirm the effectiveness of our approach design and the practicality for partial code type inference. As the first of its kind, our neural type inference method opens the door to many innovative ways of using partial code.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/21/2023

A Chain of AI-based Solutions for Resolving FQNs and Fixing Syntax Errors in Partial Code

API documentation, technical blogs and programming Q A sites contain n...
research
09/14/2023

Pop Quiz! Do Pre-trained Code Models Possess Knowledge of Correct API Names?

Recent breakthroughs in pre-trained code models, such as CodeBERT and Co...
research
09/19/2019

DIRE: A Neural Approach to Decompiled Identifier Naming

The decompiler is one of the most common tools for examining binaries wi...
research
04/01/2017

Topic modeling of public repositories at scale using names in source code

Programming languages themselves have a limited number of reserved keywo...
research
03/26/2018

code2vec: Learning Distributed Representations of Code

We present a neural model for representing snippets of code as continuou...
research
12/16/2022

SE Factual Knowledge in Frozen Giant Code Model: A Study on FQN and its Retrieval

Pre-trained giant code models (PCMs) start coming into the developers' d...
research
08/01/2023

CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code

Recent works have widely adopted large language model pretraining for so...

Please sign up or login with your details

Forgot password? Click here to reset