ToolCoder: Teach Code Generation Models to use API search tools

05/06/2023
by   Kechi Zhang, et al.
0

Automatically generating source code from natural language descriptions has been a growing field of research in recent years. However, current large-scale code generation models often encounter difficulties when selecting appropriate APIs for specific contexts. These models may generate APIs that do not meet requirements or refer to non-existent APIs in third-party libraries, especially for lesser-known or private libraries. Inspired by the process of human developers using tools to search APIs, we propose ToolCoder, a novel approach that integrates API search tools with existing models to assist in code generation and API selection. To teach our model to use tools, we introduce an automated data annotation method using ChatGPT to add tool usage information into the source code data and fine-tune code generation models. During inference, we integrate API search tools into the generation process so that our model can automatically use the search tool to get suggestions when selecting an API. Our experimental results demonstrate that ToolCoder exhibits excellent performance and generalization across five public and private library code generation benchmarks, with at least 6.21% improvement on average pass@1 metrics and 9.64% improvement on average pass@10 metrics compared to state-of-the-art methods. Furthermore, we show that our relatively small ToolCoder model is comparable to one of the current best models, GPT-3.5, highlighting the potential of incorporating programming tools into the code generation process.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/31/2022

When Language Model Meets Private Library

With the rapid development of pre-training techniques, a number of langu...
research
02/18/2021

APIScanner – Towards Automated Detection of Deprecated APIs in Python Libraries

Python libraries are widely used for machine learning and scientific com...
research
07/28/2023

Private-Library-Oriented Code Generation with Large Language Models

Large language models (LLMs), such as Codex and GPT-4, have recently sho...
research
02/16/2022

Code Generation for Unknown Libraries via Reading API Documentations

Open-domain code generation is a challenging problem because the set of ...
research
09/14/2023

Pop Quiz! Do Pre-trained Code Models Possess Knowledge of Correct API Names?

Recent breakthroughs in pre-trained code models, such as CodeBERT and Co...
research
04/13/2022

CamBench – Cryptographic API Misuse Detection Tool Benchmark Suite

Context: Cryptographic APIs are often misused in real-world applications...
research
02/24/2021

Themisto: Towards Automated Documentation Generation in Computational Notebooks

Computational notebooks allow data scientists to express their ideas thr...

Please sign up or login with your details

Forgot password? Click here to reset