Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow

05/23/2018
by   Pengcheng Yin, et al.
0

For tasks like code synthesis from natural language, code retrieval, and code summarization, data-driven models have shown great promise. However, creating these models require parallel data between natural language (NL) and code with fine-grained alignments. Stack Overflow (SO) is a promising source to create such a data set: the questions are diverse and most of them have corresponding answers with high-quality code snippets. However, existing heuristic methods (e.g., pairing the title of a post with the code in the accepted answer) are limited both in their coverage and the correctness of the NL-code pairs obtained. In this paper, we propose a novel method to mine high-quality aligned data from SO using two sets of features: hand-crafted features considering the structure of the extracted snippets, and correspondence features obtained by training a probabilistic model to capture the correlation between NL and code using neural networks. These features are fed into a classifier that determines the quality of mined NL-code pairs. Experiments using Python and Java as test beds show that the proposed method greatly expands coverage and accuracy over existing mining methods, even when using only a small number of labeled examples. Further, we find that reasonable results are achieved even when training the classifier on one language and testing on another, showing promise for scaling NL-code mining to a wide variety of programming languages beyond those for which we are able to annotate data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/26/2018

StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow

Stack Overflow (SO) has been a great source of natural language question...
research
09/21/2018

Predicting the Programming Language of Questions and Snippets of StackOverflow Using Natural Language Processing

Stack Overflow is the most popular Q&A website among software developers...
research
05/20/2020

Generating Question Titles for Stack Overflow from Mined Code Snippets

Stack Overflow has been heavily used by software developers as a popular...
research
10/13/2021

Leveraging Automated Unit Tests for Unsupervised Code Translation

With little to no parallel data available for programming languages, uns...
research
10/07/2020

PyMT5: multi-mode translation of natural language and Python code with transformers

Simultaneously modeling source code and natural language has many exciti...
research
05/24/2023

Using Natural Language Explanations to Rescale Human Judgments

The rise of large language models (LLMs) has brought a critical need for...
research
10/23/2020

Learning to Recognize Dialect Features

Linguists characterize dialects by the presence, absence, and frequency ...

Please sign up or login with your details

Forgot password? Click here to reset