A parallel corpus of Python functions and documentation strings for automated code documentation and code generation

Automated documentation of programming source code and automated code generation from natural language are challenging tasks of both practical and scientific interest. Progress in these areas has been limited by the low availability of parallel corpora of code and natural language descriptions, which tend to be small and constrained to specific domains. In this work we introduce a large and diverse parallel corpus of a hundred thousands Python functions with their documentation strings ("docstrings") generated by scraping open source repositories on GitHub. We describe baseline results for the code documentation and code generation tasks obtained by neural machine translation. We also experiment with data augmentation techniques to further increase the amount of training data. We release our datasets and processing scripts in order to stimulate research in these areas.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/07/2020

PyMT5: multi-mode translation of natural language and Python code with transformers

Simultaneously modeling source code and natural language has many exciti...
research
10/11/2021

Using Document Similarity Methods to create Parallel Datasets for Code Translation

Translating source code from one programming language to another is a cr...
research
09/20/2019

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Semantic code search is the task of retrieving relevant code given a nat...
research
08/22/2022

Incorporating Domain Knowledge through Task Augmentation for Front-End JavaScript Code Generation

Code generation aims to generate a code snippet automatically from natur...
research
10/05/2019

JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code Generation

Interactive programming with interleaved code snippet cells and natural ...
research
02/12/2021

DeepPseudo: Deep Pseudo-code Generation via Transformer and Code Feature Extraction

Pseudo-code written by natural language is helpful for novice developers...
research
09/01/2021

EVIL: Exploiting Software via Natural Language

Writing exploits for security assessment is a challenging task. The writ...

Please sign up or login with your details

Forgot password? Click here to reset