CoTexT: Multi-task Learning with Code-Text Transformer

05/18/2021
by   Long Phan, et al.
0

We present CoTexT, a pre-trained, transformer-based encoder-decoder model that learns the representative context between natural language (NL) and programming language (PL). Using self-supervision, CoTexT is pre-trained on large programming language corpora to learn a general understanding of language and code. CoTexT supports downstream NL-PL tasks such as code summarizing/documentation, code generation, defect detection, and code debugging. We train CoTexT on different combinations of available PL corpus including both "bimodal" and "unimodal" data. Here, bimodal data is the combination of text and corresponding code snippets, whereas unimodal data is merely code snippets. We first evaluate CoTexT with multi-task learning: we perform Code Summarization on 6 different programming languages and Code Refinement on both small and medium size featured in the CodeXGLUE dataset. We further conduct extensive experiments to investigate CoTexT on other tasks within the CodeXGlue dataset, including Code Generation and Defect Detection. We consistently achieve SOTA results in these tasks, demonstrating the versatility of our models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/02/2021

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

Pre-trained models for Natural Languages (NL) like BERT and GPT have bee...
research
01/24/2022

Cobol2Vec: Learning Representations of Cobol code

There has been a steadily growing interest in development of novel metho...
research
03/10/2021

Unified Pre-training for Program Understanding and Generation

Code summarization and generation empower conversion between programming...
research
06/10/2023

A Comprehensive Review of State-of-The-Art Methods for Java Code Generation from Natural Language Text

Java Code Generation consists in generating automatically Java code from...
research
10/05/2019

JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code Generation

Interactive programming with interleaved code snippet cells and natural ...
research
03/15/2023

Transformer Models for Type Inference in the Simply Typed Lambda Calculus: A Case Study in Deep Learning for Code

Despite a growing body of work at the intersection of deep learning and ...
research
06/06/2018

Studying the Difference Between Natural and Programming Language Corpora

Code corpora, as observed in large software systems, are now known to be...

Please sign up or login with your details

Forgot password? Click here to reset