ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages

12/13/2022
by   Yekun Chai, et al.
0

Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa, erecting huge barriers to communication and working efficiency. Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric. In this work, we step towards bridging the gap between multilingual NLs and multilingual PLs for large language models (LLMs). We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs. We employ two methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation. We will make our code and pre-trained models publicly available.

READ FULL TEXT

page 8

page 9

page 20

page 21

research
05/02/2021

Larger-Scale Transformers for Multilingual Masked Language Modeling

Recent work has demonstrated the effectiveness of cross-lingual language...
research
04/01/2021

UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

Vision-and-language pre-training has achieved impressive success in lear...
research
06/13/2023

Soft Language Clustering for Multilingual Model Pre-training

Multilingual pre-trained language models have demonstrated impressive (z...
research
04/16/2021

ProphetNet-X: Large-Scale Pre-training Models for English, Chinese, Multi-lingual, Dialog, and Code Generation

Now, the pre-training technique is ubiquitous in natural language proces...
research
10/26/2022

Multi-lingual Evaluation of Code Generation Models

We present MBXP, an execution-based code completion benchmark in 10+ pro...
research
10/12/2021

Learning Compact Metrics for MT

Recent developments in machine translation and multilingual text generat...
research
02/08/2023

Syntax and Domain Aware Model for Unsupervised Program Translation

There is growing interest in software migration as the development of so...

Please sign up or login with your details

Forgot password? Click here to reset