Multilingual training for Software Engineering

12/03/2021
by   Toufique Ahmed, et al.
0

Well-trained machine-learning models, which leverage large amounts of open-source software data, have now become an interesting approach to automating many software engineering tasks. Several SE tasks have all been subject to this approach, with performance gradually improving over the past several years with better models and training methods. More, and more diverse, clean, labeled data is better for training; but constructing good-quality datasets is time-consuming and challenging. Ways of augmenting the volume and diversity of clean, labeled data generally have wide applicability. For some languages (e.g., Ruby) labeled data is less abundant; in others (e.g., JavaScript) the available data maybe more focused on some application domains, and thus less diverse. As a way around such data bottlenecks, we present evidence suggesting that human-written code in different languages (which performs the same function), is rather similar, and particularly preserving of identifier naming patterns; we further present evidence suggesting that identifiers are a very important element of training data for software engineering tasks. We leverage this rather fortuitous phenomenon to find evidence that available multilingual training data (across different languages) can be used to amplify performance. We study this for 3 different tasks: code summarization, code retrieval, and function naming. We note that this data-augmenting approach is broadly compatible with different tasks, languages, and machine-learning models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/20/2021

SELM: Software Engineering of Machine Learning Models

One of the pillars of any machine learning model is its concepts. Using ...
research
05/26/2023

ChatGPT: A Study on its Utility for Ubiquitous Software Engineering Tasks

ChatGPT (Chat Generative Pre-trained Transformer) is a chatbot launched ...
research
04/12/2023

SmartChoices: Augmenting Software with Learned Implementations

We are living in a golden age of machine learning. Powerful models are b...
research
05/23/2023

USB: A Unified Summarization Benchmark Across Tasks and Domains

An abundance of datasets exist for training and evaluating models on the...
research
06/29/2021

Making the most of small Software Engineering datasets with modern machine learning

This paper provides a starting point for Software Engineering (SE) resea...
research
06/19/2019

Towards Lakosian Multilingual Software Design Principles

Large software systems often comprise programs written in different prog...
research
08/03/2018

Lightweight Multilingual Software Analysis

Developer preferences, language capabilities and the persistence of olde...

Please sign up or login with your details

Forgot password? Click here to reset