On the Impact of Language Selection for Training and Evaluating Programming Language Models

08/25/2023
by   Jonathan Katzy, et al.
0

The recent advancements in Transformer-based Language Models have demonstrated significant potential in enhancing the multilingual capabilities of these models. The remarkable progress made in this domain not only applies to natural language tasks but also extends to the domain of programming languages. Despite the ability of these models to learn from multiple languages, evaluations typically focus on particular combinations of the same languages. In this study, we evaluate the similarity of programming languages by analyzing their representations using a CodeBERT-based model. Our experiments reveal that token representation in languages such as C++, Python, and Java exhibit proximity to one another, whereas the same tokens in languages such as Mathematica and R display significant dissimilarity. Our findings suggest that this phenomenon can potentially result in performance challenges when dealing with diverse languages. Thus, we recommend using our similarity measure to select a diverse set of programming languages when training and evaluating future models.

READ FULL TEXT

page 1

page 4

research
08/31/2023

Can Programming Languages Boost Each Other via Instruction Tuning?

When human programmers have mastered a programming language, it would be...
research
05/10/2023

Humans are Still Better than ChatGPT: Case of the IEEEXtreme Competition

Since the release of ChatGPT, numerous studies have highlighted the rema...
research
05/06/2023

Unifying Pointer Analyses for Polyglot Inter-operations through Summary Specialization

Modular analysis of polyglot applications is challenging because heap ob...
research
12/10/2019

Usability Methods for Designing Programming Languages for Software Engineers

Programming language design requires making many usability-related desig...
research
05/24/2023

The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python

Large Language Models (LLMs) have successfully been applied to code gene...
research
04/21/2018

Taylor's law for Human Linguistic Sequences

Taylor's law describes the fluctuation characteristics underlying a syst...
research
01/25/2022

Documenting Geographically and Contextually Diverse Data Sources: The BigScience Catalogue of Language Data and Resources

In recent years, large-scale data collection efforts have prioritized th...

Please sign up or login with your details

Forgot password? Click here to reset