The Devil is in the Tails: How Long-Tailed Code Distributions Impact Large Language Models

09/07/2023
by   Xin Zhou, et al.
0

Learning-based techniques, especially advanced Large Language Models (LLMs) for code, have gained considerable popularity in various software engineering (SE) tasks. However, most existing works focus on designing better learning-based models and pay less attention to the properties of datasets. Learning-based models, including popular LLMs for code, heavily rely on data, and the data's properties (e.g., data distribution) could significantly affect their behavior. We conducted an exploratory study on the distribution of SE data and found that such data usually follows a skewed distribution (i.e., long-tailed distribution) where a small number of classes have an extensive collection of samples, while a large number of classes have very few samples. We investigate three distinct SE tasks and analyze the impacts of long-tailed distribution on the performance of LLMs for code. Our experimental results reveal that the long-tailed distribution has a substantial impact on the effectiveness of LLMs for code. Specifically, LLMs for code perform between 30.0% and 254.0% worse on data samples associated with infrequent labels compared to data samples of frequent labels. Our study provides a better understanding of the effects of long-tailed distributions on popular LLMs for code and insights for the future development of SE automation.

READ FULL TEXT
research
08/21/2023

Large Language Models for Software Engineering: A Systematic Literature Review

Large Language Models (LLMs) have significantly impacted numerous domain...
research
07/03/2023

Co-Learning Meets Stitch-Up for Noisy Multi-label Visual Recognition

In real-world scenarios, collected and annotated data often exhibit the ...
research
05/27/2022

A Survey on Long-Tailed Visual Recognition

The heavy reliance on data is one of the major reasons that currently li...
research
03/11/2021

Towards Interpreting and Mitigating Shortcut Learning Behavior of NLU models

Recent studies indicate that NLU models are prone to rely on shortcut fe...
research
11/29/2021

A Simple Long-Tailed Recognition Baseline via Vision-Language Model

The visual world naturally exhibits a long-tailed distribution of open c...
research
12/16/2022

SE Factual Knowledge in Frozen Giant Code Model: A Study on FQN and its Retrieval

Pre-trained giant code models (PCMs) start coming into the developers' d...
research
02/19/2023

Text Classification in the Wild: a Large-scale Long-tailed Name Normalization Dataset

Real-world data usually exhibits a long-tailed distribution,with a few f...

Please sign up or login with your details

Forgot password? Click here to reset