The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python

Large Language Models (LLMs) have successfully been applied to code generation tasks, raising the question of how well these models understand programming. Typical programming languages have invariances and equivariances in their semantics that human programmers intuitively understand and exploit, such as the (near) invariance to the renaming of identifiers. We show that LLMs not only fail to properly generate correct Python code when default function names are swapped, but some of them even become more confident in their incorrect predictions as the model size increases, an instance of the recently discovered phenomenon of Inverse Scaling, which runs contrary to the commonly observed trend of increasing prediction quality with increasing model size. Our findings indicate that, despite their astonishing typical-case performance, LLMs still lack a deep, abstract understanding of the content they manipulate, making them unsuitable for tasks that statistically deviate from their training data, and that mere scaling is not enough to achieve such capability.

READ FULL TEXT

page 11

page 12

page 13

page 14

page 15

page 16

page 17

page 18

research
08/17/2022

MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation

Large language models have demonstrated the ability to generate both nat...
research
06/12/2023

Probing Quantifier Comprehension in Large Language Models

With their increasing size, Large language models (LLMs) are becoming in...
research
08/25/2023

On the Impact of Language Selection for Training and Evaluating Programming Language Models

The recent advancements in Transformer-based Language Models have demons...
research
06/26/2023

Exploring the Robustness of Large Language Models for Solving Programming Problems

Using large language models (LLMs) for source code has recently gained a...
research
06/15/2023

Inverse Scaling: When Bigger Isn't Better

Work on scaling laws has found that large language models (LMs) show pre...
research
05/27/2023

Beyond Positive Scaling: How Negation Impacts Scaling Trends of Language Models

Language models have been shown to exhibit positive scaling, where perfo...
research
05/10/2023

Humans are Still Better than ChatGPT: Case of the IEEEXtreme Competition

Since the release of ChatGPT, numerous studies have highlighted the rema...

Please sign up or login with your details

Forgot password? Click here to reset