Benchmarking Language Models for Code Syntax Understanding

10/26/2022
by   Da Shen, et al.
0

Pre-trained language models have demonstrated impressive performance in both natural language processing and program understanding, which represent the input as a token sequence without explicitly modeling its structure. Some prior works show that pre-trained language models can capture the syntactic rules of natural languages without finetuning on syntax understanding tasks. However, there is limited understanding of how well pre-trained models understand the code structure so far. In this work, we perform the first thorough benchmarking of the state-of-the-art pre-trained models for identifying the syntactic structures of programs. Specifically, we introduce CodeSyntax, a large-scale dataset of programs annotated with the syntactic relationships in their corresponding abstract syntax trees. Our key observation is that existing language models pretrained on code still lack the understanding of code syntax. In fact, these pre-trained programming language models fail to match the performance of simple baselines based on positional offsets and keywords. We also present a natural language benchmark to highlight the differences between natural languages and programming languages in terms of syntactic structure understanding. Our findings point out key limitations of existing pre-training methods for programming languages, and suggest the importance of modeling code syntactic structures.

READ FULL TEXT
research
06/23/2022

AST-Probe: Recovering abstract syntax trees from hidden representations of pre-trained language models

The objective of pre-trained language models is to learn contextual repr...
research
04/16/2023

Automated Program Repair Based on Code Review: How do Pre-trained Transformer Models Perform?

Sequence-to-sequence models have been used to transform erroneous progra...
research
01/30/2020

Are Pre-trained Language Models Aware of Phrases? Simple but Strong Baselines for Grammar Induction

With the recent success and popularity of pre-trained language models (L...
research
06/20/2023

Towards Understanding What Code Language Models Learned

Pre-trained language models are effective in a variety of natural langua...
research
01/26/2022

Synchromesh: Reliable code generation from pre-trained language models

Large pre-trained language models have been used to generate code,provid...
research
02/08/2023

On the Applicability of Language Models to Block-Based Programs

Block-based programming languages like Scratch are increasingly popular ...
research
09/04/2023

Code Representation Pre-training with Complements from Program Executions

Large language models (LLMs) for natural language processing have been g...

Please sign up or login with your details

Forgot password? Click here to reset