Demystifying What Code Summarization Models Learned

03/04/2023
by   Yu Wang, et al.
0

Study patterns that models have learned has long been a focus of pattern recognition research. Explaining what patterns are discovered from training data, and how patterns are generalized to unseen data are instrumental to understanding and advancing the pattern recognition methods. Unfortunately, the vast majority of the application domains deal with continuous data (i.e. statistical in nature) out of which extracted patterns can not be formally defined. For example, in image classification, there does not exist a principle definition for a label of cat or dog. Even in natural language, the meaning of a word can vary with the context it is surrounded by. Unlike the aforementioned data format, programs are a unique data structure with a well-defined syntax and semantics, which creates a golden opportunity to formalize what models have learned from source code. This paper presents the first formal definition of patterns discovered by code summarization models (i.e. models that predict the name of a method given its body), and gives a sound algorithm to infer a context-free grammar (CFG) that formally describes the learned patterns. We realize our approach in PATIC which produces CFGs for summarizing the patterns discovered by code summarization models. In particular, we pick two prominent instances, code2vec and code2seq, to evaluate PATIC. PATIC shows that the patterns extracted by each model are heavily restricted to local, and syntactic code structures with little to none semantic implication. Based on these findings, we present two example uses of the formal definition of patterns: a new method for evaluating the robustness and a new technique for improving the accuracy of code summarization models. Our work opens up this exciting, new direction of studying what models have learned from source code.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/09/2021

Demystifying Code Summarization Models

The last decade has witnessed a rapid advance in machine learning models...
research
03/18/2022

M2TS: Multi-Scale Multi-Modal Approach Based on Transformer for Source Code Summarization

Source code summarization aims to generate natural language descriptions...
research
08/14/2023

Semantic Similarity Loss for Neural Source Code Summarization

This paper presents an improved loss function for neural source code sum...
research
02/16/2022

Probing Pretrained Models of Source Code

Deep learning models are widely used for solving challenging code proces...
research
09/17/2021

Long-Range Modeling of Source Code Files with eWASH: Extended Window Access by Syntax Hierarchy

Statistical language modeling and translation with transformers have fou...
research
01/11/2021

The Semantic Adjacency Criterion in Time Intervals Mining

Frequent temporal patterns discovered in time-interval-based multivariat...

Please sign up or login with your details

Forgot password? Click here to reset