Evaluating Semantic Representations of Source Code

10/11/2019
by   Yaza Wainakh, et al.
0

Learned representations of source code enable various software developer tools, e.g., to detect bugs or to predict program properties. At the core of code representations often are word embeddings of identifier names in source code, because identifiers account for the majority of source code vocabulary and convey important semantic information. Unfortunately, there currently is no generally accepted way of evaluating the quality of word embeddings of identifiers, and current evaluations are biased toward specific downstream tasks. This paper presents IdBench, the first benchmark for evaluating to what extent word embeddings of identifiers represent semantic relatedness and similarity. The benchmark is based on thousands of ratings gathered by surveying 500 software developers. We use IdBench to evaluate state-of-the-art embedding techniques proposed for natural language, an embedding technique specifically designed for source code, and lexical string distance functions, as these are often used in current developer tools. Our results show that the effectiveness of embeddings varies significantly across different embedding techniques and that the best available embeddings successfully represent semantic relatedness. On the downside, no existing embedding provides a satisfactory representation of semantic similarities, e.g., because embeddings consider identifiers with opposing meanings as similar, which may lead to fatal mistakes in downstream developer tools. IdBench provides a gold standard to guide the development of novel embeddings that address the current limitations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/05/2019

A Literature Study of Embeddings on Source Code

Natural language processing has improved tremendously after the success ...
research
11/05/2020

Learning Efficient Task-Specific Meta-Embeddings with Word Prisms

Word embeddings are trained to predict word cooccurrence statistics, whi...
research
02/07/2020

What You See is What it Means! Semantic Representation Learning of Code based on Visualization and Transfer Learning

Recent successes in training word embeddings for NLP tasks have encourag...
research
11/22/2020

DiaLex: A Benchmark for Evaluating Multidialectal Arabic Word Embeddings

Word embeddings are a core component of modern natural language processi...
research
08/04/2017

On the Effect of Semantically Enriched Context Models on Software Modularization

Many of the existing approaches for program comprehension rely on the li...
research
01/20/2023

Which Features are Learned by CodeBert: An Empirical Study of the BERT-based Source Code Representation Learning

The Bidirectional Encoder Representations from Transformers (BERT) were ...
research
02/16/2022

Code Search based on Context-aware Code Translation

Code search is a widely used technique by developers during software dev...

Please sign up or login with your details

Forgot password? Click here to reset