Topic modeling of public repositories at scale using names in source code

04/01/2017
by   Vadim Markovtsev, et al.
0

Programming languages themselves have a limited number of reserved keywords and character based tokens that define the language specification. However, programmers have a rich use of natural language within their code through comments, text literals and naming entities. The programmer defined names that can be found in source code are a rich source of information to build a high level understanding of the project. The goal of this paper is to apply topic modeling to names used in over 13.6 million repositories and perceive the inferred topics. One of the problems in such a study is the occurrence of duplicate repositories not officially marked as forks (obscure forks). We show how to address it using the same identifiers which are extracted for topic modeling. We open with a discussion on naming in source code, we then elaborate on our approach to remove exact duplicate and fuzzy duplicate repositories using Locality Sensitive Hashing on the bag-of-words model and then discuss our work on topic modeling; and finally present the results from our data analysis together with open-access to the source code, tools and datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/13/2019

Associating Natural Language Comment and Source Code Entities

Comments are an integral part of software development; they are natural ...
research
08/04/2017

On the Effect of Semantically Enriched Context Models on Software Modularization

Many of the existing approaches for program comprehension rely on the li...
research
05/28/2023

RefBERT: A Two-Stage Pre-trained Framework for Automatic Rename Refactoring

Refactoring is an indispensable practice of improving the quality and ma...
research
09/01/2021

An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags

This paper presents an ensemble part-of-speech tagging approach for sour...
research
03/20/2021

Keywords Guided Method Name Generation

High quality method names are descriptive and readable, which are helpfu...
research
08/10/2022

Prompt-tuned Code Language Model as a Neural Knowledge Base for Type Inference in Statically-Typed Partial Code

Partial code usually involves non-fully-qualified type names (non-FQNs) ...
research
03/31/2019

Exploring the Generality of a Java-based Loop Action Model for the Quorum Programming Language

Many algorithmic steps require more than one statement to implement, but...

Please sign up or login with your details

Forgot password? Click here to reset