Predicting the Programming Language of Questions and Snippets of StackOverflow Using Natural Language Processing

09/21/2018
by   Kamel Alreshedy, et al.
2

Stack Overflow is the most popular Q&A website among software developers. As a platform for knowledge sharing and acquisition, the questions posted in Stack Overflow usually contain a code snippet. Stack Overflow relies on users to properly tag the programming language of a question and it simply assumes that the programming language of the snippets inside a question is the same as the tag of the question itself. In this paper, we propose a classifier to predict the programming language of questions posted in Stack Overflow using Natural Language Processing (NLP) and Machine Learning (ML). The classifier achieves an accuracy of 91.1 combining features from the title, body and the code snippets of the question. We also propose a classifier that only uses the title and body of the question and has an accuracy of 81.1 only that achieves an accuracy of 77.7 Machine Learning techniques on the combination of text and the code snippets of a question provides the best performance. These results demonstrate also that it is possible to identify the programming language of a snippet of few lines of source code. We visualize the feature space of two programming languages Java and SQL in order to identify some special properties of information inside the questions in Stack Overflow corresponding to these languages.

READ FULL TEXT

page 1

page 5

page 7

research
09/21/2018

SCC: Automatic Classification of Code Snippets

Determining the programming language of a source code file has been cons...
research
03/26/2018

StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow

Stack Overflow (SO) has been a great source of natural language question...
research
10/03/2021

DeepSCC: Source Code Classification Based on Fine-Tuned RoBERTa

In software engineering-related tasks (such as programming language tag ...
research
05/23/2018

Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow

For tasks like code synthesis from natural language, code retrieval, and...
research
03/15/2023

Building an Effective Email Spam Classification Model with spaCy

Today, people use email services such as Gmail, Outlook, AOL Mail, etc. ...
research
03/21/2022

PTM4Tag: Sharpening Tag Recommendation of Stack Overflow Posts with Pre-trained Models

Stack Overflow is often viewed as the most influential Software Question...
research
06/16/2022

The Case for a Wholistic Serverless Programming Paradigm and Full Stack Automation for AI and Beyond – The Philosophy of Jaseci and Jac

In this work, the case is made for a wholistic top-down re-envisioning o...

Please sign up or login with your details

Forgot password? Click here to reset