SCC: Automatic Classification of Code Snippets

09/21/2018
by   Kamel Alreshedy, et al.
0

Determining the programming language of a source code file has been considered in the research community; it has been shown that Machine Learning (ML) and Natural Language Processing (NLP) algorithms can be effective in identifying the programming language of source code files. However, determining the programming language of a code snippet or a few lines of source code is still a challenging task. Online forums such as Stack Overflow and code repositories such as GitHub contain a large number of code snippets. In this paper, we describe Source Code Classification (SCC), a classifier that can identify the programming language of code snippets written in 21 different programming languages. A Multinomial Naive Bayes (MNB) classifier is employed which is trained using Stack Overflow posts. It is shown to achieve an accuracy of 75 a proprietary online classifier of snippets) whose accuracy is only 55.5 average score for precision, recall and the F1 score with the proposed tool are 0.76, 0.75 and 0.75, respectively. In addition, it can distinguish between code snippets from a family of programming languages such as C, C++ and C#, and can also identify the programming language version such as C# 3.0, C# 4.0 and C# 5.0.

READ FULL TEXT

page 2

page 4

page 5

research
09/21/2018

Predicting the Programming Language of Questions and Snippets of StackOverflow Using Natural Language Processing

Stack Overflow is the most popular Q&A website among software developers...
research
10/03/2021

DeepSCC: Source Code Classification Based on Fine-Tuned RoBERTa

In software engineering-related tasks (such as programming language tag ...
research
04/01/2021

The Comprehensive Blub Archive Network: Towards Design Principals for Open Source Programming Language Repositories

Many popular open source programming languages (Perl, Ruby or Python for...
research
11/20/2022

The Stack: 3 TB of permissively licensed source code

Large Language Models (LLMs) play an ever-increasing role in the field o...
research
03/22/2021

psc2code: Denoising Code Extraction from Programming Screencasts

In this paper, we propose an approach named psc2code to denoise the proc...
research
11/02/2022

Stack graphs: Name resolution at scale

We present stack graphs, an extension of Visser et al.'s scope graphs fr...
research
04/29/2021

Using Paragraph Vectors to improve our existing code review assisting tool-CRUSO

Code reviews are one of the effective methods to estimate defectiveness ...

Please sign up or login with your details

Forgot password? Click here to reset