psc2code: Denoising Code Extraction from Programming Screencasts

03/22/2021
by   Lingfeng Bao, et al.
0

In this paper, we propose an approach named psc2code to denoise the process of extracting source code from programming screencasts. First, psc2code leverages the Convolutional Neural Network based image classification to remove non-code and noisy-code frames. Then, psc2code performs edge detection and clustering-based image segmentation to detect sub-windows in a code frame, and based on the detected sub-windows, it identifies and crops the screen region that is most likely to be a code editor. Finally, psc2code calls the API of a professional OCR tool to extract source code from the cropped code regions and leverages the OCRed cross-frame information in the programming screencast and the statistical language model of a large corpus of source code to correct errors in the OCRed source code. We conduct an experiment on 1,142 programming screencasts from YouTube. We find that our CNN-based image classification technique can effectively remove the non-code and noisy-code frames, which achieves an F1-score of 0.95 on the valid code frames. Based on the source code denoised by psc2code, we implement two applications: 1) a programming screencast search engine; 2) an interaction-enhanced programming screencast watching tool. Based on the source code extracted from the 1,142 collected programming screencasts, our experiments show that our programming screencast search engine achieves the precision@5, 10, and 20 of 0.93, 0.81, and 0.63, respectively.

READ FULL TEXT

page 4

page 5

page 10

research
09/21/2018

SCC: Automatic Classification of Code Snippets

Determining the programming language of a source code file has been cons...
research
10/03/2021

DeepSCC: Source Code Classification Based on Fine-Tuned RoBERTa

In software engineering-related tasks (such as programming language tag ...
research
03/15/2018

Using StackOverflow content to assist in code review

An important goal for programmers is to minimize cost of identifying and...
research
04/29/2021

Using Paragraph Vectors to improve our existing code review assisting tool-CRUSO

Code reviews are one of the effective methods to estimate defectiveness ...
research
05/25/2022

Towards Using Data-Influence Methods to Detect Noisy Samples in Source Code Corpora

Despite the recent trend of developing and applying neural source code m...
research
07/10/2023

Calculating Originality of LLM Assisted Source Code

The ease of using a Large Language Model (LLM) to answer a wide variety ...
research
01/30/2021

ICodeNet – A Hierarchical Neural Network Approach for Source Code Author Identification

With the open-source revolution, source codes are now more easily access...

Please sign up or login with your details

Forgot password? Click here to reset