Jupyter Notebooks on GitHub: Characteristics and Code Clones

07/20/2020
by   Malin Källén, et al.
0

Jupyter notebooks have emerged as a standard tool for data science programming. Programs in Jupiter notebooks are different from typical programs by dividing code into snippets interleaved with text and visualisation and allowing interactive exploration and execution of snippets in different order. Previous studies have shown the presence of considerable code duplication in sources of traditional programs, in both so-called systems programming languages and so-called scripting languages. In this paper we present the first large-scale study of code cloning in Jupyter notebooks. We analyse a corpus of 2.7 million Jupyter notebooks hosted on GitHJub, representing over 36974714 individual snippets and 226744094 lines of code. We study clones at the level of individual snippets, and study the extent to which snippets are recurring across multiple notebooks. We study both identical clones clones (with possible differences in whitespaces) and approximate clones and conduct a small-scale ocular inspection of the most common clones. We find that that code cloning is common in Jupyter notebooks – more than 70 all notebooks do not have a unique snippet, but consists solely of snippets that are also found elsewhere. In notebooks written in Python, around 80 all snippets are near-miss clones and the prevalence of code cloning is higher in Python than in other languages. We further find that clones between different repositories are far more common than clones within the same repository. However, the most common individual repository from which a Jupyter notebook contains clones is the repository in which itself resides.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/27/2019

FSE/CACM Rebuttal^2: Correcting A Large-Scale Study of Programming Languages and Code Quality in GitHub

Ray, Devanbu and Filkov issued a rebuttal of our TOPLAS paper "On the Im...
research
08/02/2018

Debugging Native Extensions of Dynamic Languages

Many dynamic programming languages such as Ruby and Python enable develo...
research
02/03/2023

Measuring The Impact Of Programming Language Distribution

Current benchmarks for evaluating neural code models focus on only a sma...
research
10/11/2018

An Initial Step Towards Organ Transplantation Based on GitHub Repository

Organ transplantation, which is the utilization of codes directly relate...
research
08/10/2021

PyNose: A Test Smell Detector For Python

Similarly to production code, code smells also occur in test code, where...
research
06/17/2022

On the Bug-proneness of Structures Inspired by Functional Programming in JavaScript Projects

Language constructs inspired by functional programming have made their w...
research
11/24/2016

Learning Python Code Suggestion with a Sparse Pointer Network

To enhance developer productivity, all modern integrated development env...

Please sign up or login with your details

Forgot password? Click here to reset