KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle

by   Luigi Quaranta, et al.

Computational notebooks have become the tool of choice for many data scientists and practitioners for performing analyses and disseminating results. Despite their increasing popularity, the research community cannot yet count on a large, curated dataset of computational notebooks. In this paper, we fill this gap by introducing KGTorrent, a dataset of Python Jupyter notebooks with rich metadata retrieved from Kaggle, a platform hosting data science competitions for learners and practitioners with any levels of expertise. We describe how we built KGTorrent, and provide instructions on how to use it and refresh the collection to keep it up to date. Our vision is that the research community will use KGTorrent to study how data scientists, especially practitioners, use Jupyter Notebook in the wild and identify potential shortcomings to inform the design of its future extensions.



There are no comments yet.


page 1

page 2

page 3

page 4


On The Gap Between Software Maintenance Theory and Practitioners' Approaches

The way practitioners perform maintenance tasks in practice is little kn...

Data Science through the looking glass and what we found there

The recent success of machine learning (ML) has led to an explosive grow...

Diversifying the Genomic Data Science Research Community

Over the last 20 years, there has been an explosion of genomic data coll...

Making the Case for Visualization

Visual representation of information is a fundamental tool for advancing...

MEDFORD: A human and machine readable metadata markup language

Reproducibility of research is essential for science. However, in the wa...

An Empirical Characterization of Event Sourced Systems and Their Schema Evolution – Lessons from Industry

Event sourced systems are increasing in popularity because they are reli...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.