KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle

03/18/2021
by   Luigi Quaranta, et al.
0

Computational notebooks have become the tool of choice for many data scientists and practitioners for performing analyses and disseminating results. Despite their increasing popularity, the research community cannot yet count on a large, curated dataset of computational notebooks. In this paper, we fill this gap by introducing KGTorrent, a dataset of Python Jupyter notebooks with rich metadata retrieved from Kaggle, a platform hosting data science competitions for learners and practitioners with any levels of expertise. We describe how we built KGTorrent, and provide instructions on how to use it and refresh the collection to keep it up to date. Our vision is that the research community will use KGTorrent to study how data scientists, especially practitioners, use Jupyter Notebook in the wild and identify potential shortcomings to inform the design of its future extensions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/08/2021

On The Gap Between Software Maintenance Theory and Practitioners' Approaches

The way practitioners perform maintenance tasks in practice is little kn...
research
12/19/2019

Data Science through the looking glass and what we found there

The recent success of machine learning (ML) has led to an explosive grow...
research
02/07/2023

Landscape of High-performance Python to Develop Data Science and Machine Learning Applications

Python has become the prime language for application development in the ...
research
05/04/2022

pyRDF2Vec: A Python Implementation and Extension of RDF2Vec

This paper introduces pyRDF2Vec, a Python software package that reimplem...
research
07/24/2019

Making the Case for Visualization

Visual representation of information is a fundamental tool for advancing...
research
04/20/2022

MEDFORD: A human and machine readable metadata markup language

Reproducibility of research is essential for science. However, in the wa...
research
11/06/2019

Towards Human Centered AutoML

Building models from data is an integral part of the majority of data sc...

Please sign up or login with your details

Forgot password? Click here to reset