Mining the Characteristics of Jupyter Notebooks in Data Science Projects

04/11/2023
by   Morakot Choetkiertikul, et al.
0

Nowadays, numerous industries have exceptional demand for skills in data science, such as data analysis, data mining, and machine learning. The computational notebook (e.g., Jupyter Notebook) is a well-known data science tool adopted in practice. Kaggle and GitHub are two platforms where data science communities are used for knowledge-sharing, skill-practicing, and collaboration. While tutorials and guidelines for novice data science are available on both platforms, there is a low number of Jupyter Notebooks that received high numbers of votes from the community. The high-voted notebook is considered well-documented, easy to understand, and applies the best data science and software engineering practices. In this research, we aim to understand the characteristics of high-voted Jupyter Notebooks on Kaggle and the popular Jupyter Notebooks for data science projects on GitHub. We plan to mine and analyse the Jupyter Notebooks on both platforms. We will perform exploratory analytics, data visualization, and feature importances to understand the overall structure of these notebooks and to identify common patterns and best-practice features separating the low-voted and high-voted notebooks. Upon the completion of this research, the discovered insights can be applied as training guidelines for aspiring data scientists and machine learning practitioners looking to improve their performance from novice ranking Jupyter Notebook on Kaggle to a deployable project on GitHub.

READ FULL TEXT
research
12/02/2021

The Art and Practice of Data Science Pipelines: A Comprehensive Study of Data Science Pipelines In Theory, In-The-Small, and In-The-Large

Increasingly larger number of software systems today are including data ...
research
02/09/2020

Trust in Data Science: Collaboration, Translation, and Accountability in Corporate Data Science Projects

The trustworthiness of data science systems in applied and real-world se...
research
04/30/2021

Lux: Always-on Visualization Recommendations for Exploratory Data Science

Exploratory data science largely happens in computational notebooks with...
research
09/11/2020

Machine Learning and Data Science approach towards trend and predictors analysis of CDC Mortality Data for the USA

The research on mortality is an active area of research for any country ...
research
07/08/2018

Machine Learning in High Energy Physics Community White Paper

Machine learning is an important research area in particle physics, begi...
research
12/01/2021

NLP Research and Resources at DaSciM, Ecole Polytechnique

DaSciM (Data Science and Mining) part of LIX at Ecole Polytechnique, est...
research
10/31/2017

Hack Weeks as a model for Data Science Education and Collaboration

Across almost all scientific disciplines, the instruments that record ou...

Please sign up or login with your details

Forgot password? Click here to reset