A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects

07/17/2020
by   Andrew J. Simmons, et al.
0

Background: Meeting the growing industry demand for Data Science requires cross-disciplinary teams that can translate machine learning research into production-ready code. Software engineering teams value adherence to coding standards as an indication of code readability, maintainability, and developer expertise. However, there are no large-scale empirical studies of coding standards focused specifically on Data Science projects. Aims: This study investigates the extent to which Data Science projects follow code standards. In particular, which standards are followed, which are ignored, and how does this differ to traditional software projects? Method: We compare a corpus of 1048 Open-Source Data Science projects to a reference group of 1099 non-Data Science projects with a similar level of quality and maturity. Results: Data Science projects suffer from a significantly higher rate of functions that use an excessive numbers of parameters and local variables. Data Science projects also follow different variable naming conventions to non-Data Science projects. Conclusions: The differences indicate that Data Science codebases are distinct from traditional software codebases and do not follow traditional software engineering conventions. Our conjecture is that this may be because traditional software engineering conventions are inappropriate in the context of Data Science projects.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/14/2020

Enabling collaborative data science development with the Ballet framework

While the open-source model for software development has led to successf...
research
10/08/2022

The importance of good coding practices for data scientists

Many data science students and practitioners are reluctant to adopt good...
research
10/28/2022

Code4ML: a Large-scale Dataset of annotated Machine Learning Code

Program code as a data source is gaining popularity in the data science ...
research
03/30/2022

A Large-Scale Comparison of Python Code in Jupyter Notebooks and Scripts

In recent years, Jupyter notebooks have grown in popularity in several d...
research
03/29/2021

Meeting in the notebook: a notebook-based environment for micro-submissions in data science collaborations

Developers in data science and other domains frequently use computationa...
research
03/08/2021

Leveraging Data Scientists and Business Expectations During the COVID-19 Pandemic

The COVID-19 pandemic presented itself as a challenge for separate socie...
research
12/01/2021

NLP Research and Resources at DaSciM, Ecole Polytechnique

DaSciM (Data Science and Mining) part of LIX at Ecole Polytechnique, est...

Please sign up or login with your details

Forgot password? Click here to reset