A large-scale comparative analysis of Coding Standard conformance in Open-Source Data Science projects

07/17/2020 ∙ by Andrew J. Simmons, et al. ∙ Deakin University 0

Background: Meeting the growing industry demand for Data Science requires cross-disciplinary teams that can translate machine learning research into production-ready code. Software engineering teams value adherence to coding standards as an indication of code readability, maintainability, and developer expertise. However, there are no large-scale empirical studies of coding standards focused specifically on Data Science projects. Aims: This study investigates the extent to which Data Science projects follow code standards. In particular, which standards are followed, which are ignored, and how does this differ to traditional software projects? Method: We compare a corpus of 1048 Open-Source Data Science projects to a reference group of 1099 non-Data Science projects with a similar level of quality and maturity. Results: Data Science projects suffer from a significantly higher rate of functions that use an excessive numbers of parameters and local variables. Data Science projects also follow different variable naming conventions to non-Data Science projects. Conclusions: The differences indicate that Data Science codebases are distinct from traditional software codebases and do not follow traditional software engineering conventions. Our conjecture is that this may be because traditional software engineering conventions are inappropriate in the context of Data Science projects.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Software that conforms to coding standards creates a shared understanding between developers (Lee et al., 2015), and improves maintainability (Dos Santos and Gerosa, 2018; Nundhapana and Senivongse, 2018). Consistent coding standards also enhances the readability of the code by defining where to declare instance variables; how to name classes, methods, and variables; how to structure the code and other similar guidelines (Martin, 2009).

A growing trend in the software industry is the inclusion of Data Scientists on software engineering teams to build predictive models, analyze product and customer data, present insights to business and to build data engineering pipelines (Kim et al., 2018). Data Scientists may not follow common coding standards as they often come from non-software engineering backgrounds where software engineering best practices are unknown. Furthermore, a Data Scientist focuses on producing insights from data and building predictive models rather than on the inherent quality attributes of the code. In this context of a mixed team of Data Scientists and Software Engineers, team members must collaborate to produce a coherent software artefact and a consistent coding standard is essential for cohesive collaboration and to maintain productivity. However, currently it is unknown if Data Scientists follow common coding standards.

Recent studies have focused on conformance of coding standards applied to application categories (Zou et al., 2019; Bafatakis et al., 2019; Papamichail et al., 2020) and has not considered the inherent technical domain (Barnett, 2017) of Data Science. For example, Data Science applications may follow implicit or undocumented guidelines for 1) describing mathematical notation when implementing algorithms , e.g. “X” is often used as the input matrix in statistics, 2) define domain specific constructs such as a “models” folder for storing predictive models, and 3) writing vectorised code rather than using an imperative style to improve performance. Thus, general coding standards may not be applicable to Data Science code.

While research into improving the predictive accuracy of machine learning algorithms for Data Science has flourished, less attention has been paid to the the software engineering concerns of Data Science. Industry has reported growing architectural debt in machine learning, well beyond that experienced in traditional software projects (Sculley et al., 2015). While the primary concern is “hidden” system-level architectural debt, there may also be observable symptoms such as dead code paths due to rapid experimentation (Sculley et al., 2015).

In this study we conduct an empirical analysis to investigate the discrepancies in conformance to coding standards between Data Science projects and traditional software engineering projects. The novel contributions of this research are threefold. First, to the best of the authors’ knowledge this is the first large scale (1048 Open-Source Python Data Science projects and 1099 non-Data Science projects) empirical analysis of coding standards in Data Science projects. Second, we empirically identify differences in coding standard conformance between Data Science and non-Data Science projects. Finally, we investigate the impact of using machine learning frameworks on coding standards. Our study focuses on Python projects as this is the most popular language for Open-Source Data Science projects on GitHub (Braiek et al., 2018).

Our work contributes empirical evidence towards understanding phenomena that data science teams experience anecdotally, and lays the foundation for future research to provide guidelines and tooling to support the unique nature of Data Science projects.

2. Motivating Example

As a motivating example for our research, consider Naomi, a Data Scientist working with a team of other Data Scientists and Software Engineers to build a predictive model that will be deployed as part of the backend for a web application. She finds a recent research paper in the field and implements the latest algorithm using Python due to the availability of machine learning frameworks (Braiek et al., 2018). However, during automated code review Naomi finds out that her code does not adhere to the coding standards enforced by the web framework used within Naomi’s company. A snippet of the kind of code written by Namoi is shown in Figure 1.

Figure 1. Example code snippet from a machine learning loop taken from an Open-Source project111https://github.com/shuyo/iir/blob/master/activelearn/mmms.py#L18

The team realises that there are exceptions to the coding standards they have put in place but would like to formally define these standards for consistency across the organisation. They also do not have any guidelines for implementing Data Science specific coding standards or where these standards should be applied. To assist Noami’s team we investigate the answers to the following research questions (RQ):

  • What is the current landscape of Data Science projects? (Age, LOC, Cyclomatic Complexity, Distributions)

  • How does adherence to coding standards differ between Data Science projects compared to traditional projects?

  • Where do coding standard violations occur within Data Science projects?

3. Background and Related Work

3.1. Coding standards

Coding standards, also known as code conventions and coding style, have been developed under the premise that consistent standards such as inline documentation, class and variable naming standards and organisation of structural constructs improve readability, assist in identifying incorrect code, and are more likely to follow best practices. These attributes are associated with improved maintainability (Spolsky, 2008; Posnett et al., 2011), readability (Dos Santos and Gerosa, 2018) and understandability (Nundhapana and Senivongse, 2018) in general; however, certain practices may have negligible or even negative effect (Lee et al., 2015).

In this paper, we specifically discuss Python coding standards, given that this is the preferred language for Data Science projects (Braiek et al., 2018). The Style Guide for Python Code, known as PEP 8, includes standards regarding code layout, naming standards, programming recommendations and comments (python.org, 2019). In addition, standards for Python also cover package organisation and the use of virtual environments (Allbee, 2018). The Python community also strives to have a single way of doing things as defined in the ‘Zen of Python’ (Alexandru et al., 2018) so focus on analysing the official style guide for Python, PEP 8222https://www.python.org/dev/peps/pep-0008/. Static analysis tools for Python programs is notoriously difficult due to the dynamic type system where types are determined at run-time rather than at compile time.

3.2. Mining of coding standards

Allamanis et al. designed a tool to infer code style based on examples in a domain and to recommend suggestions on how to improve the code consistency  (Allamanis et al., 2014). Markovtsev et al.

later presented a fully-automated approach powered by a decision tree forest model 

(Markovtsev et al., 2019).  Zhu et al. performed a study on the use of folders in software projects and its possible relationships with project popularity (Zhu et al., 2014). Using the premise that conventions will make code easier to read, understand and maintain,  Smit et al. defined a metric for “convention adherence” and analysed a set of projects to study consistency among teams regarding their code conventions (Smit et al., 2011).  Jarczyk et al. devised metrics for a project’s popularity and the quality of user support by (virtual or remote) team members. These metrics were used to analyse possible correlations between project quality and the characteristics of the team that developed it (Jarczyk et al., 2014).  Raghuraman et al. performed a quantitative analysis on the relationship between UML model design and the quality of the project according to the number of issues in its repository. Their findings report that projects that were designed with UML have fewer software defects than projects designed without UML (Raghuraman et al., 2019). The ecosystem for analysis of static programming languages is mature; however, analysis of dynamic programming languages such as Python presents challenges. In our work, we analyse the Python programming language as this has become the lingua franca for Data Science teams.

3.3. Mining of Python repositories

Bafatakis et al. used Pylint to analyse code-style conformance of Python snippets shared on the Stack Overflow question-answer website, and studied the relationship between code-style and vote score (Bafatakis et al., 2019). Omari and Martinez curated a corpus of 132 open source Python projects and analysed their Pylint scores and cyclomatic complexity (Omari and Martinez, 2020). Their corpus includes machine learning repositories; however, it is dominated by web frameworks and all topics were analysed together as part of the same group. Biswas et al. formed a corpus of Data Science projects written in Python and made the dataset available using the Boa infrastructure (Biswas et al., 2019); however, as the Boa infrastructure only holds the abstract syntax tree rather than the underlying source code, their dataset cannot be used to analyse presentation related style violations as is.

As Biswas et al. claimed to provide “the first dataset that includes Data Science projects written in Python” (Biswas et al., 2019), but do not provide a means for measuring presentation related aspects of code style; to the best of our knowledge, we provide the first large-scale empirical investigation to specifically examine the extent to which Data Science projects written in Python follow code standards.

4. Empirical Study Design

Figure 2. Study Design

To investigate the current landscape of data science projects (RQ1), we sought to use an established corpus of high-quality Data Science Projects. For this, we reuse the Boa Data Science corpus of 1558 Open-Source Python projects curated from GitHub by Biswas et al. (Biswas et al., 2019) and recently presented at the Mining Software Repositories (MSR2019) conference. However, we encountered two obstacles in attempting to reuse the existing corpus as-is. Firstly, Biswas et al. represent project code in the form of an abstract syntax tree for querying using the Boa platform. This discards the software syntax, which may hold valuable insights about developer behaviours (Spinellis, 2011). Secondly, understanding the unique aspects of Data Science projects requires a baseline of non-Data Science projects to contrast them against. As such, we curated a secondary corpus of non-Data Science projects, while controlling for project quality and project maturity in order to ensure a meaningful comparison of how adherence to coding standards differs between the two corpora (RQ2). A subset of the projects in the Data Science corpus were investigated at a deeper level to examine how coding standards vary between modules within projects (RQ3). The high level study design is presented in Figure 2.

Results of the analysis, including raw logs of all detected code violations and source code to reproduce all results in this paper are made publicly available.333https://doi.org/10.6084/m9.figshare.12377237.v2

4.1. Formation of corpora

Figure 3. Curation of Projects

Figure 3 shows the number of project repositories available (1558 projects in the Boa Data Science corpus, and 10551 potential non-Data Science projects available on GitHub), and how we curated a selection of 1048 Data Science projects and 1102 Non-Data Science projects for further analysis.

4.1.1. Data Science Corpus

For each project in the Boa Data Science corpus, we fetched project metadata, such as the repository URL, current number of project stars, forks, issues, and committers, from the GitHub API. We were unable to fetch metadata from GitHub for six (0.4%) of the projects listed in the Boa Corpus. This can occur due to projects on GitHub being renamed, deleted, or marked private by their owners after the publication of the Boa Data Science Corpus. We were able to retrieve full source code for all projects available via the GitHub API; however, we needed to manually download and extract six of the projects that could not be cloned using the Git client (as the downloaded zip archives did not include the project Git history, these six repositories were excluded from the age calculation and assumed to follow the same distribution as the other repositories in the corpus).

We needed to discard one repository in the corpus for technical reasons (we discovered that one of the files in the project was triggering a bug in Pylint that caused our analysis pipeline to hang indefinitely). Three of the projects were excluded as they contained no Python (.py) code files; while use of Python was a criteria checked by Biswas et al., it is possible that the Python files had since been deleted in the most recent version, or that the project had been replaced by another. Finally, we restricted the analysis to only Python 3 compatible projects. Python version 3 introduces major, backwards-incompatible changes to the Python language, which has impacted the entire Python ecosystem. Limiting the analysis to a single Python version helps to prevent confounding factors that would be introduced by mixing different Python versions. Furthermore, Python 2 is no longer supported, and recent versions of code analysis tools such as Pylint have dropped support for Python 2 code. To detect the Python version, we attempted to parse each Python (.py) file using both Python 2.7.13 and Python 3.6.5. If at least as many files can be parsed using Python 3 as can be parsed using Python 2, then we consider the project to be Python 3 compatible. It is also possible that repositories may include some Python files that cannot be parsed with either version of Python: this can occur due to syntax errors committed to the repository; use of syntax features only recently introduced in the latest versions of Python; or due to inclusion of special syntax intended for alternative environments such as Jupyter Notebooks. We included such repositories in the analysis, but only processed files in the repository that were compatible with Python 3 analysis tools (specifically, Pylint).

4.1.2. Traditional (Non-Data Science) Corpus

To identify a corpus of traditional software repositories, we use the same project quality criteria that Biswas et al. (Biswas et al., 2019) used to construct the Boa Data Science corpus, but negate any keywords relating to Data Science. As this corpus contains all types of projects other than Data Science, we will refer to it as the “non-Data Science corpus”. Following Biswas et al. we select projects having at least 80 stars444GitHub allows users to star a project, thus serving as an indication to other users of the project’s popularity and community acceptance. While stars may be subject to some manipulation, for the purpose of this paper we only use stars as a means to ensure that the Data Science and non-Data Science corpora are of a similar nature. on GitHub (as an indication of community acceptance of the project quality), and that are not forked. In contrast to Biswas et al., we check that the project description does not match any of their Data Science keywords, but still verify that the project has a non-empty description. Biswas et al. do an additional check that the project imports a common Data Science library. We considered negating this check; however, as their list includes some common libraries that could be used outside of a Data Science context (e.g. matplotlib, a general purpose data visualisation library), we decided to include projects regardless of their imports, so long as they weren’t already included as part of Biswas et al.’s corpus.

Due to limitations imposed by the GitHub search API (search query results are limited to return at most 1000 repositories), we ran our search query against GHTorrent (Gousios, 2013) (an archive of GitHub metadata). As Biswas et al. (Biswas et al., 2019) do not state the time of their query, nor how they overcame the GitHub API limits, we chose to use the ght_2018_04_01 GHTorrent dump available via Google BigQuery555https://ghtorrent.org/gcloud.html. This query led to the identification of 10551 repositories. Of these, we were able to retrieve current metadata for 9882 repositories directly from the GitHub API, but unable to fetch 669 (6.8%). Gousios notes limitations in the way that GHTorrent collects data from GitHub (Gousios, 2013) (e.g. that GHTorrent additively collects data from GitHub, but GitHub does not report deletions) that may explain why some repositories were listed in GHTorrent, but not available from GitHub. These effects would have been exasperated by repository changes since the time of the data dump. We were able to successfully obtain full source code for all the repositories for which the GitHub API returned metadata (albeit that two of these repositories needed to be manually extracted from zip archives). We removed a further 38 repositories that were already present in the Boa Data Science Corpus; while in theory our query should have excluded these, it may have been due to changes in Data Science keywords in the project description between the time of the GHTorrent dump and the time that Biswas et al. constructed the Boa Data Science Corpus.

As Data Science has recently undergone rapid growth (Braiek et al., 2018), it is also necessary to consider the maturity of projects in our non-Data Science set in order to ensure a meaningful comparison with Data Science projects. To achieve this, we only included projects that according to the GitHub API had been created in 2016 or later. However, we found cases of mature projects that matched this criterion due to being recently migrated to GitHub. Thus as an additional measure we define a project’s age as the number of days between the first commit in the Git log and the most recent commit to the default branch, and restrict the selection to projects under 1500 days old (approx. 4.1 years). This reduced the non-Data Science set to 1519 repositories. After removing projects that contained no Python files, or were not Python 3 compatible, we were left with 1102 non-Data Science repositories (as discussed in subsection 4.4, a further three were discarded in the coding standard analysis stage as they didn’t contain any files that could be analysed).

4.2. Overview of Corpora

Figure 4 shows survival plots (1 - cumulative distribution function) of the distribution of project stars and project age (after selection). To quantify the differences between the two corpora, we use the two-sample Kolmogorov–Smirnov (K–S) statistic (computed using SciPy version 1.4.1666https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html

), which is defined as the maximum vertical difference between the two curves in the survival plots, and can be used to test the null hypothesis that the samples from the two corpora are drawn from the same statistical distribution.

The distribution of stars in both corpora is almost identical; the two-sample K–S statistic is 0.06, which represents only a weakly significant (p=0.03) difference beyond what would be expected by chance if drawn from the same distribution. In contrast, despite our attempt to restrict the age of non-Data Science projects to match that of the Data Science corpora, it is only an approximate fit (K–S statistic of 0.19). The mean age of the selected projects in the Data Science Corpus was 671 days, whereas after selection, the mean age of projects in the Non-Data Science corpus was 743 days. We could have imposed additional selection criteria on the Non-Data Science corpus to bring the age distribution closer; however, this would have reduced the size of the dataset and further shifted the distribution away from the underlying population of non-Data Science projects. Our intent was to ensure both corpora have similar project quality and maturity characteristics to permit a meaningful comparison of coding standards between the corpora rather than to perfectly control all characteristics.

Figure 4. Survival plots showing controlled characteristics (stars, age) of the projects in the DS corpus and the Non-DS corpus (after selection). Stars were controlled for almost perfectly; however, age differences were only partially controlled for.

4.3. Collecting project metrics

In addition to project stars and age (defined in subsubsection 4.1.1), we examined how these relate to a broad range of project-level metrics such as forks, issues, contributors, lines of code, and cyclomatic complexity in order to understand the current landscape of Data Science projects. The number of project forks, open issues, and contributors were retrieved through the GitHub API along with the number of stars when constructing the corpus. After cloning the repositories, we calculated the non-blank Lines Of Code (LOC) for all Python files (anything ending in the .py extension). The advantage of the non-blank lines of code metric is that the metric is easy to define (all lines of code, including comments, other than lines that consist entirely of whitespace) and reproduce (Malloy and Power, 2019). It is also fast to calculate (even for thousands of repositories), and doesn’t require parsing the source code. We used Radon (Radon, 2018) to compute McCabe’s cyclomatic complexity; (McCabe, 1976) the number of linearly independent paths through the code, equivalent to the number of decisions in a code block plus 1 (McCabe, 1976).

4.4. Collecting coding standard violations

Pylint (pylint, ) is a tool to perform a series of error checks, attempt to enforce coding standards (as per PEP 8, discussed previously), and find code smells (i.e., suspicious code that may be an indication of deeper problems in the system (Fowler, 2018)). Pylint analyses code and displays a set of messages, categorised as convention violations, refactors for code smells, errors or bugs, warnings for Python specific problems, and fatal errors that prevented Pylint from further processing. We executed Pylint on each module (file) of each repository in order to count the number of code standard violations of each type. Any files that could not be analysed (e.g. due to using or importing invalid syntax, broken symbolic links, or triggering internal Pylint issues) were removed from this stage of the analysis (including removal from the LOC counts to ensure they did not distort the frequency of warnings per LOC). This resulted in three non-Data Science repositories being discarded completely, leaving a total of 1048 Data Science repositories and 1099 non-Data Science repositories.

While many of the coding standards enforced by Pylint are configurable, we left these as their default values to reflect the community standards. Pylint gives the ability to ignore unwanted warning types, but for our purposes we included all warning types in the analysis then grouped them at the most detailed level. This allowed us to identify precisely which warning types are significant.

4.5. Analysing module-level import graph

We used the Python FindImports library777https://pypi.org/project/findimports/ to extract the direct dependencies (imports) of each module (file). To detect unit tests, imports were checked against a list of common Python testing frameworks (unittest, pytest, unittest2, mock). We checked the remaining modules against a the same list of Data Science libraries as Biswas et al. (Biswas et al., 2019) (machine learning, visualisation, and data wrangling libraries).

5. Results

5.1. (RQ1) What is the current landscape of Data Science projects?

In this section, we investigate the Data Science landscape through the lens of the project level metrics described in subsection 4.3, while controlling for the project stars (as an indicator of community acceptance) and age (as an indicator of project maturity) as described in subsection 4.2 to ensure a meaningful comparison between the Data Science corpus and the non-Data Science corpus. The resultant distributions are presented in Figure 5.

Figure 5. Survival plots showing characteristics of the projects selected for the DS corpus and the control group.

Lines of code was greater in the Data Science corpus. When manually examining the results, we found cases of projects that encoded data files as Python code in order to make data easy to access from Python, which may explain some of the difference, particularly in the tail of the distribution.

For each project, cyclomatic complexity was calculated for each method or function, then averaged to provide a score for that project (projects without any methods or functions were excluded from the analysis). Cyclomatic complexity is similar in both corpora (K–S statistic=0.06, p=0.05). This was an unexpected result, as implementations of machine learning algorithms are an important component of Data Science projects, which one would intuitively expect to result in a higher cyclomatic complexity caused by nested loops and branches. A potential explanation is the use of frameworks and vectorised code that will tend to flatten out loops.

The number of open issues shows no significant differences (K–S statistic=0.03, p=0.6) between Data Science and non-Data Science projects. While our selection process attempts to control the distribution of stars and age, we did not make any attempt to manipulate the distribution of open issues. The finding that open issues were also similar between the projects is an indication that the Data Science and non-Data Science corpora are of a similar quality and maturity.

Data Science projects had a lower number of contributors than non-Data Science projects, but had a higher number of forks. This is consistent with a development workflow whereby data scientists copy/fork each other’s code/models, then work alone or in small teams to tune it to their use case. However further investigation would be needed to confirm whether this is the case.

RQ1 summary. Compared to other projects of similar quality (stars) and maturity (age), Data Science projects on GitHub have the same number of open issues and less contributors, but are forked more often and, surprisingly, have similar average cyclomatic complexity to non-Data Science projects.

5.2. (RQ2) How does adherence to coding standards differ between Data Science projects compared to traditional projects?

Pylint warning (per non-blank LOC) DS mean DS median Non-DS mean Non-DS median p-value
too-many-locals 0.34% 0.28% 0.16% 0.06%
import-error 2.73% 2.07% 1.83% 1.16%
too-many-arguments 0.33% 0.26% 0.19% 0.08%
invalid-name 9.88% 7.73% 6.94% 4.86%
broad-except 0.03% 0.00% 0.13% 0.00%
bad-indentation 6.58% 0.00% 2.24% 0.00%
consider-using-enumerate 0.05% 0.00% 0.03% 0.00%
missing-class-docstring 0.43% 0.26% 0.73% 0.44%
trailing-whitespace 1.59% 0.05% 1.05% 0.00%
missing-function-docstring 2.56% 2.31% 3.26% 2.94%
unsubscriptable-object 0.04% 0.00% 0.02% 0.00%
inconsistent-return-statements 0.03% 0.00% 0.07% 0.00%
wrong-import-position 0.32% 0.00% 0.18% 0.00%
bad-whitespace 5.43% 0.53% 4.90% 0.15%
too-many-instance-attributes 0.11% 0.05% 0.08% 0.02%
unnecessary-comprehension 0.03% 0.00% 0.02% 0.00%
missing-final-newline 0.12% 0.00% 0.20% 0.00%
unused-variable 0.38% 0.23% 0.33% 0.14%
ungrouped-imports 0.05% 0.00% 0.04% 0.00%
too-few-public-methods 0.17% 0.08% 0.36% 0.12%
too-many-public-methods 0.00% 0.00% 0.01% 0.00%
missing-module-docstring 0.82% 0.65% 1.17% 0.74%
too-many-statements 0.07% 0.03% 0.06% 0.00%
attribute-defined-outside-init 0.42% 0.01% 0.24% 0.00%
line-too-long 2.54% 1.64% 2.08% 1.02%
bad-continuation 2.17% 0.51% 2.22% 0.34%
bad-option-value 0.004% 0.00% 0.002% 0.00%
consider-using-dict-comprehension 0.007% 0.00% 0.003% 0.00%
trailing-newlines 0.09% 0.00% 0.07% 0.00%
invalid-unary-operand-type 0.003% 0.00% 0.001% 0.00%
reimported 0.024% 0.00% 0.018% 0.00%
Table 1. Coding standard violations (warnings / non-blank LOC) found to significantly differ between DS repositories and Non-DS repositories. Largest values of mean and median highlighted in bold

We analysed all files in each repository with Pylint (version 2.4.4) and computed the number of coding standard violations per non-blank LOC. Pylint checked source code for 229 different types of code standard violations. We then examined differences in the error distribution between Data Science repositories compared to non-Data Science repositories.

Due to the sparse nature of code violations, many (or in some cases most) repositories contained zero violations of a particular type, whereas those that contained violations tended to do so with an extreme frequency (e.g. projects that used non-standard formatting could trigger multiple violations per line). As such, a two-sided Mann-Whitney U-test (computed using SciPy version 1.4.1888https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html) was selected to detect differences in the distributions, as it is a non-parametric test that depends only on the ranking (ordering) of values rather than their absolute quantities. To guard against data dredging (an inflated number of false positives as a consequence of performing many comparisons), we apply Bonferroni correction as a conservative counter-measure. We set a significance threshold of .

Figure 6. Survival plots comparing Data Science coding standard violation frequency to non-Data Science projects.

A table of the Pylint warning types that had significantly different frequencies between Data Science applications compared to other projects is provided in Table 1. The table shows that the top five most significant differences were: too-many-locals, import-error, too-many-arguments, invalid-name, and broad-except. The distributions of key warnings are compared visually in Figure 6.

Data Science projects (mean of 0.34% warnings/LOC) trigger the too-many-locals warning an order of magnitude more frequently than non-Data Science projects (mean of 0.16% warnings/LOC). By default, Pylint triggers the too-many-locals warning when a function of method contains more than 15 variables. As function arguments count towards this limit, it is related to the too-many-arguments

warning which triggers when a function definition contains more than five parameters. Combined with our findings from RQ1, this implies that Data Science projects tend to incorporate functions with many variables; however, not necessarily a high cyclomatic complexity. A possible cause is models with multiple hyperparameters that are either hard-coded as variables in the function definition or passed to the the function as individual parameters rather than being stored in a configuration object.

Data Science projects also contained more frequent violations of Pylint’s variable naming rules. Pylint will flag variables that are improperly capitalised (e.g. local variables are expected to be lowercase) or were too short (e.g. single letter variable names other than loop variables). This may conflict with mathematical notations, e.g. upper case ”X” for an input matrix.

The import-error difference and bad-option-value warnings are artifacts due to Pylint that should be ignored. As we did not attempt to install individual project dependencies into the environment, the larger number of import-errors triggered by Data Science projects are more likely an indication that they are importing external libraries more frequently.

broad-except warning represents cases where a try-catch statement is used to catch a general (overly broad) exception rather than a particular type. This warning was more common in non-Data Science projects. However, the difference could also be explained by Data Science projects that fail to use exceptions handling at all. Similarly, the missing-class-docstring, missing-function-docstring, and missing-module-docstring warnings were more common in non-Data Science projects; however, it is unclear whether this is due to worse documentation or a consequence of an Object-Oriented design with shorter modules and more common use of classes and functions.

RQ2 summary. Data Science projects contain over twice as many cases (per line of code) of functions/methods with excessive numbers of local variables. Data Science projects contain more frequent violations (per line of code) of variable naming standards than non-Data Science projects.

5.3. (RQ3) Where do coding standard violations occur within Data Science projects?

In this section, we examine how coding standard violations vary within projects. We first label any modules that import unit testing libraries as Unit Test. The remaining modules are checked to see if they import any Data Science or Machine Learning libraries (e.g. tensorflow) and labelled as either Uses DS Library or Does not use DS Library (the three categories are mutually exclusive).

To prevent the analysis from becoming swamped by large repositories with many modules, an intermediate average is calculated per category (Unit Test, etc.) for each repository, such that the repository contributes one data point per category to the final distribution. For this research question, we only analyse Data Science repositories that contain a combination of all three categories (290 repositories). The final breakdown of the results is presented in Figure 7.

Figure 7. Survival plots for selection of coding standard violations within projects, categorised based on module imports.

The plot of too-many-locals confirms that the violations are indeed coming from modules that import a Data Science library. However, other code in the repository does not appear to be affected.

The plot of invalid-name shows that the naming violations are frequent in modules that imports Data Science libraries. However, other code in the repository, such as unit tests, also seem to be affected. This indicates that the naming violations spread throughout Data Science repositories.

The majority of code was defined in modules that imported Data Science libraries (median of 3706 non-blank LOC) in contrast to unit testing testing code (median 746 non-blank LOC) and other code (median 1604 non-blank LOC).

RQ3 summary. Overuse of local variables appears to be confined to modules that directly import Data Science libraries. In contrast, non-conforming variable names in Data Science projects occur throughout the entire project. The majority of Data Science code is defined in modules that import Data Science libraries.

6. Discussion

In this section we use the technical domain as a lens to interpret our findings (Barnett, 2017) as our study focused on the specific technical domain of Data Science. Our study found that Data Science projects are forked more often and have lower number of contributors which would indicate greater software reuse amongst Data Scientists. This is consistent with Data Science specific tasks such as data preparation, statistical modelling, and training machine learning models.

Compared to other projects of similar community acceptance and maturity, Data Science projects showed significantly higher rates of functions that use an excessive number of parameters (too-many-arguments) and local variables (too-many-locals).

Our module-level analysis confirms that this appears in code that imports machine learning libraries, but does not appear to affect other modules in the application. Despite the excessive number of local variables and parameters, the average cyclomatic complexity of functions in Data Science projects appeared similar to that of non-Data Science projects. This suggests that machine learning frameworks and libraries are effective in helping Data Scientists avoid branching and looping constructs, but not in managing the excessive number of variables associated with model parameters and configuration.

Variable naming conventions in Data Science projects also differ from those used in other projects, as evidenced by the higher frequency of Pylint (invalid-name) warnings. In contrast to the other warnings, the variable naming differences appear to permeate the entire Data Science code base.

6.1. Implications

In this subsection we present the implications of our research.

6.1.1. Software teams should be aware that Data Scientists may be following unconventional coding standards.

Our findings indicate that Data Science code follows different coding standards so this information should be taken into consideration when establishing team wide standards.

6.1.2. Further investigation to evaluate the case for Data Science aware coding standards needed.

Answers to RQ2 and RQ3 indicate that Data Science projects do not follow coding conventions to the same extent as a non-Data Science project. This raises at least two additional research questions ‘Do Data Scientists follow Data Science specific coding standards?’ and ‘Why does conformance to coding standards differ between Data Science projects and non-Data Science projects?’ To approach the first question we plan to do a qualitative analysis of the data collected in this paper to derive potential Data Science specific coding standards before an evaluation with practitioners. To better understand Data Scientists attitude to coding standards we propose a large scale survey.

6.1.3. Study the impact of variable naming conventions in Data Science projects on readability.

A recent study indicated that short variable names take longer to comprehend (Hofmeister et al., 2019) which may not hold for Data Science code when trying to comprehend the details of an algorithm implementation written to resemble math notation. Our study found that Data Science projects follow different coding standards than non-Data Science projects (RQ2) which potentially impacts readability.

6.1.4. Develop richer tools to support empirical software engineering research involving Python source code.

We found that Radon hangs on large Python files, and encountered multiple issues with Pylint as described in section 7. With Python’s dominance in the Data Science space and continuing popularity in the Open-Source community (Bafatakis et al., 2019) the need for tools to aid empirical software engineering research is going to grow.

6.1.5. Further study of conformance to Python idioms by Data Scientists.

The Python community is well known for the ‘Zen of Python’ a set of high-level coding conventions and recent work has set about cataloguing these conventions (Alexandru et al., 2018). While our work indicated a discrepancy between coding standards in Data Science projects compared to non-Data Science projects, Python specific idioms not detected by Pylint may still have frequently been used.

7. Threats to Validity

7.1. Internal validity

7.1.1. Confounding variables

As seen in Figure 5, our dataset includes a diversity of projects with a wide spread of lines of code. As such, it is vital to correct for the number of lines of code in the project when reporting the number of code style violations. We do this by reporting the number of violations per line of code. However, this may not be suitable for all warning types. For example, as imports only occur once at the top of the file, import related errors may not increase linearly with the number of lines of code (in our analysis we exclude import errors). Similarly, for errors related to functions (e.g. “too-many-arguments”), it may be more appropriate to normalise by the number of functions rather than the lines of code.

7.1.2. Assumption of independence

An assumption in our calculation of statistical significance is that the data-points are independent. This assumption can be violated if the data contains duplicated repositories. As a guard against this, we excluded forked repositories, as noted in subsubsection 4.1.2

. However, during our investigation, we came across some instances where the exact same code violation was repeated across multiple repositories due to partial sharing of code that had been copied from another repository then modified. Another threat is that the same developer may be involved in multiple projects. As such, the probability that the patterns observed in our paper are coincidental (i.e. false positives) may be higher than suggested by the p-values due to projects that share code and/or authors.

7.1.3. Robustness to outliers

Some repositories contained non-project related code, such as dependencies copied directly into the repository, or even entire Python virtual environments. Furthermore, some repositories encoded data as Python files, leading to source files with over 50,000 lines of code. Only the maximum disk limit enforced by GitHub limits what a repository can contain, so any single repository has the potential to arbitrarily distort averages though committing irrelevant directories containing large Python files. To limit the extent to which any one repository can affect the results, subsection 5.2

treats all repositories with equivalent weight (regardless of how many files they contain), and reports the median as this is more robust to outliers than the mean (but unfortunately is not suited for rare error types, for which the median is zero). The module-level analysis in

subsection 5.3 also limits the impact of large repositories containing irrelevant modules by averaging over all files in a repository of a certain category (Unit Test, Uses DS Library, or Does not use DS Library) such that each repository only contributes a single data-point for each category. Unfortunately however, the module-level analysis may be affected by uneven group sizes (e.g. if the Uses DS Library category contains longer/more files that the Does not use DS Library category, then there is a greater chance that it will contain at least one line that produces a warning, and thus has a greater chance of producing a positive, albeit small, warnings per LOC score). As such the distributions presented in this section need to be interpreted with caution; the median of the resultant distribution may be distorted due to uneven group sizes, but the mean of the resultant distribution (mean of means) will not be biased by group size.

7.1.4. Validity of Pylint

Pylint uses the current Python environment in order to resolve imports and detect coding errors such as calling a non-existent method or passing the wrong number of arguments. As such, the Pylint results may be influenced by which Python packages were installed in the Python environment—for example, if the project depends on a newer version of a library than what is installed in the Python environment, it may lead to warnings related to use of newer methods that do not appear in the installed version of the library. However, given the diversity of projects analysed (some of which did not specify dependencies, or only did so via a README file), it was impractical to reliably determine and install dependencies for the projects analysed. To ensure a fair comparison, we used a clean Python environment containing only essential packages needed for data extraction (findimports, pylint, radon, and their dependencies). A consequence of this decision is that certain Pylint checks (such as import-error) should be ignored.

As Pylint is popular in the Python community, projects that use Pylint (or a similar linting library) as part of their Continuous Integration (CI) are likely to have few (or no) warnings. Furthermore, source files may include “# pylint:” comments that can instruct Pylint to ignore warnings for that file. Thus our results may reflect whether projects used a linting system rather than developer commitment to code quality. However, Pylint identified at least one warning in nearly all repositories, indicating that Pylint is either not being widely used by repositories in the corpus, or that projects do not have all Pylint checks turned on.

Pylint also has known bugs relating to false positives. The Pylint issue tracker currently contains 104 open bugs (and 318 closed bugs) reporting or relating to a “false positive”. As this paper examines differences in the number of code standard violations, false positives will not undermine the analysis if they are triggered consistently across all projects. Unfortunately however, false positives can be triggered by Python features such as type hints, leading to counter intuitive results where use of features such as type hints (designed to promote better code) has the potential to trigger more warnings.

7.1.5. Identification of Data Science repositories

We relied on Biswas et al. (Biswas et al., 2019) for the identification of Data Science repositories, and negated their query to identify non-Data Science repositories. However, as an automated approach, it is possible that some traditional projects appear in the Data Science corpus, and vice versa. Nevertheless, as our study focuses on examining the differences between these groups, the observed differences are still meaningful so long as the Data Science corpus contains a higher proportion of Data Science projects than the non-Data Science corpus. The effect of imperfect identification will be to reduce the observed strength of the differences. In future work we intend to manually perform a qualitative analysis of the corpora to form a more complete picture of their contents.

7.2. External validity

Our analysis was limited to Open-Source Data Science projects on GitHub. Additional research is needed to test whether our results generalise to closed-source Data Science projects undertaken for commercial reasons, in which there is the ability for managers to enforce top-down processes across the entire development team.

8. Conclusion

In this study we conduct the first large-scale empirical investigation to examine the extent to which Open-Source Data Science projects follow coding standards. Our results provide empirical evidence in support of our hypothesis that Data Science projects show differences in coding standards when compared to other projects. Specifically, we found that Open-Source Data Science projects in the corpus contain twice as many cases of functions/methods with excessive numbers of local variables and also contain more frequent violations of variable naming standards.

Through pursuing this line of research, we aim to identify barriers to communication within inter-disciplinary teams arising from differing norms, and restore the sense of shared ownership over the entire end-to-end Data Science workflow. Future work will involve qualitative analysis to identify causal links for our findings and an evaluation with practitioners to determine practically relevant coding standard violations. Our research provides an initial indication that the technical domain of Data Science may require specific coding standards. These coding standards could be used to inform development of a software linting tool tailored to the needs of Data Science projects.

Acknowledgements.
The authors would like to thank Nikhil Bhat and the Surround development team for assistance with extracting the project metrics used in this paper, and acknowledge the Pylint project contributors.

References

  • C. V. Alexandru, J. J. Merchante, S. Panichella, S. Proksch, H. C. Gall, and G. Robles (2018) On the usage of pythonic idioms. In 2018 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pp. 1–11. External Links: Document Cited by: §3.1, §6.1.5.
  • M. Allamanis, E. T. Barr, C. Bird, and C. Sutton (2014) Learning natural coding conventions. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering - FSE 2014, New York, New York, USA, pp. 281–293. External Links: Link, ISBN 9781450330565, Document Cited by: §3.2.
  • B. Allbee (2018) Hands-On Software Engineering with Python: Move beyond basic programming and construct reliable and efficient software with complex code. Packt Publishing Ltd. Cited by: §3.1.
  • N. Bafatakis, N. Boecker, W. Boon, M. C. Salazar, J. Krinke, G. Oznacar, and R. White (2019) Python Coding Style Compliance on Stack Overflow. In Proceedings of the 16th International Conference on Mining Software Repositories, pp. 210–214. External Links: Document Cited by: §1, §3.3, §6.1.4.
  • S. Barnett (2017) Extracting Technical Domain Knowledge to Improve Software Architecture. Ph.D. Thesis, Swinburne University of Technology. Cited by: §1, §6.
  • S. Biswas, M. J. Islam, Y. Huang, and H. Rajan (2019) Boa Meets Python: A Boa Dataset of Data Science Software in Python Language. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), M. D. Storey, B. Adams, and S. Haiduc (Eds.), pp. 577–581. External Links: Link, ISBN 978-1-7281-3412-3, Document Cited by: §3.3, §3.3, §4.1.2, §4.1.2, §4.5, §4, §7.1.5.
  • H. B. Braiek, F. Khomh, and B. Adams (2018) The open-closed principle of modern machine learning frameworks. In Proceedings of the 15th International Conference on Mining Software Repositories - MSR ’18, A. Zaidman, Y. Kamei, and E. Hill (Eds.), pp. 353–363. External Links: Link, ISBN 9781450357166, Document Cited by: §1, §2, §3.1, §4.1.2.
  • R. M. Dos Santos and M. A. Gerosa (2018) Impacts of coding practices on readability. Proceedings - International Conference on Software Engineering, pp. 277–285. External Links: ISBN 9781450357142, Document, ISSN 02705257 Cited by: §1, §3.1.
  • M. Fowler (2018) Refactoring: improving the design of existing code. Addison-Wesley Professional. Cited by: §4.4.
  • G. Gousios (2013) The GHTorent dataset and tool suite. In 2013 10th Working Conference on Mining Software Repositories (MSR), pp. 233–236. External Links: Link, ISBN 978-1-4673-2936-1, Document, ISSN 21601852 Cited by: §4.1.2.
  • J. C. Hofmeister, J. Siegmund, and D. V. Holt (2019) Shorter identifier names take longer to comprehend. Empirical Software Engineering 24 (1), pp. 417–443. Cited by: §6.1.3.
  • O. Jarczyk, B. Gruszka, S. Jaroszewicz, L. Bukowski, and A. Wierzbicki (2014) GitHub Projects. Quality Analysis of Open-Source Software. In International Conference on Social Informatics, pp. 80–94. External Links: Document Cited by: §3.2.
  • M. Kim, T. Zimmermann, R. Deline, and A. Begel (2018) Data scientists in software teams: State of the art and challenges. IEEE Transactions on Software Engineering 44 (11), pp. 1024–1038. External Links: Link, ISBN 9781450356381, Document, ISSN 19393520 Cited by: §1.
  • T. Lee, J. B. Lee, and H. P. In (2015) Effect analysis of coding convention violations on readability of post-delivered code. IEICE Transactions on Information and Systems E98D (7), pp. 1286–1296. External Links: Document, ISSN 17451361 Cited by: §1, §3.1.
  • B. A. Malloy and J. F. Power (2019) An empirical analysis of the transition from Python 2 to Python 3. Empirical Software Engineering 24 (2), pp. 751–778. External Links: ISBN 1066401896, Document, ISSN 15737616 Cited by: §4.3.
  • V. Markovtsev, W. Long, H. Mougard, K. Slavnov, and E. Bulychev (2019) Style-Analyzer: Fixing Code Style Inconsistencies with Interpretable Unsupervised Algorithms. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), pp. 468–478. External Links: Link, ISBN 978-1-7281-3412-3, Document Cited by: §3.2.
  • R. C. Martin (2009) Clean code: a handbook of agile software craftsmanship. Pearson Education. Cited by: §1.
  • T.J. McCabe (1976) A Complexity Measure. IEEE Transactions on Software Engineering SE-2 (4), pp. 308–320. External Links: Link, Document, ISSN 0098-5589 Cited by: §4.3.
  • R. Nundhapana and T. Senivongse (2018) Enhancing understandability of objective C programs using naming convention checking framework. Lecture Notes in Engineering and Computer Science 2237, pp. 314–319. External Links: ISBN 9789881404817, ISSN 20780958 Cited by: §1, §3.1.
  • S. Omari and G. Martinez (2020) Enabling Empirical Research: A Corpus of Large-Scale Python Systems. In Proceedings of the Future Technologies Conference (FTC) 2019, K. Arai, R. Bhatia, and S. Kapoor (Eds.), pp. 661–669. External Links: Document, ISBN 978-3-030-32523-7 Cited by: §3.3.
  • A. Papamichail, A. V. Zarras, and P. Vassiliadis (2020) Do People Use Naming Conventions in SQL Programming?. In SOFSEM 2020: Theory and Practice of Computer Science, pp. 429–440. External Links: Document, ISBN 9783030389185, ISSN 16113349 Cited by: §1.
  • D. Posnett, A. Hindle, and P. Devanbu (2011) A simpler model of software readability. In Proceedings of the 8th working conference on mining software repositories, pp. 73–82. Cited by: §3.1.
  • [23] pylint Pylint - code analysis for Python — www.pylint.org. External Links: Link Cited by: §4.4.
  • python.org (2019) PEP 8 – Style Guide for Python Code — Python.org. External Links: Link Cited by: §3.1.
  • Radon (2018) Radon 2.4.0 documentation. External Links: Link Cited by: §4.3.
  • A. Raghuraman, T. Ho-Quang, M. R. V Chaudron Chalmers, A. Serebrenik, and B. Vasilescu (2019) Does UML Modeling Associate with Lower Defect Proneness?: A Preliminary Empirical Investigation. 16th International Conference on Mining Software Repositories. External Links: Link Cited by: §3.2.
  • D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, M. Young, J. Crespo, and D. Dennison (2015) Hidden Technical Debt in Machine Learning Systems. In Advances in Neural Information Processing Systems, pp. 2503–2511. External Links: Link Cited by: §1.
  • M. Smit, B. Gergel, H. J. Hoover, and E. Stroulia (2011) Code convention adherence in evolving software. In 2011 27th IEEE International Conference on Software Maintenance (ICSM), pp. 504–507. External Links: ISBN 9781457706646, Document Cited by: §3.2.
  • D. Spinellis (2011) elytS edoC. IEEE Software 28 (2), pp. 103–104. External Links: Document, ISSN 07407459 Cited by: §4.
  • A. J. Spolsky (2008) More Joel on software: further thoughts on diverse and occasionally related matters that will prove of interest to software developers, designers, and managers, and to those who, whether by good fortune or ill luck, work with them in some capacity. Apress. Cited by: §3.1.
  • J. Zhu, M. Zhou, and A. Mockus (2014) Patterns of folder use and project popularity: A case study of GitHub repositories. In Proceedings of the 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, pp. 30. Cited by: §3.2.
  • W. Zou, J. Xuan, X. Xie, Z. Chen, and B. Xu (2019) How does code style inconsistency affect pull request integration? An exploratory study on 117 GitHub projects. Empirical Software Engineering 24 (6), pp. 3871–3903. Cited by: §1.