On the differences between quality increasing and other changes in open source Java projects

09/08/2021 ∙ by Alexander Trautsch, et al. ∙ Clausthal University of Technology The University of Göttingen 0

Static software metrics, e.g., size, complexity and coupling are used in defect prediction research as well as software quality models to evaluate software quality. Static analysis tools also include boundary values for complexity and size that generate warnings for developers. However, recent studies found that complexity metrics may be unreliable indicators for understandability of the source code and therefore may have less impact on software quality. To explore the relationship between quality and changes, we leverage the intent of developers about what constitutes a quality improvement in their own code base. We manually classify a randomized sample of 2,533 commits from 54 Java open source projects as quality improving depending on the intent of the developer by inspecting the commit message. We distinguish between perfective and corrective maintenance via predefined guidelines and use this data as ground truth for fine-tuning a state-of-the art deep learning model created for natural language processing. The benchmark we provide with our ground truth indicates that the deep learning model can be confidently used for commit intent classification. We use the model to increase our data set to 125,482 commits. Based on the resulting data set, we investigate the differences in size and 14 static source code metrics between changes that increase quality and other changes. In addition, we investigate which files are targets of quality improvements. We find that quality improving commits are smaller than other commits. Perfective changes have a positive impact on static source code metrics while corrective changes do tend to add complexity. Furthermore, we find that files which are the target of perfective maintenance already have a lower median complexity than other files.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Software quality is notoriously hard to measure (Kitchenham and Pfleeger, 1996). The main reason is that quality is subjective and that it consists of multiple factors. This idea was formalized by Boehm and McCall in the 70s (Boehm et al., 1976; McCall et al., 1977). Both introduced a layered approach where software quality consists of multiple factors. The ISO standard ISO/IEC (2001) and successor ISO/IEC 25010 (2011) also approach software quality in this fashion.

All these ideas contain abstract quality factors. However, the question remains what concrete measurements can we perform to evaluate the abstract factors of which software quality consists, i.e., how do we measure software quality. Some software quality models recommend concrete measurements, e.g., ColumbusQM (Bakota et al., 2011) and Quamoco (Wagner et al., 2012)

. Defect prediction researchers also try to build (machine learning) models to find a function that can map measurable metrics to the number of defects in the source code. This can also be thought of as software quality evaluation, that tries to map internal software quality, measured by code or process metrics, to external software quality measured by defects 

(Fenton and Bieman, 2014).

Software quality models and defect prediction models use static source code metrics as a proxy for quality. The intuition is that complex code, as measured by static source code metrics, is harder to reason about and therefore is more prone to errors. However, recent research by Peitek et al. (2021) showed that measured code complexity is perceived very differently between developers and does not translate well to code understanding. A similar result was found by Scalabrino et al. (2021) although their work is focused on readability measured in a static way. Both studies, due to their nature, observe developers in an controlled experiment with code snippets. To supplement these results it would be interesting to measure what developers change in their code “in the wild” to improve software quality.

While there are multiple publications on maintenance or change classification after Swanson (1976), e.g., (Mockus and Votta, 2000; Mauczka et al., 2012; Levin and Yehudai, 2017; Hönel et al., 2019) we are not aware of a publication that investigates differences between multiple software metrics for corrective or perfective maintenance and other changes. Most recent work focuses on different aspects instead of a broad overview, e.g., how software metrics change when code smells are removed (Bavota et al., 2015) or refactorings are applied (Bavota et al., 2015; Alshayeb, 2009; Pantiuchina et al., 2020). A broader overview is needed however to answer questions about what changes in general quality improving commits with respect to other commits. In this work, we find changes that increase the quality, while we measure current, previous and delta of common source code metric values used in a current version (Bakota et al., 2014) of the Columbus quality model (Bakota et al., 2011). We use the commit message contained in each change to find commits where the intent of the developer is to improve software quality.

This provides us with a view of corrective and perfective maintenance commits if we think of this approach with the classification by Swanson (1976). Perfective maintenance should increase internal quality while corrective maintenance should increase external quality. Both categories should increase the overall quality of the software. To ease the readability we adopt the perfective and corrective terms defined by Swanson for the rest of the paper when referring to the categories. For general assumptions we adopt the internal and external quality terms by Fenton and Bieman (2014).

Within our study we first classify the commit intent for a sample of 2,533 commits from 54 open source projects manually. The manual classification is provided by two researchers according to predefined guidelines. According to the overview of previous research in this area provided by AlOmar et al. (2021) this is the largest manual classification study of commits. We use this data as ground truth to fine-tune a state-of-the-art deep learning model for natural language processing that was pre-trained exclusively on software engineering data (von der Mosel et al., 2021). After we determine the performance of the model, we use the fine-tuned version to classify all commits, increasing our data to 125,482 commits.

We use the automatically classified data to conduct a two part study. First part is a confirmatory study into the expected behavior of metrics for quality increasing changes. We derive hypotheses about the expected behavior of measurements from existing quality models and from the related literature. In case our data matches the expected behavior from the literature, we can confirm the postulated theories and provide evidence in favor of using the measurements. Otherwise, we try to establish which metrics may be unsuitable for quality estimation, including the potential reasons. Even further, we determine whether metrics used in software quality models are impacted by quality increasing maintenance, therefore providing an evaluation for software quality measurement metrics.

The second part of our study is of exploratory nature. We investigate which files are the target of quality improvements by the developers. We explore whether only complex files are receiving perfective changes and which metric values are indicative of corrective changes. This provides us with data for practitioners and static analysis tool vendors for boundary values which are likely to have a positive impact on the quality of source code from the perspective of the developers.

Overall, our work provides the following contributions:

  • A Large data set of manual classification of commit intents with improving internal and external quality categories.

  • A Confirmatory study of size and complexity metric value changes for quality improvements.

  • An Exploratory study of size and complexity metric values of files that are the target of quality improvements.

  • A fine-tuned state-of-the-art deep learning model for automatic classification of commit intents.

The main findings of our study are the following:

  • Our confirmatory study confirms previous work that quality increasing commits are smaller than other changes.

  • While perfective changes have a positive impact on most static source code metrics, corrective changes have a negative impact in size and complexity metrics.

  • The files that are the target of perfective changes are already less complex and smaller than other files.

  • The files that are the target of corrective changes are more complex and larger than other files.

The remainder of this paper is structured as follows. In Section 2, we define our research questions and hypotheses. In Section 3, we discuss the previous work related to our study. Section 4 contains our case study design with descriptions for subject selection as well as data sources and analysis procedure. In Section 5, we present the results of our case study which are discussed in Section 6. Section 7 lists our identified threats to validity and Section 8 closes with a short conclusion of our work.

2 Research Questions and Hypotheses

In our study, we answer two research questions.

  • RQ1: Does developer intent to improve internal or external quality have a positive impact on software metrics?
    Previous work provides us with certain indications about the impact on software metrics. This is part of our confirmatory study, and we derive two hypothesis from previous work regarding how size and software metrics should change for different types of quality improvement. We formulate our assumptions as hypothesis and test these in our case study.

    • H1: Intended quality improvements are smaller than other changes.
      Mockus and Votta (2000)
      found that corrective changes modify fewer lines while perfective changes delete more lines. Purushothaman and Perry (2005) also observed more deletions for perfective maintenance and an overall smaller size of perfective and corrective maintenance. Hönel et al. (2019) used size based metrics as additional features for an automated approach to classify maintenance types. They found that the size based metrics increased the classification performance. Moreover, just-in-time quality assurance (Kamei et al., 2013) builds on the assumption that changes and metrics derived from these changes can predict bug introduction, meaning there should be a difference. Therefore, we hypothesize that corrective as well as perfective maintenance consist of smaller changes. Addition of features should be larger than both and therefore we assume that the categories we are interested in, perfective and corrective, are smaller overall.

    • H2: Intended quality improvements impact software quality metrics in a positive way.
      In this paper we focus on metrics used in the Columbus Quality Model (Bakota et al., 2011, 2014). The metrics are specifically chosen for a quality model so they should provide different measurements based on their maintenance category. Prior research, e.g., Ch’avez et al. (2017) and Stroggylos and Spinellis (2007) found that refactorings, which are part of our classification, have a measurable impact on software metrics. We hypothesize that an improvement consciously applied by a developer via a perfective commit has a measurable, positive impact on software metrics. Positive means that we expect a value change direction of the metric, e.g., complexity is reduced. We note our expected direction for each metric together with a description in Table 4.

      Defect prediction research assumes a connection between software metrics and external software quality in the form of bugs. While most publications in defect prediction are not investigating the impact of single bug fixing changes the most common datasets all contain coupling, size and complexity metrics as independent variables, e.g., (Jureczko and Madeyski, 2010; NASA, 2004; D’Ambros et al., 2012) see also Hosseini et al. (2017). We hypothesize that fixing bugs via corrective commits has a measurable, positive impact on software metrics.

Our second research question is exploratory in nature.

  • RQ2: What kind of files are the target of internal or external quality improvements?
    The first part of our study provides us with information about metric value changes for quality increasing commits. In this part we are exploring which files are the target of quality increasing commits. We are interested in how complex, e.g., via cyclomatic complexity a file is on average that receives perfective maintenance. Moreover, on the external quality side we are interested in which files are receiving corrective changes. Due to the exploratory nature of this research question we do not derive hypotheses.

3 Related Work

We separate the discussion of the related work into publications on the classification of changes and publications on the relation between the intended quality improvements and software metrics.

Most prior work that follows a similar approach to ours is concerned with specific types of quality improving changes, e.g., refactoring changes and code smells. We note that some code smell detection are based on internal software quality metrics which we use in our study.

We first present previous research related to the first phase of our study: classification of changes. Multiple studies are concerned with classification of changes into maintenance types.

Mockus and Votta (2000) study changes in a large system and identified reasons for changes. They find that a textual description of the change can be used to identify the type of change with a keyword based approach which they validated with a developer survey. The authors classified changes to Swansons maintenance types. They find that corrective and perfective changes are smaller and perfective changes delete more lines than other changes.

Mauczka et al. (2012) present an automatic keyword based approach for classification into Swansons maintenance types. They evaluate their approach and provide a keyword list for each maintenance type together with a weight.

Fu et al. (2015) present an approach for change classification that uses latent drichtlet allocation. They study five open source projects and classify changes into Swansons maintenance types together with a not sure type. The keyword list of their study is based on Mauczka et al. (2012).

Mauczka et al. (2015) collect developer classifications for three different classification schemes. Their data contains 967 commits from six open source projects. While the developers themselves are the best source of information, we believe that within the guidelines of our approach our classifications are similar to those of the developers. We evaluate this assumption in Section 4.2.

Yan et al. (2016) use discriminative topic modeling also based on the keyword list by Mauczka et al. (2012). They focus on changes with multiple categories.

Levin and Yehudai (2017) improve maintenance type classification by utilizing source code in addition to keywords. This is an indication that metrics which are computed using source code are impacted by different maintenance types.

Hönel et al. (2019) use size based metrics as additional features for automated classification of changes. In our study, we first classify the change and then look at how this impacts size and spread of the change. However, the differences we found in our study support the assumption that size based features can be used to distinguish change categories.

More recently, Wang et al. (2021)

also analyze developer intents from the commit messages. They focus on large review effort code changes instead of quality changes or maintenance types. They also use a keyword based heuristic for the classification. They do not, however, include a perfective maintenance classification.

Ghadhab et al. (2021) also use a deep learning model to classify commits. They use word embeddings from the deep learning model in combination with fine grained code changes to classify into Swansons maintenance categories. In contrast to Ghadhab et al. we do not include code changes in our automatic classifications and focus only the commit message.

The classification of changes for the ground truth in our study is based on manual inspection by two researchers instead of a keyword list. We specify guidelines for the classification procedure which enable other researchers to replicate our work. To accept or reject our hypotheses, we only inspect internal and external quality improvements which would correspond to the perfective and corrective maintenance types by Swanson. In contrast to the previous studies, we relate our classified changes also to a set of static software metrics.

We now present research related to our second phase, the relation between intended quality improvements and software metrics. Stroggylos and Spinellis (2007) found changes where the developers intended a refactoring change via the commit message. The authors then measured several source code metrics to evaluate the quality change. In contrast to the work of Stroggylos and Spinellis (2007), we do not focus on refactoring keywords. Instead, we consider refactoring as a part of our classification guidelines. Moreover, our aim is to investigate whether the metrics most commonly used as internal quality metrics (see also Al Dallal and Abdin (2018)) are the ones that are changing if developers perform quality improving changes including refactoring.

Fakhoury et al. (2019) investigated the practical impact of software evolution with developer perceived readability improvements on existing readability models. After finding target commits via commit message filtering they applied state of the art readability models before and after the change and investigated the impact of the change on the resulting readability score.

Pantiuchina et al. (2018) analyze commit messages to extract the intent of the developer to improve certain static source code metrics related to software quality. In contrast to their work, we are not extracting the intent to improve certain static code metrics but instead focus on overall improvement to measure the delta of a multitude of metrics between the improving commit and its parents. Developers may not use the terminology Pantiuchina et al. base their keywords on, e.g., instead of writing reduce coupling or increase cohesion the developer may simply write refactoring or simplify code.

In contrast to the previous studies, we relate developer intents to improve the quality either by perfective maintenance or by corrective maintenance to change size metrics and static source code metrics. In addition, we also look at which files are the target of quality improvements regarding their static source code metrics.

4 Case Study Design

The goal of our case study is to gather empirical data about changes due to developer intents to improve the quality of the code base. To achieve this, we first sample a number of commits from our selected study subjects. This sample is classified by two researchers into two categories of quality improving and other commits. This data is then used to train a model that can confidently classify the rest of our data. The classified commits are then used to investigate the static source code metric changes to accept or reject our hypotheses in the confirmatory part of our study. After that, we investigate the metric values before the change is applied in the exploratory part of our study.

4.1 Data and Study Subject Selection

Project Timeframe #C #S #SP #SC #AP #AC
archiva 2005-2018 3,914 79 35 17 1,478 1,005
calcite 2012-2018 1,987 40 8 14 565 665
cayenne 2007-2018 3,738 75 31 14 1,470 1,007
commons-bcel 2001-2019 884 18 9 6 588 171
commons-beanutils 2001-2018 577 12 5 2 317 130
commons-codec 2003-2018 828 17 12 1 619 76
commons-collections 2001-2018 1,827 37 27 3 1,185 200
commons-compress 2003-2018 1,598 32 17 6 873 317
commons-configuration 2003-2018 2,075 42 23 7 1,027 253
commons-dbcp 2001-2019 1,034 21 15 3 672 211
commons-digester 2001-2017 1,256 26 16 0 744 113
commons-imaging 2007-2018 682 14 10 2 476 96
commons-io 2002-2018 1,036 21 15 3 613 171
commons-jcs 2002-2018 788 16 10 1 400 162
commons-jexl 2002-2018 1,469 30 20 1 873 199
commons-lang 2002-2018 3,261 66 50 6 2,182 420
commons-math 2003-2018 4,675 94 66 10 2,981 574
commons-net 2002-2018 1,092 22 13 5 585 246
commons-rdf 2014-2018 529 11 9 0 341 35
commons-scxml 2005-2018 479 10 6 2 256 76
commons-validator 2002-2018 1,573 32 18 6 900 296
commons-vfs 2002-2018 1,136 23 11 8 628 207
eagle 2015-2018 582 12 5 4 104 199
falcon 2011-2018 1,547 31 7 13 255 676
flume 2011-2018 1,489 30 5 14 266 591
giraph 2010-2018 854 18 4 6 201 281
gora 2010-2019 569 12 3 4 182 141
helix 2011-2019 2,199 44 8 9 552 580
httpcomponents-client 2005-2019 2,399 48 22 16 1,113 639
httpcomponents-core 2005-2019 2,598 52 25 12 1,326 544
jena 2002-2019 8,698 174 88 34 4,163 1,424
jspwiki 2001-2018 4,326 87 32 25 1,523 941
knox 2012-2018 1,131 23 3 10 266 306
kylin 2014-2018 6,789 136 40 40 1,904 2,163
lens 2013-2018 1,370 28 9 9 321 479
mahout 2008-2018 2,075 42 16 15 836 467
manifoldcf 2010-2019 2,867 58 10 21 602 1,164
mina-sshd 2008-2019 1,281 26 10 6 381 396
nifi 2014-2018 3,299 66 12 18 592 1,052
opennlp 2008-2018 1,763 36 22 6 805 275
parquet-mr 2012-2018 1,228 25 7 9 439 316
pdfbox 2008-2018 8,256 166 81 69 3,934 2,904
phoenix 2014-2019 7,835 157 23 83 828 4,545
ranger 2014-2018 2,213 45 10 20 434 908
roller 2005-2019 2,435 49 15 13 869 723
santuario-java 2001-2019 1,455 30 14 5 627 406
storm 2011-2018 2,839 57 24 9 987 716
streams 2012-2019 911 19 7 2 264 196
struts 2006-2018 2,945 59 21 18 1,191 682
systemml 2012-2018 3,860 78 21 25 921 1,416
tez 2013-2018 2,359 48 8 27 443 1,223
tika 2007-2018 2,581 52 11 10 705 740
wss4j 2004-2018 2,455 50 22 10 712 702
zeppelin 2013-2018 1,836 37 11 6 333 699
125,482 2,533 1,022 685 47,852 35,124
Table 1: Case study subjects with time frame and distribution of commits. All considered commits (#C), sample size (#S), sample perfective commits (#SP), sample corrective commits (#SC), all perfective commits (#AP), all corrective commits (#AC)

The data we use in our study is a SmartSHARK (Trautsch et al., 2017) database taken from Trautsch et al. (2020). We use all projects and commits in the database. However, only commits that change production code and which are not empty are considered. For each change in our data we extract a list of changed files, the number of changed lines, the number of hunks111An area within a file that is changed., and the delta as well as the previous and current value of source code metrics from the changed files between the parent and the current commit. To create our ground truth sample, we randomly sample 2% of commits per project rounded up for manual classification.

The data consists of Java open source projects under the umbrella of the Apache Software Foundation. All projects use an issue tracking system and were still active when the data was collected. Each project consist of at least 100 files and 1000 commits and are at least two years old. Table 1 shows every project, the number of commits and the years of data we consider for sampling. In addition, we include the number of perfective and corrective commits for our ground truth and final classification.

4.2 Change Type Classification Guidelines

As we are not relying on a keyword based approach and there is no existing guideline for this kind of classification, we created a guideline based on Herzig et al. (2013). Our ground truth consists of a sample of changes which we manually classified into perfective, corrective, and other changes. Every commit message is inspected independently by two researchers using a graphical frontend that loads the sample and displays the commit message which can then be assigned a label by each researcher independently. If the commit message does not provide enough information, we inspect additional linked information in the form of bug reports or the change itself. In case of a link between the commit message and the issue tracking system, we inspect the bug report and determine if it is a bug according to the guidelines by Herzig et al. (2013). We perform this step because the reporter of a bug sometimes assigns a wrong type. We defined the guidelines listed in Table 2 that both researchers used for the classification of changes. The deep learning model for our final classification of intents only receives the commit message. This is a conscious trade-off. On one hand we want the ground truth to be as exact as possible, on the other hand we want to keep the automatic intent classification as simple as possible. The results of our fine-tuning evaluation show that the model does not need the additional data from changes and issue reports to perform well.

Both researchers achieve a substantial inter rater agreement (Landis and Koch, 1977) with a Kappa score of 0.66 (Cohen, 1960). Disagreements are discussed and assigned a label both researchers agree upon after discussion. The disagreement front end shows both prior labels anonymized in random order.

A change is classified as perfective if…
  1. the commit message says code is removed or marked as deprecated.

  2. code is moved to new packages.

  3. generics are introduced, new Java features are used, existing code is switched to collections, or class members are switched to final.

  4. documentation is improved or example code is updated.

  5. static analysis warnings are fixed even though no related bug is reported.

  6. code is reformatted or the readability is otherwise improved (e.g. whitespace fixes or tabs to spaces).

  7. existing code is cleaned up, simplified, or its efficiency improved.

  8. dependencies are updated.

  9. developer tooling is improved, e.g., build scripts or logging facilities.

  10. the repository layout is cleaned, e.g., by removing compiled code or maintaining .gitignore files.

  11. tests are improved or added.

Examples: Eliminated unused private field. JIRA: DBCP-255222commons-dbcp:fd59279 Because of other null checks it was already impossible to use the field. Thus, this is clean up. [CODEC-127] Non-ascii characters in source files333commons-codec:096c0cc While the linked issue is a bug, it only affects IDEs for developers and not the compiled code. Thus, this is an improvement of developer tooling. JEXL-240: Javadoc444commons-jexl:e3c80ca The message indicates that this commit only improved the code comments. Therefore it is classified as perfective.
A change is classified as corrective if…
  1. the commit message mentions bug fixes.

  2. the commit message or the linked issue mentions that a wrong behaviour is fixed.

  3. the commit message or the linked issue mentions that a NullPointerException is fixed.

  4. a bug report is linked via the commit message that is of type bug and is not just a feature request in disguise (see Herzig et al. (2013)).

Examples: KYLIN-940 ,fix NPE in monitor module ,apply patch from Xiaoyu Wang555kylin:623585f This fixes a NullPointerException that is visible to the end user. owl syntax checker (bug fixes)666jena:15ecc3c Fixes a wrong behavior.
A change is classified as other if…
  1. the commit message mentions feature or functionality addition.

  2. the commit message mentions license information or copyrights changes.

  3. the commit message mentions repository related information with unclear purpose, e.g., merges of branches without information, tagging of releases.

  4. the commit message mentions that a release is prepared.

  5. an issue is linked via the commit message that requests a feature.

  6. any of the 1-5 are tangled with a perfective or corrective classification.

Examples: KYLIN-715 fix license issue777kylin:d9cf556 License changes or additions are not direct improvements of source code. Support the alpha channel for PAM files. Fix the alpha channel order when reading and writing. Add various tests.888commons-imaging:9c563ec This change adds support for a new feature, fixes something and adds tests, it is therefore highly tangled and we do not classify it as either or both.
Table 2: Classification rules and examples, footnotes denote different commit messages from our data.

In contrast to the classification by Mauczka et al. (2015) and Hattori and Lanza (2008), we do not categorize release tagging, license or copyright corrections as perfective. Our rationale is that these changes are not related to the code quality which is our main interest in this study.

To validate our guidelines against developer classifications, the first author re-classified the Java projects from Mauczka et al. (2015) via our guidelines. Of the 339 commits from Deltaspike, Mylyn-reviews and Tapiji our classification concurs with the developers perfective classification 251 times and disagrees 88 times. Of the 88 perfective disagreements 33 are due to differences in our guidelines, e.g., because we do not consider license and copyright information as perfective or tagging of releases. We remain with 55 perfective disagreements where it is unknown why the developers classified them as perfective. Several commits contain some variation of “minor bugfixes” which are classified as perfective maintenance by the developers or both corrective and perfective, whereas we classify them as corrective. Additionally, code removal or test additions were not classified as perfective changes by the developers, but rather as corrective changes. This difference may be due to additional knowledge about tangled changes which is not visible through the commit message.

For corrective commits our classification agreed with the developers on 296 commits and disagreed on 43 commits. Some of these are also contained in the 55 perfective disagreements, we find that some test changes are classified as corrective while we classify them as perfective. There are also some cleanup and removals which contain no hint of an underlying bug which are classified as corrective by the developers. We assume that the developers simply have more information here than we do. Based on the information available to us, we cannot decide if these are misclassifications by the developers, the result of differences in the guidelines for classification, or misclassifications by the first author due to lack of in-depth knowledge about the projects.

Overall, we agree in 80% of the cases with the developers. This indicates a high validity of our guidelines.

4.3 Deep Learning For Commit Intent Classification

Model Acc. F1 MCC Description
von der Mosel et al. (2021) 0.80 0.79 0.70 BERT model pre-trained on software engineering data, fine-tuned with only commit messages
Ghadhab et al. (2021) 0.78 0.80 - BERT model pre-trained on natural language, includes code changes.
Gharbi et al. (2019) - 0.46 -

Multi-label active learning, only commit message

Levin and Yehudai (2017) 0.76 - -

Keywords and code changes, Random Forest model

Hönel et al. (2019) 0.80 - - LogitBoost model, includes code density.
Table 3: Change classification model performance comparison.

In order to use all available data, we use a deep learning model that classifies all data which is not manually classified into perfective, corrective or other. Due to the size of state-of-the-art deep learning models and the computing requirements for training them, a current best practice is to use a pre-trained model which was trained unsupervised on a large data set. The model is then fine-tuned on labeled data for a specific task.

To achieve a high performance we use seBERT (von der Mosel et al., 2021), a model that is pre-trained on textual software engineering data in two common Natural Language Processing (NLP) tasks. Masked Language Model (MLM) and Next Sentence Prediction (NSP) which predict randomly masked words in a sentence and the next sentence respectively. Combined, this allows the model to learn a contextual understanding of the language. While von der Mosel et al. (2021)

include a similar benchmark based on our ground truth data, it only used the perfective label, i.e., a binary classification to demonstrate text classification for software engineering data. In our study, we measure performance of the multi-class case with all three labels, perfective, corrective and other. Within this study we first use our ground truth data to evaluate the multi-class performance of the model. We perform a 10x10 cross-validation which splits our data into 10 parts and uses 9 for fine-tuning the model and one for evaluating the performance. The fine-tuning itself splits the data into 80% training and 20% evaluation. The model is then fine-tuned and evaluated on the evaluation data for each epoch. At the end the best epoch is chosen to classify the test data of the fold. This is repeated 10 times for every fold which yields 100 performance measurements.

Our experiment shows sufficient performance comparable to other state-of-the-art models for commit classification. We provide the final fine-tuned model as well as the fine-tuning code as part of our replication kit for other researchers. Performance wise our model is comparable to Ghadhab et al. (2021) and improves performance compared other studies, e.g., Gharbi et al. (2019); Levin and Yehudai (2017). However, we note that we fine-tuned the model with only the labels used in our study, i.e., perfective, corrective and other. Therefore, it can not be used or directly compared with models that support other commit classification labels. This would require the same data and labels, we can only compare the given model performance metrics, which we do in Table 3. We can see that our model performs comparable to the BERT model by Ghadhab et al. (2021) and outperforms the Random Forest model by Levin and Yehudai (2017). If we look at the overview of commit classification studies by AlOmar et al. (2021) we can see that our model outperforms the other models for comparable tasks where accuracy or F-measure is given. While this is evidence that our model can perform our required commit intent classification a throughout comparison of different commit intent classification approaches is not within the scope of this study.

4.4 Metric Selection

Name and Description Abbrev
Cyclomatic Complexity (McCabe, 1976)
The number of independent control-flow paths. McCC
Logical Lines of Code
Number of lines in a file without comments and empty lines. LLOC
Nesting Level else-if
Maximum of nesting level in a file. NLE
Number of parameters in a method
The sum of all parameters of all methods in a file. NUMPAR
Clone Coverage
Ratio of code covered by duplicates. CC
Comment lines of code
Sum of commented lines. CLOC
Comment density
Ratio of CLOC to LLOC. CD
API Documentation
Number of documented public methods, +1 if class is documented. AD
Number of Ancestors
Number of classes, interfaces, enums from which the class is inherited. NOA
Coupling between object classes
Number of used classes (inheritance, function call, type reference). CBO
Number of Incoming Invocations
Other methods that call the current class. NII
Minor static analysis warnings
E.g., brace rules, naming conventions. Minor
Major static analysis warnings
E.g., type resolution rules, unnecessary/unused code rules. Major
Critical static analysis warnings
E.g., equals for string comparison, catching null pointer exceptions. Critical
Table 4: Static source code metrics and static analysis warning severities used in this study including the expected direction of their values in quality increasing commits.

The metric selection is based on the Columbus software quality model by Bakota et al. (2011). The metrics are selected from the current version of the model also in use as QualityGate (Bakota et al., 2014). The current model consists of 14 static source code metrics related to size, complexity, documentation, re-usability and fault-proneness. While the quality model provides us with a selection of metrics, we do not use it directly as it requires a baseline of projects before estimating quality on a candidate project.

Table 4 shows the metrics utilized in this study, a short description, and the direction which we assume they change in quality improving commits. As most of the metrics are size and complexity metrics we expect that their values decrease in comparison to all other commits. The metrics we expect to increase in quality improving commits are commented lines of code, comment density, and API documentation, as added documentation should increase these metrics. The three bottom rules consist of static analysis warnings from PMD999https://pmd.github.io/ aggregated by severity for every file. We are of the opinion that this selection strikes a good balance of size, complexity, documentation, clone, and coupling based metrics.

As we are interested in static source code metrics in a commit granularity, we sum the metrics for all files that are changed within a commit. In addition we extract meta information about each change. The static source code metrics are provided by a SmartSHARK plugin using the OpenStaticAnalyzer101010https://openstaticanalyzer.github.io/. To answer our research question, we provide the delta of the metric value changes as well as their current and previous value.

4.5 Analysis Procedure

For our confirmatory study as part of RQ1, we compare the difference between two samples. To choose a valid statistical test of whether there is a difference between both samples, we first perform the Shapiro-Wilk test (Wilk and Shapiro, 1965) to test for normality of each sample. Since we found that the data is non-normal, we perform the Mann-Whitney U-test (Mann and Whitney, 1947) to evaluate if the metric values of one population dominates the other. Since we have an expectation about the direction of metric changes we perform a one-sided Mann-Whitney U test. The hypothesis is that both samples are the same, the alternative hypothesis is that one sample contains lower or higher values depending on our expectation. The expected direction of the metric change is noted in Table 4.

As our data contains a large number of metrics, we can not assume a statistical test with is a valid rejection of a hypothesis. To mitigate the problem posed by a high number of statistical tests, we perform Bonferroni correction (Abdi, 2007). We choose a significance level of with Bonferroni correction for 192 statistical tests. They consist of four size metrics with two groups and three statistical tests as well as 14 source code metrics with two groups and three statistical tests (normality tests for two samples and Mann-Whitney U for difference between samples). The second part is repeated for RQ2. We reject the hypothesis that there is no difference between samples at .

To calculate the effect size of the Mann-Whitney U test, we use Cliff’s  (Cliff, 1993) as a non-parametric effect size measure. We follow a common interpretation of values Griessom and Kim (2005): is negligible, is small, is medium and is large. We provide the effect size for every difference that is statistically significant.

Every box plot shows three groups: all, perfective and corrective, this allows us to show the values for each metric for each group and serves to highlight the differences. Additionally, we report the differences between each group and its counterpart, e.g., perfective and not perfective in the tables where we report the statistical differences.

A more detailed description of the procedure for each hypothesis follows. For H1, we want to compare the structure of quality improving changes with every other change. We compare the size (changed lines) and diffusion (number of hunks, number of changed files) to evaluate the hypothesis. We visualize the results with box plots and report results for statistical tests to determine if the difference in samples is statistically significant.

For H2, we also visualize the results via box plots. As most of the differences hover around 0 we transform the data before plotting via . As we are interested in the differences between changes of metrics, we also require where is the complete, untransformed dataset for the visualizations. Due to the difference in changes, we provide our data size corrected, e.g., the delta of McCC is divided by the modified lines. Additionally, we report the percentage of data that is non-zero to indicate how often the measurements are changing in our data. In addition to the visualization, we provide a table with differences between the samples and statistical test results.

As part of our exploratory study for answering RQ2, we also provide box plots of our metric values. Instead of transformed delta values we provide the raw averages per file in a change before the change was applied. In addition, we provide the median values of all of our metrics before the change was applied. In this part we apply a two-sided Mann-Whitney U test as we have no expectation of the direction the metrics change into for the categories.

4.6 Replication Kit

All data and source code can be found in our replication kit (Trautsch et al., 2021). In addition we provide a small website for this publication that contains all information and where the fine-tuned model can be tested live111111https://user.informatik.uni-goettingen.de/~trautsch2/emse_2021/.

5 Results

In this section, we present the results for evaluating our hypotheses for our first research question and, after that, the result of the exploratory part of our study for our second research question.

5.1 RQ1: Does developer intent to improve internal or external quality have a positive impact on software metrics?

We first present the results of our confirmatory study and evaluate our hypotheses.

5.1.1 Results H1

Perfective Corrective
Metric -value -value
#lines added <0.0001 0.20 (s) <0.0001 0.21 (s)
#lines deleted <0.0001 0.15 (s) <0.0001 0.16 (s)
#files modified 0.2081 - <0.0001 0.22 (s)
#hunks <0.0001 0.01 (n) <0.0001 0.22 (s)
Table 5: Statistical test results for perfective and corrective commits, Mann-Whitney U test p-values (-value) and effect size () with category, is negligible, is small. Statistically significant -values are bolded.
Figure 1: Commit size distribution over all projects for all, perfective and corrective commits. Fliers are omitted.

Figure 1 shows the distribution of sizes between perfective, corrective, and all commits. Table 5 shows the statistical test results for the differences. We can see that perfective commits tend to add less lines but instead remove more lines as the other commits. When we calculate a median delta between all commits and perfective commits we find a difference of 28 for added lines and -2 for deleted lines. While the effect sizes are negligible to small, we can see this difference also in Figure 1. The diffusion of the change over files is also different, however for the number of modified files the difference is not significant for perfective commits.

Corrective commits also tend to add less code, while they do not delete as much, the difference in added and deleted lines is also statistically significant. While the effect size is small, we can see the difference in Figure 1. For corrective commits we can also see a difference in the number of files changed and the number of hunks modified. This diffusion of the change via the number of files and hunks is also statistically significant although, again, with a small effect size.

We can conclude, that perfective commits tend to remove more lines, and are generally adding less lines to the repository. Corrective commits delete less lines and add less lines than other commits. Corrective commits are also distributed over less hunks and less files than other commits.

       We accept H1 that intended quality improvements are smaller than other changes. Perfective and corrective commits tend to add less lines, perfective commits remove more lines. The effect size in all cases is negligible to small.   

5.1.2 Results H2

We first note that neither metric changes for each instance of our data. This can be seen in Table 6, which shows the percentages for each metric for perfective, corrective, and all changes.

Metric %NZ %NZ P %NZ C
McCC 51.03 31.01 57.70
LLOC 74.69 60.93 77.99
NLE 36.76 23.92 34.28
NUMPAR 35.93 24.44 24.98
CC 49.41 37.81 55.14
CLOC 51.56 46.52 42.51
CD 76.07 66.48 77.35
AD 27.19 20.63 15.82
NOA 10.51 6.96 3.62
CBO 30.89 22.52 22.22
NII 27.08 17.78 21.09
Minor 36.15 27.02 29.77
Major 19.87 13.23 14.77
Critical 7.23 4.20 4.95
Table 6: Percentage of commits where the metric is not zero on all commits (%nz), perfective commits (%nz p) and corrective commits (%nz c).

We can see some differences between changes, e.g., critical PMD warnings only change in about 7% of commits while LLOC changes in about 75%. Some differences are also between categories, e.g., McCC changes in 31% of perfective changes and in 57% of corrective changes.

To evaluate H2, we present the differences in all changes visually as box plots in Figure 2.

Figure 2: Static source code metrics changes in all, perfective and corrective commits divided by changed lines. Fliers are omitted.
Perfective Corrective
Metric -val -val
McCC <0.0001 0.39 (m) 1.0000 -
LLOC <0.0001 0.45 (m) 1.0000 -
NLE <0.0001 0.27 (s) 1.0000 -
NUMPAR <0.0001 0.25 (s) <0.0001 0.09 (n)
CC 1.0000 - <0.0001 0.12 (s)
CLOC <0.0001 0.16 (s) <0.0001 0.05 (n)
CD 1.0000 - <0.0001 0.16 (s)
AD <0.0001 0.02 (n) <0.0001 0.08 (n)
NOA <0.0001 0.08 (n) <0.0001 0.07 (n)
CBO <0.0001 0.19 (s) <0.0001 0.06 (n)
NII <0.0001 0.19 (s) <0.0001 0.02 (n)
Minor <0.0001 0.19 (s) <0.0001 0.05 (n)
Major <0.0001 0.12 (s) <0.0001 0.05 (n)
Critical <0.0001 0.05 (n) <0.0001 0.03 (n)
Table 7: Statistical test results for perfective and corrective commits, Mann-Whitney U test p-values (-value) and effect size () with category, is negligible, is small, is medium. Statistically significant -values are bolded. All values are normalized for changed lines.

In addition, we provide Table 7 which shows the Mann-Whitney U test (Mann and Whitney, 1947) p-values, and effect sizes for differences between the types of commits. We can see that most metrics are different depending on whether they are measured in perfective, corrective or all other commits. We now discuss the differences for each measured metric.

McCC: the cyclomatic complexity of perfective changes is smaller than for other changes. Even when we do not account for the size of the change. This is expected as some perfective commits mention simplification of code. For perfective commits the effect size is medium. Corrective commits however have higher McCC than other commits. This can be seen in the Figure 2. The median of corrective commits is higher than for the other commits. Our assumption about McCC being lower in all quality improving commits is not met in this case. While it makes sense that corrective commits add complexity, the comparison here is one of stochastic dominance between all other commits and only corrective commits, not if corrective commits remove or add McCC. Thus, this means that changes in corrective commits are more complex than those of other changes.

LLOC: the difference of LLOC is the most pronounced in our data. We find that even when we do not correct for size of the change the difference between perfective and other changes in LLOC is the most pronounced. While manually classifying the commits, we found that often code is removed because it was marked as deprecated before or it was no longer needed due to other reasons. The effect size for perfective commits is medium. For corrective commits we can see the same result as for McCC. While we assumed that bug fixes usually add code we did not expect them to dominate all other commits including feature additions.

NLE: the nesting level if-else is smaller in perfective commits. We expect this is due to simplification and removal of complex code. When we look at the box plot in Figure 2 it shows a noticeable difference. This means simplification is a high priority when improving code quality in perfective commits. For corrective commits, we can see the same effect as previously seen for McCC and LLOC. The NLE is not lower but higher for corrective commits. This is more evidence for the fact that bug fixes add more complex code. There may be a timing factor involved, e.g., if bug fixes are quick fixes, they would add more complex code without a more complex refactoring which would decrease the complexity again.

NUMPAR: the number of parameters in a method is also different for perfective commits. This may be a hint of the type of perfective maintenance performed the most in perfective commits. The manual classification showed a lot of commit messages that claimed a simplification of the changed code. This metric would also be impacted by a simplification or refactoring operations. Corrective commits also show less additions in this metric, while it only has a negligible effect size it is still statistically significant. Fixing bugs seems to include some code reduction or at least less addition of parameters for methods.

CC: the clone coverage is not different for perfective commits. We would have expected that it is decreasing in perfective commits. However, it seems that clone removal is not a big part of perfective maintenance in our study subjects, which contradicts our expectation. Corrective commits contain a lower clone coverage however. This could either be because corrective commits introduce fewer new clones than other commits or because they remove more. A possible reason for clone removal may be the correction of copy and paste related bugs.

CLOC: the comment lines of code show a difference for perfective commits and corrective commits. While we expected the CLOC to increase in both types of quality improving commits the effect size is higher in perfective commits. It seems that bug fixing operations do not add enough comment lines to show a larger difference here for corrective commits.


: the comment density of perfective commits is not statistically significantly different from other commits. We would have expected a difference here because perfective maintenance should include additional comments on new or previously uncommented code. We can see a difference for corrective commits here. This shows that the density of comments is also improving in bug fixing operations probably due to clarifications for parts of the code that were fixed.

AD: the API documentation metric does change in perfective and corrective commits compared to other commits. A reason could be that perfective commits do add API documentation to make the difference significant. Corrective changes that introduce code in our study subjects seem to almost always include API documentation, therefore we can see a difference here. However, the effect size is negligible in both cases.

NOA: the number of ancestors is lower in perfective commits as expected. This metric would be affected in simplification and clean up maintenance operations. For corrective commits we can also see a lower value, this hints at some clean up operations happening during bug fixing.

CBO: the coupling between objects is lower after perfective commits. This is expected due to class removal and subsequent decoupling of classes. For corrective commits we can also see a difference. While the effect size is negligible, there is some code clean up happening during bug fixes, e.g., NOA and CC are also lower in corrective than in other commits.

NII: the number of incoming invocations is lower in both perfective and corrective commits. However, the effect size is small in perfective and negligible in corrective commits. It seems reasonable to see a difference in this metric, because in the case of perfective commits we have lots of source code removal. However, there are also maintenance activities which are decoupling classes which would also impact this metric. Corrective maintenance seems to involve only limited decoupling operations, also seen in CBO.

Minor: The PMD warnings of minor severity are different in both types of changes. However, we can see that the effect size is larger for perfective changes which makes sense as those warnings can be part of perfective maintenance.

Major: The PMD warnings of major severity are also different in both types of changes. We can see the difference in effect size again and we expect the reason is the same as for Minor.

Critical: The PMD warnings of critical severity are different for both types of changes. Here, the effect size is negligible for both types. However as they are only changed in about 7% of our commits, they are not changing often regardless of commit type.

       There are significant differences between perfective and corrective changes. We reject H2 that intended quality improvements have a positive impact on quality metrics.   

5.2 Summary RQ1

In summary, we have the following results for RQ1.

       RQ1 Summary While intended quality improvements by developers yield measurable differences in almost all metrics we find that not all metrics are changing in the expected, positive direction. Perfective changes Perfective commits have a positive effect on metrics that measure code complexity through the size, conditional statements, number of parameters, and coupling. For two metrics we do not find the expected difference to other commits. Code clones and comment density metrics are not statistically significantly different in perfective commits. Corrective changes Only for two metrics, we observe a non-negligible and statistically significant change that we predicted. For LLOC, McCC and NLE, we observe the opposite of the expectation, which indicates that bug fixes add complex code.   

5.3 RQ2: What kind of files are the target of internal or external quality improvements?

As part of our exploratory study present the results of which kind of files are changed. The extracted metrics are considered on a per-change basis, i.e., we divide the metrics by the number of changed files to get an average metric value per file.

Figure 3: Static source code metrics before the change is applied. Fliers are omitted.

Figure 3 shows box plots for the metric values of files before the change is applied. We can see that, perfective changes are not necessarily applied to complex files. If we compare the median values in Table 8 we can see that perfective changes are applied to smaller, simpler files than the average or corrective change. McCC, LLOC, NLE, NUMPAR and CBO are lower for the files which receive perfective changes. While CLOC, CD, AD are higher. This means that less complex files are often the target of perfective changes. If we look at corrective changes we see that they are more complex and usually larger files. McCC, LLOC, NLE, NUMPAR, CBO, NII as well as Minor, Major and Critical are higher than other, or perfective changes. As we consider the metric values before the change is applied they can be considered pre bugfix. However, when we consider our results for RQ1 the corrective changes usually increase the complexity even further.

Table 9 show the results of our statistical tests. Analogous to RQ1 we compare the difference between perfective and non-perfective as well as corrective and non-corrective. While most metric differences are statistically significant we observe only some small effect sizes for the comment related metrics while the rest is negligible.

Metric All Perfective Corrective
McCC 21.78 18.78 33.23
LLOC 186.98 163.75 264.18
NLE 9.60 8.33 14.00
NUMPAR 16.06 15.00 22.00
CC 0.04 0.04 0.05
CLOC 46.25 55.00 54.00
CD 0.25 0.32 0.25
AD 0.50 0.67 0.46
NOA 1.00 1.00 1.00
CBO 9.67 8.00 14.00
NII 8.00 8.50 9.50
Minor 7.00 6.00 9.67
Major 2.00 1.25 3.00
Critical 0.00 0.00 0.00
Table 8: Median metric values per file before the change is applied.
Perfective Corrective
Metric -val -val
McCC <0.0001 0.05 (n) <0.0001 0.08 (n)
LLOC <0.0001 0.05 (n) <0.0001 0.05 (n)
NLE <0.0001 0.04 (n) <0.0001 0.07 (n)
NUMPAR 0.6367 - 0.0218 -
CC <0.0001 0.01 (n) 0.0011 -
CLOC <0.0001 0.12 (s) <0.0001 0.06 (n)
CD <0.0001 0.15 (s) <0.0001 0.15 (s)
AD <0.0001 0.17 (s) <0.0001 0.15 (s)
NOA 0.5109 - <0.0001 0.02 (n)
CBO <0.0001 0.09 (n) <0.0001 0.07 (n)
NII <0.0001 0.05 (n) <0.0001 0.04 (n)
Minor <0.0001 0.04 (n) <0.0001 0.02 (n)
Major <0.0001 0.09 (n) <0.0001 0.04 (n)
Critical <0.0001 0.05 (n) <0.0001 0.03 (n)
Table 9: Statistical test results for perfective and corrective commits regarding their average metrics before the change, Mann-Whitney U test p-values (-value) and effect size () with category, is negligible, is small, is medium. Statistically significant -values are bolded.
Figure 4: Density plot of metric values for perfective and corrective categories before the change.

Figure 4 shows another perspective on our data in the form of a direct comparison of the density between perfective and corrective changes. We can see that McCC, NLE, LLOC, NUMPAR, CD, CBO, NII and Minor have a lower density for perfective than for corrective. While the differences are small they are noticeable.

       RQ2 Summary The files that are targets of perfective changes are usually not large and complex even before the change is applied. Corrective changes are applied to files which are already very complex and large.   

6 Discussion

Our results show that size is different in both types of commits in H1. The size difference between all commits and perfective as well as corrective commits shows that both tend to be smaller than other commits. In case of perfective commits, code is statistically significantly more often deleted.

The differences in change sizes as well as the increased number of deletions for perfective commits we found for H1 confirms previous research. Multiple studies, e.g., Mockus and Votta (2000), Purushothaman and Perry (2005) and Alali et al. (2008) found that perfective maintenance activities are usually smaller. Mockus and Votta (2000) as well as Purushothaman and Perry (2005) found that corrective maintenance is also smaller and that perfective maintenance deletes more code. Another indication that size between maintenance types is different can be seen in the work by Hönel et al. (2019), which used size based metrics as predictors for maintenance types and showed that it improved the performance of classification models.

Our results for H2 show statistically significant differences in metric measurements between perfective commits and other commits. This result indicates a confirmation of the measurements used by quality models, as the majority of metrics change as expected when developers actively improve the internal code quality. This empirical confirmation of the connection between quality metrics and developer intent is one of our main contributions and was, to the best of our knowledge, not part of any prior study. However, there are several examples of prior work that assumed this relationship.

The publications by McCabe (1976) and Chidamber and Kemerer (1994) assume that reducing complexity and coupling metrics increases software quality which is in line with our developer intents. While all metrics are included in a current ColumbusQM version (Bakota et al., 2014) because we used it as a basis, the CBO, McCC, LLOC, NOA metrics are also part of the SQUALE model (Mordal-Manet et al., 2009) AD, NLE, McCC, and PMD warnings are also part of Quamoco (Wagner et al., 2012). It seems that developers and the Columbus quality model agree with their view on software quality. We find that most of the metrics used in the quality model change when developers perceive their change as quality increasing. This is also true for most of the metrics shared with the SQUALE model and with the Quamoco quality model. However, the implementation for the metrics may differ between the models. Our work establishes that all these quality models are directly related to intended improvements of the internal code quality by the developers.

Surprisingly, we have found only few statistically significant and non negligible differences for corrective commits. Not all software metric values are changing into the expected direction for corrective commits. For example, we can see that McCC, LLOC and NLE are increasing in corrective changes compared to other commits. While we are not expecting them to decrease for every corrective commit we assumed that in comparison to all other commits they would be decreasing. Even considering software aging (Parnas, 2001) we would expect the aging to impact all kinds of changes not just corrective changes. When we look at popular data sets used in the defect prediction domain we often find coupling, size and complexity software metrics (Herbold et al., 2020). For example, the popular (Hosseini et al., 2017) data set by Jureczko and Madeyski (2010) uses such features, but they are also common in more recent data sets, e.g., by Ferenc et al. (2020) or Yatish et al. (2019).

That the most significant difference is in the size of changes could explain various recent findings from the literature, in which size was found to be a very good indicator both for release level defect prediction (Zhou et al., 2018) and just-in-time defect prediction (Huang et al., 2017). This could also be an explanation for possible ceiling effects (Menzies et al., 2008) when such criteria are used, as the difference to other changes are relatively small. We believe that these aspects should be further considered by the defect prediction community and believe that more research is required to establish causal relationships between features and defectiveness.

While the work by Peitek et al. (2021) indicates that cyclomatic complexity may not be as indicative of code understandability as expected, we show within our work that it often changes in quality increasing commits. It seems that developers associate overall complexity as measured by McCC, NLE, NUMPAR with code that needs quality improvement. However as we can see in the exploratory part of our study the most complex files are usually not targeted for quality increasing changes.

The result of our exploratory study to answer RQ2 about the files that are target of quality increasing commits we reveal additional interesting data. We show that perfective maintenance does not necessarily target files that are in need of it due to high complexity in comparison to other changes. In fact, low complexity files as measured by McCC and NLE are more often part of additional quality increasing work by the developers. This may hint at problems regarding the prioritization of quality improvements in the source code. Maybe errors could have been avoided when perfective changes would have targeted more complex files. There could also be effects of different developers or a bias for perfective changes towards simpler code, this warrants future investigation. Corrective changes, in contrast to perfective changes,are applied to files which are large and complex. This was expected, however combined with the results of RQ1 this means that bugs are fixed in complex and large files and then the files get, on average, even more complex and even larger.

Future work could investigate boundary values according to our data. When we compare the median values of our measurements in Table 8 with current boundary values from PMD121212https://pmd.github.io/pmd/pmd_rules_java_design.html#cyclomaticcomplexity, we may think that the PMD warning value of 80 may be too high. A PMD warning triggered at 34 McCC per file would have warned about at least 50% of the files that were in need of a bug fix. However, lowering the boundary will result in more warnings for files that were not target of corrective changes.

7 Threats to validity

In this section, we discuss the threats to validity we identified for our work. We discuss four basic types of validity separately as suggested by Wohlin et al. (2000) and include reliability due to our manual classification approach.

7.1 Reliability

We classify changes to a software retroactively and without the developers. This may introduce a researcher bias to the data and subsequently the results. However, this is a necessity given the size of the data and the unrestricted time frame for the sample and full data because it would not be feasible to ask developers about a couple of commits from years ago. To mitigate this threat, we perform the classification labeling according to guidelines and every change is independently classified by two researchers. We also compare our differences with a sample of changes classified by the developers themselves from Mauczka et al. (2015) and find that we are agreeing on most changes. In addition we measure the inter rater agreement between the researchers and find that it is substantial.

7.2 Construct Validity

Our definition of quality improving may be too broad. We aggregate different types of quality improvement together, e.g., improving error messages, structure of the code or readability. This may influence the changes we observe within our metrics. While these differences should be studied as well, we believe that a broad overview of generic quality improvements independent of their type has advantages. We avoid the risk of being focused only on structural improvements, i.e., due to use of generics or new Java features without missing bigger changes due to simplification of method code.

7.3 Conclusion Validity

We are reporting differences in metric changes between perfective and corrective changes of the software development history of our study subjects. We find a difference for perfective commits and only some non negligible, statistically significant difference for corrective commits. This could be an effect of our sample used as ground truth, however we chose to draw randomly from a list of commits in our study subjects so that our sample should be representative.

We use a deep learning model to classify all of our commits based on the ground truth we provide. This can introduce a bias or errors in the classification. We note however, that the non-negligible effect sizes for our results do not change. The quality metric evaluation of only the ground truth data is included in the appendix and shows similar results. We note that for the small effect sizes we observe a large number of observation is needed to show a significant difference as is demonstrated by the results in this article when compared to the ground truth.

7.4 Internal Validity

A possible threat could be tangled commits which improve quality and at the same time add a feature. We mitigate this in our ground truth, by manual inspection of the commit message of every change considered. We excluded tangled commits if it was possible to determine this by the commit message. As no automatic untangling approach is available to us and available approaches to label tangled commits already use the commit message to find tangled commits we determine that tangled commits which are not identifiable from the commit message are a minor threat.

Another threat could be a lower number of feature additions in our study subjects. Maybe feature additions happen too infrequently to influence the results, therefore corrective commits are seen as adding more complex code than other commits. While we include some projects that are in development for a long period of time, we believe this threat is mitigated by the unrestricted time frame of our study.

7.5 External Validity

We are restricted to a convenience sample of data consisting of Java Open Source projects under the umbrella of the Apache Software Foundation. We consider this a minor threat to external validity. The reason is that although we are limited to one organization, we still have a wide variety of different types of software in our data. We believe that this somewhat mitigates the missing variety of project patronage.

Furthermore, we only include Java projects. However, Java is used in a wide variety of projects and remains a popular language. Its age provides us with a long history of data we can utilize in this study. However, we note that this study may not generalize to all Java projects much less all software projects in other languages.

8 Conclusion

Numerous quality measurements exist, and numerous software quality models try to connect concrete quality metrics with abstract quality factors and sub factors. Although it seems clear that some static source code metrics influence software quality factors, the question of which and how much remains. Instead of relying on necessarily limited developer or expert evaluations of source code or changes we extract metrics from past changes where developers intended to increase the quality from the commit message.

Within this work, we performed a manual classification of developer intents on a sample of 2,533 commits from 54 Java open source projects by two researchers independently. We classify the commits into three categories, perfective maintenance, corrective maintenance or neither. We further evaluate our classification guidelines with a re-classification of a developer labeled sample. We use this data as ground truth to evaluate and then fine tune a state-of-the-art deep learning model for text classification. The fine-tuned model is then used to classify all available commits into our categories increasing our data size to 125,482 commits. We extract static source code metrics and static analysis warnings for all 125,482 commits which allows us to investigate the impact of changes and the distribution of metric values before the changes are applied. Based on the literature we hypothesize that certain metrics change in a certain direction, e.g., perfective changes reduce complexity. We accept or reject these hypotheses based on our data in a confirmatory study.

We find that perfective commits are more often removing code and generally add less lines. Regarding the metric measurements, we find that most metric changes of perfective commits are significantly different to other commits and have a positive, non negligible impact on the majority of metrics.

Surprisingly, we found that corrective changes are more complex and larger than other changes. It seems that fixing a bug increases the size but also the complexity measured via McCC and NLE. As we compare with all other changes we were expecting less addition of complexity as e.g., a feature addition. We conclude that the process of performing a bug fix tends to add more complex code than other changes.

We find that complex files are not necessarily the primary target for quality increasing work by developers, including refactoring. To the contrary, we find that perfective quality changes are applied to files that are already less complex than files changed in other or corrective commits. Files contained in corrective changes on the other hand are more complex and usually larger than both perfective and other files. In combination with our first result this shows that corrective changes are applied to files which are already complex and get even more complex after the change is applied.

While we explored a limited number of metrics and commits we think that this approach can be used to evaluate more metrics connected with software quality in a meaningful way and help practitioners and researchers with additional empirical data.


This work was partly funded by the German Research Foundation (DFG) through the project DEFECTS, grant 402774445. We also want to thank the GWDG Göttingen131313https://www.gwdg.de for providing us with computing resources within their HPC-Cluster.


  • H. Abdi (2007) Bonferroni and Sidak corrections for multiple comparisons. In Encyclopedia of Measurement and Statistics, pp. 103–107. Cited by: §4.5.
  • J. Al Dallal and A. Abdin (2018) Empirical evaluation of the impact of object-oriented code refactoring on quality attributes: a systematic literature review. 44 (1), pp. 44–69. External Links: Document, ISSN 2326-3881 Cited by: §3.
  • A. Alali, H. Kagdi, and J. I. Maletic (2008) What’s a typical commit? a characterization of open source software repositories. In 2008 16th IEEE International Conference on Program Comprehension, Vol. , pp. 182–191. External Links: Document Cited by: §6.
  • E. A. AlOmar, M. W. Mkaouer, and A. Ouni (2021) Toward the automatic classification of self-affirmed refactoring. Journal of Systems and SoftwareIEEE Transactions on Software EngineeringACM Trans. Softw. Eng. Methodol.Annals of Mathematical StatisticsIEEE Transactions on Software EngineeringEmpirical Software EngineeringEmpirical Software EngineeringSIGSOFT Softw. Eng. NotesIEEE Trans. Softw. Eng.Journal of Systems and SoftwareEmpirical Softw. Engg.IEEE SoftwareRome Air Development Center, Air Force Systems Command, Griffiss Air Force Base, New YorkIEEE Transactions on Software EngineeringIEEE Transactions on Software EngineeringIEEE Transactions on Software EngineeringJournal of Systems and SoftwareInformation and Software TechnologyInformation and Software TechnologyBiometrikaPsychological BulletinACM Trans. Softw. Eng. Methodol.Journal of Systems and SoftwareInformation and Software TechnologyJournal of Systems and SoftwareBiometrikaIEEE Transactions on Software Engineering 171, pp. 110821. External Links: ISSN 0164-1212, Document, Link Cited by: §1, §4.3.
  • M. Alshayeb (2009) Empirical investigation of refactoring effect on software quality. 51 (9), pp. 1319 – 1326. External Links: ISSN 0950-5849, Document, Link Cited by: §1.
  • T. Bakota, P. Hegedűs, P. Körtvélyesi, R. Ferenc, and T. Gyimóthy (2011) A probabilistic software quality model. In 2011 27th IEEE International Conference on Software Maintenance (ICSM), Vol. , pp. 243–252. External Links: Document, ISSN 1063-6773 Cited by: §1, §1, 2nd item, §4.4.
  • T. Bakota, P. Hegedűs, I. Siket, G. Ladányi, and R. Ferenc (2014) Qualitygate sourceaudit: a tool for assessing the technical quality of software. In 2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE), Vol. , pp. 440–445. External Links: Document Cited by: §1, 2nd item, §4.4, §6.
  • G. Bavota, A. De Lucia, M. Di Penta, R. Oliveto, and F. Palomba (2015) An experimental investigation on the innate relationship between quality and refactoring. 107, pp. 1 – 14. External Links: ISSN 0164-1212, Document, Link Cited by: §1.
  • B. W. Boehm, J. R. Brown, and M. Lipow (1976) Quantitative evaluation of software quality. In Proceedings of the 2Nd International Conference on Software Engineering, ICSE ’76, Los Alamitos, CA, USA, pp. 592–605. External Links: Link Cited by: §1.
  • A. Ch’avez, I. Ferreira, E. Fernandes, D. Cedrim, and A. Garcia (2017) How does refactoring affect internal quality attributes? a multi-project study. In Proceedings of the 31st Brazilian Symposium on Software Engineering, SBES’17, New York, NY, USA, pp. 74–83. External Links: ISBN 9781450353267, Link, Document Cited by: 2nd item.
  • S. R. Chidamber and C. F. Kemerer (1994) A metrics suite for object oriented design. 20 (6), pp. 476–493. External Links: Document Cited by: §6.
  • N. Cliff (1993) Dominance statistics: ordinal analyses to answer ordinal questions. Cited by: §4.5.
  • J. Cohen (1960) A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20 (1), pp. 37–46. External Links: Document, Link, https://doi.org/10.1177/001316446002000104 Cited by: §4.2.
  • M. D’Ambros, M. Lanza, and R. Robbes (2012) Evaluating defect prediction approaches: a benchmark and an extensive comparison. 17 (4-5), pp. 531–577. External Links: ISSN 1382-3256, Document, Link Cited by: 2nd item.
  • S. Fakhoury, D. Roy, A. Hassan, and V. Arnaoudova (2019) Improving source code readability: theory and practice. In 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC), Vol. , pp. 2–12. External Links: Document, ISSN 2643-7147 Cited by: §3.
  • N. Fenton and J. Bieman (2014) Software metrics: a rigorous and practical approach, third edition. 3rd edition, CRC Press, Inc., Boca Raton, FL, USA. External Links: ISBN 1439838224, 9781439838228 Cited by: §1, §1.
  • R. Ferenc, P. Gyimesi, G. Gyimesi, Z. Tóth, and T. Gyimóthy (2020) An automatically created novel bug dataset and its validation in bug prediction. 169, pp. 110691. External Links: ISSN 0164-1212, Document, Link Cited by: §6.
  • Y. Fu, M. Yan, X. Zhang, L. Xu, D. Yang, and J. D. Kymer (2015) Automated classification of software change messages by semi-supervised latent dirichlet allocation. 57, pp. 369 – 377. External Links: ISSN 0950-5849, Document, Link Cited by: §3.
  • L. Ghadhab, I. Jenhani, M. W. Mkaouer, and M. Ben Messaoud (2021) Augmenting commit classification by using fine-grained source code changes and a pre-trained deep neural language model. Information and Software Technology 135, pp. 106566. External Links: ISSN 0950-5849, Document, Link Cited by: §3, §4.3, Table 3.
  • S. Gharbi, M. W. Mkaouer, I. Jenhani, and M. B. Messaoud (2019) On the classification of software change messages using multi-label active learning. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, SAC ’19, New York, NY, USA, pp. 1760–1767. External Links: ISBN 9781450359337, Link, Document Cited by: §4.3, Table 3.
  • R. J. Griessom and J. J. Kim (2005) Effect sizes for research: a broad practical approach. Lawrence Erlbaum Associates Publishers. Cited by: §4.5.
  • L. P. Hattori and M. Lanza (2008) On the nature of commits. In Proceedings of the 23rd IEEE/ACM International Conference on Automated Software Engineering, ASE’08, Piscataway, NJ, USA, pp. III–63–III–71. External Links: ISBN 978-1-4244-2776-5, Link, Document Cited by: §4.2.
  • S. Herbold, A. Trautsch, F. Trautsch, and B. Ledel (2020) Issues with szz: an empirical assessment of the state of practice of defect prediction data collection. Note: Submitted External Links: Link Cited by: §6.
  • K. Herzig, S. Just, and A. Zeller (2013) It’s not a bug, it’s a feature: how misclassification impacts bug prediction. In Proceedings of the 2013 International Conference on Software Engineering, ICSE ’13, pp. 392–401. External Links: ISBN 9781467330763 Cited by: item 4, §4.2.
  • S. Hönel, M. Ericsson, W. Löwe, and A. Wingkvist (2019) Importance and aptitude of source code density for commit classification into maintenance activities. In 2019 IEEE 19th International Conference on Software Quality, Reliability and Security (QRS), Vol. , pp. 109–120. External Links: Document Cited by: §1, 1st item, §3, Table 3, §6.
  • S. Hosseini, B. Turhan, and D. Gunarathna (2017) A systematic literature review and meta-analysis on cross project defect prediction. PP (99), pp. 1–1. External Links: Document, ISSN 0098-5589 Cited by: 2nd item, §6.
  • Q. Huang, X. Xia, and D. Lo (2017) Supervised vs unsupervised models: a holistic look at effort-aware just-in-time defect prediction. In 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME), Vol. , pp. 159–170. External Links: Document Cited by: §6.
  • ISO/IEC 25010 (2011) ISO/IEC 25010:2011, systems and software engineering — systems and software quality requirements and evaluation (square) — system and software quality models. Cited by: §1.
  • ISO/IEC (2001) ISO/iec 9126. software engineering – product quality. ISO/IEC. Cited by: §1.
  • M. Jureczko and L. Madeyski (2010) Towards identifying software project clusters with regard to defect prediction. In Proceedings of the 6th International Conference on Predictive Models in Software Engineering, PROMISE ’10, New York, NY, USA. External Links: ISBN 9781450304047, Link, Document Cited by: 2nd item, §6.
  • Y. Kamei, E. Shihab, B. Adams, A. E. Hassan, A. Mockus, A. Sinha, and N. Ubayashi (2013) A large-scale empirical study of just-in-time quality assurance. 39 (6), pp. 757–773. External Links: Document, ISSN 2326-3881 Cited by: 1st item.
  • B. Kitchenham and S. L. Pfleeger (1996) Software quality: the elusive target [special issues section]. 13 (1), pp. 12–21. External Links: Document, ISSN Cited by: §1.
  • J. R. Landis and G. G. Koch (1977) An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics 33 (2), pp. 363–374. External Links: ISSN 0006341X, 15410420, Link Cited by: §4.2.
  • S. Levin and A. Yehudai (2017) Boosting automatic commit classification into maintenance activities by utilizing source code changes. In Proceedings of the 13th International Conference on Predictive Models and Data Analytics in Software Engineering, PROMISE, New York, NY, USA, pp. 97–106. External Links: ISBN 9781450353052, Link, Document Cited by: §1, §3, §4.3, Table 3.
  • H. B. Mann and D. R. Whitney (1947)

    On a test of whether one of two random variables is stochastically larger than the other

    18 (1), pp. 50–60. Cited by: §4.5, §5.1.2.
  • A. Mauczka, F. Brosch, C. Schanes, and T. Grechenig (2015) Dataset of developer-labeled commit messages. In Proceedings of the 12th Working Conference on Mining Software Repositories, MSR ’15, Piscataway, NJ, USA, pp. 490–493. External Links: ISBN 978-0-7695-5594-2, Link Cited by: §3, §4.2, §4.2, §7.1.
  • A. Mauczka, M. Huber, C. Schanes, W. Schramm, M. Bernhart, and T. Grechenig (2012) Tracing your maintenance work — a cross-project validation of an automated classification dictionary for commit messages. In Proceedings of the 15th International Conference on Fundamental Approaches to Software Engineering, FASE’12, Berlin, Heidelberg, pp. 301–315. External Links: ISBN 9783642288715, Link, Document Cited by: §1, §3, §3, §3.
  • T. J. McCabe (1976) A complexity measure. 2 (4), pp. 308–320. External Links: ISSN 0098-5589, Document Cited by: Table 4, §6.
  • J. A. McCall, P. K. Richards, and G. F. Walters (1977) Factors in software quality: concept and definitions of software quality. 1 (3). Cited by: §1.
  • T. Menzies, B. Turhan, A. Bener, G. Gay, B. Cukic, and Y. Jiang (2008) Implications of ceiling effects in defect predictors. In Proceedings of the 4th International Workshop on Predictor Models in Software Engineering, PROMISE ’08, New York, NY, USA, pp. 47–54. External Links: ISBN 9781605580364, Link, Document Cited by: §6.
  • Mockus and Votta (2000) Identifying reasons for software changes using historic databases. In Proceedings 2000 International Conference on Software Maintenance, Vol. , pp. 120–130. External Links: Document Cited by: §1, 1st item, §3, §6.
  • K. Mordal-Manet, F. Balmas, S. Denier, S. Ducasse, H. Wertz, J. Laval, F. Bellingard, and P. Vaillergues (2009) The squale model — a practice-based industrial quality model. In 2009 IEEE International Conference on Software Maintenance, Vol. , pp. 531–534. External Links: Document Cited by: §6.
  • NASA (2004) NASA iv & v facility metrics data program. External Links: Link Cited by: 2nd item.
  • J. Pantiuchina, M. Lanza, and G. Bavota (2018) Improving code: the (mis) perception of quality metrics. In 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME), Vol. , pp. 80–91. External Links: Document, ISSN Cited by: §3.
  • J. Pantiuchina, F. Zampetti, S. Scalabrino, V. Piantadosi, R. Oliveto, G. Bavota, and M. D. Penta (2020) Why developers refactor source code: a mining-based study. 29 (4). External Links: ISSN 1049-331X, Link, Document Cited by: §1.
  • D. L. Parnas (2001) Software aging. In Software Fundamentals: Collected Papers by David L. Parnas, pp. 551–567. External Links: ISBN 0201703696 Cited by: §6.
  • N. Peitek, S. Apel, C. Parnin, A. Brechmann, and J. Siegmund (2021) Program comprehension and code complexity metrics: an fmri study. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), Vol. , pp. 524–536. External Links: Document Cited by: §1, §6.
  • R. Purushothaman and D. E. Perry (2005) Toward understanding the rhetoric of small source code changes. 31 (6), pp. 511–526. External Links: Document Cited by: 1st item, §6.
  • S. Scalabrino, G. Bavota, C. Vendome, M. Linares-Vásquez, D. Poshyvanyk, and R. Oliveto (2021) Automatically assessing code understandability. 47 (3), pp. 595–613. External Links: Document Cited by: §1.
  • K. Stroggylos and D. Spinellis (2007) Refactoring–does it improve software quality?. In Fifth International Workshop on Software Quality (WoSQ’07: ICSE Workshops 2007), Vol. , pp. 10–10. External Links: Document, ISSN null Cited by: 2nd item, §3.
  • E. B. Swanson (1976) The dimensions of maintenance. In Proceedings of the 2nd International Conference on Software Engineering, ICSE ’76, Washington, DC, USA, pp. 492–497. Cited by: §1, §1.
  • A. Trautsch, J. Erbel, S. Herbold, and J. Grabowski (2021) Replication kit. External Links: Link Cited by: §4.6.
  • A. Trautsch, S. Herbold, and J. Grabowski (2020)

    A Longitudinal Study of Static Analysis Warning Evolution and the Effects of PMD on Software Quality in Apache Open Source Projects

    External Links: Link, Document Cited by: §4.1.
  • F. Trautsch, S. Herbold, P. Makedonski, and J. Grabowski (2017) Addressing problems with replicability and validity of repository mining studies through a smart data platform. External Links: Document Cited by: §4.1.
  • J. von der Mosel, A. Trautsch, and S. Herbold (2021) On the validity of pre-trained transformers for natural language processing in the software engineering domain. Note: Submitted External Links: Link Cited by: §1, §4.3, Table 3.
  • S. Wagner, K. Lochmann, L. Heinemann, M. Kläs, A. Trendowicz, R. Plösch, A. Seidl, A. Goeb, and J. Streit (2012) The quamoco product quality modelling and assessment approach. In Proceedings of the 34th International Conference on Software Engineering, ICSE ’12, Piscataway, NJ, USA, pp. 1133–1142. External Links: ISBN 978-1-4673-1067-3, Link Cited by: §1, §6.
  • S. Wang, C. Bansal, and N. Nagappan (2021) Large-scale intent analysis for identifying large-review-effort code changes. 130, pp. 106408. External Links: ISSN 0950-5849, Document, Link Cited by: §3.
  • M. B. Wilk and S. S. Shapiro (1965)

    An analysis of variance test for normality (complete samples)†

    52 (3-4), pp. 591–611. External Links: ISSN 0006-3444, Document Cited by: §4.5.
  • C. Wohlin, P. Runeson, M. Höst, M. C. Ohlsson, B. Regnell, and A. Wesslén (2000) Experimentation in software engineering: an introduction. Kluwer Academic Publishers, Norwell, MA, USA. External Links: ISBN 0-7923-8682-5 Cited by: §7.
  • M. Yan, Y. Fu, X. Zhang, D. Yang, L. Xu, and J. D. Kymer (2016) Automatically classifying software changes via discriminative topic model: supporting multi-category and cross-project. 113, pp. 296 – 308. External Links: ISSN 0164-1212, Document, Link Cited by: §3.
  • S. Yatish, J. Jiarpakdee, P. Thongtanunam, and C. Tantithamthavorn (2019) Mining software defects: should we consider affected releases?. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), Vol. , pp. 654–665. External Links: Document Cited by: §6.
  • Y. Zhou, Y. Yang, H. Lu, L. Chen, Y. Li, Y. Zhao, J. Qian, and B. Xu (2018) How far we have progressed in the journey? an examination of cross-project defect prediction. 27 (1). External Links: ISSN 1049-331X, Link, Document Cited by: §6.

Appendix A Ground truth only results

Figure 5: Ground truth only. Commit size distribution over all projects for all, perfective and corrective commits. Fliers are omitted.
Perfective Corrective
Metric -value -value
#lines added <0.0001 0.20 (s) <0.0001 0.20 (s)
#lines deleted <0.0001 0.13 (s) <0.0001 0.17 (s)
#files modified 0.2829 - <0.0001 0.22 (s)
#hunks 0.7009 - <0.0001 0.21 (s)
Table 10: Ground truth only. Statistical test results for perfective and corrective commits, Mann-Whitney U test p-values (-value) and effect size () with category is negligible, is small. Statistically significant -values are bolded.
Figure 6: Ground truth only. Static source code metrics changes in all, perfective and corrective commits divided by changed lines. Fliers are omitted.
Perfective Corrective
Metric -val -val
McCC <0.0001 0.37 (m) 1.0000 -
LLOC <0.0001 0.42 (m) 1.0000 -
NLE <0.0001 0.26 (s) 0.9577 -
NUMPAR <0.0001 0.24 (s) <0.0001 0.09 (n)
CC 1.0000 - <0.0001 0.12 (s)
CLOC <0.0001 0.19 (s) 0.1906 -
CD 0.9303 - <0.0001 0.15 (s)
AD 0.1556 - <0.0001 0.10 (s)
NOA <0.0001 0.08 (n) <0.0001 0.09 (n)
CBO <0.0001 0.18 (s) 0.0145 -
NII <0.0001 0.19 (s) 0.0620 -
Minor <0.0001 0.18 (s) 0.0005 -
Major <0.0001 0.10 (s) 0.0002 0.06 (n)
Critical <0.0001 0.06 (n) 0.1111 -
Table 11: Ground truth only. Statistical test results for perfective and corrective commits, Mann-Whitney U test p-values (-value) and effect size () with category, is negligible, is small, is medium. Statistically significant -values are bolded. All values are normalized for changed lines.
Figure 7: Ground truth only. Static source code metrics before the change is applied. Fliers are omitted.
Metric All Perfective Corrective
McCC 21.00 18.00 34.00
LLOC 187.22 160.38 270.00
NLE 9.50 7.67 15.20
NUMPAR 16.00 14.67 21.00
CC 0.04 0.04 0.04
CLOC 48.22 55.00 55.00
CD 0.25 0.31 0.24
AD 0.50 0.63 0.49
NOA 1.00 1.00 1.00
CBO 9.50 8.00 14.00
NII 8.00 8.00 9.00
Minor 7.00 5.43 10.00
Major 2.00 1.00 2.67
Critical 0.00 0.00 0.00
Table 12: Median metric values before the change is applied.
Perfective Corrective
Metric -val -val
McCC 0.0003 - 0.0016 -
LLOC 0.0005 - 0.1138 -
NLE 0.0003 - 0.0072 -
NUMPAR 0.5344 - 0.4704 -
CC 0.4142 - 0.0210 -
CLOC <0.0001 0.10 (n) 0.0111 -
CD <0.0001 0.15 (s) <0.0001 0.16 (s)
AD <0.0001 0.15 (s) <0.0001 0.15 (s)
NOA 0.6847 - 0.2103 -
CBO <0.0001 0.11 (s) 0.0190 -
NII 0.0510 - 0.0105 -
Minor 0.0006 - 0.6288 -
Major <0.0001 0.12 (s) 0.0852 -
Critical 0.0179 - 0.5730 -
Table 13: Ground truth only. Statistical test results for perfective and corrective commits regarding their average metrics before the change, Mann-Whitney U test p-values (-value) and effect size () with category, is negligible, is small, is medium. Statistically significant -values are bolded.