Towards Surgically-Precise Technical Debt Estimation: Early Results and Research Roadmap

08/02/2019 ∙ by Valentina Lenarduzzi, et al. ∙ TU Eindhoven Tampere Universities UNIVERSITETET I OSLO 0

The concept of technical debt has been explored from many perspectives but its precise estimation is still under heavy empirical and experimental inquiry. We aim to understand whether, by harnessing approximate, data-driven, machine-learning approaches it is possible to improve the current techniques for technical debt estimation, as represented by a top industry quality analysis tool such as SonarQube. For the sake of simplicity, we focus on relatively simple regression modelling techniques and apply them to modelling the additional project cost connected to the sub-optimal conditions existing in the projects under study. Our results shows that current techniques can be improved towards a more precise estimation of technical debt and the case study shows promising results towards the identification of more accurate estimation of technical debt.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Companies commonly spend time to improve the quality of the software they develop, investing effort into refactoring activities aimed at removing technical issues believed to impact software qualities. Technical issues include any kind of information that can be derived from the source code and from the software process, such as usage of specific patterns, compliance with coding or documentation conventions, architectural issues, and many others.

Technical Debt (TD) is a metaphor from the economic domain that ”refers to different software maintenance activities that are postponed in favor of the development of new features in order to get short-term payoff” (Cunningham, 1992). The growth of TD commonly slows down the development process (Cunningham, 1992), (Li and Shatnawi, 2007) and software companies need to manage it. Many factors related to unpredictable business or environmental forces internal or external to the company can lead to TD  (Martini et al., 2015), (Besker et al., 2018).

The adoption of tools to measure internal software quality is increasing (Lenarduzzi et al., 2017), (Lenarduzzi et al., 2019c) and SonarQube is one of the most used, since it has been adopted by more than 100K organizations 111https://www.sonarqube.org including nearly more than 15K public open-source projects 222https://sonarcloud.io/explore/projects.

More specifically, SonarQube checks code compliance against a set of coding rules and calculates an estimated effort (remediation time) to refactor the violated rule (TD items). The diffuseness of TD items in software systems was well investigated (Digkas et al., 2017), (Digkas et al., 2018), (Saarimäki et al., 2019), (Lenarduzzi et al., 2019a). To instrument a proper management of overall software maintenance costs, the individual and overarching impact of TD items on software quality needs further attention, especially considering that the severity of the impact is still not clear (Saarimäki et al., 2019), (Romano, 2019). Precise understanding of which TD items developers should refactor and at which costs is paramount for proper just-in-time management of overall TD. Although developers typically gain a preliminary overview of the TD considering all estimation rules in tools such as SonarQube, still there is a clear need for instruments capable of more precisely estimating the technical debt connected to every single TD item, over time, and spanning a sufficient longitude to encompass the inception of the TD item, its resolution, as well as its eventual refactoring after resolution. For example, imagine a TD item T

(a bug or a code smell) being added at moment X, removed at moment Y, and subsequently re-added/refactored at moment Z. Current tools would offer a rule-based snapshot of three distinct scenarios (X, Y, and Z) without properly understanding and factoring into their estimation techniques the nature, nurture, and dynamics around item T

.

In this paper, we aim to conceptualize a technical debt estimation approach which is intended as ”Surgically-Precise”, that is, it enables a more precise and fine-grained lens of analysis over individual TD items, as well as the evolution of their code-related history over time. We apply Machine-Learning techniques since the aforementioned exercise is a predictive modelling exercise, and start by getting a preliminary model of the actual gap between the rule-based approach of SonarQube (as represented by its own estimations using its own atomic metrics) with respect to the actual timings and costs evident from the history of software projects in our dataset.

Section 2 describes the tool-based technical debt estimation, while Section 3 outlines the motivation of this study. Section 4 describes our proposed approach to estimate technical debt. Section 6 presents related works. Section 5 identifies the threats to the validity of our study, and Section 7 draws conclusions and give an outlook on possible future work.

2. Tool-based Technical Debt Estimation: The SonarQube Approach

SonarQube is one of the most common open-source static code analysis tools for measuring code technical debt. SonarQube is provided as a service by the sonarcloud.io platform or can be downloaded and executed on a private server.

SonarQube calculates several metrics such as number of lines of code and code complexity, and verifies the code’s compliance against a specific set of ”coding rules” defined for most common development languages.

If the analyzed source code violates a coding rule, SonarQube generates a ”TD issue”. The time needed to remove these issues (remediation effort) is used to calculate the remediation cost and the technical debt. SonarQube includes reliability, maintainability, and security rules.

Reliability rules, also named Bugs, create TD issues that ”represent something wrong in the code” and that will soon be reflected in a bug. Code smells are considered ”maintainability-related issues” in the code that decrease code readability and code modifiability. It is important to note that the term ”code smells” adopted in SonarQube does not refer to the commonly known term code smells defined by Fowler et al. (Beck, 1999), but to a different set of rules.

Moreover, SonarQube calculates three types of technical debt 333https://docs.sonarqube.org/latest/user-guide/metric-definitions/:

  • Technical debt. SonarQube calculated the technical debt as sqale index that is ”the Effort to fix all Code Smells” in terms of in minutes.

  • Reliability remediation effort. SonarQube calculated the Reliability remediation effort as reliability remediation effort that is ”the Effort to fix all bug issues”.

  • Security remediation effort. SonarQube calculated Security remediation effort as security remediation effort that is ”the Effort to fix all vulnerability issues”.

3. Motivation

SonarQube is currently adopted by more than 98% of the public projects 444SonarQube Quality Profiles: https://docs.sonarqube.org/display/SONAR/
Quality+Profiles Last Access:May 2018
. SonarQube suggests to customize the out-of-the-box set of rules (named ”sonar way”). However, customers are reluctant to do it and mostly rely on the standard rule-set (aka ”sonar way”) (Vassallo et al., 2018). Developers are not completely sure about the rules usefulness (Vassallo et al., 2018)(Taibi et al., 2017), without discriminating among the different rules categories. Generally, developers remove violations according to high severity level (Vassallo et al., 2018) to reduce risk of faults (Taibi et al., 2017).

Moreover, recent studies confirm developers concerns (Lenarduzzi et al., 2019a)

; such studies investigate the fault proneness of SonarQube violations, to understand which violations are actually fault-prone and to assess the fault-prediction model accuracy. They conducted an empirical study on 21 well-known mature open-source projects from Apache Software Foundations (ASF). Each fault-inducing commit was labeled applying the SZZ algorithm and analyzed with eight machine learning techniques (Logistic Regression, Decision Tree, Random Forest, Extremely Randomized Trees, AdaBoost, Gradient Boosting, XGBoost).

Results showed that among the 202 SonarQube violations, only 26 are low fault prone and violations classified as ”bugs” hardly never led to a failure. Moreover, the fault-prediction model accuracy is extremely low (AUC 50.94%) compared with the accuracy considering only the 26 violations correctly labeled as fault prone (AUC 83%).

The results confirmed that the SonarQube rules should be thoroughly investigated in order to understand which ones are really harmfulness to reduce Technical Debt.

Based on this, we investigated if the SonarQube technical debt could be derived from the other metrics that SonarQube measured and not involved in the computation.

For this purpose, we conducted an empirical study as a case study based on the guidelines defined by Runeson and H’́ost (Höst, 2009).

Goal and Research Questions. The goal of this study is to investigate if Technical Debt could be derived from software metrics. So, we derived the following Research Question:

RQ: To what extent can basic software metrics allow continuous prediction of technical debt?

More specifically, we are interested in knowing more about the intimate nature of technical debt items while allowing for a more precise, instantaneous, and continuous estimation of technical debt over time. We aim at understanding (a) what software metrics allow for a better estimation of the actual added project cost connected to specific TD items as well as (b) which classifier is most promising to instrument such prediction. Therefore, we formulate two sub-research questions:

RQ1.1 what software quality metrics from SonarQube better instrument a prediction of technical debt?
RQ1.2 what classifier is better fit to instrument a prediction of technical debt?

Context. For this study, we adopted the projects included in the Technical Debt Dataset (Lenarduzzi et al., 2019b). The projects in the dataset were selected projects based on ”criterion sampling” (Patton, 2002). The selected projects had to fulfill all of the following criteria:

  • Developed in Java;

  • Older than three years;

  • Featuring More than 500 commits;

  • Featuring More than 100 classes;

  • Using of an issue tracking system with at least 100 issues reported;

Moreover, as recommended by Nagappan et al. (Nagappan et al., 2013), we also tried to maximize diversity and representativeness by considering a comparable number of projects with respect to project age, size, and domain.

Based on these criteria, we selected 33 Java projects from the Apache Software Foundation (ASF) repository 555http://apache.org. This repository includes some of the most widely used software solutions. The available projects can be considered industrial and mature, due to the strict review and inclusion process required by the ASF. Moreover, the included projects regularly review their code and follow a strict quality process 666https://incubator.apache.org/policy/process.html.

In Table 1, we report the list of the 33 projects we considered together with the number of analyzed commits, the project sizes (LOC) of the last analyzed commits, and the number of artifacts in the commits.

Name Analyzed Commits #LOC #Artifacts
# Timeframe
Accumulo 3 2011/10 - 2013/03 307,167 4,137
Ambari 8 2011/08 - 2015/08 774,181 3,047
Atlas 7 2014/11 - 2018/05 206,253 1,443
Aurora 16 2010/04 - 2018/03 103,395 1,028
Batik 3 2000/10 - 2002/04 141,990 1,969
BCEL 32 2001/10 - 2018/02 43,803 522
Beam 3 2014/12 - 2016/06 135,199 2,421
BeanUtils 33 2001/03 - 2018/06 35,769 332
Cocoon 7 2003/02 - 2006/08 398,984 3,120
Codec 30 2003/04 - 2018/02 21,932 147
Collections 35 2001/04 - 2018/07 66,381 750
Commons CLI 29 2002/06 - 2017/09 9,547 58
Commons Configuration 29 2003/12 - 2018/04 87,553 565
Commons Daemon 27 2003/09 - 2017/12 4,613 24
Commons DBCP 33 2001/04 - 2018/01 23,646 139
Commons DbUtils 26 2003/11 - 2018/02 8,441 108
Commons Digester 30 2001/05 - 2017/08 26,637 340
Commons Exec 21 2005/07 - 2017/11 4,815 56
Commons FileUpload 28 2002/03 - 2017/12 6,296 69
Commons HttpClient 25 2005/12 - 2018/04 74,396 779
Commons IO 33 2002/01 - 2018/05 33,040 274 0
Commons Jelly 24 2002/02 - 2017/05 30,100 584
Commons JEXL 31 2002/04 - 2018/02 27,821 333
Commons JXPath 29 2001/08 - 2017/11 28,688 253
Commons Net 32 2002/04 - 2018/01 30,956 276
Commons OGNL 8 2011/05 - 2016/10 22,567 333
Commons Validator 30 2002/01 - 2018/04 19,958 161
Commons VFS 32 2002/07 - 2018/04 32,400 432
Felix 2 2005/07 - 2006/07 55,298 687
HttpCore 21 2005/02 - 2017/06 60,565 739
Santuario 33 2001/09 - 2018/01 124,782 839
SSHD 19 2008/12 - 2018/04 94,442 1,103
ZooKeeper 7 2014/07 - 2018/01 72,223 835
Sum 726 2,528,636 27,903
Table 1. Description of the selected projects

Data Collection. All selected projects were cloned from their Git repositories. Each commit was analyzed using SonarQube’s default rule set. We exported results as a csv file using SonarQube APIs777the data is available in the replication package. The analysis was performed by taking a snapshot of the main branch of each project every 180 days. Furthermore, we collected the 28 software metrics measured by SonarQube as listed in Table 2, and the two types of technical debt888https://docs.sonarqube.org/latest/user-guide/metric-definitions/ defined by SonarQube: Maintainability remediation effort (also known as ”Squale Index” and reliability remediation effort. We did not considered security remediation effort, since SonarQube does not provide software metrics clearly useful to predict it (Table 2).

Metric Description
Size
Number of classes Number of classes (including nested classes, interfaces, enums and annotations).
Number of files Number of files.
Lines Number of physical lines (number of carriage returns).
Ncloc Also known as Effective Lines of Code (eLOC). Number of physical lines that contain at least one character which is neither a whitespace nor a tabulation nor part of a comment.
Ncloc language distribution Non Commenting Lines of Code Distributed By Language
Number of classes and interfaces Number of Java classes and Java interfaces
Missing package info Missing package-info.java file (used to generate package-level documentation)
Package Number of packages
Statements Number of statements.
Number of directories Number of directories in the project, also including directories not containing code (e.g., images, other files…).
Number of functions Number of functions. Depending on the language, a function is either a function or a method or a paragraph.
Number of comment lines Number of lines containing either comment or commented-out code. Non-significant comment lines (empty comment lines, comment lines containing only special characters, etc.) do not increase the number of comment lines.”
Number of comment lines density Density of comment lines = Comment lines / (Lines of code + Comment lines) * 100
Complexity
Complexity It is the Cyclomatic Complexity calculated based on the number of paths through the code. Whenever the control flow of a function splits, the complexity counter gets incremented by one. Each function has a minimum complexity of 1. This calculation varies slightly by language because keywords and functionalities do.
Class complexity Complexity average by class
Function complexity Complexity average by method
Function complexity distribution Distribution of method complexity
File complexity distribution Distribution of complexity per class
Cognitive complexity How hard it is to understand the code’s control flow.
Package dependency cycles Number of package dependency cycles
Test coverage
Coverage It is a mix of Line coverage and Condition coverage. Its goal is to provide an even more accurate answer to the following question: How much of the source code has been covered by the unit tests?
Lines to cover Number of lines of code which could be covered by unit tests (for example, blank lines or full comments lines are not considered as lines to cover).
Line coverage On a given line of code, Line coverage simply answers the following question: Has this line of code been executed during the execution of the unit tests?
Uncovered lines Number of lines of code which are not covered by unit tests.
Duplication
Duplicated lines Number of lines involved in duplications
Duplicated blocks Number of duplicated blocks of lines.
Duplicated files Number of files involved in duplications.
Duplicated lines density = (duplicated lines lines) * 100
Table 2. The software metrics

Data Analysis. Similarly to previous work (Di Nucci et al., 2018), we selected 8 Machine Learning techniques, namely, Linear Regression, Random Forest, Gradient Boost, Extra Trees, Decision Trees, Bagging, AdaBoost, SVM, to overcome to the limitation of the different techniques. We performed a second analysis retraining the models using a drop-column mechanism (Terence et al., 2018). This mechanism is a simplified variant of the exhaustive search (Yoon et al., 2005), which iteratively tests every subset of features for their regression performance. The full exhaustive search is very time-consuming requiring train-evaluation steps for a -dimensional feature space. Instead, we look only at dropping individual features one at a time, instead of all possible groups of features. For each regressor, to easily gauge the overall accuracy of the machine learning algorithm in a model, we calculated R2 and the Mean Absolute Error (MAE).

MAE is defined as follow:

Results. We report the results obtained in order to answer to our RQ in Table 3 and Table 4. As we can see, even if the R2 is good in many cases, the accuracy (MAE) is very low for all the machine learning techniques applied in this study.

Regressor MAE MAE_std R2 R2_std
Linear Regression 9,382.623 4,372.698 0.952 0.075
Random Forest 6,594.945 1,161.236 0.976 0.019
Gradient Boost 7,717.614 1,150.637 0.974 0.022
Extra Trees 5,789.625 1,404.204 0.981 0.017
Decision Trees 7,626.258 1,689.545 0.967 0.030
Bagging 6,663.218 1,120.130 0.976 0.019
AdaBoost 13,024.412 3,303.271 0.954 0.043
SVM 91,231.180 4,5517.892 -0.521 0.140
Table 3. Maintainability remediation effort vs All Metrics
Regressor MAE MAE_std R2 R2_std
Linear Regression 259.860 92.249 0.839 0.237
Random Forest 360.371 146.910 0.324 0.699
Gradient Boost 429.584 142.428 0.210 0.812
Extra Trees 252.508 96.295 0.770 0.222
Decision Trees 359.689 206.836 0.372 0.616
Bagging 362.272 155.184 0.287 0.801
AdaBoost 488.048 101.195 0.348 0.566
SVM 1,583.805 1,571.807 -0.371 0.072
Table 4. Reliability remediation effort vs All Metrics

Based on the obtained results, we can notice that technical debt and reliability remediation effort both are not correlated with the 28 software metrics measured by SonarQube. Moreover, we can not able to determine which classifier better fits to instrument a technical debt prediction.

4. Surgically-Precise Technical Debt Estimation: Concept and Approach

Our preliminary results, together with additional recent work reported here, highlight how the current instruments for estimating TD are not mature yet: in particular, current tools and metrics to estimate Code Debt do not provide agreement regarding what to refactor with respect to maintainability and reliability. Software practitioners have a plethora of metrics and recommendations to improve their code, but, in practice, it is difficult to prioritize the right ones. This can have the negative effect of creating confusion and keeping practitioners from using the available instruments to estimate and refactor TD. There is a need for the development of techniques that are precise enough for the practitioners to trust them. We therefore propose two main approaches for future work, in order to estimate TD in a surgically-precise way. In both cases, the use of machine learning approaches would provide a great opportunity to achieve such precision.

1. Estimation precision based on real impact and costs. First and foremost, current metrics explored here do not take in consideration real effort and costs incur by practitioners (principal and interest of technical debt). Does a complex class lead to more effort for developers? Do more violations highlighted by SonarQube make the code really more difficult to change and bug prone? Are these issues hindering developers in continuously deliver value to the customers? We propose to refine the existing metrics and recommendations with the use of additional metrics related to project costs and effort. The integration of such metrics would help in creating a model where code smells and refactoring suggestions are ranked higher if they are associated with higher negative impact, and are therefore more important to refactor for the practitioners (in accordance with the technical debt theory). As an example, code smells that have been associated with the occurrence of more bugs should be prioritized by developers. Such surgically-precise approach can make use of the most advanced machine learning techniques in order to create a reliable cost-impact model to classify and rank code smells.

2. Estimation precision based on historical data. During the lifetime of software artefacts in a project, such artifacts undergo various lifecycle stages. As part of these stages these artifacts are incepted, refactored, deprecated, and more. To achieve surgically-precise estimation, in this case we use techniques intended to take into account the entire history of each TD item, either from a specific target project under analysis or related projects elsewhere in a software ecosystem. In line with this assumption, we also assume that each TD item has its own nature, evolutionary dynamics, as well as nurture, causes, and effects. As such, we propose the use of machine-learning approaches to encompass this analysis and provide for a precise estimation. The fundamental research concept we propose is that the intimate nature of each TDEBT item should be connected to the estimation mechanisms behind technical debt; if debt is set to evolve conjointly with artefacts evolution and complex mechanisms regulate its precise estimation then effort-estimation for project success is, in turn, simplified and more precise, to a point in which automated mechanisms can be used further to plan, direct, and execute software maintenance and evolution activities. Our conjecture is that machine-learning approaches can account for such dynamics and offer a solution. Figure 1 offers an overview of the intended context of analysis.

Figure 1. Approach and conceptual overview: artifacts evolve over time and TDEBT should be estimated more precisely in a just-in-time fashion

In the context in question, software metrics are used to keep track of the nature of software artefacts part of a project, as well as the variations in their status (e.g., including re-opened bugs, mutated code-smells, etc.). In addition, project metrics can be used to take into account costs and efforts. In turn, a predictive model can factor in the metrics themselves and provide for evolving snapshots of additional project cost (i.e., technical debt). In line with this concept and approach, we envision the following challenges:

  • TD estimation in conjunction to the lifecycle and evolution of single TDEBT items. Different TD items might follow different evolutions, for example the presence of a code smell might create bugs in the short term but not in the long term. In this case, a precise model would recognize such smell as urgent to refactor.

  • Continuous estimation of TD over time. Tapping into the history of related projects or related refactoring scenarios, TD could be estimated continuously using a comparative analysis of within- and cross-project estimation.

  • Association of TD with impact metrics. A few impact metrics have been proposed as proxies for effort and costs, such as bug and change proneness, but additional project metrics could help, e.g., bug-fixing times.

  • Costs related to the refactoring of TD. TD items should be weighted also with respect to the cost for their refactoring. If a TD item has the same impact of another one, but it’s known to take more time to refactor, the former should be prioritized.

5. Threats to Validity

In this Section, we will introduce the threats to validity and the different tactics we adopted to mitigate them.

We selected 33 projects from the Apache Software Foundation, which incubates only certain systems that follow specific and strict quality rules. Our case study was not based only on one application domain. This was avoided since we aimed to find general mathematical models for the prediction technical debt in a system. Choosing only one or a very small number of application domains could have been an indication of the non-generality of our study, as only prediction models from the selected application domain would have been chosen. The selected projects stem from a very large set of application domains, ranging from external libraries, frameworks, and web utilities to large computational infrastructures. The application domain was not an important criterion for the selection of the projects to be analyzed, but in any case we tried to balance the selection and pick systems from as many contexts as possible. We are considering only open source projects, and we cannot speculated on industrial projects. Moreover, we only considered Java projects due to the limitation of the tools used (SonarQube provides a different set of TD issues for each language) and results would have not been comparable.

In our case, this threat could be represented by the analysis method applied in our study. We reported the results considering descriptive statistics. Moreover, instead of using only Logistic Regression, we compared the prediction power of different classifier to reduce the bias of the low prediction power that one single classifier could have. We do not exclude the possibility that other statistical or machine learning approaches such as Deep Learning or others might have yielded similar or even better accuracy than our modeling approach. However, considering the extremely low importance of each TD Issue and its statistical significance, we do not expect to find big differences applying other type of classifiers.

6. Related Work

Saarimaki et al. (Romano, 2019) investigated the accuracy of the remediation time estimation asking to 65 novice developers to remove TD items from 15 open source Java projects. They compared the effort needed by developers to repay TD with the estimation proposed by SonarQube. Remediation time is generally overestimated by the tool compared to the actual time for patching TD items. The most accurate estimations are relate to code smells, while the least accurate concern bugs.

Lenarduzzi et al. (Lenarduzzi et al., 2019a) investigated the fault proneness of SonarQube violations, in order to understand which violations are actually fault-prone and to assess the fault-prediction model accuracy. They conducted an empirical study on 21 well-known mature open-source projects from Apache Software Foundations (ASF). Each fault-inducing commit was labeled applying the SZZ algorithm and analyzed with eight machine learning techniques (Logistic Regression, Decision Tree, Random Forest, Extremely Randomized Trees, AdaBoost, Gradient Boosting, XGBoost) Results showed that among the 202 SonarQube violations, only 26 are low fault prone and violations classified as ”bugs” hardly never led to a failure. Moreover, the fault-prediction model accuracy is extremely low (AUC 50.94%) compared with the accuracy considering only the 26 violations correctly labeled as fault prone (AUC 83%). These results confirm that the SonarQube rules should be thoroughly investigated in order to understand which ones are really harmfulness to reduce technical debt.

Falessi et al. (Falessi et al., 2017) analyzed the the distribution of 16 metrics and 106 (out 202) SonarQube violations in an industrial project. Moreover this study also evaluated the fault-proneness of these measures. They claimed that by removing violations, 20% of faults were preventable in the code.

Tollin et al. (I. Tollin et al., 2017) investigated the change-proneness of SonarQube violations applying machine learning techniques. They found that the presence of violations increases change-proneness at class level.

7. Conclusion

In this work, we conceptualize a technical debt estimation approach to enables a more precise and fine-grained analysis of technical debt, based on the evolution of the software over time. We apply the first steps of the approach to a dataset of 33 Java projects from the Apache Software Foundation analyzing them with different Machine-Learning techniques in order to get a preliminary model of the actual gap between the rule-based approach of SonarQube (as represented by its own estimations using its own metrics) with respect to the actual timings and costs evident from the history of software projects in our dataset.

The main outcome of our preliminary investigation is that the current instruments for estimating TD are not mature yet. Despite the big variety of available software metrics for software measurement and improvement, it is very complex to understand which metric to consider and how to prioritize their importance mainly because current metrics explored do not take in consideration real effort and costs incur by practitioners (principal and interest of technical debt).

Future works include the application of this approach to a larger data-set and the implementation of the approach on different type of issues, including code smells, rules detected by SonarQube, but also rules detected by other tools such as BetterCodeHub, Coverity Scan and others999Damian’s work is partially supported by the European Commission grants no. 787061 (H2020), ANITA, no. 825040 (H2020), RADON, no. 825480 (H2020), SODALITE..

References

  • K. Beck (1999) Refactoring: improving the design of existing code. Addison-Wesley Longman Publishing Co., Inc.. Cited by: §2.
  • T. Besker, A. Martini, R. E. Lokuge, K. Blincoe, and J. Bosch (2018) Embracing technical debt, from a startup company perspective. In Int. Conf. on Software Maintenance and Evolution (ICSME), Vol. , pp. 415–425. Cited by: §1.
  • W. Cunningham (1992) The wycash portfolio management system. OOPSLA-92. External Links: ISBN 0-89791-610-7 Cited by: §1.
  • D. Di Nucci, F. Palomba, D. Tamburri, A. Serebrenik, and A. De Lucia (2018) Detecting code smells using machine learning techniques: are we there yet?. In Int. Conf. on Software Analysis, Evolution, and Reengineering, pp. . Cited by: §3.
  • G. Digkas, M. Lungu, P. Avgeriou, A. Chatzigeorgiou, and A. Ampatzoglou (2018) How do developers fix issues and pay back technical debt in the apache ecosystem?. In SANER 2018, Vol. , pp. 153–163. External Links: ISSN Cited by: §1.
  • G. Digkas, A. C. M. Lungu, and P. Avgeriou (2017) The evolution of technical debt in the apache ecosystem. pp. 51–66. Cited by: §1.
  • D. Falessi, B. Russo, and K. Mullen (2017) What if i had no smells?. ESEM 2017 (). Cited by: §6.
  • M. Höst (2009) Guidelines for conducting and reporting case study research in software engineering. Empirical Softw. Engg. 14 (2), pp. 131–164. Cited by: §3.
  • F. A. F. I. Tollin, M. Zanoni, and R. Roveda (2017) Change prediction through coding rules violations. EASE’17, pp. 61–64. Cited by: §6.
  • V. Lenarduzzi, F. Lomio, D. Taibi, and H. Huttunen (2019a) On the fault proneness of sonarqube technical debt violations: a comparison of eight machine learning techniques. arXiv:1907.00376. Cited by: §1, §3, §6.
  • V. Lenarduzzi, N. Saarimäki, and D. Taibi (2019b) The technical debt dataset. In 15th conference on PREdictive Models and data analycs In Software Engineering, PROMISE ’19. Cited by: §3.
  • V. Lenarduzzi, A. Sillitti, and D. Taibi (2017) Analyzing forty years of software maintenance models. In Proceedings of the 39th International Conference on Software Engineering Companion, ICSE-C ’17, pp. 146–148. External Links: ISBN 978-1-5386-1589-8 Cited by: §1.
  • V. Lenarduzzi, A. Sillitti, and D. Taibi (2019c) A survey on code analysis tools for software maintenance prediction. In Int. Conf. in Software Engineering for Defence Applications, Cited by: §1.
  • W. Li and R. Shatnawi (2007)

    An empirical study of the bad smells and class error probability in the post-release object-oriented system evolution

    .
    J. Syst. Softw. 80 (7), pp. 1120–1128. External Links: ISSN 0164-1212 Cited by: §1.
  • A. Martini, J. Bosch, and M. Chaudron (2015) Investigating architectural technical debt accumulation and refactoring over time: a multiple-case study. Information and Software Technology 67, pp. 237 – 253. Cited by: §1.
  • M. Nagappan, T. Zimmermann, and C. Bird (2013) Diversity in software engineering research. ESEC/FSE 2013, pp. 466–476. External Links: ISBN 978-1-4503-2237-9 Cited by: §3.
  • M. Patton (2002) Qualitative Evaluation and Research Methods. Sage, Newbury Park. Cited by: §3.
  • S. Romano (2019) On the accuracy of sonarqube technical debt remediation time. SEAA Euromicro 2019. Cited by: §1, §6.
  • N. Saarimäki, V. Lenarduzzi, and D. Taibi (2019) On the diffuseness of code technical debt in open source projects. Int. Conf. on Technical Debt (TechDebt 2019). Cited by: §1.
  • D. Taibi, A. Janes, and V. Lenarduzzi (2017) How developers perceive smells in source code: a replicated study. Information and Software Technology 92, pp. 223 – 235. Cited by: §3.
  • P. Terence, T. Kerem, C. Christopher, and H. Jeremy (2018) Beware default random forest importances. Note: http://explained.ai/rf-importance/index.htmlAccessed: 2019-07-20 Cited by: §3.
  • C. Vassallo, S. Panichella, F. Palomba, S. Proksch, A. Zaidman, and H. C. Gall (2018) Context is king: the developer perspective on the usage of static analysis tools. SANER 2018. Cited by: §3.
  • H. Yoon, K. Yang, and C. Shahabi (2005) Feature subset selection and feature ranking for multivariate time series. IEEE transactions on knowledge and data engineering 17 (9), pp. 1186–1198. Cited by: §3.