Better Technical Debt Detection via SURVEYing

05/20/2019 ∙ by Fahmid M. Fahid, et al. ∙ NC State University IEEE 2

Software analytics can be improved by surveying; i.e. rechecking and (possibly) revising the labels offered by prior analysis. Surveying is a time-consuming task and effective surveyors must carefully manage their time. Specifically, they must balance the cost of further surveying against the additional benefits of that extra effort. This paper proposes SURVEY0, an incremental Logistic Regression estimation method that implements cost/benefit analysis. Some classifier is used to rank the as-yet-unvisited examples according to how interesting they might be. Humans then review the most interesting examples, after which their feedback is used to update an estimator for estimating how many examples are remaining. This paper evaluates SURVEY0 in the context of self-admitted technical debt. As software project mature, they can accumulate "technical debt" i.e. developer decisions which are sub-optimal and decrease the overall quality of the code. Such decisions are often commented on by programmers in the code; i.e. it is self-admitted technical debt (SATD). Recent results show that text classifiers can automatically detect such debt. We find that we can significantly outperform prior results by SURVEYing the data. Specifically, for ten open-source JAVA projects, we can find 83 if higher levels of recall are required, SURVEY0can adjust towards that with some additional effort).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

This paper is about cost-effective analytics using surveying; i.e. rechecking and (possibly) revising the labels found by prior analysis. We demonstrate the value of surveying by showing that, it can lead to better predictors for technical debt than existing state-of-the-art methods [1].

Studying technical debt is important since it can significantly damage project maintainability [2, 3, 4]. When developers cut corners and make haste to rush out code, then that code often contains technical debt; i.e. decisions that must be repaid, later on, with further work. Technical debt is like dirt in the gears of software production. As technical debt accumulates, development becomes harder and slower. Technical debt can affect many aspects of a system including evolvability (how fast we can add new functionality) and maintainability (how easily developers can handle new or unseen bugs in code).

Surveying is important for automated software engineering since many automated software analytics methods assume that they are learning from correctly labelled examples. However, before an automated method uses labels from old data, it is prudent to revisit and recheck the labels generated by prior analysis. This is needed since humans often make mistakes in the labelling [5]. But surveying can be (very) time-consuming process. For example, later we show that surveying all the data used in this study would require more than 350 hours. Clearly, surveying will not be apoted as standard practice unless we can reduce its associated effort.

Algorithm 1 describes SURVEY0, a human-in-the-loop algorithm for reducing the cost of surveying. The details of SURVEY0 are offered later in this paper. For now, it is suffice to say, SURVEY0 includes an early exit strategy (in Step5) that is triggered if “enough” examples have been found.

To assess the significance of SURVEY0, this paper builds predictors for technical debt, with and without surveying. That experience let us answer the following research questions.

  1. Randomly sort software artifacts (e.g. code comments);

  2. Use prior data to build a classifier and a sorter ;

  3. Using , ask reader to review and label the first artifacts as “good, bad” (in our case, “bad” means “has TD”);

  4. Using the labelled examples, update an estimator for the number of “bad” remaining in the examples;

  5. Exit if says we found enough “bad” examples;

  6. Else:

    • Skip over the first artifacts. Set .

    • Apply the sorter to arrange the remaining artifacts, in order of descending “bad”-ness;

    • Loop to Step3.

Algorithm 1 SURVEY0 = . Using a human reader , a sorter , a classifier and an estimator , SURVEY0 updates its knowledge every examples.

RQ1: Is surveying necessary? If there are no disputes about what labels to assign to (e.g.) code comments, there is no need for surveying the data. But this is not the case. We find that labels about technical debt from different sources have many disagreements (36% to 79%, median to max). Hence:

Fig. 1: Impact of technical debt on software. From [6].

Conclusion #1: Surveying is required to resolve disagreement about labels.

RQ2: Is SURVEY0 useful? SURVEY0 cannot be recommended unless it improves our ability to make quality predictions. Therefore we compared the predictive power of classifiers that were trained using SURVEY0’s labels. We found that SURVEY0’s labels improved recall from 63% to 83% (median values across 10 data sets). That is: Conclusion #2: SURVEY0 improved quality predictions.

RQ3: Is SURVEY0 comparable to the state-of-the-art for human-in-the-loop AI? SURVEY0 is not a research contribution unless

it out-performs other human-in-the-loop AI tools. Therefore we compared our results to those from an “optimum” tool. For this study, “optimum” was computed by giving prior state-of-the-art methods an undue advantage (allowing those methods to tune hyperparameter for better test results). We found that:

Conclusion #3: SURVEY0 made its predictions at a near optimum rate.

RQ4: How soon can SURVEY0 learn quality predictors? SURVEY0 cannot be said to mitigate the relabelling problem unless it finds most quality issues using very few artifacts. Therefore we tracked how many artifacts SURVEY0 had to show the humans before it finds most of the technical debts. We found that SURVEY0 asked humans to read around 16% of the comments while finding 83% of the technical debts (median values across ten data sets). Conclusion #4: SURVEY0 can be recommended as a way to reduce the labelling effort. RQ5: Can SURVEY0 find more issues? The previous research question showed that SURVEY0 can find most issues after minimal human effort. But if finding (e.g.) 83% of the quality issues is not enough, can SURVEY0 be used to find even more technical debt issues? We find that SURVEY0’s stopping rule can be modified to find more issues with additional cost of reading: Conclusion #5: SURVEY0 can be used to get additional desired level of quality assurance.

RQ6: How much does SURVEY0 delay human readers? Humans grow frustrated and unproductive when they wait for a system response for a long time. Therefore we recorded how long humans had to wait for SURVEY0’s conclusions. We found that SURVEY0 needs half a minute to find the next most interesting programmer comments. Humans, on the other hand, need twenty minutes [7] to assess if those 100 comments a for “self-admitted technical debt” (defined later in this paper). That is, SURVEY0 delays humans by . To put that another way: Conclusion #6: SURVEY0 imposed negligible overhead (i.e. less than 5%) on the activity of human experts. The rest of this paper is structured as follows. In section II, we first discuss the background work and other related concepts that is needed for our study. In section IV, we give a brief description of our dataset, a detailed description of SURVEY0 and our experiment and evaluation methods. section VI discuss the results of our study. We discuss the threats to validity in Section VII. We will close our work by discussing its implication and possible future directions.

Note that, for reproduction purpose, all our data and scripts is publicly available (see github.com/blinded4Review).

Ii Background

Ii-a About Technical Debt

Technical debt (TD) effects multiple aspects of the software development process (see Figure 1). The term was first introduced by Cunningham in 1993 [2]. It is a widespread problem:

  • In 2012, after interviewing 35 software developers from different projects in different companies, both varying in size and type, Lim et al. [8] found developers generate TD due to factors like increased workload, unrealistic deadline in projects, lack of knowledge, boredom, peer-pressure among developers, unawareness or short-term business goals of stakeholders, and reuse of legacy or third party or open source code.

  • After observing five large scale projects, Wehaibi et al. found that, the amount of technical debts in a project may be very low (only 3% on average), yet they create significant amount of defects in the future (and fixing such technical debts are more difficult than regular defects) [9].

  • Another study on five large scale software companies revealed that, TDs contaminate other parts of a software system and most of the future interests are non-linear in nature with respect to time [10].

  • According to the SIG (Software Improvement Group) study of Nugroho et al., a regular mid-level project owes in TD and resolving TD has a Return On Investment (ROI) of 15% in seven years [4].

  • Guo et al. also found similar results and concluded that, the cost of resolving TD in future is twice as much as resolving immediately [3].

Much research tried to identify TD as part of code smells using static code analysis, with limited success [11, 12, 13, 14, 15]. Static code analysis has a high rate of false alarms while imposing complex and heavy structures for identifying TD [16, 17, 18, 19].

Recently, much more success has been seen in the work on so-called “self-admitted technical debt” (SATD). A significant part of technical debt is often “self-admitted” by the developer in code comments[20]. In 2014, after studying four large scale open source software projects, Potdar and Shihab [20] concluded that developers intentionally leave traces of TD in their comments with remarks like like “

hack, fixme, is problematic, this isn’t very solid, probably a bug, hope everything will work, fix this crap

” Potdar and Shihar et al. found 62 distinct keywords for identifying such TD [20] (similar conclusions were made by Faris et al. [21]). In 2015, Maldonado et al. used five open source projects to manually classify different types of SATD [7] and found:

  • SATD mostly contains requirement debt and design debt in source code comments;

  • 75% of the SATD gets removed, but the median lifetime of SATD ranges between 18-173 days [22].

Another study tried to find the SATD introducing commits in Github using different features on change level [23]

. Instead of using a bag of word approach, a recent study also proposed word embedding as vectorization technique for identifying SATD 

[24]

. Other studies investigated source code comments using different text processing techniques. For example, Tan et al. analyzed source code comments using natural language processing to understand programming rules and documentations and indicate comment quality and inconsistency 

[25, 26]. A similar study was done by Khamis et al [27]

. After analyzing and categorizing comments in source code, Steidl et al. proposed a machine learning technique that can measure the comment quality according to category 

[28]

. Malik et al. used random forest to understand the lifetime of code comments 

[29]. Similar study over three open source projects was also done by Fluri et al. [30].

In 2017, Maldonado et al. identified two types of SATD in 10 open source projects (average 63% F1 Score) using Natural Language Processing (a Max Entropy Stanford Classifier) using only 23% training examples [31]. A different approach was introduced by Huang et al. in 2018 [1]

. Using eight datasets, Huang et al. build Naive Bayes Multinomial sub-classifier for each training dataset using information gain as feature selection. By implementing an ensemble technique on sub-classifiers, they have found an average 73% F1 scores for all datasets 

[1]. A recent IDE for Ecliplse was also released using this technique for identifying SATD in java projects [32].

To the best of our knowledge, Huang et al.’s EMSE’18 paper is the current state-of-the-art approach for identifying SATD. Hence, we base our work on their methods.

Ii-B About Surveying

This section describes surveying, why it is needed, and why cost-effective methods for surveying are required.

Standard practice in software analytics is for different researchers to try their methods on shared data sets. For example, in 2010, Jureckzo et al. [33]

offered tables of data that summarized dozens of open source JAVA projects. That data is widely used in the literature. A search at Google Scholar on “xalan synapse” (two of the Jureckzo data sets) shows that these data sets are used in 177 papers and eight textbooks, 126 of which are in the last five years.

Reusing data sets from other researchers has its advantages and disadvantages. One advantage is repeatability of research results; i.e. using this shared data, it is now possible and practical to repeat/repute/prove prior results. For examples of such kind on analysis, see the proceedings of the PROMISE conference or the ROSE festivals (recognizing and rewarding open science in SE) at FSE’18, FSE’19. ESEM’19 and ICSE’19. See also all the lists of 678 papers which reuse data from the Software-artifact Infrastructure Repository at Nebraska University (sir.csc.ncsu.edu/portal/usage.php).

Another advantage is faster research. Software analytics data sets contain independent and dependent variables. For example, in the case of self-admitted technical debt, the independent variables are the programmer comments and the dependent variable is the label “SATD=yes” or “SATD=no”. Independent variables can often be collected very quickly (e.g. Github’s API permits 5000 queries per hour). However, assigning the dependent labels is comparatively a much slower task. According to Maldonado and Shihab et al. [7], classifying 33,093 comments as “SATD {yes,no}” from five open source projects took approximately 95 hours by a single person; i.e. 10.3 seconds per comment. Using that information, we calculated that, relabelling the data used in this paper would require months of work (see Table I). When a task takes months to complete, it is not surprising that research teams tend to reuse old labels rather than make their own.

That said, the clear disadvantage of reusing old labels is reusing old mistakes. Humans often make mistakes when labelling [5]. Hence, it is prudent to review the labels found in a dataset. We used the term “surveying’ to refer to the process of revisiting, rechecking, and possibly revising the labels offered by prior analysis

In our experience, surveying is usually done on a somewhat informal basis. For example, researchers would manually survey a small number of randomly selected artifacts (e.g. 1% of the corpus; or 100 artifacts). There are many problems with the informal approach to surveying:

  • How many random selections are enough? That is, on what basis should we select?

  • And when to stop surveying? Should finding errors prompt more samples? Or is there some point after which further surveying is no longer cost-effective?

In order to answer these questions, the rest of this article discusses cost-effective methods for surveying.

[backgroundcolor=blue!5]

Scenario #1) Hackathons: Our dataset contains comments (from ten projects). At 10.3 seconds/comment [7], this takea 178 hours to label. With two readers (one to read, one to verify) this time becomes 356 hours. Using the power of pizza, we can assemble a hackathon team of half a dozen graduate students willing to work on tasks like this, six hours per day, two days per month; i.e. 6*6*2=72 hours per month.


Scenario #2) Teams of Two: Note that, if pushed, we could demand more time from these students. For example, we could demand that two students work on this task, full time. Given the tedium of that task, we imagine that they could work productively on this task for 20 hours per week per person. Under these conditions, revisiting and relabelling our data would take nearly two months.


Scenario #3) Crowdsourcing: Given sufficient funds, such labelling could be done at a much faster rate. Crowdsourcing tools like Mechanical Turk could be used to assemble any number of readers to revisit and relabel all comments, in just a matter of hours [34, 35, 36]

. While this is certainly a useful heuristic method for scaling up labelling, ever so often there must a validation study where the results of crowdsourcing are checked against some “ground truth”. This paper is concerned with cost-effective methods for generating that ground truth.

TABLE I: Cost of labelling, Three different scenarios.

Iii Related Work

The process we call surveying uses some technology from research on active learning [37, 38]. “Active learners” assume that some oracle offers labels to examples and that there is a cost incurred, each time we invoke the oracle (in the case of surveying, that might mean asking a human to check if a particular code comment is an example of SATD).

The research of active learning was certainly motivating for this work. However, standard active learning methods were not immediately applicable to the problem of technical debt. Accordingly, we made numerous changes to standard methods.

Firstly, SURVEY0’s workflow are different (more informed) than those of an standard active learner. Such learners do not know when to stop learning. Since our goal is to understand how many more items we need to read, SURVEY0 adds an incremental estimation method that studies how fast humans are currently finding interesting examples and imposes a stopping criteria based on that estimation. That estimation method is described later in this paper.

(Aside: outside the machine learning literature, we did find two information retrieval methods for predicting when to stop incremental learning from Ros et al. [39] and Cormack [40]. When we experimented with these methods, we found that our estimators out-performed these methods. For more on this point, see the RQ3 results discussed later in this paper.)

Secondly, we needed different learning methods. Active learning in SE has been applied previously in (e.g.) the code search recommender tools of Gay et al. [41] that seek methods implicated in bug reports. Our work is very different to that:

  • Code search recommender tools input bug reports and output code locations. In between, those tools search static code descriptors; i.e. theirs are a code analysis tool.

  • The tools of this paper input programmer comments and output predictors of technical debt. In between, our methods search text comments; i.e. ours is a text mining tool.

Thirdly, we had to make more use of prior knowledge:

  • Initially, we tried tools built to help researchers find (say) a few dozen relevant papers within 1000 abstracts downloaded from Google Scholar [42]. Those tools were not successful (they resulted in single digit recall values).

  • On investigation, we realized those tools started learning afresh for each new problem. That is, those tools assumed that prior knowledge was not relevant to new projects.

  • That assumption seemed inappropriate for ths paper since, for most commercial software developers, software is more extended and refined than build from scratch. In such an environment, it is possible to discover important lessons from prior projects.

  • Hence, as shown in Step2 of Algorithm 1, SURVEY0 starts by learning models from all prior projects. After that, the rest of SURVEY0 uses feedback from the current project to refine the estimations from those models.

Iv Inside SURVEY0

Recall from the above that SURVEYO is characterized by:

That is, SURVEY0 updates its knowledge every example, using a reader , a sorter , a classifier and an estimator . In the experiments of this paper, defines how much data is passed to humans (each time, we pass examples).

The rest of this section describes . Just to say the obvious, this section includes many engineering choices which future research may want to revisit and revise. We make no claim that SURVEY0 is the best surveying tool. Rather, our goal is to popularize the surveying problem and produce a baseline result which can be used to guide the creation of better surveyors.

Iv-a About the Classifiers “

This paper compares two classifiers

These learners were selected as follows. Firstly, as to our use of SVMs, these are commonly used method for text mining [43]. A SVM outputs a map of binary sorted data with margins between the two as far apart as possible, known as the support vectors. SVM uses a kernel to transform problem data into a higher dimensional space where it is easier to find decision boundary between examples. SVM models this boundary as set of support vectors; i.e. examples of different classes closest to the boundary. Depending on the kernel used with a SVM, their training times can be very fast or very slow. For this work, we used linear SVM and SVM with radial bias functions. There was no significant performance delta between them and the linear SVM was much faster. Hence, for this work, we are reporting linear SVM only.

Secondly, as to our use of EnsembleDT, our aim was to extend the results of Huang et al.’s EMSE’18 paper. That work used an ensemble Naive Bayes classifiers: In that approach

  • The authors first trained one Naive Bayes Multinomial (NBM) sub-classifier for each training project.

  • These solo classifiers were then consulted as an ensemble, where each solo classifier voted on whether or not some test example was an example of SATD.

  • The output of such an ensemble classifier is the majority vote across all the ensemble members.

To build their system, Huang et al. used Weka library (written in Java)[44] and their built-in “StringToWordVector” for vectorization and the “NaiveBayesMultinomial” for classification. We were unable to find an equivalent vectorizer in Python, so we used the standard TF-IDF vectorizer. We failed to reproduce their results using Scikit-learn’s [45] Naive Bayes Multinomial. But by retaining the ensemble method (as recommend by Huang et al.) and switching the classifier to Decision Trees (DT) , we obtained similar results. Thus, for our experiment, we used their framework and data (with 2 additional projects) but with a modification to the leaner (Decision Trees, not Naive Bayes Multinomial).

Decision tree learners recursivily split data such that each split is more uniform than its parent. The attribute used to split the data is selected in order to minimize the diversity of the data after each split. This is a very fast and efficient machine learning algorithm [46].

Iv-B About the Sorters “

SURVEY0 asks its learners to sort examples by how “interesting” they are. Our two classifiers need different sorters:

  • For EnsembleDT, the function counts how many times ensemble members vote for SATD.

  • For linear SVMs, a “most interesting” example would be an unlabelled artifact on the SATD side of the decision boundary, and furthest away from that boundary. Hence, the sorter for linear SVMs is “distance from the bounary” (and for this measure, we take the SATD side of the boundary to be positive distance).

Note that when our estimator needs the probability that an example is technical debt, we reach into these sort orders and report the position of that example w.r.t. the other examples. Formally, those probabilities are generated by normalizing the sort scores over the range between 0..1

Fig. 2: Example retrieval curve (project SQL) using SURVEY0. “Actual” is the retrieval by the human according to the sorter . “Total Estimation” is the output from the estimator. With Target@90, that becomes the “Target@90 Estimation”. This intersects at point S where we stop with 85% recall and 17% cost.

Iv-C About the Estimator “

SURVEY0 uses an internal estimator, built using a Logistic Regression (LR) curve. Using this estimator, it is possible to guess how many more interesting examples are left to find.

This estimator is used as follows. First, users specify a target goal e.g. find 90% of all the technical debt comments. Next SURVEY0 executes, asking the reader to examine comments at a time. As this process continues, more and more of the technical debt comments are discovered.

Figure 2 shows a typical growth curve. The dotted blue line shows the evolving estimator. In practice, often over-estimates how much technical debt has been found. Hence, after reading of the code, the estimator reports that the target; i.e. that 90% of the TD has been found (even though the exact figure is 85%, see Figure 2).

Algorithm 2 descibes our estimator. The estimator takes two inputs, some probability from the classifier (see §IV-B) and the labels. All unlabeled example are assumed to be “not technical debt” (because as shown in Table II, actual TD comments are quite rare). A logistic regression model is then trained using the probabilities (from the learners) as the independent variable; and labels as dependent examples. Using an iterative approach, the label for the unlabeled dataset is then predicted and the total number of remaining target class is calculated.

  1. Count the total number of positives in , say .

  2. Train a Logistic Regression curve using and ;

  3. Use to predict the probabilities of (all the unlabeled datapoints), say .

  4. Sort in decreasing order.

  5. In each datapoint sorted not marked as “seen”, calculate a cumulative sum of the probabilities from the sorted list one by one and mark each datapoint as “seen”. Whenever the sum , reset the probability of the first one, and rest as 0. Go back to step 5 until all datapoint is marked as “seen”. At the end of this step has new probabilities with and s only.

  6. Marge (new probabilities of unlabeled examples) with (labeled examples) and get .

  7. Count the total number of positives in , say .

  8. If , go back to step 1 with as new

Algorithm 2 SURVEY0 estimator with labeled examples by human (1 is SATD, 0 is Not-SATD) and unlabeled examples all marked with (because the dataset is very imbalanced). This is our . The algorithm obtains its probabilities from the sorter described in §IV-B.

Iv-D About the Reader “

SURVEY0 use a human expert to label examples in the test project. At each iteration, the sorter suggests most likely target class examples from the unlabeled data points and the human labels them one by one.

In this experiment, we have implemented an automated human oracle to mimic the behaviour of a human reader. To do that, we kept the actual label of our test project (labeled by the authors of the data set [7]) as a separate reference set. At each iteration, the oracle looks into the reference set and label the comment (thus mimicking a human expert).

V Experimental Materials

V-a Evaluation Metrics

Recall: Our framework is concerned with how much target class (SATD) is found within the comments that has been checked. Formally, this is known as recall:

(1)

The larger the recall, the better retrieval process

Cost: As our framework has a human involved, we wanted to measure the cost of finding target class (SATD). For that, we only focused in the number of comments to read as a ratio of total number of comments. Thus,

(2)

Cost is a measurement of the overall effort needed for the human. The smaller the cost the better the surveying.

C = S = S/C
Release Comments SATD *100
Apache Ant 1.7.0 Automating Build 4098 131 3.2
Apache JMeter 2.10 Testing 8057 374 4.64
ArgoUML - UML Diagram 9452 1413 14.95
Columba 1.4 Email Client 6468 204 3.15
EMF 2.4.1 Model Framework 4390 104 2.37
Hibernate Distribution 3.3.2 Object Mapping Tool 2968 472 15.90
jEdit 4.2 Java Text Editor 10322 256 2.48
jFree Chart 1.0.19 Java Framework 4408 209 4.74
jRuby 1.4.0 Ruby for Java 4897 622 12.70
SQL12 - Database 7215 286 3.96
MEDIAN 5683 271 4.77
TABLE II: Dataset Details. In ten projects, the self-admitted technical debt comments are around 5% of all comments.

V-B Data

Table II shows the data used in this study. This data comes from the same source as Huang et al.; i,e the publicly available dataset from Maldonado and Shihab [7]. This dataset contains ten open source JAVA projects on different application domains, varying in size and number of developers and most importantly, in number of comments in source code. The provided dataset contains project names, classification type (if any) with actual comments. Note that, our problem do not concern with the type of SATD, rather we care about a binary problem of being a SATD or not. So, we have changed the final label into a binary problem by defining as and the rest (for example ) as .

When creating this dataset, Maldonado et al. [7] used jDeodrant [47], an Eclipse plugin for extracting comments from the source code of java files. After that, they applied four filtering heuristics to the comments. A short description of them are given below (and for more details, see [7]):

  • Removed licensed comments, auto generated comments etc because according to the authors, they do not contain SATD by developers.

  • Removed commented source codes as commented source codes do not contain any SATD.

  • Removed Javadoc comments that do not contain words like “todo”, “fixme”, “xxx” etc because according to the authors, the rest of the comments rarely contain SATD.

  • Multiple single line comments are grouped into a single comment because they all convey a single message and it is easy to consider them as a group.

After applying these filters, the number of comments in each project reduced significantly (for example, the number of comments in Apache Ant reduced from to , almost 19% of the original size).

Two of the authors [7] then manually labelled each comments according to the six different types of TD mentioned by Alves et al. [48]. Note that if those labels were perfect, then SURVEY0 would not be necessary.

V-C Standard Rig

In the following, when we say standard rig we mean a 10-by-10 cross validation study that tries to build a predictor for technical debt, as follows:

  • For i = 1 to 10 projects

    • test = project[i]

    • train = projects - test

    • 10 times repeat

      • Generate a new random number as seed.

      • Apply the classifier .

        • For ensembles, we generate n-1 decision trees using the seed (learning from 90% of the training data, selected at random).

        • For SVM, we shuffle the data using the seed.

      • Apply SURVEY0, with , stopping at some target recall (usualy, 90% recall for SATD).

Note also, when generating the estimator, we shuffle the data using the seed for building the logistic regression model.

Projects Recall 100-Recall
jruby 88 12
argouml 87 13
columba 87 13
jmeter 73 27
hibernate 72 28
jfreechart 46 54
emf 33 67
sql12 54 69
ant 24 76
jedit 21 79
MEDIAN 63 37
TABLE III: RQ1 results: Agreement between SURVEY0’s labels and those generated by other methods. Sorted by 100-recall. In this table, the smaller the values in the right-hand-side column, the larger the agreement between SURVEY0’s labels and other labels.

Vi Results

The experimental materials described above where used to answer the research questions from the introduction.

Vi-a RQ1: Is surveying necessary?

Table III reports the levels of (dis)agreement seen between the labels seen after rechecking and revising labels (using SURVEY0) and the labels in the original data This data was generated using our standard rig:

  • In that table, we measure disagreement as 100-recall.

  • A disagreement of 0% indicates that the labels found via SURVEY0 are the same as in the original data sets.

  • Note that the disagreements are quite large and range from 36% to 79% (median to max). That is:

Conclusion #1: Surveying is required to resolve disagreement about labels. When discussing this results with colleagues, they comment “does not that mean that SURVEY0 is just getting it wrong all the time?”. We would argue against that interpretation. As shown below, surveying improves classification predictions so whatever SURVEY0 is doing, it is also improving the correspondence between the labels and the target concept.

Now suppose the reader is unconvinced by the last paragraph and wants to check whose labels are correct:

  • The pre-existing labels?

  • Or the labels generated by SURVEY0?

At that point the reader would encounter the “ground truth” problems. That is, to assess which labels are correct, the reader would need some “correct” labels. After some reflection (and a review of Table I) the reader might realize that finding those correct set of labels can be very costly– so much so that they would like some intelligent assistant to help them label the data is a cost-effective manner. That is, they need some tool like SURVEY0.

This is not a fanciful scenario. We envisage that once tools like SURVEY0 become widespread, then informally “labelling” collectives will emerge between collaborating research groups. Data sets would be passed between research groups, each one checking the labels of the other. If the level of disagreement on the next round of labelling falls below a community-decided level of acceptability, then that data could then move on to be used in research papers.

Vi-B RQ2: Is SURVEY0 useful?

Figure 3 and Figure 4 shows the recalls and costs achieved from the standard rig (when the target goal is 90%). In those figures

  • The EnembleDT results are the closest we can come to reproducing the methods of Huang et al. from EMSE’18. In these results, some classifier is learned from nine projects, then applied to the tenth. These results make no use of SURVEY0; i.e. here, there is no label review or revision.

  • The other plots come from SURVEY0 using either EnembleDT or Linear SVM as the learner.

We observe that :

  • The two sets of treatments have median recalls of 82.5 and 62% respectively.

  • That is, the treatments using SURVEY0 perform much better than those that do not.

That is: Conclusion #2: SURVEY0 improved quality predictions. As to which classifiers we would recommend, Figure 3 reports that in terms of recall, both Linear SVM and EnsembleDT perform just as well. However, Figure 4 reports that Linear SVM has much lower associated cost; i.e. it can find the technical debt comments much faster than EnsembleDT.

Based on these results, we recommend SURVEY0 using a Linear SVM classifier

Fig. 3: Recall of SURVEY0, SURVEY-EnsembleDT and recall of Ensemble DT without SURVEYing (from RQ1)
Fig. 4: Cost of SURVEY0 and SURVEY-EnsembleDT

Vi-C RQ3: Is SURVEY0 comparable to the state-of-the-art for human-in-the-loop AI?

Certain information retrieval methods offer stopping criteria for when to halt exploring new data. Here, we assess two such state-of-the-art approaches, developed for assisting Systematic Literature Reviews.

Those stopping methods require certain tuning parameters which we set by “cheating”; i.e. manually tuning using our test data. That is, we gave the information retrieval methods an undue advantage over SURVEY0

Ros et al. [39] suggests that, if no target class is found in consecutive examples seen (if each iteration offers examples, then a total of iterations), then we should stop. Ros proposed but after “cheating”, we found that worked better (i.e. obtained higher recalls with minimum cost).

Cormack et al. [40] finds the knee in the current retrieval curve at each iteration and if the ratio between slops from and is greater than a predefined threshold , then it stops. Note that, knee can be found using the Kneedle algorithm [49]. Cormack proposed but after “cheating” we found that was a better value.

Fig. 5: Comparison between optimized state-of-the-art human-in-the-loop frameworks with SURVEY0

We compared these two baselines with our our standard rig. As we can see from Figure 5, even after letting the other methods “cheat” (i.e. manually tuning these methods using the test data) SURVEY0 wins on 6 projects out of 9 and has an overall recall almost as good as the “cheating” results. Thus, we say:

Conclusion #3: SURVEY0 made its predictions at a near optimum rate.

RQ4: How soon can SURVEY0 learn quality predictors? As we know from RQ2 and RQ3, that SURVEY0 has a high recall, meaning, when looking for SATDs, it finds most of them. But in order to do that, SURVEY0 need a human expert to read through the comments suggested by the classifier and sorter. Thus, a core part of SURVEY0 is to ensure that, the cost of reading is minimized and learning when to stop.

Our experiment with SURVEY0 on Target@90 recall show that, after reading only 16% (median) of the comments, SURVEY0 stops while finding 83% (median) of the SATDs. This 16% cost has an IQR111

IQR= inter-quartile-range = (75-25)th percentile.

of 5% across project, implying, the cost is nearly the same for all ten projects. Hence, we say: Conclusion #4: SURVEY0 can be recommended as a way to reduce the labelling effort.

RQ5: Can SURVEY0 find more issues?

In the above experiments, we set that target goal to be 90% recall. Here, we report what happens when we seek to find more technical debt; i.e when we set the target recall to 95%.

Table IV shows the results. As before, if we set the target to X%, we achieve a performance level of slightly less than X (so the median recalls achieved when the target was 90% or 95% was 83% or 89%, respectively).

We also see that increasing the target recall by just 5% (from 90 to 95) nearly doubles the cost of finding the technical debt (from reading 16% of the comments to 29%). We make no comment here on whether or not it is worth increasing the cost in this way. All we say is that, if required, our methods can be used to tune how much work is done to reach some deseired level of recall. That is:

Projects Target@90 Target@95
Name Recall Cost Recall Cost
ant 85 21 87 35
jmeter 89 15 93 38
argouml 79 15 90 21
columba 98 16 98 33
emf 78 32 83 41
hibernate 78 15 85 21
jedit 80 32 86 44
jfreechart 61 19 66 20
jruby 85 13 94 19
sql12 85 17 91 24
MEDIAN 83 16 89 29
IQR 7 5 7 16
TABLE IV: RQ5: Cost Effective SURVEY0 for Target@90 and Target@95.

Conclusion #5: SURVEY0 can be used to advise on how much more work is required to achieve some additional desired level of quality assurance.

Vi-D RQ6: How much does SURVEY0 delay human readers?

SURVEY0 need a human expert in the loop to classify the most possible datapoint. To that end, SURVEY0 offer examples at each iteration, before estimating the remaining target class (here SATD).

This estimation process has its own overhead. According to Maldonado et al., each example need approximately 10.3 seconds to classify. So, if examples are offered at each iteration, then the human expert will need approximately seconds to finish reading. If the estimation process takes longer than this, then human expert become unproductive while waiting for the next iteration. After experimenting, we see that on average, the estimation process takes 30 seconds for . On the other hand, for , human reader will need seconds or approximately 20 minutes. If we calculate the overhead of each iteration, it is only . To put that another way: Conclusion #6: SURVEY0 imposed negligible overhead (i.e. less than 5%) on the activity of human experts.

Vii Threats to Validity

Model Bias: One internal threat to validity is our bias towards the classifier selection and stopping rule selection. We experimented a wide variety of state-of-the-art classifiers used in text-mining as rankers while building SURVEY0 and found SVM to be the best. Yet, there are other advanced and complex classifiers (such as LSTM) that we did not used in our selection, because of the simplicity of our dataset as well as no prior work has used them to classify SATDs. We also avoided a few stopping rules as baselines (such as Wallace [50]) intentionally as previous research showed [51] that our baselines are significantly better than theirs. Nevertheless, we are aware that our model selection is not comprehensive and could be explored further in future research.

Evaluation Bias: We have reported Recall and Cost as our overall measure. We have repeated each experiment ten times and reported only the median values for minimizing any bias towards randomness. We understand the quality measures are not comprehensive and their might be other quality measure used in software engineering that reflects more comprehensive summary of our findings. A more comprehensive analysis using other measures is left for future work.

Sample Bias: The dataset was provided by authors Maldonado and Shihab [7]. Other data might lead to other conclusions. Clearly, this work needs to be repeated on other data.

Viii Discussion

In our work, we have studied the comments of ten open source projects developed in JAVA. Our work shows that, with minimum cost, we can identify self-admitted technical debt using a combination of AI with human. There are several ways to extend the current work.

  • Feature Selection and Vectorization: A few recent work shows that feature selection can improve the overall classification of SATD [1]. A more recent work also implies that word embedding model such as word2vec is promising while identifying SATD [24]. We believe, our framework can also improve significantly after proper feature selection and vectorization. We initially did some feature selection, but a more rigorous experiment must be done in this regard in the future.

  • New Dataset: Our work is confined in Java projects and Open Source Projects. We want to develop new dataset to generalize our findings and possibly discover new facts along the way.

  • Matrices: There are other goal metrics to explore. For example, measuring cost in terms of time or man-hour might be a better quality measure.

  • Results: According to our experiment, we can find SATD while only reading of the data (both in median). We hypothesize that, this results can be improved using hyperparameter optimization. The only drawback is the run-time of such tuning. In our future work, we will try to find improved results using hyperparameter tuning.

Ix Conclusion

Technical debt is a metaphor to describe the quick and dirty workaround to receive immediate gain. This is an intentional practice and often developers leave intentional comments indicating that their work is sub-otimal. Although this phenomena is unavoidable in reality, research show that the long term impact of these practice is dire. Thus identifying technical debt is a major concern for the industry. This work has explored methods for building a technical debt predictor, at minimal cost.

The methods used here to reduce the cost of building technical debt predictors are quite general to any human-in-the-loop process where some subject matter expert is required to read and label a large corpus. Such work can be time-consuming, tedious, and error-prone. Our work is a response to that. We offer a complete framework where human will be guided by an AI to label artifacts with minimal effort. At least for the data studied here, we can find on average (median) of the artifacts of interest by reading only of the artifacts. Examining the possible implication on a larger dataset with better estimator and well tuned parameters will open interesting possibilities in future.

References

  • [1] Q. Huang, E. Shihab, X. Xia, D. Lo, and S. Li, “Identifying self-admitted technical debt in open source projects using text mining,” Empirical Software Engineering, vol. 23, no. 1, pp. 418–451, 2018.
  • [2] W. Cunningham, “The wycash portfolio management system,” ACM SIGPLAN OOPS Messenger, vol. 4, no. 2, pp. 29–30, 1993.
  • [3] Y. Guo, C. Seaman, R. Gomes, A. Cavalcanti, G. Tonin, F. Q. Da Silva, A. L. Santos, and C. Siebra, “Tracking technical debt—an exploratory case study,” in 2011 27th IEEE International Conference on Software Maintenance (ICSM).   IEEE, 2011, pp. 528–531.
  • [4] A. Nugroho, J. Visser, and T. Kuipers, “An empirical model of technical debt and interest,” in Proceedings of the 2nd Workshop on Managing Technical Debt.   ACM, 2011, pp. 1–8.
  • [5] L. Hatton, “Testing the value of checklists in code inspections,” IEEE software, vol. 25, no. 4, 2008.
  • [6] I. Ozkaya, R. L. Nord, and P. Kruchten, “Technical debt: From metaphor to theory and practice,” IEEE Software, vol. 29, no. 06, pp. 18–21, nov 2012.
  • [7] E. d. S. Maldonado and E. Shihab, “Detecting and quantifying different types of self-admitted technical debt,” in 2015 IEEE 7th International Workshop on Managing Technical Debt (MTD).   IEEE, 2015, pp. 9–15.
  • [8] E. Lim, N. Taksande, and C. Seaman, “A balancing act: What software practitioners have to say about technical debt,” IEEE software, vol. 29, no. 6, pp. 22–27, 2012.
  • [9] S. Wehaibi, E. Shihab, and L. Guerrouj, “Examining the impact of self-admitted technical debt on software quality,” in 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), vol. 1.   IEEE, 2016, pp. 179–188.
  • [10] A. Martini and J. Bosch, “The danger of architectural technical debt: Contagious debt and vicious circles,” in 2015 12th Working IEEE/IFIP Conference on Software Architecture.   IEEE, 2015, pp. 1–10.
  • [11] R. Marinescu, G. Ganea, and I. Verebi, “Incode: Continuous quality assessment and improvement,” in 2010 14th European Conference on Software Maintenance and Reengineering.   IEEE, 2010, pp. 274–275.
  • [12] R. Marinescu, “Detection strategies: Metrics-based rules for detecting design flaws,” in 20th IEEE International Conference on Software Maintenance, 2004. Proceedings.   IEEE, 2004, pp. 350–359.
  • [13] ——, “Assessing technical debt by identifying design flaws in software systems,” IBM Journal of Research and Development, vol. 56, no. 5, pp. 9–1, 2012.
  • [14] N. Zazworka, R. O. Spínola, A. Vetro, F. Shull, and C. Seaman, “A case study on effectively identifying technical debt,” in Proceedings of the 17th International Conference on Evaluation and Assessment in Software Engineering.   ACM, 2013, pp. 42–47.
  • [15] F. A. Fontana, V. Ferme, and S. Spinelli, “Investigating the impact of code smells debt on quality code evaluation,” in Proceedings of the Third International Workshop on Managing Technical Debt.   IEEE Press, 2012, pp. 15–22.
  • [16] N. Tsantalis and A. Chatzigeorgiou, “Identification of extract method refactoring opportunities for the decomposition of methods,” Journal of Systems and Software, vol. 84, no. 10, pp. 1757–1782, 2011.
  • [17] N. Tsantalis, D. Mazinanian, and G. P. Krishnan, “Assessing the refactorability of software clones,” IEEE Transactions on Software Engineering, vol. 41, no. 11, pp. 1055–1090, 2015.
  • [18] J. Graf, “Speeding up context-, object-and field-sensitive sdg generation,” in 2010 10th IEEE Working Conference on Source Code Analysis and Manipulation.   IEEE, 2010, pp. 105–114.
  • [19] K. Ali and O. Lhoták, “Application-only call graph construction,” in European Conference on Object-Oriented Programming.   Springer, 2012, pp. 688–712.
  • [20] A. Potdar and E. Shihab, “An exploratory study on self-admitted technical debt,” in 2014 IEEE International Conference on Software Maintenance and Evolution.   IEEE, 2014, pp. 91–100.
  • [21] M. A. de Freitas Farias, M. G. de Mendonça Neto, A. B. da Silva, and R. O. Spínola, “A contextualized vocabulary model for identifying technical debt on code comments,” in 2015 IEEE 7th International Workshop on Managing Technical Debt (MTD).   IEEE, 2015, pp. 25–32.
  • [22] E. d. S. Maldonado, R. Abdalkareem, E. Shihab, and A. Serebrenik, “An empirical study on the removal of self-admitted technical debt,” in 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).   IEEE, 2017, pp. 238–248.
  • [23] M. Yan, X. Xia, E. Shihab, D. Lo, J. Yin, and X. Yang, “Automating change-level self-admitted technical debt determination,” IEEE Transactions on Software Engineering, 2018.
  • [24] J. Flisar and V. Podgorelec, “Enhanced feature selection using word embeddings for self-admitted technical debt identification,” in 2018 44th Euromicro Conference on Software Engineering and Advanced Applications (SEAA).   IEEE, 2018, pp. 230–233.
  • [25] L. Tan, D. Yuan, G. Krishna, and Y. Zhou, “/* icomment: Bugs or bad comments?*,” in ACM SIGOPS Operating Systems Review, vol. 41, no. 6.   ACM, 2007, pp. 145–158.
  • [26] S. H. Tan, D. Marinov, L. Tan, and G. T. Leavens, “@ tcomment: Testing javadoc comments to detect comment-code inconsistencies,” in 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation.   IEEE, 2012, pp. 260–269.
  • [27] N. Khamis, R. Witte, and J. Rilling, “Automatic quality assessment of source code comments: the javadocminer,” in International Conference on Application of Natural Language to Information Systems.   Springer, 2010, pp. 68–79.
  • [28] D. Steidl, B. Hummel, and E. Juergens, “Quality analysis of source code comments,” in 2013 21st International Conference on Program Comprehension (ICPC).   Ieee, 2013, pp. 83–92.
  • [29] H. Malik, I. Chowdhury, H.-M. Tsou, Z. M. Jiang, and A. E. Hassan, “Understanding the rationale for updating a function’s comment,” in 2008 IEEE International Conference on Software Maintenance.   IEEE, 2008, pp. 167–176.
  • [30] B. Fluri, M. Wursch, and H. C. Gall, “Do code and comments co-evolve? on the relation between source code and comment changes,” in 14th Working Conference on Reverse Engineering (WCRE 2007).   IEEE, 2007, pp. 70–79.
  • [31] E. da Silva Maldonado, E. Shihab, and N. Tsantalis, “Using natural language processing to automatically detect self-admitted technical debt,” IEEE Transactions on Software Engineering, vol. 43, no. 11, pp. 1044–1062, 2017.
  • [32] Z. Liu, Q. Huang, X. Xia, E. Shihab, D. Lo, and S. Li, “Satd detector: a text-mining-based self-admitted technical debt detection tool,” in Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings.   ACM, 2018, pp. 9–12.
  • [33] M. Jureczko and L. Madeyski, “Towards identifying software project clusters with regard to defect prediction,” in Proceedings of the 6th International Conference on Predictive Models in Software Engineering, ser. PROMISE ’10.   New York, NY, USA: ACM, 2010, pp. 9:1–9:10. [Online]. Available: http://doi.acm.org/10.1145/1868328.1868342
  • [34] D. Chen, K. T. Stolee, and T. Menzies, “Replication can improve prior results: A github study of pull request acceptance,” ICPC’19, 2019.
  • [35] J. Wang, M. Li, S. Wang, T. Menzies, and Q. Wang, “Images don’t lie: Duplicate crowdtesting reports detection with screenshot information,” Information & Software Technology, vol. 110, pp. 139–155, 2019. [Online]. Available: https://doi.org/10.1016/j.infsof.2019.03.003
  • [36] J. Wang, Y. Yang, Z. Yu, T. Menzies, and Q. Wang, “Characterizing crowds to better optimize worker recommendation in crowdsourced testing,” TSE’19 (to appear), 2019.
  • [37] G. V. Cormack and M. R. Grossman, “Evaluation of machine-learning protocols for technology-assisted review in electronic discovery,” in Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval.   ACM, 2014, pp. 153–162.
  • [38] N. Abe and H. Mamitsuka, “Query learning strategies using boosting and bagging,” in Proceedings of the Fifteenth International Conference on Machine Learning, ser. ICML ’98.   San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1998, pp. 1–9. [Online]. Available: http://dl.acm.org/citation.cfm?id=645527.657478
  • [39] R. Ros, E. Bjarnason, and P. Runeson, “A machine learning approach for semi-automated search and selection in literature studies,” in Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering.   ACM, 2017, pp. 118–127.
  • [40] G. V. Cormack and M. R. Grossman, “Engineering quality and reliability in technology-assisted review,” in Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval.   ACM, 2016, pp. 75–84.
  • [41] G. Gay, S. Haiduc, A. Marcus, and T. Menzies, “On the use of relevance feedback in ir-based concept location,” in 2009 IEEE International Conference on Software Maintenance, 2009, pp. 351–360.
  • [42] Z. Yu, N. A. Kraft, and T. Menzies, “Finding better active learners for faster literature reviews,” Empirical Software Engineering, vol. 23, no. 6, pp. 3161–3186, 2018.
  • [43]

    M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf, “Support vector machines,”

    IEEE Intelligent Systems and their applications, vol. 13, no. 4, pp. 18–28, 1998.
  • [44] I. H. Witten, E. Frank, L. E. Trigg, M. A. Hall, G. Holmes, and S. J. Cunningham, “Weka: Practical machine learning tools and techniques with java implementations,” 1999.
  • [45] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
  • [46] S. R. Safavian and D. Landgrebe, “A survey of decision tree classifier methodology,” IEEE transactions on systems, man, and cybernetics, vol. 21, no. 3, pp. 660–674, 1991.
  • [47] M. Fokaefs, N. Tsantalis, E. Stroulia, and A. Chatzigeorgiou, “Jdeodorant: identification and application of extract class refactorings,” in 2011 33rd International Conference on Software Engineering (ICSE).   IEEE, 2011, pp. 1037–1039.
  • [48] N. S. Alves, L. F. Ribeiro, V. Caires, T. S. Mendes, and R. O. Spínola, “Towards an ontology of terms on technical debt,” in 2014 Sixth International Workshop on Managing Technical Debt.   IEEE, 2014, pp. 1–7.
  • [49] V. Satopaa, J. Albrecht, D. Irwin, and B. Raghavan, “Finding a” kneedle” in a haystack: Detecting knee points in system behavior,” in 2011 31st International Conference on Distributed Computing Systems Workshops.   IEEE, 2011, pp. 166–171.
  • [50] B. C. Wallace, I. J. Dahabreh, K. H. Moran, C. E. Brodley, and T. A. Trikalinos, “Active literature discovery for scoping evidence reviews: How many needles are there,” in KDD workshop on data mining for healthcare (KDD-DMH), 2013.
  • [51] Z. Yu and T. Menzies, “: Better automated support for finding relevant se research papers,” arXiv preprint arXiv:1705.05420, 2017.