Estimating defectiveness of source code: A predictive model using GitHub content

03/21/2018 ∙ by Ritu Kapur, et al. ∙ Indian Institute of Technology Ropar 0

Two key contributions presented in this paper are: i) A method for building a dataset containing source code features extracted from source files taken from Open Source Software (OSS) and associated bug reports, ii) A predictive model for estimating defectiveness of a given source code. These artifacts can be useful for building tools and techniques pertaining to several automated software engineering areas such as bug localization, code review, and recommendation and program repair. In order to achieve our goal, we first extract coding style information (e.g. related to programming language constructs used in the source code) for source code files present on GitHub. Then the information available in bug reports (if any) associated with these source code files are extracted. Thus fetched un(/ semi)-structured information is then transformed into a structured knowledge base. We considered more than 30400 source code files from 20 different GitHub repositories with about 14950 associated bug reports across 4 bug tracking portals. The source code files considered are written in four programming languages (viz., C, C++, Java, and Python) and belong to different types of applications. A machine learning (ML) model for estimating the defectiveness of a given input source code is then trained using the knowledge base. In order to pick the best ML model, we evaluated 8 different ML algorithms such as Random Forest, K Nearest Neighbour and SVM with around 50 parameter configurations to compare their performance on our tasks. One of our findings shows that best K-fold (with k=5) cross-validation results are obtained with the NuSVM technique that gives a mean F1 score of 0.914.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Maintenance of software, particularly identifying and fixing the defects, contributes significantly towards the overall cost (Bennett and Rajlich, 2000; Lientz and Swanson, 1980) of developing and deploying a software application. It has been observed (Deissenboeck and Pizka, 2006; Allamanis et al., 2014) that coding practices adopted by the programmers greatly influence the quality of software. We believe that an in-depth comprehension of how the programming styles influence defectiveness

of software can help in avoiding software defects. Availability of a large volume of source code from Open Source Software (OSS) and its associated defect reports present a ripe opportunity to apply Artificial Intelligence (AI) techniques to build a system for estimating the presence of defects in a given source code.

One of our important goals is to be able to determine the defectiveness of a source code by considering the programming style used in the code. In other words, a broader question (and aim for us) is to check if there exist any correlation between the programming style and the quality of a software. We believe that Among two software applications developed using similar processes, the one having fewer reported defects is more likely to be of better quality than the other which has more defects reported against it. The defectiveness for us includes two things: i) likelihood of finding defects of different types and severity etc. expressed on the scale {Likely-to-be-defective, Unpredictable}, and ii) the properties of potential defects that can be present in a given source file. Such properties include: type, severity, project phase in which reported etc.

Broadly speaking, our system can answer two questions for an input source file: whether it is likely to be defective, and if yes then what are the properties of likely defects. We use a binary classification methodology while answering these questions.

Leveraging information available on OSS projects
OSS repositories like GitHub111, SourceForge222 etc. provide access to a large set of source code artifacts and their related information (e.g. the number of developers involved, number of bugs associated with it, people reporting the bugs etc.) (Allamanis and Sutton, 2013). Similarly, the bugs reported corresponding to various OSS software are usually tracked and available through different bug tracking portals (Ye et al., 2014) of major OSS foundations such as Apache333 and Eclipse444 etc. Availability of such a rich set of raw data offers a ripe opportunity to extract and exploit the latent knowledge present in it.

One possible direction for extracting and exploiting such latent knowledge involves examining the programming style features of the source code. Typically, a software engineer makes a variety of programming choices during the construction of a software application. For instance, some of the key choices include:

  • Choosing a particular programming language (e.g. Java over C#).

  • Choosing a particular programming construct (e.g. if statement over the switch statement) to be used in a scenario.

  • Choosing a particular form of identifier name (e.g. a long but descriptive identifier name or a small and an ambiguous one (Lawrie et al., 2006)).

  • Program design decisions (Allamanis et al., 2014).

All such choices that a programmer makes define the programming style of a programmer. The programming style used in a software influences several quality attributes of the software, such as readability, portability, ease-of-learning (Malheiros et al., 2012), reliability and maintainability (Allamanis et al., 2014). For instance, the choice of concise and consistent naming conventions for identifier names has been reported to result in a better quality software (Deissenboeck and Pizka, 2006).

In our approach we have exploited the relationship that exists between programming style and defects associated with a source code. Basic idea of our approach is as follows (also depicted in Figure 1):

  • We first extract the programming style information (shallow knowledge) associated with source files taken from various OSS repositories. The shallow knowledge is stored in a relational database.

  • Refined Knowledge is then created by first extracting a subset of the shallow knowledge (created in the previous set) and then applying a couple of transformations on this knowledge to avoid dataset biasness as discussed in Section 2.1.

  • Using the dataset created, we then build a Machine Learning (ML) model, as shown in Figure 5, which estimates the defectiveness associated with an input source file. An exhaustive comparison of the results obtained corresponding to various ML techniques used for prediction is presented in Figure 6-10.

The software artifacts thus created, can be further used to build tools and techniques pertaining to several automated software engineering areas like bug localization (Zhou et al., 2012; Hindle et al., 2012), code review and recommendation (Malheiros et al., 2012; Hindle et al., 2012) and program repair (Kim et al., 2013a) etc. The rest of the paper is organized as follows. Section 2 describes the details of the proposed solution. We present the details about our implementation in Section 3, and in Section 4 we discuss the results and analysis of our experiments. Section 5 describes the related work. Conclusions drawn from our work are presented in Section 6.

2. Proposed solution

Figure 1. Outline of the proposed approach

A broad outline of the proposed solution is shown in Figure 1. The Predictive Model, shown as the central entity in Figure 1, estimates defectiveness associated with an input source file. A predictive model typically is trained using pertinent information about the intended inputs. Thus, for our case a dataset is built by extracting useful knowledge from source code files and their corresponding defect reports. We achieve this by extracting information about programming styles used and the corresponding bugs, if any reported, at various OSS repositories and bug tracking portals. Thus the entire work can be split into two major tasks: 1) Fetching raw data and creation of a structured dataset, 2) Building the predictive model to estimate the defectiveness for given input source code.

2.1. Creating a structured dataset

It has been shown (Deissenboeck and Pizka, 2006; Allamanis et al., 2014) that programming style significantly influences the quality of software. Based on these results we decided to make use of programming style information as one of the primary feature for training our prediction models (discussed shortly). Thus, we extracted programming style information by processing a corpus of source code files written in different programming languages. We also extracted relevant information from available bug reports corresponding to the source code files that we considered. Following are the main steps involved:

  1. Identify suitable source code repositories and bug tracking portals: The code repositories such as GitHub, SourceForge etc. contain large number of open source projects. Similarly, the bug tracking portals for major OSS foundations such as Apache and Eclipse etc. track bugs corresponding to such OSS projects. Several OSS projects on GitHub have sufficient source files having at least one bug reported against it. Therefore we took bulk of the raw content (source files) from GitHub for building our dataset.

  2. Identify suitable OSS repositories: Java, Python, C++ and C seem to be the most used programming languages555Professional Developer Survey 2017 by StackOverflow: Thus we considered mainly those OSS repositories that have source code files written in one of these programming languages.

  3. Establishing the mapping between the bug reports and the source files present in OSS repositories: In order to find the source files associated with bug reports, we utilize the summary and the patch field of the bug reports. The patch information associated with the bug reports mostly contains at least one mention of the associated source file. We use this information to establish a mapping between the source files present in the OSS repositories and the bugs reported corresponding to them as explained in Figure 3.

  4. Feature extraction technique: Raw source files need to be processed in order to extract relevant programming style attributes/features present in each. As such a custom lexer/ parser was built using ANTLR666 (Another Tool for Language Recognition). On providing a grammar file as an input, ANTLR generates a parse tree listener777 class which contains methods for handling AST tokens as they are encountered during parsing of input source code conforming to the grammar .

    We overrode relevant methods of the parse tree listener for extracting desired features about the input source code. The features were computed using statistical measures such as average, min., max., standard deviation for values such as counts, lengths and depths of different programming constructs as enumerated in Table


  5. Extract relevant un/semi- structured information (shallow knowledge): The statistical measures captured in the previous step are aggregated at various levels (viz., function level, class level and file level) and stored in a database as shallow knowledge. The detailed extraction process is depicted in Figure 4. Also, for each source file which is referred to in a bug report we extract information such as priority, status, type and user exposure from that bug report and store that information in the database.

  6. Refine the shallow knowledge:

    Refining the shallow knowledge is necessary for building a training dataset for a classifier. Since the dataset used for training the ML models should not be biased by factors such as a particular language or file length or a particular feature we normalize the dataset. First step of normalization involves filtering only those source files (across programming languages) which are of

    similar size. This was done for limiting the bias due to size of a source file. Further, the absolute values of various features extracted from a source file were divided by the file length. Similarly, the bias towards individual features was removed by using a MinMaxScalar888 function of Scikit - Learn999 The result of this step is our final dataset.

2.1.1. Composition of the dataset

Language Total file count Total bug-linked file count Candidate file count101010Files with the similar length chosen to build the feature matrix for training the ML models
Table 1. Language wise description of source files present in the dataset

In order to eliminate the programming language selection bias we fetched the source code written in four different programming languages, viz., C, C++, Java and Python. A total of about source files from different GitHub repositories were considered for creating the dataset. Further, about bug reports associated with these source files were extracted from bug tracking portals of major OSS foundations, viz., Apache111111, GNU121212, Eclipse131313 and Python141414 Table 1 and 2 describe the composition of the dataset by providing the details such as the type of source files, OSS repositories and bug tracking portals chosen while building the dataset.

The information captured in our dataset is stored in a relational database. Partial schema of the database showing main tables comprising our dataset are shown in Figure 2. A brief description of these tables is as follows:

  1. SourceCodeFeatures: It contains the features extracted (as described in Section 2.1) from various source files present in different OSS GitHub repositories.

  2. BugInfo: It contains the relevant meta-data from associated bug reports to the source files considered in the previous SourceCodeFeatures table.

  3. SourceFileToBugMapping: It stores the mapping between the source files and the bug reports considered in the dataset formation.

  4. LanguageConstructs: It contains the information about the unique identifiers assigned to various language constructs.

  5. RefinedFeatures: It stores the refined features obtained by transforming the shallow knowledge as explained in Section 2.1 step ().

OSS repository Total source files Bug tracking portal Total reported bugs Total source files linked to bugs151515Not every bug report provides information about the involved source file
ant-master Apache
commons-bcel -trunk
lenya-BRANCH -1-2-X
webdavjedit -master
pengyou -clients -master
gcc-master GCC
org.eclipse. -paho.mqtt. -python-master Eclipse
paho.mqtt. -embedded -c-master
paho.mqtt. -java-master
cpython-master Python
pythondotorg -master
Table 2. Details of GitHub repositories and associated bug reports
Software Metric Description
maxXCount maximum number of times an X construct is used in a source code
minXCount minimum number of times an X construct is used in a source code
avgXCount average number of times an X construct is used in a source code
stdDevXCount standard deviation of the number of times an X construct is used in a source code
maxXDepth maximum depth at which an X construct is used in a source code
minXDepth minimum depth at which an X construct is used in a source code
avgXDepth average depth at which an X construct is used in a source code
stdDevXDepth standard deviation of depth at which an X construct is used in a source code
maxXLength maximum length of an X construct used in a source code
minXLength minimum length of an X construct used in a source code
avgXLength average length of an X construct used in a source code
stdDevXLength standard deviation of length of an X construct used in a source code
Table 3. Details of source code features
Figure 2. Partial schema showing main entities of dataset.
Figure 3. Establishing mapping between the source files and bug reports
Figure 4. Dataset Creation
Figure 5. Predictive Model

A copy of the dataset that we used is available at

2.2. Building the predictive model

Next important task in our system is to estimate the defectiveness of an input source code file. For this we have to identify the most accurate ML algorithm with its parameters tuned for the problem. We make use of the dataset described in Section 2.1.1 to train different ML models (henceforth referred as Predictive Model) as indicated in Table 4. We then select the best performing model based on standard accuracy metric scores (discussed in Section 4).

The Predictive Model estimates the defectiveness of an input source code file in two phases: a) it first classifies the file as likely- to-be-defective or unpredictable, and b) if the input source file is classified as likely-to-be-defective then we label the file with prominent characteristics of bugs (for instance, type and severity

of the bugs etc.). Basically, the second phase deals with determining the characteristics of likely bugs. It seeks to provide the answers to questions such as: What is the probability that the input file is remarked as

likely-to-be-defective contains bugs:

  1. of a specific priority/severity (e.g. Critical, High, Low, Medium etc.)?

  2. of a specific type (e.g. Enhancement, BugFix etc.)?

  3. that are manifesting on a specific operating system (OS)?

  4. that are manifesting on a specific hardware?

  5. that involve a specific level of user exposure (measured, for instance, via number of comments on the bug reports)?

The corresponding results are shown in Figure 6-10 and discussed in Section 4. Although, we have performed the experiments on a subset of bug characteristics, the work can be repeated for the other bug characteristics too. Further, it is not difficult to repeat (we plan to do it in future) the experiments for predicting defectiveness along various qualitative scenarios such as:

  • Which programming language is likely to lead to defective programs?

  • Which type of bugs (enhancements, defects etc.) are most likely on a particular OS/hardware/programming language etc.

3. Implementation details

Key Phase 1 Phase 2
a LSVM(, ‘ovr’)
b LSVM(, ‘ovo’)
c LSVM(, ‘ovr’)
d LSVM(, ‘ovo’)
e SVM(, ‘ovr’, ‘l’) LSVM(, ‘ovr’)
f SVM(, ‘ovo’, ‘l’) LSVM(, ‘ovo’)
g SVM(, ‘ovr’, ‘r’, ) LSM(, ‘ovr’)
h SVM(, ‘ovo’, ‘r’, ) LSM(, ‘ovo’)
i SVM(, ‘ovo’, ‘p’, ) LSM(, ‘ovo’)
j SVM(, ‘ovr’, ‘p’, ) LSM(, ‘ovo’)
k SVM(, ‘ovo’, ‘p’, ) SVM(, ‘ovr’, ‘l’)
l SVM(, ‘ovr’, ‘p’, ) SVM(, ‘ovo’, ‘l’)
m SVM(, ‘ovr’, ‘s’, ) SVM(, ‘ovr’, ‘r’, )
n SVM(, ‘ovo’, ‘s’, ) SVM(, ‘ovo’, ‘r’, )
o RF() SVM(, ‘ovo’, ‘p’, )
p RF() SVM(, ‘ovr’, ‘p’, )
q RF() SVM(, ‘ovo’, ‘p’, )
r RF() SVM(, ‘ovr’, ‘p’, )
s RF() SVM(, ‘ovr’, ‘s’, )
t RF() SVM(, ‘ovo’, ‘s’, )
u NSVM(, ‘l’, ‘ovo’) RF()
v NSVM(, ‘l’, ‘ovr’) RF()
w NSVM(, ‘l’, ‘ovo’) RF()
x NSVM(, ‘l’, ‘ovr’) RF()
y NSVM(, ‘r’, ‘ovo’) RF()
z NSVM(, ‘r’, ‘ovr’) RF()
A NSVM(, ‘r’, ‘ovo’) NSVM(, ‘l’, ‘ovo’)
B NSVM(, ‘r’, ‘ovr’) NSVM(, ‘l’, ‘ovr’)
C NSVM(, ‘r’, ‘ovo’, ) NSVM(, ‘l’, ‘ovo’)
D NSVM(, ‘r’, ‘ovr’, ) NSVM(, ‘l’, ‘ovr’)
E NSVM(, ‘r’, ‘ovo’, ) NSVM(, ‘r’, ‘ovo’, )
F NSVM(, ‘r’, ‘ovr’, ) NSVM(, ‘r’, ‘ovo’, )
G NSVM(, ‘s’, ‘ovo’) NSVM(, ‘ovo’, ‘p’, )
H NSVM(, ‘s’, ‘ovr’) NSVM(, ‘ovr’, ‘p’, )
I NSVM(, ‘s’, ‘ovo’) NSVM(, ‘ovo’, ‘p’, )
J NSVM(, ‘s’, ‘ovr’) NSVM(, ‘ovr’, ‘p’, )
K NSVM(, ‘p’, ‘ovo’) NSVM(, ‘s’, ‘ovo’)
L NSVM(, ‘p’, ‘ovr’) NSVM(, ‘s’, ‘ovr’)
M NSVM(, ‘p’, ‘ovr’) NSVM(, ‘s’, ‘ovo’)
N NSVM(, ‘p’, ‘ovo’) NSVM(, ‘s’, ‘ovr’)
O NSVM(, ‘p’, ‘ovo’, ) Gauss(‘r’, ‘ovo’)
P NSVM(, ‘p’, ‘ovr’, ) Gauss(‘r’, ‘ovr’)
Q NSVM(, ‘p’, ‘ovo’, ) KNN(‘e’)
R NSVM(, ‘p’, ‘ovr’, ) KNN(‘m’)
S Gauss(‘r’, ‘ovo’) MLP
T Gauss(‘r’, ‘ovr’) -
Table 4. ML. models used in different phases of the predictive model

To obtain the best prediction results, training and testing for both phases is performed using variety of relevant ML classification techniques as described in Table 4. They include: Linear SVM (LSVM)(Fan et al., 2008), SVM and Nu-SVM (NSVM) with radial, sigmoid and poly kernels (based on libSVM (Chang and Lin, 2011)), Gaussian Process (Gauss) classifier(Rasmussen, 2004), K Nearest Neighbors (KNN) classifier(Dasarathy, 1991), Random Forest (RF) classifier(Breiman, 2001)

and Multi-Layer Perceptron (MLP)

(Hinton, 1990) classifier. We have used the implementation of these algorithms provided in Scikit - Learn161616 Tuning of algorithm specific parameters was performed by carrying out several experiments by using different parameter configurations of these ML techniques. A brief description of the pertinent parameters (as mentioned in Table 4) of different ML algorithms that we tuned is as follows:

For LSVM, among the two input parameters i.e. for LSVM(a,b), a refers to the penalty parameter and b=‘ovr’ represents to the method of classification (where b=‘ovr’ refers to the one-vs-rest approach and ‘ovo’refers to the one-vs-one approach); For SVM, the first two input parameters remain the same while the third (say c) represents the kernel parameter (where c=‘l’, ‘r’, ‘p’and ‘s’ refer to a linear, radial, polynomial and sigmoid kernel respectively) and the fourth represents the degree in case of a polynomial kernel (‘p’); RF has only one input parameter which specifies the

number of decision trees

or the number of estimators; NSVM has the first parameter as nu, second as kernel, third as the method of classification and fourth as gamma. For KNN, the ‘e’ input parameter represents the use of euclidean distance whereas the ‘m’ input parameter represents the use of Manhattan distance.

3.1. Evaluating effectiveness of the system

To measure the efficacy of our system we compute standard (Michalski et al., 2013)evaluation metrics of Precision, Recall and F1 score for the prediction models. Literature (Kim et al., 2013b), (Wang et al., 2016) and (Williams and Hollingsworth, 2005) represents the use of these metrics to compute the performance effectiveness of similar predictive systems. The respective equations are:


Since F1 score captures both the effect of Precision and Recall, we show the results corresponding to F1 score values and its respective standard deviation values (or error values) only. Higher is the F1 score better is the prediction accuracy of the model. Further, all the results obtained are validated using k-fold cross validation (with k=5). Thus the evaluation metrics illustrated are the averaged values over all k folds. Results of the experiments are discussed in Section 4.

4. Results and analysis

A major goal of our experiments was to assess the efficacy of the proposed system. The system can be considered effective if it accurately estimates the defectiveness of input source code. This required us to identify the best ML model which could accurately estimate the defectiveness of an input source code. As such we present and discuss here only the accuracy metrics (please see Section 3.1) achieved by using different ML algorithms on our dataset, when answering prediction questions involving different scenarios. For instance, our aim in the experiments has been to identify the best performing algorithm for:

  • The likely-to-be-defective, unpredictable classification task on the input source code.

  • Estimating associated bug characteristics for a likely-to- be-defective labeled input.

Salient observations from our experiments are discussed next.

4.1. Phase-I prediction results

Phase-I performs the task of classifying an input source file as likely-to-be-defective or Unpredictable. Accuracy metrics achieved by different ML models when testing on our dataset are shown in Figure 6. Salient observations are as follows: a) SVM with radial kernel gives the highest (best) averaged F1 score of with a standard deviation () of . b) SVM classifier with a poly kernel gives the lowest (worst) F1 score on the dataset. The shaded portion in Figure 6 shows points corresponding to the lowest F1 score while the points marked with dark diamonds represent the models that yield the highest F1 score.

These accuracy metrics are obtained for the best of respective ML algorithm configurations. The tuning parameter values used for different scenarios are depicted in Table 4.

Figure 6. Phase 1 results
Figure 7. Prediction Results for bugs with medium user exposure
Figure 8. Prediction Results for bugs associated with Windows operating system
Figure 9. Prediction Results for bugs of type Enhancement
Figure 10. Prediction Results for bugs of high priority

4.2. Phase-II prediction results

Phase-II deals with predicting characteristics of the bugs that are likely to be associated with an input source code file that has been classified as likely-to-be-defective in the previous phase.

Salient observations here are: a) For predicting Medium exposure bugs Nu-SVM model outperforms the rest – it achieves an F1 score of and an associated of . b) RF model with estimators (or decision trees) yields the best F1 score ( with of ) for predicting bugs of type Enhancement. c) RF model with estimators achieves the best F1 score ( and of ) when predicting bugs annotated as highest priority. d) When predicting bugs annotated with a specific OS the Linear SVM gives the best F1 score ( with as ).

All these observations are shown in Figure 7-10. The shaded portion in these figures represents the points corresponding to the lowest F1 score. In most cases the shaded portion corresponds to the SVM classifier with a poly kernel setup, implying that it fared the worst on the dataset used. Points marked with dark diamonds on the graphs correspond to the models that give highest F1 score.

5. Related work

Effects of programming styles on the quality of software has been well studied and reported in research literature. Most such studies can be categorized into two broad areas – Quality and maintenance of software. Authors in (Allamanis et al., 2014, 2015; Deissenboeck and Pizka, 2006; Lawrie et al., 2006) have examined in-depth the use of various programming styles, e.g., use of descriptive identifier names for improving quality of the software.

One of the works that addressed problems similar to ours is (Wang et al., 2016)

. They propose a learning algorithm to automatically extract the semantic representation associated with a program. They train a Deep Belief Network (DBN) on the token vectors extracted from the Abstract Syntax Tree (AST) representations of various programs. They rely on trained DBNs to detect the differences in these token vectors extracted from an input to predict defects. Another related contribution is presented by

(Kim et al., 2013b). They propose a two phase recommendation model for suggesting the files to be fixed by analyzing the bug reports associated with them. The recommendation model in its first phase categorizes a file as “predictable” or “unpredictable” depending upon the adequacy of the available content of bug reports. Further, in its second phase, amongst the files categorized as predictable, it recommends the top k files to be fixed. They experimented on a limited set of projects, viz., “Firefox” and “Core” packages of Mozilla.

An information retrieval technique based bug localization module “BugLocator” is proposed by (Zhou et al., 2012). The BugLocator uses the concept of textual similarity to find a set of bug reports similar to an input bug report, and using the linked source code tries to identify potential bugs related to input. Use of textual similarity for code identification can pose problems because often times a given programming task may be coded in more than one way.

A source code recommender, Mentor, for the use of novice programmers is presented by (Malheiros et al., 2012)

. Their aim is to help avoiding unnecessary diversion of programmer’s attention from the main tasks. Mentor is based on a Prediction by Pattern Matching (PPM) algorithm. The authors have compared the performance of PPM with LSI on three OSS projects (GTK+, GIMP and Hadoop) and claim PPM to be better. Similarly,

(Hindle et al., 2012)

applies N-gram language models to infer syntactic and semantic properties of a program so as to provide code completion facilities.

Overall, some of the key gaps that we find in the majority of the existing works that address problems similar to ours can be summarized as follows.

  • In works that use (or propose) source code feature extraction for defect prediction and localization, only a limited types of nodes from program’s AST has been utilized. For instance, node types such as identifier nodes, operator nodes, scope nodes, user defined data types etc. have not been usually considered. In our work we have captured all such types of nodes as per a language’s grammar.

  • While building the feature vectors, mere presence/absence of programming constructs is considered. In our work, however, we also capture additional characteristics implied/associated with the programming constructs. For example, “length”, “count” and “depth-of-occurrences” of various constructs.

  • Association of “programming styles” adopted in software with the characteristics of the defects reported against such source code has not been adequately studied in literature.

  • Last but not the least, most works reported their results using a very limited volume (less than projects) of source code and bug reports, if any. Our study spans more than source code files written in different programming languages and taken from OSS repositories.

6. Conclusions

Software maintenance tasks consume significant (Bennett and Rajlich, 2000; Lientz and Swanson, 1980) resources during software development. It has been well established (Deissenboeck and Pizka, 2006; Allamanis et al., 2014) that the programming practices followed during construction of a software greatly impact the quality of software. Analysis of the latent features present in source code can thus offer a valuable avenue for building predictive systems for detecting potential defects in software.

In this paper we have proposed a system which leverages the large volume of source code available in OSS projects and their associated bug reports to create a new dataset of program’s features. We then use this dataset to train a variety of state-of-the-art ML models which can accurately estimate the defectiveness of a given source code. We have used setups of different ML algorithms to examine their prediction accuracy on our dataset. This allowed us to identify the best performing model for predicting defectiveness of source code under variety of scenarios. For instance, we have shown that SVM with radial kernel performs the best (averaged F1 score of ) for identifying a source file as potentially defective or not. Similarly, RF model with estimators performs best (F1 score ) when predicting bugs annotated with highest priority.


  • (1)
  • Allamanis et al. (2014) Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. 2014. Learning natural coding conventions. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 281–293.
  • Allamanis et al. (2015) Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. 2015. Suggesting accurate method and class names. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. ACM, 38–49.
  • Allamanis and Sutton (2013) Miltiadis Allamanis and Charles Sutton. 2013. Mining source code repositories at massive scale using language modeling. In Proceedings of the 10th Working Conference on Mining Software Repositories. IEEE Press, 207–216.
  • Bennett and Rajlich (2000) Keith H Bennett and Václav T Rajlich. 2000. Software maintenance and evolution: a roadmap. In Proceedings of the Conference on the Future of Software Engineering. ACM, 73–87.
  • Breiman (2001) L Breiman. 2001. Random Forests Machine Learning. 45: 5–32. View Article PubMed/NCBI Google Scholar (2001).
  • Chang and Lin (2011) Chih-Chung Chang and Chih-Jen Lin. 2011.

    LIBSVM: a library for support vector machines.

    ACM transactions on intelligent systems and technology (TIST) 2, 3 (2011), 27.
  • Dasarathy (1991) Belur V Dasarathy. 1991. Nearest neighbor (NN) norms:NN pattern classification techniques. (1991).
  • Deissenboeck and Pizka (2006) Florian Deissenboeck and Markus Pizka. 2006. Concise and consistent naming. Software Quality Journal 14, 3 (2006), 261–282.
  • Fan et al. (2008) Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. Journal of machine learning research 9, Aug (2008), 1871–1874.
  • Hindle et al. (2012) Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of software. In Software Engineering (ICSE), 2012 34th International Conference on. IEEE, 837–847.
  • Hinton (1990) Geoffrey E Hinton. 1990. Connectionist learning procedures. In Machine Learning, Volume III. Elsevier, 555–610.
  • Kim et al. (2013a) Dongsun Kim, Jaechang Nam, Jaewoo Song, and Sunghun Kim. 2013a. Automatic patch generation learned from human-written patches. In Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, 802–811.
  • Kim et al. (2013b) Dongsun Kim, Yida Tao, Sunghun Kim, and Andreas Zeller. 2013b. Where should we fix this bug? a two-phase recommendation model. IEEE transactions on software Engineering 39, 11 (2013), 1597–1610.
  • Lawrie et al. (2006) Dawn Lawrie, Christopher Morrell, Henry Feild, and David Binkley. 2006. What’s in a name? a study of identifiers. In Program Comprehension, 2006. ICPC 2006. 14th IEEE International Conference on. IEEE, 3–12.
  • Lientz and Swanson (1980) Bennet P Lientz and E Burton Swanson. 1980. Software maintenance management. (1980).
  • Malheiros et al. (2012) Yuri Malheiros, Alan Moraes, Cleyton Trindade, and Silvio Meira. 2012. A source code recommender system to support newcomers. In Computer Software and Applications Conference (COMPSAC), 2012 IEEE 36th Annual. IEEE, 19–24.
  • Michalski et al. (2013) Ryszard S Michalski, Jaime G Carbonell, and Tom M Mitchell. 2013. Machine learning: An artificial intelligence approach. Springer Science & Business Media.
  • Rasmussen (2004) Carl Edward Rasmussen. 2004. Gaussian processes in machine learning. In Advanced lectures on machine learning. Springer, 63–71.
  • Wang et al. (2016) Song Wang, Taiyue Liu, and Lin Tan. 2016. Automatically learning semantic features for defect prediction. In Proceedings of the 38th International Conference on Software Engineering. ACM, 297–308.
  • Williams and Hollingsworth (2005) Chadd C Williams and Jeffrey K Hollingsworth. 2005. Automatic mining of source code repositories to improve bug finding techniques. IEEE Transactions on Software Engineering 31, 6 (2005), 466–480.
  • Ye et al. (2014) Xin Ye, Razvan Bunescu, and Chang Liu. 2014. Learning to rank relevant files for bug reports using domain knowledge. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 689–699.
  • Zhou et al. (2012) Jian Zhou, Hongyu Zhang, and David Lo. 2012. Where should the bugs be fixed?-more accurate information retrieval-based bug localization based on bug reports. In Proceedings of the 34th International Conference on Software Engineering. IEEE Press, 14–24.