Code Smell Detection using Multilabel Classification Approach

by   Thirupathi Guggulothu, et al.

Code smells are characteristics of the software that indicates a code or design problem which can make software hard to understand, evolve, and maintain. The code smell detection tools proposed in the literature produce different results, as smells are informally defined or are subjective in nature. To address the issue of tool subjectivity, machine learning techniques have been proposed which can learn and distinguish the characteristics of smelly and non-smelly source code elements (classes or methods). However, the existing machine learning techniques can only detect a single type of smell in the code element which does not correspond to a real-world scenario. In this paper, we have used multilabel classification methods to detect whether the given code element is affected by multiple smells or not. We have considered two code smell datasets for this work and converted them into a multilabel dataset. In our experimentation, Two multilabel methods performed on the converted dataset which demonstrates good performances in the 10-fold cross-validation, using ten repetitions.



There are no comments yet.


page 1

page 2

page 3

page 4


Crowdsmelling: The use of collective knowledge in code smells detection

Code smells are seen as major source of technical debt and, as such, sho...

A Machine Learning Based Framework for Code Clone Validation

A code clone is a pair of code fragments, within or between software sys...

A Survey on Machine Learning Techniques for Source Code Analysis

Context: The advancements in machine learning techniques have encouraged...

Improving type information inferred by decompilers with supervised machine learning

In software reverse engineering, decompilation is the process of recover...

LVMapper: A Large-variance Clone Detector Using Sequencing Alignment Approach

To detect large-variance code clones (i.e. clones with relatively more d...

Oreo: Detection of Clones in the Twilight Zone

Source code clones are categorized into four types of increasing difficu...

Plagiarism: Taxonomy, Tools and Detection Techniques

To detect plagiarism of any form, it is essential to have broad knowledg...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Code smell refers to an anomaly in the source code that shows violation of basic design principles such as abstraction, hierarchy, encapsulation, modularity, and modifiability booch1980object . Even if the design principles are known to the developers, they are been violated because of inexperience, deadline pressure, and heavy competition in the market. Fowler et al. fowler1999refactoring , have defined 22 informal code smells. One way to remove them is by using refactoring techniques opdyke1992refactoring . Refactoring is a technique that makes better internal structure (design quality) of the code without altering the external behavior of the software.

In the literature, there are several techniques kessentini2014cooperative and tools fontana2012automatic available to detect different code smells. Each technique and tool produces different results. According to Kessentini et al.

, the code smell detection techniques can be classified into seven categories (cooperative-based

abdelmoez2014risk , visualization based murphy2010interactive , search-based palomba2015mining liu2013monitor palomba2013detecting , probabilistic Rao07detectingbad , metric-based marinescu2004detection moha2010decor tsantalis2009identification , symptoms based moha2010domain , and manual techniques travassos1999detecting ciupke1999automatic ) which differs in the underlying algorithm. Bowes et al. bowes2013inconsistent , compared two smell detection tools on message chaining and showed the disparity of results between them. Because of the differing results, Rasool et al. rasool2015review have classified, compared and evaluated existing detection tools and techniques to understand the categorization better. The three main reasons for the disparity in the results are: 1) the code smells can be subjectively interpreted by the developers, and hence detected in different ways.(2) Agreement between the detectors is low, i.e., different tools or rules detect a different type of smell for different code elements. 3) The threshold value for identifying the smell can vary for the detectors.

To address the above limitations, in particular the subjective nature, Fontana et al. fontana2016comparing proposed a machine learning (ML) technique to detect four code smells with the help of 32 classification techniques. The authors showed that most of the classifiers achieved more than 95% performance in terms of accuracy and F-measure. After observing the results, authors have suggested that ML algorithms are most suitable approach for the code smell detection. Di Nucci et al. di2018detecting addressed some limitations in the Fontana et al. fontana2016comparing , that the prepared datasets do not represent a real world scenario. That is in the datasets, metric distribution of smelly elements strongly different than the metric distribution of non smelly instances, then any ML technique might easily distinguish the two classes. Where boundary between smelly and non smelly characteristics is not always clear in real case tufano2017and , fontana2016antipattern . In addition, the authors built four datasets, one for each smell. Each dataset contained code elements (instances) affected by that type of smell or non-smelly components. This makes the datasets unrealistic i.e., a software system usually contains different types of smells and might have made easier for the classifiers to discriminate smelly instances.

To over come the above limitations, Di Nucci et al. di2018detecting , modified the datasets of Fontana et al. fontana2016comparing , to simulate a more realistic scenario by merging the class and method-level wise datasets. The merged datasets have reduced the metric distribution and contains more than one type of smell instances. The authors experimented the same ML techniques as the Fontana et al., on revised datasets and achieved an average 76% of accuracy in all models. Their datasets has some instances which are identical but have different class labels called disparity (smelly and non-smelly). In this paper, we addressed the disparity instances and due to this the performances decreased in Di Nucci et al. di2018detecting . For example, in method level merging, if the long method dataset has an instance which is smelly, and if the same instance is in the feature envy dataset then authors di2018detecting replicate that instance in long-method as non-smelly. This disparity will confuse the ML algorithms. Apart from this issue, the datasets have multiple type code smell instances, but they are not able to detect them.

In this work, we removed the disparity instances in the merged method level datasets and experimented tree-based classifiers techniques on them. The results report, an average 95%- 98% accuracy for the datasets. There is a drastic change in the performance after removal of disparity. From the datasets of Fontana et al.fontana2016comparing and Di Nucci et al. di2018detecting , we have observed that there are 395 common instances in method level. These instances led to an idea to form multilabel dataset. Through this dataset disparity can be eliminated, and more than one smells can be detected for the same instance by using multilabel classification methods. Till now, in the literature azeem2019machine

, three classification types were used in the code smell detection: 1) binary code smell (presence or absence) 2) based on probability 3) based on severity.

In this paper, we formulate the code smell detection as a multilabel classification (MLC) problem. For this, we considered two method level datasets from Fontana et al. fontana2016comparing and converted them into multilabel dataset (MLD). We applied, two multilabel classification methods on the dataset. We found that these classification methods achieved good performances (on average 91%) in the 10-fold cross validation using 10-iterations.

The structure of the paper is organized as follow; The second section, introduces a work related to detection of code smell using ML techniques; The third section, describes the reference study of considered datasets; The fourth section, explains the proposed approach; The fifth section, presents experimental setup and results of the proposed study; The sixth section, discusses the proposed study with the previous; The final section, gives conclusion and future directions to this research paper.

2 Related Work

Over the past fifteen years, researchers presented various tools and techniques for detecting code smells. According to kessentini et al. kessentini2014cooperative approaches of code smell detection are classified into 7 categories (i.e., cooperative-based approaches, visualization based approaches, machine learning-based approaches, probabilistic approaches, metric-based approaches, symptoms based approaches, and manual approaches). In this section, we consider only machine learning-based approaches for detecting the code smells.

Kreimer and Jochen kreimer2005adaptive

, introduces an adaptive detection to combine known methods for finding design flaws viz., Big Class (Large Class) and Long Method on the basis of metrics with learning decision trees. The analyses were conducted on two software systems known as: IYC system and the WEKA package.

Khomh et al. khomh2009bayesian , propose a Bayesian approach to detect occurrences of the Blob antipattern on open-source programs (GanttProject v1.10.2 and Xerces v2.7.0). khomh2011bdtex present BDTEX (Bayesian Detection Expert), a Goal Question Metric approach to build Bayesian Belief Networks from the definitions of antipatterns and validate BDTEX with Blob, Functional Decomposition, and Spaghetti Code antipatterns on two open-source programs.

Maneerat et al. maneerat2011bad , collect datasets from the literature regarding the evaluation of 7 bad-smells, and apply 7 machine learning algorithms for bad-smells prediction, using 27 design model metrics extracted by a tool as independent variables. The author make no explicit reference to the applied datasets.

Maiga et al. maiga2012support

, introduce SVMDetect, an approach to detect anti-patterns, based on support vector machines. The subjects of their study are Blob, Functional Decomposition, Spaghetti Code and Swiss Army Knife antipatterns, on three open-source programs: ArgoUML, Azureus, and Xerces.

maiga2012smurf extend the previous paper by introducing SMURF, which takes into account practitioners’ feedback.

Wang et al. wang2012can

, propose an approach that assists in understanding the harmfulness of intended cloning operations using Bayesian Networks and a set of features such as history, code, destination features.

Yang et al. yang2015classification , study the judgment of individual users by applying machine learning algorithms on code clones. white2016deep

, detected code clone by using deep learning techniques. The authors have sampled 398 files and 480 method levels pairs across 8 real world java software system.

Amorim et al. amorim2015experience , studied the effectivness of the Decision Tree algorithm to recognize code smells. For this, the authors experimented on 4 open source projects and the results were compared with the manual oracle, with existing detection approaches and with other machine learning algorithms.

Fontana et al. fontana2016comparing , experimented and compared code smell detection through supervised ML algorithms. The author experimented 74 Java systems which are manually validated instances on training dataset and used 16 different classification algorithms. In addition, a boosting techniques is applied on 4 code smells viz., Data Class, Long Method, Feature Envy, God Class.

Fontana et al. fontana2017code , Classified the code smells severity by using a machine learning method. This approach can help software developrs to priortize or rank the classes or methods. Multinomail classifcation and regression were used for code smell severity classification.

Di Nucci et al. di2018detecting , covered some of the limitaions of the Fontana et al.fontana2016comparing . The authors configured the datasets of Fontana and provided new datasets which are suitable for real case scenario.

When observed, the major difference of the previous work with respect to the proposed approach is that the detection of code smells is viewed as multilabel classfication. This paper address some limitations of di2018detecting and shown the reason for degraded the results.

3 Reference Datasets

In this paper, we consider two method level datasets (long method and feature envy) from Fontana et al. fontana2016comparing . In existing literature, these datasets are used as a single label methods. In the following subsections, we briefly describe the data preparation methodology of Fontana et al. These datasets are available at

3.1 Systems and Code Smell Selection

Fontana et al. fontana2016comparing , have analyzed Qualitus Corpus software systems which are collected from Tempero et al. tempero2010qualitas . Among 111 systems of the corpus, 74 systems are considered. The remaining 37 systems can not detect code smells as they are not successfully compiled. For the given 74 software systems, the authors have computed 61 class level and 82 method level metrics. These metrics became features for independent variables in the datasets. The two method level code smells used to detect them are long method and feature envy.

  1. Long Method (LM): A code smell is said to be long method when it has more number of lines in the code and requires too many parameters. This increases the functional complexity of the method and it will be difficult to understand.

  2. Feature Envy (FE): Feature Envy is the method level smell which uses more data from other classes rather than its own class i.e., it accesses more foreign data than the local one.

3.2 Dataset Preparation

To establish the dependent variable for code smell prediction models, the authors applied to each code smell a set of automatic detectors shown in Table 1. However, code smell detectors cannot usually achieve 100% recall, meaning that an automatic detection process might not identify actual code smell instances (i.e., false negatives) even in the case that multiple detectors are combined. To cope with false positives and to increase their confidence in validity of the dependent variable, the authors applied a stratified random sampling of the classes/methods of the considered systems: this sampling produced 1,986 instances (826 smelly elements and 1,160 non-smelly ones), which were manually validated by the authors in order to verify the results of the detectors. As a final step, the sampled dataset was normalized for size: the authors randomly removed smelly and non-smelly elements building four disjoint datasets, i.e., one for each code smell type, composed of 140 smelly instances and 280 non-smelly ones (for a total of 420 elements). These datasets represented the training set for the ML techniques.

Code Smell Detectors
Long Method PMD 111, iPlasma(marinescu2005measurement ), Marinescu(marinescu2002measurement )
Feature envy Fluid Tool(nongpong2012integrating ), iPlasma(marinescu2005measurement )
Table 1: Automatic Code Smell Detector Tools

4 Multilabel Classification Approach

Supervised classification is the task of using algorithms that allow the machine to learn associations between instances and class labels. Supervision comes in the form of previously labeled instances, from which an algorithm builds a model to automatically predict the labels of new instances. In ML, classification problems can be classified into three main categories: Binary (yes or no), MultiClass and Multilabel classification (MLC). In literature azeem2019machine , code smell detection were single label (binary) classifiers, used to detect the single type code smell (presence or absence) only. In this work, multilabel classifiers are used to detect the multiple code smells for the same element.

MLC is a way to learn from instances that are associated with a set of labels (predictive classes). That is, for every instance there can be one or more labels associated with them. MLC is frequently used in some application areas like multimedia classification, medical diagnosis, text categorization, and semantic scene classification. Similarly, in our code smell detection domain, instances are code elements and set of labels are code smells, i.e., a code element can contain more than one type of smell which is not addressed by the earlier approaches. The main difference between MLC and existing approaches is that the expected output from the trained models. Existing approaches detected only one smell but, in the proposed one more than one smell can be detected. In the following subsections, we explain the procedure of constructed MLD and methods used for experimentation of multiple label classification.

4.1 Construction of Multi-label Dataset

The considered LM and FE datasets have 420 instances each, which are used to construct multilabel dataset. Following are the steps to create MLD.

  1. Initially, each data set have 420 instances. From those, 395 common instances are added to MLD with their corresponding two class labels.

  2. The remaining 25 instances of each single class label dataset are added into MLD by considering the other class label as non smelly.

An overview of the procedure is depicted in Figure 1. As shown in Figure, the data set contains 82 method metrics namely M1, M2, .. M82 (Independent variables). I1, I2,…… are the instances and the class labels are LM and FE respectively.

Figure 1: Procedure of constructing multilabel dataset

4.2 Methods of Multilabel Classification

There are two approaches that are widely used to handle the problems of MLC tsoumakas2007multi : problem transformation methods (PTM) and algorithm adoption methods (AAM). In PTM, MLD is transformed to single label problem and are solved by appropriate single label classifiers. In algorithm adaptation, MLD is handled by adapting a single label classifier to solve it. In this paper, we consider only problem transformation method.

There are many methods which fall under PTM category. Among them two methods can be thought of as foundation to many other methods. (1) Binary relevance (BR) method godbole2004discriminative : it will convert an MLD to as many binary datasets as the number of different labels that are present. The different dataset predictions from binary classifiers are joined to get the final outcome. (2) Label power set(LP) method boutell2004learning : is used to convert MLD to Multi-class dataset based on the label set of each instance as a class identifier. The predicted classes are transformed back to label set using any multi-class classifier.

Several algorithms developed under BR and LP methods. In this paper, there have been two algorithms which covering these methods: Classifier chains (CC) under BR category and LC aka LP category. The reason for choosing these algorithms is that they capture the label dependencies (correlation or co-occurrence) during classification is thus leading to improve the classification performance guo2011multi . Usually, the considered code smells co-occur each other palomba2017investigating .

After the transformation, we used top 5 tree based (single label) classifiers for the predictions of multilabel methods (CC, LC). In the literature azeem2019machine , previous studies shown that, these classifiers achieved high performance in the code smell classification.

We have identified set of specific research questions which guides to classify the code smells using multilabel approach:

RQ1: How many disparity instances are existing in the configured datasets of the concerned code smells in the di2018detecting .

RQ2: What would be the performance improvement after removing the disparity instances?

RQ3: What would be the performance when constructed the dataset by using multilabel instead of merging?

5 Experimental Results

5.1 Experimentation Setup

In the following, report the MLC methods with a short description and MEkA read2016meka tool provides the implementation of the selected methods.

  • Classifier Chains (CC) read2011classifier : The algorithm tries to enhance BR by considering the label correlation. To predict the new labels, train ’Q’ classifiers which are connected to one another in such a way that the prediction of each classifier is being added to the dataset as new feature.

  • LC aka LP (Label Powerset) Method boutell2004learning : Treats each label combination as a single class in a multi-class learning scheme. The set of possible values of each class is the powerset of labels.

To test the performance of the different code smell prediction models built, we apply 10-fold cross validation and run them up to 10 times to cope with randomness hall2011developing . Next, we evaluate the classification performance.

The evaluation metric of MLC is different from that of single label classification, since for each instance there are multiple labels which may be classified partly correctly or partly incorrectly. MLC evaluation metrics are classified into two groups: (1) Example based metrics (2) Label based metrics. In example based metrics one each instance metric is calculated and then average of those metrics gives the final outcome. Label based metrics are computed for each label instead of each instance. In this work, we have considered example based measures. Label based measures would fail to directly address the correlations among different classes

sorower2010literature . Below, equations 1, 2, and 3 are used to measure the performances of MLC methods which belongs to the example-based category. In this, D denotes number of instances, L represents number of labels, is the predicted labels for instance i, and denotes true labels for instance i. Detailed discussion of all other measures are defined in sorower2010literature .

  • Accuracy: The proportion of correctly predicted labels with respect to the number of labels for each instance.

  • Hamming Loss: The prediction error (an incorrect label is predicted) and the missing error (a relevant label not predicted), normalized over total number of classes and total number of examples.

  • Exact match Ratio: The predicted label set is identical to the actual label set. It is a most strict evaluation metric.


5.2 Results

5.2.1 Datasets

To answer the RQ1, we have considered the configured datasets of di2018detecting . The author merged the FE dataset into LM dataset and vice versa. The merged datasets are listed in Table 2. In a table, each dataset has 840 instances, among them 140 instances affected (smelly) and 700 are non-smelly. While merging FE into LM, there are 395 common instances among which 132 are smelly instances in LM dataset. In the same way, when LM is merged with FE, there are 125 smelly instances in FE dataset. These 132 and 125 instances are suffered from disparity i.e., same instance is having two class label (smelly and non-smelly). Due to the disparity instances di2018detecting , authors achieved less performances in the ML classification techniques. The merged datasets are available at

Number of
Method Level Merged Datasets
Long Method Feature Envy
Non Smelly
Non Smelly
840 140 700 140 700
Table 2: Configured Datasets

In this paper, MLD is created by considering 395 common and 50 uncommon (25 each) instances of LM and FE merged; there are 445 instances. Table 3 shows the percentage and number of instances affected in the MLD. Out of 445, 85 instances are affected by both the smells. When concerned individually there are 140 instances affected by LM and FE. The grahphical representation of MLD is shown in Figure 2.

Long Method
Feature Envy
Number of
Instances Affected
% of Affected
Yes Yes 85 19.1%
Yes No 55 12.3%
No Yes 55 12.3%
No No 250 56.17%
Table 3: Number of instances affected in multilabel dataset
Figure 2: Multilabel dataset

5.2.2 Multilabel Dataset Statistics

Table 4 lists the basic measures of multi-label training dataset characteristics. Some of the basic measures in single label dataset are attributes, instances, and labels. In addition to it there are other measures added to multilabel dataset tsoumakas2007multi . In the table, cardinality indicates the average number of active labels per instance. Dividing this measure by number of labels in dataset, results in a dimensionless measure known as density. The two labels will have four label combinations (label sets) in our dataset. The mean imbalance ratio (mean IR) gives the information about, whether the dataset is imbalanced or not. As a general rule, charte2015addressing any MLD with a MeanIR value higher than 1.5 should be considered as imbalanced. With this, the prepared multilabel dataset is well balanced because of the MeanIR value in our case is 1.0 which is less than the 1.5.

Number of
Number of
Number of
Number of
Label Sets
Cardinality Density MeanIR
445 82 2 4 0.629 0.314 1.0
Table 4: Statistics of Multilabel Dataset

5.2.3 Performance Improvements in Existing Datasets

To answer RQ2, We have removed 132, and 125 disparity instances of LM and FE merged datasets respectively. Now, the LM dataset has 708 instances among them 140 are positive (Smelly), and 568 are negative (non-smelly). In FE dataset has 715 instances among them 140 are positive, and 575 are negative. Then, we used single label ML techniques (tree based classifiers) on those datasets. Now, the performance got drastically improved on both the datasets which are shown in Tables 5 and 6. Earlier the performance on long method and feature envy datasets were an average 73% and 75% using tree based classifier. After removal of disparity instances in both the datasets, now we got an average 95%, 98%. With this evidence, due to disparity, Di Nucci et al.di2018detecting got less performance on the concerned code smell datasets.

Classifier Accuracy F-Measure ROC Area

B-Random Forest

95.9% 96.0% 97.6%
Random Forest 95.9% 96.0% 97.7%
B-J48 UnPruned 95.4% 95.5% 97.1%
B-J48 Pruned 94.7% 94.8% 97.7%
J48 Unpruned 93.5% 93.5% 91.9%
Table 5: Long Method Results
Classifier Accuracy F-Measure ROC Area
B-Random Forest 98.0% 98.0% 99.9%
Random Forest 98.1% 98.2% 99.9%
B-J48 UnPruned 99.0% 99.0% 98.7%
B-J48 Pruned 99.1% 99.2% 99.3%
J48 Unpruned 98.1% 98.2% 98.0%
Table 6: Feature Envy Results

5.2.4 Performances of Multilabel classfication

To answer the RQ3: Two problem transformation methods (CC, LC) are used to transform multi-label training dataset into a set of binary or multi-class datasets. Then, we have used top 5 tree-based classification techniques on the transformed dataset. The performances of those techniques are shown in the tables respectively 7 and 8. To evaluate the techniques, we have run them for 10 iterations using 10 fold cross-validation. We measured average accuracy, hamming loss, and an exact match of those 100 iterations.

From the tables 7, 8 reports that all top 5 classifiers performing well under the CC, LC methods. The best results report 89.6%-93.6% accuracy for CC and 89%-93.5% for LC method with low hamming loss 0.079 in most cases. In both the tables, it is shown that random forest classifier is giving the best performance based on all three measures. As a method wise, CC method performing slight over the LC method. In addition to these results, we also listed other metrics (label-based) of CC and LC methods which are reported in Appendix table 9 and 10.

CC (10-Fold Cross Validation Run for 10 Iterations)
Single Label
Example Based Metrics

(Jaccard Index)

J48 Pruned 89.6% 0.078 85.4%
Random Forest 93.6% 0.047 91.1%
B-J48 pruned 92.2% 0.056 89.4%
B-J48 UnPruned 91.1% 0.064 87.9%
B-Random Forest 92.8% 0.053 89.9%
Table 7: Results of CC method using top 5 single label classifers.
LC (10-Fold Cross Validation Run for 10 Iterations)
Single Label
Example Based Metrics
(Jaccard Index)
J48 Pruned 89.0% 0.075 85.2%
Random Forest 93.3% 0.053 90.1%
B-J48 pruned 90.0% 0.069 87.0%
B-J48 UnPruned 90.7% 0.063 87.9%
B-Random Forest 93.5% 0.049 90.6%
Table 8: Results of LC method using top 5 single label classifers.

The LC method aka LP is used to convert MLD to Multi-class dataset based on the label set of each instance as a class identifier. That is, in this work, a multiclass can contains four class (00,01,10,11) values, 00 means not affected by both smells, 01 means affected by feature envy, 10 means affected by long method, and 11 means affected by both the smells. Table 8, also said the results of Multiclass classification.

5.3 Discussion

In this section, we discuss how the existing studies differ from the proposed study. Then, we give how our proposed approach is much more useful in a real-world scenario.

The study di2018detecting , replicated and modified the datasets of fontana2016comparing by merging the instances of other code smell datasets to i)reduce the difference in the metric distribution ii) have the different type of smells in the same dataset so that can model a more realistic scenario.

In this paper, we identified the disparity instances in the merged datasets and removed them by manual process. After that, we used the same tree-based classifiers as in the di2018detecting on the removal disparity instances datasets and achieved 95% and 98% accuracy in LM and FE respectively. This disparity will lead to forming the idea of multilable dataset.

In a real-world scenario, a code element can contain more than one design problems (code smells) and our MLD constructed accordingly. The MLD also maintain similar characteristics as in the modified datasets of di2018detecting , like metric distribution and have different types of smells. Then, two MLC methods used on the MLD. In the existing study, the performances were an average 76% accuracy and detected only one type of smell. But, in the proposed study we detected two smells in the same instance and obtained 91% of accuracy. Our findings have important implications for further research community to 1) analyze the detected code smells after the detection so that which smell is first to refactor to reduce developer effort because different smell orders require different effort 2) Identify (or prioritize) the critical code elements for refactoring based on the number of code smells it detected. That is, if an element can be affected by more design problems then this element given has the highest priority for refactoring.

6 Conclusion and Future Directions

In this work, we detected two method level code smells using a multilabel classification approach. Existing studies used to detect a single type code smell but, in the proposed study, we detected two code smells whether they exist in the same method or not. For this work, we considered two method datasets which are constructed by single type detectors. These datasets have 395 common instances thus leads to form the disparity while merging process in the existing study. Due to this, the performances were less in their study. In this paper, these common instances are led to construct the MLD and also to avoid the disparity. We experimented, two multilabel classification methods(CC, LC) on the MLD. The CC method has given best performance than LC based on all three measures. The performance of the proposed study is much better than the existing study. In the existing study, the performance of all models got an average 73% accuracy, whereas in proposed study we got an average 91%.

Proposed approach detected only two smells, and it is not limited. In the future, we want to detect other method level code smells also. In addition, the importance of multilabel classification for code smell can identify the critical code elements (method or class) which are urgent need of refactoring. That is, we are classifying the critical element by using multilabel classification based on the number of code smell detected by the element in the dataset. For example, if there are two code smells in the same method, then this method is suffering from more design problems (critical) associated to those code smells rather than single code smell.

The removal of disparity instances datasets are avaliable for download at

The multilabel dataset available for download at


  • (1) G. Booch, Object-oriented analysis and design, Addison-Wesley, 1980.
  • (2) M. Fowler, K. Beck, J. Brant, W. Opdyke, D. Roberts, Refactoring: Improving the design of existing programs (1999).
  • (3) W. F. Opdyke, Refactoring: A program restructuring aid in designing object-oriented application frameworks, Ph.D. thesis, PhD thesis, University of Illinois at Urbana-Champaign (1992).
  • (4) W. Kessentini, M. Kessentini, H. Sahraoui, S. Bechikh, A. Ouni, A cooperative parallel search-based software engineering approach for code-smells detection, IEEE Transactions on Software Engineering 40 (9) (2014) 841–861.
  • (5) F. A. Fontana, P. Braione, M. Zanoni, Automatic detection of bad smells in code: An experimental assessment., Journal of Object Technology 11 (2) (2012) 5–1.
  • (6) W. Abdelmoez, E. Kosba, A. F. Iesa, Risk-based code smells detection tool, in: The International Conference on Computing Technology and Information Management (ICCTIM2014), The Society of Digital Information and Wireless Communication, 2014, pp. 148–159.
  • (7) E. Murphy-Hill, A. P. Black, An interactive ambient visualization for code smells, in: Proceedings of the 5th international symposium on Software visualization, ACM, 2010, pp. 5–14.
  • (8) F. Palomba, G. Bavota, M. Di Penta, R. Oliveto, D. Poshyvanyk, A. De Lucia, Mining version histories for detecting code smells, IEEE Transactions on Software Engineering 41 (5) (2015) 462–489.
  • (9) H. Liu, X. Guo, W. Shao, Monitor-based instant software refactoring, IEEE Transactions on Software Engineering (2013) 1.
  • (10) F. Palomba, G. Bavota, M. Di Penta, R. Oliveto, A. De Lucia, D. Poshyvanyk, Detecting bad smells in source code using change history information, in: Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering, IEEE Press, 2013, pp. 268–278.
  • (11) A. A. Rao, K. N. Reddy, Detecting bad smells in object oriented design using design change propagation probability matrix 1 (2007).
  • (12) R. Marinescu, Detection strategies: Metrics-based rules for detecting design flaws, in: Software Maintenance, 2004. Proceedings. 20th IEEE International Conference on, IEEE, 2004, pp. 350–359.
  • (13) N. Moha, Y.-G. Gueheneuc, A.-F. Duchien, et al., Decor: A method for the specification and detection of code and design smells, IEEE Transactions on Software Engineering (TSE) 36 (1) (2010) 20–36.
  • (14) N. Tsantalis, A. Chatzigeorgiou, Identification of move method refactoring opportunities, IEEE Transactions on Software Engineering 35 (3) (2009) 347–367.
  • (15) N. Moha, Y.-G. Guéhéneuc, A.-F. Le Meur, L. Duchien, A. Tiberghien, From a domain analysis to the specification and detection of code and design smells, Formal Aspects of Computing 22 (3-4) (2010) 345–361.
  • (16) G. Travassos, F. Shull, M. Fredericks, V. R. Basili, Detecting defects in object-oriented designs: using reading techniques to increase software quality, in: ACM Sigplan Notices, Vol. 34, ACM, 1999, pp. 47–56.
  • (17) O. Ciupke, Automatic detection of design problems in object-oriented reengineering, in: Technology of Object-Oriented Languages and Systems, 1999. TOOLS 30 Proceedings, IEEE, 1999, pp. 18–32.
  • (18) D. Bowes, D. Randall, T. Hall, The inconsistent measurement of message chains, in: Emerging Trends in Software Metrics (WETSoM), 2013 4th International Workshop on, IEEE, 2013, pp. 62–68.
  • (19) G. Rasool, Z. Arshad, A review of code smell mining techniques, Journal of Software: Evolution and Process 27 (11) (2015) 867–895.
  • (20) F. A. Fontana, M. V. Mäntylä, M. Zanoni, A. Marino, Comparing and experimenting machine learning techniques for code smell detection, Empirical Software Engineering 21 (3) (2016) 1143–1191.
  • (21) D. Di Nucci, F. Palomba, D. A. Tamburri, A. Serebrenik, A. De Lucia, Detecting code smells using machine learning techniques: are we there yet?, in: 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), IEEE, 2018, pp. 612–621.
  • (22) M. Tufano, F. Palomba, G. Bavota, R. Oliveto, M. Di Penta, A. De Lucia, D. Poshyvanyk, When and why your code starts to smell bad (and whether the smells go away), IEEE Transactions on Software Engineering 43 (11) (2017) 1063–1088.
  • (23) F. A. Fontana, J. Dietrich, B. Walter, A. Yamashita, M. Zanoni, Antipattern and code smell false positives: Preliminary conceptualization and classification, in: Software Analysis, Evolution, and Reengineering (SANER), 2016 IEEE 23rd International Conference on, Vol. 1, IEEE, 2016, pp. 609–613.
  • (24) M. I. Azeem, F. Palomba, L. Shi, Q. Wang, Machine learning techniques for code smell detection: A systematic literature review and meta-analysis, Information and Software Technology.
  • (25) J. Kreimer, Adaptive detection of design flaws, Electronic Notes in Theoretical Computer Science 141 (4) (2005) 117–136.
  • (26) F. Khomh, S. Vaucher, Y.-G. Guéhéneuc, H. Sahraoui, A bayesian approach for the detection of code and design smells, in: Quality Software, 2009. QSIC’09. 9th International Conference on, IEEE, 2009, pp. 305–314.
  • (27) F. Khomh, S. Vaucher, Y.-G. Guéhéneuc, H. Sahraoui, Bdtex: A gqm-based bayesian approach for the detection of antipatterns, Journal of Systems and Software 84 (4) (2011) 559–572.
  • (28) N. Maneerat, P. Muenchaisri, Bad-smell prediction from software design model using machine learning techniques, in: Computer Science and Software Engineering (JCSSE), 2011 Eighth International Joint Conference on, IEEE, 2011, pp. 331–336.
  • (29) A. Maiga, N. Ali, N. Bhattacharya, A. Sabané, Y.-G. Guéhéneuc, G. Antoniol, E. Aïmeur, Support vector machines for anti-pattern detection, in: Automated Software Engineering (ASE), 2012 Proceedings of the 27th IEEE/ACM International Conference on, IEEE, 2012, pp. 278–281.
  • (30) A. Maiga, N. Ali, N. Bhattacharya, A. Sabane, Y.-G. Gueheneuc, E. Aimeur, Smurf: A svm-based incremental anti-pattern detection approach, in: Reverse engineering (WCRE), 2012 19th working conference on, IEEE, 2012, pp. 466–475.
  • (31) X. Wang, Y. Dang, L. Zhang, D. Zhang, E. Lan, H. Mei, Can i clone this piece of code here?, in: Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering, ACM, 2012, pp. 170–179.
  • (32) J. Yang, K. Hotta, Y. Higo, H. Igaki, S. Kusumoto, Classification model for code clones based on machine learning, Empirical Software Engineering 20 (4) (2015) 1095–1125.
  • (33) M. White, M. Tufano, C. Vendome, D. Poshyvanyk, Deep learning code fragments for code clone detection, in: Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, ACM, 2016, pp. 87–98.
  • (34) L. Amorim, E. Costa, N. Antunes, B. Fonseca, M. Ribeiro, Experience report: Evaluating the effectiveness of decision trees for detecting code smells, in: Software Reliability Engineering (ISSRE), 2015 IEEE 26th International Symposium on, IEEE, 2015, pp. 261–269.
  • (35) F. A. Fontana, M. Zanoni, Code smell severity classification using machine learning techniques, Knowledge-Based Systems 128 (2017) 43–58.
  • (36) E. Tempero, C. Anslow, J. Dietrich, T. Han, J. Li, M. Lumpe, H. Melton, J. Noble, The qualitas corpus: A curated collection of java code for empirical studies, in: Software Engineering Conference (APSEC), 2010 17th Asia Pacific, IEEE, 2010, pp. 336–345.
  • (37) R. Marinescu, Measurement and quality in object-oriented design, in: Software Maintenance, 2005. ICSM’05. Proceedings of the 21st IEEE International Conference on, IEEE, 2005, pp. 701–704.
  • (38) R. Marinescu, Measurement and quality in objectoriented design.
  • (39) K. Nongpong, Integrating” code smells” detection with refactoring tool support.
  • (40) G. Tsoumakas, I. Katakis, Multi-label classification: An overview, International Journal of Data Warehousing and Mining (IJDWM) 3 (3) (2007) 1–13.
  • (41) S. Godbole, S. Sarawagi, Discriminative methods for multi-labeled classification, in: Pacific-Asia conference on knowledge discovery and data mining, Springer, 2004, pp. 22–30.
  • (42)

    M. R. Boutell, J. Luo, X. Shen, C. M. Brown, Learning multi-label scene classification, Pattern recognition 37 (9) (2004) 1757–1771.

  • (43)

    Y. Guo, S. Gu, Multi-label classification using conditional dependency networks, in: IJCAI Proceedings-International Joint Conference on Artificial Intelligence, Vol. 22, 2011, p. 1300.

  • (44) F. Palomba, R. Oliveto, A. De Lucia, Investigating code smell co-occurrences using association rule learning: A replicated study, in: Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE), IEEE Workshop on, IEEE, 2017, pp. 8–13.
  • (45) J. Read, P. Reutemann, B. Pfahringer, G. Holmes, Meka: a multi-label/multi-target extension to weka, The Journal of Machine Learning Research 17 (1) (2016) 667–671.
  • (46) J. Read, B. Pfahringer, G. Holmes, E. Frank, Classifier chains for multi-label classification, Machine learning 85 (3) (2011) 333.
  • (47) T. Hall, S. Beecham, D. Bowes, D. Gray, S. Counsell, Developing fault-prediction models: What the research can show industry, IEEE software 28 (6) (2011) 96–99.
  • (48) M. S. Sorower, A literature survey on algorithms for multi-label learning, Oregon State University, Corvallis 18.
  • (49) F. Charte, A. J. Rivera, M. J. del Jesus, F. Herrera, Addressing imbalance in multilabel classification: Measures and random resampling algorithms, Neurocomputing 163 (2015) 3–16.


CC (10-Fold Cross Validation Run for 10 Iterations)
Single Label
Label Based Metrics
Micro Averaging Macro Averaging
J48 Pruned 89.0% 95.0% 91.9% 89.1% 95.0% 91.9%
Random Forest 89.9% 95.7% 92.8% 90.2% 95.7% 95.4%
B-J48 pruned 89.7% 92.9% 91.2% 89.8% 92.9% 913%
B-J48 UnPruned 88.6% 91.4% 90.0% 88.7% 91.4% 90.0%
B-Random Forest 89.0% 95.0% 91.9% 89.1% 95.0% 94.8%
Table 9: Results of CC method (example based) using top 5 single label classifers.
CC (10-Fold Cross Validation Run for 10 Iterations)
Single Label
Label Based Metrics
Micro Averaging Macro Averaging
J48 Pruned 88.2% 87.9% 88.0% 88.1% 87.9% 88.0%
Random Forest 88.4% 95.7% 91.9% 88.6% 95.7% 92.0%
B-J48 pruned 88.7% 89.6% 89.2% 88.8% 89.6% 89.2%
B-J48 UnPruned 86.2% 91.1% 90.1% 89.2% 91.1% 90.1%
B-Random Forest 88.8% 96.4% 92.5% 89.0% 96.4% 92.5%
Table 10: Results of LC method (example based) using top 5 single label classifers.