Log In Sign Up

Empirical Analysis on Effectiveness of NLP Methods for Predicting Code Smell

by   Himanshu Gupta, et al.
BITS Pilani

A code smell is a surface indicator of an inherent problem in the system, most often due to deviation from standard coding practices on the developers part during the development phase. Studies observe that code smells made the code more susceptible to call for modifications and corrections than code that did not contain code smells. Restructuring the code at the early stage of development saves the exponentially increasing amount of effort it would require to address the issues stemming from the presence of these code smells. Instead of using traditional features to detect code smells, we use user comments to manually construct features to predict code smells. We use three Extreme learning machine kernels over 629 packages to identify eight code smells by leveraging feature engineering aspects and using sampling techniques. Our findings indicate that the radial basis functional kernel performs best out of the three kernel methods with a mean accuracy of 98.52.


page 1

page 2

page 3

page 4


The Sense of Logging in the Linux Kernel

Logging plays a crucial role in software engineering because it is key t...

An Empirical Study on Predictability of Software Code Smell Using Deep Learning Models

Code Smell, similar to a bad smell, is a surface indication of something...

The standard coder: a machine learning approach to measuring the effort required to produce source code change

We apply machine learning to version control data to measure the quantit...

Predicting Usefulness of Code Review Comments using Textual Features and Developer Experience

Although peer code review is widely adopted in both commercial and open ...

What Do Developers Discuss about Code Comments?

Code comments are important for program comprehension, development, and ...

KASR: A Reliable and Practical Approach to Attack Surface Reduction of Commodity OS Kernels

Commodity OS kernels have broad attack surfaces due to the large code ba...

Assert Use and Defectiveness in Industrial Code

The use of asserts in code has received increasing attention in the soft...

1 Introduction

The existence of code smells in source code points towards poor design and violations in standard coding practices[10]. The code smells may not necessarily be identified as defects in the software in the current phase, but these code classes have a high likelihood of developing bugs in the future. Since these code smells do not cause defects, the only way to identify them is based on inspection, i.e., manually combing through thousands of lines of code to find code smells. This method is highly disorganized and costly and becomes more inefficient along with scaling of code package size. In our work, we are automating this process of identifying code smells. We are using the input source code packages to build our set of source code metrics to develop a model to locate and predict code smells in the application packages. These models will reduce the cost and efficiency of maintaining software while enforcing standard coding practices and improving its quality.

In this paper, we used three kernels: Linear kernel, radial basis function kernel, and polynomial kernel to develop models to predict the following eight code smells, namely, Swiss Army Knife (SAK), Long Method (LM), Member Ignoring Method (MIM), No Low Memory Resolver (NLMR), Blob Class (BLOB), Internal Getter/Setter (IGS), Leaking Inner Class (LIC) and Complex Class (CC). These source code metrics are from the application’s source code packages and are used to engineer relevant features and select relevant metrics. We used the Wilcoxon Rank Sum Test to achieve the second of these objectives. In this work, we have analyzed the performance with various kernel functions using accuracy, area under the curve (AUC), and F-measure to predict code smells.We have attempted to answer three research questions in this paper:

  • RQ1: Discuss the ability of different NLP Methods to generate features that help detect code smells. In traditional code smell detection techniques, code smell metrics are present, which help detect code smells. We have manually constructed 100 features derived from reviews of peer developers’ software about the software’s source code in this problem. We have a Continuous Bag of words and the Skip-gram method to construct features for detection. We will use accuracy, Area under the curve, and F1 Score to compare each technique’s performance.

  • RQ2: Explore the potential of Data Sampling Techniques to discover code smells Instead of using just the original data, we have used three sampling techniques to generate datasets. SMOTE[1] (Synthetic Minority Over-sampling), borderline SMOTE[4]

    , and SVM SMOTE (Support Vector Machine SMOTE)

    [7], along with original data, gives us four sets of data. We compare the performance of these datasets using Area under curve and statistical significance tests.

  • RQ3: Study the capacity of various ELM Kernels to predict code smells. We have used three Extreme Learning Machine kernels to detect code smells from the various features and datasets. Linear Kernel (LINK), Radial Basis Function kernel (RBF), and Polynomial kernel have been used for classification. Their performance has been compared using statistical significance tests and Area Under the Curve Analysis.

Organization: The paper is prepared as follows: The 2nd section summarizes the associated work. The 3rd section offers an in-depth review of all the components used in the experiment. The 4th section describes the study’s framework pipeline and how the components described in section 3 interact with each other.The 5th section provides the experimental outcome and the 6th section answers the questions raised in the introduction. In the 7th section we conclude our research.

2 Related Work

Evgeniy et al. used contextual analysis of document data to generate features making use of word sense clarification. Long Ma et al. used Word2Vec to output word vectors to represent large pieces of texts or entire documents. He used CBOW and skip-grams as component models of Word2Vec to create word vectors and then evaluate word similarity.

[6] Hui Han et al. introduced over-sampling techniques of Borderline-SMOTE as a variant of SMOTE where only minority examples near borderline are over-sampled. [4] Josey Mathew et al. proposed a kernel-based SMOTE(SVM SMOTE) algorithm which directly generates the minority data points. His proposed SMOTE technique performs better than other SMOTE techniques in 51 benchmark datasets.[7]

Guang-Bin Huang et al. proposed Extreme Learning Machine(ELM), which randomly chooses hidden nodes and determines the Single-hidden Layer Feed forward Neural Networks(SLFN) weights.


Francisco Fernandez-Navarro et al. proposed a modified version of ELM, which uses Gaussian distribution to parameterize the distribution called the radial basis function. She used ELM is used to optimize the parameters of the model.


Code smell
Repository # without
any code smell
Repository # with
code smell
Percent of classes
without code smell
Percent of classes
with code smell
Complex Class (CC) 230 399 63.4% 36.5%
Leaking Internal Class (LIC) 160 469 74.5% 25.4%
Blob Class 460 169 26.8% 73.1%
No Low Memory Resolver (NLMR) 190 439 69.7% 30.2%
Internal Getter Setter (IGS) 264 365 58.02% 41.9%
Member Ignoring Method (MIM) 265 364 57.8% 42.1%
Swiss Army Knife (SAK) 155 474 75.3% 24.6%
Long method (LM) 225 404 64.2% 35.7%
Table 1: Statistics on code smell distribution by type

3 Research Methodology

Figure 1: Flowchart of the Research Framework

A detailed description of the Dataset, Data Sampling Techniques, Feature Generation Techniques, Feature Selection Techniques, and Classification Algorithms is given below.

3.1 Experimental Data Set

In this research, our main database comprised of 629 freely available software packages. Our dataset consisted of a list of packages, and the code smells present in them. The characteristics and patterns exhibited by all of the code smells are presented in Table 1. Table 1 shows that the code smells are present in a range from 25.4% to 73.1%. We also observed that the lowest presence of any code smell we found was 25.4% for the BLOB Class code smell, while the highest presence observed at 75.04% was for Swiss Army Knife (SAK) code smell.

3.2 Data Sampling Techniques

We use three data sampling techniques to generate additional datasets to mitigate the bias in the dataset :

  • SMOTE[1] randomly chooses samples from K nearest neighbors from minority class. The synthetic data would be made between the randomly selected sample and the k nearest neighbor.

  • Borderline SMOTE[4] works on a similar principle but creates data only along the classes’ boundary line, not to introduce external bias.

  • SVM-SMOTE[11, 7] uses Support Vector Machine instead of K nearest neighbor to generate samples between a chosen sample and the decision boundary.

3.3 Feature Generation Techniques

We use two architectures from the word2vec [6] techniques, namely the Continuous Bag of Words (CBOW) and Skip-gram. Continuous Bag of Words method [12] uses the surrounding words to speculate the present word. Since it derives from the bag of words model, words present in the window( surrounding the current word) are not differentiated based on the current word’s distance. The skip-gram model [3] makes use of the current word to predict the context. Here words nearer are more heavily weighted than words farther away from the present word. Comparing the two models, the CBOW model is faster than skip-grams, but skip-grams perform better when uncommon words are involved.

3.4 Feature Selection Techniques

We have generated 100 feature metrics, but they might not be relevant to the code smells we have considered. We use the Wilcoxon signed-rank test to get the statistical relation between the smelly and clean applications. We have set 0.05 as the outset for the p-value, and we reject the hypothesis if the value is lower. We employ cross-correlation analysis to select uncorrelated features. Our selected variables share a high correlation to the output variables and have a low correlation between themselves.

Figure 2:

Architecture diagram of Extreme Machine Learning Kernel

3.5 Classification Algorithms

This paper uses three ELM kernel functions [5, 8] to train models to predict code smells, namely the Linear Kernel function, Radial basis kernel function, and polynomial kernel function[9]

.As shown in Figure 2, Extreme Learning Machines (ELM) can be simply defined as feed-forward neural networks, and they can be used for clustering, classification, regressing, among other things. These three kernel methods work best for different data types based on whether it is linearly separable and the problem is linear or nonlinear. Kernel functions are mathematical functions used to transform training data into higher dimensions. The linear kernel is generally chosen when dealing with linearly separable data; it is also most commonly used when many features are in a dataset. The Radial basis function kernel is a non-linear kernel used for training SVMS when solving nonlinear problems. The polynomial kernel function is also used to train nonlinear models. It is faster and requires fewer resources to train the linear or polynomial kernel functions than radial basis functions. Still, they are less accurate in comparison to the RBF kernel. We also use ten-fold cross-validation to overcome overfitting and selection-bias issues and obtain insights on our model’s performance on an independent dataset. We use the area under the curve (AUC) and F-measure, among other tests, to compare their performance.

4 Research Framework

We make use of the code data from 629 open-source software packages on GitHub. To eliminate the class imbalance problem in the data, we use SMOTE, Borderline SMOTE, and SVM SMOTE

[1, 7, 4]to get four datasets: the Original Dataset (ORD) SMOTE Dataset, Borderline SMOTE Dataset, and SVM SMOTE Dataset. We use three kernel functions, the linear kernel function, the radial basis kernel function, and the polynomial kernel function. To compare the accuracy over all the four datasets, we have created and used the area under the curve (AUC) and F-measure, among other tests, to compare their performance. Figure 1 provides a clear representation of the same.

Original Data
BLOB 62.80 100.00 100.00 73.77 100.00 100.00 62.64 100.00 72.02 63.91 68.04 76.63
LM 75.36 100.00 100.00 75.68 99.84 100.00 75.36 100.00 100.00 75.36 100.00 100.00
SAK 75.20 100.00 100.00 74.72 100.00 100.00 75.83 75.83 100.00 73.77 100.00 94.28
CC 71.70 100.00 100.00 71.54 100.00 100.00 70.27 100.00 100.00 70.75 100.00 93.80
IGS 67.73 99.68 100.00 68.04 63.43 100.00 61.84 100.00 65.66 57.71 74.56 100.00
MIM 58.98 100.00 100.00 59.46 82.35 100.00 59.62 100.00 94.59 59.62 88.71 100.00
NLMR 81.08 100.00 100.00 81.56 100.00 100.00 75.04 100.00 99.84 75.04 94.75 93.16
LIC 70.11 100.00 100.00 65.18 85.06 100.00 64.86 100.00 97.46 66.45 78.54 100.00
(a) Accuracy values
Original Data
BLOB 0.64 1.00 1.00 0.78 1.00 1.00 0.60 1.00 0.78 0.62 0.86 0.84
LM 0.72 1.00 1.00 0.75 1.00 1.00 0.72 1.00 1.00 0.67 1.00 1.00
SAK 0.71 1.00 1.00 0.68 1.00 1.00 0.69 0.72 1.00 0.65 1.00 0.99
CC 0.75 1.00 1.00 0.70 1.00 1.00 0.65 1.00 1.00 0.69 1.00 0.98
IGS 0.74 1.00 1.00 0.74 0.74 1.00 0.65 1.00 0.73 0.61 0.84 1.00
MIM 0.63 1.00 1.00 0.61 0.91 1.00 0.62 1.00 0.99 0.61 0.96 1.00
NLMR 0.84 1.00 1.00 0.85 1.00 1.00 0.71 1.00 1.00 0.67 0.99 0.98
LIC 0.76 1.00 1.00 0.67 0.98 1.00 0.67 1.00 1.00 0.68 0.94 1.00
(b) AUC Values
Table 2: Area Under Curve and Accuracy figures for ELM models trained on the original dataset
Figure 3: Box plot comparison between All Metrics

5 Experimental Results

Tables 2A and 2B give accuracy and AUC values for all the ELM methods, using feature engineering methods and all feature selection techniques. Table 3 and Table 4 summarize the various statistical measures of different metrics used in our research. It is pretty evident from Tables 2A and 2B that Radial Basis Function and polynomial perform much better for most samples than Linear kernel. Table 3B Figure 3 shows that models trained using all metrics perform better than those using significant metrics. The high values of the performance indicators encourage the use of the code smell prediction model. The following observations can be made made from the Results obtained :

  • The performance of all models varies greatly with the minimum accuracy being 56.41% and the maximum accuracy being 100% AUC follows similar trend to accuracy but f-measure varies the most from minimum value being 0.02 and maximum being 1.

  • It is observed that radial basis kernel and polynomial kernel perform much better than linear kernel across all three statistical measures that is AUC, accuracy and f-measure and they also indicate the high efficiency of the models which are developed.

  • It is observed that Linear kernel performs the best with Class NLMR (77.5%) and the worst with MIM class. (61.95%)

Min Max Mean Median 25th 75th
CBOW 57.98 100.00 88.63 100.00 74.61 100.00
SKM 56.41 100.00 87.82 99.85 73.98 100.00
CBOW 0.59 1.00 0.90 1.00 0.78 1.00
SKM 0.59 1.00 0.90 1.00 0.80 1.00
F Measure
CBOW 0.02 1.00 0.86 1.00 0.74 1.00
SKM 0.05 1.00 0.84 1.00 0.72 1.00
(a) Feature Generation Techniques
Min Max Mean Median 25th 75th
ALM 58.98 100.00 90.72 100.00 77.94 100.00
SGM 56.41 100.00 85.73 94.55 70.46 100.00
ALM 0.61 1.00 0.93 1.00 0.84 1.00
SGM 0.59 1.00 0.88 0.99 0.73 1.00
F Measure
ALM 0.02 1.00 0.88 1.00 0.78 1.00
SGM 0.02 1.00 0.83 0.95 0.68 1.00
(b) Features Selection Metrics
Table 3: Statistical Measure
Figure 4: Accuracy, AUC and F-measure box-plot of different feature generation techniques

6 Comparison

RQ1: Discuss the ability of different NLP Methods to generate features that help detect code smells.
Table 3A and Figure 4 shows that CBOW performs slightly better than skip-grams across accuracy and F-measure metric. It is a known fact that CBOW is many times faster to train compared to skip-grams. CBOW performs marginally better when common words are considered, while skip-gram performs better on rare words or phrases. [11] Our model performs better on CBOW, indicating that user comments from which the feature vectors are generated have a higher occurrence of common words over rare words. Table 6B shows us the result of the Ranksum test of vectors generated using these two methods, and we can conclude that the vectors generated are highly uncorrelated.

Min Max Mean Median 25th 75th
LINK 56.41 82.59 69.84 70.57 64.39 75.11
RBFK 63.43 100.00 98.52 100.00 100.00 100.00
POLYK 64.92 100.00 96.32 100.00 95.57 100.00
Min Max Mean Median 25th 75th
LINK 0.59 0.90 0.74 0.74 0.66 0.80
RBFK 0.72 1.00 0.99 1.00 1.00 1.00
POLYK 0.71 1.00 0.98 1.00 0.99 1.00
F Measure
Min Max Mean Median 25th 75th
LINK 0.02 0.86 0.62 0.69 0.59 0.75
RBFK 0.32 1.00 0.98 1.00 1.00 1.00
POLYK 0.45 1.00 0.96 1.00 0.96 1.00
(a) Performance on various ELM Kernels
Min Max Mean Median 25th 75th
ORD 57.71 100.00 86.66 96.10 73.77 100.00
SMOTE 57.02 100.00 88.78 100.00 74.46 100.00
BSMOTE 56.41 100.00 88.50 100.00 73.14 100.00
SVMSMOTE 58.69 100.00 88.96 99.95 74.61 100.00
ORD 0.60 1.00 0.88 0.99 0.72 1.00
SMOTE 0.61 1.00 0.91 1.00 0.81 1.00
BSMOTE 0.59 1.00 0.91 1.00 0.80 1.00
SVMSMOTE 0.61 1.00 0.92 1.00 0.82 1.00
F Measure
ORD 0.02 1.00 0.74 0.96 0.50 1.00
SMOTE 0.58 1.00 0.89 1.00 0.75 1.00
BSMOTE 0.57 1.00 0.89 1.00 0.74 1.00
SVMSMOTE 0.59 1.00 0.89 1.00 0.75 1.00
(b) Performance on different Datasets
Table 4: Statistics about different datasets and machine learning models
Figure 5: Comparison between different data sampling techniques.

RQ2: Explore the potential of Data Sampling Techniques to discover code smells.

Table 4B and Figure 5 shows that the data sampling techniques perform better than the original data in AUC, accuracy, and F-measure. Although all three SMOTE techniques, BorderlineSMOTE and SVM-SMOTE, perform nearly the same, SVM-SMOTE performs the best. SVM-SMOTE performs better than others because they use KNN. SVM can employ kernels to lead to a better hyperplane in higher dimensions. KNN uses euclidean distance, on the other hand, which may not work well in the same case. Also, KNN computes the nearest neighbors’ distance, leading to more unsatisfactory performance when working on a large dataset.

Table 5A gives us the result of the Ranksum test of the datasets generated using these methods. We observe that all the datasets generated from smoothing techniques vary a lot from the original dataset, and we can conclude that the datasets are highly uncorrelated. We also observe that SMOTE, Borderline-SMOTE, and SVM-SMOTE are very similar to each other, and hence the performance of the models trained over them also show similar trends.

Figure 6: Box-plot comparison between different ELM Kernel methods
ORD 1.00 0.12 0.19 0.12
SMOTE 0.12 1.00 0.76 0.89
BSMOTE 0.19 0.76 1.00 0.83
SVMSMOTE 0.12 0.89 0.83 1.00
(a) Different sampling methods
LINK 1.00 0.00 0.00
RBFK 0.00 1.00 0.00
POLYK 0.00 0.00 1.00
(b) Model similarity
Table 5: Ranksum Test
ALM 1.00 0.00
SGM 0.00 1.00
(a) Feature Combination
CBOW 1.00 0.43
SKM 0.43 1.00
(b) Feature Generation Methods
Table 6: Ranksum Test

RQ3: Study the capacity of various ELM Kernels to predict code smells.

Table 4A and Figure 6 shows the three kernel methods’ performance in terms of accuracy, AUC, and F-measure. Since our data does not have a linear distribution, we observe that the linear kernel method’s performance is relatively lackluster. Polynomial and RBF both perform significantly better than the linear kernel due to a fixed small number of features. It is observed that the RBF kernel shows the best performance of the three. Table 5B shows the result of the Ranksum tests on models generated using the different ELM kernels. We can observe that the prediction models developed using the various methods are significantly different from each other, and the models are highly unrelated.

7 Conclusion

This paper provides the empirical evaluation of code smell prediction utilizing various ELM methods, feature generation methods using NLP techniques, feature selection, and data sampling techniques. The models are evaluated using ten-fold cross-validation, and their prediction abilities are compared using accuracy, AUC, and F-measure. We draw the following conclusions from our research study:

  • CBOW performs better than skip-grams in feature generation.

  • SVM-SMOTE performs best among the data sampling techniques.

  • Models based on all metrics perform better than models based on significant metrics created using the Wilcoxon signed-rank test.

  • RBF kernel performs best among the EML methods in predicting code smells.


  • [1] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer (2002) SMOTE: synthetic minority over-sampling technique.

    Journal of artificial intelligence research

    16, pp. 321–357.
    Cited by: 2nd item, 1st item, §4.
  • [2] F. Fernández-Navarro, C. Hervás-Martínez, J. Sanchez-Monedero, and P. A. Gutiérrez (2011) MELM-grbf: a modified version of the extreme learning machine for generalized radial basis function neural networks. Neurocomputing 74 (16), pp. 2502–2510. Cited by: §2.
  • [3] D. Guthrie, B. Allison, W. Liu, L. Guthrie, and Y. Wilks (2006) A closer look at skip-gram modelling.. In LREC, Vol. 6, pp. 1222–1225. Cited by: §3.3.
  • [4] H. Han, W. Wang, and B. Mao (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing, pp. 878–887. Cited by: 2nd item, §2, 2nd item, §4.
  • [5] G. Huang, Q. Zhu, and C. Siew (2006) Extreme learning machine: theory and applications. Neurocomputing 70 (1-3), pp. 489–501. Cited by: §2, §3.5.
  • [6] L. Ma and Y. Zhang (2015) Using word2vec to process big text data. In 2015 IEEE International Conference on Big Data (Big Data), pp. 2895–2897. Cited by: §2, §3.3.
  • [7] J. Mathew, M. Luo, C. K. Pang, and H. L. Chan (2015) Kernel-based smote for svm classification of imbalanced datasets. In IECON 2015-41st Annual Conference of the IEEE Industrial Electronics Society, pp. 001127–001132. Cited by: 2nd item, §2, 3rd item, §4.
  • [8] C. A. Micchelli, M. Pontil, and P. Bartlett (2005) Learning the kernel function via regularization.. Journal of machine learning research 6 (7). Cited by: §3.5.
  • [9] G. L. Prajapati and A. Patle (2010) On performing classification using svm with radial basis and polynomial kernel functions. In 2010 3rd International Conference on Emerging Trends in Engineering and Technology, pp. 512–515. Cited by: §3.5.
  • [10] E. Van Emden and L. Moonen (2002) Java quality assurance by detecting code smells. In Ninth Working Conference on Reverse Engineering, 2002. Proceedings., pp. 97–106. Cited by: §1.
  • [11] Q. Wang, Z. Luo, J. Huang, Y. Feng, and Z. Liu (2017) A novel ensemble method for imbalanced data learning: bagging of extrapolation-smote svm. Computational intelligence and neuroscience 2017. Cited by: 3rd item, §6.
  • [12] Q. Wang, J. Xu, H. Chen, and B. He (2017) Two improved continuous bag-of-word models. In 2017 International Joint Conference on Neural Networks (IJCNN), pp. 2851–2856. Cited by: §3.3.