1 Introduction
The existence of code smells in source code points towards poor design and violations in standard coding practices[10]. The code smells may not necessarily be identified as defects in the software in the current phase, but these code classes have a high likelihood of developing bugs in the future. Since these code smells do not cause defects, the only way to identify them is based on inspection, i.e., manually combing through thousands of lines of code to find code smells. This method is highly disorganized and costly and becomes more inefficient along with scaling of code package size. In our work, we are automating this process of identifying code smells. We are using the input source code packages to build our set of source code metrics to develop a model to locate and predict code smells in the application packages. These models will reduce the cost and efficiency of maintaining software while enforcing standard coding practices and improving its quality.
In this paper, we used three kernels: Linear kernel, radial basis function kernel, and polynomial kernel to develop models to predict the following eight code smells, namely, Swiss Army Knife (SAK), Long Method (LM), Member Ignoring Method (MIM), No Low Memory Resolver (NLMR), Blob Class (BLOB), Internal Getter/Setter (IGS), Leaking Inner Class (LIC) and Complex Class (CC). These source code metrics are from the application’s source code packages and are used to engineer relevant features and select relevant metrics. We used the Wilcoxon Rank Sum Test to achieve the second of these objectives. In this work, we have analyzed the performance with various kernel functions using accuracy, area under the curve (AUC), and F-measure to predict code smells.We have attempted to answer three research questions in this paper:
-
RQ1: Discuss the ability of different NLP Methods to generate features that help detect code smells. In traditional code smell detection techniques, code smell metrics are present, which help detect code smells. We have manually constructed 100 features derived from reviews of peer developers’ software about the software’s source code in this problem. We have a Continuous Bag of words and the Skip-gram method to construct features for detection. We will use accuracy, Area under the curve, and F1 Score to compare each technique’s performance.
-
RQ2: Explore the potential of Data Sampling Techniques to discover code smells Instead of using just the original data, we have used three sampling techniques to generate datasets. SMOTE[1] (Synthetic Minority Over-sampling), borderline SMOTE[4]
, and SVM SMOTE (Support Vector Machine SMOTE)
[7], along with original data, gives us four sets of data. We compare the performance of these datasets using Area under curve and statistical significance tests. -
RQ3: Study the capacity of various ELM Kernels to predict code smells. We have used three Extreme Learning Machine kernels to detect code smells from the various features and datasets. Linear Kernel (LINK), Radial Basis Function kernel (RBF), and Polynomial kernel have been used for classification. Their performance has been compared using statistical significance tests and Area Under the Curve Analysis.
Organization: The paper is prepared as follows: The 2nd section summarizes the associated work. The 3rd section offers an in-depth review of all the components used in the experiment. The 4th section describes the study’s framework pipeline and how the components described in section 3 interact with each other.The 5th section provides the experimental outcome and the 6th section answers the questions raised in the introduction. In the 7th section we conclude our research.
2 Related Work
Evgeniy et al. used contextual analysis of document data to generate features making use of word sense clarification. Long Ma et al. used Word2Vec to output word vectors to represent large pieces of texts or entire documents. He used CBOW and skip-grams as component models of Word2Vec to create word vectors and then evaluate word similarity.
[6] Hui Han et al. introduced over-sampling techniques of Borderline-SMOTE as a variant of SMOTE where only minority examples near borderline are over-sampled. [4] Josey Mathew et al. proposed a kernel-based SMOTE(SVM SMOTE) algorithm which directly generates the minority data points. His proposed SMOTE technique performs better than other SMOTE techniques in 51 benchmark datasets.[7]Guang-Bin Huang et al. proposed Extreme Learning Machine(ELM), which randomly chooses hidden nodes and determines the Single-hidden Layer Feed forward Neural Networks(SLFN) weights.
[5]Francisco Fernandez-Navarro et al. proposed a modified version of ELM, which uses Gaussian distribution to parameterize the distribution called the radial basis function. She used ELM is used to optimize the parameters of the model.
[2]
|
|
|
|
|
||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Complex Class (CC) | 230 | 399 | 63.4% | 36.5% | ||||||||||
Leaking Internal Class (LIC) | 160 | 469 | 74.5% | 25.4% | ||||||||||
Blob Class | 460 | 169 | 26.8% | 73.1% | ||||||||||
No Low Memory Resolver (NLMR) | 190 | 439 | 69.7% | 30.2% | ||||||||||
Internal Getter Setter (IGS) | 264 | 365 | 58.02% | 41.9% | ||||||||||
Member Ignoring Method (MIM) | 265 | 364 | 57.8% | 42.1% | ||||||||||
Swiss Army Knife (SAK) | 155 | 474 | 75.3% | 24.6% | ||||||||||
Long method (LM) | 225 | 404 | 64.2% | 35.7% |
3 Research Methodology

A detailed description of the Dataset, Data Sampling Techniques, Feature Generation Techniques, Feature Selection Techniques, and Classification Algorithms is given below.
3.1 Experimental Data Set
In this research, our main database comprised of 629 freely available software packages. Our dataset consisted of a list of packages, and the code smells present in them. The characteristics and patterns exhibited by all of the code smells are presented in Table 1. Table 1 shows that the code smells are present in a range from 25.4% to 73.1%. We also observed that the lowest presence of any code smell we found was 25.4% for the BLOB Class code smell, while the highest presence observed at 75.04% was for Swiss Army Knife (SAK) code smell.
3.2 Data Sampling Techniques
We use three data sampling techniques to generate additional datasets to mitigate the bias in the dataset :
-
SMOTE[1] randomly chooses samples from K nearest neighbors from minority class. The synthetic data would be made between the randomly selected sample and the k nearest neighbor.
-
Borderline SMOTE[4] works on a similar principle but creates data only along the classes’ boundary line, not to introduce external bias.
3.3 Feature Generation Techniques
We use two architectures from the word2vec [6] techniques, namely the Continuous Bag of Words (CBOW) and Skip-gram. Continuous Bag of Words method [12] uses the surrounding words to speculate the present word. Since it derives from the bag of words model, words present in the window( surrounding the current word) are not differentiated based on the current word’s distance. The skip-gram model [3] makes use of the current word to predict the context. Here words nearer are more heavily weighted than words farther away from the present word. Comparing the two models, the CBOW model is faster than skip-grams, but skip-grams perform better when uncommon words are involved.
3.4 Feature Selection Techniques
We have generated 100 feature metrics, but they might not be relevant to the code smells we have considered. We use the Wilcoxon signed-rank test to get the statistical relation between the smelly and clean applications. We have set 0.05 as the outset for the p-value, and we reject the hypothesis if the value is lower. We employ cross-correlation analysis to select uncorrelated features. Our selected variables share a high correlation to the output variables and have a low correlation between themselves.

Architecture diagram of Extreme Machine Learning Kernel
3.5 Classification Algorithms
This paper uses three ELM kernel functions [5, 8] to train models to predict code smells, namely the Linear Kernel function, Radial basis kernel function, and polynomial kernel function[9]
.As shown in Figure 2, Extreme Learning Machines (ELM) can be simply defined as feed-forward neural networks, and they can be used for clustering, classification, regressing, among other things. These three kernel methods work best for different data types based on whether it is linearly separable and the problem is linear or nonlinear. Kernel functions are mathematical functions used to transform training data into higher dimensions. The linear kernel is generally chosen when dealing with linearly separable data; it is also most commonly used when many features are in a dataset. The Radial basis function kernel is a non-linear kernel used for training SVMS when solving nonlinear problems. The polynomial kernel function is also used to train nonlinear models. It is faster and requires fewer resources to train the linear or polynomial kernel functions than radial basis functions. Still, they are less accurate in comparison to the RBF kernel. We also use ten-fold cross-validation to overcome overfitting and selection-bias issues and obtain insights on our model’s performance on an independent dataset. We use the area under the curve (AUC) and F-measure, among other tests, to compare their performance.
4 Research Framework
We make use of the code data from 629 open-source software packages on GitHub. To eliminate the class imbalance problem in the data, we use SMOTE, Borderline SMOTE, and SVM SMOTE
[1, 7, 4]to get four datasets: the Original Dataset (ORD) SMOTE Dataset, Borderline SMOTE Dataset, and SVM SMOTE Dataset. We use three kernel functions, the linear kernel function, the radial basis kernel function, and the polynomial kernel function. To compare the accuracy over all the four datasets, we have created and used the area under the curve (AUC) and F-measure, among other tests, to compare their performance. Figure 1 provides a clear representation of the same.
|
|

5 Experimental Results
Tables 2A and 2B give accuracy and AUC values for all the ELM methods, using feature engineering methods and all feature selection techniques. Table 3 and Table 4 summarize the various statistical measures of different metrics used in our research. It is pretty evident from Tables 2A and 2B that Radial Basis Function and polynomial perform much better for most samples than Linear kernel. Table 3B Figure 3 shows that models trained using all metrics perform better than those using significant metrics. The high values of the performance indicators encourage the use of the code smell prediction model. The following observations can be made made from the Results obtained :
-
The performance of all models varies greatly with the minimum accuracy being 56.41% and the maximum accuracy being 100% AUC follows similar trend to accuracy but f-measure varies the most from minimum value being 0.02 and maximum being 1.
-
It is observed that radial basis kernel and polynomial kernel perform much better than linear kernel across all three statistical measures that is AUC, accuracy and f-measure and they also indicate the high efficiency of the models which are developed.
-
It is observed that Linear kernel performs the best with Class NLMR (77.5%) and the worst with MIM class. (61.95%)
|
|

6 Comparison
RQ1: Discuss the ability of different NLP Methods to generate features that help detect code smells.
Table 3A and Figure 4 shows that CBOW performs slightly better than skip-grams across accuracy and F-measure metric. It is a known fact that CBOW is many times faster to train compared to skip-grams. CBOW performs marginally better when common words are considered, while skip-gram performs better on rare words or phrases. [11] Our model performs better on CBOW, indicating that user comments from which the feature vectors are generated have a higher occurrence of common words over rare words.
Table 6B shows us the result of the Ranksum test of vectors generated using these two methods, and we can conclude that the vectors generated are highly uncorrelated.
|
|

RQ2: Explore the potential of Data Sampling Techniques to discover code smells.
Table 4B and Figure 5 shows that the data sampling techniques perform better than the original data in AUC, accuracy, and F-measure. Although all three SMOTE techniques, BorderlineSMOTE and SVM-SMOTE, perform nearly the same, SVM-SMOTE performs the best. SVM-SMOTE performs better than others because they use KNN. SVM can employ kernels to lead to a better hyperplane in higher dimensions. KNN uses euclidean distance, on the other hand, which may not work well in the same case. Also, KNN computes the nearest neighbors’ distance, leading to more unsatisfactory performance when working on a large dataset.
Table 5A gives us the result of the Ranksum test of the datasets generated using these methods. We observe that all the datasets generated from smoothing techniques vary a lot from the original dataset, and we can conclude that the datasets are highly uncorrelated. We also observe that SMOTE, Borderline-SMOTE, and SVM-SMOTE are very similar to each other, and hence the performance of the models trained over them also show similar trends.

|
|
|
|
RQ3: Study the capacity of various ELM Kernels to predict code smells.
Table 4A and Figure 6 shows the three kernel methods’ performance in terms of accuracy, AUC, and F-measure. Since our data does not have a linear distribution, we observe that the linear kernel method’s performance is relatively lackluster. Polynomial and RBF both perform significantly better than the linear kernel due to a fixed small number of features. It is observed that the RBF kernel shows the best performance of the three. Table 5B shows the result of the Ranksum tests on models generated using the different ELM kernels. We can observe that the prediction models developed using the various methods are significantly different from each other, and the models are highly unrelated.
7 Conclusion
This paper provides the empirical evaluation of code smell prediction utilizing various ELM methods, feature generation methods using NLP techniques, feature selection, and data sampling techniques. The models are evaluated using ten-fold cross-validation, and their prediction abilities are compared using accuracy, AUC, and F-measure. We draw the following conclusions from our research study:
-
CBOW performs better than skip-grams in feature generation.
-
SVM-SMOTE performs best among the data sampling techniques.
-
Models based on all metrics perform better than models based on significant metrics created using the Wilcoxon signed-rank test.
-
RBF kernel performs best among the EML methods in predicting code smells.
References
-
[1]
(2002)
SMOTE: synthetic minority over-sampling technique.
Journal of artificial intelligence research
16, pp. 321–357. Cited by: 2nd item, 1st item, §4. - [2] (2011) MELM-grbf: a modified version of the extreme learning machine for generalized radial basis function neural networks. Neurocomputing 74 (16), pp. 2502–2510. Cited by: §2.
- [3] (2006) A closer look at skip-gram modelling.. In LREC, Vol. 6, pp. 1222–1225. Cited by: §3.3.
- [4] (2005) Borderline-smote: a new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing, pp. 878–887. Cited by: 2nd item, §2, 2nd item, §4.
- [5] (2006) Extreme learning machine: theory and applications. Neurocomputing 70 (1-3), pp. 489–501. Cited by: §2, §3.5.
- [6] (2015) Using word2vec to process big text data. In 2015 IEEE International Conference on Big Data (Big Data), pp. 2895–2897. Cited by: §2, §3.3.
- [7] (2015) Kernel-based smote for svm classification of imbalanced datasets. In IECON 2015-41st Annual Conference of the IEEE Industrial Electronics Society, pp. 001127–001132. Cited by: 2nd item, §2, 3rd item, §4.
- [8] (2005) Learning the kernel function via regularization.. Journal of machine learning research 6 (7). Cited by: §3.5.
- [9] (2010) On performing classification using svm with radial basis and polynomial kernel functions. In 2010 3rd International Conference on Emerging Trends in Engineering and Technology, pp. 512–515. Cited by: §3.5.
- [10] (2002) Java quality assurance by detecting code smells. In Ninth Working Conference on Reverse Engineering, 2002. Proceedings., pp. 97–106. Cited by: §1.
- [11] (2017) A novel ensemble method for imbalanced data learning: bagging of extrapolation-smote svm. Computational intelligence and neuroscience 2017. Cited by: 3rd item, §6.
- [12] (2017) Two improved continuous bag-of-word models. In 2017 International Joint Conference on Neural Networks (IJCNN), pp. 2851–2856. Cited by: §3.3.