On the Feasibility of Transfer-learning Code Smells using Deep Learning

Context: A substantial amount of work has been done to detect smells in source code using metrics-based and heuristics-based methods. Machine learning methods have been recently applied to detect source code smells; however, the current practices are considered far from mature. Objective: First, explore the feasibility of applying deep learning models to detect smells without extensive feature engineering, just by feeding the source code in tokenized form. Second, investigate the possibility of applying transfer-learning in the context of deep learning models for smell detection. Method: We use existing metric-based state-of-the-art methods for detecting three implementation smells and one design smell in C# code. Using these results as the annotated gold standard, we train smell detection models on three different deep learning architectures. These architectures use Convolution Neural Networks (CNNs) of one or two dimensions, or Recurrent Neural Networks (RNNs) as their principal hidden layers. For the first objective of our study, we perform training and evaluation on C# samples, whereas for the second objective, we train the models from C# code and evaluate the models over Java code samples. We perform the experiments with various combinations of hyper-parameters for each model. Results: We find it feasible to detect smells using deep learning methods. Our comparative experiments find that there is no clearly superior method between CNN-1D and CNN-2D. We also observe that performance of the deep learning models is smell-specific. Our transfer-learning experiments show that transfer-learning is definitely feasible for implementation smells with performance comparable to that of direct-learning. This work opens up a new paradigm to detect code smells by transfer-learning especially for the programming languages where the comprehensive code smell detection tools are not available.



There are no comments yet.


page 1

page 2

page 3

page 4


Deep Transfer Learning for Source Code Modeling

In recent years, deep learning models have shown great potential in sour...

Towards an efficient deep learning model for musical onset detection

In this paper, we propose an efficient and reproducible deep learning mo...

Exploration of Various Deep Learning Models for Increased Accuracy in Automatic Polyp Detection

This paper is created to explore deep learning models and algorithms tha...

On the Efficiency of Various Deep Transfer Learning Models in Glitch Waveform Detection in Gravitational-Wave Data

LIGO is considered the most sensitive and complicated gravitational expe...

Revisiting Facial Key Point Detection: An Efficient Approach Using Deep Neural Networks

Facial landmark detection is a widely researched field of deep learning ...

Transfer Learning for Piano Sustain-Pedal Detection

Detecting piano pedalling techniques in polyphonic music remains a chall...

An exploratory experiment on Hindi, Bengali hate-speech detection and transfer learning using neural networks

This work presents our approach to train a neural network to detect hate...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The metaphor of code smells is used to indicate the presence of quality issues in source code (Fowler, 1999; Sharma and Spinellis, 2018). A large number of smells in a software system is associated with a high level of technical debt (Kruchten et al., 2012) hampering the system’s evolution. Given the practical significance of code smells, software engineering researchers have studied the concept in detail and explored various aspects associated with it including causes, impacts, and detection methods (Sharma and Spinellis, 2018).

A large body of work has been carried out to detect smells in source code. Traditionally, metric-based (Marinescu, 2005; Salehie et al., 2006) and rule/heuristic-based (Moha et al., 2010; Sharma et al., 2016) smell detection techniques are commonly used (Sharma and Spinellis, 2018). In recent years, machine-learning based smell detection techniques (Maiga et al., 2012b; Czibula et al., 2015) have emerged as a potent alternative as they not only have the potential to bring human judgment in the smell detection but also provide the grounds for transferring results from one problem to another. Researchers have used Bayesian belief networks (Khomh et al., 2009, 2011)

, support vector machines

(Maiga et al., 2012a)

, and binary logistic regression

(Bryton et al., 2010) to identify smells.

In particular, transfer-learning refers to the technique where a learning algorithm exploits the commonalities between different learning tasks to enable knowledge transfer across the tasks (Bengio et al., 2013). In this context, it would be interesting to explore the possibility of leveraging the availability of tools and data related to code smell detection in a programming language in order to train machine learning models that address the same problem on another language. The cross-application of a machine learning model could provide opportunities for detecting smells without actually developing a language-specific smell detection tool from scratch.

Despite the potential prospects, existing approaches for applying machine learning techniques for smell detection are considered far from mature. In a recent study, Di Nucci et al. (Nucci et al., 2018)

note that the problem of detecting smells still requires extensive research to attain a maturity that would produce results of practical use. In addition, machine learning techniques (such as Bayesian networks, support vector machines, and logistic regression) that have been applied so far require considerable pre-processing to generate features for the source code, a substantial effort that hinders their adoption in practice. Traditionally, researchers use machine-learning methods that require extracting feature-sets from source code. Typically, code metrics are used as the feature set for smell detection purposes. We perceive two shortcomings in such usage of machine-learning methods for detecting smells. First, we need an external tool to compute metrics for the target programming language on which we would like to apply the machine learning model. Those that have a metrics computation tool may deduce many smells directly by combining these metrics

(Marinescu, 2004; Sharma and Spinellis, 2018) and thus applying a machine-learning method is redundant. Second and more importantly, we are limiting the machine learning algorithm to use only the metrics that we are computing and feeding as feature-set. Therefore, the machine learning algorithm cannot observe any pattern that is not captured by the provided set of metrics.

In this context, deep learning models, specifically neural networks, offer a compelling alternative. The Convolution Neural Network (cnn) and the Recurrent Neural Network (rnn

) are state-of-the-art supervised learning methods currently employed in many practical applications, including image recognition

(Krizhevsky et al., 2012; Szegedy et al., 2015), voice recognition and processing (Sainath et al., 2015)

, and natural language processing

(Johnson and Zhang, 2015)

. These advanced models are capable of inferring features during training and can learn to classify samples based on these inferred features.

In this paper, we present experiments with deep learning models with two specific goals:

  • To investigate whether deep learning methods—particularly architectures that include layers of cnns and rnns—can effectively detect code smells. In addition, how different models perform on detecting diverse code smells and how model performance is affected by tweaking the learning hyper-parameters.

  • To investigate whether results on smell detection through deep learning are transferable; specifically, to explore whether models trained for detecting smells on a programming language can be re-used to detect smells on another language.

Keeping these goals in mind, we define research questions and prepare an experimental setup to detect four smells viz. complex method, empty catch block, magic number, and multifaceted abstraction using deep learning models in different configurations. We develop a set of tools and scripts to automate the experiment and collate the results. Based on the results, we derive conclusions to our addressed research questions.

The contributions of this paper are summarized below.

  • An extensive study that applies deep learning models in detecting code smells and compares the performance of different methods; to the best of our knowledge this is the first study of this kind and scale.

  • An exploration regarding the feasibility of employing deep learning models in transfer-learning. This exploration potentially will open a new paradigm to detect smells specifically for programming languages for which comprehensive code smells detection tools are not available.

  • Openly available tools, scripts, and data used in this experiment111https://github.com/tushartushar/DeepLearningSmells to promote replication as well as extension studies.

The rest of the paper is organized as follows. Section 2 sets up the stage by presenting background and related work. We define our research objective in Section 3 and research method in Section 4. Section 5 presents our findings, discussion, and further research opportunities. We present threats to validity of this work in Section 6 and conclude in Section 7.

2. Background and Related Work

In this section, we present the background about the topic of code smells as well as machine learning and elaborate on the related literature.

2.1. Code Smells

Kent Beck coined the term “code smell” (Fowler, 1999) and defined it as “certain structures in the code that suggest (or sometimes scream) for refactoring.” Code smells indicate the presence of quality problems impacting many facets of quality (Sharma and Spinellis, 2018) of a software system (Fowler, 1999; Suryanarayana et al., 2014). The presence of an excessive number of smells in a software system makes it hard to maintain and evolve.

Smells are categorized as implementation (Fowler, 1999), design (Suryanarayana et al., 2014), and architecture smells (Garcia et al., 2009a) based on their scope, granularity, and impact. Implementation smells are typically confined to a limited scope and impact (e.g., a method). Examples of implementation smells are long method, complex method, long parameter list, and complex conditional (Fowler, 1999). Design smells occur at higher granularity, i.e., abstractions, and hence are confined to a class or a set of classes. God class, multifaceted abstraction, cyclic-dependency modularization, and rebellious hierarchy are examples of design smells (Suryanarayana et al., 2014). Along the similar lines, architecture smells span across multiple components and have a system-wide impact. Some examples of architecture smells are god component (Lippert and Roock, 2006), feature concentration (de Andrade et al., 2014), and scattered functionality (Garcia et al., 2009b).

A plethora of work related to code smell detection exists in the software engineering literature. Researchers have proposed methods for detecting smells that can be largely divided into five categories (Sharma and Spinellis, 2018). Metric-based smell detection methods (Marinescu, 2005; Vidal et al., 2014; Salehie et al., 2006) take source code as input, prepare a source code model, such as an Abstract Syntax Tree (ast), compute a set of source code metrics, and detect smells by applying appropriate thresholds (Marinescu, 2005). Rule/Heuristic-based smell detection methods (Moha et al., 2010; Sharma et al., 2016; Arnaoudova et al., 2013; Tsantalis and Chatzigeorgiou, 2011) typically take source code models and sometimes additional software metrics as inputs. They detect a set of smells when the defined rules/heuristics get satisfied. History-based smell detection methods use source code evolution information (Palomba et al., 2015; Fu and Shen, 2015). Such methods extract structural information of the code and how it has changed over a period of time. This information is used by a detection model to infer smells in the code. Optimization-based smell detection approaches (Sahin et al., 2014; Ouni et al., 2015; Kessentini et al., 2014) apply an optimization algorithm on computed software metrics and, in some cases, existing examples of smells to detect new smells in the source code.

In recent times, machine learning-based smell detection

methods have attracted software engineering researchers. Machine learning is a subfield of artificial intelligence that

trains solutions to problems rather than modeling them through hard-coded rules. In this approach, the rules that solve a problem are not set a-priori; rather, they are inferred in a data-driven manner. In supervised learning, a model is trained by being exposed to examples of instances of the problem along with their expected answers and statistical regularities are drawn. The representations that are learned from the data can in turn be applied and generalized to new, unseen data in a similar context.

A typical machine learning smell detection method starts with a mathematical model representing the smell detection problem. Existing examples and source code models are used to train the model. The trained model is used to classify or predict the code fragments into smelly or non-smelly instances. Foutse et al. (Khomh et al., 2009, 2011)

use a Bayesian approach for the detection of smells. Their study forms a Bayesian graph using a set of metrics and determines the probability whether a class belongs to a smell or not. Similarly, Abdou

et al. (Maiga et al., 2012b, a) employ support vector machine-based classifiers trained using a set of object-oriented metrics for each class to detect design smells (blob, feature concentration, spaghetti code, and swiss army knife). Furthermore, Sérgio et al. (Bryton et al., 2010) detect long method smell instances by employing binary logistic regression. They use commonly used method metrics, such as Method Lines of Code (mloc) and cyclomatic complexity as regressors. Bardez et al. (Barbez et al., 2019) presents an ensemble method that combine outcome of multiple tools to detect god class and feature envy smells. They identify a set of key metrics for each smell and feed them to a cnn-based architecture. Fontana et al. (Fontana et al., 2016) compare performance of various machine learning algorithms in detecting data class, god class, feature envy, and long method.

Performance of a machine learning task heavily depends on the formation of the evaluation samples. As the ratio of positive and negative samples becomes more balanced, the classification task of the models becomes easier. Hence, a network would perform significantly better when classifying data from balanced datasets. Most of the above-mentioned approaches do not explicitly mention the ratio of positive and negative samples used for the evaluation. Fontana et al. (Fontana et al., 2016) carry out the evaluation using positive and negative samples for each smell which is considerably balanced compared to a realistic case. The realistic samples that we used for evaluation are highly imbalanced (refer Section 5); for example, the ratio of negative over positive samples for 1d evaluation is to for multifaceted abstraction smell.

2.2. Deep Learning

Deep learning is a subfield of machine learning that allows computational models composed of multiple processing layers to learn representations of data with multiple levels of abstraction (LeCun et al., 2015; Goodfellow et al., 2016). Even though the idea of layered neural networks with internal “hidden” units was already introduced in the 80s (Rumelhart et al., 1986), a breakthrough in the field came in 2006 by Hinton et al. (Hinton et al., 2006) who introduced the idea of learning a hierarchy of features one level at a time. Ever since, and particularly during the course of the last decade, the field has taken off due to the advances in hardware, the release of benchmark datasets (Deng et al., 2009; Krizhevsky and Hinton, 2009; LeCun et al., 2010), and a growing research focus on optimization methods (Martens, 2010; Kingma and Ba, 2014). Although deep learning architectures often consist of tens or hundreds of successive layers, much shallower architectures may also fall under the category of deep learning, as long as at least one hidden layer exists between the input and the output layer.

Deep learning architectures are being used extensively for addressing a multitude of detection, classification, and prediction problems. Architectures involving layers of cnns are inspired by the hierarchical organization of the visual cortex in animals, which consists of alternating layers of simple and complex cells (Felleman and Van Essen, 1991; Hubel and Wiesel, 1962). cnns have been proven particularly effective for problems of optical recognition and are widely used for image classification and detection (Krizhevsky et al., 2012; Szegedy et al., 2015; LeCun et al., 1998), segmentation of regions of interest in biological images (Kraus et al., 2016)

, and face recognition

(Lawrence et al., 1997; Parkhi et al., 2015). Besides recognition of directly interpretable visual features of an image, cnn

s have also been used for pattern recognition in signal spectograms, with applications in speech recognition

(Sainath et al., 2015). In these applications the input data are given in the form of matrices (2d arrays) for representing the 2d grid layout of pixels in an image. 1d representations of data have been used for applying 1d convolutions in sequential data such as textual patterns (Johnson and Zhang, 2015) or temporal event patterns (Lee et al., 2017; Abdeljaber et al., 2017). However, when it comes to sequential data, Recurrent Neural Networks (rnns) (Rumelhart et al., 1986)

have been proven superior due to their capability to dynamically “memorize” information provided in previous states and incorporate it to a current state. Long Short Term Memory (

lstm) networks are a special kind of rnn that can connect information spanning long-term intervals, thus capturing long-term dependencies. lstms have been found to perform reasonably well on various data sets within the context of representative applications that exhibit sequential patterns, such as speech recognition and music modeling (Greff et al., 2017; Graves et al., 2013)

. In addition, they have been established as state-of-the-art networks for a variety of natural language processing tasks; indicative applications include natural language generation

(Wen et al., 2015), sentiment classification (Wang et al., 2016; Baziotis et al., 2017)

and neural machine translation

(Cho et al., 2014), among others.

2.3. Machine Learning Techniques on Source Code

The emergence of online open-source repository hosting platforms such as GitHub in recent years has led to an explosion on the volumes of openly available source code along with metadata related to software development activities; this bulk of data is often referred to as “Big Code” (Allamanis et al., 2018). As an effect, software maintenance activities have started leveraging the wealth of openly available data, the availability of computational resources, and the recent advances in machine learning research. In this context, statistical regularities observed in source code have revealed the repetitive and predictable nature of programming languages, which has been compared to that of natural languages (Hindle et al., 2012; Ernst, 2017). To this end, problems of automation in natural language processing, such as identification of semantic similarity between texts, translation, text summarisation, word prediction and language generation have been examined in parallel with the automation of software development tasks. Relevant problems in software development include clone detection (White et al., 2016; Wei and Li, 2017), de-obfuscation (Vasilescu et al., 2017), language migration (Nguyen et al., 2013), source code summarisation (Iyer et al., 2016), auto-correction (Pu et al., 2016; Gupta et al., 2017), auto-completion (Foster et al., 2012), generation (Oda et al., 2015; Ling et al., 2016; Yin and Neubig, 2017), and comprehension (Alexandru et al., 2017).

On a par with equivalent problems in natural language processing, the methods employed to address these software engineering problems have switched from traditional rule-based and probabilistic n-gram models to deep learning methods. The majority of the proposed deep learning solutions rely on the use of

rnns which provide sophisticated mechanisms for capturing long term dependencies in sequential data, and specifically lstms (Hochreiter and Schmidhuber, 1997) that have demonstrated particularly effective performance on natural language processing problems.

Alternative approaches to mining source code have employed cnns in order to learn features from source code. Li et al. (Li et al., 2017) have used a single-dimension cnns to learn semantic and structural features of programs by working at the ast level of granularity and combining the learned features with traditional hand-crafted features to predict software defects. This method however incorporates hand-crafted features in the learning process and is not proven to yield transferable results. Similarly, a one-dimensional cnn-based architecture has been used by Allamanis et al. (Allamanis et al., 2016) in order to detect patterns in source code and identify “interesting” locations where attention should be focused. The objective of the study is to predict short and descriptive names of source code snippets (e.g., a method body) given solely its tokens. cnns have also been used by Huo et al. (Huo et al., 2016) in order to address the problem of bug localization. This approach leverages both the lexical information expressed in the natural language of a bug report and the structural information of source code in order to learn unified features. A more coarse-grain approach that also employs cnns has been proposed in the context of program comprehension (Ott et al., 2018) where the authors use imagery rather than text in order to discriminate between scripts written in two programming languages, namely Java and Python.

2.4. Challenges in Applying Deep Learning on Source Code

Applying deep learning techniques on source code is non-trivial. In this section, we present challenges that we face in the process of applying deep learning techniques on source code.

2.4.1. Analogies with other problems

Deep learning is advancing rapidly in domains that address problems of image, video, audio, text, and speech processing (LeCun et al., 2015). Consequently, these advances drive current trends in deep learning and inspire applications across disciplines. As such, studies that apply deep learning on source code rely heavily on results from these domains, and particularly that of text mining.

Based on prior observations that demonstrate similarity between source code and natural language (Hindle et al., 2012), the research community has largely addressed relevant problems on mining source code by adopting latest state-of-the-art natural language processing methods (Allamanis et al., 2016; Palomba et al., 2016; Iyer et al., 2016; Vasilescu et al., 2017; Yin and Neubig, 2017). However, besides similarities, there also exist major differences that need to be taken into consideration when designing such studies. First of all, source code, unlike natural language, is semantically fragile; minor syntactic changes can drastically change the meaning of code (Allamanis et al., 2018). As an effect, treating code as text by ignoring the underlying formal semantics carries the risk of not preserving the appropriate meaning. Besides formal semantics, the syntax of source code obviously presents substantial differences compared to the syntax found in text. As a result, methods that perform well on text are likely to under-perform on source code. Architectures involving cnn-1d layers, for instance, have been proven effective for matching subsequences of short lengths (Chollet, 2017), which are often found in natural language where the length of sentences is limited. This however does not necessarily apply on self-contained fragments of source code, such as method definitions, which tend to be longer. Finally, even though good practices dictate naming conventions in coding, unlike natural language, there is no universal vocabulary of source code. This results to a diversity in the artificial vocabulary found in source code that may affect the quality of the models learned.

Approaches that treat code as text mainly focus on the mining of sequential patterns of source code tokens. Other emerging approaches look into structural characteristics of the code with the objective of extracting visual patterns delineated on code (Ott et al., 2018). Even though there are features in source code, such as nesting, which demonstrate distinctive visual patterns, treating source code in terms of such patterns and ignoring the rich intertwined semantics carries the risk of oversimplifying the problem.

2.4.2. Lack of Resources

Research employing deep learning techniques on software engineering data, including source code as well as other relevant artifacts, is still young. Consequently, results against traditional baseline techniques are very limited (Fu and Menzies, 2017; Hellendoorn and Devanbu, 2017). Especially when it comes to processing solely source code artifacts, relevant studies are scarce and mostly address the problem of drawing out semantics related to the functionality of a piece of code (Allamanis et al., 2016; White et al., 2015; White et al., 2016; Mou et al., 2016; Piech et al., 2015). To the best of our knowledge, our study is the first to thoroughly investigate the application of deep learning techniques with the objective of examining characteristics of source code quality. Therefore, a major challenge in studies of this kind is that there is no prior knowledge that would guide this investigation, a challenge reflected on all stages of the inquiry. At the level of designing an experiment, there exist no rules of thumb indicating a set up for a deep learning architecture that adequately models the fine-grained features required for the problem in hand. Furthermore, at the level of training a model, there is no prior baseline for hyper-parameters that would lead to an optimal solution. Finally, at the level of evaluating a trained model, there exist no benchmarks to compare against; there is no prior concrete indication on the expected outcomes in terms of reported metrics. Hence, a result that would appear sub-optimal in another domain such as natural language processing, may actually account for a significant advance in software quality assessment.

Besides challenges that relate to the know-how of applying deep learning techniques on source code, there are technical difficulties that arise due to the paucity of curated data in the field. The need for openly available data that can serve for replicating data-driven studies in software engineering has been long stressed (Robles, 2010). The release of curated data in the field is encouraged through badging artifact-evaluated papers as well as dedicated data showcase venues for publication. However, the software engineering domain is still far from providing benchmark datasets, whereas the available datasets are limited to curated collections of repositories with associated metadata that lack ground truth annotation that is essential for a multitude of supervised machine learning tasks. Therefore, unlike domains such as image processing and natural language processing where an abundance of annotated data exist (Krizhevsky and Hinton, 2009; Deng et al., 2009; LeCun et al., 2010; Maas et al., 2011), in the field of software engineering the lack of gold standards induces the inherent difficulty of collecting and curating data from scratch.

3. Research Objectives

The goal of this research is to explore the possibility of applying state-of-the-art deep learning methods to detect smells. Further, within the same context, this work inquires into the feasibility of applying transfer-learning. Based on the stated goals, we define the following research questions that this work aims to explore.


Is it possible to use deep learning methods to detect code smells? If yes, which deep learning method performs superior?

We use cnn and rnn models in this exploration. For the cnn-based architecture, we provide input samples in 1d and 2d format to observe the difference in their capabilities due to the added dimension; we refer to them as cnn-1d and cnn-2d respectively. In the context of this research question, we define the following hypotheses.


It is feasible to detect smells using deep learning methods.
The considered deep learning models are powerful mechanisms that have the ability to detect complex patterns with sufficient training. These models have demonstrated high performance in the domain of image processing (Krizhevsky et al., 2012; Szegedy et al., 2015) and natural language processing (Luong et al., 2015). We believe we can leverage these models in the presented context.


cnn-2d performs better than cnn-1d in the context of detecting smells.
The rationale behind this hypothesis is the added dimensionality in cnn-2d. The 2d model might observe inherent patterns when input data is presented in two dimensions that may possibly be hidden in one dimensional format. For instance, a 2-d variant could possibly identify the nesting depth of a method easier than its 1-d counterpart when detecting complex method smell.


An rnn model performs better than cnn models in the smell detection context.
are considered better for capturing sequential patterns and have the reputation to work well with text. Thus, taking into account the similarities that source code and natural language share, rnn could prove superior than cnn models.


Is transfer-learning feasible in the context of detecting smells? If yes, which deep learning model exhibits superior performance in detecting smells when applied in transfer-learning setting?

Transfer-learning is the capability of an algorithm to exploit the similarities between different learning tasks and offering a solution of a task by transferring knowledge acquired while solving another task. We would like to explore whether it is feasible to train a deep learning model from samples of C# and predict the smells using this trained model in samples of Java programming language.

We derive the following hypotheses.


It is feasible to apply transfer-learning in the context of code smell detection.
We train the deep learning models using C# code fragments and evaluate the trained model using Java fragments. Given the high similarity in the syntax between the two programming languages, we believe that we may train the model from training samples and use the trained model to classify smelly and non-smelly fragments from our evaluation samples.


Transfer-learning performs inferior compared to direct-learning.
Direct-learning in the context of our study refers to the case where training and evaluation samples belong to the same programming language. We expect that the performance of the models in the transfer-learning could be inferior to that compared to direct-learning given both the problems are equally hard i.e., negative and positive sample showing similar distribution.

4. Research Method

This section describes the employed research method by first providing an overview and then elaborating on the data curation process. We also discuss the selection protocol of smells and architecture of the deep learning models.

4.1. Overview of the Method

Figure 1 provides an overview of the experiment. We download C# and Java repositories from GitHub. We use Designite and DesigniteJava to analyze C# and Java code respectively. We use CodeSplit to extract each method and class definition into separate files from C# and Java programs. Then the learning data generator uses the detected smells to bifurcate code fragments into positive or negative samples for a smell—positive samples contain the smell while the negative samples are free from that smell. Tokenizer takes a method or class definition and generates integer tokens for each token in the source code. We apply preprocessing operation, specifically duplicates removal, on the output of Tokenizer. The processed output of Tokenizer is ready to feed to the neural networks.

Figure 1. Overview of the Proposed Method

4.2. Data Curation

In this section, we elaborate on the the process of generating training and evaluation samples along with the tools used in the process.

4.2.1. Downloading Repositories

We use the following protocol to identify and download our subject systems.

  • We download repositories containing C# and Java code from GitHub. We use RepoReapers (Munaiah et al., 2017) to filter out low-quality repositories. RepoReapers analyzes GitHub repositories and provides scores for eight dimensions of their quality. These dimensions are architecture, community, continuous integration, documentation, history, license, issues, and unit tests.

  • We select all the repositories where at least six out of eight and seven out of eight RepoReapers’ dimensions have suitable scores for C# and Java repositories respectively. We consider a score suitable if it has a value greater than zero.

  • We ensure that RepoReapers results do not include forked repositories.

  • We discard repositories with fewer than five stars and less than loc.

  • Following these criteria, we get a filtered list of C# and Java repositories. We select repositories randomly from the filtered list of Java repositories. Finally, we download and analyze the C# and Java repositories.

4.2.2. Smell Detection

We use Designite to detect smells in C# code. Designite (Sharma et al., 2016; Sharma, 2016) is a software design quality assessment tool for code written in C#. It supports detection of eleven implementation, design, and seven architecture smells. It also provides commonly used code metrics and other features such as trend analysis, code clone detection, and dependency structure matrix to help developers assess the software quality. A free academic license of Designite can be requested.

Similar to the C# version, we have developed DesigniteJava (Sharma, 2018), which is an open-source tool for analyzing and detecting smells in a Java codebase. The tool supports detection of design and ten implementation smells.

We use the console version of Designite (version ) and DesigniteJava (version ) to analyze C# and Java code respectively and detect the specified design and implementation smells in each of the downloaded repositories.

4.2.3. Splitting Code Fragments

CodeSplit is a set of two utility programs, one for each programming language, that split methods or classes written in C# and Java source code into individual files. Hence, given a C# or Java project, the utilities can parse the code correctly (using Roslyn for C# and Eclipse jdt for Java), and emit the individual method or class fragments into separate files following hierarchical structure (i.e., namespaces/packages becomes folders). CodeSplit for Java is an open-source project that can be found on GitHub (Sharma, 2019b). CodeSplit for C# can be downloaded freely online (Sharma, 2019a).

4.2.4. Generating Training and Evaluation Data

The learning data generator requires information from two sources—a list of detected smells for each analyzed repository and a path to the folder where the code fragments corresponding to the repository are stored. The program takes a method (or class in case of design smells) at a time and checks whether the given smell has been detected in the method (or class) by Designite. If the method (or class) suffers from the smell, the program puts the code fragment into a “positive” folder corresponding to the smell otherwise into a “negative” folder.

4.2.5. Tokenizing Learning Data

Machine learning algorithms require the inputs to be given in a representation appropriate for extracting the features of interest, given the problem in hand. For a multitude of machine learning tasks it is a common practice to convert data into numerical representations before feeding them to a machine learning algorithm. In the context of this study, we need to convert source code into vectors of numbers honoring the language keywords and other semantics. Tokenizer (Spinellis, 2019) is an open-source tool that provides, among others, functionality for tokenizing source code elements into integers where different ranges of integers map to different types of elements in source code. Figure 2 shows a small C# method and corresponding tokens generated by Tokenizer. Currently, it supports six programming languages, including C# and Java.

Figure 2. Tokens generated by Tokenizer for an example

4.2.6. Data Preparation

The stored samples are read into numpy arrays, preprocessed, and filtered. We first perform bare minimum preprocessing to clean the data—for both 1d and 2d samples, we scan all the samples for each smell and remove duplicates if any exist.

We split the samples in the ratio of 70-30 for training; i.e., 70% of the samples are used for training a model while 30% samples are used for evaluation. We limit the maximum number of positive/negative training samples to . Therefore, for instance, if negative samples are more than , we drop the rest of the samples. We perform model training using balanced samples, i.e., we balance the number of samples for training by choosing the smaller number from the positive and negative sample count; we discard the remaining training samples from the larger side. Table 1 presents an example of data preparation.

Initial samples 70-30 split Applying max limit Balancing
Positive Training 4,961 3,472 3,472 3,472
Evaluation 1,489 1,489 1,489
Negative Training 178,048 121,161 5,000 3,472
Evaluation 51,926 51,926 51,926
Table 1. Number of samples in each step of preparing input data

Each individual input instance, either a method in the case of implementation smells, or a class in the case of design smells, is stored in the appropriate data structure depending upon the model that will use it. In 1d representation, each individual input instance is represented by a flat 1d array of sequences of tokens, compatible for use with the rnn and the cnn-1d models. In the 2d representation, each input instance is represented by a 2d array of tokens, preserving the original statement-by-statement delineation of source code thus providing the grid-like input format that is required by cnn-2d models. All the individual samples are stored in a few files (where each file size is approximately mb) to optimize the I/O operations due to a large number of files. We read all the samples into a numpy

array and we filter out the outliers. In particular, we compute the mean input size and discard all the samples with length over one standard deviation away from the mean. This filtering helps us keep the training set in reasonable bounds and avoids waste of memory and processing resources. We pad the input array with zeros to the extent of the longest remaining input in order to create vectors of uniform length and bring the data in the appropriate format for using with the deep learning models. Finally, we shuffle the array of input samples along with its corresponding labels array.

4.3. Selection of Smells

Over the last two decades, the software engineering community has documented many smells associated with different granularities, scope, and domains (Sharma and Spinellis, 2018). A comprehensive taxonomy of the software smells can be found online.222http://www.tusharma.in/smells For this study, selection of smells is a crucial decision. The scope of the higher granularity smells, such as design and architecture smells, is large, often spanning to multiple classes and components. It is essential to provide all the intertwined source code fragments to the deep learning model to make sure that the model captures the key deciding elements from the provided input source code. Hence, it is naturally difficult to detect them using deep learning approaches, unless extensive feature engineering is performed beforehand in order to attain an appropriate representation of the data. We started with implementation smells because they can be detected typically just by looking at a method. However, we would like to avoid very simple smells (such as long method) which can be easily detected by less sophisticated techniques.

We chose complex method (cmi.e., the method has high cyclomatic complexity), magic number (mni.e., an unexplained numeric literal is used in an expression), and empty catch block (ecbi.e., a catch block of an exception is empty). These three smells represent three different kinds of smells where neural networks have to spot specific features. For instance, to detect magic number, the neural networks must spot a specific range of tokens representing magic numbers. On the other hand, detection of complex method requires looking at the entire method and the structural property within it (i.e., nesting depth of the method). For the detection of empty catch block the neural network has to recognize a sequence of a try block followed by an empty catch block.

To expand the horizon of the experiment, we also select multifaceted abstraction (mai.e., a class has more than one responsibility assigned to it) design smell. The scope of this smell is larger (i.e., the whole class) and detection is not trivial since the neural network has to capture cohesion aspect (typically captured by the Lack of Cohesion of Methods (lcom) metric in deterministic tools) among the methods to detect it accurately. This smell not only allows us to compare the capabilities of neural networks in detecting implementation smells with design smells but also sets the stage for the future work to build on.

4.4. Architecture of Deep Learning Models

In this section, we present the architecture of the neural network models that we use in this study. The Python implementation of the experiments using the Keras library can be found online.


4.4.1. cnn Model

Figure 3 presents the architecture of cnn model used to detect smells. This architecture is inspired by typical cnn architectures used in image classification (Krizhevsky et al., 2012)

and consists of a feature extraction part followed by a classification part. The feature extraction part is composed of an ensemble of layers, specifically, convolution, batch normalization, and max pooling layers. This set of layers form the hidden layers of the architecture. The convolution layer performs convolution operations based on the specified filter and kernel parameters and computes accordingly the network weights to the next layer, whereas the max pooling layer effectuates a reduction on the dimensionality of the feature space. Batch normalization

(Ioffe and Szegedy, 2015) mitigates the effects of varied input distributions for each training mini-batch, thus optimizing training. In order to experiment with different configurations, we use one, two, and three hidden layers.

The output of the last max pooling layer is connected to a dropout layer. Dropout performs another type of regularization by ignoring some randomly selected nodes during training in order to prevent over-fitting (Srivastava et al., 2014). In our experiments we set the dropout rate for the layer to be equal to which means that the nodes to be ignored are randomly selected with probability .

The output of the last dropout layer is fed into a densely connected classifier network that consists of a stack of two dense layers. These classifiers process 1d

vectors, whereas the incoming output from the last hidden layer is a 3D tensor (that corresponds to height and width of an input sample, and channel; in this case, the number of channels is one). For this reason, a flatten layer is used first, to transform the data in the appropriate format before feeding them to the first dense layer with

units and relu activation. This is followed by the second dense layer with one unit and sigmoid

activation. This last dense layer comprises the output layer and contains a single neuron in order to make predictions on whether a given instance belongs to the positive or negative class in terms of the smell under investigation. The layer uses the sigmoid activation function in order to produce a probability within the range of


Figure 3. Architecture of employed CNN

We use dynamic batch size depending upon the size of samples to train. We divide the training sample size by and use the result as the index to choose one of the items in the possible batch size array ( ). For instance, we use as batch size when the training sample size is and when the training sample size is

The hyper-parameters are set to different values in order to experiment with different configurations of the model. Table 2 lists all the different values chosen for the hyper-parameters. We execute cnn models for configurations that result from generating combinations of different values of hyper-parameters and number of repetitions of the set of hidden units. We label each configuration between and where configuration refers to number of repetitions of the set of hidden units = number of filters = kernel size = and pooling window size = Similarly, configuration refers to number of repetitions of the set of hidden units = number of filters = kernel size = and pooling window size = Both the 1d and 2d variants use the same architecture replacing the 2d version of Keras layers for their 1d counterparts.

Hyper-parameter Values
Filters in convolution layer {8, 16, 32, 64}
Kernel size in convolution layer {5, 7, 11}
Pooling window size in max pooling layer {2, 3, 4, 5}

Maximum epochs

Table 2. Chosen values of hyper-parameters for the cnn model

We ensure the best attainable performance and avoid over-fitting by using early stopping444https://keras.io/callbacks/ as a regularization method. It implies that the model may reach a maximum of epochs during training. However, if there is no improvement in the validation loss of the trained model for five consecutive epochs (since patience, a parameter to early stopping mechanism, is set to five), the training is interrupted. Along with it, we also use model check point to restore the best weights of the trained model.

For each experiment, we compute the following performance metrics — accuracy, roc-auc (Receiver Operating Curve-Area Under Curve), precision, recall, F1, and average precision score. We also record the actual epoch count where the models stopped training (due to early stopping). After we complete all the experiments with all the chosen hyper-parameters, we choose the best performing configuration and the corresponding number of epochs used by the experiment and retrain the model and record the final and best performance of the model.

4.4.2. rnn Model

Figure 4 presents the architecture of the employed rnn model which is inspired by state-of-the-art models in natural language modeling that employ an lstm network as a recurrent layer (Sundermeyer et al., 2012). The model consists of an embedding layer followed by the feature learning part — a hidden lstm layer. It is succeeded by the regularization (realized by a dropout layer) and classification (consisting of a dense layer) part.

Figure 4. Architecture of employed RNN

The embedding layer maps discrete tokens into compact dense vector representations. One of the advantages of the lstm networks is that they can effectively handle sequences of varying lengths. To this end, in order to avoid the noise produced by the padded zeros in the input array, we set the mask_zero parameter to True provided by the Keras embedding layer implementation. Thus the padding is ignored and only the meaningful part of the input data is taken into account. We set dropout and recurrent_dropout parameters of lstm layer to . The regular dropouts mask (or drop) network units at inputs and/or outputs whereas recurrent dropouts drop the connections between the recurrent units along with dropping units at inputs and/or outputs (Gal and Ghahramani, 2015). The output from the embedding layer in fed into the lstm layer, which in turn outputs to the dropout layer. As in the case of the cnn model, we experiment for different depths of the rnn model by repeating multiple instances of the hidden layer.

The dropout layer uses a dropout rate equal to , which we empirically found effective for preventing over-training, yet conservative enough for avoiding under-training. The dense layer, which comprises the classification output layer, is configured with one unit and sigmoid activation as in the case of the cnn model. Similarly to the cnn model, we use early stopping (with maximum epochs = and patience = ) and model check point callbacks. Also, we use the dynamic batch size selection as explained in the previous subsection.

We try different values for the model hyper-parameters. Table 3 presents different values selected for each hyper-parameter. We measure the performance of the rnn model in configurations by forming the combinations produced by the different chosen values of hyper-parameters and the number of repetitions of the set of hidden units.

Hyper-parameter Values
Dimensionality of embedding layer {16, 32}
lstm units {32, 64, 128}
Maximum epochs 50
Table 3. Chosen values of hyper-parameters for the rnn model

As described earlier, we pick the best performing hyper-parameters and number of epochs and retrain the model to obtain the final and best performance of the model.

4.5. Hardware Specification

We perform all the experiments on the super-computing facility offered by grnet (Greek Research and Technology Network). The experiments were run on gpu nodes (NVidia K40). Each gpu incorporate 2880 cuda cores. We request 1 gpu node with gb of memory for most of the experiments while submitting the job to the super computing facility. Some rnn experiments require more memory to perform the training; we request gb of memory for them.

5. Results and Discussion

As elaborated in this section, we found that it is feasible to detect smells using deep learning models without extensive feature engineering. Our results also indicate that performance of deep learning models is highly smell-specific. Furthermore, we found that it is feasible to apply transfer-learning in the context of code smells detection. In the rest of the section, we discuss the results in detail.

5.1. Results of RQ1


Is it possible to use deep learning methods to detect code smells? If yes, which deep learning method performs superior?

5.1.1. Approach

We prepare the input samples as described in Section 4.2. Table 4 presents the number of positive and negative samples used for each smell for training and evaluation; cnn-1d and rnn use 1d samples and cnn-2d uses 2d samples. As mentioned earlier, we train our models with the same number of positive and negative samples. Sample size for multifaceted abstraction (ma) is considerably low compared to other smells because each sample in this smell is a class (other smells use method fragments). The one-dimensional sample counts are different from their two-dimensional counterparts because we apply additional constraint for outlier exclusion, on permissible height, in addition to the width.

cnn-1d and rnn cnn-2d
Training Evaluation Training Evaluation
p and n p n p and n p n
cm 3,472 1,489 51,926 2,641 1,132 45,204
ecb 1,200 515 52,900 982 422 45,915
mn 5,000 5,901 47,514 5,000 5,002 41,334
ma 290 125 22,727 284 122 17,362
Table 4. Number of positive (P) and negative (N) samples used for training and evaluation for RQ1

5.1.2. Results

Figure 5 presents the performance (F1) of the models for the considered smells for all the configurations that we experimented with. The results from each model perspective show that performance of the models varies depending on the smell under analysis. Another observation from the trendlines shown in the plots is that performance of the convolution models remains more or less stable and unchanged for different configurations while rnn exhibits better performance as the complexity of the model increases except for multifaceted abstraction smell. It implies that the hyper-parameters that we experimented with do not play a very significant role for convolution models.

(a) Performance of CNN-1D
(b) Performance of CNN-2D
(c) Performance of RNN
Figure 5. Scatter plots of the performance (F1) exhibit by the considered deep learning models along with their corresponding trendline

Figure 6 presents the boxplots comparing for each smell performance of all trained models, under all configurations. For complex method smell, both convolution models outperform the rnn. In between the convolution models, overall the various configurations of the cnn-1d model appear accumulated around the mean, whereas cnn-2d

shows higher variance among the F1 scores obtained at different configurations. Though,

cnn-1d shows lower variance, the model has higher number of outliers compared to cnn-2d model. rnn model performs significantly superior compared to convolution models for empty catch block smell with an F1 score of versus and achieved by cnn-1d and cnn-2d respectively; the performance of the model, however, shows a wide variation depending on the chosen hyper-parameters. For magic number smell, most of the rnn configurations do better than the best of the convolution-based configurations. rnn exhibits a very high variance in the performance compared to convolution models for multifaceted abstraction smell.

(a) Complex method
(b) Empty catch block
(c) Magic number
(d) Multifaceted abstraction
Figure 6. Boxplots of the performance (F1) exhibit by the considered deep learning models for all the four smells

Equipped with experiment results, we attempt to validate the hypotheses. We present auc, precision, recall, and F1 to show the performance of the analyzed deep learning models. We attempt to validate each of the addressed hypotheses in the rest of the section.


It is feasible to detect smells using deep learning methods.

Table 5 lists performance metrics (auc, precision, recall, and F1) for the optimal configuration for each smell, comparing all three deep learning models. It also lists the hyper-parameters associated with the optimal configuration for each smell. Figure 7 presents the performance (F1) of the deep learning models corresponding to each smell considered in this exploration.

Performance Configuration
Smells AUC Precision Recall F1 l f k mpw ed lstm e
cnn-1d cm 0.82 0.26 0.69 0.38 2 16 7 4 25
ecb 0.59 0.02 0.31 0.04 2 64 11 4 40
mn 0.68 0.18 0.77 0.29 2 16 5 5 17
ma 0.83 0.05 0.75 0.09 3 16 11 5 36
cnn-2d cm 0.82 0.30 0.68 0.41 3 64 5 4 17
ecb 0.50 0.01 1 0.02 3 64 7 2 32
mn 0.65 0.31 0.41 0.35 1 16 11 2 50
ma 0.87 0.03 0.95 0.06 2 8 7 2 19
rnn cm 0.85 0.19 0.80 0.31 3 16 32 8
ecb 0.86 0.13 0.76 0.22 2 16 128 15
mn 0.91 0.55 0.91 0.68 2 16 128 19
ma 0.69 0.01 0.86 0.02 1 32 128 9
Table 5. Performance of all three models with configuration corresponding to the optimal performance. L refers to deep learning layers, F refers to number of filters, K refers to kernel size, MPW refers to maximum pooling window size, ED refers to embedding dimension, LSTM refers to number of LSTM units, and E refers to number of epochs.
Figure 7. Comparative performance of the deep learning models for each considered smell

For complex method smell, cnn-2d performs the best; though, performance of cnn-1d is comparable. This could be an implication of the fact that the smell is exhibited through the structure of a method; hence, cnn models, in this case, could identify the related structural features for classifying the smells correctly. On the other hand, cnn models perform significantly poorer than rnn in identifying empty catch block smells. The smell is characterized by a micro-structure where catch block of a try-catch statement is empty. rnn model identifies the sequence of tokens (i.e., opening and closing braces), following the tokens of a try block, whereas cnn models fail to achieve that and thus rnn performs significantly better than the cnn models. Also, the rnn model performs remarkably better than cnn models for magic number smell. The smell is characterized by a specific range of tokens and the rnn does well in spotting them. Multifaceted abstraction is a non-trivial smell that requires analysis of method interactions to observe incohesiveness of a class. None of the employed deep learning models could capture the complex characteristics of the smell, implying that the token–level representation of the data may not be appropriate for capturing higher–level features required for detecting the smell. It is evident from the above discussion that all the employed models are capable of detecting smells in general; however, their smell-specific performances differ significantly. Therefore, the hypothesis exploring the feasibility of detecting smells using deep learning models holds true.


cnn-2d performs better than cnn-1d in the context of detecting smells.

Table 5 shows that cnn-1d performs better than cnn-2d model for empty catch block and multifaceted abstraction smells with optimal configuration. On the other hand, cnn-2d performs slightly better than its one dimension counterpart for detecting complex method and magic number smells. In summary, there is no universal superior model for detecting all four smells; their performance varies depending on the smell under analysis. Therefore, we reject the hypothesis that cnn-2d performs overall better than cnn-1d as none of the models is clearly superior to another in all the cases.


rnn model performs better than cnn models in the smell detection context.

Table 6 presents the comparison of rnn with cnn-1d and cnn-2d by comparing pairwise F1 measure differences in percentages, where the F1 values are obtained by the optimal configuration in each case. Here, the performance difference in percentage is calculated by . rnn performs far better for empty catch block and magic number smells against both convolution models. However, the performance of rnn is lower for complex method and multifaceted abstraction smells.

Smell rnn vs cnn-1d rnn vs cnn-2d
cm -22.94% -33.81%
ecb 80.23% 91.94%
mn 57.19% 48.48%
ma -353.15% -208.00%
Table 6. Performance (F1) comparison of RNN with CNN-1D and CNN-2D

The analysis suggests that performance of the deep learning models is smell-specific. Therefore, we reject the hypothesis that rnn models perform better than cnn models for all considered smells.

5.1.3. Implications

This is the first attempt in the software engineering literature to show the feasibility of detecting smells using deep learning models from the tokenized source code without extensive feature engineering. It may motivate researchers and developers to explore this direction and build over it. For instance, context plays an important role in deciding whether a reported smell is actually a quality issue for the development team. One of the future works that the community may explore is to combine the models trained using samples classified by the existing smell detection tools with the developer’s feedback to identify more relevant smells considering the context.

Our results show that, though both convolution methods perform superior for specific smells, their performance is comparable for each smell. This imply that we may use one-dimensional or two-dimensional cnn interchangeably without compromising the performance significantly.

The comparative results on applying diverse deep learning models for detecting different types of smells suggest that there exists no universal optimal model for detecting all smells under consideration. The performance of the model is highly dependent on the kind of smell that the model is trying to classify. This observation provides grounds for further investigation, encouraging the software engineering community to propose improvements on smell-specific deep learning models.

5.2. Results of RQ2


Is transfer-learning feasible in the context of detecting smells? If yes, which deep learning model exhibits superior performance in detecting smells when applied in transfer-learning setting?

5.2.1. Approach

In the case of direct-learning, the training and evaluation samples belong to the same programming language whereas in the transfer-learning case, the training and evaluation samples come from two similar but different programming languages. This research question inquires the feasibility of applying transfer-learning i.e., train neural networks by using C# samples and employ the trained model to classify code fragments written in Java.

For the transfer learning experiment we keep the training samples exactly the same as the ones we used in RQ1. For evaluation, we download repositories containing Java source code and preprocess the samples as described in Section 4.2. Similar to RQ1, evaluation is performed on a realistic scenario, i.e., we use all the positive and negative samples from the selected repositories. This arrangement ensures that the models would perform as reported if employed in a real-world application. Table 7 shows the number of samples used for training and evaluation for this research question.

cnn-1d and rnn cnn-2d
Training Evaluation Training Evaluation
p and n p n p and n p n
cm 3,472 2,163 48,633 2,641 2001 30,215
ecb 1,200 597 50,199 982 538 31,678
mn 5,000 42,037 50,905 5,000 7,778 24,438
ma 290 25 13,110 284 23 11,812
Table 7. Positive (P) and negative (N) number of samples used for training and evaluation for RQ2

5.2.2. Results

As an overview, Figure 8 shows the scatter plots for each deep learning model comparing the performance (F1) of both the direct-learning and transfer-learning for all the considered smells for all the configurations. These plots outline the performance exhibited by the models in both the cases with trend lines distinguishing the compared series. The plots imply that the models perform better in the transfer-learning case for all except multifaceted abstraction design smell.

(a) CNN-1D for complex method smell
(b) CNN-1D for empty catch block smell
(c) CNN-1D for magic number smell
(d) CNN-1D for multifaceted abstraction smell
(e) CNN-2D for complex method smell
(f) CNN-2D for empty catch block smell
(g) CNN-2D for magic number smell
(h) CNN-2D for multifaceted abstraction smell
(i) RNN for complex method smell
(j) RNN for empty catch block smell
(k) RNN for magic number smell
(l) RNN for multifaceted abstraction smell
Figure 8. Scatter plots for each model and for each considered smell comparing F1 of direct-learning and transfer-learning along with corresponding trendline

In the rest of the section, we report quantitative results on applying transfer learning between C# to Java. The results are based on the optimal configuration of each model for each smell.


It is feasible to apply transfer-learning in the context of code smells detection.

Table 8 presents the performance of the models for all the considered smells demonstrating strong evidence on the feasibility of applying transfer-learning for smell detection. The performance pattern is in alignment to that in the direct-learning case; Spearman correlation between the performance produced by direct-learning and transfer-learning is (with p-value = ). Therefore, we accept the hypothesis that transfer-learning is feasible in the context of code smells detection.

Performance Configuration
Smells AUC Precision Recall F1 l f k mpw ed lstm e
cnn-1d cm 0.87 0.38 0.79 0.51 2 32 7 4 23
ecb 0.56 0.05 0.15 0.08 3 8 5 5 27
mn 0.64 0.48 0.37 0.42 1 32 11 3 12
ma 0.52 0.01 0.04 0.02 2 8 11 5 13
cnn-2d cm 0.88 0.43 0.84 0.57 1 8 7 2 37
ecb 0.54 0.04 0.12 0.06 3 16 5 4 19
mn 0.65 0.43 0.54 0.48 1 64 5 4 8
ma 0.50 0.0 0.0 0.0 3 8 5 5 17
rnn cm 0.66 0.62 0.32 0.42 1 32 64 8
ecb 0.90 0.09 0.91 0.16 3 32 32 27
mn 0.95 0.94 0.91 0.92 1 32 32 22
ma 0.51 0.0 0.08 0.0 1 32 32 18
Table 8. Performance of all three models with configuration corresponding to the optimal performance. L refers to deep learning layers, F refers to number of filters, K refers to kernel size, MPW refers to maximum pooling window size, ED refers to embedding dimension, LSTM refers to number of LSTM units, and E refers to number of epochs.
Figure 9. Comparative performance of the deep learning models for each considered smell in transfer-learning settings

Figure 9 presents a comparison among the performance (i.e., F1) exhibited by all the deep learning models for each considered smell. rnn performs significantly superior for empty catch block and magic number smells following a trend comparable to direct-training. For complex method smell, cnn-2d performs the best followed by cnn-1d. All the three models perform poorly with multifaceted abstraction smell.


Transfer-learning performs inferior compared to direct learning.

Figure 10. Comparison of performance of the deep learning models between direct-learning (DL) and transfer-learning (TL) settings

Figure 10 compares the performance of the models at their optimal configurations applied in the transfer-learning and in direct-learning. We observe that, in majority of cases, transfer-learning performs better than the corresponding direct-learning counterpart models. The only exception for implementation smells is rnn applied on empty catch block smell, where direct-learning shows better results. For the only design smell, i.e., multifaceted abstraction, all the models perform poorly in both cases.

To dig deeper into the cause of better performance of deep learning models in the transfer-learning case, we calculate the ratio of positive and negative evaluation samples in both research questions. Table 9 presents the ratio for samples used in both the research questions as well as percentage difference of the ratios of positive and negative samples in RQ2 compared to the sample ratio in RQ1. The percentage difference is computed as follows: . It is evident that Java code samples have higher ratio of positive samples, up to higher, compared to C# samples for implementation smells. We deduce that due to significantly higher number of positive samples, the deep learning models show better performance statistics in the transfer-learning case. On the other hand, multifaceted abstraction smell occurs significantly lower (up to ) in Java code compared to C# code and this lower ratio further degrades the performance of the models for multifaceted abstraction smell. Therefore, due to size discrepancies in the samples available for evaluation in direct-learning and transfer-learning, we cannot conclude the superiority or inferiority of the results produced by applying transfer-learning compared to those of direct-learning.

Ratio (RQ1) Ratio (RQ2) Difference %
Smell 1d 2d 1d 2d 1d 2d
cm 0.0287 0.0250 0.0445 0.0662 35.53 62.19
ecb 0.0097 0.0092 0.0119 0.0170 18.14 45.88
mn 0.1242 0.1210 0.2084 0.3183 40.40 61.98
ma 0.0055 0.0070 0.0019 0.0019 -188.42 -260.87
Table 9. Difference in ratio (in percent) of positive and negative evaluation samples in RQ2 compared to sample ratio in RQ1

5.2.3. Implications

Our results demonstrate that it is feasible to apply transfer-learning in the smell detection context. Exploiting this approach can lead to a new category of smell detection tools, specifically for the programming languages where comprehensive smell detection tools are not available.

5.3. Discussion

As is the case with most research, our results are sobering rather than sensational. Although it is possible to detect some code smells using deep learning models, the method is by no means universal, and the outcome is sensitive to the training set composition and the training time. In the rest of the section, we elaborate on these observations emerging from the presented results.

5.3.1. Is there any silver-bullet?

In practical settings one would want to employ a universal model architecture that performs consistently well for all the considered smells would allow the implementation of tools simpler.

rnn has the reputation to perform well with textual data and sequential patterns while cnn is considered good for imaging data and visual patterns. Given the similarity of source code and natural language, it is expected to obtain good performance from rnn. Our results show that rnn significantly outperforms both cnn models in the cases of empty catch block and magic number. However, in the case of complex method, the cnn models outperform the rnn whereas in the case of multifaceted abstraction, none of the models yield satisfactory results. These outcomes suggest that there is not one deep learning model that can be used for all kinds of smells. We have a uniform model architecture for each model and we observed that the performance of the model differs significantly for different smells. It suggests that it is non-trivial, if not impossible, to propose a universal model architecture that works for all smells. Each smell exhibits diverse distinctive features and hence their detection mechanisms differ significantly. Therefore, given the nature of the problem, it is unlikely that one universal model architecture will be the silver-bullet for the detection of a wide range of smells.

5.3.2. Performance comparison with baseline

It is not feasible to compare the results presented in this paper with other attempts (Khomh et al., 2009, 2011; Maiga et al., 2012b, a; Bryton et al., 2010; Barbez et al., 2019; Fontana et al., 2016) that use machine learning for smell detection due to the following reasons. First, the replication packages of the related attempts are not available. Second, for most of the existing attempts, the ratio of positive and negative evaluation samples is not known; in the absence of this information, we cannot compare them with our results fairly since the ratio plays an important role in the performance of machine learning models. Furthermore, the existing approaches compute metrics and feed them to machine learning models while we feed tokenized source code.

We compare our results with the results obtained from two baseline random classifiers that do not really learn from the data but use only the distribution of smells in the training set to form their predictions. Table 10 presents the comparison. The first random classifier generates predictions by following the training set’s class distribution: that is, for every item in the evaluation set it predicts whether it is a smell or not based on the frequency of smells in the training data. We did that for both balanced and unbalanced evaluation samples to mimic the learning process of the actual experiment. In the middle three columns, referred to as ‘rc (frequency)’, of the table we show the results for the balanced setting, as they were better than the results for the unbalanced setting. The second random classifier predicts always that a smell is present; this gives perfect recall, but low precision, as you can see in the columns corresponding to ‘rc (all smells)’ of the table. Overall, our models perform far better than a random classifier for all but multifaceted abstraction smell for both baseline variants.

Our results RC (frequency) RC (all smells)
Smells P R F1 P R F1 P R F1
cnn-1d cm 0.38 0.79 0.51 0.03 0.50 0.05 0.03 1 0.05
ecb 0.05 0.15 0.08 0.01 0.50 0.02 0.01 1 0.02
mn 0.48 0.37 0.42 0.11 0.50 0.18 0.11 1 0.20
ma 0.01 0.04 0.02 0.01 0.50 0.01 0.01 1 0.01
cnn-2d cm 0.43 0.84 0.57 0.02 0.50 0.05 0.02 1 0.05
ecb 0.04 0.12 0.06 0.01 0.50 0.02 0.01 1 0.02
mn 0.43 0.54 0.48 0.11 0.50 0.18 0.11 1 0.19
ma 0.0 0.0 0.0 0.01 0.50 0.01 0.01 1 0.01
rnn cm 0.62 0.32 0.42 0.03 0.50 0.05 0.03 1 0.05
ecb 0.09 0.91 0.16 0.01 0.50 0.02 0.01 1 0.02
mn 0.94 0.91 0.92 0.11 0.50 0.18 0.11 1 0.20
ma 0.0 0.08 0.0 0.01 0.50 0.01 0.01 1 0.01
Table 10. Comparison of performance (Precision, Recall, and F1) with a random classifier (RC) following the training set frequencies or responding always indicating a smell

5.3.3. Poor performance in detecting a design smell

The presented neural networks perform very poor when it comes to detecting the sole design smell multifaceted abstraction. We infer the following two reasons for this under-performance. First, design smells such as multifaceted abstraction are inherently difficult to spot unless a deeper semantic analysis is performed. Specifically, in the case of multifaceted abstraction, interactions among methods of a class as well as the member fields are required to observe cohesion among the methods which is a non-trivial aspect and the neural networks could not spot this aspect with the provided input. Therefore, we need to provide refined semantics information in the form of engineered features along with the source code to help neural networks identify the inherent patterns. Second, the number of positive training samples were very low, thus significantly restricting our training set. The low number severely impacts the ability of neural networks to infer the responsible aspect that cause the smell. This limitation can be addressed by increasing the number of repositories under analysis.

5.3.4. Trading performance with training-time

As observed in the results section, rnn performs significantly superior than cnn in some specific cases. However, we also note that rnn models take considerable more time to train compared to cnn models. We log the time taken by each experiment for the comparison. Table 11 presents the average time taken by each model for each smell per epoch. The table shows that the rnn performance is coming from much more intense processing compared to cnn. Therefore, if the performance of rnn and cnn is comparable for a given task, one should go with cnn-based solution for significantly faster training time.

cnn-1d cnn-2d rnn
cm 1.2 1.0 2,134
ecb 0.8 0.5 1,038
mn 3.2 3.9 5,737
ma 0.8 4.6 2,208
Table 11. Average training-time taken by the models to train a single epoch in seconds

5.4. Opportunities

The study may encourage the research community to explore the deep learning methods as a viable option for addressing the problem of smell detection. Given that we did not consider the context and developers’ opinion on smells reported by deterministic tools, it would be acutely interesting to combine these aspects either by considering the developers’ opinion (by manually tagging the samples) while segregating positive and negative samples or by designing the models that take such opinions as an input to the model.

We have shown that transfer-learning is feasible in the context of code smells. This result introduces new directions for automating smell detection which is particularly useful for programming languages for which smell detection tools are either not available or not matured.

Though this work shows the feasibility of detecting implementation smells; however, complex smells such as multifaceted abstraction require further exploration and present many open research challenges. The research community may build on the results presented in this study and further explore optimizations to the presented models, alternative models, or innovative model architectures to address the detection of complex design and architecture smells.

Beyond smell detection, proposing an appropriate refactoring to remove a smell is a non-trivial challenge. There have been some attempts (Tsantalis et al., 2018; Biegel et al., 2011) to separate refactoring changes from bug fixes and feature additions. One may exploit the information produced from such tools and the power of deep learning methods to construct tools that propose suitable refactoring mechanism.

6. Threats to Validity

Threats to the validity of our work mainly stem from correctness of the employed tools, our assumption concerning similarity of both the programming languages, and generalizability and repeatability of the presented results.

6.1. Construct Validity

Construct validity measures the degree to which tools and metrics actually measure the properties that they are supposed to measure. It concerns the appropriateness of observations and inferences made on the basis of measurements taken during the study.

In the context of using deep learning techniques for smell detection, we use Designite and DesigniteJava to detect smells in C# and Java code respectively and use these results as the ground truth. Relying on the outcome of two different tools may pose a threat to validity especially in the case of transfer-learning. To mitigate the risk, we make sure that both the tools use exactly the same set of metrics, thresholds, and heuristics to detect smells. Also, we ensure the smell detection similarity by employing automated as well as manual testing.

To address potential threats posed by representational discrepancies between the two languages we ensure that Tokenizer generates same tokens for same or similar language constructs. For instance, all the common reserved words are mapped to the same integer token for both the programming languages.

6.2. Internal Validity

Internal validity refers to the validity of the research findings. It is primarily concerned with controlling the extraneous variables and outside influences that may impact the outcome.

In the context of our investigation, exploring the feasibility of applying transfer-learning for smell detection, we assume that both programming languages are similar by paradigm, structure, and language constructs. It would be interesting to observe how two completely different programming languages (for example, Java and Python) can be combined in a transfer-learning experiment.

6.3. External Validity

External validity concerns generalizability and repeatability of the produced results. The method presented in the study is programming language agnostic and thus can be repeated for any other programming language given the availability of appropriate tool-chain. To encourage the replication and building over this work, we have made all the tools, scripts, and data available online.555https://github.com/tushartushar/DeepLearningSmells

7. Conclusions

The interest in machine learning-based techniques for processing source code has gained momentum in the recent years. Despite existing attempts, the community has identified the immaturity of the discipline for source code processing, especially when it comes to identifying quality aspects such as code smells. In this paper, we establish that deep learning methods can be used for smell detection. Specifically, we found that cnn and rnn deep learning models can be used for code smell detection, though with varying performance. We did not find a clearly superior method between 1d and 2d convolution neural networks; cnn-1d performed slightly better for the smells empty catch block and multifaceted abstraction, while cnn-2d performed superior than its one dimensional counterpart for complex method and magic number. Further, our results indicate that rnn performs far better than convolutional networks for smells empty catch block and magic number. Our experiment on applying transfer-learning proves the feasibility of practicing transfer-learning in the context of smell detection, especially for the implementation smells.

With the results presented in the paper we encourage software engineering researchers to build over our work as we identify ample opportunities for automating smell detection based on deep learning models. There are grounds for extending this work to a wider scope by including smells belonging to design and architecture granularities. Furthermore, there exist opportunities for further exploiting results and coupling with deep learning methods for identifying suitable refactoring candidates. From the practical side, the tool developers may induct the deep learning methods for effective smell detection and using transfer-learning to detect smells for programming languages where the comprehensive code smell detection tools are not available.

This work is partially funded by the seneca project, which is part of the Marie Skłodowska-Curie Innovative Training Networks (itn-eid) under grant agreement number 642954 and by the CROSSMINER project, which has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under grant agreement No. 732223. We would like to thank Antonis Gkortzis, Theodore Stassinopoulos, and Alexandra Chaniotakis for generously contributing effort to our DesigniteJava project. We would like to thank grnet (Greek Research and Network Center) for allowing us perform the experiments on their super-computing facility.


  • (1)
  • Abdeljaber et al. (2017) Osama Abdeljaber, Onur Avci, Serkan Kiranyaz, Moncef Gabbouj, and Daniel J Inman. 2017. Real-time vibration-based structural damage detection using one-dimensional convolutional neural networks. Journal of Sound and Vibration 388 (2017), 154–170.
  • Alexandru et al. (2017) Carol V Alexandru, Sebastiano Panichella, and Harald C Gall. 2017. Replicating parser behavior using neural machine translation. In Proceedings of the 25th International Conference on Program Comprehension. IEEE Press, 316–319.
  • Allamanis et al. (2018) Miltiadis Allamanis, Earl T Barr, Premkumar Devanbu, and Charles Sutton. 2018. A survey of machine learning for big code and naturalness. ACM Computing Surveys (CSUR) 51, 4 (2018), 81.
  • Allamanis et al. (2016) Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A convolutional attention network for extreme summarization of source code. In International Conference on Machine Learning. 2091–2100.
  • Arnaoudova et al. (2013) Venera Arnaoudova, Massimiliano Di Penta, Giuliano Antoniol, and Yann-Gaël Guéhéneuc. 2013. A New Family of Software Anti-patterns: Linguistic Anti-patterns. In CSMR ’13: Proceedings of the 2013 17th European Conference on Software Maintenance and Reengineering. IEEE Computer Society, 187–196.
  • Barbez et al. (2019) Antoine Barbez, Foutse Khomh, and Yann-Gaël Guéhéneuc. 2019. A Machine-learning Based Ensemble Method For Anti-patterns Detection. arXiv:cs.SE/1903.01899
  • Baziotis et al. (2017) Christos Baziotis, Nikos Pelekis, and Christos Doulkeridis. 2017.

    Datastories at semeval-2017 task 4: Deep lstm with attention for message-level and topic-based sentiment analysis. In

    Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). 747–754.
  • Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2013. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35, 8 (2013), 1798–1828.
  • Biegel et al. (2011) Benjamin Biegel, Quinten David Soetens, Willi Hornig, Stephan Diehl, and Serge Demeyer. 2011. Comparison of Similarity Metrics for Refactoring Detection. In Proceedings of the 8th Working Conference on Mining Software Repositories (MSR ’11). ACM, 53–62. https://doi.org/10.1145/1985441.1985452
  • Bryton et al. (2010) Sérgio Bryton, Fernando Brito E Abreu, and Miguel Monteiro. 2010. Reducing subjectivity in code smells detection: Experimenting with the Long Method. In Proceedings - 7th International Conference on the Quality of Information and Communications Technology, QUATIC 2010. Faculdade de Ciencias e Tecnologia, New University of Lisbon, Caparica, Portugal, IEEE, 337–342.
  • Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1724–1734.
  • Chollet (2017) Francois Chollet. 2017. Deep learning with python. Manning Publications Co.
  • Czibula et al. (2015) Gabriela Czibula, Zsuzsanna Marian, and Istvan Gergely Czibula. 2015. Detecting software design defects using relational association rule mining. Knowledge and Information Systems 42, 3 (March 2015), 545–577.
  • de Andrade et al. (2014) Hugo Sica de Andrade, Eduardo Almeida, and Ivica Crnkovic. 2014. Architectural Bad Smells in Software Product Lines: An Exploratory Study. In Proceedings of the WICSA 2014 Companion Volume (WICSA ’14 Companion). ACM, Article 12, 6 pages. https://doi.org/10.1145/2578128.2578237
  • Deng et al. (2009) J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.
  • Ernst (2017) Michael D Ernst. 2017. Natural language is a programming language: Applying natural language processing to software development. In LIPIcs-Leibniz International Proceedings in Informatics, Vol. 71. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik.
  • Felleman and Van Essen (1991) Daniel J Felleman and David C Van Essen. 1991. Distributed Hierarchical Processing in the Primate Cerebral Cortex. Cerebral Cortex 1, 1 (1991), 1–47.
  • Fontana et al. (2016) Francesca Arcelli Fontana, Ilaria Pigazzini, Riccardo Roveda, and Marco Zanoni. 2016. Automatic Detection of Instability Architectural Smells. In Software Maintenance and Evolution (ICSME), 2016 IEEE International Conference on. IEEE, 433–437.
  • Foster et al. (2012) Stephen R Foster, William G Griswold, and Sorin Lerner. 2012. WitchDoctor: IDE support for real-time auto-completion of refactorings. In Software Engineering (ICSE), 2012 34th International Conference on. IEEE, 222–232.
  • Fowler (1999) Martin Fowler. 1999. Refactoring: Improving the Design of Existing Programs (1 ed.). Addison-Wesley Professional.
  • Fu and Shen (2015) Shizhe Fu and Beijun Shen. 2015. Code Bad Smell Detection through Evolutionary Data Mining. In International Symposium on Empirical Software Engineering and Measurement. Shanghai Jiaotong University, Shanghai, China, IEEE, 41–49.
  • Fu and Menzies (2017) Wei Fu and Tim Menzies. 2017. Easy over hard: A case study on deep learning. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. ACM, 49–60.
  • Gal and Ghahramani (2015) Yarin Gal and Zoubin Ghahramani. 2015. A Theoretically Grounded Application of Dropout in Recurrent Neural Networks. arXiv e-prints, Article arXiv:1512.05287 (Dec 2015), arXiv:1512.05287 pages. arXiv:stat.ML/1512.05287
  • Garcia et al. (2009a) Joshua Garcia, Daniel Popescu, George Edwards, and Nenad Medvidovic. 2009a. Identifying Architectural Bad Smells. In CSMR ’09: Proceedings of the 2009 European Conference on Software Maintenance and Reengineering. IEEE Computer Society, 255–258.
  • Garcia et al. (2009b) Joshua Garcia, Daniel Popescu, George Edwards, and Nenad Medvidovic. 2009b. Toward a Catalogue of Architectural Bad Smells. In Proceedings of the 5th International Conference on the Quality of Software Architectures: Architectures for Adaptive Software Systems (QoSA ’09). Springer-Verlag, 146–162. https://doi.org/10.1007/978-3-642-02351-4_10
  • Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. 2016. Deep learning. Vol. 1. MIT press Cambridge.
  • Graves et al. (2013) Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. 2013. Hybrid speech recognition with deep bidirectional LSTM. In Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. IEEE, 273–278.
  • Greff et al. (2017) Klaus Greff, Rupesh K Srivastava, Jan Koutník, Bas R Steunebrink, and Jürgen Schmidhuber. 2017. LSTM: A search space odyssey. IEEE transactions on neural networks and learning systems 28, 10 (2017), 2222–2232.
  • Gupta et al. (2017) Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish Shevade. 2017. DeepFix: Fixing Common C Language Errors by Deep Learning.. In AAAI. 1345–1351.
  • Hellendoorn and Devanbu (2017) Vincent J Hellendoorn and Premkumar Devanbu. 2017. Are deep neural networks the best choice for modeling source code?. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. ACM, 763–773.
  • Hindle et al. (2012) Abram Hindle, Earl T Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. 2012. On the naturalness of software. In Software Engineering (ICSE), 2012 34th International Conference on. IEEE, 837–847.
  • Hinton et al. (2006) Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. 2006. A fast learning algorithm for deep belief nets. Neural computation 18, 7 (2006), 1527–1554.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735–1780.
  • Hubel and Wiesel (1962) David H Hubel and Torsten N Wiesel. 1962. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. The Journal of physiology 160, 1 (1962), 106–154.
  • Huo et al. (2016) Xuan Huo, Ming Li, and Zhi-Hua Zhou. 2016. Learning Unified Features from Natural and Programming Languages for Locating Buggy Source Code.. In IJCAI. 1606–1612.
  • Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37. JMLR. org, 448–456.
  • Iyer et al. (2016) Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016.

    Summarizing source code using a neural attention model. In

    Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 2073–2083.
  • Johnson and Zhang (2015) Rie Johnson and Tong Zhang. 2015. Effective Use of Word Order for Text Categorization with Convolutional Neural Networks. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 103–112.
  • Kessentini et al. (2014) Wael Kessentini, Marouane Kessentini, Houari Sahraoui, Slim Bechikh, and Ali Ouni. 2014. A Cooperative Parallel Search-Based Software Engineering Approach for Code-Smells Detection. IEEE Transactions on Software Engineering 40, 9 (2014), 841–861.
  • Khomh et al. (2009) Foutse Khomh, Stéphane Vaucher, Yann-Gaël Guéhéneuc, and Houari Sahraoui. 2009. A Bayesian Approach for the Detection of Code and Design Smells. In QSIC ’09: Proceedings of the 2009 Ninth International Conference on Quality Software. IEEE Computer Society, 305–314.
  • Khomh et al. (2011) Foutse Khomh, Stéphane Vaucher, Yann-Gaël Guéhéneuc, and Houari Sahraoui. 2011. BDTEX: A GQM-based Bayesian approach for the detection of antipatterns. In Journal of Systems and Software. Ecole Polytechnique de Montreal, Montreal, Canada, 559–572.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arXiv preprint arXiv:1412.6980 (2014).
  • Kraus et al. (2016) Oren Z Kraus, Jimmy Lei Ba, and Brendan J Frey. 2016. Classifying and segmenting microscopy images with deep multiple instance learning. Bioinformatics 32, 12 (2016), i52–i59.
  • Krizhevsky and Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images. Technical Report. Citeseer.
  • Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105.
  • Kruchten et al. (2012) Philippe Kruchten, Robert L. Nord, and Ipek Ozkaya. 2012. Technical Debt: From Metaphor to Theory and Practice. IEEE Software 29, 6 (2012), 18–21. https://doi.org/10.1109/MS.2012.167
  • Lawrence et al. (1997) Steve Lawrence, C Lee Giles, Ah Chung Tsoi, and Andrew D Back. 1997. Face recognition: A convolutional neural-network approach. IEEE transactions on neural networks 8, 1 (1997), 98–113.
  • LeCun et al. (2015) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature 521, 7553 (2015), 436.
  • LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (1998), 2278–2324.
  • LeCun et al. (2010) Yann LeCun, Corinna Cortes, and CJ Burges. 2010. MNIST handwritten digit database. AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist 2 (2010).
  • Lee et al. (2017) Song-Mi Lee, Sang Min Yoon, and Heeryon Cho. 2017. Human activity recognition from accelerometer data using Convolutional Neural Network. In Big Data and Smart Computing (BigComp), 2017 IEEE International Conference on. IEEE, 131–134.
  • Li et al. (2017) Jian Li, Pinjia He, Jieming Zhu, and Michael R Lyu. 2017. Software defect prediction via convolutional neural network. In Software Quality, Reliability and Security (QRS), 2017 IEEE International Conference on. IEEE, 318–328.
  • Ling et al. (2016) Wang Ling, Phil Blunsom, Edward Grefenstette, Karl Moritz Hermann, Tomáš Kočiskỳ, Fumin Wang, and Andrew Senior. 2016. Latent Predictor Networks for Code Generation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 599–609.
  • Lippert and Roock (2006) Martin Lippert and Stephen Roock. 2006. Refactoring in large software projects: performing complex restructurings successfully. John Wiley & Sons.
  • Luong et al. (2015) Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1412–1421.
  • Maas et al. (2011) Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Portland, Oregon, USA, 142–150. http://www.aclweb.org/anthology/P11-1015
  • Maiga et al. (2012a) Abdou Maiga, Nasir Ali, Neelesh Bhattacharya, Aminata Sabané, Yann-Gaël Guéhéneuc, and Esma Aïmeur. 2012a. SMURF: A SVM-based incremental anti-pattern detection approach. In Proceedings - Working Conference on Reverse Engineering, WCRE. Ptidej Team, IEEE, 466–475.
  • Maiga et al. (2012b) Abdou Maiga, Nasir Ali, Neelesh Bhattacharya, Aminata Sabané, Yann-Gaël Guéhéneuc, Giuliano Antoniol, and Esma Aïmeur. 2012b. Support vector machines for anti-pattern detection. In ASE 2012: Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering. Polytechnic School of Montreal, ACM, 278–281.
  • Marinescu (2004) Radu Marinescu. 2004. Detection Strategies: Metrics-Based Rules for Detecting Design Flaws. In Proceedings of the 20th IEEE International Conference on Software Maintenance (ICSM ’04). IEEE Computer Society, 350–359.
  • Marinescu (2005) R Marinescu. 2005. Measurement and quality in object-oriented design. In 21st IEEE International Conference on Software Maintenance (ICSM’05). Universitatea Politehnica din Timisoara, Timisoara, Romania, IEEE, 701–704.
  • Martens (2010) James Martens. 2010. Deep learning via Hessian-free optimization.. In ICML, Vol. 27. 735–742.
  • Moha et al. (2010) Naouel Moha, Yann-Gaël Guéhéneuc, Laurence Duchien, and Anne-Françoise Le Meur. 2010. DECOR: A Method for the Specification and Detection of Code and Design Smells. IEEE Trans. Software Eng. 36, 1 (2010), 20–36. https://doi.org/10.1109/TSE.2009.50
  • Mou et al. (2016) Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional Neural Networks over Tree Structures for Programming Language Processing.. In AAAI, Vol. 2. 4.
  • Munaiah et al. (2017) Nuthan Munaiah, Steven Kroh, Craig Cabrey, and Meiyappan Nagappan. 2017. Curating GitHub for engineered software projects. Empirical Software Engineering 22, 6 (01 Dec 2017), 3219–3253. https://doi.org/10.1007/s10664-017-9512-6
  • Nguyen et al. (2013) Anh Tuan Nguyen, Tung Thanh Nguyen, and Tien N Nguyen. 2013. Lexical statistical machine translation for language migration. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering. ACM, 651–654.
  • Nucci et al. (2018) D. Di Nucci, F. Palomba, D. A. Tamburri, A. Serebrenik, and A. De Lucia. 2018. Detecting code smells using machine learning techniques: Are we there yet?. In 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER), Vol. 00. 612–621. https://doi.org/10.1109/SANER.2018.8330266
  • Oda et al. (2015) Yusuke Oda, Hiroyuki Fudaba, Graham Neubig, Hideaki Hata, Sakriani Sakti, Tomoki Toda, and Satoshi Nakamura. 2015. Learning to generate pseudo-code from source code using statistical machine translation (t). In Automated Software Engineering (ASE), 2015 30th IEEE/ACM International Conference on. IEEE, 574–584.
  • Ott et al. (2018) Jordan Ott, Abigail Atchison, Paul Harnack, Natalie Best, Haley Anderson, Cristiano Firmani, and Erik Linstead. 2018. Learning Lexical Features of Programming Languages from Imagery Using Convolutional Neural Networks. (2018), 336–339. https://doi.org/10.1145/3196321.3196359
  • Ouni et al. (2015) Ali Ouni, Raula Gaikovina Kula, Marouane Kessentini, and Katsuro Inoue. 2015.

    Web Service Antipatterns Detection Using Genetic Programming. In

    GECCO ’15: Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation

    . Osaka University, ACM, 1351–1358.
  • Palomba et al. (2015) Fabio Palomba, Gabriele Bavota, Massimiliano Di Penta, Rocco Oliveto, Denys Poshyvanyk, and Andrea De Lucia. 2015. Mining version histories for detecting code smells. IEEE Transactions on Software Engineering 41, 5 (May 2015), 462–489.
  • Palomba et al. (2016) Fabio Palomba, Annibale Panichella, Andrea De Lucia, Rocco Oliveto, and Andy Zaidman. 2016. A textual-based technique for Smell Detection. In 2016 IEEE 24th International Conference on Program Comprehension (ICPC). Universita di Salerno, Salerno, Italy, IEEE, 1–10.
  • Parkhi et al. (2015) Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, et al. 2015. Deep face recognition.. In BMVC, Vol. 1. 6.
  • Piech et al. (2015) Chris Piech, Jonathan Huang, Andy Nguyen, Mike Phulsuksombati, Mehran Sahami, and Leonidas Guibas. 2015. Learning Program Embeddings to Propagate Feedback on Student Code. In International Conference on Machine Learning. 1093–1102.
  • Pu et al. (2016) Yewen Pu, Karthik Narasimhan, Armando Solar-Lezama, and Regina Barzilay. 2016. sk_p: a neural program corrector for MOOCs. In Companion Proceedings of the 2016 ACM SIGPLAN International Conference on Systems, Programming, Languages and Applications: Software for Humanity. ACM, 39–40.
  • Robles (2010) Gregorio Robles. 2010. Replicating MSR: A study of the potential replicability of papers published in the Mining Software Repositories proceedings. In Mining Software Repositories (MSR), 2010 7th IEEE Working Conference on. IEEE, 171–180.
  • Rumelhart et al. (1986) David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1986. Learning representations by back-propagating errors. nature 323, 6088 (1986), 533.
  • Sahin et al. (2014) Dilan Sahin, Marouane Kessentini, Slim Bechikh, and Kalyanmoy Deb. 2014. Code-Smell Detection as a Bilevel Problem. ACM Transactions on Software Engineering and Methodology (TOSEM) 24, 1 (Oct. 2014), 6–44.
  • Sainath et al. (2015) Tara N Sainath, Brian Kingsbury, George Saon, Hagen Soltau, Abdel-rahman Mohamed, George Dahl, and Bhuvana Ramabhadran. 2015. Deep convolutional neural networks for large-scale speech tasks. Neural Networks 64 (2015), 39–48.
  • Salehie et al. (2006) Mazeiar Salehie, Shimin Li, and Ladan Tahvildari. 2006. A Metric-Based Heuristic Framework to Detect Object-Oriented Design Flaws. In ICPC ’06: Proceedings of the 14th IEEE International Conference on Program Comprehension (ICPC’06). University of Waterloo, IEEE Computer Society, 159–168.
  • Sharma (2016) Tushar Sharma. 2016. Designite - A Software Design Quality Assessment Tool. https://doi.org/10.5281/zenodo.2566832 http://www.designite-tools.com.
  • Sharma (2018) Tushar Sharma. 2018. DesigniteJava. https://doi.org/10.5281/zenodo.2566861 https://github.com/tushartushar/DesigniteJava.
  • Sharma (2019a) Tushar Sharma. 2019a. CodeSplit for C#. https://doi.org/10.5281/zenodo.2566905
  • Sharma (2019b) Tushar Sharma. 2019b. CodeSplitJava. https://doi.org/10.5281/zenodo.2566865 https://github.com/tushartushar/CodeSplitJava.
  • Sharma et al. (2016) Tushar Sharma, Pratibha Mishra, and Rohit Tiwari. 2016. Designite — A Software Design Quality Assessment Tool. In Proceedings of the First International Workshop on Bringing Architecture Design Thinking into Developers’ Daily Activities (BRIDGE ’16). ACM. https://doi.org/10.1145/2896935.2896938
  • Sharma and Spinellis (2018) Tushar Sharma and Diomidis Spinellis. 2018. A survey on software smells. Journal of Systems and Software 138 (2018), 158 – 173. https://doi.org/10.1016/j.jss.2017.12.034
  • Spinellis (2019) Diomidis Spinellis. 2019. dspinellis/tokenizer: Version 1.1. https://doi.org/10.5281/zenodo.2558420 https://github.com/dspinellis/tokenizer.
  • Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1 (2014), 1929–1958.
  • Sundermeyer et al. (2012) Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. 2012. LSTM neural networks for language modeling. In Thirteenth annual conference of the international speech communication association.
  • Suryanarayana et al. (2014) Girish Suryanarayana, Ganesh Samarthyam, and Tushar Sharma. 2014. Refactoring for Software Design Smells: Managing Technical Debt (1 ed.). Morgan Kaufmann.
  • Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    . 1–9.
  • Tsantalis and Chatzigeorgiou (2011) Nikolaos Tsantalis and Alexander Chatzigeorgiou. 2011. Identification of Extract Method Refactoring Opportunities for the Decomposition of Methods. Journal of Systems & Software 84, 10 (Oct. 2011), 1757–1782. https://doi.org/10.1016/j.jss.2011.05.016
  • Tsantalis et al. (2018) Nikolaos Tsantalis, Matin Mansouri, Laleh M. Eshkevari, Davood Mazinanian, and Danny Dig. 2018. Accurate and Efficient Refactoring Detection in Commit History. In Proceedings of the 40th International Conference on Software Engineering (ICSE ’18). ACM, 483–494. https://doi.org/10.1145/3180155.3180206
  • Vasilescu et al. (2017) Bogdan Vasilescu, Casey Casalnuovo, and Premkumar Devanbu. 2017. Recovering clear, natural identifiers from obfuscated JS names. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. ACM, 683–693.
  • Vidal et al. (2014) Santiago A Vidal, Claudia Marcos, and J Andrés Díaz-Pace. 2014. An approach to prioritize code smells for refactoring. Automated Software Engineering 23, 3 (2014), 501–532.
  • Wang et al. (2016) Yequan Wang, Minlie Huang, Li Zhao, et al. 2016. Attention-based lstm for aspect-level sentiment classification. In Proceedings of the 2016 conference on empirical methods in natural language processing. 606–615.
  • Wei and Li (2017) Huihui Wei and Ming Li. 2017.

    Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code.. In

    IJCAI. 3034–3040.
  • Wen et al. (2015) Tsung-Hsien Wen, Milica Gasic, Nikola Mrkšić, Pei-Hao Su, David Vandyke, and Steve Young. 2015. Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1711–1721.
  • White et al. (2016) Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 87–98.
  • White et al. (2015) Martin White, Christopher Vendome, Mario Linares-Vásquez, and Denys Poshyvanyk. 2015. Toward deep learning software repositories. In Proceedings of the 12th Working Conference on Mining Software Repositories. IEEE Press, 334–345.
  • Yin and Neubig (2017) Pengcheng Yin and Graham Neubig. 2017. A Syntactic Neural Model for General-Purpose Code Generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 440–450.