Source Code Comments: Overlooked in the Realm of Code Clone Detection

06/25/2020 ∙ by Sandeep Kaur Kuttal, et al. ∙ University of Tulsa 0

Reusing code can produce duplicate or near-duplicate code clones in code repositories. Current code clone detection techniques, like Program Dependence Graphs, rely on code structure and their dependencies to detect clones. These techniques are expensive, using large amounts of processing power, time, and memory. In practice, programmers often utilize code comments to comprehend and reuse code, as comments carry important domain knowledge. But current code detection techniques ignore code comments, mainly due to the ambiguity of the English language. Recent advances in information retrieval techniques may have the potential to utilize code comments for clone detection. We investigated this by empirically comparing the accuracy of detecting clones with solely comments versus solely source code (without comments) on the JHotDraw package, which contains 315 classes and 27K lines of code. To detect clones at the file level, we used a topic modeling technique, Latent Dirichlet Allocation, to analyze code comments and GRAPLE – utilizing Program Dependency Graph – to analyze code. Our results show 94.86 recall and 84.21 precision with Latent Dirichlet Allocation and 28.7 recall and 55.39 precision using GRAPLE. We found Latent Dirichlet Allocation generated false positives in cases where programs lacked quality comments. But this limitation can be addressed by using a hybrid approach: utilizing code comments at the file level to reduce the clone set and then using Program Dependency Graph-based techniques at the method level to detect precise clones. Our further analysis across Java and Python packages, Java Swing and PyGUI, found a recall of 74.86% and a precision of 84.21%. Our findings call for reexamining the assumptions regarding the use of code comments in current clone detection techniques.



There are no comments yet.


page 1

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Code reuse involves modifying existing code fragments for a new context or problem and is a common practice among programmers to improve their productivity [90, 12, 97, 98, 28, 32, 14]. Although reuse practices by programmers can produce duplicate or near-duplicate fragments of code in code repositories [39]. Studies have found 7% to 23% of software systems contain duplicated code fragments [5, 33, 9, 42]. Such duplicated code fragments – code clones – increase the complexity and cost of software maintenance [13]. In many cases, code clones are also unintentionally generated by programmers as they work on similar programming tasks, overcome the limitations of programming languages, follow organizational coding conventions, or use design patterns [7]. To support reuse and decrease maintenance costs, various code detection techniques have been utilized to find similarities between code fragments [13, 77, 18, 62, 20, 74, 16, 44, 96].

One notable technique for detecting code clones is a Program Dependency Graph (PDG) [96]; especially to find semantic clones – functionally similar code fragments that have same the pre and post conditions, but may or may not be syntactically similar. PDG based techniques are effective as they preserve statement ordering and analyze the data and control dependencies of the code [56]. But, these techniques are known to be computationally expensive and are biased, detecting certain clones but not others [96, 76, 73, 85, 100, 99].

Programmers often utilize code comments to understand and reuse code, as comments carry important domain knowledge [40, 17, 11, 94, 84]. But current code clone detection techniques only evaluate the source code and ignore comments within the source code, which are a significant portion of software systems [79].

One of the primary reasons that clone detection techniques ignore code comments is the ambiguity of the English language. For humans, it is easy to comprehend the similarities and differences between words or topics, but a machine may treat the words differently. However, with recent advancements in topic modeling techniques, we are able to make more accurate predictions using machine learning and natural language processing tools. One of the most versatile topic modeling techniques is Latent Dirichlet Allocation (LDA), which has been extensively used for mining hidden topic patterns.

We conjectured that we can utilize code comments by applying current topic modeling techniques to detect semantic clones. Thus, we investigated the following:

RQ1: Can code comments be utilized as effectively as source code to detect semantic clones at a file level?

To answer this, we empirically compared the precision and recall of LDA (solely using comments) and the PDG-based tool GRAPLE


(solely using source code) on an open source software package called JHotDraw (315 java source files with 27KLOC). We found 94.86% recall and 84.21% precision with LDA and 28.7% recall and 55.39% precision using GRAPLE.

RQ2: Can we utilize a hybrid approach to detect clones?
Despite being able to use code comments to detect clones as in RQ1, our objective was not to replace current state-of-the-art semantic clone techniques such as PDG since they are more precise and generate fewer false positives. Hence, we formulated RQ2 to investigate if we can utilize the whole program file, both code comments and source code, to detect clones. Therefore, we analyzed the clone sets generated by LDA and PDG for the JHotDraw package and found common clones between them.

RQ3: Can code comments help in detecting code clones across programming languages?
Since code comments are in plain English, they are more often similar across programming languages unlike the syntax of source code. Hence, we conjecture that good quality comments can support interoperability across several programming languages. To investigate the utilization of quality comments to detect clones in different languages, we formulated RQ3. Thus, we analyzed two popular Java and Python graphics packages, Java Swing (318 source files with 30KLOC) and PyGUI (355 source files with 28KLOC). When comparing across languages, we found recall of 74.86% and precision of 84.21%.

Ii Background

Ii-a Program Dependency Graphs - Solely using Source Code

Clone detection techniques utilize program representations such as source code text, tokens, abstract syntax trees (ASTs), and program dependence graphs (PDGs), each with their own set of advantages and disadvantages. For our purpose, PDG-based clone detection techniques are well suited and applied to detect semantic (functionally equivalent) clones as the use of PDGs preserves the semantics of statement ordering and are oblivious to code syntax.

PDG is a static representation of the flow of data through a procedure[38]. The nodes of a PDG could be declarations, simple statements, expressions, or control points of a source file. Control points are program branches, loops, or enter/exit points for a procedure. The edges of a PDG encode the data and control dependencies between program points. Since PDGs abstract many arbitrary syntactic decisions that a programmer makes while constructing a function, they are the best fit for finding semantic code clones [56].

Program Dependence Graph G can be represented as a directed graph G with a set of vertices and a set of edges . A labeling function maps vertices or edges to labels : V E L. E can be represented by a matrix E. = 1, if there is an edge between vertex to vertex . Otherwise it is 0.

A subgraph H of G (H subset of G) only exists if there is an injective mapping : such that:

  1. Labels of all vertices in H should be the same when mapped to the labels of vertices in G

  2. All edges in H are also in G

  3. Labels of all edges in H should map with all edge labels in G

These mappings are called embeddings. Fig.LABEL:frag_pdg shows two code fragments and their respective PDGs. Both PDG graphs are the same despite one code fragment using a for-loop and the other a while-loop to calculate the factorial of a number.

Fig. 1: Code fragments for determining the factorial of a number, fragment 1 uses a for-loop and fragment 2 uses a while-loop. Each fragment is accompanied by a program dependency graph. Although the code fragments use different syntax, the dependency graphs are the same. Hence, the code fragments are semantic clones and can be detected by PDG-based techniques.


(GRAph samPLE) is a code detection technique used to sample subgraphs of large graphs and then statistically estimate the subgraphs’ characteristics

[96]. We decided to use GRAPLE as it (1) is a statistically unbiased method for sampling dependence clones unlike most PDG-based clone detection tool that are biased towards detecting certain types of clones and (2) allows estimating parameters of the whole clone population to reduce computational cost as it is impractical to process all dependence clones [96].

GRAPLE uses Frequent Subgraph Mining (FSGM). FSGM techniques identify recurring subgraphs of a graph database by traversing the frequently connected subgraph lattices that recur for ,

being the number of times it occurred in a subgraph. Once a subgraph is identified, its support is calculated as equal to the number of embeddings (mappings) that it has in the graph database. Finding reoccurring isomorphic subgraphs is an expensive technique, and it can be improved by storing the embeddings of each subgraph. Storing embeddings is called the canonicalization process. After the canonicalization process, GRAPLE randomly samples from the space of maximal frequent subgraphs. A frequent subgraph is maximal if no larger frequent subgraphs can be constructed. The sampling procedure allows us to compute selection probabilities for subgraphs, which can be used in statistical estimators such as the Horvitz-Thompson unequal probability estimator

[81]. To use the Horvitz-Thompson probability estimator, it is necessary to determine the probability that the maximal frequent pattern is selected on a random walk of -frequent connected subgraph lattice (-

). GRAPLE uses Markov chains to compute the probability

[10]. A Markov chain moves from one state to another according to the probability (transition probability from state to ) in the transition matrix P of a finite set of states, S = , , . Here, states are considered as vertices of the lattice (i.e. frequent patterns ). The transition probability for an edge to is the reciprocal of the out-degree of :

Ii-B Latent Dirichlet Allocation - Solely using Comments

Latent Dirichlet Allocation (LDA) is a statistical model for topic modeling that has been extensively used in natural language processing for representing text documents [15]

. To investigate the use of code comments in clone detection, we specifically utilized LDA as (1) it is the most popular topic modeling technique in the fields of machine learning and artificial intelligence

[31, 53], (2) allows faster training, (3) is simple, and (4) efficiently utilize statistics. Further, we wanted to use a simple and efficient count based NLP technique to investigate our RQs rather than focusing on the performance as supported by sophisticated techniques like doc2vec. Given a corpus of documents, LDA identifies a set of topics; it associates a set of words with a topic, and a specific mixture of these topics for each document.

Basic terminologies to describe LDA are:

Word: A word is a basic unit that has been extracted from a vocabulary.

Document: A document is a series of words denoted by d = , where is the word in the series. We considered source files as documents.

Corpus: A corpus is a set of M documents denoted by D = . We considered a code repository (or package) of source files as a corpus.

Topic: Topics are identified based on frequent and similar words in a corpus. Each document d can be modeled as a multinomial distribution over T topics, and each topic , as a multinomial distribution over the set of words W.

Our task is to make an estimate of and in order to discover the set of topics used and the distribution of these topics in each document in a corpus of D documents [15]. However, the LDA model assumes a prior Dirichlet distribution on , thus allowing the estimation of without requiring the estimation of .

The LDA algorithm [15] works as follows:

  1. Choose N Poisson(): Select the number of words N

  2. Dir(): Select from the Dirichlet distribution parameterized by

  3. For each W do

    - choose a topic Multinomial()

    - choose a word from p(,), a multinomial probability

The LDA model uses a document word matrix [w, ] = , where is a value indicating the importance of the word in the file . The value of is computed using Gibbs sampling [91]

, which uses a Markov Chain Monte Carlo method to converge to the target distributions in some iterations. LDA assumes documents are produced from a mixture of topics, and these topics generate words based on their probability distribution. LDA samples topics of documents. LDA takes primarily two parameters,

and , where represents document-term density and represents topic-word density. The higher the value of , the more topics the documents are composed of. A lower value of indicates that the document contains fewer topics. On the other hand, the higher the value of , the more words the topic contains. A lower value of indicates that the topic contains fewer words. Fig. 2 shows the labelling of code comments from two source files and indicating the relevant topics.

Fig. 2: The mechanics of LDA model after the comments are extracted from source files. Bold arrows depict the highest relevance between the comments and a topic, whereas the dashed arrows depict the next best relevance between them. Both source files have the highest relevance with Topic 3; the highlighted words indicate the occurrence of each word in respect to their topic.

Iii Approach

To investigate the potential of using code comments to detect clones, we utilized Open Source Software (OSS) because code reuse is an acceptable practice in the OSS community as it is believed that knowledge should be shared with humankind [49].

Iii-a Dataset

We used three open source packages: JHotDraw [47], Java Swing [46], and PyGUI [69]. We used JHotDraw and Java Swing because they are popular packages that have been widely used in clone detection studies [107, 77]. JHotDraw contain 315 source files with approximately 27 KLOC and Java Swing contains 318 source files with 30 KLOC. To compare PDG and LDA, we used the JHotDraw package. For cross-language clone detection, we used Java Swing and PyGUI. We decided to use PyGUI as it is a graphical application similar to Java Swing making the two ideal for comparing cross-language clones. PyGUI contains 355 source files with 28 KLOC.

Iii-B System Configuration

For computing and evaluating our dataset, we used a core i7 Quad Core 4th Gen processor with 3.6 GHz of clock speed and 32 GB of RAM.

Iii-C Procedure

To study semantic clone detection, we conducted simulations using PDG and LDA.

Iii-C1 Using PDG

Program Dependence Graph is one of the most advanced procedures used for identifying semantic code clones. Our approach with PDG was accomplished with the help of GRAPLE, an existing code detection tool, which generates the clone sets by sampling isomorphic subgraphs. Fig. 3 shows the overall mechanism for detecting clones using source code (without comments). The PDG evaluation consisted of:

Fig. 3: Block Diagram of PDG Evaluation.

JPDG: We used jpdg [96], a PDG generator developed by Henderson et al. [24] as it is more effective in detecting semantic clones over its predecessors. The jpdg tool runs on Apache Buildr 1.4.15 and only works with Java 1.7 platform, so we had to rely solely on Java 1.7 for generating PDGs. jpdg generates a JSON file, which contains a list of dictionaries of two types: a list of vertices and a list of edges. The dictionaries for the vertices consist of key-value pairs, with keys such as“id,” “label,” “package_name,” “class_name,” “method_name,” “type,” “start_line,” “end_line,” etc. For instance, the key “id” represents the vertex number in the PDG. The dictionaries for the edges contained keys such as, “src,”“targ,” “src_label,” “targ_label,” etc. The jpdg generated the whole graph database in a single .veg file (49 MB in size), which was used by GRAPLE for sampling out the clone sets with and without probability.

GRAPLE with probability: GRAPLE takes the .veg file as an argument along with standard parameters such as minimum-support, sample-size, minimum-vertices, and probabilities. Then, GRAPLE creates a transition matrix for computing the selection probability of the frequent pattern . Each cell of the transition matrix stores the probability of transitioning state from one vertex to another vertex

. Thus, this process encounters the “Curse of Dimensionality,’ as it consumes a huge amount of resources, both memory and processing power. To address this, we restricted to only generating subgraphs with

20 edges (still generating matrix of X ) as [96] indicated this to be when the submatrix is manageable. The selection probabilities were computed with support=5, sample-size=100, and min-vertices=8. Once selection probabilities were computed, GRAPLE generated the clone sets.

GRAPLE without probability: To avoid the “Curse of Dimensionality,” we used the option to turn off the selection probabilities. GRAPLE generated the file containing digraphs that represented a clone set. We wrote a script in Python that identified all the clone sets from by collecting the nodes of the digraph with labels and strings containing a source file. We varied the standard parameter, sample-size, from 20 to 200 and observed a very small increase in the clone sets.

The outputs from GRAPLE with and without probability were then parsed to the recall and precision module. The recall and precision for GRAPLE with selective probability was reported as and without selective probability as .

Fig. 4: Block Diagram of LDA Evaluation.

Iii-C2 Using LDA

Fig. 4 shows the block diagram of the evaluation mechanism, explaining the training of the LDA model and the extracting and generating of topics using it.

Extracting Comments: In order to train the model we needed to extract the comments from the source files. We wrote script in Python using RegEx to extract the comments from JHotDraw.

Comments serve as an integral part of a source file, and are used for understanding code structure and functionality of the source file [93, 83, 78]. Seidl et al. differentiated Java’s code comments into seven categories [17]. We used all types of comments (refer Fig. 5), except for copyright and task comments. The reason for this was that copyright comments do not contain information related to the functionality of the source code and task comments (developer notes containing todo) were not present in our dataset.

Finally, we did not consider the HTML syntax from the comments. The primary reason for excluding the HTML tags (e.g. <html>,<p>, and <br>) was that they misdirected the LDA training process. As LDA uses multinomial distribution, the large frequency of the HTML tags in the corpus caused the model to assign higher probabilities to irrelevant characters.

Fig. 5: Different types of comments in a source file.

Clean and Normalize: Once the comments were extracted from the source files, we used the Natural Language ToolKit to clean the stopwords and punctuations and then to normalize the comments. Once the comments were processed, they were combined together to form the corpus.

Create Dictionary: The corpus generated from the previous module was used to create a dictionary. A dictionary is a collection of all the unique words in the corpus. It also maps between the normalized words and assigned IDs (IDs are generated by the function itself). This was used later to train the LDA model.

Create Doc-Term Matrix: Once a dictionary was prepared, it was used to create the document-term matrix. A typical document-term matrix displays the unique words in the columns and documents in rows. So, a cell in the doc-term matrix means the frequency of the word in the document.

LDA: After the data was processed, we used the corpus along with the dictionary to train the LDA model. Once the model was trained, individual source files were passed to it to generate an associated topic for that file. The model used passes and iterations between 1 to 1000 topics and then set to a specified value that generated the maximum number of topics. Our primary concern with LDA was that it cannot assign a meaningful label (topic) for the source files. Since we were more interested in the clusters of similar files assigned to a single topic (clone set), we were not concerned with LDA’s inability to assign meaningful names (topics). Below is an example of our output, with each clone set belonging to a topic.

Clone Set 1:[‘Locator’, ‘TriangleRotationHandler’]
Clone Set 2:[‘DrawApplet’, ‘NetApplet’, ‘SVGApplet’, ‘PertApplet’]

We did not impose any restriction on the number of words for a labeled topic.

We proposed two mechanisms to compare the clone sets generated by LDA and the ground truth.

  • Method 1: Once the clone sets were generated, we calculated the precision and recall by unifying all the clone sets into one single clone set with unique file names. For instance, Clone Set 1:[‘F1’, ‘F2’] Clone Set 2: [‘F1’, ‘F4’] would become New Clone Set 1:[ ‘F1’, ‘F2’, ’F4’]. Similarly, with the ground truth we created another clone set and then performed the recall and precision as ; similar approach by Maskeri et al. [23].

  • Method 2: Here we kept the clone sets intact. We calculated the precision and recall between each individual clone set from (clone sets reported by LDA) and (clone sets reported by the ground truth), and then we took the highest value, since the pair with the highest value would undoubtedly be a match. Finally, we took the average of precision and recall for all the clone sets of .

Iii-D Ground Truth

To evaluate the effectiveness of the PDG and LDA methods, a senior undergraduate and graduate student investigated the semantic clones using Java Compare and Eclipse Java Editor to build the ground truth. In total, 45 hours were spent generating the ground truth . For cross-language, another senior undergraduate and graduate student created the ground truth. They spent 55 hours generating the ground truth.

In the JHotDraw package, 52 clone sets were found; similarly, in Java Swing, 19 clone sets were found, and in PyGUI, 50 clone sets were found. The clone sets consisted of clones ranging from a minimum of 2 to a maximum of 45.

Precision is the percentage of correctly reported differential multisets. Precision is calculated as , where is the set of multisets reported by either LDA or PDG.

Recall is the percentage of actual differential multisets reported. It is calculated as .

Iv Results

The state-of-the-art semantic clone detection techniques [95, 101] rely on programs that yield the same outputs using dynamic code similarity detection [51, 82, 29], and identify similar behaviors of different programs by comparing instruction-level execution [21]. These approaches are precise, but not scalable, and have limitations for practical usage. Moreover, none of these approaches have considered code comments for identifying similar code fragments. We wanted to explore the feasibility of using code comments as a parameter for clone detection at a file level. Hence, we investigated:

RQ1: Can code comments be utilized as effectively as source code for detecting semantic clones at a file level?

Iv-a Graple

To investigate RQ1, first, we used GRAPLE [96] with and without the selection probability (refer Sec. III).

GRAPLE with Probability: We used jpdg to generate the program dependence graph and then used GRAPLE to detect the clone sets. jpdg generated the .veg file for JHotDraw, which consisted of 61K vertices and over 118K edges, an overall size of 49 MB. We used the same parameters as used by Henderson et al. [96]: with the sample-size set to 100, the minimum vertices set to 8, and the support set to 5. Using the selection probabilities generated 80 clone sets. By manually inspecting all clone sets, we found most had just one source file. These files were removed from our ; as GRAPLE detected similar method/function-level clones within that same file. These files were also excluded from the clone sets, as we were interested in detecting clones at the file level. Moreover, we also found some clone sets that were duplicates, where the file pairs had more than one similar functionality on different sections of their code. These extra clone sets were removed and only one was retained as it did not matter if more than one part of a source file was similar to another part of the source file. After excluding the function-level and duplicated clone sets, only 22 remained. We found recall of 28.7% and precision of 55.39%.

GRAPLE without Probability: Without using the selection probabilities, we generated clone sets by varying the sample-size from 20 to 200, however, only a small increase in clone sets were observed. Hence, we generated clone sets using the above mentioned specifications: sample-size set to 100, minimum-vertices to 8, and support set to 5. We found a total of 68 clone sets were generated. After excluding the function-level and duplicated clone sets, only 17 remained. We found recall of 27.84% and precision of 52.94%.

Iv-B Lda

Secondly, to study whether comments can be utilized for clone detection, we trained the LDA model on code comments from 310 files of JHotDraw using Method 1 and Method 2. Five files were excluded from the data set as they did not contain any code comments.

Fig. 6: Snapshot of Clone Sets of JHotDraw Package.

LDA using Method 1: We used LDA to generate topics (interpreted as a set of semantically related linguistic terms) derived from comments. The LDA model was trained using all the files in the corpus. We randomly set the topic limit to 100. Using this process, 66 clone sets containing 274 files were extracted. Fig. 6 depicts a subset of clone sets generated by LDA (Note: since the assigned topics were not labeled by LDA, we refer to them as clone sets). Once all the clones sets were generated, precision and recall metrics were computed. Based on the , recall was 94.86% and precision was 84.21%. Hence, it was concluded that we can utilize comments to detect code clones.

LDA using Method 2: In Method 2, we used a more sophisticated approach to calculate the precision and recall. The topic number was set to 105 as Ghosh and Kuttal [2] found when LDA was trained with over 1-1000 topics using 1000 iterations with 50 passes, it had highest recall at 100-108 topics. LDA found 7 clone sets with 21 files. Based on the , recall was 28.61% and precision was 88.57%. After manually analyzing the codes to check the authenticity of the clone sets, we found that the matched clone sets, i.e , were functionally identical in terms of object or instance creation.

#Clones Recall Precision
With Pr. 22 28.7% 55.39%

Without Pr. 17 27.84% 52.94%
Method 1 66 94.86% 84.21%
LDA Method 2 7 28.61% 88.57%

TABLE I: Recall and Precision for JHotDraw

Iv-C Discussions

Table I summarizes the number of clone sets reported, the recall, and the precision of PDG (with and without probability) and LDA (with the different methods). Our results show that we can utilize comments for detecting code clones at a file level.

With vs. Without the Selection Probabilities: We wanted to compare and contrast the clone sets generated with and without the selection probabilities. Table I shows that 22 clone sets were detected using selection probabilities and 17 without using selection probabilities. To check the quality of the clones, we manually went through all the clone sets reported between the two approaches and also measured the correlation between by GRAPLE with selection probabilities and by GRAPLE without selection probabilities. It was found that 16 out of 17 clone sets reported by had been also reported by

. In addition, we found that without selection probability GRAPLE missed 5 clone sets and wrongly classified 1 clone set compared to including the selection probabilities. Although the

produced a comprehensive clone set, it cost 30 hours and 74 GB of memory. On the other hand, found 17 clone sets in 4.5 seconds and consumed 481.5 MB.

Effect of PDG based Tool’s Constraints: The PDG’s (GRAPLE) performance was attributed to the parameters, code structure, and code dependencies. For example, PDG was not able to detect the clone set [XMLParseException, XMLException] as clones because while comparing these files, PDG considered the difference in the number of arguments, classes extended (different super classes), and the presence of additional methods. XMLParseException extends Runtime Exception and requires three argument values: ‘name,’ ‘message,’ and ‘LineNr.’ The XMLException class extends Exception and requires five arguments in the constructor: ‘SystemID,’ ‘lineNr,’ ‘Exception,’ ‘msg’ and ‘reportParams’ and has a separate method to print stack traces for the exceptions. These constraints by GRAPLE affected the detection.

Recall vs. Precision using LDA: Balancing the trade-off between recall and precision is one of the major concerns in using LDA. When we applied the LDA model to the comments and varied the topic numbers from 1 to 1000, we observed a steady increase in precision and decrease in recall. Identifying the range where the recall and precision will be balanced is challenging and may differ based on package size and contents of the comments.

Effect of Code Comments on LDA: The performance of LDA was very much dependent on the quality of the code comments. When the code comments were present LDA was able to detect clones that were not detected by PDG. For example, LDA found the [XMLParseException, XMLException] clone set, which PDG did not find. As seen in Fig. 2, the code comments of both source files contained better information on the exceptions. Although, in the cases where the code comments were vague, LDA detected false positives. Moreover, as expected, the lack of code comments resulted in LDA missing clones completely.

The problem of mismatched, missing, and outdated comments has been well approached by the software research community [67, 25]

by utilizing techniques such as manually crafted heuristics and stereotypes

[52], information retrieval [87, 88], probabilistic models [102, 66, 86, 59]

, Recurrent Neural Networks


, and deep learning

[103], we believe that these advanced code generation techniques can be utilized to address LDA’s limitation of vague comments or lack of comments.

RQ2: Can we utilize a hybrid approach to detect clones?

To investigate whether we could utilize the whole source file, i.e. both the source code and its comments, we started by examining the similarities between the clone sets generated by LDA and PDG for the JHotDraw package. We computed , where is clones reported by LDA using Method 1 and is clones reported by GRAPLE without selection probability. We started by setting the similarity index to 1 (), i.e. all the clone sets with at least one matching file. For instance, ‘BezierPath’ is the matching file among = [‘BezierPath,’ ‘DOMStorable,’ ‘RoundRectangleFigure’] and = [‘BezierPath,’ ‘DoubleStroke’]. Next, we set the similarity index to 2 () and 3 (). By following this procedure, we found that the largest clone set can be generated when the similarity index is 3, making it the maximum similarity index.

After collecting clone sets based on the similarity indices, we created a superset containing all those files (). The superset contained unique file names. In our case we found 40-50 common files within the LDA and PDG clone set. This superset can be used as input for PDG based techniques like GRAPLE to detect function level clones. This will help in utilizing code as well as the comments of a program to find unique sets of clones.

Iv-D Discussions

We recommend the use of a hybrid technique, i.e. utilizing both LDA and PDG to detect the semantic clones. By applying LDA, we obtained 130 unique files for our data set111LDA generates the clone sets based on the random states assigned. For our experiment, we set the random seed to 100, so that the LDA model always returned the same number of clone sets; i.e., the same number of unique files., which reduced the dataset size by more than 50%. This reduced dataset obtained from LDA can be used by GRAPLE with selection probabilities. Hence, the hybrid approach can reduce both the time and space complexity of the whole process. This will help in utilizing source code as well as the comments of a program to detect unique sets of clones. Processing all clone dependencies for even moderately sized programs is impractical. As noted by Henderson et al.[96], for programs with 70 KLOC, around 10 million clones were detected before the space was exhausted. LDA can be utilized to generate the clone sets at the file level and then PDG-based techniques can be applied on these selected clone sets to detect the function-level clones.

RQ3: Can code comments help in detecting clones across programming languages?

To determine whether comments can be utilized to detect clones across multiple programming languages, we investigated code comments in Java Swing and PyGUI which are popular options in Java and Python, respectively, to build graphical user interfaces. We used Method 1 and Method 2 as discussed in Section III. Table II summarizes the recall and precision of Java Swing and PyGUI. For Java Swing, we found 66 clone sets with a 90.68% recall and 49.49% precision according to Method 1 and 69 clone sets with a 37.83% recall and 32.77% precision according to Method 2. For PyGUI, we found 61 clone sets with 51.12% recall and 39.21% precision for Method 1 and 58 clone sets with 65.62% recall and 53.33% precision for Method 2.

In order to detect the clone sets across the two packages Java Swing and PyGUI using LDA, we created a dictionary using both of the packages. The corpus consisted of 314 source files from Java Swing and 355 source files from PyGUI. For each source file, comments were extracted out and parsed to the LDA model one at a time such that each source file was assigned to a particular topic number. Source files with similar topic numbers were put together to form a clone set. Once the clone sets were created, we calculated the recall and precision. We found 88 clones common between Java Swing and PyGUI (refer Table II), with recall of 74.86% and precision of 84.21% using Method 1 and we found recall of 28.61% and precision of 58.57% using Method 2.

Iv-E Discussions

Our results show that we can utilize code comments to detect clones across different languages to an extent. But limitations of the quality of code comments and large number of false positives will persist. As discussed before, we recommend using automated code comment generation and hybrid-technique utilizing LDA and language specific PDG to detect clones.

#Clones Recall Precision
Method 1 66 90.68% 49.49%
Java Swing Method 2 69 37.83% 32.77%
Method 1 61 51.12% 39.21%
PyGUI Method 2 58 65.62% 53.33%
Method 1 88 74.86% 84.21%
Java Swing - PyGUI Method 2 81 28.61% 58.57%
TABLE II: Cross-Language Evaluation of Java Swing and PyGUI

V Threats to validity

Threat to External Validity: We studied medium-sized Java projects and a Python project, which cannot serve as an exemplar for all software systems. Secondly, for cross-language verification we studied only two languages, Java and Python. Thirdly, we explored only one PDG based tool – GRAPLE – and only one topic modeling technique – LDA. Despite these limitations, this study is a first step towards exploring the viability of the utilization of code comments in clone detection. Future studies on large-scale systems and with different languages, tools, and techniques need to be done to analyze the generalizability of our results.

Threat to Internal Validity: LDA’s recall and precision depends largely on the quantity and quality of contents. In our data, we adjusted the number of passes and iterations to find a balance between the recall and precision of LDA. But, in practice, finding the right balance is challenging and limits the usage of LDA. Additionally, the LDA approach cannot be applied to visual programming languages as they do not contain code comments. Yet, our results indicate that topic modeling techniques could be applied to text based programming languages to detect clones.

Threat to Construct Validity: To maintain consistency throughout the study, we analyzed only a single version of the Java library files. This decision was based on the facts that GRAPLE and past studies used JHOtDraw v7.0.6. Furthermore, with different versions, the number of files could have been altered by the authors, and thus might have caused mis-matches in the detection process.

Vi Related Work

Vi-a Code Clone Detection Techniques

In software engineering, many techniques [13] have been proposed to detect code clones based on token similarity (e.g., CCFinder [77], CloneMiner [18], and CloneDetective [62]), Abstract Syntax Tree (AST) similarity (e.g., CloneDR [20], Deckard [74]), or Program Dependence Graph similarity (e.g., [16, 44, 96]). These clone detectors can detect not only textually identical clones (Type I), but also parameterized clones (Type II) and gapped clones (Type III) [55]. Textually identical clones refer to code fragments with differences only in whitespace, layout, and comments. Parameterized clones refer to syntactically identical code fragments, except for differences in identifiers, literals, and types. Gapped clones refer to copied fragments with further modifications such as changed, added, or removed statements. A code clone often appears in multiple places in the system; i.e., it has multiple instances. Detecting and analyzing differences in parameterized and gapped clones has been used in the software engineering literature to manage and maintain code clones by identifying refactoring opportunities [63], detecting bugs [76], supporting change propagation in code clones [34, 43], searching code [72], and detecting plagiarism [48].

Despite this detection and analysis, finding Type III along with semantic clones (Type IV) is still an open research problem [13]. Semantic clones are functionally similar code fragments that have similar pre and post conditions, but may or may not be syntactically similar. Basit et al. [26] have explored the applicability of generics for the removal of code clones in the Java Buffer Library and the C++ Standard Template Library (STL) and concluded that programming language constructs limit the applicability of generics or templates for clone removal. Most existing clone detection techniques analyze the lower level code (e.g., assembly code, Java Bytecode, or .Net intermediate language) as obtained from the transformation by the compiler rather than from analyzing the textual source code[19, 36, 35].

Clone detection mechanisms have utilized various techniques, like searching for isomorphic sub-graphs [71], tracing program executions [56], using deep learning [58], and using abstract memory states [48]. Overall, prior research has ignored code comments, therefore, we explored the viability of comments to detect semantic clones at a file level by using LDA, a topic modeling technique.

Vi-A1 Clone Detection using PDG

Krinke’s Duplix algorithm [41] and Komondoor and Horwitz’s algorithms [71] had been utilized as PDG-based techniques to detect clones. Higo and Kusomoto extended Komondoor’s algorithm to detect contiguous clones [105, 106]. Deckard [74] showed an innovative way to map PDGs to abstract syntax trees for detecting clones. Pham et al. [64] conducted research to detect clones by using labeled directed graphs and finding clones with vSiGram. Henderson and Podgurski [96] developed a PDG-based clone detection tool using maximal frequent subgraph mining with different graph mining patterns [27] and GRAPLE [96]. We used the GRAPLE semantic clone detection tool to detect clones when considering code without comments.

Vi-B Clone Detection Across Languages

The problem of detecting clones persists across multiple languages especially in large-scale software systems. With an increase in the size of clones the relation between them gets more subtle [80, 4, 57, 8]. The existing approaches as mentioned above perform clone detection only in a single language. Some research has been conducted on cross-language clone detection. Kraft et al. [60] conducted clone detection research mainly on .NET languages. Microsoft’s Common Intermediate Language (CIL) has been used by Al-Omari et al. [20] to represent source code, which detects similar code fragments. This tool is restricted to find true positive cross-language code clones in .NET languages only. Another important contribution comes from Avetisyan et al. [1], which uses LLVM bitcode to detect semantic clones. The approach is applicable to any languages that can be compiled to LLVM bitcode. Cheng et al. [101] conceptualized the notion of detecting similarities in sets of components written in different languages, using Natural Language Processing techniques to mine projects’ revision histories. Vislavski et al. [95] designed a tool, LICCA, for cross-language clone detection that is based on intermediate program representation to unify semantically similar code fragments. They evaluated the tool on an extended set of cloning scenarios over five different languages (i.e. Java, JavaScript, C, Modula-2, Scheme). One of the limitations of LICCA [95] is that it can detect small fragments (few LOCs) of code clones and is ineffective in large software systems. None of these techniques consider code comments for clone detection.

Vi-C Software Engineering and LDA

LDA has been widely used in software engineering but mostly for program comprehension and maintainability. Wilde et al. [65] proposed the use of linguistic information to identify the functional intent of the system. Biggerstaff et al. [92] have suggested the assignment of domain concepts as an approach to program comprehension. Prior research has proposed using function names and signatures to obtain domain specific functions [6]. Furthermore, file names often carry the functional intent of the source code specified in the file [61]. Antoniol et al. used information retrieval methods to find traceability links between code documentation and source code [22]. Oezbek et. al[11] created an Eclipse plugin, JTourBus, to lead the programmer directly to relevant details by creating a tour through the source code. Kuhn et al. [3] used a Latent Semantic Analysis based approach for software comprehension identifying topics in source code by semantically clustering software artifacts such as methods, files, or packages based on identifiers’ names and comments. Unlike these approaches, we are interested in utilizing linguistic topics (of code comments) to detect clones rather than comprehend programs (source code).

Vii Conclusions

This is the first study to investigate the realm of code comments to detect code clones. We made the following contributions:

  1. Provided empirical evidence that code comments can be utilized for detecting clones at a file level and even across programming languages. We found that the precision of the detected clones largely depended on the quality of the comments. In the presence of vague or incomplete comments, there was a higher number of false positives.

  2. Demonstrated that clone detection may utilize a hybrid approach, a combination of LDA and PDG, first detecting clones at the file level using code comments and then at the method or statement level using the source code. A hybrid approach can help in reducing the cost of clone detection by utilizing less resource-intensive techniques, like LDA which can be applied to reduce the number of clone sets. It also helps in increasing accuracy of clone detection by utilizing sophisticated and resource-intensive techniques, like PDG which can utilize code structure for finding method and statement level clones.

  3. Revealed that PDG-based techniques can miss detecting some clones because of their strict constraints, like matching parameters, code structure, and code dependencies.

Our study provided evidence that comments, which are underrated in clone detection research, can be effectively utilized.


  • [1] A. Avetisyan, S. Kurmangaleev, S. Sargsyan, M. Arutunian, and A. Belevantsev. LLVM-based code clone detection framework. Computer Science and Information Technologies, pp.100-104, 2015.
  • [2] A. Ghosh and S. K. Kuttal, Semantic Code Clones: Can Source Code Comments Help?, IEEE Conference VL/HCC, 2018.
  • [3] A. Kuhn, S. Ducasse, and T. Gˆirba, Semantic clustering: Identifying topics in source code, Information and Software Technology, 49(3), 2006.
  • [4] A. Sheneamer and J. Kalita, Article: A Survey of Software Clone Detection Techniques, International Journal of Computer Applications, 137(10) 2016.
  • [5] B. Baker, On Finding Duplication and Near-Duplication in Large Software Systems, In Proceedings of Working Conference on Reverse Engineering, pp. 86-95, 1995.
  • [6] B. Caprile and P. Tonella, Nomen est omen: Analyzing the language of function identifiers, In Proceedings of the Sixth Working Conference on Reverse Engineering, pp.112-122, 1999.
  • [7] C. K. Roy and J. R. Cordy, A survey on software clone detection research, Technical Report TR 2007-541, Queens University, 2007.
  • [8] C. K. Roy, J. R. Cordy, and R. Koschke, Comparison and evaluation of code clone detection techniques and tools: A qualitative approach, Science of Computer Programming, 74(7), 2009.
  • [9] C. Kapser and M. Godfrey, Supporting the Analysis of Clones in Software Systems: A Case Study. Journal of Software Maintenance and Evolution: Research and Practice , 18(2), pp.61-82, 2006.
  • [10] C. M. Grinstead and J. L. Snell, Introduction to Probability, American Mathematical Society, Providence, RI, 2 edition, 1997.
  • [11] C. Oezbek and L. Prechelt, JTourBus: Simplifying Program Understanding by Documentation that Provides Tours Through the Source Code, In Proceedings of IEEE International Conference on Software Maintenance, pp.64-73, 2007.
  • [12] C. W. Krueger (1992). ”Software reuse”. In ACM Computing Surveys. 24(2):131-183.
  • [13] C.K. Roy, M.F. Zibran, and R. Koschke, The vision of software clone management: Past, present, and future (Keynote paper), In Proceedings of IEEE Software Maintenance, Reengineering and Reverse Engineering, pp.18-33, 2014.
  • [14] Charles W. Krueger. 1992. Software reuse. ACM Comput. Surv. 24, 2 (June 1992), 131–183.
  • [15] D. Blei, A. Ng, and M. Jordan, Latent dirichlet allocation, Journal of Machine Learning Research, pp.993-1022, 2003.
  • [16] D. Chatterji, J. C. Carver, and N. A. Kraft, Cloning: The need to understand developer intent, In Proceedings of International Workshop on Software Clones, pp.14-15, 2013.
  • [17] D. Seidl, B. Hummel and E. Juergens, Quality Analysis of Source Code comments, In Proceedings of International Conference on Program Comprehension, pp.43-60, 2013.
  • [18] E. Adar and M. Kim, SoftGUESS: Visualization and exploration of code clones in context, In Proceedings of International Conference on Software Engineering, pp.762-766, 2007.
  • [19] F. Al-Omari, I. Keivanloo, C. K. Roy, and J. Rilling, Detecting clones across Microsoft .net programming languages, In Proceeding of Working Conference on Reverse Engineering, pp.405-414, 2012.
  • [20] F. Al-Omari, I. Keivanloo, C. K. Roy, and J. Rilling, Detecting clones across microsoft .net programming languages, In Proceedings of Working Conference on Reverse Engineering, pp.405-414, 2012.
  • [21] F. H. Su, J. Bell, K. Harvey, S. Sethumadhavan, G. Kaiser, and T. Jebara, Code Relatives: Detecting Similarly Behaving Software, In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering (FSE 2016), pp.702–714, 2016.
  • [22] G. Antoniol, G. Canfora, G. Casazza, A. De Lucia and E. Merlo, Recovering traceability links between code and documentation, In Proceedings of IEEE Transactions on Software Engineering, 28(10), pp.970-983, 2002.
  • [23] G. Maskeri, S Sarkar and K Heafield, Mining business topics in source code using latent dirichlet allocation, In Proceedings of 1st India software engineering conference, pp.113-120, 2008.
  • [24] G. Shu, B.Sun, T.A.D. Henderson and A. Podgurski, JavaPDG: A New Platform for Program Dependence Analysis, In Proceedings of International Conference on Software Testing, Verification and Validation, pp.408-415, 2013.
  • [25] Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori Pollock, and K VijayShanker. 2010. Towards automatically generating summary comments for java methods. In Proceedings of the IEEE/ACM international conference on Automated software engineering. ACM, 43–52.
  • [26] H. Basit, D. Rajapakse, and S. Jarzabek, An empirical study on limits of clone unification using generics, In Proceedings of SEKE, pp.109-114, 2005.
  • [27] H. Cheng, X. Yan, and J. Han, Mining Graph Patterns, In Frequent Pattern Mining, pp.307-338, 2014.
  • [28] H. D. Rombach (1991). Software reuse: a key to the maintenance problem. In Information and Software Technology Journal, 33(1), Jan/Feb.
  • [29] H. Kim, Y. Jung, S. Kim, and K. Yi, MeCC: Memory comparison-based clone detector, In Proceedings of the 33rd International Conference on Software Engineering, pp.301–310, 2011.
  • [30]
  • [31]
  • [32] Hubig R., Morschel I. (1997) Quality and Productivity Improvement in Object-Oriented Software Development. In: Lehner F., Dumke R., Abran A. (eds) Software Metrics. Information Engineering und IV-Controlling. Deutscher Universitätsverlag, Wiesbaden
  • [33] I. Baxter, A. Yahin, L. Moura and M. Anna, Clone Detection Using Abstract Syntax Trees, In Proceedings of Software Maintenance, pp.368-377, 1998.
  • [34] I. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier, Clone detection using abstract syntax trees, In Proceedings of International Conference on Software Maintenance, pp.368-377, 1998.
  • [35] I. Davis and M. Godfrey, Clone detection by exploiting assembler, In Proceedings of 4th International Workshop on Software Clones, pp.77-78, 2010.
  • [36] I. Keivanloo, C. K. Roy, and J. Rilling, SeByte: Scalable clone and similarity search for bytecode, Science of Computer Programming, 95, 2014
  • [37] I.D. Baxter, A. Quigley, L. Bier, M. Sant’Anna, L. Moura, and A. Yahin, CloneDR: clone detection and removal, In Proceedings of the 1st International Workshop on Soft Computing Applied to Software Engineering, pp.111-117, 1999.
  • [38] J. Ferrante, K. J. Ottenstein, and J. D. Warren, The program dependence graph and its use in optimization, ACM Transactions on Programming Languages and Systems, 9(3), pp.319-349, 1987.
  • [39] J. Howard Johnson, Visualizing textual redundancy in legacy source, In Proceedings of the conference of the Centre for Advanced Studies on Collaborative research, John Botsford, Ann Gawman, Morven Gentleman, Evelyn Kidd, Kelly Lyons, Jacob Slonim, and Howard Johnson (Eds.). IBM Press, 1994.
  • [40] J. Johnson, Visualizing Textual Redundancy in Legacy Source, In Proceedings of CASCON, pp.171-183, 1994.
  • [41] J. Krinke, Identifying similar code with program dependence graphs. In Proceedings of Working Conference on Reverse Engineering, pp.301-309, 2001.
  • [42] J. Mayrand, C. Leblanc and E. Merlo, Experiment on the Automatic Detection of Function Clones in a Software System Using Metrics, In Proceeding of ICSM, pp.244-253, 1996Measuring Code Behavioral Similarity for Programming and Software Engineering Education.
  • [43] J. R. Cordy, Live scatterplots, In Proceedings of International Workshop on Software Clones, pp.79-80, 2011.
  • [44] J. R. Cordy. Comprehending reality: Practical barriers to industrial adoption of software maintenance automation, In Proceedings of International Workshop on Program Comprehension, pp.196-206, 2003.
  • [45] J.R. Cordy and C. K. Roy, The NiCad clone detector, In Proceedings of Program Comprehension (ICPC), pp.219-220, 2011.
  • [46] Java Swing:
  • [47] JHotDraw:
  • [48] K. Kim, D. Kim, TF Bissyande, E. Choi,, J. Klein, Y. Le Traon, FaCoY–A Code-to-Code Search Engine, In Proceedings of the 40th International Conference on Software Engineering, 2018
  • [49] K. Nakakoji, Y. Yamamoto, Y. Nishinaka, Kouichi Kishida, and Y. Ye, Evolution Patterns of Open-Source Software Systems and Communities, In Proceedings of the International Workshop on Principles of Software Evolution, pp.76-85, 2002.
  • [50] K. T. Stolee, S. Elbaum, and Daniel Dobos, Solving the Search for Source Code, ACM Transactions on Software Engineering Methodology, 23(3), pp.26, 2014
  • [51] L. Jiang and Z. Su, Automatic mining of functionally equivalent code fragments via random testing. In Proceedings of the eighteenth international symposium on Software testing and analysis, pp.81–92, 2009.
  • [52] Laura Moreno, Jairo Aponte, Giriprasad Sridhara, Andrian Marcus, Lori Pollock, and K Vijay-Shanker. 2013. Automatic generation of natural language summaries for java classes. In Program Comprehension (ICPC), 2013 IEEE 21st International Conference on. IEEE, 23–32.
  • [53] Liangjie Hong and Brian D. Davison, Empirical study of topic modeling in Twitter, In Proceedings of the First Workshop on Social Media Analytics, pp 80-88, 2010.
  • [54] Lopes, Cristina V., Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny, Hitesh Sajnani, and Jan Vitek, DéjàVu: a map of code duplicates on GitHub, In Proceedings of the ACM on Programming Languages 1, OOPSLA, pp.84, 2017
  • [55] M. Asaduzzaman, C. K. Roy, and K. A. Schneider, VisCad: Flexible code clone analysis support for NiCad, In Proceedings of International Workshop on Software Clones, pp.77-78, 2011.
  • [56] M. Gabel, L Jiang, Zhendong SU, Scalable detection of semantic clones, In Proceedings of the International Conference on Software Engineering, pp.321-330, 2008.
  • [57] M. Sudhamani and L. Rangarajan, Duplicate Code Detection using Control Statements, International Journal of Computer Applications Technology and Research, 4(10), 2015
  • [58] M. White, M. Tufano,C.Vendome, and D. Poshyvanyk, Deep learning code fragments for code clone detection, In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, pp.87-98, 2016
  • [59] Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 87–98.
  • [60] N. A. Kraft, B. W. Bonds, and R. K. Smith, Cross-language Clone Detection, In Proceedings of SEKE, pp.54-59, 2008.
  • [61] N. Anquetil and T. C. Lethbridge, Recovering software architecture from the names of source files, Journal of Software Maintenance: Research and Practice, 11(3),pp.201-221, 1999.
  • [62] N. Bettenburg, W. Shang, W. Ibrahim, B. Adams, Y. Zou, and A. Hassan, An empirical study on inconsistent changes to code clones at the release level, Science of Computer Programming, 77(6), pp.760-776, 2012.
  • [63] N. Göde, and K. Rainer, Studying clone evolution using incremental clone detection, Journal of Software: Evolution and Process, 25(2), pp.165-192, 2013.
  • [64] N. H. Pham, H. A. Nguyen, T. T. Nguyen, J. M. Al-Kofahi, and T. N. Nguyen, Complete and accurate clone detection in graph-based models, In Proceedings of International Conference on Software Engineering, pp.276-286, 2009
  • [65] N. Wilde, M. Buckellew, H. Page, V. Rajlich, and L. Pounds, A comparison of methods for locating features in legacy software, Journal of Systems and Software, 65(2), pp.105-114, 2003.
  • [66] Pablo Loyola, Edison Marrese-Taylor, and Yutaka Matsuo. 2017. A Neural Architecture for Generating Natural Language Descriptions from Source Code Changes. arXiv preprint arXiv:1704.04856 (2017).
  • [67] Paul W McBurney and Collin McMillan. 2014. Automatic documentation generation via source code summarization of method context. In Proceedings of the 22nd International Conference on Program Comprehension. ACM, 279–290.
  • [68] Popular Programming Language (2020).URL
  • [69] PyGUI: Doc/index.html
  • [70] R. Hoffmann, J. Fogarty, and D. S. Weld, Assieme: finding and leveraging implicit references in a web search interface for programmers, In Proceedings of the 20th annual ACM symposium on User interface software and technology, pp.13–22, 2007.
  • [71] R. Komondoor and S. Horwitz, Using Slicing to Identify Duplication in Source Code, In Proceedings of International static analysis symposium, pp.40-56, 2001.
  • [72] R. Sirres, T. F. Bissyandé, D. Kim, D. Lo, J. Klein, K. Kim, and Y. L. Traon, Augmenting and Structuring User Queries to Support Efficient Free-Form Code Search, In Proceedings of Empirical Software Engineering, 23(5), pp.2622–2654, 2018.
  • [73] Roy, Chanchal K., and James R. Cordy, NICAD: Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization, 16th IEEE international conference on program comprehension (ICPC), 2008.
  • [74] S. Bazrafshan, and R. Koschke, An empirical study of clone removals, In Proceedings of International Conference Software Maintenance, pp.50-59, 2013.
  • [75] S. Bellon, R. Koschke, G. Antoniol, J. Krinke and E. Merlo, ”Comparison and Evaluation of Clone Detection Tools,” in IEEE Transactions on Software Engineering, vol. 33, no. 9, pp. 577-591, Sept. 2007.
  • [76] S. Bellon, R. Koschke, G. Antoniol, J. Krinke, and E. Merlo, Comparison and evaluation of clone detection tools, IEEE Transaction on Software Engineering, 33(9), pp.577-591, 2007.
  • [77]

    S. Bouktif, G. Antoniol, M. Neteler, and E. Merlo, A novel approach to optimize clone refactoring activity, In Proceedings of Genetic and Evolutionary Computation, pp.1885-1892, 2006.

  • [78] S. C. B. de Souza, N. Anquetil, and K. M. de Oliveira, “A Study of the Documentation Essential to Software Maintenance,” ser. SIGDOC ’05, 2005.
  • [79] S. C. B. de Souza, N. Anquetil, and K. M. de Oliveira, A Study of the Documentation Essential to Software Maintenance, In Proceedings of the 23rd annual international conference on Design of communication: documenting & designing for pervasive information, pp.68-75, 2005.
  • [80] S. Dang and S. A. Wani, Performance Evaluation of Clone Detection Tools, International Journal of Science and Research, 2015
  • [81]

    S. K. Thompson. Sampling. John Wiley and Sons, New York, 2 edition, 2002.

  • [82] S. Li, X. Xiao, B. Bassett, T. Xie, and Nikolai Tillmann, Measuring Code Behavioral Similarity for Programming and Software Engineering Education, In Proceedings of the 38th International Conference on Software Engineering Companion, pp.501–510, 2016.
  • [83] S. N. Woodfield, H. E. Dunsmore, and V. Y. Shen, “The effect of modularization and comments on program comprehension,” ser. ICSE’81, 1981.
  • [84] S. N. Woodfield, H. E. Dunsmore, and V. Y. Shen, The effect of modularization and comments on program comprehension, In Proceedings of International Conference on Software Engineering, pp.215-223, 1981.
  • [85] Sajnani, Hitesh, et al., SourcererCC: scaling code clone detection to big-code, In Proceedings of International Conference on Software Engineering, 2016.
  • [86] Song Wang, Taiyue Liu, and Lin Tan. 2016. Automatically learning semantic features for defect prediction. In Proceedings of the 38th International Conference on Software Engineering. ACM, 297–308.
  • [87] Sonia Haiduc, Jairo Aponte, and Andrian Marcus. 2010. Supporting program comprehension with source code summarization. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 2. ACM, 223– 226.
  • [88]

    Sonia Haiduc, Jairo Aponte, Laura Moreno, and Andrian Marcus. 2010. On the use of automated text summarization techniques for summarizing source code. In Reverse Engineering (WCRE), 2010 17th Working Conference on. IEEE, 35–44.

  • [89]

    Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. 2016. Summarizing Source Code using a Neural Attention Model.. In ACL (1).

  • [90] T. C. Jones (1986). Programming Productivity. McGraw-Hill.
  • [91] T. Griffiths and M. Steyvers, Finding scientific topics, In Proceedings of the National Academy of Sciences, pp.5228-5235, 2004.
  • [92] T. J. Biggerstaff, B. G. Mitbander, and D. Webster, Program understanding and the concept assignment problem, Communications of the ACM, 37(5):pp.72-83, 1994.
  • [93] T. Tenny, “Program Readability: Procedures Versus Comments,” IEEE Trans. Softw. Eng., vol. 14, no. 9, 1988.
  • [94] T. Tenny, Program Readability: Procedures Versus Comments, IEEE Transactions of Software Engineering, 14(9), pp.1271-1279, 1988.
  • [95] T. Vislavski, G. Rakic, Z and N Budimac, LICCA: A tool for cross-language clone detection, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering, pp.512-516, 2018
  • [96] TAD Henderson and A Podgurski, Sampling code clones from program dependence graphs with GRAPLE, In Proceedings of the 2nd International Workshop on Software Analytics, pp.47-53, 2016.
  • [97] V. Basili (1990). Viewing maintenance as reuse-oriented software development. In IEEE Software, 7(1):19–25, Jan.
  • [98] V. Basili and H. D. Rombach (1991). ”Support for comprehensive reuse”. In IEE Software Engineering Journal. Sept. pp. 303-316
  • [99] Vaibhav Saini, Farima Farmahini farahani, Yadong Lu, Pierre Baldi, and Cristina V. Lopes, Oreo: detection of clones in the twilight zone, In Proceedings of European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp 354-365, 2018.
  • [100] Wang, Pengcheng, et al., CCAligner: a token based large-gap clone detector, Proceedings of the 40th International Conference on Software Engineering (ICSE). ACM, 2018.
  • [101] X. Cheng, Z. Peng, L. Jiang, H. Zhong, H. Yu, and J. Zhao, CLCMiner: Detecting Cross-Language Clones without Intermediates. IEICE Trans, on Information and Systems, 100(2), pp.273-284, 2017.
  • [102] Xiaodong Gu, Hongyu Zhang, Dongmei Zhang, and Sunghun Kim. 2016. Deep API learning. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 631–642.
  • [103] Xing Hu, Ge Li, Xin Xia, David Lo, and Zhi Jin. 2018. Deep code comment generation. In Proceedings of the 26th Conference on Program Comprehension, 200-210.
  • [104] Y. Higo and S. Kusumoto, Code Clone Detection on Specialized PDGs with Heuristics, In Proceedings of Software Maintenance and Reengineering, pp.75-84, 2011.
  • [105] Y. Higo and S. Kusumoto, Enhancing quality of code clone detection with program dependency graph, In Proceedings of Working Conference on Reverse Engineering, pp.315-316, 2009.
  • [106] Y. Higo, U. Yasushi, M. Nishino, and S. Kusumoto, Incremental code clone detection: A PDG-based approach, In Proceedings of Working Conference on Reverse Engineering, pp.3-12, 2011
  • [107] Y. Lin, Z. Xing, Y. Xue, Y. Liu, X. Peng, J. Sun, and W. Zhao, Detecting differences across multiple instances of code clones, In Proceedings of International Conference on Software Engineering, pp.164-174, 2014.
  • [108] Yoshiki Higo and Hiroaki Murakami and Shinji Kusumoto, Revisiting Capability of PDG-based Clone Detection, Technical Report, Graduate School of Information Science and Technology, Osaka University, 2013.