I Backgrounds
Source code comments are considered to be an indispensable part of computer programs. Code comments are often used to make up for the lack of proper software documentation. However, comments are also considered as an elusive part of a computer program. Unlike the actual code that is written in a programming language, code comments are mostly written in natural language, which has a lot more freedom in expressions and hence defies any sort of formal analysis or objective testing.
We are particularly interested in source code comments that are inside a function or method and explains how the code works at a microscopic level. In this paper, we call them “local comments”. Local comments tend to describe things like “the kind of data that is stored in a certain variable” or “the assumption that is hold at a certain point” and so on. They normally describe a small part (often just a few lines) of code and largely invisible from the official documents, because they are too technical or obscure to most users. Such comments, however, are often crucial for understanding a tricky part of the code, as they give us a rare glimpse of its developer’s mind. Programmers often need this kind of information for diagnosing the software issues or extending its functions.
Note that this paper is a part of our bigger attempt that is to obtain the semantic relationship between a source code and a program function. We plan to use the techniques presented in this paper to analyze the function of an unknown program given its source code. However, this future stage is not presented in this paper.
Our original goal was to leverage local comments for static analysis of a program. Soon we discovered that finding good local comments itself is a nontrivial task. One of such comments would be the following:
b.start = b.end = 0; // clear the ring buffer.
In this example, the actual operation of the code is just to assign
zeroes to two variables, b.start
and b.end
. But the
comment gives it a much richer meaning than just two assignments;
namely, they are clearing the ring buffer. In order to discover
this kind of examples, we need to know a couple of things. First, we
have to recognize that the above comment actually explains the code to
its left. Then we also have to recognize that the comment explains the
function of the code. To illustrate how this is not a trivial
problem, let us consider another example:
// error occurred. b.start = b.end = 0;
In this case, the role of the comment is entirely different. While this comment might be still as important and relevant, it is no more explaining what the code does. Rather, it is stating the reason why this code needs to be executed. Obviously, there are several different roles for code comments, but it was not clear how they are different and what we can do to recognize them automatically.
There have been a few seminal attempts to categorize source code comments. Padioleau et al. studied code comments in operating system kernels and classified them by detailed topics [1]. Steidl et al. proposed seven categories of comments [2]. They focused on macroscopic comments, such as a module header or method description. Pascarella and Bacchelli further developed this idea and proposed a hierarchical categories [3]. The key difference between our work and these previous works is that we primarily focus on a finer relationship between a code and its local comment, with the aim of collecting the code/comment pairs for future use.
Today, several industry guidelines [5, 6] exist for writing macroscopic, or non-local comments. However, the manner and style of local comments are still pretty much up to a programmer’s own discretion. In The Practice of Programming [4], Kernighan and Pike mentioned a couple of principles for writing code comments such as “Don’t belabor the obvious” or “Clarify, don’t confuse”, but they did not go further. As a result, analyzing local comments is still believed to be hard and there have not been many attempts in this area.
We initially set out collecting comments from popular GitHub projects (repositories) and manually reviewing them. Throughout the process, we recognized that there are some general unspoken rules for local comments. Our own experiences as a programmer agree with this too; each programmer develops their own “grammar” for comments, even when there is no explicit instruction. While the style of each local comment still varies from project to project, we observed that there is a certain tendency among them, which brings hope that we can analyze them somewhat mechanically. This paper is our attempt to propose a framework that allows a rigid analysis of local comments and lay the groundwork for further research in this field.
I-a Contribution of Paper
There are three contributions of this paper. Firstly, we propose common structural elements of local comments written in natural language. To our knowledge, this is the first attempt to automatically analyze source code comments in this level of detail. Secondly, we develop a machine learning algorithm that can identify each element of a local comment with a reasonable accuracy. Thirdly, we show those structures are roughly preserved across different projects and, to some extent, in different languages.
More specifically, we try to answer the following Research Questions:
- RQ1.
-
What are the common (syntactic or semantic) elements among source code comments written in natural language (English)?
- RQ2.
-
Can those elements be identified by a certain machine learning algorithm?
- RQ3.
-
How common those elements can be seen across different projects or programming languages?
The rest of this paper goes as follows: First, we present our model of local comments that explain how a code works. Then we describe our attempt to automatically identify them using a machine learning algorithm (decision tree). And then we proceed to apply this method to real projects and see if our model is relevant. We then discuss our findings. Finally, we briefly discuss the related work and state the conclusion.
Ii Structure of Comments
In this section, we present our model of source code comments in attempting to answer our first Research Question. In most cases, there is a relationship between a comment and the code it describes. Each relationship can be viewed as an arc that has three elements (Fig. 1). The arc has its source (comment itself), the destination (the target code), and a type of relationship (comment category). In the rest of this section, we describe each element one by one.

Ii-a Comment Extent
The extent of a comment is a part of source code that can be recognized as one “chunk” of explanatory text. This is less obvious than it sounds because in many cases, a single explanatory text is not necessarily expressed by a single comment tag. This is illustrated by the following example:
// This is still // one sentence.
In many modern languages such as Java, C++ or Python, inline comments
are commonly used. Inline comments, or end-of-line comments, are a
type of comments that start with a comment tag (such as //
or
#
) and continues until the end of the line. It is very common
that a single sentence is split into multiple inline comments because
programmers want to limit the length of each line to maintain its
readability.
In the rest of this paper, we define the extent of a comment as a sequence of consecutive comment tags that can be taken as a single continuous text. One could say a comment extent is a “whole” comment rather than individual comment tags. In theory, however, there could be a disjoint comment set that forms one explanation. We have not found such an example from the source codes we reviewed.
Ii-B Comment Target
Comments are by nature aiming at a certain subject, as we often say we comment on something. The same can be said for source code comments. However, there has not been a lot of discussion about what are the actual targets of code comments or how they can be specified in terms of a programming language syntax.
In modern languages, a program source code is first transformed into a parse tree. A typical parse tree consists of a number of syntax elements that cover certain parts of the source code (Fig. 2). In many popular languages, however, comments are treated as special tokens and not a part of a syntax tree. Java Development Tools (JDT), a popular Java parser implementation, treats source code comments as a special syntax element that belongs to the entire file [7]. In Python Abstract Syntax Tree (AST) module, comments are discarded at the tokenization stage and completely ignored by the subsequent parser. However, when a programmer writes a comment, they often try to align it with the existing language syntax. Therefore it is natural to assume that there is some way to express the target of a comment using some form of formal syntax.
From manually reviewing 1,000 Java code comments, we have discovered that most comment targets can be specified relative to its surrounding syntax elements (Fig. 3). Every syntax element within a parse tree has its start and end point, and comments are sitting between two syntax elements. We found that comment targets can be specified as one of the four types:

-
Left : Comment is targeting the syntax element that ends immediately before the comment. When there are overlapping elements which ends at the same point, the element which has the longest span is chosen. Note that this will change the size of a comment target depending on its position. For example, comment
thread.join(); // Let the job finish.
targets the entire statement that precedes it, whereasc.query(uri, DOWNLOAD, null /* selection */);
only targets the expression (null
). -
Right : Comment is targeting the syntax element that starts immediately after the comment. When there are overlapping elements which starts at the same point, the longest element is chosen. A typical example would look like this:
// Copy the array. for (int i = 0; i < a.length; i++) { b[i] = a[i]; }
where the comment targets the entirefor
block to its right. -
Parent : Comment is targeting its parent element, i.e. the syntax element that contains the comment. This type of target is commonly seen in an
if
statement:if (obj == null) { // error return; }
where the comment targets the entire then-block.
-
In-Place : Comment does not describe any code. The target is considered as the comment itself. Examples include metadata such as authors or copyright notices.
Note that not all comments have a target. A notable example is a commented out code. Such a comment is considered to have “In-Place” target. Out of 1,000 comments we have seen, only 3 of did not fit in the above four criteria. In the rest of this paper, we ignore these irregular targets.
Ii-C Comment Category
A comment category represents the type of a relationship between a comment and its target. While a comment extent and comment target are mostly syntactic elements, a comment category involved some sort of semantics. There are several existing works about comment categories [2, 3], but we independently made our list of categories that are suitable for local comments. The procedure of making the category list is the following: We reviewed each comment and checked if this comment can be in one of existing categories. If not, we regarded the comment as of a new category. After finishing this process, eleven categories were formed. They are listed in Tab. I. Note that certain categories such as “Guide” or “Meta Information” are rarely used for local comments, but left for the sake of completion.
Postcondition |
Conditions or effects that hold after the code is executed.
Typically used for explaining “what” the code does.
// create some test data Map<String, String> data = createTestData(testSize); // if we had a prior association, // restore and throw an exception if (previous != null) { taskVertices.put(id, previous); ... |
Precondition |
Conditions that hold before the code is executed.
This includes statements that hold regardless of the code execution.
Typically used for explaining “why” the code is needed.
// Unable to find the specidifed document. return Status.ERROR; if (myStatusBar != null) { //not welcome screen myStatusBar.addProgress(this, myInfo); } |
Value Description |
Phrase that can be equated with a variable, constant or expression.
addSourceFolders( SourceFolder.FACTORY, getSourceFoldersToInputsIndex(), false /* wantsPackagePrefix */, context); |
Instruction |
Instruction for code maintainers. Often referred to as “TODO” comments.
// TODO Auto-generated catch block e.printStackTrace(); Assert.fail("Failed"); |
Guide |
Guide for code users. Not to be confused with Instructions.
// Example: renderText(100, 100, FONT, 12, "Hello"); |
Interface |
Description of a function, type, class or interface.
// Comparison function class MyComparator implements Comparator { public int compare(Object o1, Object o2) { ... |
Meta Information |
Meta information such as author, date, or copyright.
// from org.apache.curator.framework. // CuratorFrameworkFactory this.maxCloseWait = 1000; |
Comment Out |
Commented out code. This type of comments does not have its target.
while ((m = ch.receive()) != null) { //System.out.println(Strand.currentStrand()); ... |
Directive |
Compiler directive that isn’t directed to human readers.
//CHECKSTYLE:OFF } catch (final Exception ex) { //CHECKSTYLE:ON |
Visual Cue |
Text inserted just for the ease of reading.
// // Initialization key storage // |
Uncategorized |
All other comments that don’t fit the above categories. |
Ii-C1 Manual Annotation Experiment
In order to verify the relevance of our definition of the comment categories, we have conducted a manual annotation experiment. The three authors of this paper participated this experiment. Each participant is given a list of guidelines shown in Tab. I and instructed to choose a category for 100 Java code snippets. The annotation tool is implemented as a form of Web application where a participant can choose the categories from a drop-down menu (Fig. 4). The time taken for each choice is recorded. The categories obtained here are later used as a test set for measuring the performance of our classifiers.
The code snippets given in this experiment are randomly chosen from popular 100 GitHub projects. All the participants receive the same set of snippets. Each snippet contains a comment and its neighboring lines (four lines before and after the comment). Since some questions might require a participant to study the source code in-depth, a link is provided for each snippet, which leads to the original GitHub repository where the participant can view a wider range of the source code, or other project files if necessary.
There are three participants in this experiment. One of them is a doctoral student and the other two are faculty members. They are all male, and their age ranges from 37 to 51. The total time spent for this experiment ranges from 30 minutes to 2 hours per person. Their median time for each question ranges from 10 seconds to 30 seconds.
Measuring the Agreement Ratio
We used Fleiss’ Kappa as the ratio of inter-rater agreement. Fleiss’ Kappa is commonly used for measuring agreement between people where . In case of , Cohen’s Kappa is typically used. Both methods are based on the same principle as we illustrate in the following paragraphs.
When there is no gold standard for answers, the only ratio we can measure for agreement is how many times people choose the same category. However, people choosing the same category on some question does not guarantee that they always agree on every question. The idea behind Cohen’s Kappa is to prevent one type of answers from accidentally dominating the entire agreement ratio. Technically, this is done by discounting the agreement for categories that are frequently chosen.
Fleiss’ Kappa is an extention to Cohen’s Kappa for three or more people. This is calculated as follows: Assume that there is a complete graph that connects all the participants (which is a triangle, in our case), and count the number of edges where both participants on the edge agree on the answer. Then discount the agreement on a frequent category in the same manner of Cohen’s Kappa. We calculated the Fleiss’ Kappa for our expriment as
. By a commonly used guideline, this is considered as “moderate agreement”. The confusion matrix is shown in Tab.
II. The agreement ratio on comment extents or comment targets were not measured.Category | Po | Pr | Co | Vi | Va | In | Di | Gu | Un |
---|---|---|---|---|---|---|---|---|---|
Postcondition | 75 | ||||||||
Precondition | 37 | 37 | |||||||
Comment Out | 0 | 0 | 12 | ||||||
Visual Cue | 17 | 11 | 0 | 11 | |||||
Value Descr. | 10 | 7 | 0 | 1 | 15 | ||||
Instruction | 2 | 4 | 0 | 1 | 0 | 14 | |||
Directive | 0 | 0 | 0 | 5 | 0 | 0 | 8 | ||
Guide | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | |
Uncategorized | 6 | 5 | 0 | 10 | 0 | 1 | 1 | 0 | 9 |
We have found that there is a relatively high chance of disagreement between “Precondition” and “Postcondition” categories. We think there are three major reasons for this: Firstly, both categories are inherently tricky because they require a deep understanding of the code and comment. Sometimes the participants had to view a much wider range of the code to get the proper context for a snippet. Secondly, some comments are long enough that can actually have multiple purposes. Although the participants are asked to choose the most dominant category, it is sometimes not clear which category fits the best. And thirdly, the code quality and style vary between different projects.
Iii Building Classifiers
As the answer to our second Research Question, we now describe our attempt to build machine learning classifiers that identify the three elements described in the previous section. We use C4.5 decision tree algorithm [9]. A decision tree algorithm is efficient and easy to implement, but the feature set we use is relatively complex, as explained later. We particularly like its property that the obtained tree is to some extent human readable, allowing us to investigate which feature works the best.
We built three different classifiers for each element: Extent, Target and Category. Fig. 5 shows the overall system architecture. An input source code is first processed by a Java parser. We use Java Development Tools (JDT) [7] here. Then the parse tree is fed into the Extent Tagger. This is the first classifier that identifies the beginning and end of a comment extent. Extent Tagger needs to be applied prior to the other two classifiers (Target and Category) because the later classifiers require an entire comment extent rather than individual comments. After this stage, inline comments are grouped into one chunk and each comment extent has one continuous text. Then a natural language parser is applied. We use an English parser with an assumption that the majority of source code comments are written in English111Out of 1,000 comments we have reviewed, we found 38 of them were actually written in Chinese. All other comments were in English.. Finally, the combined features are fed into Target and Category classifiers respectively. In the following subsections, we describe each classifier in the pipeline.
Iii-a Recognizing Comment Extent
Extent Tagger is a decision tree-based classifier that marks the beginning and end of each comment extent. In order to perform this as a classification task, we use IOB notation [10] 222It is sometimes called “BIO notation”.. Each comment tag is assigned with one of the three tags: “I” (Middle of an extent), “O” (Outside of an extent) or “B” (Beginning of an extent). A tagging example is shown in Fig. 6
. Note that the comments at line 4. and 5. are regarded as separate even though they are consecutive, hence giving a “B” tag to both lines. Since we only deal with comment tags, only “B” and “I” tags are considered in an actual process. IOB notation is a common technique to mark the boundary of an object sequence, and it is used in many natural language processing tasks.
The actual mechanism of Extent Tagger works as follows: first, a preprocessing is applied to each comment tag and its surrounding syntax elements are identified. They are listed as a set of discrete features. Since comments are not well integrated in a parse tree, we process them separately and combine them based on its text location within the source file. To our knowledge, this is the first attempt to use precise relationship between a language parse tree with its comments for categorizing comments. Then the distance between each comment and its neighboring comments is calculated. They are expressed as a number of lines (vertical distance) and a number of columns (horizontal distance). Finally, these features are fed into the classifier and the IOB tag is identified. The list of features used is shown in Tab. III.
Feature | Description |
---|---|
DeltaRows |
Distance in lines from a previous comment. |
DeltaCols |
Difference in columns from a previous comment. |
DeltaLeft |
Difference in columns between a comment and syntax element. |
LeftSyntax |
Syntax element left to the comment. |
RightSyntax |
Syntax element right to the comment. |
ParentSyntax |
Parent syntax element of the comment. |
Iii-B Identifying Comment Target and Category
After the Extent Tagger stage, comments are merged into one continuous
text. Then Part-of-speech (POS) tagging is applied. The POS tagger
assigns one of 36 POS tags to each word, such as VBZ
(verb, 3rd
person) or NNS
(plural noun) [11]. We use CoreNLP
natural language processing toolkit [12]. At this point, all
the words in a comment extent is assigned with a POS tag and they can
be used as features. We add a few extra binary features
(“HasSymbol”) using a regular expression pattern to detect if the
text includes a symbol that is commonly used in a program code. These
extra features are manually crafted and expected to help the
classifier to identify characteristics of certain categories. Then we
independently apply two classifiers to this feature set and identify
the target and category of each comment extent. The list of used
features is shown in Tab. IV.
Feature | Description |
---|---|
LeftSyntax |
Syntax element left to the comment. |
RightSyntax |
Syntax element right to the comment. |
ParentSyntax |
Parent syntax element of the comment. |
HasSymbol |
Does the comment text include a symbol? |
PosTagFirst |
POS tag of the first word of the comment. |
PosTagAny |
Does the comment text include a certain POS tag? |
WordFirst |
First word of the comment text. |
WordAny |
Does the comment text include a certain word? |
Our C4.5 implementation is fairly straightforward. The way that the decision tree learner works is following: it scans all the input examples and searches a feature that split the given examples the best. In C4.5 algorithm, this means that a split with the maximum information gain is chosen. The algorithm starts with the most significant feature, and then repeatedly splits the subtrees until it meets a certain predefined cutoff criteria; an important feature tends to appear at the top of the tree, and as it descends to its nodes a less significant feature appears. In general, setting the cutoff threshold too small causes a tree overfitting problem, while setting it too large makes it underfitting. In our experiment, we found that setting the minimum threshold to 10 examples produced the best results. The more detailed mechanism is described in [9]. Fig. 7 shows a sample decision tree. Once the decision tree is constructed, it is converted to simple if-then clauses, so that the actual classification can be performed efficiently.
Iv Experiments
In this section, we describe our experimental setup and its results. We have measured the performance of the three classifiers we described above. We first describe our data set and then present the experimental result with Java source codes. Then, to measure the generality of our model, we apply the same classifier to another language, Python. Finally, we apply our method to a wider range of projects and show its findings.
Iv-a Data Set
As a data set for the experiments, we selected the top 1,000 GitHub projects by popularity (the number of Stars)333The data set was retrieved in July, 2017 (Java) and November, 2017 (Python) respectively.. The overall size of the data set is listed in Tab. V. We then parsed all the Java files in each project and randomly chose 1,000 comments444We enumerated all the comments of the above projects, shuffle them, then pick the first 1,000 comments while limiting the maximum number of comments per source file to three. This way, an unusually large source code does not affect the overall distribution, while large projects with many source code files can still be more representing than smaller projects. as a training set for the classifiers. These 1,000 comments were manually annotated for the three elements described in Section II. The frequency of each comment target and category in the training set is listed in Tab. VI. We then chose another 100 comments independently and used them for the manual tagging experiment described in Section II-C1. The result of the tagging experiment was further narrowed as some comment had no agreement in category (all the participants chose different categories). In the end, the remaining 84 comments were used as a test set. The distribution of categories in the training set and test set is similar. The Kullback-Leibler distance between two distribution is 0.11. This was calculated as
where and
are the probability of each category in the training set and test set, respectively.
Language | Projects | Files | SLOC | Comments |
---|---|---|---|---|
Java | 1,000 | 480,600 | 63,224,880 | 4,049,628 |
Python | 990 | 160,844 | 29,070,278 | 2,215,683 |
Iv-B Experimental Results
We now present the performance of our classifiers. The first classifier, Extent Tagger, had 97.7% accuracy per comment tag. The Target classifier had 70% accuracy per comment extent. For the Category classifier, we measured its performance for each category, which is listed in Tab. VII. Although the accuracy of the classifier is varying depending on its category, it has a reasonable performance (61% precision and 89% recall) for the “Postcondition” category, which was our original purpose for this research. Since there was no comment that was classified as the “Metadata” category, its column is left out from both tables.
Category | Po | Pr | Co | Vi | Va | In | Di | Cls. |
---|---|---|---|---|---|---|---|---|
Postcondition | 31 | 3 | 1 | 0 | 0 | 0 | 0 | 35 |
Precondition | 8 | 10 | 0 | 0 | 1 | 0 | 0 | 19 |
Comment Out | 0 | 0 | 3 | 0 | 1 | 0 | 0 | 4 |
Visual Cue | 3 | 0 | 0 | 6 | 0 | 0 | 0 | 9 |
Value Descr. | 4 | 1 | 0 | 0 | 2 | 0 | 0 | 7 |
Instruction | 3 | 1 | 1 | 1 | 0 | 0 | 0 | 6 |
Directive | 2 | 0 | 0 | 1 | 0 | 0 | 1 | 4 |
Answer | 51 | 15 | 5 | 8 | 4 | 1 | 0 | 84 |
Category | Precision | Recall | F1 |
---|---|---|---|
Postcondition | 0.61 (31/51) | 0.89 (31/35) | 0.72 |
Precondition | 0.67 (10/15) | 0.53 (10/19) | 0.59 |
Comment Out | 0.60 (3/5) | 0.75 (3/4) | 0.67 |
Visual Cue | 0.75 (6/8) | 0.67 (6/9) | 0.71 |
Value Descr. | 0.50 (2/4) | 0.29 (2/7) | 0.36 |
Instruction | 0.00 (0/0) | 0.00 (0/6) | 0.00 |
Directive | 1.00 (1/1) | 0.25 (1/4) | 0.40 |
Iv-C Adapting to Another Language
After experimenting with Java source codes, we applied the obtained
classifier to another language, Python. This is done by applying a
rather straightforward transformation to the features. More
specifically, we converted the name of Java syntax elements in
features like LeftSyntax
, RightSyntax
and
ParentSyntax
into its Python counterparts by simply replacing
them (Tab. VIII). All other features in the
decision tree were kept intact. Note that this decision tree was
originally obtained for Java source codes, so we did not make any
training data for this experiment. We manually annotated 100 comments
in Python for the test set in the same manner described in Section
IV-A. We use the Python Abstract Syntax Tree (AST)
module for the parser [8].
Java Syntax | Python Syntax |
---|---|
SimpleName |
Name |
MethodDeclaration |
FunctionDef |
ExpressionStatement |
Expr |
IfStatement |
If |
MethodInvocation |
Call |
ForStatement |
For |
StringLiteral |
Str |
NumberLiteral |
Num |
ArrayInitializer |
Tuple |
Tab. IX shows the results. Note that despite the overall degradation of the performance in all categories, the accuracy for “Postcondition” category stayed at a reasonable level. This result suggests that this type of comments have the same characteristic in both languages, and thus implies the existence of the universal “grammar” for them.
Category | Po | Pr | Co | Vi | Va | In | Di | Cls. |
---|---|---|---|---|---|---|---|---|
Postcondition | 35 | 10 | 5 | 1 | 0 | 0 | 2 | 53 |
Precondition | 14 | 8 | 3 | 2 | 1 | 2 | 1 | 31 |
Comment Out | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
Visual Cue | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
Value Descr. | 3 | 3 | 0 | 2 | 1 | 0 | 0 | 9 |
Instruction | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 3 |
Directive | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Answer | 56 | 21 | 10 | 5 | 2 | 2 | 4 | 100 |
Category | Precision | Recall | F1 |
---|---|---|---|
Postcondition | 0.63 (35/56) | 0.66 (35/53) | 0.64 |
Precondition | 0.38 (8/21) | 0.26 (8/31) | 0.20 |
Comment Out | 0.10 (1/10) | 1.00 (1/1) | 0.18 |
Visual Cue | 0.00 (0/5) | 0.00 (0/3) | 0.00 |
Value Descr. | 0.50 (1/2) | 0.11 (1/9) | 0.11 |
Instruction | 0.25 (1/4) | 0.33 (1/3) | 0.17 |
Directive | 0.00 (0/2) | 0.00 (0/0) | 0.00 |
Iv-D Applying to Large Corpora
We applied our classifier to a number of GitHub repositories to see the overall tendency of comment categories in various projects, both in Java and in Python. Fig. 8 and 9 show the categories of the top 10 projects by the number of comments (shown in the parentheses). We can see most projects have a comparable ratio of categories. This answers our third Research Question.
A notable exception in this figure is neo4j
project, which has
an unusually high ratio of “Visual Cue” categories. It turned
out that this project has a high volume of unit testing codes (about
the 30% out of the 900k lines of code). According to the unit test
convention in Java, each unit test should have sections marked by
comments such as Given
, When
, and Then
. These
comments were eventually recognized as a visual cue by our classifier.
In an attempt to further explore our results, we extracted the most
commonly used “verb + noun” pairs used in the “Postcondition”
comments. The results are shown in Tab. X and
XI. It turned out that the most common phrase in
Java comments across all the projects is “do nothing
”. We also
tried to extract a few anecdotal code snippets which have “clear
”
and “buffer
” in its corresponding comments. The obtained snippets
are shown in Fig. 10. This demonstrates one of
ways to utilize the proposed method for our original goal; by focusing
on postconditional comments, we can discover a number of nontrivial
ways where a buffer is “cleared”.
Verb + Noun | Projects |
---|---|
do nothing |
332 |
throw exception |
170 |
set default |
161 |
add list |
154 |
do anything |
146 |
set value |
140 |
use default |
122 |
have value |
119 |
create file |
119 |
create list |
116 |
Verb + Noun | Projects |
---|---|
create object |
149 |
get list |
143 |
get data |
134 |
do anything |
133 |
do nothing |
130 |
get name |
129 |
keep track |
128 |
raise exception |
128 |
write file |
123 |
create file |
123 |
OpenGrok/…/UtilTest.java out.getBuffer().setLength(0); // clear buffer atlas/…/BaseLayer.java // Clear the off screen buffer. This is // necessary for some phones. canvas.drawRect(0, 0, canvas.getWidth(), canvas.getHeight(), clearPaint); druid/…/LimitedBufferGrouper.java // clear the used bits of the first buffer for (int i = 0; i < maxBuckets; i++) { subHashTableBuffers[0].put( i * bucketSizeWithHash, (byte)0); } hadoop/…/Shell.java // clear the input stream buffer String line = inReader.readLine(); while(line != null) { line = inReader.readLine(); }
V Discussions
In this section, we briefly discuss our experimental results in the previous section. First of all, the performance of our Category classifiers is worse than expected. This is probably due to the similar reasons why the agreement ratio of manual annotations was bad; the distinction between “Precondition” and “Postcondition” is inherently tricky, and the comments were in varying qualities.
One of the ways to improve the performance is to use more
features. From an inspection of the obtained decision tree for the
classifier, we have discovered that POSTagFirst
is the most
significant feature used for identifying a category, and programming
language syntax matters less. This is probably why it worked for both
Java and Python. This roughly corresponds to our experience, as many
“Postcondition” comments start with an imperative verb, as in the
form of “do this”. Therefore, we can expect some performance gain by
using more natural language based features. For example, we could use
a full English parse tree as a feature, which would allow the
classifier to recognize a more complex phrase in a consistent manner.
Throughout the experiments, we have observed that the ratio of each comment category is mostly unchanged across different projects and different languages (Java and Python). The ratio of the “Postcondition” comments usually ranges from 60% to 70%, and the ratio of the “Precondition” comments ranges from 10% to 20%. Although they are not precisely comparable, we think that our “Postcondition” and “Precondition” categories roughly correspond to the “Summary” and “Expand” categories described by Pascarella and Bacchelli [3], which reported a similar ratio (about 7:2) across six Java projects. Assuming the ratio of each category is not significantly different from projects to projects, our result means that our classifier can consistently find the same categories in both Java and Python projects, suggesting the existence of universal “grammar” for source code comments.
Vi Threats to Validity
There are a couple of threats to the validity of our conclusion. As for the threats to the internal validity, we are aware that our definition of comment targets or categories might still be incomplete. The validity of our eleven comment categories can be measured with the number of uncategorized comments (21 out of 300 annotations) as well as the inter-rater agreement ratio. One could argue that the number of reviewed examples or the degree of agreement among the manual annotations is weak. This is especially true for the distinction between “Precondition” and “Postcondition” categories (as stated in Section II-C1). The small number of annotators could also be a problem. The annotators might be biased. However, the observed Fleiss’ Kappa was (moderate agreement), which is not bad. As for the design of classifiers, we have assumed that a comment target and comment category is independent to each other (Section III); this might not be the case. Also, we treated that each comment extent as independent from other comments, i.e. the interpretation of one comment extent is not affected by other comments in the source code. Realistically, this is very unlikely. We often see that a comment uses names, ideas, or terms introduced by other comments in the same file or other external files. This could potentially affect the classification results.
As for the threats to the external validity, one could argue that the amount of test data is not enough. Indeed, the test set for Java comment categories had only 84 cases (Section IV-A). However, the distribution of categories from the training set and test set was not very different, as shown in Section IV-A
. Another threat is that both training set and test set are highly skewed in their categories. For now, our main focus is to distinguish the two most major categories (“Precondition” and “Postcondition”), so the predictions for smaller categories should be treated carefully. We plan to address this issue further in future. The test set for Python comments was equally small. Also, one could argue that the application of our classifier to a Python source code did not work well, citing its lower score (Section
IV-C). In fact, Python language supportsdocstring
, another mechanism for software documentation.
However docstring
texts are usually only applied for a class or
method level description, but not for local comments. So we do not
think it affects the results.
Another threat to the external validity is the way we used GitHub projects. They are all open sourced software, which could bias the use of code comments. It is also known that there is a high variation in its code quality in GitHub projects. A carefully selected set of projects might give a different outcome.
Vii Related Work
Source code comments have been an active target of research, but there is no prior work that explores the semantics of local comments for the purpose of obtaining nontrivial comments in an empirical manner. While there are a group of people who advocate “self-documenting code” [13], code comments are still considered as an irreplaceable way to express programmers’ intention [14]. Padioleau et al. studied source code comments from operating systems and concluded that programmers use comments when they cannot express their intention in any other way [1]. They also classified the comments by detailed topics, such as “error code” or “lock related”. Comment classification has been actively studied by Pascarella and Bacchelli [3]. In terms of writing comments, Kramer provided a case study of Java programmers who wrote Javadoc comments for Java API at Sun Microsystems [6].
There are numerous attempts to mine source code comments to get extra intelligence about a program. Jiang and Hassan examined the effects of stale comments and bugs [15]. Tan et al. presented a clever approach that automatically finds lock-related bugs by obtaining special patterns from code comments [16]. Ying et al. explored “TODO” comments, or task comments, as a way of programmers’ communication to their coworkers [17]. Storey et al. investigated the relationship between “TODO” comments and software bugs [18]. Sridhara described a way to detect up-to-date TODO comments [19]. Aman et al. studied the possibility of commented out codes leading to software bugs [20, 21].
As for the relationship between code comments and its readers, Salviulo and Scanniello suggested that novice programmers tend to rely on comments more than professionals [22]. Hirata and Mizuno examined the relevance of code comments using text filtering methods [23]. Some researchers are exploring the idea of automatic comment generation. Sridhara et al. presented a framework for automatically generating a Java method description based on program analysis [24]. Wong et al. proposed a way to generate comments by using programming question sites such as Stack Overflow [25].
To our knowledge, code comments are still largely treated independently from a source code syntax tree. This can be a problem when a code is automatically refactored by IDE. Sommerlad et al. tried to address this problem [26].
Viii Conclusion
In this paper, we presented our attempt to develop a framework for collecting and analyzing source code comments in detail. We proposed our model of comments, which has three elements: extent, target and category. We described the definition of each element, and conducted a manual annotation experiment. We then presented our attempt to build classifiers to identify the above three elements using a decision tree algorithm (C4.5). The obtained classifiers could recognize these elements with a reasonable accuracy. We tested our classifiers with two programming languages (Java and Python). We applied our classifiers to various GitHub projects to test our hypothesis that there is a universal structure in source code comments.
References
- [1] Yoann Padioleau, Lin Tan and Yuanyuan Zhou, “Listening to Programmers - Taxonomies and Characteristics of Comments in Operating System Code”, Proceedings of the 31st International Conference on Software Engineering (ICSE ’09), pp. 331-341.
- [2] Daniela Steidl, Benjamin Hummel and Elmar Jurgens, “Quality Analysis of Source Code Comments”, Proceedings of the 21st International Conference on Program Comprehension (ICPC’ 13), pp. 83-92.
- [3] Luca Pascarella and Alberto Bacchelli, “Classifying code comments in Java open-source software systems”, Proceedings of the 14th International Conference on Mining Software Repositories (MSR ’17), pp. 227-237, ISBN: 978-1-5386-1544-7, DOI: 10.1109/MSR.2017.63.
- [4] Brian W. Kernighan and Rob Pike, “The Practice of Programming”, Addison-Wesley Professional, 1999, ISBN 9780139376818.
- [5] Oracle Corporation, “How to Write Doc Comments for the Javadoc Tool”, http://www.oracle.com/technetwork/java/javase/ documentation/index-137868.html, Retrieved on 2018-02-01.
- [6] Douglas Kramer, “API Documentation from Source Code Comments: A Case Study of Javadoc”, Proceedings of the 17th Annual International Conference on Computer Documentation (SIGDOC ’99), pp. 147-153.
- [7] The Eclipse Foundation, “Eclipse Java development tools (JDT)”, https://www.eclipse.org/jdt/, Retrieved on 2018-02-01.
- [8] Python Software Foundation, “32.2. ast - Abstract Syntax Trees”, https://docs.python.org/2/library/ast.html, Retrieved on 2018-02-01.
- [9] J. Ross Quinlan, “C4.5: Programs for Machine Learning”, Morgan Kaufmann Publishers Inc., 1993, ISBN: 1558602402.
- [10] Dan Jurafsky and James H. Martin, “Speech and Language Processing, 2nd Edition”, Prentice Hall, May 16, 2008, ISBN: 0131873210.
- [11] Mitchell P. Marcus, Beatrice Santorini and Mary Ann Marcinkiewicz, “Building a Large Annotated Corpus of English: The Penn Treebank”, Computational Linguistics - Special Issue on Using Large Corpora: II Archive Volume 19 Issue 2, June 1993, pp. 313-330.
- [12] Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky, “The Stanford CoreNLP Natural Language Processing Toolkit”, Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55-60.
- [13] C2 Wiki, “Self Documenting Code”, http://wiki.c2.com/?SelfDocumentingCode, Retrieved on 2018-02-01.
- [14] Jef Raskin, “Comments are More Important than Code”, ACM Queue - Patching and Deployment, Vol. 3, Issue 2, Mar. 2005, pp. 64-65.
- [15] Zhen Ming Jiang and Ahmed E. Hassan, “Examining the Evolution of Code Comments in PostgreSQL”, Proceedings of the 2006 International Workshop on Mining Software Repositories (MSR ’06), pp. 179-180.
- [16] Lin Tan, Ding Yuan and Yuanyuan Zhou, “HotComments: How to Make Program Comments More Useful?”, Proceedings of HotOS’07: 11th Workshop on Hot Topics in Operating Systems, May 7-9, 2005, San Diego, California, USA.
- [17] Annie T. T. Ying, James L. Wright and Steven Abrams, “Source code that talks: an exploration of Eclipse task comments and their implication to repository mining”, Proceedings of the 2005 International Workshop on Mining Software Repositories (MSR ’05), pp. 1-5.
- [18] Margaret-Anne Storey, Jody Ryall, R. Ian Bull, Del Myers and Janice Singer, “TODO or To Bug: Exploring How Task Annotations Play a Role in the Work Pracitices of Software Developers”, Proceedings of the 30th International Conference on Software Engineering (ICSE’ 08), pp. 251-260.
- [19] Giriprasad Sridhara, “Automatically Detecting the Up-To-Date Status of ToDo Comments in Java Programs”, Proceedings of the 9th India Software Engineering Conference (ISEC ’16), pp. 16-25.
- [20] Hirohisa Aman, “Quantitative Analysis of Relationships among Comment Description, Comment Out and Fault-proneness in Open Source Software”, Journal of Information Processing (Japanese), Vol. 53, No. 2, pp. 612-621 (Feb. 2012).
- [21] Hirohisa Aman, Takashi Sasaki, Sousuke Amasaki and Minoru Kawahara, “Empirical analysis of comments and fault-proneness in methods: can comments point to faulty methods?” Proceedings of the 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM ’14), Article No. 63, ISBN: 978-1-4503-2774-9, DOI: 10.1145/2652524.2652592.
- [22] Felice Salviulo and Giuseppe Scanniello, “Dealing with Identifiers and Comments in Source Code Comprehension and Maintenance: Results from an Ethnographically-informed Study with Students and Professionals”, Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering (EASE ’14), Article No. 48. ISBN: 978-1-4503-2476-2, DOI: 10.1145/2601248.2601251.
- [23] Yukinao Hirata and Osamu Mizuno, “Do Comments Explain Codes Adequately?: Investigation by Text Filtering”, Proceedings of the 8th Working Conference on Mining Software Repositories (MSR ’11), pp. 242-245.
- [24] Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori Pollock and K. Vijay-Shanker, “Towards Automatically Generating Summary Comments for Java Methods”, Proceedings of the IEEE/ACM International Conference on Automated Software Engineering (ASE ’10), pp. 43-52.
- [25] Edmund Wong, Jinqiu Yang and Lin Tan, “AutoComment: Mining Question and Answer Sites for Automatic Comment Generation”, Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering (ASE ’13), pp. 562-567.
- [26] Peter Sommerlad, Guido Zgraggen, Thomas Corbat and Lukas Felber, “Retaining Comments when Refactoring Code”, Companion to the 23rd ACM SIGPLAN Conference on Object-Oriented Programming Systems Languages and Applications (OOPSLA Companion ’08), pp. 653-662.
Comments
There are no comments yet.