Identifying Algorithm Names in Code Comments

07/10/2019 ∙ by Jakapong Klainongsuang, et al. ∙ 0

For recent machine-learning-based tasks like API sequence generation, comment generation, and document generation, large amount of data is needed. When software developers implement algorithms in code, we find that they often mention algorithm names in code comments. Code annotated with such algorithm names can be valuable data sources. In this paper, we propose an automatic method of algorithm name identification. The key idea is extracting important N-gram words containing the word `algorithm' in the last. We also consider part of speech patterns to derive rules for appropriate algorithm name identification. The result of our rule evaluation produced high precision and recall values (more than 0.70). We apply our rules to extract algorithm names in a large amount of comments from active FLOSS projects written in seven programming languages, C, C++, Java, JavaScript, Python, PHP, and Ruby, and report commonly mentioned algorithm names in code comments.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The importance of annotated data is increasing for various machine-learning-based tasks, such as generating API sequences Gu et al. (2016), comment generation Jiang et al. (2017), pseudo-code generation Oda et al. (2015), and document generation Wong et al. (2013); Takata et al. (2018). For preparing annotated data, manual annotation is considered to be an effective approach for high quality data (for example, collecting code and comment pairs Oda et al. (2015)). However, because of required human effort, increasing annotated data is costly and not easy. For automating preparation of annotated data, Yin et al. proposed a method to mine aligned code and natural language pairs from Stack Overflow Yin et al. (2018).

In this paper, we focus on algorithms as annotations of code. In the field of mathematics and computer science, algorithms denote one of the main subjects of study that are used to processing a set of input data to obtain the output, operating computation, or even automatically finishing the task of reasoning111https://en.wikipedia.org/wiki/Algorithm. Indeed, they are very essential due to their problem-independence but obscure in control strategy applications, or the programming language allocation as well Smith and Lowry (1990). Toward automatically preparing code annotated with algorithm names, this paper addresses identifying algorithm names in code comments. To the best of our knowledge, this is the first study of mining algorithm names from code comments.

    Sorts the array of size count using comparator lessThan using an Insertion
    Sort algorithm.
Listing 1: A code comment containing an algorithm name (Insertion Sort algorithm)

To explain the design of implemented code, we find that developers sometimes explicitly mention algorithm names in the associated code. Listing 1 shows an example of the algorithm provision inline with a code comment222https://github.com/servo/skia/blob/v0.30000019.0/src/core/SkTSort.h#L95. The comment mentions the name of the algorithm, insertion Sort algorithm.

In this paper, we propose a method of identifying algorithm names and report the frequencies of algorithm names appear in code comments of free/libre and open source software (FLOSS). The key idea of our identification is extracting important N-gram words containing the word ‘algorithm’ in the last. We also consider part of speech patterns to derive rules for appropriate algorithm name identification. The result of our rule evaluation produced high precision and recall values (more than 0.70). We apply our rules to extract algorithm names in a large amount of comments from active FLOSS projects written in seven programming languages, C, C++, Java, JavaScript, Python, PHP, and Ruby, and report commonly mentioned algorithm names in code comments.

Our contributions can be summarized as follows.

  • We propose a method to identify algorithm names in code comments.

  • Proposed method is manually evaluated with preliminary data containing 1,581 identified N-gram terms and 458 summarized distinct N-gram terms.

  • By applying our method to large-scale FLOSS data, we report commonly used algorithms in for seven languages (C, C++, Java, JavaScript, Python, PHP, and Ruby).

2 Algorithm Name Identification

2.1 Overview

In this section, we give an overview of our proposed method. Fig. 1 shows the main steps of our method. In this study, we obtained the preliminary code comment data from FLOSS projects hosted in GitHub to create rules to identify the algorithm names.

Figure 1: Overview of our FLOSS in creating the rule

From code comments, we extract word terms using N-gram IDF similar to the study Terdchanakul et al. (2017). First, to identify the algorithm name, we obtain N-gram terms containing keywords ‘algorithm’ in the last position (e.g. quick sort algorithm, search algorithm). Second, the part of speech (POS) tagging process is applied for all code comments. Finally, we subsequently remove the unnecessary words in head position of the N-gram word terms by considering POS tags. With the above process, We create Inclusive and Exclusive criteria to identify appropriate algorithm names.

As the processes show, we do not target algorithm names that do not include ‘algorithm’ in the bottom. We consider this makes our identification method precise by ignoring inappropriate terms including words similar to algorithm names. In addition, our method do not need a list of algorithm names, which make our method robust.

2.2 Text Preprocessing, N-gram Extraction, and POS tagging

To create our inclusive and exclusive criteria, we use our preliminary data of code comments containing the keywords of the algorithms from two repositories of C programming language (i.e. gecko-dev and linux). Special characters such as ‘*’, ‘#’, ‘/’ were removed and the part of speech to all code comments were tagged using the spacy library Omran and Treude (2017).

We applied N-gram IDF, a theoretical extension of Inverse Document Frequency (IDF) introduced by Shirakawa Shirakawa et al. (2015, 2017), to capture N-gram terms and obtained terms that contain ‘algorithm’ in the last position. Each N-gram word terms were then searched in all code comments. If they match the words in each comment, they will be tagged with the same part of speech.

2.3 Remove Unnecessary Words and Summarize Same N-gram Terms

In this step, we recursively remove the unnecessary words shown in Table 1 in the head position, as those words cannot be algorithm names. If in a single comment contains the terms of “quick sort algorithm”, we can extract both sort algorithm and quick sort algorithm. In this case we select the longest one, “quick sort algorithm”.

Part of speech Description
VERB ADP verb with conjunction, subordinating or preposition
ADP conjunction, subordinating or preposition
NUM number
DET determine
Table 1: Unnecessary words in the head position

Because the same N-gram word terms can be tagged in different part of speech depend on the position of the words in each code comment, we summarize them by the majority of part of speech. Table 2 shows “sort algorithm” from four different code comments. There are three terms tagged as NOUN NOUN while the last one is tagged as VERB NOUN. Because the sorting algorithm has a majority of 3 out of 4 candidate parts of speech, thus we set the majority of sort algorithm as NOUN NOUN. Meanwhile, the term “blur algorithm” has no majority in this case, we ignore such algorithm name candidates.

N-gram term Part of speech
sort algorithm NOUN NOUN
sort algorithm NOUN NOUN
sort algorithm NOUN NOUN
sort algorithm VERB NOUN
blur algorithm NOUN NOUN
blur algorithm ADJ NOUN
Table 2: Example of N-gram terms and their part of speech

2.4 Creating Rules

In this step, we classify the N-gram word terms using POS tag and manually create identification rules. We focus only on the reliable N-gram word terms; terms that do not have majority nor the number of candidate is 1 are excluded. We manually verified correct algorithm names or not for all obtained algorithm name candidates in the preliminary data and labeled

valid or invalid.

If the number of valid algorithm name in the part of speech is larger than the invalid one, it will be included in the Inclusive rule sets. Otherwise, if the number of valid algorithm name in the part of speech is lesser than the invalid, this will be considered to be an Exclusive rule. Inclusive rule means the unlabeled term is marked as a valid algorithm name, while the exclusive rule means the unlabeled term is assigned as an invalid algorithm name.

Next, we build a program to implement the rule automatically with bigger data. The procedure we used in the code is shown in Algorithm 1 with the all created rules. Description of part of speech is shown in Table 3.

Part of speech Description
VERB verb
VERB(-ing) verb end with suffix “ing”
NOUN noun
PART particle
CONJ conjunction, coordinating
ADJ adjective
ADV adverb
Table 3: Description of Part of Speech
1:procedure rule checking (part of speech of unlabeled term)
2:      
3:      if  then
4:            return valid
5:      else if  then
6:            return invalid
7:      else if  then
8:            return invalid
9:      else if  then
10:            if  then
11:            else if  then
12:                 return valid
13:            else if  then
14:                 return valid             
15:            return invalid
16:      else if  then
17:            if  then
18:                 return valid
19:            else if  then
20:                 return valid
21:            else if  then
22:                 return valid
23:            else if  then
24:                 return valid
25:            else if  then
26:                 return valid             
27:            return invalid
28:      else if  then
29:            if  then
30:                 return valid             
31:            return invalid
32:      else if  then
33:            if  then
34:                 return valid
35:            else if  then
36:                 return valid
37:            else if  then
38:                 return valid
39:            else if  then
40:                 return valid             
41:            return invalid       
Algorithm 1 Algorithm Name Identification

3 Identification Evaluation with Preliminary Data

In this section, we evaluate our method shown in Algorithm 1 with the preliminary data. One of the authors manually created an oracle of all algorithm name candidates with valid and invalid labels. All algorithm name candidates in the preliminary data are classified into valid or invalid with our method and are compared with the oracle. We present Precision, Recall, and F-measure in Table 4. Precision measures accuracy of Algorithm 1

to to correctly identify valid algorithm names. Recall is the fraction of the valid algorithm name that are successfully retrieved. F-measure is the harmonic mean of Precision and Recall. We obtained considerably high accuracy (more than 0.7 in Precision, Recall, and F-measure).

Metric Score
Precision 0.76
Recall 0.70
F-measure 0.73
Table 4: Performance of algorithm name identification

4 Applying to Large-scale FLOSS Data

There are some Web articles indicating that developers and students are interested in knowing important algorithms Ojha (2015); Quora (2016); Ojha (2017). Although there are such interests in understanding algorithm usages in practice, there is no empirical study for algorithm uses in FLOSS as far as we know.

To investigate the frequencies of algorithm names in FLOSS code comments, we collect repositories in GitHub that have more than 500 commits in entire histories, and at least 100 commits in the most active two years, which is set to ignore long-term inactive projects and very short term active projects (university classes, for example). Code comments are extracted from those identified repositories: C 2,771, C++ 3,563, Java 4,995, JavaScript 7,130, Python 5,263, Ruby 2,233, and PHP 3,279.

Table 5 is the obtained top 10 results after applying our rules by frequency (the numbers below algorithm names). In the 70 algorithm names, identified 15 names were not appropriate algorithm names, such as “learning algorithm”, “following algorithm”, and “legacy algorithm”, which is 0.786 (55/70) precision. Since we can immediately observe such inappropriate algorithm names, we exclude them in the results.

Rank Programming Language
Java Ruby Python PHP C Cpp JavaScript
1 Search Encryption Nagles Encryption Hash Compression Ordering
3,142 537 123 213 3,193 925 799
2 Optimization Compression Signature RC4 Compression Unicode Bidirectional Sort
493 22 119 206 2,592 711 599
3 Neighbour search Audionormalizationsettings SBO Fragment parsing Scheduling Slow iterative Diff
454 22 86 157 2,472 597 534
4 Estimation Hash Parsing Search Sfrcacc Sorting Parsing
451 15 74 137 2,363 523 513
5 Signature Search Hashing Depthfirst search Search Dct Ritters
379 12 70 87 1,939 448 448
6 Sorting Kex Exchange Ordering Encryption Hash O(nd) Difference
377 12 67 84 1,774 428 448
7 Encryption Serverside encryption Encryption Signing Auth MD5 Search
370 10 66 51 1,397 415 406
8 Binary search Tarjans Generation Header canonicalization Rate control LLM Clipping
266 9 61 51 1,340 410 351
9 Stemming Matching Nagle Body canonicalization Public key Euclids Iteration
238 9 56 51 1,073 391 322
10 Hash Signature Matching Canonicalization Authentication Clipping Ray casting
223 5 55 50 1,070 389 284
Table 5: Top ranked algorithms found in seven programming languages. The term ‘algorithm’ is abbreviated.

We see that words of search algorithm appear frequently in Java, Ruby, PHP, C, and JavaScript. Encryption algorithms appear in Java, Ruby, Python, PHP, and C. In C programming language, most of top 10 algorithm related to security. Some group of algorithms such as search, parsing, hash, sorting are implemented in many programming languages. PHP and JavaScript programming language have some algorithm related to Web.

Table 6 shows example of code comments including identified algorithm names as well as programming languages, organization, repository, and file names. By extracting associated code with the identified algorithm names, it seems possible to automatically collect annotated code, in the future.

Language, Organization/Repository, Filename Example of Comment
C, /*Encryption Algorithm for Unicast Packet */
dorimanx/Dorimanx-LG-G2-D802-Kerne,
wifi.h
Ruby, # Instantiates one of the Transport::Kex classes (based on the negotiated
openshift/openshift-extras, # kex algorithm), and uses it to exchange keys. */
algorithms.rb
Java, // The reason for this method name, as opposed to getFirstStrongDir(), is that
codenameone/CodenameOne, // “first strong” is a commonly used description of Unicode’s estimation algorithm
BidiFormatter.java
C++, /* This is the central step in the MD5 algorithm
hwine/test-mc-ma-cvs,
md5.cc
Python, # Enable Nagle’s algorithm for proxies, to avoid packet fragmentation.
Andersh75/resurseffektivitet, # We cannot know if the user has added default socket options, so we cannot replace the
connectionpool.py # list.
PHP, * Returns the input text encrypted using RC4 algorithm and the specified key.
DerDu/SPHERE-Framework, * RC4 is the standard encryption algorithm used in PDF format
tcpdf_static.php
JavaScript, // ray casting algorithm for detecting if point is in polygon
CartoDB/d3.cartodb,
leaflet.js
Table 6: Example of Results

5 Related Work

5.1 Benefits of code comments

Some studies were undertaken to analyze the importance of code comments in maintaining a software. According to Tenny Tenny (1988), the comprehension of a programmer to acquire such important information of a source code relies on the readability of the program. This is very crucial in software maintenance. The more readable of a code, the easier the developers to maintain the program. The experiment was designed by comparing the effect of procedures and code comments tested to more than 100 software engineering students. From the analysis, it shows that code comments written by authors improve the ability to read the program. It is more significant if there are no procedures provide in the source code. Conversely, the procedures affect slightly to the readability of a program.

Similar work also performed by Woodfield et al. (1981) in searching the advantage of comments in a code. In this study, the authors analyzed the relationship between the types of modularization and comments and the ability of programmers to comprehend the codes. Several questions were asked to some programmers based on four different modularization types (monolithic, functional, super, and abstract data type) of the same program with and without comments. The results of the observation indicate that programmers were easier to provide the answers if the source code contains comments. Furthermore, the version of modularization which makes the subjects able to perform better is the abstract data type.

5.2 Measurement of code comments

Hu et al. Hu et al. (2018) investigated the effectiveness of the code comments used by developers. They argued that comments in the code are very essential to guide the programmers to understand the source code and make them easier and faster to analyze the programs. However, these comments are frequently incompatible, old-fashioned or even misplaced in a project. By applying the NLP on FLOSS projects, the authors propose an automatic tool to generate comments for methods written in Java. The finding shows that the proposed method surpasses the existing techniques with prominent divergent.

An analysis and assessment of code comment quality were also conducted by Steidl et al. (2013). The authors state that software developers rely on their understanding of the source code in terms of development and maintenance. Nevertheless, their capabilities on the comprehension of the programs depend on how high the quality of the code comments. In the study, the FLOSS taken to address the problems consists of comments classification, quality model development, model assessment, and validity evaluation. The study indicates that the proposed method offers the analysis more detail compared to the existing techniques on the classification of code comment.

6 Threats to Validity

Threats to the construct validity exist in our rule creation approach with the preliminary data. Since rules were created only with the limited data, it may have limitations to generalize. As seen in good precision from the result of large-scale data (Section 4), we consider that the rules are robust and not limited to specific projects.

Threats to the external validity exist in our data preparation. Although we analyzed a large amount of repositories on GitHub, we cannot generalize our findings to industry nor open source projects in general; some of open source repositories are hosted outside of GitHub, e.g., on GitLab or private servers. Further studies are required.

7 Conclusion

In this paper, we have presented a method to identify algorithm names from code comment by creating rules using N-gram IDF and part of speech tagging. We find major part of speech of N-gram word terms by a majority of them. Our evaluation shows our method accurately identify algorithm names in FLOSS code comments. In the future, we plan to identify associated algorithm implementation code as well as algorithm names. Using such data, we could try helping developers by recommending actual code implementing algorithms from active FLOSS projects.

References

References

  • Gu et al. (2016) X. Gu, H. Zhang, D. Zhang, S. Kim, Deep api learning, in: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, ACM, New York, NY, USA, 2016, pp. 631–642.
  • Jiang et al. (2017) S. Jiang, A. Armaly, C. McMillan,

    Automatically generating commit messages from diffs using neural machine translation,

    in: Proceedings of the 32Nd IEEE/ACM International Conference on Automated Software Engineering, ASE 2017, IEEE Press, Piscataway, NJ, USA, 2017, pp. 135–146.
  • Oda et al. (2015) Y. Oda, H. Fudaba, G. Neubig, H. Hata, S. Sakti, T. Toda, S. Nakamura, Learning to generate pseudo-code from source code using statistical machine translation (t), in: Proceedings of the 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), ASE ’15, IEEE Computer Society, Washington, DC, USA, 2015, pp. 574–584.
  • Wong et al. (2013) E. Wong, J. Yang, L. Tan, Autocomment: Mining question and answer sites for automatic comment generation, in: Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering, ASE’13, IEEE Press, Piscataway, NJ, USA, 2013, pp. 562–567.
  • Takata et al. (2018) D. Takata, A. Alhefdhi, M. Rungroj, H. Hata, H. K. Dam, T. Ishio, K. Matsumoto, Catalogen: Generating catalogs of code examples collected from oss, in: 2018 IEEE Third International Workshop on Dynamic Software Documentation (DySDoc3), pp. 11–12.
  • Yin et al. (2018) P. Yin, B. Deng, E. Chen, B. Vasilescu, G. Neubig, Learning to mine aligned code and natural language pairs from stack overflow, in: Proceedings of the 15th International Conference on Mining Software Repositories, MSR ’18, ACM, New York, NY, USA, 2018, pp. 476–486.
  • Smith and Lowry (1990) D. R. Smith, M. R. Lowry, Algorithm theories and design tactics, Science of Computer Programming 14 (1990) 305 – 321.
  • Terdchanakul et al. (2017) P. Terdchanakul, H. Hata, P. Phannachitta, K. Matsumoto, Bug or not? bug report classification using n-gram idf, in: 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 534–538.
  • Omran and Treude (2017) F. N. A. A. Omran, C. Treude, Choosing an nlp library for analyzing software documentation: A systematic literature review and a series of experiments, in: Proceedings of the 14th International Conference on Mining Software Repositories, MSR ’17, IEEE Press, Piscataway, NJ, USA, 2017, pp. 187–197.
  • Shirakawa et al. (2015) M. Shirakawa, T. Hara, S. Nishio, N-gram idf: A global term weighting scheme based on information distance, in: Proceedings of the 24th International Conference on World Wide Web, WWW ’15, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 2015, pp. 960–970.
  • Shirakawa et al. (2017) M. Shirakawa, T. Hara, S. Nishio, Idf for word n-grams, ACM Trans. Inf. Syst. 36 (2017) 5:1–5:38.
  • Ojha (2015) R. Ojha, Top 7 algorithms and data structures every programmer should know about, https://www.hackerearth.com/blog/algorithms/top-7-algorithms-data-structures-every-programmer-know/, 2015.
  • Quora (2016) Quora, What are the top 10 algorithms every software engineer should know by heart?, https://www.quora.com/What-are-the-top-10-algorithms-every-software-engineer-should-know-by-heart, 2016.
  • Ojha (2017) R. Ojha, Top 10 algorithms every software engineer should know by heart, https://www.freelancinggig.com/blog/2017/05/09/top-10-algorithms-every-software-engineer-know-heart/, 2017.
  • Tenny (1988) T. Tenny, Program readability: procedures versus comments, IEEE Transactions on Software Engineering 14 (1988) 1271–1279.
  • Woodfield et al. (1981) S. N. Woodfield, H. E. Dunsmore, V. Y. Shen, The effect of modularization and comments on program comprehension, in: Proceedings of the 5th International Conference on Software Engineering, ICSE ’81, IEEE Press, Piscataway, NJ, USA, 1981, pp. 215–223.
  • Hu et al. (2018) X. Hu, G. Li, X. Xia, D. Lo, Z. Jin, Deep code comment generation, in: Proceedings of the 26th Conference on Program Comprehension, ICPC ’18, ACM, New York, NY, USA, 2018, pp. 200–210.
  • Steidl et al. (2013) D. Steidl, B. Hummel, E. Juergens, Quality analysis of source code comments, in: 2013 21st International Conference on Program Comprehension (ICPC), pp. 83–92.