The importance of annotated data is increasing for various machine-learning-based tasks, such as generating API sequences Gu et al. (2016), comment generation Jiang et al. (2017), pseudo-code generation Oda et al. (2015), and document generation Wong et al. (2013); Takata et al. (2018). For preparing annotated data, manual annotation is considered to be an effective approach for high quality data (for example, collecting code and comment pairs Oda et al. (2015)). However, because of required human effort, increasing annotated data is costly and not easy. For automating preparation of annotated data, Yin et al. proposed a method to mine aligned code and natural language pairs from Stack Overflow Yin et al. (2018).
In this paper, we focus on algorithms as annotations of code. In the field of mathematics and computer science, algorithms denote one of the main subjects of study that are used to processing a set of input data to obtain the output, operating computation, or even automatically finishing the task of reasoning111https://en.wikipedia.org/wiki/Algorithm. Indeed, they are very essential due to their problem-independence but obscure in control strategy applications, or the programming language allocation as well Smith and Lowry (1990). Toward automatically preparing code annotated with algorithm names, this paper addresses identifying algorithm names in code comments. To the best of our knowledge, this is the first study of mining algorithm names from code comments.
To explain the design of implemented code, we find that developers sometimes explicitly mention algorithm names in the associated code. Listing 1 shows an example of the algorithm provision inline with a code comment222https://github.com/servo/skia/blob/v0.30000019.0/src/core/SkTSort.h#L95. The comment mentions the name of the algorithm, insertion Sort algorithm.
Our contributions can be summarized as follows.
We propose a method to identify algorithm names in code comments.
Proposed method is manually evaluated with preliminary data containing 1,581 identified N-gram terms and 458 summarized distinct N-gram terms.
2 Algorithm Name Identification
In this section, we give an overview of our proposed method. Fig. 1 shows the main steps of our method. In this study, we obtained the preliminary code comment data from FLOSS projects hosted in GitHub to create rules to identify the algorithm names.
From code comments, we extract word terms using N-gram IDF similar to the study Terdchanakul et al. (2017). First, to identify the algorithm name, we obtain N-gram terms containing keywords ‘algorithm’ in the last position (e.g. quick sort algorithm, search algorithm). Second, the part of speech (POS) tagging process is applied for all code comments. Finally, we subsequently remove the unnecessary words in head position of the N-gram word terms by considering POS tags. With the above process, We create Inclusive and Exclusive criteria to identify appropriate algorithm names.
As the processes show, we do not target algorithm names that do not include ‘algorithm’ in the bottom. We consider this makes our identification method precise by ignoring inappropriate terms including words similar to algorithm names. In addition, our method do not need a list of algorithm names, which make our method robust.
2.2 Text Preprocessing, N-gram Extraction, and POS tagging
To create our inclusive and exclusive criteria, we use our preliminary data of code comments containing the keywords of the algorithms from two repositories of C programming language (i.e. gecko-dev and linux). Special characters such as ‘*’, ‘#’, ‘/’ were removed and the part of speech to all code comments were tagged using the spacy library Omran and Treude (2017).
We applied N-gram IDF, a theoretical extension of Inverse Document Frequency (IDF) introduced by Shirakawa Shirakawa et al. (2015, 2017), to capture N-gram terms and obtained terms that contain ‘algorithm’ in the last position. Each N-gram word terms were then searched in all code comments. If they match the words in each comment, they will be tagged with the same part of speech.
2.3 Remove Unnecessary Words and Summarize Same N-gram Terms
In this step, we recursively remove the unnecessary words shown in Table 1 in the head position, as those words cannot be algorithm names. If in a single comment contains the terms of “quick sort algorithm”, we can extract both sort algorithm and quick sort algorithm. In this case we select the longest one, “quick sort algorithm”.
|Part of speech||Description|
|VERB ADP||verb with conjunction, subordinating or preposition|
|ADP||conjunction, subordinating or preposition|
Because the same N-gram word terms can be tagged in different part of speech depend on the position of the words in each code comment, we summarize them by the majority of part of speech. Table 2 shows “sort algorithm” from four different code comments. There are three terms tagged as NOUN NOUN while the last one is tagged as VERB NOUN. Because the sorting algorithm has a majority of 3 out of 4 candidate parts of speech, thus we set the majority of sort algorithm as NOUN NOUN. Meanwhile, the term “blur algorithm” has no majority in this case, we ignore such algorithm name candidates.
|N-gram term||Part of speech|
|sort algorithm||NOUN NOUN|
|sort algorithm||NOUN NOUN|
|sort algorithm||NOUN NOUN|
|sort algorithm||VERB NOUN|
|blur algorithm||NOUN NOUN|
|blur algorithm||ADJ NOUN|
2.4 Creating Rules
In this step, we classify the N-gram word terms using POS tag and manually create identification rules. We focus only on the reliable N-gram word terms; terms that do not have majority nor the number of candidate is 1 are excluded. We manually verified correct algorithm names or not for all obtained algorithm name candidates in the preliminary data and labeledvalid or invalid.
If the number of valid algorithm name in the part of speech is larger than the invalid one, it will be included in the Inclusive rule sets. Otherwise, if the number of valid algorithm name in the part of speech is lesser than the invalid, this will be considered to be an Exclusive rule. Inclusive rule means the unlabeled term is marked as a valid algorithm name, while the exclusive rule means the unlabeled term is assigned as an invalid algorithm name.
Next, we build a program to implement the rule automatically with bigger data. The procedure we used in the code is shown in Algorithm 1 with the all created rules. Description of part of speech is shown in Table 3.
|Part of speech||Description|
|VERB(-ing)||verb end with suffix “ing”|
3 Identification Evaluation with Preliminary Data
In this section, we evaluate our method shown in Algorithm 1 with the preliminary data. One of the authors manually created an oracle of all algorithm name candidates with valid and invalid labels. All algorithm name candidates in the preliminary data are classified into valid or invalid with our method and are compared with the oracle. We present Precision, Recall, and F-measure in Table 4. Precision measures accuracy of Algorithm 1
to to correctly identify valid algorithm names. Recall is the fraction of the valid algorithm name that are successfully retrieved. F-measure is the harmonic mean of Precision and Recall. We obtained considerably high accuracy (more than 0.7 in Precision, Recall, and F-measure).
4 Applying to Large-scale FLOSS Data
There are some Web articles indicating that developers and students are interested in knowing important algorithms Ojha (2015); Quora (2016); Ojha (2017). Although there are such interests in understanding algorithm usages in practice, there is no empirical study for algorithm uses in FLOSS as far as we know.
Table 5 is the obtained top 10 results after applying our rules by frequency (the numbers below algorithm names). In the 70 algorithm names, identified 15 names were not appropriate algorithm names, such as “learning algorithm”, “following algorithm”, and “legacy algorithm”, which is 0.786 (55/70) precision. Since we can immediately observe such inappropriate algorithm names, we exclude them in the results.
|3||Neighbour search||Audionormalizationsettings||SBO||Fragment parsing||Scheduling||Slow iterative||Diff|
|8||Binary search||Tarjans||Generation||Header canonicalization||Rate control||LLM||Clipping|
|9||Stemming||Matching||Nagle||Body canonicalization||Public key||Euclids||Iteration|
Table 6 shows example of code comments including identified algorithm names as well as programming languages, organization, repository, and file names. By extracting associated code with the identified algorithm names, it seems possible to automatically collect annotated code, in the future.
|Language, Organization/Repository, Filename||Example of Comment|
|C,||/*Encryption Algorithm for Unicast Packet */|
|Ruby,||# Instantiates one of the Transport::Kex classes (based on the negotiated|
|openshift/openshift-extras,||# kex algorithm), and uses it to exchange keys. */|
|Java,||// The reason for this method name, as opposed to getFirstStrongDir(), is that|
|codenameone/CodenameOne,||// “first strong” is a commonly used description of Unicode’s estimation algorithm|
|C++,||/* This is the central step in the MD5 algorithm|
|Python,||# Enable Nagle’s algorithm for proxies, to avoid packet fragmentation.|
|Andersh75/resurseffektivitet,||# We cannot know if the user has added default socket options, so we cannot replace the|
|PHP,||* Returns the input text encrypted using RC4 algorithm and the specified key.|
|DerDu/SPHERE-Framework,||* RC4 is the standard encryption algorithm used in PDF format|
5 Related Work
5.1 Benefits of code comments
Some studies were undertaken to analyze the importance of code comments in maintaining a software. According to Tenny Tenny (1988), the comprehension of a programmer to acquire such important information of a source code relies on the readability of the program. This is very crucial in software maintenance. The more readable of a code, the easier the developers to maintain the program. The experiment was designed by comparing the effect of procedures and code comments tested to more than 100 software engineering students. From the analysis, it shows that code comments written by authors improve the ability to read the program. It is more significant if there are no procedures provide in the source code. Conversely, the procedures affect slightly to the readability of a program.
Similar work also performed by Woodfield et al. (1981) in searching the advantage of comments in a code. In this study, the authors analyzed the relationship between the types of modularization and comments and the ability of programmers to comprehend the codes. Several questions were asked to some programmers based on four different modularization types (monolithic, functional, super, and abstract data type) of the same program with and without comments. The results of the observation indicate that programmers were easier to provide the answers if the source code contains comments. Furthermore, the version of modularization which makes the subjects able to perform better is the abstract data type.
5.2 Measurement of code comments
Hu et al. Hu et al. (2018) investigated the effectiveness of the code comments used by developers. They argued that comments in the code are very essential to guide the programmers to understand the source code and make them easier and faster to analyze the programs. However, these comments are frequently incompatible, old-fashioned or even misplaced in a project. By applying the NLP on FLOSS projects, the authors propose an automatic tool to generate comments for methods written in Java. The finding shows that the proposed method surpasses the existing techniques with prominent divergent.
An analysis and assessment of code comment quality were also conducted by Steidl et al. (2013). The authors state that software developers rely on their understanding of the source code in terms of development and maintenance. Nevertheless, their capabilities on the comprehension of the programs depend on how high the quality of the code comments. In the study, the FLOSS taken to address the problems consists of comments classification, quality model development, model assessment, and validity evaluation. The study indicates that the proposed method offers the analysis more detail compared to the existing techniques on the classification of code comment.
6 Threats to Validity
Threats to the construct validity exist in our rule creation approach with the preliminary data. Since rules were created only with the limited data, it may have limitations to generalize. As seen in good precision from the result of large-scale data (Section 4), we consider that the rules are robust and not limited to specific projects.
Threats to the external validity exist in our data preparation. Although we analyzed a large amount of repositories on GitHub, we cannot generalize our findings to industry nor open source projects in general; some of open source repositories are hosted outside of GitHub, e.g., on GitLab or private servers. Further studies are required.
In this paper, we have presented a method to identify algorithm names from code comment by creating rules using N-gram IDF and part of speech tagging. We find major part of speech of N-gram word terms by a majority of them. Our evaluation shows our method accurately identify algorithm names in FLOSS code comments. In the future, we plan to identify associated algorithm implementation code as well as algorithm names. Using such data, we could try helping developers by recommending actual code implementing algorithms from active FLOSS projects.
- Gu et al. (2016) X. Gu, H. Zhang, D. Zhang, S. Kim, Deep api learning, in: Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, ACM, New York, NY, USA, 2016, pp. 631–642.
Jiang et al. (2017)
S. Jiang, A. Armaly,
Automatically generating commit messages from diffs using neural machine translation,in: Proceedings of the 32Nd IEEE/ACM International Conference on Automated Software Engineering, ASE 2017, IEEE Press, Piscataway, NJ, USA, 2017, pp. 135–146.
- Oda et al. (2015) Y. Oda, H. Fudaba, G. Neubig, H. Hata, S. Sakti, T. Toda, S. Nakamura, Learning to generate pseudo-code from source code using statistical machine translation (t), in: Proceedings of the 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), ASE ’15, IEEE Computer Society, Washington, DC, USA, 2015, pp. 574–584.
- Wong et al. (2013) E. Wong, J. Yang, L. Tan, Autocomment: Mining question and answer sites for automatic comment generation, in: Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering, ASE’13, IEEE Press, Piscataway, NJ, USA, 2013, pp. 562–567.
- Takata et al. (2018) D. Takata, A. Alhefdhi, M. Rungroj, H. Hata, H. K. Dam, T. Ishio, K. Matsumoto, Catalogen: Generating catalogs of code examples collected from oss, in: 2018 IEEE Third International Workshop on Dynamic Software Documentation (DySDoc3), pp. 11–12.
- Yin et al. (2018) P. Yin, B. Deng, E. Chen, B. Vasilescu, G. Neubig, Learning to mine aligned code and natural language pairs from stack overflow, in: Proceedings of the 15th International Conference on Mining Software Repositories, MSR ’18, ACM, New York, NY, USA, 2018, pp. 476–486.
- Smith and Lowry (1990) D. R. Smith, M. R. Lowry, Algorithm theories and design tactics, Science of Computer Programming 14 (1990) 305 – 321.
- Terdchanakul et al. (2017) P. Terdchanakul, H. Hata, P. Phannachitta, K. Matsumoto, Bug or not? bug report classification using n-gram idf, in: 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 534–538.
- Omran and Treude (2017) F. N. A. A. Omran, C. Treude, Choosing an nlp library for analyzing software documentation: A systematic literature review and a series of experiments, in: Proceedings of the 14th International Conference on Mining Software Repositories, MSR ’17, IEEE Press, Piscataway, NJ, USA, 2017, pp. 187–197.
- Shirakawa et al. (2015) M. Shirakawa, T. Hara, S. Nishio, N-gram idf: A global term weighting scheme based on information distance, in: Proceedings of the 24th International Conference on World Wide Web, WWW ’15, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 2015, pp. 960–970.
- Shirakawa et al. (2017) M. Shirakawa, T. Hara, S. Nishio, Idf for word n-grams, ACM Trans. Inf. Syst. 36 (2017) 5:1–5:38.
- Ojha (2015) R. Ojha, Top 7 algorithms and data structures every programmer should know about, https://www.hackerearth.com/blog/algorithms/top-7-algorithms-data-structures-every-programmer-know/, 2015.
- Quora (2016) Quora, What are the top 10 algorithms every software engineer should know by heart?, https://www.quora.com/What-are-the-top-10-algorithms-every-software-engineer-should-know-by-heart, 2016.
- Ojha (2017) R. Ojha, Top 10 algorithms every software engineer should know by heart, https://www.freelancinggig.com/blog/2017/05/09/top-10-algorithms-every-software-engineer-know-heart/, 2017.
- Tenny (1988) T. Tenny, Program readability: procedures versus comments, IEEE Transactions on Software Engineering 14 (1988) 1271–1279.
- Woodfield et al. (1981) S. N. Woodfield, H. E. Dunsmore, V. Y. Shen, The effect of modularization and comments on program comprehension, in: Proceedings of the 5th International Conference on Software Engineering, ICSE ’81, IEEE Press, Piscataway, NJ, USA, 1981, pp. 215–223.
- Hu et al. (2018) X. Hu, G. Li, X. Xia, D. Lo, Z. Jin, Deep code comment generation, in: Proceedings of the 26th Conference on Program Comprehension, ICPC ’18, ACM, New York, NY, USA, 2018, pp. 200–210.
- Steidl et al. (2013) D. Steidl, B. Hummel, E. Juergens, Quality analysis of source code comments, in: 2013 21st International Conference on Program Comprehension (ICPC), pp. 83–92.