Semantic Source Code Search: A Study of the Past and a Glimpse at the Future

08/15/2019 ∙ by Muhammad Khalifa, et al. ∙ 0

With the recent explosion in the size and complexity of source codebases and software projects, the need for efficient source code search engines has increased dramatically. Unfortunately, existing information retrieval-based methods fail to capture the query semantics and perform well only when the query contains syntax-based keywords. Consequently, such methods will perform poorly when given high-level natural language queries. In this paper, we review existing methods for building code search engines. We also outline the open research directions and the various obstacles that stand in the way of having a universal source code search engine.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Code search is a common activity in the software development process. It’s mainly done by the developer for the purpose of either code review or code reuse. Semantic code search is where the search engine is able to capture and understand the meaning of the query allowing the search query to be expressed in terms natural language rather than technology-related terms or syntax. Such code search tools find great application especially when it comes to educational purposes, where the user is not necessarily required to be familiar with the syntax of the language in which the codebase in question is written in.

Previous works on code search have mostly borrowed their ideas from information retrieval systems where code snippets are treated as a collection of documents (Lu et al., 2015) (McMillan et al., 2011) (Haiduc et al., 2013) . On one hand, such methods are able to retrieve relative results in terms of IR systems performance metrics. On the other hand, these methods are mainly limited to keyword search. Moreover, they have difficulty understanding the query semantics especially when the query is expressed in terms of high-level natural language. One other problem with IR-based systems is their sensitivity to irrelevant and noisy keywords in the query where one misused keyword can lead to extremely different results.

Recently, the Natural Language Processing (NLP) field has witnessed great improvements on different levels of natural language understanding. This improvement has been mainly driven by the usage of deep neural models for text representation where a text unit (a paragraph, a sentence or a word) is represented by a vector in space. Since source code is, in some sense, textual data, these methods become relevant in code search and other code-related tasks. In this paper, we highlight one system based on what is known as multi-modal embeddings

(Karpathy and Fei-Fei, 2015)

, which is an idea borrowed from both NLP and Computer Vision domains.

2 Literature Review

We divide the proposed methods into two major areas: information retrieval-based and deep learning-based

methods. We begin our review with IR-based methods where we review and discuss four different methodologies. Then we move to deep learning methods where we discuss one paper in much more detail.

Information Retrieval Methods

The first system we discuss is Sourcerer (Bajracharya et al., 2006), which is an information retrieval based code search tool that combines the textual content of a program with structural information. The structural information used includes how various code components interact with each other and how different methods are called and in which other methods. This type of information is then used in a traditional retrieval approach to improve the search results.

(McMillan et al., 2011) borrowed ideas from web search where code functions are treated similarly to how web pages are modeled. A directed graph is created where each node represents a function, and if a function calls another function , then we should have directed edge from to . This Function Call Graph (FCG) is then analyzed using PageRank, which is the link analysis algorithm used mainly by Google search engine, in order to map queries to relevant results. A good side of this system is that instead of returning a single function as a result, it’s able to return a set of related functions. Figure 1 shows the system components.

Figure 1: Portfolio System

(Hill et al., 2011) proposed to improve upon basic bag-of-words IR by leveraging the contextual and semantic role of the words within the query. This was formulated in terms of a concept-based scoring function that made use of query-related information such as location of a word within query, semantic role, head distance and frequency of the word in a candidate result where less frequency implies a more important word.

Query expansion (Efthimiadis, 1996) is the technique of first augmenting the query with synonyms before it’s used for retrieval. Since the new version of the query is likely to have more keywords than the original one, this technique helps improve the overall system performance especially recall. (Lu et al., 2015) proposed to employ query expansion for code search where the synonyms are extracted from WordNet (Miller, 1995). See Figure 2.

Figure 2: Query Expansion for Code Search

As we have stated earlier, these methods require the query to be in the form of syntax-based keywords where the user needs to be familiar with the language or technology in question. As a result, a query expressed in natural language is likely to be misinterpreted by the system leading to irrelevant results.

Deep Learning Methods

Embedding (also known as distributed representation

(Mikolov et al., 2013)) is a technique for learning vector representations of entities such as words, sentences and images in such a way that similar entities have vectors close to each other. (Gu et al., 2018) proposed to employ Multi-modal embeddings (Karpathy and Fei-Fei, 2015) to map both code and description into a common space where related code snippets and code description have similar vectors. Then, given a textual query, we retrieve the code snippet whose embedding is most similar to the query embedding. See Figure 4

The model consists mainly of two Recurrent Neural Networks for embedding source code and text, respectively. See Figure


. The embedding is then obtained by applying max pooling over the hidden states of the RNN. The model is trained to maximize the similarity between pairs of related code snippets and text, and to minimize it for unrelated pairs. The similarity measure used is the cosine similarity which is expressed by the formula


where and are both vectors.

To use the model, all the code snippets are embedded using the code RNN, the query is also embedded using the description RNN, then we retrieve the top similar code snippets to the description.

Figure 3: Generating embeddings with two RNNs
Figure 4: Mapping code and description into a common space

The authors compared their model against two baselines: Lucene and CodeHow (Lv et al., 2015). Although their result show the superiority of their model, two baselines are generally not enough to fully asses the retrieval performance. We also worry that there might have been some overlap between training and test sets and that this could be the cause of the performance improvement.

Although deep learning methods have shown very promising results, such methods tend to have some drawbacks. First of all, a huge amount of labeled data is required. The authors used about 200k documented Java projects extracted from Github to achieve the reported performance. Secondly, training deep learning methods tend to be very time consuming although that may no be an issue with recent advancement in terms of processing power.

3 Challenges and Future Work

There are still a lot of open question when it comes to improving upon existing models. For instance, the previous deep learning based model required huge amount of labeled (documented) projects to give reasonable results. Thus, such model is likely to perform poorly when it’s used with source code that comes from a domain other than the one it was trained on. This makes it difficult to use such methods with technologies or programming languages for which we can’t find huge amount of existing documented projects to train the model on. This poses the question of whether we can build unsupervised or semi-supervised models for which we need no or little amount of data to achieve good performance. Another area for improvement is building faster models at training. Since RNNs are sequence processing models and consequently then need to process data in order. This makes them unparallelizable leading to time-consuming training. This calls for another research direction where parallelizable models such as Convolutional Neural Networks

(Conneau et al., 2016) and attention-based models (Vaswani et al., 2017) are used in place of RNNs. Another direction that may lead to performance improvement is to employ a multi-task learning setting where the model is trained to simultaneously improve the performance on two tasks in the hope that improvement on one task will lead to improvement on the other. For semantic code search, a model may be trained simultaneously to perform both code embedding and code completion. Another direction is building a universal embedder that’s not merely limited to a single programming language but rather able to handle different programming languages and technologies. Indeed, building such a model is no easy task. First of all, a huge amount of data, spanning different languages and technologies, needs to be collected for training the model. Another challenge would be equipping the model with the ability to handle semantic conflicts between different languages where one entity would mean two different things in two languages. We conjecture that such universal model would likely need to convert its input into an intermediate representation before embedding.

4 Conclusion

In this paper we reviewd several proposed source code search systems. We also pointed to a specific promising model borrowed from the NLP domain that gives good retrieval performance compared to traditional methods. We concluded with the possible research directions that may be followed in the future such as building a language-invariant universal search code engine. In the end, we invite the research community to make effort towards both covering the pointed open questions and addressing the aforementioned challenges.


  • S. Bajracharya, T. Ngo, E. Linstead, Y. Dou, P. Rigor, P. Baldi, and C. Lopes (2006) Sourcerer: a search engine for open source code supporting structure-based search. In Companion to the 21st ACM SIGPLAN symposium on Object-oriented programming systems, languages, and applications, pp. 681–682. Cited by: §2.
  • A. Conneau, H. Schwenk, L. Barrault, and Y. Lecun (2016) Very deep convolutional networks for text classification. arXiv preprint arXiv:1606.01781. Cited by: §3.
  • E. N. Efthimiadis (1996) Query expansion.. Annual review of information science and technology (ARIST) 31, pp. 121–87. Cited by: §2.
  • X. Gu, H. Zhang, and S. Kim (2018) Deep code search. In Proceedings of the 40th International Conference on Software Engineering, pp. 933–944. Cited by: §2.
  • S. Haiduc, G. Bavota, A. Marcus, R. Oliveto, A. De Lucia, and T. Menzies (2013) Automatic query reformulations for text retrieval in software engineering. In Proceedings of the 2013 International Conference on Software Engineering, pp. 842–851. Cited by: §1.
  • E. Hill, L. Pollock, and K. Vijay-Shanker (2011) Improving source code search with natural language phrasal representations of method signatures. In Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering, pp. 524–527. Cited by: §2.
  • A. Karpathy and L. Fei-Fei (2015) Deep visual-semantic alignments for generating image descriptions. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 3128–3137. Cited by: §1, §2.
  • M. Lu, X. Sun, S. Wang, D. Lo, and Y. Duan (2015) Query expansion via wordnet for effective code search. In Software Analysis, Evolution and Reengineering (SANER), 2015 IEEE 22nd International Conference on, pp. 545–549. Cited by: §1, §2.
  • F. Lv, H. Zhang, J. Lou, S. Wang, D. Zhang, and J. Zhao (2015) Codehow: effective code search based on api understanding and extended boolean model (e). In Automated Software Engineering (ASE), 2015 30th IEEE/ACM International Conference on, pp. 260–270. Cited by: §2.
  • C. McMillan, M. Grechanik, D. Poshyvanyk, Q. Xie, and C. Fu (2011) Portfolio: finding relevant functions and their usage. In Proceedings of the 33rd International Conference on Software Engineering, pp. 111–120. Cited by: §1, §2.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §2.
  • G. A. Miller (1995) WordNet: a lexical database for english. Communications of the ACM 38 (11), pp. 39–41. Cited by: §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §3.