Sosed: a tool for finding similar software projects

07/06/2020
by   Egor Bogomolov, et al.
JetBrains
0

In this paper, we present Sosed, a tool for discovering similar software projects. We use fastText to compute the embeddings of subtokens into a dense space for 120,000 GitHub repositories in 200 languages. Then, we cluster embeddings to identify groups of semantically similar sub-tokens that reflect topics in source code. We use a dataset of 9 million GitHub projects as a reference search base. To identify similar projects, we compare the distributions of clusters among their sub-tokens. The tool receives an arbitrary project as input, extracts sub-tokens in 16 most popular programming languages, computes cluster distribution, and finds projects with the closest distribution in the search base. We labeled subtoken clusters with short descriptions to enable Sosed to produce interpretable output. Sosed is available at https://github.com/JetBrains-Research/sosed/. The tool demo is available at https://www.youtube.com/watch?v=LYLkztCGRt8. The multi-language extractor of sub-tokens is available separately at https://github.com/JetBrains-Research/buckwheat/.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

08/25/2021

Revizor: A Data-Driven Approach to Automate Frequent Code Changes Based on Graph Matching

Many code changes that developers make in their projects are repeated an...
12/21/2020

AC2 – Towards Understanding Architectural Changes in Rapid Releases

Open source projects are adopting faster release cycles that reflect var...
10/05/2017

BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages

We present BPEmb, a collection of pre-trained subword unit embeddings in...
12/07/2020

A Tool to Extract Structured Data from GitHub

GitHub repositories consist of various detailed information about the pr...
03/27/2022

UAST: Unicode Aware Sanskrit Transliteration

Devanagari is the writing system that is adapted by various languages li...
06/15/2021

Rcall: Calling R from Matlab

Summary: R and Matlab are two high-level scientific programming language...
09/18/2017

TikZ-network manual

TikZ-network is an open source software project for visualizing graphs a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Identification of similar projects in a large set of open-source repositories can help in several software engineering tasks: rapid prototyping, program understanding, plagiarism detection 

(Mens et al., 2014). Additionally, it requires the development of new approaches to understand the meaning behind code and represent software projects at a large scale. In turn, if the developed methods can detect similar projects, they might be also applied in other software engineering tasks.

While popular search engines provide an option to search for web pages or images similar to the input, there is no common approach for finding similar software projects. For instance, prior work on similar projects detection leveraged several sources of data: Java API calls (McMillan et al., 2012), contents of README files (Zhang et al., 2017), user reactions in the form of GitHub stars (Zhang et al., 2017), tags on SourceForge (Thung et al., 2012).

Recently, several papers proposed to split code tokens into sub-tokens to improve results in method name prediction (Alon et al., 2018), variable misuse identification (Hellendoorn et al., 2020), and source code topic modeling (Markovtsev and Kant, 2017). Following these advances, we suggest a novel approach to represent arbitrary fragments of code based on sub-token embeddings, e.g., numerical representations in a dense space. We train sub-token embeddings with fastText (Bojanowski et al., 2017), an algorithm for training word embeddings that takes into account both words and their subparts.

As prior work demonstrated, words with similar embeddings tend to be semantically related (Schnabel et al., 2015)

. We retrieve groups of related sub-tokens by clustering their embeddings with the spherical K-means algorithm 

(Hornik et al., 2012), a modification of the regular K-means (Lloyd, 1982) that works with cosine distance. These clusters represent topics that occurred in a large corpus of source code. We represent code as a distribution of clusters among its sub-tokens.

We implemented the suggested approach to represent code as a tool for detecting similar projects called Sosed. We define similarity of projects as the similarity of the corresponding cluster distributions. To measure it, we suggest using either KL-divergence (Kullback and Leibler, 1951)

, or cosine similarity of the distribution vectors.

Sosed identifies similar projects based solely on their codebase and supports 16 most popular languages. It does not make use of collaboration data (e.g., GitHub stars) to avoid popularity bias. Currently, Sosed supports the search of similar repositories across 9 million repositories that comprise all unique public projects on GitHub as of the end of 2016. In future, we plan to update the dataset to use an up-to-date snapshot of Github.

An important feature of Sosed is the explainability of its output. We manually labeled the sub-token clusters with short descriptions of their topics. For each query result, we can provide descriptions of topics that contributed the most to the similarity measure.

The main contribution of our work is Sosed— an open-source tool for finding similar repositories based on the novel code representation. Sosed provides explainable output, supports 16 programming languages, and searches across millions of reference projects.

The tool is available on GitHub (12). The part of Sosed used for sub-token extraction and language identification is also available as a standalone tool (11).

2. Background

Previous work on detecting similar repositories leveraged several sources of data. McMillan et al. (McMillan et al., 2012) suggested CLAN, a Java-specific approach that detected similar Java applications by analyzing their API calls. The authors applied Latent Semantic Indexing (Deerwester et al., 1990) to an occurrence matrix, where columns represent projects, and rows represent API calls. The authors obtained vector representations of Java applications and defined the similarity of two projects as the cosine similarity of the corresponding vectors.

Aside from analyzing the code, several approaches to similarity search used data specific to code hosting platforms (e.g., SourceForge (Thung et al., 2012) or GitHub (Zhang et al., 2017)). Thung et al. (Thung et al., 2012) used the SourceForge’s tags system to define similarity of the projects. Tags are short descriptions of project characteristics: category, language, user interface, and so on. Since some tags are more descriptive than others, the authors proposed to assign a weight to each tag. Then, they computed similarity of two projects from their sets of tags and their intersection. Zhang et al. (Zhang et al., 2017) measured similarity of projects hosted on GitHub based on the stars given by the same user in a short period of time and contents of the projects’ README files.

The problem of detecting similar applications is also actively researched in the domain of mobile apps (Chen et al., 2015; Linares-Vásquez et al., 2016; Li et al., 2017; Gonzalez et al., 2014). The main difference from open source software projects is the data associated with each app. For apps in app stores, source code is often not openly accessible, but there are multiple other kinds of data available: description, images, permissions, user reviews, download size.

Another method related to measuring similarity of projects is topic modeling on code. The goal of topic modeling is to automatically detect topics in a corpus of unlabeled data, e.g., software projects. The output of a topic modeling algorithm is a set of topics, and a distribution of topics in each item from the corpus. A topic is usually represented by a group of reference words or labels that are most frequent across data comprising the topic. According to the survey by Sun et al. (Sun et al., 2016), the most popular approach to topic modeling in software engineering is LDA (Blei et al., 2003). It treats source code as a bag of tokens, such as variable names, function names, and other identifiers. Markovtsev et al. (Markovtsev and Kant, 2017) used ARTM (Vorontsov and Potapenko, 2015), an algorithm similar to LDA, to identify topics across 9 million GitHub projects, which makes it, to the best of our knowledge, the largest study of topic modeling on source code.

3. Description of the tool

In this work, we present Sosed, a tool for finding similar software projects based on a novel representation of code.

Outline of Sosed’s internals. Figure 1 provides an overview of Sosed’s internals. To find similar projects, we should define a search space, represent projects in a way suitable for searching, and set up a similarity measure.

As for the search space, we use the dataset of 9 million GitHub repositories collected by Markovtsev et al. (Markovtsev and Kant, 2017). To the best of our knowledge, it is the largest deduplicated dataset of software projects, which is suitable for our task straight-away.

As a preprocessing step, we transform projects into numerical vectors. Firstly, we train embeddings of sub-tokens on a large corpus of code (Markovtsev, 2017) with fastText (Bojanowski et al., 2017). Secondly, we find clusters of sub-tokens with spherical K-means algorithm (Hornik et al., 2012), where is a manually selected parameter. Finally, for each repository, we compute the distribution of clusters among its sub-tokens. The distribution for a project is a -dimensional vector, where each component

is a probability of cluster

appearing among the project’s sub-tokens.

We implement two methods for measuring similarity of projects: explicitly computing KL-divergence (Kullback and Leibler, 1951) (i.e., a measure of distribution similarity) of their cluster distributions, or computing cosine similarity of the distribution vectors. In both cases, we use Faiss (Johnson et al., 2017) library to find the closest distributions.

In the rest of this section we describe parts of the tool in more details.

Figure 1. Overview of the algorithm to compute projects’ similarity

Reference projects.

For each repository, the dataset introduced by Markovtsev et al. (Markovtsev and Kant, 2017) contains a set of all sub-tokens found in the project. We describe the process of extracting sub-tokens latter in this section.

The dataset is already cleared of both explicit and implicit forks (i.e., copies of other projects that are not marked as forks on GitHub by its authors). It contains all the GitHub projects as of the end of 2016. Even though the projects in the dataset are not up-to-date, it allows us to implement the search in a vast amount of projects. In future, we plan to create an up-to-date version of the dataset.

Training sub-token embeddings. For training sub-token embeddings, we use a dataset of identifiers extracted from 120,000 GitHub repositories (Markovtsev, 2017). It contains sequences of sub-tokens from files in approximately 200 programming languages.

We use fastText (Bojanowski et al., 2017)

to compute embeddings of sub-tokens into a 100-dimensional space. Alongside with embeddings of input words, fastText also computes embeddings of encountered n-grams. It is helpful in the source code domain, because even at sub-token level there are some highly repetitive n-grams. Another important feature of fastText is its ability to compute embeddings for out-of-vocabulary (OOV) tokens: sub-tokens of reference projects not encountered in the corpus used for training embeddings. We computed embeddings for OOV sub-tokens with the trained fastText model, which gave us a set of 40 million known sub-tokens.

Extracting sub-tokens from repositories. A part of this work used for sub-token extraction and language identification might be useful for other tasks as well. To share it with the community and facilitate its reuse, we make it available as a separate project (11). The input of sub-token extractor is a list of either links to GitHub repositories or paths to local directories. The output is a list of all extracted sub-tokens and quantities of sub-tokens for each project.

On the first step of tokenization, we use enry (7) to recognize languages in files in each project. enry is a Go-based language tool that employs several strategies to determine the language of a given file, including its name, extension, and content. enry features the support of 382 languages, fast performance, and does not require a git repository to work, meaning that the input project can be any collection of files.

When run on a directory, enry outputs a JSON file with the recognized languages as keys and lists of files as values. Using these keys, we filter languages that we are interested in. Based on the statistics on programming languages popularity (26), we currently support 16 languages, namely: C, C#, C++, Go, Haskell, Java, JavaScript, Kotlin, PHP, Python, Ruby, Rust, Scala, Shell, Swift, and TypeScript.

The next step of tokenization is extraction of identifiers. Since we are only interested in identifiers and names, we need to iterate over all the tokens in the file and gather only those that belong to specific types (excluding literals, comments, etc.). To do that, we employ two different tools. 12 out of 16 languages (including 10 most popular ones) are passed on to Tree-sitter (30), a fast parsing tool that uses language-specific grammars to parse a given file into an abstract syntax tree (AST). We then filter the AST leaves to obtain various kinds of identifiers, names, constants, etc.

The four remaining languages (Scala, Swift, Kotlin, and Haskell) either do not have a Tree-sitter grammar at the time of writing or the grammar is in development. The files in these languages are passed on to Pygments (23) lexers. A Pygments lexer splits the code into tokens, each of which also has a certain type. From the list of tokens, we extract those that are of interest to us: this includes the token.Name type by default, but for some languages it also makes sense to gather other types.

The last step of tokenization is splitting each token into sub-tokens. Following Markovtsev et al. (Markovtsev and Kant, 2017), we split the tokens by camel case and snake case, append short sub-tokens (less than three characters) to the adjacent longer ones, and stem sub-tokens longer than 6 characters using the Snowball stemmer (Porter, 2001).

For a given project, we carry out identifier extraction and subtokenization for all files written in the supported languages and accumulate the results: in the end, the repository is represented as a dictionary with sub-tokens as keys and their counts as values.

Clustering sub-token embeddings. We use the spherical K-means algorithm (Hornik et al., 2012) to find clusters of similar sub-tokens. The algorithm is similar to the regular K-means (Lloyd, 1982), but it works with cosine distance instead of the Euclidean distance. Since we work with millions of high-dimensional vectors and cosine distance, other approaches like DBSCAN (Ester et al., 1996) turn out to be too computationally expensive.

Spherical K-means requires choosing the number of clusters

beforehand. We estimate an optimal number of clusters with gap statistic 

(Tibshirani et al., 2001)

, a technique based on comparing the distribution of the inner-cluster distances with a uniform distribution. It has not shown any significant difference for the number of clusters above 256, so we decided to set

to 256 to reduce the dimensionality of project representations at the next step.

Clusters represent groups of semantically similar sub-tokens. They can be seen as topics at the sub-token level. As in topic modeling, the topic can be guessed from a set of representatives. In our case, the representatives are the most frequent sub-tokens in the cluster and sub-tokens closest to the cluster center. To further elevate this information and make Sosed’s output explainable, we manually labeled clusters with short descriptions by looking both at the representatives and projects where they are frequently used.

Project representations. From the previous step, we get a mapping from sub-tokens to clusters. Then, we compute the distribution of clusters among sub-tokens in each project. For each repository, we get a -dimensional vector where a coordinate along the dimension is equal to the probability of the cluster appearing among project’s sub-tokens.

We applied the described technique to compute representations of 9 million repositories from the dataset of Markovtsev et al. (Markovtsev and Kant, 2017), which includes all unique projects (excluding both the explicit and implicit forks) on GitHub as of the end of 2016. This large set of projects forms the Sosed’s search space.

Searching for similar repositories. To find similar repositories to a given one, we should compute a cluster distribution for it. Firstly, we tokenize the project as previously described. Then, we collect pre-computed cluster indices for the sub-tokens encountered in reference projects. We do not compute embeddings for OOV sub-tokens in the new projects for two reasons. Firstly, their number is small, because the reference projects contain 40 million different sub-tokens. Secondly, OOV sub-tokens may refer to libraries and technologies that emerged after the reference dataset had been collected , i.e., the end of 2016. In this case, the embeddings will not reflect the underlying semantics of sub-tokens.

We implement two methods to compare cluster distributions between projects from a query and reference projects: direct computation of KL-divergence (Kullback and Leibler, 1951)

between two distributions and cosine similarity of the distribution vectors. Cosine similarity equals to the inner product of the normalized distribution vectors. KL-divergence can be expressed by the following formula:

where and are cluster distributions for a query and a reference project, respectively. Finding a reference project that minimizes KL-divergence for the given query project is equivalent to maximizing the following function:

The function is an inner product of the cluster distribution and a point-wise logarithm of the distribution . Thus, both for KL-divergence and cosine similarity, the search of similar projects reduces to maximizing an inner product between two vectors.

We utilized the Faiss (Johnson et al., 2017) library to find vectors giving the maximal inner product. Faiss transforms reference vectors into an indexing structure that can be further used for querying. The indexing structure used in our work does not introduce a significant memory overhead, which allows us to use it with a large search space.

To enable the tool to provide explanations for project similarity, we find sub-token clusters corresponding to the terms that contributed the most to the vectors’ inner product. Within the tool’s output, we display their contributions alongside with manually given labels and sub-tokens from these clusters.

4. Evaluation

To the best of our knowledge, the only approach to evaluate the output of algorithms for finding similar projects used in previous work (McMillan et al., 2012; Thung et al., 2012; Zhang et al., 2017) is conducting a survey of developers.

Since Sosed works with programming projects in 16 languages, thorough evaluation of its performance without diving deep into specific ecosystems becomes challenging. We plan to conduct a survey of a large group of programmers with different expertise in order for its results to be reliable.

For now, we evaluated Sosed’s output on a set of 94 GitHub projects that comprises top-starred repositories in different languages. The results are available on our GitHub page (12)

. For example, top-5 most similar projects to TensorFlow

111https://github.com/tensorflow/tensorflow/

are deep learning and machine learning frameworks. For Bitcoin

222https://github.com/bitcoin/bitcoin/ Sosed detected other open-sourced cryptocurrencies. Among projects similar to Python333https://github.com/python/cpython/ we found Brython,444https://github.com/brython-dev/brython/ a Python implementation running in a browser.

5. Conclusion

Finding similar software projects among a large set of repositories might be beneficial for practical software engineering tasks like quick prototyping and program understanding. Aside from that, it requires development of new methods for representing source code, which can find application in other software-related tasks.

We created a novel approach to represent code based on the topic distribution among its sub-tokens. We implemented it as a tool for finding similar software repositories called Sosed. The main features of Sosed are explainability of its output, support of 16 programming languages, and independence of project popularity. Sosed is available on GitHub (12; 11).

For now, Sosed searches among a set of 9 million GitHub projects. While it is a large set of data, open-source community grew rapidly over the recent years (27). In order to catch up with the growth of the open-source ecosystem, we plan to collect a new dataset, which will contain an up-to-date set of GitHub projects.

Implementation of open-source tools for the novel ideas has several benefits. This way, we can quickly evaluate the method’s performance, check its practical applicability, and gather feedback from the tool’s users. We encourage others to create open-source software based on the developed methods in order to speed up communication and evolution in the research community.

References

  • U. Alon, O. Levy, and E. Yahav (2018) Code2seq: generating sequences from structured representations of code. CoRR abs/1808.01400. External Links: Link, 1808.01400 Cited by: §1.
  • D. M. Blei, A. Y. Ng, and M. I. Jordan (2003) Latent dirichlet allocation. J. Mach. Learn. Res. 3 (null), pp. 993–1022. External Links: ISSN 1532-4435 Cited by: §2.
  • P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2017) Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, pp. 135–146. External Links: ISSN 2307-387X Cited by: §1, §3, §3.
  • N. Chen, S. Hoi, S. Li, and X. Xiao (2015) SimApp: a framework for detecting similar mobile applications by online kernel learning. pp. . External Links: Document Cited by: §2.
  • S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman (1990) Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, pp. 391–407. Cited by: §2.
  • M. Ester, H. Kriegel, J. Sander, and X. Xu (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, KDD’96, pp. 226–231. Cited by: §3.
  • [7] (2020)(Website) External Links: Link Cited by: §3.
  • H. Gonzalez, N. Stakhanova, and A. Ghorbani (2014) DroidKin: lightweight detection of android apps similarity. Vol. 152, pp. . External Links: Document Cited by: §2.
  • V. J. Hellendoorn, C. Sutton, R. Singh, P. Maniatis, and D. Bieber (2020) Global relational models of source code. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • K. Hornik, I. Feinerer, M. Kober, and C. Buchta (2012) Spherical k-means clustering. Journal of Statistical Software 50, pp. 1–22. External Links: Document Cited by: §1, §3, §3.
  • [11] (2020)(Website) External Links: Link Cited by: §1, §3, §5.
  • [12] (2020)(Website) External Links: Link Cited by: §1, §4, §5.
  • J. Johnson, M. Douze, and H. Jégou (2017) Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734. Cited by: §3, §3.
  • S. Kullback and R. A. Leibler (1951) On information and sufficiency. Ann. Math. Statist. 22 (1), pp. 79–86. External Links: Document, Link Cited by: §1, §3, §3.
  • L. Li, T. F. Bissyandé, and J. Klein (2017) SimiDroid: identifying and explaining similarities in android apps. In 2017 IEEE Trustcom/BigDataSE/ICESS, Vol. , pp. 136–143. Cited by: §2.
  • M. Linares-Vásquez, A. Holtzhauer, and D. Poshyvanyk (2016) On automatically detecting similar android apps. In 2016 IEEE 24th International Conference on Program Comprehension (ICPC), Vol. , pp. 1–10. Cited by: §2.
  • S. Lloyd (1982) Least squares quantization in PCM. IEEE Transactions on Information Theory 28 (2), pp. 129–137. External Links: Document, Link Cited by: §1, §3.
  • V. Markovtsev and E. Kant (2017) Topic modeling of public repositories at scale using names in source code. arXiv preprint arXiv:1704.00135. Cited by: §1, §2, §3, §3, §3, §3.
  • V. Markovtsev (2017) GitHub word2vec 120k. Note: https://data.world/vmarkovtsev/github-word-2-vec-120-k Cited by: §3, §3.
  • C. McMillan, M. Grechanik, and D. Poshyvanyk (2012) Detecting similar software applications. In Proceedings of the 34th International Conference on Software Engineering, ICSE ’12, pp. 364–374. External Links: ISBN 9781467310673 Cited by: §1, §2, §4.
  • T. Mens, A. Serebrenik, and A. Cleve (2014) Evolving software systems. Springer Publishing Company, Incorporated. External Links: ISBN 364245397X Cited by: §1.
  • M. F. Porter (2001) Snowball: a language for stemming algorithms. Cited by: §3.
  • [23] (2020)(Website) External Links: Link Cited by: §3.
  • T. Schnabel, I. Labutov, D. Mimno, and T. Joachims (2015) Evaluation methods for unsupervised word embeddings. In

    Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

    ,
    Lisbon, Portugal, pp. 298–307. External Links: Link, Document Cited by: §1.
  • X. Sun, X. Liu, L. Bin, Y. Duan, H. Yang, and J. Hu (2016) Exploring topic models in software engineering data analysis: a survey. pp. 357–362. External Links: Document Cited by: §2.
  • [26] (2020)(Website) External Links: Link Cited by: §3.
  • [27] (2019)(Website) External Links: Link Cited by: §5.
  • F. Thung, D. Lo, and L. Jiang (2012) Detecting similar applications with collaborative tagging. In 2012 28th IEEE International Conference on Software Maintenance (ICSM), Vol. , pp. 600–603. Cited by: §1, §2, §4.
  • R. Tibshirani, G. Walther, and T. Hastie (2001) Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63 (2), pp. 411–423. External Links: Document, Link, https://rss.onlinelibrary.wiley.com/doi/pdf/10.1111/1467-9868.00293 Cited by: §3.
  • [30] (2020)(Website) External Links: Link Cited by: §3.
  • K. Vorontsov and A. Potapenko (2015) Additive regularization of topic models. Machine Learning 101 (1), pp. 303–323. External Links: ISSN 1573-0565, Document, Link Cited by: §2.
  • Y. Zhang, D. Lo, P. S. Kochhar, X. Xia, Q. Li, and J. Sun (2017) Detecting similar repositories on github. pp. 13–23. External Links: Document Cited by: §1, §2, §4.