Sosed: a tool for finding similar software projects

07/06/2020
by   Egor Bogomolov, et al.
0

In this paper, we present Sosed, a tool for discovering similar software projects. We use fastText to compute the embeddings of subtokens into a dense space for 120,000 GitHub repositories in 200 languages. Then, we cluster embeddings to identify groups of semantically similar sub-tokens that reflect topics in source code. We use a dataset of 9 million GitHub projects as a reference search base. To identify similar projects, we compare the distributions of clusters among their sub-tokens. The tool receives an arbitrary project as input, extracts sub-tokens in 16 most popular programming languages, computes cluster distribution, and finds projects with the closest distribution in the search base. We labeled subtoken clusters with short descriptions to enable Sosed to produce interpretable output. Sosed is available at https://github.com/JetBrains-Research/sosed/. The tool demo is available at https://www.youtube.com/watch?v=LYLkztCGRt8. The multi-language extractor of sub-tokens is available separately at https://github.com/JetBrains-Research/buckwheat/.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/27/2023

Soft-Search: Two Datasets to Study the Identification and Production of Research Software

Software is an important tool for scholarly work, but software produced ...
research
03/16/2023

Wasmizer: Curating WebAssembly-driven Projects on GitHub

WebAssembly has attracted great attention as a portable compilation targ...
research
08/25/2021

Revizor: A Data-Driven Approach to Automate Frequent Code Changes Based on Graph Matching

Many code changes that developers make in their projects are repeated an...
research
12/21/2020

AC2 – Towards Understanding Architectural Changes in Rapid Releases

Open source projects are adopting faster release cycles that reflect var...
research
10/05/2017

BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages

We present BPEmb, a collection of pre-trained subword unit embeddings in...
research
12/06/2022

ACRO: A multi-language toolkit for supporting Automated Checking of Research Outputs

This paper discusses the development of an open source tool ACRO, (Autom...
research
12/07/2020

A Tool to Extract Structured Data from GitHub

GitHub repositories consist of various detailed information about the pr...

Please sign up or login with your details

Forgot password? Click here to reset