Topical: Learning Repository Embeddings from Source Code using Attention

08/19/2022
by   Agathe Lherondelle, et al.
8

Machine learning on source code (MLOnCode) promises to transform how software is delivered. By mining the context and relationship between software artefacts, MLOnCode augments the software developers capabilities with code auto-generation, code recommendation, code auto-tagging and other data-driven enhancements. For many of these tasks a script level representation of code is sufficient, however, in many cases a repository level representation that takes into account various dependencies and repository structure is imperative, for example, auto-tagging repositories with topics or auto-documentation of repository code etc. Existing methods for computing repository level representations suffer from (a) reliance on natural language documentation of code (for example, README files) (b) naive aggregation of method/script-level representation, for example, by concatenation or averaging. This paper introduces Topical a deep neural network to generate repository level embeddings of publicly available GitHub code repositories directly from source code. Topical incorporates an attention mechanism that projects the source code, the full dependency graph and the script level textual information into a dense repository-level representation. To compute the repository-level representations, Topical is trained to predict the topics associated with a repository, on a dataset of publicly available GitHub repositories that were crawled along with their ground truth topic tags. Our experiments show that the embeddings computed by Topical are able to outperform multiple baselines, including baselines that naively combine the method-level representations through averaging or concatenation at the task of repository auto-tagging.

READ FULL TEXT

page 1

page 2

page 3

page 8

research
10/18/2020

Topic Recommendation for Software Repositories using Multi-label Classification Algorithms

Many platforms exploit collaborative tagging to provide their users with...
research
06/20/2018

A Large-Scale Study on Source Code Reviewer Recommendation

Context: Software code reviews are an important part of the development ...
research
05/31/2022

Semantically-enhanced Topic Recommendation System for Software Projects

Software-related platforms have enabled their users to collaboratively l...
research
07/11/2021

Repo2Vec: A Comprehensive Embedding Approach for Determining Repository Similarity

How can we identify similar repositories and clusters among a large onli...
research
06/26/2022

Repository-Level Prompt Generation for Large Language Models of Code

With the success of large language models (LLMs) of code and their use a...
research
01/28/2021

Peptipedia: a comprehensive database for peptide research supported by Assembled predictive models and Data Mining approaches

Motivation: Peptides have attracted the attention in this century due to...
research
06/05/2023

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

Large Language Models (LLMs) have greatly advanced code auto-completion ...

Please sign up or login with your details

Forgot password? Click here to reset