Topic modeling of public repositories at scale using names in source code

04/01/2017 ∙ by Vadim Markovtsev, et al. ∙ source{d} 0

Programming languages themselves have a limited number of reserved keywords and character based tokens that define the language specification. However, programmers have a rich use of natural language within their code through comments, text literals and naming entities. The programmer defined names that can be found in source code are a rich source of information to build a high level understanding of the project. The goal of this paper is to apply topic modeling to names used in over 13.6 million repositories and perceive the inferred topics. One of the problems in such a study is the occurrence of duplicate repositories not officially marked as forks (obscure forks). We show how to address it using the same identifiers which are extracted for topic modeling. We open with a discussion on naming in source code, we then elaborate on our approach to remove exact duplicate and fuzzy duplicate repositories using Locality Sensitive Hashing on the bag-of-words model and then discuss our work on topic modeling; and finally present the results from our data analysis together with open-access to the source code, tools and datasets.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

There are more than 18 million non-empty public repositories on GitHub

which are not marked as forks. This makes GitHub the largest version control repository hosting service. It has become difficult to explore such a large number of projects and nearly impossible to classify them. One of the main sources of information that exists about public repositories is their code.

To gain a deeper understanding of software development it is important to understand the trends among open-source projects. Bleeding edge technologies are often used first in open source projects and later employed in proprietary solutions when they become stable enough111Notable examples include the Linux OS kernel, the PostgreSQL database engine, the Apache Spark cluster-computing framework and the Docker containers.. An exploratory analysis of open-source projects can help to detect such trends and provide valuable insight for industry and academia.

Since GitHub appeared the open-source movement has gained significant momentum. Historically developers would manually register their open-source projects in software digests. As the number of projects dramatically grew, those lists became very hard to update; as a result they became more fragmented and started exclusively specializing in narrow technological ecosystems. The next attempt to classify open source projects was based on manually submitted lists of keywords. While this approach works [1], it requires careful keywords engineering to appear comprehensive, and thus not widely adopted by the end users in practice. GitHub introduced repository tags in January 2017 which is a variant of manual keywords submission.

The present paper describes how to conduct fully automated topic extraction from millions of public repositories. It scales linearly with the overall source code size and has substantial performance reserve to support the future growth. We propose building a bag-of-words model on names occurring in source code and applying proven Natural Language Processing algorithms to it. Particularly, we describe how ”Weighted MinHash” algorithm

[2] helps to filter fuzzy duplicates and how an Additive Regularized Topic Model (ARTM) [3] can be efficiently trained. The result of the topic modeling is a nearly-complete open source projects classification. It reflects the drastic variety in open source projects and reflect multiple features. The dataset we work with consists of approx. 18 million public repositories retrieved from GitHub in October 2016.

The rest of the paper is organised as follows: Section 2 reviews prior work on the subject. Section 3 elaborates on how we turn software repositories into bags-of-words. Section 4 describes the approach to efficient filtering of fuzzy repository clones. Section 5 covers the building of the ARTM model with 256 manually labeled topics. Section 6 presents the achieved topic modeling results. Section 7 lists the opened datasets we were able to prepare. Finally, section 8 presents a conclusion and suggests improvements to future work.

Ii Related work

Ii-a Academia

There was an open source community study which presented statistics about manually picked topics in 2005 by J. Xu [4].

Blincoe [5] studied GitHub ecosystems using reference coupling over the GHTorrent dataset [6] which contained 2.4 million projects. This research employs an alternative topic modeling method on source code of 13.6 million projects.

Instead of using the GHTorrent dataset we’ve prepared open datasets from almost all public repositories on GitHub to be able to have a more comprehensive overview.

M. Lungi [7] conducted an in-depth study of software ecosystems in 2009, the year when GitHub appeared. The examples in this work used samples of approx. 10 repositories. And the proposed discovery methods did not include Natural Language Processing.

The problem of the correct analysis of forks on GitHub has been discussed by Kalliamvakou [8] along with other valuable concerns.

Topic modeling of source code has been applied to a variety of problems reviewed in [9]: improvement of software maintenance [10], [11], defects explanation [12], concept analysis [13], [14], software evolution analysis [15], [16], finding similarities and clones [17], clustering source code and discovering the internal structure [18], [19], [20], summarizing [21], [22], [23]. In the aforementioned works, the scope of the research was focused on individual projects.

The usage of topic modeling [24] focused on improving software maintenance and was evaluated on 4 software projects. Concepts were extracted using a corpus of 24 projects in [25]. Giriprasad Sridhara [26], Yang and Tan [27], Howard [28] considered comments and/or names to find semantically similar terms; Haiduc and Marcus [29] researched common domain terms appearing in source code. The presented approach in this papers reveals similar and domain terms, but leverages a much significantly larger dataset of 13.6 million repositories.

Bajracharya and Lopes [30] trained a topic model on the year long usage log of Koders, one of the major commercial code search engines. The topic categories suggested by Bajracharya and Lopes share little similarity with the categories described in this paper since the input domain is much different.

Ii-B Industry

To our knowledge, there are few companies which maintain a complete mirror of GitHub repositories. source{d} [31]

is focused on doing machine learning on top of the collected source code. Software Heritage

[32] strives to become for open source software. SourceGraph [33] processes source code references, internal and external, and created a complete reference graph for projects written in Golang. [34] is not GitHub centric but rather processes the dependencies and metadata of open source packages fetched from a number of repositories. It analyses the dependency graph (at the level of projects, while SourceGraph analyses at the level of functions).

Iii Building the bag-of-words model

This section follows the way we convert software repositories into bags-of-words, the model which stores each project as a multiset of its identifiers, ignoring the order while maintaining multiplicity.

For the purpose of our analysis we choose to use the latest version of the master branch of each repository. And treat each repository as a single document. An improvement for further research would be to use the entire history of each repository including unique code found in each branch.

Iii-a Preliminary Processing

Our first goal is to process each repository to identify which files contain source code, and which files are redundant for our purpose. GitHub has an open-source machine learning based library named linguist [35] that identifies the programming language used within a file based on its extension and contents. We modified it to also identify vendor code and automatically generated files. The first step in our pre-processing is to run linguist over each repository’s master branch. From 11.94 million repositories we end up with 402.6 million source files in which we have high confidence it is source code written by a developer in that project. Identifying the programming language used within each file is important for the next step, the names extraction, as it determines the programming language parser.

Iii-B Extracting Names

Source code highlighting is a typical task for professional text editors and IDE’s. There have been several open source libraries created to tackle this task. Each works by having a grammar file written per programming language which contains the rules. Pygments [36] is a high quality community-driven package for Python which supports more than 400 programming languages and markups. According to Pygments, all source code tokens are classified across the following categories: comments, escapes, indentations and generic symbols, reserved keywords, literals, operators, punctuation and names.

Linguist and Pygments have different sets of supported languages. Linguist stores it’s list at master/lib/linguist/languages.yml and the similar Pygments list is stored as pygments.lexers.LEXERS. Each has nearly 400 items and the intersection is approximately 200 programming languages (”programming” Linguist’s item type). The languages common to Linguist and Pygments which were chosen are listed in appendix A. In this research we apply Pygments to the 402.6 million source files to extract all tokens which belong to the type Token.Name.

Iii-C Processing names

The next step is to process the names according to naming conventions. As an example class FooBarBaz adds three words to the bag: foo, bar and baz, or int wdSize should add two: wdsize and size. Fig. 1 is the full listing of the function written in Python 3.4+ which splits identifiers.

{},frame=single,fontsize=] NAME˙BREAKUP˙RE = re.compile(r”[^a-zA-Z]+”)

def extract˙names(token): token = token.strip() prev˙p = [””]

def ret(name): r = name.lower() if len(name) ¿= 3: yield r if prev˙p[0]: yield prev˙p[0] + r prev˙p[0] = ”” else: prev˙p[0] = r

for part in NAME˙BREAKUP˙RE.split(token): if not part: continue prev = part[0] pos = 0 for i in range(1, len(part)): this = part[i] if prev.islower() and this.isupper(): yield from ret(part[pos:i]) pos = i elif prev.isupper() and this.islower(): if 0 ¡ i - 1 - pos ¡= 3: yield from ret(part[pos:i - 1]) pos = i - 1 elif i - 1 ¿ pos: yield from ret(part[pos:i]) pos = i prev = this last = part[pos:] if last: yield from ret(last)

Fig. 1: Identifier splitting algorithm, Python 3.4+

In this step each repository is saved as an sqlite database file which contains a table with: programming language, name extracted and frequency of occurance in that repository. The total number of unique names that were extracted were 17.28 million.

Iii-D Stemming names

It is common to stem names when creating a bag-of-words in NLP. Since we are working with natural language that is predominantly English we have applied the Snowball stemmer [37] from the Natural Language Toolkit (NLTK) [38]. The stemmer was applied to names which were 6 characters long. In further research a diligent step would be to compare results with and without stemming of the names, and also to predetermine the language of the name (when available) and apply stemmers in different languages.

The length of words on which stemming was applied was chosen after the manual observation that shorter identifiers tend to collide with each other when stemmed and longer identifiers need to be normalized. Fig. 2 represents the distribution of identifier lengths in the dataset:

Fig. 2: Name lengths distribution

It can be seen that the most common name length is 6. Fig. 3 is the plot of the number of unique words in the dataset depending on the stemming threshold:

Fig. 3: Influence of the stemming threshold on the vocabulary size

We observe the breaking point at length 5. The vocabulary size linearly grows starting with length 6. Having manually inspected several collisions on smaller thresholds, we came to the conclusion that 6 corresponds to the best trade-off between collisions and word normalization.

The Snowball algorithm was chosen based on the comparative study by Jivani [39]. Stems not being real words are acceptable but it is critical to have minimum over-stemming since it increases the number of collisions. The total number of unique names are 16.06 million after stemming.

To be able to efficiently pre-process our data we used Apache Spark [40] running on 64 4-core nodes which allowed us to process repositories in parallel in less than 1 day.

However, before training a topic model one has to exclude near-duplicate repositories. In many cases GitHub users include the source code of existing projects without preserving the commit history. For example, it is common for web sites, blogs and Linux-based firmwares. Those repositories contain very little original changes and may introduce frequency noise into the overall names distribution. This paper suggests the way to filter the described fuzzy duplicates based on the bag of words model built on the names in the source code.

Iv Filtering near-duplicate repositories

There were more than 70 million GitHub repositories in October 2016 by our estimation. Approx. 18 million were not marked as forks. Nearly 800,000 repositories were de facto forks but not marked correspondingly by GitHub. That is, they had the same git commit history with colliding hashes. Such repositories may appear when a user pushes a cloned or imported repository under his or her own account without using the GitHub web interface to initiate a fork.

When we remove such hidden forks from the initial 18 million repositories, there still remain repositories which are highly similar. A duplicate repository is sometimes the result of a git push of an existing project with a small number of changes. For example, there are a large number of repositories with the Linux kernel which are ports to specific devices. In another case, repositories containing web sites were created using a cloned web engine and preserving the development history. Finally, a large number of repositories are the same. Such repositories contain much text content and few identifiers which are typically the same (HTML tags, CSS rules, etc.).

Filtering out such fuzzy forks speeds up the future training of the topic model and reduces the noise. As we obtained a bag-of-words for every repository, the naive approach would be to measure all pairwise similarities and find cliques. But first we need to define what is the similarity between bags-of-words.

Iv-a Weighted MinHash

Suppose that we have two dictionaries - key-value mappings with unique keys and values indicating non-negative ”weights” of the corresponding keys. We would like to introduce a similarity measure between them. The Jaccard Similarity between dictionaries and , is defined as


where and . If the weights are binary, this formula is equivalent to the common Jaccard Similarity definition.

The same way as MinHash is the algorithm to find similar sets in linear time, Weighted MinHash is the algorithm to find similar dictionaries in linear time. Weighted MinHash was introduced by Ioffe in [2]. We have chosen it in this paper because it is very efficient and allows execution on GPUs instead of large CPU clusters. The proposed algorithm depends on the parameter which adjusts the resulting hash length.

  1. [label=0.]

  2. for in range():

    1. [label=1.0.]

    2. Sample

      - Gamma distribution (their PDF is

      ), and .

    3. Compute

  3. Find and return the samples .

Thus given and supposing that the integers are 32-bit we obtain the hash with size bytes. Samples from distribution can be efficiently calculated as where

- uniform distribution between

and .

We developed the MinHashCUDA [41] library and Python native extension which is the implementation of Weighted MinHash algorithm for NVIDIA GPUs using CUDA [42]. There were several engineering challenges with that implementation which are unfortunately out of the scope of this paper. We were able to hash all 10 million repositories with hash size equal to 128 in less than 5 minutes using MinHashCUDA and 2 NVIDIA Titan X Pascal GPU cards.

Iv-B Locality Sensitive Hashing

Having calculated all the hashes in the dataset, we can perform Locality Sensitive Hashing. We define several hash tables, each for it’s own sub-hash which depends on the target level of false positives. Same elements will appear in the same bucket; union of the bucket sets across all the hash tables for a specific sample yields all the similar samples. Since our goal is to determine the sets of mutually similar samples, we should consider the set intersection instead.

We used the implementation of Weighted MinHash LSH from Datasketch [43]. It is designed after the corresponding algorithm in Mining of Massive Datasets [44]. LSH takes a single parameter - the target Weighted Jaccard Similarity value (”threshold”). MinHash LSH puts every repository in a number of separate hash tables which depend on the threshold and the hash size. We used the default threshold 0.9 in our experiments which ensures a low level of dissimilarity within a hash table bin.

Algorithm 5 describes the fuzzy duplicates detection pipeline. Step 6 discards less than 0.5% of all the sets and aims at reducing the number of false positives. The bins size distribution after step 5 is depicted on Fig. 4

- it is clearly seen that the majority of the bins has the size 2. Step 6 uses Weighted Jaccard similarity threshold 0.8 instead of 0.9 to be sensitive to evident outliers exclusively.

Fig. 4: LSH hash table’s bin size distribution
1:Calculate Weighted MinHash with hash size 128 for all the repositories.
2:Feed each hash to MinHash LSH with threshold 0.9 so that every repository appears in each of the 5 hash tables.
3:Filter out hash table bins with single entries.
4:For every repository, intersect the bins it appears in across all the hash tables. Cache the intersections, that is, if a repository appears in the same existing set, do not create the new one.
5:Filter out sets with a single element. The resulting number of unique repositories corresponds to ”Filtered size” in Table I.
6:For every set with 2 items, calculate the precise Weighted Jaccard similarity value and filter out those with less than 0.8 (optional).
7:Return the resulting list of sets. The number of unique repositories corresponds to ”Final size” in Table I.
Fig. 5: Fuzzy duplicates detection pipeline

Table I reveals how different hash sizes influence on the resulting number of fuzzy clones:

Hash size Hash tables Average bins Filtered size Final size
64 3 272000 1,745,000 1,730,000
128 5 263000 1,714,000 1,700,000
160 6 261063 1,687,000 1,675,000
192 7 258000 1,666,000 1,655,000
TABLE I: Influence of the Weighted MinHash size to the number of fuzzy clones

Algorithm 5 results in approximately 467,000 sets of fuzzy duplicates with overall 1.7 million unique repositories. Each repository appears in two sets on average. The examples of fuzzy duplicates are listed in appendix B. The detection algorithm works especially well for static web sites which share the same JavaScript libraries.

After the exclusion of the fuzzy duplicates, we finish dataset processing and pass over to training of the topic model. The total number of unique names has now reduced by 2.06 million to 14 million unique names. To build a meaningful dataset, names with occurrence of less than were excluded from the final vocabulary. 20 was chosen on the frequency histogram shown on Fig. 6 since it is the drop-off point.

Fig. 6: Stemmed names frequency histogram

After this exclusion, there are now 2 million unique names, with an average size of a bag-of-words of 285 per repository and Fig. 7 displays the heavy-tailed bag size distribution.

Fig. 7: Bag sizes after fuzzy duplicates filtering

V Training the ARTM topic model

This section revises ARTMs and describes how the training of the topic model was performed. We have chosen ARTM instead of other topic modeling algorithms since it has the most efficient parallel CPU implementation in bigARTM according to our benchmarks.

V-a Additive Regularized Topic Model

Suppose that we have a topic probabilistic model of the collection of documents which describes the occurrence of terms in document with topics :



is the probability of the term

to belong to the topic , is the probability of the topic to belong to the document , thus the whole formula is just an expression of the total probability, accepting the hypothesis of conditional independence: . Terms belong to the vocabulary , topics are taken from the set which is simply the series of indices .

We’d like to solve the problem of recovering and from the given set of documents . We normally assume , being the number of times term occurred in document , but this implies that all the terms are equally important which is not always true. ”Importance” here means some measure which negatively correlates with the overall frequency of the term. Let us denote the recovered probabilities as and

. Thus our problem is the stochastic matrix decomposition which is not correctly stated:


The stated problem can be solved by applying maximum likelihood estimation:


upon the conditions


The idea of ARTM is to naturally introduce regularization as one or several extra additive members:


Since this is a simple summation, one can combine a series of regularizers in the same objective function. For example, it is possible to increase and sparsity or to make topics less correlated. Well-known LDA model [45] can be reproduced as ARTM too.

The variables and

can be effectively calculated using the iterative expectation maximization algorithm

[46]. Many ready to be used ARTM regularizers are already implemented in the BigARTM open source project [47].

V-B Training

Vorontsov shows in [3] that ARTM is trained best if the regularizers are activated sequentially, with a lag relative to each other. For example, first EM iterations are performed without any regularizers at all and the model reaches target perplexity, then and sparsity regularizers are activated and the model optimizes for those new members in the objective function while not increasing the perplexity. Finally other advanced regularizers are appended and the model minimizes the corresponding members while leaving the old ones intact.

We apply only and sparsity regularizers in this paper. Further research is required to leverage others. We experimented with the training of ARTM on the source code identifiers from III and observed that the final perplexity and sparsity values do not change considerably on the wide range of adjustable meta-parameters. The best training meta-parameters are given in Table II.

Parameter Value
Topics 256
Iterations without regularizers 10
Iterations with regularizers 8
sparsity weight 0.5
sparsity weight 0.5
TABLE II: Used ARTM meta-parameters

We chose 256 topics merely because it is time intensive to label them and 256 was the largest amount we could label. The traditional ways of determining the optimal number of topics using e.g. Elbow curves are not applicable to our data. We cannot consider topics as clusters since a typical repository corresponds to several topics and increasing the number of topics worsens the model’s generalization and requires the dedicated topics decorrelation regularizer. The overall number of iterations equals 18. The convergence plot is shown on Fig. 8.

Fig. 8: ARTM convergence

The achieved quality metric values are as given in Table III.

Metric Value
Perplexity 10168
sparsity 0.964
sparsity 0.913
TABLE III: Achieved ARTM metrics

On the average, a single iteration took 40 minutes to complete on our hardware. We used BigARTM in a Linux environment on 16-core (32 threads) Intel(R) Xeon(R) CPU E5-2620 v4 computer with 256 GB of RAM. BigARTM supports the parallel training and we set the number of workers to 30. The peak memory usage was approximately 32 GB.

It is possible to relax the hardware requirements and speed up the training if the model size is reduced. If we set the frequency threshold to a greater value, we can dramatically reduce the input data size with the risk of loosing the ability of the model to generalize.

We trained a reference LDA topic model to provide the baseline using the built-in LDA engine in BigARTM. 20 iterations resulted in perplexity 10336 and and sparsity 0 (fully dense matrices). It can be seen that the additional regularization not only made the model sparse but also yielded a better perplexity. We relate this observation with the fact that LDA assumes the topic distribution to have a sparse Dirichlet prior, which does not obligatory stand for our dataset.

V-C Converting the repositories to the topics space

Let be the matrix of repositories in the topics space of size , be the sparse matrix representing the dataset of size and be the matrix representing the trained topic model of size . We perform the matrix multiplication to get the repository embeddings:


We further normalize each row of the matrix by metric:


The sum along every column of this matrix indicates the significance of each topic. Fig. 9 shows the distribution of this measure.

Fig. 9: ARTM topic significance distribution

Vi Topic modeling results

The employed topic model is unable to summarize the topics the same way humans do. It is possible to interpret some topics based on the most significant words, some based on relevant repositories, but many require manual supervision with the careful analysis of most relevant names and repositories. This supervision is labour intensive and the single topic normally takes up to 30 minutes to summarize with proper confidence. 256 topics required several man-days to complete the analysis.

After a careful analysis, we sorted the labelled topics into the following groups:

  • Concepts (41) - general, broad and abstract. The most interesting group. It includes scientific terms, facts about the world and the society.

  • Human languages (10) - it appeared that one can determine programmer’s approximate native language looking at his code, thanks to the stem bias.

  • Programming languages (33) - not so interesting since this is the information we already have after linguist classification. Programming languages usually have a standard library of classes and functions which is imported/included into most of the programs, and the corresponding names are revealed by our topic modeling. Some topics are more narrow than a programming language.

  • General IT (72) - the topics which could appear in Concepts if had an expressive list of key words but do not. The repositories are associated by the unique set of names in the code without any special meaning.

  • Technologies (87) - devoted to some specific, potentially narrow technology or product. Often indicates an ecosystem or community around the technology.

  • Games (13) - related to video games. Includes specific gaming engines.

The complete topics list is in appendix C

. The example topic labelled ”Machine Learning, Data Science” is shown in appendix


It can be observed that some topics are dual and need to be splitted. That duality is a sign that the number of topics should be bigger. At the same time, some topics appear twice and need to be de-correlated, e.g. using the ”decorrelation” ARTM regularizer. Simple reduction or increase of the number of topics however do not solve those problems, we found it out while experimenting with 200 and 320 topics.

Vii Released datasets

We generated several datasets which were extracted from our internal 100 TB GitHub repository storage. We incorporated them on [48], the recently emerged ”GitHub for data scientists”, each has the description, the origin note and the format definition. They are accessed at Besides, the datasets are uploaded to Zenodo and have DOI. They are listed in Table VII.

Name abd DOI Description
[t]source code names
10.5281/zenodo.284554 names extracted from 13,000,000 repositories (fuzzy clones excluded) considered in section III
[t]452,000,000 commits
10.5281/zenodo.285467 metadata of all the commits in 16,000,000 repositories (fuzzy clones excluded)
[t]keyword frequencies
10.5281/zenodo.285293 frequencies of programming language keywords (reserved tokens) across 16,000,000 repositories (fuzzy clones excluded)
[t]readme files
10.5281/zenodo.285419 README files extracted from 16,000,000 repositories (fuzzy clones excluded)
[t]duplicate repositories
10.5281/zenodo.285377 fuzzy clones which were considered in section IV
TABLE IV: Open datasets on

Viii Conclusion and future work

Topic modeling of GitHub repositories is an important step to understanding software development trends and open source communities. We built a repository processing pipeline and applied it to more than 18 million public repositories on GitHub. Using developed by us open source tool MinHashCUDA we were able to remove 1.6 million fuzzy duplicate repositories from the dataset. The preprocessed dataset with source code names as well as other datasets are open and the presented results can be reproduced. We trained ARTM on the resulting dataset and manually labelled 256 topics. The data processing and model training are possible to perform using a single GPU card and a moderately sized Apache Spark cluster. The topics covered a broad range of projects but there were repeating and dual ones. The chosen number of topics was enough for general exploration but not enough for the complete description of the dataset.

Future work may involve experimentation with clustering the repositories in the topic space and comparison with clusters based on dependency or social graphs [49].

Appendix A Parsed languages

  • [label=,leftmargin=*]

  • abap

  • abl

  • actionscript

  • ada

  • agda

  • ahk

  • alloy

  • antlr

  • apl

  • applescript

  • arduino

  • as3

  • aspectj

  • aspx-vb

  • autohotkey

  • autoit

  • awk

  • b3d

  • bash

  • batchfile

  • befunge

  • blitzbasic

  • blitzmax

  • bmax

  • boo

  • bplus

  • brainfuck

  • bro

  • bsdmake

  • c

  • c#

  • c++

  • ceylon

  • cfc

  • cfm

  • chapel

  • chpl

  • cirru

  • clipper

  • clojure

  • cmake

  • cobol

  • coffeescript

  • coldfusion

  • common lisp

  • component pascal

  • console

  • coq

  • csharp

  • csound

  • cucumber

  • cuda

  • cython

  • d

  • dart

  • delphi

  • dosbatch

  • dylan

  • ec

  • ecl

  • eiffel

  • elisp

  • elixir

  • elm

  • emacs

  • erlang

  • factor

  • fancy

  • fantom

  • fish

  • fortran

  • foxpro

  • fsharp

  • gap

  • gas

  • genshi

  • gherkin

  • glsl

  • gnuplot

  • go

  • golo

  • gosu

  • groovy

  • haskell

  • haxe

  • hy

  • i7

  • idl

  • idris

  • igor

  • igorpro

  • inform 7

  • io

  • ioke

  • j

  • isabelle

  • jasmin

  • java

  • javascript

  • jsp

  • julia

  • kotlin

  • lasso

  • lassoscript

  • lean

  • lhaskell

  • lhs

  • limbo

  • lisp

  • literate agda

  • literate haskell

  • livescript

  • llvm

  • logos

  • logtalk

  • lsl

  • lua

  • make

  • mako

  • mathematica

  • matlab

  • mf

  • minid

  • mma

  • modelica

  • modula-2

  • monkey

  • moocode

  • moonscript

  • mupad

  • myghty

  • nasm

  • nemerle

  • nesc

  • newlisp

  • nimrod

  • nit

  • nix

  • nixos

  • nsis

  • numpy

  • obj-c

  • obj-c++

  • obj-j

  • objectpascal

  • ocaml

  • octave

  • ooc

  • opa

  • openedge

  • pan

  • pascal

  • pawn

  • perl

  • php

  • pike

  • plpgsql

  • posh

  • povray

  • powershell

  • progress

  • prolog

  • puppet

  • pyrex

  • python

  • qml

  • robotframework

  • [label=,leftmargin=*]

  • r

  • racket

  • ragel

  • rb

  • rebol

  • red

  • redcode

  • ruby

  • rust

  • sage

  • salt

  • scala

  • scheme

  • scilab

  • shell

  • shen

  • smali

  • smalltalk

  • smarty

  • sml

  • sourcepawn

  • splus

  • squeak

  • stan

  • standard ml

  • supercollider

  • swift

  • tcl

  • tcsh

  • thrift

  • typescript

  • vala


  • verilog

  • vhdl

  • vim

  • winbatch

  • x10

  • xbase

  • xml+genshi

  • xml+kid

  • xquery

  • xslt

  • xtend

  • zephir

Appendix B Examples of fuzzy duplicate repositories

B-a Linux kernel

  • 1406/linux-0.11

  • yi5971/linux-0.11

  • love520134/linux-0.11

  • wfirewood/source-linux-0.11

  • sunrunning/linux-0.11

  • Aaron123/linux-0.11

  • junjee/linux-0.11

  • pengdonglin137/linux-0.11

  • yakantosat/linux-0.11

B-B Tutorials

  • dcarbajosa/linuxacademy-chef

  • jachinh/linuxacademy-chef

  • flarotag/linuxacademy-chef

  • qhawk/linuxacademy-chef

  • paul-e-allen/linuxacademy-chef

B-C Web applications 1

  • choysama/my-django-first-blog

  • mihuie/django-first

  • PubMahesh/my-first-django-app

  • nickmalhotra/first-django-blog

  • Jmeggesto/MyFirstDjango

  • atlwendy/django-first

  • susancodes/first-django-app

  • quipper7/DjangoFirstProject

  • phidang/first-django-blog

B-D Web applications 2

  • iggitye/omrails

  • ilrobinson81/omrails

  • OCushman/omrails

  • hambini/One-Month-Rails

  • Ben2pop/omrails

  • chrislorusso/omrails

  • arjunurs/omrails

  • crazystingray/omrails

  • scorcoran33/omrails

  • Joelf001/Omrails

Appendix C Complete list of labelled topics

C-a Concepts

  1. 2D geometry

  2. 3D geometry

  3. Arithmetic

  4. Audio

  5. Bitcoin

  6. Card Games

  7. Chess; Hadoop #

  8. Classical mechanics (physics)

  9. Color Manipulation/Generation

  10. Commerce, ordering

  11. Computational Physics

  12. Date and time

  13. Design patterns; HTML parsing

  14. Email *

  15. Email *

  16. Enumerators, Mathematical Expressions

  17. Finance and trading

  18. Food (eg. pizza, cheese, beverage), Calculator

  19. Genomics

  20. Geolocalization, Maps

  21. Graphs

  22. Hexademical numbers

  23. Human

  24. Identifiers

  25. Language names; JavaFX #

  26. Linear Algebra; Optimization

  27. Machine Learning, Data Science

  28. My

  29. Parsing

  30. Particle physics

  31. Person Names (American)

  32. Personal Information

  33. Photography, Flickr

  34. Places, transportation, travel

  35. Publishing; Flask #

  36. Space and solar system

  37. Sun and moon

  38. Trade

  39. Trees, Binary Trees

  40. Video; movies

  41. Word Term

C-B Human languages

  1. Chinese

  2. Dutch

  3. French *

  4. French *

  5. German

  6. Portuguese *

  7. Portuguese *

  8. Spanish *

  9. Spanish *

  10. Vietnamese

C-C Programming languages

  1. Assembler

  2. Autoconf

  3. Clojure

  4. ColdFusion *

  5. ColdFusion *

  6. Common LISP

  7. Emacs LISP

  8. Emulated assembly

  9. Go

  10. HTML

  11. Human education system

  12. Java AST and bytecode

  13. libc

  14. Low-level PHP

  15. Lua *

  16. Lua *

  17. Makefiles

  18. Mathematics: proofs, sets

  19. Matlab

  20. Object Pascal

  21. Objective-C

  22. Perl

  23. Python

  24. Python, ctypes

  25. Ruby

  26. Ruby with language extensions

  27. SQL

  28. String Manipulation in C

  1. Verilog/VHDL

  2. Work, money, employment, driving, living

  3. x86 Assembler *

  4. x86 Assembler *

  5. XPCOM

C-D General IT

  1. 3-char identifiers

  2. Advertising (Facebook, Ad Engines, Ad Blockers, AdMob)

  3. Animation

  4. Antispam; PHP forums

  5. Antivirus; database access #

  6. Barcodes; browser engines #

  7. Charting

  8. Chat; messaging

  9. Chinese web

  10. Code analysis and generation

  11. Computer memory and interfaces

  12. Console, terminal, COM

  13. CPU and kernel

  14. Cryptography

  15. Date and time picker

  16. DB Sharding, MongoDB sharding

  17. Design patterns; formal architecture

  18. DevOps

  19. Drawing *

  20. Drawing *

  21. Forms (UI)

  22. Glyphs; X11 and FreeType #

  23. Grids and tables

  24. HTTP auth

  25. iBeacons

  26. Image Manipulation

  27. Image processing

  28. Intel SIMD, Linear Algebra #

  29. IO operations

  30. Javascript selectors

  31. JPEG and PNG

  32. Media Players

  33. Metaprogramming

  34. Modern JS frontend (Bower, Grunt, Yeoman)

  35. Names starting with “m”

  36. Networking

  37. OAuth; major web services #

  38. Observer design pattern

  39. Online education; Moodle

  40. OpenGL *

  41. Parsers and compilers

  42. Plotting

  43. Pointers

  44. POSIX Shell; VCS #

  45. Promises and deferred execution; Angular #

  46. Proof of concept

  47. RDF and SGML parsing

  48. Request and Response

  49. Requirements and dependencies

  50. Sensors; DIY devices

  51. Sockets C API

  52. Sockets, Networking

  53. Sorting and searching

  54. SQL database

  55. SQL DB, XML in PHP projects

  56. SSL

  57. Strings

  58. Testing with mocks

  59. Text editor UI

  60. Threads and concurrency

  61. Typing suggestions and dropdowns

  62. UI

  63. Video player

  64. VoIP

  65. Web Media, Arch Packages #

  66. Web posts

  67. Web testing; crawling

  68. Web UI

  69. Wireless

  70. Working with buffers

  71. XML (SAX, XSL)

  72. XMPP

  73. .NET

  74. Android Apps

  75. Android UI

  76. Apache Libraries for BigData

  77. Apache Thrift

  78. Arduino, AVR

  79. ASP.NET *

  80. ASP.NET *

C-E Technologies

  1. Backbone.js

  2. Chardet (Python)

  3. Cocos2D

  4. Comp. vision; OpenCV

  5. Cordova

  6. CPython

  7. Crumbs; cake(PHP)

  8. cURL

  9. DirectDraw

  10. DirectX

  11. Django Web Apps, CMS

  12. Drupal

  13. Eclipse SWT

  14. Emacs configs

  15. Emoji and Dojo #

  16. Facebook; Parse SDK #

  17. ffmpeg

  18. FLTK

  19. Fonts

  20. FPGA, Verilog

  21. FreeRTOS (Embedded)

  22. Glib

  23. Ionic framework, Cordova

  24. iOS Networking

  25. iOS Objective-C API

  26. iOS UI

  27. Jasmine tests, JS exercises, exercism #

  28. Java GUI

  29. Java Native Interface

  30. Java web servers

  31. Javascript AJAX, Javascript DOM manipulation

  32. Joomla

  33. JQuery

  34. jQuery Grid

  35. Lex, Yacc compiler

  36. libav / ffmpeg

  37. Linear algebra libraries

  38. Linux Kernel, Linux Wireless

  39. Lodash

  40. MFC Desktop Applications

  41. Minecraft mods

  42. Monads

  43. OpenCL

  44. OpenGL *

  45. PHP sites written by non-native English people

  46. PIC32

  47. Portable Document Format

  48. Puppet

  49. Apps

  50. Python packaging

  51. Python scientific stack

  52. Python scrapers

  53. Qt *

  54. Qt *

  55. React

  56. ROS (Robot Operating System)

  57. Ruby On Rails Apps

  58. SaltStack

  59. Shockwave Flash

  60. Spreadsheets (Excel)

  61. Spreadsheets with PHP

  62. SQLite

  63. STL, Boost

  64. STM32

  65. Sublime Extensions

  66. Symphony, Doctrine; NLP #

  67. U-boot

  68. Vim Extensions

  69. Visual Basic, MSSQL

  70. Web scraping

  71. WinAPI

  72. Wordpress *

  73. Wordpress *

  74. Wordpress-like frontend

  75. Working with PDF in PHP

  76. wxWidgets

  77. Zend framework

  78. zlib *

  79. zlib *

C-F Games

  1. 3D graphics and Unity

  2. Fantasy Creatures

* Repeating topic with different key words, see section VI.

# Dual topic, see section VI.

  1. Games

  2. Hello World, Games

  3. Minecraft


  5. Pokemon

  6. Puzzle games

  7. RPG, fantasy

  8. Shooters (SDL)

  9. Unity Engine

  10. Unity3D Games

  11. Web Games

Appendix C Complete list of labelled topics

C-a Concepts

  1. 2D geometry

  2. 3D geometry

  3. Arithmetic

  4. Audio

  5. Bitcoin

  6. Card Games

  7. Chess; Hadoop #

  8. Classical mechanics (physics)

  9. Color Manipulation/Generation

  10. Commerce, ordering

  11. Computational Physics

  12. Date and time

  13. Design patterns; HTML parsing

  14. Email *

  15. Email *

  16. Enumerators, Mathematical Expressions

  17. Finance and trading

  18. Food (eg. pizza, cheese, beverage), Calculator

  19. Genomics

  20. Geolocalization, Maps

  21. Graphs

  22. Hexademical numbers

  23. Human

  24. Identifiers

  25. Language names; JavaFX #

  26. Linear Algebra; Optimization

  27. Machine Learning, Data Science

  28. My

  29. Parsing

  30. Particle physics

  31. Person Names (American)

  32. Personal Information

  33. Photography, Flickr

  34. Places, transportation, travel

  35. Publishing; Flask #

  36. Space and solar system

  37. Sun and moon

  38. Trade

  39. Trees, Binary Trees

  40. Video; movies

  41. Word Term

C-B Human languages

  1. Chinese

  2. Dutch

  3. French *

  4. French *

  5. German

  6. Portuguese *

  7. Portuguese *

  8. Spanish *

  9. Spanish *

  10. Vietnamese

C-C Programming languages

  1. Assembler

  2. Autoconf

  3. Clojure

  4. ColdFusion *

  5. ColdFusion *

  6. Common LISP

  7. Emacs LISP

  8. Emulated assembly

  9. Go

  10. HTML

  11. Human education system

  12. Java AST and bytecode

  13. libc

  14. Low-level PHP

  15. Lua *

  16. Lua *

  17. Makefiles

  18. Mathematics: proofs, sets

  19. Matlab

  20. Object Pascal

  21. Objective-C

  22. Perl

  23. Python

  24. Python, ctypes

  25. Ruby

  26. Ruby with language extensions

  27. SQL

  28. String Manipulation in C

Appendix D Key words and repositories belonging to topic #27 (Machine Learning, Data Science)

Rank Word Rank Repository
0.313115 plot 1.000000 jingxia/kaggle_yelp
0.303456 numpy 0.999998 Carreau/spicy
0.273759 plt 0.999962 zck17388/test
0.187565 figur 0.999719 jonathanekstrand/Python_…
0.181307 zeros 0.999658 skendrew/astroScanr
0.169696 matplotlib 0.999543 southstarj/YCSim
0.166166 dtype 0.999430 parteekDhream/statthermo…
0.165236 fig 0.999430 axellundholm/FMN050
0.159658 ylabel 0.999361 soviet1977/PriceList
0.153094 xlabel 0.999354 connormarrs/3D-Rocket-…
0.146327 subplot 0.999282 Holiver/matplot
0.144736 shape 0.999103 wetlife/networkx
0.132792 pyplot 0.999034 JingshiPeter/CS373
0.124264 scipy 0.998969 marialeon/los4mas2
0.120666 axis 0.998385 acnz/Project
0.110212 arang 0.998138 khintz/GFSprob
0.110049 mean 0.998123 claralusan/test_aug_18
0.096037 reshap 0.997822 amcleod5/PythonPrograms
0.093182 range 0.997662 ericqh/deeplearning
0.084059 ylim 0.997567 laserson/stitcher
0.082812 linspac 0.996786 hs-jiang/MCA_python
0.081260 savefig 0.996327 DianaSplit/Tracking
0.080978 xlim 0.995153 SivaGabbi/Ipython-Noteb…
0.080325 axes 0.994776 prov-suite/prov-sty
0.077891 legend 0.992801 bmoerker/EffectSizeEstim…
0.076858 bins 0.992558 natalink/machine_learning
0.076140 panda 0.992324 olehermanse/INF1411-El…
0.076043 astyp 0.991026 fonnesbeck/scipy2014_tut…
0.075235 pylab 0.990514 fablab-paderborn/device-…
0.073265 ones 0.989586 mqchau/dataAnalysis
0.072214 xrang 0.988722 acemaster/Image-Process…
0.072196 len 0.988514 henryoswald/Sin-Plot
0.069818 float 0.987874 npinto/virtualenv-bootstrap
0.065453 linewidth 0.987039 ipashchenko/test_datajoy
0.065453 linalg 0.986802 Ryou-Watanabe/practice
0.065322 norm 0.986272 mirthbottle/datascience-…
0.064042 hist 0.986039 hglabska/doktorat
0.062975 label 0.985837 parejkoj/yaledemo
0.061608 sum 0.985246 grajasumant/python
0.060443 cmap 0.985173 aaronspring/Scientific-Py…
0.059155 scatter 0.985043 asesana/plots
0.058877 fontsiz 0.984838 Sojojo83/SJFirstRepository
0.057343 self 0.983808 Metres/MetresAndThtu…
0.057328 none 0.983393 e-champenois/pySurf
0.056908 true 0.983170 pawel-kw/dexy-latex-ex…
0.056292 xtick 0.983074 keiikegami/envelopetheorem
0.051978 figsiz 0.982967 msansa/test
0.051359 sigma 0.982904 qdonnellan/spyder-examples
0.050785 ndarray 0.981725 qiuwch/PythonNotebook…
0.050586 sqrt 0.981156 rescolo/getdaa


  • [1] “Sourceforge directory.”
  • [2] S. Ioffe, “Improved consistent sampling, weighted minhash and l1 sketching,” in Proceedings of the 2010 IEEE International Conference on Data Mining, ICDM ’10, (Washington, DC, USA), pp. 246–255, IEEE Computer Society, 2010.
  • [3] K. Vorontsov and A. Potapenko, “Additive regularization of topic models,” Machine Learning, vol. 101, no. 1, pp. 303–323, 2015.
  • [4] J. Xu, S. Christley, and G. Madey, “The open source software community structure,” in Proceedings of the North American Association for Computation Social and Organization Science, NAACSOS ’05, 2005.
  • [5] K. Blincoe, F. Harrison, and D. Damian, “Ecosystems in github and a method for ecosystem identification using reference coupling,” in Proceedings of the 12th Working Conference on Mining Software Repositories, MSR ’15, (Piscataway, NJ, USA), pp. 202–207, IEEE Press, 2015.
  • [6] G. Gousios, “The ghtorrent dataset and tool suite,” in Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, (Piscataway, NJ, USA), pp. 233–236, IEEE Press, 2013.
  • [7] M. Lungu, Reverse Engineering Software Ecosystems. PhD thesis, University of Lugano, Sept 2009.
  • [8] E. Kalliamvakou, G. Gousios, K. Blincoe, L. Singer, D. M. German, and D. Damian, “The promises and perils of mining github,” in Proceedings of the 11th Working Conference on Mining Software Repositories, MSR 2014, (New York, NY, USA), pp. 92–101, ACM, 2014.
  • [9] X. Sun, X. Liu, B. Li, Y. Duan, H. Yang, and J. Hu, “Exploring topic models in software engineering data analysis: A survey,” in

    2016 17th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD)

    , pp. 357–362, May 2016.
  • [10] S. Grant, J. R. Cordy, and D. B. Skillicorn, “Using topic models to support software maintenance,” in 2012 16th European Conference on Software Maintenance and Reengineering, pp. 403–408, March 2012.
  • [11] S. Grant and J. R. Cordy, “Examining the relationship between topic model similarity and software maintenance,” in 2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE), pp. 303–307, Feb 2014.
  • [12] T.-H. Chen, S. W. Thomas, M. Nagappan, and A. E. Hassan, “Explaining software defects using topic models,” in Proceedings of the 9th IEEE Working Conference on Mining Software Repositories, MSR ’12, (Piscataway, NJ, USA), pp. 189–198, IEEE Press, 2012.
  • [13]

    S. Grant, J. R. Cordy, and D. Skillicorn, “Automated concept location using independent component analysis,” in

    2008 15th Working Conference on Reverse Engineering, pp. 138–142, Oct 2008.
  • [14] E. Linstead, P. Rigor, S. Bajracharya, C. Lopes, and P. Baldi, “Mining concepts from code with probabilistic topic models,” in Proceedings of the Twenty-second IEEE/ACM International Conference on Automated Software Engineering, ASE ’07, (New York, NY, USA), pp. 461–464, ACM, 2007.
  • [15] E. Linstead, C. Lopes, and P. Baldi, “An application of latent dirichlet allocation to analyzing software evolution,” in 2008 Seventh International Conference on Machine Learning and Applications, pp. 813–818, Dec 2008.
  • [16] S. W. Thomas, B. Adams, A. E. Hassan, and D. Blostein, “Modeling the evolution of topics in source code histories,” in Proceedings of the 8th Working Conference on Mining Software Repositories, MSR ’11, (New York, NY, USA), pp. 173–182, ACM, 2011.
  • [17] J. I. Maletic and A. Marcus, “Using latent semantic analysis to identify similarities in source code to support program understanding,” in Proceedings 12th IEEE Internationals Conference on Tools with Artificial Intelligence. ICTAI 2000, pp. 46–53, 2000.
  • [18] J. I. Maletic and N. Valluri, “Automatic software clustering via latent semantic analysis,” in 14th IEEE International Conference on Automated Software Engineering, pp. 251–254, Oct 1999.
  • [19] A. Kuhn, S. Ducasse, and T. Gírba, “Semantic clustering: Identifying topics in source code,” Inf. Softw. Technol., vol. 49, pp. 230–243, Mar. 2007.
  • [20] S. W. Thomas, “Mining software repositories using topic models,” in Proceedings of the 33rd International Conference on Software Engineering, ICSE ’11, (New York, NY, USA), pp. 1138–1139, ACM, 2011.
  • [21] B. P. Eddy, J. A. Robinson, N. A. Kraft, and J. C. Carver, “Evaluating source code summarization techniques: Replication and expansion,” in 2013 21st International Conference on Program Comprehension (ICPC), pp. 13–22, May 2013.
  • [22] P. W. McBurney, C. Liu, C. McMillan, and T. Weninger, “Improving topic model source code summarization,” in Proceedings of the 22Nd International Conference on Program Comprehension, ICPC 2014, (New York, NY, USA), pp. 291–294, ACM, 2014.
  • [23] A. M. Saeidi, J. Hage, R. Khadka, and S. Jansen, “Itmviz: Interactive topic modeling for source code analysis,” in Proceedings of the 2015 IEEE 23rd International Conference on Program Comprehension, ICPC ’15, (Piscataway, NJ, USA), pp. 295–298, IEEE Press, 2015.
  • [24] X. Sun, B. Li, H. Leung, B. Li, and Y. Li, “Msr4sm: Using topic models to effectively mining software repositories for software maintenance tasks,” Inf. Softw. Technol., vol. 66, pp. 1–12, Oct. 2015.
  • [25] V. Prince, C. Nebut, M. Dao, M. Huchard, J.-R. Falleri, and M. Lafourcade, “Automatic extraction of a wordnet-like identifier network from software,” International Conference on Program Comprehension, vol. 00, pp. 4–13, 2010.
  • [26] G. Sridhara, L. Pollock, E. Hill, and K. Vijay-Shanker, “Identifying word relations in software: A comparative study of semantic similarity tools,” International Conference on Program Comprehension, vol. 00, pp. 123–132, 2008.
  • [27] J. Yang and L. Tan, “Inferring semantically related words from software context,” in Proceedings of the 9th IEEE Working Conference on Mining Software Repositories, MSR ’12, (Piscataway, NJ, USA), pp. 161–170, IEEE Press, 2012.
  • [28] M. J. Howard, S. Gupta, L. Pollock, and K. Vijay-Shanker, “Automatically mining software-based, semantically-similar words from comment-code mappings,” in Proceedings of the 10th Working Conference on Mining Software Repositories, MSR ’13, (Piscataway, NJ, USA), pp. 377–386, IEEE Press, 2013.
  • [29] S. Haiduc and A. Marcus, “On the use of domain terms in source code,” International Conference on Program Comprehension, vol. 00, pp. 113–122, 2008.
  • [30] S. Bajracharya and C. Lopes, “Mining search topics from a code search engine usage log,” in Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories, MSR ’09, (Washington, DC, USA), pp. 111–120, IEEE Computer Society, 2009.
  • [31] “source{d}.”
  • [32] “Software heritage.”
  • [33] “Sourcegraph.”
  • [34] A. Nesbitt, “”
  • [35] “github/linguist.”
  • [36] Pocoo, “Pygments - generic syntax highlighter.”
  • [37] M. F. Porter, “Snowball: A language for stemming algorithms,” 2001.
  • [38] “Natural language toolkit.”
  • [39] A. G. Jivani, “A comparative study of stemming algorithms,” vol. 2 (6) of IJCTA, pp. 1930–1938, 2011.
  • [40] “Apache spark.”
  • [41] source{d} and V. Markovtsev, “Minhashcuda - the implementation of weighted minhash on gpu.” Appeared in 2016-09.
  • [42] J. Nickolls, I. Buck, M. Garland, and K. Skadron, “Scalable parallel programming with cuda,” Queue, vol. 6, pp. 40–53, Mar. 2008.
  • [43] E. Zhu, “ekzhu/datasketch.” Appeared in 2015-03.
  • [44] J. Leskovec, A. Rajaraman, and J. D. Ullman, Mining of Massive Datasets. New York, NY, USA: Cambridge University Press, 2nd ed., 2014.
  • [45] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, Mar. 2003.
  • [46] F. Dellaert, “The expectation maximization algorithm,” tech. rep., Georgia Institute of Technology, 2002.
  • [47] O. Frei, M. Apishev, N. Shapovalov, and P. Romov, “Bigartm - the state-of-the-art platform for topic modeling.” Appeared in 2014-11.
  • [48] “ - the most meaningful, collaborative, and abundant data resource in the world.”
  • [49] S. Syed and S. Jansen, “On clusters in open source ecosystems,” 2013.