Authorship Attribution of Source Code: A Language-Agnostic Approach and Applicability in Software Engineering

by   Egor Bogomolov, et al.

Authorship attribution of source code has been an established research topic for several decades. State-of-the-art results for the authorship attribution problem look promising for the software engineering field, where they could be applied to detect plagiarized code and prevent legal issues. With this study, we first introduce a language-agnostic approach to authorship attribution of source code. Two machine learning models based on our approach match or improve over state-of-the-art results, originally achieved by language-specific approaches, on existing datasets for code in C++, Python, and Java. After that, we discuss limitations of existing synthetic datasets for authorship attribution, and propose a data collection approach that delivers datasets that better reflect aspects important for potential practical use in software engineering. In particular, we discuss the concept of work context and its importance for authorship attribution. Finally, we demonstrate that high accuracy of authorship attribution models on existing datasets drastically drops when they are evaluated on more realistic data. We conclude the paper by outlining next steps in design and evaluation of authorship attribution models that could bring the research efforts closer to practical use.



There are no comments yet.


page 1

page 2

page 3

page 4


Misleading Authorship Attribution of Source Code using Adversarial Learning

In this paper, we present a novel attack against authorship attribution ...

A Comparison of Code Embeddings and Beyond

Program representation learning is a fundamental task in software engine...

An Evalutation of Programming Language Models' performance on Software Defect Detection

This dissertation presents an evaluation of several language models on s...

SPN-CNN: Boosting Sensor-Based Source Camera Attribution With Deep Learning

We explore means to advance source camera identification based on sensor...

Analysis of the first Genetic Engineering Attribution Challenge

The ability to identify the designer of engineered biological sequences ...

Datasets and Models for Authorship Attribution on Italian Personal Writings

Existing research on Authorship Attribution (AA) focuses on texts for wh...

Decentralized Attribution of Generative Models

There have been growing concerns regarding the fabrication of contents t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The task of source code authorship attribution can be formulated as follows: given a piece of code and a predefined set of authors, to attribute this piece to one of these authors, or judge that it was written by someone else. This problem has been an area of interest for researchers for at least three decades [1].

The task of authorship attribution in academic works is usually motivated by the needs of computer security, where it can be used to identify authors of malware programs [2, 3, 4, 5]. However, past research has shown that software engineering tasks, such as software maintenance [6, 7, 8] and software quality analysis [9, 10, 11, 12], also benefit from authorship information. Since authorship information in the software engineering domain may be missing or inaccurate (e.g., due to pair-programming, co-authored commits, and changes made after code review suggestions), authorship attribution is an important goal to achieve.

Source code authorship attribution is also useful for plagiarism detection, either to directly determine the author of plagiarized code [13, 14, 15] or to ensure that several fragments of code were written by a single author [16]. Plagiarism detection, in turn, is important in software engineering: Software companies need to pay extra attention to the copyright and licensing issues, as they can become liable to lawsuits [17]. For example, Stack Overflow111 is often used to copy and paste code snippets to their projects. However, if developers do not take special care, code borrowed from Stack Overflow can induce licence conflicts on top of complicating maintenance [18].

Recently, several works improved the state of the art in authorship attribution on datasets for three popular programming languages: C++, Python, and Java. For C++, Caliskan et al. reported accuracy of 92.8% when distinguishing among 1,600 potential authors of code [3]. For Python, Alsulami et al. attributed code of 70 programmers with 88.9% accuracy [19]

. Yang et al. developed a neural network model that achieved 91.1% accuracy for a dataset of Java code by 40 authors 


In this study, we suggest two language-agnostic authorship attribution models. Both models work with path-based representations of code [21]

. The first model, called PbRF (Path-based Random Forest), is a random forest trained on term frequencies of tokens and paths in abstract syntax trees (AST). This random forest model matches or improves the state of the art in Java, C++, and Python datasets, even with few available samples per author. The second model, named PbNN (Path-based Neural Network), is an adopted version of

code2vec neural network [21]. PbNN outperforms PbRF when the number of available samples per author is large. Both models improve state-of-the-art results for Java on a dataset by Yang et al. [20], with 97% and 98% accuracy respectively.

Existing works on authorship attribution operate with artificial data: examples from books [1, 4], students’ assignments [22, 23, 5], solutions of programming competitions [3, 24, 19, 25]

, and open-source projects with a single author 

[15, 26, 20, 13]. In this study, based on our experience with improving authorship attribution as well as with the software engineering research domain, we investigate the differences between the mentioned data sources and code that can be found in real-world programming projects or other practical applications. Based on the results, we propose a new data collection technique that can reduce these differences. To formalize the differences, we suggest a concept of work context, which includes aspects that can affect developer’s coding practices and be specific for the concrete project, its domain, team, internal coding conventions, and more. Also, we discuss how the evolution of programmer’s individual coding practices over time influences the problem of source code attribution.

Our quantitative evaluation shows that the accuracy of authorship models plunges when models are tested with software engineering conditions more realistic than in previous artificial datasets. In particular, the model that can distinguish between 40 authors with 98% accuracy in one setup, reaches only 22% for 26 developers in another. This result suggests that—before their practical adoption for software engineering—existing results in the field of source code attribution should be revisited to evaluate their liability to the mentioned aspects.

With this work we make the following contributions :

  • Two language-agnostic models that work with path-based representation of code. These models match or outperform the language-specific state of the art.

  • A discussion on the limitations of existing datasets, particularly when applied to the software engineering domain.

  • A novel, scalable approach to data collection for evaluation of source code authorship attribution models.

  • The concept of developer’s work context and the empirical evaluation of its influence on authorship attribution.

  • Empirical evidence on how the evolution of developers’ coding practices strongly impacts current authorship models’ performance.222Here and onwards, we use the word performance interchangeably with accuracy

The implementations of PbNN, PbRF, and a novel data collection approach are available on GitHub333

2 Background

The first work on source code authorship attribution dates back to Oman et al. [1]

in 1989. Although the results and approaches for the authorship attribution have changed and improved since then, the underlying idea remains to continue using machine learning, based on the features extracted from the source code.

According to the recent survey by Kalgutkar et al. [27], the following are the best results achieved for various languages:

  • C++: Caliskan et al. [3] reported the best results using a random forest trained on syntactical features. They achieved 92.8% and 98% accuracy for datasets with 1,600 and 250 developers to distinguish, respectively.

  • Python: Alsulami et al. [19] suggested to use tree-based LSTM to derive implicit features from ASTs, achieving 88.9% accuracy in distinguishing among 70 authors.

  • Java: Yang et al. [20]

    reported 91.1% accuracy for a dataset of 40 authors using neural networks. Instead of a commonly used stochastic gradient descent optimizer, the authors trained the network with particle swarm optimization 

    [28] improving the performance by 15%.

The approaches differ not only in their target languages and suggested models, but also in the datasets used for the evaluation. Four sources of data are used in previous work:

  • Code examples from books ([1, 4]). They were used before the era of easily available open-source projects, due to the lack of other sources.

  • Students’ assignments ([22, 23, 5, 29]). Often, researchers are not allowed to publish these datasets (mostly from university courses) due to privacy or intellectual property issues. Lack of published data causes problems for comparison of different works’ results.

  • Solutions to programming competitions ([3, 24, 19, 25]). Researchers mostly work with data from Google Code Jam444 (GCJ), an annual competition held by Google since 2008.

  • Single-author open-source projects ([15, 26, 20, 13]). With the increasing popularity of GitHub, this has become the major source of data. Researchers avoid projects with multiple authors because in this case even small fragments of code might be a result of a shared work.

The syntactic features derived from the code’s AST improved the results for authorship attribution [3, 19] as well as for other software engineering tasks, such as code summarization [30], method name prediction [21], and clone detection [31].

Compared to real-world data, where a programmer often works in multiple languages and projects, existing datasets are limited to a single language and one project per author. To overcome this limitation, the models should work with different programming languages in the same manner. Following this idea, we decided to build a language-independent model, based on syntactic features, that works on par with prior studies.

3 Models

Our first goal is to develop an authorship attribution solution that is language-agnostic and achieves accuracy comparable to state-of-the-art approaches.

To apply machine learning methods to code, we should transform it into a numerical form called representation. While some works use explicitly designed language-specific features [3, 20], we represent the code using the path-based representation [32] to be able to work with code in various programming languages uniformly.

A common way to use path-based representation is code2vec neural network [21], suggested by the same authors. However, code2vec requires a significant number of samples for each author to infer meaningful information. Thus, alongside with the neural network, we employ a random forest model, trained on similar features. The random forest model shows better performance for small datasets, but generalizes worse for larger ones. In the rest of this section, we describe our models and define related concepts.

3.1 Definitions

Abstract Syntax Tree. An abstract syntax tree (AST) is a representation of program’s code in the form of a rooted tree. Nodes of an AST correspond to different code constructs (e.g., math operations and variable declarations). Children of a node correspond to smaller constructs that comprise its corresponding code. Different constructs are represented with different node types. An AST omits parentheses, tabs, and other formatting details. Figure 3 shows an example of a code fragment and a corresponding AST.

(a) An example code fragment
(b) An AST of this code fragment
Fig. 3: A code example and corresponding AST

Path in AST. A path is a sequence of connected nodes in an AST. Start and end nodes of a path may be arbitrary, but we only use paths between two leaves in the AST to conform with code2vec and have the benefit of working with the smaller pieces of code that such paths represent. Following Alon et al. [21], we denote a path by a sequence of node types and directions (up or down) between consequent nodes. In Figure (b)b, an example of a path between the leaves of an AST is shown with red arrows. In the notation of node types and directions, this path can be denoted as follows:

Path-context. The path-based representation operates with path-contexts, which are triples consisting of (1) a path between two nodes and the tokens corresponding to (2) start and (3) end nodes. From the human perspective, a path-context represents two tokens in code and a structural connection between them. This allows a path-context to capture information about the structure of the code. Prior works show that code structure also carries semantic information [21, 30]. Figure (b)b highlights the following path-context:

This path-context represents a declaration of a function named square with a single argument named x. The path in this path-context encodes the following information: It contains nodes Function Declaration as well as Single Variable Declaration and tokens are linked to Simple Name AST nodes.

Path-based representation. The path-based representation treats a piece of code as a bag of path-contexts. For larger pieces of code, the number of path-contexts in the bag might be large. If a fragment of code contains tokens, its AST contains leaves and path-contexts. The number of path-contexts can be reduced by setting a limit on the length (i.e., the number of vertices in the list) or width (i.e., the difference in leaf indices between the path’s endpoints) of the paths. If the maximum allowed width is , then the number of path-contexts is

. These limits on length and width are hyperparameters and are determined empirically. We set them to 8 and 3, respectively, as it was done in prior works 

[21, 32].

To make a bag of contexts produced by a path-based representation suitable for training a model, we need to transform it into a numerical representation. For the random forest model, the transformation is done by computing term-frequencies of paths and tokens. For the neural network model, an embedding translates paths and tokens into numerical vectors. Since the path-based representation does not require any specific properties from the programming language, both PbRF and PbNN are language-agnostic. The following subsections cover both cases in more detail.

3.2 PbRF (random forest model)

The random forest model was designed to work when the number of samples for each author is rather small, as the usage of random forest has proved to be effective in this setup [3, 25]. Random forest does not allow training an embedding of path-contexts, thus, instead of combining paths and tokens into path-contexts, we use raw term frequencies of tokens and paths as features.

If a set of documents contains tokens and paths, the random forest model takes a sparse vector of size as an input. The size of such a vector might be significant (up to millions), with some features being unimportant for identifying the author. Unimportant features create additional noise for the model, reduce its performance, and increase memory usage. For this reason, we employed feature filtering to reduce the effect of the dimensionality problem. As in previous works on authorship attribution [33, 34, 3], we used filtering based on mutual information. The mutual information of a feature and an author can be expressed as:

where is Shannon entropy [35], and the value of does not depend on a specific feature. For the task of authorship identification, we interpret it as follows: the higher is mutual information (i.e., the lower is ), the better one can recognize the author based on the value of the given feature. For example, if is 0, then the author is always the same for a fixed value of feature .

Feature selection based on mutual information criteria ranks all the features by their importance and takes of the most important ones, where is a hyperparameter determined empirically during the evaluation process by trying various values. This approach does not take into account the dependencies among features; for example, if there are two identical high-ranked features, we would take both and miss some other feature. To avoid this problem, we could have added features one by one and compute mutual information after each step, but on the scale of millions of features, this procedure would be too costly. Hence, we do not tackle this limitation.

3.3 PbNN (neural network model)

While in the cybersecurity domain, one would assume a shortage in the available data due to its sensitivity and the lack of public datasets, in software engineering we should be able to work with large projects. To achieve better performance for larger datasets, we adopted the neural network called code2vec [32]. Compared to classical machine learning methods, neural networks can derive more complex concepts and relationships from structured data, when given enough training samples.

Fig. 4: Architecture of the PbNN

Figure 4 shows the architecture of the network. The network takes a bag of path-contexts as an input. The number of path-contexts, even with restrictions on path length and width in place, might be tremendous. To speed up the computations, at each training iteration we only take up to 500 random path-contexts for each sample.

Further, we transform path-contexts into numerical form that can be passed to the network. We embed a path and both tokens into and concatenate these vectors to form a -dimensional context vector. Embeddings for tokens and paths are matrices of size and , respectively. At first, the matrices are random, their values adjust during the network training process. The size of the embeddings vector might be set separately for paths and tokens, but, as these numbers are of roughly the same order of magnitude, it is easier to tune one hyperparameter instead of two, so we set both of them to .

Then, a fully-connected layer with activation function transforms raw path-context vectors of size into context vectors of size . This step is not obligatory, but it speeds up convergence of the model [32]. After that, a piece of code is represented by a set of -dimensional vectors corresponding to path-contexts.

At the next step, we aggregate vectors of individual path-contexts into a representation of a code snippet through an attention mechanism [36]. We use a simple version of attention, represented by a single trainable vector . For the path-context vectors , the attention weights are computed as follows:

Weights for the context vectors are a softmax of attention values. Then, the representation of the code snippet is:

Finally, a fully-connected layer with activation outputs author predictions.

The number of the PbNN’s parameters is . Since the value of is usually large, tens of thousands to millions, the number of required samples for the model to train is also significant.

4 Evaluation on existing datasets

We evaluated our models on the publicly available datasets for Java, C++, and Python used in recent work [3, 19, 20] (see Section 2) and compared the accuracy of PbRF and PbNN to the models proposed in these papers. Table I shows statistical information about the datasets and Table II presents the results.

To compare the results of different models when results are obtained through multiple runs (i.e., folds in cross-validation), we apply the Wilcoxon signed-rank test [37] to various accuracy values per run. When only the mean accuracy is available (which is the case for the previous work), or the number of runs is too small, we compare mean values.

C++ [3] Java [20] Python [19]
Number of authors 1,600 40 70
Number of files 14,400 3,021 700
Unique tokens 30,200 36,700 4,300
Unique paths 169,900 7,900 46,300
TABLE I: Datasets used in previous works. The number of paths is provided for and
Language C++ Python Java

Caliskan et al. [3]
0.928 0.729 N/A
Alsulami et al. [19] N/A 0.889 N/A
Yang et al., SGD [20] N/A N/A 0.760
Yang et al., PSO [20] N/A N/A 0.911
This work, PbNN 0.415 0.617 0.981
This work, PbRF 0.927 0.937 0.97
TABLE II: Mean accuracy by approach and dataset

4.1 Evaluation on C++

The C++ dataset was introduced by Caliskan et al. [3]. It contains solutions of 1600 participants for 9 problems from Google Code Jam 2012, making it the largest experiment in the number of authors.

Since every author contributed nine files, the dataset is perfectly balanced: It contains the same number of code samples for every person. For testing, solutions for one problem are held out and the model trains on solutions for the rest of the problems. This procedure is repeated for each problem and the final result is an average accuracy of determining a solution’s author.

Caliskan et al. [3]

reported 92.8% mean accuracy after 9-fold cross-validation. PbNN and PbRF achieve 41.5% and 92.7% average accuracy, respectively. The neural model performs poorly because of the overfitting: The number of available data points is too small to train a much larger set of the network’s internal parameters. The PbRF’s accuracy is marginally lower compared to the Caliskan’s work (92.7% vs 92.8%), but the difference is smaller than the standard deviation computed based on the cross-validation (which is 0.8%), thus the difference is indistinguishable from random noise. We can conclude PbRF is on par with the previous best result.

4.2 Evaluation on Python

The Python dataset also contains Google Code Jam solutions. It was collected and introduced by Alsulami et al. [19] and consists of solutions to 10 problems implemented by 70 authors. This dataset is also perfectly balanced. During cross-validation, the model is trained on 8 problems and validated on 2 other problems that are initially held out. The best reported average accuracy is 88.9%.

On this dataset, our models achieve 61.7% (PbNN) and 93.7% (PbRF). Similarly to the C++ dataset, PbRF shows better performance compared to PbNN because the number of available samples is too small for efficient training of the neural network.

4.3 Evaluation on Java

The Java dataset, introduced by Yang et al. [20], consists of 40 open source projects, each authored exclusively by a single developer. Each project contains 11 to 712 files with a median value of 54, totaling 3,021 files overall. The dataset is unbalanced, because the number of samples per author varies by person.

For the evaluation, we split the dataset into 10 folds and perform cross-validation, similarly to the work by Yang et al. [20]. Ideally, to compare the performance of our model to theirs in the most precise manner, we would use an identical split of the dataset into folds for our evaluation. However, the original split into folds is not available, and we created our own with a fixed random seed.

The model by Yang et al. achieves an average accuracy of 91.1% using 10-fold cross-validation. Both our models reach an accuracy of more than 97%. The previous result lies out of the standard deviation range in both cases, thus, indicating that the difference is statistically significant. Although the median number of samples per author is only 54, the neural network shows high accuracy. Even though the average accuracy of the neural network model is slightly higher than that of the random forest, the statistical test yields a p-value of 0.07.

4.4 Hyperparameters

Both our models have parameters that should be fixed before the training phase, i.e., hyperparameters. These hyperparameters are the number of trees and the percentage of features left after the feature selection (feature ratio) for the random forest and the size of the embedding vector for the neural network.

We tuned these hyperparameters using grid search [38]. We found that the optimal values are the same among the datasets. For the random forest model, increasing the number of trees improves performance until this number reaches 300, after that performance reaches a plateau. The optimal feature ratio for all datasets lays between 5% and 10%. For the neural network, increasing the embedding dimensionality results in a significant growth in accuracy until the size of the vector reaches the value of 64.

5 Limitations of current evaluations

As was shown in Section 4, PbRF demonstrates comparable or better results than recent work on authorship attribution for C++, Python, and Java datasets. PbNN achieves 98% accuracy on the Java dataset. While this result could be interpreted as superiority of our models to prior state of the art, especially considering its language agnosticism, we have realized a number of limitations posed by this evaluation technique. Particularly, using a single number of accuracy metric to compare complex approaches to problems motivated by practical needs, such as authorship attribution in software engineering, is a one-sided solution. In this section, we discuss the limitations of such evaluation.

Academic work on authorship attribution is motivated by the practical needs such as detection of plagiarism [39, 40, 26], detection of ghostwriting [22, 15], and attribution of malware [2, 3, 23, 5]. In a perfect world, introduced authorship attribution approaches should be evaluated on real-world data. However, such real-world data is privacy-sensitive and is seldom publicly available. For this reason, to show how models behave and compare against each other, researchers create datasets from available data sources. Even though these datasets try to mimic the data found in practical applications, there are major differences. To illustrate them, we introduce the concept of work context, i.e., the environment that surrounds programmers when they write code. The work context includes the following:

  • Files: Fragments of code in the same file are usually related implementation-wise: e.g., use the same fields or call the same methods.

  • Parts of a codebase: A codebase usually contains logically connected code that implements specific features and is organized into packages, modules, or other components. This logical connection often implies a lower-level connection observed in code: calls of methods, creation of objects, similar names of entities.

  • Project domains: The domain of the task (e.g., an Android application) influences names, used libraries, implemented features, and architectural patterns that are used more commonly.

  • Projects: The project itself might have internal naming conventions and utility components that are called from different parts of the codebase. Moreover, some companies have their own style guides for programming languages that affect naming conventions, formatting, and preferred use of specific language constructions. An example of this is Google Style Guides555

  • Set of tools: Integrated development environments (IDE) or text editors, version control systems, as well as build and deployment tools may influence the way how developers write code. For example, a recent survey with programmers concluded that one affects their development practices by just using GitHub in their projects [41].

This list is not complete and could easily be extended, but even a single of the aforementioned individual items might be significant for the task of authorship attribution. Practical applications of authorship attribution often imply that the model should be trained on code written in one work context and tested in another, or distinguish between developers working in the same context. However, datasets used in prior works do the opposite: There is a difference between authors’ work contexts (e.g., different projects) and no difference between work contexts of the same developer in training and evaluation sets.

Another concern is that existing datasets do not consider the impact of collaboration on the code. All of them consist of projects developed by a single programmer. However, projects studied in the software engineering domain are usually developed by teams. This collaborative work may introduce additional complexity for the authorship attribution task, but it is not reflected in prior works.

Moreover, it is not clear whether developers’ individual coding practices remain the same over time [42]. It is reasonable to think that changes in coding practices (e.g., programmers style, used libraries, naming conventions, process of development) can influence the performance of the models significantly. For existing datasets, all code written by a single author belongs to roughly the same period of time (e.g., one project, one competition, and assignments belonging to one course) and there is no temporal division between evaluation and training sets. Though, for practical problems, one might need to train a model on the historical data and apply it to new samples. This temporal aspect may introduce potential significant difference in individual coding practices between code used for training and testing.

Two prior studies consider evolution of programmers style as a challenge for authorship attribution task [42, 3]. Burrows et al. evaluated the difference between six student assignments, showing that their coding style changes over time [42]. Caliskan et al. trained a model on solutions of Google Code Jam (GCJ) 2012 and evaluated it on a single problem from GCJ 2014. Their experiment did not reveal any major differences in accuracy compared to evaluating on a problem from the same GCJ 2012 [3]. These results are contradictory, but both studies operated with small datasets and in domains different from real-world projects, thus further research on this topic is required.

We conclude that there is a gap (at least theoretical) between the existing datasets and what can be collected from and used for real-world applications. In particular, there is a difference in terms of work context, effects of developer collaboration, and changes over time. In the following sections, we suggest a novel approach to data collection that allows to quantitatively evaluate the impact of both temporal and contextual issues.

6 Collecting realistic data

As previously discussed, existing studies on authorship attribution are evaluated with data that differs from data in practical tasks, in terms of work context and separation of samples in time. To quantitatively evaluate the impact of these dissimilarities on the accuracy of authorship attribution techniques, we developed a new approach to data collection. It uses Git666 repositories as data source and unlike existing datasets from open-source data [43, 20, 19, 13] overcomes the limitation of a single author per project.

6.1 Method of data collection

We suggest a new approach to collect testing data for authorship attribution task. The approach works with any Git project without restrictions on the number of developers. In particular, using Git as the main data source allows taking data from GitHub777 — the world’s largest repository hosting platform with more than 100 million repositories and 30 million users [44]. Git repositories, and Github in particular, are a uniquely rich source of data for a variety of recent software engineering research efforts [45, 46, 47, 48]. Table III displays the ten largest open repositories on GitHub by numbers of commits. Commits are atomic units of contribution in Git projects. Usually, a commit has a single author. This authorship information, associated with every change recorded in a Git repository, make it a particularly rich source of data for authorship attribution studies.

Repository Commits (x1000) Language
torvalds/linux 782 C
LibreOffice/core 428 C++
liferay/liferay-portal 283 Java
jsonn/pkgsrc 266 Makefile
freebsd/freebsd 254 C
JetBrains/intellij-community 230 Java
cms-sw/cmssw 194 C++
openbsd/src 192 C
NixOS/nixpkgs 154 Nix
Wikia/app 152 PHP
TABLE III: GitHub repositories with the largest number of commits

The first step of our method is to traverse the history of a repository to gather individual commits. Then, we need to identify commits authored by the same developer. It is not a trivial task, because a developer can work within one repository under multiple aliases using different emails. Even though there are prior studies on this problem [49, 50, 51], their methods either make assumptions or are probabilistic to some extent. To avoid possible mistakes, we decided to merge identities in a deterministic way, then manually clean the results.

To merge user entities, we create a bipartite graph of author names and emails and connect vertices that appear together in a single commit. Then, we exclude company emails and stub names (e.g., or unknown), by looking at vertices with the highest number of connections. Afterwards, we detect connected components in the graph. The components correspond to names and emails that most likely belong to one author. To clean the results, we manually merge components that belong to the same person. Finally, we label commits with component indices.

In the second step, we split each commit into changes of fixed granularity (e.g., a change to a single class, method, or field). In this study we use method changes as our granularity. Within the scope of a single commit, methods can be renamed or moved to other files. We used GumTree [52] to precisely track such changes in Java code as well as simple changes to a method’s body. As a result, we get a set of all method changes made during the project development. Afterwards, the extracted data can be grouped into datasets with different properties.

6.2 Collected datasets

We applied the described approach to extract data from the IntelliJ IDEA Community Edition888 project, the second-largest Java project on GitHub. At the point of processing, the project contained about 240,000 commits by 500 developers. These commits comprise about two million individual method changes. The changes are of one of three types: creation of a method, deletion, and a modification of its body or signature. The latter, unlike method creations, can not be processed directly by authorship identification models: newly added code fragments might be incomplete, and the concrete modifications might be scattered across the method body. Finally, the author of the original code might be different from the one who modifies the method, which makes it impossible to define a sole author of the code fragment. In the datasets designed in this work, we only use method creations, because they contain new code fragments implemented by a single person and can be labeled accordingly. However, attributing the method modifications is an interesting task for the future research. In the IntelliJ Community repository, 100 most active developers made 98% of the changes, 50 most active — 90% of the changes. Out of all the changes, 700,000 are method creations.

To quantitatively evaluate the impact of the work context and the evolution of coding practices on the performance of authorship attribution, we created two datasets from the IntelliJ IDEA data. To make evaluation conditions as close to practical tasks as possible, we should have processed several projects and split projects between training and testing sets. However, at this point it is unclear how to define similarity between work contexts of different projects, and we would not be able to run several experiments with an increasing degree of context difference to perform a quantitative evaluation. Thus, this left us with one project and multiple datasets.

6.2.1 Dataset with gradual separation of work context

The purpose of this dataset is to measure the influence of variation in developers’ work context on the quality of authorship attribution. To achieve this, we need multiple pairs of training and evaluation samples that differ only in their work context. More specifically, the pairs should contain the same code fragments but be split differently between the training and testing part.

To control the degree of difference in work contexts, we use the project’s file tree. Figure 5 shows an example of such a tree. It consists of folders with edges between a folder and its content. Leaves in the tree correspond to files. For Java code, we are interested only in Java source files identified by the ‘.java’ extension. To reduce the size and depth of the tree, we compress paths of folders with a single sub-folder into nodes: In Figure 5, folders ‘plugins’, ‘src’, and “main” are compressed into a single node “plugins/src/main”. This operation preserves the structure of the tree.

A file tree for a Java codebase resembles the structure of packages. Usually, classes in one package are logically connected and refer to each other; thus, they have similar work context. At a higher level of abstraction, this also applies to classes in different sub-packages of the same package. In Figure 5

, the tree class ‘API-A’ has a similar work context to ‘API-B’, because they are in the same package, but a less similar context to ‘Impl-A’ from ‘platform-impl’. But they are still much closer to each other than to any file from the ‘plugins’ package since both are used to implement some platform features and probably even depend on one another.

From this observation, we can derive a way to measure similarity of work contexts of two files: it can be defined as depth of the lowest common ancestor in the file tree. In Figure 5, similarities between ‘API-A’ and other classes are shown with arrows (depth of the root is considered to be 0). By doing so, for a training-evaluation split, we can define the similarity of work context of these parts as the highest value of pairwise similarity between files in them.

The next step is to create a sequence of data splits with increasing similarity, or depth of split. To preserve the distribution of authors at each level, we find splits for different authors independently and merge them afterwards. We fix a fraction of training samples , depth of split , and an author . Then, we collect all folders at depth and files at depth or less. Afterward, we greedily divide them into training and evaluation parts, trying to get as close to as possible. When a folder is put into training/testing part, all the methods created by in the subtree of go into this part.

Table IV shows an example of correct splits at different depths. However, splits with smaller values of similarity are valid for the subsequent levels. If we split the parts randomly without any restrictions, it can result in the same split at all levels, in contrast to our goal of obtaining splits with different values of similarity. To fix this issue, we set a limit on the maximal value of mutual information between consequent splits. The mutual information shows the degree of randomness of the split with respect to the previous one. In Table IV, the mutual information between consequent splits is 0, because every time the training and evaluation parts are split into halves.

To create such a dataset, we took all method creations by 50 most active developers and applied the described algorithm to them. The file tree in the IntelliJ IDEA project has a depth of 12. Figure 6 shows the distribution of files by depth. Since the increase in similarity value from to affects only files with depth and higher, we vary only in the range , where more than 95% of files lie.

Because of the restrictions on the mutual information between splits and the ratio of training samples, for some authors we could not find a suitable separation at every level. For this reason, we filter out samples from these authors. The filtering left us with 26 authors out of 50. Finally, we obtained a dataset consisting of about 348,000 samples by 26 authors split at 9 different levels.

Fig. 5: An example of a project’s file tree with similarities between files
Similarity Training Evaluation
1 plugins/src/main platform
2 P-A, P-B, P-C, P-D,
platform-api platform-impl
3 P-A, P-B, P-C, P-D,
API-A, Impl-A API-B, Impl-B
TABLE IV: Correct splits at different levels of similarity with equal size for the training and the evaluation sets
Fig. 6: Fraction of Java files with depth not exceeding

6.2.2 Dataset with separation in time

The dataset was designed to investigate whether developers’ coding practices change in time. The high-level idea is as follows: we pick a set of methods from the IntelliJ IDEA project, sort them by time, split into 10 folds, then train a model on one fold and evaluate it on the others.

More specifically, we gathered all events of method creation generated by the 20 most active developers. Then, we sorted all the methods written by each author by creation time and divided them into ten buckets of equal size. To preserve the same distribution of the authors across the buckets, we did the division independently for each programmer. We also evaluated an alternative approach of splitting all the methods simultaneously, but this approach ended up adding too much noise to the data (in fact, some programmers joined the project later, and the model trained on earlier folds did not have any information about them).

The resulting dataset consists of 350,000 methods split into ten equal folds. They are sorted in time and the difference between the indices of the folds can be used as the temporal distance between them. The distribution of the authors and the number of samples is uniform for every fold.

6.3 Benefits of the data collection technique

The proposed data collection technique enables the collection of datasets that have several major benefits over existing datasets and capture some important effects that are specific to real-world data:

  • Smaller gap between work context of code written by different authors. While different developers still tend to work in different parts of the codebase, naming conventions, internal utility libraries, and the general domain are the same for everyone, since all code originates from the same codebase.

  • Large number of samples available per author. Existing datasets mostly work with up to several hundred code fragments per author. In the IntelliJ IDEA dataset collected with our technique, two developers have a hundred thousand method changes each. The ability to collect multiple contributions by a single author makes the resulting data suitable for studying more fine-grained aspects of the authorship attribution, such as the effect of the changes in coding practices overtime on the attribution accuracy.

  • More broad domain of application. Since our data collection technique allows one to collect data from any Git project, it is possible to investigate cross-project or cross-domain authorship attribution.

7 Evaluation on collected datasets

We evaluated both our models (see Section 3) on the the IntelliJ IDEA dataset that we created using the previously described technique.

7.1 Separated working contexts

First, we work with separated working contexts (Section 6.2.1). This dataset contains nine different splits into training and evaluation part of the same large pool of method creation events, labeled with the method’s author. Each split is parameterized with a depth value, which indicates the maximum possible depth of the lowest common ancestor of file changes in training and evaluation set.

If the model is sensitive to the influence of working context, it should perform better for higher split depths, because with the growth of the split depth training and testing sets become more and more similar. Figure 7 shows the dependency of the accuracy values of both models on the split depth. Since the number of available samples per author is high (around 10,000 on average) and sufficient for proper training, the neural network model (PbNN) outperforms the random forest model (PbRF).

Fig. 7: Models’ performance on the dataset with the separation of the working context

Figure 7 shows that the accuracy values increase with depth. We tried to eliminate possible reasons for that, except for the difference in work context: the experiments were held on the same data points, sizes of the training sets vary by less than 3%, and we train models until convergence. Thus, we conclude that the work context strongly affects the accuracy of authorship attribution.

In addition to the runs with separated contexts, we performed an experiment where training and testing parts were not separated at all. It means that methods from the same file might appear in both training and testing sets. This experiment might be seen as an extreme case of non-separated working contexts. Both models performed significantly better than in previous runs: 60.3% and 45.9% for the PbNN and PbRF, respectively. The gain in accuracy additionally proves models’ tendency to capture work context.

7.2 Time-separated dataset

To see if developers’ coding practices change in time, which might affect the accuracy of authorship attribution models, we evaluated our models on the collected dataset with folds separated in time (see Section 6.2.2). It contains samples from 20 most active developers in the IntelliJ IDEA project. For each of the developers, the data has been divided by time into 10 folds of equal size. This way we preserved the distribution of authors across folds, eliminating all differences between folds, except for the time when they were written.

We train a separate model on data from each fold except for the last. Then, the models are tested on code fragments from subsequent folds. Thus, for ten folds we get nine trained models and 45 fold predictions. If the developers’ practices change in time, then the accuracy should be lower for more distant folds.

The results for both models are presented in Figure 10. The neural network (PbNN) outperforms the random forest model (PbRF) with an average difference of 2% and p-value of . The graphs show that the accuracy of both models drops as distance in time grows, which confirms our hypothesis: evolution of coding practices affects accuracy of authorship attribution.

(a) PbNN’s performance
(b) PbRF’s performance
Fig. 10: Models’ performance on the dataset with separation in time. Lines are drawn for better readability and do not denote a linear approximation

Based on the obtained results, we computed the mean and standard deviation values of the average accuracy for each evaluation fold and the distance between evaluation folds (the difference of their indices). Graphs of both dependencies are presented in Figure 13. For the collected dataset, standard deviation for the fixed time distance is about 3 times less than for the fixed evaluation fold, which means that the time difference between training and evaluation data has more impact on model’s accuracy than the actual data.

(a) Mean and standard deviation of accuracy for evaluation folds
(b) Mean and standard deviation of accuracy for the distance between folds
Fig. 13: Dependence of models’ accuracy from fold and distance between folds for the dataset split in time

8 Results discussion

8.1 Influence of the working context

In the previous section, we described an experiment with separation of the work context between training and evaluation sets. We demonstrated that the model’s performance decreases as we train and evaluate it on more distant (in terms of the codebase file tree) pieces of code for each author. Specifically, the accuracy might vary by almost three times if we divide the same samples differently (see Figure 7).

We conclude that the gap between datasets used in previous works and the data observed in practical tasks exists. Specifically, a model’s accuracy can drop from 97-98% in one setting to 22.5% in another (see Figure 7). This suggests that to provide relevant information about the performance of a solution, researchers should use datasets where training and testing part for each author belongs to different environments.

The proposed dataset with gradual separation of training and evaluation sets can be used by researchers in the future to measure their models’ tendency to rely on context-related features rather than individual developers’ properties. The models proposed in this work turned out to have a strong dependency on work context as well, with the accuracy drop from 48% to 22.5% (PbNN) and from 37% to 17.8% (PbRF) for splits at depth 9 and 1. To lower the influence of work context on the model’s performance researchers could design context-independent features or add regularization terms.

8.2 Evolution of developers’ coding practices

Evaluation on the dataset with samples of code from each developer split in time showed that as programmers’ coding practices evolve over time, learning on older contributions to attribute authorship of the new code impairs attribution accuracy. To maximize attribution accuracy in potential real-world scenarios we should use as relevant training data as possible and re-train or fine-tune the models as we gather new data samples.

An interesting finding based on the evaluation is that the deviation of accuracy for the fixed index distance between evaluation folds is significantly smaller than it is for the fixed evaluation fold. The same pattern was observed for both the neural network and the random forest models. The difference in deviations means that for this setup the models’ performance has a greater dependence on the recency of the training data compared to the specific testing data. That can be interpreted as follows: from the perspective of authorship attribution, the evolution speed of developers’ coding practices remains the same during the project’s history. However, it might be a feature of the specific project or team organization, that is reflected in the dataset. Further research is needed to clarify the hypothesis in a more general case.

8.3 Threats to validity

The experiment with gradual separation of work context relies on the proposed method to measure work context similarity as the depth of the lowest common ancestor in file tree. Despite the provided rationale on why it is reasonable, the dissimilarities between code in different parts of the codebase might be caused by something other than work context.

We concluded that the evolution of developers coding practices strongly affects accuracy of authorship attribution models. However, the observed drop in models’ performance can be caused by the evolution of the whole project instead of the individual programmers. Also, the observed results are limited to IntelliJ IDEA data. To extend them to a general case, further research is needed.

While the datasets collected from the IntelliJ IDEA repository suit the goal of measuring influence of work context and evolution of developers coding practices on authorship attribution, additional work is needed to create a dataset for proper evaluation of the models in practical setting. It should comprise several projects with overlapping sets of developers, with projects divided between training and testing sets.

9 Conclusion

Source code authorship attribution could be useful in the software engineering field for tasks such as software maintenance, software quality analysis, and plagiarism detection. While recent studies in the field report high accuracy values, they use language-dependent models.

We propose two models for source code attribution: PbNN (a neural network) and PbRF (a random forest model). Both models are language-agnostic and work with path-based representations of code. Evaluation on datasets for C++, Python, and Java used in recent works shows that the suggested models are competitive with state-of-the-art solutions. In particular, they improve attribution accuracy on the Java dataset from 91.1% to 97% (PbRF) and 98% (PbNN).

While demonstrating high accuracy, existing works in authorship attribution are evaluated on datasets that might be inaccurate in modeling real-world conditions. This might pose a barrier to their adoption in software engineering methods and tools in practice. To formalize the differences, we suggest a concept of work context — the environment that influences the process of writing code — such as surrounding files, broader codebase, or team conventions. Taking work context into account, there is significant dissimilarity between academic and practical datasets. Another concern discussed in this work is evolution of developers’ coding practices and its potential impact on performance of authorship attribution. This topic has not been thoroughly investigated in prior research.

We suggest a novel approach to creation of authorship attribution datasets. In contrast to prior studies that are limited to projects with a single author, our approach can work with any Git project. We used our approach to process the history of a large Java project (IntelliJ IDEA Community repository on GitHub) and create several datasets to study the influence of work context and coding practices evolution on the performance of authorship attribution. The first dataset contains 348,000 data points authored by 26 active developers, split at nine different levels of context similarity. The second dataset comprises 350,000 methods created by 20 developers, divided in time into ten equal folds.

Evaluation of our models on the dataset with separation of work context shows that the accuracy of both PbNN and PbRF goes down as similarity values decrease. As we gradually change the similarity level from maximal to minimal, the accuracy dramatically drops to the low of 22.5%, which is much lower than 98% achieved at the existing dataset of 40 single-authored projects. For the experiment with folds divided in time, the accuracy drops as the time difference between training and evaluation folds increases, and the drop might also be significant (over 3 times for the most distant folds). We conclude that programmers’ coding practices evolve over time, at least in large projects, and their evolution negatively affects quality of authorship attribution methods.

Our study demonstrates that solutions with state-of-the-art or close accuracy can perform very differently for existing datasets when put into conditions close to real-world. This should be taken into account when evaluating authorship attribution approaches, especially for data collection and training/testing division steps.


  • [1] P. W. Oman and C. R. Cook, “Programming style authorship analysis,” in Proceedings of the 17th Conference on ACM Annual Computer Science Conference, ser. CSC ’89.   New York, NY, USA: ACM, 1989, pp. 320–326. [Online]. Available:
  • [2] R. Layton, P. Watters, and R. Dazeley, “Automatically determining phishing campaigns using the uscap methodology,” in 2010 eCrime Researchers Summit, Oct. 2010, pp. 1–8.
  • [3] A. Caliskan-Islam, R. Harang, A. Liu, A. Narayanan, C. Voss, F. Yamaguchi, and R. Greenstadt, “De-anonymizing programmers via code stylometry,” in 24th USENIX Security Symposium (USENIX Security 15).   Washington, D.C.: USENIX Association, 2015, pp. 255–270. [Online]. Available:
  • [4]

    G. Frantzeskou, E. Stamatatos, S. Gritzalis, and S. Katsikas, “Source code author identification based on n-gram author profiles,” in

    Artificial Intelligence Applications and Innovations, I. Maglogiannis, K. Karpouzis, and M. Bramer, Eds.   Boston, MA: Springer US, 2006, pp. 508–515.
  • [5] I. Krsul and E. Spafford, “Authorship analysis: Identifying the author of a program,” Computers and Security, vol. 16, pp. 233–257, Dec. 1997.
  • [6] J. Anvik, L. Hiew, and G. C. Murphy, “Who should fix this bug?” in Proceedings of the 28th International Conference on Software Engineering, ser. ICSE ’06.   New York, NY, USA: ACM, 2006, pp. 361–370. [Online]. Available:
  • [7] T. Fritz, J. Ou, G. C. Murphy, and E. Murphy-Hill, “A degree-of-knowledge model to capture source code familiarity,” in Proceedings of the 32Nd ACM/IEEE International Conference on Software Engineering - Volume 1, ser. ICSE ’10.   New York, NY, USA: ACM, 2010, pp. 385–394. [Online]. Available:
  • [8] T. Girba, A. Kuhn, M. Seeberger, and S. Ducasse, “How developers drive software evolution,” in Eighth International Workshop on Principles of Software Evolution (IWPSE’05), Sep. 2005, pp. 113–122.
  • [9] C. Bird, N. Nagappan, B. Murphy, H. Gall, and P. Devanbu, “Don’t touch my code!: Examining the effects of ownership on software quality,” in Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, ser. ESEC/FSE ’11.   New York, NY, USA: ACM, 2011, pp. 4–14. [Online]. Available:
  • [10] P. Thongtanunam, S. McIntosh, A. E. Hassan, and H. Iida, “Revisiting code ownership and its relationship with software quality in the scope of modern code review,” in 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), May 2016, pp. 1039–1050.
  • [11] F. Rahman and P. Devanbu, “Ownership, experience and defects: A fine-grained study of authorship,” in Proceedings of the 33rd International Conference on Software Engineering, ser. ICSE ’11.   New York, NY, USA: ACM, 2011, pp. 491–500. [Online]. Available:
  • [12] Z. Yin, D. Yuan, Y. Zhou, S. Pasupathy, and L. Bairavasundaram, “How do fixes become bugs?” in Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, ser. ESEC/FSE ’11.   New York, NY, USA: ACM, 2011, pp. 26–36. [Online]. Available:
  • [13] C. Zhang, S. Wang, J. Wu, and Z. Niu, “Authorship identification of source codes,” in APWeb/WAIM, 2017.
  • [14] S. Burrows and S. Tahaghoghi, “Source code authorship attribution using n-grams,” ADCS 2007 - Proceedings of the Twelfth Australasian Document Computing Symposium, Jan. 2007.
  • [15] J. Kothari, M. Shevertalov, E. Stehle, and S. Mancoridis, “A probabilistic approach to source code authorship identification,” in Fourth International Conference on Information Technology (ITNG’07), Apr. 2007, pp. 243–248.
  • [16] B. Stein, N. Lipka, and P. Prettenhofer, “Intrinsic plagiarism analysis,” Language Resources and Evaluation, vol. 45, no. 1, pp. 63–82, Mar 2011. [Online]. Available:
  • [17] P. S. Menell, “Api copyrightability bleak house: Unraveling and repairing the oracle v. google jurisdictional mess,” Berkeley Technology Law Journal, April 2017.
  • [18] S. Baltes and S. Diehl, “Usage and attribution of stack overflow code snippets in github projects,” Empirical Software Engineering, vol. 24, no. 3, pp. 1259–1295, Jun 2019. [Online]. Available:
  • [19]

    B. Alsulami, E. Dauber, R. Harang, S. Mancoridis, and R. Greenstadt, “Source code authorship attribution using long short-term memory based networks,” in

    Computer Security – ESORICS 2017, S. N. Foley, D. Gollmann, and E. Snekkenes, Eds.   Cham: Springer International Publishing, 2017, pp. 65–82.
  • [20] X. Yang, G. Xu, Q. Li, Y. Guo, and M. Zhang, “Authorship attribution of source code by using back propagation neural network based on particle swarm optimization,” PLOS ONE, vol. 12, no. 11, pp. 1–18, Nov. 2017. [Online]. Available:
  • [21] U. Alon, M. Zilberstein, O. Levy, and E. Yahav, “A general path-based representation for predicting program properties,” SIGPLAN Not., vol. 53, no. 4, pp. 404–419, Jun. 2018. [Online]. Available:
  • [22] B. S. Elenbogen and N. Seliya, “Detecting outsourced student programming assignments,” J. Comput. Sci. Coll., vol. 23, no. 3, pp. 50–57, Jan. 2008. [Online]. Available:
  • [23] G. Frantzeskou, E. Stamatatos, S. Gritzalis, C. Chaski, and B. Stephen H., “Identifying authorship by byte-level n-grams: The source code author profile (scap) method.” IJDE, vol. 6, Jan. 2007.
  • [24] N. Rosenblum, B. P. Miller, and X. Zhu, “Recovering the toolchain provenance of binary code,” in Proceedings of the 2011 International Symposium on Software Testing and Analysis, ser. ISSTA ’11.   New York, NY, USA: ACM, 2011, pp. 100–110. [Online]. Available:
  • [25] L. Simko, L. Zettlemoyer, and T. Kohno, “Recognizing and imitating programmer style: Adversaries in program authorship attribution,” Proceedings on Privacy Enhancing Technologies, vol. 2018, no. 1, pp. 127–144, 2018. [Online]. Available:
  • [26] M. Shevertalov, J. Kothari, E. Stehle, and S. Mancoridis, “On the use of discretized source code metrics for author identification,” in 2009 1st International Symposium on Search Based Software Engineering, May 2009, pp. 69–78.
  • [27] V. Kalgutkar, R. Kaur, H. Gonzalez, N. Stakhanova, and A. Matyukhina, “Code authorship attribution: Methods and challenges,” ACM Comput. Surv., vol. 52, no. 1, pp. 3:1–3:36, Feb. 2019. [Online]. Available:
  • [28] J. Kennedy and R. Eberhart, “Particle swarm optimization,” in Proceedings of ICNN’95 - International Conference on Neural Networks, vol. 4, Nov. 1995, pp. 1942–1948 vol.4.
  • [29] S. Burrows, A. Uitdenbogerd, and A. Turpin, “Comparing techniques for authorship attribution of source code,” Software: Practice and Experience, vol. 44, 01 2014.
  • [30] U. Alon, O. Levy, and E. Yahav, “code2seq: Generating sequences from structured representations of code,” CoRR, vol. abs/1808.01400, 2018. [Online]. Available:
  • [31] D. Perez and S. Chiba, “Cross-language clone detection by learning over abstract syntax trees,” in Proceedings of the 16th International Conference on Mining Software Repositories, ser. MSR ’19.   Piscataway, NJ, USA: IEEE Press, 2019, pp. 518–528. [Online]. Available:
  • [32]

    U. Alon, M. Zilberstein, O. Levy, and E. Yahav, “Code2vec: Learning distributed representations of code,”

    Proc. ACM Program. Lang., vol. 3, no. POPL, pp. 40:1–40:29, Jan. 2019. [Online]. Available:
  • [33] N. Rosenblum, X. Zhu, and B. P. Miller, “Who wrote this code? identifying the authors of program binaries,” in Proceedings of the 16th European Conference on Research in Computer Security, ser. ESORICS’11.   Berlin, Heidelberg: Springer-Verlag, 2011, pp. 172–189. [Online]. Available:
  • [34] X. Meng, “Fine-grained binary code authorship identification,” in Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, ser. FSE 2016.   New York, NY, USA: ACM, 2016, pp. 1097–1099. [Online]. Available:
  • [35] C. E. Shannon, “A mathematical theory of communication,” Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948. [Online]. Available:
  • [36] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. [Online]. Available:
  • [37] F. Wilcoxon, “Individual comparisons by ranking methods,” Biometrics Bulletin, vol. 1, no. 6, pp. 80–83, 1945. [Online]. Available:
  • [38] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in python,” J. Mach. Learn. Res., vol. 12, pp. 2825–2830, Nov. 2011. [Online]. Available:
  • [39] K. J. Ottenstein, “An algorithmic approach to the detection and prevention of plagiarism,” SIGCSE Bull., vol. 8, no. 4, pp. 30–41, Dec. 1976. [Online]. Available:
  • [40] C. Liu, C. Chen, J. Han, and P. S. Yu, “Gplag: Detection of software plagiarism by program dependence graph analysis,” in Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’06.   New York, NY, USA: ACM, 2006, pp. 872–881. [Online]. Available:
  • [41] E. Kalliamvakou, D. Damian, K. Blincoe, L. Singer, and D. M. German, “Open source-style collaborative development practices in commercial projects using github,” in 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, vol. 1, May 2015, pp. 574–585.
  • [42] S. Burrows, A. L. Uitdenbogerd, and A. Turpin, “Temporally robust software features for authorship attribution,” in 2009 33rd Annual IEEE International Computer Software and Applications Conference, vol. 1, July 2009, pp. 599–606.
  • [43]

    R. C. Lange and S. Mancoridis, “Using code metric histograms and genetic algorithms to perform author identification for software forensics,” in

    Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation

    , ser. GECCO ’07.   New York, NY, USA: ACM, 2007, pp. 2082–2089. [Online]. Available:
  • [44] (2018) The state of the octoverse. [Online]. Available:
  • [45] C. Bird, P. C. Rigby, E. T. Barr, D. J. Hamilton, D. M. German, and P. Devanbu, “The promises and perils of mining git,” in 2009 6th IEEE International Working Conference on Mining Software Repositories, May 2009, pp. 1–10.
  • [46] M. Allamanis, E. T. Barr, C. Bird, and C. Sutton, “Learning natural coding conventions,” in Proceedings of the 22Nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, ser. FSE 2014.   New York, NY, USA: ACM, 2014, pp. 281–293. [Online]. Available:
  • [47] E. Kalliamvakou, G. Gousios, K. Blincoe, L. Singer, D. M. German, and D. Damian, “The promises and perils of mining github,” in Proceedings of the 11th Working Conference on Mining Software Repositories, ser. MSR 2014.   New York, NY, USA: ACM, 2014, pp. 92–101. [Online]. Available:
  • [48] G. G. L. Menezes, L. G. P. Murta, M. O. Barros, and A. Van Der Hoek, “On the nature of merge conflicts: a study of 2,731 open source java projects hosted by github,” IEEE Transactions on Software Engineering, pp. 1–1, 2018.
  • [49] C. Bird, A. Gourley, P. Devanbu, M. Gertz, and A. Swaminathan, “Mining email social networks,” in Proceedings of the 2006 International Workshop on Mining Software Repositories, ser. MSR ’06.   New York, NY, USA: ACM, 2006, pp. 137–143. [Online]. Available:
  • [50] G. Robles and J. M. Gonzalez-Barahona, “Developer identification methods for integrated data from various sources,” SIGSOFT Softw. Eng. Notes, vol. 30, no. 4, pp. 1–5, May 2005. [Online]. Available:
  • [51] E. Kouters, B. Vasilescu, A. Serebrenik, and M. G. J. van den Brand, “Who’s who in gnome: Using lsa to merge software repository identities,” in 2012 28th IEEE International Conference on Software Maintenance (ICSM), Sep. 2012, pp. 592–595.
  • [52] J. Falleri, F. Morandat, X. Blanc, M. Martinez, and M. Monperrus, “Fine-grained and accurate source code differencing,” in ACM/IEEE International Conference on Automated Software Engineering, ASE ’14, Vasteras, Sweden - September 15 - 19, 2014, 2014, pp. 313–324. [Online]. Available: