Building Implicit Vector Representations of Individual Coding Style

02/10/2020
by   Vladimir Kovalenko, et al.
Universität Zürich
JetBrains
0

With the goal of facilitating team collaboration, we propose a new approach to building vector representations of individual developers by capturing their individual contribution style, or coding style. Such representations can find use in the next generation of software development team collaboration tools, for example by enabling the tools to track knowledge transfer in teams. The key idea of our approach is to avoid using explicitly defined metrics of coding style and instead build the representations through training a model for authorship recognition and extracting the representations of individual developers from the trained model. By empirically evaluating the output of our approach, we find that implicitly built individual representations reflect some properties of team structure: developers who report learning from each other are represented closer to each other.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/16/2020

Please Turn Your Cameras On: Remote Onboarding of Software Developers during a Pandemic

The COVID-19 pandemic has impacted the way that software development tea...
02/05/2022

A "Distance Matters" Paradox: Facilitating Intra-Team Collaboration Can Harm Inter-Team Collaboration

By identifying the socio-technical conditions required for teams to work...
04/01/2020

A Case Study on Tool Support for Collaboration in Agile Development

We report on a longitudinal case study conducted at the Italian site of ...
01/12/2022

Big Data = Big Insights? Operationalising Brooks' Law in a Massive GitHub Data Set

Massive data from software repositories and collaboration tools are wide...
03/27/2022

OneLabeler: A Flexible System for Building Data Labeling Tools

Labeled datasets are essential for supervised machine learning. Various ...
06/04/2019

Slack Channels Ecology in Enterprises: How Employees Collaborate Through Group Chat

Despite the long history of studying instant messaging usage in organiza...
07/07/2020

Artistic Style in Robotic Painting; a Machine Learning Approach to Learning Brushstroke from Human Artists

Robotic painting has been a subject of interest among both artists and r...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Machine learning (ML) has lately been having more impact in software engineering by improving over state of the art in problems such as code summarization (Haiduc et al., 2010; Allamanis et al., 2016) and program synthesis (Devlin et al., 2017). Many of the methods that apply ML techniques to code are aimed at enhancing software development tools by offering engineers assistance in routine tasks. Examples of such enhancements include code completion engines, static analysis, and automated code review systems (Raychev et al., 2014). Most of these methods are designed to assist in a task that is relevant within a short time scope, such as to insert the right code snippet or to fix problems in a single changeset.

While this assistance to developers promises a significant improvement in the daily experience with developer tools, it does not cover the complete scope of potential tooling support for engineering teams, particularly when in relation to socio-technical aspects. Indeed, developers report that their use of developer tools is not only related to technical artifacts, but is also a vital part of interpersonal communication in teams (e.g., as it is in the case of code review tools (Bacchelli and Bird, 2013)).

Social aspects in teams are manifested in technical artifacts and records of these processes can be extracted with software repository mining techniques (Valetto et al., 2007). Still, modern software team collaboration tools make little to no use of the data available in software repositories to assist with social aspects of the engineering process. To enable tools to assist with interpersonal processes at the scale of a software team, one first needs reliable and transparent models of these processes as well as methods to retrieve corresponding data from software repositories. As a step in this direction, in this work, we propose a new approach to building representations of developers’ individual coding fingerprints – or their coding style. Such representations can find use in the next generation of team collaboration tools, which could, for example, track the process of knowledge transfer in teams and provide assistance. Other potential applications include searching for similar developers and profiling of individual coding habits for tasks related to the management of human resources.

Existing work on code stylometry typically relies on explicitly defined features to represent code style (Caliskan-Islam et al., 2015)

. We take a different direction: Instead of using explicit measures of code style, we implicitly extract the distinguishing features of individual developers by training a model to recognize authorship of a batch of code changes and processing the model’s internal representations. The input of the model consists of changes made by a developer to individual methods, and its output is a label for the predicted author. To maximize transparency of the model, we use an attention mechanism – a technique widely used in neural machine translation 

(Bahdanau et al., 2014), which allows us to point out particular code constructs that are more important for authorship attribution. After training the model to recognize the author of code changes, we extract representations of contributions of individual developers from it. We do so by combining the vector representations of individual method changes made by a developer over a time period, using the weights for vectors of individual method changes that are learned by the attention mechanism.

Finally, we assess the capability of representations of reflecting a practical social aspect, namely, learning within teams. For this, we produce multiple snapshots of individual representations of developers in a large open source project maintained by an enterprise, with each snapshot corresponding to a specific time period. Thanks to a localized development team, we were able to also collect reports of mutual learning from the developers of this project. Finally, we look for a connection between reported learning and relative distances between developer representations. While we find no connection between reported learning and relative movement of developers’ representations between consecutive time buckets, we see that reported learning is associated with lower distance between representations of two developers.

The primary contribution of this work is a novel method to extract representations of the contribution style of individual developers. The method is designed to be suitable for use with raw data from software repositories and to not require any additional labeling or explicit feature engineering. Another contribution is an empirical assessment of how the retrieved representations match learning as perceived by software developers.

2. Background and motivation

2.1. The need for developer representations

In the following, we reason about the importance of representations of individual developers’ style, focusing on another class of tools (i.e., IDEs) as an example of a class of advanced tools successfully making use of a comprehensive model of the main medium they are designed to manipulate: code.

Despite the important role of team collaboration tools in software engineering, existing approaches aimed at improving software engineering tools with data have been mostly targeting coding environments and IDEs. Modern industry-grade IDEs, such as IntelliJ IDEA (17) and Eclipse (13), provide rich toolkits for code manipulation and maintenance. The IDEs feature automated code refactoring, code inspections pointing at potential issues, are able to provide high-level overview of large codebases, and enable deep integration of external tools, e.g. debuggers, with the code editor. Capability of IDEs to provide such rich code manipulation features is based on comprehensive internal models of software projects. In particular, IntelliJ relies on PSI (20) that represents a rich internal code model. Language-specific features in Eclipse IDEs are based on language support packages like JDT (14) and CDT (12) that as well manipulate with comprehensive language-specific program structure data. Outside the industrial IDE realm, academic methods aimed at coding manipulation and improvement also operate with code models and in some cases partially rely on language support initially designed for use in IDEs  (Falleri et al., 2014).

Modern team collaboration tools, such as code review tools, repository hosting engines, and bug tracking systems, are vital mediums for collaborative software engineering. In fact, these tools do not simply provide an environment to perform short-term tasks, like reviewing changes or communicating an issue, but also play a crucial role in supporting knowledge transfer in teams (Bacchelli and Bird, 2013) and serve as a knowledge base (Tran et al., 2008).

In contrast to comprehensive code manipulation and problem detection features in IDEs, most team collaboration tools, despite their vital role in team-wide processes, do not maintain a comparably complex and detailed model of teams’ communication structure, nor do these tools routinely analyze records of prior communication in teams. While there are exceptions to this rule, such as data-driven techniques like reviewer or assignee recommendation systems (Thongtanunam et al., 2015; Kovalenko et al., 2018; Anvik et al., 2006) and repository analytics features that are present in some collaboration tools, these tools are yet to evolve to feature and utilize a more comprehensive model of team communication and to assist in maintenance and improvement of communication at a larger time scale.

Enabling assistive features in team collaboration tools requires enabling tools to model social processes internally. While existing research suggests that social processes are to an extent reflected in technical artifacts (Cataldo et al., 2008) and can be extracted with data mining techniques (Valetto et al., 2007), it is important to focus on extracting representations of individual properties of contributors from records of their collaborative work.

This work is dedicated to extracting representations of individual properties of engineer’s coding that distinguish their contributions from their peers’. This could provide the tools with a sense of proximity of individual properties of their users’ work, which could be use to detect learning in teams or provide onboarding assistance. We require that extraction of representations should not rely on any explicitly defined set of features, as opposed to existing code stylometry approaches. Using a neural code change embedding technique, we avoid feature engineering and utilize the ability of the model to capture optimal distinguishing features implicitly.

2.2. Existing work

Despite a solid track record of academic efforts and lack of widespread usage in modern team collaboration tools, we believe individual developer representations to be a promising ground for the evolution of collaboration tools.

The idea of building representations of individual developers’ style has been around in the research community for several decades. The need for such representations is mostly motivated by the demand for code authorship attribution, which is deemed important for a variety of real-world applications, such as malware detection (Caliskan-Islam et al., 2015) and plagiarism elimination (Lange and Mancoridis, 2007).

Some recent work is closely related to ours. Azcona et al. (2019) propose building vector representations for individual computer science students, based on source code of their assignment submissions. As opposed to our work, they do not use any information on structure of code. Alsulami et al. (2017)

use a deep learning model to attribute authorship of source code, based on traversal sequences of the AST. Authorship attribution, however, is the sole task of their approach. Moreover, the model they propose works on code snippets and not code modifications, thus making it very hard to apply to data from software repositories.

3. Method

Figure 1. A high-level overview of our approach to building representations of individual developers

In contrast to explicitly defined feature sets for developers’ coding style, commonly used in existing literature, we define individual coding style in the scope of a single repository as vaguely as anything that distinguishes a developer’s contribution to the codebase from their peers’ contributions and focus our method on capturing this.

We propose a two-step method for building the embeddings – essentially, vector representations – of individual code style. The overarching idea of our approach is to first learn to vectorize individual method changes in a way that best represents individual contribution style of each developer, then combine representations of multiple changes made by a single developer into this developer’s individual contribution fingerprint.

In the first step, we extract individual changes in Java methods from the project’s VCS history and randomly group them into batches of changes authored by the same person. Then we train a neural network to vectorize individual changes and their batches so to distinguish between contributions as efficiently as possible.

At this step, the machine learning model essentially learns a function that maps a code change to a vector. The primary requirement for this function, that defines the learning process, is to represent method changes in a way that groups multiple changes made by one person close to each other and far from changes made by other developers. In addition, we use an attention mechanism by training the model on batches on multiple code changes instead of single changes. Thanks to attention, the model is also capable of assigning a weight to each method change, defining its “importance” for attributing authorship of each change batch. The inner workings of the authorship attribution model are explained in more detail in Section 3.1.

In the second step, we combine representations of changes made by every individual developer into a representation for that developer. In this step, we use the trained model for authorship attribution to produce vectors for individual code changes made by the developer, and calculate the representation of a person as a weighed sum of the changes they have made, using attention values from the model as weights.

In the rest of this section we provide a more detailed technical overview of the extraction pipeline. In the first part, we discuss the inner workings of the authorship attribution model. In the second part, we describe extraction of representations of individual developers from the trained model for authorship recognition.

Figure 2. Overview of the authorship attribution pipeline that we use to obtain authorship-based embeddings of method changes and importance of individual method changes for attribution of authorship. The method nodes and their attention weights are later used to produce developer representations.

3.1. Vectorizing code changes to represent authorship

The first step in building the representations of code style is to train a model to distinguish among the contributions of individual developers. During training for authorship attribution, the model implicitly learns to extract information that distinguishes method changes made by each developer from those made by their teammates.

We operate with method changes, rather than static snapshots of code snippets, to use authorship labels from version control: a method change can always be attributed to a single person who performs the change, and this data is already present in the version control system. Moreover, we use batches of randomly selected method changes as a unit of input for authorship attribution. This allows us to use the whole history of a project for training. For each developer, we shuffle their history of changes and split it into batches. A batch consists of 16 methods authored by the same developer, sampled uniformly randomly from their development history. We chose this number empirically as a good balance point between having too small batches and too little data points per developer. While making the authorship attribution task easier by letting the model focus on more important pieces of input, the use of batches and attention forces the model to estimate importance of specific changes before making a prediction. We further use these attention values when constructing individual style vectors for developers, so to let the code changes that are more representative of a developer contribute more to their fingerprint.

Our authorship recognition model is a neural network. Figure 2 presents an overview of the model. Its architecture is based on code2vec (Alon et al., 2019) – a state-of-the-art code embedding model. Similarly to code2vec, it uses path-based representations (Alon et al., 2018) of versions of each method before and after a change.

3.1.1. Path-based representations.

Path-based representations are explained in detail in the original work by Alon et al. (Alon et al., 2018). We explain the essential concepts below.
Abstract Syntax Tree. An abstract syntax tree (AST) is a representation of program’s code in the form of a tree. Nodes of the tree correspond to different code constructs (e.g., math operations and variable declarations). Children of a node correspond to smaller constructs that comprise the corresponding code. Different constructs are represented with different node types. An AST omits parentheses, tabs, and other formatting details. Figure 3 shows an example of a code fragment and the corresponding AST.
AST path. A path is a sequence of connected nodes in an AST. Start and end nodes of a path may be arbitrary, but we only use paths between two leaves in the AST to conform with code2vec (Alon et al., 2019). Following Alon et al. (Alon et al., 2018), we denote an AST path by a sequence of node types and directions (up or down) between consequent nodes. In Figure (b)b, an example of a path between the leaves of an AST is shown with red arrows. In terms of node types and directions, this path is denoted as follows:

Path-context. The path-based representation operates with path-contexts, which are triples consisting of (1) a path between two nodes and the tokens corresponding to (2) start and (3) end nodes. From the human perspective, a path-context represents two tokens in code and a structural connection between them. This allows a path-context to capture information about the structure of the code. Figure (b)b highlights the following path-context:

This path-context represents a declaration of a function named square with a single argument named x. The path in this path-context encodes the following information: It contains nodes Function Declaration as well as Single Variable Declaration. Tokens are linked to Simple Name AST nodes.

(a) An example code fragment
(b) AST of this code fragment
Figure 3. A code example and corresponding AST

3.1.2. Mechanics of the authorship recognition model.

The first step in the authorship recognition task is to convert a set of individual method changes into a vector form. For each changed method, we parse both the old method version and the new one to retrieve their ASTs. We extract path-contexts from both versions of the AST. We impose limitations on maximum length and width of the paths, to only include path-contexts representing local relations in code. To distill the concrete effect of each code change, we only use the difference in sets of path-contexts representing the old and the new versions of each method: we only include the path-contexts that were introduced or removed in the new version of the method after the change. We convert path-contexts representing the difference into a numerical form that can be passed to the neural network. First this, we apply vocabulary-based encoding to paths and tokens. Vocabulary-based encoding consists in representing every path or token with a unique integer number.

After that we learn the embeddings for tokens and paths. Essentially, at this step the model learns to convert them into a vector form in an optimal way that is most meaningful for the ultimate objective of attributing authorship of method change representations that they comprise. Initially represented by a random matrix, stacked embeddings of paths or tokens eventually converge to optimal values during training.

Further down the pipeline, the model concatenates the embedding vector for a path with embeddings of its start and end tokens to build a path-context vector. This vector is a combined representation of the path and its end tokens. We transform the path-context vectors with a fully connected layer and aggregate them into method change vectors, using weights from an attention layer. The attention mechanism essentially attributes a “relevance” weight to each path-context in the batch corresponding to the code change. Path-contexts with higher attention values are more important for distinguishing between developers, i.e., capture more individual information. By highlighting the relevant path-contexts, the attention mechanism improves the accuracy of the model and improves its interpretability: it is possible to pinpoint the concrete path-contexts in the input.

As depicted in the bottom part of Figure 2, we use another attention mechanism to combine a batch of method change vectors, each corresponding to a change made to a single method, into a change batch vector

. Combining changes in batches, rather than using a single change for every prediction, allows to attribute an attention weight, representing its importance for authorship attribution, to each individual method change. The size of this vector is a hyperparameter of the model, but we choose it to be much less than the number of possible developer labels to ensure that representations of developers are dense. The next fully connected layer with softmax activation solves the classification problem by learning to attribute a change batch vector to a concrete developer.

3.2. From authorship recognition to developer embeddings

The classifier in the authorship recognition model is learning to attribute a change batch vector (which is a weighted sum of change method vectors for methods in the batch) to an individual developer. Essentially, the whole model is learning to map batches of individual method changes into a vector space so that the sets of contributions of individual developers can be separated as well as possible. High accuracy in the authorship recognition task suggests that the learned model separates the space of method change vectors into areas that correspond to individual developers, thus capturing individual characteristics of a sample of a developer’s contributions.

The pivotal idea of our approach is to extract a representation of a developer from the trained model. In addition, the attention mechanism learns to evaluate importance of representations of single method changes in the combined method batch vector. To extract a representation of contributions of a single developer over a period of time, we consider all of changes made by that developer during the period. We feed the representations of individual changes into the model and retrieve a vector representing each method change as well as the corresponding attention value. Finally, we combine these vectors and weights into a representation of a developer as a weighted sum.

This representation is used in further analysis: we calculate the representations for multiple time buckets, and retrieve multiple representation vectors for various team members, each corresponding to a certain time period. Finally, we explore whether positions and relative movement of representations are connected to learning from peers, as reported by developers.

Figure 4. Change of developers’ vectors between time buckets. A developer may be only active in some of the buckets.

3.3. Threats to validity

The curse of context. The authorship recognition model distinguishes contributors based on both the structure of code (which is represented by sequences of node types from AST paths) and the context of their changes (which is represented by the tokens). Tokens include variable names and names of declared and invoked methods. These names may be highly specific for a concrete narrow area of code. It is reasonable to think that in projects with practices of individual code ownership this context information alone can be enough to recognize the author of a method change. Given our ultimate goal of capturing individual characteristics of developers, including the context information in the model is not always desirable. While we perform a separate evaluation on data with excluded tokens, there is a chance that context information may as well be reflected in characteristic unique code patterns that are captured in sequences of AST node types.Performance. We must note that resource consumption of our approach is very high, mostly due to the need to repeat training multiple times to reduce noise in the data. While not a crucial property for a proof-of-concept tool, reasonable performance of an approach is a requirement for practical applicability of an approach to this task. Producing a slow pipeline for a far-fetched, yet practical, goal impacts strength of the motivation.

4. Evaluation Setup

One critical design choice for our approach to building individual developer representations, or embeddings, is to take a step away of explicitly defined code style. In essence, as we use the authorship attribution task for building the representations, we implicitly define code style as “anything that distinguishes between individuals’ code”. While avoiding explicit definition of code style gives potential to include characteristics of the style that are otherwise left out, this makes the evaluation of embeddings’ quality challenging, as there is no ground truth to compare against.

To get a realistic estimate of quality of code style embeddings, we decided to evaluate them in the context of a possible application. As discussed in Section 2.1, one promising application of code style representations is enabling team collaboration tools to make sense of proximity of individual contribution styles, capture the process of knowledge transfer in teams, and potentially provide aid by aligning this process to be more efficient. Essentially, we formulate the task of embeddings evaluation as evaluating their ability to capture learning between individuals. In the rest of this section, we elaborate on the evaluation setup and technique.

4.1. Dataset preparation

As an evaluation dataset, we use the source code and development history of IntelliJ Community.111https://github.com/jetbrains/intellij-community

Merging, splitting and filtering. In the first step, we merge name-email pairs accounted in the VCS history and belonging to the same developer into a single entity. For this purpose, we used a separate user management tool, internally used by developers of IntelliJ and accessible to us, containing merged records of VCS entities for developers in the project. To facilitate running our pipeline on other projects, we implemented a simple algorithm for entities merging: it builds a bipartite graph of names and emails, where pairs that appear together are connected by an edge. Connected components in the graph correspond to merged entities. While the algorithm is not perfect, it gives an approximate merging which can be further improved manually. We include the implementation into the reproduction package.

Afterward, we split the history of the repository into multiple time chunks, each containing the same number of commits. Learning representations of developers over a small time chunk, rather than over complete history of a repository, allows to produce multiple representations of a developer, each corresponding to a relatively short time bucket. This accounts for potential changes in developers’ coding traits over time and allows us to look at changes in distances between representations in consecutive time buckets – in other words, to track relative movement of representations.

As explained in Section 3, we focus on Java method changes for training and producing the representations. The repository contains contributions from about 500 developers, with a long tail of developers with only a few contributions. To ensure that we have sufficient data for every developer, we exclude the developers who made less than 1,000 method changes over the whole history of the repository. This left us with 124 active developers in the dataset. Having significant amounts of data for every developer, we increase stability of representations and reduce noise.

Noise reduction. The weights in the authorship attribution model, which ultimately represent the resulting change embedding function, are randomly initialized before learning and may converge to very different configurations depending on the initial random seed. Moreover, density of representations in each snapshot for multiple developers differs.

To account for the varying density, we calculate (for each time bucket) the average distance between every pair of developer representations, and divide the actual distances by this value. This allows us to compare distances between two given representations in consecutive time buckets and is necessary because density of representations may differ between two buckets.

On top of normalization, to account for the random nature of representations, we repeat the whole learning and style representation extraction 30 times and calculate the average normalized distance between every two developers in every time bucket across all runs. While making the process of obtaining representations from a large repository computationally demanding, repeating the learning and using the average make the resulting data less noisy and the comparison of distances between two different pairs more reliable.

The resulting data consists of 20 lists of relative distances between every two representations, with each snapshot corresponding to a certain time bucket. The number of resulting representations in a bucket varies between 23 and 87, depending on the bucket, displaying an upward trend, representing the growth of the team.

4.2. Team survey

To get a baseline to check to what extent proximity and relative movement of individual representations, as extracted by the model, reflect actual learning in the team, we circulated a short online survey with the development team of the project in the dataset via a post in the project’s internal communication channel.

The goal of the survey was to collect information on mutual learning between some developers in the team, so that this data could be used as ground truth regarding actual learning taking place, which could let us see whether relative movement of developer representation reflected learning reported by developers. In the survey, we ask each respondents the questions depicted in Figure 5 three times to get information about three different colleagues they have learned from. We also included an option to mention one or two colleagues in a similar way. To avoid hinting the respondents with any particular definition of coding elements their learning may relate to, we explicitly stated that we would like them to define it for themselves and consider anything they may have learned during collaborative work, mentoring, code review, or other team activities. Responses to this question provide us with examples of positive pairs in terms of reported learning.

In addition, we ask the respondents to mention several developers who contribute to the same project but from whom they are sure they have not learned any elements of coding: “Please name a few colleagues from the IntelliJ team you think you did not learn any coding elements from at all”. This question provides us with a set of negative examples.

Figure 5. Excerpt from the survey

4.3. Survey results and model output

The actual evaluation consisted in mapping the results of the developer survey onto data of relative movement of developers’ representations between different time buckets.

While we asked the survey participants to indicate the degree of perceived learning and the time period when such a learning took place, we believe that the data for these two questions is not reliable enough. 3 respondents simply reported a similar neutral or extreme score for every person they mention and noted that they were confused by the request to indicate the degree of learning. Regarding the time period when learning took place, only one in three respondents provided any meaningful information, which in most cases was still too vague to attribute to a concrete time bucket. Thus, we only use the fact of a participant naming someone they have learned from, or certainly have not learned from, as extracted from the answers, to map the results of the survey onto the output of the model. The final data from the survey consists of 23 positive pairs – pairs of developers one of whom reporting learning from the other, and 13 negative pairs, where one of the developers named the other as someone they certainly have not learned any elements of coding from.

To see whether distances between representations reflect reported learning, we compare distributions of distances in all positive pairs, taken across all buckets where both developers in the pair displayed activity and were present in the data, to a similarly defined distribution across all negative pairs. The sample of distances for positive pairs consists of 229 values, and a similar sample for negative pairs has 113 values.

To extract dynamics of relative distances, we obtain a distribution of differences of distances between two consecutive time buckets for all pairs in the positive and negative groups, using every time bucket where these distances are defined both in the given and the previous bucket. The sample of distance differences for positive pairs contains 204 values, and the negative pairs yielded 99 values.

Finally, we use a 2-sample Kolmogorov-Smirnov test222We use this test because we cannot make any assumptions about distribution of these values due to the obscure stochastic process of learning and the fact that this data may reflect some social processes in teams that are impossible to completely quantify. to compare distributions of distances, as well as distributions of their differences, between the positive and negative groups.

5. Evaluation results

Relative distances. When tokens are present in the input data, the distribution of 229 distance values in pairs with reported learning has a mean of

and variance of

. For pairs with reported lack of learning, 113 distance values display a mean of with variance of . The p-value from the 2-sample Kolmogorov-Smirnov test comparing distributions of these two samples is under . When tokens are removed from the input data (to minimize context information available to the model), the mean distance between positive pairs is with a variance of . For negative pairs the mean distance is and variance is . The KS test yields a p-value .

These results suggest that the representations of developers in pairs with reported learning are located closer to each other, both in cases with included and excluded tokens.

Relative movement. We perform a similar comparison for the distributions of distances in pairs of representations between two consequent time buckets. With tokens in data, 204 values for positive pairs are distributed with mean of and variance of . 99 values for negative pairs display a mean of and variance of . The KS test yields a p-value of , suggesting that samples of distance differences for positive and negative pairs are likely from similar distributions. When no tokens are present, values for positive pairs display a distribution with mean of and variance of . Values for negative pairs are distributed with mean of and variance of . The p-value in the KS test is 0.16.

These results suggest that distributions of differences in relative distances between consecutive time buckets are no different between pairs with reported learning and with reported lack of learning, regardless of whether tokens are included.

Summary. Overall, learning is to an extent reflected in the distances between representations: distances between developers who report learning from each other are lower than between developers who report not learning from each other. This result persists when tokens are removed from the data, thus importantly suggesting how learning is not only captured in context of developers’ contributions represented by tokens. However, a similar comparison for distributions of distance differences between consecutive time buckets suggests that distributions are similar for both groups of developer pairs.

6. Discussion

The core idea of our approach to building coding style representations is to step aside from explicitly defined feature sets for developer representations, rather build the representations of style by aggregating embeddings of code changes, which are produced to distinguish between developers as good as possible by the authorship attribution model.

We evaluated the computed developer embeddings through reports of peer learning in a large development team finding that inter-peer learning is indeed reflected in the embeddings: representations of developers who a given developer reports having learned from are closer to their own representation than representations of those who they reported not having learned from. Importantly, this effect persists even when the context information (reflected in concrete code tokens) is removed from the training data. This result suggests that implicitly built developer representations reflect the fuzzy process of learning in teams, by capturing individual characteristic patterns of code constructs and their proximity for people who report learning from each other and not just capturing the context information.

However, we do not see any connection between reported learning and relative movement of developer representations. In our opinion, the most important reason for the lack of such a connection is the low stability of the resulting representations. While reflecting learning in distributions of relative positions of individual representations, the representations are too noisy to reason about learning in side-by-side comparison of team snapshots in consecutive time buckets. We discuss possible ways to mitigate this in the next subsection.

6.1. Future work

Evaluation. A more elaborate evaluation of our approach, involving multiple software projects and more feedback from developers, could help clarify quality of the representations.
Other scopes. Apart from the learning detection task, the representations produced by our approach could find use in other contexts, including tasks already supported by team collaboration tools, e.g., recommendation of code reviewers (Kovalenko et al., 2018) or issue assignment (Anvik et al., 2006).
Alternative embedding techniques. We use a modified version of code2vec for method change embeddings. Using other embedding techniques could improve the performance of our approach.
Transparency. The authorship attribution model uses two attention layers: one to learn importance of individual changes in a batch and the other for individual path-contexts. While we use weights from the former to produce developer representations, a careful consideration of values from both layers could provide more insights on what makes a certain change characteristic for a developer.
Stability of representations. Noise and jitter in representation snapshots between time buckets make extracting representations harder. Additional constraints could help increase representations’ stability. Figuring out a way to increase stability without compromising ability of representations to capture important social properties is another promising direction of future work.

7. Conclusion

We introduced an approach to building representations of individual coding style of developers relative to their peers in the team. The most important feature of our approach is that it does not require explicit feature engineering, rather relies on implicit vectorization of code changes via a code embedding model, trained to distinguish between changes made by individual developers and the aggregation of individual changes.

We demonstrate that it is possible to build representation of individual developers’ coding style without defining style formally. The resulting representations reflect learning between peers in the team to a certain degree.

Reproducibility. The technical artifacts of this work – the pipeline to build the representations and data analysis code that we used to map the survey results to the data – are available online: https://zenodo.org/record/3647645

References

  • M. Allamanis, H. Peng, and C. Sutton (2016) A convolutional attention network for extreme summarization of source code. In International Conference on Machine Learning, pp. 2091–2100. Cited by: §1.
  • U. Alon, M. Zilberstein, O. Levy, and E. Yahav (2018) A general path-based representation for predicting program properties. In Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 404–419. Cited by: §3.1.1, §3.1.
  • U. Alon, M. Zilberstein, O. Levy, and E. Yahav (2019)

    Code2vec: learning distributed representations of code

    .
    Proceedings of the ACM on Programming Languages 3 (POPL), pp. 40. Cited by: §3.1.1, §3.1.
  • B. Alsulami, E. Dauber, R. Harang, S. Mancoridis, and R. Greenstadt (2017)

    Source code authorship attribution using long short-term memory based networks

    .
    In European Symposium on Research in Computer Security, pp. 65–82. Cited by: §2.2.
  • J. Anvik, L. Hiew, and G. C. Murphy (2006) Who should fix this bug?. In Proceedings of the 28th international conference on Software engineering, pp. 361–370. Cited by: §2.1, §6.1.
  • D. Azcona, P. Arora, I. Hsiao, and A. Smeaton (2019) User2code2vec: embeddings for profiling students based on distributional representations of source code. In Proceedings of the 9th International Conference on Learning Analytics & Knowledge, pp. 86–95. Cited by: §2.2.
  • A. Bacchelli and C. Bird (2013) Expectations, outcomes, and challenges of modern code review. In Proceedings of the 2013 international conference on software engineering, pp. 712–721. Cited by: §1, §2.1.
  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1.
  • A. Caliskan-Islam, R. Harang, A. Liu, A. Narayanan, C. Voss, F. Yamaguchi, and R. Greenstadt (2015) De-anonymizing programmers via code stylometry. In 24th USENIX Security Symposium (USENIX Security 15), Washington, D.C., pp. 255–270. External Links: ISBN 978-1-931971-232, Link Cited by: §1, §2.2.
  • M. Cataldo, J. D. Herbsleb, and K. M. Carley (2008) Socio-technical congruence: a framework for assessing the impact of technical and work dependencies on software development productivity. In Proceedings of the Second ACM-IEEE international symposium on Empirical software engineering and measurement, pp. 2–11. Cited by: §2.1.
  • J. Devlin, J. Uesato, S. Bhupatiraju, R. Singh, A. Mohamed, and P. Kohli (2017) Robustfill: neural program learning under noisy i/o. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 990–998. Cited by: §1.
  • [12] (2019) Eclipse cdt — the eclipse foundation. Note: Accessed: 2019-12-27https://www.eclipse.org/cdt/ Cited by: §2.1.
  • [13] (2019) Eclipse desktop & web ides — the eclipse foundation. Note: Accessed: 2019-12-27https://www.eclipse.org/ide/ Cited by: §2.1.
  • [14] (2019) Eclipse java development tools (jdt) — the eclipse foundation. Note: Accessed: 2019-12-27https://www.eclipse.org/jdt/ Cited by: §2.1.
  • J. Falleri, F. Morandat, X. Blanc, M. Martinez, and M. Monperrus (2014) Fine-grained and accurate source code differencing. In ACM/IEEE International Conference on Automated Software Engineering, ASE ’14, Vasteras, Sweden - September 15 - 19, 2014, pp. 313–324. External Links: Link, Document Cited by: §2.1.
  • S. Haiduc, J. Aponte, and A. Marcus (2010) Supporting program comprehension with source code summarization. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 2, pp. 223–226. Cited by: §1.
  • [17] (2019) IntelliJ idea: the java ide for professional developers by jetbrains. Note: Accessed: 2019-12-27https://www.jetbrains.com/idea/ Cited by: §2.1.
  • V. Kovalenko, N. Tintarev, E. Pasynkov, C. Bird, and A. Bacchelli (2018) Does reviewer recommendation help developers?. IEEE Transactions on Software Engineering. Cited by: §2.1, §6.1.
  • R. C. Lange and S. Mancoridis (2007)

    Using code metric histograms and genetic algorithms to perform author identification for software forensics

    .
    In

    Proceedings of the 9th annual conference on Genetic and evolutionary computation

    ,
    pp. 2082–2089. Cited by: §2.2.
  • [20] (2019) Program structure interface (psi). Note: Accessed: 2019-12-26https://www.jetbrains.org/intellij/sdk/docs/basics/architectural_overview/psi.html Cited by: §2.1.
  • V. Raychev, M. Vechev, and E. Yahav (2014) Code completion with statistical language models. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, New York, NY, USA, pp. 419–428. External Links: ISBN 978-1-4503-2784-8, Link, Document Cited by: §1.
  • P. Thongtanunam, C. Tantithamthavorn, R. G. Kula, N. Yoshida, H. Iida, and K. Matsumoto (2015) Who should review my code? a file location-based code-reviewer recommendation approach for modern code review. In 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER), pp. 141–150. Cited by: §2.1.
  • H. M. Tran, G. Chulkov, and J. Schönwälder (2008) Crawling bug tracker for semantic bug search. In International Workshop on Distributed Systems: Operations and Management, pp. 55–68. Cited by: §2.1.
  • G. Valetto, M. Helander, K. Ehrlich, S. Chulani, M. Wegman, and C. Williams (2007) Using software repositories to investigate socio-technical congruence in development projects. In Fourth International Workshop on Mining Software Repositories (MSR’07: ICSE Workshops 2007), pp. 25–25. Cited by: §1, §2.1.