DeepAI
Log In Sign Up

Using Large-scale Heterogeneous Graph Representation Learning for Code Review Recommendations

Code review is an integral part of any mature software development process, and identifying the best reviewer for a code change is a well accepted problem within the software engineering community. Selecting a reviewer who lacks expertise and understanding can slow development or result in more defects. To date, most reviewer recommendation systems rely primarily on historical file change and review information; those who changed or reviewed a file in the past are the best positioned to review in the future. We posit that while these approaches are able to identify and suggest qualified reviewers, they may be blind to reviewers who have the needed expertise and have simply never interacted with the changed files before. To address this, we present CORAL, a novel approach to reviewer recommendation that leverages a socio-technical graph built from the rich set of entities (developers, repositories, files, pull requests, work-items, etc.) and their relationships in modern source code management systems. We employ a graph convolutional neural network on this graph and train it on two and a half years of history on 332 repositories. We show that CORAL is able to model the manual history of reviewer selection remarkably well. Further, based on an extensive user study, we demonstrate that this approach identifies relevant and qualified reviewers who traditional reviewer recommenders miss, and that these developers desire to be included in the review process. Finally, we find that "classical" reviewer recommendation systems perform better on smaller (in terms of developers) software projects while CORAL excels on larger projects, suggesting that there is "no one model to rule them all."

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

10/15/2022

Code Recommendation for Open Source Software Developers

Open Source Software (OSS) is forming the spines of technology infrastru...
05/20/2020

Representation of Developer Expertise in Open Source Software

With tens of millions of projects and developers, the OSS ecosystem is b...
12/22/2018

An Industrial Case Study on Shrinking Code Review Changesets through Remark Prediction

Change-based code review is used widely in industrial software developme...
04/20/2022

Modeling Review History for Reviewer Recommendation:A Hypergraph Approach

Modern code review is a critical and indispensable practice in a pull-re...
10/15/2021

Nalanda: A Socio-Technical Graph for Building Software Analytics Tools at Enterprise Scale

Software development is information-dense knowledge work that requires c...
07/30/2018

Automatic Clone Recommendation for Refactoring Based on the Present and the Past

When many clones are detected in software programs, not all clones are e...
02/15/2021

Investigating and Recommending Co-Changed Entities for JavaScript Programs

JavaScript (JS) is one of the most popular programming languages due to ...

1. Introduction

Code review (also known as Pull Request review) has become an integral process in software development, both in industrial and open source development 

(Gousios et al., 2014; Rigby et al., 2012; Rigby and Bird, 2013) and all code hosting systems support it. Code reviews facilitate knowledge transfer, help to identify potential issues in code, and promote discussion of alternative solutions (Bacchelli and Bird, 2013). Modern code review is characterized by asynchronous review of changes to the software system, facilitated by automated tools and infrastructure (Bacchelli and Bird, 2013) .

As code review inherently requires expertise and prior knowledge, many studies have noted the importance of identifying the “right” reviewers, which can lead to faster turnaround, more useful feedback, and ultimately higher code quality (Rigby and Storey, 2011; Bosu et al., 2015). Selecting the wrong reviewer slows down development at best and can lead to post-deployment issues. In response to this finding, a vibrant line of code reviewer recommendation research has emerged, to great success (Lipcak and Rossi, 2018; Ouni et al., 2016; Jiang et al., 2017; Yu et al., 2014, 2016; Lee et al., 2013; Sülün et al., 2019; Thongtanunam et al., 2015). Some of these have, in fact, even been put into practice in industry (Asthana et al., 2019).

All reviewer recommender approaches that we are aware of rely on historical information of changes and reviews. The principle underlying these is that best reviewers of a change are those who have previously authored or reviewed the files involved in the review. While recommenders that leverage this idea have proven to be valid and successful, we posit that they may be blind to qualified reviewers who may have never interacted with these files in the past, especially as the number of developers in a project grows.

We note that there is a wealth of additional recorded information in software repositories that can be leveraged to improve reviewer recommendation and address this weakness. Specifically, we assert that incorporating information around interactions between code contributors as well as the semantics of code changes and their descriptions can help identify the best reviewers. As one intuitive example, if a set of existing pull requests are determined to be semantically similar to a new incoming pull request, then reviewers who contributed meaningfully to the former may likely be good candidates to review the latter, even if the reviews do not share common files or components. To leverage this idea, we construct a socio-technical graph from repository information, comprising files, authors, reviewers, pull requests, and work items, along with the relationships that connect them. Prior work has shown that code review is a social process in addition to a technical one (Kononenko et al., 2015; Bosu and Carver, 2013). As such, our primary belief is that this heterogeneous graph captures both and can be used to address many software engineering tasks, with code reviewer recommendation being the first that we address.

Learning on such a graph poses a challenge. Fortunately, the area of machine learning has advanced by leaps and bounds in the past eight years since reviewer recommendation became a recognized important research problem. Neural approaches give us tools to deal with this relational information found in software repositories and make inferences about who is best able to review a change 

(Wu et al., 2020).

Based on these observations and ideas, we introduce Coral, a novel approach for identifying the best reviewers for a code change. We train a graph convolutional neural network on this socio-technical graph and use it to recommend reviewers for future pull requests. Compared to existing state of the art, this approach works quite well.

To test our hypotheses, we build a historical graph of the entities and their relationships in 332 software projects over a two and a half year period. We show that a neural network trained on this graph is able to model review history surprisingly well. We perform a large scale user study of Coral by contacting those potential reviewers recommended by our neural approach that the “classical” baseline (in production) approach did not identify because they did not previously interact with the files in the pull requests. Their responses reveal that there is a large population of developers who not only are qualified to contribute in these code reviews, but that desire to be involved as well. We also investigate in what contexts Coral works best and find it performs better than the baseline in large (in terms of developers) software projects, but the baseline excels in small projects, indicating that there is no “one model to rule them all.” Finally, through an ablation study of Coral, we demonstrate that while both files and their natural language text in the graph are important, there is a tremendous performance boost when used together.

We make the following contributions in this paper:

1. We present a general socio-technical graph based on the entities and interactions in modern source code repository systems.

2. We introduce Coral, a novel code reviewer recommendation approach that leverages graph convolutional neural networks on the socio-technical repository graph.

3. We evaluate our approach through retrospective analyses, a large scale user study, and an ablation study to show that Coral improves on the state of the art deployed approaches on a broad scale of historical reviews and also conduct a user study based running our system on real-time PRs.

2. Related work

There have been many approaches to the code reviewer recommendation problem. We survey a broad set of studies and approaches here and refer the reader to the work of Çetin et al. (Çetin et al., 2021) for a more comprehensive survey of existing work.

The first reviewer recommendation system we are aware of was introduced by Balachandran et al. (Balachandran, 2013). They used authorship of the changed lines in a code review (using git blame) to identify who had worked on that code before and suggested a ranked list of this set as potential reviewers. Lipcak and Rossi (Lipcak and Rossi, 2018) performed a large scale (293,000 pull requests) study of reviewer recommendation systems. They found that no single recommender works best for all projects, further supporting our assertion that there is no “on recommender to rule them all.” Thongtanunam et al. (Thongtanunam et al., 2015) proposed RevFinder, a reviewer recommender based on file locations. RevFinder is able to recommend reviewers for new files based on reviews of files that have similar paths in the filesystem. The approach was evaluated on over 40,000 code reviews across three OSS projects, and recalls a correct reviewer in the top 10 recommendations 79% of the time on average. Sülün et al. (Sülün et al., 2019) construct an artifact graph similar to our socio-technical graph and recommend potential reviewers based on paths through this graph from people to the artifact under review. Lee et al. (Lee et al., 2013) build a graph of developers and files with edges indicating a developer committed to a file or one file is close to another file in the Java namespace tree. They use a random walk approach on this graph to recommend reviewers.

Yu et al. (Yu et al., 2014, 2016)

recommend reviewers for a pull request by examining other pull requests whose terms have high textual similarity (cosine similarity in a term vector space), the

comment network of other developers who have commented on the author’s pull requests in the past, and prior social interactions of developers with the author on GitHub. Jiang et al. (Jiang et al., 2017) examine the impact of various attributes of a pull request on a reviewer recommender, including file similarity, PR text similarity, social relations, and “activeness” and time difference. They find that adding measures of activeness to prior models increases performance considerably. Ouni et al. (Ouni et al., 2016) used a genetic search-based approach to find the optimal set of reviewers based on their expertise on the files involved and their previous collaboration with the change author. Zanjani et al. (Zanjani et al., 2015) train a model of expertise based on author interactions with files and a time decay to provide ranked lists of potential reviewers for a given change set. Rahman et al. (Rahman et al., 2016) propose CORRECT an approach to recommend reviewers based on their history across all of GitHub as well as their experience certain specialized technologies associated with a pull request.

Dougan et al. (Doğan et al., 2019) investigated the problem of ground truth in reviewer recommendation systems. They point out that many tools are trained and evaluated on historical code reviews and rely on an (often unstated) assumption that the selected reviewers were the correct reviewers. They find that using history as the ground truth is inherently flawed.

3. socio-technical graph

Figure 1. Coral architecture

Coral system contains three main building blocks (as shown in Figure 1).

  1. Building the socio-technical graph.

  2. Performing the graph representation learning to generate node embeddings.

  3. Performing inductive inference to predict reviewers for the new pull requests.

In this section, we describe the process of building the socio-technical graph from entities (developers, repositories, files, pull requests, work-items) and their relationships in modern source code management systems shown as step 1 in Figure 1.

3.1. Socio-technical Graph

The socio-technical graph consists of nodes, which represent the people and the artifacts, and edges, which represent the relationships or interactions that exist between the nodes. Figure 2 shows the nodes and the edges along with their properties. The socio-technical graph (STG) has two fundamental elements.
Nodes There are six types of nodes in the socio-technical graph graph. They are pull request, work item, author, reviewer, file, and repository.
Edges There are five types of edges in the socio-technical graph as listed below.
creates created between an author node and a pull request node.
reviews created between a reviewer node and a pull request node.
contains created between a repository node and a pull request node if the repository contains the pull request.
changes created between a pull request node and a file node if the pull request is changing the file.
linkedTo created between a pull request node and a work item node if the pull request is linked to the work item.
commentsOn created between a pull request node and a reviewer node if the reviewer places code review comments.
parentOf created between a work item node and another work item node if there exist a parent-child relationship between them.

Figure 2. socio-technical graph

3.2. Augmented socio-technical graph

To include semantic information, we expand the socio-technical graph to have text tokens represented as nodes. This has two benefits:

  1. Map users to concepts (word tokens): this helps in building a knowledge base of users (authors, reviewers) to concepts. For example., if a user is authoring/reviewing pull requests which contain a token, a second order relationship will be established from that user to the token.

  2. Bring semantically similar token together: as we are establishing edges between words that appear together, we capture the semantic similarity between the words.

We perform the four operations explained below to construct the augmented socio-technical graph (ASTG):

  • Tokenize the text (title and description) of each pull request, work item, and the names of the source code files edited in those pull requests.

  • Filter the stop words by implementing a block list (12).

  • All the text nodes that appear in a pull request title or description and work item title or description are linked to the respective pull requests. All the text nodes that appear in a file name are linked to the file nodes.

  • Text nodes are linked to each other based on their co-occurrence in the pull request corpus. Pointwise Mutual Information (PMI) (Manning and Schütze, 1999) is a common measure of the strength of association between two terms.

    (1)

    The formula is based on maximum likelihood estimates: when we know the number of observations for token x, o(x), the number of observations for token y, o(y) and the size of the corpus N, the probabilities for the tokens x and y, and for the co-occurrence of x and y are calculated by:

    (2)

    The term p(x,y) is the probability of observing x and y together.

Element type Label Count
Node pull request 1,342,821
Node work item 542,866
Node file 2,809,805
Node author 18,001
Node reviewer 30,585
Node text 1,104,427
Total 5,858,834
Edge creates 1,342,821
Edge reviews 7,066,703
Edge contains 1,342,821
Edge changes 12,595,859
Edge parent of 148,422
Edge linked to 1,252,901
Edge comments on 53,506
Total 23,803,053
Table 1. Distribution of node and edge types in the socio-technical graph

3.3. Scale

The socio-technical graph is built using the software development activity data from 332 repositories. We ingest data starting from 1st January, 2019, or from when the first pull request is created in a repository (whichever is older). The graph is refreshed three times a day. During the refresh we perform two operations:
Insert Ingest new pull request, work item, and code review information, across all the 332 repositories, by creating corresponding nodes, edges, properties.
Update the word tokens connected to nodes, if there are changes. We also update the edges between nodes to reflect the changes in the source data.

The socio-technical graph contains 5,858,834 nodes and 23,803,053 edges. A detailed statistics of node and edge types can be found in Table 1.

4. Reviewer Recommendation via Graph Neural Networks

Reviewing a pull request is a collaborative effort. Good reviewers are expected to write good code review comments that help improve the quality of the code and thus shape a good product. In order to achieve this, a good reviewer needs to be 1) familiar with the feature that is implemented in the pull request, 2) experienced in working with the source code and the files that are modified by the pull request, 3) a good collaborator with others in the team, and, 4) actively involved in creating and reviewing related pull requests in the repository. Hence a machine learning algorithm that recommends reviewers for a pull request needs to model these complex interaction patterns to produce a good recommendation. Feature learning via embedding generation has shown good promise in the literature in capturing complex patterns from the data (Hoff et al., 2002; Hamilton et al., 2017; Chen et al., 2018; Zhou et al., 2021). Hence in this work we propose to pose the reviewer recommendation problem as ranking reviewers using similarity scores between the users and the pull requests in the embedding space. In the rest of this section we give details on learning embedding for pull requests and users along with other entities (such as files, word tokens, etc.), and scoring top reviewers for a new pull request using the learned embeddings.

The socio-technical graph shown in the Figure 2 has essential ingredients to model the characteristics of a good reviewer: 1) the user - pull request - token path in the graph associates a user to a set of words that characterize the user’s familiarity with one or more topics. 2) user - pull request - file path associates a user to a set of files that the user authors or reviews. 3) user - pull request - user path characterizes the collaboration between people in a project. 4) pull request - user - pull request path characterizes users working on related pull requests. Essentially, by envisioning software development activity as an interaction graph of various entities, we are able to capture interesting and complex relations and patterns in the system. We aim to encode all these complex interactions into entity embeddings using Graph Neural Network (GNN) (Wu et al., 2021). These embeddings are then used as features to predict most relevant reviewers to a pull request. In Figure 1 this is depicted as step 2 and 3.

4.1. Graph Neural Network Architecture

Graph Convolutional Network (GCN) (Schlichtkrull et al., 2017) (which is a form of GNN) has shown a great success in machine learning community in capturing complex relations and interaction patterns in a graph through node embedding learning. In GCN, for each node, we aggregate the feature information from all its neighbors and of course, the feature of itself. During this aggregation, neighbors are weighted as per the edge (relation) weight. A common approach that has been used effectively in the literature is to weigh the edges using symmetric-normalization approach. Here we normalize the edge weight by the degrees of both the nodes connected by the edge. The aggregated feature values are then transformed and fed to the next layer. This procedure is repeated for every node in the graph.

Mathematically it can be represented as follows:

(3)

where is the embedding of node in the layer; is the initial set of node features, which can be set to one-hot vectors if no other features are available; is the set of neighbors of node ; is the feature transformation weights for the step (learned via training),

is the activation function (such as RELU

(Nair and Hinton, 2010)). Note that symmetric-normalization is achieved by dividing by .

GCN learns node embeddings from a homogeneous graph with same node types and relations. However, the pull request graph in Figure 2 is a heterogeneous graph with different node types and different relation types between them. In this case, inspired by RGCN (Schlichtkrull et al., 2017), for each node, we aggregate the feature information separately for each type of relation.

Mathematically it can be represented as follows:

(4)

where is the set of relations, is the set of neighbors of having relation , is the relation-specific feature transformation weights for the layer; is the feature transformation weights for the self node.

The set of relations capture the semantic relatedness of different types of nodes in the graph. This is generally determined by the domain knowledge. For Coral we identified a bunch of useful relations as listed in Table 2.

Relation Semantic Description
1 PullRequest - User Captures the author or reviewer relationship between a pull request and a user
2 PullRequest - File Captures the related file modification needed for a pull request
3 PullRequest - Word Captures the semantic description of a pull request through the words
4 File - Word Captures the semantic description of a file.
5 Word - Word Captures the related words in a window of size 5 in a sentence (in the pull request title/description)
Table 2. Relations () used for generating embeddings

In our experiments, we use a 2-layer GCN network, i.e., we set in Equation 4. With this, GCN can capture second order relations such as User-User, File-File, User-File, User-Word, etc., which we believe are useful in capturing interesting dependencies between various entities, such as related files, related users, files authored/modified by users, words associated with users, etc. While setting to a even higher value can fold-in long distance relations, it is not clear if that helps or brings more noise. We leave that exploration to our future work.

4.2. Training the Model

To learn the parameters of the model (i.e.,and ) we pose it as a Link Prediction problem. Here, we set the probability of existence of a link/edge between two nodes and as proportional to the dot product between their embeddings derived from the 2-layer GCN. In particular, we set the link probability as equal to . Here, denotes the logistic function, and denote the embeddings of nodes respectively (i.e., from Equation 4). This probability is high when the nodes and are connected in the graph. And, it is low when the nodes and are not connected in the graph. Accordingly, we prepare a training data set containing records of triplets , where are the edges in the graph and denotes the presence or absence of an edge between and . Since there can be very large number of node pairs where and are not connected, we employ random sampling to select a sizable number of such pairs. The training objective is to minimize the cross-entropy loss in the Equation 5.

(5)

Minimizing the above loss enforces the dot product of the embeddings of the nodes to attain high value when they are connected by an edge in the graph (i.e., when ), and a low dot product value when they are not connected in the graph (i.e., when

). The parameters of the model are updated as the training progresses to minimize the above loss. We stop training when the loss function stops decreasing (or the decrease becomes negligible).

4.3. Inductive Recommendation for New Pull Requests

GCN by design is a transductive model. That is, it can generate embeddings only for the nodes that are present in the graph during the training. It cannot generate embeddings for the new nodes without adding those nodes to the graph and retraining. On the other hand, inductive models can infer embeddings for the new nodes that were unseen during the training by applying the learned model to the new nodes. Since Coral is a GCN-based model, we will not have embedding for the new pull request at the inference time. We need to derive the embedding for on-the-fly by applying Equation 4. The challenge in deriving the embedding is in getting the correct self embedding for . That is, as per Equation 4, to generate , we need trained and , which are not available for the new nodes. Hence we approximate the embedding of the new node by ignoring its self embedding part in Equation 4, which leads to the following approximation:

(6)

Here, is the embedding of the new pull request , is the set of relations involving the pull request node (i.e., PullRequest-User, PullRequest-File, and PullRequest-Word), are the trained model weights from the layer of the GCN, and are the embeddings coming out of the first layer of the GCN.

After obtaining the embedding of the new pull request as per Equation 6, we can get the top reviewers for it by finding the top closest users in the embedding space. That is,

(7)

where is the embedding of the user .

Since our training objective enforces high score when the likelihood of an edge is high, the Equation 7 finds the users who are most likely to be associated with the pull request, as reviewers. Finding top reviewers in this way using their embeddings allows us to naturally make use of complex relationships that are encoded in those embeddings to capture user’s relatedness to the pull request.

5. Experiments

To assess the value of Coral empircally, we pose three research questions:

  1. [font=,label=RQ, leftmargin=4]

  2. RQ1: Can Coral simulate the review history?

  3. Under what circumstances does Coral perform better than a Rule-based Model (and vice versa)?

  4. What are developers’ perceptions about Coral?

The vast majority of code reviewer recommendation approaches are evaluated by comparing recommendations from the tool with historical code reviews and examining how often the recommended reviewers were the actual reviewers (Doğan et al., 2019). In line with this accepted practice, RQ1 asks how often the network is able to recommend the reviewers that the authors added. However, as Dougan et al. point out, there is an underlying (and often unstated) assumption that these are the correct reviewers***We would point out that if this assumption were correct, then there would be no need for a recommender in the first place! (Doğan et al., 2019). To address this flawed assumption and pursue a more complete ground truth, we reach out to the reviewers recommended by Coral that were not recommended by a rule-based model. The results of this developer study help address RQ2 and RQ3.

For the purpose of conducting the experiments and comparative studies, we use a rule-based model built based on the heuristics proposed by Zanjani et al.

(Zanjani et al., 2015) which demonstrated that considering the history of source code files edited in a pull request in terms of authorship and reviewership is an effective way to recommend peer reviewers for a code change. This model is currently deployed in production at our company. This gives us an opportunity to conduct comparative studies by observing the recommendations made by the model and the telemetry generated from the production deployment.

5.1. Methodology

5.1.1. Retrospective Evaluation

To address RQ1, we construct a dataset of 254K code reviews, i.e. pull request–reviewer pairs, starting from 2019 to evaluate Coral. To keep training and validation cases separate, these nodes and their edges are not present in the graph during model training. We use the following metrics, which are the most common measures for evaluating reviewer recommender approaches (Thongtanunam et al., 2015; Ouni et al., 2016; Sülün et al., 2019; Zanjani et al., 2015; Balachandran, 2013; Rahman et al., 2016):

Accuracy We measure the percentage of pull requests from test data for which Coral is able to recommend at least one reviewer correctly and report the percentage for top 1, 3, 5, and 7 reviewers suggested by the model. Specifically, given a set of pull requests , the top-k accuracy can be calculated using Equation 8. The isCorrect(, Top-k) function returns value of 1 if at least one of top-k recommended reviewers actually review the pull request ; and returns value of 0 for otherwise.

(8)

Mean reciprocal rank (MRR) This metric is used extensively in recommender systems to assess whether the correct recommendation is made at the top of a ranked list (Manning et al., 2008). MRR is calculated using Equation 9, where rank(candidates(p)) returns value of the first rank of correct reviewer in the recommendation list candidates(p).

(9)
Repo size (# of developers) # data
Large 220
Medium 200
Small 80
Table 3. Pull request distribution across dimensions

5.1.2. User Study

To address RQ2 and RQ3, we conduct a user study that reaching out to reviewers recommended by Coral to see if they would be qualified to review the pull requests.

Sampling We select 500 pull requests from the test data set of 254K pull requests and randomly pick one of the top 2 recommendations by Coral as the recommended reviewer to reach out. Note that pull requests selected had not been recommended by rule-based model and each recommended reviewer appears at most once. The pull requests are collected from repositories having different number of developers using stratified random sampling following the distribution in Table 3. The categories are defined as follows: number of developers: Large (¿ 100 developers), Medium (between 25 and 100 developers), Small (¡ 25 developers).

Questionnaire We perform the user study by posing a set of questions on what actions a reviewer might take when they were recommended for a specific pull request:

  1. Is this pull request relevant to you (as of the PR creation date, state)?
    A - Not relevant
    B - Relevant, I’d like to be informed about the pull request.
    C - Relevant, I’d take some action and/or I’d comment on the pull request.

  2. If possible, could you please explain the reason behind your choice?

We avoid intruding in the actual work flow, yet still maintain an adequate level of realism by working with actual pull requests, thus balancing realism and control in our study (Stol and Fitzgerald, 2018). Note that, with 287 responses, this is one of the largest field studies conducted to understand the effects of an automated reviewer recommender system.

We divided the questionnaire among 4 people to conduct the user studies. The interviewers did not know these reviewers, nor had worked with them before. The teams working on the systems under study are organizationally far away from the interviewers. Therefore, they do not have any direct influence on the study participants. The interview format is semi-structured where users are free to bring up their own ideas and free to express their opinions about the recommendations.

We use the question (2) to collect user feedback and analyze it to generate insights about the perceptions of the developers about the automated reviewer recommendation systems (RQ). Namely, the factors that influence people to not lean towards using an automated reviewer recommendation system.

5.1.3. Comparing with Rule-based Model

To compare Coral with the rule-based model, we select another 500 pull requests from the set of pull requests on which the rule-based model (currently deployed in production) has made recommendations, by following the same distribution as the pull requests selected for evaluating Coral (Table 3). We then collect the recommendations made by the rule-based model, the subsequent actions performed by the recommended reviewers (changing the status of the pull request, or adding a code review comment, or both) for the selected pull requests from telemetry. The telemetry yields two benefits: 1. it helps us gather user feedback without doing another large-scale user study, as the telemetry captures the user actions already 2. it avoids the probable study participants from taking one more survey (and save time and frustration), because they already indicated their preferences on the pull request when it was active and when they were added as reviewers.

An important point to keep in mind is, the rule-based model is adding recommended reviewers directly to the pull requests. This increases the probability of them taking an action. Because, the reviewers are being called out in public (everyone, including their managers, can see who is reviewing the pull request) (Allen and Davis, 2011). If they do not respond, it might look like they are blocking the pull request progression. Whereas, the Coral’s recommendations are validated through user studies, which are conducted in a private 1-1 setting. Reviewers can be open about their decisions in the user studies. Therefore, Coral might be at a slight disadvantage.

5.2. Results

5.2.1. RQ1: Can Coral simulate the review history?

To answer RQ, we examine who the pull request author invited to review a change and then check to see if Coral recommended the same reviewers. In this context, the “correct” recommendation is defined as the recommended reviewer being invited to the pull request. While the author’s actions may not actually reflect the ground truth of who is best able to review the change, most prior work in code reviewer recommendation evaluates recommenders in this way (see (Doğan et al., 2019) for a thorough discussion of this) and so we follow suit here. Table 4 shows the accuracy and MRR for Coral across all 254K (pull request-reviewer) pairs. In 73% of the pull requests, Coral is able to replicate the human authors’ behavior in picking the reviewers in the top 3 recommendations which validates that Coral matches the history of reviewer selection quite well.

Metric k = 1 k=3 k=5 k=7
Accuracy 0.50 0.73 0.78 0.80
MRR 0.49 0.61 0.68 0.72
Table 4. Link prediction accuracy and MRR
Repo size (# of developers) RM Coral
Large 0.19 0.37
Medium 0.31 0.36
Small 0.35 0.23
Table 5. Comparative user study precision across dimensions. RM is Rule-based Model. The differences between the two models with the same Greek letter suffix (and only those pairs) are not statistically significant.

5.2.2. Rq: Under what circumstances does Coral perform better than a rule-based model (and vice versa)?

In Table 5, we show the recommendation precision of rule-based model and Coral. Specifically, on the sampled data for Coral model and rule-based model, precision is calculated as the percentage of the recommended reviewers who are willing to engage in reviewing the pull requests. For rule-based model, reviewers who either change the status of the pull request or add a code review comment are considered as engaged. For Coral, reviewers who say that the pull request is relevant and they would take some action are considered as engaged.

Generally, there is ”no model to rule them all”. Neither of the models performs consistently better than the other in all the pull requests from repositories of all categories. As shown in table 5 Coral

performs better on pull requests from large repositories and medium repositories while the rule-based model does well on pull requests from small repositories. However, when we statistically tested for differences, Fisher exact tests 

(Agresti, 2003) only showed a statistically significant difference between the two approaches for large repositories ().

One observation that may explain this result is that due to their size, large software projects dominate the graph. Thus, Coral is trained on many more pull requests from large projects than from smaller projects. If the mechanisms, factors, behaviors, etc., for reviewer selection are different in smaller projects than large ones, then the model is likely learn those used in larger projects. This hypothesis could be confirmed by splitting the training data by project size and training multiple models. However, as reviewer recommendation is most important in projects with many developers and that appears to be where Coral excels, we do not pursue this line of inquiry.

We have observed that in small repositories usually with few developers, one or two experienced developers are more likely to take the responsibility of reviewing pull requests which accounts for the high accuracy of rule-based model. However, this phenomenon in which a small number of experienced people in a particular repository are assigned the lion’s share of reviews is problematic, and heuristics have been used to “share the load” (Asthana et al., 2019). As the socio-technical graph contains historical information about a developer across many repositories and PRs from different repositories may be semantically related, Coral is able to leverage more information per developer and per PR, which may avoid this problem.

The following feedback received from the user study (question (2)) also demonstrates that Coral identifies relevant and qualified reviewers who traditional reviewer recommenders miss:

This PR is created in a repository on which our service has a dependency on. I would love to review these PRs. In fact, I am thinking of asking x on these PRs going forward.

I never reviewed y’s PRs. I work with her on the same project and know what she is doing. I am happy to provide any feedback (of course if she’d like :))

The content of the PR might impact another repository that I have ownership of because we use some of the components in that lib. Based on that I would say it is a relevant PR and I will not mind reviewing it.

5.2.3. Rq: What are developer’s perceptions about an automated reviewer recommendation model?

Category # of responses (%)
I will review this pull request 170 (59.23%)
I’d like to be added to this pull request 24 (8.36%)
This pull request is not relevant to me 93 (32.40%)
Table 6. Distribution of qualitative user study responses.

We show the distribution of user study responses in Table  6. Out of 500 user study messages we sent, 287 users responded. 67.6% of the users give positive feedback saying that the given pull request is relevant to them to some degree. In this, 8.36% of the users say they would like to be informed about the given pull request. 59.23% of the users say that they would like to take some action and/or leave comment on the pull request. 32.4% of the users give negative feedback saying that the given pull request is not relevant.

Category Feedback # of feedback (%)
I This pull request is no longer relevant to me 71 (91.03%)
II Never participate in code review 5 (6.41%)
III Pull request does not need reviewer 2 (2.56%)
Table 7. Users’ Negative Feedback Categories.

To understand the reason that users do not like Coral

’s recommendations, we analyze the negative feedback (comments/anecdotes from the developer) and classify them into 3 categories with their distribution shown in Table 

7. To offer an impression, we show some typical negative quotes that we received from users.

91.03% of the negative feedback we received said that the pull request is no longer relevant to them and 69.23% of them said it is because they start to work in a different area and 21.79% of them mentioned that they do not work in this repository because of switching groups or team transferred: Not relevant since I no longer work on the team that manages this service. 6.41% of the users mentioned that they are actually never involved in code review: I’m a PM. I’m less interested in PR in general. Only when I’m needed by the devs and then they mention me there. Two users said that the pull requests we provided does not need to be reviewed: Let me explain. This is an automated commit that updates the version number of the product as part of the nightly build. It pretty much happens every night. So it doesn’t need reviewer like a traditional pull request would.

From users’ negative feedback, we learn that in order to improve Coral we need to include several extra factors. First, our socio-technical graph should take the people movement into consideration and update the graph dynamically, namely identifying inactive users and removing edges or decaying the weight on the edges between user nodes and repository nodes.

Second, Coral should include and learn the job role for every user in the socio-technical graph through node embeddings, such as SDE, PM, so that it can filter those irrelevant users and suggest the reviewers more precisely.

Third, before running the Coral, some heuristic rules can be designed to filter the automatic, deprecated pull requests.

Besides the negative feedback, we receive a lot of credits from users:

The recommendation makes a lot of sense since I primarily contributed to that repository for a few years. However, a recent re-org means I no longer work on that repository.

I am lead of this area and would like to review these kinds of PRs which are likely fixing some regressions.

They validate our claim that Coral does consider the interactions between users and files, and the recommendations are understandable by human. Since Coral is trained and evaluated on historical pull requests starting from 2019, it is hard to reconstruct the situation where the pull requests were created and many users complain that it is difficult to recall the context of the pull requests, thus putting Coral in a disadvantage. We expect it will have better performance in the actual production.

Models Accuracy MRR
k = 1 k=3 k=5 k=7 k = 1 k=3 k=5 k=7
(1) No words or files 0.02 0.08 0.13 0.16 0.01 0.04 0.05 0.06
(2) Words only 0.21 0.30 0.32 0.34 0.21 0.25 0.26 0.32
(3) Files only 0.29 0.69 0.73 0.76 0.29 0.48 0.49 0.50
(4) Words + Files 0.49 0.73 0.77 0.80 0.49 0.61 0.68 0.72
Table 8. Link prediction accuracy and MRR for various configurations of parameters

5.2.4. Ablation Study

To evaluate the contribution of each of the entities in Coral, we perform an ablation study, with results shown in Table 8. Specifically, we first remove the entities from the socio-technical graph and training data, and then retrain the graph convolutional neural network. We find that ablating each entity deteriorates performance across metrics. After removing word entities and file entities from graph, i.e. the socio-technical graph only contains user and pull requests entities, the model can hardly recommend correct reviewers. By comparing (1) and (2), (1) and (3), we demonstrate the importance of semantic information and file change history introduced by file entities in recommending reviewers and file entities give more value than words. Looking at (3) and (4), we observe boost in performance when adding semantics information on top of the file change and review activities, which underlines our claim that incorporating information around interactions between code contributors as well as the semantics of code changes and their descriptions can help identify the best reviewers.

6. Threats and Limitations

As part of our study, we reached out to people who were not invited to a review but that Coral recommended as potential reviewers. It is possible that their responses to our solicitations differed from what they may have actually done if they were unaware that their actions/responses were being observed (the so-called Hawthorne Effect (Adair, 1984)). The company that this was evaluated at has tens of thousands of developers and we were careful not to include any repositories or participants that we have interacted with before or might have a conflict of interest with us. Nonetheless, there is a small chance that respondents may be positive about the system because they wanted to make the interviewers happy.

The socio-technical graph contains information about who was added as a reviewer on a PR, but it doesn’t not include why that person was added or if they were added as the result of a reviewer recommendation tool. Thus, in our evaluation of how well Coral is able to recommend reviewers that were historically added to reviews, it is unclear how much of history comes from the rule-based model recommender and how much from authors without the aid of a recommender.

When looking at repository history, the initial recommendation by the rule-based model is based on files involved in the initial review, while Coral includes files and descriptions in the review’s final state. If the description or the set of files was modified, then Coral may have a different set of information available to it than it would have had it been used at the time of PR creation.

In our evaluation of Coral, we use a training set of PRs to train the model and keep a hold out set for evaluation. These datasets are disjoint, but they are not temporally divided. In an ideal setting all training PRs would precede all evaluation PRs in time and we would evaluate our approach by looking at a Coral’s recommendation for the the next unseen PR (ordered by time), then add that PR to the socio-technical graph, and then retrain the model on the updated graph for the following PR and repeat until all PRs in the evaluation set were exhausted. This form of evaluation proved too costly and time consuming to conduct and so we used a random split of training and testing data sets.

We sampled the 500 PRs from the population using a random selection approach. We selected sample size in an effort to avoid bias and confounding factors in the sample, but we cannot guarantee that this data set is free from noise, bias, etc.

7. Future Work

In this work we showed that simple GCN style model is able to capture complex interaction patterns between various entities in the code review ecosystem and can be used to predict relevant reviewers for pull requests effectively. While this method is very promising on large sized repositories, we believe that the method can be improved to make good recommendations on other repositories too by training repository type specific models. In this work we mainly focused on using interaction graph of various entities (pull requests, users, files, words, etc.) to learn complex features through embeddings. We neither captured any node specific features (e.g., user-specific features, file-specific features, etc.) nor any edge specific features (e.g., how long ago user authored/modified files, whether two users belong to the same org or not, etc.). Incorporating such features may help the model learn even complex patterns from the data and further improve the recommendation accuracy. Furthermore, we believe that a detailed study of effect of model hyper-parameters (such as embedding dimension, number of GCN layers, different activation functions, etc.) on the recommendation accuracy will be a very useful result. We intend to explore these directions in our future work.

The techniques explained in this paper and the Coral system are generic enough to be applied on any dataset that follows a GIT based development model. Therefore, we see opportunities for implementing Coral for source control systems like GitHub and GitLab.

8. Conclusion

In this work, we seek to to leverage additional recorded information in software repositories to improve reviewer recommendation and address the weakness of the approaches that rely only on the historical information of changes and reviews.

To that end we propose Coral, a novel Graph-based machining learning model that leverages a socio-technical graph built from the rich set of entities (developers, repositories, files, pull requests, work-items, etc.) and their relationships in modern source code management systems. We train a Graph Convolutional Neural network (GCN) on this graph to learn to recommend code reviewers for pull requests.

Our retrospective results show that in 73% of the pull requests, Coral is able to replicate the human pull request authors’ behavior in top 3 recommendations and it performs better than the rule-based model in production on pull requests in large repositories by 94.7%. A large-scale user study with 500 developers showed 67.6% positive feedback, and relevance in suggesting the correct code reviewers for pull requests.

Our results open new possibilities for incorporating the rich set of information available in software repositories and the interactions that exist between various actors and entities to develop code reviewer recommendation models. We believe the techniques and the system has a wider applicability ranging from individual organizations to large open source projects. Beyond code reviewer recommendation, future research could also target other recommendation scenarios in source code repositories that could aid software developers leveraging the socio-technical graphs.

9. Data Availability

We are unfortunately unable to make the data involved in this study publicly available as it contains personally identifiable information as well as confidentially Access to the data for this study was made under condition of confidentiality from the company providing it and we cannot share it while remaining compliant with the General Data Protection Regulation (GDPR) (13).

References

  • J. G. Adair (1984) The hawthorne effect: a reconsideration of the methodological artifact.. Journal of applied psychology 69 (2), pp. 334. Cited by: §6.
  • A. Agresti (2003) Categorical data analysis. Vol. 482, John Wiley & Sons. Cited by: §5.2.2.
  • R. L. Allen and A. S. Davis (2011) Hawthorne effect. In Encyclopedia of Child Behavior and Development, S. Goldstein and J. A. Naglieri (Eds.), pp. 731–732. External Links: ISBN 978-0-387-79061-9, Document, Link Cited by: §5.1.3.
  • S. Asthana, R. Kumar, R. Bhagwan, C. Bird, C. Bansal, C. Maddila, S. Mehta, and B. Ashok (2019) WhoDo: automating reviewer suggestions at scale. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 937–945. Cited by: §1, §5.2.2.
  • A. Bacchelli and C. Bird (2013) Expectations, outcomes, and challenges of modern code review. In 2013 35th International Conference on Software Engineering (ICSE), pp. 712–721. Cited by: §1.
  • V. Balachandran (2013) Reducing human effort and improving quality in peer code reviews using automatic static analysis and reviewer recommendation. In 2013 35th International Conference on Software Engineering (ICSE), pp. 931–940. Cited by: §2, §5.1.1.
  • A. Bosu and J. C. Carver (2013) Impact of peer code review on peer impression formation: a survey. In 2013 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, pp. 133–142. Cited by: §1.
  • A. Bosu, M. Greiler, and C. Bird (2015) Characteristics of useful code reviews: an empirical study at microsoft. In 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, pp. 146–156. Cited by: §1.
  • H. A. Çetin, E. Doğan, and E. Tüzün (2021) A review of code reviewer recommendation studies: challenges and future directions. Science of Computer Programming, pp. 102652. Cited by: §2.
  • H. Chen, B. Perozzi, R. Al-Rfou, and S. Skiena (2018) A tutorial on network embeddings. External Links: 1808.02590 Cited by: §4.
  • E. Doğan, E. Tüzün, K. A. Tecimer, and H. A. Güvenir (2019) Investigating the validity of ground truth in code reviewer recommendation studies. In 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pp. 1–6. Cited by: §2, §5.2.1, §5.
  • [12] (Accessed 2021) English stop words. External Links: Link Cited by: 2nd item.
  • [13] (2018-05-25)(Website) European Commission. External Links: Link Cited by: §9.
  • G. Gousios, M. Pinzger, and A. v. Deursen (2014) An exploratory study of the pull-based software development model. In Proceedings of the 36th International Conference on Software Engineering, pp. 345–355. Cited by: §1.
  • W. L. Hamilton, R. Ying, and J. Leskovec (2017) Representation learning on graphs: methods and applications. arXiv preprint arXiv:1709.05584. Cited by: §4.
  • P. D. Hoff, A. E. Raftery, and M. S. Handcock (2002) Latent space approaches to social network analysis. Journal of the american Statistical association 97 (460), pp. 1090–1098. Cited by: §4.
  • J. Jiang, Y. Yang, J. He, X. Blanc, and L. Zhang (2017) Who should comment on this pull request? analyzing attributes for more accurate commenter recommendation in pull-based development. Information and Software Technology 84, pp. 48–62. Cited by: §1, §2.
  • O. Kononenko, O. Baysal, L. Guerrouj, Y. Cao, and M. W. Godfrey (2015) Investigating code review quality: do people and participation matter?. In 2015 IEEE international conference on software maintenance and evolution (ICSME), pp. 111–120. Cited by: §1.
  • J. B. Lee, A. Ihara, A. Monden, and K. Matsumoto (2013) Patch reviewer recommendation in oss projects. In 2013 20th Asia-Pacific Software Engineering Conference (APSEC), Vol. 2, pp. 1–6. Cited by: §1, §2.
  • J. Lipcak and B. Rossi (2018) A large-scale study on source code reviewer recommendation. In 2018 44th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), pp. 378–387. Cited by: §1, §2.
  • C. D. Manning, P. Raghavan, and H. Schütze (2008) Introduction to information retrieval. Cambridge University Press, USA. External Links: ISBN 0521865719 Cited by: §5.1.1.
  • C. D. Manning and H. Schütze (1999)

    Foundations of statistical natural language processing

    .
    The MIT Press, Cambridge, Massachusetts. External Links: Link Cited by: 4th item.
  • V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, Madison, WI, USA, pp. 807–814. External Links: ISBN 9781605589077 Cited by: §4.1.
  • A. Ouni, R. G. Kula, and K. Inoue (2016) Search-based peer reviewers recommendation in modern code review. In 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 367–377. Cited by: §1, §2, §5.1.1.
  • M. M. Rahman, C. K. Roy, and J. A. Collins (2016) Correct: code reviewer recommendation in github based on cross-project and technology experience. In Proceedings of the 38th international conference on software engineering companion, pp. 222–231. Cited by: §2, §5.1.1.
  • P. C. Rigby and C. Bird (2013) Convergent contemporary software peer review practices. In Proceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, pp. 202–212. Cited by: §1.
  • P. C. Rigby and M. Storey (2011) Understanding broadcast based peer review on open source software projects. In 2011 33rd International Conference on Software Engineering (ICSE), pp. 541–550. Cited by: §1.
  • P. Rigby, B. Cleary, F. Painchaud, M. Storey, and D. German (2012) Contemporary peer review in action: lessons from open source development. IEEE software 29 (6), pp. 56–61. Cited by: §1.
  • M. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov, and M. Welling (2017) Modeling relational data with graph convolutional networks. External Links: 1703.06103 Cited by: §4.1, §4.1.
  • K. Stol and B. Fitzgerald (2018) The abc of software engineering research. ACM Trans. Softw. Eng. Methodol. 27 (3). External Links: ISSN 1049-331X, Link, Document Cited by: §5.1.2.
  • E. Sülün, E. Tüzün, and U. Doğrusöz (2019) Reviewer recommendation using software artifact traceability graphs. In Proceedings of the Fifteenth International Conference on Predictive Models and Data Analytics in Software Engineering, pp. 66–75. Cited by: §1, §2, §5.1.1.
  • P. Thongtanunam, C. Tantithamthavorn, R. G. Kula, N. Yoshida, H. Iida, and K. Matsumoto (2015) Who should review my code? a file location-based code-reviewer recommendation approach for modern code review. In 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER), pp. 141–150. Cited by: §1, §2, §5.1.1.
  • Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip (2020) A comprehensive survey on graph neural networks. IEEE transactions on neural networks and learning systems 32 (1), pp. 4–24. Cited by: §1.
  • Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and P. S. Yu (2021) A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems 32 (1), pp. 4–24. External Links: ISSN 2162-2388, Link, Document Cited by: §4.
  • Y. Yu, H. Wang, G. Yin, and C. X. Ling (2014) Reviewer recommender of pull-requests in github. In 2014 IEEE International Conference on Software Maintenance and Evolution, pp. 609–612. Cited by: §1, §2.
  • Y. Yu, H. Wang, G. Yin, and T. Wang (2016) Reviewer recommendation for pull-requests in github: what can we learn from code review and bug assignment?. Information and Software Technology 74, pp. 204–218. Cited by: §1, §2.
  • M. B. Zanjani, H. Kagdi, and C. Bird (2015) Automatically recommending peer reviewers in modern code review. IEEE Transactions on Software Engineering 42 (6), pp. 530–543. Cited by: §2, §5.1.1, §5.
  • J. Zhou, G. Cui, S. Hu, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, and M. Sun (2021) Graph neural networks: a review of methods and applications. External Links: 1812.08434 Cited by: §4.