Productivity Assessment of Neural Code Completion

05/13/2022
by   Albert Ziegler, et al.
github
0

Neural code synthesis has reached a point where snippet generation is accurate enough to be considered for integration into human software development workflows. Commercial products aim to increase programmers' productivity, without being able to measure it directly. In this case study, we asked users of GitHub Copilot about its impact on their productivity, and sought to find a reflection of their perception in directly measurable user data. We find that the rate with which shown suggestions are accepted, rather than more specific metrics regarding the persistence of completions in the code over time, drives developers' perception of productivity.

READ FULL TEXT VIEW PDF

Authors

page 4

page 5

page 6

08/17/2020

A Deep Dive on the Impact of COVID-19 in Software Development

Context: COVID-19 pandemic has impacted different business sectors aroun...
04/21/2020

Chat activity is a better predictor than chat sentiment on software developers productivity

Recent works have proposed that software developers' positive emotion ha...
11/08/2021

How Developers and Managers Define and Trade Productivity for Quality

In this paper, we present the findings from a survey study to investigat...
12/15/2021

Long-Term Productivity Based on Science, not Preference

This position paper argues that decisions on processes, tools, technique...
12/14/2020

Mind the Gap: On the Relationship Between Automatically Measured and Self-Reported Productivity

To improve software developers' productivity has been the holy grail of ...
01/27/2021

In-IDE Code Generation from Natural Language: Promise and Challenges

A great part of software development involves conceptualizing or communi...
08/23/2021

Optimal optical conditions for Microalgal production in photobioreactors

The potential of industrial applications for microalgae has motivated th...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Code completion systems that offer suggestions to a developer on the basis of contextual information from the IDE have been shown to be by far the most frequently used kind of programmer assistance (Amann et al., 2016)

. One common example is that of proposing a list of method names based on the type of a variable. Neural code synthesis approaches to code completion generate suggestions by using a language model to predict what the user might type next (the completion) from the context of what they are working on at the moment (the prompt) 

(Austin et al., 2021). Rather than focusing on a particular task (such as suggesting a method to call), neural code synthesis predicts arbitrary sections of code, and rather than generating single tokens, these systems might predict multiple lines of code at once.

The potential benefits of generating large sections of code automatically are huge, but evaluating these systems is challenging. Offline evaluation where the system is shown a snippet of code with (say) an identifier removed and then asked to complete it is difficult not least because for longer completions there are many acceptable alternatives and no straightforward mechanism for labeling them automatically (Chen et al., 2021). An additional step taken by some researchers (Svyatkovskiy et al., 2021; Zhou et al., 2021; Aye et al., 2021) is to use online evaluation and track the frequency of real users accepting suggestions, assuming that the more contributions a system makes to the developer’s code, the higher its benefit. The validity of this assumption is not obvious when considering issues such as whether two short completions are more valuable than one long one or other human factors such as whether reviewing suggestions is detrimental to programming flow.

Neural synthesis tools such as GitHub Copilot 111https://copilot.github.com/, Kite222https://www.kite.com, and TabNine333https://tabnine.com suggest code snippets within an IDE with the explicitly stated intention to increase a user’s productivity. Developer productivity has many aspects, and a recent study has shown that tools like these are helpful in ways that are only partially reflected by measures like completion times for standardized tasks (Vaithilingam et al., 2022a). Alternatively, we can leverage the developers themselves as expert assessors of their own productivity. This meshes well with current thinking in software engineering research which suggests measuring productivity on multiple dimensions and using self-reported data (Forsgren et al., 2021). We will thus focus on studying perceived productivity.

In this paper we investigate whether usage measurements of developer interactions with GitHub Copilot can be used to predict perceived productivity as reported by developers. We analyze survey responses from developers using GitHub Copilot and match their responses to usage measurements collected from the IDE. We consider acceptance counts and more detailed measures of contribution such as the amount of code contributed by GitHub Copilot, and the rate of acceptances which subsequently persist in the code unchanged. We find that acceptance rate of shown suggestions is a better predictor of perceived productivity than the alternative measures. We also find that acceptance rate varies significantly over our developer population as well as over time, and present a deeper dive into some of these variations.

Our results support the principle that acceptance rate can be used for coarse-grained monitoring of the performance of a neural code synthesis system. In particular, the ratio of shown suggestions being accepted correlates better than more detailed measures of contribution. However, other approaches remain necessary for fine-grained investigation due to the many human factors involved.

2. Background

Offline evaluation of code completion can have shortcomings even in tractable circumstances where completions can be labeled for correctness. For example, a study of completions by developers in Visual Studio found significant differences between synthetic benchmarks used for model evaluation and real-world usage (Hellendoorn et al., 2019). The evaluation of context-aware API completion for Visual Studio IntelliCode considered Recall@5—the proportion of completions for which the correct method call was in the top 5 suggestions. This metric fell from in offline evaluation to when used online (Svyatkovskiy et al., 2021).

Due to the diversity of potential solutions to a multi-line completion task, researchers have used software testing to evaluate the behaviour of completions. Competitive programming sites have been used as a source of such data (Kulal et al., 2019; Hendrycks et al., 2021) as well as hand-written programming problems (Chen et al., 2021). Human-written tests can be augmented with automatically generated tests to reduce false-positive rates in which erroneous programs are accepted (Li et al., 2022). Online evaluation remains important for general code completion tools because one needs to understand how well performance on programming competition data generalizes to interactive development in an IDE.

In this work we define acceptance rate as the fraction of completions shown to the developer that are subsequently accepted for inclusion in the source file. The IntelliCode Compose system uses the term CTR (Click Through Rate) for this and reports a value of in online trials (Svyatkovskiy et al., 2020). An alternative measure is that of DCPU (Daily Completions accepted Per User) for which a value of around has been reported (Zhou et al., 2021; Aye et al., 2021). To calculate acceptance rate one must, of course, normalize DCPU by the time spent coding each day. For context, in our study GitHub Copilot has an acceptance rate of and a mean DCPU in excess of . These differences are presumably due to differences in the kinds of completion offered, or perhaps to user interface choices. We discuss later how developer objectives, choice of programming language and even time of day seem to affect our data. Such discrepancies highlight the difficulty in using acceptance rate to understand the value of a system.

There is some evidence that acceptance rate (and indeed correctness) might not tell the whole story. One survey of developers considered the use of AI to support translation between programming languages and found indications that developers tolerated, and in some cases valued, erroneous suggestions from the model (Weisz et al., 2021).

Measuring developer productivity through activity counts over time (a typical definition of productivity borrowed from economics) disregards the complexity of software development as they account for only a subset of developer outputs. A more holistic picture is formed by measuring perceived productivity through self-reported data across various dimensions (Forsgren et al., 2021), and supplementing it with automatically measured data (Beller et al., 2020). In our investigation we used the SPACE framework (Forsgren et al., 2021) to design a survey that captures self-reported productivity, and paired the self-reported data with usage telemetry.

To the best of our knowledge, this is the first study of code suggestion tools establishing a clear link between usage measurements and developer productivity or happiness. A previous study comparing GitHub Copilot against IntelliCode with 25 participants found no significant correlation between task completion times and survey responses (Vaithilingam et al., 2022b). Another study considered the benefits of using a plugin converting natural language prompts to code (Xu et al., 2021). It found no statistically significant improvements in task completion time or task correctness despite positive qualitative survey results (possibly due to small sample size).

3. Data and Methodology

3.1. Usage Measurements

GitHub Copilot provides code completions using OpenAI Codex (Chen et al., 2021) , which is a version of GPT-3 that has been tuned on publicly available source code. It runs within the IDE and at appropriate points sends a completion request to a cloud-hosted instance of the neural model. Completion requests contain a prompt drawn from the code currently in the IDE. GitHub Copilot can generate completions at arbitrary points in code rather than (say) only being triggered when a developer types a period for invoking a method on an object. We use a variety of rules to determine appropriate points to request a completion; to abandon requests if the developer has moved on before the model is ready with a completion; and to determine how much of the response from the model to surface as a completion.

Figure 1. GitHub Copilot’s code completion funnel.
opportunity

a heuristic-based determination by the IDE and the plugin that a completion might be appropriate at this point in the code (e.g. the cursor is not in the middle of a word)

shown completion shown to the developer
accepted completion accepted by the developer for inclusion in the source file
accepted_char the number of characters in an accepted completion
mostly_unchanged_X completion persisting in source code with limited modifications (Levenshtein distance less than 33%) after X seconds, where we consider a duration of 30, 120, 300, and 600 seconds
unchanged_X completion persisting in source code unmodified after X seconds.
(active) hour an hour during which the developer was using their IDE with the plugin active
Table 1. Developer usage events collected by GitHub Copilot.

We make usage measurements for each developer by counting the events shown in Table 1, collected for all users of GitHub Copilot according to our terms of usage.444https://docs.github.com/en/github/copilot/github-copilot-telemetry-terms

Our measures of persistence go further than existing work which stops at acceptance. The intuition here is that a completion which is accepted into the source file but then subsequently turns out to be incorrect con be considered to have wasted developer time both in reviewing it and then having to go back and delete it again. We also record mostly unchanged completions, reasoning that a large completion requiring a few edits might still be a positive contribution. It is not clear how long after acceptance one should confirm persistence and so we consider a range of options.

The events pertaining to completions form a funnel which we show quantitatively in Table 1. We include a summary of all data in Appendix A.

We normalize these measures against each other and write X_per_Y to indicate we have normalized metric X by metric Y. For example: accepted_per_hour is calculated as the total number of accepted events divided by the total number of (active) hour events.

Natural name Explanation Definition
Shown rate Percentage of completion opportunities that resulted in a completion being shown to the user shown_per_opportunity
Acceptance rate Percentage of shown completions accepted by the user accepted_per_shown
Persistence rate Percentage of accepted completions unchanged after 30, 120, 300, and 600 seconds unchanged_X_per_accepted
Fuzzy persistence rate Percentage of accepted completions mostly unchanged after 30, 120, 300, and 600 seconds mostly_unchanged_X_per_accepted
Efficiency Percentage of completion opportunities that resulted in a completion accepted and unchanged after 30, 120, 300, and 600 seconds accepted_X_per_opportunity, unchanged_X_per_opportunity
Contribution speed Number of characters in accepted completions per distinct, active hour accepted_char_per_hour
Acceptance frequency Number of accepted completions per distinct, active hour accepted_per_hour
Persistence frequency Number of unchanged completions per distinct, active hour unchanged_X_per_hour
Total Volume Total number of completions shown to the user shown
Loquaciousness Number of shown completions per distinct, active hour shown_per_hour
Eagerness Number of shown completions per opportunity shown_per_opportunity
Table 2. The core set of measurements considered in this paper.

Table 2 defines the a core set of metrics which we feel have a natural interpretation in this context. We note that there are other alternatives here and we incorporate these in our discussion where relevant.

3.2. Productivity Survey

To understand users’ experience with GitHub Copilot, we emailed users providing them with a link to complete an online survey. These were participants of the unpaid technical preview using GitHub Copilot with their everyday programming tasks. The only selection criterion was having previously opted in to receive communications. Between 10 February 2022 and 6 March 2022, we received responses we could match to usage measurements during the 4 week period leading up to February 12 March 2022. We focus on usage data from this period, since the vast majority () of survey users had filled out their survey by then.

The survey questions contained multiple choice questions, in particular regarding demographic information (shown in Figure 2) and Likert-style questions about different aspects of productivity, which were randomized in their order of appearance to the user. Figure 2 shows the demographic composition of our respondents. We note the significant proportion of professional programmers who responded.

Figure 2. Demographic composition of survey respondents.

The SPACE framework (Forsgren et al., 2021) defines 5 dimensions of productivity: Satisfaction and well-being, Performance, Activity, Communication and Collaboration, Efficiency and Flow. We use 4 of these (S,P,C,E) since self reporting on (A) is generally considered inferior to direct measurement. We included statements covering these dimensions in addition to a single statement “I am more productive when using GitHub Copilot”. For each self-reported productivity measure, we encoded its five ordinal response values to numeric labels (1 = Strongly Disagree, , 5 = Strongly Agree). We include the full list of questions and their coding to the SPACE framework in Appendix C.

Early in our analysis we found that the usage metrics we describe in Section 3.1 below corresponded similarly to each of the measured dimensions of productivity, and in turn these dimensions were highly correlated to each other (for details see Figure 3 and Appendix B). We therefore added an aggregate productivity score calculated as the mean of all 12 individual measures (excluding skipped questions). This average can only serve as a rough proxy for the much more complex concept of productivity, but facilitates recognition of overall trends, which may be less discernible on individual variables due to higher statistical variation.

For reproducibility and transparency, the full data set of these aggregate productivity scores together with the usage measurements considered in this article is available at https://github.com/wunderalbert/prod-neural-materials.

4. What Drives Perceived Productivity?

Figure 3. Correlation between metrics.

Metrics are ordered by similarity based on distance in the correlation matrix, except for manually fixing the aggregate productivity and acceptance rate at the end for visibility.

To examine the relationship between objective measurements of user behavior and self-reported perceptions of productivity, we used our set of core usage measurements (Table 2

). We then calculated Pearson’s R correlation coefficient and the corresponding p-value of the F-statistic between each pair of usage measurement and perceived productivity metric. Next, we computed a PLS regression from all usage measurements jointly. Finally, we perform incremental feature selection by analyzing the significance of a univariate model where each usage measurement seeks to predict the residuals of a model fit with varying numbers of other metrics; this allows us to more directly rank each metric.

We summarize these results in Figure 3 showing the correlation coefficients between all measures and survey questions. The full table of all results is included in Appendix B.

Across all three analyses, we find that acceptance rate (accepted_per_shown

) most positively predicts users’ perception of productivity, although, given the confounding and human factors, there is still notable unexplained variance.

Of all usage measurements, acceptance rate correlates best with aggregate productivity (, ). This measurement is also the best performing for at least one survey question in each of the SPACE dimensions. This correlation is high confidence but leaves considerable unexplained variance. Below we explore improvements from combining multiple usage measurements together.

Looking at the more detailed metrics around persistence, we see that persistence over shorter time periods is generally better than over longer periods. This is intuitive in the sense that shorter periods move the measure closer to acceptance rate. We also expect that at some point after accepting the completion it becomes simply part of the code and so any changes (or not) after that point will not be attributed to GitHub Copilot. All persistence measures were less well correlated than acceptance rate.

In order to assess the different metrics in a single model in a way that is still robust against their strong collinearity and unaffected by the decision whether or not to include highly similar metrics, we ran a regression using projection on latent structures (PLS), which captures the common variation of these variables as is linearly connected to the aggregate productivity (Wold et al., 2001). The first component, to which every metric under consideration contributes positively, explains of the variance. The second component captures the acceptance rate / change rate dichotomy; it explains a further .

Figure 4. Different metrics clustering in latent structures predicting perceived productivity. We color the following groups: flawless suggestions (anything counting the number of unchanged suggestions), persistence rate (ratio of accepted suggestions that are unchanged), and fuzzy persistence rate (ratio of accepted suggestions that are mostly unchanged).

The results of both the individual correlations as well as the PLS strongly point to acceptance rate being the most immediate indicator of perceived productivity.

But what about combinations of metrics? We aim to quantify the extra information provided by one metric over a set of others. In the vein of incremental feature selection, we fit an additional predictor to the residual of a model represented by already selected metrics. Starting from the best single predictors, Table 3 shows the next most useful predictors that predict the residual of the acceptance rate model at . Given a model fit to acceptance rate, adding the shown frequency or rate, as well as either amount of accepted characters or accepted completions per hour each further improve predictive ability at statistically significant levels. No other additions were statistically significant for further iterations.

p-value univ. coef.
accepted_per_shown ¡0.0001 0.13
     + shown_per_hour 0.04 +0.03
     + accepted_char_per_hour 0.04 +0.03
     + shown_per_opportunity 0.04 +0.03
     + accepted_per_hour 0.05 +0.02
accepted_per_opportunity ¡0.0001 0.12
     + accepted_char_per_hour ¡0.0001 +0.04
     + accepted_per_hour 0.01 +0.03
     + unchanged_30_per_hour 0.01 +0.03
     + accepted_per_shown 0.02 +0.03
Table 3. Incremental benefit of additional metrics.

So even if acceptance rate may be the best of the metrics we considered, it is beneficial to combine with others to get a fuller picture.

5. What Drives Acceptance Rate?

5.1. Language Use

We are aware that there are significant differences for how GitHub Copilot performs for different programming languages. The most common languages among our user base are TypeScript ( of all shown completions in the observed time frame, for users in survey), JavaScript (, ), and Python (, ). The latter two enjoy higher acceptance rates, possibly hinting at a relative strength of neural tooling versus deductive tooling for untyped languages. Regardless of language, survey participants had a slightly higher acceptance rate than the whole user base.

Figure 5. Programming language use by survey participants vs. all users.

This difference in language acceptance can not explain away the effects from Section 4

: when considering the linear regression for perceived productivity from acceptance rate, only

of the explained variance can be attributed to language. On the other hand, however, of the variance of the different levels for the languages’ average perceived productivity can be explained by the acceptance rates in Figure 5 above (using a linear model factoring through acceptance rate).

5.2. Circadian and Weekly Rhythms

For coherence in the meaning of timestamps and weekdays, all data in this section was restricted to users from the United States (whether in the survey or not). We used the same time frame as for the investigation in Section 4.

We observe strong regular patterns in overall acceptance rate (Figure 6). These lead us to distinguish three different time regimes, all of which are statistically significantly distinct at (using a bootstrap test re-sampling the proposed time regimes):

  • The weekend: Saturdays and Sundays (day boundaries taken from PST), where the average acceptance rate is comparatively high at .

  • Typical non-working hours during the week: evenings after 4 pm PST until mornings 7 am PST, where the average acceptance rate is also rather high at .

  • Typical working hours during the week from 7 am PST to 4 pm PST, where the average acceptance rate is much lower at .

The border between the second and third regime is fuzzy, probably partially due to time zone variation within the United States.

Figure 6. Average acceptance rate for hour-long time buckets during the week. Each point represents the average for such a bucket, whereas the shaded ribbon represents the min-max variation for single hours during the observed 4 week period.

Users’ inclination to accept more suggestions outside of standard working hours could be attributed either to regular changes in the users’ behavior (e.g. accepting more solutions because they are more relaxed), or to changes in the underlying distribution of who is coding and what they are working on (e.g. personal projects being easier to suggest code for).

To distinguish between these two explanations, we trained a model to predict a user’s acceptance rate for a particular time bucket from their usual contribution times (Table 4). Unexpectedly to us, we found that the actual time bucket mattered very little – but what did matter was whether it lay in the user’s usual time regime. That means a user normally active only during the week accepts fewer solutions on the rare occasions they do code on a weekend, and a user whose activity is normally restricted to working hours accepts fewer solutions when they do venture outside that time range.

Figure 7. Acceptance rate depending on whether the user is mostly active on weekdays / typical work hours (x-axis), and whether it is actually a weekday / typical office hour (color).
coeff p-value t-value coeff p-value t-value
weekend user -0.035 ¡0.001 -24.5 work hour user -0.036 ¡0.001 -22.3
on a weekday -0.004 0.004 -2.9 during work hours -0.001 0.578 -0.6
usual day for user 0.021 ¡0.001 11.3 usual time for user 0.019 ¡0.001 11.5

Results of a linear regression of acceptance rate from: 1. user’s percentile value for of how much their activity is concentrated on weekdays (during typical work hours), 2. a categorical variable describing whether the suggestion is actually made on a weekday (during typical work hours), and 3. the proportion of this user’s contributions where that categorical variable would have the same value.

Table 4. Acceptance rate depending on time factors.

6. Threats To Validity

The principal challenge to this work is that we have only been able to investigate correlation and not causation. We hope to have mitigated this to some extent by selecting “sensible” metrics that we can justifiably believe could capture a causal relationship with productivity. Nor do we claim that these metrics themselves directly impact productivity (a developer with a faulty tab-key that accidentally accepts of suggestions without user intention will probably not be extra productive because of it), but only that these metrics are a good indicator of an underlying quality that predicts productivity.

We did not set out to accurately predict a survey participant’s answers, but merely to find a signal between developers’ perceived productivity and usage metrics. Still, we must highlight that our best performing measurement acceptance_per_shown has a Pearson coefficient of and so there is a considerable amount of unexplained variance remaining.

User-perceived productivity is also not necessarily actual productivity: seeking to maximise acceptance_per_shown might satisfy an individual developer without directly reducing the amount of time it takes them to solve a task. And indeed one study looking at the benefits of GitHub Copilot over IntelliCode found no measurable impact on task completion time despite notably positive feedback from developers (Vaithilingam et al., 2022b). On the other hand, one drawback of task-oriented studies is of how representative the chosen tasks are of real workloads whereas online studies (such as ours) capture authentic activity.

Another substantial caveat is that we only considered a single completion system, in particular with a single fixed neural engine. Alternatives could be different in many aspects that could affect developer attitudes. Factors might include average quality and the latency of completions, their length and even the user interface used to present them.

7. Conclusion

Neural code completion systems have the potential to hugely improve developer productivity through their ability to assimilate contextual information about the developer’s current activity and then generate substantial completions in response. In this paper we investigated ways of connecting the productivity benefit of GitHub Copilot to usage measurements from developer activity. Our approach was to seek correlations between our measurements and user-reported productivity from survey results.

In common with prior work we collected measurements about the acceptance of completions, but we also developed measures of persistence. This was based on the idea that for longer completions a developer might have to take more action after accepting a completion such as deleting or correcting an erroneous one.

We were surprised to find that acceptance rate (number of acceptances normalized by the number of shown completions) was better correlated with reported productivity than our measures of persistence.

But in hindsight, this makes sense. Coding is not typing, and GitHub Copilot’s central value lies not in being the way the user enters the highest possible number of lines of code. Instead, it lies in helping the user to make the best progress towards their goals. A suggestion that serves as a useful template to tinker with may be as good or better than a perfectly correct (but obvious) line of code that only saves the user a few keystrokes.

This suggests that a narrow focus on the correctness of suggestions would not tell the whole story for these kinds of tooling. Instead one could view code suggestions inside an IDE to be more akin to a conversation with a chatbot. We see anecdotal evidence of this in comments posted about GitHub Copilot online (see Appendix E for examples) in which users talk about sequences of interactions. A conversation turn in this context consists of the prompt in the completion request and the reply as the completion itself. The developer’s response to the completion arises from the subsequent changes which are incorporated in the next prompt to the model. And there are clear programming parallels to factors such as specificity and repetition that have been identified to affect human judgements of conversation quality (See et al., 2019). Researchers have already investigated the benefits of natural language feedback to guide program synthesis (Austin et al., 2021) and so ours is not a radical proposal. But neither is it one we have seen followed.

In future work, we wish to further explore this analogy, borrowing ideas (van der Lee et al., 2019) from the evaluation of chatbots and natural language text generation.

8. Broader Impact

A detailed impact analysis of the model that underlies GitHub Copilot may be found in the Appendix of (Chen et al., 2021). In this section, we focus more specifically on the potential impact of using the metrics we have described in this paper to evaluate the success of neural code completion systems.

First, focusing on a single top-level metric such as acceptance rate may bias a tool toward the most popular use cases — the most popular programming languages, natural languages, IDEs, locations, etc. Users in underrepresented groups may see lower quality results. We can mitigate this by slicing our data along the lines described above, and avoiding shipping changes that improve the top-level metric but degrade performance for other slices of the data.

Second, to compute these metrics, we must collect telemetry from users. Collecting these metrics exposes users to potential security and privacy concerns. We mitigate this by enacting strict access controls to user data and collaborating with organization and industry experts at protecting user data.

Third, blindly optimizing for a proxy (acceptance rate) for a desired property (usefulness) encourages artificial changes that improve only that proxy. For example, cutting code suggestions into half and suggesting both parts consecutively would likely transform one accepted suggestion into two, while not substantially increasing the number of rejections. Thus it would likely increase acceptance rate without substantially increasing, and maybe even while decreasing, user benefit. We can thus not recommend acceptance rate as singular and ultimate criterion of quality – it will be useful for many applications, e.g. comparing incremental changes to the code generating model, but its validity is limited in other cases, especially those involving significantly changed operational parameters.

Acknowledgments

We thank the GitHub Copilot team for their help, and in particular Krzysztof Cieslak and Johan Rosenkilde for implementing the highly complex telemetry of suggestion fate, including calculating edit distances for fuzzy matches. We thank the SAINTes team at Microsoft Research for their advisement; Nicole Forsgren and Denae Ford Robinson for advising on questions capturing perceived productivity with the SPACE framework, and Tom Zimmermann and Christian Bird for recommending more time intervals to consider for suggestion fate monitoring. We thank Rahul Pandita for LaTeXsupport and proofreading. Finally, we are grateful to GitHub Incorporated for supporting this research.

9. Appendix

Appendix A Summary of usage measurements collected

This table shows summary statistics of our core metrics (highlighted in black). We include other possible metrics (arising from different normalization options) in the table for context.

Metric N Mean Std. Min. Median Max.
opportunity 2,047 13,085.23 14,493.16 1.00 8,686.00 241,033.00
shown 2,047 1,872.05 1,922.22 0.00 1,276.00 15,832.00
accepted 2,047 503.94 639.20 0.00 293.00 5,851.00
unchanged_30 2,047 328.96 449.91 0.00 178.00 4,253.00
unchanged_120 2,047 289.15 400.16 0.00 155.00 3,937.00
unchanged_300 2,047 262.31 367.78 0.00 140.00 3,737.00
unchanged_600 2,047 240.69 342.01 0.00 125.00 3,487.00
accepted_char 2,047 25,869.83 33,288.97 0.00 14,662.00 320,080.00
active_hour 2,047 77.10 55.86 1.00 66.00 364.00
mostly_unchanged_30 2,047 434.24 556.89 0.00 254.00 5,025.00
mostly_unchanged_120 2,047 406.51 523.54 0.00 237.00 4,785.00
mostly_unchanged_300 2,047 390.58 503.78 0.00 230.00 4,586.00
mostly_unchanged_600 2,047 382.34 493.05 0.00 222.00 4,491.00
opportunity_per_active_hour 2,047 158.07 107.40 1.00 134.40 1,844.00
shown_per_active_hour 2,047 22.52 13.45 0.00 20.02 137.75
accepted_per_active_hour 2,047 6.24 5.76 0.00 4.60 58.56
shown_per_opportunity 2,047 0.15 0.05 0.00 0.15 0.37
accepted_per_opportunity 2,047 0.04 0.03 0.00 0.04 0.22
accepted_per_shown 2,038 0.26 0.12 0.00 0.24 1.00
accepted_char_per_active_hour 2,047 335.15 464.08 0.00 235.00 14,064.00
accepted_char_per_opportunity 2,047 2.14 1.76 0.00 1.71 20.87
accepted_char_per_shown 2,038 13.73 10.00 0.00 11.63 194.00
accepted_char_per_accepted 2,019 52.84 22.14 5.00 48.85 494.83
mostly_unchanged_30_per_active_hour 2,047 5.37 5.02 0.00 3.89 51.92
mostly_unchanged_30_per_opportunity 2,047 0.04 0.02 0.00 0.03 0.19
mostly_unchanged_30_per_shown 2,038 0.22 0.11 0.00 0.21 1.00
mostly_unchanged_30_per_accepted 2,019 0.86 0.08 0.00 0.86 1.00
mostly_unchanged_120_per_active_hour 2,047 5.01 4.66 0.00 3.67 49.00
mostly_unchanged_120_per_opportunity 2,047 0.03 0.02 0.00 0.03 0.19
mostly_unchanged_120_per_shown 2,038 0.21 0.11 0.00 0.19 1.00
mostly_unchanged_120_per_accepted 2,019 0.80 0.10 0.00 0.81 1.00
mostly_unchanged_300_per_active_hour 2,047 4.81 4.40 0.00 3.54 46.61
mostly_unchanged_300_per_opportunity 2,047 0.03 0.02 0.00 0.03 0.19
mostly_unchanged_300_per_shown 2,038 0.20 0.10 0.00 0.18 1.00
mostly_unchanged_300_per_accepted 2,019 0.77 0.11 0.00 0.77 1.00
mostly_unchanged_600_per_active_hour 2,047 4.72 4.32 0.00 3.51 44.85
mostly_unchanged_600_per_opportunity 2,047 0.03 0.02 0.00 0.03 0.19
mostly_unchanged_600_per_shown 2,038 0.20 0.10 0.00 0.18 1.00
mostly_unchanged_600_per_accepted 2,019 0.76 0.11 0.00 0.76 1.00
unchanged_30_per_active_hour 2,047 4.03 4.06 0.00 2.89 43.84
unchanged_30_per_opportunity 2,047 0.03 0.02 0.00 0.02 0.19
unchanged_30_per_shown 2,038 0.17 0.10 0.00 0.15 0.64
unchanged_30_per_accepted 2,019 0.64 0.17 0.00 0.67 1.00
unchanged_120_per_active_hour 2,047 3.51 3.50 0.00 2.52 40.00
unchanged_120_per_opportunity 2,047 0.02 0.02 0.00 0.02 0.19
unchanged_120_per_shown 2,038 0.15 0.09 0.00 0.13 0.63
unchanged_120_per_accepted 2,019 0.56 0.16 0.00 0.59 1.00
unchanged_300_per_active_hour 2,047 3.13 3.08 0.00 2.24 36.77
unchanged_300_per_opportunity 2,047 0.02 0.02 0.00 0.02 0.17
unchanged_300_per_shown 2,038 0.13 0.08 0.00 0.12 0.62
unchanged_300_per_accepted 2,019 0.51 0.16 0.00 0.53 1.00
unchanged_600_per_active_hour 2,047 2.82 2.76 0.00 2.06 34.15
unchanged_600_per_opportunity 2,047 0.02 0.02 0.00 0.02 0.17
unchanged_600_per_shown 2,038 0.12 0.07 0.00 0.11 0.61
unchanged_600_per_accepted 2,019 0.46 0.16 0.00 0.48 1.00

Appendix B Correlations between usage measurements and survey questions

This table shows the correlation of the aggregate productivity score from our survey against our core metrics (highlighted in black), as well as their PLS scores. We include other possible metrics (arising from different normalization options) in the table for context.

Metric N Coefficient P-Value PLS-1 PLS-2
accepted_per_shown 1,780 0.24 ¡0.0001 0.0133 0.029
mostly_unchanged_30_per_shown 1,780 0.23 ¡0.0001
mostly_unchanged_120_per_shown 1,780 0.23 ¡0.0001
mostly_unchanged_300_per_shown 1,780 0.22 ¡0.0001
accepted_per_opportunity 1,789 0.22 ¡0.0001 0.0122 0.0204
mostly_unchanged_600_per_shown 1,780 0.22 ¡0.0001
unchanged_30_per_shown 1,780 0.21 ¡0.0001
mostly_unchanged_30_per_opportunity 1,789 0.21 ¡0.0001
mostly_unchanged_120_per_opportunity 1,789 0.21 ¡0.0001
mostly_unchanged_30_per_active_hour 1,789 0.21 ¡0.0001
accepted_per_active_hour 1,789 0.21 ¡0.0001 0.0116 0.0164
mostly_unchanged_120_per_active_hour 1,789 0.21 ¡0.0001
unchanged_120_per_shown 1,780 0.21 ¡0.0001
mostly_unchanged_300_per_active_hour 1,789 0.21 ¡0.0001
mostly_unchanged_300_per_opportunity 1,789 0.21 ¡0.0001
mostly_unchanged_600_per_active_hour 1,789 0.20 ¡0.0001
mostly_unchanged_600_per_opportunity 1,789 0.20 ¡0.0001
unchanged_30_per_opportunity 1,789 0.20 ¡0.0001 0.0112 0.0122
unchanged_30_per_active_hour 1,789 0.20 ¡0.0001 0.0111 0.0104
unchanged_300_per_shown 1,780 0.19 ¡0.0001
unchanged_120_per_active_hour 1,789 0.19 ¡0.0001 0.011 0.0092
unchanged_120_per_opportunity 1,789 0.19 ¡0.0001 0.0108 0.0099
accepted_char_per_opportunity 1,789 0.19 ¡0.0001
unchanged_300_per_active_hour 1,789 0.19 ¡0.0001 0.0107 0.0082
unchanged_300_per_opportunity 1,789 0.18 ¡0.0001 0.0103 0.0082
unchanged_600_per_active_hour 1,789 0.18 ¡0.0001 0.01 0.006
accepted_char_per_shown 1,780 0.17 ¡0.0001
unchanged_600_per_shown 1,780 0.17 ¡0.0001
unchanged_600_per_opportunity 1,789 0.16 ¡0.0001 0.0093 0.0049
accepted_char_per_active_hour 1,789 0.16 ¡0.0001 0.0094 0.0155
accepted_char 1,789 0.11 ¡0.0001
mostly_unchanged_30 1,789 0.11 ¡0.0001
shown_per_active_hour 1,789 0.11 ¡0.0001 0.006 0.0025
accepted 1,789 0.11 ¡0.0001
mostly_unchanged_120 1,789 0.11 ¡0.0001
mostly_unchanged_600 1,789 0.11 ¡0.0001
shown_per_opportunity 1,789 0.11 ¡0.0001 0.0059 0.0082
mostly_unchanged_300 1,789 0.11 ¡0.0001
unchanged_30 1,789 0.11 ¡0.0001
unchanged_120 1,789 0.10 ¡0.0001
unchanged_300 1,789 0.10 ¡0.0001
unchanged_600 1,789 0.10 ¡0.0001
mostly_unchanged_30_per_accepted 1,763 0.07 0.01 0.0036 0.0077
unchanged_30_per_accepted 1,763 0.06 0.02 0.0031 -0.0011
mostly_unchanged_120_per_accepted 1,763 0.05 0.04 0.0028 0.0064
unchanged_120_per_accepted 1,763 0.04 0.09 0.0023 -0.0047
mostly_unchanged_600_per_accepted 1,763 0.04 0.09 0.0022 0.007
mostly_unchanged_300_per_accepted 1,763 0.04 0.09 0.0022 0.0053
accepted_char_per_accepted 1,763 0.03 0.20
opportunity_per_active_hour 1,789 0.03 0.24
unchanged_300_per_accepted 1,763 0.02 0.40 0.0011 -0.0094
shown 1,789 0.01 0.75 4e-04 -0.0149
unchanged_600_per_accepted 1,763 -0.00 0.95 -1e-04 -0.0138
opportunity 1,789 -0.04 0.08
active_hour 1,789 -0.05 0.03

Appendix C Aspects of Productivity Measured In The Survey

This table shows the relationship between the survey statements, the metrics and the different dimension of the SPACE framework (Forsgren et al., 2021).

Survey statements Productivity aspect Code Metric name
“I am more productive when using GitHub Copilot” Perceived productivity more_productive
“I feel more fulfilled with my job when using GitHub Copilot.” more_fulfilled
“I find myself less frustrated during coding sessions when using GitHub Copilot.” less_frustrated
“I can focus on more satisfying work when using GitHub Copilot.” Satisfaction and well-being S focus_satisfying
“While working with an unfamiliar language, I make progress faster when using GitHub Copilot.” unfamiliar_progress
“The code I write using GitHub Copilot is better than the code I would have written without GitHub Copilot.” Performance P better_code
n/a Activity A n/a
“I learn from the suggestions GitHub Copilot shows me.” Communication and collaboration (Wang et al., 2020) C learn_from
“Using GitHub Copilot helps me stay in the flow.” stay_in_flow
“I complete tasks faster when using GitHub Copilot.” tasks_faster
“I complete repetitive programming tasks faster when using GitHub Copilot.” repetitive_faster
“I spend less mental effort on repetitive programming tasks when using GitHub Copilot.” less_effort_repetitive
“I spend less time searching for information or examples when using GitHub Copilot.” Efficiency and flow E less_time_searching

Appendix D Incremental Feature Selection

Univariate fit to aggregate_productivity, excluding 120, 300 and 600 second unchanged metrics:

Behavioral Metric Coefficient P-value SSR
accepted_per_shown 0.13 0.00 466.26
accepted_per_opportunity 0.12 0.00 471.28
accepted_per_hour 0.11 0.00 473.79
unchanged_30_per_opportunity 0.11 0.00 475.43
unchanged_30_per_hour 0.10 0.00 476.36
accepted_char_per_hour 0.09 0.00 482.39
shown_per_opportunity 0.06 0.00 489.60
shown_per_hour 0.06 0.00 489.66
mostly_unchanged_30_per_accepted 0.03 0.01 493.49
unchanged_30_per_accepted 0.03 0.02 494.04
shown 0.00 0.78 495.58

Best metric fit to a univariate model modeling residuals of accepted_per_shown univariate model:

Behavioral Metric Coefficient P-value SSR
shown_per_hour 0.03 0.0363 465.10
accepted_char_per_hour 0.03 0.0385 465.13
shown_per_opportunity 0.03 0.0399 465.15
accepted_per_hour 0.02 0.0465 465.21
unchanged_30_per_hour 0.02 0.0955 465.53
mostly_unchanged_30_per_accepted 0.02 0.2124 465.85
unchanged_30_per_accepted 0.01 0.2461 465.91
accepted_per_opportunity 0.01 0.4819 466.13
unchanged_30_per_opportunity 0.01 0.5442 466.17
shown -0.01 0.5539 466.17

Appendix E Publicly posted comments

Below we include a selection of (unsolicited) publicly posted comments which give a sense that developers are thinking about engagement with GitHub Copilot rather than solely the immediate correctness of a completion.

It’s a little like pair programming with an incredibly eager junior developer who has read a lot of the documentation of every popular API in the world.

Cycling through GitHub Copilot suggestions and manually editing the suggested code is an amazing flow. What I really like is that OurTool adapts to my own code style.

It says, ‘How can I facilitate your thinking process?’ rather than, ‘How can I take away your thinking process and just give you code?’

i was writing a function that evaluates a polynomial, in a lambda (‘let eval_polynomial = —‘) and it autofilled a function that evaluated the polynomial but i wanted horner’s method, so i deleted and typed ‘let eval_polynomial_horner = —‘it correctly autofilled (with one small error) horner’s method for evaluating polynomials

Just pasted in an AttributeError as a comment in my Python file, and GitHub Copilot began trying to help me debug my implementation of a @HuggingFace Transformers model. Its advice wasn’t 100% all-knowing, but was enough to get me to a resolution!

References

  • S. Amann, S. Proksch, S. Nadi, and M. Mezini (2016) A study of visual studio usage in practice. In IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering, SANER 2016, Suita, Osaka, Japan, March 14-18, 2016 - Volume 1, pp. 124–134. External Links: Link, Document Cited by: §1.
  • J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. J. Cai, M. Terry, Q. V. Le, and C. Sutton (2021) Program synthesis with large language models. CoRR abs/2108.07732. External Links: Link, 2108.07732 Cited by: §1, §7.
  • G. A. Aye, S. Kim, and H. Li (2021) Learning autocompletion from real-world datasets. In 43rd IEEE/ACM International Conference on Software Engineering: Software Engineering in Practice, ICSE (SEIP) 2021, Madrid, Spain, May 25-28, 2021, pp. 131–139. External Links: Link, Document Cited by: §1, §2.
  • M. Beller, V. Orgovan, S. Buja, and T. Zimmermann (2020) Mind the gap: on the relationship between automatically measured and self-reported productivity. IEEE Software 38 (5), pp. 24–31. Cited by: §2.
  • M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba (2021) Evaluating large language models trained on code. CoRR abs/2107.03374. External Links: Link, 2107.03374 Cited by: §1, §2, §3.1, §8.
  • N. Forsgren, M. Storey, C. Maddila, T. Zimmermann, B. Houck, and J. Butler (2021) The space of developer productivity: there’s more to it than you think.. Queue 19 (1), pp. 20–48. Cited by: Appendix C, §1, §2, §3.2.
  • V. J. Hellendoorn, S. Proksch, H. C. Gall, and A. Bacchelli (2019) When code completion fails: a case study on real-world completions. In Proceedings of the 41st International Conference on Software Engineering, ICSE 2019, Montreal, QC, Canada, May 25-31, 2019, J. M. Atlee, T. Bultan, and J. Whittle (Eds.), pp. 960–970. External Links: Link, Document Cited by: §2.
  • D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song, and J. Steinhardt (2021) Measuring coding challenge competence with APPS. CoRR abs/2105.09938. External Links: Link, 2105.09938 Cited by: §2.
  • S. Kulal, P. Pasupat, K. Chandra, M. Lee, O. Padon, A. Aiken, and P. Liang (2019) SPoC: search-based pseudocode to code. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 11883–11894. External Links: Link Cited by: §2.
  • Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago, T. Hubert, P. Choy, C. de Masson d’Autume, I. Babuschkin, X. Chen, P. Huang, J. Welbl, S. Gowal, A. Cherepanov, J. Molloy, D. Mankowitz, E. Sutherland Robson, P. Kohli, N. de Freitas, K. Kavukcuoglu, and O. Vinyals (2022) Competition-level code generation with alphacode. Cited by: §2.
  • A. See, S. Roller, D. Kiela, and J. Weston (2019) What makes a good conversation? how controllable attributes affect human judgments. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio (Eds.), pp. 1702–1723. External Links: Link, Document Cited by: §7.
  • A. Svyatkovskiy, S. K. Deng, S. Fu, and N. Sundaresan (2020) IntelliCode compose: code generation using transformer. In ESEC/FSE ’20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event, USA, November 8-13, 2020, P. Devanbu, M. B. Cohen, and T. Zimmermann (Eds.), pp. 1433–1443. External Links: Link, Document Cited by: §2.
  • A. Svyatkovskiy, S. Lee, A. Hadjitofi, M. Riechert, J. V. Franco, and M. Allamanis (2021) Fast and memory-efficient neural code completion. In 18th IEEE/ACM International Conference on Mining Software Repositories, MSR 2021, Madrid, Spain, May 17-19, 2021, pp. 329–340. External Links: Link, Document Cited by: §1, §2.
  • P. Vaithilingam, T. Zhang, and E. L. Glassman (2022a) Expectation vs. experience: evaluating the usability of code generation tools powered by large language models. In CHI Conference on Human Factors in Computing Systems Extended Abstracts, CHI EA ’22, New York, NY, USA. External Links: ISBN 9781450391566, Link, Document Cited by: §1.
  • P. Vaithilingam, T. Zhang, and E. Glassman (2022b) Expectation vs. experience: evaluating the usability of code generation tools powered by large language models. In CHI ’22 Late-Breaking Work: Proceedings of the 2022 Conference on Human Factors in Computing Systems, Cited by: §2, §6.
  • C. van der Lee, A. Gatt, E. van Miltenburg, S. Wubben, and E. Krahmer (2019) Best practices for the human evaluation of automatically generated text. In Proceedings of the 12th International Conference on Natural Language Generation, INLG 2019, Tokyo, Japan, October 29 - November 1, 2019, K. van Deemter, C. Lin, and H. Takamura (Eds.), pp. 355–368. External Links: Link, Document Cited by: §7.
  • D. Wang, E. Churchill, P. Maes, X. Fan, B. Shneiderman, Y. Shi, and Q. Wang (2020) From human-human collaboration to human-ai collaboration: designing ai systems that can work together with people. In Extended abstracts of the 2020 CHI conference on human factors in computing systems, pp. 1–6. Cited by: Appendix C.
  • J. D. Weisz, M. J. Muller, S. Houde, J. T. Richards, S. I. Ross, F. Martinez, M. Agarwal, and K. Talamadupula (2021) Perfection not required? human-ai partnerships in code translation. In IUI ’21: 26th International Conference on Intelligent User Interfaces, College Station, TX, USA, April 13-17, 2021, T. Hammond, K. Verbert, D. Parra, B. P. Knijnenburg, J. O’Donovan, and P. Teale (Eds.), pp. 402–412. External Links: Link, Document Cited by: §2.
  • S. Wold, M. Sjöström, and L. Eriksson (2001) PLS-regression: a basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems 58 (2), pp. 109–130. Note: PLS Methods External Links: ISSN 0169-7439, Document, Link Cited by: §4.
  • F. F. Xu, B. Vasilescu, and G. Neubig (2021) In-ide code generation from natural language: promise and challenges. CoRR abs/2101.11149. External Links: Link, 2101.11149 Cited by: §2.
  • W. Zhou, S. Kim, V. Murali, and G. A. Aye (2021)

    Improving code autocompletion with transfer learning

    .
    CoRR abs/2105.05991. External Links: Link, 2105.05991 Cited by: §1, §2.