“Productivity Assessment of Neural Code Completion”, Albert Ziegler, Eirini Kalliamvakou, Shawn Simister, Ganesh Sittampalam, Alice Li, Andrew Rice, Devon Rifkin, Edward Aftandilian2022-05-13 ()⁠:

Neural code synthesis has reached a point where snippet generation is accurate enough to be considered for integration into human software development workflows. Commercial products aim to increase programmers’ productivity, without being able to measure it directly.

In this case study, we asked users of GitHub Copilot about its impact on their productivity, and sought to find a reflection of their perception in directly measurable user data.

We find that the rate with which shown suggestions are accepted, rather than more specific metrics regarding the persistence of completions in the code over time, drives developers’ perception of productivity.

…In this work we define acceptance rate as the fraction of completions shown to the developer that are subsequently accepted for inclusion in the source file. The IntelliCode Compose system uses the term CTR (Click Through Rate) for this and reports a value of 10% in online trials [12]. An alternative measure is that of DCPU (Daily Completions accepted Per User) for which a value of around 20 has been reported [3, 21]. To calculate acceptance rate one must, of course, normalize DCPU by the time spent coding each day. For context, in our study GitHub Copilot has an acceptance rate of 27% and a mean DCPU in excess of 31.

…Language Use: We are aware that there are substantial differences for how GitHub Copilot performs for different programming languages. The most common languages among our user base are TypeScript (24.7% of all shown completions in the observed time frame, 21.9% for users in survey), JavaScript (21.3%, 24.2%), and Python (14.1%, 14.5%). The latter 2 enjoy higher acceptance rates, possibly hinting at a relative strength of neural tooling versus deductive tooling for untyped languages. Regardless of language, survey participants had a slightly higher acceptance rate than the whole user base.

Figure 5: Programming language use by survey participants vs. all users.

…We were surprised to find that acceptance rate (number of acceptances normalized by the number of shown completions) was better correlated with reported productivity than our measures of persistence.

But in hindsight, this makes sense. Coding is not typing, and GitHub Copilot’s central value lies not in being the way the user enters the highest possible number of lines of code. Instead, it lies in helping the user to make the best progress towards their goals. A suggestion that serves as a useful template to tinker with may be as good or better than a perfectly correct (but obvious) line of code that only saves the user a few keystrokes.

This suggests that a narrow focus on the correctness of suggestions would not tell the whole story for these kinds of tooling. Instead one could view code suggestions inside an IDE to be more akin to a conversation with a chatbot. We see anecdotal evidence of this in comments posted about GitHub Copilot online (see Appendix E for examples) in which users talk about sequences of interactions. A conversation turn in this context consists of the prompt in the completion request and the reply as the completion itself. The developer’s response to the completion arises from the subsequent changes which are incorporated in the next prompt to the model. And there are clear programming parallels to factors such as specificity and repetition that have been identified to affect human judgements of conversation quality [11]. Researchers have already investigated the benefits of natural language feedback to guide program synthesis [2] and so ours is not a radical proposal. But neither is it one we have seen followed.

In future work, we wish to further explore this analogy, borrowing ideas [16] from the evaluation of chatbots and natural language text generation.