“Research Recitation: A First Look at Rote Learning in GitHub Copilot Suggestions”, 2021-07 (; similar):
I limited the investigation to Python suggestions with a cutoff on May 7, 2021 (the day we started extracting that data). That left 453,780 suggestions spread out over 396 “user weeks”, i.e. calendar weeks during which a user actively used GitHub Copilot on Python code.
…For most of GitHub Copilot’s suggestions, our automatic filter didn’t find any substantial overlap with the code used for training. But it did bring 473 cases to our attention. Removing the first bucket (cases that look very similar to other cases) left me with 185 suggestions. Of these, 144 got sorted out in buckets 2—4. This left 41 cases in the last bucket, the “recitations”, in the meaning of the term I have in mind.
That corresponds to 1 recitation event every 10 user weeks (95% confidence interval: 7—13 weeks, using a Poisson test).
…This investigation demonstrates that GitHub Copilot can quote a body of code verbatim, but that it rarely does so, and when it does, it mostly quotes code that everybody quotes, and mostly at the beginning of a file, as if to break the ice.
…The answer is obvious: sharing the pre-filtering solution we used in this analysis to detect overlap with the training set. When a suggestion contains snippets copied from the training set, the UI should simply tell you where it’s quoted from. You can then either include proper attribution or decide against using that code altogether. This duplication search is not yet integrated into the technical preview, but we plan to do so. And we will both continue to work on decreasing rates of recitation, and on making its detection more precise.
[If you find ‘1 possible copy every 10 man-weeks’ concerning, you’d better not look too hard at the overlap of your own codebase with StackExchange/GitHub or the licensing requirements (especially attribution) of all code therein…]