“Can Pre-Trained Language Models Be Used to Resolve Textual and Semantic Merge Conflicts?”, Jialu Zhang, Todd Mytkowicz, Mike Kaufman, Ruzica Piskac, Shuvendu K. Lahiri2021-11-23 (, ; similar)⁠:

Program merging is standard practice when developers integrate their individual changes to a common code base. When the merge algorithm fails, this is called a merge conflict. The conflict either manifests in textual merge conflicts where the merge fails to produce code, or semantic merge conflicts where the merged code results in compiler or test breaks. Resolving these conflicts for large code projects is expensive because it requires developers to manually identify the sources of conflict and correct them.

In this paper, we explore the feasibility of automatically repairing merge conflicts (both textual and semantic) using k-shot learning with large neural language models (LM) such asGPT-3 [but not Codex/Copilot, or finetuned GPT-3]. One of the challenges in leveraging such language models is to fit the examples and the queries within a small prompt (2,048 tokens). We evaluate LMs and k-shot learning for 2 broad applications: (1) textual and semantic merge conflicts for a divergent fork Microsoft Edge, and (2) textual merge conflicts for a large number of JavaScript projects in GitHub.

Our results are mixed: one one-hand, LMs provide the state-of-the-art (SOTA) performance on semantic merge conflict resolution for Edge compared to earlier symbolic approaches; on the other hand, LMs do not yet obviate the benefits of fine-tuning neural models (when sufficient data is available) or the design of special purpose domain-specific languages (DSL) for restricted patterns for program synthesis.

Figure 8: Accuracy Comparison for Gmerge on Resolving Semantic Merge Conflicts

…The evaluation shows that having a shot as the input to the language model substantially improves the results in all prompt structures. This meets our predictions because having a shot not only clearly pinpoints the current task to the model but also provides an example of what is the expected output from the model. Moreover, the evaluation shows that providing more conflicts related code changes as the context improves the accuracy of the model. It further shows that with the heuristics, Gmerge achieved the highest accuracy of 64.6%.

GPT-3 and GPT-J each output one resolution at one model trial. In our experiment, we repeatedly query GPT-3, and if the resolution is produced in any of the trials, we mark the merge conflict as “resolved”. We evaluated how the number of trials affect the model accuracy. Figure 8 shows that the overall accuracy of GPT-3 and GPT-J both increased with the number of model trials For example, for GPT-3, 10 independent trials achieves the accuracy of 64.6% in contrast to the accuracy of 37.2% with only one trial. Compared to the GPT-3 model, we only observed a modest accuracy gain in the GPT-J model. StringMerge and Transformation.Text have no accuracy gain because they produce a deterministic result in every run.

Are larger language models more accurate than smaller ones? One benefit of Gmerge is that its k-shot approach does not require expensive task specific fine-tuning. Thus, Gmerge can benefit from large scale language autoregressive models. In this section, we demonstrate that the size of the model has a substantial impact on Gmerge’s task specific accuracy. Figure 8 shows that the overall accuracy of GPT-3 increased more sharply than GPT-J with the increase of model queries. ~30% of the additional merge conflicts are resolved if we query the GPT-3 model multiple times. In contrast, for GPT-J, only 5% of the additional merge conflicts can be resolved in this setting.