1 Introduction
Collaborative software development relies on version control systems such as git to track changes across files. In most projects, developers work primarily in a branch of a software repository, periodically synchronizing their code changes with the main branch via pull requests gousios2016work. When multiple developers make concurrent changes to the same line of code, a merge conflict may occur. According to an empirical study of four large software projects by merge-study2 up to 46% of all merge commits result in conflicts. Resolving merge conflicts is a time-consuming, complicated, and error-prone activity that requires understanding both syntax and program semantics, often taking more time than developing a code feature itself bird2012avb.
Modern version control systems such as git utilize the diff3 algorithm for performing unstructured line-based three-way merge of input files smith-98. This algorithm aligns the two-way diffs of two versions of the code and over the common base into a sequence of diff “slots”. At each slot, a change from either or is selected. If both program versions introduce a change at the same slot, a merge conflict is produced, and manual resolution of the conflicting modifications is required.
A versatile, production-level merge conflict resolution system should be aware of programming language syntax and semantics yet be sufficiently flexible to work with any source code files, irrespective of the programming language. It should generalize to a wide variety of real-world merge conflicts beyond a specific merge type or a domain of software artifacts.
Inspired by the exceptional performance of transformer models and self-supervised pretraining in natural language understanding and generation tasks bert; gpt2; Liu2019RoBERTaAR; lewis2019bart; raffel2020exploring as well as in the programming language domain feng-etal-2020-codebert; gptc; clement2020pymt5; tufanoUnitTest; plbart
, we introduce MergeBERT: a neural program merge framework based on token-level three-way differencing and transfer learning. However, other encoder architectures such as LSTM
lstm, or efficient transformer variants like Poolingformer poolingformer could be utilized here. Unlike the standard diff3 algorithm which makes deterministic merge decisions for each line of code, we introduce a token-level variant of diff3 which helps to localize the conflicting chunks, then utilize a probabilistic neural model that selects a most likely primitive merge pattern. MergeBERT is based on a bidirectional transformer encoder model. To endow our model with a basic knowledge of programming language syntax and semantics, we adopt a two-step training procedure: (1) unsupervised masked language model pretraining on a massively multilingual source code corpus, (2) supervised finetuning for the sequence classification task. We transfer weights of the pretrained encoder into a multi-input model architecture that encodes all inputs that a standard diff3algorithm does (two two-way diffs of input programs) as well as the edit sequence information, then aggregate them for learning. We select a bidirectional transformer encoder (BERT) as our encoder implementation. As a bidirectional encoder, BERT allows to include code context around the conflicting chunks, which is a key advantage over left-to-right language models.
The paper contributions are as follows: (1) we introduce MergeBERT a novel transformer-based program merge framework that leverages token-level differencing and formulates the task of generating the resolution sequence as a classification task over a set of primitive merge patterns extracted from real-world merge commit data, (2) we effectively transfer knowledge about program syntax and types of source code identifiers from millions of software programs to downstream sequence classification task of merge conflict resolution by using unsupervised masked language model pretraining (see section 5), which also makes this approach computationally more feasible, (3) we overcome several limitations of the existing neural program merge models and semi-structured program merge tools like jsFSTMerge and JDime to improve upon the state-of-the-art deepmerge by nearly 2 (see sections 8 and 10), and finally, (4) we demonstrate that multilingual MergeBERT model trained on Java, JavaScript, TypeScript, and C# programming languages can nearly match the performance of monolingual models, and generalizes zero-shot to unseen languages.
1.1 Related Work
There have been multiple attempts to improve merge algorithms by restricting the merge algorithm to a particular programming language or a specific type of applications mens2002state. Typically, such attempts result in algorithms that do not scale well or have a low coverage.
Syntactic merge algorithms improve upon diff3 by verifying the syntactic correctness of the merged programs. Several syntactic program merge techniques have been proposed westfechtel1991structure; Asklund1999IdentifyingCD which are based on parse trees or abstract syntax trees and graphs.
Apel et al. noted that structured and unstructured merge each has strengths and weaknesses. They developed a semi-structured merge tool, FSTMerge, which switches between approaches apel2010semistructured. They later introduced JDime, an approach that automatically tunes a mixture of structured and unstructured merge based conflict locations apel2012structured. tavares2019javascript later implemented jsFSTMerge by adapting an off-the-shelf grammar for JavaScript and modifying the FSTMerge algorithm itself to address JavaScript specific issues.
In addition, pan-synthesis-2021 explore using program synthesis to learn repeated merge resolutions within a project. However, the approach is limited to a single C++ project, and only deals with restricted cases of import statements. Sousa18 explore the use of program verification to certify that a merge obeys a semantic correctness criteria, but does not help resolve merge conflicts.
2 Motivating Example
In this section, we illustrate how MergeBERT formulates the traditional line-level merge conflict resolution problem as a classification task over token-level conflicting regions. Fig. 1 provides an example merge conflict in JavaScript. Fig. 1(a) on the left shows the standard diff3 markers “<<<<<<< A.js”, “||||||| O.js”, “=======” and “>>>>>>> B.js”, which denote the conflicting regions introduced by programs , base , and respectively. Here, represents the most common ancestor of programs and in the version control history. We denote the program text of diff3 conflicting regions as , , , where is a conflict index. The conflict index may be omitted when referring to programs consisting of a single conflict only. We refer to the program text outside the diff3 conflicting chunks, common to all merged programs versions, as a prefix and suffix, and denote it respectively as Pref and Suff throughout the paper. First, MergeBERT represents each line-level merge conflict instance at token level (Fig. 1(b)) with localized conflicting regions , , , and then it predicts its resolution via classification (Fig. 1(c)). Here, and throughout the paper we will use lower case notations to refer to attributes of token-level differencing (e.g. is a suffix to token-level diff3 conflict region). Intuitively, a token-level merge first turns the line-structured text into a list of tokens (including space and line delimiters), applies diff3 to the resulting documents, and reconstructs the merged document at line level. As a result of token-level merge, the whole “let x = max(y,” string is cleanly merged, becoming a part of the program prefix , and “)” is prepended to the suffix .
Observe that the resolution does not consist of any single line from either or since both edits modify a common line present in the base. Hence, earlier neural approaches such as DeepMerge deepmerge would not be able to synthesize the resolution. On the other hand, structured merge techniques (such as jsFSTMerge tavares2019javascript) cannot resolve the conflict as the conflict appears on a program statement, which leads to side effects.
A token-level merge can interleave edits within lines (i.e., tokens in which one edit does not conflict with another are trivially merged). Consider ’s edit of the var to let keyword. Such non-conflicting edits suffice to demonstrate the above. Likewise, consider the token-level conflict for the arguments to the max function: an appropriate model trained on JavaScript should be able to easily deduce that taking the edit from (i.e., "11, z") captures the behavior of ’s edit as well. The suggested resolution gives an intuitive demonstration for how MergeBERT turns a complex line-level resolution into a simpler token-level classification problem.
MergeBERT can deal with non-trivial real-world merges composed of multiple conflicting chunks. To provide an example of such a merge conflict, we include a complete example in the Appendix.
![]() |
![]() |
![]() |
3 Background: Data-driven Merge
deepmerge introduced the data-driven program merge
problem as a supervised machine learning problem. A program merge consists of a 4-tuple of programs
, where-
The base program is the most common ancestor in the version history for programs and ,
-
diff3 produces an unstructured (line-level) conflict when applied to , and
-
is the program with the developer resolution, having no conflicts.
Given a set of such programs and merges , the goal of a data-driven merge is to learn a function, , that maximizes the set of examples where . Moreover, since a program may have multiple unstructured conflicts , j=0…N, the data-driven merge considers the different merge tuples corresponding to the conflicting regions independently, and poses the learning problem over all the merge tuples present in . deepmerge also provides an algorithm for extracting the exact resolution regions for each merge tuple and define a dataset that corresponds to non-trivial
resolutions where the developer does not drop one of the changes in the resolution. Further, they provide a sequence-to-sequence encoder-decoder based architecture, where a bi-directional gated recurrent unit (GRU) is used for encoding the merge inputs comprising of
segments of a merge tuple, and a pointer mechanism is used to restrict the output to only choose from line segments present in the input. Given the restriction on copying only lines from inputs, the dataset defined in the paper did not consider merges where the resolution required token-level interleaving. And, lastly, the dataset consists of merge conflicts in a single language, JavaScript. In contrast, our paper addresses both of these limitations.4 Merge Conflict Resolution as a Classification Task
In this work, we demonstrate how to exploit the restricted nature of merge conflict resolutions (compared to an arbitrary program repair) to leverage discriminative models to perform the task of generating the resolution sequence. We have empirically observed that a token-level variant of diff3 enjoys two useful properties over its line-level counterpart: (i) it helps localize the merge conflicts to small program segments, effectively reducing the size of conflicting regions, and (ii) most resolutions at a token-level consist entirely of changes from or or or a sequential composition of followed by or vice versa. On the flip side, a token-level merge has the potential to introduce many small conflicts. To balance the trade-off, we start with the line-level conflicts as produced by line-level merge and perform a token-level merge of only the segments present in the line-level conflict. There are several potential outcomes for such a two-level merge at the line-level:
-
A conflict-free token-level merge: For example, the edit from about let is merged since does not edit that slot as shown in Fig. 1(b).
-
A single localized token-level merge conflict: For example, the edit from both and for the arguments of max yields a single conflict as shown in Fig. 1(b).
-
Multiple token-level conflicts: Such a case (not illustrated above) can result in several token-level conflicts.
For a given line-level conflict , we represent the conflicts and resolutions at the token-level as a sequence . We empirically observe that many such at the token-level comprises entirely of () , () , () or concatenating () or () . We, therefore, can treat the problem of constructing as a classification task to predict between these possibilities. It is important to note that although we are predicting simple resolution strategies at token level, they translate to complex interleavings at line-level.
Of course, not all line-level conflicts are resolved by breaking that conflict to tokens—some resolutions which are complex line-based interleavings are not expressible as a choice at the token-level.
4.1 Primitive Merge Resolution Types
Given a merge tuple with line-level conflicting regions , i=0…N, and token-level conflicting regions corresponding to a line-level conflict , we define following nine basic merge resolution types which serve as labels for supervised classification task:
-
Take changes proposed in program (developer branch A) as resolution,
-
Take changes proposed in program as resolution,
-
Take changes in the base reference program as resolution,
-
Take a string concatenation of changes in and as resolution,
-
Take a string concatenation of changes in and as resolution (reverse order as compared to the previous),
-
Take changes proposed in program , excluding the lines also present in the base reference program , as resolution,
-
Take changes proposed in program , excluding the lines present in the base, as resolution,
-
Take a string concatenation of changes in and , excluding the lines present in the base, as resolution,
-
Take a string concatenation of changes in and , excluding the lines present in the base, as resolution (reverse order as compared to the previous),
We use a data-driven approach to identify these 9 primitive merge resolution patterns based on the analysis of the real-world merge conflict resolutions from GitHub. Our analysis shows that over 85% of all the merge conflicts can be represented using these labels. While the above nine resolution types are primitive, they form a basis sufficient to cover a large class of real-world merge resolutions in modern version control systems, including arbitrary combinations or interleavings of lines.
![]() |
![]() |
Fig. 2 shows the label distribution in our dataset for TypeScript programming language. The plot on the left shows the label distribution obtained for the standard (line-level) diff3 conflicting regions. As seen, nearly 60% of all cases are trivial – take changes from branch A or B. Arguably, these cases can be resolved without machine learning and are easily addressed by take ours or take theirs merge resolution strategies. The plot on the right shows the label distribution obtained for token-level differencing algorithm. It excludes trivial (take A or take B) merge resolutions. Note, that “take A” merge resolution at token-level does not correspond to “take ours” or “take theirs” merge resolution strategy, and can map to any label at line-level, thus representing a non-trivial merge scenario stimulating for machine learning studies.
It is important to stress, these primitive merge resolution types are not strictly defined templates dictating which syntactic structures should be selected from input programs. For instance, a label “take changes proposed in program “ can correspond to a single code token as well as an entire method signature or body. As such, the merge types are not restrictive in their representation power of merge conflicts, capable of representing over 85% of all conflicts.
5 MergeBERT: Neural Program Merge Framework
MergeBERT is a textual program merge model based on the bidirectional transformer encoder model. It approaches merge conflict resolution as a sequence classification task given conflicting regions extracted with token-level differencing and surrounding code as context. The key technical innovation to MergeBERT lies in how it breaks program text into an input representation amenable to training with a bidirectional transformer encoder and how it pools and classifies various input encodings for classification.
MergeBERT exploits the traditional two-step pretraining and finetuning training procedure. We use unsupervised masked language modeling (MLM) pretraining on a massively multilingual source code corpus followed by supervised finetuning for a classification task. For finetuning, we construct a multi-input model architecture that encodes pair-wise aligned token sequences of conflicting programs and with respect to original program , as well as corresponding edit sequence steps (see section 5.3 for details on merge representations), then aggregate them for learning. An overview of MergeBERT model architecture is shown in Fig. 3.

). Four aligned token sequences are fed to the multi-input encoder neural network, while edit sequences are consumed as type embeddings. Finally, encoded token sequences are summarized into a hidden state which serves as input to classification layer. We decode merge conflict resolution by concatenating the prefix tokens, predicted token-level resolution, and suffix tokens. See Algorithm
1 for details about merge resolution decoding. Parts of the neural networks colored in blue are finetuned, the rest are transferred from pretrained encoder and frozen.Given a merge tuple with token-level conflicting chunks
, MergeBERT models the following conditional probability distribution:
(1) |
and consequently, for entire programs:
(2) |
where is the number of token-level conflicts in the merge tuple .
5.1 Representing Merge Conflicts
As shown by deepmerge, an effective merge representation needs to be “edit aware” to provide an indication that and are edits of the original program
. Prior work on distributed representations of edits
yin2019learningdescribes how to compute a two-way diff using a standard deterministic diffing algorithm and represent the resulting pair-wise alignment as a vector consumable by machine learning models.
Given a merge tuple , MergeBERT first calculates two two-way alignments between the sequence of tokens of conflicting regions (respectively ) with respect to that of the original program . For each pair of aligned token sequence we extract an “edit sequence” that represents how to turn the second sequence into the first. These edit sequences – and – are comprised of the following editing actions (kinds of edits): = represents equivalent tokens, + represents insertions, - represents deletions, represents a replacement, and
is used as a padding token. Overall, this produces four token sequences and two edit sequences: (
, , and ) and (, , and ). Each token sequence covers the corresponding conflicting region and, potentially, surrounding code tokens (see section 9 for details). Fig. 3 shows an example of edit sequence.5.2 Context Encoding
We pretrain a bidirectional transformer encoder (BERT) model following the masked language modeling objective on a multilingual dataset of source code files. In each source code file, a set of tokens is sampled at random uniform and replaced with [MASK] symbols, and the model aims to reconstruct the original sequence. We make use of a Byte-Pair Encoding (BPE) unsupervised tokenization procedure to avoid a blowup in the vocabulary size given the sparse nature of code identifiers 10.1145/3377811.3380342. Besides code tokens, the vocabulary includes the special symbols representing editing steps and the [MASK] symbol.
During finetuning, we introduce an edit type embedding combining it with token and position embeddings via addition: . Edit type embedding helps the model recognize the edit steps, which are not supplied during pretraining. See Fig. 4 for details.

As shown in Fig. 3, we utilize pretrained encoder model to independently encode each of the four token sequences (, , , and ) of merged programs, passing edit sequences ( and ) as type embedding indices.
5.3 Merge Tuple Summarization
In standard sequence learning tasks there is one input and one output sequence. In merge conflict resolution setting, there are multiple input programs and one resolution. To facilitate learning in this setting, we construct MergeBERT as a multi-input encoder neural network, which first encodes token sequences , , , and , and then aggregates them into a single hidden summarization state: h_m = ∑_x_i ∈(a|_o, o|_a, b|_o, o|_b) θ_i ⋅E (x_i, Δ) where is the context encoder and
are the embedding tensors for each of the sequences
. After encoding and aggregation a linear classification layer with softmax is applied:(3) |
The resulting line-level resolution region is obtained by concatenating the prefix pref, predicted token-level resolution , and the suffix suff.
Finally, in the case of a one-to-many correspondence between the original line-level and the token-level conflicts (see appendix for an example), MergeBERT uses a standard beam-search to decode the most promising token-level predictions.
6 Merge Resolution Decoding
Each model prediction yields a probability distribution
over token-level merge classes given a conflict. In case of a one-to-many correspondence between original line-level and the token-level conflicts (see, for instance, Fig. 7) to approximate the original we decode the most promising combination from the predicted solution space. This can be conceptualized as a maximum cost path search on a matrix, which we approach via a beam search algorithm.return
As a result, the model prediction for each line-level conflict consists of either a label for a token-level conflict or a combination of labels for multiple token-level conflicts representing the best prediction for each token-level conflict within the line-level conflict. Given these labels for each line-level conflict and the contents of the merged file, MergeBERT generates the code corresponding to the resolution region. The contents of the merged file includes the conflict in question and its surrounding regions. Therefore, for each conflicting line, MergeBERT chooses between the versions of code based on the labels the model produced and generates the resolution code by concatenating them. Afterwards, MergeBERT checks the syntax of the generated resolved code, and in case of correctness, outputs it as the candidate merge conflict resolution.
In case of multiple line-level conflicts in the merged file, MergeBERT refines the contents of the merged file that serves as the surrounding region of the conflict. More specifically, for each line-level conflict, MergeBERT replaces the other conflicts in the the merged file contents with the code it previously generated as their predicted resolutions. For this purpose, MergeBERT updates the contents of the merged file after resolving each line-level conflict with the code it generates as the conflict resolution based on the model prediction.
7 Dataset
To create a dataset for pretraining, we clone all non-fork repositories with more than 20 stars in GitHub that have C, C++, C#, Python, Java, JavaScript, TypeScript, PHP, Go, and Ruby as their top language. The resulting dataset comprises over 64 million source code files.
The finetuning dataset is mined from over 100 thousand open source software repositories in multiple programming languages with merge conflicts. It contains commits from git histories with at least two parents, which resulted in a merge conflict. We replay
git merge on the two parents to see if it generates any conflicts. Otherwise, we ignore the merge from our dataset. We follow deepmerge to extract resolution regions—however, we do not restrict ourselves to conflicts with less than 30 lines only. Lastly, we extract token-level conflicts (and labels) from line-level conflicts (and resolutions). Tab. 1 provides a summary of the finetuning dataset.Programming language | Train set | Test set |
---|---|---|
C# | 27874 | 6969 |
JavaScript | 66573 | 16644 |
TypeScript | 22422 | 5606 |
Java | 103065 | 25767 |
Scala | 389 |
8 Baseline Models
8.1 Language Model Baseline
Neural language models (LMs) have shown great performance in natural language generation gpt2; sellam-etal-2020-bleurt, and have been successfully applied to the domain of source code 10.5555/2337223.2337322; gptc; feng-etal-2020-codebert. We consider the generative pretrained transformer language model for code (GPT-C) and appeal to the naturalness of software naturalness to construct our baseline approaches for the merge resolution synthesis task. We establish the following baseline: given an unstructured (line-level) conflict produced by diff3, we take the common source code prefix Pref acting as user intent for program merge. We attempt to generate an entire resolution region token-by-token using beam search. As an ablation experiment, we repeat this for the conflict produced with the token-level differencing algorithm (see Fig. 1 for details about prefix and conflicting regions).
8.2 DeepMerge: Neural Model for Interleavings
Next, we consider DeepMerge deepmerge: a sequence-to-sequence model based on the bi-directional GRU summarized in section 3. It learns to generate a resolution region by choosing from line segments present in the input (line interleavings) with a pointer mechanism. We retrain the DeepMerge model on our TypeScript dataset.
8.3 Jdime
Looking for a stronger baseline, we consider JDime, a Java-specific merge tool that automatically tunes the merging process by switching between structured and unstructured merge algorithms apel2012structured. Structured merge is abstract syntax tree (AST) aware and leverages syntactic information to improve matching precision of conflicting nodes. To compare the accuracy of JDime to that of MergeBERT, we use the Java test and complete the following evaluation steps: First, we identify the set of merge conflict scenarios where JDime did not report a merge conflict, and the standard diff3 algorithm did. Second, we compare the JDime output to the version of the code where the merge conflict is resolved. Third, we calculate JDime accuracy by identifying the number of merges where JDime output file correctly matches the resolved conflict file.
As a result of its AST matching approach, code generated by JDime is reformatted, and the original order of statements is not always preserved. In addition, source code comments that are part of conflicting code chunks are not merged.
A simple syntactic comparison is too restrictive, and JDime merge output can still be semantically correct. To accurately identify semantically equivalent merges, we use GumTree FalleriMBMM14, an AST differencing tool, to compute fine grained edit scripts between the two merge files. By ignoring semantically equivalent differences computed by GumTree (such as moved method declarations) we have a more accurate baseline comparison between the number of semantically equivalent merges generated by JDime and MergeBERT.
8.4 jsFSTMerge
Apel apel2012fstmerge introduced FSTMerge, a semi-structured merge engine using an approach similar to JDime, but that that allows a user to provide an annotated language grammar specification for any language. tavares2019javascript implemented jsFSTMerge by adapting an off-the-shelf grammar for JavaScript to address shortcomings of FSTMerge and also modifying the FSTMerge algorithm itself. For example, statements can be intermingled with function declarations at the same syntactic level, and statement order must be preserved while function order does not. jsFSTMerge allows for certain types of nodes to maintain their relative order (e.g., statements) while others may be order independent (e.g., function declarations) even if they share the same parent node.
For cases where jsFSTMerge produces a resolution that does not match the user resolution, we manually inspect the output for semantic equivalence (e.g., reordered import statements).
9 Implementation Details
We pretrain a BERT model with 6 encoder layers, 12 attention heads, and a hidden state size of 768. The vocabulary is constructed using byte-pair encoding method sennrich2015neural and the vocabulary size is 50000. We set the maximum sequence length to 512. Input sequences cover conflicting regions and surrounding code (i.e., fragments of Pref and Suff) up to a maximum length of 512 BPE tokens. The backbone of our implementation is HuggingFace’s RobertaModel and RobertaForSequenceClassification
classes in PyTorch, which are modified to turn the model into a multi-input architecture shown in Fig.
3.We train the model with Adam stochastic optimizer with weight decay fix using a learning rate of 5e-5, 512 batch size and 8 backward passes per allreduce on 64M files in C, C++, C#, Python, Java, JavaScript, TypeScript, PHP, Go and Ruby programming languages. The training was performed on 64 NVIDIA Tesla V100 with 32GB memory for 21 days; we utilized mixed precision. Finetuning was performed on 4 NVIDIA Tesla V100 GPUs with 16GB memory for 6 hours.
In the inference phase, the model prediction for each line-level conflict consists of one or more token-level predictions. Given the token-level predictions and the contents of the merged file, MergeBERT generates the code corresponding to the resolution region. The contents of the merged file include the conflict in question and its surrounding regions. Afterward, MergeBERT checks the syntax of the generated code with tree-sitter111https://tree-sitter.github.io/tree-sitter/ parser and outputs it as the candidate merge conflict resolution only in case of correctness.
10 Evaluation
We evaluate MergeBERT’s accuracy of resolution synthesis. Our evaluation metrics are precision and recall of verbatim string match (modulo whitespaces or indentation) of the decoded top-1 prediction to the user resolution extracted from real-world merge resolutions. This definition is rather restrictive as a predicted resolution might differ from the true user resolution by, for instance, only the order of statements, being semantically equivalent otherwise. As such, this evaluation approach gives a lower bound of the MergeBERT model performance.
In addition to the precision and recall, we estimate the fraction of syntactically correct (or parseable) source code suggestions to filter out merge resolutions with syntax errors.
10.1 Baseline Model Evaluations
Approach | Precision | Recall | F-score | BLEU-4 |
---|---|---|---|---|
LM (line) | 3.6 | 3.1 | 3.3 | 5.2 |
LM (token) | 49.7 | 48.1 | 48.9 | 55.3 |
DeepMerge | 34.9 | 22.2 | 27.1 | 51.2 |
MergeBERT | 69.1 | 68.2 | 68.7 | 78.6 |
As seen in Tab. 2
, MergeBERT significantly outperforms language model baselines in the precision of merge resolution synthesis, suggesting that the naturalness hypothesis is insufficient to capture the developer intent when merging programs. This is perhaps not surprising given the notion of precision that does not tolerate even a single token mismatch. We therefore also considered a more relaxed evaluation metric – BLEU-4 score – which defines the similarity based on an n-gram model. LM baseline over token-level conflicts achieves a modest 55.3, while MergeBERT still outperforms it with 78.6.
DeepMerge precision of merge resolution synthesis is quite admirable, showing 34.9% top-1 precision, but nearly half as low as compared to 69.1% of correctly generated resolutions by MergeBERT. Moreover, it was only able to produce predictions for 63.8% of the test conflicts, failing to generate predictions for merge conflicts which are not representable as a line interleaving. This type of merge conflicts comprises almost a third of the test set, leading to 250% lower F-score.
As can be seen from Tab. 4, jsFSTMerge is only able to produce a resolution for 22.8% of conflicts, and the produced resolution is correct only 15.8% of the time. This is in line with the conclusions of the creators of jsFSTMerge that semi-structured merge approaches may not be as advantageous for dynamic scripting languages tavares2019javascript. Because jsFSTMerge may produce reformatted code, we manually examined cases where a resolution was produced but did not match the user resolution (our oracle). If the produced resolution was semantically equivalent to the user resolution, we classified it as correct.
Tab. 3 shows the detailed evaluation results of the MergeBERT.
Test (Train) Languages | Precision | Recall | F-score | Fraction Merged | Syntax correct |
JavaScript (JS, TS, C#, Java) | 66.6 | 65.3 | 65.9 | 98.1 | 97.4 |
TypeScript (JS, TS, C#, Java) | 68.5 | 67.6 | 68.0 | 98.7 | 96.9 |
Java (JS, TS, C#, Java) | 63.6 | 62.9 | 63.2 | 98.9 | 98.2 |
C# (JS, TS, C#, Java) | 66.3 | 65.1 | 65.7 | 98.1 | 98.3 |
JavaScript (JS) | 66.9 | 65.6 | 66.2 | 98.0 | 97.4 |
TypeScript (TS) | 69.1 | 68.2 | 68.7 | 98.7 | 97.0 |
Java (Java) | 63.9 | 63.2 | 63.5 | 98.8 | 98.3 |
C# (C#) | 68.7 | 67.3 | 68.0 | 97.9 | 98.3 |
Scala | 57.8 | 56.5 | 57.1 | 97.7 | 97.9 |
Approach | Precision | Recall | F-score | Syntax |
---|---|---|---|---|
JDime | 26.3 | 21.6 | 23.7 | 90.9% |
MergeBERT | 63.9 | 63.2 | 63.5 | 98.3% |
jsFSTMerge | 15.8 | 3.6 | 5.9 | 94.4% |
MergeBERT | 66.9 | 65.6 | 66.2 | 97.4% |
10.2 Impact of Pretraining
As shown in Fig. 3
, the effect of transfer learning is two-fold: (1) it speeds up the time to solution as a result of faster model convergence – we observe 20% higher F-score after 5 training epochs – and 32 times larger finetuning training throughput, and (2) it yields 14% overall higher F-score as compared to a model trained from scratch.

For reference, we employ the CodeBERT public checkpoint for a downstream task of merge conflict resolution. It shows comparable F-score to our pretrained encoder, and a likely explanation for the difference is that CodeBERT is pretrained on the CodeSearchNet dataset, which does not include C# and TypeScript programming languages used in this study,
10.3 Multilinguality and Zero-shot Generalization
Multilingual variant of MergeBERT yields top-1 precision of verbatim match and relatively high recall values. Overall, the multilingual variant of the model generates results comparable to the monolingual versions on the languages present in the training set and shows the potential for zero-shot generalization to unseen languages. We test the zero-shot generalization property on merge conflicts in Scala444https://www.scala-lang.org/ programming language and obtain an encouraging 57.8% precision of merge resolution synthesis.
10.4 Inference Cost
Computational efficiency is an important constraint influencing machine learning design decisions in production environments (e.g. deployment in IDE, GitHub action). In the following, we discuss inference costs and floating point operations per second (FLOPs) of MergeBERT as compared to the best performing baseline – GPT-C language model.
In this paper, we reformulate the task of merge conflict region as a classification problem. This provides a major speedup during inference, due to a smaller number of inference calls necessary to decode a resolution. Indeed, in most cases MergeBERT requires only 1 inference call to resolve a merge conflict, with up to 3 calls in the worst case, based on our dataset. The cost of a single inference call on a 16GB Tesla V100 GPU is 60 ms. The end-to-end time to resolve a merge conflict (including tokenization, alignment, and edit sequence extraction) is 105 ms on average, and up to 500 ms in the worst case.
With GPT-C language model, the resolution region is decoded token-by-token via the beam search algorithm. The average time to decode a single token (in our experiments we use beam width of 5, and 1024 tokens context length, with past hidden state caching optimization enabled) on a 16GB Tesla V100 GPU is about 15 ms. With token-level differencing, the resolution size is 70 tokens on average (up to 1584 tokens maximum, in our dataset), which yields 1.1 seconds on average and up to 23.8 seconds in the worst case (the largest conflict) to generate resolution token sequence. Overall, end-to-end inference time required to resolve a merge conflict (including parsing and tokenization) is 2.3 seconds on average and up to 48.5 seconds for the largest conflict. From the user experience prospective in IDE, inference times of over 10 seconds are prohibitively slow.
10.4.1 Floating Point Operations per Second
In the following, we identify main operations in the transformer encoder, for the multi-input MergeBERT architecture (see Fig. 3 for reference):
-
Self-attention: 600 MFLOPs x 4 inputs (encoder weights are shared for all inputs),
-
Feed-forward layer: 1200 MFLOPs x 4 inputs.
Contribution of the lightweight pooling (aggregation) and classification layers are negligibly small. With a total of 6 transformer encoder layers this yields: 43200 MFLOPs per forward pass.
For the GPT-C transformer decoder-only model we get:
-
Self-attention: 600 MFLOPs
-
Feed-forward layer: 1200 MFLOPs
with a total of 12 encoder layers this yields: 21600 MFLOPs per inference call, and for 6 encoder layers: 10800 MFLOPs.
With larger FLOPs per a single forward pass as compared to generative approach, with MergeBERT we gain a significant reduction in total FLOPS required to decode resolution region as a result of needing to performing orders of magnitude less calls (1–3 calls with MergeBERT as compared to 70–1584 with a language model), making this approach an appealing candidate for deployment in IDE.
11 Conclusion
This paper introduces MergeBERT, a neural program merge framework that significantly improves automatic merge resolution upon the existing state-of-the-art tools by over 2. MergeBERT exploits pretraining over massive amounts of code and then finetuning on specific programming languages, achieving 64–69% precision on merge resolution synthesis. MergeBERT views a line-level merge conflict as a token-level prediction task, thus turning a generative sequence-to-sequence task into a discriminative one. Lastly, MergeBERT is flexible and effective, capable of resolving more conflicts than the existing tools in multiple programming languages.
Our work focuses on helping software developers resolve merge conflicts and improve their productivity. The finetuning approach that lies at the core of this tool promotes the re-usability of pretrained transformer models for software engineering tasks, thus reducing the carbon footprint of a product that may utilize MergeBERT.
References
12 Appendix
MergeBERT can deal with non-trivial real-world merges, composed of multiple conflicting chunks. To provide an example of such a merge conflict, we include Fig. 7. MergeBERT correctly predicts a concatenation of changes proposed by developers A and B for the first token-level chunk and a concatenation of changes proposed by developers B and A (in the reverse order) for the second chunk.
12.1 Implementation Details
Each model prediction yields a probability distribution
over word-level merge classes given a conflict. In case of a one-to-many correspondence between original line-level and the word-level conflicts (see, for instance, Fig.) to approximate the original we decode the most promising combination from the predicted solution space. This can be conceptualized as a maximum cost path search on a matrix, which we approach via dynamic programming algorithm.As a result, The model prediction for each line-level conflict consists of either a label for a word-level conflict or a combination of labels for multiple word-level conflicts representing the best prediction for each word-level conflict within the line-level conflict. Given these labels for each line-level conflict and the contents of the merged file, MergeBERT generates the code corresponding to the resolution region. The contents of the merged file includes the conflict in question and its surrounding regions. Therefore, MergeBERT, for each conflicting line, choose between the versions of code based on the labels the model produced and generates the resolution code by concatenating them. Afterwards, MergeBERT checks the syntax of the generated resolved code, and in case of correctness, outputs it as the candidate merge conflict resolution.
In case of multiple line-level conflicts in the merged file, MergeBERT refines the contents of the merged file that serves as the surrounding region of the conflict. More specifically, MergeBERT replaces the other conflicts in the the merged file contents with the code it previously generated as their predicted resolutions. For this purpose, MergeBERT updates the contents of the merged file after resolving each line-level conflict with the code it generates as the conflict resolution based on the model prediction.
![]() |
![]() |
![]() |
![]() |
![]() |