A tremendous amount of code is generated and updated every day, which is often maintained by version control systems such as Git. In the course of software development and maintenance, developers could frequently change their code to fix bugs, add features, perform refactoring, etc. Commits keep track of these code changes. Each commit is associated with a message which describes what and why these code changes are made. Commit messages can help developers understand and analyze code changes. For example, they can provide additional explanatory power in maintenance classification  and Just-In-Time defect prediction . Refactoring opportunities can be found with the analysis of commit messages .
However, writing commit messages manually is time-consuming and laborious especially when the code is updated frequently . Generating commit messages automatically is very helpful to developers. Early work on commit message generation [40, 10, 12] is based on expert rules. Commit messages generated by rule-based methods tend to have too many lines, making it difficult to convey the key intention of the code changes . Later, information retrieval techniques are introduced to commit message generation. For instance, NNGen proposed by Liu et al.  is a simple yet effective retrieval-based method utilizing nearest neighbor algorithm. ChangeDoc proposed by Huang et al. 
is another method that retrieves the most similar commits according to the syntax and semantics in the changed code. Recently, various deep learning-based models are proposed for commit message generation. Some studies[18, 29, 30]
represent code changes as textural sequences and use Neural Machine Translation (NMT) techniques to translate the source code changes into target commit messages. In addition, Liu et al. adopt the pointer-generator network  to handle the out-of-vocabulary (OOV) words. Other studies leverage the rich structural information of source code. Xu et al.  jointly model the semantic representation and structural representation of code changes, Liu et al.  capture both the AST structure of code changes and its semantics for commit message generation.
However, we notice that several important aspects are overlooked in existing work. Firstly, when evaluating commit message generation models, evaluation metrics being used vary a lot. The differences and applicability of these metrics have received little attention. Second, the evaluation datasets are different for different models and most studies experiment on only one dataset. Third, the applicable scenarios of the models are rarely discussed, such as data splitting, etc. In view of the above limitations of existing work, in this paper, we would like to dive deep into the problem and answer:how to evaluate and compare commit message generation models more correctly and comprehensively?
To answer the above question, we conduct a systematic analysis of commit message generation methods and their performance. We analyze the features in commit messages and compare the BLEU variants with human judgment. Moreover, we collect a large-scale commit message dataset from 500 repositories with more than 3.6M commit messages in five popular programming languages (PLs). Benefited from the comprehensive information provided by our dataset, we evaluate the performance of existing methods on multiple PLs. We also compare the influence of different splitting strategies on model performance and discuss the applicable scenarios.
Through extensive experimental and human evaluation, we obtain the following findings about the current commit message generation models and datasets:
Most existing datasets are crawled from only Java repositories while repositories in other PLs are rarely taken into account. Moreover, we find some context information contributing to generation models is not available in the existing public datasets.
By comparing three BLEU variants commonly used in existing work, we find they show inconsistent results in many cases. From the correlation coefficient between human evaluation and different BLEU variants, we find that B-Norm is a more suitable BLEU variant for this task and it is recommended to be used in future research.
By comparing different models on existing datasets and our multi-programming-language dataset, we find existing models show different performance. On the positive side, most of them can be migrated to repositories in other PLs.
The dataset splitting strategies have a significant impact on the evaluation of commit message generation models. Many studies randomly split datasets by commit. Such a splitting strategy cannot simulate the Just-In-Time situation, where the training set contains no data later than that in the test set. Our results show that evaluating models on datasets split by timestamp shows much worse performance than split by commit. In the scenario of Just-In-Time,
we suggest splitting by timestamp for a practical evaluation of models. Moreover, splitting data by project also leads to worse performance, meaning that the generalization ability of existing models for new repositories is limited.
Therefore, in order to evaluate models more comprehensively, it is suggested to also evaluate models on datasets split by project.
Through these findings, we give actionable suggestions on comprehensively evaluating commit message generation models. Then, we discuss future work from three aspects: metrics, information, different scenarios. Finally, we discuss some threats to validity. To summarize, the major contributions of this paper are as follows:
We perform a systematic evaluation of existing work on commit message generation and summarize our findings (including the inappropriate use of BLEU metrics and data splitting strategies, etc) which have not received enough attention in the past.
We develop a large dataset in multi-programming-language which contains comprehensive information for each commit and is publicly available111https://doi.org/10.5281/zenodo.5025758.
We suggest ways for better evaluating commit message generation models. We also suggest possible research directions that address the limitations of existing models.
Ii Commit Message Generation
In this section, we give an overview of existing models, experimental datasets, and evaluation metrics for commit message generation.
We study representative commit message generation models proposed in recent years. They can be classified into three categories, namely generation-based models, retrieval-based models, and hybrid models.
Ii-A1 Generation-based Models
CommitGen, proposed by Jiang et al. , is an early attempt to adopt Neural Machine Translation (NMT) techniques in commit message generation. It treats code diffs and commit messages as inputs and outputs, respectively. CommitGen adapts one NMT model Nematus  with the attentional RNN encoder-decoder . The attention mechanism is introduced to capture long-distance features as CommitGen uses 100 as the maximum input length, which is much longer than that (50) used in Nematus .
CoDiSum, proposed by Xu et al. , is another encoder-decoder based model. The major difference between CoDiSum and other models is the design of the encoder part. CoDiSum jointly models the structure and the semantics of the code diffs with a multi-layer bidirectional GRU to better learn the representations of the code changes. Moreover, the copying mechanism  is used in the decoder to mitigate the out-of-vocabulary (OOV) issue.
Ii-A2 Retrieval-based Models
NNGen, proposed by Liu et al. 
, leverages the nearest neighbor (NN) algorithm to generate commit messages. Code diffs are represented as vectors in the form of “bag of words”
. To generate a commit message, NNGen calculates the cosine similarity between the target code diff and each code diff in the training set. Then, the top-k code diffs in the training set are selected to compute the BLEU scores between each of them and the target code diff. The one with the largest BLEU score is regarded as the most similar code diff and its commit message will be used as the target commit message. The use of cosine similarity for retrieval can boost efficiency and the use of BLEU for ranking can improve the performance. Therefore, NNGen strikes a balance between effectiveness and efficiency.
CC2Vec, proposed by Hoang et al. , is a neural model that learns a representation of code changes guided by the accompanying commit messages. It aims to represent the semantic intent of the code changes by modeling the hierarchical structure of a code change with an attention mechanism and using various comparison functions to identify the differences between the deleted and added code. The learned vector representations of diffs are used to adapt the NNGen model to retrieve a code diff that is most similar to the input so that the corresponding commit message can be used as the output.
Ii-A3 Hybrid Model
ATOM, proposed by Liu et al. , is a hybrid model that combines the techniques in generation-based models and retrieval-based models. ATOM is the first model that makes use of the Abstract Syntax Trees (ASTs) of the code diffs for commit message generation. In the generation module, AST paths extracted from ASTs are encoded with a BiLSTM model to represent the code diffs and then the attention mechanism is used in the decoder to generate a sequence as the commit message. In the retrieval module, the code diff that has the largest cosine similarity with the input code diff is retrieved from the training set. Finally, a hybrid ranking module is used to prioritize the commit messages obtained from the generation and retrieval modules.
There are several existing datasets for the commit message generation task. Table I reports their basic information.
CommitGendata: It is an early commit message generation dataset used in CommitGen and other studies [27, 25, 16]. It is pre-processed from the commit dataset provided by Jiang et al.  which is collected from top-1000 Java GitHub projects. CommitGendata extracts the first sentence from the original commit messages and removes the commits which have ids of issues. Merge and rollback commits are also removed because existing models are not suitable for most of these commits with too many lines. A Verb-Direct Object filter is also introduced to filter out non-compliant commit messages. After filtering, 32K commits remain in the dataset. The training, validation and test sets contain , , and commits, respectively.
NNGendata: Liu et al.  find that CommitGendata contains about 16% noisy messages that can be divided into two categories: bot messages (generated by bot) and trivial messages (written by human but contain little information about code diff). Liu et al.  remove these noises and proposed the cleaned subset of CommitGendata, i.e., NNGendata. The training, validation and test sets contain / / commits, respectively.
CoDiSumdata: Based on the dataset in , Xu et al.  remove the commits that contain no source code changes. They also remove punctuations and special symbols in commit messages, and filter commit messages that contain less than three words and duplications. Finally, they obtain 90,661 pairs of Diff, Message, and they randomly choose / / for training, validation, and testing.
PtrGendata: Liu et al.  collect the top 1,001-2,081 Java projects. They remove the rollback and merged commits, extract the first sentence from messages, and replace the diff signs ( and ) with special tokens <add> and <delete>. The training, validation and test sets contain / / commits, respectively.
ATOMdata: Liu et al.  collect data from 56 Java projects with the largest number of stars. After filtering commits with noisy messages and commits that contain no source code changes, ATOMdata contains 197,968 commits. This dataset is designed to provide not only the raw commits but also the extracted functions which are affected in each commit. They randomly choose 81% for training, 10% for validation, and 9% for testing.
Ii-C Evaluation Metrics
Several metrics commonly used in NLP tasks such as machine translation, text summarization, and captioning can be adopted for evaluating commit message generation. These metrics include BLEU, Meteor , Rouge-L , Cider , etc. In this study, we focus on BLEU as it is the metric that the related works [18, 27, 16, 42, 13, 45] use to evaluate the performance of commit message generation models.
BLEU score is used to evaluate the correlation between the generated and the reference sentences in NLP tasks. For the commit message generation, the references are the commit messages written by developers and the generated sentences are the outputs from the models. Different BLEU variants are used in prior work and they could produce different scores for the same generated commit message. In the following, we are going to illustrate three different variants of BLEU used before. The names are not intended to be a standard in the literature, but just for easy reference in this study.
use the same BLEU script from the open-sourced code in which calculates B-Moses. B-Moses is designed for statistical translation, which does not use a smoothing function. It can be calculated as follows:
where is the weight of -gram precision , which can be obtained as Equation 3. If not explicitly specified, = 4 and uniform weights .
is brevity penalty which is computed as:
where is the length of the candidate generation and is the length of the reference.
The n-gram precisioncan be obtained as,
where is the number of matched n-grams between the reference and the generation, and is the total number of n-grams in the generation.
B-Norm: B-Norm is a BLEU variant adapted from B-Moses. It is used by Loyola et al. . One difference between B-Norm and B-Moses is that B-Norm converts all characters both in the reference and the generation to lowercase before calculating scores. Therefore, B-Norm is case insensitive. The smoothing method proposed by Lin and Och  is used in B-Norm to smooth the calculation of n-gram precision scores. It adds a constant number (one) to both the numerator and denominator of for .
B-CC: Hoang et al.  use BLEU measure provided by NLTK  with the smoothing method proposed by Chen and Cherry  in their evaluation. This smoothing method  is inspired by the assumption that matched counts for similar values of should be similar. The average value of the , and –gram matched counts is used as -gram matched count. is defined as . Therefore, for is defined as: .
Iii Study Design
Iii-a Experimental Models
In this study, we select the models to be evaluated according to the following criteria: a) source code is publicly available, and b) we can confirm the correctness of source code by checking the implementation provided by authors and reproducing results presented in the original paper. CommitGen , CoDiSum , NMT , PtrGNCMsg , and NNGen  satisfy these criteria and thus are selected for in-depth evaluation in our study. For models that hyper-parameter settings are reported in their papers, we use the same hyper-parameters. Otherwise, we tune the hyper-parameters empirically to optimize each model.
For CC2Vec, after inspecting its public source code222https://github.com/CC2Vec/CC2Vec, we suspect that the implementation is inconsistent with descriptions in the paper. We find that the scores reported in CC2Vec paper are produced by the code that retrieves the commit message of the most similar code diff in terms of BLEU score instead of the vector representation described in the CC2Vec paper. We have tried to modify the code according to the paper, but the result drops significantly in BLEU. We have contacted the authors regarding this issue. Therefore, CC2Vec is not evaluated in this study.
For ATOM, extracting AST paths for code diff is a key step during the pre-processing. However, the tool for this step is not available. We have contacted the authors but it cannot be provided for commercial reasons. We have tried to replace the required path extraction tool with JavaExtractor333https://github.com/tech-srl/code2seq/tree/master/JavaExtractor  for extracting AST paths from the Java function. However, this attempt cannot fully match the results in ATOM paper, therefore, in most experiments except for the one reported in Table IV, ATOM is not evaluated.
Iii-B Experimental Datasets
Iii-B1 Existing Datasets
We select datasets from the existing ones based on their availability and representativeness. In this way, three existing datasets are chosen as highlighted in bold in Table I: CommitGendata, NNGendata, and CoDiSumdata. The reasons are explained as follows.
CommitGendata and NNGendata are Java datasets and commonly used in [19, 27, 25, 16]. NNGendata is more difficult than CommitGendata because commit messages with certain patterns are filtered in NNGendata. CoDiSumdata is another Java dataset with different features compared to CommitGendata and NNGendata. It is a deduplicated dataset, i.e., its training set and test set are not overlapping. CoDiSumdata can be used to evaluate the generalization ability of models.
PtrGendata is not used in our study since it is very similar to CommitGendata except that CommitGendata is collected from top-1,000 starred GitHub projects while PtrGendata is from top-1001 to top-2081 projects. The performance difference is reported to be very small on CommitGendata and PtrGendata . Besides, CommitGendata is used more often in the literature [18, 27, 28, 16] than PtrGendata .
MultiLangdata is not used for two reasons. First, in MultiLangdata, commits for each programming language are collected from only three repositories, resulting in small and sparse data. Second, the three collected repositories are not available from the provided link444https://osf.io/67kyc/?view_only=ad588fe5d1a14dd795553fb4951b5bf9, making it difficult to inspect the data source.
ATOMdata is not chosen because the given dataset is incomplete. For example, all commits to the repository retrofit are missing. Recovering the data is not feasible because both the repositories’ full names (including owner and repositories’ name) and version numbers are not provided.
Iii-B2 Our Dataset MCMD
Existing datasets have facilitated the development of commit message generation. However, the available datasets have their limitations: most are in a single programming language (i.e., Java), and the available information is very limited. There is only one dataset MultiLangdata  in multiple PLs, but it is not usable as explained in Section III-B1. To provide a large-scale dataset in multiple PLs and with rich information, we created a new dataset MCMD, short for Multi-programming-language Commit Message Dataset. For each language, we collected commits before 2021 from the top 100 starred projects on GitHub. In this step, a total of 3.69M commits were collected. We removed branch merging and rollback commits, and filtered out noisy messages as Liu et al. , to improve the quality of commits in our dataset. About 3.42M commits remain after filtering. To balance the size of data in each programming language so that we can fairly compare the performance of models in different programming language in subsequent experiments, we randomly sampled and retained 450,000 commits for each language.
Existing datasets [18, 27, 43] contain the information of only code diffs and the commit messages. However, the context of the code diffs can contribute to explaining why this code is added and what role it plays in the software . For example, extracting AST paths from the code diffs is beneficial to the commit message generation model , which requires the dataset to provide enough information to find the complete affected functions around the code changes. However, most of the existing datasets [18, 27, 43, 30] do not provide information for retrieving related functions. To trace back to the original repository, the RepoFullname (including owner and repository’s name) and SHA of a repository should be recorded. The RepoFullname can be used to find the corresponding repository and SHA is a unique ID to identify the version of the repository. Moreover, if we want to split the dataset by timestamp, timestamps of commits are necessary. Considering the above demands, our dataset MCMD contains the complete information of commits, including not only code diffs and commit messages, but also RepoFullname, SHA, and timestamp. We have made MCMD public555https://doi.org/10.5281/zenodo.5025758 to benefit future research on commit message generation.
Iii-C Research Questions
We have identified the following Research Questions (RQs) and will seek their answers in our evaluation:
RQ1: How do different BLEU variants affect the evaluation of commit message generation?
As described in Section II-C, most existing works use BLEU as an evaluation metric but the scores in their papers are different BLEU variants. Scores for different BLEU variants can vary a lot for the same sentence as shown in Table II.
|add setup ( )||add setUp ( )||0.00||100.00||22.80|
|Fix merge conflicts||fix merge conflicts||0.00||100.00||25.00|
|BAEL - 3001||BAEL - 2412 : Add a new class||0.00||19.64||12.54|
|Fix typo||Fix typo in core - validation . adoc||0.00||19.64||12.54|
|Update visualvm to build 908||[ GR - 6405 ] Update visualvm .||0.00||19.64||12.54|
|[ FIXED JENKINS - 12514 ]||[ FIXED JENKINS - 12514 ] Fixed a bug in bundled plugins on Windows .||22.31||36.41||41.26|
|[ GR - 22084 ] Add TruffleCreateGraphTime timer .||Add a timer to the timer .||0.00||19.68||11.51|
|Remove dead code .||[ GR - 19154 ] Remove unused code .||0.00||19.07||13.25|
|Fix reported leaks||Fix a bug in SnappyFramedEncoderTest||0.00||24.03||8.98|
|[ fixed ] Bug in Mesh . setIndices ( ) . had to clear buffer first .||[ fixed ] Mesh . setVertices ( ) .||0.00||18.97||16.63|
For instance, as Table II and Figure 1 illustrate, the commit message “[ fixed ] Mesh . setVertices ( ) .” generated by PtrGNCMsg is relatively reasonable for that code changes. Compared with the reference, it has shared tokens and the meaning is partially correct. However, it has different scores for different BLEU variants, as shown in Table II. The B-Moses score of 0 means that this generation and reference are completely different, which is not true. Another case is the commit message “fix merge conflicts” generated by NMT, which has the same meaning with the reference but it has lower scores for B-Moses (0) and B-CC (25.00) while it has the perfect score for B-Norm (100). These examples are just the “tip of the iceber”.
RQ1 chooses the most suitable BLEU metric for the task of commit message generation by human evaluation and analyzes why that variant is better than others. Following best practice for the human evaluation 
, three human experts manually labeled the data. All of them have more than 5 years of programming experience and they are majored in Computer Science. Firstly, we select 100 commit messages randomly from generation results which show large disagreement (the variance among the three BLEU metrics is larger than 30). Then, we define five levels of criteria for manual labeling as shown in TableIII. The raters give a score between 0 to 4 to measure the semantic similarities between reference and the generated commit message. After labeling, all scores are double checked by each rater to confirm whether scores from human are stable and reliable.
To validate the reliability of human scores, we calculate Krippendorff’s alpha  and Kendall rank correlation coefficient (Kendall’s Tau)  values666The details of calculation can be seen at our repository https://github.com/DeepSoftwareAnalytics/CommitMsgEmpirical. Before the calculation, these human direct assessments are converted into relative rankings as Direct Assessment Relative Rankings (DaRR) serve as the golden standard for segment-level evaluation . The Krippendorff’s alpha of the three raters is 0.86, and the Kendall’s Tau value between any two raters is greater than 0.8, which indicates there is a high degree of agreement between the raters and human scores are reliable.
Human scores are regarded as the reference and we want to compare the three BLEU variants by their correlations with the reference. Spearson  and Kendall  are selected because the scores are ordinal and satisfied their assumptions. Note that these correlation coefficients are calculated per commit and more details can be found in our repository.
|0||No similarity between the generation and reference.|
|1||Have few shared tokens, not semantically similar.|
Have some shared tokens, probable semantically similar.
|3||Much similar in semantic but a few tokens are different.|
|4||Identical in semantic.|
RQ2: How good are the existing models and datasets?
As described in Section II-B, When evaluating commit message generation models, not only are the evaluation metrics different, but the datasets used are also different. Most of the existing studies [30, 43, 26] only experiment on one dataset. Therefore, in RQ2, we conduct a unified evaluation of existing models on existing datasets, and study the impact of using different datasets on model evaluation.
RQ3: Why do we need a new dataset MCMD for evaluating commit message generation?
Since most of the prior works focus on Java datasets, there is a lack of research on other PLs’ repositories that have the same need to generate commit messages automatically. In RQ3, we explore the necessity of a new dataset, and rely on the new dataset MCMD we collected to explore the performance of migrating existing models to other PLs. Similar to previous work [18, 28, 26], for the dataset of each language in MCMD, we randomly select 80% data for training, 10% data for validation and 10% data for testing.
RQ4: What is the impact of different dataset splitting strategies?
Commit message generation models have different usage scenarios, so they need to be evaluated on datasets split by different strategies that simulate different scenarios. However, for the datasets used in most previous work, commits are split into training, validation, and testing sets randomly (split by commit), while other situations such as split by project (where training, validation, and testing sets contain commits from disjoint projects) are not considered. In RQ4, we study the impact of different dataset splitting strategies from two aspects.
Split by timestamp. As a previous study on code summarization  suggests: “Care must be taken to avoid unrealistic scenarios, such as ensuring that the training set consists only of code older than the code in the test set”. The commit message generation task is similar to code summarization task in this regard, especially in a just-in-time (JIT) scenario, the model cannot see future data and can only use past data for training.
Therefore, we further conduct experiments on datasets split by timestamp instead of by commit to ensure future commits are not used as training data. In this splitting strategy, we divide each programming language’s dataset for training (80%), validation (10%), and test (10%) in chronological order.
Split by project. According to the study of LeClair and McMillan  on code summarization, splitting dataset by function (in analogy with “commit” in our study) might cause information leakage from test set projects into the training or validation sets and should be avoided. Following previous studies [22, 26], we also evaluate the performance of models based on MCMD split by project to ensure that projects in the training, validation, and test sets are disjoint. In this splitting strategy, we divide each PL’s dataset for training (80%), validation (10%), and test (10%) by the repository.
The experiments of splitting by project can reflect the performance of models on new repositories. For a repository that has commits before, the models can be trained by using only other repositories, using the repository itself, or using both of itself and others. Furthermore, to mimic the scenario that we are predicting commit messages on a new project with a trained model, we conduct a series of experiments called single project experiments. As illustrated in Figure 2, the test set for all experiments is the same, and it comes from the target project. The training sets for the three experiments in the single project experiments are: 1⃝ data only from the target project for Within-Project setting; 2⃝ data only from other projects in the same programming language for Cross-Project setting; and 3⃝ the union of training sets from the target project and other projects for Full-Project setting. The experiment on these three settings can provide suggestions for the models’ usability on existing repositories.
Iv Results and Findings
Iv-a How Do Different BLEU Variants Affect the Evaluation of Commit Message Generation? (Rq1)
To answer this question, we evaluate commit message generation models using different BLEU variants, and compare BLEU scores with the results from a human evaluation.
Iv-A1 Experiments under different BLEU variants
As shown in Table IV, rankings of models are inconsistent under different metrics777All the generated commit messages are available in our repository.. For example, on CoDiSumdata, CoDiSum is the best model when measured by B-Norm or B-CC, while NNGen is the best model when measured by B-Moses. On CommitGendata, if the B-Moses metric is used, PtrGNCMsg is better than CommitGen, but an opposite conclusion will be drawn if B-Norm is used. Similar inconsistencies can also be observed in Table VII.
Iv-A2 Human Evaluation
Table V shows the correlation scores between three BLEU variants and human evaluation, under two correlation metrics: Spearson  and Kendall . We can see that B-Norm is most correlated with human judgement and the conclusion consistently holds for all correlation metrics at the confidence level 95%. After manually investigating a great number of commit messages in the test set and comparing the design of three BLEU variants, we find two possible reasons: smoothing and case sensitivity.
|Spearman||0.1989 ( 0.05)||0.6188 ( 0.05)||0.5375 ( 0.05)|
|Kendall||0.1716 ( 0.05)||0.4639 ( 0.05)||0.3967 ( 0.05)|
(with p-values shown in parentheses).
B-Moses does not perform smoothing when calculating each while B-Norm and B-CC do so. However, more than 17.19% commit messages have less than five tokens in MCMD. Therefore, 4-gram precision (shown in Equation 3
) of these commit messages is close to zero, leading the geometric mean of n-gram precision scores to be zero even if there are many 1-gram, 2-gram, or 3-gram matches. Without smoothing, a short commit message that is identical to the reference will get a near-zero B-Moses score, which is unreasonable. As many commit messages are short, we believe that B-Moses is not very suitable for evaluating commit message generation.
B-Norm is not case sensitive while B-Moses and B-CC are. As shown in Table II, the generated message “fix merge conflicts” has the same meaning as the reference “Fix merge conflicts”. The only difference is the case of “Fix” and “fix”. Besides, other words such as “add” and “Add” also have the same meaning. Scores of B-Moses and B-CC for commit messages that differ only in case tend to be low. The low scores are unreasonable, since these messages have exactly the same meaning as the references.
Summary: For the evaluation of commit message generation models, using different metrics may lead to different conclusions. B-Norm, which uses a smoothing method and is case insensitive, is more in line with human judgments.
Iv-B How Good Are the Existing Models and Datasets? (Rq2)
Based on the experimental results shown in Table IV, we have the following findings:
The scores of the same model on different datasets can vary a lot. For example, the B-Norm score of the NNGen model on CommitGendata is 34.74. When evaluated on CoDiSumdata, the score drops to 9.07. Hence, we should consider more datasets in the evaluation: good performance on one dataset does not mean we can observe similar performance on another dataset.
The scores on CommitGendata are higher than scores on NNGendata. NNGendata is a subset of CommitGendata, in which noisy messages are filtered out as described in Section II-B. To investigate the role of noise data in model training and evaluation, we conduct ablation experiments on CommitGen and the results are shown in Table VI. We split the test set of CommitGendata into 2 parts: one only contains noisy messages, and the other is the rest (i.e., the test set of NNGendata). We can see that the scores on the test set of noisy messages are much higher than that of NNGendata, indicating that noisy messages are easy to generate. However, these messages (e.g., branch merging messages) are often bot generated and do not need to be predicted by a model. Therefore, what really needs to be compared is the performance on the NNGendata test set. We further investigated whether excluding noisy data in model training will improve its performance. As shown in Table VI, training on NNGendata (noise-free data) has a higher score than CommitGendata, which indicates that it is better not to include noisy data in model training.
When experimenting on CommitGendata and NNGendata, NNGen has the highest scores under all metrics. But NNGen does not perform the best on CoDiSumdata. We speculate that this is because NNGen is retrieval-based and it is easy for it to achieve high score on datasets with duplicated data. After checking, we find that in the test set of NNGendata, there are 16.02% duplicated commit messages and 5.16% duplicated Diff, Message pairs from the training set. And in the test set of CommitGendata, there are 29.13% duplicated commit messages and 4.67% of duplicated Diff, Message pairs. In contrast, CoDiSumdata is a deduplicated dataset as described in Section III-B1. As a retrieval-based model, NNGen obtains a high score by leveraging the duplication in the dataset.
Summary: More datasets can be used for more comprehensive evaluation since good performance of a model on one dataset does not mean good performance on other datasets. Removing noisy data (commits with bot and trivial messages) during model training can improve performance. The duplication of commit data makes the performance of retrieval-based models such as NNGen better.
Iv-C Why Do We Need a New Dataset MCMD for Evaluating Commit Message Generation? (Rq3).
Most of existing datasets only retain Diff, Message information and cannot be used to evaluate models that require more information. For example, as shown in Table IV, only our MCMD dataset can be used to evaluate ATOM. This is because ATOM needs to know the complete code of the modified functions in order to extract the AST information. However, this information is unavailable in existing public datasets888Although the ATOMdata can be used for evaluating ATOM, it is not publicly available as described in Section III-B1.. Compared to existing datasets, our dataset MCMD provides complete information for each commit. For example, the provided RepoFullname, SHA information can be used to obtain the complete functions for AST extraction. Please note that the ATOM results we presented here are only to illustrate the necessity of a richly informative dataset. As part of the ATOM code is not disclosed (as described in Section III-A), ATOM is out of the scope of this study. We believe future research that requires the use of other commit information can benefit from the complete information provided by MCMD.
Extremely low scores are observed for CoDiSum on CommitGendata and NNGendata. The reason might be that the size of CommitGendata and NNGendata is not large enough to support the model’s training after filtering the data. CoDiSum is designed to extract additional structure information from code changes in “.java” files. Therefore, code changes that are not related to “.java” files are filtered. After filtering by CoDiSum, there are only hundreds of commits left, which are inadequate for its training. With the large-scale dataset MCMD, we have tens of thousands of commits after filtering to support the training of CoDiSum. Considering that some filtering steps reduce the size of the original dataset for the model’s training, using a larger dataset can reduce the negative impact of insufficient training during the evaluation.
The multiple-programming-language nature of the MCMD dataset makes it possible to study the migration of commit message generation models to other PLs. From Table VII, we find that the ranking of models on the Java dataset cannot be preserved when migrating to other languages and the best model for different languages may vary. Overall, the retrieval-based model NNGen performs the best, with an average B-Norm score of 17.82. But no model can consistently outperform others.
Summary: A large-scale, multi-language, and information-rich dataset is needed to comprehensively evaluate commit message generation models. Overall, NNGen performs the best, but no model can consistently outperform other models in all PLs. Therefore, when choosing commit message generation models for a new language, we suggest testing multiple models in that language to select the best one.
Iv-D What Is the Impact of Different Dataset Splitting Strategies? (Rq4).
We analyze the impact of different dataset splitting strategies including splitting by timestamp and splitting by project.
Iv-D1 Split by Timestamp
Table VIII shows experimental results on datasets split by timestamp. Compared to Table VII, the performance of all models on all datasets drop consistently, and the BLEU scores of all models drop by 17.88% - 51.71% on average. This shows that it is more difficult to predict future commit messages based on past data training.
Although the retrieval-based model NNGen shows the best performance in the split by commit setting as shown in Table VII, the results of splitting by timestamp are different. The performance degradation of NNGen is greater than other models, and PtrGNCMsg performs the best. Therefore, in the JIT application scenario, it is not suitable to use datasets split by commit to evaluate models. Moreover, PtrGNCMsg which is based on generation and pointer-generator mechanism has better generalization ability in the JIT scenario.
|CommitGen||8.08 (34.75%)||4.53 (75.04%)||7.08 (38.85%)||5.50 (50.49%)||8.91 (48.77%)||6.82 (51.71%)|
|CoDiSum||12.71 (9.21%)||4.85 (61.90%)||12.24 (1.77%)||12.46 (14.83%)||11.17 (0.62%)||11.17 (17.88%)|
|NMT||9.50 (29.05%)||5.15 (70.27%)||8.53 (26.21%)||7.31 (36.60%)||11.58 (32.20%)||8.41 (40.65%)|
|NNGen||10.73 (39.78%)||7.83 (65.81%)||9.30 (32.05%)||9.36 (43.78%)||12.07 (33.04%)||9.86 (44.67%)|
|PtrGNCMsg||13.30 (13.21%)||9.38 (52.45%)||10.94 (16.29%)||13.21 (17.38%)||18.07 (7.73%)||12.98 (22.45%)|
Iv-D2 Split by Project
Table IX shows experimental results on datasets split by project. Compared to Table VII, the performance of all models on all datasets drop consistently, by 26.93% to 73.41%. This indicates that the split-by-project scenario is much more difficult than split-by-commit, and models need to have better generalization ability when applied to new projects. We also find that on datasets split by project, PtrGNCMsg shows the best performance.
|CommitGen||5.20 (58.03%)||4.82 (73.41%)||4.47 (61.37%)||7.61 (31.46%)||7.05 (59.50%)||5.83 (58.71%)|
|CoDiSum||10.23 (26.93%)||8.43 (33.78%)||2.87(76.97%)||9.23(36.91%)||8.02(28.65%)||7.76 (40.39%)|
|NMT||7.94 (40.70%)||5.95 (65.65%)||5.73 (50.43%)||5.29 (54.12%)||7.39 (56.73%)||6.46 (54.43%)|
|NNGen||5.67 (68.16%)||9.89 (56.85%)||3.90 (71.51%)||4.66 (72.00%)||5.72 (68.28%)||5.97 (66.50%)|
|PtrGNCMsg||7.92 (48.34%)||8.08 (59.03%)||6.28 (51.95%)||8.79 (45.03%)||11.98(38.82%)||8.61 (48.57%)|
Furthermore, to emulate the scenario in which we need to generate commit messages for a new project, we conduct a series of single project experiments as described in Section III-C with the NMT model. From the results shown in Table X, we can find that the performance of NMT with Cross-Project training is poor, which is consistent with the previous split-by-project conclusion. Within-Project training is much better than Cross-Project, and the performance of NMT model can be further improved through Full-Project training.
Summary: The dataset splitting strategies have significant impact on the evaluation of commit message generation models. Under the split-by-timestamp or split-by-project strategies, the evaluation scores of models are significantly lower than that of split-by-commit, and the PtrGNCMsg model is overall the best. Moreover, to achieve the best performance, it is recommended to train models with data from both the target project and other projects.
V-a Future Research Directions
As described in Section IV-A, evaluation metrics are important for the evaluation of commit message generation models, and using different metrics may lead to different conclusions. We have only studied the metrics that have been used in this task. A possible future direction is to study whether other metrics used in some NLP tasks are better than B-Norm or even design a new metric for commit message generation task.
Another future direction is to leverage more context information of repositories in the design of commit message generation models. In addition to the AST information that ATOM has studied, there are other information that could be explored, such as programming languages’ features, contributors, history changes on the target code, associated bug reports, etc. In this regard, our dataset MCMD, which is information-rich and can be traced back to original repositories to extract all information mentioned above, can be very helpful for future research. We have made our dataset public.
Finally, as discussed in Section IV-C and Section IV-D, there are multiple aspects that should attract more attention when evaluating commit message generation models in the future, including multiple PLs, dataset splitting strategies, etc. More specifically, in the settings of split-by-timestamp and split-by-project, the performance of existing models needs to be further improved.
V-B Possible Ways to Improve?
Our findings suggest that commit message generation still has a long way to go. We now show that the performance of commit message generation models could be improved through small changes. Note that we do not aim to design a full-scale model here. Our purpose is to show that there is still ampler room for further improving existing models.
As the experimental results in Table VII suggest, a retrieval-based model is a simple yet effective approach to generating commit messages. However, NNGen only uses the “bag of words” (which is 1-gram) as the retrieval index, which is basic and can be optimized. We can use a simple and different representation. Inspired by the design of BLEU concerning the precision of n-gram matches, we change the representation of diff tokens from 1-gram to n-gram. NNGen-Gram4 in Table XI means that we represent all of the tokens including 1,2,3,4-gram rather than 1-gram only. Besides the representation, the retrieval method is also changed in our attempt. As described in Section IV-A, adding a smoothing function can affect the BLEU score. The original BLEU metric used by NNGen may not be the most suitable option for retrieval. Therefore, we try to add the smoothing function (shown as NNGen-Smooth in Table XI) to the retrieval method. We compare the results of our two attempts NNGen-Gram4 and NNGen-Smooth with NNGen on NNGendata, as shown in Table XI. The two variants achieve consistently higher score than the original NNGen and NNGen-Smooth-Gram4, the combination of them, further improves the performance.
V-C Threats to Validity
We have identified the main threats to validity as follows:
Data Quality. The quality of the data could be a threat to validity. To mitigate this threat, we have used multiple filtering rules following previous work  to obtain a set of relatively good-quality commit messages. But there might still exist low-quality data. It is possible to further improve our dataset MCMD to obtain higher quality.
Human Labeling Bias. Our manual annotation of the quality of commit message may be biased, and interrater reliability could be a threat to validity: bias may exist in the scores assigned to the same sentence by different raters. We attempted to mitigate this threat by: (1) making clear scoring rules as shown in Table III
before actual scoring, and (2) having discussions on disagreement cases so that the standard deviations among all raters are small.
Data Sampling. 100 sampled commit messages are used in the human evaluation. These samples are randomly selected from the messages with the variance among the three BLEU variants larger than 30, because the aim is to find which BLEU variant correlates with human scores the most, especially when there is large variance between the BLEU scores. This sampling could be a threat to validity. Larger scale human evaluation can be conducted to alleviate this threat.
Programming Languages. There are plenty of programming languages (PLs) with various characteristics. Although we have expanded the language diversity in MCMD to 5 popular PLs, it is still not exhaustive. Caution is needed when applying our findings to other PLs.
Replication. There could exist potential errors in our implementations and experiments. To mitigate this threat, we reuse the existing implementations of the models from the original authors when possible. The new implementation in our experiments (such as data collection, scripts for comparison experiments, and implementation of the improved NNGen) are double checked by multiple experts to ensure correctness.
In this paper, we conduct an in-depth analysis of the datasets and models in the commit message generation task. We have investigated several aspects, including: evaluation metrics, datasets in multiple programming language, and dataset splitting strategies, etc. Our study points out that all these aspects have large impacts on the evaluation. We believe that the results and findings in our study can be of great help for practitioners and researchers working on this interesting area.
Our source code and data are available at https://github.com/DeepSoftwareAnalytics/CommitMsgEmpirical.
-  () . Note: https://github.com/SoftWiser-group/CoDiSum Cited by: §III-A.
-  () . Note: https://sjiang1.github.io/commitgen Cited by: §III-A.
-  () . Note: https://github.com/epochx/commitgen Cited by: §III-A.
-  () . Note: https://github.com/Tbabm/nngen Cited by: §III-A.
-  () . Note: https://zenodo.org/record/2542706#.XECK8C277BJ Cited by: §III-A.
-  (2019) Code2seq: generating sequences from structured representations of code. In ICLR, Cited by: §III-A.
-  (2015) Neural machine translation by jointly learning to align and translate. In ICLR, Cited by: 1st item.
-  (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In IEEvaluation@ACL, pp. 65–72. Cited by: §II-C.
-  (2016) The relationship between commit message detail and defect proneness in java projects on github. In MSR, pp. 496–499. Cited by: §I.
-  (2010) Automatically documenting program changes. In ASE, pp. 33–42. Cited by: §I.
-  (2014) A systematic comparison of smoothing techniques for sentence-level BLEU. In WMT@ACL, pp. 362–367. Cited by: 3rd item.
-  (2014) On automatically generating commit messages via summarization of source code changes. In SCAM, pp. 275–284. Cited by: §I.
-  (2016) Deep API learning. In SIGSOFT FSE, pp. 631–642. Cited by: §II-C.
-  (2007) Answering the call for a standard reliability measure for coding data. Communication methods and measures 1 (1), pp. 77–89. Cited by: §III-C.
-  (2009) Automatic classication of large changes into maintenance categories. In ICPC, pp. 30–39. Cited by: §I.
CC2Vec: distributed representations of code changes. In ICSE, pp. 518–529. Cited by: 2nd item, 1st item, 3rd item, §II-C, §III-B1, §III-B1.
-  (2020) Learning human-written commit messages to document code changes. J. Comput. Sci. Technol. 35 (6), pp. 1258–1277. Cited by: §I.
-  (2017) Automatically generating commit messages from diffs using neural machine translation. In ASE, Cited by: §I, 1st item, 1st item, §II-C, §III-B1, §III-B2, §III-C.
-  (2017) Towards automatic generation of short summaries of commits. In Proceedings of the 25th International Conference on Program Comprehension, ICPC 2017, Buenos Aires, Argentina, May 22-23, 2017, Cited by: 1st item, 3rd item, §III-B1.
-  (1945) The treatment of ties in ranking problems. Biometrika 33 (3), pp. 239–251. Cited by: §III-C, §III-C, §IV-A2.
-  (2007) Moses: open source toolkit for statistical machine translation. In ACL, Cited by: 1st item.
-  (2019) Recommendations for datasets for source code summarization. In NAACL-HLT (1), pp. 3931–3937. Cited by: §III-C, §III-C.
-  (2004) Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In ACL, pp. 605–612. Cited by: 2nd item.
-  (2004) ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out, pp. 74–81. Cited by: §II-C.
-  (2019) Generating commit messages from diffs using pointer-generator network. In MSR, pp. 299–309. Cited by: §I, 4th item, 1st item, 4th item, 1st item, §III-B1, §III-B1.
-  (2020-11) ATOM: commit message generation based on abstract syntax tree and hybrid ranking. TSE PP, pp. 1–1. Cited by: §I, 2nd item, 1st item, 6th item, §III-B2, §III-C, §III-C, §III-C.
-  (2018) Neural-machine-translation-based commit message generation: how far are we?. In ASE, pp. 373–384. Cited by: §I, 1st item, 1st item, 2nd item, 1st item, §II-C, §III-B1, §III-B1, §III-B2, §III-B2.
-  (2019) Automatic generation of pull request descriptions. In ASE, pp. 176–188. Cited by: §III-B1, §III-C, 1st item.
-  (2018) Content aware source code change description generation. In INLG, pp. 119–128. Cited by: §I.
-  (2017) A neural architecture for generating natural language descriptions from source code changes. In ACL (2), pp. 287–292. Cited by: §I, 2nd item, 5th item, 2nd item, §III-B2, §III-B2, §III-C.
-  (2015) Effective approaches to attention-based neural machine translation. In EMNLP, pp. 1412–1421. Cited by: 2nd item.
-  (2019) Results of the WMT19 metrics shared task: segment-level and strong MT systems pose big challenges. In WMT (2), pp. 62–90. Cited by: §III-C.
-  (2010) Christopher d. manning, prabhakar raghavan, and hinrich schütze: introduction to information retrieval - cambridge university press, cambridge, england, 2008, 482 pp, ISBN: 978-0-521-86571-5. Inf. Retr. 13 (2), pp. 192–195. Cited by: 1st item.
-  (2002) Bleu: a method for automatic evaluation of machine translation. In ACL, pp. 311–318. Cited by: §II-C.
-  (2020) Recommending refactorings via commit message analysis. Inf. Softw. Technol. 126, pp. 106332. Cited by: §I.
-  (2017) Get to the point: summarization with pointer-generator networks. In ACL (1), pp. 1073–1083. Cited by: §I, 3rd item, 4th item.
-  (2017) Nematus: a toolkit for neural machine translation. In EACL (Software Demonstrations), pp. 65–68. Cited by: 1st item.
-  (2008) Asking and answering questions during a programming change task. IEEE Trans. Software Eng. 34 (4), pp. 434–451. Cited by: §III-B2.
Best practices for the human evaluation of automatically generated text. In Proceedings of the 12th International Conference on Natural Language Generation, INLG, Cited by: §III-C.
-  (2015) ChangeScribe: A tool for automatically generating commit messages. In ICSE (2), pp. 709–712. Cited by: §I.
-  (2015) CIDEr: consensus-based image description evaluation. In CVPR, pp. 4566–4575. Cited by: §II-C.
-  (2020) Cocogum: contextual code summarization with multi-relational gnn on umls. Technical report Microsoft, MSR-TR-2020-16. [Online]. Available: https://www.microsoft.com/en-us/research/publication/cocogum-contextual-code-summarization-with-multi-relational-gnn-on-umls. Cited by: §II-C.
-  (2019) Commit message generation for source code changes. In IJCAI, pp. 3975–3981. Cited by: §I, 3rd item, 3rd item, §III-B2, §III-C.
-  (2011) Steven bird, evan klein and edward loper. Natural Language Processing with Python. o’reilly media, inc 2009. ISBN: 978-0-596-51649-9. Nat. Lang. Eng. 17 (3), pp. 419–424. Cited by: 3rd item.
-  (2020) Retrieval-based neural source code summarization. In ICSE, Cited by: §II-C.
-  (1999) CRC standard probability and statistics tables and formulae. Crc Press. Cited by: §III-C, §IV-A2.