gec-ranking
Data and code used in the 2015 ACL paper, "Ground Truth for Grammatical Error Correction Metrics"
view repo
The GLEU metric was proposed for evaluating grammatical error corrections using n-gram overlap with a set of reference sentences, as opposed to precision/recall of specific annotated errors (Napoles et al., 2015). This paper describes improvements made to the GLEU metric that address problems that arise when using an increasing number of reference sets. Unlike the originally presented metric, the modified metric does not require tuning. We recommend that this version be used instead of the original version.
READ FULL TEXT VIEW PDF
This document describes a comprehensive procedure of how the biophysical...
read it
Video quality measurement takes an important role in many applications.
...
read it
Most trainable machine translation (MT) metrics train their weights on h...
read it
In this paper, we propose a new metric for Machine Translation (MT)
eval...
read it
Despite being widely used as a performance measure for visual detection
...
read it
Evaluating AMR parsing accuracy involves comparing pairs of AMR graphs. ...
read it
Natural language generation (NLG) systems are commonly evaluated using n...
read it
Data and code used in the 2015 ACL paper, "Ground Truth for Grammatical Error Correction Metrics"
GLEU (Generalized Language Understanding Evaluation)^{2}^{2}2Not to be confused with the method of the same name presented in mutton-EtAl:2007:ACLMain. was designed and developed using two sets of annotations as references, with a tunable weight to penalize n-grams that should have been changed in the system output but were left unchanged [Napoles et al.2015]. After publication, it was observed that the weight needed to be re-tuned as the number of references changed. With more references, more variations of the sentence are seen which results in a larger set of reference n-grams. Larger sets of reference n-grams tend to have higher overlap with the source n-grams, which decreases the number of n-grams that were seen in the source but not the reference. Because of this, the penalty term decreases and a large weight is needed for the penalty term to have the same magnitude as the penalty when there are fewer references.
As re-tuning the weight for different sized reference sets is undesirable, we simplified GLEU so that there is no tuning needed and the metric is portable across comparisons against any number of references.
Our GLEU implementation differs from that of napoles-EtAl:2015:ACL-IJCNLP. As originally presented, in computing n-gram precision, GLEU double-counts n-grams in the reference that do not appear in the source, and it subtracts a weighted count of n-grams that appear in the source () and not the reference (). We use a modified version, GLEU, that simplifies this. Precision is simply the number of reference n-gram matches, minus the counts of n-grams found more often in the source than the reference (Equation 1). GLEU follows the same intuition as the original GLEU: overlap between and should be rewarded and n-grams that should have been changed in but were not should be penalized.
Equation 1: Modified precision calculation of GLEU.
The precision term in Equation 1 is then used in the standard BLEU equation [Papineni et al.2002] to get the GLEU score. Because the number of possible reference n-grams increases as more reference sets are used, we calculate an intermediate GLEU by randomly sample from one of the references for each sentence, and report the mean score over 500 iterations. It takes less than 30 seconds to evaluate 1,000 sentences using 500 iterations.
Using this revised version of GLEU, we calculated the scores for each system submitted to the CoNLL 2014–Shared Task on Grammatical Error Correction^{3}^{3}3http://www.comp.nus.edu.sg/~nlp/conll14st.html to update the results reported in Tables 4 and 5 of napoles-EtAl:2015:ACL-IJCNLP. The system ranking by GLEU is compared to the originally reported GLEU (GLEU), M, and the human ranking (Table 1).
Human | M | GLEU | GLEU |
---|---|---|---|
CAMB | CUUI | CUUI | CAMB |
AMU | CAMB | AMU | CUUI |
RAC | AMU | UFC | AMU |
CUUI | POST | CAMB | UMC |
source | UMC | source | PKU |
POST | NTHU | IITB | POST |
UFC | PKU | SJTU | SJTU |
SJTU | RAC | PKU | NTHU |
IITB | SJTU | UMC | UFC |
PKU | UFC | NTHU | IITB |
UMC | IPN | POST | source |
NTHU | IITB | RAC | RAC |
IPN | source | IPN | IPN |
On average, M ranks systems within 3.4 places of the human ranking. Both GLEU scores have closer rankings on average: GLEU within 2.6 and GLEU within 2.9 places of the human ranking.
The correlation between the system scores and the human ranking is shown in Table 2. GLEU has slightly stronger correlation with the human ranking than GLEU, which is significantly greater than the human correlation with M, however the rank correlation of GLEU is weaker than GLEU and M.
We recommend that the originally presented GLEU no longer be used due to the issues we identified in Section 1.
The updated version of GLEU that does not require tuning (GLEU) should be used instead.
The code is available at
https://github.com/cnap/gec-ranking.
Metric | ||
---|---|---|
GLEU | 0.549 | 0.401 |
GLEU | 0.542 | 0.555 |
M | 0.358 | 0.429 |
I-measure | -0.051 | -0.005 |
BLEU | -0.125 | -0.225 |
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
, pages 588–593, Beijing, China, July. Association for Computational Linguistics.
Comments
There are no comments yet.