Isabelle, Cherry and Foster introduce a challenge set approach to evaluating machine translation (MT) systems. This approach is not meant as a substitute for traditional evaluation methods such as average BLEU or human scores on a held out portion of the training corpus. It is rather meant to supplement these methods with tools that directly measure the extent to which MT systems manage to tackle some of the more difficult translation problems. Thus, unlike traditional metrics, challenge sets provide developers with a fine-grained view of the remaining obstacles.
Ideally, one would like challenge sets to be constructed automatically. This is all the more desirable in that such sets are intrinsically language-pair dependent. But until automatic construction methods become available, we can turn to human experts for developing limited sets of challenging sentences. This is what  did for EnglishFrench machine translation (MT). However, challenge sets are not only language-pair dependent: they are also direction dependent. For example, in EnglishFrench translation there is a need to choose between the French verbs savoir and connaître as the correct translation for the English verb to know. As it turns out, this depends on the syntactic nature of the complement of the verb. But in the opposite direction this problem does not arise: both savoir and connaître simply translate as to know. This kind of asymmetry led us to develop a new challenge set that specifically targets French English MT.
In section 2, we describe the makeup of our new challenge set. In section 3 we report on the results of subjecting both Google Translate and DEEPL to the resulting challenge on two different dates: 5 October 2017 and 16 January 2018. As we will see, this constitutes an interesting way to track the systems’ evolution.
2 Makeup of the New Challenge Set
In developing our French English challenge set we closely followed the practices described in . In particular:
We used short sentences that are each meant to bring into focus a single linguistic issue.
All sentences are based on "common general vocabulary" because our goal is not to test lexical coverage but rather the system’s ability to bridge specific linguistic divergences.
We provide one reference translation for each sentence, notwithstanding the fact that other acceptable translations are usually possible.
Each sentence is accompanied with a yes-no question that focuses the attention of evaluators on the particular issue that the sentence is intended to test.
The evaluator’s responses to each yes/no question completely determine the outcome of the evaluation or the relevant sentence. Consequently, translation errors that lie outside the scope of the yes-no questions will be ignored as irrelevant.
We use the same major classes of structural divergences as in the paper mentioned above: morpho-syntactic, lexico-syntactic and purely syntactic divergences. Each class is further subdivided into a set of subclasses which has a large overlap with those presented in the same paper.
One important difference with the earlier paper is that, in addition to testing the system’s ability to deal with structural divergences, this new dataset includes some 199 examples that are intended to probe the systems’ ability to cope with the notoriously difficult translation of grammatical words. This is achieved using groups of examples in which the same French grammatical word needs to be rendered in different ways in the English translation.
In the remainder of this section we describe and illustrate the classes of linguistic difficulties that are built into our challenge set.
2.1 Morpho-Syntactic divergences
We use the term morpho-syntactic divergences to refer to cases where the two languages differ in which grammatical features are overtly marked in the morphology of corresponding words. Whenever a target language word requires a feature marking that is not explicitly marked in its source, the MT system needs to infer the relevant feature from the context. Our challenge set is probing that capability for the following cases:
Proclitic pronouns. French complement pronouns often need to be procliticized, that is, moved to the left of the verb and phonetically attached to it. When translated into English, these pronouns need to be repositioned in their normal complement position. Moreover, the French clitic frequently underdetermines the grammatical features of the complement, such its gender, person and number and case. In the following example, the French clitic se is not marked for gender but it needs to be translated as a neutral reflexive pronoun itself because it co-refers with the neutral noun machine.
Cette machine se répare elle-même. This machine repairs itself.
The following two examples illustrate that the French clitic leur underdetermines the case/preposition marking of the corresponding English complement:
Je leur ai parlé. I spoke to them.
Je leur ai emprunté un livre. I borrowed a book from them.
Chez soi. The French form chez moi/toi/lui… can translate as at home but only when it is being used reflexively:
Mon fils est demeuré chez lui. My son stayed .
Ma fille est demeurée chez lui. My daughter stayed at his place | *at home. 111We use the asterisk to mark a translation as incorrect.
Verb tense. The overt French verb tense marking frequently underdetermines its English counterpart. In the following example, the French verb form is compatible with both the indicative and subjunctive mood while its English counterpart is explicitly marked as subjunctive. As a result the MT system must be able to determine whether or not the context is triggering the subjunctive mood:
Il est essentiel qu’il arrive à temps. It is essential that he arrive on time.
Verb tense concordance. In sentences expressing two events with a specific time dependency, French and English often feature different tense concordance constraints. In the following example both verbs are semantically future but English, contrary to French, requires the subordinate verb to be in the grammatical present tense:
Max partira dès que tu te lèveras. Max will leave as soon as you *will get up | get up.
2.2 Lexico-syntactic divergences
We now turn to lexico-syntactic divergences. We place under that heading all cases where the corresponding governing words of the two languages happen to organize their respective dependents in different ways. As a result, whenever such a governing word gets translated the system must be able to reorganize its dependents accordingly.
Argument switch. In some cases, the most straightforward translation of a given verb requires a change in the order of the verb’s arguments. This is the case when the French verb manquer à is translated as to miss:
Mary manque beaucoup à John. John misses Mary a lot.
Manner of movement verbs. In English, a completed movement is often expressed by using a verb that expresses a manner of moving (walk, climb, swim, etc.) and combining it with a prepositional phrase that expresses the endpoint of the movement. In French, this is normally expressed by a more generic movement verb together with an adverbial expressing the manner of that movement. Here are two examples:
John aimerait traverser l’océan à la nage. John would like to swim across the ocean.
John entra dans la salle en courant. John ran into the room.
Verb/adverb transposition. Some French verbs tend to be expressed as adverbs in English. This involves a reorganization of the sentence in which a verb which is subordinate in French becomes the main verb in English.
Max a fini par comprendre la difficulté. Max finally understood the difficulty.
Non-finite to finite clause. It is quite common for a non-finite clause of French to be translated as a finite clause in English. This raises the difficulty of introducing an adequate subject as well as an adequate verb tense in the English translation.
Max croit connaître la vérité. Max thinks he knows the truth.
Mary croyait connaître la vérité. Mary thought she knew the truth.
Aussitôt son travail terminé, Mary partit. As soon as her work was over, Mary left.
Cela provient de ce que Max a trop dormi. That arises from the fact that Max has slept too much.
Max sait réparer une cafetière. Max knows how to repair a coffee maker.
Middle voice. The so-called middle-voice of French involves a pseudo-pronominal verb form whose interpretation is related to that of a passive sentence, often with a generic interpretation. It is most often translated in English with a passive form.
Ce type de moteur se répare facilement. This type of engine can be repaired easily.
Control verbs. So called subject-control verbs take an infinitival complement whose subject is understood to co-refer with the subject of the control verb. In contrast, the understood subject of object-control verbs is the object of the control verb. This difference can be brought to light when the infinitival complement is reflexive.
Max a convaincu sa fille de se sacrifier. Max convinced his daughter to sacrifice *himself | herself.
Max a promis à sa fille de ne pas se sacrifier. Max promised his daughter not to sacrifice himself | *herself.
Mass versus count nouns. Both languages make a grammatical distinction between count nouns (e.g. book, table, idea) versus mass nouns (e.g. wine, butter, fear). However, there are cases where a French noun and its English counterpart happen to fall on different sides of the divide. In such cases, a partitive noun may need to be introduced or deleted in the translation.
Max lui a donné un conseil. Max gave him *an advice | a piece of advice.
Factitives. Some French verbs require the use of the auxiliary verb "faire" in order to receive an agentive reading. In such cases, that auxiliary must disappear in the translation.
Max a *fondu | fait fondre la glace. Max melted the ice.
Max a *explosé | fait exploser un rocher. Max blew up a rock.
Two-position adjectives. The correct translation of several French adjectives depends on whether they are placed before or after the noun they modify.
Une idée simple n’est pas forcément mauvaise. A simple | *mere idea is not necessarily bad.
La simple idée de partir la terrorisait. The *simple | mere idea of leaving terrorized her.
Genitive. Contrary to French, the preferred way to express genitives in English is not to use a prepositional phrase but rather the case marking ’s.
Il a pris le livre de mon frère aîné. He took my elder brother’s book.
2.3 Purely syntactic divergences
The third type of divergence considered stems from the fact that some syntactic constructions only exist in one or the other language. Whenever a French sentence contains a construction that has no direct counterpart in English, the MT system needs to be able to recast the source language material into a different construction.
In fact, we have already seen one such case above, namely that of French proclitics. While we listed them under the heading of morpho-syntactic divergences, they do exemplify both types of divergence at once. Since there are no proclitics in English, a French object proclitic needs to be relocated in the standard post-verbal position in the English translation: Il la voit. He sees her. Here are some other subtypes of purely syntactic divergences.
Yes-no question syntax. French and English differ in the way yes-no questions are formed. Basically, French questions are obtained as follows: if the subject is a proclitic, move it after the verb; otherwise insert a particle (either est-ce que at the beginning of the sentence or -il after the verb). In contrast, English questions are obtained by fronting an auxiliary verb.
As-tu lu ce livre? Have you read this book?
Max partira-t-il à temps? Will Max leave on time?
Tag questions. The so-called tag-question construction of English does not exist in French, but the French n’est-ce pas? sentence-final question is normally translated as an appropriate tag question, which involves selecting the right auxiliary verb.
Il a vu la photo hier, n’est-ce pas? He saw the picture yesterday, didn’t he?
Nous devrions vérifier le niveau d’huile, n’est-ce pas? We should check the oil level, shouldn’t we?
WH-movement: relative clauses. When a relative clause is formed, its internal relativized element gets fronted, typically in the form of a "WH-word". For example, in The man whom you saw is my brother the word "whom" is understood to refer to the object of the verb "saw", which we will call its native site. French and English relative constructions are often parallel enough that an MT system can get away with a superficial process that falls short of explicitly relating the WH word to its native site. However, such a superficial approach breaks down in the case of stranded prepositions. In French, whenever a prepositional phrase is relativized, its preposition must be fronted alongside the WH-word: la fille avec qui tu as dansé. In contrast, English will often leave the preposition stranded: the girl you danced with. Note that in the French English direction, the MT system does not have to move the preposition to its native site, since preposition fronting is also permitted in English. However, if the system does move the preposition to its correct native site, then this provides nice evidence that it is able to perform some deeper processing.
L’homme à qui Max a donné un livre est parti. The man whom Max gave a book to is gone.
La fille dont il a parlé est brillante. The girl that he talked about is brilliant.
WH-movement: interrogatives. Question formation and relative formation are highly parallel in both French and English. As a result, stranded prepositions raise the same translation issues with questions as with relatives.
À qui Max a-t-il donné un livre? Whom did Max gave a book to?
Pour quelle compagnie travaille-t-il? What company does he work for?
Negation. In French, negation is typically expressed using a discontinuous form such as ne … pas/jamais/plus/nullement while in English this is typically done using a single word. MT systems often run into difficulty with this phenomenon. In our first example below the system needs to recognize that ne is being used in an "expletive" (i.e. non-negative) way and therefore should not be translated. In our second example, the French negation is to be rendered by the single negation word not, but while reinforcing it with the intensifying adverbial at all.
Je crains que Max ne vienne nous voir. I’m afraid Max is coming to see us.
Max ne comprend nullement cette idée. Max does not understand this idea at all.
Double negation. Double negations are sometimes used for stylistical effect and some MT systems appear to have difficulty coping with that.
Ce politicien n’est pas capable de ne pas mentir. This politician is not able not to lie.
C’est le docteur dont il est impossible que vous n’ayez pas entendu parler. It is the doctor of whom it is impossible that you have not heard.
Other doubled concepts. Some MT systems appear to experience some difficulty with sentences that contain two tokens for the same concept.
Il a commis faute sur faute. He committed mistake after mistake.
C’est beaucoup beaucoup mieux. This is much much better.
2.4 Purely lexical divergences.
The kinds of structural divergences described above closely mirror what was done in  for EnglishFrench machine translation. However, in that work idiomatic phrases and support verbs were placed under the broader category of lexico-syntactic divergences. In the present work, we instead introduce an additional top-level category, namely that of purely lexical divergences. Alongside testing material for idioms and support verbs, this category will include a substantial amount of additional material meant to test the ability of MT systems to translate common grammatical words such as prepositions.
Common idioms – fixed. Some phrases need to be translated as a group because they happen to have a language-specific idiomatic meaning. The simpler case is that of fixed idioms, those that always appear under one and the same form.
Ils sont déterminés à continuer envers et contre tous. They are determined to continue in spite of all opposition.
Ils partiront entre chien et loup. They will leave at dusk.
Common idioms – variable. Many idioms exhibit some morphological and/or syntactic flexibility. As a result, there is a need for MT systems to generalize over a range of different surface forms.
Cessez de tourner autour du pot. Stop beating around the bush.
Il tournait constamment autour du pot. He was constantly beating around the bush.
Vous mettez la charrue devant les boeufs. You put the cart before the horse.
La charrue a été mise avant les bœufs. The cart was put before the horse.
Support verbs. These verbs (also known as "light verbs") carry little meaning in themselves. Rather they combine with their complement to express what can often be expressed as a single verb. For example, to walk and to take a walk are roughly equivalent. But even though the support verb - here, take - carries little meaning in itself, its choice is not free. In this example, *make a walk is not an acceptable substitute. Support verbs must be translated as a whole with their complements.
Max a fait campagne contre le maire hier. Max campaigned against the mayor yesterday.
Ceci apporte la preuve qu’il était au courant. This is proof that he was aware.
Unacceptable, literal translations for these two examples would be:
Max *made a campaign against the mayor yesterday.
This *brings proof that he was aware.
Grammatical words. Grammatical words such as prepositions are notoriously difficult to translate. Our challenge set includes testing material for some 28 different grammatical words or phrases that are relatively difficult to translate correctly because they each have multiple uses. For each one we provide sets of sentences where the word needs to receive different translations as a result of these different uses. Consider for example some different uses/translations of the French preposition en:
Il lui a offert un foulard en soie. He offered her a silk scarf.
Il est docteur en philosophie. He’s a doctor of philosophy.
En semaine, je travaille. On weekdays, I work.
J’ai payé mes études en vendant du café. I paid my tuition by selling coffee.
En travaillant, j’aime écouter de la musique. While working, I like to listen to music.
Another good example is the multiple uses and translations of the preposition par:
Il a été averti par Paul. He was warned by Paul.
Un lundi par mois, il se rend au marché. One Monday per month, he goes to the market.
Il a fait cela par plaisir. He did it for pleasure.
Il a fait cela par habitude. He did it out of habit.
Le bateau a coulé par cent mètres de fond. The boat sank to a depth of a hundred meters.
2.5 Our New Challenge Set.
We manually developed a set of 506 different challenging examples populating the main categories discussed above with the distribution shown in Table 1.
|Category||No. of examples||Percent|
In addition to making use of our own personal experience in machine translation, we were able to draw many examples from Morris Salkoff’s highly detailed and insighful French-English contrative grammar .
3 Testing Google Translate and DEEPL
Armed with this new French
English challenge set, we decided to evaluate the performance of the Google Translate and DEEPL neural machine translation systems. We submitted all 506 sentences to each system on two different dates: 5 October 2017 and 16 January 2018. We collected the results and proceeded to evaluate them.
The evaluation protocol was as follows. The human evaluator looks at each test case in turn, being provided with: a) the source-language sentence; b) one reference translation; c) the machine-translated sentence to be evaluated; and d) a single yes/no question about the translation and its relationship to the source-language sentence. The evaluator simply provides an answer the yes/no question associated with each translated example. Figure 1 provides two examples of material being presented to the evaluator together with his/her response (either "Yes" or "No").
|Src||La femme s’est regardée dans le miroir.|
|Ref||The woman looked at herself in the mirror.|
|Sys||The woman looked at herself in the mirror.|
|Is the French highlighted pronoun correctly translated (y/n)? Yes|
|Src||Je le suppose.|
|Ref||I suppose so.|
|Is the French highlighted pronoun correctly translated (y/n)? No|
The first author made an initial pass at responding to each one of the 2024 relevant questions (506 for each one of the four machine translations). The second author checked all these judgments and noted all disagreements. Each difference was then discussed by the two authors and a joint decision was made.
Thus, unlike in  where three independent evaluators were used, the results presented below only rely on the authors’ judgments. However, we are making these judgments available alongside the new challenge set so that interested parties can compare them with their own judgments.
The main results are presented in Table 2. The outcome of October 2017 was similar to that presented in  for the EnglishFrench direction: in both cases DEEPL turned out to deal with the challenge set quite a bit better that Google. On the present challenge set, DEEPL’s overall rate of success was almost 13% higher than that of Google. This advantage holds in all categories of examples except for morpho-syntax where both systems are tied.
We can also see that the overall performance of both systems turned out to be somewhat better in January 2017. The Google system achieved an overall improvement of 2.6% for a relative error reduction of 3.6%, while the DEEPL system got a 1% improvement for a relative error reduction of 1.3%. The rate of progress varied across categories. In the case of morpho-syntax, Google managed to gain 7%, significantly bettering DEEPL which turned out to lose 4.6%. Conversely, in the case of pure syntax Google lost 3.5% while DEEPL’s performance remained unchanged. For the other two categories (purely lexical and lexico-syntactic) both systems progressed but Google did so more markedly.
Table 3 provides a breakdown of the same results in terms of our finer-grained subcategories.
|Verb tense concord||7||71.4%||85.7%||42.9%||42.9%|
|Non-finite finite clause||20||80%||80%||75%||80%|
|"De/à ce que" "from the fact that"||2||0%||50%||50%||50%|
|V1 V2inf V1 how to V2inf||2||100%||100%||100%||100%|
|Count Vs mass nouns||6||83.3%||83.3%||66.7%||66.7%|
|"Voilà [TIME] que"||3||66.7%||0%||100%||100%|
|Syn||Yes-no question syntax||4||50%||50%||100%||100%|
|WH movement, relatives||19||78.9%||68.4%||78.9%||78.9%|
|WH movement, questions||10||80%||70%||90%||90%|
|Other doubled concepts||5||80%||40%||20%||20%|
|Lex||Common idioms – fixed||25||32%||40%||52%||48%|
|Common idioms – variable||24||33.3%||58.3%||66.7%||66.7%|
|Translation of grammatical words||199||64.8%||66.3%||80.9%||81.4%|
We have presented a new challenge set for evaluating machine translation systems in the FrenchEnglish direction based on the principles presented in . This new set includes 506 different sentences spread across four categories: morpho-syntactic, lexico-syntactic, purely syntactic and purely lexical. The first three categories mirror those of  but the last one is novel. Each sentence is meant to test the ability of MT systems to bridge one specific divergence issue between the two languages.
Our 506 challenge sentences have been submitted to the Google and DEEPL MT systems on two different dates: 5 October 2017 and 16 January 2018. The results have been evaluated according to the method presented in , which amounts to responding to the yes-no questions attached to each challenge sentence.
In this case the evaluators were the co-authors of this paper, which is not optimal. However, we are making all the data available so that readers can compare our judgments with theirs.
Pierre Isabelle, Colin Cherry, and George Foster.
A challenge set approach for evaluating machine translation.
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2486–2496, Copenhagen, September 2017. Association for Computational Linguistics.
-  Morris Salkoff. A French-English grammar: contrastive grammar on translation principles, volume 1. John Benjamins, Amsterdam, 1999.