Scalable Cross Lingual Pivots to Model Pronoun Gender for Translation

06/16/2020 ∙ by Kellie Webster, et al. ∙ Google 0

Machine translation systems with inadequate document understanding can make errors when translating dropped or neutral pronouns into languages with gendered pronouns (e.g., English). Predicting the underlying gender of these pronouns is difficult since it is not marked textually and must instead be inferred from coreferent mentions in the context. We propose a novel cross-lingual pivoting technique for automatically producing high-quality gender labels, and show that this data can be used to fine-tune a BERT classifier with 92 30-51 BERT model. We augment a neural machine translation model with labels from our classifier to improve pronoun translation, while still having parallelizable translation models that translate a sentence at a time.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

While neural modeling has solved many challenges in machine translation, a number remain, notably document-level consistency. Since standard architectures are sentence-based, they can misrepresent pronoun gender since cues for this are often extra-sentential Läubli et al. (2018). For example, in Figure 1,111Source: https://es.wikipedia.org/wiki/Britney_Spears, Translation: Retrieved from translate.google.com, 27 Feb 2019. there is a dropped subject and a neutral possessive pronoun in the Spanish, which are both translated with masculine pronouns in English despite context indicating they refer to Britney Jean Spears. Recommendations for the use of discourse in translation are developed in Sim Smith (2017).

Such errors are worrying for their prevalence. An oracle study found that augmenting an NMT system with perfect pronoun information could give up to 6 BLEU points improvement Stojanovski and Fraser (2018). Further, the DiscoMT shared task on pronoun prediction tested multiple language pairs and found that the most difficult language pair by far was SpanishEnglish and that “[t]his result reflects the difficulty of translating null subjects” Loáiciga et al. (2017).

Spanish Britney Jean Spears Adquirió fama durante su niñez al participar en el programa de televisión The Mickey Mouse Club (1992).
SpanishEnglish He gained fame during his childhood by participating in the television program The Mickey Mouse Club (1992).

Figure 1: Typical example of pronoun gender errors when translating from pronouns without textual gender marking (e.g., dropped or possessive pronouns in Spanish) to a language with gendered pronouns.

We address document-level modeling for dropped and neutral personal pronouns in Spanish. We significantly improve the ability of a strong sentence-level translator to produce correctly gendered pronouns in three stages. First, we automatically collect high-quality pronoun gender-tagged examples at scale using a novel cross-lingual pivoting technique (Section 3). This technique easily scales to produce large amounts of data without explicit human labeling, while remaining largely language agnostic. This means our technique may also be easily scaled to produce data for other languages than Spanish; we present results on Spanish as a proof of concept.

Next, we show how to fine-tune BERT Devlin et al. (2019), a state-of-the-art pre-trained language model, to classify pronoun gender using our new dataset (Section 4). Finally, we add context tags from our BERT model into standard MT sentence input, yielding a 8.8% F1 improvement in feminine pronoun generation (Section 5).

2 Background

In this section, we overview particular nuances associated with pronoun understanding, resulting challenges for translation, and existing relevant datasets.

Properties Surface Form
En Es
Nominative, Masc. he él   or
Nominative, Fem. she ella or
Possessive, Masc. his su
Possessive, Fem. her su
Table 1: Masculine and feminine pronouns cannot be distinguished by their surface forms in Spanish in both the dropped and possessive case.

Dropped (Implicit) Pronouns

Spanish personal pronouns have less textual marking of gender than their English counterparts. Where they may be inferred from context, Spanish drops subject pronouns él/ella (he/she). Several other languages such as Chinese, Japanese, Greek, and Italian also have dropped pronouns.

Possessive Pronouns

Additionally, Spanish possessive pronouns do not mark the gender of the possessor, with su(s) used where English uses his and her. In other Romance languages such as French, the possessive pronoun agrees with the gender of the possessed object, rather than the possessor as in English. These usages are summarized for Spanish in Table 1 and illustrated in Figure 1.

MT of Ambiguous Pronouns

Document-level consistency is an outstanding challenge for machine translation Läubli et al. (2018). One reason is that standard techniques translate the sentences independently of one another, meaning that context outside a given sentence cannot be used to make translation decisions. In a language like Spanish, this means that critical cues for pronoun gender are simply not present at decoding time.

There have been a number of directions exploring how to introduce general contextual awareness into translation. Tiedemann and Scherrer (2017) investigated a simple technique of providing multiple sentences, concatenated together, to a translator and learning a model to translate the final sentence in the stream using the preceding one(s) for contextualization. Unfortunately, no substantial performance gain was seen. Next, the Transformer model Vaswani et al. (2017) was extended by introducing a second, context decoder Wang et al. (2017); Zhang et al. (2018). Complementary to this, Kuang et al. (2018) introduces two caches, one lexical and the other topical, to store information from context pertinent to translating future sentences.

Specific to the translation of implicit or dropped pronouns, Wang et al. (2018) shows that modeling pronoun prediction jointly with translation improves quality, while Miculicich Werlen and Popescu-Belis (2017a) presents a proof-of-concept where coreference links are available at decoding time. Most similar to our current work, Habash et al. (2019); Moryossef et al. (2019); Vanmassenhove et al. (2018) improve gender fidelity in translations in of first-person singular pronouns in languages with grammatical gender inflection in this position. They show promising results from either injecting explicit information about speaker gender, or predicting it with a classifier. We extend this line of work without requiring human labels to train our classifier.

(A) Page Alignment
Title: ”Mitsuko Shiga” https://en.wikipedia.org/wiki/Mitsuko_Shiga
https://es.wikipedia.org/wiki/Mitsuko_Shiga
(B) Sentence Alignment
Spanish Publicó numerosas antologías de su poesía durante su vida, incluyendo Fuji no Mi , Asa Tsuki, Asa Ginu, y Kamakura Zakki.
SpanishEnglish He published numerous anthologies of his poetry during his life, including Fuji no Mi, Asa Tsuki, Asa Ginu, and Kamakura Zakki.
English She published numerous anthologies of her poetry during her lifetime, including Fuji no Mi (”Wisteria Beans”), Asa Tsuki (”Morning Moon”), Asa Ginu (”Linen Silk”), and Kamakura Zakki (”Kamakura Miscellany”).
(C) Pronoun Tagging /She Publicó numerosas antologías de su/her poesía durante su/her vida, incluyendo Fuji no Mi , Asa Tsuki, Asa Ginu, y Kamakura Zakki.
Table 2: Example of how pronoun translations are extracted from comparable texts. The English and Spanish sentences were extracted from Wikipedia articles about the same subject, the sentences were aligned due to the similarities in their translations, and the labels were extracted by matching the Spanish dropped and ambiguous pronouns with the English pronouns.

Prior Work on Aligned Corpora

Translation of sentences has been a task at the Workshop on Machine Translation (WMT) since 2006. Over this period, more and larger datasets have been released for training statistical models. Typically released data is given in sentence-pair form, meaning that for each source language sentence (e.g. Spanish), there is a target language sentence (e.g. English) which may be considered its true translation. For a discussion of the adequacy of this approach, see Guillou and Hardmeier (2018).

Translation of Spanish sentences to English was most recently contested at WMT’13222http://www.statmt.org/wmt13/translation-task.html Bojar et al. (2013), which offered participants 14,980,513 sentence pairs from Europarl333http://www.statmt.org/europarl/ Habash et al. (2017), Common Crawl Smith et al. (2013), the United Nations corpus444https://cms.unov.org/UNCorpus/ Ziemski et al. (2016), and news commentary. While large, this dataset wasn’t designed for document-level understanding. Document boundaries are available, but correspond loosely to what a user of a translation system might consider a document, whole days of European parliament, for instance.

DiscoMT was introduced as a workshop in 2013555http://aclweb.org/anthology/W13-3300 to target the translation of pronouns in document context. As well as Europarl, Ted talk subtitles666http://opus.nlpl.eu/TED2013.php Tiedemann (2012) were included for both training and evaluation. Also available for the translation of extended speech is Open Subtitles777http://opus.nlpl.eu/OpenSubtitles.php Lison and Tiedemann (2016); Muller et al. (2018), which contains utterance pairs which are aligned and contextualized according to their temporal appearance in a film. Although subtitle datasets make available very wide context, they are not ideal in this study due to their heavy use of first and second person pronouns and noisy alignment.

High quality targeted datasets for discourse phenomena were explored in Bawden et al. (2018), however the released corpus was a small FrenchEnglish test set. In this paper, we introduce a novel cross-lingual technique for gathering a large-scale targeted dataset without explicit human annotation.

3 Cross-Lingual Pivoting for Pronoun Gender Data Creation

Despite document-level context being critical for pronoun translation quality, little of the available aligned data is directly suitable for training models. In this section, we exploit the special properties of Wikipedia to automate the collection of high-quality gender-tagged pronouns in document context.

Prodrop Possessive
Articles he she Articles his her
Spanish Wikipedia 1,326,469 - - 1,326,469 - -
Page Alignment 420,850 - - 420,850 - -
Pronoun Tagging 45,210 118,911 41,524 52,559 266,886 111,411
Table 3: Extraction yield in each stage of the multi-lingual pivot extraction pipeline.

3.1 Cross-Lingual Pivoting

Wikipedia has a number of desirable properties which we exploit to automatically tag pronoun gender. Firstly, it is very large for head languages like English and Spanish. More importantly, many non-English Wikipedia pages express similar content to their English counterpart given the resource’s style and formatting requirements, and the ease of using translation systems to generate base text for a missing non-English page. Based on these properties, we develop the following three stage extraction pipeline, illustrated in Table 2 with pipeline yield summarized in Table 3.

Page Alignment

Identify as pairs pages with the same title in English and Spanish. We prioritize precision and require exact match.

Sentence Alignment

Identify sentence pairs in aligned pages which express nearly the same meaning. To do this, we compare English translations of sentences888From an in-house tool. from the Spanish page to sentences from the English page. We do bipartite matching over these English sentence-pairs, selecting pairings with the smallest edit distance. We require that edit distance is maximally one half sentence length, that paired sentences share either a noun or verb, and that the mapping is at most one-to-one. Table 2, step (B) shows an aligned sentence pair.

Pronoun Tagging

Perform alignment999We use an implementation of Vogel et al. (1996). over the tokens in detected sentence pairs, identifying where English uses a gendered pronoun and Spanish has either a dropped or possessive pronoun. Copy the gender of the English pronoun as a label onto the ambiguous target pronoun, masculine for he and his, feminine for she and her. Step (C) in Table 2 shows the resulting pronoun tagging in the example.

3.2 Final Dataset

Gender

We can see from Table 3 that there are approximately three times as many masculine examples extracted compared to feminine examples. This is not surprising given the disparity in representation by gender found in Wikipedia Wagner et al. (2016). For the final dataset, we down-sample masculine examples to achieve a 1:1 gender balance, yielding a final dataset with 79,240 prodrop and 187,224 possessive examples. We split the 79,240 prodrop examples into training, development, and test sets of size 72,120, 2,968, and 4,152, respectively; the 187,224 possessive examples are similarly split into sets of size 167,222, 8,862, and 11,140.

Quality

To assess the quality of our examples, we sent 993 examples for human rating. In-house bilingual speakers were shown a paragraph of Spanish text containing a labeled pronoun and asked whether, given the context, the pronoun referred to a masculine person, feminine person, or whether it was unclear. Agreement between these human labels and those produced by our automatic pipeline is high. 84% of prodrop and 80% of possessive pronouns could be disambiguated with the provided context and, of those, 98% and 92% of automatically generated gender tags agreed with human judgment. Where the provided paragraph was not enough, it was typically the case that relevant context was in the directly preceding paragraph. Confusion matrices are given in Tables 4 and 5.

Human
he she unclear
Pipeline he 221 3 36
she 7 179 43
Table 4: Agreement between automatic labels and human judgment for prodrop examples.
Human
his her unclear
Pipeline his 198 6 48
her 25 179 55
Table 5: Agreement between automatic labels and human judgment for possessive pronoun examples.
Masculine Feminine
P R F1 P R F1
Prodrop Baseline MT (Sentences) 56.2 81.3 66.5 91.3 18.6 30.9
Baseline MT (Contexts) 62.4 60.5 61.4 93.0 25.1 39.5
Context MT (Contexts) 65.1 65.3 65.2 88.9 35.8 51.0
BERT (Sentences) 66.6 61.6 64.0 87.0 40.2 54.9
BERT (Contexts) 81.6 58.5 68.1 92.6 58.5 71.7
BERT (Contexts) + Data 90.4 95.2 92.7 95.0 89.8 92.3
Possessives Baseline MT (Sentences) 58.2 77.8 66.6 86.7 30.4 45.0
Baseline MT (Contexts) 63.6 63.4 63.5 87.2 35.0 50.0
Context MT (Contexts) 64.2 67.6 65.9 86.4 38.0 52.8
BERT (Contexts) + Data 87.4 91.3 89.3 90.9 86.8 88.8
Table 6: Intrinsic evaluation of pronoun gender classification over our new test sets.

4 Classifying Pronoun Gender

Language model (LM) pretraining Peters et al. (2018); Devlin et al. (2019) is emerging as an essential technique for human-quality language understanding. However, LMs are limited to learning from what is explicitly expressed in text, which seems insufficient for Spanish given that gender is often not textually realized for pronouns. We show that BERT, a state-of-the-art pretrained-LM model Devlin et al. (2019), already models some aspects of pronoun gender, achieving scores above a strong NMT baseline. We then show that fine-tuning this LM using our new data improves pronoun gender discrimination by over 20% absolute F1 for both genders.

4.1 Neural Machine Translation

Count Prodrop Rate
he 1,220 1,106 90.7%
she 314 251 79.9%
Table 7: Even accounting for difference in the raw number of pronouns by gender, prodrop is more likely for masculine entities than feminine entities in WMT’13 (development).

We take as our NMT baseline a Transformer Vaswani et al. (2017)

with vocabulary size 32,000 trained for up to 250k steps. We set input dimension to 512 and use 6 layers with 8 attention heads and hidden dimension 2048, dropout probabilities of 0.1 and learning rate 2.0. We train this model on WMT’13 Spanish

English data,101010http://www.statmt.org/wmt13/translation-task.html preprocessed using the Moses toolkit.111111https://github.com/moses-smt/mosesdecoder Rows designated Baseline in Table 6 are trained with the standard sentence-level formulation of the task and yield a best test set BLEU score Papineni et al. (2002) of 34.02. Context MT is trained using the 2+1 strategy of Tiedemann and Scherrer (2017) where an extended context, in our case full sentences up to 128 subtokens,121212Given this very local context, we make the assumption that adjacent sentences in the pre-shuffled dataset are document-continuous. This assumption introduces noise at document boundaries, which occur much less frequently than document continuation transitions. is used to predict the translation of the contextualized target sentence. This modeling yields 33.99 BLEU, which is not substantially different to Baseline MT, consistent with Tiedemann and Scherrer.

To evaluate how well these NMT systems model pronoun gender, we use them to translate the Spanish text from our new dataset into English. For each target Spanish pronoun, we use an implementation of Vogel et al. (1996) to find the corresponding English pronoun and count the translation as correct if the gender tag on the Spanish agrees with the English pronoun gender. Table 6 shows the results when the input units are single sentences (Sentences) or full sentences up to 128 subtokens (Contexts). All context sentences precede the one containing the pronoun (no lookahead).

Performance is weak overall but especially on feminine examples of prodrop. The tendency is for masculine pronouns to be over-used: feminine pronouns are used by Baseline MT (Sentences) less than one time in five when they should be. To understand why this effect is so strong, we analyzed the development set of WMT’13 (Table 7). Specifically, we counted the number of times the English pronouns he and she occurred (first column). Not surprisingly, there was an over-representation of masculine pronouns; for each feminine example in the data, there was four masculine examples. Next, we looked in the aligned Spanish sentences for the corresponding pronouns él and ella and counted a case of prodrop each time they were missing (second column). Even accounting for differences in raw frequency by gender, there is a difference in which prodrop is more frequent for masculine entities. This is the first time this difference has been reported and future work should control for the prior it induces.

Introducing context in the input to an MT model, either trained on sentences or contexts, improves overall performance by enhancing feminine recall and masculine precision at the expense of masculine recall. One possible reason is that the added context includes the relevant cues for breaking out of the prior for masculine pronouns. The magnitude of change in F1 score is perhaps beyond expected given that Baseline MT and Context MT achieve similar BLEU scores. However, this finding builds on the arguments in Cifka and Bojar (2018), which show that BLEU score does not correlate perfectly with semantics.

Input: Britney Jean Spears (McComb, Misisipi; 2 de diciembre de 1981), conocida como Britney Spears, es una cantante, bailarina, compositora, modelo, actriz, diseñadora de modas y empresaria estadounidense. Comenzó a actuar desde niña, a través de papeles en producciones teatrales. *** adquirió fama durante su niñez al participar en el programa de televisión The Mickey Mouse Club (1992).
MaskLM Output: spears (0.61), britney (0.23), tambien (0.06), ella (0.03), rapidamente (0.01)…
Prediction: Feminine

Figure 2: Example of how the BERT pretrained model Devlin et al. (2019) is evaluated on gendered pronoun prediction. The MaskLM predicts a word for the wildcard position. The feminine ella is found near the top of the k-best list, so the prediction is Feminine.

4.2 Language Model Pretraining

We initialize our LM experiments from a BERT model (110M parameters; see Devlin et al. (2019)) trained over Wikipedia from 102 languages including Spanish. To assess how much this model knows about pronoun gender out-of-the-box, we extend the masked LM approach from the original BERT work, inserting a mask token in each dropped subject position (see Figure 2 for an example).131313This is applied to prodrop only. No analogous probing may be done for the neutral Spanish possessive pronouns without potentially semantic-breaking re-writing. To generate gender predictions, we deduce masculine if él appears before ella in the list of mask fills and vice versa; we make no gender prediction where neither pronoun appears in the top ten mask fills. The results of this probing is shown in the BERT rows of Table 6, which use input feeds of one sentence (Sentences) and up to 128 subtokens (Contexts).

Despite scanning the full top ten mask fills, we can see that the strength of BERT is in its strong precision rather than boosts to recall. While not directly comparable to MT values, which are generated single-shot, we can see that BERT values are stronger, particularly for feminine examples which are handled poorly by machine translation.

Comenzo a actuar desde niña, a traves de papeles en producciones teatrales. t Adquirio /t fama durante su niñez.
S/he started to act as a child.FEM, through roles in theatrical productions. S/he acquired fame in his/her childhood.

Before Fine-tuning After Fine-tuning
Class Token Weight Class Token Weight
*Masc. durante 0.01 Fem. niña 0.29
traves -0.01 Comenzo -0.10
actuar -0.01 niñez 0.09
Comenzo 0.01 papeles -0.01
Table 8: LIME analysis for a representative Spanish text in which all third-person singular pronouns are either dropped or genderless. An English gloss, including grammatical gender marking, is given above.

4.3 BERT Classification Task

We extend this understanding of pronoun gender using the fine-tuning recipe given in the original BERT paper. Specifically, we generate classification examples where the output is a gender tag (masculine or feminine) and the input text is up to 128 subtokens including special <t> tokens to mark the target pronoun position (as in Table 8). We train over both prodrop and possessives examples from the training portion of our new dataset.

Table 6 shows that this approach is extraordinarily powerful for modeling pronoun gender in our dataset. We note that the performance we see here is remarkably similar to human ratings of quality for the dataset.

On all metrics and in both genders, we see leaps of 20 points and higher. There is minimal performance difference across the two genders; precision and recall do not need to trade off against one another. Our novel method which allows implicit pronoun gender to be made explicit and amenable to text-level modeling is very well suited to pronoun gender classification.

4.4 Analysis

We use the LIME toolkit141414https://github.com/marcotcr/lime Ribeiro et al. (2016) to understand how the fine-tuning recipe yields such a strong model of pronoun gender. LIME is built for interpretability, learning linear models which approximate heavy-weight architectures over features which are intuitive to humans. By analyzing the highly-weighted features in these approximations, we can learn what factors are important in the classification decisions of the heavy-weight models.

Table 8 shows the top tokens in the LIME analysis of our gender classifier, both at the start and end of fine-tuning, for a representative Spanish text, shown with its English gloss. Before fine-tuning, the classification decision is incorrect and importance is low and spread fairly evenly across all terms in context. After fine-tuning, a correct classification is made, primarily with reference to the niña token, which is the feminine form of child and refers to a young Britney Spears, the dropped subject of Adquirió. That is, just as the original Transformer model implicitly models coreference for translation Voita et al. (2018); Webster et al. (2018), a fine-tuned BERT classifier appears to implicitly learn coreference to make gender decisions here.

5 Improving Machine Translation with Classification Tags

Model Translation Classifier Masculine Feminine
Input Input BLEU P R F1 P R F1
Baseline MT Sentences - 34.02 95.2 97.1 96.2 69.7 57.5 63.0
      + Gender Tags Sentences Contexts 34.12 96.8 96.2 96.5 70.0 73.7 71.8
Context MT Contexts - 33.99 97.6 94.7 96.1 63.2 80.0 70.6
Table 9: BLEU score and pronoun accuracy by gender on the WMT’13 Spanish-to-English test set.

Given the strong intrinsic evaluation, we use the gender classification tags directly in a standard NMT architecture to introduce implicit contextual awareness while still having translation parallelizable across sentences. This follows precedent set in Johnson et al. (2017) and Zhou et al. (2019) for controlling translations via tag-based annotations. Our simple approach yields substantial improvements in pronoun translation quality.

Baseline MT + Gender Tags

We use the Transformer model specification above but train with WMT’13 sentence pairs augmented with gender tags. For each Spanish sentence, we classify whether it contains a third-person singular dropped subject or possessive pronoun using heuristics over morphological analysis, part of speech tags, and syntactic dependency structure

Andor et al. (2016).151515Dropped pronouns occur where a ROOT verb has no nsubj dependent and no noun to its left; possessive pronouns are tagged with a label ending in $. For those containing a target pronoun, we concatenate the gender tag predicted by BERT (Contexts) + Data, after the sentence,161616To preserve positional encoding. following a special <c> context token. For example:

Adquirió fama durante su niñez al participar en el programa de televisión The Mickey Mouse Club (1992). c FEM

For robustness, we introduce noise by randomly flipping whether the sentence contains a target pronoun 2% of the time171717In which case, input contains no special <c> token. and generating a random gender tag 5% of the time.

We choose this strategy as a simple extension of WMT which is fit for our purposes. Specifically, we seek to show that our gender classification model captures implicit pronoun gender well and that its understanding is useful for the translation of ambiguous pronouns.

Results

We evaluate translation performance using BLEU score and pronoun generation quality using F1 score disaggregated by gender over gendered pronouns. We do not report APT score181818https://github.com/idiap/APT Miculicich Werlen and Popescu-Belis (2017b) since we target few pronouns and make conclusions at the level of gender, not pronoun.

Table 9 shows that incorporating contextual awareness, either from concatenating full sentences or gender tags in translation input, improves pronoun quality markedly. Masculine performance is uniformly strong and our new model, Baseline MT + Gender Tags, gives the best performance on feminine pronouns, achieving a 8.8% F1 score improvement compared to Baseline MT. As for the contextual MT models analyzed in the previous section, the improvement is from addressing the low recall over feminine pronouns in sentence-level translation (while retaining precision). We manually reviewed error cases and noted the major class was from domain shift: where antecedents were close to pronouns in Wikipedia, there were more frequent cases in WMT of antecedents being outside the BERT context window, and these were misclassified. Future work could look at methods for learning pre-trained language models from extended document context.

6 Conclusion

In this paper, we introduce a cross-lingual pivoting technique which generates large volumes of labeled data for pronoun classification without explicit human annotation. We demonstrate improvements on translating pronouns from Spanish to English using a BERT-based classifier fine-tuned on this data, which achieves F1 of 92% for feminine pronouns, compared with 30-51% for neural machine translation models and 54-71% for non-fine-tuned BERT. We show that our classifier can be incorporated into a standard sentence-level neural machine translation system to yield 8.8% F1 score improvement in feminine pronoun generation.

Our data creation method is largely language-agnostic and may be extended in future work to new language pairs, or even different grammatical features. Further in this direction, we are excited by the prospect of zero-shot pronoun gender classification, particularly in low-resource languages.

Acknowledgements

We would like to thank Melvin Johnson, Romina Stella, and Macduff Hughes for assistance with machine translation work, as well as Mike Collins, Slav Petrov, and other members of the Google Research team for their feedback and suggestions on earlier versions of this manuscript.

References

  • Andor et al. (2016) Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman Ganchev, Slav Petrov, and Michael Collins. 2016. Globally normalized transition-based neural networks. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2442–2452, Berlin, Germany. Association for Computational Linguistics.
  • Bawden et al. (2018) Rachel Bawden, Rico Sennrich, Alexandra Birch, and Barry Haddow. 2018. Evaluating discourse phenomena in neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1304–1313, New Orleans, Louisiana. Association for Computational Linguistics.
  • Bojar et al. (2013) Ondřej Bojar, Christian Buck, Chris Callison-Burch, Christian Federmann, Barry Haddow, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. 2013. Findings of the 2013 Workshop on Statistical Machine Translation. In Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 1–44, Sofia, Bulgaria. Association for Computational Linguistics.
  • Cifka and Bojar (2018) Ondrej Cifka and Ondrej Bojar. 2018. Are BLEU and Meaning Representation in Opposition? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1362–1371, Melbourne, Australia. Association for Computational Linguistics.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
  • Guillou and Hardmeier (2018) Liane Guillou and Christian Hardmeier. 2018. Automatic reference-based evaluation of pronoun translation misses the point. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    , pages 4797–4802, Brussels, Belgium. Association for Computational Linguistics.
  • Habash et al. (2019) Nizar Habash, Houda Bouamor, and Christine Chung. 2019. Automatic gender identification and reinflection in Arabic. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pages 155–165, Florence, Italy. Association for Computational Linguistics.
  • Habash et al. (2017) Nizar Habash, Nasser Zalmout, Dima Taji, Hieu Hoang, and Maverick Alzate. 2017. A Parallel Corpus for Evaluating Machine Translation between Arabic and European Languages. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 235–241, Valencia, Spain. Association for Computational Linguistics.
  • Johnson et al. (2017) Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351.
  • Kuang et al. (2018) Shaohui Kuang, Deyi Xiong, Weihua Luo, and Guodong Zhou. 2018. Modeling coherence for neural machine translation with dynamic and topic caches. In Proceedings of the 27th International Conference on Computational Linguistics, pages 596–606, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
  • Läubli et al. (2018) Samuel Läubli, Rico Sennrich, and Martin Volk. 2018. Has machine translation achieved human parity? a case for document-level evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4791–4796.
  • Lison and Tiedemann (2016) Pierre Lison and Jörg Tiedemann. 2016. OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16).
  • Loáiciga et al. (2017) Sharid Loáiciga, Sara Stymne, Preslav Nakov, Christian Hardmeier, Jörg Tiedemann, Mauro Cettolo, and Yannick Versley. 2017. Findings of the 2017 DiscoMT shared task on cross-lingual pronoun prediction. In The Third Workshop on Discourse in Machine Translation.
  • Miculicich Werlen and Popescu-Belis (2017a) Lesly Miculicich Werlen and Andrei Popescu-Belis. 2017a. Using coreference links to improve spanish-to-english machine translation. In Proceedings of the 2nd Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2017), pages 30–40, Valencia, Spain. Association for Computational Linguistics.
  • Miculicich Werlen and Popescu-Belis (2017b) Lesly Miculicich Werlen and Andrei Popescu-Belis. 2017b. Validation of an Automatic Metric for the Accuracy of Pronoun Translation (APT). In Proceedings of the Third Workshop on Discourse in Machine Translation (DiscoMT), EPFL-CONF-229974. Association for Computational Linguistics (ACL).
  • Moryossef et al. (2019) Amit Moryossef, Roee Aharoni, and Yoav Goldberg. 2019. Filling gender & number gaps in neural machine translation with black-box context injection. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pages 49–54, Florence, Italy. Association for Computational Linguistics.
  • Muller et al. (2018) Mathias Muller, Annette Rios, Elena Voita, and Rico Sennrich. 2018. A large-scale test set for the evaluation of context-aware pronoun translation in neural machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 61–72, Belgium, Brussels. Association for Computational Linguistics.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  • Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the North American Chapter of the Association of Computational Linguistics.
  • Ribeiro et al. (2016) Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. ”why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pages 1135–1144.
  • Sim Smith (2017) Karin Sim Smith. 2017. On integrating discourse in machine translation. In Proceedings of the Third Workshop on Discourse in Machine Translation, pages 110–121, Copenhagen, Denmark. Association for Computational Linguistics.
  • Smith et al. (2013) Jason R. Smith, Herve Saint-Amand, Magdalena Plamada, Philipp Koehn, Chris Callison-Burch, and Adam Lopez. 2013. Dirt Cheap Web-Scale Parallel Text from the Common Crawl. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1374–1383, Sofia, Bulgaria. Association for Computational Linguistics.
  • Stojanovski and Fraser (2018) Dario Stojanovski and Alexander Fraser. 2018. Coreference and coherence in neural machine translation: A study using oracle experiments. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 49–60.
  • Tiedemann (2012) Jörg Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association (ELRA).
  • Tiedemann and Scherrer (2017) Jörg Tiedemann and Yves Scherrer. 2017. Neural machine translation with extended context. In Proceedings of the Third Workshop on Discourse in Machine Translation, pages 82–92, Copenhagen, Denmark. Association for Computational Linguistics.
  • Vanmassenhove et al. (2018) Eva Vanmassenhove, Christian Hardmeier, and Andy Way. 2018. Getting gender right in neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3003–3008, Brussels, Belgium. Association for Computational Linguistics.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of NIPS, pages 6000–6010, Long Beach, California.
  • Vogel et al. (1996) Stephan Vogel, Hermann Ney, and Christoph Tillmann. 1996. HMM-based word alignment in statistical translation. In Proceedings of the 16th conference on Computational linguistics. Association for Computational Linguistics.
  • Voita et al. (2018) Elena Voita, Pavel Serdyukov, Rico Sennrich, and Ivan Titov. 2018. Context-aware neural machine translation learns anaphora resolution. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1264–1274, Melbourne, Australia. Association for Computational Linguistics.
  • Wagner et al. (2016) Claudia Wagner, Eduardo Graells-Garrido, David Garcia, and Filippo Menczer. 2016. Women through the glass ceiling: Gender assymetrics in Wikipedia.

    EPJ Data Science

    , 5.
  • Wang et al. (2017) Longyue Wang, Zhaopeng Tu, Andy Way, and Qun Liu. 2017. Exploiting cross-sentence context for neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2826–2831, Copenhagen, Denmark. Association for Computational Linguistics.
  • Wang et al. (2018) Longyue Wang, Zhaopeng Tu, Andy Way, and Qun Liu. 2018. Learning to jointly translate and predict dropped pronouns with a shared reconstruction mechanism. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2997–3002, Brussels, Belgium. Association for Computational Linguistics.
  • Webster et al. (2018) Kellie Webster, Marta Recasens, Vera Axelrod, and Jason Baldridge. 2018. Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns. In Transactions of the ACL, page to appear.
  • Zhang et al. (2018) Jiacheng Zhang, Huanbo Luan, Maosong Sun, Feifei Zhai, Jingfang Xu, Min Zhang, and Yang Liu. 2018. Improving the transformer translation model with document-level context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 533–542, Brussels, Belgium. Association for Computational Linguistics.
  • Zhou et al. (2019) Pei Zhou, Weijia Shi, Jieyu Zhao, Kuan-Hao Huang, Muhao Chen, Ryan Cotterell, and Kai-Wei Chang. 2019. Examining gender bias in languages with grammatical gender. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5276–5284, Hong Kong, China. Association for Computational Linguistics.
  • Ziemski et al. (2016) Michał Ziemski, Marcin Junczys-Dowmunt, and Bruno Pouliquen. 2016. The United Nations Parallel Corpus v1.0. In Proceedings of the 10th edition of the Language Resources and Evaluation Conference.

References

  • Andor et al. (2016) Daniel Andor, Chris Alberti, David Weiss, Aliaksei Severyn, Alessandro Presta, Kuzman Ganchev, Slav Petrov, and Michael Collins. 2016. Globally normalized transition-based neural networks. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2442–2452, Berlin, Germany. Association for Computational Linguistics.
  • Bawden et al. (2018) Rachel Bawden, Rico Sennrich, Alexandra Birch, and Barry Haddow. 2018. Evaluating discourse phenomena in neural machine translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1304–1313, New Orleans, Louisiana. Association for Computational Linguistics.
  • Bojar et al. (2013) Ondřej Bojar, Christian Buck, Chris Callison-Burch, Christian Federmann, Barry Haddow, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. 2013. Findings of the 2013 Workshop on Statistical Machine Translation. In Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 1–44, Sofia, Bulgaria. Association for Computational Linguistics.
  • Cifka and Bojar (2018) Ondrej Cifka and Ondrej Bojar. 2018. Are BLEU and Meaning Representation in Opposition? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1362–1371, Melbourne, Australia. Association for Computational Linguistics.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
  • Guillou and Hardmeier (2018) Liane Guillou and Christian Hardmeier. 2018. Automatic reference-based evaluation of pronoun translation misses the point. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4797–4802, Brussels, Belgium. Association for Computational Linguistics.
  • Habash et al. (2019) Nizar Habash, Houda Bouamor, and Christine Chung. 2019. Automatic gender identification and reinflection in Arabic. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pages 155–165, Florence, Italy. Association for Computational Linguistics.
  • Habash et al. (2017) Nizar Habash, Nasser Zalmout, Dima Taji, Hieu Hoang, and Maverick Alzate. 2017. A Parallel Corpus for Evaluating Machine Translation between Arabic and European Languages. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 235–241, Valencia, Spain. Association for Computational Linguistics.
  • Johnson et al. (2017) Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2017. Google’s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5:339–351.
  • Kuang et al. (2018) Shaohui Kuang, Deyi Xiong, Weihua Luo, and Guodong Zhou. 2018. Modeling coherence for neural machine translation with dynamic and topic caches. In Proceedings of the 27th International Conference on Computational Linguistics, pages 596–606, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
  • Läubli et al. (2018) Samuel Läubli, Rico Sennrich, and Martin Volk. 2018. Has machine translation achieved human parity? a case for document-level evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4791–4796.
  • Lison and Tiedemann (2016) Pierre Lison and Jörg Tiedemann. 2016. OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16).
  • Loáiciga et al. (2017) Sharid Loáiciga, Sara Stymne, Preslav Nakov, Christian Hardmeier, Jörg Tiedemann, Mauro Cettolo, and Yannick Versley. 2017. Findings of the 2017 DiscoMT shared task on cross-lingual pronoun prediction. In The Third Workshop on Discourse in Machine Translation.
  • Miculicich Werlen and Popescu-Belis (2017a) Lesly Miculicich Werlen and Andrei Popescu-Belis. 2017a. Using coreference links to improve spanish-to-english machine translation. In Proceedings of the 2nd Workshop on Coreference Resolution Beyond OntoNotes (CORBON 2017), pages 30–40, Valencia, Spain. Association for Computational Linguistics.
  • Miculicich Werlen and Popescu-Belis (2017b) Lesly Miculicich Werlen and Andrei Popescu-Belis. 2017b. Validation of an Automatic Metric for the Accuracy of Pronoun Translation (APT). In Proceedings of the Third Workshop on Discourse in Machine Translation (DiscoMT), EPFL-CONF-229974. Association for Computational Linguistics (ACL).
  • Moryossef et al. (2019) Amit Moryossef, Roee Aharoni, and Yoav Goldberg. 2019. Filling gender & number gaps in neural machine translation with black-box context injection. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pages 49–54, Florence, Italy. Association for Computational Linguistics.
  • Muller et al. (2018) Mathias Muller, Annette Rios, Elena Voita, and Rico Sennrich. 2018. A large-scale test set for the evaluation of context-aware pronoun translation in neural machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 61–72, Belgium, Brussels. Association for Computational Linguistics.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  • Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the North American Chapter of the Association of Computational Linguistics.
  • Ribeiro et al. (2016) Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. ”why should I trust you?”: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pages 1135–1144.
  • Sim Smith (2017) Karin Sim Smith. 2017. On integrating discourse in machine translation. In Proceedings of the Third Workshop on Discourse in Machine Translation, pages 110–121, Copenhagen, Denmark. Association for Computational Linguistics.
  • Smith et al. (2013) Jason R. Smith, Herve Saint-Amand, Magdalena Plamada, Philipp Koehn, Chris Callison-Burch, and Adam Lopez. 2013. Dirt Cheap Web-Scale Parallel Text from the Common Crawl. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1374–1383, Sofia, Bulgaria. Association for Computational Linguistics.
  • Stojanovski and Fraser (2018) Dario Stojanovski and Alexander Fraser. 2018. Coreference and coherence in neural machine translation: A study using oracle experiments. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 49–60.
  • Tiedemann (2012) Jörg Tiedemann. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey. European Language Resources Association (ELRA).
  • Tiedemann and Scherrer (2017) Jörg Tiedemann and Yves Scherrer. 2017. Neural machine translation with extended context. In Proceedings of the Third Workshop on Discourse in Machine Translation, pages 82–92, Copenhagen, Denmark. Association for Computational Linguistics.
  • Vanmassenhove et al. (2018) Eva Vanmassenhove, Christian Hardmeier, and Andy Way. 2018. Getting gender right in neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3003–3008, Brussels, Belgium. Association for Computational Linguistics.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of NIPS, pages 6000–6010, Long Beach, California.
  • Vogel et al. (1996) Stephan Vogel, Hermann Ney, and Christoph Tillmann. 1996. HMM-based word alignment in statistical translation. In Proceedings of the 16th conference on Computational linguistics. Association for Computational Linguistics.
  • Voita et al. (2018) Elena Voita, Pavel Serdyukov, Rico Sennrich, and Ivan Titov. 2018. Context-aware neural machine translation learns anaphora resolution. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1264–1274, Melbourne, Australia. Association for Computational Linguistics.
  • Wagner et al. (2016) Claudia Wagner, Eduardo Graells-Garrido, David Garcia, and Filippo Menczer. 2016. Women through the glass ceiling: Gender assymetrics in Wikipedia. EPJ Data Science, 5.
  • Wang et al. (2017) Longyue Wang, Zhaopeng Tu, Andy Way, and Qun Liu. 2017. Exploiting cross-sentence context for neural machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2826–2831, Copenhagen, Denmark. Association for Computational Linguistics.
  • Wang et al. (2018) Longyue Wang, Zhaopeng Tu, Andy Way, and Qun Liu. 2018. Learning to jointly translate and predict dropped pronouns with a shared reconstruction mechanism. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2997–3002, Brussels, Belgium. Association for Computational Linguistics.
  • Webster et al. (2018) Kellie Webster, Marta Recasens, Vera Axelrod, and Jason Baldridge. 2018. Mind the GAP: A Balanced Corpus of Gendered Ambiguous Pronouns. In Transactions of the ACL, page to appear.
  • Zhang et al. (2018) Jiacheng Zhang, Huanbo Luan, Maosong Sun, Feifei Zhai, Jingfang Xu, Min Zhang, and Yang Liu. 2018. Improving the transformer translation model with document-level context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 533–542, Brussels, Belgium. Association for Computational Linguistics.
  • Zhou et al. (2019) Pei Zhou, Weijia Shi, Jieyu Zhao, Kuan-Hao Huang, Muhao Chen, Ryan Cotterell, and Kai-Wei Chang. 2019. Examining gender bias in languages with grammatical gender. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5276–5284, Hong Kong, China. Association for Computational Linguistics.
  • Ziemski et al. (2016) Michał Ziemski, Marcin Junczys-Dowmunt, and Bruno Pouliquen. 2016. The United Nations Parallel Corpus v1.0. In Proceedings of the 10th edition of the Language Resources and Evaluation Conference.