DeepAI
Log In Sign Up

Checks and Strategies for Enabling Code-Switched Machine Translation

Code-switching is a common phenomenon among multilingual speakers, where alternation between two or more languages occurs within the context of a single conversation. While multilingual humans can seamlessly switch back and forth between languages, multilingual neural machine translation (NMT) models are not robust to such sudden changes in input. This work explores multilingual NMT models' ability to handle code-switched text. First, we propose checks to measure switching capability. Second, we investigate simple and effective data augmentation methods that can enhance an NMT model's ability to support code-switching. Finally, by using a glass-box analysis of attention modules, we demonstrate the effectiveness of these methods in improving robustness.

READ FULL TEXT VIEW PDF

page 6

page 7

02/28/2019

Massively Multilingual Neural Machine Translation

Multilingual neural machine translation (NMT) enables training a single ...
05/06/2018

Multi-Domain Neural Machine Translation

We present an approach to neural machine translation (NMT) that supports...
10/20/2022

Can Domains Be Transferred Across Languages in Multi-Domain Multilingual Neural Machine Translation?

Previous works mostly focus on either multilingual or multi-domain aspec...
07/11/2019

Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges

We introduce our efforts towards building a universal neural machine tra...
04/28/2022

NMTScore: A Multilingual Analysis of Translation-based Text Similarity Measures

Being able to rank the similarity of short text segments is an interesti...
11/01/2021

Switch Point biased Self-Training: Re-purposing Pretrained Models for Code-Switching

Code-switching (CS), a ubiquitous phenomenon due to the ease of communic...
10/19/2020

Revisiting Modularized Multilingual NMT to Meet Industrial Demands

The complete sharing of parameters for multilingual translation (1-1) ha...

1 Introduction

Neural machine translation (NMT) Sutskever et al. (2014); Bahdanau et al. (2015); Vaswani et al. (2017) has made significant progress, from supporting only a pair of languages per model to simultaneously supporting hundreds of languages Johnson et al. (2017); Zhang et al. (2020); Tiedemann (2020); Gowda et al. (2021). Multilingual NMT models have been deployed in production systems and are actively used to translate across languages in day-to-day settings Wu et al. (2016); Caswell (2020); Mohan and Skotdal (2021). A great many metrics for evaluation of machine translation have been proposed Doddington (2002); Banerjee and Lavie (2005); Snover et al. (2006); Popović (2015); Gowda et al. (2021); simply citing a more comprehensive list would exceed space limitations, however, except context-aware MT, nearly all approaches consider translation in the context of a single sentence. Even approaches that generalize to support translation of multiple languages Zhang et al. (2020); Tiedemann (2020); Gowda et al. (2021) continue to use the single-sentence, single-language paradigm. In reality, however, multilingual environments often involve language alternation or code-switching (CS), where seamless alternation between two or more languages occurs Myers-Scotton and Ury (1977).

CS can be broadly classified into two types

Myers-Scotton (1989): (i) intra-sentential CS, where switching occurs within sentence or clause boundary, and (ii) inter-sentential CS, where switching occurs at sentence or clause boundaries. An example for each type is given in Table 1. CS has been studied extensively in linguistics communities Nilep (2006); however, the efforts in the MT community are scant Gupta et al. (2021).

Intra Cemoment when you start penser en deux langues at the same temps.
(The moment when you start to think in two languages at the same time.)
Inter Comme on fait son lit, you must lie on it.
(As you make your bed, you must lie on it.)
Table 1: Intra- and inter- sentential code-switching examples between French and English.

In this work, we show that, as commonly built, multilingual NMT models are not robust to multi-sentence translation, especially when CS is involved. The contributions of this work are outlined as follows: Firstly, a few simple but effective checks for improving the test coverage in multilingual NMT evaluation are described (Section 2). Secondly, we explore training data augmentation techniques such as concatenation and noise addition in the context of multilingual NMT (Section 3). Third, using a many-to-one multilingual translation task setup (Section 4), we investigate the relationship between training data augmentation methods and their impact on multilingual test cases. Fourth, we conduct a glass-box analysis of cross-attention in the Transformer architecture and show visually as well as quantitatively that the models trained with concatenated training sentences learn a more sharply focused attention mechanism than others. Finally, we examine how our data augmentation strategies generalize to multi-sentence translation for a variable number of sentences, and determine that two-sentence concatenation in training is sufficient to model many-sentence concatenation in inference (Section 5.2).

2 Multilingual Translation Evaluation: Additional Checks

Notation: For simplicity, consider a many-to-one model that translates sentences from source languages, , to a target language, . Let be a sentence in the source language , and let its translation in the target language be ; where unambiguous we omit the superscripts.

We propose the following checks to be used for multilingual NMT:

[itemsep=-2mm,topsep=1mm,leftmargin=3mm]

C-TL:

Consecutive sentences in the source and target languages. This check tests if the translator can translate in the presence of inter-sentential CS, and preserve phrases that are already in the target language. For completeness, we can test both source-to-target and target-to-source CS, as follows:

(1)
(2)

In practice, we use a space character to join sentences, indicated by the concatenation operator ‘’.111We focus on orthographies that use space as a word-breaker. In orthographies without a word-breaker, joining may be performed without any glue character. This check requires the held-out set sentence order to preserve the coherency of the original document.

C-XL:

This check tests if a multilingual translator is agnostic to CS. This check is created by concatenating consecutive sentences across source languages. This is possible iff the held-out sets are multi-parallel across languages, and, similar to the previous, each preserves the coherency of the original documents. Given two languages and , we obtain a test sentence as follows:

(3)
R-XL:

This check tests if a multilingual translator can function in light of a topic switch among its supported source languages. For any two languages and and random positions and in their original corpus, we obtain a test segment by concatenating them as:

(4)

This method makes the fewest assumptions about the nature of held-out datasets, i.e., unlike previous methods, neither multi-parallelism nor coherency in sentence order is necessary.

C-SL:

Concatenate consecutive sentences in the same language. While this check is not a test on CS, this helps in testing if the model is invariant to a missed segmentation, as it is not always trivial to determine sentence segmentation in continuous language. This check is possible iff held-out set sentence order preserves the coherency of the original document. Formally,

(5)

3 Achieving Robustness via Data Augmentation Methods

In the previous section, we described several ways of improving test coverage for multilingual translation models. In this section, we explore training data augmentation techniques to improve robustness to code-switching settings.

3.1 Concatenation

Concatenation of training sentences has been proven to be a useful data augmentation technique; Nguyen et al. (2021) investigate key factors behind the usefulness of training segment concatenations in bilingual settings. Their experiments reveal that concatenating random sentences performs as well as consecutive sentence concatenation, which suggests that discourse coherence is unlikely the driving factor behind the gains. They attribute the gains to three factors: context diversity, length diversity, and position shifting.

In this work, we investigate training data concatenation under multilingual settings, hypothesizing that concatenation helps achieve the robustness checks that are described in Section 2. Our training concatenation approaches are similar to our check sets, with the notable exception that we do not consider consecutive sentence training specifically, both because of Nguyen et al. (2021)’s finding and because training data gathering techniques can often restrict the availability of consecutive data Bañón et al. (2020). We investigate the following sub-settings for concatenations:

[itemsep=0.5mm,topsep=0pt,leftmargin=5mm]

CatSL:

Concatenate a pair of source sentences in the same language, using space whenever appropriate (e.g., languages with space separated tokens).

(6)
CatXL:

Concatenate a pair of source sentences, without constraint on language.

(7)
CatRepeat:

The same sentence is repeated and then concatenated. Although this seems uninteresting, it serves a key role in ruling out gains possibly due to data repetition and modification of sentence lengths.

(8)

3.2 Adding Noise

We hypothesize that introducing noise during training might help achieve robustness and investigate two approaches that rely on noise addition:

[itemsep=0.5mm,topsep=0pt,leftmargin=5mm]

DenoiseTgt:

Form the source side of a target segment by adding noise to it. Formally,

, where hyperparameter

controls the noise ratio. Denoising is an important technique in unsupervised NMT Artetxe et al. (2018); Lample et al. (2018).

NoisySrc:

Add noise to the source side of a translation pair. Formally, . This resembles back-translation Sennrich et al. (2016a) where augmented data is formed by pairing noisy source sentences with clean target sentences.

The function is implemented as follows: (i) of random tokens are dropped, (ii) of random tokens are replaced with random types uniformly sampled from vocabulary, and (iii) of random tokens’ positions are displaced within a sequence. We use in this work.

Language In-domain All-data
Bengali (BN) 23.3k/0.4M/0.4M 1.3M/19.5M/21.3M
Gujarati (GU) 41.6k/0.7M/0.8M 0.5M/07.2M/09.5M
Hindi (HI) 50.3k/1.1M/1.0M 3.1M/54.7M/51.8M
Kannada (KN) 28.9k/0.4M/0.6M 0.4M/04.6M/08.7M
Malayalam(ML) 26.9k/0.3M/0.5M 1.1M/11.6M/19.0M
Marathi (MR) 29.0k/0.4M/0.5M 0.6M/09.2M/13.1M
Oriya (OR) 32.0k/0.5M/0.6M 0.3M/04.4M/05.1M
Punjabi (PA) 28.3k/0.6M/0.5M 0.5M/10.1M/10.9M
Tamil (TA) 32.6k/0.4M/0.6M 1.4M/16.0M/27.0M
Telugu (TE) 33.4k/0.5M/0.6M 0.5M/05.7M/09.1M
All 326k/5.3M/6.1M 9.6M/143M/175M
Table 2: Training dataset statistics: segments / source / target tokens, before tokenization.
Name Dev Test
Orig 10k/140.5k/163.2k 23.9k/331.1k/385.1k
C-TL 10k/303.7k/326.4k 23.9k/716.1k/770.1k
C-XL 10k/283.9k/326.4k 23.9k/670.7k/770.1k
R-XL 10k/216.0k/251.2k 23.9k/514.5k/600.5k
C-SL 10k/281.0k/326.4k 23.9k/662.1k/770.1k
Table 3: Development and test set statistics: segments / source / target tokens, before subword tokenization. The row named ‘Orig’ is the union of all ten individual languages’ datasets, and the rest are created as per definitions in Section 2. Dev-Orig set is used for validation and early stopping in all our multilingual models.
Table 4: Concatenated sentence examples from the development set. Bengali (BN), Gujarati (GU), Kannada (KN), and Hindi (HI) are chosen for illustrations; similar augmentations are performed for all other languages in the corpus. Indices and indicate consecutive positions, and and indicate random positions.

4 Setup

4.1 Dataset

We use publicly available datasets from The Workshop on Asian Translation 2021 (WAT21)’s MultiIndicMT Nakazawa et al. (2021)222http://lotus.kuee.kyoto-u.ac.jp/WAT/indic-multilingual/ shared task. This task involves translation between English(EN) and 10 Indic Languages, namely: Bengali(BN), Gujarati(GU), Hindi(HI), Kannada(KN), Malayalam(ML), Marathi(MR), Oriya(OR), Punjabi(PA), Tamil(TA) and Telugu(TE). The development and held-out test sets are multi-parallel and contain 1,000 and 2,390 sentences, respectively. The training set contains a small portion of data from the same domain as the held-out sets, as well as additional datasets from other domains. All the training data statistics are given in Table 3. We focus on the IndicEnglish (many-to-one) translation direction in this work.

Following the definitions in Section 2, we create C-SL, C-TL, C-XL, and R-XL versions of development and test sets; statistics are given in Table 3. An example demonstrating the nuances in all these four methods is shown in Table 4. Following the definitions in Section 3, we create CatSL, CatXL, CatRpeat, DenoiseTgt, and NoisySrc augmented training segments. For each of these training corpus augmentation methods, we restrict the total augmented sentences to be roughly the same number of segments as the original corpus, i.e., 326k and 9.6M segments in the in-domain and the all-data setup, respectively.

4.2 Model and Training Process

We use a Transformer base model Vaswani et al. (2017)

which has 512 hidden dimensions, 6 encoder and decoder layers, 8 attention heads, and intermediate feedforward layers of 2048 dimensions. We use a Pytorch based NMT toolkit.

333Additional details are withheld at the moment to preserve the anonymity of authors. All code, data, and models will be publicly released. Tuning the vocabulary size and batch size are important to achieve competitive performance. We use byte-pair-encoding (BPE) Sennrich et al. (2016b), with vocabulary size adjusted as per the recommendations from Gowda and May (2020)

. Since the source side has many languages and the target side has only a single language, we use a larger source vocabulary than that of the target. The source side vocabulary contains BPE types from all 11 languages (i.e., ten source languages and English), whereas to improve the efficiency in the decoder’s softmax layer, the target vocabulary is restricted to contain English only. Our in-domain limited-data setup learns BPE vocabularies of 30.4k and 4.8k types for source and target languages. Similarly, the all-data setup learns 230.4k and 63.4k types. The training batch size used for all our multilingual models is 10k tokens for the in-domain limited-data setup, and 25k tokens for the larger all-data setup. The batch size for the baseline bilingual models is adjusted as per data sizes using

‘a thousand per million tokens’ rule of thumb that we have come to devise with a maximum of 25k tokens. The median sequence lengths in training after subword segmentation but before sentence concatenation are 15 on the Indic side and 17 on the English side. We model sequence lengths up to 512 time steps during training. We use the same learning rate schedule as Vaswani et al. (2017). We train our models until a maximum of 200k optimizer steps, and use early stopping with a patience of 10 validations. Validations are performed after every 1000 optimizer steps. All our models are trained using one Nvidia A40 GPU per setting. The smaller in-domain setup takes less than 24 hours per run, whereas the larger all-data setup takes at most 48 hours per run (or less when early stopping criteria are reached). We run each experiment two times and report the average. During inference, we average the last 5 checkpoints and use a beam decoder of size 4 and length penalty of Vaswani et al. (2017); Wu et al. (2016).

Dev Test
ID In-domain Orig C-TL C-SL C-XL R-XL Orig C-TL C-SL C-XL R-XL
#I1 Baseline (B) 26.5 10.8 17.0 16.9 15.9 22.7 9.4 14.9 14.7 13.6
#I2 B+CatRepeat 25.3 9.9 14.5 14.7 13.3 21.6 8.6 13 13 11.4
#I3 B+CatXL 26.2 12.6 26.1 25.9 26.5 22.6 11.1 22.7 22.5 22.3
#I4 B+CatSL 26.1 13.2 26.1 25.9 26.5 22.6 11.4 22.9 22.6 22.3
#I5 B+NoisySrc 25.2 10.5 16.2 16.0 15.2 21.2 9.1 14.3 14.1 12.9
#I6 B+DenoiseTgt 26.7 40.4 17.9 17.7 16.6 23.2 39.7 15.7 15.4 14.1
#I7 B+CatXL+DenoiseTgt 26.1 55.2 26.3 26.0 26.4 22.6 53.4 23.0 22.6 22.4
Table 5: IndicEnglish BLEU scores for models trained on in-domain training data only. Abbreviations: Orig: average across ten languages’ original held-out set, C-: consecutive sentences, R-: random sentences, TL: target-language (i.e, English), SL: same-language, XL: cross-language.
Dev Test
ID All-data Orig C-TL C-SL C-XL R-XL Orig C-TL C-SL C-XL R-XL
#A1 Baseline (B) 35.0 43.1 30.0 29.5 28.2 32.4 42.2 27.8 27.3 26.1
#A2 B+CatRepeat 34.5 43.7 30.3 29.9 28.8 32.0 42.9 28.0 27.6 26.3
#A3 B+CatXL 34.1 53.3 31.9 33.7 34.4 31.6 52.4 29.7 31.0 31.2
#A4 B+CatSL 33.6 54.0 32.5 32.2 34.3 31.3 53.3 30.4 29.9 31.1
#A5 B+NoisySrc 34.9 42.1 29.8 29.2 27.8 32.3 41.7 27.6 27.1 25.8
#A6 B+DenoiseTgt 33.3 60.0 28.9 28.4 27.3 31.3 59.4 27.1 26.5 25.4
#A7 B+CatXL+DenoiseTgt 33.3 65.8 31.1 33.0 33.6 31.0 64.7 28.9 30.4 30.3
Table 6: IndicEnglish BLEU scores for models trained on all data. (Abbreviations are same as Table 7.)
Dev Test
ID C-TL C-SL C-XL R-XL C-TL C-SL C-XL R-XL
#A1 Baseline (B) 14.3 10.4 10.3 10.1 14.3 10.6 10.5 10.3
#A2 B+CatRepeat 12.3 8.9 8.9 8.6 12.5 9.0 9.0 8.7
#A3 B+CatXL 5.8 7.2 4.3 4.3 5.8 7.2 4.4 4.3
#A4 B+CatSL 5.3 6.2 6.1 5.2 5.4 6.2 6.2 5.2
#A5 B+NoisySrc 17.4 16.1 16.1 15.8 17.5 16.2 16.2 15.9
#A6 B+DenoiseTgt 7.9 8.3 8.4 8.0 8.1 8.5 8.5 8.1
#A7 B+CatXL+DenoiseTgt 4.3 6.8 3.9 4.1 4.4 7.0 4.0 4.1
Table 7: Cross-attention bleed rate (lower is better). All numbers are scaled from to for easier interpretation, and the best settings per test are indicated with bold font. Models trained on concatenated sentences have lower attention bleed rate. Denoising is better than baseline, but not as much as concatenation. The lowest bleed rate is achieved by using both concatenation and denoising methods. (Abbreviations are same as Table 7.)

5 Results and Analysis

We train multilingual many-to-one models for both in-domain and all data. Table 7 presents our results from a limited quantity in-domain dataset. The baseline model (#I1) has strong performance on individual sentences, but degrades on held-out sets involving missed sentence segmentation and code-switching. Experiments with concatenated data, namely CatXL (#I3) and CatSL (#I4), while they appear to make no improvements on regular held-out sets, make a significant improvement in BLEU scores on C-SL, C-XL, and R-XL. Furthermore, both CatSL and CatXL show a similar trend. While they also make a small gain on the C-TL setting, DenoiseTgt method is clearly an out-performer on C-TL. The model that includes both concatenation and denoising (#I7) achieves consistent gains across all the robustness check columns. In contrast, the CatRepeat (#I2) and NoisySrc (#I5) methods do not show any gains.

Our results from the all-data setup are provided in Table 7. While none of the augmentation methods appear to surpass baseline BLEU on the regular held-out sets (i.e., Orig column), their improvements to robustness can be witnessed similar to the in-domain setup. We show a qualitative example in Table 8.

Table 8: Example translations from the models trained on all-data setup. See Table 7 for quantitative scores of these models, and Figures 1 and 2 for a visualization of cross-attention.

5.1 Attention Bleed

Figures 1 and 2 visualize cross-attention444Also known as encoder-decoder attention.

from our baseline model without augmentation as well as models trained with augmentation. Generally, the NMT decoder is run autoregressively; however, to facilitate the analysis described in this section, we force-decode reference translations and extract cross-attention tensors from all models. The cross-attention visualization between a pair of concatenated sentences, say

, shows that models trained on augmented datasets appear to have less cross-attention mass across sentences, i.e. in the attention grid regions representing , and . We call attention mass in such regions attention bleed. This observation confirms some of the findings suggested by Nguyen et al. (2021). We quantify attention bleed as follows: consider a Transformer NMT model with layers, each having attention heads and a held-out dataset of segments. Further more, let each segment be a concatenation of two sentences i.e. , with known sentence boundaries. Let and be the sequence lengths after BPE segmentation, and and be the indices of the end of the first sentence (i.e., the sentence boundary) on the source and target sides, respectively. The average attention bleed across all the segments, layers, and heads is defined as:

(9)

where is the attention bleed rate in an attention head , in layer , for a single record at . To compute , consider that an attention grid is of size . Then

(10)

where is the percent of attention paid to source position by target position at decoder layer and head in record . Intuitively, a lower value of is better, as it indicates that the model has learned to pay attention to appropriate regions. As shown in Table 7, the models trained on augmented sentences achieve lower attention bleed.

((a)) Baseline model without sentence concatenation (#A1)
((b)) Model trained with concatenated sentences (#A3)
Figure 1: Cross-attention visualization from baseline model and concatenated (cross-language) model. For each position in the grid, only the maximum value across all attention-heads from all the layers is visualized. The darker color implies more attention weight, and the black bars indicate sentence boundaries. The model trained on concatenated sentences has more pronounced cross-attention boundaries than the baseline, indicating less mass is bled across sentences.
((a)) Model trained with DenoiseTgt augmentation (#A6)
((b)) Model trained with both CatXL and DenoiseTgt augmentations (#A7)
Figure 2: Cross-attention visualization (… continuation from Figure 1) The model trained on both concatenated and denoising sentences has least attention mass across sentences.

5.2 Sentence Concatenation Generalization

In the previous sections, only two-segment concatenation has been explored; here, we investigate whether more concatenation further improves model performance and whether models trained on two segments generalize to more than two at test time. We prepare a training dataset having up to four sentence concatenations and evaluate on datasets having up to four sentences. As shown in Table 9, the model trained with just two segment concatenation achieves a similar BLEU as model trained with up to four concatenations.

Dev Test
C-SL C-4SL C-SL C-4SL
Baseline / no join 30.0 27.8 27.8 25.7
Up to two joins 31.9 28.9 29.7 26.7
Up to four joins 31.0 28.9 28.8 26.8
Table 9: IndicEnglish BLEU on held out sets containing up to 4 consecutive sentence concatenations in same language (C-4SL). The two sentences dataset (C-SL) is also given for comparison. The model trained on two concatenated sentences achieves comparable results on C-4SL, indicating that no further gains are obtained from increasing concatenation in training.

6 Related Work

Robustness and Code-Switching:

MT robustness has been investigated before within the scope of bilingual translation settings. Some of those efforts include robustness against input perturbations Cheng et al. (2018), naturally occurring noise Vaibhav et al. (2019), and domain shift Müller et al. (2020). However, as we have shown in this work, multilingual translation models can introduce new aspects of robustness to be desired and evaluated. The robustness checklist proposed by Ribeiro et al. (2020) for NLP modeling in general does not cover translation tasks, whereas our work focuses entirely on the multilingual translation task. Clinchant et al. (2019) and Niu et al. (2020) create synthetic test sets to increase test coverage, however, unlike our work, their synthetic tests do not simulate CS. Belinkov and Bisk (2018) investigate the effect of noise on character based NMT and find that excess noise is detrimental to performance as models are brittle. Yang et al. (2020)

artificially create CS text via unsupervised lexicon induction for pretraining NMT in bilingual settings, and

Song et al. (2019) make CS training data to achieve lexically constrained translation, however, neither of these investigate model’s ability to translate CS text at evaluation time.

Augmentation Through Concatenation:

Concatenation has been used before as a simple-to-incorporate augmentation method. Concatenation can be limited to consecutive sentences as a means to provide extended context for translation Tiedemann and Scherrer (2017); Agrawal et al. (2018), or additionally include putting random sentences together, which has been shown to result in gains under low resource settings Nguyen et al. (2021); Kondo et al. (2021). While in a multilingual setting such as ours, data scarcity is less of a concern as a result of combining multiple corpora, concatenation is still helpful to prepare the model for scenarios where code-switching is plausible. Besides data augmentation, concatenation has also been used to train multi-source NMT models. Multi-source models Och and Ney (2001) translate multiple semantically-equivalent source sentences into a single target sentence. DBLP:journals/corr/DabreCK17 show that by concatenating the source sentences (equivalent sentences from different languages), they are able to train a single-encoder NMT model that is competitive with models that use separate encoders for different source languages. Backtranslation Sennrich et al. (2016a) is another useful method for data augmentation, however it is more expensive when the source side has many languages, and does not focus on code-switching.

Attention Weights:

Attention mechanism Bahdanau et al. (2015) enables the NMT decoder to choose which part of the input to focus on during its stepped generation. The attention distributions learned while training a machine translation model, as an indicator of the context on which the decoder is focusing, have been used to obtain word alignments Garg et al. (2019); Zenkel et al. (2019, 2020); Chen et al. (2020). In this work, by visualizing attention weights, we depict how augmenting the training data guides attention to more neatly focus on the sentence of interest while decoding its corresponding target sentence. We are also able to quantify this by the introduction of the attention bleed metric.

7 Conclusion

We have described simple but effective checks for improving test coverage in multilingual NMT (Section 2), and have explored training data augmentation methods such as sentence concatenation and noise addition (Section 3). Using a many-to-one multilingual setup, we have investigated the relationship between these augmentation methods and their impact on robustness in multilingual translation. While the methods are useful in limited training data settings, their impact may not be visible on single-sentence test sets in a high resource setting. However, our proposed evaluation checks reveals the robustness improvement in both the low resource as well as high resource settings. We have conducted a glass-box analysis of cross-attention in Transformer NMT showing both visually and quantitatively that the models trained with augmentations, specifically, sentence concatenation and target sentence denoising, learn a more sharply focused attention mechanism (Section 5.1). Finally, we have determined that two-sentence concatenation in training corpora generalizes sufficiently to many-sentence concatenation inference (Section 5.2).

Acknowledgements

This research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via AFRL Contract FA8650-17-C-9116. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.

Limitations

  1. [itemsep=-1mm,topsep=0mm]

  2. This work is focused on translating CS input, and does not attempt to generate CS text during translation. We consider the general problem of many-to-many translation with CS text on both input and output as a promising future direction.

  3. As mentioned in Section 2, some of the multilingual evaluation checks require the datasets to have multi-parallelism, and coherency in the sentence order. When neither multi-parallelism nor coherency in the held-out set sentence order is available, we recommend R-XL.

  4. While the proposed checks serve as starting points for testing CS, we do not claim that they are exhaustive of all manner of CS. The proposed checks specifically simulate inter-sentential CS; intra-sentential CS checks are left for future work.

  5. We have investigated robustness under Indic-English translation tasks where all languages use space characters as word-breakers; we have not investigated other languages such as Chinese, Thai, etc. We use the term Indic language to collectively reference 10 Indian languages only, similar to MultiIndicMT shared task. While the remaining Indian languages and their dialects are not covered, we believe that the approaches discussed in this work generalize to other languages in the same family.

References

  • R. R. Agrawal, M. Turchi, and M. Negri (2018) Contextual handling in neural machine translation: look behind, ahead and on both sides. In 21st Annual Conference of the European Association for Machine Translation, pp. 11–20. Cited by: §6.
  • M. Artetxe, G. Labaka, E. Agirre, and K. Cho (2018) Unsupervised neural machine translation. In International Conference on Learning Representations, External Links: Link Cited by: item DenoiseTgt:.
  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §1, §6.
  • S. Banerjee and A. Lavie (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, Michigan, pp. 65–72. External Links: Link Cited by: §1.
  • M. Bañón, P. Chen, B. Haddow, K. Heafield, H. Hoang, M. Esplà-Gomis, M. L. Forcada, A. Kamran, F. Kirefu, P. Koehn, S. Ortiz Rojas, L. Pla Sempere, G. Ramírez-Sánchez, E. Sarrías, M. Strelec, B. Thompson, W. Waites, D. Wiggins, and J. Zaragoza (2020) ParaCrawl: web-scale acquisition of parallel corpora. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 4555–4567. External Links: Link, Document Cited by: §3.1.
  • Y. Belinkov and Y. Bisk (2018) Synthetic and natural noise both break neural machine translation. In International Conference on Learning Representations, External Links: Link Cited by: §6.
  • I. Caswell (2020) Google translate adds five languages. Google. Note: Accessed: 2022-01-14 External Links: Link Cited by: §1.
  • Y. Chen, Y. Liu, G. Chen, X. Jiang, and Q. Liu (2020) Accurate word alignment induction from neural machine translation. In

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    ,
    Online, pp. 566–576. External Links: Link, Document Cited by: §6.
  • Y. Cheng, Z. Tu, F. Meng, J. Zhai, and Y. Liu (2018) Towards robust neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 1756–1766. External Links: Link, Document Cited by: §6.
  • S. Clinchant, K. W. Jung, and V. Nikoulina (2019) On the use of BERT for neural machine translation. In Proceedings of the 3rd Workshop on Neural Generation and Translation, Hong Kong, pp. 108–117. External Links: Link, Document Cited by: §6.
  • G. Doddington (2002)

    Automatic evaluation of machine translation quality using n-gram co-occurrence statistics

    .
    In Proceedings of the Second International Conference on Human Language Technology Research, HLT ’02, San Francisco, CA, USA, pp. 138–145. External Links: Link Cited by: §1.
  • S. Garg, S. Peitz, U. Nallasamy, and M. Paulik (2019) Jointly learning to align and translate with transformer models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 4453–4462. External Links: Link, Document Cited by: §6.
  • T. Gowda and J. May (2020) Finding the optimal vocabulary size for neural machine translation. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, pp. 3955–3964. External Links: Link, Document Cited by: §4.2.
  • T. Gowda, W. You, C. Lignos, and J. May (2021) Macro-average: rare types are important too. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 1138–1157. External Links: Link, Document Cited by: §1.
  • T. Gowda, Z. Zhang, C. Mattmann, and J. May (2021) Many-to-English machine translation tools, data, and pretrained models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, Online, pp. 306–316. External Links: Link, Document Cited by: §1.
  • A. Gupta, A. Vavre, and S. Sarawagi (2021) Training data augmentation for code-mixed translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 5760–5766. External Links: Link, Document Cited by: §1.
  • M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg, G. Corrado, M. Hughes, and J. Dean (2017) Google’s multilingual neural machine translation system: enabling zero-shot translation. Transactions of the Association for Computational Linguistics 5, pp. 339–351. External Links: Link, Document Cited by: §1.
  • S. Kondo, K. Hotate, T. Hirasawa, M. Kaneko, and M. Komachi (2021) Sentence concatenation approach to data augmentation for neural machine translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, Online, pp. 143–149. External Links: Link, Document Cited by: §6.
  • G. Lample, A. Conneau, L. Denoyer, and M. Ranzato (2018) Unsupervised machine translation using monolingual corpora only. In International Conference on Learning Representations, External Links: Link Cited by: item DenoiseTgt:.
  • K. D. Mohan and J. Skotdal (2021) Microsoft translator: now translating 100 languages and counting!. Microsoft Research Blog. Note: Accessed: 2022-01-14 External Links: Link Cited by: §1.
  • M. Müller, A. Rios, and R. Sennrich (2020) Domain robustness in neural machine translation. In Proceedings of the 14th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track), Virtual, pp. 151–164. External Links: Link Cited by: §6.
  • C. Myers-Scotton and W. Ury (1977) Bilingual strategies: the social functions of code-switching. Linguistics: An Interdisciplinary Journal of the Language Sciences 1977 (13), pp. 5–20. External Links: Document, Link Cited by: §1.
  • C. Myers-Scotton (1989) Codeswitching with english: types of switching, types of communities. World Englishes 8 (3), pp. 333–346. Cited by: §1.
  • T. Nakazawa, H. Nakayama, C. Ding, R. Dabre, S. Higashiyama, H. Mino, I. Goto, W. Pa Pa, A. Kunchukuttan, S. Parida, O. Bojar, C. Chu, A. Eriguchi, K. Abe, Y. Oda, and S. Kurohashi (2021) Overview of the 8th workshop on Asian translation. In Proceedings of the 8th Workshop on Asian Translation (WAT2021), Online, pp. 1–45. External Links: Link, Document Cited by: §4.1.
  • T. Q. Nguyen, K. Murray, and D. Chiang (2021) Data augmentation by concatenation for low-resource translation: a mystery and a solution. In Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021), Bangkok, Thailand (online), pp. 287–293. External Links: Link, Document Cited by: §3.1, §3.1, §5.1, §6.
  • C. Nilep (2006) “Code switching” in sociocultural linguistics. Colorado Research in Linguistics 19. External Links: Link, Document Cited by: §1.
  • X. Niu, P. Mathur, G. Dinu, and Y. Al-Onaizan (2020) Evaluating robustness to input perturbations for neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 8538–8544. External Links: Link, Document Cited by: §6.
  • F. J. Och and H. Ney (2001) Statistical multi-source translation. In Proceedings of Machine Translation Summit VIII, Santiago de Compostela, Spain. External Links: Link Cited by: §6.
  • M. Popović (2015)

    ChrF: character n-gram F-score for automatic MT evaluation

    .
    In Proceedings of the Tenth Workshop on Statistical Machine Translation, Lisbon, Portugal, pp. 392–395. External Links: Link, Document Cited by: §1.
  • M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh (2020) Beyond accuracy: behavioral testing of NLP models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 4902–4912. External Links: Link, Document Cited by: §6.
  • R. Sennrich, B. Haddow, and A. Birch (2016a) Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 86–96. External Links: Link, Document Cited by: item NoisySrc:, §6.
  • R. Sennrich, B. Haddow, and A. Birch (2016b) Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1715–1725. External Links: Link, Document Cited by: §4.2.
  • M. Snover, B. Dorr, R. Schwartz, L. Micciulla, and J. Makhoul (2006) A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, Cambridge, Massachusetts, USA, pp. 223–231. External Links: Link Cited by: §1.
  • K. Song, Y. Zhang, H. Yu, W. Luo, K. Wang, and M. Zhang (2019) Code-switching for enhancing NMT with pre-specified translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 449–459. External Links: Link, Document Cited by: §6.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014)

    Sequence to sequence learning with neural networks

    .
    In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), Vol. 27, pp. . External Links: Link Cited by: §1.
  • J. Tiedemann and Y. Scherrer (2017) Neural machine translation with extended context. In Proceedings of the Third Workshop on Discourse in Machine Translation, Copenhagen, Denmark, pp. 82–92. External Links: Link, Document Cited by: §6.
  • J. Tiedemann (2020) The tatoeba translation challenge – realistic data sets for low resource and multilingual MT. In Proceedings of the Fifth Conference on Machine Translation, Online, pp. 1174–1182. External Links: Link Cited by: §1.
  • V. Vaibhav, S. Singh, C. Stewart, and G. Neubig (2019) Improving robustness of machine translation with synthetic noise. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 1916–1920. External Links: Link, Document Cited by: §6.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. . External Links: Link Cited by: §1, §4.2.
  • Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. External Links: Link Cited by: §1, §4.2.
  • Z. Yang, B. Hu, A. Han, S. Huang, and Q. Ju (2020) CSP:code-switching pre-training for neural machine translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 2624–2636. External Links: Link, Document Cited by: §6.
  • T. Zenkel, J. Wuebker, and J. DeNero (2019) Adding interpretable attention to neural translation models improves word alignment. CoRR abs/1901.11359. External Links: Link, 1901.11359 Cited by: §6.
  • T. Zenkel, J. Wuebker, and J. DeNero (2020) End-to-end neural word alignment outperforms GIZA++. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 1605–1617. External Links: Link, Document Cited by: §6.
  • B. Zhang, P. Williams, I. Titov, and R. Sennrich (2020) Improving massively multilingual neural machine translation and zero-shot translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 1628–1639. External Links: Link, Document Cited by: §1.