A Discriminative Neural Model for Cross-Lingual Word Alignment

09/01/2019 ∙ by Elias Stengel-Eskin, et al. ∙ Johns Hopkins University 0

We introduce a novel discriminative word alignment model, which we integrate into a Transformer-based machine translation model. In experiments based on a small number of labeled examples ( 1.7K-5K sentences) we evaluate its performance intrinsically on both English-Chinese and English-Arabic alignment, where we achieve major improvements over unsupervised baselines (11-27 F1). We evaluate the model extrinsically on data projection for Chinese NER, showing that our alignments lead to higher performance when used to project NER tags from English to Chinese. Finally, we perform an ablation analysis and an annotation experiment that jointly support the utility and feasibility of future manual alignment elicitation.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Neural-network-based models for Machine Translation (MT) have set new standards for performance, especially when large amounts of parallel text (bitext) are available. However, explicit word-to-word alignments, which were foundational to pre-neural statistical MT (SMT) Brown et al. (1993), have largely been lost in neural MT (NMT) models. This is unfortunate: while alignments are not necessary for NMT systems, they have a wealth of applications in downstream tasks, such as transferring input formatting, incorporating lexica, and human-in-the-loop translation. Crucially, they are central to cross-lingual dataset creation via projection (Yarowsky et al., 2001)

, where token-level annotations in a high-resource language are projected across alignments to a low-resource language; using projection, datasets have been created for a variety of natural language processing tasks, including named-entity recognition (NER), part-of-speech tagging, parsing, information extraction (IE), and semantic role labeling

(Yarowsky et al., 2001; Hwa et al., 2005; Riloff et al., 2002

; Padó and Lapata,

2009; Xi and Hwa, 2005)
. This paradigm allows researchers to address a lack of annotated multilingual resources and changing ontologies with minimal annotation effort.

Figure 1: One application of alignments is to project annotations (e.g., Named Entities) across bitexts. Here we see a lack of correspondence between FastAlign (red) alignments and gold alignments (blue).

The introduction of attention (Bahdanau et al., 2014)

has allowed NMT to advance past decoding from a single sentence-level vector representation and has been crucial in supporting fluent, meaningful translations, especially of longer sentences. However, despite their intuitive analogy to word alignments, attention-based alignments typically do not correspond very closely to gold-standard human annotations and suffer from systematic errors

(Koehn and Knowles, 2017; Ghader and Monz, 2017), falling short of the promise to jointly learn to align and translate. This problem has only been exacerbated by the introduction of deeper models containing multiple attention layers and heads (e.g. Vaswani et al. (2017)). There have been several attempts to mitigate this discrepancy, mostly with a view to improving the quality of output translations (Alkhouli et al., 2016, 2018; Chen et al., 2016; Liu et al., 2016) – though some work has focused specifically on alignment (Legrand et al., 2016; Zenkel et al., 2019)

. We confirm that the poor performance of encoder-decoder attention for Recurrent Neural Networks (RNNs) found by

Koehn and Knowles (2017) and Ghader and Monz (2017) can also be observed in the self-attentional Transformer model (Vaswani et al., 2017), further motivating an approach focusing explicitly on alignment quality.

We introduce a novel alignment module that learns to produce high-quality alignments after training on 1.7K to 4.9K human-annotated alignments for Arabic and Chinese, respectively. While our module is integrated into the state-of-the-art Transformer model (Vaswani et al., 2017), the implementation is architecture-neutral, and can therefore be applied to RNN-based models (Schwenk, 2012; Kalchbrenner and Blunsom, 2013; Sutskever et al., 2014) or fully convolutional models (Gehring et al., 2017). Our experiments on English-Chinese and English-Arabic bitext show that our model yields major improvements of 11 and 27 alignment F1 points (respectively) over baseline models; ablation experiments indicate with only half the data minor improvements can be observed for Chinese, and that the amount of annotated alignment data has a far greater impact on alignment score than the amount of unlabelled bitext used for pre-training.

Furthermore, our NER projection trials demonstrate a major downstream improvement when using our alignments over FastAlign. Taken together, these results motivate the further annotation of small amounts of quality alignment data in a variety of languages. We demonstrate that such annotation can be performed rapidly by untrained L2 (second language) speakers.

2 Related work

Statistical Alignment

Generative alignment (Brown et al., 1993) models the posterior over a target sequence given a source sequence as:

Such models are asymmetrical: they are learned in both directions and then heuristically combined.

Discriminative alignment models, on the other hand, directly model

, usually by extracting features from the source and target sequences and training a supervised classifier using labelled data. A comprehensive account of methods and features for discriminative alignment can be found in

Tomeh (2012).

Neural Machine Translation

The attentional encoder-decoder proposed by Bahdanau et al. (2014) greatly improved the initial sequence-to-sequence architecture; rather than conditioning each target word on the final hidden unit of the encoder, each target word prediction is conditioned on a weighted average of all the encoder hidden units. However, the weights in the average do not constitute a soft alignment, as confirmed by Koehn and Knowles (2017) and Ghader and Monz (2017). Our findings align with those of Ghader and Monz (2017), who provide a detailed analysis of where attention and alignment diverge, offering possible explanations for the modest results obtained when attempting to bias attention weights towards alignments. For example, Chen et al. (2016) introduce a loss over the attention weight that penalizes the model when the attention deviates from alignments; this yields only minor BLEU score improvements.111This may also be a result of noisy training data, as they use unsupervised alignments as training data for their models.

Neural Alignment Models

Legrand et al. (2016) develop a neural alignment model that uses a convolutional encoder for the source and target sequences and a negative-sampling-based objective. Tamura et al. (2014) introduce a supervised RNN-based aligner which conditions each alignment decision on the source and target sequences as well as previous alignment decisions. Alkhouli et al. (2018) extend the Transformer architecture for alignment by adding an additional alignment head to the multi-head encoder-decoder attention, biasing the attention weights to correspond with alignments. Similarly, Zenkel et al. (2019) introduce a single-layer attention module (akin to the standard Transformer decoder) which predicts the target word conditioned on the encoder and decoder states, following the intuition that a source word should be aligned to a target word if it is highly predictive of the target word. Unlike our model, all of these models are asymmetrical and are trained without explicit alignment data.

Following the generative formulation of the alignment problem, Alkhouli et al. (2016) present neural lexical translation and alignment models, which they train using silver-standard alignments obtained from GIZA++. A similar training strategy is used in Alkhouli and Ney (2017), who bias an RNN attention module with silver-standard alignments for use as a lexical model. In the same vein, Peter et al. (2017) improve the attention module by allowing it to peek at the next target word.

Note that unlike our framework, none of the models mentioned make use of existing gold-standard labeled data. Following Legrand et al. (2016) and Zenkel et al. (2019) we focus exclusively on alignment quality. A similar approach is taken by Ouyang and McKeown (2019), who introduce a pointer network-based model for monolingual phrase alignment which allows for the alignment of variable-length phrases.

3 Model

NMT systems typically use an encoder-decoder architecture, where one neural network (the encoder) learns to provide another network (the decoder) with a continuous representation of the input sequence of words in the source language, , from which the decoder can accurately predict the target sequence . The intermediate representations of the encoder and decoder are known as hidden states. After training, the hidden states of the encoder () and decoder () contain information about their co-indexed source and target words as well as their respective contexts; this can be thought of as a rough approximation of the semantic content of the source and target words. In an MT model (such as the Transformer-based model used here) the states of the encoder and decoder networks also bear information on cross-lingual correspondence.

From these representations, we can derive a natural formulation of the alignment problem: words of the source language should be aligned to the words in the target language that convey roughly the same meaning in the translated sentence i.e. their contextual vector-based meaning representations are similar.

We use the dot product as an unnormalized measure of distance between each pair of vectors and

. In order to apply a distance metric, the two vectors are projected into a shared space via a shared three-layer feed-forward neural network with

tanh activations and a linear output layer, producing source and target matrices and .222We use 512 hidden units in each layer. This is given by:

Taking the matrix product of the projected source and target sequences along their shared dimension yields a matrix A of semantic “links”, where each element is an un-normalized distance between vectors and , i.e.

These distances will be passed through a sigmoid function, normalizing them to probabilities

. However, alignments are context-sensitive: the decision to align source word to target word may affect the probability of aligning the neighbors of and . Therefore, before applying the sigmoid function to the matrix we convolve it with a shared weight matrix, yielding a matrix , which encodes the relationships between alignments; this is given by , where is the convolution operator. This can be thought of as a way of conditioning each alignment decision on its neighbors, modeling . This step is crucial—without a convolution, F1 results hover below 50.

Finally, we apply the sigmoid function to and treat each alignment decision as a separate binary classification problem, using binary cross entropy loss:

where if source word is aligned to target word in the gold-standard data, and otherwise. Note that the convolution step ensures these are not independent alignment decisions.

When training our model, we begin by pre-training an MT model on unlabelled bitext; the weights of this model are used to initialize the encoder and decoder in the alignment model. To obtain an alignment at test time, the source and target word sequences are encoded and presented to the aligner.

4 Alignment Experiments

4.1 Baselines

We compare against two baselines: alignments obtained from FastAlign (Dyer et al., 2013), and attention-based alignments. FastAlign is a fast log-linear reparameterization of IBM Model 2 (Brown et al., 1993).

Unlike in the RNN case, where a single attention layer intervenes between the encoder and decoder, Transformers make use of multi-head and multi-layer attention. Our attentional alignments are obtained by averaging across all heads of the final multi-head encoder-decoder attention layer.

Both of these methods allow for fast online decoding, making them comparable to our model, and both have been used for word alignment; moreover, FastAlign is a standard choice for projection (Arthur et al., 2016; Fu et al., 2014; Eger et al., 2018; Agić et al., 2016; Sharoff, 2018).

Our model extends the Sockeye implementation of the Transformer model (Hieber et al., 2017).333We pre-train a 6-layer model with the Adam optimizer (Kingma and Ba, 2014) using a learning rate of , 8-headed multi-head attention, 2048-dimensional feed-forward layers, and 512-dimensional encoder/decoder output layers. We include weight-tying (Press and Wolf, 2017)

between the source and target embedding layers. All other MT hyperparameters were set to the Sockeye defaults.

Following Dyer et al. (2013), FastAlign hyperparameters were tuned by training on 10K training examples, evaluating on a 100-sentence subset of the validation split. The grow-diag-final-and heuristic was used for symmetrization. Thresholds for both attention alignment and our model were tuned on the same 100-sentence validation subset.

4.2 Data

Chinese character-tokenized bitext and alignments were sourced from the GALE Chinese-English Parallel Aligned Treebank (Li et al., 2015); Arabic data was obtained from the GALE Arabic-English Parallel Aligned Treebank (Li et al., 2013). Both GALE datasets are based on broadcast news corpora; gold-standard alignments were produced by multiple trained human annotators, with several checks for quality including redundant annotation, review by senior annotators, and automatic heuristics. The GALE corpora satisfy the following criteria:

  1. [topsep=0pt,itemsep=-1ex,partopsep=1ex,parsep=1ex]

  2. Quality: although some problems were detected in the GALE alignments (e.g. mismatched bitext, formatting errors, etc.), the multiple levels of quality control suggest a high level of overall quality.

  3. Size: while other gold-standard alignment corpora exist (Europarl, etc.) these are typically on the order of 500 sentences; conversely, GALE is large enough to be used as a training set in a neural setting.

  4. Annotations: GALE is annotated with a variety of labels, and a subset of the Chinese portion of GALE was annotated for NER as part of the OntoNotes corpus (Weischedel et al., 2013).

Sentences were split into train, test, and validation portions and cleaned by removing extraneous annotations (e.g. timestamp tags, etc.). Corpus statistics are given in Table 1.

Language train dev test
AR 1687 299 315
ZH 4871 596 636
Table 1: Num. annotated alignment sentence pairs.

The discriminative models (referred to as DiscAlign) were initialized with weights from an MT model pre-trained on 4 million sentences, which were tokenized using the Moses tokenizer (Koehn et al., 2007)

. Chinese data was drawn from the WMT 2017 En-Zh bitext, while the Arabic data was sourced from local resources; both Chinese and Arabic models were trained with a 60K-word vocabulary. For FastAlign, the GALE train split was concatenated to the pretraining bitext, while for both the discriminative and attentional models the GALE data was used to finetune a model pre-trained on bitext alone.

Contemporary MT systems use subword units to tackle the problem of out-of-vocabulary low-frequency words. This is often implemented by byte-pair encoding (BPE) (Sennrich et al., 2016). We applied a BPE model with 30K merge operations to all of our data for the BPE experimental condition. Training alignments were expanded by aligning all subwords of a given source word to all the subwords of the target word to which the source word aligns. At test time, following Zenkel et al. (2019), to reduce BPE alignments to regular alignments, we considered a source and target word to be aligned if any of their subwords were aligned.444Note that this step is necessary in order to make the produced and reference alignments comparable; to evaluate alignments, the sequences must be identical, or the evaluation is ill-posed. This heuristic was used to evaluate all BPE models.

With the task of projection for NER data in mind (Yarowsky et al., 2001), we evaluate all models on Chinese NER spans. These spans were obtained from the OntoNotes corpus, which subsumes the GALE Chinese corpus. Due to formatting differences, only a subset of the GALE sentences (287 validation, 189 test) were recoverable from OntoNotes.

We evaluate our models using macro-F1 score, as Fraser and Marcu (2007) showed that alignment error rate does not match human judgments. F1 scores were obtained using the macro-F1 scorer provided by FastAlign.

Figure 2: (a) Dev. F1 as function of training corpus size (log scale), contrast against FastAlign F1 (dashed). (b) Dev. precision (blue) and recall (orange) at different thresholds.
Method P R F1
Avg. attention. 36.30 46.17 40.65
Avg. attention (BPE) 37.89 49.82 43.05
Avg. attention (NER) 16.57 35.85 22.66

80.46 50.46 62.02
FastAlign (BPE) 70.41 55.43 62.03
FastAlign (NER) 83.70 49.54 62.24
DiscAlign 72.92 73.91 73.41
DiscAlign (BPE) 69.36 67.11 68.22
DiscAlign (NER) 74.52 77.05 75.78
DiscAlign (NER) +prec. 84.69 58.41 69.14

Table 2: Precision, Recall, and F1 on Chinese GALE test data. BPE indicates “with BPE”, and NER denotes restriction to NER spans.
Method P R F1
Avg. attention. 8.46 32.50 13.42
Avg. attention (BPE) 10.11 17.27 12.75
FastAlign 62.26 51.06 56.11
FastAlign (BPE) 62.74 51.25 56.42
DiscAlign 91.30 75.66 82.74
DiscAlign (BPE) 87.05 76.98 81.71

Table 3: Precision, Recall, and F1 on Arabic GALE test data.

4.3 Results

Our model outperforms both baselines for all languages and experimental conditions. Table 2 shows that our model performs especially well in the NER setting, where we observe the largest improvement over FastAlign (). By increasing the threshold above which

is considered an alignment we can obtain high-precision alignments, exceeding FastAlign’s precision and recall. The best threshold values on the development set were

and for average attention on data with and without BPE (respectively) and for the discriminative settings ( was used for the high-precision “+prec” setting).

Table 3 further underscores these findings; we observe an even more dramatic increase in performance on Arabic, despite the fact that the model is trained on only 1.7K labelled examples. Furthermore, we see that average attention performs abysmally in Arabic. The best thresholds for Arabic differed substantially from those for Chinese: for average attention (with and without BPE) the values were and , while for the discriminative aligner they were and (compared to for Chinese). This is reflected in the high precision of the discriminative aligner.

Figure 2(a) shows the relationship between the number of training examples and validation F1 performance for Chinese, with the performance of FastAlign overlaid as a benchmark. We observe that with a little over half the training examples (2500), we perform comparably to FastAlign.

Figure 3: Chinese F1 performance with respect to the amount of unlabelled pretraining bitext (y axis) and the number of sentences annotated for alignment (x axis); increasing the number of annotated sentences has a far greater impact on the F1 score than increasing the amount of bitext.

In Figure 3, we demonstrate that labelled alignment data has a greater impact than the amount of pre-training bitext. For example, halving the number of pre-training examples from 8 million to 4 million only reduces F1 by 3.69, but halving the number of labelled finetuning examples while keeping the number of pre-training sentences constant at 8 million lowers the performance by 9.44 F1.


Taken together, Table 2 and Figure 2 support the use of discriminative alignment models in settings where test data is presented in an online fashion as well as in cases where high-quality alignments matter. Most importantly, the results in the NER condition highlight the applicability of our model to dataset projection for NER.

Figure 2(b) also shows that by varying the minimum threshold on , we can manipulate the precision-recall tradeoff, giving our model an advantage in scenarios where high precision alignments are required (e.g., dataset projection) as well as in ones where high recall is preferred (e.g., correction by human annotators).

Our results underscore the poor performance of attentional alignments. One factor driving the low F1 scores observed may be the Transformer’s design: each attention head in the Transformer’s multi-head attention layer learns to attend to different aspects of the encoder states; thus, averaging across all heads is likely to yield even noisier results than the RNN model, where there is effectively only one attention head. Another reason attention performs poorly may be that attention heads are learning task- and dataset- specific patterns that do not match alignments but nevertheless provide useful cues for translation. This would support the observations of Ghader and Monz (2017), who find that for certain parts of speech (e.g. verbs) alignment is very poorly correlated with translation loss.

These findings illustrate the benefit of supervised training for alignment, suggesting further annotation of gold-standard alignments is warranted, with an emphasis on alignment quality.555In addition to the preprocessing steps required to clean the data, a bilingual speaker pointed out several systematic deviations from the alignment protocol by GALE annotators. In particular, Figure 3 indicates that labelled alignment data is far more valuable than bitext, meaning that efforts should be focused on annotating existing bitext for alignment; FastAlign’s performance (when trained on the entire 4 million-sentence training bitext and the alignment bitext) can be surpassed after pre-training the translation model on only 400K sentences of unlabelled bitext and the full alignment set (4871 sentences).

5 NER Experiments

In Table 2 we evaluated the quality of our model’s alignments intrinsically, showing that they dramatically outperform the baseline models, especially for NER spans. To show that these improvements translate to downstream performance, we conduct projection experiments for NER using the Chinese and English data from the OntoNotes 5.0 corpus.


Dataset projection, as defined by Yarowsky et al. (2001), is a method of obtaining data in a low resource language by leveraging existing bitext and high-resource annotations. As shown (for a single sentence) in Figure 1, a set of sentences in a high-resource language for which there exist task-specific token-level annotations (e.g. an English NER dataset) is translated into the target language. Using alignments, the annotations are “projected” from the source text to the target; each target token receives the label of the source token it is aligned to (or a default label if the token is unaligned). This produces an annotated resource in the target language which can be used to train or pre-train a model.


We project the English OntoNotes NER training data (53K sentences) by first translating it to Chinese using our BPE translation model, and then aligning the source and target sequences using FastAlign and our best discriminative model. We create two datasets, one where the tags have been projected using FastAlign alignments and another using the higher-quality DiscAlign alignments. OntoNotes contains gold-labelled NER data for 18 tag types in both Chinese and English. The official splits666http://conll.cemantix.org/2012/data.html were used. Because there are fewer training examples for Chinese than English, two NER models were trained, one with the same number of projected sentences as there were gold Chinese sentences (36K), and the second using the full English dataset (53K sentences) projected to Chinese. The data was pre-processed in following the same procedure as §4

NER Model

Given the recent successes of contextual encoders for sequence modeling, especially for NER, we make use of a pre-trained BERT model (Devlin et al., 2018), where the NER label is determined by the softmax output of a linear layer situated on top of the encoder.777An NER model was adapted from http://github.com/kamalkraj/BERT-NER to use a pre-trained Chinese model and allow for incremental checkpointing.

Each model was trained for 4 to 8 epochs.

We explore the performance of the BERT NER model when trained on both gold-standard and projected datasets. In addition, we analyze its performance when trained on decreasing amounts of gold data, and contrast those results with the performance of a model pre-trained on projected data and fine-tuned with gold data.

5.1 Results & Analysis

Method # train P R F1
Zh Gold 36K 75.46 80.55 77.81
FastAlign 36K 38.99 36.61 37.55
FastAlign 53K 39.46 36.65 37.77
DiscAlign 36K 51.94 52.37 51.76
DiscAlign 53K 51.92 51.93 51.57
Table 4: F1 results on OntoNotes test for systems trained on data projected via FastAlign and DiscAlign.
# train P R F1
500 41.43 57.97 48.07
1000 58.52 68.05 62.72
2500 68.30 74.93 71.35
4871 72.00 76.90 74.30
10K 73.13 78.79 75.77
Table 5: OntoNotes test set performance when trained on subsamples of the Chinese gold NER data.
# gold P R F1 F1
500 63.16 68.90 65.74 17.67
1000 66.39 72.23 69.13 6.41
2500 69.57 75.52 72.33 0.98
4871 71.11 76.38 73.60 -0.7
10K 72.52 78.32 75.24 -0.53
full 75.04 80.18 77.45 -0.36
Table 6: OntoNotes test performance when pre-trained on projected data and finetuned on varying amounts of Chinese gold NER data. F1 compares these values to corresponding rows in Table 5.
Figure 4: Comparing F1 across different tag types when a model is trained on 500 gold NER sentences to when it is pre-trained on projected data and then finetuned on 500 gold sentences, ranked by number of train examples.

Table 4 shows that while NER systems trained on projected data do categorically worse than an NER system trained on gold-standard data, the higher-quality alignments obtained from DiscAlign lead to a major improvement in F1 when compared to FastAlign.

For many tasks, an English resource (i.e. the English portion of the OntoNotes data) exists along with bitext between English and a given target language, but very few resources are available in the target language. Table 5 shows that projection outperforms a system trained on gold data when there is very little gold data available (500 sentences); as the amount of data increases, these gains disappear. However, Table 6 indicates that projected data provides a useful pre-training objective for NER, making low-data NER models perform about as well as if they had been trained on twice as much gold data.

Figure 4 shows that for nearly all classes, especially rare ones, pre-training on projected data improves downstream results when there is little annotated data. This is partly due to the fact some classes (e.g. LANGUAGE, ORDINAL, etc.) may appear very rarely in the 500-sentence training set ( times). Note that all classes appear at least 5 times in the training data. However, these classes are far more common in the much larger projected dataset. In addition, the major gains seen in tags with higher support in the gold data (e.g. MONEY, LOCATION) suggest that pre-training the NER model on projected data does more than merely expose the model to unseen classes.

6 Human Evaluation

In §4 we established that our discriminatively-trained neural aligner outperforms unsupervised baselines, especially on NER spans; in §5 we verified that the alignments it produces can be productively applied a downstream IE task (NER) via dataset projection.

However, unlike these unsupervised baselines, our aligner requires labelled data on the order of thousands of sentences, and thus cannot be applied to language pairs for which no labelled alignment data exists (most languages). We contend that using a web-based crowdsourcing interface 888https://github.com/hltcoe/tasa, alignment annotation can be performed rapidly by annotators with minimal experience; to justify this claim, we conduct a small timed experiment using untrained human annotators. Six L2 Chinese speakers (L1 English) were asked to annotate English-Chinese bitext for alignment. These sentences were taken from the GALE development portion of the OntoNotes dataset. Note that these annotators had no experience in annotating data for alignment, and had approximately equivalent amounts of Chinese instruction. Annotators were introduced to the task and the alignment interface, and then given 50 minutes to align the sentences as accurately as possible. They were divided into two groups, with each group annotating a different partition of sentences.

Model sents/m P R F1

4.4 90.09 62.85 73.92
DiscAlign - 74.54 72.15 73.31
Hu. (NER) - 87.73 71.02 78.24
DA (NER) - 77.37 67.69 71.94
Table 7: Sentences per minute and average scores against gold-labelled data for sentences annotated for alignment by human annotators (Hu.), compared to DiscAlign (DA) on the same sentences.

6.1 Results & Analysis

Table 7 shows that alignment can be performed rapidly, at 4.4 sentences per minute. The annotators achieve high overall precision for alignment, but fall short on recall. When only alignments of NER spans are considered, their F1 score improves considerably. Additionally, human annotators outperform DiscAlign when evaluated on the same sentences.

These are promising results, as they indicate that entirely untrained L2 annotators can rapidly annotate data. Their relatively low recall scores might be due to their lack of training, and given more extensive training and guidelines we would expect them to improve.

7 Conclusion & Future Work

We described a model extension to NMT which supports discriminative alignment, where fewer than 5K Chinese and 1.7K Arabic labelled sentences lead to better F1 scores in all tested scenarios than standardly employed mechanisms. By evaluating the model on both Arabic and Chinese we demonstrated that our model’s improvements generalize across typologically divergent languages. We showed that in Chinese our model’s performance is particularly strong for NER spans.

We projected a dataset for NER from English to Chinese using our alignments, verifying that our model’s higher alignment quality led to downstream improvements; our aligner’s intrinsic improvements result in major gains over FastAlign in a downstream NER evaluation. Comparing the performance of our projected dataset with a gold-standard dataset, we showed that a large projected dataset can outperform a small labelled one, and that projection can be leveraged as a method of data augmentation for low-resource settings.

Finally, we showed that untrained L2 annotators can rapidly annotate bitext for alignment, and yield promising results for both full alignment and NER-specific annotation, confirming that we can collect additional data in a variety of languages.

Based on our findings that:

  • [topsep=0.0pt,itemsep=-1ex,partopsep=1ex,parsep=1ex]

  • discriminative alignments outperform common unsupervised baselines in two typologically divergent languages

  • this performance boost leads to major downstream improvements on NER

  • only a small amount of labelled data is needed to realize these improvements

  • and that these labelled examples can be obtained from L2 speakers with minimal training

we conclude with a call for additional annotation efforts in a wider variety of languages. While multilingual datasets directly annotated for a given task typically lead to the highest-performing systems (see Table 4) these datasets are task-specific and not robust to ontology changes. In contrast, the framework of projection via discriminatively-trained alignments which we present is task-agnostic (any token-level annotations can be projected) and requires only one source-side dataset to be annotated.

Future work will explore different sets of languages paired with new, higher-quality elicited alignments of existing bitext. We will also explore architecture modifications to allow for task-specific annotation, training, and inference for a given IE task, where only certain spans are of interest (i.e. training our aligner only on NER-span alignments). We expect that asking annotators to only align Named Entities (or some other token type) will greatly increase the annotation speed. These experiments will extend beyond NER to other token-level tasks such as coreference resolution and event detection.


This work was supported by DARPA LORELEI and AIDA. We thank the anonymous reviewers for their constructive feedback. The views and conclusions expressed herein are those of the authors and should not be interpreted as representing official policies or endorsements of DARPA or the U.S. Government.


  • Ž. Agić, A. Johannsen, B. Plank, H. A. Martínez, N. Schluter, and A. Søgaard (2016) Multilingual projection for parsing truly low-resource languages. Transactions of the Association for Computational Linguistics 4, pp. 301–312. Cited by: §4.1.
  • T. Alkhouli, G. Bretschner, and H. Ney (2018)

    On the alignment problem in multi-head attention-based neural machine translation

    In Proceedings of the Third Conference on Machine Translation: Research Papers, pp. 177–185. Cited by: §1, §2.
  • T. Alkhouli, G. Bretschner, J. Peter, M. Hethnawi, A. Guta, and H. Ney (2016) Alignment-based neural machine translation. In Proceedings of the First Conference on Machine Translation: Volume 1, Research Papers, pp. 54–65. External Links: Document, Link Cited by: §1, §2.
  • T. Alkhouli and H. Ney (2017) Biasing attention-based recurrent neural networks using external alignment information. In Proceedings of the Second Conference on Machine Translation, pp. 108–117. Cited by: §2.
  • P. Arthur, G. Neubig, and S. Nakamura (2016)

    Incorporating discrete translation lexicons into neural machine translation

    In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1557–1567. External Links: Document, Link Cited by: §4.1.
  • D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1, §2.
  • P. F. Brown, V. J. D. Pietra, S. A. D. Pietra, and R. L. Mercer (1993)

    The mathematics of statistical machine translation: parameter estimation

    Computational linguistics 19 (2), pp. 263–311. Cited by: §1, §2, §4.1.
  • W. Chen, E. Matusov, S. Khadivi, and J. Peter (2016) Guided alignment training for topic-aware neural machine translation. arXiv preprint arXiv:1607.01628. Cited by: §1, §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §5.
  • C. Dyer, V. Chahuneau, and N. A. Smith (2013) A simple, fast, and effective reparameterization of IBM Model 2. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 644–648. Cited by: §4.1, §4.1.
  • S. Eger, J. Daxenberger, C. Stab, and I. Gurevych (2018) Cross-lingual argumentation mining: machine translation (and a bit of projection) is all you need!. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 831–844. Cited by: §4.1.
  • A. Fraser and D. Marcu (2007) Measuring word alignment quality for statistical machine translation. Computational Linguistics 33 (3), pp. 293–303. Cited by: §4.2.
  • R. Fu, B. Qin, and T. Liu (2014) Generating Chinese named entity data from parallel corpora. Frontiers of Computer Science 8 (4), pp. 629–641. Cited by: §4.1.
  • J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin (2017) Convolutional sequence to sequence learning. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    pp. 1243–1252. Cited by: §1.
  • H. Ghader and C. Monz (2017) What does attention in neural machine translation pay attention to?. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Vol. 1, pp. 30–39. Cited by: §1, §2, §4.3.
  • F. Hieber, T. Domhan, M. Denkowski, D. Vilar, A. Sokolov, A. Clifton, and M. Post (2017) Sockeye: a toolkit for neural machine translation. arXiv preprint arXiv:1712.05690. Cited by: §4.1.
  • R. Hwa, P. Resnik, A. Weinberg, C. Cabezas, and O. Kolak (2005) Bootstrapping parsers via syntactic projection across parallel texts. Natural language engineering 11 (3), pp. 311–325. Cited by: §1.
  • N. Kalchbrenner and P. Blunsom (2013) Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1700–1709. Cited by: §1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: footnote 3.
  • P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, et al. (2007) Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the association for computational linguistics companion volume proceedings of the demo and poster sessions, pp. 177–180. Cited by: §4.2.
  • P. Koehn and R. Knowles (2017) Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, pp. 28–39. Cited by: §1, §2.
  • J. Legrand, M. Auli, and R. Collobert (2016) Neural network-based word alignment through score aggregation. In Proceedings of the First Conference on Machine Translation: Volume 1, Research Papers, pp. 66–73. External Links: Document Cited by: §1, §2, §2.
  • X. Li, S. Grimes, S. Ismael, S. Strassel, M. Maamouri, and A. Bies (2013) GALE Arabic-English parallel aligned treebank – broadcast news part 1; LDC2013T14. Cited by: §4.2.
  • X. Li, S. Grimes, S. Strassel, X. Ma, N. Xue, M. Marcus, and A. Taylor (2015) GALE Chinese-English parallel aligned treebank–training; LDC2015T06. Cited by: §4.2.
  • L. Liu, M. Utiyama, A. Finch, and E. Sumita (2016) Neural machine translation with supervised attention. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 3093–3102. Cited by: §1.
  • J. Ouyang and K. McKeown (2019) Neural network alignment for sentential paraphrases. In Proceedings of the 57th Conference of the Association for Computational Linguistics, pp. 4724–4735. Cited by: §2.
  • S. Padó and M. Lapata (2009) Cross-lingual annotation projection for semantic roles.

    Journal of Artificial Intelligence Research

    36, pp. 307–340.
    Cited by: §1.
  • J. Peter, A. Nix, and H. Ney (2017) Generating alignments using target foresight in attention-based neural machine translation. The Prague Bulletin of Mathematical Linguistics 108 (1), pp. 27–36. Cited by: §2.
  • O. Press and L. Wolf (2017) Using the output embedding to improve language models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Vol. 2, pp. 157–163. Cited by: footnote 3.
  • E. Riloff, C. Schafer, and D. Yarowsky (2002) Inducing information extraction systems for new languages via cross-language projection. In Proceedings of the 19th international conference on Computational linguistics-Volume 1, pp. 1–7. Cited by: §1.
  • H. Schwenk (2012) Continuous space translation models for phrase-based statistical machine translation. Proceedings of COLING 2012: Posters, pp. 1071–1080. Cited by: §1.
  • R. Sennrich, B. Haddow, and A. Birch (2016) Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 1715–1725. Cited by: §4.2.
  • S. Sharoff (2018) Language adaptation experiments via cross-lingual embeddings for related languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), Cited by: §4.1.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §1.
  • A. Tamura, T. Watanabe, and E. Sumita (2014) Recurrent neural networks for word alignment model. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 1470–1480. Cited by: §2.
  • N. Tomeh (2012) Discriminative alignment models for statistical machine translation. Ph.D. Thesis, Université Paris Sud-Paris XI. Cited by: §2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §1, §1.
  • R. Weischedel, M. Palmer, M. Marcus, E. Hovy, S. Pradhan, L. Ramshaw, N. Xue, A. Taylor, J. Kaufman, M. Franchini, et al. (2013) Ontonotes Release 5.0 LDC2013T19. Linguistic Data Consortium, Philadelphia, PA. Cited by: item 3.
  • C. Xi and R. Hwa (2005) A backoff model for bootstrapping resources for non-english languages. In Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 851–858. Cited by: §1.
  • D. Yarowsky, G. Ngai, and R. Wicentowski (2001) Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the first international conference on Human language technology research, pp. 1–8. Cited by: §1, §4.2, §5.
  • T. Zenkel, J. Wuebker, and J. DeNero (2019) Adding interpretable attention to neural translation models improves word alignment. arXiv preprint arXiv:1901.11359. Cited by: §1, §2, §2, §4.2.