Adversarial Neural Networks for Cross-lingual Sequence Tagging

08/14/2018 ∙ by Heike Adel, et al. ∙ 0

We study cross-lingual sequence tagging with little or no labeled data in the target language. Adversarial training has previously been shown to be effective for training cross-lingual sentence classifiers. However, it is not clear if language-agnostic representations enforced by an adversarial language discriminator will also enable effective transfer for token-level prediction tasks. Therefore, we experiment with different types of adversarial training on two tasks: dependency parsing and sentence compression. We show that adversarial training consistently leads to improved cross-lingual performance on each task compared to a conventionally trained baseline.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Cross-lingual modeling is especially interesting when generalizing from a “source” language with labeled data to a “target” language without fully supervised annotations. While POS taggers and dependency parsers are available in many languages Nivre et al. (2016), data for tasks like sentence compression Filippova et al. (2015) or classification Chen et al. (2017); Joty et al. (2017) is much harder to come by. Past success for cross-lingual transfer is mainly achieved by label projection Täckström et al. (2013); Wisniewski et al. (2014); Agić et al. (2016), which requires manual efforts, machine translation or parallel data.

In this work, we address the challenge of building models that learn cross-lingual regularities. Our goal is to avoid overfitting to the source language and to generalize to new languages through the use of adversarial training. Ideally, our models build internally language-agnostic representations which can then be applied to any new target language. Adversarial training has gained a lot of attention recently for domain adaptation by building domain-independent feature representations Ganin et al. (2016); Chen and Cardie (2018). As we show in this work, cross-lingual transfer can be treated as a language-specific variant of domain adaptation. This poses additional challenges to adversarial training: In contrast to most domain adaptation settings, the change from source to target involves not only the word distributions changing (as when adapting models from news to web data) but the entire vocabulary changing as well. To address this, we use bilingual word embeddings and universal POS tags as a common intermediate representation. The second difficulty is that, to the best of our knowledge, adversarial loss has only been applied to cross-lingual NLP classification tasks  Chen et al. (2017); Joty et al. (2017); Chen and Cardie (2018)

in which a single output label is predicted. Filling the gap, we are the first to show that adversarial loss functions are also effective for cross-lingual sequence tagging in which multiple outputs are predicted for a given input sequence.

We show for both a syntactic task (dependency parsing) and a semantic task (extractive sentence compression) that adversarial training improves cross-lingual transfer when little or no data is available. For completeness, we also provide a negative result: for training POS taggers, bilingual word embeddings and adversarial training are not sufficient to produce useful cross-lingual models.

Our contributions are: (i) We adapt adversarial training for structured prediction and compare gradient reversal Ganin et al. (2016), GAN Goodfellow et al. (2014) and WGAN Arjovsky et al. (2017) objectives. (ii) We show that this procedure is useful for both a syntactic and a semantic task.

2 Related Work

Adversarial training Goodfellow et al. (2014), is receiving increased interest in the NLP community Gulrajani et al. (2017); Hjelm et al. (2017); Li et al. (2017); Press et al. (2017); Rajeswar et al. (2017); Yu et al. (2017); Zhao et al. (2017).

ganin2016 propose adversarial training with a gradient reversal layer for domain adaptation (for image classification and sentiment analysis, resp.). Similarly, chen2017 and joty2017 apply adversarial training to cross-lingual sentiment classification and community question answering, respectively. While most previous work in NLP has investigated adversarial domain adaptation for sentence-level classification tasks, we are the first to explore it for cross-lingual sequence tagging. Moreover, we provide a direct comparison of different adversarial loss functions in a cross-lingual training setting. li2016 also work on adversarial sequence tagging but treat sequence tag prediction and sequence labeling as adversarial model parts. This is very different in motivation than using adversarial training in cross-lingual settings. More related to our work is the paper by yasunaga2018 which applies adversarial training in the context of part-of-speech tagging. However, their model is different from ours in two crucial respects. They train single-language models whereas we train a cross-lingual model. They use adversarial training in the form of input perturbations whereas we use adversarial loss functions in order to derive language-independent representations to be able to transfer knowledge between source and target languages.

3 Model

We implement our model using DRAGNN, a Tensorflow framework for efficient training of recurrent neural networks

Kong et al. (2017).

3.1 Model Architecture

Our model consists of three main components: a feature generator, a domain discriminator and a sequence tagger: see Figure 1.

Figure 1: General architecture of our model. G: feature generator (bi-LSTM), T: target sequence tagger, D: domain discriminator. On the right, we illustrate the flow of the different losses.

The feature generator is a bi-directional LSTM (bi-LSTM) with one layer Hochreiter and Schmidhuber (1997) which uses the embeddings of words, their POS tags and Brown clusters as input (see Section 4 for more details). Its output is consumed by the domain discriminator and the target tagger , which are both implemented as feed-forward networks and predict the language id (lid) of the input sequence and the target token label at each step, respectively.111For dependency parsing, the tagger is replaced by the arc-standard model from Kong et al. (2017) which predicts two labels per token, but the feature generator architecture is the same. We found that predicting language id at the token level was more effective than predicting it on the sentence level. The tagger objective maximizes the log-likelihood of the target tag sequence:


updating w.r.t. parameters of the tagger . The objectives for the discriminator and feature generator depend on the adversarial techniques, which we describe next.

3.2 Training with Gradient Reversal

The first adversarial architecture we investigate is gradient reversal training as proposed by ganin2016. In this setting, the discriminator is a classifier, which identifies the input domain given a single feature vector

. Thus, its objective is .

The goal of the generator is to fool the discriminator which is achieved by updating the generator weights in the opposite direction w.r.t. the discriminator gradient:


where is used to scale the gradient from the discriminator.

3.3 Training with GAN and WGAN

The other two adversarial training objectives we investigate is the GAN Goodfellow et al. (2014) and its variant Wasserstein GAN (WGAN) objective Arjovsky et al. (2017). In contrast to gradient reversal, the discriminator inputs are sampled from source and target distributions and , resp. The adversarial objective for GAN is:


The objective of the feature generator is to act as an adversary w.r.t. the discriminator while being collaborative w.r.t. the target tagger:


To stabilize the adversarial training, arjovsky2017 have proposed WGAN, which trains the discriminator as:


with restricting the range of the discriminator weights for Lipschitz continuity. The objective for the feature generator has the same form as in Eq. 4.

Thus, the feature generator is incentivized to extract language-agnostic representations in order to fool the discriminator while helping the tagger. As a result, the tagger is forced to rely more on language-agnostic features. As we discuss later, this works well for higher-level syntactic or semantic tasks but does not work for low-level, highly lexicalized tasks like POS tagging.

Note that the updates to the target tagger are affected by the discriminator only indirectly through the shared feature generator.

4 Experiments

We address two cross-lingual sequential prediction tasks: dependency parsing and extractive sentence compression. For both tasks, we evaluate four different settings: (i) No ADA: no adversarial training: the feature-generator and tagger are trained on the source language and then tested on the target language, (ii) GR: gradient-reversal training, (iii) GAN: using GAN loss, (iv): WGAN: using WGAN loss. In (ii)-(iv), a discriminator is trained in order to achieve language-agnostic representations in the feature generator. In all setups, we use bilingual word embeddings (BWE), Brown clusters and universal POS tags as input representation that is common across languages. The BWEs are trained on unsupervised multi-lingual corpora as described in soricut2016. The POS tags are predicted by simple bi-LSTM taggers trained on the full datasets.

4.1 Data and Evaluation Measure

For parsing, we use the French (FR) and Spanish (ES) parts of the Universal Dependencies v1.3 Nivre et al. (2016). For sentence compression, in the absence of non-english datasets, we collect our own datasets for FR and ES from online sources (e.g., Wikipedia) and ask professionally trained linguists to label each token with KEPT or DROPPED (see Section 4.3) such that the compressed sentences are grammatical and informative. See Table 1 for statistics. In all our experiments, we use ES as source and FR as target.

We follow standard approaches for evaluation: token-level labeled attachment score (LAS) for dependency parsing and sentence-level accuracy for sentence compression Filippova et al. (2015). Note that for the latter, all token-level KEPT/DROPPED decisions need to be correct in order to get a positive score for a sentence.

Note that, for all experiments without target training data, we do not use target data for tuning the models. Thus, the sequence-tagger is fully unsupervised w.r.t. the target language.

Spanish French
#sent. (train / dev / test) 2353 / 115 / 115 2760 / 115 / 115
avg sent. length 41 tokens 37 tokens
compression rate 26.8% 37.8%
Table 1: Statistics of SC dataset.
Training data No ADA GR GAN WGAN
ES + 0k FR 63.35 64.25 61.20 62.51
ES + 1k FR 67.33 67.43 66.69 66.97
ES + 2k FR 68.51 69.05 68.54 68.14
ES + all FR 80.17 80.46 80.68 80.22
Table 2: esfr dependency parsing results.

4.2 Dependency Parsing

We train the tagger on ES data with and without adversarial loss. Table 2 shows the results. Adversarial training gives consistent improvements over the conventionally trained baseline in all settings. It also outperforms a monolingual model trained on the full dataset of the target language FR which achieves a score of 80.21. When comparing the different adversarial loss functions, GR outperforms GAN and WGAN in most cases. One possible reason is a difference in the discriminator’s strength: During training, we observe that the discriminator of GAN and WGAN could easily predict the language id correctly (even after careful tuning of the update rate between generator and discriminator). This is a well-known problem with training GANs: When the discriminator becomes too strong it provides no useful signal for the feature generator. In contrast, GR training updates the feature generator by taking the inverse of the gradient of the discriminator cross-entropy loss. This simpler setup possibly results in a better training signal for the generator.

Training data Accuracy
1k FR 0.85
1k MT-FR 1.71
1k FR + 1k MT-FR 17.09
2k FR 23.93
All FR 25.64
Table 3: Monolingual SC results on French.
Training data No ADA GR GAN WGAN
ES + 0k FR 0.00 9.17 0.00 1.71
ES + 1k FR 20.51 22.27 11.97 15.38
ES + 2k FR 29.06 24.89 15.38 19.66
ES + all FR 29.91 30.77 22.22 29.06
Table 4: esfr SC with adversarial training.

4.3 Sentence Compression

Extractive sentence compression aims at generating shorter versions of a given sentence by deleting tokens Knight and Marcu (2000); Clarke and Lapata (2008); Berg-Kirkpatrick et al. (2011); Filippova et al. (2015); Klerke et al. (2016)

. This is useful for text summarization as well as for simplifying sentences or providing shorter answers to questions. We follow related work and treat it as a sequence-tagging problem: each token of the input sentence is tagged with either KEPT or DROPPED, indicating which words should occur in the compressed sentence. To solve the task, the model needs to consider the meaning of words and sentences. Thus, we consider it a semantic task although we treat it as a sequence-tagging problem.

Monolingual and MT models. Most work on sentence compression considers English corpora only. Studies on other languages either train different monolingual models Steinberger and Tesar (2007) or use translation or alignments to transfer compressions from English into another language Aziz et al. (2012); Takeno and Yamamoto (2015); Ive and Yvon (2016). In order to get baseline results, we follow these approaches and train monolingual models (on FR) and models on translated data (MT-FR).222We use the Google MT API to translate from ES into FR. Thus, the feature generator and tagger are monolingual models and there is no language discriminator. Table 3 shows the results. We find that MT can help to train first models in a new language. However, training data in the target language is better (see performance gap from 1k FR+1k MT-FR to 2k FR).

Cross-lingual models. Next, we train cross-lingual models (see Table 4). Even without adversarial training, the models perform better than the monolingual models. This shows that information can already be shared from ES to FR by using bilingual word embeddings. When adding adversarial training, we again notice a better performance of GR compared to GAN or WGAN: GR training boosts the results, especially for no or little FR training data. GAN and WGAN loss does not perform as good as for dependency parsing.

5 Discussion

We compared tasks of different natures: a syntactic task (depedency parsing) and a semantic task (sentence compression). For completeness, we also report that language-agnostic POS taggers did not lead to promising results: Even though adversarial training improved a No-ADA baseline by several points, cross-lingual transfer still yielded only 45% POS accuracy in the target language, which is not accurate enough to be useful in downstream models. We assume that POS tagging requires seeing language-specific vocabulary more than other tasks. However, we showed that the more high-level tasks dependency parsing and sentence compression can benefit from language-agnostic representations. To the best of our knowledge, this is the first work showing the effectiveness of adversarial loss for cross-lingual sequence tagging. Therefore, we opted for a language pair from the same language family. However, our algorithm is applicable to language pairs from different language groups as well.

6 Conclusion and Future Work

In this paper, we study the utility of adversarial training for cross-lingual sequence-tagging. Our results show that the more higher-level structure the task required, the more gains we could achieve with cross-lingual models. Gradient reversal training outperformed GAN and WGAN loss in our experiments. In future work we plan to extend our study to other language pairs, including languages from different families.