Cross-Lingual Transfer Learning for Question Answering

07/13/2019 ∙ by Chia-Hsuan Lee, et al. ∙ National Taiwan University 0

Deep learning based question answering (QA) on English documents has achieved success because there is a large amount of English training examples. However, for most languages, training examples for high-quality QA models are not available. In this paper, we explore the problem of cross-lingual transfer learning for QA, where a source language task with plentiful annotations is utilized to improve the performance of a QA model on a target language task with limited available annotations. We examine two different approaches. A machine translation (MT) based approach translates the source language into the target language, or vice versa. Although the MT-based approach brings improvement, it assumes the availability of a sentence-level translation system. A GAN-based approach incorporates a language discriminator to learn language-universal feature representations, and consequentially transfer knowledge from the source language. The GAN-based approach rivals the performance of the MT-based approach with fewer linguistic resources. Applying both approaches simultaneously yield the best results. We use two English benchmark datasets, SQuAD and NewsQA, as source language data, and show significant improvements over a number of established baselines on a Chinese QA task. We achieve the new state-of-the-art on the Chinese QA dataset.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Question answering (QA) has drawn much attention in the past few years, and end-to-end deep learning has demonstrated promising performance on QA Yu et al. (2018); Xiong et al. (2016); Seo et al. (2016); Huang et al. (2017); Wang et al. (2018); Liu et al. (2017); Hu et al. (2017); Wang et al. (2017). QA tasks on images Zitnick et al. (2016), audio Lee et al. (2018); Li et al. (2018) or video descriptions Rohrbach et al. (2015) have been studied, but most focus on understanding text documents Lai et al. (2017); Rajpurkar et al. (2016); Clark et al. (2018); Joshi et al. (2017); He et al. (2017); Nguyen et al. (2016); Trischler et al. (2016).

To train end-to-end network-based QA models, a large amount of training examples are needed. Although English training examples are plentiful, most languages still lack the resources to train high-quality QA models; moreover, annotating training examples for QA is expensive. Therefore, it is desirable to transfer the knowledge of a QA model from a source language with many training examples such as English to target languages with fewer training examples.

Translating the data in source language into target language or vice versa by machine translation (MT) is an intuitive way to achieve cross-lingual transfer learning for QA. In this paper, we first validate the benefits of sentence-level MT-based approaches for transferring QA knowledge; we find that MT-based approaches bring significant improvements.

However, not all the language pairs have high quality sentence-level MT system. We propose using domain adversarial training Chen et al. (2016) to learn language-invariant latent representations on top of the QA models. With only word-by-word bilingual dictionary, this approach rivals the performance of the sentence-level MT-based approach. Finally, we find that applying both approaches simultaneously yield the best results.

2 Related Work

2.1 Transfer Learning for Question Answering

Transfer learning is a set of techniques using source domain data to enhance the performance of a model on the target domain. Transfer learning reduces the required amount of target domain training data; equivalently, it improves performance in the target domain. It has been successfully applied in computer vision 

Ganin et al. (2016) and speech recognition Shinohara (2016). It is also widely studied on NLP tasks such as sequence tagging Yang et al. (2017), syntactic parsing McClosky et al. (2010)

, and named entity recognition 

Chiticariu et al. (2010).

QA transfer learning has also been studied. In probably the first work to apply transfer learning on QA, the authors show that the transferred system is significantly better than a random baseline 

Kadlec et al. (2016). Min et al. (2017) use source data to pre-train QA model, and then fine-tune the model with target data. Wiese et al. (2017) show that it is possible to transfer knowledge from an open domain dataset for improvements on a biomedical dataset. Chung et al. (2017) study both supervised and unsupervised transfer learning techniques. Golub et al. (2017) propose a two-stage synthesis network that generates synthetic questions and answers in the target domain without annotations to augment training data. However, in all these studies, the source and target domains are in the same language.

2.2 Multilingual Question Answering

Another line of research related to our work is multilingual question answering (MLQA), also called cross lingual question answering (CLQA) in which the machine is required to return an answer in target language with respect to a question in source language using documents in source language Magnini et al. (2006); Sasaki et al. (2005). One typical approach is using a mono-lingual QA system for the source language and translating resulting answers into the target language Bos and Nissim (2006). Another is translating questions from source language into target language and using a mono-lingual QA system for the target language to return answers Mori and Takahashi (2007). The task in this paper is completely different from MLQA. In this work, the question, document, and answer are all in the same language, and we use training data pairs from another language to improve model performance on target language.

2.3 Cross-Lingual Transfer Learning

For knowledge transfer between different languages, it is intuitive to use machine translation (MT) to translate the source language into the target language, or vice versa. Using MT to achieve knowledge transfer between different languages has been studied on sentiment analysis 

Mohammad et al. (2016), spoken language understanding Stepanov et al. (2013); Schuster et al. (2018) and question answer Asai et al. (2018). It is also possible to train language model to obtain cross-lingual text representations and further improve QA performance in different languages using parallel data Lample and Conneau (2019); Mulcaire et al. (2019) or even without parallel data Devlin et al. (2018)111We refer to the multilingual version in

The performance of MT-based approaches depends highly on the quality of the available MT systems. Due to the lack of reliable MT tools for some language pairs, approaches that require only limited linguistic knowledge resources between the source and the target languages have been proposed. Zirikly and Hagiwara (2015) only assume the availability of a small bilingual dictionary and train the same model to tag the target corpus. Lu et al. (2011) augment the available labeled data in the target language with unlabeled parallel data.

Compared with the above tasks, cross-lingual transfer learning for QA without a sentence-level MT system is even more challenging because returning the right answer requires that the machine comprehends the document and conducts cross-sentence reasoning. By considering the source and target languages to be two different domains, we propose using domain adversarial training Chen et al. (2016); Kim et al. (2017) to cause the QA model to learn language-invariant latent representations, which encourages knowledge transfer between languages.

3 Task Description

Of the many existing QA settings, here we focus on extraction-based QA (EQA), but it is possible to adopt the approaches described in this study to other types of QA tasks. In EQA, each example is a triple in which is a question and is a multi-sentence document that contains the answer .

We seek to improve EQA model performance on the target EQA dataset by transferring knowledge from the source EQA dataset. The target and source datasets are in different languages. For training, we have a large amount of training examples from the source domain, but only a few examples from the target domain. The testing data is in the target language. Due to the limited number of training examples in the target domain, training a QA model from the target domain training set and then applying the model on the testing set of the target domain would yield poor performance. The goal of this research is to use the training examples from the source domain to improve performance on the testing set of target domain.

In this paper, the target language is Chinese and the source language English, but the following discussion can be applied to other language pairs. Compared with other language pairs, knowledge transfer between English and Chinese is difficult because there are no common characters in the two languages. As such we cannot use shared character embeddings Yang et al. (2017).

4 Cross-Lingual Transfer Learning for QA

In this section, we present the cross-lingual transfer learning techniques for QA. In the first approach, we use an MT system to translate the sentences from one language to the other. In the second approach, language-invariant representations are learned, and only a word-by-word bilingual dictionary is needed.

4.1 Machine Translation Based Approaches

There are two ways to use an (sentence level) MT system.

  • Train-on-Target García et al. (2012): In this approach the MT system is used to translate the documents, questions ans answers of the source domain into the target domain. As the source domain training examples are translated into the target domain, they can be used to train the QA model for the target language.

  • Test-on-Source He et al. (2013): In this approach the MT system is used to translate the documents, questions and answers of the target domain into the source domain, so the QA model trained on the source language can be directly applied on the translated target domain testing data.  Asai et al. (2018)

    proposed to back-translate the answer into target language using white-box Neural Machine Translation System while ours utilized off-the-shelf MT system (e.g. Google Translate). In our experiments, this approach leads to inferior performance to

    Train-on-Target. See supplemental material for details.

In both approaches, we split the document into sentences and translate them individually. The translated document is obtained by concatenating all translated sentences. We remove the data examples in which answer spans are no longer recoverable in the documents after translation.

4.2 GAN-based Approach

In this approach, we propose a QA model which projects the sentences of two different language domains onto the same space, so the model benefits from utilizing training examples from both source and target languages. Instead of utilizing a sentence-level MT system, this approach requires only a word-by-word bilingual dictionary.

Bilingual Embeddings. First of all, we seek to produce word embeddings in the two languages in the same continuous space. Aligning word embeddings between two languages without supervision is an active research field Conneau et al. (2017); this approach yielded poor performance in preliminary experiments. We conjecture that this is due to fundamental linguistic differences between the source language (English) and the target language (Chinese); thus the embedding space could not be properly aligned222It has been shown that aligning word embeddings between English and Chinese is more challenging than for other language pairs Conneau et al. (2017)..

Here we use a word-by-word bilingual dictionary to solve the problem. For each word in the source language, we utilize the dictionary to fetch the corresponding word in the target language. Then we use the word embedding of word in the target language as the embedding of word in the source language. Thus, as the two languages actually use the same set of word embeddings, the model is more likely to bootstrap knowledge from both languages during training and eventually improve performance when testing on the target language.

Figure 1: Architecture of proposed QA model and language discriminator. Each language uses the same network architecture for the language-independent layers.

Language-Dependent and -Independent Layers. However, in preliminary experiments, we found that using the same set of word embeddings is not sufficient to achieve efficient joint training. This is because the syntactic structures of the two languages are often very different. To address this problem, we propose the QA model architecture in Fig. 1. This is a general concept, and is independent of the network architecture.

The QA model in Fig. 1 separately encodes sentences from the target and source domains with two sets of language-dependent layers

to handle the different underlying linguistic structures. Each language has its own language-dependent layers, each of which takes a sequence of words of a language as input and outputs a sequence of vectors. The length of the output vectors is the same as the length of the input word sequence. The input word sequence can be a query

or document . We use and ( and ) to represent the output vector sequence given query and document in the target (source) language respectively.

The subsequent layers in QA models are language-independent layers which take the and ( and ) as input, and output the final answer. The parameters of language-independent layers are shared across different languages.

Adversarial Training. Although it may be that the two language-dependent layers occasionally learn common latent representations, adversarial learning can further encourage the outputs of the language-dependent layers to be language-invariant. We incorporate a language discriminator whose job is to identify the language of the input vector sequence from the output of and . is also shown in Fig. 1. The intuition is that if the output of is indistinguishable from , the features they extract are language-invariant, making it easier for the following layers of the QA model to use the knowledge from the training examples of both source and target languages.

The discriminator is learned to minimize below.


where is the language discriminator. In (1), given a training example from the target language (), learns to assign a lower score to the and in that example, that is, minimize and . Conversely, given a training example from the source language (), learns to assign larger values to and .

The two language-dependent layers in the QA model are trained to maximize while minimizing the loss for QA,

. The loss function

for language-dependent layers is



is a hyperparameter. Because the parameters of language-independent layers are independent of the loss of the language discriminator, the loss function of language-independent layers

is equivalent to , that is, .

The definition of depends on the specific QA task. In EQA, each example is labeled with a span in the document containing the answer. The QA model predicts the probability of each position in the document being the start or end of an answer span. The QA loss function is defined as the negative sum of the log probabilities of the predicted distributions indexed by true start and end indices, averaged over all the training examples:


where is the number of training examples and and are respectively the ground-truth starting and ending positions of example . and are respectively the probabilities of the starting and ending positions.

The QA model and the language discriminator are learned in an adversarial manner. The training procedure of our model is summarized in Algorithm 1.

2:for  do
3:     for  steps do
4:          Sample a batch of training examples from the target domain
5:          Sample a batch of training examples from the source domain
6:         Update discriminator by minimizing its loss function:
8:     end for
9:     Update language-dependent layers by minimizing their loss function:
10:      ( defined in (3) )
11:     Update language-independent layers by minimizing their loss function:
13:end for
Algorithm 1 Training procedure.
is the number of learning steps for the discriminator in each iteration. is the discriminator loss coefficient which is initially zero and scales up with the training steps. is the batch size. Details on these hyperparameters are provided in the Experiments section.

5 Experiments

5.1 Question Answering Model

Of the numerous models proposed for EQA, we chose QANet Yu et al. (2018) as the base model due to its good performance; also, thanks to its lack of recurrent layers, it can be trained efficiently. It is possible to replace QANet with other QA models as long as they can be separated into language-dependent and -independent layers. Below we briefly introduce the architecture of QANet. The network architecture can be found in Fig. 1. Note that the concatenation of the language-dependent and -independent layers forms the original QANet.

In the language-dependent layers, the Input Embedding Layer first produces the word and character embedding for each word in or , after which an Embedding Encoder Layer, which consists of an encoder block, is used to model the temporal interactions between words and refine them to contextualized representations. The encoder block is composed exclusively of depthwise separable convolutions and self-attention. The intuition is that convolution components model local interactions and self-attention components model global interactions. We found it is necessary to include the embedding encoder layer in the language-dependent layers, because in preliminary experiments, with only the input embedding layer, the language-dependent layers do not generate language-invariant representation.

In language-independent layers, a context-query attention layer generates the question-document similarity matrix and computes the question-aware vector representations of the context words. Likewise, we attempted putting the context-query attention layer in the language-dependent layers instead of the language-independent layers, but it makes the discriminator too easy to discriminate and leads to degrading performance. After the context-query attention layer, a model encoder layer consisting of seven encoder blocks captures the interaction among the context words conditioned on the question. Finally, the output layer predicts the start position and end position in the document and then extracts the answer from the document.

“…所有南亞的主要書寫系統事實上都用於梵語文稿的抄寫。自19世紀晚期,天城文被定為梵語的標準書寫系統…” “…all the major writing systems in South Asia are actually used for the transcription of Sanskrit manuscripts. Since the late 19th century, Tianchengwen has been designated as the standard writing system for Sanskrit…”
天城文在何時成為梵語的標準書寫系統? When did Tianchengwen become the standard writing system for Sanskrit?
19世紀晚期 Late 19th century
Table 1: An example (document , query , and answer ) in DRCD with its english translation.

5.2 Corpora

We use the SQuAD English corpus, the NewsQA English corpus and the DRCD Chinese corpus. These three corpora are introduced as below.

Source Data – SQuAD. For the source-language dataset, we chose the SQuAD Rajpurkar et al. (2016) training set, in which documents are a set of Wikipedia articles, and the answer with respect to the given question is always a span in the document. This training set contains 87,500 question-answer pairs. The average article length is 250 words, and the average question length is 10 words.

Source Data – NewsQA. In order to test the generality of our proposed approach, we conduct experiments on another source-language dataset, NewsQA Trischler et al. (2016) training set, in which documents are a set of CNN news articles. The answer with respect to the given question is also a span in the document. In original NewsQA, there are unanswerable questions specifically designed to test the reasoning ability of model. We remove these questions and leave the challenge of identifying the unanswerable questions for future work. To eliminate the difference in article length as a possible cause of trivial discrimination between source and target data, we remove the articles with lengths longer than 600 words. The resulting training set contains 42,300 question-answer pairs. The average article length is 378 words, and the average question length is 6 words.

Target Data – DRCD. For the target language dataset, we used the Delta Reading Comprehension Dataset (DRCD) Shao et al. (2018), an EQA dataset in which each document is a set of Chinese Wikipedia articles, and the answer with respect to the given question is always a span in the document, as in SQuAD. An example of DRCD and its English translation is shown in Table 1 respectively. In DRCD, the training set contains 26,936 questions with 8,014 paragraphs, which is smaller than the SQuAD training set; the testing set contains 3,524 questions with 1,000 paragraphs333The testing set mentioned here is actually a development set. The real testing set is not publicly available yet.. After word segmentation, the average number of words per document is 262, and the average number of words per question is 13, which are both slightly more than the SQuAD counterparts. In DRCD every question has only one ground-truth answer, as opposed to SQuAD’s three, which makes it more difficult for the model prediction to match the ground-truth.

5.3 Experimental Setup

The word embeddings in all our experiments were initialized from the 300-dimension pre-trained Fasttext embeddings Bojanowski et al. (2016) and fixed during training for both English and Chinese. The word-by-word bilingual dictionary used for naive word-by-word translation was provided by Google Machine Translation. We used JIEBA444Python library JIEBA: to segment Chinese sentences into words. The resulting word vocabulary size for DRCD was around 110,000. We used Google Machine Translation for the MT-based approaches555Obtained from in November, 2018 .

5.4 Training details

In the QANet embedding encoder layer, the convolutional blocks and self-attention blocks encode the words from both document and question into contextualized word representations. We chose to incorporate our language discriminator on top of this layer to encourage more explicit alignment between feature representations.

We adopted the discriminator design from Gulrajani et al. (2017)

for our language discriminator: it stacks five residual blocks of 1D convolutional layers with 96 filters and a filter size of 5 followed by one linear layer to convert each input vector sequence into a scalar value. All models used in the experiments were trained with a batch size of 24, using the Adam optimizer with learning rate of 1e-3 until convergence. We adopted the architecture suggested by the QANet paper, but as we found that using the suggested hyperparameters did not yield optimal results, we set the hidden state size to 96 across all layers and the number of self-attention heads to 2. We also used L2 norm regularization, gradient clipping, and moving averages of all weights with an exponential decay rate of 0.999.

We used a variable weight for the discriminator loss coefficient . We initially set to 0, training the whole model like a normal QA model. Then we slowly increased to 0.001 over the first 30000 training steps to slowly encourage the model to produce invariant representations. Without this scheduling technique, we observed that the model was overly influenced by the disciminator loss and hindered from performing the normal QA task. in Algorithm 1 was set to 5.

5.5 Evaluation Measures

The evaluation metrics we used were exact match (EM) and the (macro-averaged) F1 score. If the predicted text answer and the ground-truth text answer are exactly the same, then the EM score is 1, and 0 otherwise. The F1 score is based on the precision and recall. Precision is the percentage of Chinese characters in the predicted answer that occur in the ground-truth answer, and recall is the percentage of Chinese characters in the ground-truth answer that also occur in the predicted answer. The EM and F1 scores of each testing example were averaged as the final EM and F1 score. We removed all the punctuation in the answers and used the standard evaluation script from SQuAD 

Rajpurkar et al. (2016) to evaluate the performance.

5.6 MT-based Approach

The train-on-target results and results of competing approaches are shown in Table 2. We translate the sentences of SQuAD and NewsQA into Chinese using Google Machine Translation System. The resulting corpora are denoted as SQuAD (MT) and NewsQA (MT). We also translate SQuAD and NewsQA into Chinese using only word-by-word bilingual dictionary. The results are denoted as SQuAD (word-by-word) and NewsQA (word-by-word). Rows (a) to (e) are the established baselines when these models are directly trained on the DRCD training set. Row (f) is the results of human performance. Row(g) is the result when QANet is trained on untranslated SQuAD, and row (h) is the results for jointly training on both SQuAD and DRCD666The subsets of data are combined and shuffled before jointly training. Row (k) is the results when QANet is trained only on SQuAD (MT), and row (l) is the results for jointly training on both SQuAD (MT) and DRCD. Row (o) is the results when QANet is trained only on SQuAD (word-by-word), while row (p) is the results for jointly training using both SQuAD (word-by-word) and DRCD. Rows (i), (j), (m), (n), (q) and (r) are the results for NewsQA. We see that adding more English training data without translation yields no improvements. (row (h) and (j) v.s. row (e) We also see that training with additional data from different corpora together yields improvement. Finally, we find that the performances of word-by-word translation are inferior to those of machine translation.

Baselines EM F1
Shao et al. (2018) (a) - 53.78
BiDAF Seo et al. (2016) (b) 56.45 70.57
DrQA Chen et al. (2017) (c) 63.21 74.11
F-Net Huang et al. (2017) (d) 57.54 70.86
QANet (Baseline) (e) 66.10 78.01
Human Performance (f) 80.43 93.30
SQuAD (g) 3.00 12.65
+DRCD (h) 66.56 78.67
NewsQA (i) 0.93 7.51
+DRCD (j) 66.23 78.45
SQuAD (MT) (k) 53.50 72.22
+DRCD (l) 74.20 85.67
NewsQA (MT) (m) 22.42 35.99
+DRCD (n) 68.98 81.41
SQuAD (word-by-word) (o) 18.89 37.98
+DRCD (p) 68.89 81.31
NewsQA (word-by-word) (q) 11.74 26.82
+DRCD (r) 67.16 79.52
Table 2: EM/F1 train-on-target scores over DRCD. FusionNet is denoted as F-Net.
Figure 2: Results of proposed GAN-based approaches on DRCD testing set
Figure 3: Performance curves given labeled training data from target domain. Vertical axis shows F1 score. Performance curve of GAN-CH is plotted to compare with baseline model.

5.7 GAN-based Approach

Here we show the results of the experiments on the GAN-based approach in which two language-dependent layers and a language discriminator are included in the QA model. It is possible to update and simultaneously to minimize in (2), but it leads to unstable training. Therefore, in real implementations, we update either or to minimize in (2); the other only minimizes in (2) and ignores . The above training strategy results in more stable training. When minimizes while minimizes , we are steering the output of toward . The results are reported in Table 3. Row (a) is the results from row (e) in Table 2. Rows (b) to (d) are the results using SQuAD as source data. We find that even without the discriminator, the QA model with language-dependent layers brings improvement (row (b) vs. row (p) in Table 2).

The language discriminator is used in rows (c) and (d). GAN-CH and GAN-EN represent the results when updating or to minimize respectively. Incorporating adversarial learning is helpful (rows (c) and (d) vs. row (b)). We see that the performance of row (c) is better than row (d), attesting the greater benefit of aligning source domain representations to target domain representations.

The GAN-CH approach rivals the performance of the best setting in MT-based approaches (row (c) vs. row (l) in Table 2). Note that GAN-based approaches in row (c) utilize only a word-by-word bilingual dictionary and do not utilize any sentence-level MT tool. Also, it is beneficial to apply our proposed language-dependent layers and language discriminator as we obtain improvements over base model with the same resources (row (d) vs. row (p) in Table 2). Rows (e) to (g) are the results using NewsQA as source data. Similar trend is found on NewsQA.

In Figure 2, we also compare the training process between rows (c), (d), (f) and (g). We observe testing set F1 scores of GAN-EN that grow quickly but end up inferior to that of GAN-CH. This could indicate early convergence.

Next, in Figure 3, we show the performance curves of GAN-CH with various amounts of labeled Chinese (target) data. We see that the proposed approach requires only 15,000 Chinese QA pairs to rival the baseline model trained with 25,000 QA pairs; this demonstrates the effects of cross-lingual transfer learning.

5.8 Combining MT and GAN approaches

Instead of using word-by-word translation and bilingual embeddings to encode source language data in section 4.2, one can use MT system to translate source language data into target language before encoding the results using target language word embeddings. The refined source language data and target language data are then fed into the model in Fig. 1. We found that using both MT-based approaches and GAN yielded the best performance. The results are shown in Table 4.

Approaches EM F1
Baseline (a) 66.10 78.01
Dependent (b) 70.97 81.92
+GAN-CH (c) 74.00 85.35
+GAN-EN (d) 70.97 83.36
Dependent (e) 67.96 80.51
+GAN-CH (f) 71.73 83.90
+GAN-EN (g) 69.25 82.28
Table 3: EM/F1 scores of GAN-based approaches on DRCD. Dependent denotes QA model trained without discriminator; MT denotes the MT-based approach; GAN-CH denotes the output of the English dependent layer steered toward Chinese; GAN-EN denotes the output of the Chinese dependent layer steered toward English; MT+GAN-CH denotes the combination of MT and GAN-CH.
Approaches EM F1
MT+GAN-CH (a) 75.12 87.26
MT+GAN-CH (b) 72.79 84.94
Table 4: EM/F1 scores of combining MT and GAN-based approaches on DRCD.

6 Conclusion

We investigate several cross-lingual transfer learning approaches for QA. First, we apply sentence-level MT-based approaches, which bring significant improvements over target-domain testing data. Second, by incorporating domain adversarial learning, the GAN-based approach learns language-invariant latent representations and consequentially transfers knowledge from the source domain. Given only a word-by-word bilingual dictionary, a GAN-based approach rivals the performance of the best MT-based approach, and integrating MT-based and GAN-based approaches yields the best results. We conducted experiments using SQuAD and NewsQA as source language datasets and achieved new state-of-the-art on a Chinese QA dataset: DRCD.


Appendix A Supplemental Material to accompany Cross-Lingual Transfer Learning for Question Answering

a.1 Test-on-Source

In the EQA task, the ground-truth answer is always a span in the document. However, in the test-on-source approach, the testing data is transcribed into English. It is possible that after translation, the translation of answer is not in the translation of its corresponding document; therefore we compiled DRCD, a filtered testing set from DRCD, in which only the examples fulfilling the requirement of EQA after translation are included. We report the performance on both the original DRCD testing set and the DRCD testing set.

For the test-on-source approach, all the Chinese examples in DRCD, including the training and testing sets, were transcribed into English and they are denoted as DRCD (English). All results of the various test-on-source settings are reported in Table 7.

We report the evaluation results on both DRCD (English) and DRCD (English). Because the testing documents were translated into English, the output predictions of QA model were also English (column (1), column (3)). For evaluation, all the ground-truth answers in this case were translated from the original Chinese answers using Google Machine Translation. We also back-translated the English predictions into Chinese (column (2), column (4)), and evaluated the performance.

As you can see in Table 7, the EM and F1 scores on Chinese answers (column (2), column (4)) are much lower than that for the English answers (column (1), column (3)). This was expected because the Chinese predictions were translated from the English predictions; the resulting translation errors degraded performance. However, when comparing the different approaches, the conclusion drawn from the Chinese and English evaluations is the same: They both show that training from DRCD (English) outperforms SQuAD even though SQuAD has more training examples. This is intuitive because the data distribution of DRCD (English) is closer to the testing data here. Using both corpora is even better (row (e) vs. row (d)).

a.2 Train-on-Target

We also report the performance of Train-on-Target on DRCD in Table 5.

a.3 GAN-based Approach

We also report the performance of GAN-based approaches on DRCD in Table 6.

Approaches EM F1 EM F1
Shao et al. (2018) (a) - - - 53.78
Baseline (b) 67.72 79.23 66.10 78.01
Human (c) - - 80.43 93.30
SQuAD-zh (d) 59.00 76.21 53.50 72.22
+DRCD (e) 77.61 87.62 74.20 85.67
NewsQA-zh (g) 24.76 38.7 22.42 35.99
+DRCD (h) 70.49 83.53 68.98 81.41
Table 5: EM/F1 train-on-target scores over DRCD and DRCD. SQuAD-zh denotes SQuAD (MT) and NewsQA-zh denotes NewsQA (MT)
Approaches EM F1 EM F1
Baseline (a) 67.72 79.23 66.10 78.01
MT (b) 77.61 87.62 74.20 85.67
Dependent (c) 73.18 83.10 70.97 81.92
+GAN-CH (d) 77.04 87.51 74.00 85.35
+GAN-EN (e) 72.45 85.27 70.97 83.36
MT+GAN-CH (f) 78.21 89.36 75.12 87.26
MT (g) 70.49 83.53 68.98 81.41
Dependent (h) 71.20 82.12 67.96 80.51
+GAN-CH (i) 75.54 86.35 71.73 83.90
+GAN-EN (j) 72.58 84.36 69.25 82.28
MT+GAN-CH (k) 76.83 87.22 72.79 84.94
Table 6: EM/F1 scores of GAN-based approaches over DRCD and DRCD.
DRCD (English) DRCD (English)
(1) ENG Ans (2) CH Ans (3) ENG Ans (4) CH Ans
Approaches EM F1 EM F1 EM F1 EM F1
Shao et al. (2018) (a) - - - - - - - 53.78
Baseline (b) - - 67.72 79.23 - - 66.10 78.01
SQuAD (c) 49.75 60.61 30.81 51.91 34.67 49.31 23.43 45.29
DRCD (English) (d) 53.97 63.10 32.40 53.85 38.02 51.30 24.57 46.07
SQuAD + DRCD (English) (e) 59.44 68.72 35.60 58.03 42.02 56.23 27.09 50.44
Table 7: EM/F1 test-on-source scores over DRCD (English) and DRCD (English). Because the testing documents are translated into English, the output predictions of the QA model are also in English (column (1), column (3)). We also back-translated the English predictions into Chinese (column (2), column (4)) and evaluated with ground-truths from DRCD and DRCD respectively.