Neural Skill Transfer from Supervised Language Tasks to Reading Comprehension

11/10/2017 ∙ by Todor Mihaylov, et al. ∙ 0

Reading comprehension is a challenging task in natural language processing and requires a set of skills to be solved. While current approaches focus on solving the task as a whole, in this paper, we propose to use a neural network `skill' transfer approach. We transfer knowledge from several lower-level language tasks (skills) including textual entailment, named entity recognition, paraphrase detection and question type classification into the reading comprehension model. We conduct an empirical evaluation and show that transferring language skill knowledge leads to significant improvements for the task with much fewer steps compared to the baseline model. We also show that the skill transfer approach is effective even with small amounts of training data. Another finding of this work is that using token-wise deep label supervision for text classification improves the performance of transfer learning.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reading comprehension (RC) is a language understanding task, typically evaluated in a question answering setting, where a system is expected to read a given passage of text (document D) and answer questions (Q) about it. Recent work has introduced several large-scale datasets for reading comprehension which gained a lot of attention such as the ‘CNN/Daily Mail’ [12], MCTest [45], Children Book Test [13], bAbI [48] which are formed automatically following a cloze style setup. Most recently SQuAD [35] and NewsQA [44] were created using crowd-sourcing.

Reading comprehension has been shown [42, 4, 35] to require different sets of skills such as paraphrase detection, recognition of named entities, natural language inference, etc. The common approach to tackling a higher-level task such as Reading Comprehension is to build a complex neural model that reads a large-scale dataset and tries to learn all required skills at once. We propose learning the ‘skills’ required for the task of reading comprehension from existing supervised language tasks. We evaluate the performance of several learned lower-level ‘skills’ for reading comprehension on the SQuAD [35] dataset by integrating them in a simple neural model. This is in contrast to [7] who propose learning sentence compression representations from a large supervised corpus and transfer the learned knowledge to a set of smaller tasks. Our approach is similar to [25] who used weights pre-trained on machine translation to boost the performance of a very good RC system [50]. Instead of solving a single complex task, we propose using the knowledge learned from multiple supervised, possibly low-scale, language tasks as ’skills’. We propose a simple model that allows to inject learned ‘skill’ representations easily and analyze the learning behavior of this skill transfer model for reading comprehension. We also experiment with training on smaller parts of the training data (2%, 5%, 10%, 25%) to examine the impact of ‘skill’ transfer on smaller datasets.

2 Method

In this work, we tackle the task of reading comprehension using lower-level supporting ‘skill’ tasks. To do that, we implement a baseline model to represent the relation between a given question and the story context and enrich the representation by reusing encoder weights from the chosen ‘skill’ tasks.

Figure 1: Skillful Reader: Architecture for transferring knowledge from ‘skill’ language tasks to a reading comprehension model.

Our ‘skill’ transfer method is visualized in Figure 1 and can be summarized in two main steps:

  • Skill Learning: Train context encoder-based (Bi-LSTM) models for several language skill tasks and save the learned encoder weights.

  • Neural Skill Transfer: Reuse the learned context encoder skill weights to encode the text context of document and question, in a simple model for the higher-level task (QA/RC).

An overview of our model is shown in Figure 1. It can be considered similar to progressive neural networks [38] without the notion of sequential learning of the tasks. We refer to the underlying tasks as skills, following [42], who show that complex tasks like RC require a set of language analysis skills. We show that using such skills, learned from specialized corpora, boosts the performance of a good baseline RC system (i) early in training and (ii) when training on smaller portions (2, 5, and 10 percent) of the original training data.

2.1 Skill Learning

For encoding the skill knowledge from lower-level tasks we first implement simple context encoder models for each low-level task. In this work we implement three types of models for encoding language skill tasks: Sequence Labeling, Text Classification, and Relation Classification.

Sequence Labeling

is applied for labeling each token of a given text with a specific category. For this type of encoder model we use a vanilla Bi-directional Long Short-Term Memory

[9] architecture, that uses word embeddings as input with a label projection layer with Softmax to predict the sequence labels (2a). While this does not lead to a supreme performance in any sequence-labeling task, it is a stable baseline [23, 19]. We hypothesize that by using a simple architecture for the skill model, we can encode the skill knowledge in the context layer. As a sequence labeling skill, we choose the task of Named Entity Recognition (NER) based on the CoNLL 2012 NER dataset. We use the BIO schema for label encoding, as shown in Figure 2a.

Figure 2: a) Vanilla Bi-LSTM for sequence labeling (NER). b) Text classification (Question Type Classification) with Bi-LSTM context encoder and token-wise label supervision.

Text Classification

is applied in order to categorize a given word token sequence. Given that our RC task is cast as a QA problem, we propose to employ the skill of Question Type Classification, using the TREC Question Classification dataset [21]

with 50 classes for training. The task is to classify a given question according to the type of the answer phrase. To learn text classification skills we employ a simple model with Bi-LSTM context encoder, where we apply label supervision on the

token level. The model is shown in Fig. 2

b). That is, instead of retrieving a single vector representation of the sentence (with avg- or max-pooling, etc.) and predicting the label, we project the token context representation

to the label space (50 classes) and sum the label representation predicted for each token, to obtain the label for the sentence . We hypothesize that with lower-level label supervision we can propagate the knowledge expressed by the label to the context representations of specific tokens. This is a form of deep supervision [20], similar to [22].

Relation Classification

is used to classify the relation between two arguments represented as text. We implement relation classification skills following the exact Bi-LSTM max-out model from Conneau et al. [7], that has been shown to be successful for learning sentence representations.

As a relation classification skill we employ the tasks of Textual Entailment (TE) learned from the Stanford Natural Language Inference (SNLI) corpus [3]. TE is a task that requires a model to classify the entailment relation between two sentences: hypothesis and premise. For instance, the premise ‘Dogs like eating food.’ entails the hypothesis ‘Animals like eating.’. Another task that we consider useful for our target task is paraphrase detection over the PPDB 2.0 [33] where the model is required to detect the relation between two phrases in one of the given 6 fine-grained paraphrase classes.

2.2 Model for Reading Comprehension with Skills

We build a simple neural model that uses pre-trained embeddings and word-matching features as input to a bi-directional LSTM context-encoder of document and question and two Bi-LSTM layers for predicting start and end of the answer span. The architecture of the model is shown in Figure 1.

Word embedding input. As an input to the neural model, we use pre-trained 100d Glove [34] word embeddings (WE). We also use two features for each token: the exact word matching feature (em) [47] [5] between each token in the document and the question and the maximum similarity between the word embedding vector of each of the document tokens and each token in the question (). The WE maxsim between two texts has been shown to be helpful for community question answering [27] and discourse relation sense classification [26]. For each token we concatenate the WE and the two features (, r means input representation, p is a token sequence that can be d(document) or q(question)) and use them as an input to the context-encoder. For the question, the two features above are set to 1 as in [47].

Context encoding. In particular, we use a Bi-LSTM context encoder represented as . We refer to a task-specific context-encoder as .

Context encoder for the current (main) task. For the target task of reading comprehension, we initialize an encoder with random weights.

Skill task context encoders. For each skill task, we train a context-encoder model as described in Sec. 2.1. We use the learned weights to initialize the task-specific encoders . For the tasks where we employ token label prediction (NER and Question Type Classification), we also concatenate the soft label prediction vectors with the context encoder states: .

Adapted representations. Each output from the skill context encoder is projected to a lower dimension using adapter weights [38]: , where is a weight matrix for the current skill and

is a bias vector.

Ensemble representation. For each token in the document and question we concatenate all adapted skill representations to the main task representation to obtain the ensemble representation , where is or . We represent the question by a weighted representation of its ensemble token vectors: , where is a weight matrix. We then model interaction between the question representation and each document token as .

Answer span prediction.

To predict the answer span we predict start and end pointers in the document context. We model the probability of the document tokens being the start of the answer span as

, where is a weight matrix and is bias. We then model the probability of the document tokens being the end of the answer span as , where is a weight matrix and is a bias vector. We use dynamic programming to find the answer span (i,j) that maximizes .

Training details. For all skill tasks and the RC task we use pre-trained Glove word embeddings with size 100. For all tasks, including the target RC task, we train the bi-directional LSTM encoder with output size 256. For the skill adaption layer we use output size of 100.

3 Related work

Reading comprehension [14] has gained a lot of attention in the last years thanks to large-scale datasets [12][13][31]. More recently the SQuAD [35] dataset offered over 100 thousand crowd-sourced questions to answer questions about Wikipedia. Some of the best performing single models (F1   75-84) on the SQuAD dataset propose token-wise interaction between documents and question Bi-DAF [39], Dynamic-Coattention Networks [50], R-NET [46]. Some models [40][29][43] try to perform reasoning more explicitly using an approach based on memory networks [49, 10]. Some simple neural models [5][47][8] incorporate features to achieve better performance. It has been shown that a big enough dataset [1] can provide enough knowledge to allow a simple neural model [17] to achieve human performance. However, in practice, having a huge dataset is not always an option. So another approach can be to transfer knowledge [16] from another dataset of the same task or from a less related task such as machine translation [25]. Indeed almost all recent neural models use a form of transfer learning by incorporating word embeddings, such as [28][34], as input. Some recent models [32] even use the task of question answering to learn better embeddings. Transfer Learning with neural models has been proposed in NLP initially by [6] and has been encouraged as a way of sharing representations between tasks [2]. It can be performed jointly on multiple tasks [37] which includes learning linguistic tasks in a hierarchical fashion [41] on many levels [11] or even perform the knowledge transfer between tasks from different modalities [18]. In this work we propose a generic and modular approach to learning a set of relevant ‘skill’ tasks and transferring this knowledge to a target task, here the problem of reading comprehension.

4 Experiments and results

In this work we examine the impact of transferring knowledge from several ‘skill’ tasks to the task of Reading Comprehension. The assumption is that the transfer of skill knowledge should improve the learning of the target task (RC) and allows for using smaller training sets and fewer training steps. To examine this impact we run several experiments: adding single skill tasks to the RC task, adding all tasks, and ablation of tasks.

Training on the full training set. We use the SQuAD [35] train dataset for training and the publicly available dev set for evaluation. We do not aim for state-of-the art performance but focus on the impact of injecting skill knowledge. In Table 1 we show evaluation results with single tasks and ablation of tasks, w/ and w/o fine tuning of the skill parameters. Figure 4 shows the results in different training steps, with different skills. It shows that individual skills and all skills jointly show a noticeable impact in the early training stages compared to a model without skills.

Fine-tuning No fine-tuning
Setup F-score EM F-Score EM
no skills 59.41 46.90 59.66 46.80
only PPDB 60.82 48.71 58.23 45.25
only TE 61.67 49.12 59.40 46.47
only NER 60.65 48.45 58.17 44.70
only QC 60.94 48.68 57.80 44.39
all skills 60.92 48.70 58.30 45.51
all - QC 60.91 48.52 57.28 45.17
all - NER 60.81 48.55 56.99 44.07
all - TE 60.86 48.81 58.48 45.73
all - PPDB 61.11 48.83 57.87 45.19
Table 1: Results for transferring knowledge from skill tasks. ‘Fine-tuning’: parameters of the skill tasks are fine-tuned during training.‘EM’ shows the results for Exact Match with the gold answers. We show evaluation results on the dev set of SQuAD [35].
Figure 3: Results for single skill tasks combined with the QA-encoder. (w/ skill fine-tuning)
Figure 4: Results with QTC skill, w/ and w/o token label supervision. (w/o fine-tuning)
Figure 5: Results for training with different sizes of the training data (2%, 5%, 10%, 25%) and evaluated on the dev set. ‘Rand. skill weights’ is ‘All skill tasks’ model with random weights. (w/ fine-tuning)

Training on small parts of the train data. Figure 5 shows results with training on different sizes of the train data. 2% of the train data contains 378 paragraphs, 2512 questions, with 88k tokens in total. We show that with less data (2%, 5% of the full train set), employing skill tasks shows high impact, reaching the best result compared to ‘No skills’ or ‘Random skill weights’ setups in only 1000 steps.

Token-wise label supervision. Figure 4 analyzes the impact of token-wise label prediction vs. sentence-wise label prediction with Question Type Classification. We show that token supervision clearly outperforms sentence label supervision in early training phases.

5 Conclusion and future work

In this work, we show the impact of injecting knowledge from supervised language skill tasks into a reading comprehension model. We observe noticeable gains of performance in both, early training stages and when using small training data. While for some domains, currently large training sets are being built, in others such as [36] this is not the case. Beyond performance issues, using skill tasks as proposed in this work can be applied as a tool for analyzing which specific skills are required for reading comprehension (or other tasks) and also the contribution of specific skills for a particular dataset and problem formulation, without having to conduct manual annotation as in [42]. Another finding is that token-wise deep label supervision for QTC is profitable for reading comprehension in a QA setting. In future work we plan to transfer knowledge from other tasks i.a. Discourse Relations [15] [30], Semantic Role Labeling [24]. We also want to experiment with different models of integrating the learned skills, also for other tasks. We also plan to train all the tasks jointly, in multi-task fashion, where shared parameters are fine-tuned on the skill tasks and the target task.