Learning to Selectively Transfer: Reinforced Transfer Learning for Deep Text Matching

12/30/2018 ∙ by Chen Qu, et al. ∙ University of Massachusetts Amherst Carnegie Mellon University 0

Deep text matching approaches have been widely studied for many applications including question answering and information retrieval systems. To deal with a domain that has insufficient labeled data, these approaches can be used in a Transfer Learning (TL) setting to leverage labeled data from a resource-rich source domain. To achieve better performance, source domain data selection is essential in this process to prevent the "negative transfer" problem. However, the emerging deep transfer models do not fit well with most existing data selection methods, because the data selection policy and the transfer learning model are not jointly trained, leading to sub-optimal training efficiency. In this paper, we propose a novel reinforced data selector to select high-quality source domain data to help the TL model. Specifically, the data selector "acts" on the source domain data to find a subset for optimization of the TL model, and the performance of the TL model can provide "rewards" in turn to update the selector. We build the reinforced data selector based on the actor-critic framework and integrate it to a DNN based transfer learning model, resulting in a Reinforced Transfer Learning (RTL) method. We perform a thorough experimental evaluation on two major tasks for text matching, namely, paraphrase identification and natural language inference. Experimental results show the proposed RTL can significantly improve the performance of the TL model. We further investigate different settings of states, rewards, and policy optimization methods to examine the robustness of our method. Last, we conduct a case study on the selected data and find our method is able to select source domain data whose Wasserstein distance is close to the target domain data. This is reasonable and intuitive as such source domain data can provide more transferability power to the model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Text matching is an important problem in both information retrieval and natural language processing. Typical examples of text matching include paraphrase identification 

[R. Socher and E. H. Huang and J. Pennington and A. Y. Ng and C. D. Manning, 2011], natural language inference [Bowman et al., 2015], document retrieval [Guo et al., 2016], question answering (QA) [Yang et al., 2016], and conversational response ranking [Yang et al., 2018]. In particular, text matching plays a key role in conversational assistant systems to answer customer questions automatically. For example, the Contact Center AI111https://cloud.google.com/solutions/contact-center/ recently launched by Google and the AliMe [Li et al., 2017] built by Alibaba Group are both capable of handling informational requests by retrieving potential answers from a knowledge base.

We illustrate the importance of text matching by describing the role it plays in a retrieval-based QA system. Typically, for a given user query, the system measures its similarity with the questions in the knowledge base and returns the answer of the best matched question [Yan et al., 2016, Yu et al., 2018]. The query-question matching problem can be modeled as a paraphrase identification (PI) or natural language inference (NLI) task, which are both typical tasks of text matching. Thus in this work, we focus on PI and NLI tasks to evaluate the performance of our method on text matching. We believe the improvement of text matching methods can benefit the end tasks such as question answering. PI and NLI problems have been widely studied in previous work [Bowman et al., 2015, Yin et al., 2016, 2018, R. Socher and E. H. Huang and J. Pennington and A. Y. Ng and C. D. Manning, 2011]. However, when applied to real world applications, such methods face the challenge of insufficient labeled data in different domains. For example, in the E-commerce industry, a QA system has to handle each small domain of products such as books, electronics, clothes, etc. It is unrealistic to obtain a large amount of labeled training data for every small domain. As a promising approach to bridge the domain discrepancy, Transfer Learning (TL) has become an important research direction in the past several years [Yu et al., 2018, Shen et al., 2018, Ruder and Plank, 2017, Yosinski et al., 2014, Liu et al., 2017].

Due to the domain shift between the source and target domains, directly applying TL approaches may result in “negative transfer” problem. To prevent this problem, we argue that source domain data selection is necessary for the TL approaches. Table 1 gives an example of negative transfer in the PI task. “Order” typically means to place an order for a product in the E-commerce domain (target domain). However, in an open domain (source domain) dataset, “order” can be used to denote a succession or sequence. Hence in such case, TL without source domain data selection might result in negative transfer.

Domain Sentence 1 Sentence 2
Which answers does Quora show first for each question? How does Quora decide the order of the answers to a question?
What order should the Matrix movies be watched in Is there any particular order in which I should watch the Madea movies
Source (Open Domain) How can I order a cake from Walmart online? How do I order a cake from Walmart?
How long is my order arriving? Is it over? Will I have the refund? I have escalated an order and have not been updated in over a week.
How can i get an order receipt or invoice? How do I get an invoice to pay?
Target (E-commerce Domain) I need to understand why my orders have been cancelled Why my order have been closed?
Table 1. An example of negative transfer in the PI task. This table is best viewed in color. “Order” in blue means to place an order for a product, which is typical in the E-commerce domain. “Order” in red means a succession or sequence, which might appear in the source open domain. Transfer learning without source domain data selection might result in negative transfer.

Recently, neural architectures are employed to leverage a large amount of source domain data and a small amount of target domain data in a multi-task learning manner [Mou et al., 2016, Yang et al., 2017]

, which can be described as Deep Neural Networks (DNN) based supervised transfer learning. The DNN based TL framework has been proven to be effective in deep text matching tasks for question answering systems 

[Yu et al., 2018]. Although various data selection methods [Ruder and Plank, 2017, Chen et al., 2011, Huang et al., 2006, Patel et al., 2018] were proposed for TL settings, most of them do not fit well with neural transfer models, because the data selector/reweighter is not jointly trained with the TL model. Specifically, the TL task model is considered as a sub-module of the data selection framework. Thus the TL task model needs to be retrained repetitively to provide sufficient updates to the data selection framework. Due to the relatively long training time of neural models, such data selection methods may suffer from long training time when applied to neural TL models. Therefore, we argue that data selection methods for transfer learning need to be revisited under the DNN based TL setting.

In the setting of DNN based transfer learning, the TL model is updated with mini-batch gradient descent in an iterative manner. In order to learn a universal data selection policy in this setting, we model the problem of source domain data selection as a Markov Decision Process (MDP) 

[Puterman, 1994]. Specifically, at each time step (mini-batch/iteration), the TL model is at a certain state , the decision maker (data selector) chooses an action to select samples from the current source batch to optimize the TL model. The TL model gives the data selector a reward and moves on to the next state . The state of depends on the current state and the action made by the data selector. To solve this problem, it is intuitive to employ reinforcement learning, where the decision maker is the data selection policy that needs to be learned.

In this paper, we propose a novel reinforced data selector to select high-quality source data to help the TL model. Specifically, we build our data selector based on the actor-critic framework and integrate it to a DNN based TL model, resulting in a Reinforced Transfer Learning (RTL) method. To improve the model training efficiency, the instance based decisions are made in a batch. Rewards are also generated on a batch level. Extensive experiments on PI and NLI tasks show that our RTL method significantly outperforms existing methods. Finally, we use Wasserstein distance to measure the target and source domain distances before and after the data selection. We find our method is able to select source data whose Wasserstein distance is close to the target domain data. This is reasonable and intuitive as such source domain data can provide more transferability power to the model.

Our contributions can be summarized as follows. (1) To the best of our knowledge, we propose the first reinforcement learning based data selector to select high-quality source data to help the DNN based TL model. (2) In contrast to conducting data selection instance by instance, we propose a batch based strategy to sample the actions in order to improve the model training efficiency. (3) We perform thorough experimental evaluation on PI and NLI tasks that involves four benchmark datasets. We find that the proposed reinforced data selector can effectively improve the performance of the TL model and outperform several existing baseline methods. We also use Wasserstein distance to interpret the model performance.

2. Related Work

Paraphrase Identification and Natural Language Inference. PI and NLI problems have been extensively studied in previous work. Existing methods include using convolutional, recurrent, or recursive neural networks to model the sentence interactions, attentions, or encoding of a pair of input sentences [Bowman et al., 2015, Yin et al., 2016, 2018, R. Socher and E. H. Huang and J. Pennington and A. Y. Ng and C. D. Manning, 2011]. All the methods have been proven to be highly effective if given enough labeled training data. However, in real world applications, obtaining a large amount of labeled data by human annotation is not always affordable in terms of time and expense. Therefore, we focus on PI and NLI tasks in a transfer learning setting in this paper.

Transfer Learning. Transfer learning has been widely studied in the past years [Pan and Yang, 2010]

. Existing work can be mainly classified into two categories. The first category makes the assumption that labeled data from both source and target domains are available to us, though the amount may differ 

[Daume III, 2007, Yu et al., 2018]. While the second category assumes that no labeled data from the target domain is available in addition to the labeled source domain data [Shen et al., 2018, Ruder and Plank, 2017]. Our work falls into the first category. In addition, an alternative view of taxonomy on transfer learning is to focus on methods. In this case, there are also two categories. The first is the instance based methods, which select or reweight the source domain training samples so that data from the source domain and the target domain would share a similar data distribution [Ruder and Plank, 2017, Chen et al., 2011, Huang et al., 2006]. The second category is feature based methods, which aim to locate a common feature space that can reduce the differences between the source and target domains. This goal is accomplished either by transform the features from one domain to be closer to the other domain, or to project both domains into a common latent space [Yu et al., 2018, Shen et al., 2018].

In terms of instance based methods and feature based methods, our work falls into the first category, and we select data from the source domain to benefit the task performance in the target domain. In this line of work, data selectors/reweighters are typically not jointly trained with the TL model, which can lead to negative impacts on training efficiency [Ruder and Plank, 2017, Chen et al., 2011, Patel et al., 2018]. Specifically, the TL model is considered as a sub-module of the data selection framework and the data selection policy is updated based on the final performance of the TL model. Due to the relatively long training time of neural models, such data selection methods suffer from poor training efficiency if applied to neural transfer learning models.

Reinforcement Learning. The concept of reinforcement learning (RL) dates back to several decades ago [Sutton and Barto, 1998, Arulkumaran et al., 2017]

. Recent advances in deep learning made it possible for RL agents to generate dialogs 

[Li et al., 2016], play video games [Mnih et al., 2013], and even outperform human experts in the game of Go [Silver et al., 2017]

. Reinforcement learning algorithms can be categorized into two types: value based methods and policy based methods. Value based methods estimate the expected total return given a state. This type of method includes SARSA 

[Rummery and Niranjan, 1994] and the Deep Q Network [Mnih et al., 2015]. Policy based methods try to find a policy directly instead of maintaining a value function, such as the REINFORCE algorithm [Williams, 1992]. Such methods can provide strong learning signals to update the policy. Finally, it is also possible to combine the value based and policy based approaches for a hybrid solution, such as the actor-critic algorithm [Konda and Tsitsiklis, 2003]

. This employs a learned value estimator to provide a baseline for the policy network for variance reduction. We experiment with policy based methods and hybrid methods in our model.

Given the dynamic nature of reinforcement learning, researchers found it useful to employ RL in data selection problems, because data selection during training can be modeled as a sequential decision making process. So far, RL has been applied to data selection in active learning 

[Fang et al., 2017], co-training [Wu et al., 2018]

, and other applications of supervised learning, including computer vision 

[Fan et al., 2017, Patel et al., 2018], machine reading comprehension [Wang et al., 2018], and entity relation classification [Feng et al., 2018]. However, there is a lack of reinforced data selection methods under a DNN based transfer learning setting.

3. Our Approach

In this section, we present our reinforced transfer learning (RTL) framework under a DNN based transfer learning setting. The reinforced data selector is integrated into a TL model to select source domain data to prevent negative transfer.

3.1. Task Definition

We formulate our task into three subtasks: a text matching task, a transfer learning task, and a data selection task.

3.1.1. Text Matching.

Both paraphrase identification and natural language inference tasks can be unified as a text matching problem, defined as follows. Given two sentences and , where

denotes a word embedding vector either randomly initialized or retrieved from a pre-trained global vector look up table (such as GloVe 

[Pennington et al., 2014]). and denotes the lengths of and respectively. The goal is to predict a binary label that denotes whether , are semantically related. For PI, denotes the two sentences are semantically identical (PARAPHRASE). For NLI222The common NLI task contains a third label of CONTRADICTION, which denotes the premise and the hypothesis are contradicted. The SciTail [Khot et al., 2018] dataset in our experiment does not come with this label., denotes that the hypothesis can be inferred from the premise (ENTAILMENT).

3.1.2. Transfer Learning.

We consider the transductive transfer learning setting, where the source and target tasks are the same, while the source and target domains are different [Pan and Yang, 2010]. In contrast to conventional transductive TL where no labeled target domain data is available, we assume some target domain training data is available to perform supervised training of a base model. However, we expect a significantly larger amount of source domain training data can boost the performance of the aforementioned base model under a transfer learning setting. Given the labeled source domain data and labeled target domain data for the same task, where , the TL model leverages both and to improve the performance of the base model in the target domain.

3.1.3. Data Selection.

The data selection task under a transfer learning setting is formulated as follows. A transfer learning algorithm updates the TL model with a batch of source data and a batch of target data iteratively. and are drawn from and respectively. The data selection module intervenes before the update of every iteration. Specifically, the data selection module selects a part of data from according to a policy . The selected is expected to produce a better performance than after this single iteration as well as future iterations.

3.2. Overview

The proposed framework consists of three components: a base model, a transfer learning model, and a reinforced data selector, corresponding to the above three subtasks respectively. The base model tackles the basic problem of text matching. It takes in a pair of sentences and generates a hidden representation of the sentence pair for the final prediction. The TL model is built on the top of the base model to leverage a large amount of source domain data. Finally, the reinforced data selector is a compartmentalized part in the transfer learning framework to handle the data selection for source domain data. The reinforced data selector is designed to prevent negative transfer and thus maximize the effectiveness of the TL model. Figure 

1 gives an overview of our model.

Figure 1. Architecture of the proposed RTL framework, which consists of two major parts: a reinforced data selector and a TL model. This figure is best viewed in color. The “Shared Encoder” refers to the base model embedded in the TL model. The reinforced data selector selects a part of the source batch (blue) and feeds them into the TL model at each iteration. The TL model generates a reward on the target domain validation data for the data selector. Target batches (orange/pink) are fed into the TL model without data selection.

3.3. Base Model

As illustrated in Figure 1, the base model in our method is a shared encoder (shared neural network). Note that our method is a general framework which can integrate different base models. Any implementation of text matching models can be adapted to our framework. However, for real-world applications, it is a common practice to consider the efficiency at both training and testing time [Yu et al., 2018]

. Thus, we use the Decomposable Attention Model (DAM) 

[Parikh et al., 2016] as our base model, as DAM has effective performance with remarkable efficiency in text matching.

DAM consists of three jointly trained components: “attend”, “compare”, and “aggregate”. First, the “attend” module softly aligns the input pair of two sentences by obtaining the unnormalized attention weights . Formally,

(1)

where is a feed-forward network. Then the attention weights are normalized as follows:

(2)

where is to perform the softmax function on dimension . is interpreted as the subphrase in aligned to .

Then the “compare” module compares the aligned subphrases separately and produces a set of matching vectors as follows:

(3)

where means concatenation and is a feed-forward network.

Finally, the “aggregate” module combines the matching vectors to produce a representation of the sentence pair for a final prediction. Formally, the aggregated vectors are computed as follows:

(4)

Here can be viewed as performing a sum pooling over the concatenated matrix

. In practice, it’s also beneficial to consider a max pooling over the matrix 

[Chen et al., 2017], i.e.:

(5)

The aggregated vectors are concatenated to form the output :

(6)

The base model can be formulated as a transformation function that takes in a pair of sentences as input and produces a hidden representation . If the base model is used alone to make predictions, a classification (fully-connected) layer will be added after obtaining the hidden representation.

3.4. Transfer Learning Model

As shown in Figure 1, we consider a DNN based transfer learning framework with a fully-shared encoder [Mou et al., 2016, Yang et al., 2017]. The proposed data selection method is a general method that can be adapted to other TL frameworks, including fully-shared and specific-shared models [Yu et al., 2018]. We only consider the fully-shared model because we would like to keep the TL model simple and focus on the reinforced data selector. The base model serves as a shared encoder for sentence pairs from both source domain and target domain. For each given pair of sentences from domain (), the shared encoder maps to a hidden representation . Then a classification layer maps to a label . The classification layers are separately learned for different domains. Formally,

(7)

where and

are the weight matrix and bias vector for the classification layer respectively for domain

. Thus, the training objective is to minimize the training loss for both source and target domains. For training, we use the cross-entropy loss as follows.

(8)

where is the ground truth label for the -th data pair.

The fully-shared encoder and the source classification layer is considered as the source model. Similarly, the fully-shared encoder and the target classification layer is considered as the target model.

3.5. Reinforced Data Selector

3.5.1. Overview

We cast the source domain data selection in a transfer learning setting as a Markov Decision Process, which can be solved by reinforcement learning. The reinforced data selector is an agent that interacts with the environment constructed by the TL model. The agent takes actions of keeping or dropping a given source sample (a sentence pair) according to a learned policy. The agent bases its decision on a state representation that describes several features of the given sample. The TL model evaluates the agent’s actions and produces rewards to guide the learning of the agent. The agent’s goal is to maximize the expected future total rewards it receives.

Reinforcement learning is commonly used for policy learning of agents in video games or chess games. In the context of video games, an episode commonly refers to a round of the game where the player either passes or fails the game at the end. The player would take a sequence of actions to reach the terminal state. In a neural transfer learning setting, the TL model is updated batch by batch for several epochs. It is natural to consider an epoch as an episode and a batch as a step to take actions.

As shown in Figure 1, given each batch of the source domain sentence pairs , where denotes the batch ID and denotes the batch size. We obtain a batch of states , where denotes the state of the -th sentence pair . Then the reinforced data selector makes a decision for each sample according to a learned policy . Actions are also made in batch, denoted as , where . means to drop from . Thus, we obtain a new source domain batch that contains the selected source samples only. Finally, The transfer model is updated with and produce a reward according to the performance on target domain validation data.

We will introduce the state, action, and reward in the following sections. The batch ID is omitted in some cases for simplicity.

3.5.2. State

The state of a given source domain sentence pair is denoted as a continuous real valued vector , where is the dimension of the state vector. represents the concatenation of the following features:

  1. [leftmargin=2em]

  2. A hidden representation , which is the output of the shared encoder given .

  3. The training loss of on the source model.

  4. The testing loss of on the target model.

  5. The predicted probabilities of

    on the source model.

  6. The predicted probabilities of on the target model.

The first feature is designed to present the raw content to the data selector. Feature (3) and Feature (5) are based on the intuition that helpful source domain training data would be classified with relatively high confidence on the target model. Feature (2) and feature (4) are also provided as feature (3) and feature (5)’s counterparts on the source model.

3.5.3. Action

An action is denoted as , which indicates whether to drop or keep from the source batch.

is sampled according to a probability distribution produced by a learned policy function

. is approximated with a policy network that consists of two fully-connected layers. Formally, is defined as follows:

(9)

where and are the weight matrix and bias vector respectively for the -th layer in the policy network, and is an intermediate hidden state.

3.5.4. Reward

The data selector takes actions to select data from and form a new batch of source data . We use to update the source model and obtain an immediate reward with a reward function . In contrast to conventional reinforcement learning, where one action is sampled based on one state and obtaining one reward from the environment, our actions are sampled in a batch based on a batch of states and obtaining one reward in order to improve model training efficiency.

The reward is set to the prediction accuracy on the target domain validation data for each batch. Other metrics generated on the target validation data could also be applicable. To accurately evaluate the utility of , is obtained after the source model is updated and before the target model is updated. For the extremely rare case of , we skip the update of source model for this step.

We compute the future total reward for each batch after an episode. Formally, for batch is computed as follows.

(10)

where is the number of batches in this episode, is the future total reward for batch , and is the reward discount factor.

3.5.5. Optimization

We experiment with two methods to update the policy network: the REINFORCE algorithm [Williams, 1992] and the actor-critic algorithm [Konda and Tsitsiklis, 2003]. Our model is mainly based on the actor-critic framework since it can help to reduce the variance so that the learning is more stable [Konda and Tsitsiklis, 2003].

For any given episode, we aim to maximize the expected total reward. Formally, we define the objective function as follows:

(11)

where the policy function is parameterized by . We further compute the gradient to make a step of update as follows:

(12)

where is the learning rate, is the batch size, and is the target that guides the update of the policy network. The actor-critic algorithm combines policy based methods and value based methods for stable updates. We employ a value estimator network as the value function to estimate the future total reward for each state in a given batch. Thus, is computed as follows:

(13)

The structure of the value estimator network is similar to the policy network except that the output layer is a regression function. Formally, the value network is optimized to approximate the real future total reward , i.e. to minimize the Mean Squared Error (MSE) between and :

(14)

where the value function is parameterized by .

In addition to the actor-critic algorithm described above, we also experiment with a Policy Gradient method named REINFORCE algorithm. In this case, is simply set to in Equation 12, which means that every action in this batch shares the same reward .

3.6. Model Training

The TL model and the reinforced data selector are learned jointly as they interact with each other closely during training. To optimize the policy network, we use the actor-critic algorithm described in Section 3.5.5

. To optimize the TL model, we use a gradient descent method to minimize the loss function in Equation 

8. We first pre-train the TL model for iterations and then start the joint training process. We use such a procedure following previous work [Feng et al., 2018, Bahdanau et al., 2017].

The details of the joint learning process is described in Algorithm 1. When optimizing the TL model, the gradient is computed based on a batch of training data. The TL model leverages training data in both source and target domains for better model performance. The reinforced data selector intervenes before every iteration of source model update by selecting helpful source domain data. Therefore, the intervention process has an impact on the gradient computed for the source model update, which includes the update for the shared encoder. The TL model provides a reward in turn to evaluate the utility of the data selection. After each epoch/episode, the policy network is updated with the actor-critic algorithm with the stored (states, actions, reward) triplets.

Input : Episode , source domain training data , target domain training data and validation data
Initialize the pre-trained source and target model in TL model;
Initialize the policy network and value estimator network;
for episode to  do
        Obtain the random batch sequence: and ;
        foreach  in  do
               Obtain the states for ;
               Sample actions according to the policy ;
               Obtain the filtered source training batch ;
               Update the source model with ;
               Obtain the reward on the target model with ;
               Update the target model with ;
               Store to an episode history ;
              
        end foreach
       foreach  in  do
               Obtain the future total reward as in Eq. 10;
               Obtain the estimated future total rewards ;
               Update the policy network following Eq. 12;
              
        end foreach
       foreach  in  do
               Obtain the future total reward as in Eq. 10;
               Update the value estimator network as in Eq. 14;
              
        end foreach
       Empty ;
       
end for
Algorithm 1 Joint learning of the transfer learning model and the reinforced data selector

4. Experiments

4.1. Data Description

In this paper, we follow previous work [Yu et al., 2018] and use paraphrase identification and natural language inference data to evaluate the performance of our RTL model on text matching. Four benchmark datasets are used in the PI and NLI tasks. Both task settings are designed to transfer from a relatively open domain to a closed domain. Statistics for all datasets are presented in Table 2.

4.1.1. Natural Language Inference (NLI)

We use MultiNLI [Williams et al., 2018] as the source domain data and SciTail [Khot et al., 2018] as the target domain data. MultiNLI is a large crowdsourced corpora for textual entailment recognition. Each sample is a (premise, hypothesis, label) triplet, where the label is one of the ENTAILMENT, NEUTRAL, and CONTRADICTION. In contrast to another widely used NLI dataset SNLI [Bowman et al., 2015], where all premise sentences are derived from image captions, MultiNLI has more diverse text sources and thus is more suitable to serve as the source domain in a TL setting. We use the 1.0 version of MultiNLI with the training data from all five domains. We discard the samples with no gold labels. SciTail is a recently released textual entailment dataset in the science domain. In contrast to SNLI and MultiNLI, the premises and hypotheses in SciTail are generated with no awareness of each other. Therefore SciTail is more diverse in terms of linguistic variations and thus is more challenging than other entailment datasets [Khot et al., 2018]. However, the labels in SciTail only consists of ENTAILMENT and NEUTRAL. Therefore, we remove the CONTRADICTION samples from MultiNLI.

4.1.2. Paraphrase Identification (PI)

We use the Quora Question Pairs333https://www.kaggle.com/c/quora-question-pairs as the source domain data and a paraphrase dataset made available in CIKM AnalytiCup 2018444https://tianchi.aliyun.com/competition/introduction.htm?raceId=231661 as the target domain data. Quora Question Pairs (Quora QP) is a large paraphrase dataset released by Quora555https://www.quora.com/. Quora is a knowledge sharing website where users post questions and write answers for other users’ questions. Due to the large amount of visitors, the user-generated questions contains duplications. Thus Quora released this dataset to encourage the research on paraphrase identification. AnalytiCup Data consists of question pairs in the E-commerce domain. It was released with the CIKM AnalytiCup 2018. This competition targets the research problem of cross-lingual text matching. This dataset contains labeled English training data and unlabeled Spanish data. However, in this work, we only deal with the labeled English data. We sample the training, validation, and testing data for the AnalytiCup data since no pre-defined data partitions are available.

Task Domain Data Train Validation Test
PI Source Quora QP 404,287/149,263 N/A N/A
Target AnalytiCup 6,668/1,731 3,334/830 3,330/820
NLI Source MultiNLI 261,799/130,899 N/A N/A
Target SciTail 23,596/8,602 1,304/657 2,126/842
Table 2. Data Statistics. For source domains, only training data is used. The numbers before and after the “/” are # all examples and # positive examples. “Positive” refers to PARAPHRASE in PI and ENTAILMENT in NLI.

4.2. Experimental Setup

4.2.1. Baselines

We consider the following baselines:

  • [leftmargin=1em]

  • Base model [Parikh et al., 2016]: we use the shared encoder described in Section 3.3 with a classification layer to form a decomposable attention model. This base model is trained with the target domain data.

  • Transfer baseline: we use the TL model described in Section 3.4 to provide a stronger baseline. The transfer baseline leverages training data in both source and target domains.

  • Ruder and Plank [Ruder and Plank, 2017] proposed a data selection method with Bayesian optimization for transfer learning. This data selection approach is model-independent. We use it on the top of our transfer learning model to keep the comparisons fair.

4.2.2. Evaluation Metrics

For both tasks, we adopt accuracy (Acc) and the Area under the ROC curve (AUC) as evaluation metrics. Significance tests can only be performed on accuracy.

4.2.3. Implementation Details

We present the parameter settings and implementation details as follows. All models are implemented with TensorFlow

666https://www.tensorflow.org/

. Size for the hidden layers of the decomposable attention model is 200. The max sequence length is 40 for PI and 50 for NLI. The padding is masked to avoid affecting the gradient. Hyper-parameters including the size of the hidden layer of the policy network, and the reward discount factor are tuned with the target domain validation data. Checkpoints are saved at the end of every epoch and produce an evaluation on the test set. All models are trained with a NVIDIA Titan X GPU using Adam 

[Kingma and Ba, 2015]. The initial learning rate is 0.001 for the transfer model and 0.02 for the policy network. The parameters of Adam, and are 0.9 and 0.999 respectively. The hidden layer size and optimization methods for the value estimator network are the same with the policy network.

The transfer learning model is pre-trained for 50 iterations for both tasks before the reinforced data selector is applied. For the word embedding layer, we use GloVe [Pennington et al., 2014] (840B tokens) to initialize the embedding look up table. The dimension of word embedding is 300. Word vectors are set to trainable.

4.3. Evaluation Results

We present the evaluation results in Table 3. Models are tuned with the target domain validation data and results are reported on the target domain testing data.

Methods PI NLI
Acc AUC Acc AUC
Base Model [Parikh et al., 2016] 0.8393 0.8548 0.7300 0.7663
Transfer Learning Model 0.8488 0.8706 0.7453 0.8044
Ruder and Plank [Ruder and Plank, 2017] 0.8458 0.8680 0.7521 0.8062
RTL 0.8616 0.8829 0.7672 0.8163
Table 3. Testing performance in the target domain for PI and NLI tasks. Our model is referred to as RTL. The significance tests can only be performed on accuracy. means statistically significant difference over the strongest baseline with

measured by the Student’s paired t-test.

For paraphrase identification, we observe that the base model alone achieves relatively good performance. The TL model manages to have a minor improvement over the base model. However, the data selection method with Bayesian optimization by Ruder and Plank [Ruder and Plank, 2017] fails to make further improvement over the TL model. Based on the base model performance, we speculate that the PI task on AnalytiCup data is a relatively straightforward task. Therefore, it could be possible that sophisticated models do not always boost the performance. Under these circumstances, our model (RTL) still manages to generate a statistically significant improvement over the strongest baseline.

For natural language inference, the performance on all models are lower in general compared to the PI task. This is due to the fact that SciTail is very challenging as described in Section 4.1. The base model has moderate performance on this task. The TL model improves the performance thanks to the source domain data. The data selection method with Bayesian optimization manages to make a further improvement, indicating the large potential of data selection in this setting. Moreover, our RTL model outperforms the strongest baseline by a large margin with statistical significance. These results demonstrate the effectiveness of the RTL model.

In Ruder and Plank [Ruder and Plank, 2017], the TL model is considered as a sub-module of the Bayesian optimization based data selection framework. This framework evaluates the utility of the data selection based on the final performance of the TL model. In our RTL framework, the TL model and the reinforced data selector are trained jointly and thus the data selection policy is updated more efficiently and effectively. This could be the reason behind the improvement of our model over Ruder and Plank [Ruder and Plank, 2017].

4.4. Ablation Analysis

In addition to the best performing model in Section 4.3, we also investigate different variations of the RTL model. The variations are made in terms of three aspects: the reward functions, optimization methods for the policy network, and state representations.

4.4.1. Reward Functions and Policy Optimization Methods.

We consider various reward functions and policy optimization methods as the main settings for our ablation tests. The results are presented in Table 4.

Methods PI NLI
Reward RL Acc AUC Acc AUC
AUC REINFORCE 0.8557 0.8818 0.7486 0.8070
AUC Actor-Critic 0.8545 0.8793 0.7613 0.8067
Acc REINFORCE 0.8428 0.8788 0.7587 0.8121
Acc Actor-Critic 0.8616 0.8829 0.7672 0.8163
Table 4. Testing performance of RTL with different variations. The last entry is the final RTL model in Table 3.

As shown in Table 4, we experiment with two reward functions of using accuracy or AUC. Also, we use two algorithms to optimize the policy network: the REINFORCE algorithm and the actor-critic algorithm. We observe that policy networks optimized with the actor-arctic algorithm generally produce similar or better performance. On the other hand, when using the same algorithm to optimize the policy network, using accuracy as the reward tends to generate better results. The best setting is to use accuracy as the reward and actor-critic for policy optimization. Thus, we adopt this setting for our final RTL model.

4.4.2. State Features.

In addition to the model variations on main settings of reward functions and policy optimization methods, we also perform a state feature ablation test under the best main setting. We have five state features in total as mentioned in Section 3.5.2. Four of them (feature 2, 3, 4, 5) can be considered as a feature group because they are all designed to evaluate whether a source sample can be easily classified by the TL model. Thus, we perform a feature ablation test on two feature groups.

Features PI NLI
Acc AUC Acc AUC
Transfer Learning Model 0.8488 0.8706 0.7521 0.8044
(1) 0.8539 0.8813 0.7594 0.8135
(2) (3) (4) (5) 0.8529 0.8778 0.7507 0.7916
(1) (2) (3) (4) (5) 0.8616 0.8829 0.7672 0.8163
Table 5. Testing performance of RTL with different state features under the best main setting of (Acc, Actor-Critic). The TL model without data selection is also included for comparisons. The last entry is the final RTL model in Table 3.

As shown in Table 5, the reinforced data selector with the second feature group (feature 2 - feature 5) achieves similar or higher results than the transfer baseline. This indicates that hand-crafted features in the second feature group have limited capacity for the state representation. On the other hand, feature 1 achieves relatively good performance when used alone. This suggests that the hidden representations of source samples learned by the shared encoder are capable of providing good descriptions of the model’s states. Moreover, the model gives the best performance if we combine the two feature groups, confirming that all features contribute to the model performance.

4.5. Impact of Hyper-parameters

We use the NLI task to demonstrate the impact of two hyper-parameters on our model: the size of the hidden layer of the policy network and the reward discount factor. Both hyper-parameters are related to the reinforced data selector. The number of units in the hidden layer of the policy network is tuned in (32, 64, 128, 256, 512). The choices for the reward discount factor are (0, 0.2, 0.4, 0.6, 0.8, 0.95, 1). The reward discount factor = 0 denotes that future rewards are not considered when updating the policy, while reward discount = 1 means to fully consider future rewards without any discount. Figure 2 presents the validation performance with different hyper-parameters. The performance of the transfer model has a peak value when the hidden layer of the policy network has 128 units. This indicates that too small or too large capacity of the policy network cannot benefit the data selection process. In addition, the model performance does not seem be heavily influenced by the different small reward discount factors. The trend suggests that the reinforced data selector can benefit more when contributions of previous actions are properly emphasized with relatively large reward discount factors but not too large.

Figure 2. Impact of hyper-parameters of the reinforced data selector on the validation data of the NLI task.

4.6. Case Study and Performance Interpretation

The results in Section 4.3 demonstrate the effectiveness of our method. However, due to the lack of interpretability of neural architectures, it is difficult to speculate the reasons behind the decisions made by the reinforced data selector. Therefore, instead of trying to interpret specific cases, we present an overall performance interpretation of our method by measuring the distance between domains of interest. Specifically, we compute the Wasserstein distance between the term distributions of the target domain and the source domains. The source domains include the source domain data selected or dropped by the reinforced data selector and the randomly selected source domain data.

Wasserstein distance is also known as the earth mover’s distance. It measures the distance between two probability distributions on a given metric space . Intuitively, it can be considered as the minimum amount of work required to transform the distribution to the distribution , which can be computed by the amount of earth it has to be moved, multiplied by the distance it has to be moved. Formally, the 1st Wasserstein distance is defined as:

(15)

where is a set of probability distributions on and are two probability distributions. In our case, are defined as the term frequency distributions on two domains respectively.

Besides, Wasserstein distance can handle the distributions where some events have the probability of (a certain word presents in one domain but not in the other domain). Also, Wasserstein distance is a symmetric measure, meaning that is equal to . These properties make it suitable for our task.

We keep track of the actions taken by the reinforced data selector. The selected data at the final episode is considered as the final selected data. We compute the Wasserstein distance between the target domain data and the source domain data, including the selected and dropped source domain data. We also include the randomly selected source domain data for comparison. The number of randomly selected instances is identical to the number of instances selected by the reinforced data selector. The results are presented in Table 6. Due to the large amount of tokens in the source domain data, the normalized term frequency for any given term is relatively small, and thus the the Wasserstein distance is small in terms of the order of magnitude.

Name Domains in Comparison PI NLI
Target Source 5.250E-06 3.256E-06
Target Source (Selected) 4.963E-06 3.190E-06
Target Source (Dropped) 5.320E-06 3.290E-06
Target Source (Random) 5.232E-06 3.243E-06
Table 6. The Wasserstein distances between the term distributions of different domains.

We observe the exact same patterns for the PI and NLI tasks: (1) , which means that random selection only influences the term distribution slightly. This sets a baseline for other distances. (2) , which means that the source domain data selected by the reinforced data selector is closer to the target domain data and thus may contribute to the transfer learning process. (3) , which means that the source domain data dropped by the reinforced data selector is not very similar to the target domain data and thus may cause negative transfer. These findings indicate that our method is able to select source domain data whose Wasserstein distance is close to the target domain data. This is reasonable as such source domain data could be more transferrable and helpful for the target domain.

5. Conclusions and Future work

In this paper, we proposed a reinforced data selection method in a DNN based transfer learning setting. Specifically, we used reinforcement learning to train a data selection policy to select high-quality source domain data with the purpose of preventing negative transfer. We investigated different settings of states, rewards, and policy optimization methods to test the robustness of our model. Extensive experiments on PI and NLI tasks indicate that our model can outperform existing methods with statistically significant improvements. Finally, we used Wasserstein distance to measure the source and target domain distances before and after the data selection. This study indicates that our method is capable of selecting source domain data that has a similar probability distribution to the target domain data. Future work will consider to explore more effective state representations and adapt our method to other tasks.

Acknowledgements.
This work was supported in part by the Center for Intelligent Information Retrieval. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor.

References

  • Arulkumaran et al. [2017] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath. Deep Reinforcement Learning: A Brief Survey. IEEE Signal Processing Magazine, 2017.
  • Bahdanau et al. [2017] D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. C. Courville, and Y. Bengio. An Actor-Critic Algorithm for Sequence Prediction. In ICLR, 2017.
  • Bowman et al. [2015] S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning. A Large Annotated Corpus For Learning Natural Language Inference. In EMNLP, 2015.
  • Chen et al. [2011] M. Chen, K. Q. Weinberger, and J. C. Blitzer. Co-training for Domain Adaptation. In NIPS, 2011.
  • Chen et al. [2017] Q. Chen, X. Zhu, Z. Ling, S. Wei, H. Jiang, and D. Inkpen. Enhanced LSTM for Natural Language Inference. In ACL, 2017.
  • Daume III [2007] H.l Daume III. Frustratingly Easy Domain Adaptation. In ACL, 2007.
  • Fan et al. [2017] Y. Fan, F. Tian, T. Qin, J. Bian, and T. Liu. Learning What Data to Learn. CoRR, 2017.
  • Fang et al. [2017] M. Fang, Y. Li, and T. Cohn. Learning how to Active Learn: A Deep Reinforcement Learning Approach. In EMNLP, 2017.
  • Feng et al. [2018] J. Feng, M. Huang, L. Zhao, Y. Yang, and X. Zhu. Reinforcement Learning for Relation Classification From Noisy Data. In AAAI, 2018.
  • Guo et al. [2016] J. Guo, Y. Fan, Q. Ai, and W. B. Croft. A Deep Relevance Matching Model for Ad-hoc Retrieval. In CIKM, 2016.
  • Huang et al. [2006] J. Huang, A. J. Smola, A. Gretton, K. M. Borgwardt, and B. Scholkopf. Correcting Sample Selection Bias by Unlabeled Data. In NIPS, 2006.
  • Khot et al. [2018] T. Khot, A. Sabharwal, and P. Clark. SciTail: A Textual Entailment Dataset from Science Question Answering. In AAAI, 2018.
  • Kingma and Ba [2015] D. P. Kingma and J. L. Ba. Adam: A Method for Stochastic Optimization. In ICLR, 2015.
  • Konda and Tsitsiklis [2003] V. R. Konda and J. N. Tsitsiklis. On Actor-Critic Algorithms. SIAM J. Control Optim., 2003.
  • Li et al. [2017] F. Li, M. Qiu, H. Chen, X. Wang, X. Gao, J. Huang, J. Ren, Z. Zhao, W. Zhao, L. Wang, G. Jin, and W. Chu. AliMe Assist : An Intelligent Assistant for Creating an Innovative E-commerce Experience. In CIKM, 2017.
  • Li et al. [2016] J. Li, W. Monroe, A. Ritter, D. Jurafsky, M. Galley, and J. Gao. Deep Reinforcement Learning for Dialogue Generation. In EMNLP, 2016.
  • Liu et al. [2017] P. Liu, X. Qiu, and X. Huang. Adversarial Multi-task Learning for Text Classification. In ACL, 2017.
  • Mnih et al. [2013] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. A. Riedmiller. Playing Atari with Deep Reinforcement Learning. CoRR, 2013.
  • Mnih et al. [2015] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 2015.
  • Mou et al. [2016] L. Mou, Z. Meng, R. Yan, G. Li, Y. Xu, L. Zhang, and Z. Jin. How Transferable are Neural Networks in NLP Applications? In EMNLP, 2016.
  • Pan and Yang [2010] S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 2010.
  • Parikh et al. [2016] A. P. Parikh, O. Täckström, D. Das, and J. Uszkoreit. A Decomposable Attention Model for Natural Language Inference. In EMNLP, 2016.
  • Patel et al. [2018] Y. Patel, K. Chitta, and B. Jasani. Learning Sampling Policies for Domain Adaptation. CoRR, 2018.
  • Pennington et al. [2014] J. Pennington, R. Socher, and C. D. Manning. GloVe: Global Vectors for Word Representation. In EMNLP, 2014.
  • Puterman [1994] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1994.
  • R. Socher and E. H. Huang and J. Pennington and A. Y. Ng and C. D. Manning [2011] R. Socher and E. H. Huang and J. Pennington and A. Y. Ng and C. D. Manning.

    Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection.

    In NIPS, 2011.
  • Ruder and Plank [2017] S. Ruder and B. Plank. Learning to Select Data for Transfer Learning with Bayesian Optimization. In EMNLP, 2017.
  • Rummery and Niranjan [1994] G. A. Rummery and M. Niranjan. On-Line Q-Learning Using Connectionist Systems. Technical report, University of Cambridge, 1994.
  • Shen et al. [2018] J. Shen, Y. Qu, W. Zhang, and Y. Yu. Wasserstein Distance Guided Representation Learning for Domain Adaptation. In AAAI, 2018.
  • Silver et al. [2017] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. R. Baker, M. Lai, A. Bolton, Y. Chen, T. P. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis. Mastering the game of Go without human knowledge. Nature, 2017.
  • Sutton and Barto [1998] R. S. Sutton and A. G. Barto. Reinforcement Learning - An Introduction.

    Adaptive Computation and Machine Learning. MIT Press, 1998.

  • Wang et al. [2018] S. Wang, M. Yu, X. Guo, Z. Wang, T. Klinger, W. Zhang, S. Chang, G. Tesauro, B. Zhou, and J. Jiang. : Reinforced Ranker-Reader for Open-Domain Question Answering. In AAAI, 2018.
  • Williams et al. [2018] A. Williams, N. Nangia, and S. Bowman. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In NAACL, 2018.
  • Williams [1992] R. J. Williams. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Machine Learning, 1992.
  • Wu et al. [2018] J. Wu, L. Li, and W. Y. Wang. Reinforced Co-Training. In NAACL, 2018.
  • Yan et al. [2016] R. Yan, Y. Song, and H. Wu. Learning to Respond with Deep Neural Networks for Retrieval-Based Human-Computer Conversation System. In SIGIR, 2016.
  • Yang et al. [2016] L. Yang, Q. Ai, J. Guo, and W. B. Croft. aNMM: Ranking Short Answer Texts with Attention-Based Neural Matching Model. In CIKM, 2016.
  • Yang et al. [2018] L. Yang, M. Qiu, C. Qu, J. Guo, Y. Zhang, W. B. Croft, J. Huang, and H. Chen. Response Ranking with Deep Matching Networks and External Knowledge in Information-seeking Conversation Systems. In SIGIR, 2018.
  • Yang et al. [2017] Z. Yang, R. Salakhutdinov, and W. W. Cohen. Transfer Learning for Sequence Tagging with Hierarchical Recurrent Networks. In ICLR, 2017.
  • Yin et al. [2016] W. Yin, H. Schütze, B. Xiang, and B. Zhou.

    ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs.

    TACL, 2016.
  • Yin et al. [2018] W. Yin, D. Roth, and H. Schütze. End-Task Oriented Textual Entailment via Deep Explorations of Inter-Sentence Interactions. In ACL, 2018.
  • Yosinski et al. [2014] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How Transferable Are Features in Deep Neural Networks? In NIPS, 2014.
  • Yu et al. [2018] J. Yu, M. Qiu, J. Jiang, J. Huang, S. Song, W. Chu, and H. Chen. Modelling Domain Relationships for Transfer Learning on Retrieval-based Question Answering Systems in E-commerce. In WSDM, 2018.