Question Answering (QA) systems have been widely developed and used in many domains. Examples of industry applications include Alibaba’s AliME (qiu:acl17; alime-demo), Microsoft’s SuperAgent (cui:acl17), Apple’s Siri and Google’s Google Assistant. Generally speaking, there are two kinds of commonly-used techniques behind most QA systems: Information Retrieval (IR)-based models (yan:sigir2016) and generation-based models (vinyals:arxiv2015). In this work, we focus on building up an IR-based QA system for automatically answering frequently asked questions (FAQs) in the E-commerce industry.
Fig. 1 illustrates the workflow of IR-based chatbot systems, where a key component is the Question Rerank module which reranks candidate questions in a question-answering knowledge base (KB) to find the best matching question given a question from a user. This task can be reduced to a paraphrase identification or a natural language inference problem. Take the query question and knowledge base shown in Fig. 1 for example. If we can detect that question C1 in the KB is a paraphrase of the query question, then we can take its answer as the answer for the query. In some cases, if we allow the matching question to be more general than the query question (i.e., entailed by the query question), we can also take the answer for question C2 in the KB as the query question’s answer.
In the literature, paraphrase identification (PI) and natural language inference (NLI) have been extensively studied in the last decade (socher:nips2011; snli:emnlp2015; yin:tacl2016; bowman:acl2016; chen:acl2017). However, when applying existing solutions to PI and NLI in chatbot systems in the E-commerce industry, there are at least two major challenges we face: (1) Lack of rich training data: All these solutions rely on a large amount of labeled data. However, it is generally time-consuming and costly to manually annotate sufficient labeled data for each domain. For example, different product categories might need different training data. (2) Hard to reach a high QPS333Queries Per Second. Most of the existing methods focus on improving the effectiveness or accuracy without paying much attention to efficiency. For real industry applications, when real-time responses are expected and a large number of customers are being served simultaneously, we need an efficient method to support a high QPS.
In this paper, we try to address the two challenges above. Specifically, we first make an empirical comparison of both the effectiveness and efficiency of several representative methods for modeling sentence pairs and propose an effective and efficient hybrid model as our base model. This ensures that we can achieve a high QPS. On top of the base model, we further design a new transfer learning (TL) framework, which is able to efficiently improve the performance on a resource-poor target domain by leveraging knowledge from a resource-rich source domain.
A Hybrid Base Model. Observing that LSTM-based methods (snli:emnlp2015; chen:acl2017) are much more time-consuming than CNN-based methods (yin:tacl2016; mou:acl2016), we focus on CNN-based methods in this study. Meanwhile, there are typically two types of CNN-based methods for the task, namely sentence encoding (SE)-based methods (yin:naacl2015; mou:EMNLP2016) and sentence interaction (SI)-based methods (hu:nips2014; pang:aaai2016). We argue that these two types of methods may highly complement each other, and thus we propose a hybrid CNN model by combining an SE-based method (yin:tacl2016) and an SI-based method (pang:aaai2016). Specifically, we modify the SE-based method using two element-wise comparison functions inspired by (mou:acl2016; wang:iclr2017) to match the two sentence embeddings, and then concatenate them together with sentence embeddings from the SI-based method.
Transfer Learning Framework. Transfer learning aims to apply knowledge gained in a source domain to help a target domain (pan:tkde2010). The key issue is how to transfer the shared knowledge from the source domain to the target domain while exclude the specific knowledge in the source domain based on the domain relationship. Most recent studies for TL in NLP perform multi-task feature learning by exploiting different NN models to capture a shared feature space across domains. As illustrated in Fig. 2a and Fig. 2b, one line of work employs a fully-shared framework to learn a shared representation followed by using two different fully connected layers for each domain (mou:EMNLP2016; yang:iclr2017), while another line of work uses a specific-shared framework to learn not only a shared representation for both domains but also a domain-specific representation for each domain (liu:acl2017).
However, the first line of work simply assumes that two domains share the same feature space but ignore the domain-specific feature space. Although the latter one is capable of capturing both the shared and the domain-specific representations, it does not consider any relationships between the weights of the final output layer. Generally speaking, the weights on the output layer should capture both the inter-domain and the intra-domain relationships: (1) For the shared feature space across domains, since it is expected to be domain-independent, the weights corresponding to this feature space in the two domains should be positively related to each other; (2) For the shared and the domain-specific feature spaces in each domain, since they are expected to respectively capture domain-independent and domain-dependent features, their corresponding weights should be irrelevant to each other. Motivated by such an intuition, in this paper, we propose a new transfer learning method by explicitly modeling the domain relationships via a covariance matrix, which imposes a regularization term on the weights of the output layer to uncover both the inter-domain and the intra-domain relationships. Besides, to make the shared representation more invariant across domains, we follow some recent work on adversarial networks (ganin:jmlr2016; liu:acl2017) and introduce an adversarial loss on the shared feature space in our method. Fig. 3 gives an outline of our full model.
To evaluate our proposed method, we conduct both intrinsic evaluation and extrinsic evaluation.
Intrinsic Evaluation. We conduct extensive experiments on both a benchmark dataset and our own dataset. (1) The hybrid CNN model is shown to be not only efficient but also effective, in comparison with several representative methods; (2) Our proposed transfer learning method can bring significant improvements over the base model without transfer learning, and outperform existing TL frameworks including the widely used fully-shared model and the recently proposed specific-shared model; (3) Further analysis on our learned correlation matrix shows that our method is able to capture the inter-domain and intra-domain relationships.
Extrinsic Evaluation. We deploy our proposed hybrid CNN-based transfer learning model into our online chatbot system, which is deployed on a real E-commerce site AliExpress. Both the offline and the online evaluations show that our new system can significantly outperform the existing online chatbot system. Finally, we launch our new system on Eva444Eva can be accessed via the following link: https://gcx.aliexpress.com/ae/evaenglish/portal.htm?pageId=195440, a chatbot platform in AliExpress.
In this section, we present our general model for paraphrase identification and natural language inference, which will be used for question reranking in our chatbot-based QA system.
2.1. Problem Formulation and Notation
Our model is designed to address the following general problem. Given a pair of sentences, we would like to identify their semantic relation. For paraphrase identification (PI), the semantic relation indicates whether or not the two sentences express the same meaning (yin:naacl2015); for natural language inference (NLI), it indicates whether a hypothesis sentence can be inferred from a premise sentence (snli:emnlp2015).
Formally, assume there are two sentences and , where denotes an
-dimensional dense embedding vector retrieved from a lookup tablefor all the words in the vocabulary . Our task is to predict the semantic label which indicates the relation between and . For PI, we assume the label to be either paraphrase or not paraphrase; for NLI, we assume to be either neutral, entailment or contradiction.
We consider a transfer learning setting, where we have a set of labeled sentence pairs from a source domain and a target domain, respectively, denoted by and . Note that is assumed to be much larger than . We seek to use both and to train a good model so that it can work well in the target domain.
To solve such a problem, a widely used transfer learning method (as illustrated in Fig. 2
a) is to use the same NN model to transform every pair of input sentences in both domains into a hidden representation, where is the size of the hidden representations. To facilitate our discussion, let us assume , where denotes the transformation function parameterized by . Next, for the source and the target domains, we assume that two fully connected layers are separately learned to map to label .
where and are weight matrices and and
are bias vectors.
Besides, another transfer learning approach (liu:acl2017) was recently proposed to use a domain-shared NN model and two domain-specific NN models to obtain a shared embedding and two domain-specific embeddings and . Therefore, the output layers are defined as:
The main limitation of the fully-shared framework is that it ignores source-specific or target-specific features. While for the specific-shared framework, it fails to consider any inherent correlations between the weights on the output layers. Therefore, we will introduce our proposed method that explicitly incorporates such correlations into the specific-shared framework in the next session.
2.2. Proposed Transfer Learning Method
Our goal is to model the inter-domain relationship between and , and the intra-domain relationship between and as well as and . Hence, we first reshape each weight matrix into a vector , followed by concatenating all the four reshaped vectors to form a new matrix , where each column corresponds to one weight matrix of the output layer.
Next, to capture the domain relationships mentioned above, we introduce a covariance matrix . Note that each element indicates the correlation between and , where and are one of , , and . Inspired by a general multi-task relationship learning framework as introduced in (zhang:uai2010), we consider confining the output layer’s weights with by using , where is the trace of a square matrix. This means that if is a large positive/negative value, and will be positively/negatively related to each other; otherwise if is close to zero, and will be irrelevant to each other. Note that to the best of our knowledge, we are the first to apply the multi-task relationship learning framework into NN-based transfer learning methods.
In order to simultaneously learn our model parameters and the domain relationships in a unified framework, we formulate our loss function as follows:
where , , , and are regularization parameters, is required to be positive semi-definite, and is required to be 1 without losing generality. In the above formulation, the first term refers to the cross-entropy loss for both domains, and the second term serves as a domain-relationship regularizer to constrain the weights on the output layer. The remaining terms are standard L2-regularization terms.
2.3. Adversarial Loss
Our above transfer learning method is based on the specific-shared framework, which is assumed to well capture the shared and domain-specific feature spaces. However, as suggested by (liu:acl2017), the shared representation learned in this framework may still contain noisy domain-specific features. Therefore, to eliminate the noisy features, here we also consider incorporating an adversarial loss on the shared feature space so that the trained model can not distinguish between the source and target domains on it (ganin:jmlr2016).
First, we assume that the shared layer is mapped to a binary domain label , which indicates whether comes from the source or the target domain:
Since the goal of adversarial training is to encourage the shared feature space indiscriminate across two domains, we define the adversarial loss as minimizing the negative entropy of the predicted domain distribution, which is different from maximizing the negative cross-entropy as in (ganin:jmlr2016; liu:acl2017):
Finally, we obtain a combined objective function as follows:
where is a hyper-parameter for tuning the importance of the adversarial loss. As suggested by (zhang:uai2010), it is not easy to optimize such a semi-definite programming problem. We will present an alternating training approach in Section 2.5 for solving it efficiently.
2.4. Base Model
Although the proposed transfer learning method is general and any neural networks for modeling a pair of sentences can be applied to it, we further target at proposing an efficient and effective base model for encoding a pair of sentences.
On one hand, although various attention-based LSTM architectures have been proposed to achieve a superior performance on both PI and NLI (rock:iclr2015; parikh:emnlp2016; wang:iclr2017; chen:acl2017), these models are very time-consuming due to the computation of memory cells and attention weights in each time step, which may not satisfy the industry demand, especially when QPS is high. On the other hand, CNN-based models are proven to be efficient, hence are the focus of our study. Most existing CNN-based models can be categorized into two groups: sentence encoding (SE)-based methods and sentence interaction (SI)-based methods. The former aims to first learn good representations for each sentence, followed by using a comparison function to transform them into a single representation (yin:naacl2015; mou:EMNLP2016), while the latter tries to directly model the interaction between two sentences at the beginning and then makes abstractions on top of the interaction output (hu:nips2014; pang:aaai2016). Observing that the two lines of methods focus on different perspectives to model sentence pairs, we expect that a combination of them can capture both good sentence representations and rich interaction structures.
Hence, we propose a hybrid CNN (hCNN) model, which are based on some minor modifications of two existing models: a SE-based BCNN model (yin:tacl2016) and a SI-based Pyramid model (pang:aaai2016). Fig. 3
depicts our full transfer learning framework, which contains one shared hCNN and two domain-specific hCNNs. Below we briefly go through the architecture of hCNN. Note that in our implementation and the model description below, we pad the two input sentences to the same length.
Modified BCNN: Following the original BCNN (yin:tacl2016)
, we first use two separate 1-D convolutional (conv) and 1-D max-pooling layers to encode the two input sentences into two sentence embeddings:
Furthermore, as suggested by (mou:acl2016; wang:iclr2017) that element-wise comparison can work well on the problem, we use two comparison functions to match the two sentence embeddings, and then concatenate them together with the sentence embeddings as the sentence pair representation: where and refer to element-wise subtraction and element-wise multiplication, and refers to concatenation. Note that this setting is different from the original BCNN, which yields better performance in our empirical experiments.
Pyramid: As shown in the rightmost part of Fig 3, we first produce an interaction matrix , where denotes the similarity score between the word in and the word in . Following (pang:aaai2016), we use dot-product to compute the similarity score.
Next, by viewing the interaction matrix as an image, we stack two 2-D convolutional layers and two 2-D max-pooling layers on it to obtain the hidden representation .
Finally, we concatenate the two hidden representations as the final representation for each input sentence pair: .
In our combined objective function, we have nine parameters , , , , , , , and , and it is not easy to optimize them at the same time. Following the practice in (zhang:uai2010), we employ an alternating stochastic method, i.e., first optimizing the other eight parameters by fixing , and then alternatively optimizing by fixing the others in each iteration. The details are given as below:
Updating , , , , , , and . While fixing , the optimization problem becomes:
Since it is a smooth function, we can easily compute its partial derivatives with respect to the eight parameters.
Updating . After fixing the eight parameters, the optimization problem is as follows:
As proved by (zhang:uai2010), the above optimization problem has an analytical solution .
Finally, we present the whole procedure for training our full model as in Algorithm 1. Note that we only update when we scan all the target training instances once.
2.6. Implementation Details
In our full transfer learning model, we initialize the lookup table with the pre-trained vectors from GloVe (pennington:emnlp2014) by setting as 300. For BCNNPyramid
in both PI and NLI, the feature map sizes are set to be 8 and 16, the strides are set to be 1 and 3, and the kernel sizes are set to beand ; for the two max-pooling layers of Pyramid, the strides are set to be 4 and 2, and the pooling sizes are set to be and . Besides, for and , we set them as 0.05 and 0.0008; while for , , and , we set them as 0.0004. AdaGrad (duchi:jmlr2011) is used to train our model with an initial learning rate of 0.08.
3. Online System
As introduced in Section 1, our online chatbot system is based on traditional information retrieval techniques, where the goal is to obtain the nearest question in the knowledge base for a given customer question (jeon:cikm2005). Fig. 1 depicts the whole system architecture.
Specifically, we first build an indexing for all the questions in our knowledge base (KB) using Apache Lucene 555https://lucene.apache.org/core/. Next, given a query question, we employ TF-IDF ranking algorithm (wu:tois2008) in Lucene to compute its similarities to all the questions in the KB, and call back the top- candidate questions. We then use a reranking algorithm to compute the similarities between the query and the candidates, and obtain the most similar candidate. Finally we return the answer of the selected candidate to answer the query question. Note that in this paper, we only consider formulating our question rerank module as a PI task, but one can also model it as an NLI task.
Our existing reranking method is based on this ensemble method for the Answer Selection task (wang:acl2015). But instead of using the output of the time-consuming LSTM model, we feed another three features, namely, Word Mover’s Distance (kusner:icml2015), keywords features (wang:acl2015) and the cosine distance of sentence embeddings (wieting:iclr2016)
to a gradient boosted regression tree (GBDT).
To combine our model with the existing ranking method, we treat the probability of beingparaphrases predicted by our model as an additional feature, and feed all features to GBDT for reranking.
In this section, we describe a qualitative evaluation of our proposed methods from the following perspectives: (1) From Section 4.1 to Section 4.4, we perform an intrinsic evaluation by utilizing a benchmark dataset and our own dataset to show the efficiency and effectiveness of our proposed base model and transfer learning framework; (2) In Section 4.5, we deploy our full model into our chatbot system, and conduct an extrinsic evaluation to show that our full model can bring in significant improvements to our existing online chatbots.
4.1. Experiment Settings
Datasets: In this section, we evaluate our methods on both Paraphrase Identification (PI) and Natural Language Inference (NLI).
For PI, we used a recently released large-scale dateset666https://www.kaggle.com/c/quora-question-pairs by Quora as the source domain, and our E-commerce dataset as the target domain. Based on our historical data, we constructed a question answering KB, which consists of around 15,000 frequently asked QA pairs. To create labeled question pairs, we first collected all the query questions from the chat log of conversations between clients and our staff from May 22 to May 28, 2017. For each query question, we then used Lucene indexing to retrieve several of its similar questions, and obtained 45,075 question pairs. Finally, we asked a business analyst to annotate all the question pairs.
For NLI, we employed a large-scale multi-genre corpus (williams:arxiv2017), which contains an image captioning domain (SNLI) and another five distinct genres/domains about written and spoken English (MultiNLI)777For the MultiNLI dataset, we use Version 0.9 in this paper. Note that since the label of the original test set is unavailable, we treat its development set as our test set, and randomly choose 2000 sentence pairs from its training set as our development set.. Since the number of sentence pairs in SNLI is much larger than that in the other five domains, we took SNLI as the source domain, and the others as the target domains.
Table 2 and Table 2 summarize the statistics of our datasets. Note that in Table 2, the number before and after the slash for Q-Q pairs denote respectively the total number of question pairs and the number of positive question pairs (i.e., paraphrases), while the two numbers for #Query-Q respectively denote the total number of query questions and the number of questions with paraphrasing candidates. Besides, for #Candi-Q, we refer to the average number of candidate questions for each query.
Compared Methods: For base models, we compared our hCNN model with the following models:
BCNN is the left component of our hCNN model, which incorporates element-wise comparisons on top of the base model proposed in (yin:tacl2016).
Pyramid is the right component of our hCNN model based on sentence interactions as in (pang:aaai2016).
ABCNN is the attention-based CNN model by (yin:tacl2016).
BiLSTM is similar to BCNN, but uses LSTM instead of CNN to encode each sentence as in (snli:emnlp2015).
ESIM is one of the state-of-the-art attention-based LSTM models on SNLI proposed by (chen:acl2017).
hCNN is our hybrid CNN model as introduced in Section 2.4.
For evaluating the proposed transfer learning framework, we employed the following compared systems:
Tgt-Only is the baseline trained in the target domain.
Src-Only is another baseline trained in the source domain.
Mixed is to simply combine the labeled data in the two domains to train the hCNN model.
Fine-Tune is a widely used TL method, where we first train a model on the source data, and then use the learned parameters to initialize the model parameters for training another model on the target data.
FS and SS are the fully-shared and specific-shared frameworks as detailed in Section 2.1.
DRSS is our proposed model of learning domain relationships based on SS as in Section 2.2.
SS-Adv and DRSS-Adv denote adding the adversarial loss into SS and DRSS as in Section 2.3.
All the methods in this paper are implemented with Tensorflow and are trained on machines with NVIDIA Tesla K40m GPU.
Evaluation Metrics: For PI, since our goal is to retrieve the most similar candidate for each query question, we use our model to predict each candidate’s probability of being paraphrase as its similarity score, and then rank all the candidates. To evaluate the ranking performance, we use Precision@1, Recall@1,
@1 as metrics; to evaluate the classification performance for all question pairs, we employ two metrics: the Area under the Receiver Operating Characteristic curve (AUC) score(bradley:pr1997)
and the classification accuracy (ACC). For NLI, we only use ACC as the evaluation metric.
4.2. Comparisons Between Base Models
|AUC||ACC||Test Time(ms)||ACC||Test Time(ms)|
In Table 3
, we compared different models for classifying sentence pairs with hCNN in both efficiency and effectiveness. Note that to fairly evaluate the efficiency of each model, we compute the total time of predicting all the test sentence pairs on CPU by setting the mini-batch size to 1, and report the average time. Also, for feature map sizes in BCNN, ABCNN and the BCNN component in hCNN, we set them as 50 for PI and 300 for NLI.
First, we can find that LSTM-based methods are generally much slower than CNN-based methods. Especially for ESIM, although it can outperform all CNN-based models, its computational time for each sentence pair is 32.2ms for our dataset and 79.5ms for SNLI, which is 6-11 times of CNN-based models. This means that most existing state-of-the-art models can only support low QPS, and therefore hard to be applied to industry. Second, clearly for both tasks, hCNN performs better than the other CNN-based methods, which indicates that BCNN and Pyramid are complementary to each other, and can work better when combined. Moreover, we verified that the improvements of hCNN over the other methods are significant with based on McNemar’s paired significance test (gillick:icassp1989). Finally, while the computational cost of hCNN is slightly higher than BCNN and Pyramid, it can still serve 233 question pairs per second, which is able to satisfy the current demand of our industrial bot.
4.3. Comparisons Between TL Methods
We can observe from Table 5 that for all the five target domains, Src-Only perform much worse than Tgt-Only, and the average performance of Mixed is even worse than Tgt-Only. This implies that the source domain is quite different from all the target domains, and simply mixing the training data in two domains may lead to the model overfitting the source data since is much larger than . In addition, it is observed that the widely used Fine-Tune method can perform slightly better than Tgt-Only in most cases, which shows that pre-training the model parameters on a related source domain is better than randomly initializing them. Moreover, in all the five domains, the performance of two existing transfer learning frameworks FS and SS are both 1.9% better than that of Tgt-Only, which proves their usefulness. Furthermore, our proposed method DRSS improves the average performance of SS to 0.665, and the improvements are significant over all the tasks with based on McNemar’s paired significance test. This suggests that capturing the relationship between domains is generally useful for transfer learning. Finally, we can see that the incorporation of adversarial loss into SS and DRSS further boosts their performance, and DRSS-Adv can achieve the best accuracy across all the methods. Similar trends can be also observed for the PI task from Table 5. Interestingly, by comparing Table 3 and Table 5, we find that with the help of training data from the source domain, the performance of DRSS-Adv is even better than that of ESIM. These observations demonstrate the effectiveness of our transfer learning method.
Apart from the effectiveness, we also measure the efficiency of each method. Since the first five methods only use a single hCNN model for prediction, the computational time is the same as hCNN. As for SS, DRSS and their adversarial extensions, the computational time is 6.9ms, which is slightly longer than the other five methods but still much shorter than LSTM-based methods as in Table 3.
4.4. Domain Relationships
After obtaining the covariance matrix for each source/target pair, we can derive their corresponding correlation matrices. For better comparison, here we show the square root of the correlation matrices for DRSS.
As shown in Fig. 4 that across all the six source/target pairs, and are positively related with each other. This is intuitive as the shared network is supposed to learn shared features between the source and the target domains, thus the learned and should be close to each other. This also shows the learned correlation matrix helps to capture the inter-domain relationship between and .
In Fig.4, we can also see that for most source/target pairs except SNLIFict, the correlation between and and that between and learnt by our model are with small values. This indicates that in most cases, the shared feature space and the domain-specific feature space learnt by SS tend to be different from each other, and our model can help to reveal such intra-domain relationships.
Finally, to help us get a deeper insight on the helpfulness of the adversarial training, we perform comparisons on the correlation matrices learnt by DRSS and DRSS-Adv. We first show the result of SNLIFict. in Table 6. As we can see, for DRSS, the correlation between (or ) and (or ) is relatively large, while for DRSS-Adv, the correlation is relatively small. For the other subtasks, we find that the learnt matrices of DRSS-Adv are similar to those of DRSS, but we still observe that the intra-domain correlations of DRSS-Adv are generally smaller than those of DRSS. This shows that adding the adversarial loss can encourage the shared feature space to capture more domain-independent features, and further make the shared and domain-specific feature spaces more different. Therefore, the adversarial training can lead our model to better satisfy our assumption on the domain relationships, and finally improve the performance. All the above observations demonstrate that our model can capture the inter-domain and intra-domain relationship as mentioned in Section 2.2.
4.5. Extrinsic Evaluations
As mentioned in Section 3, for the online reranking algorithm, we propose to train GBDT by treating the prediction score of our DRSS model as another feature. To achieve this, we first took out the prediction scores of our DRSS model on the validation set. Then, we combined them together the other features as introduced in Section 3, and trained GBDT on the validation set. The model performance on the test set is reported in Table 7. Note that the test time here denotes the average serving time (including the Response Time), which is different from the reported test time in Table 3.
As we can see from the offline test, the GBDT model with the feature derived from our DRSS model (referred to as GBDT-DRSS) is respectively 26.3% and 7.1% better than our existing online model (referred to as GBDT) and the GBDT model with the feature derived from hCNN (refered to as GBDT-hCNN) in @1. Although adding our DRSS feature leads to more computational time, the total prediction time is 80.7ms for each query question (i.e., QPS of 12), which is acceptable for our chatbots.
For online serving, to accelerate the computation, we set the number of candidates returned by Lucene as 30, and bundle the 30 candidates into a mini-batch to feed into our model for prediction. For online evaluation, we randomly sampled 2750 questions, where 1317 questions are answered by GBDT and 1433 questions are answered by GBDT-DRSS. Then, we asked one business analyst to annotate if the nearest question returned by models expresses the same meaning as the query question, and compared their precision at top-1. As shown in Table 7, the Prec@1 of GBDT-DRSS is 18.8% higher than that of GBDT.
|@1||Time(ms per query)||Prec@1|
5. Related Work
Paraphrase Identification and Natural Language Inference:
Recent years have witnessed great successes of applying different neural networks, including Recursive Neural Networks (ReNN), Reccurrent Neural Networks (RNN) and Convolutional Neural Networks (CNN), into Paraphrase Identification and Natural Language Inference(socher:nips2011; snli:emnlp2015; yin:tacl2016)
. Although all these models have been shown to significantly outperform the traditional methods without deep learning, most of them only focus on improving the performance of standard in-domain setting, and therefore require a large amount of labeled data to train a robust model. However, in practice, it will be time-consuming and costly to manually annotate much labeled data for each target domain we are interested in. Hence, in this paper, our focus is to apply an efficient and effective NN model into a transfer learning framework so that we can leverage the large amount of labeled data from a related source domain to train a robust model for a resource-poor target domain, which can benefit our chatbot-based question answering system.
Transfer Learning: Transfer learning (TL) has been extensively studied in the last decade (long:tkde2014). Most existing studies for TL can be generally categorized into two groups. The first line of work assumes that we have enough labeled data from a source domain and also a little labeled data from a target domain (daumeiii:acl2007), and the second line assumes that we only have labeled data from source domain but may also have some unlabeled data from a target domain (blitzer:EMNLP2006; yu:EMNLP2016). Our study belongs to the first line of work, which is also referred to as supervised domain adaptation.
For supervised domain adaptation, a majority of previous work belong to two clusters: instance-based and feature-based transfer learning. The former focuses on mining from the source labeled data to find those instances that are similar to the distribution of the target domain, and combine them together with the target labeled data (dai:icml2007; jiang:acl2007). The core idea of the latter line of work is to find a shared feature space, which can reduce the divergence between the distribution of the source and the target domains (argyriou:nips2007; lee:icml2007; wang:icml2008; qiu:icdm2017). Our work follows the latter one, and tries to leverage NN models to learn a shared hidden representation for sentence pairs across domains.
Deep Transfer Learning: With the recent advances of deep learning, different NN-based TL frameworks have been proposed for image processing (yosinski:nips2014) and speech recognition (wang:apsipa2015) as well as NLP (mou:EMNLP2016; yang:iclr2017; liu:acl2017). A simple but widely used framework is referred to as fine-tuning approaches, which first use the parameters of the well trained models on the source domain to initialize the model parameters of the target domain, and then fine tune the parameters based on labeled data in the target domain (yosinski:nips2014; mou:EMNLP2016). Another line of work can be referred to as multi-task feature learning approaches, which bears the same intuition behind the feature-transfer methods as mentioned above. Among this line of work, one typical framework is to simply use a shared NN to learn a shared feature space (mou:EMNLP2016; yang:iclr2017), while another representative framework is to employ a shared NN and two domain-specific NNs to respectively derive a shared feature space and two domain-specific feature space (liu:acl2017). Motivated by the observation that both methods fail to consider the domain relationship, in this paper, we propose to jointly learn the shared feature representations and domain relationships in a unified model. Moreover, inspired by the recent success of applying adversarial networks into unsupervised domain adaptation (ganin:jmlr2016; taigman:iclr2017) and multi-task learning (liu:acl2017), we also incorporate the adversarial training into our transfer learning model in order to learn a more robust shared feature space across domains.
In this paper, we systematically evaluated different base methods and transfer learning techniques for modelling sentence pairs, with the goal of proposing an effective and efficient transfer learning framework for PI and NLI. Specifically, we first proposed a hybrid CNN model on the basis of two existing models, and then further proposed a general transfer learning framework, which can simultaneously perform the shared feature learning and domain relationship learning in an end-to-end mode. Evaluations on both a benchmark dataset and our own dataset showed that (1) our hybrid CNN model is both effective and efficient in comparison with several representative models; (2) our transfer learning framework can outperform all the existing frameworks across six source/target pairs. We further deployed our transfer learning model in our online chatbot system, and showed that it can improve the performance of the existing system by a large margin.
The authors would like to give great thanks to Feng Ji, Wei Zhou, Weipeng Zhao and Xu Hu for their helpfulness during the project, and the anonymous reviewers for their constructive comments.