Correction of Faulty Background Knowledge based on Condition Aware and Revise Transformer for Question Answering

06/30/2020 ∙ by Xinyan Zhao, et al. ∙ USTC 0

The study of question answering has received increasing attention in recent years. This work focuses on providing an answer that compatible with both user intent and conditioning information corresponding to the question, such as delivery status and stock information in e-commerce. However, these conditions may be wrong or incomplete in real-world applications. Although existing question answering systems have considered the external information, such as categorical attributes and triples in knowledge base, they all assume that the external information is correct and complete. To alleviate the effect of defective condition values, this paper proposes condition aware and revise Transformer (CAR-Transformer). CAR-Transformer (1) revises each condition value based on the whole conversation and original conditions values, and (2) it encodes the revised conditions and utilizes the conditions embedding to select an answer. Experimental results on a real-world customer service dataset demonstrate that the CAR-Transformer can still select an appropriate reply when conditions corresponding to the question exist wrong or missing values, and substantially outperforms baseline models on automatic and human evaluations. The proposed CAR-Transformer can be extended to other NLP tasks which need to consider conditioning information.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Question answering (QA), which intends to provide concise, direct answers to user queries based on abundant external information, has recently become a major focus of natural language processing (NLP) research. According to the type of external information, QA tasks can be roughly divided into two groups: text-QA and knowledge based QA (KBQA). The former utilizes information that in form of text such as web documents and the latter utilizes the knowledge from structured knowledge bases (KB) such as freebase [3], sales and marketing databases [18]

and so on. Recently, neural approaches are exploited to represent questions and external information using continuous vectors so that the inference can be performed in a hidden semantic space

[4, 49, 17, 51, 52, 41, 37, 31]. Neural approaches have advantages in semantic understanding since various expressions with similar semantic meaning have similar continuous representations in the hidden semantic space.

Although previous studies have made substantial progress, a major drawback of previous QA agents is that they assume the information from text or KBs is correct and complete, and thus they focus on how to utilize the information better. However, it might be unreasonable to suppose that the information is perfect in real-world applications. For example, many KBs are constructed by automatic extraction and might contain some mistakes due to lack of manual correction. In personalized dialog system, the user profile attributes may be incomplete due to the privacy policy. In addition, information delay, inaccuracy and missing in logistics, inventory and transaction are common in e-commerce applications.

Question Conditions Answer
I need this item by Friday morning. Please could you send it ASAP? Unshipped, expedite delivery We are now marked your order as URGENT ORDER in our system and we will try our best to delivery it to you on time.
Shipped, normal delivery We will try our best but we are not able to guarantee.
Thank you for the update, yes I will wait for the new stock to come in. Instock Dear Customer, many thanks for your reply. We shall ship your order as soon as the new stock arrives.
Shirt arrived looks good, but too big as have another shirt similar. I need 17” collar, so XL would be better. Not opened package, so can I swap it for a smaller one? Shipped: Unknow, delivery status: unknown Dear customer, to receive the order in time, we suggest you to return the existing order for refund and place a new order for the right size so that you don’t have to wait for the return to be processed. please request your return from here: <link>. We shall process your return request within two working days.
Table 1: Questions with different conditions responded by CAR-Transformer.

This work attempts to alleviate the impact of defective conditions corresponding to the question and provide a valid answer that is compatible with both user intent and the conditions. The conditions can be delivery status, delivery service type, stock information in e-commerce or user profile attributes in customer service. To be clear, Table 1 shows several representative examples of the proposed model for customer service in e-commerce. The conditions in this application is order states. The first question in Table 1 has different answers corresponding to different conditions. The second question has wrong condition value “Instock” and the third question has unknown condition values. But our model still gives appropriate responses. In general, there are two main challenges in our task.

  1. How to incorporate conditions into an end-to-end model. Typical KBQA systems leverage knowledge in an explicit way, the answer is what retrieve from the knowledge base. However, in this study, conditions is to help answer selection instead of giving to the user directly.

  2. How to deal with missing or wrong conditions. Although existing defective condition values, it is not worthwhile to abandon the conditioning information completely. The model is expected to not only utilize the conditions but also reduce the effect of inferior condition values. Note that we don’t know which condition of which sample is wrong in advance.

To tackle these challenges, this paper proposes condition aware and revise Transformer (CAR-Transformer). The overview of CAR-Transformer is shown in Figure 1. Transformer [40]

is an encoder-decoder framework that adopts self-attention mechanism to encode text instead of RNN or CNN structure. Since self-attention mechanism performs better in long distance dependence, Transformer is chosen as our baseline model. CAR-Transformer consists of four parts: conditions encoder, conditions reviser, dialogue encoder and classifier. We model the problem of conditions inference as a sequence generation problem and modify the Transformer architecture to dialogue encoder and conditions reviser. The dialogue encoder transforms conversation history and question into hidden representations. Then the conditions reviser generates revised conditions based on all original condition values and the dialogue representations. After that, each revised condition value is discrete and represented as a one-hot vector. Then the one-hot vectors are fed into the conditions encoder to get conditions embedding. Finally, the concatenation of conditions embedding and dialogue embedding is fed into the classifier to make a prediction over candidate responses. To capture the sophisticated interactions between features, the proposed conditions encoder and classifier adopt multi-layer neural network to take advantage of strong representation and generalization ability of deep learning. Experiments on a real-world dataset in customer service for e-commerce show that the CAR-Transformer can revise missing or wrong condition values to a certain degree and choose answer that accord with conditions. Automatic and human evaluations demonstrate CAR-Transformer achieves significant improvement as compared to all baseline methods. We also conduct experiments on personalized bAbI dataset to further verify the effectiveness of the CAR-Transformer. In summary, this paper makes the following contributions.

  1. It proposes CAR-Transformer, a highly effective Transformer based conversational QA system. By revising and integrating condition values of question, CAR-Transformer outperforms several strong baselines.

  2. By proposing the conditions reviser, CAR-Transformer is capable of revising wrong or incomplete condition values of question, which is more robust and practical in real-word applications compared to other QA systems. To our knowledge, this work is the first to discuss the treatment of defective external information in QA research.

  3. This paper explores how to represent and integrate the categorical information into the Transformer framework, therefore the prediction is compatible with both categorical attributes and the context.

The remainder of this paper is organized as follows. A review of related work is provided in Section II. Section III formulates the problem to be solved and presents CAR-Transformer. Section IV contains the introduction of the dataset used in this paper and abundant experiments. Finally, the conclusion and prospects for future work are provided in Section V.

2 Related Work

This section reviews the related work on sequence-to-sequence models and personalized dialog systems, which inspire the CAR-Transformer.

2.1 Sequence-to-Sequence Models

The sequence-to-sequence (Seq2Seq) model [14] is first proposed for machine translation task and also widely used in sequence generation task. Given a source and a target sequence

, the model maximizes the conditional probability:

. Seq2Seq model is in an encoder-decoder structure. The encoder summarizes a fixed-size vector representation from a variable-length input sentence, and the decoder generates sequences one by one based on the representation from the encoder and its previous outputs. The encoder and decoder can be specialized by RNN [14], [39], [2], CNN [19], Transformer [40] and so on. RNN is able to process the temporal sequence. The advanced temporal analysis models include learning in the model space [7, 11, 6] and its variants [22, 5, 23, 21]. However, unlike RNNs, Transformers do not require that the sequence be processed in order. Particularly, Transformer only employs self-attention mechanism which computes a weighted sum by utilizing dot-products between elements of the input sequence [30], [2], [35], [28]

and achieves the state-of-the-art results on many natural language generation tasks. The difference between CAR-Transformer and prior works lies in that there is no temporal relationship between the order condition values which is the target sequence in this study. Therefore, the condition reviser utilizes all original conditions when revising one condition value and the revised condition values can be generated in parallel. While most of the Seq2Seq-based models generate words one by one by consuming the previously generated words as input.

2.2 Attributes-aware Dialog Systems

Learning the inherent attributes of dialogues explicitly is a way to improve the diversity and effectiveness of dialogue systems. Topic and personality are widely studied among different attributes. Xing et al. [48] use Twitter LDA model to get the topic of dialog and then feed topic and input representations into a joint attention module to generate a topic-related response. Choudhary et al. [15] divide each utterance in the dialogue into different domains and generates the domain and content of the next utterance accordingly. However, these studies require an extra component to infer the topic and suffer from the error in inferring topic. For personalized dialog systems, Li et al. [29] and Herzig et al. [25] project personal information into embedding space and then input the embedding vector into the decoder. Qian et al. [36] use an extra component to learn when employing the user profile. Yang et al. [50] and Zhang et al.[53]

attempt to introduce personalized information to dialogs by transfer learning. These methods are all LSTM-based, which have inferiority when dealing with long sequences. For better capturing long-term dependency, the proposed model is Transformer-based. Besides, Joshi

et al.[27] and Luo et al. [32] use an end-to-end memory network to select response in candidate sets. It is worth noting that all of the foregoing methods don’t take into account whether flaws exist in personal information.

3 The Proposed Method

This section first formulates the investigated problem in this paper. Then Section 3.2 introduces the structure of transformer layer. Finally, a detailed description of CAR-Transformer is provided in Section 3.3.

3.1 Problem Formulation

Formally, let denotes each training sample, where denotes a conversation with utterances. The last utterance is the question needing to be responded. denotes condition values and denotes a candidate response. The goal of this work is to learn a classification model to select an answer from candidate set for dialogue-conditions pair

by computing the probability distribution of candidate answers.

3.2 Transformer layer

Transformer [40]

is composed of a stack of identical layers, called transformer layers. Each transformer layer consists of a self-attention sub-layer followed by a feedforward sub-layer. Each sub-layer employs residual connection

[24] followed by layer normalization [1]. Let denotes the input of transformer layer, where is a sequence of -dimensional vectors. The self-attention sub-layer first transforms into queries , keys and values , where , , are trainable matrices. Each query, key, and value matrix can be split into parts called attention heads, indexed by , and with dimension . This multi-heads attention mechanism allows the model to focus on different parts of the input sequence. Then the output of each head is calculated as:


The outputs of each head are concatenated and linearly transformed to a

by dimensional matrix by feedforward sub-layer:


Since self-attention mechanism has no recurrent operation like RNN, Transformer keeps the absolute or relative position of a word by adding position encodings to the input embeddings of the bottom transformer layer. The position encoding is formated by:


where is the position of a word and is the index of the input dimension.

3.3 CAR-Transformer

Figure 1: Overview of the proposed CAR-Transformer. CAR-Transformer first revises conditions according to original condition values and dialogue. For example, the original order condition value ”No” of ”Return goods received” is wrong and the model revises it correctly. Then CAR-Transformer selects an answer that in line with the dialogue content and the conditions.

Overview of CAR-Transformer is shown in Figure 1. CAR-Transformer first revises conditions of the question according to the original condition values and dialogue context. Then CAR-Transformer selects an answer in line with the dialogue content and the revised conditions. CAR-Transformer is consisted of four components:

  • conditions encoder encodes the conditions into a vector representation;

  • dialogue encoder summarizes the whole conversation into a sequence of vector representations;

  • conditions reviser generates revised condition values based on the original condition values and the dialogue representations;

  • classifier selects a candidate response based on the revised conditions embedding and the dialogue representations.

Since CAR-Transformer has two outputs: revised condition values and predicted label of response, the overall objective function is defined as following:


where denotes the sum of cross entropy loss of each condition value, denotes the cross entropy between the predictive distribution and the true candidate response label and is a scalar to balance the two terms.

Figure 2 highlights the structure of conditions encoder. Figure 3 illustrates the structure of dialogue encoder and conditions reviser. Details of each component will be described in the following subsections.

3.3.1 Conditions Encoder

Figure 2: The structure of conditions encoder.

Objective of this component is to represent the conditions as a vector and capture the interaction between conditions. Each condition value is first converted into one-hot vector . Then each one-hot vector is compressed into embedding space. The embeddings are concatenated as the output of the embedding layer:


where is the embedding of -th condition. Although the lengths of different one-hot condition vectors can be different, the embeddings of different condition are of the same size , so the size of is . Finally is fed into a fully connected layer:


where is the final representation of conditions and

is an activation function.

and is weight and bias of the fully connected layer, respectively. The size of

is a hyperparameter that determines the dimension of the conditions embedding.

3.3.2 Dialogue Encoder

Dialogue in each sample is split into two parts: dialogue history that contains the first utterances and the -th utterance . is the question needing to be responded. All tokens in are concatenated and an end-of-utterance delimiter is inserted between every two utterances. For each token in , the input embedding is the sum of its word embedding, position embedding and turn embedding:


Dialogue encoder follows the same way in normal Transformer [40] to compute the word embedding and position embedding . Tokens from share the same turn embedding and tokens from share the same turn embedding. Then the input embeddings are forwarded into dialogue encoder to get the hidden representation , which has the same length and dimension as input. The dialogue encoder is composed of a stack of 6 transformer layers. Readers can refer to Section 3.2 for details of the transformer layer.

3.3.3 Conditions Reviser

Figure 3: Details of dialogue encoder (left part) and conditions reviser (right part).

This component generates condition values based on dialogue hidden representations (hidden representations of conversation history and current question) and original condition values. The condition reviser is also composed of a stack of 6 identical sub-layers. Each layer has a self-attention sub-layer, feedforward sub-layer and an extra cross-attention sub-layer between self-attention sub-layer and feedforward sub-layer. The structure of each sub-layer is now described in detail. In normal Transformer, the self-attention sub-layer in decoder is modified to masked self-attention sub-layer to prevent positions from attending to subsequent positions. Conversely, the self-attention sub-layer in condition reviser takes attention over all positions by removing the mask encoding. This is because there is no temporal relationship between condition values and a condition may be correlated with conditions. Inspired by the attention mechanism in sequence-to-sequence models such as[47],[2],[42],[19], the cross-attention sub-layer employs multi-head attention over all dialogue hidden representations. The cross-attention sub-layer follows the same structure as normal self-attention sub-layer, the query vectors are transformed from the outputs of the previous self-attention sub-layer, but the key vectors and value vectors are transformed from dialogue hidden representations. Finally, the outputs of cross-attention sub-layer are fed to feedforward sub-layer to get the final hidden representation vectors of this sub-layer. At the top of condition reviser, linear transformation and softmax function are applied to convert each hidden representation into predicted value distribution corresponding to each condition. In general, one condition value is revised by referring to the whole dialogue and all original conditions:


where denotes the dialogue hidden representations, is revised condition value of -th condition and is a nonlinear function.

3.3.4 Classifier

The classifier is constructed as a Multilayer Perceptron (MLP). The input of the MLP is the concatenation of conditions embedding, hidden representation for the first token of conversation history and hidden representation for the first token of question. The classifier outputs the predictive distribution of candidate responses. In the future, more elegant probabilistic classifiers, such as probabilistic classification vector machine and its variants

[9, 10, 33, 26] would be employed to produce real probabilistic outputs. Another direction is to employ neural network ensemble algorithms [13, 38] for possible better performance, though its probabilistic outputs could be achieved by incorporating with Bayesian methods [8, 12].

4 Experiment and analysis

4.1 Dataset Description

Conditions Possible values Description
Shipped Yes/No Does this order has been shipped?
Delivery status Null/Normal/Delay/Deliver failed/Redelivery/Missing/Unknown If the order has been shipped, the status of delivery.
Consignee’s area US/NG/GB/Other site Area of the consignee.
Delivery service type Null/Expedite service/Normal Type of delivery service chosed by buyer.
Stock information In stock/Out of stock Are the items in the order still available?
Return goods received Null/Yes/No Dose seller receive the goods sent back from the buyer?
Refund processing status Null/Unprocessed/Refunded Whether pass buyer’s refund application?
Table 2: Introduction of order conditions and its possible values

Dialogs used in this section are collected from an online customer service system for e-commerce provided by our collaborator. The final dataset includes 35928 dialogs. The dataset is randomly divided into training (25150), validation (3593) and testing (7185) sets. Each dialog has seven order condition values associated with its order ID. The conditions and its possible values are shown in Table 2. “Null” means the absence of the value of this condition. For example, if an order has not been shipped, the order does not have delivery status, of course. Besides, “Unknown” means the value of this condition is missing due to some reasons. Every dialog has 35 candidate replies. Abbreviations are used to represent each candidate reply in the following parts. The distribution of candidate responses in this dataset is shown in Figure 4. The statistics of dataset is shown in Table 3, one can find that the utterance length in this dataset is relatively long.

Figure 4: Distribution of candidate responses.
Dataset utterance average length Questions longer than 40 Questions with defective condition values
Training set 27.10 11.72% 19.63%
Validation set 27.35 11.82% 20.78%
Testing set 28.92 12.22% 20.53%
Table 3: Dataset statistics

4.2 Metrics

4.2.1 Automatic Evaluations

Bilingual Evaluation Understudy (BLEU) [34]

is widely used as an automatic evaluation of text generation systems. BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores are taken as the automatic evaluation metrics for the answers given by different methods:


where is the brevity penalty value, which equals to 1 if the total length of the resulting response is longer than that of the reference response or equals to the ratio between those two lengths.

measures the overlapping between the bag of i-grams appearing in the resulting sentences and that of i-grams appearing in the reference sentences. The value of BLEU score is between 0-1. The higher it is, the higher the precision of n-grams. For classification models, answer selection accuracy is also used as the evaluation metric.

4.2.2 Human Evaluations

Five customer service staffs were asked to label whether a reply is accord with order conditions and score satisfaction of the reply. Whether a reply is accord with order conditions is assessed by comparing the reply with ground truth condition values and two options (“Yes” and “No”) are optional. Reply satisfaction is an overall measurement, which reflects the response’s fluency, availability and so on. There are five ratings for reply satisfaction. 300 samples are chosen randomly from the test set for human assessment, and all different categories of questions are guaranteed to appear in the set for human assessment.

4.3 Baselines

CAR-Transformer is compared with the following methods. Constant replies “Dear Customer, we have updated your order information to our fulfillment team.” to all questions. SC-LSTM [43]

is a language generation model based on a semantically controlled Long Short-term Memory structure. By incorporating dialogue act one-hot vectors into the original LSTM

[20] cell, SC-LSTM enables the generator to output the act-related text. Dialogue act one-hot vectors are replaced with order conditions one-hot vectors to adapt to the task of this paper. TA-seq2seq[48] utilizes topic information in chatbots by a joint attention mechanism. Topic words in [48] are replaced with our order conditions. Split Memory Network[27] is a modification of Memory Network [44] that enables personalization. Profile attributes and dialogue history are modeled in two separate memories. The profile attributes are replaced with order conditions. BERT [16] is a Transformer-based pre-training model for language understanding. BERT summaries the input question as a fixed-dimensional pooled representation and this representation is transformed into label probabilities. CA-Transformer is a degraded version of CAR-Transformer without conditions revise component.

The max length of input question is set to 50 and the max turn of history conversations is set to 2 for all models. The optimal setting of each model was selected by the BLEU scores on the validation set. We empirically set the hyperparameter in Equation 4 as 0.2. Both dimensions of word embedding and conditions embedding are 300 for all models.

4.4 Results

Models Automatic Evaluations Human Evaluations
BLEU-1 BLEU-2 BLEU-3 BLEU-4 Accuracy Rate of accord with conditions Satisfaction
constant 5.73% 3.70% 2.49% 2.10% 1.55% 58.08%(0.26) 1.38(0.36)
SC-LSTM 37.97% 30.86% 27.64% 25.42% - 39.93%(0.21) 1.26(0.20)
TA-seq2seq 58.95% 54.02% 51.64% 49.49% - 71.33%(0.10) 3.26(0.19)
Split Memory Network 52.55% 51.59% 50.88% 50.69% 50.60% 69.13%(0.16) 3.39(0.26)
BERT 67.70% 64.67% 62.64% 61.32% 59.29% 59.34%(0.15) 3.59(0.31)
CA-Transformer 75.23% 74.30% 73.45% 73.24% 73.15% 75.27%(0.11) 3.76(0.07)
CAR-Transformer 86.53% 85.58% 84.88% 84.68% 85.60% 85.90%(0.09) 4.20(0.07)
Table 4: overall comparison results of different models

The automatic evaluation and human evaluation results of different models are reported in Table LABEL:overall_comparion_results

. For human evaluations, the mean and standard error of results given by five staff are reported. Based on the results, CAR-Transformer delivers the best performance of all the comparison methods on all metrics. Furthermore, the experimental results reveal the following observations.

4.4.1 CAR-Transformer/CA-Transformer Can Better Leverages Conditions Information

Figure 5: Confusion matrix of CA-Transformer
Figure 6: Confusion matrix of CAR-Transformer
Figure 7: Confusion matrix of BERT
Figure 8: Confusion matrix of Split Memory Network

As shown in Table LABEL:overall_comparion_results, the gap between CA-Transformer and BERT demonstrates the importance of external knowledge. In other words, it is not enough to rely solely on powerful language model. Moreover, answers given by CAR-Transformer and CA-Transformer have overwhelming advantages over other methods in terms of the rate of accord with order conditions. This suggests that CAR-Transformer and CA-Transformer have better capacity in leveraging the order conditions information. CAR-Transformer utilizes conditions information in two components, conditions encoder and classifier. Although both CA-Transformer and TA-seq2seq are sequential language model, CA-Transformer performing better than TA-seq2seq suggests the superiority of the proposed conditions encoder. For further understanding the utilization of conditions, confusion matrices for classification models are displayed in Figures 5-8. For questions do not need conditions information to reply, such as questions whose answer are “Howtorequest” or “Invoiceleadtime”, BERT performs well. While for most dialogues that need conditions information to reply, for example, dialogues which consult delivery, one can find that the models without knowledge aware have bad performance and incline to choose candidate answers with large amounts of samples, such as “Missing”, “Notracking” and so on. Additionally, Split Memory Network tends to overfit the conditions information. It can be observed from Figure 8 that all “Instock” samples are classified to “Wishtowait” and 75% “when_instock” samples are classified to “Wishtochange”. Since training samples whose answers are “Instock”, “when_instock”, “Wishtowait” or “Wishtochange” have same condition value ”out of stock”, the conditions embedding may be very similar. CAR-Transformer usually can create a balance between buyers intent and conditions information. However, Figure 6 also indicates that CAR-Transformer is confused about the questions whose answers are “Notacceptedreturn”, “ShippingCharge1”, “ShippingCharge2” and so on. This is apparently due to these questions have few training samples.

4.4.2 The Effectiveness of Conditions Reviser

question defective conditions true conditions revised conditions prediction of CAR-Transformer prediction of CA-Transformer
Thank you for the update, I will wait for the new stock to come in In stock Out of stock Out of stock Wishtowait Refund
Hi, could you please advise whether you have received my returned order? It’s nearly two weeks since i posted it off to you. Thanks Express missing, no return status Express status is normal, not receive return goods Express status is normal, not receive return goods Plswait Missing
Hi, I have 5 t-shirts on order and they haven’t arrived as yet. Could you provide an update please. Normal express, express status is redelivery Expedite express, express status is redelivery Expedite express, express status is redelivery Redelivery Notracking
Shirt arrived looks good, but too big as have another shirt similar. I need 17” collar, so XL would be better. Not opened package, so can I swap it for a smaller one? Shipped: unknow, delivery status: unknown Shipped: yes, delivery status: normal Shipped: yes, delivery status: normal Adviserefund Plswait
Cancelled this order due to posting and packing churches when free delivery in UK was stated. I am disappointed that this order has been dispatched already shipped shipped shipped Cannotbechanged Cancellation
My daughter needs the t-shirt for her show next Saturday 24th November, could you please dispatch asap, I’d be very grateful shipped Unshipped Unshipped mark as urgent Notracking
I really need the high visible vest for the 6th of October this month … if is possible it would it be great !! shipped shipped shipped Faster1 Notracking
Table 5: Some representational cases of CAR-Transformer and CA-Transformer
Models Automatic Evaluations Human Evaluations
BLEU-1 BLEU-2 BLEU-3 BLEU-4 Accuracy Rate of accord with conditions Satisfaction
SC-LSTM 36.57% 29.33% 26.56% 24.32% - 38.85%(0.17) 1.20(0.30)
TA-seq2seq 55.89% 51.11% 48.34% 46.46% - 58.83%(0.12) 3.39(0.17)
Split Memory Network 41.77% 40.81% 40.10% 39.95% 39.86% 57.76%(0.14) 3.26(0.24)
CA-Transformer 58.30% 58.21% 57.52% 57.32% 57.25% 60.37%(0.12) 3.55(0.08)
CAR-Transformer 64.93% 64.24% 64.02% 63.68% 63.60% 68.82%(0.11) 4.01(0.10)
Table 6: overall comparison results of conditions aware models on examples with defective conditions information
Figure 9: Principal component analysis of conditions embedding vectors

As can be observed in Table LABEL:overall_comparion_results, CAR-Transformer has substantial gains over CA-Transformer. This suggests that the defective conditions information may affect the model’s accuracy and the conditions reviser may reduce this impact. To provide some qualitative insights into the conditions reviser, some representational cases of CAR-Transformer and CA-Transformer are displayed in Table 5. Note that the dialogue history and condition values that unrelated to reply question are omitted for the limitation of the space. Samples in rows 1 to 4 have wrong or unknow condition values and only CAR-Transformer predict correctly by revising condition values. Samples in rows 5 to 7 have right condition values, however, only CAR-Transformer predict correctly. Observed from Table 5 and Figures 5, CA-Transformer tends to weaken the function of conditions information. A possible explanation is that CA-Transformer pays less attention to conditions embedding because of the existence of defective condition values. However, CAR-Transformer can revise 71.13% wrong or missing condition values in the test set to correct, and thus it can make good use of conditions information. Figure 9 shows the principal component analysis of three different types of conditions embedding vectors. It can be found that the revised conditions embeddings tend to the same distribution as correct conditions embeddings. The results of condition aware models on examples with defective conditions are also provided in Table 6. The so-obtained results indicate that defective condition values can be revised by the proposed conditions reviser to a certain degree.

4.5 Supplementary experiment on personalized bAbI dialog dataset

Tasks Supervised Embeddings Memory Network CAR-Transformer
PT1 17.34% 20.01% 80.66%
PT2 11.07% 18.76% 85.34%
PT3 9.04% 44.16% 44.84%
PT4 4.47% 43.23% 50.27%
PT5 10.36% 29.57% 76.19%
Table 7: Test results across models and tasks of personalized bAbI dialog dataset when masking user profiles
Task Best Accuracy Number of utterances used
PT1 96.39% 2
PT2 96.53% 2
PT3 64.02% 10
PT4 95.66% 2
PT5 63.27% 14
Table 8: user profile inferring results of personalized bAbI dialog dataset

In this subsection, the proposed CAR-Transformer is extended to personalized bAbI dialog dataset [27], which is a public multi-turn dialog corpus in a restaurant reservation scenario. It introduces additional four profiles (gender, age, dietary preference and favorite food) and the utterances are relevant to the user profiles. The bot is required to select an appropriate response from candidate set. Five separate tasks are introduced along with the dataset. Tasks 1 and 2 test the model’s ability to indirectly track dialog state. Tasks 3 and 4 check whether the model can sort and use facts of restaurants. Task 5 tests the capabilities of all the above aspects of the model. Tasks 1, 2 and 4 only have gender and age information of users, task 3 and 5 have all attributes. More details of this dataset can be found in [27]. Instead of giving user profiles directly, we assume that all user profiles are unknown and use the proposed conditions reviser to infer the user profiles according to dialogue history. The main differences between our customer service dataset and the personalized bAbI dialog dataset lies in there is no correlation between profile attributes in personalized bAbI dialog dataset since the profile attributes are randomly sampled from a list of possible values. What’s more, the user profiles ”dietary preference” and ”favorite food” in personalized bAbI dialog dataset are provided for bot to make choice between restaurants rather than associating with the content of dialogue directly.

Based on above factors, the personalized bAbI dialog dataset is merely used for supplementary experiment. CAR-Transformer is compared with supervised embeddings and Memory Network [44] as in [27]. For CAR-Transformer, the max length of input utterance is set to 50 and only the last two turns of conversation history are considered. The hyperparameters of the models were selected on the validation sets.

Per-response accuracy (the percentage of responses in which the correct one is chosen out of all candidate ones) across all models are reported in Table 7. Table 8 shows the best profiles prediction accuracy (percentage of correct inference for all user profiles) of conditions reviser on each task and how many utterances are used at least. The performance of CAR-Transformer is significant higher than other models, which indicates that the CAR-Transformer is able to infer and leverage user profiles to a certain extent. As can be observed in Table 8, for gender and age, they can be reasoned by the conditions reviser according to the style of language easily; but for dietary preference and favorite food, because of the diversity of choices (there are 2 types for dietary preference and 14 types for favorite food) and implicit correspondence between utterances and attributes, they are harder and need more conversation history to infer.

5 Conclusion

In this paper, CAR-Transformer is proposed to select appropriate answer that compatible with both user intent and the conditions of the question. Specifically, this paper considers more general and realistic situation where the condition values are wrong or incomplete. The proposed conditions reviser can revise the wrong or incomplete condition values without knowing which one is wrong beforehand. We perform extensive experimental evaluations of the proposed approach on the real world dataset and extend the CAR-Transformer to infer the user profiles in personalized bAbI dialog dataset. The experimental results show the effectiveness of the proposed CAR-Transformer. The explicit knowledge will be investigated to Incorporated to the learning model for more effective knowledge correction [46, 45] in the future work.


  • [1] J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §3.2.
  • [2] D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In Proc. ICLR, Cited by: §2.1, §3.3.3.
  • [3] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In Proc. SIGMOD, pp. 1247–1250. Cited by: §1.
  • [4] A. Bordes, S. Chopra, and J. Weston (2014) Question answering with subgraph embeddings. In Proc. EMNLP, pp. 615–620. Cited by: §1.
  • [5] H. Chen, F. Tang, P. Tino, A. G. Cohn, and X. Yao (2015) Model metric co-learning for time series classification. In

    Twenty-Fourth International Joint Conference on Artificial Intelligence

    Cited by: §2.1.
  • [6] H. Chen, F. Tang, P. Tino, and X. Yao (2013) Model-based kernel for efficient time series analysis. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 392–400. Cited by: §2.1.
  • [7] H. Chen, P. Tiño, A. Rodan, and X. Yao (2014) Learning in the model space for cognitive fault diagnosis. IEEE Transactions Neural Networks Learning System 25 (1), pp. 124–136. Cited by: §2.1.
  • [8] H. Chen, P. Tiňo, and X. Yao (2009) Predictive ensemble pruning by expectation propagation. IEEE Transactions on Knowledge and Data Engineering 21 (7), pp. 999–1013. Cited by: §3.3.4.
  • [9] H. Chen, P. Tino, and X. Yao (2009) Probabilistic classification vector machines. IEEE Transactions on Neural Networks 20 (6), pp. 901–914. Cited by: §3.3.4.
  • [10] H. Chen, P. Tiňo, and X. Yao (2013) Efficient probabilistic classification vector machine with incremental basis function selection. IEEE transactions on neural networks and learning systems 25 (2), pp. 356–369. Cited by: §3.3.4.
  • [11] H. Chen, P. Tiňo, and X. Yao (2014) Cognitive fault diagnosis in tennessee eastman process using learning in the model space. Computers & chemical engineering 67, pp. 33–42. Cited by: §2.1.
  • [12] H. Chen and X. Yao (2009) Regularized negative correlation learning for neural network ensembles. IEEE Transactions on Neural Networks 20 (12), pp. 1962–1979. Cited by: §3.3.4.
  • [13] H. Chen and X. Yao (2010) Multiobjective neural network ensembles based on regularized negative correlation learning. IEEE Transactions on Knowledge and Data Engineering 22 (12), pp. 1738–1751. Cited by: §3.3.4.
  • [14] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proc. EMNLP, pp. 1724–1734. Cited by: §2.1.
  • [15] S. Choudhary, P. Srivastava, L. Ungar, and J. Sedoc (2017) Domain aware neural dialog system. arXiv preprint arXiv:1708.00897. Cited by: §2.2.
  • [16] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. NAACL, pp. 4171–4186. Cited by: §4.3.
  • [17] L. Dong, F. Wei, M. Zhou, and K. Xu (2015)

    Question answering over freebase with multi-column convolutional neural networks

    In Proc. ACL, pp. 260–269. Cited by: §1.
  • [18] J. Gao, M. Galley, and L. Li (2019) Neural approaches to conversational ai: question answering, task-oriented dialogues and social chatbots. Now Foundations and Trends. Cited by: §1.
  • [19] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin (2017) Convolutional sequence to sequence learning. In Proc. ICML, pp. 1243–1252. Cited by: §2.1, §3.3.3.
  • [20] F. A. Gers, J. Schmidhuber, and F. Cummins (2000) Learning to forget: continual prediction with lstm. Neural Computation 12 (10), pp. 2451–2471. Cited by: §4.3.
  • [21] Z. Gong, H. Chen, B. Yuan, and X. Yao (2018) Multiobjective learning in the model space for time series classification. IEEE Transactions on Cybernetics 49 (3), pp. 918–932. Cited by: §2.1.
  • [22] Z. Gong and H. Chen (2016) Model-based oversampling for imbalanced sequence classification. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 1009–1018. Cited by: §2.1.
  • [23] Z. Gong and H. Chen (2018) Sequential data classification by dynamic state warping. Knowledge and Information Systems 57 (3), pp. 545–570. Cited by: §2.1.
  • [24] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proc. CVPR, pp. 770–778. Cited by: §3.2.
  • [25] J. Herzig, M. Shmueli-Scheuer, T. Sandbank, and D. Konopnicki (2017) Neural response generation for customer service based on personality traits. In Proc. INLG, pp. 252–256. Cited by: §2.2.
  • [26] B. Jiang, H. Chen, B. Yuan, and X. Yao (2017)

    Scalable graph-based semi-supervised learning through sparse bayesian model

    IEEE Transactions on Knowledge and Data Engineering 29 (12), pp. 2758–2771. Cited by: §3.3.4.
  • [27] C. K. Joshi, F. Mi, and B. Faltings (2017) Personalization in goal-oriented dialog. arXiv preprint arXiv:1706.07503. Cited by: §2.2, §4.3, §4.5, §4.5.
  • [28] Y. Kim, C. Denton, L. Hoang, and A. M. Rush (2017) Structured attention networks. Cited by: §2.1.
  • [29] J. Li, M. Galley, C. Brockett, G. Spithourakis, J. Gao, and B. Dolan (2016) A persona-based neural conversation model. In Proc. ACL, pp. 994–1003. Cited by: §2.2.
  • [30] Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio (2017) A structured self-attentive sentence embedding. In Proc. ICLR, Cited by: §2.1.
  • [31] X. Liu, Y. Shen, K. Duh, and J. Gao (2018) Stochastic answer networks for machine reading comprehension. In Proc. ACL, pp. 1694–1704. Cited by: §1.
  • [32] L. Luo, W. Huang, Q. Zeng, Z. Nie, and X. Sun (2019) Learning personalized end-to-end goal-oriented dialog. In Proc. AAAI, Vol. 33, pp. 6794–6801. Cited by: §2.2.
  • [33] S. Lyu, X. Tian, Y. Li, B. Jiang, and H. Chen (2019) Multiclass probabilistic classification vector machine. IEEE Transactions on Neural Networks and Learning Systems. External Links: Document Cited by: §3.3.4.
  • [34] K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proc. ACL, pp. 311–318. Cited by: §4.2.1.
  • [35] A. Parikh, O. Täckström, D. Das, and J. Uszkoreit (2016)

    A decomposable attention model for natural language inference

    In Proc. EMNLP, pp. 2249–2255. Cited by: §2.1.
  • [36] Q. Qian, M. Huang, H. Zhao, J. Xu, and X. Zhu (2018) Assigning personality/identity to a chatting machine for coherent conversation generation. In Proc. IJCAI, pp. 4279–4285. Cited by: §2.2.
  • [37] M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi (2017) Bidirectional attention flow for machine comprehension. In Proc. ICLR, Cited by: §1.
  • [38] R. G. F. Soares, H. Chen, and X. Yao (2012) Semisupervised classification with cluster regularization. IEEE Transactions Neural Networks and Learning Systems 23 (11), pp. 1779–1792. Cited by: §3.3.4.
  • [39] I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Proc. NIPS, pp. 3104–3112. Cited by: §2.1.
  • [40] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proc. NIPS, pp. 5998–6008. Cited by: §1, §2.1, §3.2, §3.3.2.
  • [41] Y. Wang, R. Zhang, C. Xu, and Y. Mao (2018) The apva-turbo approach to question answering in knowledge base. In Proc. COLING, pp. 1998–2009. Cited by: §1.
  • [42] Z. Wang, W. He, H. Wu, H. Wu, W. Li, H. Wang, and E. Chen (2016) Chinese poetry generation with planning based neural network. In Proc. COLING, pp. 1051–1060. Cited by: §3.3.3.
  • [43] T. Wen, M. Gasic, N. Mrkšić, P. Su, D. Vandyke, and S. Young (2015) Semantically conditioned lstm-based natural language generation for spoken dialogue systems. In Proc. EMNLP, pp. 1711–1721. Cited by: §4.3.
  • [44] J. Weston, S. Chopra, and A. Bordes (2015) Memory networks. In Proc. ICLR, Cited by: §4.3, §4.5.
  • [45] X. Wu, H. Chen, J. Liu, G. Wu, R. Lu, and N. Zheng (2017) Knowledge engineering with big data (bigke): a 54-month, 45-million rmb, 15-institution national grand project. IEEE Access 5, pp. 12696–12701. Cited by: §5.
  • [46] X. Wu, H. Chen, G. Wu, J. Liu, Q. Zheng, X. He, A. Zhou, Z. Zhao, B. Wei, M. Gao, et al. (2015) Knowledge engineering with big data. IEEE Intelligent Systems 30 (5), pp. 46–55. Cited by: §5.
  • [47] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §3.3.3.
  • [48] C. Xing, W. Wu, Y. Wu, J. Liu, Y. Huang, M. Zhou, and W. Ma (2017) Topic aware neural response generation. In Proc. AAAI, pp. 3351–3357. Cited by: §2.2, §4.3.
  • [49] B. Yang, W. Yih, X. He, J. Gao, and L. Deng (2014) Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575. Cited by: §1.
  • [50] M. Yang, Z. Zhao, W. Zhao, X. Chen, J. Zhu, L. Zhou, and Z. Cao (2017) Personalized response generation via domain adaptation. In Proc. SIGIR, pp. 1021–1024. Cited by: §2.2.
  • [51] W. Yin, M. Yu, B. Xiang, B. Zhou, and H. Schütze (2016) Simple question answering by attentive convolutional neural network. In Proc. COLING, pp. 1746–1756. Cited by: §1.
  • [52] M. Yu, W. Yin, K. S. Hasan, C. dos Santos, B. Xiang, and B. Zhou (2017) Improved neural relation detection for knowledge base question answering. In Proc. ACL, pp. 571–581. Cited by: §1.
  • [53] W. Zhang, Q. Zhu, Y. Wang, Y. Zhao, and T. Liu (2019) Neural personalized response generation as domain adaptation. World Wide Web 22 (4), pp. 1427–1446. Cited by: §2.2.