Question answering (QA), which intends to provide concise, direct answers to user queries based on abundant external information, has recently become a major focus of natural language processing (NLP) research. According to the type of external information, QA tasks can be roughly divided into two groups: text-QA and knowledge based QA (KBQA). The former utilizes information that in form of text such as web documents and the latter utilizes the knowledge from structured knowledge bases (KB) such as freebase , sales and marketing databases 
and so on. Recently, neural approaches are exploited to represent questions and external information using continuous vectors so that the inference can be performed in a hidden semantic space[4, 49, 17, 51, 52, 41, 37, 31]. Neural approaches have advantages in semantic understanding since various expressions with similar semantic meaning have similar continuous representations in the hidden semantic space.
|I need this item by Friday morning. Please could you send it ASAP?||Unshipped, expedite delivery||We are now marked your order as URGENT ORDER in our system and we will try our best to delivery it to you on time.|
|Shipped, normal delivery||We will try our best but we are not able to guarantee.|
|Thank you for the update, yes I will wait for the new stock to come in.||Instock||Dear Customer, many thanks for your reply. We shall ship your order as soon as the new stock arrives.|
|Shirt arrived looks good, but too big as have another shirt similar. I need 17” collar, so XL would be better. Not opened package, so can I swap it for a smaller one?||Shipped: Unknow, delivery status: unknown||Dear customer, to receive the order in time, we suggest you to return the existing order for refund and place a new order for the right size so that you don’t have to wait for the return to be processed. please request your return from here: <link>. We shall process your return request within two working days.|
This work attempts to alleviate the impact of defective conditions corresponding to the question and provide a valid answer that is compatible with both user intent and the conditions. The conditions can be delivery status, delivery service type, stock information in e-commerce or user profile attributes in customer service. To be clear, Table 1 shows several representative examples of the proposed model for customer service in e-commerce. The conditions in this application is order states. The first question in Table 1 has different answers corresponding to different conditions. The second question has wrong condition value “Instock” and the third question has unknown condition values. But our model still gives appropriate responses. In general, there are two main challenges in our task.
How to incorporate conditions into an end-to-end model. Typical KBQA systems leverage knowledge in an explicit way, the answer is what retrieve from the knowledge base. However, in this study, conditions is to help answer selection instead of giving to the user directly.
How to deal with missing or wrong conditions. Although existing defective condition values, it is not worthwhile to abandon the conditioning information completely. The model is expected to not only utilize the conditions but also reduce the effect of inferior condition values. Note that we don’t know which condition of which sample is wrong in advance.
is an encoder-decoder framework that adopts self-attention mechanism to encode text instead of RNN or CNN structure. Since self-attention mechanism performs better in long distance dependence, Transformer is chosen as our baseline model. CAR-Transformer consists of four parts: conditions encoder, conditions reviser, dialogue encoder and classifier. We model the problem of conditions inference as a sequence generation problem and modify the Transformer architecture to dialogue encoder and conditions reviser. The dialogue encoder transforms conversation history and question into hidden representations. Then the conditions reviser generates revised conditions based on all original condition values and the dialogue representations. After that, each revised condition value is discrete and represented as a one-hot vector. Then the one-hot vectors are fed into the conditions encoder to get conditions embedding. Finally, the concatenation of conditions embedding and dialogue embedding is fed into the classifier to make a prediction over candidate responses. To capture the sophisticated interactions between features, the proposed conditions encoder and classifier adopt multi-layer neural network to take advantage of strong representation and generalization ability of deep learning. Experiments on a real-world dataset in customer service for e-commerce show that the CAR-Transformer can revise missing or wrong condition values to a certain degree and choose answer that accord with conditions. Automatic and human evaluations demonstrate CAR-Transformer achieves significant improvement as compared to all baseline methods. We also conduct experiments on personalized bAbI dataset to further verify the effectiveness of the CAR-Transformer. In summary, this paper makes the following contributions.
It proposes CAR-Transformer, a highly effective Transformer based conversational QA system. By revising and integrating condition values of question, CAR-Transformer outperforms several strong baselines.
By proposing the conditions reviser, CAR-Transformer is capable of revising wrong or incomplete condition values of question, which is more robust and practical in real-word applications compared to other QA systems. To our knowledge, this work is the first to discuss the treatment of defective external information in QA research.
This paper explores how to represent and integrate the categorical information into the Transformer framework, therefore the prediction is compatible with both categorical attributes and the context.
The remainder of this paper is organized as follows. A review of related work is provided in Section II. Section III formulates the problem to be solved and presents CAR-Transformer. Section IV contains the introduction of the dataset used in this paper and abundant experiments. Finally, the conclusion and prospects for future work are provided in Section V.
2 Related Work
This section reviews the related work on sequence-to-sequence models and personalized dialog systems, which inspire the CAR-Transformer.
2.1 Sequence-to-Sequence Models
The sequence-to-sequence (Seq2Seq) model  is first proposed for machine translation task and also widely used in sequence generation task. Given a source and a target sequence
, the model maximizes the conditional probability:. Seq2Seq model is in an encoder-decoder structure. The encoder summarizes a fixed-size vector representation from a variable-length input sentence, and the decoder generates sequences one by one based on the representation from the encoder and its previous outputs. The encoder and decoder can be specialized by RNN , , , CNN , Transformer  and so on. RNN is able to process the temporal sequence. The advanced temporal analysis models include learning in the model space [7, 11, 6] and its variants [22, 5, 23, 21]. However, unlike RNNs, Transformers do not require that the sequence be processed in order. Particularly, Transformer only employs self-attention mechanism which computes a weighted sum by utilizing dot-products between elements of the input sequence , , , 
and achieves the state-of-the-art results on many natural language generation tasks. The difference between CAR-Transformer and prior works lies in that there is no temporal relationship between the order condition values which is the target sequence in this study. Therefore, the condition reviser utilizes all original conditions when revising one condition value and the revised condition values can be generated in parallel. While most of the Seq2Seq-based models generate words one by one by consuming the previously generated words as input.
2.2 Attributes-aware Dialog Systems
Learning the inherent attributes of dialogues explicitly is a way to improve the diversity and effectiveness of dialogue systems. Topic and personality are widely studied among different attributes. Xing et al.  use Twitter LDA model to get the topic of dialog and then feed topic and input representations into a joint attention module to generate a topic-related response. Choudhary et al.  divide each utterance in the dialogue into different domains and generates the domain and content of the next utterance accordingly. However, these studies require an extra component to infer the topic and suffer from the error in inferring topic. For personalized dialog systems, Li et al.  and Herzig et al.  project personal information into embedding space and then input the embedding vector into the decoder. Qian et al.  use an extra component to learn when employing the user profile. Yang et al.  and Zhang et al.
attempt to introduce personalized information to dialogs by transfer learning. These methods are all LSTM-based, which have inferiority when dealing with long sequences. For better capturing long-term dependency, the proposed model is Transformer-based. Besides, Joshiet al. and Luo et al.  use an end-to-end memory network to select response in candidate sets. It is worth noting that all of the foregoing methods don’t take into account whether flaws exist in personal information.
3 The Proposed Method
This section first formulates the investigated problem in this paper. Then Section 3.2 introduces the structure of transformer layer. Finally, a detailed description of CAR-Transformer is provided in Section 3.3.
3.1 Problem Formulation
Formally, let denotes each training sample, where denotes a conversation with utterances. The last utterance is the question needing to be responded. denotes condition values and denotes a candidate response. The goal of this work is to learn a classification model to select an answer from candidate set for dialogue-conditions pair
by computing the probability distribution of candidate answers.
3.2 Transformer layer
is composed of a stack of identical layers, called transformer layers. Each transformer layer consists of a self-attention sub-layer followed by a feedforward sub-layer. Each sub-layer employs residual connection followed by layer normalization . Let denotes the input of transformer layer, where is a sequence of -dimensional vectors. The self-attention sub-layer first transforms into queries , keys and values , where , , are trainable matrices. Each query, key, and value matrix can be split into parts called attention heads, indexed by , and with dimension . This multi-heads attention mechanism allows the model to focus on different parts of the input sequence. Then the output of each head is calculated as:
The outputs of each head are concatenated and linearly transformed to aby dimensional matrix by feedforward sub-layer:
Since self-attention mechanism has no recurrent operation like RNN, Transformer keeps the absolute or relative position of a word by adding position encodings to the input embeddings of the bottom transformer layer. The position encoding is formated by:
where is the position of a word and is the index of the input dimension.
Overview of CAR-Transformer is shown in Figure 1. CAR-Transformer first revises conditions of the question according to the original condition values and dialogue context. Then CAR-Transformer selects an answer in line with the dialogue content and the revised conditions. CAR-Transformer is consisted of four components:
conditions encoder encodes the conditions into a vector representation;
dialogue encoder summarizes the whole conversation into a sequence of vector representations;
conditions reviser generates revised condition values based on the original condition values and the dialogue representations;
classifier selects a candidate response based on the revised conditions embedding and the dialogue representations.
Since CAR-Transformer has two outputs: revised condition values and predicted label of response, the overall objective function is defined as following:
where denotes the sum of cross entropy loss of each condition value, denotes the cross entropy between the predictive distribution and the true candidate response label and is a scalar to balance the two terms.
Figure 2 highlights the structure of conditions encoder. Figure 3 illustrates the structure of dialogue encoder and conditions reviser. Details of each component will be described in the following subsections.
3.3.1 Conditions Encoder
Objective of this component is to represent the conditions as a vector and capture the interaction between conditions. Each condition value is first converted into one-hot vector . Then each one-hot vector is compressed into embedding space. The embeddings are concatenated as the output of the embedding layer:
where is the embedding of -th condition. Although the lengths of different one-hot condition vectors can be different, the embeddings of different condition are of the same size , so the size of is . Finally is fed into a fully connected layer:
where is the final representation of conditions and
is an activation function.and is weight and bias of the fully connected layer, respectively. The size of
is a hyperparameter that determines the dimension of the conditions embedding.
3.3.2 Dialogue Encoder
Dialogue in each sample is split into two parts: dialogue history that contains the first utterances and the -th utterance . is the question needing to be responded. All tokens in are concatenated and an end-of-utterance delimiter is inserted between every two utterances. For each token in , the input embedding is the sum of its word embedding, position embedding and turn embedding:
Dialogue encoder follows the same way in normal Transformer  to compute the word embedding and position embedding . Tokens from share the same turn embedding and tokens from share the same turn embedding. Then the input embeddings are forwarded into dialogue encoder to get the hidden representation , which has the same length and dimension as input. The dialogue encoder is composed of a stack of 6 transformer layers. Readers can refer to Section 3.2 for details of the transformer layer.
3.3.3 Conditions Reviser
This component generates condition values based on dialogue hidden representations (hidden representations of conversation history and current question) and original condition values. The condition reviser is also composed of a stack of 6 identical sub-layers. Each layer has a self-attention sub-layer, feedforward sub-layer and an extra cross-attention sub-layer between self-attention sub-layer and feedforward sub-layer. The structure of each sub-layer is now described in detail. In normal Transformer, the self-attention sub-layer in decoder is modified to masked self-attention sub-layer to prevent positions from attending to subsequent positions. Conversely, the self-attention sub-layer in condition reviser takes attention over all positions by removing the mask encoding. This is because there is no temporal relationship between condition values and a condition may be correlated with conditions. Inspired by the attention mechanism in sequence-to-sequence models such as,,,, the cross-attention sub-layer employs multi-head attention over all dialogue hidden representations. The cross-attention sub-layer follows the same structure as normal self-attention sub-layer, the query vectors are transformed from the outputs of the previous self-attention sub-layer, but the key vectors and value vectors are transformed from dialogue hidden representations. Finally, the outputs of cross-attention sub-layer are fed to feedforward sub-layer to get the final hidden representation vectors of this sub-layer. At the top of condition reviser, linear transformation and softmax function are applied to convert each hidden representation into predicted value distribution corresponding to each condition. In general, one condition value is revised by referring to the whole dialogue and all original conditions:
where denotes the dialogue hidden representations, is revised condition value of -th condition and is a nonlinear function.
The classifier is constructed as a Multilayer Perceptron (MLP). The input of the MLP is the concatenation of conditions embedding, hidden representation for the first token of conversation history and hidden representation for the first token of question. The classifier outputs the predictive distribution of candidate responses. In the future, more elegant probabilistic classifiers, such as probabilistic classification vector machine and its variants[9, 10, 33, 26] would be employed to produce real probabilistic outputs. Another direction is to employ neural network ensemble algorithms [13, 38] for possible better performance, though its probabilistic outputs could be achieved by incorporating with Bayesian methods [8, 12].
4 Experiment and analysis
4.1 Dataset Description
|Shipped||Yes/No||Does this order has been shipped?|
|Delivery status||Null/Normal/Delay/Deliver failed/Redelivery/Missing/Unknown||If the order has been shipped, the status of delivery.|
|Consignee’s area||US/NG/GB/Other site||Area of the consignee.|
|Delivery service type||Null/Expedite service/Normal||Type of delivery service chosed by buyer.|
|Stock information||In stock/Out of stock||Are the items in the order still available?|
|Return goods received||Null/Yes/No||Dose seller receive the goods sent back from the buyer?|
|Refund processing status||Null/Unprocessed/Refunded||Whether pass buyer’s refund application?|
Dialogs used in this section are collected from an online customer service system for e-commerce provided by our collaborator. The final dataset includes 35928 dialogs. The dataset is randomly divided into training (25150), validation (3593) and testing (7185) sets. Each dialog has seven order condition values associated with its order ID. The conditions and its possible values are shown in Table 2. “Null” means the absence of the value of this condition. For example, if an order has not been shipped, the order does not have delivery status, of course. Besides, “Unknown” means the value of this condition is missing due to some reasons. Every dialog has 35 candidate replies. Abbreviations are used to represent each candidate reply in the following parts. The distribution of candidate responses in this dataset is shown in Figure 4. The statistics of dataset is shown in Table 3, one can find that the utterance length in this dataset is relatively long.
|Dataset||utterance average length||Questions longer than 40||Questions with defective condition values|
4.2.1 Automatic Evaluations
Bilingual Evaluation Understudy (BLEU) 
is widely used as an automatic evaluation of text generation systems. BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores are taken as the automatic evaluation metrics for the answers given by different methods:
where is the brevity penalty value, which equals to 1 if the total length of the resulting response is longer than that of the reference response or equals to the ratio between those two lengths.
measures the overlapping between the bag of i-grams appearing in the resulting sentences and that of i-grams appearing in the reference sentences. The value of BLEU score is between 0-1. The higher it is, the higher the precision of n-grams. For classification models, answer selection accuracy is also used as the evaluation metric.
4.2.2 Human Evaluations
Five customer service staffs were asked to label whether a reply is accord with order conditions and score satisfaction of the reply. Whether a reply is accord with order conditions is assessed by comparing the reply with ground truth condition values and two options (“Yes” and “No”) are optional. Reply satisfaction is an overall measurement, which reflects the response’s fluency, availability and so on. There are five ratings for reply satisfaction. 300 samples are chosen randomly from the test set for human assessment, and all different categories of questions are guaranteed to appear in the set for human assessment.
CAR-Transformer is compared with the following methods. Constant replies “Dear Customer, we have updated your order information to our fulfillment team.” to all questions. SC-LSTM 
is a language generation model based on a semantically controlled Long Short-term Memory structure. By incorporating dialogue act one-hot vectors into the original LSTM cell, SC-LSTM enables the generator to output the act-related text. Dialogue act one-hot vectors are replaced with order conditions one-hot vectors to adapt to the task of this paper. TA-seq2seq utilizes topic information in chatbots by a joint attention mechanism. Topic words in  are replaced with our order conditions. Split Memory Network is a modification of Memory Network  that enables personalization. Profile attributes and dialogue history are modeled in two separate memories. The profile attributes are replaced with order conditions. BERT  is a Transformer-based pre-training model for language understanding. BERT summaries the input question as a fixed-dimensional pooled representation and this representation is transformed into label probabilities. CA-Transformer is a degraded version of CAR-Transformer without conditions revise component.
The max length of input question is set to 50 and the max turn of history conversations is set to 2 for all models. The optimal setting of each model was selected by the BLEU scores on the validation set. We empirically set the hyperparameter in Equation 4 as 0.2. Both dimensions of word embedding and conditions embedding are 300 for all models.
|Models||Automatic Evaluations||Human Evaluations|
|BLEU-1||BLEU-2||BLEU-3||BLEU-4||Accuracy||Rate of accord with conditions||Satisfaction|
|Split Memory Network||52.55%||51.59%||50.88%||50.69%||50.60%||69.13%(0.16)||3.39(0.26)|
The automatic evaluation and human evaluation results of different models are reported in Table LABEL:overall_comparion_results
. For human evaluations, the mean and standard error of results given by five staff are reported. Based on the results, CAR-Transformer delivers the best performance of all the comparison methods on all metrics. Furthermore, the experimental results reveal the following observations.
4.4.1 CAR-Transformer/CA-Transformer Can Better Leverages Conditions Information
As shown in Table LABEL:overall_comparion_results, the gap between CA-Transformer and BERT demonstrates the importance of external knowledge. In other words, it is not enough to rely solely on powerful language model. Moreover, answers given by CAR-Transformer and CA-Transformer have overwhelming advantages over other methods in terms of the rate of accord with order conditions. This suggests that CAR-Transformer and CA-Transformer have better capacity in leveraging the order conditions information. CAR-Transformer utilizes conditions information in two components, conditions encoder and classifier. Although both CA-Transformer and TA-seq2seq are sequential language model, CA-Transformer performing better than TA-seq2seq suggests the superiority of the proposed conditions encoder. For further understanding the utilization of conditions, confusion matrices for classification models are displayed in Figures 5-8. For questions do not need conditions information to reply, such as questions whose answer are “Howtorequest” or “Invoiceleadtime”, BERT performs well. While for most dialogues that need conditions information to reply, for example, dialogues which consult delivery, one can find that the models without knowledge aware have bad performance and incline to choose candidate answers with large amounts of samples, such as “Missing”, “Notracking” and so on. Additionally, Split Memory Network tends to overfit the conditions information. It can be observed from Figure 8 that all “Instock” samples are classified to “Wishtowait” and 75% “when_instock” samples are classified to “Wishtochange”. Since training samples whose answers are “Instock”, “when_instock”, “Wishtowait” or “Wishtochange” have same condition value ”out of stock”, the conditions embedding may be very similar. CAR-Transformer usually can create a balance between buyers intent and conditions information. However, Figure 6 also indicates that CAR-Transformer is confused about the questions whose answers are “Notacceptedreturn”, “ShippingCharge1”, “ShippingCharge2” and so on. This is apparently due to these questions have few training samples.
4.4.2 The Effectiveness of Conditions Reviser
|question||defective conditions||true conditions||revised conditions||prediction of CAR-Transformer||prediction of CA-Transformer|
|Thank you for the update, I will wait for the new stock to come in||In stock||Out of stock||Out of stock||Wishtowait||Refund|
|Hi, could you please advise whether you have received my returned order? It’s nearly two weeks since i posted it off to you. Thanks||Express missing, no return status||Express status is normal, not receive return goods||Express status is normal, not receive return goods||Plswait||Missing|
|Hi, I have 5 t-shirts on order and they haven’t arrived as yet. Could you provide an update please.||Normal express, express status is redelivery||Expedite express, express status is redelivery||Expedite express, express status is redelivery||Redelivery||Notracking|
|Shirt arrived looks good, but too big as have another shirt similar. I need 17” collar, so XL would be better. Not opened package, so can I swap it for a smaller one?||Shipped: unknow, delivery status: unknown||Shipped: yes, delivery status: normal||Shipped: yes, delivery status: normal||Adviserefund||Plswait|
|Cancelled this order due to posting and packing churches when free delivery in UK was stated. I am disappointed that this order has been dispatched already||shipped||shipped||shipped||Cannotbechanged||Cancellation|
|My daughter needs the t-shirt for her show next Saturday 24th November, could you please dispatch asap, I’d be very grateful||shipped||Unshipped||Unshipped||mark as urgent||Notracking|
|I really need the high visible vest for the 6th of October this month … if is possible it would it be great !!||shipped||shipped||shipped||Faster1||Notracking|
|Models||Automatic Evaluations||Human Evaluations|
|BLEU-1||BLEU-2||BLEU-3||BLEU-4||Accuracy||Rate of accord with conditions||Satisfaction|
|Split Memory Network||41.77%||40.81%||40.10%||39.95%||39.86%||57.76%(0.14)||3.26(0.24)|
As can be observed in Table LABEL:overall_comparion_results, CAR-Transformer has substantial gains over CA-Transformer. This suggests that the defective conditions information may affect the model’s accuracy and the conditions reviser may reduce this impact. To provide some qualitative insights into the conditions reviser, some representational cases of CAR-Transformer and CA-Transformer are displayed in Table 5. Note that the dialogue history and condition values that unrelated to reply question are omitted for the limitation of the space. Samples in rows 1 to 4 have wrong or unknow condition values and only CAR-Transformer predict correctly by revising condition values. Samples in rows 5 to 7 have right condition values, however, only CAR-Transformer predict correctly. Observed from Table 5 and Figures 5, CA-Transformer tends to weaken the function of conditions information. A possible explanation is that CA-Transformer pays less attention to conditions embedding because of the existence of defective condition values. However, CAR-Transformer can revise 71.13% wrong or missing condition values in the test set to correct, and thus it can make good use of conditions information. Figure 9 shows the principal component analysis of three different types of conditions embedding vectors. It can be found that the revised conditions embeddings tend to the same distribution as correct conditions embeddings. The results of condition aware models on examples with defective conditions are also provided in Table 6. The so-obtained results indicate that defective condition values can be revised by the proposed conditions reviser to a certain degree.
4.5 Supplementary experiment on personalized bAbI dialog dataset
|Tasks||Supervised Embeddings||Memory Network||CAR-Transformer|
|Task||Best Accuracy||Number of utterances used|
In this subsection, the proposed CAR-Transformer is extended to personalized bAbI dialog dataset , which is a public multi-turn dialog corpus in a restaurant reservation scenario. It introduces additional four profiles (gender, age, dietary preference and favorite food) and the utterances are relevant to the user profiles. The bot is required to select an appropriate response from candidate set. Five separate tasks are introduced along with the dataset. Tasks 1 and 2 test the model’s ability to indirectly track dialog state. Tasks 3 and 4 check whether the model can sort and use facts of restaurants. Task 5 tests the capabilities of all the above aspects of the model. Tasks 1, 2 and 4 only have gender and age information of users, task 3 and 5 have all attributes. More details of this dataset can be found in . Instead of giving user profiles directly, we assume that all user profiles are unknown and use the proposed conditions reviser to infer the user profiles according to dialogue history. The main differences between our customer service dataset and the personalized bAbI dialog dataset lies in there is no correlation between profile attributes in personalized bAbI dialog dataset since the profile attributes are randomly sampled from a list of possible values. What’s more, the user profiles ”dietary preference” and ”favorite food” in personalized bAbI dialog dataset are provided for bot to make choice between restaurants rather than associating with the content of dialogue directly.
Based on above factors, the personalized bAbI dialog dataset is merely used for supplementary experiment. CAR-Transformer is compared with supervised embeddings and Memory Network  as in . For CAR-Transformer, the max length of input utterance is set to 50 and only the last two turns of conversation history are considered. The hyperparameters of the models were selected on the validation sets.
Per-response accuracy (the percentage of responses in which the correct one is chosen out of all candidate ones) across all models are reported in Table 7. Table 8 shows the best profiles prediction accuracy (percentage of correct inference for all user profiles) of conditions reviser on each task and how many utterances are used at least. The performance of CAR-Transformer is significant higher than other models, which indicates that the CAR-Transformer is able to infer and leverage user profiles to a certain extent. As can be observed in Table 8, for gender and age, they can be reasoned by the conditions reviser according to the style of language easily; but for dietary preference and favorite food, because of the diversity of choices (there are 2 types for dietary preference and 14 types for favorite food) and implicit correspondence between utterances and attributes, they are harder and need more conversation history to infer.
In this paper, CAR-Transformer is proposed to select appropriate answer that compatible with both user intent and the conditions of the question. Specifically, this paper considers more general and realistic situation where the condition values are wrong or incomplete. The proposed conditions reviser can revise the wrong or incomplete condition values without knowing which one is wrong beforehand. We perform extensive experimental evaluations of the proposed approach on the real world dataset and extend the CAR-Transformer to infer the user profiles in personalized bAbI dialog dataset. The experimental results show the effectiveness of the proposed CAR-Transformer. The explicit knowledge will be investigated to Incorporated to the learning model for more effective knowledge correction [46, 45] in the future work.
-  (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §3.2.
-  (2015) Neural machine translation by jointly learning to align and translate. In Proc. ICLR, Cited by: §2.1, §3.3.3.
-  (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In Proc. SIGMOD, pp. 1247–1250. Cited by: §1.
-  (2014) Question answering with subgraph embeddings. In Proc. EMNLP, pp. 615–620. Cited by: §1.
Model metric co-learning for time series classification.
Twenty-Fourth International Joint Conference on Artificial Intelligence, Cited by: §2.1.
-  (2013) Model-based kernel for efficient time series analysis. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 392–400. Cited by: §2.1.
-  (2014) Learning in the model space for cognitive fault diagnosis. IEEE Transactions Neural Networks Learning System 25 (1), pp. 124–136. Cited by: §2.1.
-  (2009) Predictive ensemble pruning by expectation propagation. IEEE Transactions on Knowledge and Data Engineering 21 (7), pp. 999–1013. Cited by: §3.3.4.
-  (2009) Probabilistic classification vector machines. IEEE Transactions on Neural Networks 20 (6), pp. 901–914. Cited by: §3.3.4.
-  (2013) Efficient probabilistic classification vector machine with incremental basis function selection. IEEE transactions on neural networks and learning systems 25 (2), pp. 356–369. Cited by: §3.3.4.
-  (2014) Cognitive fault diagnosis in tennessee eastman process using learning in the model space. Computers & chemical engineering 67, pp. 33–42. Cited by: §2.1.
-  (2009) Regularized negative correlation learning for neural network ensembles. IEEE Transactions on Neural Networks 20 (12), pp. 1962–1979. Cited by: §3.3.4.
-  (2010) Multiobjective neural network ensembles based on regularized negative correlation learning. IEEE Transactions on Knowledge and Data Engineering 22 (12), pp. 1738–1751. Cited by: §3.3.4.
-  (2014) Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proc. EMNLP, pp. 1724–1734. Cited by: §2.1.
-  (2017) Domain aware neural dialog system. arXiv preprint arXiv:1708.00897. Cited by: §2.2.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proc. NAACL, pp. 4171–4186. Cited by: §4.3.
Question answering over freebase with multi-column convolutional neural networks. In Proc. ACL, pp. 260–269. Cited by: §1.
-  (2019) Neural approaches to conversational ai: question answering, task-oriented dialogues and social chatbots. Now Foundations and Trends. Cited by: §1.
-  (2017) Convolutional sequence to sequence learning. In Proc. ICML, pp. 1243–1252. Cited by: §2.1, §3.3.3.
-  (2000) Learning to forget: continual prediction with lstm. Neural Computation 12 (10), pp. 2451–2471. Cited by: §4.3.
-  (2018) Multiobjective learning in the model space for time series classification. IEEE Transactions on Cybernetics 49 (3), pp. 918–932. Cited by: §2.1.
-  (2016) Model-based oversampling for imbalanced sequence classification. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 1009–1018. Cited by: §2.1.
-  (2018) Sequential data classification by dynamic state warping. Knowledge and Information Systems 57 (3), pp. 545–570. Cited by: §2.1.
-  (2016) Deep residual learning for image recognition. In Proc. CVPR, pp. 770–778. Cited by: §3.2.
-  (2017) Neural response generation for customer service based on personality traits. In Proc. INLG, pp. 252–256. Cited by: §2.2.
Scalable graph-based semi-supervised learning through sparse bayesian model. IEEE Transactions on Knowledge and Data Engineering 29 (12), pp. 2758–2771. Cited by: §3.3.4.
-  (2017) Personalization in goal-oriented dialog. arXiv preprint arXiv:1706.07503. Cited by: §2.2, §4.3, §4.5, §4.5.
-  (2017) Structured attention networks. Cited by: §2.1.
-  (2016) A persona-based neural conversation model. In Proc. ACL, pp. 994–1003. Cited by: §2.2.
-  (2017) A structured self-attentive sentence embedding. In Proc. ICLR, Cited by: §2.1.
-  (2018) Stochastic answer networks for machine reading comprehension. In Proc. ACL, pp. 1694–1704. Cited by: §1.
-  (2019) Learning personalized end-to-end goal-oriented dialog. In Proc. AAAI, Vol. 33, pp. 6794–6801. Cited by: §2.2.
-  (2019) Multiclass probabilistic classification vector machine. IEEE Transactions on Neural Networks and Learning Systems. External Links: Cited by: §3.3.4.
-  (2002) BLEU: a method for automatic evaluation of machine translation. In Proc. ACL, pp. 311–318. Cited by: §4.2.1.
A decomposable attention model for natural language inference. In Proc. EMNLP, pp. 2249–2255. Cited by: §2.1.
-  (2018) Assigning personality/identity to a chatting machine for coherent conversation generation. In Proc. IJCAI, pp. 4279–4285. Cited by: §2.2.
-  (2017) Bidirectional attention flow for machine comprehension. In Proc. ICLR, Cited by: §1.
-  (2012) Semisupervised classification with cluster regularization. IEEE Transactions Neural Networks and Learning Systems 23 (11), pp. 1779–1792. Cited by: §3.3.4.
-  (2014) Sequence to sequence learning with neural networks. In Proc. NIPS, pp. 3104–3112. Cited by: §2.1.
-  (2017) Attention is all you need. In Proc. NIPS, pp. 5998–6008. Cited by: §1, §2.1, §3.2, §3.3.2.
-  (2018) The apva-turbo approach to question answering in knowledge base. In Proc. COLING, pp. 1998–2009. Cited by: §1.
-  (2016) Chinese poetry generation with planning based neural network. In Proc. COLING, pp. 1051–1060. Cited by: §3.3.3.
-  (2015) Semantically conditioned lstm-based natural language generation for spoken dialogue systems. In Proc. EMNLP, pp. 1711–1721. Cited by: §4.3.
-  (2015) Memory networks. In Proc. ICLR, Cited by: §4.3, §4.5.
-  (2017) Knowledge engineering with big data (bigke): a 54-month, 45-million rmb, 15-institution national grand project. IEEE Access 5, pp. 12696–12701. Cited by: §5.
-  (2015) Knowledge engineering with big data. IEEE Intelligent Systems 30 (5), pp. 46–55. Cited by: §5.
-  (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §3.3.3.
-  (2017) Topic aware neural response generation. In Proc. AAAI, pp. 3351–3357. Cited by: §2.2, §4.3.
-  (2014) Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575. Cited by: §1.
-  (2017) Personalized response generation via domain adaptation. In Proc. SIGIR, pp. 1021–1024. Cited by: §2.2.
-  (2016) Simple question answering by attentive convolutional neural network. In Proc. COLING, pp. 1746–1756. Cited by: §1.
-  (2017) Improved neural relation detection for knowledge base question answering. In Proc. ACL, pp. 571–581. Cited by: §1.
-  (2019) Neural personalized response generation as domain adaptation. World Wide Web 22 (4), pp. 1427–1446. Cited by: §2.2.