Augmenting End-to-End Dialog Systems with Commonsense Knowledge

Building dialog agents that can converse naturally with humans is a challenging yet intriguing problem of artificial intelligence. In open-domain human-computer conversation, where the conversational agent is expected to respond to human responses in an interesting and engaging way, commonsense knowledge has to be integrated into the model effectively. In this paper, we investigate the impact of providing commonsense knowledge about the concepts covered in the dialog. Our model represents the first attempt to integrating a large commonsense knowledge base into end-to-end conversational models. In the retrieval-based scenario, we propose the Tri-LSTM model to jointly take into account message and commonsense for selecting an appropriate response. Our experiments suggest that the knowledge-augmented models are superior to their knowledge-free counterparts in automatic evaluation.


C3KG: A Chinese Commonsense Conversation Knowledge Graph

Existing commonsense knowledge bases often organize tuples in an isolate...

Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog

Visual Dialog requires an agent to engage in a conversation with humans ...

Maria: A Visual Experience Powered Conversational Agent

Arguably, the visual perception of conversational agents to the physical...

Intelligent Conversational Bot for Massive Online Open Courses (MOOCs)

Massive Online Open Courses (MOOCs) which were introduced in 2008 has si...

Conversation Generation with Concept Flow

Human conversations naturally evolve around related entities and connect...

Revisiting the Prepositional-Phrase Attachment Problem Using Explicit Commonsense Knowledge

We revisit the challenging problem of resolving prepositional-phrase (PP...

Towards Automatic building of Human-Machine Conversational System to support Maintenance Processes

Companies are dealing with many cognitive changes with the introduction ...


In recent years, data-driven approaches to building conversation models have been made possible by the proliferation of social media conversation data and the increase of computing power. By relying on a large number of message-response pairs, the Seq2Seq framework [Sutskever, Vinyals, and Le2014] attempts to produce an appropriate response based solely on the message itself, without any memory module.

In human-to-human conversations, however, people respond to each other’s utterances in a meaningful way not only by paying attention to the latest utterance of the conversational partner itself, but also by recalling relevant information about the concepts covered in the dialogue and integrating it into their responses. Such information may contain personal experience, recent events, commonsense knowledge and more (Figure 1). As a result, it is speculated that a conversational model with a “memory look-up” module can mimic human conversations more closely [Ghazvininejad et al.2017, Bordes and Weston2016]. In open-domain human-computer conversation, where the model is expected to respond to human utterances in an interesting and engaging way, commonsense knowledge has to be integrated into the model effectively.

In the context of artificial intelligence (AI), commonsense knowledge is the set of background information that an individual is intended to know or assume and the ability to use it when appropriate [Minsky1986, Cambria et al.2009, Cambria and Hussain2015]. Due to the vastness of such kind of knowledge, we speculate that this goal is better suited by employing an external memory module containing commonsense knowledge rather than forcing the system to encode it in model parameters as in traditional methods.

In this paper, we investigate how to improve end-to-end dialogue systems by augmenting them with commonsense knowledge, integrated in the form of external memory. The remainder of this paper is as follows: next section proposes related work in the context of conversational models and commonsense knowledge; following, a section describes the proposed model in detail; later, a section illustrates experimental results; finally, the last section proposes concluding remarks and future work.

Figure 1: Left: In traditional dialogue systems, the response is determined solely by the message itself (arrows denote dependencies). Right: The responder recalls relevant information from memory; memory and message content jointly determine the response. In the illustrated example, the responder retrieves the event “Left_dictionary_on_book_shelf” from memory, which triggers a meaningful response.

Related Work

Conversational Models

Data-driven conversational models generally fall into two categories: retrieval-based methods [Lowe et al.2015b, Lowe et al.2016a, Zhou et al.2016], which select a response from a predefined repository, and generation-based methods [Ritter, Cherry, and Dolan2011, Serban et al.2016, Vinyals and Le2015]

, which employ an encoder-decoder framework where the message is encoded into a vector representation and, then, fed to the decoder to generate the response. The latter is more natural (as it does not require a response repository) yet suffers from generating dull or vague responses and generally needs a great amount of training data.

The use of an external memory module in natural language processing (NLP) tasks has received considerable attention recently, such as in question answering 

[Weston et al.2015] and language modeling [Sukhbaatar et al.2015]. It has also been employed in dialogue modeling in several limited settings. With memory networks, [Dodge et al.2015] used a set of fact triples about movies as long-term memory when modeling reddit dialogues, movie recommendation and factoid question answering. Similarly in a restaurant reservation setting, [Bordes and Weston2016] provided local restaurant information to the conversational model.

Researchers have also proposed several methods to incorporate knowledge as external memory into the Seq2Seq framework. [Xing et al.2016] incorporated the topic words of the message obtained from a pre-trained latent Dirichlet allocation (LDA) model into the context vector through a joint attention mechanism. [Ghazvininejad et al.2017] mined FoodSquare tips to be searched by an input message in the food domain and encoded such tips into the context vector through one-turn hop. The model we propose in this work shares similarities with [Lowe et al.2015a]

, which encoded unstructured textual knowledge with a recurrent neural network (RNN). Our work distinguishes itself from previous research in that we consider a large heterogeneous commonsense knowledge base in an open-domain retrieval-based dialogue setting.

Commonsense Knowledge

Several commonsense knowledge bases have been constructed during the past decade, such as ConceptNet [Speer and Havasi2012] and SenticNet [Cambria et al.2016]

. The aim of commonsense knowledge representation and reasoning is to give a foundation of real-world knowledge to a variety of AI applications, e.g., sentiment analysis 

[Poria et al.2015], handwriting recognition [Wang et al.2013], e-health [Cambria et al.2010], aspect extraction [Poria et al.2016], and many more. Typically, a commonsense knowledge base can be seen as a semantic network where concepts are nodes in the graph and relations are edges (Figure 2). Each triple is termed an assertion.

Figure 2: A sketch of SenticNet semantic network.

Based on the Open Mind Common Sense project [Singh et al.2002], ConceptNet not only contains objective facts such as “Paris is the capital of France” that are constantly true, but also captures informal relations between common concepts that are part of everyday knowledge such as “A dog is a pet”. This feature of ConceptNet is desirable in our experiments, because the ability to recognize the informal relations between common concepts is necessary in the open-domain conversation setting we are considering in this paper.

Model Description

Task Definition

In this work, we concentrate on integrating commonsense knowledge into retrieval-based conversational models, because they are easier to evaluate [Liu et al.2016, Lowe et al.2016a] and generally take a lot less data to train. We leave the generation-based scenario to future work.

Message (context) and response are a sequence of tokens from vocabulary . Given and a set of response candidates , the model chooses the most appropriate response according to:


where is a scoring function measuring the “compatibility” of and . The model is trained on triples with cross entropy loss, where is binary indicating whether the pair comes from real data or is randomly combined.

Dual-LSTM Encoder

As a variation of vanilla RNN, a long short-term memory (LSTM) network 

[Hochreiter and Schmidhuber1997] is good at handling long-term dependencies and can be used to map an utterance to its last hidden state as fixed-size embedding representation. The Dual-LSTM encoder [Lowe et al.2015b] represents the message and response as fixed-size embeddings and with the last hidden states of the same LSTM. The compatibility function of the two is thus defined by:


where matrix is learned during training.

Figure 3: Tri-LSTM encoder. We use LSTM to encode message, response and commonsense assertions. LSTM weights for message and response are tied. The lower box is equal to a Dual-LSTM encoder. The upper box is the memory module encoding all commonsense assertions.

Commonsense Knowledge Retrieval

In this paper, we assume that a commonsense knowledge base is composed of assertions about concepts . Each assertion takes the form of a triple , where is a relation between and , such as IsA, CapableOf, etc. are concepts in . The relation set is typically much smaller than . can either be a single word (e.g., “dog” and “book”) or a multi-word expression (e.g., “take_a_stand” and “go_shopping”). We build a dictionary out of where every concept is a key and a list of all assertions in concerning , i.e., or , is the value. Our goal is to retrieve commonsense knowledge about every concept covered in the message.

We define as the set of commonsense assertions concerned with message . To recover concepts in message , we use simple -gram matching ()111More sophisticated methods such as  [Rajagopal et al.2013]

are also possible. Here, we chose n-gram for better speed and recall.

is set to 5.. Every -gram in is considered a potential concept222For unigrams, we exclude a set of stopwords. Both the original version and stemmed version of every word are considered.. If the -gram is a key in , the corresponding value, i.e., all assertions in concerning the concept, is added to (Figure 4).

Figure 4: In the illustrated case, five concepts are identified in the message. All assertions associated with the five concepts constitute

. We show three appropriate responses for this single message. Each of them is associated with (same color) only one or two commonsense assertions, which is a paradigm in open-domain conversation and provides ground for our max-pooling strategy. It is also possible that an appropriate response is not relevant to any of the common assertions in

at all, in which case our method falls back to Dual-LSTM.

Tri-LSTM Encoder

Our main approach to integrating commonsense knowledge into the conversational model involves using another LSTM for encoding all assertions in , as illustrated in Figure 3. Each , originally in the form of , is transformed into a sequence of tokens by chunking , , concepts which are potentially multi-word phrases, into and . Thus, .

We add to vocabulary , that is, each in will be treated like any regular word in during encoding. We decide not to use each concept as a unit for encoding because is typically too large (1M). is encoded as embedding representation using another LSTM. Note that this encoding scheme is suitable for any natural utterances containing commonsense knowledge333Termed surface text in ConceptNet. in addition to well-structured assertions. We define the match score of assertion and response as:


where is learned during training. Commonsense assertions associated with a message is usually large (100 in our experiment). We observe that in a lot of cases of open-domain conversation, response can be seen as triggered by certain perception of message defined by one or more assertions in , as illustrated in Figure 4. We can see the difference between message and response pair when commonsense knowledge is used. For example, the word ‘Insomnia’ in the message is mapped to the commonsense assertion ‘Insomnia, IsA, sleepproblem’. The appropriate response is then matched to ‘sleepproblem’ that is ‘go to bed’. Similarly, the word ‘Hawaii’ in the message is mapped to the commonsense assertion ‘Hawaii, UsedFor, tourism’. The appropriate response is then matched to ‘tourism’ that is ‘enjoy vacation’. In this way, new words can be mapped to the commonly used vocabulary and improve response accuracy.

Our assumption is that is helpful in selecting an appropriate response . However, usually very few assertions in are related to a particular response in the open-domain setting. As a result, we define the match score of and as


that is, we only consider the commonsense assertion with the highest match score with , as most of are not relevant to . Incorporating into the Dual-LSTM encoder, our Tri-LSTM encoder model is thus defined as:


i.e., we use simple addition to supplement with , without introducing a mechanism for any further interaction between and . This simple approach is suitable for response selection and proves effective in practice.

The intuition we are trying to capture here is that an appropriate response should not only be compatible with , but also related to certain memory recall triggered by as captured by . In our case, the memory is commonsense knowledge about the world. In cases where , i.e., no commonsense knowledge is recalled, and the model degenerates to Dual-LSTM encoder.

Comparison Approaches

Supervised Word Embeddings

We follow [Bordes and Weston2016, Dodge et al.2015] and use supervised word embeddings as a baseline. Word embeddings are most well-known in the context of unsupervised training on raw text as in [Mikolov et al.2013], yet they can also be used to score message-response pairs. The embedding vectors are trained directly for this goal. In this setting, the “compatibility” function of and is defined as:


In this setting, are bag-of-words embeddings. With retrieved commonsense assertions , we embed each to bag-of-words representation and have:


This linear model differs from Tri-LSTM encoder in that it represents an utterance with its bag-of-words embedding instead of RNNs.

Memory Networks

Memory networks [Sukhbaatar et al.2015, Weston, Chopra, and Bordes2014] are a class of models that perform language understanding by incorporating a memory component. They perform attention over memory to retrieve all relevant information that may help with the task. In our dialogue modeling setting, we use as the memory component. Our implementation of memory networks, similar to [Bordes and Weston2016, Dodge et al.2015], differs from supervised word embeddings described above in only one aspect: how to treat multiple entries in memory. In memory networks, output memory representation , where is the bag-of-words embedding of and is the attention signal over memory calculated by . The “compatibility” function of and is defined as:


In contrast to supervised word embeddings described above, attention over memory is determined by message . This mechanism was originally designed to retrieve information from memory that is relevant to the context, which in our setting is already achieved during commonsense knowledge retrieval. As speculated, the attention over multiple memory entries is better determined by response in our setting. We empirically prove this point below.


Twitter Dialogue Dataset

To the best of our knowledge, there is currently no well-established open-domain response selection benchmark dataset available, although certain Twitter datasets have been used in the response generation setting [Li et al.2015, Li et al.2016]. We thus evaluate our method against state-of-the-art approaches in the response selection task on Twitter dialogues.

1.4M Twitter <message, response pairs are used for our experiments. They were extracted over a 5-month period, from February through July in 2011. 1M Twitter <message, response pairs are used for training. With the original response as ground truth, we construct 1M <message, response, label=1 triples as positive instances. Another 1M negative instances <message, response, label=0 are constructed by replacing the ground truth response with a random response in the training set.

For tuning and evaluation, we use 20K <message, response pairs that constitute the validation set (10K) and test set (10K). They are selected by a criterion that encourages interestingness and relevance: both the message and response have to be at least 3 tokens long and contain at least one non-stopword. For every message, at least one concept has to be found in the commonsense knowledge base. For each instance, we collect another 9 random responses from elsewhere to constitute the response candidates.

Preprocessing of the dataset includes normalizing hashtags, “@User”, URLs, emoticons. Vocabulary is built out of the training set with 5 as minimum word frequency, containing 62535 words and an extra token representing all unknown words.


In our experiment, ConceptNet444 ConceptNet can be Downloaded at is used as the commonsense knowledge base. Preprocessing of this knowledge base involves removing assertions containing non-English characters or any word outside vocabulary . 1.4M concepts remain. 0.8M concepts are unigrams, 0.43M are bi-grams and the other 0.17M are tri-grams or more. Each concept is associated with an average of 4.3 assertions. More than half of the concepts are associated with only one assertion.

An average of 2.8 concepts can be found in ConceptNet for each message in our Twitter Dialogue Dataset, yielding an average of 150 commonsense assertions (the size of ). Unsurprisingly, common concepts with more assertions associated are favored in actual human conversations.

It is worth noting that ConceptNet is also noisy due to uncertainties in the constructing process, where 15.5% of all assertions are considered “false” or “vague” by human evaluators [Speer and Havasi2012]. Our max-pooling strategy used in Tri-LSTM encoder and supervised word embeddings is partly designed to alleviate this weakness.

Parameter Settings

In all our models excluding term frequency–inverse document frequency (TF-IDF) [Ramos and others2003], we initialize word embeddings with pretrained GloVe embedding vectors [Pennington, Socher, and Manning2014]

. The size of hidden units in LSTM models is set to 256 and the word embedding dimension is 100. We use stochastic gradient descent (SGD) for optimizing with batch size of 64. We fixed training rate at 0.001.

Results and Analysis

The main results for TF-IDF, word embeddings, memory networks and LSTM models are summarized in Table 1. We observe that:

(1) LSTMs perform better at modeling dialogues than word embeddings on our dataset, as shown by the comparison between Tri-LSTM and word embeddings.

(2) Integrating commonsense knowledge into conversational models boosts model performance, as Tri-LSTM outperforms Dual-LSTM by a certain margin.

(3) Max-pooling over all commonsense assertions depending on response is a better method for utilizing commonsense knowledge than attention over memory in our setting, as demonstrated by the gain of performance of word embeddings over memory networks.

Recall@ TF-IDF Word Embeddings Memory Networks Dual-LSTM Tri-LSTM Human
Recall@1 32.6% 73.5% 72.1% 73.6% 77.5% 87.0%
Recall@2 47.3% 84.0% 83.6% 85.6% 88.0% -
Recall@5 68.0% 95.5% 94.2% 95.9% 96.6% -
Table 1: Model evaluation. indicates models with commonsense knowledge integrated. The TF-IDF model is trained following [Lowe et al.2015b]. The “Recall@” method is used for evaluation [Lowe et al.2016b]. The model is asked to rank a total of responses containing one positive response and negative responses ( according to our test set). If the ranking of the positive response is not larger than , Recall@ is positive for that instance.
Instance Message
Response selected
by Dual-LSTM
i was helping my brother
with his chinese.
did yoga help?
bonjour madame, quoi de neuf.
yeah me too !
help what colour shoes can i
wear with my dress to the wedding?
very pale pink or black.
helping mum paint my bedroom.
shouldn’t it be your mum
helping you? what color are
you going for ?
Response selected
by Tri-LSTM
Activated Assertion
(total size of )
the language sounds interesting!
i really gotta learn it !
chinese, IsA,
human_language (755)
loool . you can stick with english ,
its all good unless you want
to improve your french .
bonjour, IsA,
hello_in_french (9)
3 very pale pink or black.
pink, RelatedTo,
colour (1570)
shouldn’t it be your mum
helping you? what color are
you going for ?
paint, RelatedTo,
household_color (959)
Table 2: Case studies for the impact of commonsense assertions. “Activated Assertion” is the commonsense assertion entry in chosen by max-pooling. indicates correct selection. All 4 instances displayed are taken from the test set.

We also analyze samples from the test set to gain an insight on how commonsense knowledge supplements the message itself in response selection by comparing Tri-LSTM encoder and Dual-LSTM encoder.

As illustrated in Table 2, instances 1,2 represent cases where commonsense assertions as an external memory module provide certain clues that the other model failed to capture. For example in instance 2, Tri-LSTM selects the response “…improve your french” to message “bonjour madame” based on a retrieved assertion “”, while Dual-LSTM selects an irrelevant response. Unsurprisingly, Dual-LSTM is also able to select the correct response in some cases where certain commonsense knowledge is necessary, as illustrated in instance 3. Both models select “… pink or black” in response to message “…what color shoes…”, even though Dual-LSTM does not have access to a helpful assertion “”.

Informally speaking, such cases suggest that to some extent, Dual-LSTM (models with no memory) is able to encode certain commonsense knowledge in model parameters (e.g., word embeddings) in an implicit way. In other cases, e.g., instance 4, the message itself is enough for the selection of the correct response, where both models do equally well.

Conclusion and Future Work

In this paper, we emphasized the role of memory in conversational models. In the open-domain chit-chat setting, we experimented with commonsense knowledge as external memory and proposed to exploit LSTM to encode commonsense assertions to enhance response selection.

In the other research line of response generation, such knowledge can potentially be used to condition the decoder in favor of more interesting and relevant responses. Although the gains presented by our new method is not spectacular according to Recall@, our view represents a promising attempt at integrating a large heterogeneous knowledge base that potentially describes the world into conversational models as a memory component.

Our future work includes extending the commonsense knowledge with common (or factual) knowledge, e.g., to extend the knowledge base coverage by linking more named entities to commonsense knowledge concepts [Cambria et al.2014], and developing a better mechanism for utilizing such knowledge instead of the simple max-pooling scheme used in this paper. We would also like to explore the memory of the model for multiple message response pairs in a long conversation.

Lastly, we plan to integrate affective knowledge from SenticNet in the dialogue system in order to enhance its emotional intelligence and, hence, achieve a more human-like interaction. The question, after all, is not whether intelligent machines can have any emotions, but whether machines can be intelligent without any emotions [Minsky2006].


We gratefully acknowledge the help of Alan Ritter for sharing the twitter dialogue dataset and the NTU PDCC center for providing computing resources.