Dialog state tracking, a machine reading approach using Memory Network

06/13/2016 ∙ by Julien Perez, et al. ∙ xerox The University of Melbourne 0

In an end-to-end dialog system, the aim of dialog state tracking is to accurately estimate a compact representation of the current dialog status from a sequence of noisy observations produced by the speech recognition and the natural language understanding modules. This paper introduces a novel method of dialog state tracking based on the general paradigm of machine reading and proposes to solve it using an End-to-End Memory Network, MemN2N, a memory-enhanced neural network architecture. We evaluate the proposed approach on the second Dialog State Tracking Challenge (DSTC-2) dataset. The corpus has been converted for the occasion in order to frame the hidden state variable inference as a question-answering task based on a sequence of utterances extracted from a dialog. We show that the proposed tracker gives encouraging results. Then, we propose to extend the DSTC-2 dataset with specific reasoning capabilities requirement like counting, list maintenance, yes-no question answering and indefinite knowledge management. Finally, we present encouraging results using our proposed MemN2N based tracking model.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the core components of state-of-the-art and industrially deployed dialog systems is a dialog state tracker. Its purpose is to provide a compact representation of a dialog produced from past user inputs and system outputs which is called the dialog state. The dialog state summarizes the information needed to successfully maintain and finish a dialog, such as users’ goals or requests. In the simplest case of a so-called slot-filling schema, the state is composed of a predefined set of variables with a predefined domain of expression for each of them. As a matter of fact, in the recent context of end-to-end trainable machine learnt dialog systems, state tracking remains a central element of such architectures [WenGMRSUVY16]. Current models, mainly based on the principle of discriminative learning, tend to share three common limitations. First, the tracking task is perform using a fixed window of the past dialog utterances as support for decision. Second, the possible correlations between the set of tracked variables are not leveraged and individual trackers tend to be learnt independently. Third, the tracking task is summarized as the capability of infering values for a predefined set of latent variables. Starting from these observations, we propose to formalize the task of state tracking as a particular instance of machine reading problem. Indeed, these formalization and the proposed resolution model called MemN2N [WestonBCM15] allow to define a tracker that is be able to decide at the utterance level on the basis on the current entire dialog. Indeed, the model learns to focus its attention on the meaningful parts of the dialog regarding the currently asked slot and can eventually capture possible correlation between slots. As far as our knowledge goes, it is the first attempt to explicitly frame the task of dialog state tracking as a machine reading problem. Finally, such formalization allows for the implementation of approximate reasoning capability that has been shown to be crucial for any machine reading tasks [WestonBCM15] while extending the task from slot instantiation to question answering. This paper is structured as follows, Section 2 recalls the main definitions associated to transactional dialogs and describes the associated problem of statistical dialog state tracking with both the generative and discriminative approaches. At the end of this section, the limitations of the current models in terms of necessary annotations and reasoning capabilities are addressed. Then, Section 3 depicts the proposed machine reading model for dialog state tracking and proposes to extend a state of the art dialog state tracking dataset, DSTC-2, to several simple reasoning capabilities. Section 4 illustrates the approach with experimental results obtained using a state of the art benchmark for dialog state tracking.

2 Dialog state tracking

2.1 Main Definitions

A dialog state tracking task is formalized as follows: at each turn of a dyadic dialog, the dialog agent chooses a dialog act to express and the user answers with an utterance . In the simplest case, the dialog state at each turn is defined as a distribution over a set of predefined variables, which define the structure of the state [Williams05]. This classic state structure is commonly called slot filling or semantic frame. In this context, the state tracking task consists of estimating the value of a set of predefined variables in order to perform a procedure or transaction which is the purpose of the dialog. Typically, a natural language understanding module processes the user utterance and generates an N-best list , where is the hypothesized user dialog act and is its confidence score. Various approaches have been proposed to define dialog state trackers. The traditional methods used in most commercial implementations use hand-crafted rules that typically rely on the most likely result from an NLU module [YehDJRRPLTBM14]

and hardly models uncertainty. However, these rule-based systems are prone to frequent errors as the most likely result is not always the correct one  


More recent methods employ statistical approaches to estimate the posterior distribution over the dialog states allowing them to leverage the uncertainty of the results of the NLU module. In the simplest case where no ASR and NLU modules are employed, as in a text based dialog system [Henderson13a], the utterance is taken as the observation using a so-called bag of words representation. If an NLU module is available, standardized dialog act schemas can be considered as observations [bunt10]. Furthermore, if prosodic information is available from the ASR component of the dialog system [MiloneR03], it can also be considered as part of the observation definition. A statistical dialog state tracker maintains, at each discrete time step

, the probability distribution over states,

, which is the system’s belief over the state. The actual slot filling process is composed of the cyclic tasks of information gathering and integration, in other words – dialog state tracking. In such framework, the purpose is to estimate as early as possible in the course of a given dialog the correct instantiation of each variable. In the following, we will assume the state is represented as a set of variables with a set of known possible values associated to each of them. Furthermore, in the context of this paper, only the bag of words has been considered as an observation at a given turn but dialog acts or detected named entity provided by an SLU module could have also been incorporated.

Two statistical approaches have been considered for maintaining the distribution over a state given sequential NLU output. First, the discriminative approach aims to model the posterior probability distribution of the state at time

with regard to state at time and observations . Second, the generative approach attempts to model the transition probability and the observation probability in order to exploit possible interdependencies between hidden variables that comprise the dialog state.

2.2 Generative Dialog State Tracking

A generative approach to dialog state tracking computes the belief over the state using Bayes’ rule, using the belief from the last turn as a prior and the likelihood given the user utterance hypotheses , with the observation gathered at time . In prior works [Williams05], the likelihood is factored and some independence assumptions are made:

. A typical generative model uses a factorial hidden Markov model

[gj97]. In this family of approaches, scalability is considered as one of the main issues. One way to reduce the amount of computation is to group the states into partitions, as proposed in the Hidden Information State (HIS) model [GasicY11]

. Other approaches to cope with the scalability problem in dialog state tracking is to adopt a factored dynamic Bayesian network by making conditional independence assumptions among dialog state components, and then using approximate inference algorithms such as loopy belief propagation

[ThomsonY10] or a blocked Gibbs sampling as [RauxM11]. To cope with such limitations, discriminative methods of state tracking presented in the next part of this section aim at directly model the posterior distribution of the tracked state using a choosen parametric form.

2.3 Discriminative Dialog State Tracking

The discriminative approach of dialog state tracking computes the belief over a state via a parametric model that directly represents the belief

. For example, Maximum Entropy has been widely used in the discriminative approach [MetallinouBW13]. It formulates the belief as follows: , where is the normalizing constant, is the history of user dialog acts, , the system dialog acts, , and the sequence of states leading to the current dialog turn at time . Then,

is a vector of feature functions on

and . Finally,

is the set of model parameters to be learned from annotated dialog data. Finally, deep neural models, performing on a sliding window of features extracted from previous user turns, have also been proposed in

[Henderson14, SWTY16]

. Of the current litterature, this family of approaches have proven to be the most efficient for publicly available state tracking datasets. Recently, deep learning based models implementing this strategy

[SWTY16, Henderson2014d, WilliamsRH16a] have shown state of the art results. This approaches tends to leverage unsupervised training word representation [MikolovSCCD13].

2.4 Current Limitations

Using error analysis [HendersonTW14], three limitations can be observed in the application of these inference approaches. First, current models tend to fail at considering long-tail dependencies that occurs on dialogs. For example, coreferences, inter-utterances informations and correlations between slots have been shown to be difficult to handle even with the usage of recurrent network models [Henderson2014d]. To illustrate the inter-slot correlation, Figure 1 depicted the t-SNE [Maaten2008] projected final state of the dialog of the DSTC-2 training set. On the other hand, reasoning capabilities, as required in machine reading applications [PoonD10, EtzioniBC07, BerantSCLHHCM14, WestonBCM15] remain absent in these classic formalizations of dialog state tracking. Finally, tracking definition is limited to the capability to instantiate a predefined set of slots. In the next section, we present a model of dialog state tracking that aims at leveraging the current advances of MemN2N, a memory-enhanced neural networks and their approximate reasoning capabilities that seems particularly adapted to the sequential, long range dependency equipped and sparse nature of complex dialog state tracking tasks. Furthermore, this model allows to relax the hypothesis of strict utterance-level annotation that does not corresponds to common pratices in industrial applications of transactional conversational user interfaces where annotations tend to be placed at a multi-utterance level or full-dialog level only.

Figure 1: T-SNE transformation of the final state of DSTC-2 train set.

3 Machine Reading Formulation of Dialog State Tracking

We propose to formalize the dialog state tracking task as a machine reading problem [EtzioniBC07, BerantSCLHHCM14]. In this section, we recall the main definitions of the task of machine reading, then describes the MemN2N, a memory-enhanced neural network architectures proposed to handle such tasks in the context of dialogs. Finally, we formalize the task of dialog state tracking as a machine reading problem and propose to solve it using a memory-enhanced neural architecture of inference.

3.1 Machine Reading

The task of textual understanding has recently been formulated as a supervised learning problem

[KumarISBEPOGS15, HermannKGEKSB15]. This task consists in estimating the conditional probability of an answer to a question where is a document. Such an approach requires a large training corpus of {Document - Query - Answer} triples and until now such corpora have been limited to hundreds of examples [RichardsonBR13]. In the context of dialog state tracking, it can be understood as the capability of inferring a set of latent values associated with a set of variables related to a given dyadic or multi-party conversation , from direct correlation and/or reasoning, using the course of exchanges of utterances, .

State updates at an utterance-level are rarely provided off-the-shelf from a production environment. In these environments, annotation is often performed afterhand for the purpose of logging, monitoring or quality assessment. In the limit cases, as in human-to-human dialog systems, dialog-level annotations remains a common pratices of annotation especially in personal assistance, customer care dialogs and, in a more general sense, industrial application of transactional conversational user interfaces. Another frequent setting consist of informing the state after a given number of utterance exchange between the locutors. So an additional effort of specific annotation is often needed in order to train a state of the art statistical state tracking model [HendersonTW14]. In that sense, formalizing dialog state tracking at a subdialog level in order to infer hidden state variables with respect to a list of utterances started from the first one to any given utterance of a given dialog seems particularly appropriate. In the context of dialog state tracking challenges, the DSTC-4 dialog corpus have been designed in such purpose but only consists of 22 dialogs. Concerning the DSTC-2 corpus, the training data contains 2207 dialogues (15611 turns) and the test set consists of 1117 dialogues [WilliamsRH16a]. This dataset is more suitable for our experiments.

For these reasons, the machine reading paradigm becomes a promising formulation for the general problem of dialog state tracking. Furthermore, current approaches and available datasets for state tracking do not explicitly cover reasoning capabilities such as temporal and spatial reasoning, couting, sorting and deduction. We suggest that in the future dataset dialogs expressing such specific abilities should be developed. In this last part, several reasoning enhancements are suggested to the DSTC-2 dataset.

3.2 End-to-End Memory Networks

The MemN2N architecture, introduced by [WestonBCM15], consists of two main components: supporting memories and final answer prediction. Supporting memories are in turn comprised of a set of input and output memory representations with memory cells. The input and output memory cells, denoted by and , are obtained by transforming the input context (i.e a set of utterances) using two embedding matrices and (both of size where is the embedding size and the vocabulary size) such that and where is a function that maps the input into a bag of dimension .

Similarly, the question is encoded using another embedding matrix , resulting in a question embedding . The input memories , together with the embedding of the question , are utilized to determine the relevance of each of the stories in the context, yielding in a vector of attention weights


where . Subsequently, the response from the output memory is constructed by the weighted sum:


Other models of parametric encoding for the question and the document have been proposed in [KumarISBEPOGS15]. For the purpose of this presentation, we will keep with definition of .

For more difficult tasks requiring multiple supporting memories, the model can be extended to include more than one set of input/output memories by stacking a number of memory layers. In this setting, each memory layer is named a hop and the hop takes as input the output of the hop:


Lastly, the final step, the prediction of the answer to the question , is performed by


where is the predicted answer distribution, is a parameter matrix for the model to learn and the total number of hops.

Two weight tying schemes of the embedding matrices have been introduced in [WestonBCM15]:

  1. [noitemsep,topsep=0pt]

  2. Adjacent: the output embedding matrix in the hop is shared with the input embedding matrix in the hop, i.e., for . Also, the weight matrix in Equation (4) is shared with the output embedding matrix in the last memory hop such that .

  3. Layer-wise: all the weight matrices and are shared across different hops, i.e., and .





Figure 2: Illustration of the proposed MemN2N based state dialog tracker model with hops.

In the next section, we show how the task of dialog state tracking can be formalized as machine reading task and solved using such memory enhanced model.

3.3 Dialog Reading Model for State Tracking

In this section, we formalize dialog state tracking using the paradigm of machine reading. As far as our knowledge goes, it is the first attempt to apply this approach and develop a specific dataset format, detailed in Section 4, from an existing and publicly available dialog state tracking challenge dataset to fulfill this task. Assuming (1) a dyadic dialog composed of a list of utterances, (2) a state composed with (2a) a set of variables with and (2b) a set of corresponding assigned values . One can define a question that corresponds to the specific querying of a variable in the context of a dialog . In such context, a dialog state tracking task consists in determining for each variable v, , with the specific domain of expression of a variable .

In addition to the actual dataset, we propose to investigate four general reasoning tasks using DSTC-2 dataset as a starting point. In such way, we leverage the dataset of DSTC-2 to create more complex reasoning task than the ones present in the original dialogs of the dataset by performing rule-based modification over the corpus. Obviously, the goal is to develop resolution algorithms that are not dedicated to a specific reasoning task but inference models that will be as generic as possible. In the rest of the section, each of the reasoning tasks associated with dialog state tracking are described and the generation protocol is explained with examples.

Factoid Questions : This first task corresponds to the current formulation of dialog state tracking. It consists of questions where a previously given a set of supporting facts, potentially amongst a set of other irrelevant facts, provides the answer. This kind of task was already employed in [WestonCB14] in the context of a virtual world. In that sense, the result obtained to such task are comparable with the state of the art approaches.

Yes/No Questions : This task tests the ability of a model to answer true/false type questions like “Is the food italian ?”. The conversion of a dialog to such format is deterministic regarding the fact that the utterances and corresponding true states are known at each utterance of a given dialog.

Indefinite Knowledge : This task tests a more complex natural language construction. It tests if statements can be models in order to describe possibilities rather than certainties, as proposed in [WestonCB14]. In our case, the answer will be “maybe” to the question “Is the price-range required moderate ?” if the slot hasn’t been mentioned yet throughout the current dialog. In the case of state tracking, it will allow to seamlessly deal with unknown information about the dialog state. Concretly, this set of questions and answers are generated has a super-set of the Yes-No Questions set. First, sub-dialog starting from the first utterance of a given dialog are extracted under the condition that a given slot is not informed in the corresponding annotation. Then, a question-answering question is generated.

Counting and Lists/Sets : This last task tests the capacity of the model to perform simple counting operations, by asking about the number of objects with a certain property, e.g. “How many area are requested ?”. Similarly, the ability to produce a set of single word answers in the form of a list, e.g. “What are the area requested ?” is investigated. Table 1 give an example of each of the question type presented below on a dialog sample of DSTC-2 corpus.

Inference procedure: Concretely, the current set of utterances of a dialog will be placed into the memory using sentence based encoding and the question will be encoded as the controller state at . The answer will be produced using a Softmax operation over the answer vocabulary that is supposed fixed. We consider this hypothesis valid in the case of factoid and list questions because the set of value for a given variable is often considered known. In the cases of Yes/No and Indefinite knowledge question, {Yes, No, Maybe} are added to the output vocabulary. Following [WestonCB14]

, a list-task answer will be considered as a single element in the answer set and the count question. A possible alternative would be to change the activation function used at the output of the

MemN2N from softmax activation function to a logistic one and to use a categorical cross entropy loss. A drawback of such alternative would be the necessity of cross-validating a decision threshold in order to select a eligible answers. Concerning the individual numbers for the count question set, the numbers founded on the training set are added into the vocabulary.

Index Actor Utterance
1 Cust Im looking for a cheap restaurant in the west or east part of town.
2 Agent Thanh Binh is a nice restaurant in the west of town in the cheap price range.
3 Cust What is the address and post code.
4 Agent Thanh Binh is on magdalene street city centre.
5 Cust Thank you goodbye.
6 Factoid Question What is the pricerange ? Answer: {Cheap}
7 Yes/No Question Is the Pricerange Expensive ? Answer: {No}
8 Indefinite Knowledge Is the FoodType chinese ? Answer: {Maybe}
8 Listing task What are the areas ? Answer: {West,East}
Table 1: : Dialog state tracking question-answering examples from DSTC2 dataset

We believe more reasoning capabilities need to be explore in the future, like spacial and temporal reasoning or deduction as suggested in [WestonBCM15]. However, it will probably need the development of a new dedicated ressource. Another alternative could be to develop a question-answering annotation task based on a dialog corpus where reasoning task are present. The closest work to our proposal that can be cited is [BordesW16]. In this paper, the authors defines a so-called End-to-End learnable dialog system to infer an answer from a finite set of eligible answers w.r.t the current list of utterances of the dialog. The authors generate artificial tasks of dialog. However the reasoning capabilities are not explicitly addressed and the author explicitly claim that the resulting dialog system is not satisfactory yet. Indeed, we believe that having a proper dialog state tracker where a policy is built on top can guarantee dialog achievement by properly optimizing a reward function throughout a explicitly learnt dialog policy. In the case of proper end-to-end systems, the objective function is still not explicitly defined [SerbanLCP15] and the resulting systems tend to be used in the context of chat-oriented and non-goal oriented dialog systems. In the next section, we present experimental details and results obtained on the basis of the DSTC-2 dataset and its conversion to the four mentioned reasoning tasks.

4 Experiments

4.1 Dataset and Data Preprocessing

In the DSTC-2 dialog corpus, a user queries a database of local restaurants by interacting with a dialog system. A dialog proceeds as follows: first, the user specifies constraints concerning the restaurant. Then, the system offers the name of a restaurant that satisfies the constraints. Finally, the user accepts the offer and requests additional information about the accepted restaurant. In this context, the dialog state tracker should be able to track several types of information that compose the state like the geographic area, the food type and the price range slots. In order to make comparable experiments, subdialogs generated from the first utterance to each utterance of each dialog of the corpus have been generated. The corresponding question-answer pairs have been generated using the annotated state for each of the subdialog. In the case of factoid question, this setting allows for fair comparison at the utterance-level state tracking gains with the prior art. The same protocol has been adopted for the generated reasoning task. In that sense, the tracker task consists in finding the value as defined in Section 3.3. In the overall dialog corpus, Area slot counts 5 possible values, Food slot counts 91 possible values and Pricerange slot counts 3 possible values. In order to exhibit reasoning capability of the proposed model in the context of dialog state tracking, three other dataset have been automatically generated from the dialog corpus in order to support capabilities of reasoning described in Section 3.3. Dialog modification has been required for two reasoning tasks, List and Count. Two types of rules have been developed to automatically produce modified dialogs. On a first hand, string matching has been performed to determine the position of a slot values in a given utterance and an alternative statement has been produced as a substitution. For example, the utterance “I’m looking for a chinese restaurant in the north” can be replaced by “I’m looking for a chinese restaurant in the north or the west of town”. A second type of modification has been performed in an inter-utterance fashion. For example, assuming a given value “north” has been informed in the current state of a given dialog, one can add lately in the dialog a remark like “I would also accept a place east side of town”. This kind of statement tends to not affect the overall flow of the dialog and allows to add richer semantic to the dialog. In the future, we plan to develop a richer set of generation procedures to augment the dataset. Nevertheless, we believe this simple dialog augmentation strategy allows to exhibit the competency of the proposed model beyond factoid questions.

4.2 Training Details

As suggested in [SukhbaatarSWF15],

of the set was held-out to form a validation set for hyperparameter tuning. Concerning the utterance encoding, we use the so-called

Temporal Encoding technique. In fact, reading tasks require some notion of temporal context. To enable the model to address them, the memory vector is modified as such , where is the row of a dedicated matrix that encodes temporal information. The output embedding is augmented in the same way with a matrix (e.g. ). Both and are learned during training in an end-to-end fashion. They are also subject to the same sharing constraints as and . The embedding matrix and are initialized using GoogleNews word2vec embedding model [MikolovSCCD13]. Also suggested on [SukhbaatarSWF15], utterances are indexed in reverse order, reflecting their relative distance from the question so that is the last sentence of the dialog. Furthermore, adjacent weight tying schema has been adopted. Learning rate is initially assigned a value of with exponential decay applied every epochs by until epochs are reached. Then, linear start is used in all our experiments as proposed by [SukhbaatarSWF15]. More precisely, the softmax function in each memory layer is removed and re-inserted after epochs. Batch size is set to and gradients with an norm larger than are divided by a scalar to have norm

. All weights are initialized randomly from a Gaussian distribution with zero mean and

. In all our experiments, we have tested a set of the embedding size . After validation, each model uses a -hops depth configuration.

4.3 Experimental results

Locutor Utterance Hop 1 Hop 2 Hop 3 Hop 4 Hop 5
Cust Im looking for a cheap restaurant that serves chinese food 0.00 0.18 0.11 0.04 0.00
Agent What part of town do you have in mind 0.33 0.30 0.00 0.00 0.00
Cust I dont care 0.00 0.00 0.17 0.37 1.00
Agent Rice house serves chinese food in the cheap price range 0.01 0.00 0.00 0.00 0.00
Cust What is the address and telephone number 0.58 0.09 0.01 0.00 0.00
Agent Sure rice house is on mill road city centre 0.03 0.00 0.00 0.00 0.00
Cust Phone number 0.00 0.00 0.00 0.00 0.00
Agent The phone number of rice house is 765-239-09 0.02 0.01 0.00 0.00 0.00
Cust Thank you good bye 0.02 0.42 0.71 0.59 0.00
What is the area ? Answer: dontcare
Table 2: Attention shifting example for the Area slot from DSTC2 dataset, the values corresponds the values affected to each memory block at each hop of the MemN2N

Table 3 presents tracking accuracy obtained for three variables of the DSTC2 dataset formulated as Factoid Question

task. We compare with two established utterance-level discriminative neural trackers, a Recurrent Neural Network (RNN) model

[Henderson2014d] and the Neural Belief Tracker [SWTY16]. As suggested in this last work, the first RNN baseline model uses no semantic (i.e. synonym) dictionary, while the improved baseline uses a hand-crafted semantic dictionary designed for the DSTC2 ontology. In this context, a MemN2N model allows to obtain competitive results with the most close, non-memory enhanced, state of the art approach of recurrent neural network with word embedding as prior knowledge.

Model Area Food Price Joint
RNN - no dict. 0.92 0.86 0.86 0.69
RNN + sem. dict. 0.91 0.86 0.93 0.73
NBT-DNN 0.90 0.84 0.94 0.72
NBT-CNN 0.90 0.83 0.93 0.72
MemN2N() 0.89 0.88 0.95 0.74
Table 3: One supporting fact task : Acc. obtained on DSTC2 test set

As a second result, Table 4 presents the performance obtained for the four reasoning tasks. The obtained results lead us to think that MemN2N are a competitive alternative for the task dialog state tracking but also increase the spectrum of definition of the general dialog state tracking task to machine reading and reasoning. In the future, we believe new reasoning capabilities like spacial and temporal reasoning and deduction should be exploited on the basis of a specifically designed dataset.

Variable d Yes-No I.K. Count. List.
20 0.85 0.79 0.89 0.41
Food 40 0.83 0.84 0.88 0.42
60 0.82 0.82 0.90 0.39
20 0.86 0.83 0.94 0.79
Area 40 0.90 0.89 0.96 0.75
60 0.88 0.90 0.95 0.78
20 0.93 0.86 0.93 0.83
PriceRange 40 0.92 0.85 0.90 0.80
60 0.91 0.85 0.91 0.81
Table 4: Reasoning tasks : Acc. on DSTC2 reasoning datasets

5 Conclusion and Further Work

This paper describes a novel method of dialog state tracking based on the paradigm of machine reading and solved using MemN2N, a memory-enhanced neural network architecture. In this context, a dataset format inspired from the current datasets of machine reading tasks has been developed for this task. It is the first attempt to solve this classic sub-problem of dialog management in such way. Beyond the experimental results presented in the experimental section, the proposed approach offers several advantages compared to state of the art methods of tracking. First, the proposed method allows to perform tracking on the basis of segment-dialog-level annotation instead of utterance-level one that is commonly admitted in academic datasets but tedious to produce in a large scale industrial environment. Second, we propose to develop dialog corpus requiering reasoning capabilities to exibit the potential of the proposed model. In future work, we plan to address more complex tasks like spatial and temporal reasoning, sorting or deduction and experiment with other memory enhanced inference models. Indeed, we plan to experiment and compare the same approach with Stacked-Augmented Recurrent Neural Network [JoulinM15]

and Neural Turing Machine

[GravesWD14] that sounds also promising for these family of reasoning tasks.