Deep Reinforcement Learning for Inquiry Dialog Policies with Logical Formula Embeddings

08/02/2017 ∙ by Takuya Hiraoka, et al. ∙ PKSHA Technology nec global 0

This paper is the first attempt to learn the policy of an inquiry dialog system (IDS) by using deep reinforcement learning (DRL). Most IDS frameworks represent dialog states and dialog acts with logical formulae. In order to make learning inquiry dialog policies more effective, we introduce a logical formula embedding framework based on a recursive neural network. The results of experiments to evaluate the effect of 1) the DRL and 2) the logical formula embedding framework show that the combination of the two are as effective or even better than existing rule-based methods for inquiry dialog policies.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The objective of inquiry dialogs is to cooperatively answer questions (or problems) shared by participants [1]. In inquiry dialogs, participants (including the dialog system) do not have complete domain knowledge, so they share their own knowledge with their partners. This setting is different from slot-filling dialog settings (e.g., [2, 3, 4]), where the systems are required to have complete domain knowledge. Thus, the realization of practical IDSs extends the capabilities of current dialog systems. In addition, IDSs can actively expand their own knowledge bases through dialogs, which helps to reduce the costs of manual knowledge base expansion.

Although there has been previous research on IDS, the focus has not been on learning its policies. Amgoud, Parsons and McBurney have discussed the fundamental principles of inquiry dialog [5, 6, 7, 8], and Black and Hunter proposed policies for inquiry dialogs [9, 10]. In addition, Fan and Toni proposed policies for inquiry and information-seeking dialogs [11, 12]. These studies dealt with rule-based policies that only work efficiently in limited situations; learning efficient policies in various conditions remains still an open problem.

In this work, we apply DRL to learn the policies for IDS. In addition, in order to make learning these policies more effective, we propose a logical formula embedding framework. In Section 2, we introduce our inquiry dialog domain and IDS framework. We show how dialog states and dialog acts are represented by logical formulae in the framework. In Section 3, we explain how to apply DRL to inquiry dialogs. Specifically, we use “Deep Q-Learning with experience replay” (DQL) [13, 14] for DRL and introduce a logical formula embedding framework to it. In Section 4, we evaluate the effectiveness of the DRL and the logical formula embedding framework for producing good policies. We conclude the paper in Section 5 with a brief summary.

Our research contribution is two-fold: first, to our knowledge, this is the first study that applies reinforcement learning to learning inquiry dialog policies, and second, we introduce logical formula embedding frameworks for DRL in order to make learning inquiry dialog policies more effective.

2 Inquiry Dialog

2.1 Inquiry Dialog and its Domain

In inquiry dialogs, an IDS and its user collaborate in order to answer their shared questions. They start the dialog with shared questions (queries), and then, in order to come up with an answer, they share what they believe in (beliefs) by reasonable assertion (arguments) and then make new arguments reflecting those shared beliefs. Sharing their beliefs is continued until they find an answer (or find that there is no possible answer). The shared beliefs are stored in the commitment store, and participant’s beliefs are stored in the belief base.

As an example of inquiry dialog domains, we discuss a system and its user working on Compliance Violation Detection. In this setting, the system and the user play detectives. They are given different information sources from which they extract information, and then start discussing to answer the query “Is there a compliance violation?”. An example of one participant’s beliefs and a dialog is shown in Table 1. The beliefs pertain to information (such as the e-mail contents of suspects) that they surveyed and to compliance violation. An example of beliefs is shown in Table 1a. The belief “Company A proposed a price to another company” is contained in the user’s belief base, and the beliefs “Company B accepted a proposal” and “If company A proposed a price and company B accepted it, it is a compliance violation” are contained in the system’s belief base. They exchange their arguments in order to answer the given query. Examples of arguments and how to share beliefs are shown in Table 1b. The user conveys his belief “Company A proposed a price in the email.” to the system, with the argument “Company A proposed a price and company B accepted it.” After that, the system argues “Company B accepted a proposal. If the proposal is made by company A, it is a compliance violation.” reflecting the belief shared by the user. The system’s argument is the answer to the query. After that, both participants express that they have nothing to discuss anymore, and the dialog is closed.

Transcription of belief Logical formula representation

System’s belief base
Company B accepted a proposal.
If company A proposed a price and company B accepted it,
it is a compliance violation.

User’s belief base
Company A proposed a price to another company.

Is there a compliance violation?
i sp Natural language transcription / dialog act Commitment store CS Current focus of query store cQS
1 Sys. If company A proposed a price and company B accepted it,
it is a compliance violation. Let’s discuss it.
2 Usr. Company A proposed a price in the e-mail.

Sys. Company B accepted a proposal from another company.
If the proposal is made by company A, it is a compliance violation.
4 Usr. True. I don’t have anything to discuss on the current topic. / Close()
5 Sys. Me neither. / Close()
6 Usr. Do you still have anything to say about the topic
”Is there a compliance violation?”
I don’t have anything further to say. / Close()
7 Sys. No. / Close()
Table 1: An example of an inquiry dialog for Compliance Violation Detection. The upper part of the table (Table 1a) lists the participant’s beliefs and the given query, and the lower part (Table 1b) gives an example dialog.

2.2 Framework for Inquiry Dialog Systems

Our framework basically follows the IDS framework Black and Hunter proposed [9, 10] because it is clearly defined enough to be implemented easily.

In our framework, beliefs are represented with (first-order predicate) logical formulae. In first-order predicate logic, a logical formula is composed of an atom , a ground atom , or their negations . Beliefs are categorized into two types according to the form of the logical formula:

State belief

: represented by conjunction of ground atoms: ,

Domain belief

: represented by an inference rule:

where represents logical conjunction and represents implication. The set of all possible beliefs is denoted by . The third column of Table 1a shows examples of beliefs. Representing beliefs with logical formulae is commonly seen in most of previous research on IDSs.

Dialog states are composed of 1) system beliefs, 2) a commitment store, and 3) a query store. System belief and commitment store are subsets of .

Dialog acts are Assert, Open, and Close:


: represents an asserting argument (). is a domain belief, called a claim. is a set of beliefs, called supports of the claim. should be a minimal and consistent belief set for deriving the claim .


: represents the intention to start a discussion on agenda . is a domain belief.


: represents the intention to close the discussion on the current agenda.

In the dialog example in Table 1b, Assert is shown at , Open is shown at , and Close is shown at .

The goal of the dialog is to generate arguments to answer the query. More concretely, given a query , the dialog succeeds if either of the participants performs , s.t. . In the example given in Table 1b, the query is given as “.” The Assert at addresses the query, and thus achieves the goal of the dialog.

If a participant performs Assert() at time point , the commitment store is updated (). In the example of Table 1b, is updated by an Assert at .

We introduce a stack of query stores in order to manage the current agenda of a dialog. A query store is a list of atoms that represent the current agenda of the dialog. is a stack of query stores and it is updated when participants perform Open or Close. If either of the participants perform Open, a new query store is stacked to . In addition, if all participants perform Close, the top of is removed. is initialized with the query. In Table 1b, is initialized with the query, and “” is stacked to . In addition, a query store of five atoms is stacked to by Open at and is removed when both the system and the user perform Close ().

The set of valid dialog acts in the given dialog state is called legal moves. In inquiry dialog, Assert and Open should be related to the current agenda. More concretely, at time point i, the legal moves of Assert , Open , Close , for the system are defined as follows:

1) is a ground atom corresponding to an atom in the list at top,

1) ,

The system can not utilize an Assert or Open that has already been performed in the past.

3 Methodology for Learning Inquiry Dialog Policies

3.1 Learner’s Model

We define rewards, actions, and states in Markov decision processes (MDPs) 

[15] to apply DRL for learning IDS policies.

A reward is structured in order for the system to answer the given queries as fast as possible. The reward at each system’s turn is fed with the following equation:

where and are numerical values . The positive reward is fed if either the system or its user assert the answer to the given question. In addition, the negative reward representing time pressure is fed at each turn’s end111Namely, a time point immediately after the system updates its dialog state with a user’s dialog act. . We set to 20 and to 1.

Action is represented by a dialog act from the legal moves, and the dialog state represents the state (described in Section 2.2). There are multiple ways to encode dialog state/act to state/action (in MDPs). A naive one is the bag of logical formulae

, namely, state and action are represented by binary vectors whose element is set to 1 if corresponding logical formula is contained in a dialog act or dialog state. However, the problem with this representation is that state/action space easily becomes very sparse because the size of the vectors is determined by that of all possible beliefs


3.2 Logical Formula Embedding Framework

If we want DRL to efficiently produce better policies, it makes sense to represent dialog states and dialog acts compactly as state and action in MDPs. As seen in the Section 2.2, dialog states and dialog acts are represented with beliefs (i.e., logical formulae) in typical inquiry dialog frameworks.

We propose a recursive neural network [16, 17] based framework for injecting logical formula into embeddings (“EmbF” in Fig. 1). The proposed framework (“EmbF”) calculates the compositional vector of the tree representation of a logical formula in a bottom up manner. is an abstract syntax tree of the truth assignment for  [18, 19]. The leaf nodes in represent arguments of predicates, and internal nodes represent either predicates or logical operators. Note that parent nodes of leaf nodes must represent predicates. In Fig. 1, “” is parsed into tree representation. The leaf nodes represent the arguments “X and Y”, the dashed internal nodes represent the predicates “A, B and Competitor”, and the other internal nodes represent logical operators “.”

Traversing , EmbF injects atoms into embeddings and composes them recursively. First, EmbF calculates atom (i.e., leaves and their parent nodes) embeddings on the basis of

where and represent weight matrices, is the number of dimensions of the embedding vector, is that of , corresponds to , represents the vector of the predicate, and represent the vectors of the arguments. We use one-hot representation for and . Next, on the basis of the internal nodes representing logical operators ( or ) and their child nodes labeled , they compose children’s vectors with

where represent weight matrices222As truth assignments and are identical, we could use other equations that calculate , s.t. . Comparisons with the case of using these equations are left for future work. . are vector representations of the children’s vector.

Figure 1: The framework of the logical formula embedding.

The vector of the top node is then used as compact belief representation for encoding the dialog state/act.

We implement a Q-function that evaluates the expectation of the cumulative future reward (Q-value) of the dialog act and dialog state pair (Fig. 2). takes as input a pair consisting of the dialog state and the dialog act , and the input is forwarded to the embedding framework for the dialog state (EmbDs) and that for the dialog act (EmbDa). EmbDs and EmbDa calculate vector representation of and , respectively. These vectors are compositions of vectors of logical formula

produced by EmbF (discussed above). In both EmbDs and EmbDa, the logical formulae in the dialog state and the dialog act are embedded into numerical vectors by EmbF. Finally, the Q-function estimates the Q-value with these embedded vectors. Parameters of the Q-function (i.e.,

, , and weights for “Lin”) are learned with DQL333Note that our logical formula embedding framework can be applied to other DRLs (e.g.,  [20, 21, 22, 23]) as well. .

Figure 2: Implementation of the Q-function. Rounded rectangles represent functions, normal rectangles represent data, and arrows represent data flows. “Lin” represents a linear function. “Sum” represents the element-wise sum of vectors. “” represents the concatenation of vectors. “EmbClose” is an one hot vector showing the appearance of Close in the dialog act.

3.3 User Simulator

We utilize user simulators instead of real human users to train the system. The user simulator selects a dialog act

in accordance with its policy, which is a hybrid combining a rule-based policy and a random policy. With probability

, the policy follows the ruled-based one, proposed in previous research [9, 10] (Algorithm 1). The rule-based policy is designed such that the user shares all of his or her beliefs exhaustively with the partner. Given legal moves , the policy checks if each of the legal moves is empty or not in the order . If the legal moves are not empty, it selects from the legal moves. In contrast, with the probability , the random policy is followed. The random policy selects from a set of dialog acts that do not conflict with the user belief base and commitment store.

0:  Legal moves
  if  then
     randomly select a response from
  else if  then
     randomly select a response from
     randomly select a response from
  end if
Algorithm 1 Black and Hunter’s exhaustive policy

4 Experimental Evaluation

In this section, we elucidate the effects of DRL and the logical formula embeddings proposed in Section 3.2. Four policies in six different experimental setups are compared. These four policies are as follows:


: The rule-based policy that follows Algorithm 1.


: The policy produced by the DQL without logical formula embedding proposed in Section 3.2. It follows the bag of logical formulae (Section 3.1). Also, following the previous work on DRL for dialog systems [24], it utilizes a multilayer neural net for the Q-function.


: The policy produced by the DQL with logical formula embedding proposed in Section 3.2, with .


: The same as DQLwE-5d except that .

We consider 2000 dialogs as one epoch, and learning is finished when the number of epochs becomes 100 (200,000 dialogs). The policy at the end of learning is used in the evaluation. We set the discount rate to 0.99, and use an

-greedy policy. is linearly annealed from 1.0 to 0.05 during the first 50 epochs.

The six experimental setups differ in 1) user behavior and 2) the initial condition of the system belief. We experiment with two types of user simulator proposed in Section 3.3:


: User simulator with .


: User simulator with .

In addition, we experiment with three different initial conditions of the system belief. We prepared a set of 13 beliefs related to the Compliance Violation Detection domain. Seven of these beliefs are about the violation of compliance and the remaining six are about the content of the e-mail threads. In all experiments, the beliefs about the content of the e-mail threads are randomly assigned to either the system or the user. We assign the beliefs of compliance violation in three different ways:


: Each belief is assigned to either the system or the user randomly.


: All beliefs are assigned to the user.


: All beliefs are assigned to the system

Note that these scenarios vary in the amount of knowledge that the system possesses about compliance violation.

The performance of the learned policies is evaluated in the success rate at each turn. The success rate represents the percentage of dialogs where either the system or the user asserts the answer to the query. The success rate is calculated on the basis of 2000 simulated dialogs.

From the results (Fig. 3), we see that the policies produced by the DRL with logical formula embeddings (DQLwE-5d and DQLwE-10d) performed better than the other policies (Baseline and DQLwoE), where the system has control over the dialog. With the exception of the UB cases, DQLwE-5d and DQLwE-10d achieved high success rates in fewer turns than the other policies. However, in the UB cases, performances of all policies were more or less the same. The main reason is that in the UB cases, the system does not have any beliefs concerning compliance violation (i.e., domain beliefs, as described in Section 2.2). If the system does not have domain beliefs, it cannot perform Open (i.e., condition 2) in is never satisfied). Therefore, the system cannot control the dialog at all.

(a) RBRandU
(b) RBRuleU
(c) UBRandU
(d) UBRuleU
(e) SBRandU
(f) SBRuleU
Figure 3: Results of each setup. Vertical axis represents the success rate and horizontal axis represents the number of turns.

5 Conclusion and Future Work

We proposed a method for learning IDS policies and logical formula embeddings in a common framework and found that the combination of the DRL and the logical formula embedding framework performed as effectively or even better than the policies of the hand-crafted baseline. In the future, we plan to evaluate our proposed framework in domains in which the system works with users having thousands of beliefs. Further, we will evaluate our framework in a more realistic setting where users are real humans and where speech recognition and language understanding errors are included.

Acknowledge: We gratefully thank Ayako Hoshino, Daniel Andrade, Khurana Jatin, and Vera Götzmann for their insightful comments.


  • [1] D. Walton and E. C. Krabbe, Commitment in dialogue: Basic concepts of interpersonal reasoning.   SUNY press, 1995.
  • [2] J. D. Williams and S. Young, “Partially observable markov decision processes for spoken dialog systems,” Computer Speech & Language, vol. 21, no. 2, pp. 393–422, 2007.
  • [3] E. Levin, R. Pieraccini, and W. Eckert, “A stochastic model of human-machine interaction for learning dialog strategies,” IEEE Transactions on speech and audio processing, vol. 8, no. 1, pp. 11–23, 2000.
  • [4] M. A. Walker, “An application of reinforcement learning to dialogue strategy selection in a spoken dialogue system for email,”

    Journal of Artificial Intelligence Research

    , vol. 12, pp. 387–416, 2000.
  • [5] L. Amgoud, N. Maudet, and S. Parsons, “Modelling dialogues using argumentation,” in MultiAgent Systems, 2000. Proceedings. Fourth International Conference on.   IEEE, 2000, pp. 31–38.
  • [6] S. Parsons, M. Wooldridge, and L. Amgoud, “An analysis of formal inter-agent dialogues,” in Proceedings of the first international joint conference on Autonomous agents and multiagent systems: part 1.   ACM, 2002, pp. 394–401.
  • [7] ——, “On the outcomes of formal inter-agent dialogues,” in Proceedings of the second international joint conference on Autonomous agents and multiagent systems.   ACM, 2003, pp. 616–623.
  • [8] P. McBurney and S. Parsons, “Representing epistemic uncertainty by means of dialectical argumentation,” Annals of Mathematics and Artificial Intelligence, vol. 32, no. 1-4, pp. 125–169, 2001.
  • [9] E. Black and A. Hunter, “A generative inquiry dialogue system,” in Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems.   ACM, 2007, p. 241.
  • [10] ——, “An inquiry dialogue system,” Autonomous Agents and Multi-Agent Systems, vol. 19, no. 2, pp. 173–209, 2009.
  • [11] X. Fan and F. Toni, “Agent strategies for aba-based information-seeking and inquiry dialogues.” in ECAI, 2012, pp. 324–329.
  • [12] ——, “Mechanism design for argumentation-based information-seeking and inquiry,” in International Conference on Principles and Practice of Multi-Agent Systems.   Springer, 2015, pp. 519–527.
  • [13] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
  • [14] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
  • [15] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.   MIT press Cambridge, 1998, vol. 1, no. 1.
  • [16] R. Socher, C. C. Lin, C. Manning, and A. Y. Ng, “Parsing natural scenes and natural language with recursive neural networks,” in

    Proceedings of the 28th international conference on machine learning (ICML-11)

    , 2011, pp. 129–136.
  • [17] R. Socher, A. Perelygin, J. Y. Wu, J. Chuang, C. D. Manning, A. Y. Ng, C. Potts et al., “Recursive deep models for semantic compositionality over a sentiment treebank,” in

    Proceedings of the conference on empirical methods in natural language processing (EMNLP)

    , vol. 1631.   Citeseer, 2013, p. 1642.
  • [18] T. Rocktäschel, S. Singh, and S. Riedel, “Injecting logical background knowledge into embeddings for relation extraction.” in HLT-NAACL, 2015, pp. 1119–1129.
  • [19]

    S. Guo, Q. Wang, L. Wang, B. Wang, and L. Guo, “Jointly embedding knowledge graphs and logical rules,” in

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 1488–1498.
  • [20] G. Lever, “Deterministic policy gradient algorithms,” 2014.
  • [21] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” arXiv preprint arXiv:1511.05952, 2015.
  • [22] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
  • [23] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning.” in AAAI, 2016, pp. 2094–2100.
  • [24] H. Cuayáhuitl, “Simpleds: A simple deep reinforcement learning dialogue system,” in Dialogues with Social Robots.   Springer, 2017, pp. 109–118.