Modeling question asking using neural program generation

07/23/2019 ∙ by ZiYun Wang, et al. ∙ NYU college 0

People ask questions that are far richer, more informative, and more creative than current AI systems. We propose a neural program generation framework for modeling human question asking, which represents questions as formal programs and generates programs with an encoder-decoder based deep neural network. From extensive experiments using an information-search game, we show that our method can ask optimal questions in synthetic settings, and predict which questions humans are likely to ask in unconstrained settings. We also propose a novel grammar-based question generation framework trained with reinforcement learning, which is able to generate creative questions without supervised data.



There are no comments yet.


page 2

page 5

page 9

page 12

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

People can ask rich, creative questions to learn efficiently about their environment. Question asking is central to human learning yet it is a tremendous challenge for computational models. There is always an infinite set of possible questions that one can ask, leading to challenges both in representing the space of questions and in searching for the right question to ask.

Machine learning has been used to address aspects of this challenge. Traditional methods have used heuristic rules designed by humans [7, 3], which are usually restricted to a specific domain. Recently, neural network approaches have also been proposed, including retrieval methods which select the best question from past experience [12] and encoder-decoder frameworks which map visual or linguistic inputs to questions [18, 12, 23, 21]. While effective in some settings, these approaches are heavily data-driven, limiting the diversity of generated questions and requiring large training sets for different goals and contexts. There is still a large gap between how people and machine ask questions.

Recent work has aimed to narrow this gap by taking inspiration from cognitive science. For instance, Lee et al. [10] incorporates aspects of “theory of mind” [13] in question asking by simulating potential answers to the questions, but the approach relies on imperfect agents for natural language understanding which may lead to error propagation. Related to our approach, Rothe et al. [16] proposed a powerful question-asking framework by modeling questions as symbolic programs, but their algorithm relies on hand-designed program features and requires expensive calculations to ask questions.

We use “neural program generation” to bridge symbolic program generation and deep neural networks, bringing together some of the best qualities of both approaches. Symbolic programs provide a compositional “language of thought” [4] for creatively synthesizing which questions to ask, allowing the model to construct new ideas based on familiar building blocks. Compared to natural language, programs are precise in their semantics, have clearer internal structure, and require a much smaller vocabulary, making them an attractive representation for question answering systems as well [8, 11]. Deep neural networks allow for rapid question-synthesis using encoder-decoder modeling, eliminating the need for the expensive symbolic search and feature evaluations in Rothe et al. [16]. Together, the questions can be synthesized quickly and evaluated formally for quality (e.g. the expected information gain), which as we show can be used to train question asking systems using reinforcement learning.

In this paper, we develop a neural program generation model for asking questions in an information-search game similar to “Battleship” used in previous work [6, 17, 16]. The model uses a convolutional encoder to represent the game state, and a Transformer decoder [19] for generating questions. Building on the work of [16]

, the model uses a grammar-enhanced question asking framework, such that questions as programs are formed through derivation using a context free grammar. We evaluate the model on experiments exploring several different aspects of human question asking, including reasoning tasks, a density estimation task, and a generation task. The last experiment shows our method can generate novel and creative human-like questions without supervised examples of known human questions.

To summarize, our paper makes three main contributions: 1) We propose a neural network for modeling human question-asking behavior, 2) We propose a novel reinforcement learning framework for generating creative human-like questions by exploiting the benefits of programs, and 3) We evaluate our methods extensively through three different experiments exploring different qualities of human question asking.

Figure 1: The Battleship task. Blue, red, and purple tiles are ships, dark gray tiles are water, and light gray tiles are hidden. The agent can see a partly revealed board, and should ask a question to more seek information about the hidden board. Example questions and their program format counterparts are shown on the right. We recommend viewing this figure in color.

2 Related work

Question generation has attracted attention from the machine learning community. Early research mostly explored rule-based methods which strongly depend on human-designed rules [7, 3]

. Recent methods for question generation adopt deep neural networks, especially using the encoder-decoder framework, and can generate questions without hand-crafted rules. These methods are mostly data-driven, which use pattern recognition to map inputs to questions. Researchers have worked on generating questions from different types of inputs such as knowledge base facts

[18], pictures [23], and text for reading comprehension [12, 21]. However aspects of human question-asking remain beyond reach, including the goal-directed and flexible qualities that people demonstrate when asking new questions.

Several recent papers draw inspiration from cognitive science to generate more human-like, goal-oriented questions. Research from Rothe et al. [16] and Lee et al. [10] generate questions by sampling from a candidate set based on goal-oriented metrics. This paper extends the work of Rothe et al. [16] to overcome the limitation of the candidate set, and generate creative, goal-oriented programs with neural networks.

Our work also builds on neural network approaches to program synthesis. This research draws inspiration from computer architecture, using neural networks to simulate stacks, memory, and controllers in differentiable form [15, 5]. Other models incorporate Deep Reinforcement Learning (DRL) to optimize the generated programs in a goal oriented environment, such as generating SQL queries which can correctly perform a specific database processing task [24]. Recent work has also proposed ways to incorporate explicit grammar information into the program synthesis process. Yin and Neubig [22] design a special module to capture the grammar information as a prior, which can be used during generation. Bunel et al. [2] uses DRL to explicitly encourage the generation of semantically correct programs. Our work differs from these in two aspects. First, our goal is to generate informative human-like questions instead of simply correct programs. Second, we more deeply integrate grammar information in our framework, which directly generates programs based on the grammar.

3 Battleship Task

In this paper, we work with a task used in previous work for studying human information search [6] as well as question asking [17]. The task is based on an information search game called “Battleship”, in which a player aims to resolve the hidden layout of the game board based on the revealed information (Figure 1). There are three ships with different colors (blue, red, and purple) placed on a game board consisting of grid of tiles. Each ship can be either horizontal or vertical, and takes , or tiles long. All tiles are initially turned over (light grey in Figure 1), and the player can flip one tile at a time to reveal an underlying color (either a ship color, or dark grey for water). The goal of the player is to determine the configuration of the ships (positions, sizes, orientations) in the least number of flips.

In the modified version of this task studied in previous work [17, 16], the player is presented with a partly revealed game board, and is required to ask a natural language question to gain information about the underlying configuration. As shown in Figure 1, the player can only see the partly revealed board, and might ask questions such as “How long is the red ship?” In this paper, we present this task to our computational models, and ask the models to generate questions about the game board.

Rothe et al. [16] designed a powerful context free grammar (CFG) to describe all of the questions asked by people [17] in the Battleship domain. The grammar represents questions in a LISP program-like format, which consists of a set of primitives (like numbers, colors, etc.) and a set of functions over primitives (like arithmetic operators, comparison operators, and other functions related to the game board). Figure 1 provides some examples of programs produced from this grammar.

4 Neural program generation

This section introduces our approach to program generation with neural networks. The network includes a Convolutional Neural Network (CNN) for encoding the input board, and a Transformer

[19] decoder for generating the sequence. The network architecture is shown in Figure 2.

Figure 2:

Neural program generation. The board is represented as a grid of one-shot vectors and is embedded with a convolutional neural network. The board embedding is passed to a Transformer decoder

[19] which generates a question one symbol at a time.

The game board is a 6x6 grid with five channels, one for each tile color, with the color encoded as a one-hot vector in each grid location. A simple CNN with one layer of filters is used to encode the board. Intuitively, many questions are related to specific positions, thus the position information should be recoverable from the encoding. On the other hand, some features of the board are translation-invariant, such as whether a ship is blocked by another ship. In order to capture the position-sensitive information as well as the translation-invariant patterns, three convolution operations with different filter sizes (, , and

) are performed in parallel on the same input. The inputs are padded accordingly to make sure the feature maps have the same width and height. Then three feature maps are concatenated together along the dimension of output channels, and passed through a linear projection.

Formally, the outputs of the convolutions can be obtained by


where denotes a convolution operation on a filter, ReLU means applying a ReLU activation, and means the concatenation of matrices and . Then is projected to the encoder output by matrix , where is the number of out channels of each convolution, and is the length of encoded vectors.


We use the decoder from the Transformer [19] model for generating the program as a sequence of symbols. With an input sequence of length

, the decoder computes the hidden states through several stacked Decoder Attention Layers. Each layer is composed by three sub-layers, a self-attention module, an attention over the encoded board, and a fully connected feed-forward network. Residual connections are employed around each sub-layer, followed by a layer normalization

[1]. After

layers of attention modules, a final output layer transforms the hidden states to the predicted next-token probabilities at every position from

to , e.g. the predicted result at position is used to determine the input symbol at position . During generation, the predicted symbol is appended to the input sequence, and the sequence is input to the model again to determine . The process is repeated until an <EOS> token is generated indicating the end of sequence.

Given the output from encoder

, and the hidden representation

from Decoder Attention Layer , each layer computes the hidden representation as


where LN means layer normalization [1], FC is a fully connected layer, ATT and Self-ATT are multi-head attention mechanisms, which computes the attention over the output of encoder , and the attention over the input itself, respectively. They are defined as follows


Multi-ATT is the multi-head attention mechanism described in the paper by Vaswani et al. [19], which is a concatenation of multiple standard attention mechanisms with inputs projected using different matrices. A multi-head attention with heads is defined as




is the scaled dot-product attention operation. are a set of vectors called queries, keys, and values, respectively, and is the dimension of queries and keys.

After layers, we apply a linear projection and a softmax activation to to get the output probabilities.

5 Experiments

5.1 Reasoning in synthetic settings

In the first experiment, we designed three tasks to evaluate whether the model can learn simple rules and reasoning strategies. These tasks include counting the number of visible ship tiles, locating a missing ship tile, and generalizing both strategies to unseen scenario types using compositionality. Figure 3 illustrates the three tasks we designed in this experiment by providing some examples of each task.

Figure 3: Design of the 3 tasks in experiment 1. The goal of task (a) is to find the color which has the least number of visible tiles; the goal of task (b) to find the location and color of the missing tile; (c) is the compositionality task with 5 questions as known question types, and another one (in dotted box) as held out question type.

5.1.1 Task descriptions

The three tasks are defined as follows.

  • Counting task. Models must select the ship color with the least number of visible tiles on the board. Each board has a unique answer, and models respond by generating a program ‘‘(topleft (coloredTiles X))’’ where X is a ship color. examples are used for training, and another examples are used for testing.

  • Missing tile task. Models must select the ship that is missing a tile and identify which tile is missing. All ships are completely revealed except one, which is missing exactly one tile. Models respond by generating ‘‘(== (color Y) X)’’ where X is a color and Y is a location on the board. The number of training and test examples are the same as the counting task.

  • Compositionality task. Models must combine both of the above strategies to find the missing tile of the ship with the least visible tiles. Outputs are produced as ‘‘(Z (coloredTiles X))’’ where X is a color and Z is either topleft or bottomright. Each board has a unique answer.

    This task further evaluates compositionality by withholding question types from training. With three values for X and two for Z, there are six possible question types and one is picked as the “held out” type. The other five “known” question types have training examples. For the held out question type, the number of training examples is varied from to , to test how much data is needed for generalization. Another new boards of each question type is used for evaluation.

More information about the model hyperparameters and training procedures are provided in Appendix


5.1.2 Results and discussion

Accuracy for the counting and missing tile tasks is summarized in Table 1(a) for the full model and lesioned variants. The full neural program generation model shows strong reasoning abilities, achieving an accuracy of % and % for the counting and missing tile tasks, respectively. The full model is compared to weakened variants with only one filter size in the encoder, either “3x3” and “1x1 conv only,” and the performance of the weakened models drop dramatically on the missing tile task.

To better understand the role of different filter sizes, Table 1(b) breaks down the errors in the missing tile task on whether the question can pick the right ship (color acc. ) and whether it can select the right location (location acc.). The convolution filters can accurately select the correct color, but often fail to choose the right tile. The model with convolution filters has poor performance for both color and location. In the current architecture, predicting the correct location requires precise information that seems to be lost without filters of different sizes.

Model Counting Missing tile
Full model 99.80% 95.50%
3x3 conv only 99.30% 51.50%
1x1 conv only 98.60% 1.90%
(a) Accuracy of different models on the counting and missing tile tasks
Model Location acc. Color acc.
Full model 97.80% 97.60%
3x3 conv only 53.00% 96.20%
1x1 conv only 3.90% 45.50%
(b) Accuracy for selecting the right tile location and color for the missing tile
# of training examples 0 10 50 100 200 400 800
Acc. full model on held out question type 0.0 2.0 39.0 69.5 81.0 92.0 96.0
Acc. full model on known question types 96.6 97.3 97.1 96.0 96.3 97.8 96.1

Acc. classify on held out question type

33.0 37.0 49.0 75.5 88.0 94.0 99.5
(c) Accuracy (%) on the compositionality task using different numbers of training examples from the held out question type.
Table 1: Results of the synthetic reasoning tasks.

The results for the compositionality task are summarized in Table 1(c). When no training data regarding the held out question type is provided, the model cannot generalize to situations systematically different from training data, exactly as pointed out in previous work on the compositional skills of encoder-decoder models [9]. However, when the number of additional training data increases, the model quickly incorporates the new question type while maintaining high accuracy on the familiar question tasks. On the last row of Table 1(c)

, we compare our model with another version where the decoder is replaced by two linear transformation operations which directly classify the ship type and location (details in Appendix

A.1). This model has transfer accuracy on compositional scenarios never seen during training. This suggests that the model has the potential to generalize to unseen scenarios if the task can be decomposed to subtasks and combined together.

5.2 Estimating the distribution of human questions

In this experiment, we examine if neural program generation can model the empirical distribution of human questions.

5.2.1 Data collection

To train the model, we need to construct a training set of paired game boards and questions. Instead of laboriously collecting a large number of real human questions, and translating them into programs by hand, we construct the dataset by sampling from a previous computational model of human question asking [16]. More precisely, we randomly generate a large number of game boards and sample questions given each board. For generating the boards, we first uniformly sample the configuration of three ships, and randomly cover arbitrary number of tiles, with the restriction that at least one ship tile is observable. Next we randomly sample programs for each board with importance sampling based on the cognitive model proposed by Rothe et al. [16], which models the probability of a question under a given context as


where is a parameterized energy function for estimating the likelihood of a question being asked by human, which considers multiple features such as question informativeness, complexity, answer type, etc. is a normalization constant.

We also randomly generate a larger set of questions to pretrain the decoder component of the model as a “language model” over questions, enabling it to better capture the grammatical rules. Details regarding the model hyperparameters, training procedure, and pre-training procedure are provided in Appendix A.2.

Model LL on sampled data LL on human data
Full model -3.197 -7.124
no pretrain -3.217 -7.280
LSTM decoder -3.222 -9.013
MLP encoder -3.385 -7.475
decoder only -3.401 -8.434
Table 2: Log-likelihood (LL) on two evaluation set of different version models.

5.2.2 Results and discussion

We evaluate the log-likelihood of reference questions generated by our full model as well as some lesioned variants of the full model, including a model without pretraining, a model with the Transformer decoder replaced by an LSTM decoder, a model with the convolutional encoder replaced by a simple MLP encoder, and a model that only has a decoder (unconditional language model).

Two different evaluation sets are used, one is sampled from the same process on new boards, the other is a small set of questions collected from human annotators. In order to calculate the log-likelihood of human questions, we use translated versions of these questions that were used in previous work [16], and filtered some human questions that score poorly according to the generative model used for training the neural network (Appendix A.2).

A summary of the results is shown in Table 2. The full model performs best on both datasets, suggesting that pretraining, the Transformer decoder, and the convolutional encoder are all important components of the approach. However, we find that the model without an encoder performs reasonably well too, even out-performing the full model with a LSTM-decoder on the human-produced questions. This suggests that while contextual information from the board leads to improvements, it is not the most important factor for predicting human questions. To further investigate the role of contextual information, we conduct another analysis to determine whether or not the model can utilize board information effectively.

Intuitively, if there is little uncertainty about the locations of the ships, observing the board is critical since there are fewer good questions to ask. To examine this factor, we divide the scenarios based on the entropy of the hypothesis space of possible ship locations into a low entropy set (bottom 30%), medium entropy set (40% in the middle), and high entropy set (top 30%). We evaluate different models on the split sets of sampled data and report the results in Table 3. When entropy is low, the models with access to the board has substantially higher log-likelihood than the model without encoder. If the entropy is high, the importance of the encoder is reduced. Together, this implies that our model can capture important context-sensitive characteristics of how people ask questions.

Model LL on low entropy LL on mid entropy LL on high entropy
Full model -2.990 -3.190 -3.414
decoder only -3.312 -3.397 -3.494
Table 3: Log-likelihood (LL) on different split of sampled evaluation set based on the uncertainty of the board. More comparisons are provided in Appendix B Table 5.

5.3 Question generation

Beyond predicting which questions people ask, we also explore the ability of our model to generate novel questions from scratch. To alleviate the issue of sparse training signal in typical program induction tasks, we propose a novel grammar-enhanced method in this experiment. We formalize the task of asking questions as a Markov Decision Process (MDP) of generating strings using the context-free grammar, and solve it with reinforcement learning.

5.3.1 Grammar-enhanced question asking

The neural program generation system uses the context-free grammar specified in [16]. The start state is the symbol (“A”) and generation proceeds by expanding the left-most non-terminal symbol. At each step, one production rule is chosen and applied to the first non-terminal. This procedure is repeated on the resulting string until there are no non-terminals left. For example, program “(> (size Blue) 3)” can be generated from the start symbol as follows: A B (> N N) (> (size S) N) (> (size Blue) N) (> (size Blue) 3), where A, B, N, and S are non-terminals.

Neural program generation is used as the stochastic agent for choosing production rules at each step. Instead of outputting the probabilities of symbols, as in the previous sections, we use the hidden vector corresponding to the first input token as the probabilities of all actions. At each step, a mask is applied on the probabilities so that only valid actions can be chosen.

Once a string with no non-terminals is produced, we calculate the reward based on the energy value of the question . We transform the energy value to a proper range for reward by and clamp it between and . The model is optimized with the REINFORCE algorithm [20]. In order to produce higher-quality questions, we manually tune the information-related parameter of the energy function from [16] to make it more information-seeking in this experiment. This process is described in Appendix A.2.

Model avg. EIG EIG>0.9 EIG>0 #unique
supervised 0.940 79.50% 99.90% 5
sequence RL 0.833 69.95% 87.65% 18
grammar enhanced RL 1.266 84.75% 91.35% 141
Table 4: Evaluation results of experiment 3. Our grammar enhanced model is compared with a supervised trained baseline from experiment 2, and a sequence generative RL baseline. The models are compared in terms of average energy value, average expected information gain (EIG) value, the ratio of EIG value greater than 0.9/0.1, and number of unique questions generated.

5.3.2 Results and discussion

We compare the models on their ability to generate questions with high expected information gain (EIG), low energy, and high creativity. We use the full model trained in the last experiment as a supervised baseline. We also implement a reinforcement learning agent that specifies the program symbol by symbol without direct access to the grammatical rules. For this alternative agent, we pretrain it for epochs on the same dataset in Experiment 2, because we find the agent struggles to generate grammatical questions if trained from scratch.

The models are evaluated on randomly sampled boards, and the results are shown in Table 4. From the table, the grammar-enhanced RL model generates informative and creative questions. The number of unique questions generated on the evaluation boards by the grammar-enhanced model is much larger than other two models, with a much higher average EIG. Surprisingly, the questions generated by the supervised model have the highest non-zero-EIG ratio. We find that the supervised model discovers a trivial pattern in the training data that achieves high energy in general, which is to simply ask the orientation of one ship (“(orient X)”). Although the supervised model can achieve high log-likelihood on human questions, it cannot generate creative and diverse questions given new boards. A possible reason is that since multiple reference questions are provided along with an input board in the training data, simple and effective questions such as “(orient X)” are sampled frequently, which become a common pattern for the supervised model to recognize. With sufficient exploration through reinforcement learning, neural program generation can learn to generate more interesting and creative questions.

We also provide example questions generated by our models in Figure 4a, which includes clever questions such as “Where is the bottom right of all the purple and blue tiles?" or "What is the size of the blue ship minus the purple ship?” It can also sometimes generates meaningless questions such as “Is the blue ship shorter than itself?” Additional examples of generated questions are provided in Appendix B.

With the grammar enhanced framework, we can also guide the model to ask different types of questions, consistent with the goal-directed nature and flexibility of human question asking. The model can be queried for certain types of questions by providing different start conditions to the model. Instead of starting derivation from the start symbol “A”, we can start derivation from a intermediate state such as “B” for Boolean questions or a more complicated “(and B B)” for composition of two Boolean questions. In Figure 4b, we shows examples where the model is asked to generate four specific types of questions: true/false questions, number questions, location-related questions, and compositional true/false questions. From the example we see the model can flexibly adapt to new situations and generate meaningful questions.

We also compare the questions generated by our models with human questions, and two randomly-picked examples are shown in Figure 4c. These examples also demonstrate that our models, especially the grammar enhanced model, are able to generate clever and human-like questions. More examples are provided in the supplementary materials.

Figure 4: Examples of model-generated questions. The natural language translations of the question programs are provided for interpretation. (a) shows three novel questions generated by the grammar enhanced model, (b) shows an example of how the model generates different type of questions by conditioning the input to the decoder, (c) shows questions generated by different models as well as human annotators.

6 Conclusion

This paper introduced a neural program generation framework for modeling human behavior in a rich question asking task, and generating creative human-like questions with grammar-enhanced reinforcement learning. Programs provide models with a “machine language of thought” for compositional thinking, and neural networks provide an efficient means of question generation. We demonstrate the effectiveness of our method in extensive experiments covering a range of human question asking abilities.

The current model has important limitations. It cannot generalize to systematically different scenarios, and it sometimes generates meaningless questions. We plan to further explore the model’s compositional abilities in future work. Another promising direction is to model question asking and question answering jointly within one framework, which could guide the model to a richer sense of the question semantics. As another future direction, we would like to extend our framework to real-world scenarios such as dialog systems and more open-ended human question asking.


This work was supported by Huawei. We are grateful to Todd Gureckis and Anselm Rothe for helpful comments and conversations. We thank Jimin Tan for his work on the initial version of the RL-based training procedure.


  • Ba et al. [2016] Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. CoRR, abs/1607.06450, 2016.
  • Bunel et al. [2018] Rudy Bunel, Matthew Hausknecht, Jacob Devlin, Rishabh Singh, and Pushmeet Kohli. Leveraging grammar and reinforcement learning for neural program synthesis. In International Conference on Learning Representations, 2018.
  • Chali and Hasan [2015] Yllias Chali and Sadid A Hasan. Towards topic-to-question generation. Computational Linguistics, 41(1):1–20, 2015.
  • Fodor [1975] Jerry A. Fodor. The Language of Thought. Harvard University Press, 1975.
  • Graves et al. [2014] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
  • Gureckis and Markant [2009] Todd Gureckis and Doug Markant. Active learning strategies in a spatial concept learning game. In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 31, 2009.
  • Heilman and Smith [2010] Michael Heilman and Noah A Smith. Good question! statistical ranking for question generation. In Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 609–617. Association for Computational Linguistics, 2010.
  • Johnson et al. [2017] Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Inferring and executing programs for visual reasoning. In

    Proceedings of the IEEE International Conference on Computer Vision

    , pages 2989–2998, 2017.
  • Lake and Baroni [2018] Brenden M Lake and Marco Baroni. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In International Conference on Machine Learning, 2018.
  • Lee et al. [2018] Sang-Woo Lee, Youngjoo Heo, and Byoung-Tak Zhang. Answerer in questioner’s mind: Information theoretic approach to goal-oriented visual dialog. In Advances in neural information processing systems, 2018.
  • Mao et al. [2019] Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B Tenenbaum, and Jiajun Wu. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. In International Conference on Learning Representations, 2019.
  • Mostafazadeh et al. [2016] Nasrin Mostafazadeh, Ishan Misra, Jacob Devlin, C. Lawrence Zitnick, Margaret Mitchell, Xiaodong He, and Lucy Vanderwende. Generating natural questions about an image. In Annual Meeting of the Association for Computational Linguistics, pages 1802–1813, 2016.
  • Premack and Woodruff [1978] David Premack and Guy Woodruff. Does the chimpanzee have a theory of mind? Behavioral and brain sciences, 1(4):515–526, 1978.
  • Ranzato et al. [2016] Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba.

    Sequence level training with recurrent neural networks.

    In International Conference on Learning Representations, 2016.
  • Reed and De Freitas [2016] Scott Reed and Nando De Freitas. Neural programmer-interpreters. In International Conference on Learning Representation, 2016.
  • Rothe et al. [2017] Anselm Rothe, Brenden M Lake, and Todd Gureckis. Question asking as program generation. In Advances in Neural Information Processing Systems, pages 1046–1055, 2017.
  • Rothe et al. [2018] Anselm Rothe, Brenden M Lake, and Todd M Gureckis. Do people ask good questions? Computational Brain & Behavior, 1(1):69–89, 2018.
  • Serban et al. [2016] Iulian Vlad Serban, Alberto García-Durán, Caglar Gulcehre, Sungjin Ahn, Sarath Chandar, Aaron Courville, and Yoshua Bengio. Generating factoid questions with recurrent neural networks: The 30m factoid question-answer corpus. In Annual Meeting of the Association for Computational Linguistics, pages 588–598, 2016.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  • Williams [1992] Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
  • Yao et al. [2018] Kaichun Yao, Libo Zhang, Tiejian Luo, Lili Tao, and Yanjun Wu. Teaching machines to ask questions. In

    International Joint Conferences on Artificial Intelligence

    , pages 4546–4552, 2018.
  • Yin and Neubig [2017] Pengcheng Yin and Graham Neubig. A syntactic neural model for general-purpose code generation. In Annual Meeting of the Association for Computational Linguistics, pages 440–450, 2017.
  • Yuan et al. [2017] Xingdi Yuan, Tong Wang, Caglar Gulcehre, Alessandro Sordoni, Philip Bachman, Sandeep Subramanian, Saizheng Zhang, and Adam Trischler. Machine comprehension by text-to-text neural question generation. In Workshop on Representation Learning for NLP, 2017.
  • Zhong et al. [2018] Victor Zhong, Caiming Xiong, and Richard Socher. Seq2sql: Generating structured queries from natural language using reinforcement learning. CoRR, abs/1709.00103, 2018.

Appendix A Experimental Settings

a.1 Reasoning in Synthetic Settings

In this experiment, we use for the model encoder. Each word is embedded with dimension vectors in the decoder. The decoder has layers, each multi-head attention module has heads, and . The model is trained for epochs using Adam optimizer with initial learning rate at and a batch size is set as .

To further examine the model’s ability on compositionality task, we evaluate another version of the model which replaces the decoder with two linear transformations to directly predict {topleft, bottomright}, and {Blue, Red, Color}. With the hidden representation of the encoder in equation 1, and are calculated as


where , and is the flattened vector of .

a.2 Estimating the Distribution of Human Questions

In this experiment, the model encoder has the same hyper-parameters as in the first experiment. We increase the size of the decoder by setting number of layers to , number of heads to , and set . The model is also trained for epochs using Adam optimizer with the same initial learning rate at . In order to better familiarize the model with grammar, we also pretrain the decoder for epochs on a larger set of question programs. This pretraining corpus is first uniformly sampled from the PCFG grammar defining questions, then we calculate the average energy value of each program on boards, and keep the top unique questions.

For the evaluation set of human questions, we found that some simple questions become complicated in program form. For example, question “How many ship pieces are there in the second column?” will be translated to “(++ (map (lambda y (and (== (colL y) 2) (not (== (color y) Water)))) (set All_Tiles)))

”. Such complicated programs score very poorly according to the energy function, so they do not appear in the training set. As a result, the average log-likelihood is extremely skewed by these unseen questions. For a more robust evaluation, we remove the last

human questions with low energy values.

a.3 Question Generation

The neural model for this experiment has the same hyper-parameters as in the last experiment, and is optimized by REINFORCE [20] algorithm with initial learning rate 0.0001. A baseline for REINFORCE is established simply as the average of the rewards in a batch of size . To encourage the exploration of the model, we also apply an -greedy strategy with set to at the beginning, and gradually decreased to as training continues. This model is trained for epochs, within each epoch the model passes different boards.

From some preliminary experiments, we find that the models have a strong preference of generating similar programs of relatively low complexity, with the original energy values as rewards. Thus, we tune two parameters of the energy model as mentioned in section 5.3.1, which are the two parameters corresponding to information seeking features (denoted as and in the original paper Rothe et al. [16]). We increase this two parameters from until the reinforcement learning models are able to generate a diverse set of questions.

The sequence RL baseline which directly generates sequence with the decoder is trained with MIXER algorithm [14]

, which is a variant of REINFORCE algorithm widely used in sequence generation tasks. MIXER provides a smooth transition from supervised learning to reinforcement learning. This model is pretrained for

epochs, and trained with RL for epochs.

Appendix B Additional results

For the experiment on estimating the distribution of human questions (Experiment 5.2), Table 5 provides a full table of log-likelihood of different models on evaluation set of different uncertainty level.

Model LL on low entropy LL on mid entropy LL on high entropy
Full model -2.990 -3.190 -3.414
no pretrain -3.015 -3.210 -3.428
LSTM decoder -3.044 -3.209 -3.416
MLP encoder -3.293 -3.383 -3.477
decoder only -3.312 -3.397 -3.494
Table 5: Log-likelihood (LL) on different splits of the sampled evaluations based on the uncertainty of the board.

Here we provide more examples of questions generated by our models in the generation experiment (Experiment 5.3). Figure 5, 6 and 7 show the contexts and the generated questions.

Figure 5: Novel questions generated by the grammar enhanced model.
Figure 6: Generated questions of different types by controlling the start condition.
Figure 7: Comparisons of questions generated by our models with human questions.