Interactive Classification by Asking Informative Questions

11/09/2019 ∙ by Lili Yu, et al. ∙ Princeton University ASAPP INC cornell university 0

Natural language systems often rely on a single, potentially ambiguous input to make one final prediction, which may simplify the problem but degrade end user experience. Instead of making predictions with the natural language query only, we ask the user for additional information using a small number of binary and multiple-choice questions in order to better help users accomplish their goals while minimizing their effort. At each turn, our system decides between asking the most informative question or making the final classification prediction. Our approach enables bootstrapping the system using simple crowdsourcing annotations without expensive human-to-human interaction data. Evaluation demonstrates that our method substantially increases classification accuracy, while effectively balancing the number of questions with the improvement to final accuracy.



There are no comments yet.


page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Responding to natural language queries through simple, single-step classification has been studied extensively in many applications, including user intent prediction (csbot; intentprediction), and information retrieval (kang2003query; rose2004understanding). Typical methods rely on a single user input to produce an output, missing an opportunity to interact with the user to reduce ambiguity and improve the final prediction. For example, users may under-specify a request due to incomplete understanding of the domain; or the system may fail to correctly interpret the nuances of the input query. In both cases, a low quality input could be mitigated by further interaction with the user.

Figure 1: Two example use cases of interactive classification system: providing customer the best trouble-shooting suggestion (left) and helping user identify bird species from text interactions (right). The top parts show example classification labels: FAQ documents or bird species, where the ground truth label of each interaction example is shaded. The lower parts show how a user interact with the system typically. The user starts with an initial natural language query. At each step, the system asks a clarification question. The interaction ends when the system returns an output label.

In this paper we propose a simple but effective interaction paradigm that consists of a sequence of binary and multiple choice questions allowing the system to ask the user for more information. Figure 1 illustrates the types of interaction supported by this method, showcasing the opportunity for clarification while avoiding much of the complexity involved in unrestricted natural language interactions. Following a natural language query from the user, our system then decides between posing another question to obtain more information or finalizing the current prediction. Unlike previous work which assumes access to full interaction data (Wu18:q20rinna; 2018arXiv180807645H; LeeAnswererDialogue; rao2018), we are interested in bootstrapping an interaction system using simple and relatively little annotation effort. This is particularly important in real-world applications, such as in virtual assistants, where the supported classification labels are subject to change and thereby require a lot of re-annotation.

We propose an effective approach designed for interaction efficiency and simple system bootstrapping. Our approach adopts a Bayesian decomposition of the posterior distributions over classification labels and user’s responses through the interaction process. Due to the decomposition, we can efficiently compute and select the next question that provides the maximal expected information based on the posteriors. To further balance the potential increase in accuracy with the cost of asking additional questions, we train a policy controller to decide whether to ask additional questions or return a final prediction. Our method also enables separately collecting natural language annotations to model the distributions of class labels and user responses. Specifically, we crowdsource initial natural language queries and question-answer pairs for each class label, alleviating the need for Wizard-of-Oz style dialog annotations (Kelley:1984:IDM:357417.357420; 1604.04562)

. Furthermore, we leverage the natural language descriptions of class labels, questions and answers to help estimate their correlation and reduce the need for heavy annotation.

We evaluate our method on two public tasks: FAQ suggestion (Shah2018AdversarialDetection) and bird identification using the text and attribute annotations of the Caltech-UCSD Birds dataset (WahCUB_200_2011). The first task represents a virtual assistant application in a trouble-shooting domain, while the second task provides well-defined multiple-choice question annotations and naturally noisy language inputs. Our experiments show that adding user interactions significantly increases the classification accuracy, when evaluating against both a simulator and a real human. With one clarification question, our system obtains a relative accuracy boost of and for FAQ suggestion and bird identification compared to no-interaction baselines on simulated evaluation. Given at most five turns of interaction, our approach improves accuracy by over on both tasks in both simulated and human evaluation.

2 Technical Overview


Our goal is to classify a natural language query as one class label through interactions. To this end, we treat the classification label

, interaction question as well as the user response

as random variables. We denote one possible value or assignment of the random variable using subscripts, such as

and . We use superscripts for the observed value of the random variable at a given time step, for example, is a question asked at time step . Whenever it is clear from the context, we simply write as the short notation of . For example, denotes the conditional distribution of given and , and

further specifies the corresponding probability when


An interaction starts with the user providing an initial user query . At each turn , the system can choose to select a question , to which the user respond with . We consider two types of questions in this work: binary and multiple choice questions. The predefined set of possible answers for a question is , where is for binary questions, or a predefined set of question-specific values for multiple choice questions. We denote an interaction up to time as , and the set of possible class labels as . Figure 1 shows exemplary interactions in our two evaluation domains.


We model the interactive process using a parameterized distribution over class labels that is conditioned on the observed interaction (Section 4.1), a question selection criterion (Section 4.2), and a parameterized policy controller (Section 4.5). At each time step , we compute the belief of each conditioned on . The trained policy controller decides between two actions: to return the current best possible label or to obtain additional information by asking a question. The model selects the question that maximizes the information gain. After receiving the user response, the model updates the beliefs over the classification labels.


We use crowdsourced non-interactive data to bootstrap model learning. The crowdsourced data collection consists of two sub-tasks. First, we obtain a set of user initial queries for each label . For example, for an FAQ, ‘How do I sign up for Spring Global Roaming’, an annotator can come up with an initial query as ‘Travel out of country’. Second, we ask annotators to assign text tags to each , and convert these tags into a set of question-answer pairs . Here denotes a templated question and denotes the answer. For example, a question ‘What is your phone operating system?’ can pair with one of the following answers: ‘IOS’, ‘Android operating system’, ‘Windows operating system’ or ‘Not applicable’. We denote this dataset as . We describe the data collection process in Section 5. We use this data to train our text embedding model (Section 4.3), to create a user simulator (Section 4.4), and to train the policy controller to minimize the number of interaction turns while achieving high classification accuracy (Section 4.5).


We report classification accuracy of the model, and study the trade-off between accuracy and the number of the turns that the system takes. We run our system against a user simulator and real human users. When performing human evaluation, we additionally collect qualitative ratings of the interactions.

3 Related Work

Learning from Feedback

A lot of recent work have leveraged human feedback to train natural language processing models, including dialogue learning 

(DBLP:journals/corr/LiMCRW16), semantic parsing (Artzi:11; 1606.02447; DBLP:journals/corr/IyerKCKZ17) and text classification (DBLP:journals/corr/abs-1805-03818)

. These works collect user feedback after the model-predicting stage and treat user feedback as extra offline training data to improve the model. In contrast, our model leverages the user interaction and makes model prediction accordingly in an online fashion. Human feedback has been incorporated in reinforcement learning 

(rlhuman; visionnavigate) as well. For instance, (rlhuman) learns a reward function from human preferences to provide richer rewards and (visionnavigate) uses language-illustrated subgoals as indirect interventions instead of conventional expert demonstrations.

Modeling Interaction

Language-based interaction has recently attracted a lot of attention in areas such as visual question answering  (DeVriesGuessWhatDialogue; LeeAnswererDialogue; ChattopadhyayEvaluatingGames; DasLearningLearning; 1902.08355; MIGRL), SQL generation (P18-1124; 1910.05389), information retrieval (SpokenContentRetrieval; clarificatonIR) and multi-turn text-based question answering  (rao2018; ReddyCoQA:Challenge; quac). Many of these works require learning from full-fledged dialogues (Wu18:q20rinna; 2018arXiv180807645H; LeeAnswererDialogue; rao2018) or conducting Wizard-of-Oz dialog annotations (Kelley:1984:IDM:357417.357420; 1604.04562). Instead of utilizing unrestricted but expensive conversations, we limit ourselves to a simplified type of interaction consisting of multiple-choice and binary questions. This allows us to reduce the complexity of data annotation while still achieving efficient interaction and high accuracy.

Our question selection method is closely related to the prior work of (LeeAnswererDialogue; rao2018; kovashka2013attribute; ferecatu2007interactive; 1805.00145). For instance, (kovashka2013attribute) refine image search by asking to compare visual qualities against selected reference images, and (LeeAnswererDialogue)

perform object identification in image by posing binary questions about the object or its location. Both works and our system use an entropy reduction criterion to select the best question. Our work makes use of a Bayesian decomposition of the joint distribution and can be easily extended to other model-driven selection. We also highlight the modeling of natural language to estimate information and probabilities. More recently,

(rao2018) proposes a learning-to-ask model by modeling the expected utility obtained by a question. Our selection method can be considered as a special case when entropy is used as the utility. In contrast to (rao2018), our work models the entire interaction history instead of a single turn of follow-up questioning. Our model is trained using crowdsourced annotations, while (rao2018) uses real user-user interaction data.

Learning 20Q Game

Our task can be viewed as an instance of the popular 20-question game (20Q), which has been applied to the knowledge base of celebrities (chen2018learning; 2018arXiv180807645H). Our work differs in two fold. First, our method models the natural language descriptions of classification targets, quesitons ans answers, instead of treating them as categorical or structural data as in knowledge base. Second, instead of focusing on the 20Q game (on celebrities) itself, we aim to help users accomplish realistic goals with minimal interaction effort.

4 Method

We maintain a probability distribution

over the set of labels . At each interaction step, we first update this belief, decide if to ask the question or return the classification output using a policy controller and select a question to ask using information gain if needed.

4.1 Belief Probability Decomposition

We decompose the conditional probability using Bayes rule:


We make two simplifying assumptions for Eq. (1). First, we assume the user response only depends on the provided question and the underlying target label , and is independent of past interactions. The assumption simplifies as . Second, we deterministically select the next question given the interaction history (described in Section 4.2). As a result, and otherwise would be zero if . These two assumptions allow us to rewrite the decomposition as:


That is, predicting the classification label given the observed interaction can be reduced to modeling and , i.e. the label probability given the initial query only and the probability of user response conditioned on the chosen question and class label. This factorization enables us to leverage separate annotations to learn these two components directly, alleviating the need for collecting costly full user interactions.

4.2 Question Selection using Information Gain

The system selects the question to ask at turn to maximize the efficiency of the interaction. We use a maximum information gain criterion. Given , we compute the information gain on classification label as the decrease on entropy by observing possible answers to question :

where denotes the conditional entropy. Intuitively, the information gain measures the amount of information obtained about the variable by observing the value of another variable . Because the first entropy term is a constant regardless of the choice of , the selection of is equivalent to


Here we use the independent assumption stated in Section  4.1 to calculate . Both and can be iteratively updated utilizing and as the interaction progresses (Eq. 2), resulting in efficient computation of information gain.

4.3 Modeling the Distributions

We model and by encoding the natural language descriptions of questions, answers and classification labels111The text representation of a label is the FAQ document or the bird name for instance.

. That is, we do not simply treat the labels, questions and answers as categorical variables in our model. Instead, we leverage their natural language aspects to better estimate their correlation, reduce the need for heavy annotation and improve our model in low-resource (and zero-shot) scenarios. Specifically, we use a shared neural encoder

to encode all texts. Both probability distributions are computed using the dot-product score, i.e. where and are two pieces of text.

The probability of predicting the label given an initial query is:

The probability of an answer given a question and label is a linear combination of the observed empirical probability and a parameterized estimation :

where is a hyper-parameter. We use the question-answer annotations for each label to estimate using the empirical count. For example, in the FAQ suggestion task, we collect multiple user responses for each question and class label, and average across annotator to estimate (Section 5). The second term is estimated using texts:

where is a concatenation of the question and the answer 222For example, for a templated question ‘What is your phone operating system?’ and an answer ‘IOS’, = ‘phone operating system’ and = ‘IOS’, therefore, = ‘phone operating system IOS’. and are scalar parameters. Because we do not collect complete annotations to cover every label-question pair, provides a smoothing of the partially observed counts using the learned encoding .

We estimate the with parameters by pre-training using the dataset, , as described earlier. We use this data to create a set of text pairs to train the scoring function . For each label , we create pairs with all its initial queries . We also create for each question-answer pair annotated with the label . We perform gradient descent to minimize the cross-entropy loss:

The second term requires summation over all , which are all the labels in . We approximate this sum using negative sampling that replaces the full set with a sampled subset in each training batch. The parameters , and are fine-tuned using reinforcement learning during training of the policy controller (Section 4.5).

4.4 User Simulator

The user simulator provides initial queries to the system, responds to the system initiated clarification questions and judges if the returned label is correct. The simulator is based on held-out dataset of tuples of a goal , a set of initial queries , and a set of question answer pairs . We estimate the simulator response distribution using smoothed empirical counts from the data. While this data is identical in structure to our training data, we keep it separated from the data used to estimate , and (Section 4.3).

At the beginning of an interaction, the simulator selects a target label , and samples a query from the associated query set to start the interaction. Given a system clarification question at turn , the simulator responds with an answer by sampling from a belief distribution . Sampling provides natural noise to the interaction, and our model has no knowledge of . The interaction ends when the system returns a target. This setup is flexible in that the user simulator can be easily replaced or extended by a real human, and the system can be further trained with the human-in-the-loop setup.

4.5 Policy Controller

The policy controller decides at each turn to either select another question to query the user or to conclude the interaction. This provides a trade-off between exploration by asking questions and exploitation by returning the most probable classification label. The policy controller is a feed-forward network parameterized by that takes the top- probability values and current turn as input state. It generates two possible actions, STOP or ASK. When the action is ASK, a question is selected to maximize the information gain, and when the action is STOP, the label with highest probability is returned using .

We tune the policy controller using the user simulator (Section 4.4). Algorithm 1 describes the training process. During learning, we use a reward function that provides a positive reward for predicting the correct target at the end of the interaction, a negative reward for predicting the wrong target, and a small negative reward for every question asked. We learn the policy controller , and fine-tune by back-propagating through the policy gradient. We keep the parameters fixed during this process.

Initialize text encoder, model for , and user simulator
for episode = 1 .. M do
        Sample from dataset
        for t = 1 .. T do
               Compute (Equation 2)
               if  is STOP then
               else if  is ASK then
        end for
        Update , , using policy gradient
end for
Algorithm 1 Full model training

5 Data Collection

We design a crowdsourcing process to collect data for the FAQ task using Amazon Mechanical Turk333 For the Birds domain, we re-purpose an existing dataset. We collect initial queries and tags for each FAQ document.

Initial Query Collection

We ask workers to consider the scenario of searching for an FAQ supporting document using an interactive system. Given a target FAQ, we ask for an initial query that they would provide to such a system. The set of initial queries that is collected for each document is . We encourage workers to provide incomplete information and avoid writing a simple paraphrase of the FAQ. This results in more realistic and diverse utterances because users have limited knowledge of the system and the domain.

Tag Collection

We collect natural language tag annotations for the FAQ documents. First, we use domain experts to define the set of possible free-form tags. The tags are not restricted to a pre-defined ontology and can be a phrase or a single word, which describes the topic of the document. We heuristically remove duplicate tags to finalize the set. Next, experts heuristically combine some tags to categorical tags, while leave all the rest tags as binary tags. For example, tags ‘IOS’, ‘Android operating system’ and ‘Windows operating system’ are combined to form a categorical tag ‘phone operating system’. We then use a small set of deterministic, heuristically-designed templates to convert tags into questions. For example, the tag ‘international roaming’ is converted into a binary question ‘Is it about international roaming?’; the categorical tag ‘phone operating system’ is converted into a multi-choice question ‘What is your phone operating system?’. Finally, we use non-experts to collect user responses to the questions of the FAQ. For binary questions, we ask workers to associate the tags to the FAQ target if they would respond ‘yes’ to the question. We show the workers a list of ten tags for a given target as well as ‘none of the above’ option. Annotating all target-tag combinations is excessively expensive and most pairings are negative. We rank the tags based on the relevance against the target using

and show only top-50 to the workers. For multi-choice tags, we show the workers a list of possible answers to a tag-generated question for a given FAQ. The workers need to choose one answer that they think best applies. They also have the option of choosing ‘not applicable’. We provide more data collection statistics in Appendix A.1. The workers do not engage in a multi-round interactive process. This allows for cheap and scalable collection.

6 Experimental Setup

Task I: FAQ Suggestion

We use the FAQ dataset from (Shah2018AdversarialDetection). The dataset contains troubleshooting documents by crawling Sprint’s technical website. In addition, we collect initial queries and tag annotations using the setup described in Section 5. We split the data into documents as training, development, and test sets. Only the queries and tag annotations of the training documents are used for pre-training and learning the policy controller. We use the queries and tag annotations of the development and test documents for evaluation only. The classification targets contain all documents during evaluation444The target classes from development and test sets are hidden to the model during training. This split is set up to ensure the model to generalize to newly added or unseen FAQs..

Task II: Bird Identification

Our second set of experiments uses the Caltech-UCSD Birds (CUB-200) dataset (WahCUB_200_2011). The dataset contains bird images for different bird species. Each bird image is annotated with a subset of visual attributes and attribute values pertaining to the color or shape of a particular part of the bird. We take attributes with value count less than 5 as categorical tags, leaving us 8 categorical questions in total. The remaining attributes are treated as binary tags and converted to binary questions. In addition, each image is annotated with image captions describing the bird in the image (reed2016learning). We use the image captions as initial user queries and bird species as labels. Since each image often contains only partial information about the bird species, the data is naturally noisy and provides challenging user interactions. We do not use the images from the dataset for model training. The images are only provided for grounding during human evaluation.


We compare our full model against the following baseline methods:

  • No Interact: the best classification label is predicted using only the initial query. We consider two possible implementations: BM25, a common keyword-based scoring model for retrieval methods (INR-019), and a neural model described in Section 4.3.

  • Random Interact: at each turn, a random question is chosen and presented to the user. After turns, the classification label is chosen according to the belief .

  • Static Interact: questions are picked without conditioning on the initial user query using maximum information criterion. This is equivalent to using a static decision tree to pick the question, leading to the same first question, similar to 

    (Utgoff1989; Ling:2004:DTM:1015330.1015369).

  • Variants of Ours: we consider several variants of our full model. First, we replace the policy controller with two termination strategies: one which ends interaction when passes a threshold, and another one which ends interaction after the designated number of turns. Second, we disable the parameterized estimator by setting to 1.


We evaluate our model by running against both the user simulator and a real human. Given the user simulator, we evaluate the classification performance of our model and baselines using Accuracy@k, which is the percentage of time the correct target appears among the top-k predictions of the model. During human evaluation, we ask annotators to interact with our proposed or baseline models through a web-based interactive interface. Each interaction session starts with a user scenario555Each scenario is related to a single groundtruth label and serves as the grounding of user interactions. presented to an annotator (e.g a bird image or a device-troubleshooting scenario described in text). The annotator inputs an initial query accordingly and then answers follow-up questions selected by the system. Once the system returns its prediction, the annotator is asked to provide a few ratings of the interaction, such as rationalitydo you feel that you were understood by the system?. We present more details of the human evaluation in Appendix A.3.

Implementation Details

We use a fast recurrent neural network to encode texts 

(Lei:18sru). The policy controller receives three different rewards: a positive reward for returning the correct target (), a negative reward for providing the wrong target () and a turn penalty for each question asked (). We report the averaged results over 3 independent runs for each model variant and baseline. More details about the model implementation and training procedure can be found in Appendix A.2.

FAQ Suggestion Bird Identification
Acc@1 Acc@3 Acc@1 Acc@3
No Interact  (BM25) N.A. N.A.
No Interact  (Neural)
Random Interact
Static Interact
      w/ threshold
      w/ fixed turn
Table 1:

Performance of our system against various baselines, which are evaluated using Accuracy@{1, 3}. For all interacting baselines, 5 clarification questions are used. Best performances are in bold. We report the averaged results as well as the standard deviations from 3 independent runs for each model variant and baseline.

Figure 2: Accuracy@1 (y-axis) against turns of interactions (x-axis) for FAQ Suggestion (left) and Bird Identification (right) tasks
FAQ Suggestion Bird Identification
Count Acc@1 Rationality Count Acc@1 Rationality
Ours 60 59% 0.81 60 45% 0.95
Ours (Fixed Turn) 55 56% 0.45 55 37% 0.55
Static Interact 55 45% 0.41 55 28% 0.85
Table 2: Human evaluation results. Count is the total number of interaction examples. The system is evaluated with Accuracy@1 and the rationality score ranging from (strongly disagree) to (strongly agree).

7 Results

Simulated Evaluation

Table 1 shows the performance of our model against the baselines on both tasks while evaluating against user simulator. The No Interact (Neural) baseline achieves an Accuracy@1 of for FAQ Suggestion and for Bird Identification. The No Interact (BM25) baseline performs worst. The Random Interact baseline and the Static Interact baseline barely improve the performance, illustrating the challenge of building an effective interactive model. In contrast, our model and its variants obtain substantial gain in accuracy given a few turns of interaction. Our full model achieves an Accuracy@1 of for FAQ Suggestion and for Bird Identification using less than 5 turns, outperforming the No Interact (Neural) baseline by an absolute number of and . The two baselines with alternative termination strategies underperform the full model, indicating the effectiveness of the policy controller trained with reinforcement learning. The model variant with , which has fewer probability components leveraging natural language than our full model, achieves worse Accuracy@1. This result, together with the fact that our model outperforms the Static Interact  baseline, confirms the importance of modeling natural language for efficient interaction.

Figure 2 shows the trade-off between classification accuracy and the number of the turns the system takes. The number of interactions changes as we vary the model termination strategy, which includes varying turn penalty , the prediction threshold, and the predefined number of turns . Our model with the policy controller or the threshold strategy does not explicitly control the number of turns, so we report the average number of turns across multiple runs for these two models. We achieve a relative accuracy boost of for FAQ Suggestion and for Bird Identification over no-interaction baselines with only one clarification question. This highlights the value of leveraging human feedback to improve model accuracy in classification tasks. Our approach outperforms baselines ranging across all numbers of interactions.

Human Evaluation

Table 2 shows the human evaluation results of our full model and two baselines on the FAQ Suggestion and Bird Identification tasks. The model achieves Accuracy@1 of and for FAQ and Bird tasks respectively, when there is no interaction. Each of the model variants uses interaction turns on average, and all three models improve the classification result after the interaction. Our full model achieves the best performance: an Accuracy@1 of for FAQ Suggestion and for Bird Identification. Qualitatively, the users rate our full model to be more rational. The human evaluation demonstrates that our model handles real user interaction more effectively despite being trained with only non-interactive data. Appendix A.3 includes additional details for human evaluation and exemple interactions.

Learning behavior

Figure 3 shows the learning curves of our model with the policy controller trained with different turn penalty . We observe interesting exploration behavior during the first training episodes in the middle and the right plots. The models achieve relatively stable accuracy after the early exploration stage. As expected, the three runs end up using different numbers of expected turns due to the choice of different values.

Figure 3: Learning curves of our full model. We show accumulative reward (left), interaction turns (middle), and Accuracy@1 (right) on the test set, where x-axis is the number of episodes run ( trials per episode). The results are compared on different turn penalty .

8 Conclusion

We propose an approach for interactive classification, where users can provide under-specified natural language queries and the system can inquire missing information through a sequence of simple binary or multi-choice questions. Our method uses information theory to select the best question at every turn, and a lightweight policy to efficiently control the interaction. We show how we can bootstrap the system without any interaction data. We demonstrate the effectiveness of our approach on two tasks with different characteristics. Our results show that our approach outperforms multiple baselines by a large margin. In addition, we provide a new annotated dataset for future work on bootstrapping interactive classification systems.


We thank Yi Yang, Nicholas Matthews, Alex Lin and Derek Chen for providing valuable feedback on the paper. We would also like to thank Anna Folinsky and the ASAPP annotation team for their help performing the human evaluation, and Hugh Perkins for his support on the experimental environment setup.


Appendix A Appendices

a.1 Data collection

Query collection qualification

One main challenge for the collection process lies within familiarizing the workers with the set of target documents. To make sure we get good quality annotation, we set up a two-step qualification task. The first one is to write paraphrase with complete information. After that, we reduce the number of workers down to . These workers then generate paraphrase queries. During the process, the workers familiarize themselves with the set of documents. We then post the second task (two rounds), where the workers try to provide initial queries with possibly insufficient information. We select workers after the second qualification task and collect initial queries for the second task.

Attribute Collection Qualification

To ensure the quality of target-tag annotation, we use the pre-trained model to rank-order the tags and pick out the highest ranked tags (as positives) and the lowest ranked tags (as negatives) for each target. The worker sees in total ten tags without knowing which ones are the negatives. To pass the qualifier, the workers need to complete annotation on three targets without selecting any of the negative tags. To make the annotation efficient, we rank the tag-document relevance using the model trained on the previously collected query data. We then take the top possible tags for each document and split them into five non-overlapping lists (i.e. ten tags for each list). Each of the list is assigned to four separate workers to annotate. In the FAQ task, we observe that showing only the top-50 tags out of 813 is sufficient. Figure A.1 illustrates this: after showing the top-50 tags, the curve plateaus and no new tags are assigned to a target. Our annotator agreement is illustrated in  A.1 as the Cohen’s score.

Figure A.1: Accumulated number of tags assigned to the targets (y-axis) by AMT against tag ranking (x-axis). The ranking indicates the relevance of the target-tag pairs from the pre-trained model. The curve plateaued at rank suggests that the lower ranked tags are less likely to be assigned to the target by the crowdsourcing workers.
Tag Ranks
1-10 11-20 21-30 31-40 41-50
Mean # of tags 3.31 1.45 0.98 0.61 0.48
N.A. (%) 1.9 30.7 43.6 62.1 65.2
Mean 0.62 0.54 0.53 0.61 0.61
Table A.1: Target-tag annotation statistics. We show five sets of tags to the annotators. The higher ranked ones are more likely to be related to the given target. The row mean # tags is the mean number of tags that are annotated to a target, N.A. is the percentage of the tasks are annotated as ”none of the above”, and mean is the mean pairwise Cohen’s score.

a.2 Implementation Details

Learning Components

Here we describe the detailed implementation of the text encoder and the policy controller network. We use a single-layer bidirectional Simple Recurrent Unit (SRU) as the encoder for the FAQ suggestion task and two layer bidirectional SRU for bird identification task. The encoder uses pre-trained fastText bojanowski2016 word embedding of size (fixed during training), hidden size , batch size , and dropout rate . The policy controller is a two layer feed-forward network with hidden layer size of

and ReLU activation function. We use Noam learning rate scheduler with initial learning rate

, warm-up step and Noam scaling factor . The policy controller is a 2 layer feed-forward network with a hidden layer of dimensions and ReLU activation. The network takes the current step and the top-k values of belief probabilities as input. We choose and allow a maximum of interaction turns.

We use initial queries as well as paraphrase queries to train the encoder, which has around K target-query examples. The breakdown analysis is shown in Table A.2. To see the effectiveness of the tag in addition to initial query, we generate pseudo-queries by combining existing queries with sampled subset of tags from the targets. This augmentation strategy is shown to be useful to improve the classification performance. On the other hand, when we use the set of tags instead of initial query as text input for a specific target label, the classification performance improves, indicating the designed tags can capture the target label well. Finally, when we concatenate user initial query and tags and use that as text input to the classifier, we achieve Accuracy@1 of . In our full model, we achieve with only querying about 5 tags, indicating effectiveness of our modelling.

Text Input Init Query Init Query + Tags Init + Paraphrase Query Full Data
init query tags Acc@1 Acc@3    Acc@1 Acc@3      Acc@1 Acc@3 Acc@1 Acc@3
Table A.2: Comparison of the suggestion modules trained with different training data. Each model is evaluated on three different tasks. First, use initial queries to predict targets. Second, use all attributes tags to predict targets; third, use both initial queries and tags as text input to predict targets. Each model is evaluated on Accuracy@{1, 3}.

a.3 Human Evaluation

Each interaction session starts with presenting the annotator an user scenario (e.g a bird image or an issue with your phone). The annotator inputs an initial query accordingly and then answers follow-up questions selected by the system.

FAQ Suggestion

We evaluate prediction accuracy, system rationality, and the number of counts by letting the system interact with human judges. We design user scenario for each target to present to the worker. At the end of each interaction, the predicted FAQ and the ground truth will be presented to the user as shown in the top right panel in Figure A.2. The user needs to answer the following questions: “How natural is the interaction?” and “Do you feel understood by the system during the interactions?” on the scale of (strongly disagree) to (strongly agree), which we record as naturalness and rationality in Table A.3. Our full model performs best on Accuracy@1, naturalness, and rationality. We show human evaluation examples in Table A.4.

Bird Identification

The interface for bird identification task is similar to the FAQ suggestion task. Instead of presenting a scenario, we show a bird image to the user. The user needs to describe the bird to find out its category, which is analogous to writing an initial query. We allow the user to reply ‘not visible’ if part of the bird is hidden or occluded. With such reply, the system stops asking binary questions from the same label group. For example, if user replied ‘not visible’ to a the question ‘does the bird has black tail?’, then question ‘does the bird has yellow tail?’, ‘does the bird has red tail?’ etc. won’t be asked again. At the end of the interaction, the predicted and ground-truth bird images along with their categories are presented to the user as shown in the bottom right panel in Figure A.2. Again, the user needs to fill out a similar questionnaire as in FAQ suggestion task. The bird identification task is very challenging due to its fine-grained categories, where many bird images look almost identical while belonging to different classes. Our full system improves Accuracy@1 from to

against non-interactive baselines after less than 3 turns of interaction. For Bird Identification, the annotators reported that the predicted image is almost identical to the true image sometimes. To better understand the task and the model behavior, we show the confusion matrix of the final model prediction after interaction in Figure 

A.3. In the 200 bird classes, there are 21 different kinds of sparrows and 25 different warbler. Those fine-grained bird classes identification induces most model errors. Figure  A.4 show how the confusion matrix change, adding the interactions. The model makes improvement in distinct and also similar bird types.

Figure A.2: User interface for FAQ suggestion task (top) and bird identification (bottom) tasks. Left panel shows the interface at the beginning of the interaction and the right panel shows the inter face at the end of the interaction
Dataset Models Count Initial Acc@1 Final Acc@1 Naturalness Rationality
FAQ Ours 60 30% 59% 0.4 0.81
Ours(Fixed Turn) 55 30% 56% -0.17 0.45
Static Interact 55 28% 45% 0.03 0.41
Bird Ours 60 21% 45% 0.21 0.95
Ours(Fixed Turn) 55 13% 37% -0.50 0.55
Static Interact 55 25% 28% 0.1 0 0.85
Table A.3: Human evaluation results on FAQ Suggestion and Bird Identification on our proposed model and several baslines. The three FAQ systems ask 2.8, 3 and 3 turns of questions, respectively. The three Bird systems ask 3.3, 4 and 4 turns of questions. The system is evaluated with both on performance and user experience. Performance include the initial and final Accuracy@1. The user experience score include both naturalness and rationality for both task.
U: How can I text my friends house phone?
S: Text message to landline?
U: Yes.
S: Do you want to send or receive text?
U: Yes.
S: Please read: How do I send a text message to landline?
U: I’m in sprint’s one up and want to upgrade my phone.
S: Do you want annual upgrade option?
U: Yes.
S: Do you want to exercise an option?
U: Yes.
S: Please read: I am enrolled in Sprint One Up. How do I exercise my annual upgrade option?
U: Battery ran out immediately.
S: How can I help you with your phone device?
App or feature information; Fee and charge; Troubleshooting; Not applicable
U: Troubleshooting.
S: Are you using mobile wifi hotspot?
U: Yes.
S: Please read: Does using my phone as a mobile Wi-Fi hotspot drain my battery?
Table A.4: Examples of user interactions for FAQ suggestion human evaluation.
Figure A.3: Confusion matrix of our final output for bird identification task.
Figure A.4: Confusion matrix difference between the initial query with and without the interactions. We desire high value in the diagonal part and low value elsewhere.