Dialog Policy Learning for Joint Clarification and Active Learning Queries

06/09/2020 ∙ by Aishwarya Padmakumar, et al. ∙ The University of Texas at Austin 0

Intelligent systems need to be able to recover from mistakes, resolve uncertainty, and adapt to novel concepts not seen during training. Dialog interaction can enable this by the use of clarifications for correction and resolving uncertainty, and active learning queries to learn new concepts encountered during operation. Prior work on dialog systems has either focused on exclusively learning how to perform clarification/ information seeking, or to perform active learning. In this work, we train a hierarchical dialog policy to jointly perform both clarification and active learning in the context of an interactive language-based image retrieval task motivated by an on-line shopping application, and demonstrate that jointly learning dialog policies for clarification and active learning is more effective than the use of static dialog policies for one or both of these functions.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: A stylized sample interaction. In this work, we use simulated dialog acts but the natural language glosses represent how such a dialog could look in an end application.

The ability to understand and communicate in natural language can improve the accessibility of systems such as robots, home devices and computers to non-expert users. Voice assistant applications that understand high level instructions in natural language are increasingly becoming a part of a variety of devices. Since language can often be ambiguous, it would be desirable for such systems to engage in a dialog with the user to clarify their intentions and obtain missing information.

The specific environment in which a system is deployed may also contain domain specific vocabulary or concepts that were not encountered during training. For example, a dialog system in a shopping domain may need to be updated with the introduction of new clothing styles. Hence it is desirable for a system to adapt its models to the operating environment using information from user interactions.

Prior work on dialog and user interaction typically focuses either exclusively on clarification/information-seeking settings young:procieee13; padmakumar:eacl17, or building/improving models through active learning woodward:arxiv2017; padmakumar:emnlp18. In this work, we train a hierarchical dialog policy to jointly learn to choose clarification and active learning queries in an interactive image retrieval task in the fashion domain. A sample interaction is shown in Figure 1. We consider an application where a dialog system is combined with a retrieval system to help a customer find an article of clothing. Instead of showing a large number of results obtained from the retrieval system, the dialog system attempts to use clarifications to refine the search query, and active learning questions to obtain labelled examples for concepts it has not been trained on.

Most task-oriented dialog tasks require the system to identify one or more goals specified by a user, using a slot-filling model young:procieee13. These can be considered as learning to choose between a set of clarification questions that can confirm or obtain the value of various slots. However, for tasks such as natural language image retrieval, it is non-trivial to extend the slot-filling paradigm to perform clarification, as there is no standard set of slots that natural language descriptions of images can be divided into. Also, learned models are needed to identify components such as objects or attributes, and it is difficult to enumerate all expected types of these.

Some tasks such as GuessWhat?! de:cvpr2017 or discriminative question generation li:iccv17 allow the system to ask unconstrained natural language clarification questions. However in these settings, specially designed models are still needed to ensure that learned questions actually decrease the size of the search space lee:vigil18; zhang:eccv18. Such open ended questions are also difficult to answer in simulation, which is often necessary for learning good dialog policies. Hence, in these tasks, the system often learns to ask “easy” questions that can be reliably answered by a learned answering module zhu:arxiv17.

In this work, we explore a middle-ground approach with a form of attribute-based clarification farhadi:cvpr09. We use the term “attribute” to refer to a mix of concepts including categories such as “shirt” or “dress”, more conventional attributes such as colors, and domain specific attributes such as “sleeveless” and “V-neck”. Although we work with a dataset that contains a fixed set of attributes annotated for each image, we simulate the setting where novel visual attributes are encountered at test time.

Dialog interaction can also be used to improve an underlying model using opportunistic active learning padmakumar:emnlp18. Active learning allows a system to identify unlabeled examples that, if labeled, are most likely to improve the underlying model. Opportunistic active learning thomason:corl17 incorporates such queries into an interactive task in which an agent may ask ordinary users questions that are irrelevant to the current dialog interaction to improve performance in future dialog interactions. Opportunistic queries are more expensive than traditional active learning queries as they may distract from the task at hand, but they can allow the system to perform more effective lifelong learning. Such queries have been shown to improve performance in interactive object retrieval padmakumar:emnlp18

. However, this, and other works in learning reinforcement learning (RL) policies for active learning 

fang:emnlp2017 do not account for the presence of other interactive actions such as clarification.

In this work, we set up a dialog task that combines natural language image retrieval with both opportunistic active learning and

attribute-based clarification. We then learn hierarchical dialog policies that are jointly learned to choose both appropriate clarification and active learning questions in a setting containing both uncertain visual classifiers and novel concepts not seen during training. We observe that in our challenging setup, it is necessary to jointly learn dialog policies for choosing clarification and active learning questions to improve performance over employing one-shot retrieval with no interaction.

2 Related Work

Slot-filling style clarifications young:procieee13 have been shown to be useful for a variety of domains including restaurant recommendation williams:dd16, restaurant reservation bordes:iclr17, movie recommendation and question answering dodge:iclr16, issuing commands to robots deits:jhri13; thomason:ijcai15 and converting natural language instructions to code chaurasia:ijcnlp17. Other tasks such as GuessWhat?! de:cvpr2017, playing 20 questions hu:emnlp2018, relative captioning guo:nips18 and discriminative question generation li:iccv17 enable very open-ended clarification. In this work, we choose an intermediate class of clarification questions that allow finer-grained clarification than the slot filling model, but are still somewhat constrained so that reasonably correct answers can be provided in simulation.

Most of the above works learn dialog policies for clarification using RL padmakumar:eacl17; Wen:naacl16; strub:ijcai17; hu:emnlp2018. Some of the above systems use information from clarifications to improve the underlying language-understanding model thomason:ijcai15; padmakumar:eacl17. Such improvement is implicit in end-to-end dialog systems Wen:naacl16; de:cvpr2017; hu:emnlp2018. However, we use explicit active learning questions to improve the underlying perceptual model used for language grounding. There exist some previous works that use visual attributes for clarification dindo:rsj10; parde:ijcai15 but these do not use this information for improving the underlying language understanding model.

Active learning has traditionally been done using various hand-coded sample-selection metrics, such as uncertainty sampling  settles:2010. Recent work on using active learning in dialog/ RL setups include using slot-filling style active learning questions to learn new names for known perceptual concepts yu:robonlp17, sequentially examining examples and deciding whether or not to query for their labels, up to a budget fang:emnlp2017, deciding between predicting a label for a specific example or requesting for it to be labelled woodward:arxiv2017

, and jointly learning a data selection heuristic, data representation, and prediction function 

bachman:icml17. However, most of these (except  woodward:arxiv2017) do not involve a trade-off between active learning and task completion. None of them incorporate clarification questions.

Most similar to our work is padmakumar:eacl17 which concerns learning a policy to trade-off opportunistic active learning questions to improve classifiers, and using these to ground natural-language descriptions of objects. However, instead of considering the cold-start condition where the system cannot initially ground any descriptions before some questions are asked, we consider a warm-start condition which is closer to most real-world scenarios. We use a pretrained classifier and expect active learning to primarily aid generalization to novel concepts not seen during training. We also extend the task to include clarification questions.

Also related is work on interactive image retrieval such as allowing a user to mark relevant and irrelevant results nastar:cvpr98; tieu:ijcv04, which acts as a form of clarification. Recent works allow users to provide additional feedback using language to refine search results guo:nips18; bhattacharya:mmr19; saha:aaai18. These directions are complementary to our work and can potentially be combined with it in the future.

3 Task Setup

We consider an interactive task of retrieving an image of a product based on a natural-language description. Given a set of candidate images and a natural-language description, the goal of the system is to identify the image being referred to. Before trying to identify the target image, the system can ask the user a combination of both clarification and active learning questions. The goal of the system is to maximize the number of correct product identifications across interactions, while also keeping dialogs as short as possible.

Since we want to ensure that active learning questions are used to learn a generalizable classifier, we follow the setup of  padmakumar:emnlp18 and in each interaction we present the system with two sets of images:

  • An active test set consisting of the candidate images that the description could refer to.

  • An active training set which is the set of images that can be queried for active learning.

It is also presented with a description of the target image. Before attempting to identify the target, the system can ask clarification or active learning questions. We assume the system has access to a set of attributes that can be used in natural language descriptions of products. Given these attributes, the types of questions the system can ask are as follows (see Figure 1 for examples of each):

  • Clarification query - A yes/no query about whether an attribute is applicable to the target.

  • Label query: A yes/no query about whether an attribute is applicable to a specific image in the active training set .

  • Example query: Ask for a positive example in the active training set for an attribute .

The dialog ends when the system makes a guess about the identity of the target and is considered successful if it this is correct. The goal of the task is to maximize the number of successful guesses, while also keeping dialogs as short as possible. As in  padmakumar:emnlp18, we allow label and example queries that are either on-topic – queries about attributes in the current description– or opportunistic – queries that are not relevant to the current description but may be useful for future interactions, which have been shown to be helpful for interactive object retrieval thomason:corl17 (see Figure 1 for examples of each).

4 Methodology

4.1 Visual Attribute Classifier

We train a multilabel classifier for predicting visual attributes given an image. The network structure for the classifier is shown in Figure 2. We extract features for the images using the penultimate layer of an Inception-V3 network szegedy:cvpr16

pretrained on ImageNet 


. These are passed through two separate fully connected (FC) layers with ReLU activations, that are summed to produce the final layer used for classification. This is converted into per-class probabilities

using a sigmoid layer with temperature correction guo:icml17. We obtain another set of per-class probabilities by passing the output of one of the FC layers through a sigmoid layer with temperature correction. Mathematically, given features for image , we have,

where , , and

are learned vectors and

and are learned biases.

Figure 2: Visual Attribute Classifier

We train the network using a loss function that combines cross-entropy loss on

over all examples with the cross entropy loss over only for positive labels. That is,

where is the label vector for image . This forces part of the network to focus on positive examples for each class. This is required because we use a heavily imbalanced dataset where most classes have very few positive examples. We find this more effective than a standard weighted cross entropy loss, and the results in this paper use

. We also maintain a validation set of images labeled with attributes, that can be extended using active learning queries. Using this, we can estimate per-attribute precision, recall and F1. These metrics are used for tuning classifier hyperparameters and for dialog policy learning.

4.2 Grounding Model

We assume that a description is a conjunction of attributes, and use heuristics based on string matching to determine the set of attributes referenced by the natural language description. Let the subset of attributes referenced in the description be .

Suppose we additionally obtain from clarifications that attributes apply to the target image, and attributes do not apply, assuming independence of attributes, the probability that is the target image, is:


At any stage, the best guess the system can make is the image with max belief, that is


Also, we estimate the information gain of a clarification as follows. This is based on the formulation used in  lee:vigil18 but we additionally make a Markov assumption.

where and and .

4.3 MDP Formulation

We model each interaction as an episode in a Markov Decision Process (MDP) where the state consists of the images in the active training and test sets, the attributes mentioned in the target description, the current parameters of the classifier, and the set of queries asked and their responses. At each state, the agent has the following available actions:

  • A special action for guessing – the image is chosen using Equation 2.

  • One clarification query per attribute.

  • A set of actions corresponding to possible active learning queries – one example query per attribute and one label query corresponding to each pair for , .

We do not allow actions to be repeated. We learn a hierarchical dialog policy composed of 3 parts – clarification and active learning policies to respectively choose the best clarification and active learning query in the current state, and a decision policy to choose between clarification, active learning, and guessing. An episode ends either when the guess action is chosen, or when a dialog length limit is reached, at which point the system is forced to make a guess. If the episode ends with a correct guess, the agent gets a large positive reward. Otherwise the agent gets a large negative reward at the end of the episode. Additionally, the agent gets a small negative reward for each query to encourage shorter dialogs. In our experiments, we treat these rewards as tunable hyperparameters.

4.4 Policy Learning

We experimented with using both Q-learning and A3C mnih:icml16 for policy learning, both trained to maximize the discounted reward. Since the classifier has a large number of parameters, it is necessary to extract task-relevant features to represent state-action pairs. The features provided to each policy need to capture information from the current state that enable the system to identify useful clarifications and active learning queries, and trade off between these and guessing. The features used include:

4.5 Clarification policy features

  • Metrics about the current beliefs and what they would be for each possible answer, if the question were asked:

    • Entropy: A higher entropy suggests that the agent is more uncertain. A decrease in entropy could indicate a good clarification.

    • Top two highest beliefs and their difference: A high value of the maximum belief, or a high difference between the top two beliefs could indicate that the agent is more confident about its guess. An increase in these could indicate a good clarification.

    • Difference between the maximum and average beliefs: A large difference suggests that the agent is more confident about its guess. An increase in these could indicate a good clarification.

  • Information gain of the query as calculated in section 4.2.

  • Current F1 of the attribute associated with the query: The system is likely to make better clarifications using attributes with high predictive accuracy.

4.6 Active learning policy features

  • Current F1 of the attribute associated with the query, since the system is likely to benefit more from improving an attribute whose current predictive accuracy is not high.

  • Fraction of previous dialogs in which the attribute has been used, since it is beneficial to focus on frequently used attributes that will likely benefit future dialogs.

  • Fraction of previous dialogs using the attribute that have been successful, since this suggests that the attribute may be modelled well enough already.

  • Whether the query is off-topic (i.e. opportunistic), since this would not benefit the current dialog.

Additionally in label queries,

  • For query , as a measure of (un)certainty.

  • Average cosine distance of the image to others in the dataset; this is motivated by density weighting to avoid selecting outliers.

  • Fraction of k-nearest neighbors of the image that are unlabelled for this attribute, since a higher value suggests that the query could benefit multiple images.

4.7 Decision policy features

  • Features of the current belief as in section 4.5. These can be used to determine whether a guess is likely to be successful.

  • Information gain of the best clarification action – to decide the utility of the clarification.

  • Margin from the best active learning query if it is a label query – to decide the utility of the label query.

  • F1 of attributes in clarification and active learning queries. High F1 is desirable for clarification and low F1 for active learning.

  • Mean F1 of attributes in the description. A high value suggests that the belief is more reliable.

  • Number of dialog turns completed.

4.8 Baseline static policy

As a baseline, we use an intuitive manually-designed static policy that is also hierarchical and was tailored to perform well in preliminary experiments. The static clarification policy chooses the query (among those with F1 > 0) with maximum information gain. Ties are broken using F1 of the attribute in the query. The static active learning policy has a fixed probability of choosing label queries and example queries. Uncertainty sampling is used to select the label query with minimum . An example query is chosen uniformly at random from the candidates. The decision policy initially chooses clarification if the information gain is above a minimum threshold, and the highest belief is below a confidence threshold. After a maximum number of clarifications, it chooses active learning until another threshold on the dialog length before guessing.

5 Experimental Setup

5.1 Dataset

To address a potential shopping application, we simulate dialogs using the iMaterialist Fashion Attribute dataset guo:iccvworkshop19, consisting of images from the shopping website Wish111https://www.wish.com/ each annotated for a set of 228 attributes. We scraped product descriptions for the images in the train and validation splits of the dataset for which attribute annotations are publicly available. After removing products whose images or descriptions were unavailable, we had 648,288 images with associated product descriptions and annotations for the 228 attributes.

We create a new data split following the protocol of padmakumar:emnlp18 to ensure that the learned dialog policy generalizes to attributes not seen during policy training. We divided the dataset into 4 splits, policy_pretrain, policy_train, policy_val and policy_test, such that each contains images that have attributes for which positive examples are not present in earlier splits. More details are included in Appendix B. Each of these is then split into subsets classifier_training and classifier_test by a uniform 60-40 split.

The policy_pretrain data is used to pretrain the multi-class attribute classifier. We use its classifier_training subset of images for training and its classifier_test subset to tune hyperparameters. The policy_train data is then used to learn the dialog policy. The policy_val data is used to tune hyperparameters as well as choose between RL algorithms. Results are reported for policy_test data.

We wish to simulate dialogs as refinements of an initial retrieval based on the product description. For the description of each image in the current classifier_test subset, we rank all other images in this subset according to a simplified version of the score in equation 1 (details in Appendix C). From the images which get ranked within the top 1000 for their corresponding description, we sample target images for each interaction. The active test set for the interaction consists of the top 1000 images as ranked for that description. We randomly sample 1000 images from the appropriate classifier_training subset to form the active training set.

In each interaction, the description of the target image is provided to the agent to start the interaction. The annotated attributes are used to answer queries from the system. This simulation procedure is similar to that of  padmakumar:emnlp18 but the answers to questions are less noisy in our simulation as all attributes are annotated as positive or negative for all images.

5.2 Experiment Phases

We run dialogs in batches of 100 and update the classifier and policies at the end of each batch. This is followed by repeating the retrieval step for all descriptions in the classifier_test subset before choosing target images for the next batch of dialogs. The experiment has the following phases:

  • Classifier pretraining: We pretrain the classifier using annotated attribute labels for images in the classifier_training subset of the policy_pretrain set. This ensures that we have some reasonable clarifications at the start of dialog policy learning.

  • Policy initialization: We initialize the dialog policy using experience collected using the baseline static policies (section 4.8) for the decision and active learning policies, and an oracle 222The oracle tries each candidate clarification and returns the one that maximally increases the belief of the target image. to choose clarifications. This is done to speed up policy learning. The dialogs for this phase are sampled from the set of policy_train images.

  • Policy training: This phase consists of training the policy using on-policy experience, with dialogs again sampled from the set of policy_train images.

  • Policy testing: We reset the classifier to the state at the end of pretraining. This is done to ensure that any performance improvement seen during testing are due to queries made in the testing phase. This is needed both for fair comparison with the baseline and to confirm that the system can generalize to novel attributes not seen during any stage of training. Dialogs are sampled for this from the policy_val set for hyperparameter tuning and from the policy_test set for reported results.

6 Results and Discussion

We initialize the policy with 4 batches of dialogs, followed by 4 batches of dialogs for the training phase, and 5 batches of dialogs in the testing phase. We compare the fully learned policy with hierarchical policies that consist of keeping one or more of the components static. We also compare the choice of Q-Learning or A3C mnih:icml16 as the policy learning algorithm for each learned policy. Table 1 shows the performance in the final test batch of the best fully learned policy, as well as a selected subset of the baselines (all conditions are included in Appendix A). We evaluate policies on the fraction of successful episodes in the final test batch, and the average dialog length.

Decision Policy Type Clarification Policy Type Active Learning Policy Type Fraction of Successful Dialogs Average Dialog Length
Q-Learning A3C A3C 0.33 9.40
Q-Learning A3C Static 0.15 14.16
Q-Learning Static A3C 0.09 1.00
Static A3C A3C 0.27 20.00
Static Static Static 0.17 20.00
Table 1: Results from the final batch of the test phase.

Ideally we would like the system to have a high dialog success rate while having as low a dialog length as possible. We observe that the using a learned policy for all three functions results in a significantly more successful dialog system (according to an unpaired Welch t-test with p < 0.05) than most conditions in which one or more of the policies are static. The exception is the case when the decision policy is static and the clarification and active learning policies are learned, in which case the difference is not statistically significant. The fully learned policy also uses significantly shorter dialogs than all conditions with a static decision policy. Some other conditions result in shorter dialogs, but these are unable to exploit the clarification and active learning actions enough to result in a success rate comparable to the fully learned policy.

Figure 3 plots the success rate across test batches, and the expected success rate if the system was forced to guess without clarification, for the fully learned, and fully static policies. We find that in the case of the fully static policy, there is no statistically significant improvement, either in the expected initial success rate without clarifications, or in the final success rate, between the first and last test batch. This suggests that neither the static active learning policy, nor its combination with the static clarification policy are capable of improving the system’s performance.

However, in the case of the fully learned policy, we observe a statistically significant improvement in the final success rate, but not the initial success rate without clarifications. This suggests that while a learned active learning policy by itself is not sufficient to improve the system’s success rate, the combination of learned active learning and clarification policies is sufficient to improve the system’s success rate. We also observe that while the difference between the initial and final success rate is initially not significant, it increases across batches, and becomes significant in the last two batches. This suggests that the clarification policy by itself is also insufficient for improvement, and the combination of the two is required to improve the system’s success rate.

Figure 3: Comparison of guess success rate with, and without clarifications across test batches

We believe that the reason for the relatively poor performance of the static clarification and active learning policies is that the classifier is not sufficiently accurate, and does not produce well calibrated probabilities, due to the heavy imbalance in the dataset. On the other hand, the learned policies are able to learn to properly adjust for this miscalibration. We also believe that with more training dialogs or a different state-action representation, it may be possible to also learn a decision policy that outperforms the static decision policy.

7 Conclusion

We demonstrate how a combination of RL learned policies for choosing attribute-based clarification and active learning queries can be used to improve an interactive system that needs to retrieve images based on a natural language description, while encountering novel attributes at test time not seen during training. Our experiments show that in challenging datasets where it is difficult to obtain an accurate attribute classifier, learned policies for choosing clarification and active learning queries outperform strong static baselines. We further show that in this challenging setup, a combination of learned clarification and active learning policies is necessary to obtain improvement over directly performing retrieval without interaction.

Broader Impact

Natural language interfaces such as language-based search, and intelligent personal assistants have the potential to make various forms of technology ranging from mobile phones and computers, as well as robots or other machines such as ATMs or self-checkout counters more accessible and less intimidating to users who are unfamiliar or uncomfortable with other interfaces on such devices such as command shells, button based interfaces or changing visual user interfaces. Spoken language interfaces can also be used to make such devices more accessible for the visually impaired or users who have difficulty with fine motor control.

However, the use of these interfaces do involve concerns over privacy and data security. This is especially the case with devices based on spoken language interfaces as they need to analyze every conversation for potential codewords lackes:19. Thus, users need to trust that these extraneous conversations will not be stored, or analyzed for other information. This is particularly problematic in environments such as hospitals or lawyer’s officers where confidentiality is expected.

Another concern is that transactions on these devices may be triggered by casual conversation or voices on television liptak:17, that were not intended to activate the dialog system. A related concern is that the ambiguity of language or mistakes made by the system may trigger unintended actions. In most applications, these can be handled by setting up appropriate confirmation or cancellation procedures for sensitive actions. Increased use of clarification steps before execution of an action may provide an additional opportunity for users to cancel such actions before they take place.

Using active learning, or any form of continuous learning with user data can make machine learning systems more useful due to increased exposure to the data distribution that such systems need to operate on in practice. However, most machine learning algorithms assume that the input data is complete and correct, both of which may be violated by systems that train on user-generated data. It is also possible for such data to be biased in a variety of ways – ranging from potential absence of representation or misrepresentation of some groups of people who do not use the system as frequently, to filter-bubble like effects when many users provide a few frequent examples as training data to the system 

baeza:16. Explicit active learning questions also allow users to deliberately provide misinformation to machine learning systems. Practical systems using active learning need to incorporate methods for handling noisy data, and need to have tests in place for undesirable learned biases.

We would like to thank Prasoon Goyal, Peter Stone, Joydeep Biswas and the UT Austin Building Wide Intelligence group for helpful discussions.

This work was funded by a Google Faculty Research Award (2019-2020) that was awarded to Prof Ray Mooney, an NSF NRI grant (IIS-1637736): Robots that Learn to Communicate through Natural Human Dialog, and an NSF NRI 2.0 grant (IIS-1925082): Improving Robot Learning from Feedback and Demonstration using Natural Language.


Appendix A Complete Results

Here, we include the complete set of results of which Table 1 is a part. For each of the three tasks – choosing clarification questions, choosing active learning questions and deciding between these and guessing, we compare the use of a static policy with policies learned using Q-Learning and A3C.

Decision Policy Type Clarification Policy Type Active Learning Policy Type Fraction of Successful Dialogs Average Dialog Length
Q-Learning Q-Learning Q-Learning 0.12 15.45
Q-Learning Q-Learning A3C 0.18 16.96
Q-Learning Q-Learning Static 0.14 11.83
Q-Learning A3C Q-Learning 0.06 1.0
Q-Learning A3C A3C 0.33 9.4
Q-Learning A3C Static 0.15 14.16
Q-Learning Static Q-Learning 0.17 13.96
Q-Learning Static A3C 0.09 1.0
Q-Learning Static Static 0.17 3.81
A3C Q-Learning Q-Learning 0.09 20.0
A3C Q-Learning A3C 0.19 20.0
A3C Q-Learning Static 0.17 20.0
A3C A3C Q-Learning 0.13 20.0
A3C A3C A3C 0.09 20.0
A3C A3C Static 0.15 20.0
A3C Static Q-Learning 0.12 20.0
A3C Static A3C 0.13 20.0
A3C Static Static 0.16 20.0
Static Q-Learning Q-Learning 0.29 20.0
Static Q-Learning A3C 0.24 20.0
Static Q-Learning Static 0.1 20.0
Static A3C Q-Learning 0.24 20.0
Static A3C A3C 0.27 20.0
Static A3C Static 0.14 20.0
Static Static Q-Learning 0.15 20.0
Static Static A3C 0.16 20.0
Static Static Static 0.17 20.0
Table 2: Unabridged results from the final batch of the test phase. indicates the conditions whose performance is comparable to the best condition (in bold).

We evaluate policies on the fraction of successful episodes in the final test batch, and the average dialog length. An episode is considered successful if it ends with the system guessing the correct target item. Ideally, we would like the system to have a high dialog success rate while having as low a dialog length as possible.

We observe that the using a learned policy for all three functions, with A3C for choosing clarification and active learning queries, and Q-learning for deciding between clarification, active learning and guessing, results in a significantly more successful dialog system (according to an unpaired Welch t-test with p < 0.05) than most other conditions. The exceptions are when the decision policy is static and the clarification and active learning policies are learned (marked by in Table 2), in which the difference is not statistically significant. However, the best fully learned policy uses significantly shorter dialogs than the other policies with a comparable success rate, thus making it overall more desirable.

Appendix B Data Split

We divide the set of attributes into 4 subsets – policy_pretrain, policy_train, policy_val and policy_test. Using these, we divide the images into 4 subsets as follows:

  • All images having a positive label for any of the attributes in policy_test subset form the policy_test set of images.

  • Of the remaining images, the images with a positive label for any attribute in the policy_val form the policy_val set of images.

  • Of the remaining images, the images with a positive label for any attribute in the policy_train form the policy_train set of images.

  • The remaining images form the set of policy_pretrain images.

This iterative procedure ensures that images in each of the policy_train, policy_val and policy_test result in the introduction of new attributes for which the classifier is not already trained. The data split will be included in the code release.

Appendix C Initial Retrieval

We wish to simulate dialogs as refinements of an initial retrieval based on the product description. At the start of each batch of interactions, for each description corresponding to an image in the current classifier_test subset, we rank all images in this subset according to a variant of the score in Equation 1. Instead of directly using classifier probabilities, we threshold the probabilities to obtain decisions . The threshold for each attribute is chosen to maximize the F1 score for that attribute on the current set of validation images and labels. This is initially the classifier_test subset of the policy_pretrain set, and gets expanded with a fraction of the labels obtained using active learning queries. This F1 score is also used in the baseline static policy (section 4.8) and features for the learned dialog policies (sections 4.5, 4.6 and 4.7).

Also, while the initial belief in the dialogs (Equation 1) only assumes that attributes mentioned in the description are positive for the target image, in the retrieval phase, we additionally assume that attributes not mentioned in the description are negative. Then,

We use the score to rank images, where and are hyperparameters tuned on the classifier_test subset of the policy_pretrain set. Our reported results use and .

Appendix D Policy Representation and Learning

At any state , the agent can take one of the following actions:

  • A special action for guessing – the image is chosen using Equation 2.

  • A set of clarification actions – one for each attribute.

  • A set of actions corresponding to possible active learning queries – one example query per attribute and one label query corresponding to each pair for , .

We learn a hierarchical dialog policy that consists of three parts as described below. For each policy, we obtain a feature representation of a state-action pair, as outlined in sections 4.5, 4.6 and 4.7.

  • A clarification policy to choose the best possible clarification action in the current state, using features for action in state .

  • An active learning policy to choose the best possible active learning query in the current state, using features for action in state . We reduce the action space of the active learning policy to one example query action per attribute, and one label query action per attribute corresponding to the image with probability closest to 0.5 for that attribute ().

  • A decision policy that chooses between , and , using features for action in state .

We experiment with using both Q-learning and A3C [mnih:icml16] for policy learning. We use the same model structure for all three policies but no shared parameters. In the following discussion about the model structure for the policy, we will refer to state-action features and policy , which is intended to represent the appropriate input and output for each policy.

For Q-learning we use a single-layer neural network with hidden layer size 100, whose input is the feature vector

, and output is the Q-value of action in state under policy . Suppose action is taken in state resulting in reward and next state , we update the network with new targets:

In the policy training phase, we choose -greedy actions with and in the policy validation and testing phases, we choose actions greedily. We use .

For A3C, as a critic, we use a network similar to Q-learning, predicting . The actor uses a policy representation:

where is a learned parameter vector. Suppose action is taken in state resulting in reward and next state , the critic network is updated similar to Q-learning and the actor weights are updated as:


where is the estimate from the critic network. We use and .

Appendix E Estimation of Information Gain

Our static baseline for choosing clarification questions is based on prior work in goal-oriented dialog that attempts to estimate the information gain of a clarification question [lee:vigil18]. In this setting, the agent asking questions needs to identify a target object among a set of candidate objects, and can ask clarification questions to help identify the target. Let , and

be random variables corresponding to the target object, question in turn

and answer in turn respectively, and , and represent specific values of these variables. Then the information gain from asking question , given previous questions and their answers is


In our case, possible targets correspond to possible images . As in prior work [lee:vigil18], corresponds to the estimated likelihood of target given the conversation history, which in our case is . We also make an additional assumption that the answer to question depends only on the target image and not on prior questions and answers. Hence:

which in our case is . Since our questions are attributes and the attribute classifier is expected to provide the probability that attribute is true for image , we get and . In practice, we observe that the classifier does not produce well-calibrated probabilities despite the use of temperature correction, and we believe that this contributes to the poor performance of the static clarification policy.

Substituting these, we get information gain for question , which we represent using , as:

Appendix F Classifier Design and Hyperparameters

For the attribute classifier, we initially experimented with alternate classifier designs such as binary SVMs using features extracted from Inception-V3 and fine-tuning Inception-V3 after altering the number of classes. We also experimented with alternate loss functions for fine-tuning Inception-V3, as well as the design in section

4.1 such as weighted cross entropy, and a ranking loss that maximizes the difference between the predicted probabilities of positive and negative attributes. Additionally, we compared fine-tuning all layers of Inception-V3 with training/fine-tuning only the extra/final layers. We used Inception-V3 as the backbone network due to the results reported in the original paper [guo:iccvworkshop19].

However in contrast to the original paper, we found that our particular network design and loss function were required for obtaining reasonable classifier performance. Additionally, we found that it was required to initialize Inception-V3 with weights pretrained on ImageNet and train only the new layers on the iMaterialist dataset. These differences could be due to the differences in the data split. Our choice of data split results in many attributes always having a negative label during the training phase.

We also found that it was sometimes possible to obtain increases in the multilabel F1 metric proposed in the original paper [guo:iccvworkshop19] without any improvement on per attribute F1. For example, it is possible to obtain a multilabel F1 of 36.0 on the original validation set by identifying the 13 attributes with the largest number of positive examples, always predicting 1 for these, and always predicting 0 for the other attributes. Hence, we used the average per-attribute F1 to choose the design and hyperparameters of the classifier.

To initialize the classifier, we train for 100 epochs with a batch size of 8,192 and using RMSProp for optimization. We start with a learning rate of 0.1 which is decayed exponentially with a decay rate of 0.9 every 400 steps. For updating the classifier in between dialog batches, we use a batch size of 128 and perform a single epoch over images for which the label of at least one attribute has been updated.