We introduce an inference technique to produce discriminative context-aware image captions (captions that describe differences between images or visual concepts) using only generic context-agnostic training data (captions that describe a concept or an image in isolation). For example, given images and captions of "siamese cat" and "tiger cat", we generate language that describes the "siamese cat" in a way that distinguishes it from "tiger cat". Our key novelty is that we show how to do joint inference over a language model that is context-agnostic and a listener which distinguishes closely-related concepts. We first apply our technique to a justification task, namely to describe why an image contains a particular fine-grained category as opposed to another closely-related category of the CUB-200-2011 dataset. We then study discriminative image captioning to generate language that uniquely refers to one of two semantically-similar images in the COCO dataset. Evaluations with discriminative ground truth for justification and human studies for discriminative image captioning reveal that our approach outperforms baseline generative and speaker-listener approaches for discrimination.READ FULL TEXT VIEW PDF
Language is the primary modality for communicating, and representing knowledge. To convey relevant information, we often use language in a way that takes into account context. For example, instead of describing a situation in a “literal” way, one might pragmatically emphasize selected aspects in order to be persuasive, impactful or effective. Consider the target image at the bottom left in Fig. 1. A literal description “An airplane is flying in the sky” conveys the semantics of the image, but would be inadequate if the goal was to disambiguate this image from the distractor image (bottom right). For this purpose, a more pragmatic description would be, “A large passenger jet flying through a blue sky”. This description is aware of context, namely, that the distractor image also has an airplane. People use such pragmatic considerations continuously, and effortlessly in teaching, conversation and discussions.
In this vein, it is desirable to endow machines with pragmatic reasoning. One approach would be to collect training data of language used in context, for example, discriminative ground truth utterances from people describing images in context of other images, or justifications explaining why an image contains a target class as opposed to a distractor class (Fig. 1). Unfortunately, collecting such data has a prohibitive cost, since the space of objects in possible contexts is often too large. Furthermore, in some cases the context in which we wish to be pragmatic may be unknown apriori
. For example, a free-form conversation agent may have to respond in a context-aware or discriminative fashion depending upon the history of a conversation. Such scenarios also arise in human-robot interaction, as in the case where, a robot may need to reason about which spoon a person is asking for. Thus, in this paper, we focus on deriving pragmatic (context-aware) behavior given access only to generic (context-agnostic) ground truth.
We study two qualitatively different real-world vision tasks that require pragmatic reasoning. The first is justification, where the model needs to justify why an image corresponds to one fine-grained object category, as opposed to a closely related, yet undepicted category. Justification is a task that is important for hobbyists, and domain experts: ornithologists and botanists often need to explain why an image depicts particular species as opposed to a closely-related species. Another potential application for justification is in machine teaching, where an algorithm instructs non-expert humans about new concepts.
Our second task is discriminative image captioning
, where the goal is to generate a sentence that describes an image in context of other semantically similar images. This task is not only grounded in pragmatics, but is also interesting as a scene understanding task to check fine-grained image understanding. It also has potential applications to human robot interaction.
Recent work by Andreas and Klein  derives pragmatic behaviour in neural language models using only context-free data. While we are motivated by similar considerations, the key algorithmic novelty of our work over  is a unified inference procedure which leads to more efficient search for discriminative sentences (Sec. 5). Our approach is based on the realization that one may simply re-use the sampling distribution from the generative model, instead of training a separate model to assess discriminativeness . This also has important implications for practitioners, since one can easily adapt existing context-free captioning models for context-aware captioning without additional training. Furthermore, while  was applied to an abstract scenes dataset , we apply our model to two qualitatively different real-image datasets: the fine-grained birds dataset CUB-200-2011 , and the COCO  dataset which contains real-life scenes with common objects.
In summary, the key contributions of this paper are:
A novel inference procedure that models an introspective speaker (IS), allowing a speaker (S) (say a generic image captioning model) to reason about pragmatic behavior without additional training.
Two new tasks for studying discriminative behaviour, and pragmatics, grounded in vision: justification, and discriminative image captioning.
A new dataset (CUB-Justify) to evaluate justification systems on fine-grained bird images with 5 captions for 3161 (image, target class, distractor class) triplets.
Our evaluations on CUB-Justify, and human evaluation on COCO show that our approach outperforms baseline approaches at inducing discrimination.
Pragmatics: The study of pragmatics – how context influences usage of language, stems from the foundational work of Grice  who analyzed how cooperative multi-agent linguistic agents could model each others’ behavior to achieve a common objective. Consequently, a lot of pragmatics literature has studied higher-level behavior in agents including conversational implicature  and the Gricean maxims 
. These works aim to derive pragmatic behavior given minimal assumptions on individual agents and typically use hand-tuned lexicons and rules.
More recently, there have been exciting developments on applying reinforcement learning (RL) techniques to these problems[27, 8, 21], requiring less manual tuning.
We are also interested in deriving pragmatic behavior, but our focus is on scaling context-sensitive behavior to vision tasks. Other works model ideas from pragmatics to learn language via games played online  or for human-robot collaboration . In a similar spirit, here we are interested in applying ideas from pragmatics to build systems that can provide justifications (Sec. 4.1) and provide discriminative image captions (Sec. 4.2).
Most relevant to our work is the recent work on deriving pragmatic behavior in abstract scenes made with clipart, by Andreas, and Klein . Unlike their technique, our proposed approach does not require training a second listener model and supports more efficient inference (Sec. 3.3). More details are provided in Sec. 3.1.
Beyond Image Captioning: Image captioning, the task of generating natural language description for an image, has seen quick progress [11, 12, 39, 43]. Recently, research has shifted beyond image captioning, addressing tasks like visual question answering [3, 14, 25, 45], referring expression generation [20, 26, 28, 33], and fill-in-the-blanks . In a similar spirit, the two tasks we introduce here, justification, and discriminative image captioning, can be viewed as “beyond image captioning” tasks. Sadovnik et al.  first studied a discriminative image description task, with the goal of distinguishing one image from a set of images. Their approach incorporates cues such as discriminability and saliency, and uses hand-designed rules for constructing sentences. In contrast, we develop inference techniques to induce discriminative behavior in neural models. The reference game from  can also be seen as a discriminative image captioning task on abstract scenes made from clipart, while we are interested in the domain of real images. The work on generating referring expressions by Mao et al.  generates discriminative captions which refer to particular objects in an image given context-aware supervision. Our work is different in the sense that we address an instance of pragmatic reasoning in the common case where context-dependent data is not available for training.
Rationales: Several works have studied how machines can understand human rationales, including enriching classification by asking explanations from humans 
, and incorporating human rationales in active learning[7, 29]. In contrast, we focus on machines providing justifications to humans. This could potentially allow machines to teach new concepts to humans (machine teaching). Other recent work  looks at post-hoc explanations for classification decisions. Instead of explaining why a model thinks an image is a particular class, 
describes why an image is of a class predicted by the classifier. Unlike this task, our justification task requires reasoning about explicit context from the distractor class. Further, we are not interested in providing rationalizations for classification decisions but in explaining the differences between confusing concepts to humans. We show a comparison to in the appendix, demonstrating the importance of context for justification.
We describe our approach for inducing context-aware language for: 1) justification, where the context is another class, and 2) discriminative image captioning, where the context is a semantically similar image. For clarity, we first describe the formulation for justification, and then discuss a modification for discriminative image captioning.
In the justification task (Fig. 1 top), we wish to produce a sentence , comprised of a sequence of words , based on a given image of a target concept in the context of a distractor concept . The produced justification should capture aspects of the image that discriminate between the target, and the distractor concepts. Note that images of the distractor class are not provided to the algorithm.
We first train a generic context-agnostic image captioning model (from here on referred to as speaker) using training data from Reed et al.  who collected captions describing bird images on the CUB-200-2011  dataset. We condition the model on in addition to the image. That is, we model . This not only helps produce better sentences (providing the model access to more information), but is also the cornerstone of our approach for discrimination (Sec. 3.2
). Our language models are recurrent neural networks which represent the state-of-the-art for language modeling across a range of popular tasks like image captioning[39, 43], machine translation  etc.
To induce discrimination in the utterances from a language model, it is natural to consider using a generator, or speaker, which models ) in conjunction with a listener function that scores how discriminative an utterance is. The task of a pragmatic reasoning speaker , then, is to select utterances which are good sentences as per the generative model , and are discriminative per :
where controls the tradeoff between linguistic adequacy of the sentence, and discriminativeness.
A similar reasoning speaker model forms the core of the approach of , where , and
are implemented using multi-layer perceptrons (MLPs). As noted in, selecting utterances from such a reasoning speaker poses several challenges. First, exact inference in this model over the exponentially large space of sentences is intractable. Second, in general one would not expect the discriminator function to factorize across words, making joint optimization of the reasoning speaker objective difficult. Thus, Andreas, and Klein  adopt a sampling based strategy, where is considered as the proposal distribution whose samples are ranked by a linear combination of , and (Eq. 1). Importantly, this distribution is over full sentences, hence the effectiveness of this formulation depends heavily on the distribution captured by , since the search over the space of all strings is solely based on the speaker. This is inefficient, especially when there is a mismatch in the statistics of the context-free (generative), and the unknown context-aware (discriminative) sentence distributions. In such cases, one must resort to drawing many samples to find good discriminative sentences.
Our approach for incorporating contextual behavior is based on a simple modification to the listener f (Eq. 1). Given the generator , we construct a listener module that wants to discriminate between , and , using the following log-likelihood ratio:
This listener only depends on a generative model, , for the two classes , and . We name it “introspector” to emphasize that this step re-uses the generative model, and does not need to train an explicit listener model. Substituting the introspector into Eq. 1 induces the following introspective speaker model for discrimination:
with that trades-off the weight given to generation, and introspection (similar to Eq. 1). In general, we expect this approach to provide sensible results when , and are similar. That is, we expect humans to describe similar concepts in similar ways, hence should not be too different from . Thus, the introspector is less likely to overpower the speaker in Eq. 3 in such cases (for a given ). Note that for sufficiently different concepts the speaker alone is likely to be sufficient for discrimination. That is, describing the concept in isolation is likely to be enough to discriminate against a different or unrelated concept.
A careful inspection of the introspective speaker model reveals two desirable properties over previous work . First, the introspector model does not need training, since it only depends on , the original generative model. Thus, existing language models can be readily re-used to produce context-aware outputs by conditioning on . We demonstrate empirical validation of this in Sec. 5. This would help scale this approach to scenarios where it is not known apriori which concepts need to be discriminated, in contrast to approaches which train a separate listener module. Second, it leads to a unified, and efficient inference for the introspective speaker (Eq. 3), which we describe next.
algorithm, which is a heuristic graph-search algorithm commonly used for inference in Recurrent Neural Networks[17, 38].
We first factorize the posterior log-probability terms in the introspective speaker equation (Eq.3) , denoting ( corresponds to a null string). is the length of the sentence. We then combine terms from Eq. 3, yielding the following emitter-suppressor objective for the introspective speaker:
The emitter (numerator in Eq. 4) is the generative model conditioned on the target concept , deciding which token to select at a given timestep. The suppressor (the denominator in Eq. 4) is conditioned on the distractor concept , providing signals to the emitter on which tokens to avoid. This is intuitive – to be discriminative, we want to emit words that match , but avoid emitting words that match .
We maximize the emitter-suppressor objective (Eq. 4) using beam search. Vanilla beam search, as typically used in language models, prunes the output space at every time-step keeping the top-B (usually incomplete) sentences with highest log-probabilities so far (speaker in Eq. 3). Instead, we run beam search to keep the top-B sentences with highest ES ratio in Eq. 4. Fig. 2 illustrates this ES beam search for a beam size of 1.
It is important to consider how the trade-off parameter affects the produced sentences. For , the model generates descriptions that ignore the context. At the other extreme, low values are likely to make the produced sentences very different from any sentence in the training set (repeated words, ungrammatical sentences). It is not trivial to assume that there exists a wide enough range of creating sentences that are both discriminative, and well-formed. However, our results (Sec. 5) indicate that such a range of exists in practice.
We are given a target image , and a distractor ,that we wish to distinguish, similar to the two classes for the justification task. We construct a speaker (or generator) for this task by training a standard image captioning model. Given this speaker, we construct an emitter-suppressor equation (as in Eq. 4):
We re-use the mechanics of emitter-suppressor beam search from Sec. 3.3, conditioning the emitter on the target image , and the suppressor on the distractor image .
We provide details of the CUB dataset, of our CUB-Justify dataset used for evaluation, and of the speaker-training setup for the justification task. We then discuss the experimental protocols for discriminative image captioning.
The Caltech UCSD birds (CUB) dataset  contains 11788 images for 200 species of North American birds.
Each image in the dataset has been annotated with 5 fine-grained captions by Reed et al. . These captions mention various details about the bird (“This is a white spotted bird with a long pointed black beak.”) while not mentioning the name of the bird species.
CUB-Justify Dataset: We collect a new dataset (CUB-Justify) with ground truth justifications for evaluating justification. We first sample the target, and distractor classes from within a hyper-category created based on the last name of the folk names of the 200 species in CUB. For instance, “rufous hummingbird”, and “ruby throated hummingbird” both fall in the hyper-category “hummingbird”. We induce 37 such hyper-categories. The largest single hypercategory is “Warbler” with 25 categories. We then select a subset of (approx.) 15 images from the test set of CUB-200-2011  for each of the 200 classes, to form a CUB-Justify test split. We use the rest for speaker training (CUB-Justify train split).
Workers were then shown an image of the “rufous hummingbird”, for instance, and a set of 6 other images (from CUB-Justify test split) all belonging to the distractor class “ruby throated hummingbird”, to form the visual notion of the distractor class.
They were also shown a diagram of the morphology of birds indicating various parts such as tarsus, rump, wingbars etc. (similar to Reed et al. ). The instruction was to describe the target image such that it is not confused with images from the distractor class. Some birds are best distinguished by non-visual cues such as their call, or their migration patterns.
Thus, we drop the categories of birds from the original list of triplets which were labeled as too hard to distinguish by the workers.
At the end of this process we are left with 3161 triplets with 5 captions each. We split this dataset into 1070 validation (for selecting the best value of ), and 2091 test examples respectively. More details on the interface can be found in the appendix.
Speaker Training: We implement a model similar to “Show, Attend, and Tell” from Xu et al. , modifying the original model to provide the class as input, similar in spirit to . Exact details of our model architecture are given in the appendix. We train the model on the CUB-Justify train split. Recall that this just has context-agnostic captions from .
To evaluate the quality of our speaker model, we report numbers here using the CIDEr-D metric  commonly used for image captioning [16, 19, 39] computed on the context-agnostic captions from . Our captioning model with both the image, and class as input reaches a validation score of 50.2 CIDEr-D, while the original image-only captioning model reaches a CIDEr-D of 49.1. The scores are in a similar range as existing CUB captioning approaches .
We measure performance of the (context-aware) justification captions on the CUB-Justify discriminative captions using the CIDEr-D metric. CIDEr-D weighs n-grams by their inverse document frequencies (IDF), giving higher weights to sentences having “content” n-grams (“red beak”) than generic n-grams (“this bird”). Further, CIDEr-D captures importance of an n-gram for the image. For instance, it emphasizes “red beak” over, say, “black belly” if “red beak” is used more often in human justifications. We also report METEOR  scores for completeness. More detailed discussion on metrics can be found in the appendix.
Dataset: We want to test if reasoning about context with an introspective speaker can help discriminate between pairs of very similar images from the COCO dataset. To construct a set of confusing image pairs, we follow two strategies. First, easy confusion: For each image in the validation (test) set, we find its nearest neighbor in the FC7 space of a pre-trained VGG-16 CNN , and repeat this process of neighbor finding for 1000 randomly chosen source images. Second, hard confusion: To further narrow down to a list of semantically similar confusing images, we then run the speaker model on the nearest neighbor images, and compute word-level overlap (intersection over union) of their generated sentences. We then pick the top 1000 pairs with most overlap. Interestingly, the top 539 pairs had identical captions. This reflects the issue of the output of image captioning models lacking diversity, and seeming templated [9, 39].
Speaker Training and Evaluation: We train our generative speaker for use in emitter-suppressor beam search using the model from  implemented in the neuraltalk2 project . We use the train/val/test splits from . Our trained and finetuned speaker model achieves a performance of 91 CIDEr-D on the test set. As seen in Eq. 5, no category information is used for this task. We evaluate approaches for discriminative image captioning based on how often they help humans to select the correct image out of the pair of images.
Methods and Baselines: We evaluate the following models:
1. IS(): Introspective speaker from Eq. 3;
2. IS(1): standard literal speaker, which generates a caption conditioned on the image and target class, but which ignores the distractor class;
3. semi-blind-IS(): Introspective speaker in which the listener does not have access to the image, but the speaker does;
4. blind-IS(): Introspective speaker without access to image, conditioned only on classes;
5. RS(): Our implementation of Andreas and Klein , but using
our (more powerful) language model, and
Eq. 3 with a listener that models (similar to semi-blind-IS()) for ranking samples (as opposed to a trained MLP , to keep things comparable).
All approaches use 10 beams/samples (which is better than lower values) unless stated otherwise.
Validation Performance: Fig. 3 shows the performance on CUB-Justify validation set as a function of
, the hyperparameter controlling the tradeoff between the speaker and the introspector (Eq.3). For the RS() baseline, stands for the tradeoff between the log-probability of the sentence and the score from the discriminator function for sample re-ranking. A few interesting observations emerge. First, both our IS() and semi-blind-IS() models outperform the baselines for the mid range of values. IS() model does better overall, but semi-blind-IS() has a more stable performance over a wider range of . This indicates that when conditioned on the image, the introspector has to be highly discriminative (low lambda values) to overcome the signals from the image, since discrimination is between classes.
Second, as is decreased from 1, most methods improve as the sentences become more discriminative, but then get worse again as becomes too low. This is likely to happen because when is too low, the model explores rare tokens and parts of the output space that have not been seen during training, leading to badly-formed sentences (Fig. 4). This effect is stronger for IS() models than for RS(), since RS() searches the output space over samples from the generator and only ranks using the joint reasoning speaker objective (Eq. 1). Interestingly, at (no discrimination), the RS() approach, which samples from the generator, also performs better than other approaches, which use beam search to select high log-probability (context-agnostic) sentences. This indicates that in the absence of ground truth justifications, there is indeed a discrepancy between searching for discriminativeness and searching for a highly likely context-agnostic sentence.
We perform more comparisons with the RS() baseline, sweeping over samples from the generator for listener reranking (Eq. 1). We find that using 100 samples, RS() gets comparable CIDEr-D scores (18.8) (but lower METEOR scores) than our semi-blind-IS() approach with a beam size of 10. This suggests that our semi-blind-IS() approach is more computationally efficient at exploring the output space because our emitter-suppressor beam search allows us to do joint greedy inference over speaker and introspector, leading to more meaningful local decisions. For completeness, we also trained a listener module discriminatively, and used it as a ranker for RS(). We found that this gets to 16.2 0.3 CIDEr-D (at ) on validation, which is lower than IS(), showing that the bottleneck for performance is sampling, rather than the discriminativeness of the listener. More details can be found in the appendix.
Test Performance: Table. 1 details the performance of the above models on the test set of CUB-Justify, with each model using its best-performing on the validation set (Fig. 3). Both introspective-speaker models strongly outperform the baselines, with semi-blind-IS() slightly outperforming the IS() model. This could be due to the performance of semi-blind-IS() being less sensitive to the exact choice of (from Fig. 3). Among the baselines, the best performing method is the blind-IS() model, presumably because this model does emitter-suppressor beam search, while the other two baseline approaches rely on sampling and regular beam search respectively.
Qualitative Results: We next showcase some qualitative results that demonstrate 1) aspects of pragmatics, and 2) context dependence captured by our best-performing semi-blind-IS() model. Fig. 4 demonstrates how sentences uttered by the introspective speaker change with . At the sentence describes the image well, but is oblivious of the context (distractor class). The sentence “A small sized bird has a very long and pointed bill.” is discriminative of hummingbirds against other birds, but not among hummingbirds (many of which tend to have long beaks/bills). At , and , the model captures discriminative features such as the “red neck”, “white belly”, and “red throat”. Interestingly, at the model avoids saying “long beak”, a feature shared by both birds. Next, Fig. 5 demonstrates how the selected utterances change based on the context. A limitation of our approach is that, since the model never sees discriminative training data, in some cases it produces repeated words (“green green green”) when encouraged to be discriminative at inference time.
Finally, Fig. 6 illustrates the importance of visual reasoning for the justification task.
Fine-grained species often have large intra-class variances which ablind approach to justification would ignore. Thus, a good justification approach needs to be grounded in the image signal to pick the discriminative cues appropriate for the given instance.
As explained in Sec. 4.2 we create two sets of semantically similar target, and distractor images: easy confusion based on FC7 features alone, and hard confusion based on both FC7, and sentences generated from the speaker (image captioning model). We are interested in understanding if emitter-suppressor inference helps identify the target image better than the generative speaker baseline. Thus the two approaches are speaker (S) (baseline), and introspective speaker (IS) (our approach). We use based on our results on the CUB dataset.
We run all approaches at a beam size of 2 (typically best for COCO ).
Human Studies: We setup a two annotation forced choice (2AFC) study where we show a caption to raters asking them to “pick an image that the sentence is more likely to be describing.”. Each target distractor image pair is tested against the generated captions. We check the fraction of times a method caused the target image to be picked by a human. A discriminative image captioning method is considered better if it enables humans to identify the target image more often. Results of the study are summarized in Table. 2. We find that our approach outperforms the baseline speaker (S) on the easy confusion as well as the hard confusion splits. However, the gains from our approach are larger on the hard confusion split, which is intuitive.
Qualitative Results: The qualitative results from our COCO experiments are shown in Fig. 7. The target image, when successfully identified, is shown with a green border. We show examples where our model identifies the target image better in the first two rows, and some failure cases in the third row. Notice how the model is able to modify its utterances to account for context, and pragmatics, when going from (speaker) to (introspective speaker). Note that the sentences typically respect grammatical constructs despite being forced to be discriminative.
|Approach||easy confusion (%)||hard confusion (%)|
Describing absence of concepts and inducing comparative language are exciting directions for future work on justification. For instance, when justifying why an image is a lion and not a tiger, it would be useful to be able to say “because it does not have stripes.”, or “because it has a more hair on its face.” Beyond pragmatics, the justification task also has interesting relations to human learning. Indeed, we all experience that we learn better when someone takes time out to justify or explain their point of view. One can imagine such justifications being helpful for “machine teaching”, where a teacher (machine) can provide justifications to a human learner explaining the rationale for an image belonging to a particular fine-grained category as opposed to a different, possibly mistaken, or confusing fine-grained category.
There are some fundamental limitations to inducing context-aware captions from context-agnostic supervision. For instance, if two distinct concepts are very similar, human-generated context-free descriptions may be identical, and our model (as well as baselines) would fail to extract any discriminative signal. Indeed, it is hard to address such situations without context-aware ground truth.
We believe modeling higher-order reasoning (such as pragmatics) by reusing the sampling distribution from language models can be a powerful tool. It may be applicable to other higher-order reasoning, without necessarily setting up policy gradient estimators on reward functions. Indeed, our inference objective can also be formulated for training. However, initial experiments on this did not yeild significant performance improvements.
We introduce a novel technique for deriving pragmatic language from recurrent neural network language models, namely, an image-captioning model that takes into account the context of a distractor class or a distractor image. Our technique can be used at inference time to better discriminate between concepts, without having seen discriminative training data. We study two tasks in the vision, and language domain which require pragmatic reasoning: justification – explaining why an image belongs to one category as opposed to another, and discriminative image captioning – describing an image so that one can distinguish it from a closely related image. Our experiments demonstrate the strength of our method over generative baselines, as well as adaptations of previous work to our setting. We will make the code, and datasets available online.
Acknowledgements: We thank Tom Duerig for his support, and guidance in shaping this project. We thank David Rolnick, Bharadwaja Ghali, Vahid Kazemi for help with CUB-Justify dataset. We thank Ashwin Kalyan for sharing a trained checkpoint for the discriminative image captioning experiments. We also thank Stefan Lee, Andreas Veit, Chris Shallue. This work was funded in part by an NSF CAREER, ONR Grant N00014-16-1-2713, ONR YIP, Sloan Fellowship, ARO YIP, Allen Distinguished Investigator, Google Faculty Research Award, Amazon Academic Research Award to DP.
We organize the appendix as follows:
Sec. 1: Analysis of performance as we consider unrelated images as distractors.
Sec. 4: Optimization details for justification speaker model.
Sec. 5: Choice of metrics for evaluating justification.
Sec. 6: CUB-Justify data collection details.
Sec. 7: Analysis of the RS() baseline in more detail.
Sec. 8: Comparison of our approach to a baseline with a discriminatively trained listener used for reranking in RS() model.
COCO Qualitative Examples: Fig. 8 shows more qualitative results on discriminative image captioning on the hard confusion split of the COCO dataset. Notice how our introspective speaker captions (denoted by IS), which model the context (distractor image) explicitly are often more discriminative, helping identify the target image more clearly than the baseline speaker approach (denoted by S). For example in the second row, our IS model generates the caption “a delta passenger jet flying through a clear blue sky”, which is a more discriminative (and accurate) caption than the baseline caption “a large passenger jet flying through a blue sky”, which applies to both the target and distractor images.
Effect of increasing distance: We illustrate how the quality of the discriminative captions from the introspective speaker (IS) approach varies as the distractor image becomes less relevant to the target image (Fig. 9). For the target image on the left, we show the 1-nearest neighbor (which has a very similar caption to the target image), the 10-nearest neighbor and a randomly selected distractor image. When we pick a random image to be the distractor, the generated discriminatve captions become less comprehensible, losing relevance as well as grammatical structure. This is consistent with our understanding of the introspective speaker (IS) formulation from Sec. 3.2: modeling the context explicitly during inference helps discrimination when the context is relevant. When the context is not relevant, as with the randomly picked images, the original speaker model (S) is likely sufficient for discrimination.
Hendricks et al.  propose a method to explain classification decisions to an end user by providing post-hoc rationalizations. Given a prediction from a classifier, this work generates a caption conditioned on the predicted class, and the original image. While Hendricks et al. aim to provide a rationale for a classification, we focus on a related but different problem of concept justification. Namely, we want to explain why an image contains a target class as opposed to a specific distractor class, while Hendircks et al. want to explain why a classifier thought an image contains a particular class. Thus, unlike the visual explanation task, it is intuitive that the justification task requires explicit reasoning about context. We verify this hypothesis, by first adapting the work of  to our justification task, using it as a speaker, and then augmenting the speaker with our approach to construct an intropsective speakerm which accounts for context. Interestingly, we find that our introspective speaker approach helps improve the performance of generating visual explanations  on justification.
The approach of Hendricks et al.  differs from our setup in two important ways. Firstly, uses a stronger CNN, namely the fine-grained compact-bilinear pooling CNN  which provides state-of-the-art performance on the CUB dataset. Secondly, to make the explanations more grounded in the class information, they also add a constraint to induce captions which are more specific to the class. This is achieved by using a policy gradient on a reward function that models for a given sentence and class . Thus, in some sense the approach encourages the model to produce sentences that are highly discriminative of a given class against all other classes, as opposed to a particular distractor class that we are interested in for justification. Finally, the policy gradient is used in conjunction with standard maximum likelihood training to train the explanation model. At inference, the explanation model is run by conditioning the caption generation on the predicted class.
We modify the inference setup of  slightly to condition the caption generation on the target class for justification, as opposed to the predicted class for explanation. We call this the vis-exp approach. We then apply the emitter-suppressor beam search (at a beam size of 1, to be consistent with ) to account for context, giving us an introspective visual explanation model (vis-exp-IS). Given the stronger image features and a more complicated training procedure involving policy gradients (hard to implement and tune in practice), the vis-exp approach achieves a strong CIDEr-D score of with a standard error of on our CUB-Justify test set. Note that this CUB-Justify test set is a strict subset of the test set from . These results are better than those achieved with our semi-blind-IS() CUB model, which is based on regular image features from VGG-16 implemented in the “Show, Attend and Tell” framework and uses standard log-likelihood training (Table. 1).
However, as mentioned before, the approach of , similar to a baseline speaker S, cannot explicitly model context from a specific distractor class at inference. That is, while the approach reasons (through its training procedure) that given an image of a hummingbird, one should talk about its long beak (a discriminating feature for a hummingbird against all other birds), it cannot reason about a specific distractor class presented at inference. If the distractor class is another hummingbird with a long beak, we would want to avoid talking about the long beak in our justification. On the other hand, if the distractor class were a hummingbird with a shorter beak and there do exist such hummingbirds, then the long beak would be an important feature to mention in a justification. Clearly, this is non-trivial to realize without explicitly modeling context. Hence, intuitively, one would expect that incorporating context from the distractor class should help the justification task.
|vis-exp ||20.36 0.16|
|vis-exp-IS (ours)||21.52 0.17|
As explained previously, we implement our emitter-suppressor inference (Eq. 3), on top of the vis-exp approach, yielding an vis-exp-IS approach. We sweep over the values of on validation and find that the best performance is achieved at . Plugging this value and evaluating on test, our vis-exp-IS approach achieves a CIDEr-D score of with a standard error of (Table. 3). This is an improvement of 1.16 CIDEr-D. Our gains over vis-exp are lower than the gains on the IS(1) approach (reported in Table. 1), presumably because the vis-exp approach already captures a lot of the context-independent discriminative signals (e.g., long beak for a hummingbird), due to policy gradient training. Overall though, these results provide further evidence that our emitter-suppressor inference scheme can be adapted to a variety of context-agnostic captioning models, to effectively induce context awareness during inference.
We explain some minor modifications to the “Show, Attend and Tell”  image captioning model to condition it on the class label in addition to the image, for our experiments on CUB. Note that the explanation in this section is only for CUB – our COCO models are trained using the neuraltalk2 package111https://github.com/karpathy/neuraltalk2 which implements the “Show and Tell” captioning model from Vinyals et al. 
. Our changes can be understood as three simple modifications aimed to use class information in the model. We first embed the class label (1 out of 200 classes for CUB) into a continuous vector, . The three changes then, on top of the Show, Attend, and Tell model  are as follows:
Changes to initial LSTM state: The original Show, Attend, and Tell model uses image annotation vectors (
indexes spatial location), which are the outputs from a convolutional feature map to compute the initial cell and hidden states of the long-short term memory (LSTM) (). The image annotation vector is averaged across spatial locations and used to compute the initial state as follows:
We modify this to also use the class embedding to predict the initial state of the LSTM, by concatenating it with the averaged anntoation vector ():
Changes to the LSTM recurrence: “Show, Attend and Tell” computes a scalar attention at each location of the feature map and uses it to compute a context vector at every timestep by attending on the image annotation . It also embeds an input word using an embedding matrix and uses the previous hidden state to compute the following LSTM recurrence at every timestep, producing outputs (input gate), (forget gate), (output gate), (input) (Eqn. 1, 2, 3 from ):
We use the class embeddings in addition to the context vector in Eqn. 1:
The remaining equations for the LSTM recurrence remain the same (Eqn. 2, 3).
Adding class information to the deep output layer: “Show, Attend and Tell” uses a deep output layer  to compute the output word distribution at every timestep, incorporating signals from the LSTM hidden state , context vector and the input word :
Here , are matrices used to project and to the dimensions of the word embeddings and is the output layer which produces an output of the size of the vocabulary. Similar to the previous two adaptations, we use the class embedding in addition to the context vector to predict the output at every timestep:
Blind models: For implementing our class-only blind-IS() model, we need to train a model that only uses the class to produce a sentence. For this, we drop the attention component from the model, which is equivalent to setting and to zero for all our equations above and run the model using the class embedding .
Our CUB captioning network is trained using Rmsprop  with a batch size of and a learning rate of . We decayed the learning rate on every epochs of cycling through the training data. Our word embedding embeds words into a dimensional vector and we set LSTM hidden and cell state () sizes to 1800, similar to the “Show, Attend, and Tell” model on COCO. The rest of our design choices closely mirror the original work of , based on their implementation available at https://github.com/kelvinxu/arctic-captions
. We will make our Tensorflow implementation of “Show, Attend, and Tell” publicly available.
In this section, we expand more on our discussion on the choice of metrics for evaluating justification (Sec. 4.1). In addition to the metrics we report in the main paper, namely CIDEr-D  and METEOR , we also considered using the recently introduced SPICE . The SPICE metric uses a dependency parser to extract a scene graph representation for the candidate and reference sentences and computes an F-measure between the scene graph representations. Given that the metric uses a dependency parser as an intermediate step, it is unclear how well it would scale to our justification task: some of the sentences from our model might be good justifications but may not be exactly grammatical. This is because our discriminative justifications emerge as a result of a tradeoff between high-likelihood sentences and discrimination (Eq. 3). Note that this tradeoff is inherent since we don’t have ground truth (well-formed) discriminative training data. Thus SPICE can be a problematic metric to use in our context. However, for the sake of completeness, we report SPICE numbers on validation, giving each approach access to its best value, in Table. 4.
Although we outperform the baselines using the SPICE metric, in some corner cases we also found the SPICE metric scores to be slightly un-interpretable. For example, for the candidate sentence “this bird has a speckled belly and breast with a short pointy bill.”, and reference sentences “This bird has a yellow eyebrow and grey auriculars”, “This is a bird with yellow supercilium and white throat”, the SPICE scores were higher than one would expect (0.30). For reference, an intuitively more related sentence “this is a grey and yellow bird with a yellow eyebrow.” obtains a lower SPICE score of 0.28 for the same reference sentences. Further investigation revealed that the relation F-measure, which roughly measures if the two sentences encode the same relations, had a high score in these corner cases. We hypothesize that this inconcsistency in scores might be because SPICE uses soft similarity from WordNet for computing the F-measure, which might not be calibrated for this fine-grained domain, with specialized words such as supercilium, auriculars etc. As a result of these observations, we decided not to perform key evaluations with the SPICE metric.
We provide more details on the collection of the CUB-Justify dataset (Sec. 4.1). We presented a target image from a selected target class to the workers along with a set of six distractor images, all belonging to one other distractor class. The distractor images were chosen at random from the validation, and test split of the CUB dataset we created for justification. Non-expert workers are unlikely to given have an explicit visual model of a given ditractor category, say Indigo Bunting. Thus the distractor images were shown to entail the concept of the distractor class for justification. As explained in Sec. 4.1 the choice of the distractor classes is made based on the hierarchy we induce using the folk names of the birds. Given the target class, and the distractor class images, workers were asked to describe the target image in a manner that the sentence is not confusing with respect to the distractor images. Further, the workers were instructed that someone who reads the sentence should be able to recognize the target image, distinguishing it from the set of distractor images. In order to get workers to pay attention to all the images (and the intra-class invariances), they were not told explicitly that the distractor images all belonged to one other, unique, distractor class. For helping identify minute difference between images of birds, as well as enabling workers to write more accurate captions, we also showed them a diagram of the morphology of a bird (Fig. 10). We also showed them a list of some other parts with examples not shown in the diagram, such as eyeline, rump, eyering, etc. The list of these words as well as examples, and the morphology diagram were picked based on consultation with an ornithology hobbyist. The workers were also explicitly instructed to describe only the target image, in an accurate manner, mentioning details that are present in the target image, as opposed to providing jusitifications that talk about features that are absent.
The initial rounds of data collection revealed some interesting corner cases that caused some ambiguity. For example, some workers were confused whether a part of the bird should be called gray or white, because it could appear gray either because the part was white, and in shadow, or the part was actually gray. After these initial rounds of feedback, we proceeded to collect the entire dataset.
In this section, we provide more details on how the performance of our adaptation of Andreas, and Klein , namely the RS() approach varies as we sweep over the number of samples we draw from the model for , , and . We note that for , the RS() approach approaches the best performance from our IS() approach as we draw 100 samples from the model (Fig. 11). Interestingly, our IS() model is only evaluated with a beam size of 10. Thus our model is able to perform more efficient search for discriminative sentences than a sampling, and re-ranking based approach like RS(). It is easy to note that, in case we were willing to spend time to enumerate over all exponentially-many sentences, we would find the optimal solution in worst case exponential time – most approximate inference techniques in such a setting offer a time vs. optimality tradeoff. Our approach seems to fit this tradeoff better than the RS() approach based on this empirical evidence.
In the main paper we showed comparisons to the RS() baseline with the same listener model as our approach, which uses the log-likelihood ratio (Eq. 2) to assess discriminativeness (and thus needs no further training). We did this as an apples to apples comparison to systematically study the benefit of joint inference over the speaker and listener as opposed to sampling and re-ranking.
For completeness, we report results here using a trained listener following the architecture choices of  for the justification task. We construct reasoning speaker RS() with this trained listener (RS(
)-TL) and do sampling (sample size 10) and re-ranking, based on log-odds from the listener plugged into Eqn. 1, main paper. As reported in the main paper,RS()-TL gets to a best CIDEr-D of 16.20.3 on CUB-Justify validation at , which is lower than our approach (semi-blind-IS() gets to 18.40.2) – this illustrates the benefit of joint inference.
For comparsion, we also evaluate against two other reasoning speaker approaches with different rankers (at , which only uses the listener, as opposed to listener+speaker, to directly compare listeners for ranking): 1) introspector (same as RS(0), main paper), and 2) a chance ranker (RS(0)-R), which randomly scores a class for a sentence. We find that a trained listener (14.70.2) does marginally better than RS(0) (13.90.2) which is in turn better than RS(0)-R (13.40.2). Thus a trained listener does have marginal impact on performance, but the larger factor affecting performance is sampling, which our semi-blind-IS() approach is able to do more effectively.
International Conference on Computer Vision (ICCV), 2015.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.