An efficient PyTorch implementation of the winning entry of the 2017 VQA Challenge.
People often refer to entities in an image in terms of their relationships with other entities. For example, "the black cat sitting under the table" refers to both a "black cat" entity and its relationship with another "table" entity. Understanding these relationships is essential for interpreting and grounding such natural language expressions. Most prior work focuses on either grounding entire referential expressions holistically to one region, or localizing relationships based on a fixed set of categories. In this paper we instead present a modular deep architecture capable of analyzing referential expressions into their component parts, identifying entities and relationships mentioned in the input expression and grounding them all in the scene. We call this approach Compositional Modular Networks (CMNs): a novel architecture that learns linguistic analysis and visual inference end-to-end. Our approach is built around two types of neural modules that inspect local regions and pairwise interactions between regions. We evaluate CMNs on multiple referential expression datasets, outperforming state-of-the-art approaches on all tasks.READ FULL TEXT VIEW PDF
Temporal grounding entails establishing a correspondence between natural...
This paper presents INGRESS, a robot system that follows human natural
Images are not simply sets of objects: each image represents a web of
We focus on grounding (i.e., localizing or linking) referring expression...
Relational reasoning is a central component of intelligent behavior, but...
In this paper, we address referring expression comprehension: localizing...
We focus on the task of grounding referring expressions in images, e.g.,...
An efficient PyTorch implementation of the winning entry of the 2017 VQA Challenge.
Comp541 Machine Learning Term Project
Great progress has been made on object detection, the task of localizing visual entities belonging to a pre-defined set of categories [8, 24, 23, 6, 17]. But the more general and challenging task of localizing entities based on arbitrary natural language expressions remains far from solved. This task, sometimes known as grounding or referential expression comprehension20, 11, 25]. Given an image and a natural language expression referring to a visual entity, such as the young man wearing green shirt and riding a black bicycle, these approaches localize the image region corresponding to the entity that the expression refers to with a bounding box.
Referential expressions often describe relationships between multiple entities in an image. In Figure 1, for example, the expression the woman holding a grey umbrella describes a woman entity that participates in a holding relationship with a grey umbrella entity. Because there are multiple women in the image, resolving this referential expression requires both finding a bounding box that contains a person, and ensuring that this bounding box relates in the right way to other objects in the scene. Previous work on grounding referential expressions either (1) treats referential expressions holistically, thus failing to model explicit correspondence between textual components and visual entities in the image [20, 11, 25, 30, 21], or else (2) relies on a fixed set of entity and relationship categories defined a priori .
In this paper, we present a joint approach that explicitly models the compositional linguistic structure of referential expressions and their groundings, but which nonetheless supports interpretation of arbitrary language. We focus on referential expressions involving inter-object relationships that can be represented as a subject entity, a relationship and an object entity. We propose Compositional Modular Networks (CMNs), an end-to-end trained model that learns language representation and image region localization jointly as shown in Figure 1. Our model differentiably parses the referential expression into a subject, relationship and object with three soft attention maps, and aligns the extracted textual representations with image regions using a modular neural architecture. There are two types of modules in our model, one used for localizing specific textual components by outputting unary scores over regions for that component, and one for determining the relationship between two pairs of bounding boxes by outputting pairwise scores over region-region pairs. We evaluate our model on multiple datasets containing referential expressions, and show that our model outperforms both natural baselines and previous work.
Grounding referential expressions. The problem of grounding referential expressions can be naturally formulated as a retrieval problem over image regions [20, 11, 25, 7, 30, 21]. First, a set of candidate regions are extracted (e.g. via object proposal methods like [28, 4, 13, 33]). Next, each candidate region is scored by a model with respect to the query expression, returning the highest scoring candidate as the grounding result. In [20, 11], each region is scored based on its local visual features and some global contextual features from the whole image. However, local visual features and global contextual from the whole image are often insufficient to determine whether a region matches an expression, as relationships with other regions in the image must also be considered. Two recent methods [30, 21] go beyond local visual features in a single region, and consider multiple regions at the same time. 
adds contextual feature extracted from other regions in the image, and
proposes a model that grounds a referential expression into a pair of regions. All these methods represent language holistically using a recurrent neural network: either generatively, by predicting a distribution over referential expressions[20, 11, 30, 21], or discriminatively, by encoding expressions into a vector representation [25, 7]. This makes it difficult to learn explicit correspondences between the components in the textual expression and entities in the image. In this work, we learn to parse the language expression into textual components in instead of treating it as a whole, and align these components with image regions end-to-end.
Handling inter-object relationships. Recently work by  trains detectors based on RCNN  and uses a linguistic prior to detect visual relationships. However, this work relies on fixed, predefined categories for subjects, relations, and objects, treating entities like “bicycle” and relationships like and “riding” as discrete classes. Instead of building upon a fixed inventory of classes, our model handles relationships specified by arbitrary natural language phrases, and jointly learns expression parsing and visual entity localization. Although  also learns language parsing and perception, it is directly based on logic (
-calculus) and requires additional classifiers trained for each predicate class.
Compositional structure with modules. Neural Module Networks  address visual question answering by decomposing the questions into textual components and dynamically assembling a specific network architecture for the question from a few network modules based on the textual components. However, this method relies on an external language parser for textual analysis instead of end-to-end learned language representation, and is not directly applicable to the task of grounding referential expressions into bounding boxes, since it does not explicitly output bounding boxes as results. Recently,  improves over  by learning to re-rank parsing outputs from the external parser, but it is still not end-to-end learned since the parser is fixed and not optimized for the task. Inspired by , our model also uses a modular structure, but learns the language representation end-to-end from words.
We propose Compositional Modular Networks (CMNs) to localize visual entities described by a query referential expression. Our model is compositional in the sense that it localizes a referential expression by grounding the components in the expressions and exploiting their interactions, in accordance with the principle of compositionality of natural language – the meaning of a complex expression is determined by the meanings of its constituent expressions and the rules used to combine them . Our model works in a retrieval setting: given an image , a referential expression as query and a set of candidate region bounding boxes for the image (e.g. extracted through object proposal methods), our model outputs a score for each bounding box , and returns the bounding box with the highest score as grounding (localization) result. Unlike state-of-the-art methods [25, 7], the scores for each region bounding box are not predicted only from the local feature of , but also based on other regions in the image. In our model, we focus on the relationships in referential expressions that can be represented as a 3-component triplet (subject, relationship, object), and learn to parse the expressions into these components with attention. For example, a young man wearing a blue shirt can be parsed as the triplet (a young man, wearing, a blue shirt). The score of a region is determined by simultaneously looking at whether it matches the description of the subject entity and whether it matches the relationship with another interacting object entity mentioned in the expression.
Our model handles such inter-object relationships by looking at pairs of regions . For referential expressions like “the red apple on top of the bookshelf”, we want to find a region pair such that matches the subject entity “red apple” and matches the object entity “bookshelf” and the configuration of matches the relationship “on top of”. To achieve this goal, our model is based on a compositional modular structure, composed of two modules assembled in a pipeline for different sub-tasks: one localization module for deciding whether a region matches the subject or object in the expression, where is the textual vector representation of the subject component “red apple” or the object component “bookshelf”, and one relationship module for deciding whether a pair of regions matches the relationship described in the expression represented by , the textual vector representation of the relationship “on top of”. The representations , and are learned jointly in our model in Sec. 3.1.
We define the pairwise score over a pair of image regions matching an input referential expression as the sum of three components:
where , and are vector representations of subject, relationship and object, respectively.
For inference, we define the final subject unary score of a bounding of corresponding to the subject (e.g. “the red apple” in “the red apple on top of the bookshelf”) as the score of the best possible pair that matches the entire expression:
The subject is ultimately grounded (localized) to the highest scoring region as
Given a referential expression like the tall woman carrying a red bag, how can we decide which substrings corresponds to the subject, the relationship, and the object, and extract three vector representations , and corresponding to these three components? One possible approach is to use an external language parser to parse the referential expression into the triplet format (subject, relationship, object) and then process each component with an encoder (e.g. a recurrent neural network) to extract , and . However, the formal representations of language produced by syntactic parsers do not always correspond to intuitive visual representations. As a simple example, the apple on top of the bookshelf is analyzed  as having a subject phrase the apple, a relationship on, and an object phrase top of the bookshelf, when in fact the visually salient objects are simply the apple and the bookshelf, while the complete expression on top of describes the relationship between them.
Therefore, in this work we learn to decompose the input expression into the above 3 components, and generate vector representations , and from through a soft attention mechanism over the word sequence, as shown in Figure 2 (a). For a referential expression that is a sequence of words , we first embed each word to a vector using GloVe , and then scan through the word embedding sequence with a 2-layer bi-directional LSTM network . The first layer takes as input the sequence and outputs a forward hidden state and a backward hidden state at each time step, which are concatenated into . The second layer then takes the first layer’s output sequence as input and outputs forward and backward hidden states and at each time step. All the hidden states in the first layer and second layer are concatenated into a single vector .
The concatenated state contains information from word itself and also context from words before and after . Then the attention weights , and for subject, relationship, object over each word are obtained by three linear predictions over followed by a softmax as
and the language representations of the subject , relationship and object are extracted as weighed average of word embedding vectors with attention weights, as follows:
In our implementation, both the forward and the backward LSTM in each layer of the bi-directional LSTM network have 1000-dimensional hidden states, so the final is 4000-dimensional. During training, dropout is added on top of as regularization.
As shown in Figure 2 (b), the localization module outputs a score representing how likely a region bounding box matches , which is either the subject textual vector in Eqn. 8 or object textual vector in Eqn. 10.
This module takes the local visual feature and spatial feature of image region . We extract visual feature from image region
using a convolutional neural network, and extract a 5-dimensional spatial feature from using the same representation as in , where and are bounding box coordinates and area of , and , and are width, height and area of the image . Then, and are concatenated into a vector as representation of region .
Since element-wise multiplication is shown to be a powerful way to combine representations from different modalities , we adopt it here to obtain a joint vision and language representation. In our implementation, is first embedded to a new vector that has the same dimension as (which is either in Eqn. 8 or in Eqn. 10
) through a linear transform, and then element-wise multiplied withto obtain a vector , which is L2-normalized into to obtain a more robust representation, as follows:
where is element-wise multiplication between two vectors. Then the score is predicted linearly from as
The parameters in are .
As shown in Figure 2 (c), the relationship module outputs a score representing how likely a pair of region bounding boxes matches , the representation of relationship in the expression.
In our implementation, we use the spatial features and of the two regions and extracted in the same way as in localization module (we empirically find that adding visual features of and leads to no noticeable performance boost while slowing training significantly). Then and are concatenated as , and then processed in a similar way as in localization module to obtain , as shown below:
The parameters in are .
During training, for an image , a referential expression and a set of candidate regions extracted from , if the ground-truth regions of the subject entity and of the object entity are both available, then we can optimize the pairwise score in Eqn. 1 with strong supervision using softmax loss .
However, it is often hard to obtain ground-truth regions for both subject entity and object entity. For referential expressions like “a red vase on top of the table”, often there is only a ground-truth bounding box annotation for the subject (vase) in the expression, but no bounding box annotation for the object (table), so one cannot directly optimize the pairwise score . To address this issue, we treat the object region as a latent variable, and optimize the unary score in Eqn. 2. Since is obtained by maximizing over all possible region in , this can be regarded as a weakly supervised Multiple Instance Learning (MIL) approach similar to . The unary score can be optimized with weak supervision using softmax loss .
The whole system is trained end-to-end with backpropagation. In our experiments, we train for 300000 iterations, with 0.95 momentum and an initial learning rate of 0.005, multiplied by 0.1 after every 120000 iterations. Each batch contains one image with all referential expressions annotated over that image. Parameters in the localization module, the relationship module and the language representation in our model are initialized randomly with Xavier initializer
. Our model is implemented using TensorFlow and we plan to release our code and data to facilitate reproduction of our results.
We first evaluate our model on a synthetic dataset to verify its ability to handle inter-object relationships in referential expressions. Next we apply our method to real images and expressions in the Visual Genome dataset  and Google-Ref dataset . Since the task of answering pointing questions in visual question answering is similar to grounding referential expressions, we also evaluate our model on the pointing questions in the Visual-7W dataset .
|baseline (loc module)||46.27%|
|our full model||99.99%|
expression=“the green square right of a red circle”
Inspired by , we first perform a simulation experiment on a synthetic shape dataset. The dataset consists of 30000 images with simple circles, squares and triangles of different sizes and colors on a 5 by 5 grid, and referential expressions constructed using a template of the form [subj] [relationship] [obj], where [subj] and [obj] involve both shape classes and attributes and [relationship] is some spatial relationships such as “above”. The task is to localize the corresponding shape region described by the expression on the 5 by 5 grid. Figure 3 (a) shows an example in this dataset with the synthetic expression “the green square right of a red circle”. In the synthesizing procedure, we make sure that the shape region being referred to cannot be inferred simply from [subj] as there will be multiple matching regions, and the relationship with another region described by [obj] has to be taken into consideration.
On this dataset, we train our model with weak supervision by Eqn. 20 using the ground-truth subject region of the subject shape described in the expression. Here the candidate region set are the 25 possible locations on the 5 by 5 grid, and visual features are extracted from the corresponding cropped image region with a VGG-16 network 
pretrained on ImageNET classification. As a comparison, we also train a baseline model using only the localization module, with a softmax loss on its outputin Eqn. 14 over all 25 locations on the grid, and language representation obtained by scanning through the word embedding sequence with a single LSTM network and taking the hidden state at the last time step same as in [25, 10]. This baseline method resembles the supervised version of GroundeR , and the main difference between this baseline and our model is that the baseline only looks at a region’s appearance and spatial property but ignores pairwise relationship with other regions.
We evaluate with the accuracy on whether the predicted subject region matches the ground-truth region . Table 1 shows the results on this dataset, where our model trained with weak supervision (the same as the supervision given to baseline) achieves nearly perfect accuracy—significantly outperforming the baseline using a localization module only. Figure 3 shows an example, where the baseline can localize green squares but fails to distinguish the exact green square right of a red circle, while our model successfully finds the subject-object pair, although it has never seen the ground-truth location for the object entity during training.
We also evaluate our method on the Visual Genome dataset , which contains relationship expressions annotated over pairs of objects, such as “computer on top of table” and “person wearing shirt”.
On the relationship annotations in Visual Genome, given an image and an expression like “man wearing hat”, we evaluate our method in two test scenarios: retrieving the subject region (“man”) and retrieving the subject-object pair (both “man” and “hat”). In our experiment, we take the bounding boxes of all the annotated entities in each image (around 35 per image) as candidate region set at both training and test time, and extract visual features for each region from fc7 output of a Faster-RCNN VGG-16 network  pretrained on MSCOCO detection dataset . The input images are first forwarded through the convolutional layers of the network, and the features of each image region are extracted by ROI-pooling over the convolutional feature map, followed by subsequent fully connected layers. We use the same training, validation and test split as in .
Since there are ground-truth annotations for both subject region and object region in this dataset, we experiment with two training supervision settings: (1) weak supervision by only providing the ground-truth region of the subject entity at training time (subject-GT in Table 2) and optimizing unary subject score with Eqn. 20 and (2) strong supervision by providing the ground-truth region pair of both subject and object entities at training time (subject-object-GT in Table 2) and optimizing pairwise score with Eqn. 19.
Similar to the experiment on the synthetic dataset in Sec. 4.1, we also train a baseline model that only looks at local appearance and spatial properties but ignores pairwise relationships. For the first evaluation scenario of retrieving the subject region, we train a baseline model using a localization module only by optimizing its output for ground-truth subject region with softmax loss (the same training supervision as subject-GT). For the second scenario of retrieving the subject-object pair, we train two such baseline models optimized with subject ground-truth and object ground-truth respectively, to localize of the subject region and object region separately with each model and at test time combine the predicted subject region and predicted object region from each model be the subject-object pair (same training supervision as subject-object-GT).
|our full model||subject-GT||43.81%||26.56%|
|our full model||subject-object-GT||44.24%||28.52%|
|ground-truth||our prediction||attention weights||ground-truth||our prediction||attention weights|
|expression=“tennis player wears shorts”||expression=“building behind bus”|
expression=“car has tail light”
|expression=“window on front of building”|
expression=“business name on sign”
|expression=“board on top of store”|
expression=“wine bottle next to glasses”
|expression=“chairs around table”|
expression=“marker on top of ledge”
|expression=“chair next to table”|
We evaluate with top-1 precision (P@1), which is the percentage of test instances where the top scoring prediction matches the ground-truth in each image (P@1-subj for predicted subject regions matching subject ground-truth in the first scenario, and P@1-pair for predicted subject and object regions both matching the ground-truth in the second scenario). The results are summarized in Table 2, where it can be seen that our full model outperforms the baseline using only localization modules in both evaluation scenarios. Note that in the second evaluation scenario of retrieving subject-object pairs, our weakly supervised model still outperforms the baseline trained with strong supervision.
We apply our model to the Google-Ref dataset , a benchmark dataset for grounding referential expressions. As this dataset does not explicitly contain subject-object pair annotation for the referential expressions, we train our model with weak supervision (Eqn. 20) by optimizing the subject score using the expression-level region ground-truth. The candidate bounding box set at both training and test time are all the annotated entities in the image (which is the “Ground-Truth” evaluation setting in ). As in Sec. 4.2, fc7 output of a MSCOCO-pretrained Faster-RCNN VGG-16 network is used for visual feature extraction. Similar to Sec. 4.1, we also train a GroundeR-like  baseline model with localization module which looks only at a region’s local features.
|Mao et al. ||60.7%|
|Yu et al. ||64.0%|
|Nagaraja et al. ||68.4%|
|baseline (loc module)||66.5%|
|our model (w/ external parser)||53.5%|
|our full model||69.3%|
In addition, instead of learning a linguistic analysis end-to-end as in Sec. 3.1, we also experiment with parsing the expression using the Stanford Parser [31, 19]. An expression is parsed into subject, relationship and object component according to the constituency tree, and the components are encoded into vectors , and using three separate LSTM encoders, similar to the baseline and .
Following , we evaluate on this dataset using the top-1 precision (P@1) metric, which is the fraction of the highest scoring subject region matching the ground-truth for the expression. Table 3 shows the performance of our model, baseline model and previous work. Note that all the methods are trained with the same weak supervision (only a ground-truth subject region). It can be seen that by incorporating inter-object relationships, our full model outperforms the baseline using only localization modules, and works better than previous state-of-the-art methods.
Additionally, replacing the learned expression parsing and language representation in Sec. 3.1 with an external parser (“our model w/ external parser” in Table 3) leads to a significant performance drop. We find that this is mainly because existing parsers are not specifically tuned for the referring expression task—as noted in Sec. 3.1, expressions like chair on the left of the table are parsed as (chair, on, the left of the table) rather than the desired triplet (chair, on the left of, the table). In our full model, the language representation is end-to-end optimized with other parts, while it is hard to jointly optimize an external language parser like  for this task.
Figure 6 shows some example results on this dataset. It can be seen that although weakly supervised, our model not only grounds the subject region correctly (solid box), but also finds reasonable regions (dashed box) for the object entity.
|Zhu et al. ||56.10%|
|baseline (loc module)||71.61%|
|our model (w/ external parser)||61.66%|
|our full model||72.53%|
Finally, we evaluate our method on the multiple choice pointing questions (i.e. “which” questions) in visual question answering on the Visual-7W dataset . Given an image and a question like “which tomato slice is under the knife”, the task is to select the corresponding region from a few choice regions (4 choices in this dataset) as answer. Since this task is closely related to grounding referential expressions, our model can be trained in the same way as in Sec. 4.3 to score each choice region using subject score and pick the highest scoring choice as answer.
As before, we train our model with weak supervision through Eqn. 20 and use a MSCOCO-pretrained Faster-RCNN VGG-16 network for visual feature extraction. Here we use two different candidate bounding box sets and of the subject regions (the choices) and the object regions, where is the 4 choice bounding boxes, and is the set of 300 proposal bounding boxes extracted using RPN in Faster-RCNN . Similar to Sec. 4.3, we also train a baseline model using only a localization module to score each choice based only on its local appearance and spatial properties, and a truncated model that uses the Stanford parser [31, 19] for expression parsing and language representation.
The results are shown in Table 4. It can be seen that our full model outperforms the baseline and the truncated model with an external parser, and achieves much higher accuracy than previous work . Figure 6 shows some question answering examples on this dataset.
|ground-truth||our prediction||ground-truth||our prediction||ground-truth||our prediction|
|expression=“a bear lying to the right of another bear”||expression=“man in sunglasses walking towards two talking men”||expression=“a picnic table that has a bottle of water sitting on it”|
|expression=“woman in a cream colored wedding dress cutting cake”||expression=“a man going before a lady carrying a cellphone”||expression=“pizza slice not eaten”|
|expression=“a full grown brown bear near a young bear”||expression=“black dog standing on all four legs”||expression=“chair being sat in by a man”|
|ground-truth||our prediction||ground-truth||our prediction||ground-truth||our prediction|
|question=“Which wine glass is in the man’s hand?”||question=“Which person is wearing a helmet?”||question=
“Which mouse is on a pad by computer?”
|question=“Which head is that of an adult giraffe?”||question=“Which pants belong to the man closest to the train?”||question=“Which white pillow is leftmost on the bed?”|
|question=“Which red shape is on a large white sign?”||question=“Which is not a pair of a living canine?”||question=“Which hand can be seen from under the umbrella?”|
We have proposed Compositional Modular Networks, a novel end-to-end trainable model for handling relationships in referential expressions. Our model learns to parse input expressions with soft attention, and incorporates two types of modules that consider a region’s local features and pairwise interaction between regions respectively. The model induces intuitive linguistic and visual analyses of referential expressions from only weak supervision, and experimental results demonstrate that our approach outperforms both natural baselines and state-of-the-art methods on multiple datasets.
This work was supported by DARPA, AFRL, DoD MURI award N000141110688, NSF awards IIS-1427425, IIS-1212798 and IIS-1212928, NGA and the Berkeley Artificial Intelligence Research (BAIR) Lab. Jacob Andreas is supported by a Facebook graduate fellowship and a Huawei / Berkeley AI fellowship.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
Densecap: Fully convolutional localization networks for dense captioning.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.