Tensorflow implementation of Dual Attention Network
We propose Dual Attention Networks (DANs) which jointly leverage visual and textual attention mechanisms to capture fine-grained interplay between vision and language. DANs attend to specific regions in images and words in text through multiple steps and gather essential information from both modalities. Based on this framework, we introduce two types of DANs for multimodal reasoning and matching, respectively. The reasoning model allows visual and textual attentions to steer each other during collaborative inference, which is useful for tasks such as Visual Question Answering (VQA). In addition, the matching model exploits the two attention mechanisms to estimate the similarity between images and sentences by focusing on their shared semantics. Our extensive experiments validate the effectiveness of DANs in combining vision and language, achieving the state-of-the-art performance on public benchmarks for VQA and image-text matching.READ FULL TEXT VIEW PDF
We propose an architecture for VQA which utilizes recurrent layers to
Multimodal attentional networks are currently state-of-the-art models fo...
Visual question answering (VQA) demands simultaneous comprehension of bo...
There is a perennial need in the online advertising industry to refresh ...
With the aim of promoting and understanding the multilingual version of ...
The large adoption of the self-attention (i.e. transformer model) and
To advance models of multimodal context, we introduce a simple yet power...
Tensorflow implementation of Dual Attention Network
Vision and language are two central parts of human intelligence to understand the real world. They are also fundamental components in achieving artificial intelligence, and a tremendous amount of research has been done for decades in each area. Recently, dramatic advances in deep learning have broken the boundaries between vision and language, drawing growing interest in their intersection, such as visual question answering (VQA)[3, 37, 23, 35], image captioning [33, 2], image-text matching [8, 11, 20, 30], visual grounding [24, 9], etc.
One of the recent advances in neural networks is the attention mechanism[21, 4, 33]
. It aims to focus on certain aspects of data sequentially and aggregate essential information over time to infer the results, and has been successfully applied to both areas of vision and language. In computer vision, attention based methods adaptively select a sequence of image regions to extract necessary features[21, 6, 33]4, 25, 15]
Despite the effectiveness of attention in handling both visual and textual data, it has been hardly attempted to establish a connection between visual and textual attention models which can be highly beneficial in various scenarios. For example, the VQA problem in Figure 0(a) with the question What color is the umbrella? can be efficiently solved by simultaneously focusing on the region of umbrella and the word color. In the example of image-text matching in Figure 0(b), the similarity between the image and sentence can be effectively measured by attending to the specific regions and words sharing common semantics such as girl and pool.
In this paper, we propose Dual Attention Networks (DANs) which jointly learn visual and textual attention models to explore the fine-grained interaction between vision and language. We investigate two variants of DANs illustrated in Figure 1, referred to as reasoning-DAN (r-DAN) and matching-DAN (m-DAN), respectively. The r-DAN collaboratively performs visual and textual attentions using a joint memory which assembles the previous attention results and guides the next attentions. It is suited to the tasks requiring multimodal reasoning such as VQA. On the other hand, the m-DAN separates visual and textual attention models with distinct memories but jointly trains them to capture the shared semantics between images and sentences. This approach eventually finds a joint embedding space which facilitates efficient cross-modal matching and retrieval. Both proposed algorithms closely connect visual and textual attention mechanisms into a unified framework, achieving outstanding performance in VQA and image-text matching problems.
To summarize, the main contributions of our work are as follows:
We propose an integrated framework of visual and textual attentions, where critical regions and words are jointly located through multiple steps.
Two variants of the proposed framework are implemented for multimodal reasoning and matching, and applied to VQA and image-text matching.
Detailed visualization of the attention results validates that our models effectively focus on vital portions of visual and textual data for the given task.
Attention mechanisms allow models to focus on necessary parts of visual or textual inputs at each step of a task. Visual attention models selectively pay attention to small regions in an image to extract core features as well as reduce the amount of information to process. A number of methods have recently adopted visual attention to benefit image classification [21, 28], image generation , image captioning , visual question answering [35, 26, 32], etc. On the other hand, textual attention mechanisms generally aim to find semantic or syntactic input-output alignments under an encoder-decoder framework, which is especially effective in handling long-term dependency. This approach has been successfully applied to various tasks including machine translation 16], sentence summarization , and question answering [15, 32].
VQA is a task of answering a question in natural language regarding a given image, which requires multimodal reasoning over visual and textual data. It has received a surge of interest since Antol et al.  presented a large-scale dataset with free-form and open-ended questions. A simple baseline by Zhou et al.  predicts the answer from a concatenation of CNN image features and bag-of-word question features. Several methods adaptively construct a deep architecture depending on the given question. For example, Noh et al.  impose a dynamic parameter layer on a CNN which is learned by the question, while Andreas et al.  utilize a compositional structure of the question to assemble a collection of neural modules.
One limitation of the above approaches is that they resort to a global image representation which contains noisy or unnecessary information. To address this problem, Yang et al.  propose stacked attention networks which perform multi-step visual attention, and Shih et al.  use object proposals to identify regions relevant to the given question. Recently, dynamic memory networks  integrate an attention mechanism with a memory module, and multimodal compact bilinear pooling  is exploited to expressively combine multimodal features and predict attention over the image. These methods commonly employ visual attention to find critical regions, but textual attention has been rarely incorporated into VQA. Although HieCoAtt  applies both visual and textual attentions, it independently performs each step of co-attention without reasoning over previous co-attention outputs. On the contrary, our method moves and refines both attentions via multiple reasoning steps based on the memory of previous attentions, which facilitates close interplay between visual and textual data.
The core issue with image-text matching is measuring the semantic similarity between visual and textual inputs. It is commonly addressed by learning a joint space where image and sentence feature vectors are directly comparable. Hodoshet al.  apply canonical correlation analysis (CCA) to find embeddings that maximize the correlation between images and sentences, which is further improved by incorporating deep neural networks [14, 34]. A recent approach by Wang et al. 
includes structure-preserving constraints within a bidirectional loss function to make the joint space more discriminative. In contrast, Maet al. 
construct a CNN to combine an image and sentence fragments into a joint representation, from which the matching score is directly inferred. Image captioning frameworks are also exploited to estimate the similarity based on the inverse probability of sentences given a query image[20, 29].
To the best of our knowledge, no study has attempted to learn multimodal attention models for image-text matching. Even though Karpathy et al. [11, 10] propose to find the alignments between image regions and sentence fragments, they explicitly compute all pairwise distances between them and estimate the average or best alignment score, which leads to inefficiency. On the other hand, our method automatically attends to the shared concepts between images and sentences while embedding them into a joint space, where cross-modal similarity is directly obtained by a single inner product operation.
We present two structures of DANs to consolidate visual and textual attention mechanisms: r-DAN for multimodal reasoning and m-DAN for multimodal matching. They share a common framework but differ in their ways of associating visual and textual attentions. We first describe the common framework including input representation (Section 3.1) and attention mechanisms (Section 3.2). Then we illustrate the details of r-DAN (Section 3.3) and m-DAN (Section 3.4) applied to VQA and image-text matching, respectively.
The image features are extracted from 19-layer VGGNet  or 152-layer ResNet . We first rescale images to 448448 and feed them into the CNNs. In order to obtain feature vectors for different regions, we take the last pooling layer of VGGNet (pool5) or the layer beneath the last pooling layer of ResNet (res5c). Finally the input image is represented by , where is the number of image regions and is a 512 (VGGNet) or 2048 (ResNet) dimensional feature vector corresponding to the -th region.
We employ bidirectional LSTMs to generate text features as depicted in Figure 2
. Given one-hot encoding ofinput words , we first embed the words into a vector space by , where is an embedding matrix. Then we feed the vectors into the bidirectional LSTMs:
where and represent the hidden states at time from the forward and backward LSTMs, respectively. By adding the two hidden states at each time step, i.e. , we construct a set of feature vectors where encodes the semantics of the -th word in the context of the entire sentence. Note that the models discussed here including the word embedding matrix and the LSTMs are trained end-to-end.
Our method performs visual and textual attentions simultaneously through multiple steps and gathers necessary information from both modalities. In this section, we explain the underlying attention mechanisms employed at each step, which serve as building blocks to compose the entire DANs. For simplicity, we shall omit the bias term in the following equations.
Visual attention aims to generate a context vector by attending to certain parts of the input image. At step , the visual context vector is given by
where is a memory vector encoding the information that has been attended until step . Specifically, we employ the soft attention mechanism where the context vector is obtained from a weighted average of input feature vectors. The attention weights
are computed by a 2-layer feed-forward neural network (FNN) and the softmax function:
where , , and are the network parameters, is a hidden state, and is element-wise multiplication. In Equation 6, we introduce an additional layer with the weight matrix in order to embed visual context vectors into a compatible space with textual context vectors, as we use pretrained image features .
Textual attention computes a textual context vector by focusing on specific words in the input sentence every step:
where is a memory vector. The textual attention mechanism is almost identical to the visual attention mechanism. In other words, the attention weights are obtained from a 2-layer FNN and the context vector is calculated by weighted averaging:
where , , and are the network parameters, is a hidden state. Unlike the visual attention, it does not need an additional layer after the last weighted averaging because the text features are already trained end-to-end.
VQA is a representative problem which requires joint reasoning over multimodal data. For this purpose, the r-DAN maintains a joint memory vector which accumulates the visual and textual information that has been attended until step . It is recursively updated by
where and are the visual and textual context vectors obtained from Equation 6 and 10, respectively. This joint representation concurrently guides the visual and textual attentions, i.e. , which allows the two attention mechanisms to closely cooperate with each other. The initial memory vector is defined based on global context vectors and as
By repeating the dual attention (Equation 3 and 7) and memory update (Equation 11) for steps, we effectively focus on the key portions in the image and question, and gather relevant information for answering the question. Figure 3 illustrates the overall architecture of r-DAN in case of .
The final answer is predicted by multi-way classification to the top
frequent answers. We employ a single-layer softmax classifier with cross-entropy loss where the input is the final memory:
where represents the probability over the candidate answers.
Image-text matching tasks usually involve comparison between numerous images and sentences, where effective and efficient computation of cross-modal similarities is crucial. To achieve this, we aim to learn a joint embedding space which satisfies the following properties. First, the embedding space encodes the shared concepts that frequently co-occur in image and sentence domains. Moreover, images and sentences are autonomously embedded into the joint space without being paired, so that arbitrary image and sentence vectors in the space are directly comparable.
Our m-DAN jointly learns visual and textual attention models to capture the shared concepts between the two modalities, but separates them at inference time to provide generally comparable representations in the embedding space. Contrary to the r-DAN which uses a joint memory, the m-DAN maintains separate memory vectors for visual and textual attentions as follows:
After performing steps of the dual attention and memory update, the final similarity between the given image and sentence becomes
The overall architecture of this model when is depicted in Figure 4.
This network is trained with bidirectional max-margin ranking loss, which is widely adopted for multimodal similarity learning [11, 10, 13, 30]. For each correct pair of an image and a sentence , we additionally sample a negative image and a negative sentence to construct two negative pairs and . Then, the loss function becomes:
where is a margin constraint. By minimizing this function, the network is trained to focus on the common semantics that only appears in correct image-sentence pairs through visual and textual attention mechanisms.
At inference time, an arbitrary image or sentence is embedded into the joint space by concatenating its context vectors:
where and are the representations for image and sentence , respectively. Note that these vectors are obtained via separate pipelines of visual and textual attentions, i.e. learned shared concepts are revealed from an image or sentence itself, not from an image-sentence pair. The similarity between two vectors in the joint space is simply computed by their inner product, e.g. , which is equivalent to the output of the network in Equation 19.
We fix all the hyper-parameters applied to both r-DAN and m-DAN. The number of attention steps
is set to 2 which empirically shows the best performance. The dimension of every hidden layer—including word embedding, LSTMs, and attention models—is set to 512. We train our networks by stochastic gradient descent with a learning rate 0.1, momentum 0.9, weight decay 0.0005, dropout ratio 0.5, and gradient clipping at 0.1. The network is trained for 60 epochs, where the learning rate is dropped to 0.01 after 30 epochs. A minibatch for r-DAN and m-DAN consists of 128 pairs ofimage, question and 128 quadruplets of positive image, positive sentence, negative image, negative sentence, respectively. The number of possible answers for VQA is set to 2000, and the margin for the loss function in Equation 20 is set to 100.
|VQA team ||80.5||36.8||43.1||57.8||62.7||80.6||36.5||43.7||58.2||63.1|
|MRN (ResNet) ||82.3||38.8||49.3||61.7||66.2||82.4||38.2||49.4||61.8||66.3|
|HieCoAtt (ResNet) ||79.7||38.7||51.7||61.8||65.8||-||-||-||62.1||66.1|
|RAU (ResNet) ||81.9||39.0||53.0||63.3||67.7||81.7||38.2||52.8||63.2||67.3|
|MCB (ResNet) ||82.2||37.7||54.8||64.2||68.6||-||-||-||-||-|
Q: What is the man on the bike holding on his right hand?
|Q: What is the man on the bike holding on his right hand?||Q: What is the man on the bike holding on his right hand?||
Q: How many horses are in the picture?
|Q: How many horses are in the picture?||Q: How many horses are in the picture?|
Q: What color are the cows?
A: brown and white
|Q: What color are the cows?||Q: What color are the cows?||
Q: What is on his wrist?
|Q: What is on his wrist?||Q: What is on his wrist?|
|GMM+HGLMM FV ||35.0||62.0||73.8||3||25.0||52.7||66.0||5|
|HGLMM FV ||36.5||62.2||73.3||-||24.7||53.4||66.8||-|
|(+) A woman in a brown vest is working on the computer.||(+) A woman in a brown vest is working on the computer.||(+) A woman in a brown vest is working on the computer.||(+) A man in a white shirt stands high up on scaffolding.||(+) A man in a white shirt stands high up on scaffolding.||(+) A man in a white shirt stands high up on scaffolding.|
|(+) A woman in a red vest working at a computer.||(+) A woman in a red vest working at a computer.||(+) A woman in a red vest working at a computer.||(+) Man works on top of scaffolding.||(+) Man works on top of scaffolding.||(+) Man works on top of scaffolding.|
|(+) Two boys playing together at a playground.||(+) Two boys playing together at a playground.||(+) Two boys playing together at a playground.||(-) A man wearing a red t shirt sweeps the sidewalk in front of a brick building.||(-) A man wearing a red t shirt sweeps the sidewalk in front of a brick building.||(-) A man wearing a red t shirt sweeps the sidewalk in front of a brick building.|
|(-) The two kids are playing at the playground.||(-) The two kids are playing at the playground.||(-) The two kids are playing at the playground.||(+) Boy in red shirt and black shorts sweeps driveway.||(+) Boy in red shirt and black shorts sweeps driveway.||(+) Boy in red shirt and black shorts sweeps driveway.|
|A woman in a cap at a coffee shop.||A woman in a cap at a coffee shop.||A woman in a cap at a coffee shop.||A boy is hanging out of the window of a yellow taxi.||A boy is hanging out of the window of a yellow taxi.||A boy is hanging out of the window of a yellow taxi.|
|A woman in a striped outfit on a bike.||A woman in a striped outfit on a bike.||A woman in a striped outfit on a bike.||A group of people standing on a sidewalk under some trees.||A group of people standing on a sidewalk under some trees.||A group of people standing on a sidewalk under some trees.|
We evaluate the r-DAN on the Visual Question Answering (VQA) dataset , which contains approximately 200K real images from MSCOCO dataset . Each image is associated with three questions, and each question is labeled with ten answers by human annotators. The dataset is typically divided into four splits: train (80K images), val (40K images), test-dev (20K images), and test-std (20K images). We train our model using train and val, validate with test-dev, and evaluate on test-std
. There are two forms of tasks, open-ended and multiple-choice, which require to answer each question without and with a set of candidate answers, respectively. For both tasks, we follow the evaluation metric used in as
where is a predicted answer.
The performance of r-DAN compared with state-of-the-art VQA systems is presented in Table 1, where our method achieves the best performance in both open-ended and multiple-choice tasks. For fair evaluation, we compare single-model accuracies obtained without data augmentation, even though  reported better performance using model ensembles and additional training data. Figure 5 describes the qualitative results from our approach with visualization of the attention weights. Our method produces correct answers to challenging problems which require fine-grained reasoning, as well as successfully attends to the specific regions and words which facilitate answering the questions. Specifically, the first and fourth examples in Figure 5 illustrate that the r-DAN moves its visual attention to the proper regions indicated by the attended words, while the second and third examples show that it moves its textual attention to divide a complex task into sequential subtasks—finding target objects and extracting certain attributes.
We employ the Flickr30K dataset  to evaluate the m-DAN for multimodal matching. It consists of 31,783 real images with five descriptive sentences for each, and we follow the public splits by : 29,783 training, 1,000 validation and 1,000 test images. We report the performance of m-DAN in bidirectional image and sentence retrieval using the same metrics as previous work [34, 19, 20, 30]. Recall@K (K=1, 5, 10) represents the percentage of the queries where at least one ground-truth is retrieved among the top K results and MR measures the median rank of the top-ranked ground-truth.
Table 2 presents the quantitative results on the Flickr30K dataset, where the proposed method outperforms other recent approaches in all measures. The qualitative results from image-to-text and text-to-image retrieval are also illustrated in Figure 6 and Figure 7, respectively, with visualization of attention outputs. At each step of attention, the m-DAN effectively discovers the essential semantics appearing in both modalities. It tends to capture the main subjects (e.g. woman, boy, people, etc.) at the first step, and figure out relevant objects, backgrounds or actions (e.g. computer, scaffolding, sweeps, etc.) at the second step. Note that this property solely comes from the training stage where visual and textual attention models are jointly learned, while images and sentences are processed independently at inference time.
We propose Dual Attention Networks (DANs) to bridge visual and textual attention mechanisms. We present two architectures of DANs for multimodal reasoning and matching. The first model infers the answers collaboratively from images and sentences, while the other one embeds them into a common space by capturing their shared semantics. These models demonstrate the state-of-the-art performance in VQA and image-text matching, showing their effectiveness in extracting essential information via the dual attention mechanism. The proposed framework can be potentially generalized to various tasks at the intersection of vision and language, such as image captioning, visual grounding, video question answering, etc.
A hierarchical neural autoencoder for paragraphs and documents.In ACL, 2015.