ParNet: Position-aware Aggregated Relation Network for Image-Text matching

06/17/2019 ∙ by Yaxian Xia, et al. ∙ Peking University 3

Exploring fine-grained relationship between entities(e.g. objects in image or words in sentence) has great contribution to understand multimedia content precisely. Previous attention mechanism employed in image-text matching either takes multiple self attention steps to gather correspondences or uses image objects (or words) as context to infer image-text similarity. However, they only take advantage of semantic information without considering that objects' relative position also contributes to image understanding. To this end, we introduce a novel position-aware relation module to model both the semantic and spatial relationship simultaneously for image-text matching in this paper. Given an image, our method utilizes the location of different objects to capture spatial relationship innovatively. With the combination of semantic and spatial relationship, it's easier to understand the content of different modalities (images and sentences) and capture fine-grained latent correspondences of image-text pairs. Besides, we employ a two-step aggregated relation module to capture interpretable alignment of image-text pairs. The first step, we call it intra-modal relation mechanism, in which we computes responses between different objects in an image or different words in a sentence separately; The second step, we call it inter-modal relation mechanism, in which the query plays a role of textual context to refine the relationship among object proposals in an image. In this way, our position-aware aggregated relation network (ParNet) not only knows which entities are relevant by attending on different objects (words) adaptively, but also adjust the inter-modal correspondence according to the latent alignments according to query's content. Our approach achieves the state-of-the-art results on MS-COCO dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 5

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Relationship exploring between image and natural language is of central importance to various tasks, such as image-text matching, image caption and visual question answering (VQA).

Figure 1. An image with description ”A shelving unit is between a toilet and a bathroom sink”. Sentence description makes reference to both objects (e.g., ’toilet’, ’shelving unit’ and ’sink’ ) and relative position of objects (e.g.,between ). Identifying spatial relation of objects contributes to image-text matching.

Image-text matching task requires a combination of concepts from computer version and Natural Language Processing. Methods for image-text matching are required to understand the contents of both sentence and image correctly as well as modelling their interactions. For the existence of heterogeneity gap between different media types, learning intrinsic correspondences of image-text pair is quite challenging.

Information distributions of different modalities are really inconsistent. For instance, text contains abundant prepositions to express locative and spatial relationship. As shown in Fig.1, consider an example sentence, such as ”A shelving unit is between a toilet and a bathroom sink”. In order to retrieve a corresponding image correctly, all entities are required to be identified precisely as well as relationships present in the sentence (between). Although prepositions in a sentence, e.g., in, between and on, correspond to none of the entities in this image, they influence the relevance among different image objects quite a lot. For instance, ’A shelving unit is between a toilet and a sink’ and ’A toilet is between a shelving unit and a sink’ contain the same objects, but correspond to different image sceneries. Position relationship is crucial to understand such rich image content. So it is crucial to model semantic and spatial relationships between object features effectively and efficiently.

Some pioneering methods focus on objects of image and phase in sentences separately to capture fine-level alignments than those methods that directly map a whole image or sentence into a common representation space. Karpathy et al. (Karpathy et al., 2014) broke down and embedded objects in an image and phrases in a sentence into a common embedding space and calculated the similarity of object-phrase pairs. However, an image is described as a set of object representation without location.

Most of methods mentioned above ignored the spatial position of image objects, which is important to correctly understand image descriptions (e.g. ’A shelving unit is between a toliet and a bathroom sink’ means totally different to ’A toliet is between a shelving unit and a bathroom sink’). Interaction and spatial relationship among objects are almost ignored. And the effect of spatial relation between objects in image is suppressed. Objects in image are isolated spatially. It is difficult to obtain spatial relationships described in corresponding words.

In our proposed approach, a position-aware module is employed to capture both relative position information and semantic information for image. Our method utilizes location of different objects in an image to capture geometric distributions innovatively. With the combination of semantic and spatial information, it’s easy to understand the content of different modalities and capture interpretable latent correspondences of image-text pairs.

Besides, Attention mechanisms have achieved great success in Neural Machine Translation. It has the strength to model the dependencies between relevant aspects of data without regard to their distances. Motivated by the success of attention mechanisms in NLP, several state-of-the-art approaches

(Nam et al., 2017) (Lee et al., 2018) have improved the performance by employing attention mechanisms to capture the latent semantic correspondences between object-word pairs.

Inspired by the success of attention mechanism, we aggregate intra-modal relation and inter-modal relation into a two-step relation module.

The first step, we call it intra-modal relation mechanism, in which we computes responses between different objects in an image or different words in a sentence separately and attend on informative parts within each modality. Intra-modal relation between objects contribute to fine content understanding in a certain image or a sentence with a sense of global consciousness.

The second step, we call it inter-modal relation mechanism, in which the query plays a role of textual context to refine the relationship among object proposals in an image. Note that, given different queries, not all entities are equally informative. The relevance of the modality and query’s intent is evaluated adaptively.

The aggregation of two-way relation mechanism is expected to perform better than one. In this way, our aggregated relation mechanism not only know which entities are relevant by attending on different objects (words) adaptively, but also adjust the inter-modal relationship according to the latent alignments with corresponding sentence (image).

To sum up, Figure 2 illustrates the structure of position-aware aggregated relation network (ParNet). There are three parts of our proposed method: intra-modal relation network with position-aware module for image, intra-modal relation network for text and inter-modal relation network. Our main contribution are as follows:

  • We propose a position-aware relation module to capture spatial and semantic relationships simultaneous among objects for image.

  • A two-step relation mechanism is aggregated in our method. First, correlation among different entities of the inputs (image and text) are explored for each modality seperately; second, an image (sentence) search system calculates the importance of each modality according to different queries.

  • Extensive experiments validates our approach. Our ParNet achieves the state-of-the-art performance on image-text matching tasks on MS-COCO dataset.

Figure 2. Overview of our two-step aggregated relation network with position-aware module. The first step, intra-modal relation mechanism, computes responses between different entities in an image or a sentence separately, a position-aware relation module is employed for image to capture semantic and spatial relation simultaneously; In the second step, inter-modal relation mechanism, text plays a role of textual weak annotation to contribute to image understanding with corresponding description.

2. Related works

2.1. Attention Mechanisms

Attention mechanism have recently been successfully applied in the Natural Language Processing field(Vaswani et al., 2017)(Dehghani et al., 2018)(Devlin et al., 2018)

. The attention module can well capture the long-term dependencies at a position which allow models to automatically focus on important parts of inputs. The strong strength of it is the ability to parallel implementation. In NLP field, there is a recent trend of replacing recurrent neural networks by attention models. Transformer

(Vaswani et al., 2017) proposed multi-head attention module, which outperforms than state-of-the-art methods in Neural Machine Translation. It has the ability to focus on different relation from different heads at different words.

A self-attention mechanism computes the importance between a sequence of inputs (e.g.

, a set of words or a set of object proposals) by attending to all elements and returns a weighted vector in a representation space. There are two kinds of widely used functions, additive attention and dot-product attention. Additive attention computes the dependence function through a feed-forward layer. Theoretically, the complexity of these two methods are similar, but dot-product attention is more time and space-efficient in practice.

Recently,attention mechanisms have been successfully migrated to optimize object detector (Hu et al., 2018)(Wang et al., 2017a) and understand multimedia content, e.g., image captioning, cross-modal retrieval (Lee et al., 2018)(Nam et al., 2017) and visual question answering. Nam et al.(Nam et al., 2017) proposed DAN to capture fine-grained interplay between vision and language through multiple step self-attention.

Inspired by the success of attention mechanism, we propose a two-step aggregated relation network with position-aware module for image-text matching, which can explore intra-modal and inter-modal relationship simultaneously.

2.2. Image-Text Matching

The core challenge of image-text matching is that is that data from different modalities have inconsistent semantic distributions and learning the intrinsic correlation between them is quite elusive. It is crucial to find accurate and fine-grained correspondence for image-text pairs.

The mainstream methods for this task are learning a comparable representations for images and sentences in a single subspace(Frome et al., 2013)(Tsai et al., 2017). Canonical correlation analysis(CCA)(Hardoon et al., 2004) is employed by Hardoon et al. to learn a common space to maximize correlation between query and image. Word2VisualVec(Dong et al., 2016) learned to predict visual representations of textual items based on a DNN architecture. Andrej et al. (Karpathy et al., 2014), embed images objects and fragments of sentences into a common space to explore object-level alignment. Bokun et al. (Wang et al., 2017b), proposed ACMR which is built around the concept of adversarial learning. Nam et al. (Nam et al., 2017), takes multiple self attention steps to gather correspondences. Lee et al. (Lee et al., 2018), uses image objects (or words) as context to infer image-text similarity. To the best of our knowledge, few study has attempted to explore spatial relationship. Quite a lot of position information are abundant while processing object features. And the attention mechanism they employed for cross modal retrieval are not flexible enough.

2.3. Kernel functions

Kernel functions transform data into another dimension that has a clear margin between classes of data(Monti et al., 2017)(Scholkopf and Smola, 2001)

. So that low-dimensional data is enable to be classified in a high dimension, while the explicit mapping function is not necessary. The most popular kernel function is radial basis function kernel (RBF kernel). It is commonly used in support vector machine classification. It has been proved to be effective by previous work. In our method, RBF kernels are employed to map our polar coordinates of objects into higher separable.

3. Position-aware Aggregated Relation Network

On significant goal of cross-modal retrieval is to learn a comparable common space for image-text pair descriptions. In this paper, we propose a two-step relation network with a position-aware module for image-text matching. This module is used to capturing both the fully latent semantic and spatial object relations. In this section, our proposed position-aware aggregated relation network (ParNet) is described in detail.

On the one hand, our method extracts a set of word representations through word embedding and bi-directional Gated Recurrent Units (GRUs)

(Schuster and Paliwal, 1997). On the other hand, a set of image object descriptions consisting of bounding box coordinates and object features generated by object-detector (Faster-RCNN(Ren et al., 2015)).

Two-step aggregated relation module is then constructed to capture the intra-modal and inter-modal relationships simultaneously. The intra-modal relation module is employed to capture query-unrelated relationships within each modality and attend on the most informative parts for each modality(As illustrated in Figure 1, object ’sink’ is relevant to object ’toilet’ and ’shelving unit’ under the in the surroundings of bath room.) Subsequently, given different queries, information contributions of different modalities are various. A retrieval system should pay attention to balance the importance of each modality according to query’s intent (inter-modal relation). Notably, relative position of different image objects are considered in our method. Latent semantic and spatial relationship are explored for image.

The image-text similarity is measured by aligning object-word pairs .

3.1. Input Representation

We describe the embedding vectors computed for both the input image and sentence in this subsection.

3.1.1. Image Representations

For images, a set of object proposals are detected to represent for an image. The corresponding area of the convolutional feature map is extracted by object detector for each proposed bounding box.

In particular, we detect bottom-up attention features corresponding to salient image objects in each image with a Faster-RCNN(Ren et al., 2015), which is pretrained by Anderson et al.(Anderson et al., 2018) on Visual Genome(Krishna et al., 2016). Each object in Faster-RCNN focuses on a complete object while the traditional CNN (e.g.,VGG(Simonyan and Zisserman, 2014)) produces a grid of features without clear boundary. Besides, coordinates of bounding box are easy to obtain through Faster-RCNN.

Image is represented as , where encodes k-th detected object in the image (bounding box coordinates and associated feature vector) and is the number of image objects. is concatenatd by 2048-dimension object features and 4-dimension bounding box coordinates . The whole image is writed as .

3.1.2. Text Representations

Text features are generated from bidirectional Gated Recurrent Unit(GRU) as depicted in Fig.2. First, word embeddings are employed to convert the one-hot encoding of an input sentence into a variable-dimension vector space. Then we feed the sequence of wording embeddings into bidirectional GRUs to construct a set of word vectors.

GRU can keep track of arbitrary long-term dependencies in the input sequences without the vanishing gradient problem, which performs like a long short-term memory (LSTM) but has fewer parameters than LSTM. And GRU is faster to train without losing any precision. Text is described as

, where is the length of sentence. encodes the representation of the k-th word in the context of the entire sentences.

3.2. Position-aware relation module for image

As shown in Fig.3, a position-aware relation module is proposed for capturing semantic and spatial object-level relation simultaneously. It aims to enrich the expression of objects by adaptively focusing on both spatial-relevant and semantic-relevant parts of the input image.

Figure 3. Position-aware relation module for image

There are two branches. In the spatial relation branch, for each object , a polar coordinate system is defined centered at , the 4-dimension bounding box position of object is transferred into a polar coordinate vector . Because it is quite efficient to express spatial orientation between centers of the bounding boxes and . In descriptions of images, few absolute position is used to describe the spatial relationships of objects in an image. On the contrary, relative position is widely used (e.g. ’on’, ’below’ and so on. Our method focus on relative position of objects rather than absolute position. It means that we pay more attention to ”What are the neighborhoods around a particular object and where are they?”. The relative distances and angles of objects’ center can indicate spatial relation intuitively.

Low-dimensional relative positions are embedded to higher dimension through a set of Gaussian kernels(Monti et al., 2017)

with learnable means and covariances of Gaussian distributions, where the spatial relation of between objects

and are easy separable. The spatial relation dimension after embedding is experimentally. The kernel operator for object centered at is defined as follows:

(1)
(2)

where , are the learnable means and covariances of Gaussian distributions for relevant distance and , are the learnable means and covariances of Gaussian distributions for relevant angle.

We aggregates the relative distance and angle relationship of objects and with a scaling function, which means that the strength of spatial relationships between objects can be weighted by spatial orientation. The aggregated spatial weight for image is represented as .

(3)

where is the number of objects in an image.

In another branch, image features

is linearly transformed by

. Semantic relation is computed as in Eq.4. Dot-product attention is employed in our algorithm with a scaling factor , where is the dimension of object features .

(4)

where is the dimension of object feature .

The intra-image relation weight indicates both the semantic and spatial impact from object . Spatial relationship of different objects is fused with semantic relationship between objects through Eq.5. It is scaled in the range (0, 1) and can be regarded as a variant of softmax. is computed as follows:

(5)

Multi-head attention (Vaswani et al., 2017)is employed to adapt flexible relationships, since different heads can focus on different aspects of relation. Multiple relation features from multi heads are aggregated as follows:

(6)

is the number of relation heads, which is typically set to be 6, same as transformer(Vaswani et al., 2017).

3.3. Two-step Relation Network

Image-text matching task aims to retrieve relevant images given a sentence query, and vice versa.

In order to capture interpretable alignment of image-text pairs. Inspired by the great effectiveness of attention mechanism, a two-step relation module is designed in our proposed ParNet.

In the first step, intra-modal relation mechanism computes responses between different entities in an image or a sentence separately. the image (sentence) representation is able to attend on the informative parts within each modality.

In the second step, the query plays a role of textual context to refine the relationship among image objects while inferring the similarity. Inter-modal relation mechanism balances the importance of each entities according to query’s intent.

3.3.1. Intra-modal relation module

Intra-modal relation module acts as a self-attention mechanism to generate context-attended vectors by focusing on relevant parts of image and text separately.

The input of this module is object features extracted from Faster-RCNN and word vectors constructed by bi-directional GRU.

Visual intra-modal relation module For image, position-aware relation module is employed here, which aims to generate a context vector by focusing on certain relevant parts of the input image which is decided by intra-image relation weight .

An image can aware the accurate representation of objects in particular scenery through intra-modal relation matrix in Section 3.2, which contributes to image content understanding.

The encoded image representation is computed in Section 3.2.

Textual intra-modal relation module Like-wise, we apply an adaptive relation module to compute context features by focusing on relevant words of the input sentence. We apply scaled dot-product attention(Vaswani et al., 2017) to compute weighted average vectors.

First, text features is linearly transformed by . The textual intra-modal relation weight matrix is computed as Eq.7 with a scaling factor . Here is the dimension of word features . A softmax function is applied to normalize intra-sentence weight matrix .

(7)

K-head relation features of input sentence is aggregated through sublayer addition. is set to be 6.

(8)

3.3.2. Inter-modal relation module

For image-text matching, efficient and effective computation of cross-modal similarities is crucial. Given different queries, the importance of different image objects and the relevance between them are quite different.

Our inter-modal relation module considers the fact that the content of sentences have influence on the visual context. Sentences is considered as weak annotations for image understanding.

While inferring the similarity, we employ inter-modal relation module to compare each image object to the corresponding sentence vector in order to determine the importance of image objects.

For each image-text pair, we focus on relevant objects in an image with respect to each word in the text. Firstly, we measure the object-word relevance as weight matrix as follows:

(9)
(10)

Where represents image object and represents word in Section 3.3.1. is the inter-modal relevance between object and word. The output of this module is a set of image objects under the constraint of corresponding sentence annotation, where N is the number of image objects.

The importance distribution of image objects are balanced according to corresponding sentence’s intent. The object-word alignments inferred after this step are more comprehensive.

3.4. Similarity Alignments

we compute the cosine similarity between each attended image and the corresponding sentence to obtain retrieval scores of input image-text pair.

(11)

In addition,Both self-relation module and cross-relation module discover all possible alignments simultaneously, which is quite time efficient. The final similarity can be summarized with average pooling(AVG):

(12)

Our method is trained with bidirectional max-margain ranking loss, which is widely used in multi-modal matching tasks. For each positive image-text pair (I,T), negative image-text pairs and

are sampled.In practice, for computational efficiency, rather than summing over all the negative samples, it usually considers only the hard negatives in a mini-batch of stochastic gradient descent.The network is trained to minimize the distance of positive pairs while maximizing that of negative pairs. The Loss function is defined as Eq.18.

(13)

Where is a margin constraint. By minimizing this function, our network can focus on the important entities which appears in correct image-text pairs through self-relation and cross-attention modules.

Figure 4. The qualitative results of image-to-text retrieval. The first column is image queries, the second column shows top-5 retrieved results. Green for correct answers, and red for wrong answers
Figure 5.

The qualitative results of text-to-image retrieval.The first line is text queries, the second line shows top-3 retrieved results. Green for correct answer, and red for wrong answers.

4. Experiments

4.1. Datasets and Evaluation

4.1.1. Datasets

We evaluate our image-text matching model the Microsoft COCO dataset(Lin et al., 2014). Microsoft COCO dataset contains 123287 images with five descriptive sentences for each. We use the same spilt as (Karpathy and Fei-Fei, 2015), 1000 images are used for validation , 1000 for testing. In (Karpathy and Fei-Fei, 2015), 82783 images are spilt into training set, 5000 for validation and 5000 for test. Following (Faghri et al., 2017), 30504 images are added into training set that were originally in the validation set to improve accuracy. Totally, there are 113287 images in the training set. Our results are reported by either averaging over 5 folds of 1K test images or testing on the full 5K test images.

4.1.2. Evaluation Metrics

We adopt Recall@K (K=1,5,10) for retrieval evaluation. Recall@K represents the percentage of the queries where at least one ground-truth is retrieved among the top K results.

4.2. Implementation Details

The whole system is trained by Adam optimizer(Kingma and Ba, 2015)

. Models are trained for 25 epochs. The initial learning rate is set as 0.0005. We use a mini-batch size of 128 for all experiments. Margin constraint

is set as 0.2.

4.3. Results on MS-COCO

Methods Image-to-Text Text-to-Image
R@1 R@5 R@10 R@1 R@5 R@10
DVSA (Karpathy and Fei-Fei, 2015) 38.4 69.9 80.5 27.4 60.2 74.8
mCNN(Ma et al., 2015) 42.8 73.1 84.1 32.6 68.6 82.8
HM-LSTM(Niu et al., 2017) 43.9 - 87.8 36.1 - 86.7
DSPE(Wang et al., 2015) 50.1 89.2 - 39.6 86.9 -
VSE++(Faghri et al., 2017) 64.6 - 95.7 52.0 - 92.0
DPC (Zheng et al., 2017) 65.6 89.8 95.5 47.1 79.9 90.0
Gen-GXN (Gu et al., 2018) 68.5 - 97.9 56.6 - 94.5
SCO (Yan et al., 2017) 69.9 92.9 97.5 56.7 87.5 94.8
SCAN (Lee et al., 2018) 70.9 94.5 97.8 56.4 87.0 93.9

ParNet (NP)
72.8 94.9 97.9 57.9 87.4 94
ParNet (P) 73.5 94.5 98.3 58.3 88.2 94.1
Table 1. Comparision of the image-text matching results on MS-COCO dataset in terms of Recall@K(R@K)
Methods Image-to-Text Text-to-Image
R@1 R@5 R@10 R@1 R@5 R@10
base 70.9 94.5 97.8 56.4 87.0 93.9
+intra-modal relation(only T) 72.3 95.3 98.2 57.5 87.1 93.8
+intra-modal relation(only I) 71.8 95.1 97.8 58.0 87.5 93.9
+intra & inter modal relation 72.8 94.9 97.9 57.9 87.4 94.0
+ position (=32) 69.3 94.8 98.4 56.6 87.6 94.0
+ position (=64) 73.5 94.5 98.3 58.3 88.2 94.1
+ position (=128) 71.8 95.2 98.2 58.7 87.6 93.9
ParNet 73.5 94.5 98.3 58.3 88.2 94.1
Table 2. Ablation study of ParNet for image-text matching results on MS-COCO dataset
Figure 6. The visualization of attention outputs in the second step relation of text-to-image retrieval for query ”Butter or cheese on top of some vegetables in a pan. ” (a) is the top-1 retrieval results of ParNet with position-aware model. It’s right. (b) is the top-1 retrieval results of ParNet without position-aware module. It’s wrong. While (a) and (b) can both detect entities corresponding nouns, e.g., ’vegetables’,’pan’. The result of ParNet without position-aware module fails to focus on the relative position, e.g., on, and returns a wrong answer.

We compare our methods with recently developed methods on the image-to-text and text-to-image retrieval tasks. The quantitative performance on MS-COCO test set of our method compared with state-of-the-art methods is presented in Table 1. The compared methods are as follows:

DVSA (Karpathy and Fei-Fei, 2015) develops a deep neural network model that infers the latent alignment between segments of sentences and the objects of images, it is a combination of CNN and bidirectional RNN.

mCNN(Ma et al., 2015) consists of one image CNN encoding the image content and one matching CNN modeling the joint representation of image and sentence.

HM-LSTM(Niu et al., 2017) exploits the hierarchical relations between sentences and phrases, and between whole images and image objects, to jointly establish their representations.

DSPE(Wang et al., 2015) uses a two-branch neural network with multiple layers of linear projections followed by nonlinearities to learn joint embeddings of image-text pairs.

VSE++(Faghri et al., 2017) embeds whole images and sentences to a common space without the use of attention mechanism and also leveraged hard negatives sampling.

DPC (Zheng et al., 2017) proposes a instance loss which explicitly considers the intra-modal data distribution based on an unsupervised assumption that each image / text group can be viewed as a class.

Gen-GXN (Gu et al., 2018) incorporates generative processes into the cross-modal feature embedding to learn global abstract features and local grounded features simultaneous.

SCO (Yan et al., 2017) improves the image representation by learning semantic concepts and then organizing them in a correct semantic order.

SCAN (Lee et al., 2018) considers the fact that the importance of words can depend on the visual context and employ cross attention to discover possible alignments.

For fair evaluation, we compare single model accuracy obtained without model ensemble or fused. We can see that our ParNet outperforms previous approaches on MS-COCO dataset. R@1 achieves 73.5% for image-to-text retrieval, which is improved by 3.67% compared with SCAN (single model)(Lee et al., 2018). R@1 for text-to-image retrieval achieves 58.3%, 2.84% higher than SCAN. The improvement results of our method shows the effectiveness of inferring the latent relationships.

In Table 2, our method without position-aware relation module achieves the result of R@1=72.8%, which is superior to our baseline. It confirms the effectiveness of our aggregated two-step relation module. Our methods ParNet with position-aware module achieves 73.5% for R@1. It is obvious that the added spatial information help capturing the correspondence between image and text.

The qualitative results of image-to-text retrieval and text-to-image retrieval are illustrated in Figure 4 and Figure 5 respectively. In Figure 4, the first column is image queries, the second column is the top-5 retrieved results. The green ones represent for correct answers, and red ones represent for wrong answers. We can see that, some of the wrong answers occurred in the results also have similar meaning with the corresponding query(e.g., ’A toliet and sink in a small room.’). In Figure 5 (b), the third results capture the spatial relation ’in’ correctly, but it mistake teddy bear for cat.

4.4. Visualization analyse of positon-aware module

The visualization of attention outputs in the second step relation of text-to-image retrieval for query ”Butter or cheese on top of some vegetables in a pan. ” Figure 7 (a) is the top-1 retrieval results of ParNet with position-aware model. It’s right. Figure 6 (b) is the top-1 retrieval results of ParNet without position-aware module. It’s wrong. While Figure 6 (a) and Figure 6 (b) can both detect entities corresponding nouns, e.g., ’vegetables’,’pan’. The result of ParNet without position-aware relation module fails to focus on the relative position, e.g., on, and returns a wrong answer. It confirms the effectiveness of position-aware relation module. And the correct correspondence between attended objects and words (e.g., ’butter’, ’vegetables’,’pan’) in the query shows that our two-step aggregated relation module is quite available for capturing latent image-text alignments between different modalities.

5. Conclusion

In this paper, we propose a position-aware aggregated relation network for bridging the semantic gaps of image and text. We propose an position-aware relation module for image to obtain both the semantic and spatial relationship. Then we combine intra-modal relation and inter-modal relation for capturing the alignments of entities in images (or sentences). This model achieves the state-of-the-art performance for image-text matching task, showing the effectiveness in extracting latent alignments between image-text pairs.

References