Target-Oriented Deformation of Visual-Semantic Embedding Space

10/15/2019 ∙ by Takashi Matsubara, et al. ∙ 0

Multimodal embedding is a crucial research topic for cross-modal understanding, data mining, and translation. Many studies have attempted to extract representations from given entities and align them in a shared embedding space. However, because entities in different modalities exhibit different abstraction levels and modality-specific information, it is insufficient to embed related entities close to each other. In this study, we propose the Target-Oriented Deformation Network (TOD-Net), a novel module that continuously deforms the embedding space into a new space under a given condition, thereby adjusting similarities between entities. Unlike methods based on cross-modal attention, TOD-Net is a post-process applied to the embedding space learned by existing embedding systems and improves their performances of retrieval. In particular, when combined with cutting-edge models, TOD-Net gains the state-of-the-art cross-modal retrieval model associated with the MSCOCO dataset. Qualitative analysis reveals that TOD-Net successfully emphasizes entity-specific concepts and retrieves diverse targets via handling higher levels of diversity than existing models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Our society is immersed in the age of big data, which includes diverse modalities, such as text, images, audio, and video. Entities in different modalities exhibit different abstraction levels and modality-specific information. This property increases the difficulty of cross-modal understanding and data mining. Existing works have mainly employed multimodal embedding, which maps entities in different modalities to vectors in a common space 

[1, 2, 3, 4, 5, 6, 7]. Euclidean measures then evaluate the similarity between entities in order to facilitate tasks such as retrieval and translation.

However, as shown in Fig. 1, Euclidean embedding has a limitation. Let us focus on visual-semantic embedding, and imagine an image that depicts a person dressed in green and engaging in kite snowboarding. A caption simply describes the activity as snowboarding. Meanwhile, other images could also be retrieved, such as an image depicting a snowboarder dressed in black. Another caption specifically describes the activity as kite snowboarding, and can retrieve image selectively. Yet another caption focuses on the appearance (green clothing) and retrieves image . The latter two captions and are dissimilar and should not retrieve image . Captions often reference a specific aspect of an image; hence, it is insufficient to embed these entities close to each other.

Fig. 1: Conceptual diagram of the problem of interest in this study. The captions , , and reference specific aspects of image , and hence, they are dissimilar. The caption can match the other image , but the other captions should not. Therefore, it is insufficient to embed them close to each other in the Euclidean space.

Several studies have employed an ordered vector space or hyperbolic space to capture the hierarchical relationship between words (hypernym and hyponym) or tree nodes [8, 9, 10]. Conversely, visual-semantic embedding has only two hierarchies (caption and image) and cannot benefit from the constraints of hierarchial relationships. In recent studies that focused specifically on image-caption retrieval, embedding was replaced with a cross-modal attention to directly obtain similarity scores [11, 12]. Despite the improved performance, their generality to other tasks and modalities is limited. Recently, in the context of fashion recommendations, a conditional similarity network (CSN) was proposed [13, 14]. For retrieval, the CSN focuses on a target aspect (e.g., color) and ignores others (e.g., category and occasion) by disregarding dimensions of no interest. Similarly, manipulating the Euclidean space for visual-semantic embedding potentially overcomes the gap between the modalities.

In this paper, we propose the Target-Oriented Deformation Network (TOD-Net). TOD-Net is constructed using a flow-based model, namely a conditional version of Real-NVP network [15]. In flow-based models, only continuous bijective functions are approximated theoretically. TOD-Net is installed on top of an existing visual-semantic embedding system and continuously deforms the embedding space into a new space by virtue of the bijective property. Through the deformation, the embedding space becomes specialized to a specific concept in the condition and adjusts for the similarity accordingly. For example, if the condition is a caption describing the appearance, TOD-Net emphasizes the concept describing the appearance in the embedded images and avoids false positives that show other appearance aspects.

To end the Introduction, we summarize the contributions of this paper.

  • We TOD-Net, which is installed on top of a visual-semantic embedding system. TOD-Net learns a conditional bijective mapping and deforms the shared embedding space into a new space. By that means, TOD-Net adjusts the similarities between entities under a condition while preserving the topological relations between them.

  • Unlike existing methods based on an object detector and a cross-modal attention [16, 11, 12, 17, 18, 19], TOD-Net is applied to a fixed embedding space and improves the retrieval performance even when using a single-image encoder. This fact indicates that a single-image encoder already extracts detailed concepts from entities but encounters a difficulty in expressing their relations in the embedding space. Since the object detector is not needed and TOD-Net is used at the very last phase, the computational cost is greatly suppressed.

  • We TOD-Net with existing models and conduct extensive experiments. The numerical results demonstrate that TOD-Net generally improves the performance of existing models that are based on visual-semantic embedding, thus achieving a state-of-the-art model for image-caption retrieval.

  • A qualitative analysis demonstrates that TOD-Net successfully captures entity-specific concepts, which are often suppressed by existing models because of the diversity among entities belonging to the same group.

Ii Related work

Ii-a Conventional and hierarchical embedding

An embedding system maps entities such as words, text, and images to vectors in an Euclidean embedding space. The similarity between two entities is defined through negative Euclidean distance, inner product, or cosine similarity in the embedding space. Numerous studies on visual-semantic embedding have investigated network architectures and objective functions. Typically, a pretrained convolutional neural network (CNN) has been employed for encoding images 

[1, 2, 6, 7, 20]

. For captions, a recurrent neural network (RNN) following a word embedding model has been a common choice 

[21].

Captions often focus on a specific aspect of an image while ignoring others, as depicted in Fig. 1. This means that a caption is an abstract entity of an image and there exists a hierarchical relationship between them. Words and graphs also have hierarchical relationships between hypernym and hyponym or between a parent node and its children in a tree. Hence, point embedding to the Euclidean space has a limitation. Order-embedding and related works have tackled this difficulty. They embedded an entity as a vector in an ordered vector space [8], a vector or a cone in a hyperbolic space [9, 22]

, and a Gaussian distribution 

[23, 10]. These studies have embed entities so that two hyponyms are less similar to each other than to their hypernym, and have exhibited remarkable results in the word and graph embeddings. However, visual-semantic embedding has only two hierarchies (image and caption) and cannot benefit from the constraints of hierarchical relationships. In the original study on order-embedding, entities were embedded in a super sphere for the visual-semantic embedding even though such embedding cannot express hierarchical relationships [8, 6].

Ii-B Adaptive embedding

Recently, in the context of fashion recommendations, a conditional similarity network (CSN) was proposed [13]. There are many aspects of similarity to consider when retrieving a fashion item (e.g., occasion, category, and color). For example, white sneakers are similar to jeans from the aspect of occasion, to court shoes from the aspect of category, and to a white jacket from the aspect of color. CSN learns a template for each aspect in order to rescale the dimensions of the embedded vectors, thereby emphasizing a given aspect. In another example, SCE-Net [14] extended CSN by inferring an appropriate aspect from a given input pair.

As in these studies, our work adopts the concept of conditional adjustment of the embedding space.

Ii-C Cross-model attention

Cutting-edge methods for image-caption retrieval sometimes employ an object detector and a cross-modal attention [16, 11, 12, 17, 18, 19]. A pretrained object detector crops multiple subregions in an image, and then a cross-modal attention is performed over the cropped regions and words in a caption. A cross-modal attention pays attention to a subset of the cropped regions related to the words to evaluate their similarities while discarding remaining subsets. Thereby, the cross-modal attention handles hierarchy and polysemicity.

A main drawback of these studies is computational time. Object detectors require an image larger than that required by single-image encoders and perform an additional region-proposal step. Cross-modal attentions are performed for every possible pairs of regions and words, and their computational cost is proportional to the numbers of entities, regions, and words. Conversely, TOD-Net is a tiny neural network that receives only embedded vectors; its computational cost much less than that of the cross-modal attention even though it is still higher than that of cosine similarity. Moreover, an attention module is designed specifically for image-caption retrieval and its generality to other modalities and tasks is limited, while the output of TOD-Net is still an embedded vector that is potentially applicable to other tasks.

Fig. 2: Conceptual diagram of the proposed Target-Oriented Deformation Network (TOD-Net). A visual-semantic embedding system has an image encoder and a text decoder that map images and captions to vectors in a shared embedding space . However, captions often reference a specific aspect of an image, and their hierarchical relationship is never evaluated appropriately as long as the embedding space is Euclidean. TOD-Net deforms the embedding space under a condition and provides new embedding spaces . By that means, entity-specific detailed concepts such as appearance, activity, or background are emphasized, and diverse targets can be retrieved by a single query.

Iii Methods

Iii-a Preliminaries

We assume that a backbone model is composed of an image encoder and a text encoder . The image encoder is composed of a convolutional neural network (CNN), a pooling layer, and a fully connected layer [6, 7]. When an image in an image data space is given, the image encoder maps the image to a vector in a -dimensional embedding space . The text encoder

is a recurrent neural network (RNN) or a transformer network 

[24]. It also maps a given caption in a caption data space to a vector in the same embedding space . In other words, and

. Note that these maps are sometimes stochastic in the training phase due to stochastic components such as dropout and batch normalization 

[25]. The encoders are trained under an objective function that evaluates the similarity between two entities using cosine similarity;

where , . and denote the inner product and the Euclidean norm, respectively. Other similarities defined in a vector space are acceptable.

The quality of the embedding was evaluated using image-caption retrieval, which receives a query and then ranks targets appropriately according to the similarity [1, 2, 3, 4, 5, 6, 7]. As shown in Fig. 1, image can be retrieved by two distinct captions and . The image depicts a person dressed in green and engaging in kite snowboarding. Caption describes the activity in detail and the other caption focuses on the appearance; thus, each caption references a different aspect of image . When a model focuses on appearance, caption comes closer to image in the embedding space , while caption becomes further away.

Iii-B Target-oriented deformation network

To adjust the embedded vectors, we propose a target-oriented deformation network (TOD-Net) for visual-semantic embedding. The usage is summarized in Fig. 2. TOD-Net is defined with a condition , which is also an embedded vector. TOD-Net receives an embedded vector and outputs a vector of the same size. When feeding a set of embedded vectors in , the outputs of TOD-Net form a new embedding space under the condition , that is, . TOD-Net is then trained according to the similarity in the new embedding space while the original embedding space is unchanged.

To obtain the similarity between a pair consisting of the query and one of the targets for image-caption retrieval, the target is fed to TOD-Net as condition . Then, the pair is mapped into the new embedding space by TOD-Net , and its similarity is calculated as the cosine similarity in the new embedding space . For example, the similarity for a caption retrieval is

where and denote a caption and an image, respectively, and and denote their embedded vectors. Hence, we refer to the proposed method as target-oriented. From a viewpoint of neural networks, TOD-Net is installed on top of an existing visual-semantic embedding system.

Through this process, TOD-Net is expected to deform the embedding space such that the concepts indicated by condition are emphasized while the others are ignored. For example, when a caption related to appearance is fed to TOD-Net as a condition , TOD-Net pays attention to the appearance in image and suppresses other concepts such as activity, weather, and background.

Fig. 3: Diagram of the forward path of a coupling layer of the conditional Real-NVP network [15]. In this paper, we do not use the backward path.

Iii-C Construction of TOD-Net

When TOD-Net

is a basic multilayer perceptron (MLP), it can approximate arbitrary continuous functions 

[26]. However, such a network risks disturbing the relations between embedded vectors in the original embedding space that the embedding system has learned. Instead, we employ a conditional version of the Real-NVP network to TOD-Net  [15], depicted in Fig. 3. The Real-NVP network is a neural network architecture categorized into flow-based models, which only approximate continuous bijective functions theoretically. The flow-based models have been investigated for generative models in which a sample likelihood is calculable by changing the variables. Owing to its continuous bijective nature, TOD-Net continuously deforms the original embedding space into and is expected to adjust only similarity (distance, metric) while conserving the topological relations between embedded vectors.

The conditional Real-NVP network is composed of multiple coupling layers, each of which is an MLP in our work. An embedded vector is divided into two vectors and , which are of -dimensions. One vector and a condition are jointly fed to the coupling layer, which leads to the creation of two -dimensional vectors and . With vectors and , the remaining vector

is linearly transformed as

where denotes the element-wise product and the input is kept as it is. The pair of resultant vectors is the output of the coupling layer. Intuitively, the vector is ciphered with the key vectors and . The coupling layer is continuous as long as the MLP is continuous. In the following coupling layer, is ciphered by using the pair of and as a key. Each coupling layer forms a bijective function because its inverse function can be obtained as

The conditional Real-NVP network is a composition of coupling layers and thus it also forms a bijective function. The hyperparameters are the number of coupling layers, the number of hidden layers in a coupling layer, and the number of units of a hidden layer.

Iv Experiments

Iv-a Backbone models

To evaluate our proposed TOD-Net, we employed VSE++ [6], DSVE-loc [7], and the BERT model [24] as backbone models. We followed the experimental settings shown in the original studies unless otherwise stated.

VSE++ is a commonly used baseline. The image encoder is a 152-layer ResNet [27, 28]

that is pretrained using the ImageNet dataset 

[29]. The final fully connected layer is replaced for embedding. The text encoder is a single-layer GRU network [30] trained from scratch. The dimension number of the embedding space is . The source code can be found in the original repository111https://github.com/fartashf/vsepp..

DSVE-loc is a state-of-the-art model for image-caption retrieval. The image encoder is a modified 152-layer ResNet that has a convolution layer and a special pooling layer instead of a global pooling layer [31], and is pretrained using the ImageNet dataset [29]. The text encoder is composed of a word embedding model pretrained by Skip-Thought [32] and a four-layer SRU network [33] trained from scratch. The dimension number of the embedding space is .

We also report results obtained using BERT model as the text encoder. It is based on a transformer network and is pretrained using various language processing tasks [24]222 We used the pretrained model posted on https://github.com/huggingface/pytorch-transformers. After the transformer layers, the outputs are averaged over a sentence, and a fully connected layer is added for embedding. For the image encoder, we employed a modified and pretrained ResNet [31]. The dimension number of the embedding space is . This model is not an existing work.

Iv-B Dataset and training procedure

We evaluated our proposal on the MS COCO dataset [34] using the splits employed by VSE++ [6]. We used 113,287 images for training, 5,000 images for validation, and 5,000 images for testing. Each image has five captions as positive targets.

For image retrieval, each backbone model was trained so that the similarity between a query caption

and a designated target image (called a positive target) would be larger than that to another target image

(called a negative target). The loss function was represented by a hinge rank loss, as follows:

where is the positive part and is a margin parameter. In a mini-batch, all images except a positive target image are negative . Following VSE++ [6], we only minimized the loss function with the hardest negative target image in a mini-batch. Specifically,

(1)

The loss function for caption retrieval was defined in the same way. The similarity function was the cosine similarity. The batch size was 128 for VSE++ and 160 for DSVE-loc and the BERT model.

Each backbone model was trained using the Adam optimizer with the hyperparameters and  [35]

. Note that the source code of VSE++ in the original repository was updated; we trained it from scratch. We trained its text encoder and the embedding layer of its image encoder for 30 epochs and the whole model for 15 epochs. The learning rate was initialized to

and then multiplied by 0.1 at the end of the 15th epoch. For DSVE-loc, we used the pretrained model posted on the original repository. For the BERT model, we trained the embedding layers of both encoders for one epoch and the whole model for 29 epochs (i.e., 30 epochs in total). The learning rate was initialized to and then multiplied by 0.1 at the ends of the first and 15th epochs. The first 16 of the 24 layers of the BERT model were frozen in order to retain their pretrained features and to reduce memory consumption. We used a data augmentation strategy similar to that of DSVE-loc; a random resize-and-crop to 256 pixels and a random horizontal flip were applied to the training images, and a resize to 350 pixels was applied to the validation and test images.

Iv-C Training of TOD-Net

After the pretraining of each backbone model, we installed TOD-Net to the top of the backbone model. For the conditional Real-NVP network, we set the number of coupling layers to 3, the number of hidden layers in a coupling layer to 2, and the number of hidden units to for VSE++ and BERT model and

for DSVE-loc. We used ReLU activation functions 

[36]. We trained TOD-Net for 30 epochs while freezing the feature extractors using Adam optimizer [35]. The learning rate was initialized to and then multiplied by 0.1 at the end of the 15th epoch.

We report results averaged over three runs. A typical measure of retrieval performance is recall at K, which is the fraction of the results that a positive target is ranked at the top candidates. R@K denotes the performance of caption retrieval, and Ri@K denotes that of image retrieval. Especially, R@1, R@5, R@10, Ri@1, Ri@5, Ri@10, and their average (mean rank; mR) have been commonly used for model evaluation. After every epoch, the model was evaluated using mR for five folds of the validation set and the best snapshot was saved. We omitted the median rank (Med r), which is the median rank of the first positive target, because it is saturated at 1.0 for all our methods and almost all existing state-of-the-art methods.

V Results and discussion

V-a Combined with backbone models

We combined TOD-Net with backbone models VSE++, DSVE-loc, and BERT model, and report the results in Table I. In all three cases, TOD-Net improves the retrieval performances R@1, R@5, Ri@1, and Ri@5 by significant margins. In particular, despite the fact that DSVE-loc and BERT model are state-of-the-art methods, TOD-Net improves their performance. Recall that the feature extractors are frozen when training TOD-Net. This indicates that TOD-Net does not owe its performance improvements to longer training, but to the deformed embedding space that potentially dissolves the limitation of the fixed Euclidean space. TOD-Net is considered to be universally applicable to visual-semantic embedding methods based on point embedding. On the other hand, the improvements of R@10 and Ri@10 are limited. When a backbone model extracts important concepts from an entity but fails in proper embedding, a positive target is ranked near but not at the top. In this case, TOD-Net deforms the embedding space to adjust embedded vectors and ranks the positive target at the top, resulting in improvement of R@1 and Ri@1. When a backbone model fails in concept extraction, a positive target is ranked far from the top, and TOD-Net cannot adjust it. Hence, TOD-Net is good at improving R@1 and Ri@1 but not as effective at improving R@10 and Ri@10.

Caption Retrieval Image Retrieval
Similarity R@1 R@5 R@10 R i@1 R i@5 R i@10 mR
VSE++ [6] 65.9 90.7 96.2 52.9 84.6 92.4 80.5
  + TOD-Net 68.6 92.0 96.9 54.5 85.3 92.4 81.6
DSVE-loc [7] 69.8 91.9 96.6 55.9 86.9 94.0 82.5
  + TOD-Net 72.3 93.4 97.4 58.5 88.3 94.6 84.1
BERT [24] 74.1 94.9 98.2 60.8 89.2 94.9 85.4
  + TOD-Net 75.8 95.3 98.4 61.8 89.6 95.0 86.0

TABLE I: Results of TOD-Net Combined with Backbone Models
Query 1 Query 2
Model Top 5 retrieved captions (✓denotes a positive target)
a person jumping a snow board in the air
VSE++ A man kite snowboarding on a sunny day
(Query 1) A man on a snowboard does an air trick.
A snowboarder is is the air over the snow.
a person flying in the air while on a ski board.
a person with green clothes and green board snowboarding.
VSE++ A man kite snowboarding on a sunny day
+ TOD-Net a person jumping a snow board in the air
(Query 1) a person riding a snow board in the air
A person flying through the air on the kite board.
A lot of fruits that are in a bowl.
VSE++ A fruit and vegetable stand has hanging fruit.
(Query 2) there are many crates filled with fruits and vegetable
A number of fruits and nuts on a stone
A pile of wooden boxes filled with fruits and vegetables.
A pine apple on top of a pile of mixed fruit.
VSE++ A lot of fruits that are in a bowl.
+ TOD-Net A bowl of assorted fruit with a huge pineapple on top.
(Query 2) A number of fruits and nuts on a stone

A fruit bowl containing a pineapple, an orange and several pears.

TABLE II: Typical Successful Caption Retrieval by TOD-Net.

V-B Interpretation of results

In order to visualize the contribution of TOD-Net, we TOD-Net with VSE++. We provide an example image (see Query 1 in Table II). With the image as a query, the unmodified VSE++ did not rank a positive caption at the top in any of the three trials, while VSE++ with TOD-Net was successful in all three trials. The unmodified VSE++ ranked captions such as “a person jumping a snowboard in the air” at the top. These captions are apparently positive targets but not the optimal ones. One of the actual positive captions is “a person with green clothes and green board snowboarding;” this caption focuses on the person’s appearance. Another positive caption is “A man kite snowboarding on a sunny day;” this caption describes the activity more specifically as kite snowboarding. If the embedded vector of the image contains the concept relating to the appearance, it retrieves the former caption and fails to retrieve other captions that do not mention the appearance. The embedded vector that focuses on the specific activity also retrieves nothing but the latter caption. Hence, VSE++ extracts minimal concepts from the image to accept both cases, and thus retrieves many false positives.

Actually, VSE++ does not completely lose detailed concepts. Through the deformation of the original embedded space as shown in Fig. 2, TOD-Net emphasizes the remaining specific concepts depending on the condition (i.e., the target). As a result, a single image retrieves diverse captions as summarized in the lower portion of Table II.

Query 2 in Table II is another example. The query image depicts a bowl full of fruit, among which the pineapple has the greatest presence. To avoid a failure in retrieving a caption that does not mention the pineapple, the image encoder puts a little focus on the pineapple, which results in false positives that do not mention the pineapple. Thanks to TOD-Net, which accepts a target caption as a condition, the image query can retrieve the positive captions that mention the pineapple.

A similar result can be found in the image retrieval, as shown in Table III. The query caption mentions the stairwell as well as the bench. However, another caption of the same image instead mentions the artwork, and yet another one mentions the hallway. Hence, it is inappropriate for their embedded vectors to focus on the stairwell and the hallway, as this leads to false positive images that depict only benches. TOD-Net emphasizes the concept relating to the stairwell in both embedded vectors and retrieves the positive target selectively.

Query:
      A wooden bench sits next to a stairwell.
Other captions of the same image:
      A brown wooden bench sitting up against a wall.
      A bench next to a wall with a staircase behind it.
      Single wooden bench in corridor with artwork displayed above.
      The bench in the hallway of the building is empty.
Top 5 retrieved images (A red border denotes the positive target):
VSE++
VSE++ + TOD-Net
TABLE III: Typical Successful Image Retrieval by TOD-Net.

Recent studies on image-caption retrieval have employed cross-model attentions to pay attention to concepts shared by a query and a target [16, 11, 12, 17, 18, 19]. Cross-modal attentions are performed at an early phase over multiple cropped regions of an image and words of a caption. Conversely, TOD-Net is applied to the learned embedding space as the last step and retrieves diverse targets by a query. The results indicate that the backbone models successfully extract detailed concepts from entities even when they are single-image encoders with neither object detectors nor cross-modal attentions. The backbone models encounter a difficulty in the alignment of the entities in a single Euclidean space. TOD-Net resolves the difficulty at a minimal modification.

V-C Ablation study

We performed an ablation study on TOD-Net using VSE++. We again report the performance of VSE++ with and without TOD-Net in the first two rows of Table IV. In the third row, we report how TOD-Net performed without condition . Performance is slightly improved compared to the scenarios without TOD-Net in the second row, simply owing to a deeper network. However, the improvement is limited, suggesting the importance of the condition.

As described in Section III-B, the condition of TOD-Net is a target. A target is a caption for caption retrieval and an image for image retrieval. Alternatively, one can consider other conditions such as a query. One can use a caption as a condition for both caption and image retrieval, or use an image . We report these scenarios in the fourth through sixth rows of Table IV. Performance is improved in many scenarios, but none of them is superior to the case in which condition is a target (see the first row). This is TOD-Net determines which concept to emphasize easily with a target; meanwhile, it becomes more difficult to do it with other conditions.

Caption Retrieval Image Retrieval
TOD-Net Condition R@1 R@5 R@10 R i@1 R i@5 R i@10 mR
Real-NVP target 68.6 92.0 96.9 54.5 85.3 92.4 81.6
No 65.9 90.7 96.2 52.9 84.6 92.4 80.5
Real-NVP no 66.6 91.5 96.8 54.0 85.1 92.6 81.1
Real-NVP caption 68.5 91.5 96.7 53.6 84.7 91.9 81.2
Real-NVP image 67.9 91.7 96.7 54.2 84.8 91.7 81.2
Real-NVP query 68.2 91.8 96.8 53.8 84.8 92.1 81.3
MLP target 68.3 91.8 96.5 53.8 84.8 91.9 81.2

TABLE IV: Ablation Study of TOD-Net Combined with VSE++ [6]

We also evaluate scenarios in which TOD-Net is composed of a conditional MLP (see the bottom row). The MLP has four hidden layers, each having units. The performance is improved from the baseline but not superior to that of the conditional Real-NVP network. We found the same results with two and six hidden layers. The MLP approximates arbitrary functions while Real-NVP network approximates bijective functions. Real-NVP network preserves the topological relations between embedded vectors in the original embedding space , although the MLP could disturb it.

V-D Comparison with state-of-the-art models

In Table V, we compare the experimental results against results from state-of-the-art methods that employ CNN-based single-image encoders. TOD-Net with DSVE-loc achieved the best results on six of seven criteria, and TOD-Net with BERT outperformed all other methods on all seven criteria as emphasized by the bold text.

Several recent methods for image-caption retrieval employ an object detector and a cross-modal attention. For example, SCO [16], SCAN [11], and MTFN [37] crop 10, 24, 36 regions, respectively, and merge them nonlinearly by cross-modal attentions. It is known that a simple average over multiple regions significantly improves performance [6]. We summarize the results in Table VI333Note that MTFN [37] also proposed a re-ranking algorithm, which finds the best match between a set of queries and a set of targets (not between a single query and a set of targets). We omit this result because the problem setting is completely different from those of the other studies. ; in each row, the number of cropped regions is indicated in parentheses. Although TOD-Net takes only one region, its performance is already comparable or superior to those of the state-of-the-art methods based on object detectors. In particular, TOD-Net with DSVE-loc outperforms methods in other studies in terms of Ri@1, and TOD-Net with the BERT model achieves the best results for all seven criteria.

Moreover, recent studies involving SCAN [11], GVSE [19], and VSRN [38] reported the results of a two-model ensemble. In this experimental setting, TOD-Net also improves performance significantly. In particular, TOD-Net with the BERT model achieves the best results for all seven criteria. TOD-Net with the DSVE-loc model achieves the second best results for five of seven criteria.

As in the previous section, these results also suggest that the cross-modal attention over multiple cropped regions is not the sole solution, but the adjustment of the embedding space works as a promising alternative.

Caption Retrieval Image Retrieval
Model R@1 R@5 R@10 R i@1 R i@5 R i@10 mR
m-CNN[39] 42.8 73.1 84.1 32.6 68.6 82.8 64.0
Order emb.[8] 46.7 88.9 37.9 85.9 64.9
DSPE+FV[40] 50.1 79.7 89.2 39.6 75.2 86.9 70.1
sm-LSTM[41] 53.2 83.1 91.5 40.7 75.8 87.4 72.0
2WayNet[42] 55.8 75.2 39.7 63.3
DPC[3] 65.6 89.8 95.5 47.1 79.9 90.0 78.0
VSE++[6] 64.6 90.0 95.7 52.0 84.3 92.0 79.8
GXN[5] 68.5 97.9 56.6 94.5
PVSE[43] 69.2 91.6 96.6 55.2 86.5 93.7 82.1
DSVE-loc[7] 69.8 91.9 96.6 55.9 86.9 94.0 82.5
soDeep[20] 71.5 92.8 97.1 56.2 87.0 94.3 83.2
TOD-Net + (ours)
  VSE++ 68.6 92.0 96.9 54.5 85.3 92.4 81.6
  DSVE-loc 72.6 93.4 97.3 58.6 88.4 94.6 84.2
  BERT (1) 75.8 95.3 98.4 61.8 89.6 95.0 86.0
TABLE V: Comparison with State-of-the-Art Methods using Single-Image Encoders
Caption Retrieval Image Retrieval
Model R@1 R@5 R@10 R i@1 R i@5 R i@10 mR
  SCAN t-i (24) [11] 67.5 92.9 97.6 53.0 85.4 92.9 81.6
  SCAN i-t (24) [11] 69.2 93.2 97.5 54.4 86.0 93.6 82.3
  SCO (10) [16] 69.9 92.9 97.5 56.7 87.5 94.8 83.2
  R-SCAN (36) [18] 70.3 94.5 98.1 57.6 87.3 93.7 83.6
  SGM (36) [44] 73.4 93.8 97.8 57.5 87.3 94.3 84.0
  MTFN (36) [37] 71.9 94.2 97.9 57.3 88.6 95.0 84.2
  CAMP (36) [12] 72.3 94.8 98.3 58.5 87.9 95.0 84.5
  BFAN ([17] 73.7 94.9 58.3 87.5
TOD-Net + (ours)
  DSVE-loc (1) 72.6 93.4 97.3 58.6 88.4 94.6 84.2
  BERT (1) 75.8 95.3 98.4 61.8 89.6 95.0 86.0
2-model ensemble
  SCAN ([11] 72.7 94.8 98.4 58.8 88.4 94.8 84.7
  GVSE ([19] 72.2 94.1 98.1 60.5 89.4 95.8 85.0
  VSRN ([38] 76.2 94.8 98.2 62.8 89.7 95.1 86.1
  BFAN ([17] 74.9 95.2 59.4 88.4
TOD-Net + (ours)
  DSVE-loc () 75.4 94.4 97.8 60.9 89.6 95.3 85.6
  BERT () 78.1 96.0 98.6 63.6 90.6 95.8 87.1
TABLE VI: Comparison with State-of-the-Art Methods using Object Detectors

Vi Conclusion

In this paper, we have proposed TOD-Net, a novel module for embedding systems. TOD-Net is installed on top of a pretrained embedding system and deforms the embedding space under a given condition. Through the deformation, TOD-Net successfully emphasizes an entity-specific concept that is often denied owing to the diversity between entities belonging to the same group. TOD-Net significantly outperforms state-of-the-art methods based on visual-semantic embedding for cross-modal retrieval on the MSCOCO dataset. Moreover, despite the fact that TOD-Net takes only one region, it rivals or surpasses the performance of image-caption retrieval models based on object detectors. Potential future research will focus on designing retrieval and translation systems that employ embedding internally.

References

  • [1] A. Frome et al., “DeViSE: A Deep Visual-Semantic Embedding Model,” Advances in Neural Information Processing Systems (NIPS), pp. 2121–2129, 2013.
  • [2] R. Kiros, R. Salakhutdinov, and R. S. Zemel, “Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models,” in NIPS Workshop, 2014, pp. 1–13.
  • [3] Z. Zheng et al., “Dual-Path Convolutional Image-Text Embedding with Instance Loss,” arXiv, vol. 14, no. 8, pp. 1–15, 2017.
  • [4] Y. Peng, J. Qi, and Y. Yuan, “CM-GANs: Cross-modal Generative Adversarial Networks for Common Representation Learning,” ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 15, no. 1, pp. 1–24, 2017.
  • [5] J. Gu et al., “Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models,” in

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2018.
  • [6] F. Faghri et al., “VSE++: Improving Visual-Semantic Embeddings with Hard Negatives,” in British Machine Vision Conference (BMVC), 2018.
  • [7] M. Engilberge et al., “Finding Beans in Burgers: Deep Semantic-Visual Embedding with Localization,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2018, pp. 3984–3993.
  • [8] I. Vendrov et al., “Order-Embeddings of Images and Language,” International Conference on Learning Representations (ICLR), 2015.
  • [9] M. Nickel and D. Kiela, “Poincar’e Embeddings for Learning Hierarchical Representations,” in Advances in Neural Information Processing Systems (NIPS), 2017.
  • [10] B. Athiwaratkun and A. G. Wilson, “Hierarchical Density Order Embeddings,” in International Conference on Learning Representations (ICLR), 2018, pp. 1–15.
  • [11] K. H. Lee et al., “Stacked Cross Attention for Image-Text Matching,” in European Conference on Computer Vision (ECCV), 2018, pp. 212–228.
  • [12] Z. Wang et al., “CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval,” in International Conference on Computer Vision (ICCV), 2019.
  • [13] A. Veit, S. Belongie, and T. Karaletsos, “Conditional similarity networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1781–1789.
  • [14] R. Tan et al., “Learning Similarity Conditions Without Explicit Supervision,” in International Conference on Computer Vision (ICCV), 2019.
  • [15]

    L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation using Real NVP,” in

    International Conference on Learning Representations (ICLR), 2017.
  • [16] Y. Huang et al., “Learning Semantic Concepts and Order for Image and Sentence Matching,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 6163–6171, 2018.
  • [17] C. Liu et al., “Focus Your Attention: A Bidirectional Focal Attention Network for Image-Text Matching,” in ACM International Conference on Multimedia (ACMMM), 2019.
  • [18] K.-H. Lee et al.

    , “Learning Visual Relation Priors for Image-Text Matching and Image Captioning with Neural Scene Graph Generators,”

    arXiv, 2019.
  • [19] Y. Huang, Y. Long, and L. Wang, “Few-Shot Image and Sentence Matching via Gated Visual-Semantic Embedding,” in

    AAAI Conference on Artificial Intelligence (AAAI)

    , vol. 33, 2019, pp. 8489–8496.
  • [20] M. Engilberge et al., “SoDeep: a Sorting Deep net to learn ranking loss surrogates,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [21] T. Mikolov et al., “Efficient Estimation of Word Representations in Vector Space,” in International Conference on Learning Representations (ICLR), 2013, pp. 1–12.
  • [22] O.-E. Ganea, G. Bécigneul, and T. Hofmann, “Hyperbolic Entailment Cones for Learning Hierarchical Embeddings,”

    International Conference on Machine Learning (ICML)

    , 2018.
  • [23] C. Sun et al., “Gaussian Word Embedding with a Wasserstein Distance Loss,” 2018.
  • [24] J. Devlin et al., “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” arXiv, pp. 1–15, 2018.
  • [25] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” in International Conference on Machine Learning (ICML), 2015.
  • [26]

    G. Cybenko, “Approximation by superpositions of a sigmoidal function,” in

    Mathematics of control, signals and systems, 1989, pp. 303–314.
  • [27] K. He et al., “Deep Residual Learning for Image Recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [28] ——, “Identity Mappings in Deep Residual Networks,” in European Conference on Computer Vision (ECCV), 2016.
  • [29] J. Deng et al., “ImageNet: A large-scale hierarchical image database,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 248–255.
  • [30] K. Cho et al., “Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation,” in

    Conference on Empirical Methods in Natural Language Processing (EMNLP)

    , 2014, pp. 1724–1734.
  • [31]

    T. Durand, N. Thome, and M. Cord, “WELDON: Weakly supervised learning of deep convolutional neural networks,”

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4743–4752, 2016.
  • [32] R. Kiros et al., “Skip-Thought Vectors,” in Advances in Neural Information Processing Systems (NIPS), pp. 1–9.
  • [33] T. Lei et al., “Simple Recurrent Units for Highly Parallelizable Recurrence,” in Empirical Methods in Natural Language Processing (EMNLP), 2018.
  • [34] T. Y. Lin et al., “Microsoft COCO: Common objects in context,” in European Conference on Computer Vision (ECCV), 2014, pp. 740–755.
  • [35] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” in International Conference on Learning Representations (ICLR), 2015, pp. 1–15.
  • [36]

    V. Nair and G. E. Hinton, “Rectified Linear Units Improve Restricted Boltzmann Machines,” in

    International Conference on Machine Learning (ICML), 2010, pp. 807–814.
  • [37] T. Wang et al.

    , “Matching Images and Text with Multi-modal Tensor Fusion and Re-ranking,” in

    ACM International Conference on Multimedia (ACMMM), 2019.
  • [38] K. Li et al., “Visual Semantic Reasoning for Image-Text Matching,” in International Conference on Computer Vision (ICCV), 2019.
  • [39] L. Ma et al., “Multimodal convolutional neural networks for matching image and sentence,” in IEEE International Conference on Computer Vision (ICCV), vol. 2015 Inter, 2015, pp. 2623–2631.
  • [40] L. Wang, Y. Li, and S. Lazebnik, “Learning Deep Structure-Preserving Image-Text Embeddings,” in Cvpr, no. Figure 1, 2016, pp. 5005–5013.
  • [41] Y. Huang, W. Wang, and L. Wang, “Instance-aware Image and Sentence Matching with Selective Multimodal LSTM,” 2016, pp. 2310–2318.
  • [42] A. Eisenschtat and L. Wolf, “Linking image and text with 2-way nets,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1855–1865.
  • [43] Y. Song and M. Soleymani, “Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [44] S. Wang et al., “Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval,” in IEEE Winter Conference on Applications of Computer Vision (WACV), 2020.