HAL: Improved Text-Image Matching by Mitigating Visual Semantic Hubs

11/22/2019 ∙ by Fangyu Liu, et al. ∙ University of Cambridge 0

The hubness problem widely exists in high-dimensional embedding space and is a fundamental source of error for cross-modal matching tasks. In this work, we study the emergence of hubs in Visual Semantic Embeddings (VSE) with application to text-image matching. We analyze the pros and cons of two widely adopted optimization objectives for training VSE and propose a novel hubness-aware loss function (HAL) that addresses previous methods' defects. Unlike (Faghri et al.2018) which simply takes the hardest sample within a mini-batch, HAL takes all samples into account, using both local and global statistics to scale up the weights of "hubs". We experiment our method with various configurations of model architectures and datasets. The method exhibits exceptionally good robustness and brings consistent improvement on the task of text-image matching across all settings. Specifically, under the same model architectures as (Faghri et al. 2018) and (Lee at al. 2018), by switching only the learning objective, we report a maximum R@1improvement of 7.4 and 8.3



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


The hubness problem is a general phenomenon in high-dimensional space where a small set of source vectors, dubbed hubs, appear too frequently in the neighborhood of target vectors

[27]. As embedding learning goes deeper, it has been a concern in various contexts including object classification [39], image feature matching [16]

in Computer Vision and word embedding evaluation

[29, 6], word translation [4, 20] in NLP. It is described as “a new aspect of the dimensionality curse” [1, 30].

In this work, we study the hubness problem in the task of text-image matching. In recent years, deep neural models have gained a significant edge over non-neural methods in cross-modal matching tasks [41]. Text-image matching has been one of the most popular ones among them. Most deep methods involve two phases: 1) training: two neural encoders (one for image and one for text) are learned end-to-end, mapping texts and images into a joint space, where items (either texts or images) with similar meanings are close to each other; 2) inference: for a query vector in modality A, a nearest neighbor search is performed to match the query vector against all item vectors in modality B. As the embedding space is learned through jointly modeling vision and language, it is often referred as Visual Semantic Embeddings (VSE). Recent work on VSE has shown a clear trend of growing dimensions in order to obtain better embedding quality [44]. With deeper embeddings, visual semantic hubs increase dramatically. Such property is undesired as the data is structured in the form of text-image pairs and a one-to-one mapping firmly exists among all text and image points.

Figure 1: Visualization of our proposed objective, which is to leverage both local and global negative samples to identify hubs in high-dimensional embeddings and learn to avoid them. Local negatives are the ones within mini-batch while global ones are sampled from the whole training set.

However, the hubness problem is neither well noticed nor well addressed by current methods of training VSE. Since the start of this line of work [7, 18], VSE models use either sum-margin (Sum, Eq. (2)) or max-margin (Max, Eq. (3)) ranking loss (both are triplet based) to cluster the positive pairs and push away the negative pairs. Sum is robust across various settings but treats all triplets equivalently and utilizes no information from hard samples, thus does not address the hubness problem at all. Max excels at mining hard samples and achieved state-of-the-art on MS-COCO [5]. However, it does not explicitly consider the hubness problem, nor does it resist noise well. New models on training VSE have been consistently brought up in recent years. They include incorporating extra knowledge to augment original data, eg. generating adversarial samples [32], and designing high-level objective that utilizes pre-trained models to align salient entities across modalities [22, 47]. However, ever since [5], the basic scheme of training VSE has not been enhanced. In this work we show that exploiting the data per se has yet reached its limit.

To fully extract the information buried within, we combine robustness with hard sample mining, proposing a self-adjustable hubness-aware loss called Hal. Hal takes both global (sampled from the whole training set) and local statistics (obtained from mini-batch) into account, leveraging information of hubs to automatically adjust weights of samples. It learns from hard samples and is robust to noise at the same time by taking multiple samples into account. Specifically, we exploit a sample’s relationship to 1) other samples within the mini-batch; 2) its -nearest neighbor queries in a memory bank, to decide its weight. The larger a hub is, the more it should contribute to the loss, resulting in a mitigation of hubs and an improvement of embedding quality. Through a thorough empirical comparison, we show that our method outperforms Sum and Max loss on various datasets and architectures by large margins.

The major contribution of this work is a novel training objective (Hal) that utilizes both local and global statistics to identify hubs in high-dimensional embeddings. Compared with strong baselines [5] and [22], Hal improves R@ by a maximum of on MS-COCO and on Flickr30k.


We first introduce the basic formulation of VSE model; then review widely-adopted methods that we will compare to; in the end, propose our intended loss function.

Basic Formulation

The bidirectional text-image matching framework consists of a text encoder and an image encoder. The text encoder is composed of word embeddings, a GRU [2] (or other sequential models) layer and a temporal pooling layer. The image encoder is usually a deep CNN and a linear layer. We use ResNet152 [11], Inception-ResNet-v2 (IRv2) [38] and VGG19 [34]

pre-trained on ImageNet 

[3] in our models. We denote them as functions and , which map text and image to some vectors of size respectively.

For a text-image pair , the similarity of and

is measured by cosine similarity:


During training, a margin based triplet ranking loss is usually adopted to cluster positive pairs and push negative pairs away from each other. There are mainly two prevalent choices which are Sum and Max. We introduce them in the next section along with our newly proposed non-triplet-based loss Hal.

Revisit Two Triplet-based Loss Functions

In this section we review the two popular loss functions that have been adopted for training VSE and analyze their pros and cons.

Sum-margin Loss (Sum).

Sum is a standard triplet loss adopted from the metric learning literature and has been used for training VSE since the start of this line of work [7, 18]. Its early form can be found in [45] which was used for training joint word-image embeddings. Formally, Sum is defined as:


where ; is a preset margin; and are text and image encodings in a mini-batch; is the descriptive text for image and vice versa; denotes non-descriptive texts for while denotes non-descriptive images for .

The major shortcoming of Sum lies in the fact that it views all valid triplets within a mini-batch as equal and assigns identical weights to all, leading to a failure of identifying informative pairs. As we will detail in the following, a simple “hard” weighting by taking only the hardest triplet can greatly enhance a triplet-based loss’s performance in training VSE.

Max-margin Loss (Max).

faghri2018vse++ faghri2018vse++ proposed Max fairly recently (2018). Max differs from Sum by considering only the largest violation of margin within the mini-batch instead of summing over all margins:


We refer to Max as a “hard” weighting strategy as it implicitly assigns a weight of to the hardest triplet and to all other triplets. Though Max was not used in the context of VSE before, it was thoroughly exploited in other embedding learning tasks [46]. As analyzed by [46], a rigid stress on hard negatives like Max makes its gradient easily dominated by noise, being a result of either deficiency of the model architecture or data’s structure per se. Through error analysis, we notice that the existence of pseudo hardest negatives in training data is a major source of noise for Max. During training, only the hardest negative in a mini-batch is considered. If that sample contained happens to be incorrectly labeled or inaccurate, misleading gradients would be imposed on the network. Notice that Sum eases such noise in labels by taking all mini-batch’s samples into account. When a small set of samples are with false labels, their false gradients would be canceled out by other correct negatives within the mini-batch, preventing the model from an optimization failure or overfitting to incorrect labels. That being said, Sum fails to make use of hard samples and does not address the hubness problem at all. It thus performs poorly on a well-labeled dataset like MS-COCO.

Besides, both Sum and Max

are triplet based, considering only one positive pair and one negative pair at a time. Such sampling manner isolates each triplet and disregards the overall distribution of data points. What’s more, the triplet-style heuristics is easy for selected triplets to satisfy after the early stage of training, leaving very little information in gradients in the late stage

[52]. As opposed to triplet loss, our proposed NCA-based loss, to be introduced in the next section, characterizes the whole local neighborhood and take the affinities among all pairs into consideration.

The Hubness-Aware Loss (Hal)

On the one hand, we obtain the greatest possible robustness through considering multiple samples; on the other hand, we try to make sure the samples being considered are hard enough - so that the training is effective. We tackle this problem by leveraging information from visual semantic hubs. Inspired by Neighborhood Component Analysis (NCA) [9] used for classification task, we propose a self-adaptive Hubness-Aware Loss (Hal) that weights samples within a mini-batch according to both local and global statistics. More specifically, Hal assigns more weights to samples which appear to be hubs (being close neighbors to multiple queries), judging from both the current mini-batch and a memory bank sampled from the whole training set.

How global and local information are used will be detailed shortly. Before that, we briefly explain NCA and discuss why it is a natural choice for addressing hubness problem. In the classification context, NCA is formulated as:


where is the number of samples. And the gradient of w.r.t. positive and negative samples are computed as:


For a sample , when it is a close neighbor to multiple items in the search space, ie. being a hub, its weight as a positive is reduced and that as a negative is scaled up, meaning that it receives more attention during training. This basic philosophy of NCA will be used in both the local and global weighting schemes in the following.

a) Global weighting through Memory Bank (Mb).

One of the most desired property of an NCA-based loss is that it automatically assigns weights to all samples in one batch of back-propagation through computing gradients as suggested above. The more data points we have, the more reliable a hub can be identified. The most ideal approach of leveraging hubs is utilizing the idea of NCA and searching for hubs across the whole training set, so that all samples are compared against each other and information is made fully use of. However, it is computationally infeasible to minimize such objective function on a global scale - especially when it comes to computing gradients for all training samples [48]

. We thus design hand-crafted criteria that follows the NCA’s idea to explicitly compute weight of samples but does not require gradient computation. Specifically, at the beginning of each epoch, we sample all over training set and compute their embeddings to create a memory bank

that approximates the global distribution of training data. Then we utilize relationships among mini-batch and memory bank to compute a global weight for each sample in the batch, highlighting hubs and passing the weight to the next stage of local weighting.

We define a function to return the closest points (measured by distance) in point set to and the global weighting of Hal can be formulated as:


where represent weight of positive and negative samples respectively; ; are temperature scales and are margins. For positive weighting, when the anchor’s neighborhood is dense, the denominator of the second term gets larger and so does . As will be shown in gradient computation (Eq. (8)), a large scales up positive sample’s gradient. Analogously, for negative weighting, a dense neighborhood leads to a large and increases the gradient of that negative sample in local weighting.

# architecture loss imagetext textimage
R@ R@ R@ Med r Mean r R@ R@ R@ Med r Mean r rsum
1.1 GRU+VGG19 Sum 30.0 59.6 67.7 4.0 34.7 22.8 49.4 61.4 6.0 47.5 291.0
1.2 Max 30.1 56.3 67.9 4.0 30.5 21.3 47.1 58.7 6.0 40.2 281.4
1.3 Hal 38.4 63.3 73.4 3.0 20.1 26.7 53.3 64.9 5.0 32.1 320.0
1.4 Order (VGG19, ours) [40] Sum 31.4 58.3 69.4 4.0 26.9 24.2 50.9 62.9 5.0 34.3 297.1
1.5 Max 32.1 58.0 69.9 4.0 23.1 22.7 49.4 61.3 6.0 32.9 293.4
1.6 Hal 36.4 62.2 73.0 3.0 20.4 26.6 54.4 65.6 4.0 31.0 318.3
1.7 SCAN [22] Max 67.9 89.0 94.4 - - 43.9 74.2 82.8 - - 452.2
1.8 Hal 68.6 89.9 94.7 1.0 3.3 46.0 74.0 82.3 2.0 14.3 455.5
Table 1: Quantitative results on Flickr30k [51]. “ours” means our own implementation.

b) Local weighting through loss function.

Here we adapt the NCA loss for classification for our context of producing a matching among two sets of points:


where is a temperature scale; is a margin; is number of samples within the mini-batch. And the gradients with respect to negative and positive samples are computed as:


Unlike a naive NCA aiming for classifying samples in only one direction, the first and second term of

punish mistakes made during searching targets among the two modalities in both directions. As shown in gradients, the sample is weighted according to its significance as a hub in both modalities.

Hal vs Max. As pointed out by [20], Max actually implicitly mitigates the hubness problem by targeting the hardest triplet only. A hub, by definition, is a close (potentially nearest) neighbor to multiple queries and would thus be punished by Max for multiple times (in different batches). [20]’s experiments also verified such theory empirically. However, it is a risky choice as the hardest sample within a mini-batch can easily be a pseudo hardest negative as analyzed above. As we would show in experiments, Hal prevails in a broader range of data and model configurations while Max only performs well on some specific circumstances where both training data and encoders are of ideal quality. Also, Hal is essentially leveraging more information than Max. In Max, only hub that violates margin the most gets to impose a gradient on network’s parameters while Hal softly considers all hubs, big or small, by assigning them weights.


This section is divided into 1) Experimental Setups and 2) Main Results, where detailed configurations of experiments are introduced in 1) and comparison & analysis of main results are in 2).

Experimental Setups

Dataset. We use MS-COCO [23] and Flickr30k [51] as our experimental datasets. For MS-COCO, there have been several different splitting protocols being used in the community. We use the same split as [17]: 113,287 images for training, 5,000 for validation and 5,000 for testing.222Note that 1 image in MS-COCO and Flickr30k has 5 captions, so 5 text-image pairs are used for every image. During testing, scores are computed as the average of 5 folds of 1k images. As many of the previous works report test results on a 1k test set (a subset of the 5k one), we would experiment with both protocols. We refer to the 1k test set as and the 5k test set as . Flickr30k has 30,000 images for training; 1,000 for validation; 1,000 for testing.

Evaluation metrics. We use R@s (recall at ), Med r, Mean r and rsum to evaluate the results. R@: the ratio of “# of queries that the ground-truth item is ranked in top ” to “total # of queries” (we use ); Med r: the median of the ground-truth ranking; Mean r: the mean of the ground-truth ranking; rsum: the sum of R@ for both textimage and imagetext. R@s and rsum are the higher the better while Med r and Mean r are the lower the better. We compute all metrics for both textimage and imagetext retrieval. During training, we follow the convention of taking the model with the maximum rsum on validation set as the best model for testing.

Model and training details. We use - word embeddings and internal states for GRU text encoder (all randomly initialized with Xavier init. [8]); all image encodings are obtained from image encoders pre-trained on ImageNet (for fair comparison, we don’t finetune any image encoders);

for both text and image embeddings. For more details about hyperparameters and training configurations please refer to Table

3 and code release: https://github.com/hardyqr/HAL.

Main Results

Here we present the major quantitative and qualitative findings with analysis regarding Hal’s performance, hyperparameters’ choice and hubs’ distributions.

# architecture loss imagetext textimage
R@ R@ R@ Med r Mean r R@ R@ R@ Med r Mean r rsum
2.1 GRU+VGG19 Sum 46.9 79.7 89.5 2.0 5.9 37.0 73.1 85.3 2.0 11.1 411.5
2.2 Max 51.8 82.1 90.5 1.0 5.1 39.0 73.9 84.7 2.0 12.0 421.9
2.3 Hal 55.5 84.3 92.3 1.0 4.2 41.9 75.6 86.7 2.0 7.8 436.1
2.4 Hal+Mb 56.7 84.9 93.0 1.0 4.0 41.9 75.9 87.1 2.0 7.2 439.5
2.5 GRU+IRv2 Sum 50.9 82.7 92.2 1.4 4.1 39.5 75.8 87.2 2.0 9.4 428.3
2.6 Max 57.0 86.2 93.8 1.0 3.5 43.3 77.9 87.9 2.0 8.6 446.0
2.7 Hal 60.2 87.3 94.4 1.0 3.3 44.8 78.2 88.3 2.0 7.7 453.2
2.8 Hal+Mb 62.7 88.0 94.6 1.0 3.1 45.3 78.8 89.0 2.0 6.3 458.5
2.9 GRU+ResNet152 Sum 53.2 85.0 93.0 1.0 3.9 41.9 77.2 88.0 2.0 8.7 438.3
2.10 Max 58.7 88.2 94.8 1.0 3.2 45.0 78.9 88.6 2.0 8.6 454.2
2.11 Hal 64.4 89.2 94.9 1.0 3.0 46.3 78.8 88.3 2.0 7.9 462.0
2.12 Hal+Mb 64.0 89.9 95.7 1.0 2.8 46.9 80.4 89.9 2.0 6.1 466.7
2.13 [18] (ours) 49.9 79.4 90.1 2.0 5.2 37.3 74.3 85.9 2.0 10.8 416.8
2.14 [40] 46.7 - 88.9 2.0 5.7 37.9 - 85.9 2.0 8.1 -
2.15 [13] 53.2 83.1 91.5 1.0 - 40.7 75.8 87.4 2.0 - 431.8
2.16 [25] 56.4 85.3 91.5 - - 43.9 78.1 88.6 - - 443.8
2.17 [50] 56.3 84.4 92.2 1.0 - 45.7 81.2 90.6 2.0 - 450.4
2.18 [44] (d=1024) 57.8 87.9 95.6 1.0 3.3 44.2 80.4 90.7 2.0 5.4 456.6
2.19 [5] 58.3 86.1 93.3 1.0 - 43.6 77.6 87.8 2.0 - 446.7
2.20 [5] (ours) 60.5 89.6 94.9 1.0 3.1 46.1 79.5 88.7 2.0 8.5 459.3
2.21 [24] 58.3 89.2 95.4 1.0 3.1 45.0 80.4 89.6 2.0 7.2 457.9
2.22 [47] 64.3 89.2 94.8 1.0 - 48.3 81.7 91.2 2.0 - 469.5
2.23 GRU+ResNet152 + Hal 65.4 90.4 96.4 1.0 2.5 47.4 80.6 89.0 2.0 7.3 469.2
2.24 GRU+ResNet152 + Hal + Mb 66.3 91.7 97.0 1.0 2.4 48.7 82.1 90.8 2.0 5.6 476.6
2.25 [22] (t-i AVG) 70.9 94.5 97.8 - - 56.4 87.0 93.9 - - 500.5
2.26 [22] (t-i AVG) + Hal 78.3 96.3 98.5 1.0 2.6 60.1 86.7 92.8 1.0 5.8 512.7
Table 2: Quantitative results on MS-COCO [23]. First three blocks (line 2.1-2.12) are using protocol (5k test set); the last two blocks (line 2.13-2.24) is using (1k test set) in convenience of comparing with results reported in previous works. Mb means memory bank.

Comparing Hal, Sum and Max. Table 1 and 2 present our quantitative results on Flickr30k and MS-COCO respectively. On Flickr30k, we experiment three models and Hal achieves significantly better performance than Max and Sum on the first two configurations.333We do not include Hal+Mb for [40] as it demands GPU memory exceeding 11GB, which is the limit of our used GTX 2080Ti. Same reason applies to SCAN+Hal+Mb. On MS-COCO, Hal also beats both triplet loss functions. Interestingly, while Max fails badly on Flickr30k, it becomes very competitive on MS-COCO. This serves as an evidence of Max easily overfitting to small datasets.444[5] showed that data augmentation techniques like random crop applied on input images can improve Max’s performance over small datasets. In conclusion, Hal maintains its edge over Max and Sum across regardless of data and architecture configurations. Even without global weighting (memory bank), Hal still beats the two triplet losses by a large margin. The equipment of memory bank can usually further boosts rsum by another . Also, it is worth noticing that Hal converges significantly faster than Max and Sum. Hal stabilizes after approximately epochs while Max and Sum take roughly epochs.

Figure 2: Plotting epoch against rsum on validation set for comparing convergence time. All models are using GRU+ResNet152, trained & validated on MS-COCO.

Hal vs. State-of-the-art. Table 2 line 2.13-2.24 list quantitative results of both our proposed method (2.23, 2.24) and numbers reported in previous works (2.13-2.22). For fair comparison with [5], we only use routine encoder architectures (GRU+ResNet152). Unlike [32, 47], we also do not bring in any extra information to help training. With a trivial configuration of model & data, our method is still ahead of the state-of-the-art on MS-COCO [47] by a decent margin for most metrics. Notice that we are comparing against works that use frozen image encoder (as we do). For the ones that finetuning image features, better performance is achievable [37, 33]. In Table 2 line 2.25, 2.26, we list SCAN [22] alone as it incorporates additional knowledge, i.e. bottom-up attention information, from a Faster R-CNN [28] to refine the visual-semantic alignment. With such prior, it is a well-established state-of-the-art on the Text-Image matching task, having much higher rsums than previous works. For SCAN, we pick configurations with the best rsums on both MS-COCO and Flikr30k, switching its learning objective from Max to Hal.555An ensemble model is able to achieve even higher rsum but for clear comparison we do not discuss the ensemble case. On Flikr30k, Max and Hal deliver comparable results. On MS-COCO, Hal is significantly stronger - rsum is further improved to 512.7 with R@ improved by 7.4 and 3.7 for imagetext and textimage respectively. We did not experiment with Mb due to GPU memory limits.

The impact of batch size. In contrast to loss functions that treat each sample equivalently, batch size does matter to Hal as it defines the neighborhood size where relative similarity is considered during local weighting. And Hal does benefit from a larger batch size as it means an expanded neighborhood. As suggested in Figure 3, on MS-COCO, Hal reaches a maximum rsum with a batch size of . Note that in the NCA context, batch size is a relative concept. For Flickr30k, which is only of roughly the size of MS-COCO, we maintain the original batch size of to cover roughly the same range of neighborhood.

Figure 3: Plotting batch size used by Hal against rsum. All models are using GRU+ResNet152, trained & tested on MS-COCO .

The impact of size of memory bank. The Mb in Hal has two hyperparameters: 1) , which characterizes the scope of neighborhood being considered for global statistics, and 2) memory bank’s size. Their relative scales matter for mining informative samples in the top-k neighborhood. When is fixed, we search the most appropriate memory bank size and find that of training data is ideal as suggested in Figure 4. The top-k neighborhood of a too large memory bank might be filled with noisy samples (potentially being incorrectly labeled).

Figure 4: Plotting rsum against Hal’s memory bank size. Hal without memory bank is also provided as a baseline. All data points are produced with GRU+IRv2 as the base model and are trained & tested on MS-COCO .
# Datasets models hyperparameters
3.1 MS-COCO 2.1, 2.5, 2.9, 2.13 margin=0.2, lr=0.001, lr_update=10, bs=128, epoch=30
3.2 2.2, 2.6, 2.10, 2.20 margin=0.2, lr=0.0002, lr_update=10, bs=128, epoch=30
3.3 2.11, 2.23 =30, =0.3, lr=0.001, lr_update=10, bs=512, epoch=15
3.4 2.12, 2.24 =30, =0.3, =40, =40, =0.2, =0.1
lr=0.001, lr_update=10, bs=512, epoch=15
3.5 2.26 =100, =1.0, lr=0.0005, lr_update=10, bs=256, epoch=20
3.6 Flickr30k 1.1, 1.4 margin=0.05, lr=0.001, lr_update=10, bs=128, epoch=30
3.7 1.2, 1.5 margin=0.05, lr=0.0002, lr_update=15, bs=128, epoch=30
3.8 1.3 =60, =0.7, lr=0.001, lr_update=10, bs=128, epoch=15
3.9 1.8 =70, =0.6, lr=0.0005, lr_update=10, bs=128, epoch=30
Table 3: Experiment configurations.

Related Work

In this section, we introduce works from three fields that are highly-related to our work: 1) text-image matching and VSE; 2) deep metric learning; 3) tackling the hubness problem in various contexts.

Text-image Matching and VSE.

Since the dawn of deep learning, works have emerged using a two-branch architecture to connect language and vision. weston2010large weston2010large trained a

shallowneural network to map word-image pairs into a joint space for image annotation. In 2013, frome2013devise frome2013devise brought up the term VSE and trained joint embeddings for sentence-image pairs. Later works extended VSE for the task of text-image matching [12, 18, 10, 40, 14, 5, 42], which is also our task of interest. Notice that text-image matching is different from generating novel captions for images [21, 17] but is to retrieve existing descriptive texts or images in a database.

While many of these works improve model architectures for training VSE, few have tackled the shortcomings in learning objectives. faghri2018vse++ faghri2018vse++ made the latest attempt to reform the long being used Sum loss. Their proposed Max loss is indeed a much stronger baseline than Sum in most data and model configurations. But it fails significantly when the dataset is small or noise is contained. liu-ye-2019-strong liu-ye-2019-strong eased such deficiency by relaxing Max into a top-K triplet loss. shekhar2017foil,shi2018learning shekhar2017foil,shi2018learning raised similar concerns. They mainly focused on creating better training data while we target the training objective itself.

Deep Metric Learning.

Text-image matching is an open-set task where matching results are identified based on similarity of pairs, instead of assigning probabilities to specific labels in a closed set. Such property coincides with the idea of metric learning, which utilizes relative similarities among pairs to cluster samples of same class in embedding space. Entering the deep learning age, deep neural net based metric learning is widely applied in various tasks including image retrieval

[26, 43]

, face recognition

[31], person re-identification [49], etc.. We use kindred philosophy in our context of matching two sets of data points. Works on deep metric learning that inspired our model are discussed here.

Neighborhood Component Analysis (NCA) [9] introduced the foundational philosophy for metric learning where a stochastic variant of K-Nearest-Neighbor score is directly maximized. [49, 26, 36, 43] further developed the idea, leveraging the gradient of NCA-based loss to discriminatively learn from samples of different importance. [48] proposed a method that computes only part of NCA-based loss’s gradient, so that NCA on a large scale is computationally feasible.

Tackling the Hubness Problem.

We have stated what the hubness problem is in the introduction. Now we introduce several efforts tackling the hubness problem in various contexts. [53]

pointed out the wide existence of hubs in text-image embeddings but did not address them. Though not receiving enough attention in VSE literature, hubness problem has recently been extensively explored in Bilingual Lexicon Induction (BLI). BLI is the task of inducing word translations from monolingual corpora in two languages 

[15]. In terms of finding correspondence between two sets of vectors, it is analogous to our task of interest. [35, 19] proposed to first conduct a direct Procrustes Analysis and then use criteria that heavily punish hubs during inference to reduce the hubness problem. While it is indeed efficient in finding a better matching, the actual quality of embedding is not improved. joulin2018loss joulin2018loss integrated the inference criterion Csls from [19] into a least-square loss and trained a transformation matrix end-to-end to mitigate hubness problem. Though this work has a similar philosophy to ours, it is specifically designed for BLI and only trains one linear layer over two sets of word vectors. When Csls is appended to a triplet loss, it is merely a resampling of hard samples, making it non-special in terms of both form and intuition.


We introduce a novel loss Hal for mitigating visual semantic hubs during training text-image matching models. The self-adaptive loss Hal leverages the inherit nature of Neighborhood Component Analysis (NCA) to identify information of hubs, from both a global and local perspective, giving considerations to robustness and hard sample mining at the same time. Our method beats two prevalent triplet-based objectives across different datasets and model architectures by large margins. Though our methods have only experimented on the task of text-image matching, there remains to be other cross-modal mapping tasks requiring obtaining a matching, e.g. content-based image retrieval, document retrieval, document semantic relevance, Bilingual Lexicon Induction, etc.. Hal can presumably be used in such settings as well.


We thank anonymous reviewers for their careful feedbacks, based on which we were able to enhance the work. We thank our family members for unconditionally supporting our independent research. The author Fangyu Liu gives special thanks to 1) Prof. Lili Mou, who voluntarily spent time reading and discussing the rough ideas with him at the very beginning; 2) his aunt Qiu Wang who supplied him with GPU machines; 3) his labmates Yi Zhu and Qianchu Liu from Language Technology Lab for proofreading the camera-ready version.


  • [1] R. Bellman (1961) Adaptive control processes: a guided tour princeton university press. Princeton, New Jersey, USA. Cited by: Introduction.
  • [2] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014)

    Empirical evaluation of gated recurrent neural networks on sequence modeling

    arXiv preprint arXiv:1412.3555. Cited by: Basic Formulation.
  • [3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)

    pp. 248–255. Cited by: Basic Formulation.
  • [4] G. Dinu, A. Lazaridou, and M. Baroni (2015) Improving zero-shot learning by mitigating the hubness problem. ICLR worshop. Cited by: Introduction.
  • [5] F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler (2018) VSE++: improving visual-semantic embeddings with hard negatives. External Links: Link Cited by: Hal: Improved Text-Image Matching by Mitigating Visual Semantic Hubs, Introduction, Introduction, Main Results, Table 2, Text-image Matching and VSE., footnote 4.
  • [6] M. Faruqui, Y. Tsvetkov, P. Rastogi, and C. Dyer (2016) Problems with evaluation of word embeddings using word similarity tasks. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP, pp. 30–35. Cited by: Introduction.
  • [7] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, and T. Mikolov (2013) Devise: a deep visual-semantic embedding model. In NIPS, pp. 2121–2129. Cited by: Introduction, Sum-margin Loss (Sum)..
  • [8] X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In

    Proceedings of the thirteenth international conference on artificial intelligence and statistics

    pp. 249–256. Cited by: Experimental Setups.
  • [9] J. Goldberger, G. E. Hinton, S. T. Roweis, and R. R. Salakhutdinov (2005) Neighbourhood components analysis. In Advances in neural information processing systems, pp. 513–520. Cited by: The Hubness-Aware Loss (Hal), Deep Metric Learning..
  • [10] Y. Gong, L. Wang, M. Hodosh, J. Hockenmaier, and S. Lazebnik (2014) Improving image-sentence embeddings using large weakly annotated photo collections. In European Conference on Computer Vision (ECCV), pp. 529–545. Cited by: Text-image Matching and VSE..
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Basic Formulation.
  • [12] M. Hodosh, P. Young, and J. Hockenmaier (2013) Framing image description as a ranking task: data, models and evaluation metrics. Journal of Artificial Intelligence Research 47, pp. 853–899. Cited by: Text-image Matching and VSE..
  • [13] Y. Huang, W. Wang, and L. Wang (2017) Instance-aware image and sentence matching with selective multimodal lstm. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2310–2318. Cited by: Table 2.
  • [14] Y. Hubert Tsai, L. Huang, and R. Salakhutdinov (2017) Learning robust visual-semantic embeddings. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3571–3580. Cited by: Text-image Matching and VSE..
  • [15] A. Irvine and C. Callison-Burch (2017) A comprehensive analysis of bilingual lexicon induction. Computational Linguistics 43 (2), pp. 273–310. Cited by: Tackling the Hubness Problem..
  • [16] H. Jegou, C. Schmid, H. Harzallah, and J. Verbeek (2008) Accurate image search using the contextual dissimilarity measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (1), pp. 2–11. Cited by: Introduction.
  • [17] A. Karpathy and L. Fei-Fei (2015) Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128–3137. Cited by: Experimental Setups, Text-image Matching and VSE..
  • [18] R. Kiros, R. Salakhutdinov, and R. S. Zemel (2015) Unifying visual-semantic embeddings with multimodal neural language models. Transactions of the Association for Computational Linguistics (TACL). Cited by: Introduction, Sum-margin Loss (Sum)., Table 2, Text-image Matching and VSE..
  • [19] G. Lample, A. Conneau, M. Ranzato, L. Denoyer, and H. Jégou (2018) Word translation without parallel data. In International Conference on Learning Representations, External Links: Link Cited by: Tackling the Hubness Problem..
  • [20] A. Lazaridou, G. Dinu, and M. Baroni (2015) Hubness and pollution: delving into cross-space mapping for zero-shot learning. In

    Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

    Vol. 1, pp. 270–280. Cited by: Introduction, b) Local weighting through loss function..
  • [21] R. Lebret, P. O. Pinheiro, and R. Collobert (2015)

    Phrase-based image captioning


    Proceedings of the 32nd International Conference on International Conference on Machine Learning-Volume 37 (ICML)

    pp. 2085–2094. Cited by: Text-image Matching and VSE..
  • [22] K. Lee, X. Chen, G. Hua, H. Hu, and X. He (2018) Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 201–216. Cited by: Hal: Improved Text-Image Matching by Mitigating Visual Semantic Hubs, Introduction, Introduction, Table 1, Main Results, Table 2.
  • [23] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision (ECCV), pp. 740–755. Cited by: Experimental Setups, Table 2.
  • [24] F. Liu and R. Ye (2019-07) A strong and robust baseline for text-image matching. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, Florence, Italy, pp. 169–176. External Links: Link, Document Cited by: Table 2.
  • [25] Y. Liu, Y. Guo, E. M. Bakker, and M. S. Lew (2017) Learning a recurrent residual fusion network for multimodal matching. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4107–4116. Cited by: Table 2.
  • [26] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese (2016) Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4004–4012. Cited by: Deep Metric Learning., Deep Metric Learning..
  • [27] M. Radovanović, A. Nanopoulos, and M. Ivanović (2010)

    Hubs in space: popular nearest neighbors in high-dimensional data

    Journal of Machine Learning Research 11 (Sep), pp. 2487–2531. Cited by: Introduction.
  • [28] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: Main Results.
  • [29] T. Schnabel, I. Labutov, D. Mimno, and T. Joachims (2015) Evaluation methods for unsupervised word embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 298–307. Cited by: Introduction.
  • [30] D. Schnitzer, A. Flexer, M. Schedl, and G. Widmer (2012) Local and global scaling reduce hubs in space. Journal of Machine Learning Research 13 (Oct), pp. 2871–2902. Cited by: Introduction.
  • [31] F. Schroff, D. Kalenichenko, and J. Philbin (2015) Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: Deep Metric Learning..
  • [32] H. Shi, J. Mao, T. Xiao, Y. Jiang, and J. Sun (2018) Learning visually-grounded semantics from contrastive adversarial samples. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 3715–3727. Cited by: Introduction, Main Results.
  • [33] K. Shuster, S. Humeau, H. Hu, A. Bordes, and J. Weston (2019) Engaging image captioning via personality. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12516–12526. Cited by: Main Results.
  • [34] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Basic Formulation.
  • [35] S. L. Smith, D. H. Turban, S. Hamblin, and N. Y. Hammerla (2017) Offline bilingual word vectors, orthogonal transformations and the inverted softmax. ICLR. Cited by: Tackling the Hubness Problem..
  • [36] K. Sohn (2016) Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neural Information Processing Systems, pp. 1857–1865. Cited by: Deep Metric Learning..
  • [37] Y. Song and M. Soleymani (2019) Polysemous visual-semantic embedding for cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1979–1988. Cited by: Main Results.
  • [38] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi (2017)

    Inception-v4, inception-resnet and the impact of residual connections on learning

    In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: Basic Formulation.
  • [39] N. Tomašev, R. Brehar, D. Mladenić, and S. Nedevschi (2011) The influence of hubness on nearest-neighbor methods in object recognition. In 2011 IEEE 7th International Conference on Intelligent Computer Communication and Processing, pp. 367–374. Cited by: Introduction.
  • [40] I. Vendrov, R. Kiros, S. Fidler, and R. Urtasun (2016) Order-embeddings of images and language. ICLR. Cited by: Table 1, Table 2, Text-image Matching and VSE., footnote 3.
  • [41] K. Wang, Q. Yin, W. Wang, S. Wu, and L. Wang (2016) A comprehensive survey on cross-modal retrieval. arXiv preprint arXiv:1607.06215. Cited by: Introduction.
  • [42] L. Wang, Y. Li, J. Huang, and S. Lazebnik (2019) Learning two-branch neural networks for image-text matching tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2), pp. 394–407. Cited by: Text-image Matching and VSE..
  • [43] X. Wang, X. Han, W. Huang, D. Dong, and M. R. Scott (2019) Multi-similarity loss with general pair weighting for deep metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5022–5030. Cited by: Deep Metric Learning., Deep Metric Learning..
  • [44] Wehrmann (2018) Bidirectional retrieval made simple. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7718–7726. Cited by: Introduction, Table 2.
  • [45] J. Weston, S. Bengio, and N. Usunier (2010) Large scale image annotation: learning to rank with joint word-image embeddings. Machine learning 81 (1), pp. 21–35. Cited by: Sum-margin Loss (Sum)..
  • [46] C. Wu, R. Manmatha, A. J. Smola, and P. Krähenbühl (2017) Sampling matters in deep embedding learning. In Proc. IEEE International Conference on Computer Vision (ICCV), Cited by: Max-margin Loss (Max)..
  • [47] H. Wu, J. Mao, Y. Zhang, Y. Jiang, L. Li, W. Sun, and W. Ma (2019) Unified visual-semantic embeddings: bridging vision and language with structured meaning representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6609–6618. Cited by: Introduction, Main Results, Table 2.
  • [48] Z. Wu, A. A. Efros, and S. X. Yu (2018) Improving generalization via scalable neighborhood component analysis. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 685–701. Cited by: a) Global weighting through Memory Bank (Mb)., Deep Metric Learning..
  • [49] D. Yi, Z. Lei, S. Liao, and S. Z. Li (2014-08) Deep metric learning for person re-identification. In 2014 22nd International Conference on Pattern Recognition, Vol. , pp. 34–39. External Links: Document, ISSN 1051-4651 Cited by: Deep Metric Learning., Deep Metric Learning..
  • [50] Q. You, Z. Zhang, and J. Luo (2018) End-to-end convolutional semantic embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5735–5744. Cited by: Table 2.
  • [51] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2, pp. 67–78. Cited by: Table 1, Experimental Setups.
  • [52] R. Yu, Z. Dou, S. Bai, Z. Zhang, Y. Xu, and X. Bai (2018) Hard-aware point-to-set deep metric for person re-identification. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 188–204. Cited by: Max-margin Loss (Max)..
  • [53] L. Zhang, T. Xiang, and S. Gong (2017) Learning a deep embedding model for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2021–2030. Cited by: Tackling the Hubness Problem..