VSE-ens: Visual-Semantic Embeddings with Efficient Negative Sampling

Jointing visual-semantic embeddings (VSE) have become a research hotpot for the task of image annotation, which suffers from the issue of semantic gap, i.e., the gap between images' visual features (low-level) and labels' semantic features (high-level). This issue will be even more challenging if visual features cannot be retrieved from images, that is, when images are only denoted by numerical IDs as given in some real datasets. The typical way of existing VSE methods is to perform a uniform sampling method for negative examples that violate the ranking order against positive examples, which requires a time-consuming search in the whole label space. In this paper, we propose a fast adaptive negative sampler that can work well in the settings of no figure pixels available. Our sampling strategy is to choose the negative examples that are most likely to meet the requirements of violation according to the latent factors of images. In this way, our approach can linearly scale up to large datasets. The experiments demonstrate that our approach converges 5.02x faster than the state-of-the-art approaches on OpenImages, 2.5x on IAPR-TCI2 and 2.06x on NUS-WIDE datasets, as well as better ranking accuracy across datasets.



There are no comments yet.


page 1

page 2

page 3

page 4


Representing pictures with emotions

Modern research in content-based image retrieval systems (CIBR) has beco...

Learning Multi-level Deep Representations for Image Emotion Classification

In this paper, we propose a new deep network that learns multi-level dee...

Semantic Image Manipulation with Background-guided Internal Learning

Image manipulation has attracted a lot of interest due to its wide range...

Deep Convolutional Ranking for Multilabel Image Annotation

Multilabel image annotation is one of the most important challenges in c...

Learning Local Features with Context Aggregation for Visual Localization

Keypoint detection and description is fundamental yet important in many ...

A CNN-RNN Framework for Image Annotation from Visual Cues and Social Network Metadata

Images represent a commonly used form of visual communication among peop...

Reconstructing Perceptive Images from Brain Activity by Shape-Semantic GAN

Reconstructing seeing images from fMRI recordings is an absorbing resear...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Automatic image annotation is an important task to index and search images of interest from the overwhelming volume of images derived from digital devices. It aims to select a small set of appropriate labels or keywords (i.e., annotations) from a given dictionary that can help describe the content of a target image. However, it is not trivial to handle the differences between low-level visual features of images and high-level semantic features of annotations, which has been well recognized as the problem of semantic gap. This issue becomes even more challenging if no visual features can be drawn from figure pixels, that is, when images are only represented by numerical IDs rather than pixel values. This problem setting can be observed in some real datasets, which is the target scenario of this paper.

A promising way to resolve this issue is to jointly embed images and annotations into the same latent feature space, a.k.a. visual-semantic embeddings (VSE) [Weston, Bengio, and Usunier2011, Faghri et al.2017]. Since both images and annotations are represented by the same set of latent features, their semantic differences can be converged and computed in the same space. Existing VSE methods are derived in the form of pairwise learning approaches. That is, for each image, a set of pair-wised (positive, negative) annotations will be retrieved to learn a proper pattern to represent the image. Due to the large volume of negative candidates, it is necessary to take sampling strategies in order to form balanced training data. The most frequently adopted strategy, e.g. in [Weston, Bengio, and Usunier2011], is to repeatedly sample negative labels from the dictionary that violates the ranking order against positive examples. However, the whole annotation space may need to be traversed until a good negative example is found. In a word, it is time-consuming and thus cannot be applied to large-scale datasets.

In this paper, we propose a fast adaptive negative sampler for the task of image annotation based on joint visual-semantic embeddings (VSE). It is able to well function in the problem settings of no figure pixels available. Instead of traversing the whole annotation set to get good negative examples, we selectively choose those labels that are most likely to meet the requirements of violation according to the latent factors of images and annotations. Specifically, our proposed sampler adopts a rank-invariant transformation to dynamically select the required high-ranked negative labels without conducting the inner product operations of the embedding vectors. In this way, the running time of negative sampling can be dramatically reduced. We conduct extensive experiments on three real datasets (OpenImages

111https://github.com/openimages/dataset, IAPR-TCI2222http://www.imageclef.org/photodata, NUS-WIDE333http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm) to demonstrate the efficiency of our approach. The results show that our method is 5.02 times faster than other state-of-the-art approaches on OpenImages, around 2.5 times on IAPR-TCI2 and 2.06 times on NUS-WIDE at no expense of ranking accuracy.

Our main contributions of this paper are given as follows.

  • We propose a fast adaptive sampler to select good negative examples for the task of image annotation. It adopts a rank-invariant transformation to dynamically choose highly ranked negative labels, whereby the time complexity can be greatly reduced.

  • We provide the corresponding proof to show that the proposed sampling is theoretically equivalent with the inner product based negative sampling, and thus ensure comparable and even better performance in ranking accuracy.

  • We conduct a series of experiments on three real image-annotation datasets. The results further confirm that our approach performs much faster than other counterparts in terms of both training time and ranking accuracy.


In what follows, we first introduce the visual-semantic embeddings. Then we summarize the typical negative sampling algorithm used in WARP [Weston, Bengio, and Usunier2011] and point out its inefficiency issue.

Visual-Semantic Embedding

Following WARP, we start with a representation of images and a representation of annotations to indicate an annotation of a dictionary. Let denote a training set of image-annotation pairs. We refer to as positive pairs while as negative pairs444That is, the annotation is not labeled on image .. is an inner product function that calculates a relevance score of an annotation for a given image under the VSE space. denotes the embedding matrix of both images and annotations, where corresponds to image embedding matrix while corresponds to annotation embedding matrix and is the embedding dimension. Meanwhile, we have the function that maps the image feature space to the embedding space , and jointly maps annotation space from to . Assuming a linear map is chosen for and , we can have = and =, where and are the -th and -th row of .

Hence, we consider the scoring function as follows:


where is the embedding factor and the magnitude of denotes the relevance between and . The goal of VSE is to score the positive pairs higher than the negative pairs. With this in mind, we consider the task of image annotation as a standard ranking problem.

The WARP Model

WARP [Weston, Bengio, and Usunier2011]

is known as a classical optimization approach for joint visual-semantic embeddings, where a weighted approximate-rank pairwise loss is applied. The loss function is generally defined by Eq. 

2, which enables the optimization of precision at

by stochastic gradient descent (SGD).


where is a function to measure how many negative annotations are ‘wrongly’ ranked higher than the positive ones , given by:


where is an indicator function. The function transforms the rank into a loss, defined by:

where defines the importance of relative position of the positive example in the ranked list, e.g., is used to optimize the mean rank.

The overall risk that needs to minimize is given by:


indicates the probability distribution of positive image-annotation pair

, which is a uniform distribution in WARP.

(a) OpenImages
(c) IAPR-TC12
Figure 1: The number of required negative samples of the WARP model as the SGD iterations increase.

Negative Sampling

An unbiased estimator of the above risk can be obtained by stochastically sampling in the following steps:

  1. Sample a positive pair according to .

  2. Repeatedly sample a required annotation such that:


This chosen triplet contributes to the total risk:

The sampling strategy in step 2 generally implies that the learning algorithm concentrates merely on the negative annotation with a higher score, i.e., . The idea is intuitively correct since negative examples with higher scores are more likely to be ranked higher than positive ones, and thus results in a larger loss [Yuan et al.2016, Yuan et al.2017]. Hence, as long as the learning algorithm can distinguish these higher scored negative examples, the loss is supposed to be diminished to a large extent.

Efficiency Issue of the WARP Sampler

Even though WARP has been successfully applied in various VSE scenarios, in the following it is shown that the computational cost of WARP sampling is expensive, in particular when it has been trained after several iterations. As depicted in step 2, a repeated sampling procedure has to be performed such that a required negative example can be observed. The computational complexity of scoring a negative pair in Eq. 4 is in . In the beginning, since the model is not well trained, it is easier to find a violated negative example that has a higher score than the positive one, which leads to a complexity of , where is the average sampling trials. However, after several training iterations, most positive pairs are likely to have higher scores than negative ones, and thus becomes much bigger, with the complexity up to , where is the size of the whole annotation set. For each SGD update, the sampler may have to iterate all negative examples in the whole annotation collection, which is computationally prohibitive for large-scale datasets.

Experimentation on the Efficiency Issue

According to Eq. 4, the WARP sampler always attempts to find the violating annotation for a given image annotation pair. Along with the convergence of WARP training, most annotations have met the demand (), and thus it will take longer time per iteration to find the expected violation annotation. To verify our analysis, we conduct experiments on three datasets to count the number of required negative sampling in the WARP model, as illustrated in Figure 1. We defer the detailed description of datasets to the evaluation section.

Specifically, the number of required negative sampling increases very drastically before the 13th iteration, which takes over 2,000 repeated sampling until finding an appropriate example. After that, the number stays high at about 2,100 on the OpenImages dataset. For the NUS-WIDE dataset, before the 15th iteration, the required sampling grows rapidly up to 870 and then stable at around 900. Analogously, the number of negative sampling quickly increases at the beginning stage and then keeps stable at a high value around 1,600 on the IAPR-TC12 dataset.

To sum up, the WARP sampler will become slower and slower as the SGD update iterations accumulate. Hence, we aim to resolve this issue in this paper by proposing a novel and efficient negative sampling method for the VSE field.

Fast Sampling Algorithm

Actually, suchlike sampler has been adopted not only in the visual-semantic embedding task but also in many other fields. For example, in [Weston et al.2012], [Hsiao, Kulesza, and Hero2014] and [Li et al.2015], they successfully applied the WARP loss function for the collaborative retrieval/filtering tasks and achieved state-of-the-art results. Inspired by this, we555Though our previous AAAI version [Guo et al.2018] has cited [Rendle and Freudenthaler2014] for the purpose of relevance, we’d like to make a further clarification here to avoid potential misunderstanding. attempt to adapt the sampling strategy in [Rendle and Freudenthaler2014] to solve the above-mentioned inefficient issue of the original sampler in our visual-semantic embedding task, which is a different research domain from [Rendle and Freudenthaler2014]666Also note the sampling idea in [Rendle and Freudenthaler2014] was claimed to improve only BPR-style [Rendle et al.2009] learners which are based on the negative log-likelihood loss, whereas WARP is actually a different (see [Gao and Zhou2014]) one, which has the non-smooth & non-differentiable loss with different gradients. . In this work, we aim to study the effectiveness of this alternative sampling strategy in speeding up the sampling process and improving the performance boundaries.

Naive Sampling

As aforementioned, the major computational cost of WARP is caused by the repeated inner product operations in Eq. 4, which have a complexity of in each operation.

In the following, an alternative sampler with fast sampling rate is derived which has the same intuition with the negative sampler in WARP — considering a negative example for a given positive pair , the higher score has, the more chance should be sampled. Instead of using the notion of a large score, we opt to formalize a small predicted rank

because the largeness of scores is relative to other examples, but the ranks are absolute values. This allows us to formulate a sampling distribution, e.g., an exponential distribution

777In practice, the distribution can be replaced with other analytic distributions, such as geometric and linear distributions., based on annotation ranks such that higher ranked annotations have larger chance to be selected.


Hence, a naive sampling algorithm can be easily implemented by:

  1. Adopt the exponential distribution to sample a rank .

  2. Return the annotation currently at the ranking position of , i.e. find with or .

However, it should be noted that this trivial sampling method has to compute for all in , and then sort them by their scores and return the annotation at place . This algorithm has a complexity of for each SGD learning, which is clearly infeasible in practice.

Motivated by this, we will introduce a more efficient sampling method in the following. The basic idea of our proposed sampler is to formalize Eq. 5 as a mixture of ranking distributions over normalized embedding factors such that the expensive inner production operation can be got around. The mixture probability is derived from a normalized version of the inner product operation in Eq. 1.

Rank-Invariant Transformation

According to Eq. 1, a transformation of can be defined by:


where is the probability function that denotes the importance of the latent dimension for the image — the larger and , the more important dimension :


and is a standardized label factor if we assume

corresponds to the normal distribution:

where and

are the empirical mean and standard deviation over all labels’ factors, given by:


The main idea is that the ranking derived from scoring has the same effect as the ranking from .

We can prove this as follows:

Note that the second term is independent of label , whereby we can treat it as a constant value. In other words, the ranks generated by will be equal with those generated by , i.e., .

Sampler Function.

Since the ranks generated by can also work with , we can define our sampler function according to this characteristic. The representation of in Eq. 6 indicates that the larger is, the more important dimension is for the specific image . We can define the sampling distribution as follows:

As has been standardized, we may define in the same manner as Eq. 5:

Following Eq. 6, the scoring function under the given image and dimension can be defined by:

According to the inference aforementioned, the above sampler function can be written as follows:


From our sampler function, we can observe an intuitive relation between and : the label on rank has the -th largest factor , if is positive; otherwise it has the -th largest negative factor.

Process of Sampling

According to our sampler function (Eq. 9), the process of sampling negative labels is elaborated as follows:

  1. Draw a rank from an exponential distribution, e.g., .

  2. Draw the embedding dimension from .

  3. Sort labels according to . Due to the rank-invariant property, it is thus equivalent to an inverse ranking function ().

  4. Return the label on position in the sorted list according to the value of , i.e., if , or if .

1:  Randomly initialize , , ,
2:  repeat
4:     if  then
6:        for  do
7:           Compute
8:           Compute  and  
9:        end for
10:     end if
11:     Draw
12:     Draw r from
13:     Draw f from
14:     if  = 1 then
16:     else
18:     end if
19:     for  do
21:     end for
22:  until convergence
23:  return
Algorithm 1 VSE-ens with fast negative sampling

In the process, it takes to perform steps 1 and 4, and only costs to compute in step 2. However, step 3 is computationally expensive to be performed, since the factors are sorted in .

It will take much time if we have to re-sort the ranks in order to get the negative examples for every dimension . Instead, we opt to further reduce the complexity by pre-computing the rankings for every iterations. This is because the ordering changes only little and many update steps are necessary to change the pre-computed ranking considerably. As a result, the overall complexity can be allocated by iterations. In other words, the additional complexity is just for each SGD update.

To sum up, the sampling algorithm takes an amount of computational time to sample a negative annotation which is the same required cost as a single SGD step. As a result, the proposed sampling and SGD process together will not increase much of computational cost.

Algorithm 1 sketches the pseudocodes of the improved learning algorithm. To explain, several arguments are taken as input, including the model parameters , the collection of images , the collection of annotations and a variable . Firstly, we precompute the ,  and   with a constant time (line 7 and line 8). Then, we sample an image-annotation pair (line 11) and get the position of this annotation in the annotation embedding space (line 12). Next, we choose a factor in the annotation embedding (line 13) space according to and get another annotation according the value of (line 14 - line 17). Finally, we adopt the popular Stochastic Gradient Descent (SGD) algorithm to train our model and update the model parameters (line 19 and line 20) until convergence.

Example of Negative Sampling: As shown in Figure 2, suppose we have 5 images with 10 annotations in the training datasets and set the number of embedding factor as 5. Following Algorithm 1, our model will rank these annotations according to for each dimension , and compute the value of and at the first iteration. Then, it will randomly choose a positive image-annotation pair, e.g. the 1st image and the 2nd annotation, denoted as . After this, the negative sampler will sample a rank , e.g. according to the designed distribution and a dimension , e.g. . Finally, we are able to return the negative example according to , i.e., choosing the negative annotation from the ranked list with if , and if .

Figure 2: Example of our adaptive negative sampling

Experiments and Results


Three real datasets are used in our evaluation, namely OpenImages, NUS-WIDE and IAPR-TC12. OpenImages is introduced by [Krasin, Duerig, and Alldrin2017] and contains 9 million URLs to images that have been annotated with image-level labels. NUS-WIDE [Chua et al.2009] is collected at the National University of Singapore, and composed of 269,648 images annotated with 81 ground-truth concept labels and more than 5,000 labels. IAPR-TC12 produced by [Grubinger et al.2006] has 19,627 images comprised of natural scenes such as sports, people, animals, cities or other contemporary scenes. Each image is annotated with an average of 5.7 labels out of 291 candidates. The statistics of the three datasets are presented in Table 1, where rows ‘Train’ and ‘Test’ indicate the number of image-annotation pairs in the training and test set, respectively.

Feature OpenImages NUS-WIDE IAPR-TC12
Images 112,247 269,648 19,627
Lables 6,000 5108 291
Train 887,752 2,018,879 79,527
Test 112,247 267,642 20,000
Table 1: The statistics of our datasets

Experimental Setup

We have implemented and compared with the following two strong baselines.

  • WARP [Weston, Bengio, and Usunier2011] uses a negative sampling based weighting approximation (see Eq. 3) to optimize standard ranking metrics, such as precision.

  • Opt-AUC is to optimize Area Under the ROC Curve (AUC). Logistic loss is used as the smoothed AUC surrogate.

We adopt the leave-one-out evaluation protocol. That is, we randomly select an annotation from each image for evaluation, and leave the rest for training. All reported results use the same embedding dimension of k = 100. The hyperparameters for VSE-ens on all three datasets are: learning rate

, , and variables are initialized by a normal distribution . Parameter for VSE-ens is tuned from 0.001 to 1.0 to find the best value. The learning rate and regularization settings of other models are tuned from 0.001 to 0.1 to search the optimal values.

Dataset Model Pre@5 Rec@5 Pre@10 Rec@10 MAP AUC
Open- VSE-ens 0.0574 0.2869 0.0434 0.4342 0.1762 0.7168
Images WARP 0.0526 0.2628 0.0390 0.3900 0.1676 0.6948
Opt-AUC 0.0188 0.0938 0.0147 0.1465 0.0564 0.5732
Improve 9.13% 9.17% 11.28 % 11.33 % 5.13 % 3.17 %
NUS- VSE-ens 0.0278 0.1391 0.0198 0.1982 0.0893 0.5990
WIDE WARP 0.0107 0.0533 0.0083 0.0830 0.0336 0.5415
Opt-AUC 0.0035 0.0177 0.0028 0.0279 0.0113 0.5139
Improve 159.81 % 160.98 % 138.55 % 138.80 % 165.77 % 10.62 %
IAPR- VSE-ens 0.0598 0.2990 0.0436 0.4364 0.1836 0.7126
TC12 WARP 0.0595 0.2976 0.0428 0.4278 0.1796 0.7086
Opt-AUC 0.0543 0.2713 0.0414 0.4136 0.1629 0.7011
Improve 0.50 % 0.47 % 1.87 % 2.01 % 2.23 % 0.56 %
Table 2: The ranking accuracy of comparison methods, where the last line of each dataset ‘Improve’ indicates the improvements our approach achieves relative to WARP.

Evaluation Metrics

We use four widely used ranking metrics to evaluate the performance of all comparison methods. Generally, the higher ranking metrics are, the better performance we get. The first two ranking metrics are precision@N and recall@N (denoted by Pre@N and Rec@N). We set for the ease of comparison in our experiments.

where is the number of annotations contained in both the ground truth and the top-N results produced by the algorithm; is the number of annotations in the top-N produced results but not in the ground truth; and is the number of annotations contained in ground truth but not in the top-N generated results.

We also report the results in Mean Average Precision (MAP) and Area Under the Curve (AUC), which take into account all the image labels to evaluate the full ranking.

where denotes the sample space and is an example of . , where and

denote the Precision and Recall, respectively.

where denotes the set of training triplet pairs;

is a sigmoid function and

aims to capture the relationship between positive annotation and negative annotation for image .

Comparison in Training Time

We compare the different models in terms of training time. Specifically, Table 3 summarizes the theoretical time complexity of all the comparison methods by iterating all annotation sets; and Table 4 shows the specific training time on the OpenImages, NUS-WIDE and IAPR-TC12 datasets. The results show that our approach gains up to 5.02 times improvements in training time compared with other comparison methods in the OpenImages dataset.

Model Time Complexity
Table 3: The theoretical time complexity of all the comparison models in each iteration, where k is the size of the embedding space, T is the average number of sampling trials for negative sampling.
Model OpenImages NUS-WIDE IAPR-TC12
VSE-ens 7.1h 24.83h 0.95h
WARP 35.65h 51.46h 2.38h
Opt-AUC 10.13h 25.06h 1.82h
Table 4: Training time comparison on the three datasets

In Table 3, our model precomputes rankings every SGD update (as described in Algorithm 1), which can be finished in amortized runtime. Then it will draw a rank r, the rank of negative sample in and a latent factor f in , resulting in the additional time complexity around . For the WARP model, most time is consumed and determined by the negative sampling, which can be noted as . For Opt-AUC, although the time complexity for each SGD update is lowest among these models, it takes more training iterations for convergence since most negative examples selected by the uniform sampler are not informative.

In Table 4, our VSE-ens spends 7.1 hours in training on the OpenImages dataset, whereas WARP costs 5 times more training time. On the NUS-WIDE dataset, the improvement our model reaches is about 2x faster than WARP. Similar observation can be made on the IAPR-TC12 dataset. Besides, our proposed sampling also consistently takes shorter time than Opt-AUC, because VSE-ens requires less number of iterations to reach convergence. More specifically, our approach can reach the stable status and converge at around 200 iterations, WARP costs 150 iterations (thus more costly for each iteration), and Opt-AUC takes around 800 iterations to complete the optimization in our experiments.

Comparison in Ranking Accuracy

The ranking accuracy of all the comparison models is shown in Table 2, where the percentage of improvements that our approach gains relative to WARP is also presented in the last row of each dataset. In general, our model achieves the best performance in ranking accuracy. Specifically, WARP is a stronger baseline than Opt-AUC, given the fact that the higher ranking accuracy is achieved across all the datasets. Our VSE-ens model outperforms WARP in all testing datasets, with a large portion of improvements. In particular, the improvements on NUS-WIDE are the most significant, which can reach up to around 166% in terms of MAP. This implies that our adaptive negative samplers are more effective than the uniform samplers used by WARP and Opt-AUC. Note that the amount of improvements vary quite different among datasets, which may be due to the different statistics of our datasets, and require further study as part of our future work.

In conclusion, our VSE-ens approach cannot only greatly reduce the training time in sampling positive-negative annotation pairs for each image, but also effectively improve the performance of image annotation in comparison with other counterparts across a number of real datasets.

Related Work

Many approaches have been proposed in the literature to resolve the issue of semantic gap in the task of image annotation. In general, these approaches can be roughly classified into three types, namely (1) manual annotation, (2) semi-automatic annotation and (3) automatic annotation. Manual annotation requires users to provide the browsed images with descriptive keywords, which are often regarded as the ground truth of corresponding datasets. However, man power is often very expensive and it would be even intractable when facing a huge amount of images.

Semi-automatic annotations can produce automatic annotation to some extent, but also require to build fundamental structures with the involvement of human beings. For example, [Marques and Barman2003] propose a layered structure to build image ontology for annotations, where low-level features of images are selected by the bottom layer. By abstracting low-level features up to high-level features, it connects the semantic feature of images with appropriate annotations. However, the building of image ontology requires expert knowledge, and may be domain-specific. [Zhang, Li, and Xue2010] formulate image annotation as a multi-label learning problem, and develop a semi-automatic annotating system. For a given image, their system initially chooses some keywords from a vocabulary as labels, and then refines these labels in the light of user feedback.

Most existing works follow the direction of automatic image annotation, which provides the greatest flexibility and the least involvement of human users. To this end, some researchers make use of textual information for image annotation. [Deschacht, Moens, and others2007] present a novel approach to annotate images by the associated text. It first determines the salient and attractive parts of text from which semantic entities (e.g, persons and objects) are then extracted and classified. [Verma and Jawahar2012] propose a two-step variant of K-nearest neighbor approach, where the first step is to learn image-to-label similarities and the second is to learn image-to-image similarities. Both kinds of similarities are combined together to help annotate an image with proper labels. [Uricchio et al.2017] propose a label propagation framework based on Kernel Canonical correlation analysis. It builds a latent semantic space where correlations of visual and textual features are well preserved.

For visual semantic embeddings, [Frome et al.2013]

develop a new deep visual-semantic embedding model which transfers the semantic knowledge learned from a textual domain to a deep neural network trained for visual object recognition.

[Yu, Pedrycz, and Miao2013] propose a multi-label classification method for automatic image annotation. It takes into consideration the uncertainty to map visual feature space to semantic concept space based on neighborhood rough sets. The label set of a given image is determined by maximum a posteriori (MAP) principles. [Ren et al.2015] introduce a multi-instance visual-semantic embedding model to embed images with a single or multiple labels. This approach first constructs the image subregion set, and then builds the region-to-label correspondence. [Kiros, Salakhutdinov, and Zemel2014]

describe a framework of encoder-decoder models to address the problem of image caption generation. The encoder learns a joint image-sentence embedding using long short-term memory (LSTM) and the decoder generates novel descriptions from scratch by a new neural language model.

Different from the above works, our problem settings do not have associated text or content to describe images. Besides, our main focus is not to better model images, but to provide a better solution to find appropriate annotation pairs in shorter time, which may be beneficial for other models.


In this paper, we aimed to resolve the problem of slow negative sampling for visual-semantic embeddings. Specifically, we proposed an adaptive sampler to select highly ranked negative annotations by adopting a rank-invariant transformation, through which the time complexity can be greatly reduced. We showed that our proposed sampling was theoretically comparable with traditional negative sampling based on time-consuming inner products. Experimental results demonstrated that our approach outperformed other counterparts both in training time and ranking accuracy.


This work was supported by the National Natural Science Foundation for Young Scientists of China under Grant No. 61702084 and the Fundamental Research Funds for the Central Universities under Grant No.N161704001. We would like to thank Fartash Faghri for his insightful suggestions about the visual semantic embeddings.


  • [Chua et al.2009] Chua, T.-S.; Tang, J.; Hong, R.; Li, H.; Luo, Z.; and Zheng, Y. 2009. Nus-wide: a real-world web image database from national university of singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval,  48. ACM.
  • [Deschacht, Moens, and others2007] Deschacht, K.; Moens, M.-F.; et al. 2007. Text analysis for automatic image annotation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL), volume 7, 1000–1007.
  • [Faghri et al.2017] Faghri, F.; Fleet, D. J.; Kiros, R.; and Fidler, S. 2017. VSE++: improved visual-semantic embeddings. CoRR abs/1707.05612.
  • [Frome et al.2013] Frome, A.; Corrado, G. S.; Shlens, J.; Bengio, S.; Dean, J.; Mikolov, T.; et al. 2013. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems (NIPS), 2121–2129.
  • [Gao and Zhou2014] Gao, W., and Zhou, Z.-H. 2014. On the consistency of auc pairwise optimization.
  • [Grubinger et al.2006] Grubinger, M.; Clough, P.; Müller, H.; and Deselaers, T. 2006. The iapr tc-12 benchmark: A new evaluation resource for visual information systems. In International workshop ontoImage, volume 5,  10.
  • [Guo et al.2018] Guo, G.; Zhai, S.; Yuan, F.; Liu, Y.; and Wang, X. 2018. Vse-ens: Visual-semantic embeddings with efficient negative sampling. In AAAI.
  • [Hsiao, Kulesza, and Hero2014] Hsiao, K.-J.; Kulesza, A.; and Hero, A. 2014. Social collaborative retrieval. In Proceedings of the 7th ACM international conference on Web search and data mining, 293–302. ACM.
  • [Kiros, Salakhutdinov, and Zemel2014] Kiros, R.; Salakhutdinov, R.; and Zemel, R. S. 2014. Unifying visual-semantic embeddings with multimodal neural language models. CoRR abs/1411.2539.
  • [Krasin, Duerig, and Alldrin2017] Krasin, I.; Duerig, T.; and Alldrin, N. 2017. Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github.com/openimages.
  • [Li et al.2015] Li, X.; Cong, G.; Li, X.-L.; Pham, T.-A. N.; and Krishnaswamy, S. 2015. Rank-geofm: A ranking based geographical factorization method for point of interest recommendation. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, 433–442. ACM.
  • [Marques and Barman2003] Marques, O., and Barman, N. 2003.

    Semi-automatic semantic annotation of images using machine learning techniques.

    In International Semantic Web Conference, 550–565. Springer.
  • [Ren et al.2015] Ren, Z.; Jin, H.; Lin, Z. L.; Fang, C.; and Yuille, A. L. 2015. Multi-instance visual-semantic embedding. CoRR abs/1512.06963.
  • [Rendle and Freudenthaler2014] Rendle, S., and Freudenthaler, C. 2014. Improving pairwise learning for item recommendation from implicit feedback. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining (WSDM), 273–282.
  • [Rendle et al.2009] Rendle, S.; Freudenthaler, C.; Gantner, Z.; and Schmidt-Thieme, L. 2009. Bpr: Bayesian personalized ranking from implicit feedback. In

    Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI)

    , 452–461.
  • [Uricchio et al.2017] Uricchio, T.; Ballan, L.; Seidenari, L.; and Del Bimbo, A. 2017. Automatic image annotation via label transfer in the semantic space. Pattern Recognition 71:144 – 157.
  • [Verma and Jawahar2012] Verma, Y., and Jawahar, C. 2012. Image annotation using metric learning in semantic neighbourhoods. In

    Proceedings of the 12th European Conference on Computer Vision (ECCV)

    , 836–849.
  • [Weston, Bengio, and Usunier2011] Weston, J.; Bengio, S.; and Usunier, N. 2011. WSABIE: Scaling up to large vocabulary image annotation. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI), volume 11, 2764–2770.
  • [Weston et al.2012] Weston, J.; Wang, C.; Weiss, R.; and Berenzweig, A. 2012. Latent collaborative retrieval. arXiv preprint arXiv:1206.4603.
  • [Yu, Pedrycz, and Miao2013] Yu, Y.; Pedrycz, W.; and Miao, D. 2013. Neighborhood rough sets based multi-label classification for automatic image annotation. International Journal of Approximate Reasoning 54(9):1373–1387.
  • [Yuan et al.2016] Yuan, F.; Guo, G.; Jose, J. M.; Chen, L.; Yu, H.; and Zhang, W. 2016. LambdaFM: Learning optimal ranking with factorization machines using lambda surrogates. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (CIKM), 227–236.
  • [Yuan et al.2017] Yuan, F.; Guo, G.; Jose, J. M.; Chen, L.; Yu, H.; and Zhang, W. 2017. BoostFM: Boosted factorization machines for top-n feature-based recommendation. In Proceedings of the 22nd International Conference on Intelligent User Interfaces (IUI), 45–54. ACM.
  • [Zhang, Li, and Xue2010] Zhang, S.; Li, B.; and Xue, X. 2010. Semi-automatic dynamic auxiliary-tag-aided image annotation. Pattern Recognition 43(2):470–477.