The large, hierarchical ImageNet dataset 
presents new challenges and opportunities for computer vision. Although the dataset contains over 14 million images, only a fraction of them has bounding-box annotations () and none have segmentations (object outlines). Automatically populating ImageNet with bounding-boxes or segmentations is a challenging problem, which has recently drawn attention [2, 3]. These annotations could be used as training data for problems such as object class detection , tracking  and pose estimation . This use case makes it important for these annotations to be of high quality, otherwise they will lead to models fit to their errors. This requires auto-annotation methods to be capable of self assessment, i.e. estimating the quality of their own outputs. Self-assessment would allow to return only accurate annotations, automatically discarding the rest.
In this paper we propose a method for transferring knowledge of object appearance from source images manually annotated with bounding-boxes to target images without them. Source and target images may come from different, but semantically related classes (sec. 2). We model the overlap of an image window with an object using Gaussian Processes (GP) 
regression. GP are able to model highly non-linear dependencies in a non-parametric way. Thanks to probabilistic nature of GP, we are also able to infer the probability distribution of the overlap value for a given window. This enables taking into account the uncertainty of the prediction. We then tackle the localization problem in a target image by picking a window that has high overlap with an objectwith high probability. The same principle is used for self-assessment.
Doing GP inference directly in a high-dimensional feature space such as HOG  is computationally infeasible on a large scale of ImageNet. Moreover, we would have to estimate thousands of hyper-parameters, risking overfitting to the source. Instead, we devise a new representation, coined Associative Embedding (AE), which embeds image windows in a low dimensional space, where dimensions tend to correspond to aspects of object appearance (e.g. front view, close up, background). The AE representation is very compact, and embeds windows originally described by HOG  or Bag-of-Words  histograms into just 3 dimensions. This enables very efficient learning of GP hyper-parameters and inference. It also facilitates knowledge transfer by allowing the source annotations to spread directly along aspects of appearance.
A large scale experiment on a subset of ImageNet, containing 219 classes and million images, shows that our method outperforms state-of-the-art competitors [2, 3] and various baselines for object localization.
The remainder of the paper is organized as follows. The next sec. 2 describes the configurations of source and target sets we consider. Sec. 3 formally introduces the setup and gives an overview of our method. Sec. 4 presents Associative Embedding and sec. 5 describes how we estimate window overlap with Gaussian Processes. Sec. 7 reports some implementation details. Related work is reviewed in sec. 8, followed by experimental results and conclusion in sec. 9.
2 Knowledge sources
The goal of this paper is to localize objects in a target set of images, which do not have manual bounding-box annotations. We tackles this by transferring knowledge from a source set of images that do have them. The target set is composed of all images of a certain target class without bounding-boxes. For some classes, ImageNet offers a rather large subset of images with annotations, so both the source and target sets can come from the same class. In the case when a target class has little or no manual bounding-boxes, we can construct the source set from images of semantically related classes. Different combinations of source and target sets leads to different learning problems. This paper explores the following setups (fig. 1):
Self: source and target sets of images come from the same class, e.g. from koalas to koalas. This setup is close to a classic object detection problem . Here we have the additional knowledge that every target image contains an object of the class and we are given all target images at training time (transductive learning).
Family = self + siblings: source and target sets consist of images coming from the same mix of semantically related classes, e.g. from kangaroos and koalas to kangaroos and koalas. This setup aims at improving performance on related classes when both have some manual annotations by processing them simultaneously (multitask learning).
These source/target setups cover a broad range of knowledge transfer scenarios. Below we devise a model that covers all these scenarios in unified manner. We expect an adequate model to have an increasing performance as more supervision is being provided (with siblings having the worst and family having the best performance).
3 Overview of our method
In this section we define the notation and introduce our method on a high level, providing the details for each part later (Sec. 4,5,6). Every image , in both source and target sets, is decomposed into a set of windows . To avoid considering every possible image window, we use the objectness technique  to sample windows per image. As shown in , these are enough to cover the vast majority of objects. Source windows are annotated with the overlap , defined as their intersection-over-union  (IoU) with the ground-truth bounding-box of an object.
The key idea of our method is to use GP regression to transfer overlap annotations from source windows to target ones. GP infers a probability distribution over the overlap of a target window with the object: 111We slightly abuse notation here, dropping the conditioning on source data: . Having a full distribution enables to measure and control uncertainty. Consider the score function222Corrected thanks to O. Russakovsky:
This score is the largest overlap that a window is predicted to have with at least probability. The higher the , the more certain we need to be before assigning a window a high score.
Performing GP inference in the high-dimensional space typical for common visual descriptors (e.g. HOG, bag-of-words) is computationally infeasible at the large scale of ImageNet. Moreover, it would require estimating thousands of hyper-parameters, risking overfitting. To reduce the dimensionality of the feature space without losing descriptive power, we introduce Associative Embedding (AE) , a low-dimensional representation of windows (sec. 4). Let be the representation of an object exemplar (ground-truth bounding-box) from the source set in some feature space (e.g. HOG). Let the representation of a window in the same feature space. First we describe in terms of its similarity to the exemplars in the source set. We quantify how similar the window is to the exemplar by the output of an Exemplar-SVM (E-SVM)  with parameters trained on
. Now every window is described by a vector of E-SVM outputs, one for each exemplar in the source set. Intuitively, this intermediate representation describes what a window looks like, rather than how it looks. The advantage of using E-SVMs over a standard similarity measure is its ability to suppress the influence of the background present in the window and to focus on exemplar-specific features .
Next, we embed all windows and exemplars into a low dimensional space. We optimize the embedding to minimize the difference between scalar products in the original and embedded spaces. We call the resulting space Associative Embedding (AE). The dimensions in EA space tend to correspond to aspects of the appearance of classes (e.g. close up, side view, partially occluded). The AE space enables the GP to transfer knowledge directly along the aspects which greatly facilitates the process. For HOG and Bag-of-Words descriptors, an AE space with just 3 dimensions is sufficient to support this process.
Fig. 2 depicts our method procedural flow. Having embedded source and target windows into AE space, we use GP to infer probability distributions of target windows overlap and to compute the score (eq. (1)). We employ this score to tackle localization task in sec. 6. For localization we simply return the highest scored window in each target image. We also reuse the score for self-assessment.
4 Associative embedding
This section details the proposed AE representation (fig. 4). The method proceeds by first training E-SVMs for object exemplars in the source set, then representing all windows from both source and target sets as the output of E-SVMs applied to their feature descriptors, and finally embedding these representations in a lower dimensional space.
where is the hinge loss and is a set of negative windows from the source pool. It contains a few thousand windows with overlap smaller than 0.5 with the object’s bounding-box. The weights and regulate the relative importance of the terms.
Encoding windows as E-SVM outputs.
We encode every window in both the source and target sets as a vector where each element is the response of an E-SVM . This representation is much richer than the original feature space, as each entry in encodes how much this window looks like one of the exemplars.
Let be a matrix with a row for each window in both the source and target sets. Let and be the embedded representations of and , which we are trying to produce. Let matrices and have and as rows, respectively. We seek the embedding that minimizes the reconstruction error:
where is the Frobenius norm. This can be done by a truncated SVD decomposition :
. We encapsulate singular values into the E-SVM representation:. Elements of a common visual descriptor (e.g. HOG) correspond to local statistics of an image window. However, each element of correspond to the output of a battery of E-SVMs on the same image window. These are individually much more informative, yet their collection is redundant, enabling compression to a few dimensions. For instance, the representation of a background window will contain similar negative values across all the E-SVMs. Moreover, E-SVMs that are learnt from similar exemplars coming from the same aspect of object’s appearance (e.g., gorilla’s face, close up) will produce correlated outputs uniformly for all windows. Altogether, this dramatically collapses variability, which allows SVD to achieve low reconstruction error with just a few dimensions, which tend to correspond to object/background discrimination and the aspects of object appearance (fig. 3).
Solving the optimization problem for all windows in the source and target sets at once is computationally very expensive. Therefore, in a first step we use only a sub-sample of the windows, but all exemplars, to minimize (3). This results in an embedded representation of all exemplars . In a second step we keep fixed and embed any remaining window with
We solve the above optimization using least-squares. Fig. 5(a) depicts the embedding of Koala class SURF Bag-of-Words window descriptors in AE space.
5 Estimating window overlap with Gaussian Processes
This section describes how to infer a probability distribution of the overlap of a target window with an object and calculate the final score (eq. 1). We first construct an extended representation of a window using AE, its objectness score  and its position and scale in the image. Then we define a GP over that representation and estimate its hyper-parameters. We also show how to speed up inference at prediction time in sec. 5.2.
5.1 GP construction
Let and be the SURF bag-of-words and HOG appearance descriptors of a window , and their AE. The final representation of is
where is a position and scale descriptor of a window. Following  we define it as . Here and are the coordinates of the center of the window, and are width and height of the window (all normalized by the image size). is the objectness score of , which estimates how likely it is to contain an object rather than background .
We propose to model the overlap of a window with a true object bounding-box as a Gaussian Process (GP)
where is a mean function and is a covariance function. Here the mean function is zero and the covariance (kernel) is the squared exponential
where is a vector of hyper-parameters regulating the influence of each element of on the output of the kernel.
Let be a set of windows from the source set and their respective ground-truth overlaps (inducing points). Let be a kernel matrix . Consider now a target window , for which is unknown. Let . Then a predictive distribution for is
In practice, inference involves the inversion of kernel matrix and a series of scalar product. Matrix inversion for a large set of inducing points poses a computational problem, with which we deal with in sec. 5.2.
We can now compute the score from eq. (1).
where is a constant which depends on and can be analytically computed by integrating the Gaussian density. The score equals to the maximum overlap that window may have with at least probability.
We estimate the hyper-parameters by minimizing the regularized negative log-likelihood of the overlaps of the windows in the source set
where is the regularization strength. While it is not common practice to put a regularizer on , we experimentally found that it significantly improves the result.
5.2 Fast inference for large-scale datasets
The source set can contain millions of windows, which poses a computational problem for standard GP inference methods . Instead of exact inference we use an approximation 333infFITC from the toolbox  during both training and test. For training hyper-parameters we also sub-sample the windows from the source set. Since the dimensionality of is very low ( in our case), it can be reliably estimated from modest amounts of data.
To speed up prediction on windows in a target image, as inducing points of GP we only use windows from source images that have a similar global appearance. The idea is that globally similar images are more likely to contain objects and backgrounds related to those in the target image, and therefore are the most relevant source for transfer. This is related to ideas proposed before for scene parsing [15, 16]. However, while those works used a simple monolithic global image descriptor (bag-of-words or GIST) for this task, we directly use the set of window descriptors to construct the global similarity measure. In the feature space spanned by , every image has an empirical distribution, i.e. a cloud of points corresponding to the windows it contains (fig. 5(b)
). We use per-coordinate kernel density estimation to represent the cloud of an image. The global image similarity between a source and a target image is then defined as the average per-coordinate KL-divergence between their distributions.
All the modifications above reduces full inference for all windows in a test image to approximately 4 seconds on a standard desktop computer. Hence our method is suitable for large scale datasets like ImageNet.
6 Object localization
The technique presented above produces a score for each target window in a target image . This section explains how to use this score for object localization. We simply select the window with the highest score , out of all windows in the target image. This window is the final output of the system, which returns one window for each target image.
The score can be also be used for self-assessment. For example, we can retrieve all bounding-boxes that have overlap higher than with probability by taking only the windows such that .
7 Implementation details
We use AE in SURF  bag-of-words  and HOG feature spaces. We build quantized SURF histograms with a codebook of visual words. For computational efficiency, we approximate the kernel with the expansion technique of  and train linear E-SVMs. HOG descriptors  are computed over a grid and the associated E-SVMs are linear, as in . The soft margin parameters and are set to and . We create AE of a dimensionality (for each of HOG and SURF bag-of-words).
To learn GP hyper-parameters (sec. 5) we sub-sample windows from the source set. Since the dimensionality of GP hyper-parameters is low ( in our experiments), this is sufficient. We set the regularizer to in eq. (10). For the scoring function we set for object localization, and for self-assessment.
For prediction on a target image, we retrieve the most similar source images as described in sec. 5.2, and then infer the distribution of windows overlap using them.
8 Related work
train a series of monolithic window classifiers, one for each source class, using different cues (HOG, colour, location, etc.). They are combined into a final window score by a discriminatively trained weighting function and applied to windows in the target set. As we show in sec.9, GF is not suited for self assessment. Moreover, our method produces better localizations overall.
The work  populates ImageNet with segmentations. It propagates ground-truth segmentations from PASCAL VOC  onto ImageNet. They use a nearest neighbour technique  to transfer segmentations from a given source set to a target image. We compare to this method experimentally (sec. 9), by putting a bounding-box over their segmentations.
is used in computer vision to facilitate learning a new target class with the help of labelled examples from related source classes. Transfer is typically done through regularization of model parameters [20, 21], an intermediate attribute layer  (e.g. yellow, furry), or by sharing parts . In GP  transfer learning is usually based on sharing hyper-parameters between tasks. In this work we not only share hyper-parameters, but the inducing points as well. Also, our GP kernel is defined over an augmented AE space , which is constructed specifically for a particular combination of source and target classes. In principle, one could view AE as kernel learning method for GP, which exploits the specifics of visual data.
were first proposed by  and are rapidly gaining popularity as a better way to measure visual similarity [25, 26, 27]. Their main advantage is the ability to select features within a window that are relevant to the object, down-weighting background clutter. Some authors propose to use [26, 27] E-SVMs as a similarity measure for discovering different aspects of object appearance. They explicitly group training objects into clusters according to their aspects of appearance. They produce a set of clean aspect-specific object detectors, whose responses on a test image are merged together for the final result. Instead we embed image windows into a low dimensional space, where aspects of appearance are expressed smoothly in the space’s dimensions.
9 Experiments and conclusion
We perform experiments on the same subset of ImageNet as [2, 3], which allows direct comparison to their results. The subset consists of 219 classes (e.g. phalanger, beagle, etc.) spanning over million images. Classes are selected such that each has less than images and has siblings with some manual bounding-box annotations . For a given target class we consider the following source sets: i) self — the class itself; ii) siblings of the class (as in ); iii) family — the class itself plus its siblings (see sec. 2 and fig. 1).
We use 92K images with ground-truth bounding-boxes in total. We split them in two disjoint sets of 60K and 32K. We use the first set exclusively as source and the second one exclusively as target. This allows us to compare transfer from different source types (siblings, self, family) on exactly the same target images. The first batch of 60K images is the same as used in . This ensures proper comparison to , as when using siblings as source, our method transfers knowledge from exactly the same images as .
Baselines and .
For localization, we compare against the following. MidWindow: A window in the center of the image, occupying of its area. TopObj: The window with the highest objectness score . MKL-SVM: This represents a standard, discriminative approach to object localization, similar to . On the source set we train a SVM on HOG (linear) and a SVM with SURF bag-of-words (linear with expansion ) using of the data. To combine them, we train a linear SVM over their outputs using the holdout data. We improve the baseline by adding the objectness score, location and scale cues (as in sec. 5) of a window as features for the second-level SVM. GF: We compare to , when using siblings as the source. We use the output of  as provided by the authors on their website. Notice how MKL-SVM baseline and the competitor GF  are defined on the same object proposals  and features as our AE-GP. This ensures that any improvement comes from better modelling.
To measure the quality of localizations we use the intersection-over-union criterion (IoU) as defined in the PASCAL VOC . We also measure detection rate: the percentage of images where the output has IoU .
To measure the quality of self-assessment we evaluate how well the score (1) ranks the output bounding-boxes. We sort the outputs by their scores and measure the mean IoU of the top outputs. To compare to  we use their scores released along with their output bounding-boxes. For MKL-SVM we use the score output by the second-level SVM.
Comparison to baselines and .
Table 1 summarizes localization results, averaged over all 32K target images for which we have evaluation ground-truth. The results of our method steadily improve as the source set changes from siblings to self to family, unlike the MKL-SVM baseline, whose performance decreases from self to family. This behaviour shows that our method is applicable to a wide range of knowledge transfer scenarios. Using only siblings as source, we outperform GF by the same margin as GF outperforms the trivial baselines TopObj and MidWindow in terms of mean IoU. Overall, our method delivers the best results for all kinds of source sets and metrics, compared to both competitors and baselines. While GF  sometimes produces a bounding-box occupying most of an image, our localizations are typically more specific to the object.
Fig. 7 presents self-assessment curves. Our method nicely trades off the amount of returned localizations for their quality, as demonstrated by the visible slope of the solid curves. For all sources, our method outperforms MKL-SVM and GF over the entire range of the curve. The advantage over MKL-SVM is greater especially in the left part of the curves, where self-assessment plays a bigger role. Note how both MKL-SVM and GF have cusps in their curves. This means that many high quality localizations get a low score (GF, right half of the curve) or some high scoring localizations are poor (MKL-SVM, left half). Interestingly, our MKL-SVM baseline performs similarly to GF when evaluated on all images (tab. 1), but GF is better at self-assessment (fig. 7).
|Method||Source||IoU%||IoU at 50%||Detection%|
Analysis of AE.
We validate here the importance of AE and whether it can reduce dimensionality without loss in representation power (fig. 8).
In PCA-GP we substitute AE with a standard PCA in each original feature space (HOG, SURF bag-of-words), keeping the rest of the method exactly the same. PCA-GP performs significantly worse than AE-GP, highlighting the importance and power of AE.
AE-GP-d100 increases the dimensionality of AE to for each feature space. This drastic increase of dimensionality makes little difference in the results. This shows that AE does not lose much representation power even when reducing the dimensionality to just 3.
Better features and proposals.
The experiments above demonstrate that our AE-GP outperforms baselines and , given the same features and object proposals. To push performance even further we introduced here a version of our method, coined AE-GP+, that uses state-of-the-art features  and object proposals . To accommodate very high dimensional features  we slightly increase the dimensionality of AE to . We use the initial features (sec. 7) as well, improving SURF bag-of-words by adding spatial binning. Results in fig. 8 and tab. 1 show that AE-GP+ delivers the best results, improving over AE-GP by in detection rate. Fig. 6 demonstrates localizations by AE-GP+.
Comparison to .
We compare here to the state-of-the-art segmentation method KGF  by putting a bounding-box around their output segmentation. AE-GP+ moderately outperforms KGF  by in detection rate (tab. 1). Most importantly, as KGF  assigns no score to its output, self-assessment is impossible. The user can’t automatically retrieve high quality localizations from KGF . In contrast, AE-GP+ has the ability to select the localizations that have high IoU. Also, unlike the scoring schemes in GF and MKL-SVM, ours allows the user to retrieve localizations that are predicted to have an overlap higher than, say, with a probability. Using AE-GP+ with self as source, this fully automatically returns of all localizations with a mean IoU of (which is very accurate, c.f. the PASCAL detection criterion is IoU). This means about 251K images in the dataset we processed! The results are released online 444 http://groups.inf.ed.ac.uk/calvin/proj-imagenet/page/ .
Knowledge transfer in ImageNet is motivated by the fact that semantically related objects look similar (e.g. police car and taxi). Hence, in a latent space of appearance variation image windows that contain objects of related classes are close to each other, while background windows are far away from them. Our work formalizes this intuition with the AE+GP method. AE recovers the latent space of appearance variation and embeds windows into it. Next, we construct a GP over the AE space to transfer localization annotations from the source to target image set. Thanks to probabilistic nature of GP, our model is capable of self-assessment. Large-scale experiments demonstrate that our method outperforms state-of-the-art techniques [2, 3] for populating ImageNet with bounding-boxes and segmentations, as well as a strong MKL-SVM baseline defined on the same features.
This work was supported by ERC VisCul starting grant. A. Vezhnevets is also supported by SNSF fellowship PBEZP-2142889.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-fei, “ImageNet: A large-scale hierarchical image database,” in CVPR, 2009.
-  M. Guillaumin, D. Kuettel, and V. Ferrari, “ImageNet Auto-annotation with Segmentation Propagation,” IJCV, 2014, to appear.
-  M. Guillaumin and V. Ferrari, “Large-scale knowledge transfer for object localization in ImageNet,” in CVPR, Jun 2012.
-  N. Dalal and B. Triggs, “Histogram of Oriented Gradients for human detection,” in CVPR, 2005.
-  B. Leibe, K. Schindler, and L. Van Gool, “Coupled detection and trajectory estimation for multi-object tracking,” in ICCV, 2007.
-  M. Andriluka, S. Roth, and B. Schiele, “Monocular 3d pose estimation and tracking by detection,” in CVPR, 2010.
C. E. Rasmussen and C. K. I. Williams,
Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005.
-  J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, “Local features and kernels for classification of texture and object categories: a comprehensive study,” IJCV, 2007.
-  M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” IJCV, 2010.
-  B. Alexe, T. Deselaers, and V. Ferrari, “Measuring the objectness of image windows,” IEEE Trans. on PAMI, 2012.
-  T. Malisiewicz, A. Gupta, and A. A. Efros, “Ensemble of exemplar-svms for object detection and beyond,” in ICCV, 2011.
-  S. C. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. A. Harshman, “Indexing by latent semantic analysis,” Journal of the American Society of Information Science, 1990.
-  B. Alexe, T. Deselaers, and V. Ferrari, “What is an object?,” in CVPR, 2010.
-  C. E. Rasmussen and H. Nickisch, “Gaussian processes for machine learning (gpml) toolbox,” The Journal of Machine Learning Research, 2010.
-  C. Liu, J. Yuen, and A. Torralba, “Nonparametric scene parsing: Label transfer via dense scene alignment,” in CVPR, 2009.
-  J. Tighe and S. Lazebnik, “Superparsing: Scalable nonparametric image parsing with superpixels,” in ECCV, 2010.
-  H. Bay, A. Ess, T. Tuytelaars, and L. van Gool, “SURF: Speeded up robust features,” CVIU, 2008.
-  A. Vedaldi and A. Zisserman, “Efficient additive kernels via explicit feature maps,” in CVPR, 2010.
-  D. Kuettel and V. Ferrari, “Figure-ground segmentation by transferring window masks,” in CVPR, 2012.
-  L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” CVIU, 2007.
-  T. Tommasi, F. Orabona, and B. Caputo, “Safety in numbers: Learning categories from few examples with multi model knowledge transfer,” in CVPR, IEEE, 2010.
-  C. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect unseen object classes by between-class attribute transfer,” in CVPR, 2009.
-  P. Ott and M. Everingham, “Shared parts for deformable part-based models,” in CVPR, 2011.
-  E. Bonilla, K. M. Chai, and C. Williams, “Multi-task gaussian process prediction,” 2008.
-  I. Endres, K. J. Shih, J. Jiaa, and D. Hoiem, “Learning collections of part models for object recognition,” in CVPR, 2013.
-  O. Aghazadeh, H. Azizpour, J. Sullivan, and S. Carlsson, “Mixture component identification and learning for visual recognition,” in ECCV, Springer, 2012.
-  J. Dong, W. Xia, Q. Chen, J. Feng, Z. Huang, and S. Yan, “Subcategory-aware object classification,” in CVPR, 2013.
-  A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman, “Multiple kernels for object detection,” in ICCV, 2009.
-  J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders, “Selective search for object recognition,” IJCV, 2013.
-  S. Manén, M. Guillaumin, and L. Van Gool, “Prime Object Proposals with Randomized Prim’s Algorithm,” in ICCV, 2013.