1 Introduction
The large, hierarchical ImageNet dataset [1]
presents new challenges and opportunities for computer vision. Although the dataset contains over 14 million images, only a fraction of them has boundingbox annotations (
) and none have segmentations (object outlines). Automatically populating ImageNet with boundingboxes or segmentations is a challenging problem, which has recently drawn attention [2, 3]. These annotations could be used as training data for problems such as object class detection [4], tracking [5] and pose estimation [6]. This use case makes it important for these annotations to be of high quality, otherwise they will lead to models fit to their errors. This requires autoannotation methods to be capable of self assessment, i.e. estimating the quality of their own outputs. Selfassessment would allow to return only accurate annotations, automatically discarding the rest.In this paper we propose a method for transferring knowledge of object appearance from source images manually annotated with boundingboxes to target images without them. Source and target images may come from different, but semantically related classes (sec. 2). We model the overlap of an image window with an object using Gaussian Processes (GP) [7]
regression. GP are able to model highly nonlinear dependencies in a nonparametric way. Thanks to probabilistic nature of GP, we are also able to infer the probability distribution of the overlap value for a given window. This enables taking into account the uncertainty of the prediction. We then tackle the localization problem in a target image by picking a window that has high overlap with an object
with high probability. The same principle is used for selfassessment.Doing GP inference directly in a highdimensional feature space such as HOG [4] is computationally infeasible on a large scale of ImageNet. Moreover, we would have to estimate thousands of hyperparameters, risking overfitting to the source. Instead, we devise a new representation, coined Associative Embedding (AE), which embeds image windows in a low dimensional space, where dimensions tend to correspond to aspects of object appearance (e.g. front view, close up, background). The AE representation is very compact, and embeds windows originally described by HOG [4] or BagofWords [8] histograms into just 3 dimensions. This enables very efficient learning of GP hyperparameters and inference. It also facilitates knowledge transfer by allowing the source annotations to spread directly along aspects of appearance.
A large scale experiment on a subset of ImageNet, containing 219 classes and million images, shows that our method outperforms stateoftheart competitors [2, 3] and various baselines for object localization.
The remainder of the paper is organized as follows. The next sec. 2 describes the configurations of source and target sets we consider. Sec. 3 formally introduces the setup and gives an overview of our method. Sec. 4 presents Associative Embedding and sec. 5 describes how we estimate window overlap with Gaussian Processes. Sec. 7 reports some implementation details. Related work is reviewed in sec. 8, followed by experimental results and conclusion in sec. 9.
2 Knowledge sources
The goal of this paper is to localize objects in a target set of images, which do not have manual boundingbox annotations. We tackles this by transferring knowledge from a source set of images that do have them. The target set is composed of all images of a certain target class without boundingboxes. For some classes, ImageNet offers a rather large subset of images with annotations, so both the source and target sets can come from the same class. In the case when a target class has little or no manual boundingboxes, we can construct the source set from images of semantically related classes. Different combinations of source and target sets leads to different learning problems. This paper explores the following setups (fig. 1):

Self: source and target sets of images come from the same class, e.g. from koalas to koalas. This setup is close to a classic object detection problem [9]. Here we have the additional knowledge that every target image contains an object of the class and we are given all target images at training time (transductive learning).

Siblings: the source set consists of images from classes semantically related to the target, e.g. from kangaroos to koalas. This is the setup considered in [3], which can be used to produce annotations even for a target class without any initial manual annotations (transfer learning).

Family = self + siblings: source and target sets consist of images coming from the same mix of semantically related classes, e.g. from kangaroos and koalas to kangaroos and koalas. This setup aims at improving performance on related classes when both have some manual annotations by processing them simultaneously (multitask learning).
These source/target setups cover a broad range of knowledge transfer scenarios. Below we devise a model that covers all these scenarios in unified manner. We expect an adequate model to have an increasing performance as more supervision is being provided (with siblings having the worst and family having the best performance).
3 Overview of our method
In this section we define the notation and introduce our method on a high level, providing the details for each part later (Sec. 4,5,6). Every image , in both source and target sets, is decomposed into a set of windows . To avoid considering every possible image window, we use the objectness technique [10] to sample windows per image. As shown in [10], these are enough to cover the vast majority of objects. Source windows are annotated with the overlap , defined as their intersectionoverunion [9] (IoU) with the groundtruth boundingbox of an object.
The key idea of our method is to use GP regression to transfer overlap annotations from source windows to target ones. GP infers a probability distribution over the overlap of a target window with the object: ^{1}^{1}1We slightly abuse notation here, dropping the conditioning on source data: . Having a full distribution enables to measure and control uncertainty. Consider the score function^{2}^{2}2Corrected thanks to O. Russakovsky:
(1) 
This score is the largest overlap that a window is predicted to have with at least probability. The higher the , the more certain we need to be before assigning a window a high score.
Performing GP inference in the highdimensional space typical for common visual descriptors (e.g. HOG, bagofwords) is computationally infeasible at the large scale of ImageNet. Moreover, it would require estimating thousands of hyperparameters, risking overfitting. To reduce the dimensionality of the feature space without losing descriptive power, we introduce Associative Embedding (AE) , a lowdimensional representation of windows (sec. 4). Let be the representation of an object exemplar (groundtruth boundingbox) from the source set in some feature space (e.g. HOG). Let the representation of a window in the same feature space. First we describe in terms of its similarity to the exemplars in the source set. We quantify how similar the window is to the exemplar by the output of an ExemplarSVM (ESVM) [11] with parameters trained on
. Now every window is described by a vector of ESVM outputs
, one for each exemplar in the source set. Intuitively, this intermediate representation describes what a window looks like, rather than how it looks. The advantage of using ESVMs over a standard similarity measure is its ability to suppress the influence of the background present in the window and to focus on exemplarspecific features [11].Next, we embed all windows and exemplars into a low dimensional space. We optimize the embedding to minimize the difference between scalar products in the original and embedded spaces. We call the resulting space Associative Embedding (AE). The dimensions in EA space tend to correspond to aspects of the appearance of classes (e.g. close up, side view, partially occluded). The AE space enables the GP to transfer knowledge directly along the aspects which greatly facilitates the process. For HOG and BagofWords descriptors, an AE space with just 3 dimensions is sufficient to support this process.
Fig. 2 depicts our method procedural flow. Having embedded source and target windows into AE space, we use GP to infer probability distributions of target windows overlap and to compute the score (eq. (1)). We employ this score to tackle localization task in sec. 6. For localization we simply return the highest scored window in each target image. We also reuse the score for selfassessment.
4 Associative embedding
This section details the proposed AE representation (fig. 4). The method proceeds by first training ESVMs for object exemplars in the source set, then representing all windows from both source and target sets as the output of ESVMs applied to their feature descriptors, and finally embedding these representations in a lower dimensional space.
Training ESVMs.
For each object exemplar in a given feature space, we train an ESVM [11]hyperplane to minimize the following loss
(2) 
where is the hinge loss and is a set of negative windows from the source pool. It contains a few thousand windows with overlap smaller than 0.5 with the object’s boundingbox. The weights and regulate the relative importance of the terms.
Encoding windows as ESVM outputs.
We encode every window in both the source and target sets as a vector where each element is the response of an ESVM . This representation is much richer than the original feature space, as each entry in encodes how much this window looks like one of the exemplars.
Embedding.
Let be a matrix with a row for each window in both the source and target sets. Let and be the embedded representations of and , which we are trying to produce. Let matrices and have and as rows, respectively. We seek the embedding that minimizes the reconstruction error:
(3) 
where is the Frobenius norm. This can be done by a truncated SVD decomposition [12]:
. We encapsulate singular values into the ESVM representation:
. Elements of a common visual descriptor (e.g. HOG) correspond to local statistics of an image window. However, each element of correspond to the output of a battery of ESVMs on the same image window. These are individually much more informative, yet their collection is redundant, enabling compression to a few dimensions. For instance, the representation of a background window will contain similar negative values across all the ESVMs. Moreover, ESVMs that are learnt from similar exemplars coming from the same aspect of object’s appearance (e.g., gorilla’s face, close up) will produce correlated outputs uniformly for all windows. Altogether, this dramatically collapses variability, which allows SVD to achieve low reconstruction error with just a few dimensions, which tend to correspond to object/background discrimination and the aspects of object appearance (fig. 3).Solving the optimization problem for all windows in the source and target sets at once is computationally very expensive. Therefore, in a first step we use only a subsample of the windows, but all exemplars, to minimize (3). This results in an embedded representation of all exemplars . In a second step we keep fixed and embed any remaining window with
(4) 
We solve the above optimization using leastsquares. Fig. 5(a) depicts the embedding of Koala class SURF BagofWords window descriptors in AE space.
5 Estimating window overlap with Gaussian Processes
This section describes how to infer a probability distribution of the overlap of a target window with an object and calculate the final score (eq. 1). We first construct an extended representation of a window using AE, its objectness score [13] and its position and scale in the image. Then we define a GP over that representation and estimate its hyperparameters. We also show how to speed up inference at prediction time in sec. 5.2.
5.1 GP construction
Let and be the SURF bagofwords and HOG appearance descriptors of a window , and their AE. The final representation of is
(5) 
where is a position and scale descriptor of a window. Following [3] we define it as . Here and are the coordinates of the center of the window, and are width and height of the window (all normalized by the image size). is the objectness score of , which estimates how likely it is to contain an object rather than background [10].
We propose to model the overlap of a window with a true object boundingbox as a Gaussian Process (GP)
(6) 
where is a mean function and is a covariance function. Here the mean function is zero and the covariance (kernel) is the squared exponential
(7) 
where is a vector of hyperparameters regulating the influence of each element of on the output of the kernel.
Let be a set of windows from the source set and their respective groundtruth overlaps (inducing points). Let be a kernel matrix . Consider now a target window , for which is unknown. Let . Then a predictive distribution for is
(8) 
where
(9) 
In practice, inference involves the inversion of kernel matrix and a series of scalar product. Matrix inversion for a large set of inducing points poses a computational problem, with which we deal with in sec. 5.2.
Overlap estimation.
We can now compute the score from eq. (1).
(10) 
where is a constant which depends on and can be analytically computed by integrating the Gaussian density. The score equals to the maximum overlap that window may have with at least probability.
Estimating hyperparameters.
We estimate the hyperparameters by minimizing the regularized negative loglikelihood of the overlaps of the windows in the source set
(11) 
where is the regularization strength. While it is not common practice to put a regularizer on , we experimentally found that it significantly improves the result.
5.2 Fast inference for largescale datasets
The source set can contain millions of windows, which poses a computational problem for standard GP inference methods [7]. Instead of exact inference we use an approximation ^{3}^{3}3infFITC from the toolbox [14] during both training and test. For training hyperparameters we also subsample the windows from the source set. Since the dimensionality of is very low ( in our case), it can be reliably estimated from modest amounts of data.
To speed up prediction on windows in a target image, as inducing points of GP we only use windows from source images that have a similar global appearance. The idea is that globally similar images are more likely to contain objects and backgrounds related to those in the target image, and therefore are the most relevant source for transfer. This is related to ideas proposed before for scene parsing [15, 16]. However, while those works used a simple monolithic global image descriptor (bagofwords or GIST) for this task, we directly use the set of window descriptors to construct the global similarity measure. In the feature space spanned by , every image has an empirical distribution, i.e. a cloud of points corresponding to the windows it contains (fig. 5(b)
). We use percoordinate kernel density estimation to represent the cloud of an image. The global image similarity between a source and a target image is then defined as the average percoordinate KLdivergence between their distributions.
All the modifications above reduces full inference for all windows in a test image to approximately 4 seconds on a standard desktop computer. Hence our method is suitable for large scale datasets like ImageNet.
6 Object localization
The technique presented above produces a score for each target window in a target image . This section explains how to use this score for object localization. We simply select the window with the highest score , out of all windows in the target image. This window is the final output of the system, which returns one window for each target image.
The score can be also be used for selfassessment. For example, we can retrieve all boundingboxes that have overlap higher than with probability by taking only the windows such that .
7 Implementation details
Associative Embedding.
We use AE in SURF [17] bagofwords [8] and HOG feature spaces. We build quantized SURF histograms with a codebook of visual words. For computational efficiency, we approximate the kernel with the expansion technique of [18] and train linear ESVMs. HOG descriptors [4] are computed over a grid and the associated ESVMs are linear, as in [11]. The soft margin parameters and are set to and . We create AE of a dimensionality (for each of HOG and SURF bagofwords).
Gaussian Processes.
To learn GP hyperparameters (sec. 5) we subsample windows from the source set. Since the dimensionality of GP hyperparameters is low ( in our experiments), this is sufficient. We set the regularizer to in eq. (10). For the scoring function we set for object localization, and for selfassessment.
For prediction on a target image, we retrieve the most similar source images as described in sec. 5.2, and then infer the distribution of windows overlap using them.
8 Related work
Populating ImageNet.
The two most related works are [2, 3]. Like us, they address the problem of populating ImageNet with automatic annotations (boundingboxes [3] and segmentations [2]). Guillaumin and Ferrari (GF) [3]
train a series of monolithic window classifiers, one for each source class, using different cues (HOG, colour, location, etc.). They are combined into a final window score by a discriminatively trained weighting function and applied to windows in the target set. As we show in sec.
9, GF is not suited for self assessment. Moreover, our method produces better localizations overall.The work [2] populates ImageNet with segmentations. It propagates groundtruth segmentations from PASCAL VOC [9] onto ImageNet. They use a nearest neighbour technique [19] to transfer segmentations from a given source set to a target image. We compare to this method experimentally (sec. 9), by putting a boundingbox over their segmentations.
Transfer learning
is used in computer vision to facilitate learning a new target class with the help of labelled examples from related source classes. Transfer is typically done through regularization of model parameters [20, 21], an intermediate attribute layer [22] (e.g. yellow, furry), or by sharing parts [23]. In GP [24] transfer learning is usually based on sharing hyperparameters between tasks. In this work we not only share hyperparameters, but the inducing points as well. Also, our GP kernel is defined over an augmented AE space , which is constructed specifically for a particular combination of source and target classes. In principle, one could view AE as kernel learning method for GP, which exploits the specifics of visual data.
Exemplar SVMs
were first proposed by [11] and are rapidly gaining popularity as a better way to measure visual similarity [25, 26, 27]. Their main advantage is the ability to select features within a window that are relevant to the object, downweighting background clutter. Some authors propose to use [26, 27] ESVMs as a similarity measure for discovering different aspects of object appearance. They explicitly group training objects into clusters according to their aspects of appearance. They produce a set of clean aspectspecific object detectors, whose responses on a test image are merged together for the final result. Instead we embed image windows into a low dimensional space, where aspects of appearance are expressed smoothly in the space’s dimensions.
9 Experiments and conclusion
We perform experiments on the same subset of ImageNet as [2, 3], which allows direct comparison to their results. The subset consists of 219 classes (e.g. phalanger, beagle, etc.) spanning over million images. Classes are selected such that each has less than images and has siblings with some manual boundingbox annotations [3]. For a given target class we consider the following source sets: i) self — the class itself; ii) siblings of the class (as in [3]); iii) family — the class itself plus its siblings (see sec. 2 and fig. 1).
Groundtruth.
We use 92K images with groundtruth boundingboxes in total. We split them in two disjoint sets of 60K and 32K. We use the first set exclusively as source and the second one exclusively as target. This allows us to compare transfer from different source types (siblings, self, family) on exactly the same target images. The first batch of 60K images is the same as used in [3]. This ensures proper comparison to [3], as when using siblings as source, our method transfers knowledge from exactly the same images as [3].
Baselines and [3].
For localization, we compare against the following. MidWindow: A window in the center of the image, occupying of its area. TopObj: The window with the highest objectness score [10]. MKLSVM: This represents a standard, discriminative approach to object localization, similar to [28]. On the source set we train a SVM on HOG (linear) and a SVM with SURF bagofwords (linear with expansion [18]) using of the data. To combine them, we train a linear SVM over their outputs using the holdout data. We improve the baseline by adding the objectness score, location and scale cues (as in sec. 5) of a window as features for the secondlevel SVM. GF: We compare to [3], when using siblings as the source. We use the output of [3] as provided by the authors on their website. Notice how MKLSVM baseline and the competitor GF [3] are defined on the same object proposals [10] and features as our AEGP. This ensures that any improvement comes from better modelling.
Metrics.
To measure the quality of localizations we use the intersectionoverunion criterion (IoU) as defined in the PASCAL VOC [9]. We also measure detection rate: the percentage of images where the output has IoU .
To measure the quality of selfassessment we evaluate how well the score (1) ranks the output boundingboxes. We sort the outputs by their scores and measure the mean IoU of the top outputs. To compare to [3] we use their scores released along with their output boundingboxes. For MKLSVM we use the score output by the secondlevel SVM.
Comparison to baselines and [3].
Table 1 summarizes localization results, averaged over all 32K target images for which we have evaluation groundtruth. The results of our method steadily improve as the source set changes from siblings to self to family, unlike the MKLSVM baseline, whose performance decreases from self to family. This behaviour shows that our method is applicable to a wide range of knowledge transfer scenarios. Using only siblings as source, we outperform GF by the same margin as GF outperforms the trivial baselines TopObj and MidWindow in terms of mean IoU. Overall, our method delivers the best results for all kinds of source sets and metrics, compared to both competitors and baselines. While GF [3] sometimes produces a boundingbox occupying most of an image, our localizations are typically more specific to the object.
Fig. 7 presents selfassessment curves. Our method nicely trades off the amount of returned localizations for their quality, as demonstrated by the visible slope of the solid curves. For all sources, our method outperforms MKLSVM and GF over the entire range of the curve. The advantage over MKLSVM is greater especially in the left part of the curves, where selfassessment plays a bigger role. Note how both MKLSVM and GF have cusps in their curves. This means that many high quality localizations get a low score (GF, right half of the curve) or some high scoring localizations are poor (MKLSVM, left half). Interestingly, our MKLSVM baseline performs similarly to GF when evaluated on all images (tab. 1), but GF is better at selfassessment (fig. 7).
Method  Source  IoU%  IoU at 50%  Detection% 
AE+GP  Siblings  56.4  66.6  63.5 
GF [3]  Siblings  53.7  63  58.5 
MKLSVM  Siblings  54.1  55.8  59.7 
AE+GP  Self  58.2  69  66.5 
AE+GP+  Self  59.9  71  68.3 
MKLSVM  Self  56.7  61.8  63.8 
AE+GP  Family  58.3  69.4  66.7 
MKLSVM  Family  55.1  58.9  60.8 
TobObj    50.1  55.3  52.7 
MidWindow    49.3    60.7 
KGF [2]    59.9    66.9 
Analysis of AE.
We validate here the importance of AE and whether it can reduce dimensionality without loss in representation power (fig. 8).
In PCAGP we substitute AE with a standard PCA in each original feature space (HOG, SURF bagofwords), keeping the rest of the method exactly the same. PCAGP performs significantly worse than AEGP, highlighting the importance and power of AE.
AEGPd100 increases the dimensionality of AE to for each feature space. This drastic increase of dimensionality makes little difference in the results. This shows that AE does not lose much representation power even when reducing the dimensionality to just 3.
Better features and proposals.
The experiments above demonstrate that our AEGP outperforms baselines and [3], given the same features and object proposals. To push performance even further we introduced here a version of our method, coined AEGP+, that uses stateoftheart features [29] and object proposals [30]. To accommodate very high dimensional features [29] we slightly increase the dimensionality of AE to . We use the initial features (sec. 7) as well, improving SURF bagofwords by adding spatial binning. Results in fig. 8 and tab. 1 show that AEGP+ delivers the best results, improving over AEGP by in detection rate. Fig. 6 demonstrates localizations by AEGP+.
Comparison to [2].
We compare here to the stateoftheart segmentation method KGF [2] by putting a boundingbox around their output segmentation. AEGP+ moderately outperforms KGF [2] by in detection rate (tab. 1). Most importantly, as KGF [2] assigns no score to its output, selfassessment is impossible. The user can’t automatically retrieve high quality localizations from KGF [2]. In contrast, AEGP+ has the ability to select the localizations that have high IoU. Also, unlike the scoring schemes in GF and MKLSVM, ours allows the user to retrieve localizations that are predicted to have an overlap higher than, say, with a probability. Using AEGP+ with self as source, this fully automatically returns of all localizations with a mean IoU of (which is very accurate, c.f. the PASCAL detection criterion is IoU). This means about 251K images in the dataset we processed! The results are released online ^{4}^{4}4 http://groups.inf.ed.ac.uk/calvin/projimagenet/page/ .
Conclusion.
Knowledge transfer in ImageNet is motivated by the fact that semantically related objects look similar (e.g. police car and taxi). Hence, in a latent space of appearance variation image windows that contain objects of related classes are close to each other, while background windows are far away from them. Our work formalizes this intuition with the AE+GP method. AE recovers the latent space of appearance variation and embeds windows into it. Next, we construct a GP over the AE space to transfer localization annotations from the source to target image set. Thanks to probabilistic nature of GP, our model is capable of selfassessment. Largescale experiments demonstrate that our method outperforms stateoftheart techniques [2, 3] for populating ImageNet with boundingboxes and segmentations, as well as a strong MKLSVM baseline defined on the same features.
Acknowledgements
This work was supported by ERC VisCul starting grant. A. Vezhnevets is also supported by SNSF fellowship PBEZP2142889.
References
 [1] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. Feifei, “ImageNet: A largescale hierarchical image database,” in CVPR, 2009.
 [2] M. Guillaumin, D. Kuettel, and V. Ferrari, “ImageNet Autoannotation with Segmentation Propagation,” IJCV, 2014, to appear.
 [3] M. Guillaumin and V. Ferrari, “Largescale knowledge transfer for object localization in ImageNet,” in CVPR, Jun 2012.
 [4] N. Dalal and B. Triggs, “Histogram of Oriented Gradients for human detection,” in CVPR, 2005.
 [5] B. Leibe, K. Schindler, and L. Van Gool, “Coupled detection and trajectory estimation for multiobject tracking,” in ICCV, 2007.
 [6] M. Andriluka, S. Roth, and B. Schiele, “Monocular 3d pose estimation and tracking by detection,” in CVPR, 2010.

[7]
C. E. Rasmussen and C. K. I. Williams,
Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning)
. The MIT Press, 2005.  [8] J. Zhang, M. Marszalek, S. Lazebnik, and C. Schmid, “Local features and kernels for classification of texture and object categories: a comprehensive study,” IJCV, 2007.
 [9] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” IJCV, 2010.
 [10] B. Alexe, T. Deselaers, and V. Ferrari, “Measuring the objectness of image windows,” IEEE Trans. on PAMI, 2012.
 [11] T. Malisiewicz, A. Gupta, and A. A. Efros, “Ensemble of exemplarsvms for object detection and beyond,” in ICCV, 2011.
 [12] S. C. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. A. Harshman, “Indexing by latent semantic analysis,” Journal of the American Society of Information Science, 1990.
 [13] B. Alexe, T. Deselaers, and V. Ferrari, “What is an object?,” in CVPR, 2010.
 [14] C. E. Rasmussen and H. Nickisch, “Gaussian processes for machine learning (gpml) toolbox,” The Journal of Machine Learning Research, 2010.
 [15] C. Liu, J. Yuen, and A. Torralba, “Nonparametric scene parsing: Label transfer via dense scene alignment,” in CVPR, 2009.
 [16] J. Tighe and S. Lazebnik, “Superparsing: Scalable nonparametric image parsing with superpixels,” in ECCV, 2010.
 [17] H. Bay, A. Ess, T. Tuytelaars, and L. van Gool, “SURF: Speeded up robust features,” CVIU, 2008.
 [18] A. Vedaldi and A. Zisserman, “Efficient additive kernels via explicit feature maps,” in CVPR, 2010.
 [19] D. Kuettel and V. Ferrari, “Figureground segmentation by transferring window masks,” in CVPR, 2012.
 [20] L. FeiFei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” CVIU, 2007.
 [21] T. Tommasi, F. Orabona, and B. Caputo, “Safety in numbers: Learning categories from few examples with multi model knowledge transfer,” in CVPR, IEEE, 2010.
 [22] C. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect unseen object classes by betweenclass attribute transfer,” in CVPR, 2009.
 [23] P. Ott and M. Everingham, “Shared parts for deformable partbased models,” in CVPR, 2011.
 [24] E. Bonilla, K. M. Chai, and C. Williams, “Multitask gaussian process prediction,” 2008.
 [25] I. Endres, K. J. Shih, J. Jiaa, and D. Hoiem, “Learning collections of part models for object recognition,” in CVPR, 2013.
 [26] O. Aghazadeh, H. Azizpour, J. Sullivan, and S. Carlsson, “Mixture component identification and learning for visual recognition,” in ECCV, Springer, 2012.
 [27] J. Dong, W. Xia, Q. Chen, J. Feng, Z. Huang, and S. Yan, “Subcategoryaware object classification,” in CVPR, 2013.
 [28] A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman, “Multiple kernels for object detection,” in ICCV, 2009.
 [29] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders, “Selective search for object recognition,” IJCV, 2013.
 [30] S. Manén, M. Guillaumin, and L. Van Gool, “Prime Object Proposals with Randomized Prim’s Algorithm,” in ICCV, 2013.