Object detectors [Viola and Jones (2001), Felzenszwalb et al. (2010)] locate objects of interest in images and have many applications including image tagging, consumer photography, and surveillance. Most existing object detectors take a fully supervised learning (FSL) approach, where all the training images are manually annotated with the object location. However, manual annotation of hundreds of object categories is time-consuming, laborious, and subjective to human bias. To reduce the amount of manual annotation, a weakly supervised learning (WSL) [Deselaers et al. (2010), Siva and Xiang (2011), Nguyen et al. (2009), Pandey and Lazebnik (2011)] approach is desired. In WSL, the training set is only annotated with a binary label indicating the presence or absence of the object of interest, not the location or extent of the object (Fig. 1). WSL approaches first locate the object of interest in the training images and then the location information is used to train a detector in a fully supervised fashion.
Typically three information cues, saliency, inter-class, and intra-class, are used to locate or annotate the object of interest in images known to contain the object of interest (positive images). Saliency information ensures that the annotated region is a foreground region. Inter-class information ensures that the annotated regions look dissimilar to all images without the object of interest (negative images). Intra-class information ensures that the annotated regions in all positive images look similar to each other. Methods that use saliency alone (Alexe et al., 2010) select salient regions in each positive image independently. Methods that use inter-class and intra-class information Deselaers et al. (2010); Siva and Xiang (2011) typically use saliency to limit the search space of each image by only looking at the most salient regions; then they select one of these salient regions by maximising the inter-class and intra-class information.
In this paper we utilise a fourth information cue (Fig. 1) which is typically neglected by other approaches: an auxiliary fully annotated dataset. While we want to reduce manual annotation when learning new object categories, we cannot ignore the fact that there exist many datasets which already have manual annotation of object locations Everingham et al. . However, these auxiliary datasets seem unhelpful since they often contain object categories that are unrelated to the target object category we wish to annotate. For example, an auxiliary dataset might contain annotations of cars, birds, boats and person but a target object category might be cats and buses (Fig. 1). So what information can we actually transfer? When adopting the strategy of selecting the optimal object location from a set of candidate salient regions Deselaers et al. (2010); Siva and Xiang (2011), the performance of the selection can obviously be measured by examining the degree of overlap between the selected region and the ground truth region (Fig. 1). One can safely assumes that the more a salient region overlaps the ground truth region, the more similar the two’s appearances are. In other words, there exists a mapping relationship between the degree of overlap (hence the accuracy of annotation) and the appearance similarity. This relationship should hold true regardless of the object category and is what we propose to learn and transfer to the target data. To quantify this mapping relationship, one must take into consideration the high dimensionality typical for representing object appearance and the inevitable noise. To this end, we formulate a ranking based transfer learning model which, once learned, takes appearance similarity as input and predicts the ranking order among all the candidate salient regions according to their degree of overlap with the (unknown) true object location. We show that our novel transfer learning model outperforms the state-of-the-art WSL approaches on the challenging pascal voc 2007 dataset.
Related Work. Early works on weakly supervised annotation mainly focused on saliency based approaches Alexe et al. (2010); Chum and Zisserman (2007); Russell et al. (2006). While these methods provided a set of potential salient object location, they were shown to perform poorly for automatic annotation of objects in challenging cluttered images Deselaers et al. (2010). Recently many methods Nguyen et al. (2009); Deselaers et al. (2010); Siva and Xiang (2011) re-cast the weakly supervised problem as a multiple-instance learning (mil) problem. In a mil formulation, each image with the object of interest is treated as a positive bag with many instances (potential object locations) of which at least one is positive and the images without the object of interest are treated as negative bags with only negative instances. The mil based algorithms iteratively select the positive instances in each positive bag using inter-class and/or intra-class information. The approach by Nguyen et alNguyen et al. (2009)
is an inter-class method that defines the entire positive image as the initial positive instance and then trains a support vector machine (SVM) to separate these initial positive instances and the negative images. The trained SVM is then used as a detector on the positive training images to refine the object location. However, the initial assumption is that the entire image is a good representation of the object, which is not always true. Pandey and LazenbnilkPandey and Lazebnik (2011) relaxed this assumption by using a latent SVM that treats the actual location of the object as a latent variable which is a constraint to be at least 40% overlapped with the entire image. Unlike the inter-class information based methods Nguyen et al. (2009); Pandey and Lazebnik (2011), Deselaers et alDeselaers et al. (2010) and Siva and Xiang Siva and Xiang (2011) use saliency Alexe et al. (2010) to define the initial instances in each image. Then the positive instances are iteratively selected by optimising a cost function based on both inter-class and intra-class information. We show that our transfer learning approach using an auxiliary dataset can outperform annotation accuracy of these mil methods without using either intra or inter-class information from target data.
Most existing transfer learning methods in computer vision address the classification problem not the WSL annotation problem. They typically require the target and auxiliary classes to be from either the same class but cross-domainsPan et al. (2008); Yang et al. (2007); Pan et al. (2011) (such as news video from different countries) or different but related classes Zweig and Weinshall (2007) (such as giraffe and horse). In comparison, our model does not assume that the auxiliary data and target data are related; it is thus a more generally applicable method. There have been a couple of recent efforts on transferring knowledge between unrelated categories Zheng et al. (2011); Raina et al. (2007), but they focus on image categorisation where each image is dominated by a single object. Notice that Deselaers et alDeselaers et al. (2010) also uses an auxiliary dataset for weakly supervised annotation. However, in their work auxiliary data is used for parameters tuning rather than learning a transferrable model. In contrast, we use a ranking model to learn and transfer knowledge between unrelated categories, which to our knowledge has never been attempted before.
2 Proposed Approach
In our approach we have an auxiliary dataset, , with a set of fully annotated images (Fig. 1); each image contains an object with its location manually annotated by a bounding box. Given a target image , containing an object not in , our goal is to annotate the object location in . As in Deselaers et al. (2010); Siva and Xiang (2011), for all images in the auxiliary dataset and the target image we select the top salient regions returned by the generic object detector proposed in Alexe et al. (2010) as potential object locations. Then our goal is to select one of the salient regions in the target image as the object annotation. We can think of selecting the best salient region as a ranking problem. We wish to learn a model from and use it to rank the salient regions in such that the highest ranked region has the greatest overlap with the unknown true location of the object (Fig. 1).
Given that the object in the target image is unrelated to the objects in the auxiliary dataset we need to develop a feature that is independent of object category which can still be used to learn the ranking of salient regions based on the degree of ground truth overlap. To this end, we extract appearance features and compute the absolute feature difference between candidate regions and ground truth as input for learning a ranking model.
More specifically, for each image , we represent each of the salient regions with an unnormalised bow histogram , where . To compute a feature from that is independent of the object category we define a difference vector as the feature of interest:
where indicates the element wise absolute difference, is the L1 norm and is the unnormalised bow histogram of the manually annotated ground truth region in image . However, by this definition the target image has a difference vector:
which requires us to know , the ground truth bow histogram. Since the true location of the object in is exactly what we are after, we do not know in the target class images. To overcome this problem we approximate the ground truth bow histogram in Eq. 2 by the average bow histogram of the salient regions in the target image :
The resulting feature vectors is
However, now there is a discrepancy between training, where we directly use the ground truth bow histogram, and testing, where we approximate the ground truth bow histogram. To unify the training and testing process, we also approximate the ground truth bow histogram in the training process for computing . That is for the auxiliary dataset we replace Eq. 1 with:
2.2 Ranking Model
For each image we now have salient regions represented by its corresponding difference vector , where . All salient regions are sorted by its overlap with the ground truth bounding box, where overlap is defined by Everingham et al. as the intersect area divided by union area. The sorted index of the salient region is used as the rank of the difference vector , where the salient region with the greatest overlap with the ground truth bounding box is given a rank of . Salient regions with no overlap are ranked by their distance to the ground truth bounding box because regions nearer the ground truth location is likely to contain more relevant contextual information compared to regions that are farther away (Fig. 2).
The difference vector and rank order pairsare used to train a RankSVM Joachims (2002). RankSVM is an ideal choice because it is able to cope with high dimensional feature spaces and large scale learning, exactly the problems faced by learning an object annotation ranking model. RankSVM was originally introduced to improve the performance of the Internet search engine. The text document search/retrieval problem is similar to our problem in that it ranks the search results for a query based on how relevant the search result is to the query. In our case we wish to rank the salient regions based on how similar (relevant) it is to the ground truth in terms of visual appearance. In this context we have a set of queries (images ) and for all images we have a set of preference pairs , indicating the ranking relationships between each pair of salient regions in each image. The number of preference pairs will be enormous for a moderate number of images and salient regions in each image (). For efficient learning, we use the primal-based pairwise RankSVM algorithm proposed in Chapelle and Keerthi (2010) to minimise the objective function:
is the loss function and the weightis obtained by cross-validation.
2.3 Annotating Target Image
Given a target image we first obtain the salient regions using (Alexe et al., 2010). For each salient region, we obtain the difference vector per Eq. 4. All difference vectors are then ranked by the RankSVM trained on the auxiliary dataset. The salient region with the highest rank is then selected as the annotation for the target image . Note that we are not using the weak annotation information, regarding if the target image contains the object of interest or not.
Dataset – All experiments are conducted on the challenging voc 2007 dataset Everingham et al. . We use the AllView dataset defined in Siva and Xiang (2011) for comparison and it consists of all 20 classes from the voc 2007 training and validation set with no pose annotation. For our transfer learning model, we randomly choose 10 classes as the auxiliary dataset. After training the ranking model on these 10 classes, we annotate the remaining 10 target classes by our model. The random selection is repeated 10 times and we report the average result on all 20 classes. As in Pandey and Lazebnik (2011); Siva and Xiang (2011), for all 20 classes we exclude images annotated as difficult. We consider the predicted annotation (bounding box) as correct if it has more than 50% overlap with ground truth bounding box as defined in Everingham et al. and report the percentage of correctly annotated images for performance measurement.
Regular grid SIFT descriptors are computed on all images in the training data and quantized into 2000 words using k-means clustering. Each of thesalient bounding boxes in an image, obtained by Alexe et al. (2010), is then described using a bow histogram of 2000 bins.
|Method||AllView 20 Class|
|Siva and Xiang Siva and Xiang (2011)||29.||29%|
|MI-SVM Andrews et al. (2003)||25.||12%|
|Objectness Alexe et al. (2010)||24.||12%|
|This number is computed based on the per class accuracy reported in Siva and Xiang (2011).|
|Class||Ranking Model||Siva and Xiang Siva and Xiang (2011)||MI-SVM Andrews et al. (2003)|
|aeroplane||54.74%||45.40 %||37.80 %|
|bicycle||22.65 %||20.60 %||17.70 %|
|bird||33.71%||29.70 %||26.70 %|
|boat||24.45%||12.20 %||13.80 %|
|bottle||4.62%||4.10 %||4.90 %|
|bus||33.90%||37.10 %||34.40 %|
|car||42.48%||41 %||33.70 %|
|cat||57.04 %||53.40 %||46.60 %|
|chair||7.30%||6.5 %||5.4 %|
|cow||39.05%||31.90 %||29.80 %|
|diningtable||24.13%||20.5 %||14.5 %|
|dog||43.32%||40.9 %||32.80 %|
|horse||41.30%||37.3 %||34.80 %|
|motorbike||51.49%||46.50 %||41.60 %|
|person||25.34%||22.3 %||19.9 %|
|pottedplant||13.26%||10.2 %||11.4 %|
|sheep||27.97%||27.1 %||25 %|
|sofa||29.51%||32.30 %||23.60 %|
|train||54.55%||49.00 %||45.20 %|
|tvmonitor||11.76%||9.8 %||8.6 %|
Ranking model vs. state-of-the-art WSL approaches – We compare our ranking model with Objectness (Alexe et al., 2010), mi-svm Andrews et al. (2003) and the method of Siva and Xiang Siva and Xiang (2011). Table 1 shows that our ranking model can outperform all our competitors. Anecdotal results are shown in Fig. 4. Objectness Alexe et al. (2010) is a generic object detector which provides bounding boxes of generic foreground objects with a score indicating its “objectness” or the degree to which it is a foreground region. The region with the highest “objectness” is selected as the annotation. Our ranking model, as well as mi-svm (Andrews et al., 2003) and Siva and Xiang Siva and Xiang (2011), re-weights the top boxes chosen by the Alexe et al. (2010) and as such is closely related to objectness. From Table 1 we can see that our ranking model outperforms the objectness measure by 8% because it uses the auxiliary dataset to learn the salient region ranking that best overlaps with the ground truth. mi-svm Andrews et al. (2003), like our method, starts by approximating the positive instance of each image as the average of the top instances then iteratively trains a SVM and re-annotates the target images. mi-svm only uses weak annotation information from the target classes compared to our method which uses a strongly annotated auxiliary dataset. The superior performance of our model suggests that transferrable information learned on the strongly annotated dataset is more useful even though it contains object classes unrelated to the target class. The most recent work on weakly supervised annotation is the method of Siva and Xiang Siva and Xiang (2011). Their method fuses results from inter-class information and intra-class and saliency information to select the best annotation for each target image. It achieves the best annotation accuracy to date on the VOC 2007 data. We show that our ranking model, without using the inter-class and intra-class information, can achieve nearly 3% improvement over the fused results. Again this shows the effectiveness of our model in extracting transferrable information from strongly annotated auxiliary data.
Further evaluations on our method– Table 2
show the annotation accuracy of our method on each of the 20 classes. It gives some insight into where exactly the strength of our model lies. It can be seen that our method wins all but three classes but the performance is particularly strong on the boat (almost double the accuracy of the state-of-the-art methods) and aeroplane classes. Both classes have a very consistent background or strong contextual information: boats mostly appear on water and aeroplanes often appear in the sky. This pose a stern challenge to the existing MIL based approaches. Specifically, methods using intra-class information will tend to select water or sky which will appear more similar than the actual boats or aeroplanes that have a higher within class variance. Methods using inter-class information will struggle to eliminate water and sky (see Fig.4) because they do not occur a lot in other classes. Inter-class methods will also have trouble with regions containing parts of the object and parts of the background. Unlike intra-class and inter-class methods, our method explicitly tries to rank the non-overlapped and partially overlapped regions lower than the highly overlapped regions (the aeroplane examples in Fig. 4 and Fig. 4 show clearly that those partially overlapped regions, selected wrongly by the alternative methods, are ranked lower by our model). This explains the superior performance of our method on these classes.
Our annotation accuracy of was obtained by approximating the ground truth feature using the average of the salient bounding box features for both the auxiliary and target datasets (Eqs. 3 and 5). Alternatively, one could use the actual ground truth feature for training (Eq. 1). This gives a slight drop in accuracy to . This suggests that to make the learned model generalise better to target classes, using the same approximation for the auxiliary data, even though it is not necessary, can be beneficial. We also notice that the annotation accuracy of our method varies little (standard deviation) across different trials. This verifies our assumption that the information we intend to transfer is not affected by the relationship between auxiliary and target object classes.
|Method||ByItself||+ Ranking Model|
|Objectness Alexe et al. (2010)||24.12%||32.60%|
|MI-SVM Andrews et al. (2003)||25.12%||31.23%|
|Siva and Xiang Siva and Xiang (2011)||29.29%||30.63%|
Fusion with existing methods – Since the information we use is very different from that used by existing methods, we combine our ranking model with existing methods including objectness (Alexe et al., 2010), mi-svm Andrews et al. (2003) and Siva and Xiang (2011) by a score level fusion. The results are shown in Table 3. We find that combining our ranking model with existing methods consistently improves the existing method’s performance. However, the best overall performance is obtained by combining our ranking model with the objectness score of (Alexe et al., 2010). Ranking model plus objectness perform the best because they use related but complimentary cues – objectness predicts the saliency based on cues such as image frequency and colour contrast, whereas our ranking model learns how the salient regions tend to overlap with the actual ground truth.
|Method||AllView 20 Class|
|Generic Object Detector||7.42 %|
Alternative transfer learning methods – The strongly annotated auxiliary dataset can be used to learn alternative transfer learning models. Firstly, we consider learning a generic object detector. All salient regions in the auxiliary dataset with greater than 50% overlap with the ground truth are put in the positive class and the rest in the negative class. A linear SVM111The SVM with non-linear kernel are too expensive to run on this dataset. Note that the ranking SVM implementation we adopted also uses linear kernel.
is then trained as a classifier on the rawbow histograms . The performance of the generic object detector is reported in Table 4. The generic object category will have a huge intra-class variance as well as be multi-modal in nature; importantly they are visually very different from the target object classes. This basic transfer learning approach thus performs poorly.
|Method||SingleView 62 Class|
|Ranking Model + Objectness||39.74%|
|Siva and Xiang (Fused) Siva and Xiang (2011)||39.60%|
|Deselares et alDeselaers et al. (2010)||35%|
|Pandey and Lazebnik Pandey and Lazebnik (2011)|
|a. Before cropping||36.72%|
|b. After cropping||43.73%|
|The cropping method can be applied to any other method to improve performance.|
Second, we compare our ranking method with a non-ranking transfer learning to show the importance of formulating this as a ranking problem. For easier comparison, we consider the same setup as our ranking model but limit the number of ranks to two. That is, instead of using ranks, all salient regions are ranked either 1 or 2 during training depending on whether they are over 50% overlap with ground truth. This model (named 2Rank model to distinguish it from our full ranking model) is compared with a non-ranking model which is essentially a binary linear SVM where the two ranks are used as the class label. Both models use the identical absolute difference features as input and differ only in the formulation of their cost functions. Table 4 shows that the ranking formulation leads to significantly better performance.
Comparison to methods using single pose data – As a last comparison we evaluate the performance of our ranking model on the much simpler and smaller SingleView dataset used in Deselaers et al. (2010); Pandey and Lazebnik (2011); Siva and Xiang (2011). SingleView is a subset of voc 2007 that consists of six classes (aeroplane, bicycle, boat, bus, horse, and motorbike) where the left and right pose are considered as separate classes for a total of 12 classes, meaning both the presence of the object as well as the object pose needs to be manually annotated. For SingleView we use the classes bird, car, cat, cow, dog and sheep as the auxiliary dataset per Deselaers et al. (2010). The results are presented in Table 5. Our method is comparable to the competitors on this much easier task.
In this paper we presented a novel ranking based transfer learning method for object annotation, which effectively transfers a model for automatic annotation of object location from an auxiliary dataset to a target dataset with completely unrelated object categories. Our experiments demonstrate that our transfer learning based approach achieves higher accuracy than the state-of-the-art weakly supervised approaches.
Alexe et al. (2010)
B. Alexe, T. Deselaers, and V. Ferrari.
What is an object?
IEEE Conference on Computer Vision and Pattern Recognition, pages 73 –80, june 2010.
- Andrews et al. (2003) S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines for multiple-instance learning. In Advances in Neural Information Processing Systems 15, pages 561–568. MIT Press, 2003.
- Chapelle and Keerthi (2010) O. Chapelle and S.S. Keerthi. Efficient algorithms for ranking with svms. Information Retrieval, 13(3):201–215, June 2010.
- Chum and Zisserman (2007) O. Chum and A. Zisserman. An exemplar model for learning object classes. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1 –8, june 2007.
- Deselaers et al. (2010) T. Deselaers, B. Alexe, and V. Ferrari. Localizing objects while learning their appearance. In European conference on Computer vision, pages 452–466, Berlin, Heidelberg, 2010. Springer-Verlag.
- (6) M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html.
- Felzenszwalb et al. (2010) P.F. Felzenszwalb, R.B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627 –1645, sept. 2010.
- Joachims (2002) T. Joachims. Optimizing search engines using clickthrough data. In ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133–142, New York, NY, USA, 2002. ACM.
- Nguyen et al. (2009) M.H. Nguyen, L. Torresani, F. de la Torre, and C. Rother. Weakly supervised discriminative localization and classification: a joint learning process. In International Conference on Computer Vision, pages 1925 –1932, 2009.
Pan et al. (2008)
S. J. Pan, J.T. Kwok, and Q. Yang.
Transfer learning via dimensionality reduction.
International conference on Artificial intelligence, pages 677–682, 2008.
Pan et al. (2011)
S.J. Pan, I.W. Tsang, J.T. Kwok, and Q. Yang.
Domain adaptation via transfer component analysis.
IEEE Transactions on Neural Networks, 22(2):199 –210, feb. 2011.
- Pandey and Lazebnik (2011) M. Pandey and S. Lazebnik. Scene recognition and weakly supervised object localization with deformable part-based models. IEEE International Conference on Computer Vision, 0:1307–1314, 2011.
Raina et al. (2007)
R. Raina, A. Battle, H. Lee, B. Packer, and A.Y. Ng.
Self-taught learning: transfer learning from unlabeled data.
International conference on Machine learning, pages 759–766, New York, NY, USA, 2007. ACM.
- Russell et al. (2006) B.C. Russell, W.T. Freeman, A.A. Efros, J. Sivic, and A. Zisserman. Using multiple segmentations to discover objects and their extent in image collections. In IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 1605 – 1614, 2006.
- Siva and Xiang (2011) P. Siva and T. Xiang. Weakly supervised object detector learning with model drift detection. In IEEE International Conference on Computer Vision, pages 343 –350, nov. 2011.
- Viola and Jones (2001) P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In IEEE Conference on Computer Vision and Pattern Recognition, volume 1, pages I–511 – I–518 vol.1, 2001.
- Yang et al. (2007) J. Yang, R. Yan, and A. G. Hauptmann. Cross-domain video concept detection using adaptive svms. In International conference on Multimedia, pages 188–197, New York, NY, USA, 2007. ACM.
- Zheng et al. (2011) W.S. Zheng, S.G. Gong, and T. Xiang. Unsupervised selective transfer learning for object recognition. In Asian conference on Computer vision, pages 527–541, Berlin, Heidelberg, 2011. Springer-Verlag.
- Zweig and Weinshall (2007) A. Zweig and D. Weinshall. Exploiting Object Hierarchy: Combining Models from Different Category Levels. In IEEE International Conference on Computer Vision, 2007.