Reusable model design becomes desirable with the rapid expansion of computer vision and machine learning applications. In this paper, we focus on the reusability of pre-trained deep convolutional models. Specifically, different from treating pre-trained models as feature extractors, we reveal more treasures beneath convolutional layers, i.e., the convolutional activations could act as a detector for the common object in the image co-localization problem. We propose a simple yet effective method, termed Deep Descriptor Transforming (DDT), for evaluating the correlations of descriptors and then obtaining the category-consistent regions, which can accurately locate the common object in a set of unlabeled images, i.e., unsupervised object discovery. Empirical studies validate the effectiveness of the proposed DDT method. On benchmark image co-localization datasets, DDT consistently outperforms existing state-of-the-art methods by a large margin. Moreover, DDT also demonstrates good generalization ability for unseen categories and robustness for dealing with noisy data. Beyond those, DDT can be also employed for harvesting web images into valid external data sources for improving performance of both image recognition and object detection.READ FULL TEXT VIEW PDF
Reusable model design becomes desirable with the rapid expansion of mach...
The goal of our work is to discover dominant objects without using any
Co-localization is the problem of localizing objects of the same class u...
The problem of object localization has become one of the mainstream prob...
Severe background clutter is challenging in many computer vision tasks,
In this paper, we are interested in the few-shot learning problem. In
This paper investigates a general framework to discover categories of
Model reuse (Zhou, 2016)
attempts to construct a model by utilizing existing available models, mostly trained for other tasks, rather than building a model from scratch. Particularly in deep learning, since deep convolutional neural networks have achieved great success in various tasks involving images, videos, texts and more, there are several studies have the flavor of reusing deep models pre-trained onImageNet (Russakovsky et al, 2015).
In computer vision, pre-trained models on ImageNet have been successfully adopted to various usages, e.g., as universal feature extractors (Cimpoi et al, 2016; Wang et al, 2015; Li et al, 2016), object proposal generators (Ghodrati et al, 2015), etc. In particular, Wei et al (2017a) proposed the SCDA (Selective Convolutional Descriptor Aggregation) method to utilize pre-trained models for both localizing a single fine-grained object (e.g., birds of different species) in each image and retrieving fine-grained images of the same classes/species in an unsupervised fashion.
In this paper, we reveal that the convolutional activations can be used as a detector for the common object in image co-localization. Image co-localization (a.k.a. unsupervised object discovery) is a fundamental computer vision problem, which simultaneously localizes objects of the same category across a set of distinct images. Specifically, we propose a simple but effective method termed Deep Descriptor Transforming (DDT) for image co-localization. In DDT, the deep convolutional descriptors extracted from pre-trained deep convolutional models are transformed into a new space, where it can evaluate the correlations between these descriptors. By leveraging the correlations among images in the image set, the common object inside these images can be located automatically without additional supervision signals. The pipeline of DDT is shown in Fig. 1. To our knowledge, this is the first work to demonstrate the possibility of convolutional activations/descriptors in pre-trained models being able to act as a detector for the common object.
Experimental results show that DDT significantly outperforms existing state-of-the-art methods, including image co-localization and weakly supervised object localization, in both the deep learning and hand-crafted feature scenarios. Besides, we empirically show that DDT has a good generalization ability for unseen images apart from ImageNet. More importantly, the proposed method is robust, because DDT can also detect the noisy images which do not contain the common object. Thanks to the advantages of DDT, our method could be used as a tool to harvest easy-to-obtain but noisy web images. We can employ DDT to remove noisy images from webly datasets for improving image recognition accuracy. Moreover, it can be also utilized to supply object bounding boxes of web images. Then, we use these images with automatically labeled object boxes as valid external data sources to enhance object detection performance.
Our main contributions are as follows:
We propose a simple yet effective method, i.e., Deep Descriptor Transforming, for unsupervised object discovery and co-localization. Besides, DDT reveals another probability of deep pre-trained network reusing, i.e., convolutional activations/descriptors can play a role as a common object detector.
The co-localization process of DDT is both effective and efficient, which does not require image labels, negative images or redundant object proposals. DDT consistently outperforms state-of-the-arts of image co-localization methods and weakly supervised object localization methods. With the ensemble of multiple CNN layers, DDT could further improve its co-localization performance.
DDT has a good generalization ability for unseen categories and robustness for dealing with noisy data. Thanks to these advantages, DDT can be employed beyond the narrow co-localization task. Specifically, it can be used as a generalized tool for exploiting noisy but free web images. By removing noisy images and automatically supplying object bounding boxes, these web images processed by DDT could become valid external data sources for improving both recognition and detection performance. We thus provide a very useful tool for automatically annotating images. The effectiveness of DDT augmentation on recognition and detection is validated in Sec. 4.6.
Based on the previous point, we also collect an object detection dataset from web images, named WebVOC. It shares the same 20 categories as the PASCAL VOC dataset (Everingham et al, 2015), and has a similar dataset scale (10k images) comparing with PASCAL VOC. We also release the WebVOC dataset with the automatically generated bounding boxes by DDT for further study.
This paper is extended based on our preliminary work (Wei et al, 2017b)
. Comparing with it, we now further introduce the multiple layer ensemble strategy for improving co-localization performance, provide DDT augmentation for handling web images, apply the proposed method on webly-supervised learning tasks (i.e., both recognition and detection), and supply our DDT based webly object detection dataset.
The remainder of the paper is organized as follows. In Sec. 2, we briefly review related literature of CNN model reuse, image co-localization and webly-supervised learning. In Sec. 3, we introduce our proposed method (DDT and its variant DDT). Sec. 4 reports the image co-localization results and the results of webly-supervised learning tasks. We conclude the paper in Sec. 5 finally.
We briefly review three lines of related work: model reuse of CNNs, research on image co-localization and webly-supervised learning.
Reusability has been emphasized by (Zhou, 2016) as a crucial characteristic of the new concept of learnware. It would be ideal if models can be reused in scenarios that are very different from their original training scenarios. Particularly, with the breakthrough in image classification using Convolutional Neural Networks (CNN), pre-trained CNN models trained for one task (e.g., recognition) have also been applied to domains different from their original purposes (e.g., for describing texture (Cimpoi et al, 2016) or finding object proposals (Ghodrati et al, 2015)). However, for such adaptations of pre-trained models, they still require further annotations in the new domain (e.g., image labels). While, DDT deals with the image co-localization problem in an unsupervised setting.
Coincidentally, several recent works also shed lights on CNN pre-trained model reuse in the unsupervised setting, e.g., SCDA (Selective Convolutional Descriptor Aggregation) (Wei et al, 2017a)
. SCDA is proposed for handling the fine-grained image retrieval task, where it uses pre-trained models (fromImageNet) to locate main objects in fine-grained images. It is the most related work to ours, even though SCDA is not for image co-localization. Different from our DDT, SCDA assumes only an object of interest in each image, and meanwhile objects from other categories does not exist. Thus, SCDA locates the object using cues from this single image assumption. Clearly, it can not work well for images containing diverse objects (cf. Table 2 and Table 3), and also can not handle data noise (cf. Sec. 4.5).
Image co-localization, a.k.a. unsupervised object discovery (Cho et al, 2015; Wang et al, 2015), is a fundamental problem in computer vision, where it needs to discover the common object emerging in only positive sets of example images (without any negative examples or further supervisions). Image co-localization shares some similarities with image co-segmentation (Zhao and Fu, 2015; Kim et al, 2011; Joulin et al, 2012). Instead of generating a precise segmentation of the related objects in each image, co-localization methods aim to return a bounding box around the object. Moreover, it also allows us to extract rich features from within the boxes to compare across images, which has shown to be very helpful for detection (Tang et al, 2014).
Additionally, co-localization is also related to weakly supervised object localization (WSOL) (Zhang et al, 2016; Bilen et al, 2015; Wang et al, 2014; Siva and Xiang, 2011). But the key difference between them is that WSOL requires manually-labeled negative images whereas co-localization does not. Thus, WSOL methods could achieve better localization performance than co-localization methods. However, our proposed methods perform comparably with state-of-the-art WSOL methods and even outperform them (cf. Table 4).
In the literature, some representative co-localization methods are based on low-level visual cues and optimization algorithms. Tang et al (2014) formulated co-localization as a boolean constrained quadratic program which can be relaxed to a convex problem. Then, it was further accelerated by the Frank-Wolfe algorithm (Joulin et al, 2014). After that, Cho et al (2015) proposed a Probabilistic Hough Matching algorithm to match object proposals across images and then dominant objects are localized by selecting proposals based on matching scores.
Recently, there also emerge several co-localization methods based on pre-trained deep convolutional models, e.g., Li et al (2016); Wang et al (2014). Unfortunately, these methods just treated pre-trained models as simple feature extractors to extract the fully connected representations, which did not sufficiently mine the treasures beneath the convolutional layers (i.e., leveraging the original correlations between deep descriptors among convolutional layers). Moreover, these methods also require object proposals as a part of their object discovery, which not only made them highly depend on the quality of object proposals, but may lead to huge computational costs. In addition, almost all the previous co-localization methods can not handle noisy data, except for (Tang et al, 2014).
Comparing with previous works, our DDT is unsupervised, without utilizing bounding boxes, additional image labels or redundant object proposals. Images only need one forward run through a pre-trained model. Then, efficient deep descriptor transforming is employed for obtaining the category-consistent image regions. DDT is very easy to implement, and surprisingly has good generalization ability and robustness. Furthermore, DDT can be used a valid data augmentation tool for handling noisy but free web images.
Recent development of deep CNNs has led to great success in a variety of computer vision tasks. This success is largely driven by the availability of large scale well-annotated image datasets, e.g., ImageNet (Russakovsky et al, 2015), MS COCO (Lin et al, 2014) and PASCAL VOC (Everingham et al, 2015). However, annotating a massive number of images is extremely labor-intensive and costly. To reduce the annotation labor costs, an alternative approach is to obtain the image annotations directly from the image search engine from the Internet, e.g., Google or Bing.
However, the annotations of web images returned by a search engine will inevitably be noisy since the query keywords may not be consistent with the visual content of target images. Thus, webly-supervised learning methods are proposed for overcoming this issue.
There are two main branches of webly-supervised learning. The first branch attempts to boost existing object recognition task performance using web resources (Zhuang et al, 2017; Papandreou et al, 2015; Xiao et al, 2015). Some work was implemented as semi-supervised frameworks by first generating a small group of labeled seed images and then enlarging the dataset from these seeds via web data, e.g., Papandreou et al (2015); Xiao et al (2015). In very recently, Zhuang et al (2017) proposed a two-level attention framework for dealing with webly-supervised classification, which achieves a new state-of-the-art. Specifically, they not only used a high-level attention focusing on a group of images for filtering out noisy images, but also employed a low-level attention for capturing the discriminative image regions on the single image level
The second branch is learning visual concepts directly from the web, e.g., Fergus et al (2014); Wang et al (2008). Methods belonging to this category usually collected a large image pool from image search engines and then performed a filtering operation to remove noise and discover visual concepts. Our strategy for handling web data based on DDT naturally falls into the second category. In practice, since DDT could (1) recognize noisy images and also (2) supply bounding boxes of objects, we leverage the first usage of DDT to handle webly-supervised classification (cf. Table 6 and Table 7), and leverage both two usages to deal with webly-supervised detection (cf. Table 8 and Table 9).
In this section, we propose the Deep Descriptor Transforming (DDT) method. Firstly, we introduce notations used in this paper. Then, we present the DDT process followed by discussions and analyses. Finally, in order to further improve the image co-localization performance, the multiple layer ensemble strategy is utilized in DDT.
The following notation is used in the rest of this paper. The term “feature map” indicates the convolution results of one channel; the term “activations” indicates feature maps of all channels in a convolution layer; and the term “descriptor” indicates the
-dimensional component vector of activations.
Given an input image of size
, the activations of a convolution layer are formulated as an order-3 tensorwith elements. can be considered as having cells and each cell contains one -dimensional deep descriptor. For the -th image in the image set, we denote its corresponding deep descriptors as , where is a particular cell () and .
Since SCDA (Selective Convolutional Descriptor Aggregation) (Wei et al, 2017a) is the most related work to ours, we hereby present a recap of this method. SCDA is proposed for dealing with the fine-grained image retrieval problem. It employs pre-trained CNN models to select the meaningful deep descriptors by localizing the main object in fine-grained images unsupervisedly. In SCDA, it assumes that each image contains only one main object of interest and without other categories’ objects. Thus, the object localization strategy is based on the activation tensor of a single image.
Concretely, for an image, the activation tensor is added up through the depth direction. Thus, the 3-D tensor becomes a 2-D matrix, which is called the “aggregation map” in SCDA. Then, the mean value of the aggregation map is regarded as the threshold for localizing the object. If the activation response in the position of the aggregation map is larger than , it indicates the object might appear in that position.
What distinguishes DDT from SCDA is that we can leverage the correlations beneath the whole image set, instead of a single image. Additionally, different from weakly supervised object localization, we do not have either image labels or negative image sets in WSOL, so that the information we can use is only from the pre-trained models. Here, we transform the deep descriptors in convolutional layers to mine the hidden cues for co-localizing common objects.
is a statistical procedure, which uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of linearly uncorrelated variables (i.e., the principal components). This transformation is defined in such a way that the first principal component has the largest possible variance, and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to all the preceding components.
PCA is widely used in computer vision and machine learning for image denoising (Jiao et al, 2017), 3D object retrieval (Sfikas et al, 2011), statistical shape modeling (Zhang et al, 2015), subspace learning (Garg et al, 2013; De la Torre and Black, 2003), and so on. Specifically, in this paper, we utilize PCA as projection directions for transforming these deep descriptors to evaluate their correlations. Then, on each projection direction, the corresponding principal component’s values are treated as the cues for image co-localization, especially the first principal component. Thanks to the property of this kind of transforming, DDT is also able to handle data noise.
In DDT, for a set of images containing objects from the same category, we first collect the corresponding convolutional descriptors () from the last convolutional layer by feeding the images into a pre-trained CNN model. Then, the mean vector of all the descriptors is calculated by:
where . Note that, here we assume each image has the same number of deep descriptors (i.e., ) for presentation clarity. Our proposed method, however, can handle input images with arbitrary resolutions.
Then, after obtaining the covariance matrix:
we can get the eigenvectorsof
which correspond to the sorted eigenvalues.
As aforementioned, since the first principal component has the largest variance, we take the eigenvector corresponding to the largest eigenvalue as the main projection direction. For the deep descriptor at a particular position of an image, its first principal component is calculated as follows:
According to their spatial locations, all from an image are formed into a 2-D matrix whose dimensions are . We call that matrix as indicator matrix:
contains positive (negative) values which can reflect the positive (negative) correlations of these deep descriptors. The larger the absolute value is, the higher the positive (negative) correlation will be. Because is obtained through all images in that image set, the positive correlation could indicate the common characteristic through images. Specifically, in the object co-localization scenario, the corresponding positive correlation indicates indeed the common object inside these images.
Therefore, the value zero could be used as a natural threshold for dividing of one image into two parts: one part has positive values indicating the common object, and the other part has negative values presenting background or objects that rarely appear. Additionally, if of an image has no positive value, it indicates that no common object exists in that image, which can be used for detecting noisy images.
In practice, for localizing objects,
is resized by the nearest interpolation, such that its size is the same as that of the input image. Since the nearest interpolation is the zero-order interpolation method, it will not change the signs of the numbers in. Thus, the resized can be used for localizing the common object according to the aforementioned principle with the natural threshold (i.e., the value zero). Meanwhile, we employ the algorithm described in Algo. 1 to collect the largest connected component of the positive regions in the resized to remove several small noisy positive parts. Then, the minimum rectangle bounding box which contains the largest connected component of positive regions is returned as our object co-localization prediction for each image. The whole procedure of the proposed DDT method is shown in Algo. 2.
In this section, we investigate the effectiveness of DDT by comparing with SCDA.
As shown in Fig. 2, the object localization regions of SCDA and DDT are highlighted in red. Because SCDA only considers the information from a single image, for example, in Fig. 2 (2), “bike”, “person” and even “guide-board” are all detected as main objects. Similar observations could be found in Fig. 2 (5), (13), (17), (18), etc.
Furthermore, we normalize the values (all positive) of the aggregation map of SCDA into the scale of , and calculate the mean value (which is taken as the object localization threshold in SCDA). The histogram of the normalized values in aggregation map is also shown in the corresponding sub-figure in Fig. 2. The red vertical line corresponds to the threshold. We can find that, beyond the threshold, there are still many values. It gives an explanation about why SCDA highlights more regions.
Whilst, for DDT, it leverages the whole image set to transform these deep descriptors into . Thus, for the bicycle class (cf. Fig. 2 (2)), DDT can accurately locate the “bicycle” object. The histogram of DDT is also drawn. But, has both positive and negative values. We normalize into the scale this time. Apparently, few values are larger than the DDT threshold (i.e., the value zero). More importantly, many values are close to which indicates the strong negative correlation. This observation validates the effectiveness of DDT in image co-localization. As another example shown in Fig. 2 (11), SCDA even wrongly locates “person” in the image belonging to the diningtable class. While, DDT can correctly and accurately locate the “diningtable” image region. More examples are presented in Fig. 2. In that figure, some failure cases can be also found, e.g., the chair class in Fig. 2 (9).
In addition, the normalized can be also used as localization probability scores. Combining it with conditional random filed techniques might produce more accurate object boundaries. Thus, DDT can be modified slightly in that way, and then perform the co-segmentation problem.
As is well known, CNNs are composed of multiple processing layers to learn representations of images with multiple levels of abstraction. Different layers will learn different level visual information (Zeiler and Fergus, 2014). Lower layers have more general representations (e.g., textures and shapes), and they can capture more detailed visual cues. By contrast, the learned representations of deeper layers contain more semantic information (i.e., high-level concepts). Thus, deeper layers are good at abstraction, but they lack visual details. Apparently, lower layer and deeper layer are complementary with each other. Based on this, several previous work, e.g., Hariharan et al (2015); Long et al (2015), aggregate the information of multiple layers to boost the final performance on their computer vision tasks.
Inspired by them, we also incorporate the lower convolutional layer in pre-trained CNNs to supply finer detailed information for object co-localization, which is named as DDT.
Concretely, as aforementioned in Algo. 2, we can obtain of the resize for each image from the last convolutional layer by our DDT. Several visualization examples of are shown in the first column of Fig. 3. In DDT, beyond that, those deep descriptors from the previous convolutional layer before the last one are also used for generating its corresponding resized , which is notated as . For , we directly transform it into a binary map . In the middle column of Fig. 3, the red highlighted regions represent the co-localization results by . Since the activations from the previous convolutional layer are less related to the high-level semantic meaning than those from the last convolutional layer, other objects not belonging to the common object category are also being detected. However, the localization boundaries are much finer than . Therefore, we combine and together to obtain the final co-localization prediction as follows:
As shown in the last column of Fig. 3, the co-localization visualization results of DDT are better than the results of DDT, especially for the bottle class. In addition, from the quantitative perspective, DDT will bring on average 1.5% improvements on image co-localization (cf. Table 2, Table 3 and Table 5).
In this section, we first introduce the evaluation metric and datasets used in image co-localization. Then, we compare the empirical results of our DDT and DDTwith other state-of-the-arts on these datasets. The computational cost is reported too. Moreover, the results in Sec. 4.4 and Sec. 4.5 illustrate the generalization ability and robustness of the proposed method. Furthermore, we will discuss the ability of DDT to utilize web data as valid augmentation for improving the accuracy of traditional image recognition and object detection tasks. Finally, the further study in Sec. 4.7 reveals DDT might deal with part-based image co-localization, which is a novel and challenging problem.
In our experiments, the images keep the original image resolutions. For the pre-trained deep model, the publicly available VGG-19 model (Simonyan and Zisserman, 2015) is employed to perform DDT by extracting deep convolution descriptors from the last convolution layer (i.e., the layer) and employed to perform DDT by using both the last convolution layer (i.e., the layer) and its previous layer (i.e., the layer). We use the open-source library MatConvNet (Vedaldi and Lenc, 2015) for conducting experiments. All the experiments are run on a computer with Intel Xeon E5-2660 v3, 500G main memory, and a K80 GPU.
Following previous image co-localization works (Li et al, 2016; Cho et al, 2015; Tang et al, 2014), we take the correct localization (CorLoc) metric for evaluating the proposed method. CorLoc is defined as the percentage of images correctly localized according to the PASCAL-criterion (Everingham et al, 2015):
where is the predicted bounding box and is the ground-truth bounding box. All CorLoc results are reported in percentages.
Our experiments are conducted on four challenging datasets commonly used in image co-localization, i.e., the Object Discovery dataset (Rubinstein et al, 2013), the PASCAL VOC 2007/VOC 2012 dataset (Everingham et al, 2015) and the ImageNet Subsets dataset (Li et al, 2016).
For experiments on the PASCAL VOC datasets, we follow Cho et al (2015); Li et al (2016); Joulin et al (2014) to use all images in the trainval set (excluding images that only contain object instances annotated as difficult or truncated). For Object Discovery, we use the 100-image subset following Rubinstein et al (2013); Cho et al (2015) in order to make an appropriate comparison with other methods.
In addition, Object Discovery has 18%, 11% and 7% noisy images in the Airplane, Car and Horse categories, respectively. These noisy images contain no object belonging to their category, as the third image shown in Fig. 1. Particularly, in Sec. 4.5, we quantitatively measure the ability of our proposed DDT to identify these noisy images.
To further investigate the generalization ability of DDT, ImageNet Subsets (Li et al, 2016) are used, which contain six subsets/categories. These subsets are held-out categories from the 1000-label ILSVRC classification (Russakovsky et al, 2015). That is to say, these subsets are “unseen” by pre-trained CNN models. Experimental results in Sec. 4.4 show that our proposed methods is insensitive to the object category.
|Joulin et al (2010)||32.93||66.29||54.84||51.35|
|Joulin et al (2012)||57.32||64.04||52.69||58.02|
|Rubinstein et al (2013)||74.39||87.64||63.44||75.16|
|Tang et al (2014)||71.95||93.26||64.52||76.58|
|Cho et al (2015)||82.93||94.38||75.27||84.19|
|Joulin et al (2014)||32.8||17.3||20.9||18.2||4.5||26.9||32.7||41.0||5.8||29.1||34.5||31.6||26.1||40.4||17.9||11.8||25.0||27.5||35.6||12.1||24.6|
|Cho et al (2015)||50.3||42.8||30.0||18.5||4.0||62.3||64.5||42.5||8.6||49.0||12.2||44.0||64.1||57.2||15.3||9.4||30.9||34.0||61.6||31.5||36.6|
|Li et al (2016)||73.1||45.0||43.4||27.7||6.8||53.3||58.3||45.0||6.2||48.0||14.3||47.3||69.4||66.8||24.3||12.8||51.5||25.5||65.2||16.8||40.0|
|Cho et al (2015)||57.0||41.2||36.0||26.9||5.0||81.1||54.6||50.9||18.2||54.0||31.2||44.9||61.8||48.0||13.0||11.7||51.4||45.3||64.6||39.2||41.8|
|Li et al (2016)||65.7||57.8||47.9||28.9||6.0||74.9||48.4||48.4||14.6||54.4||23.9||50.2||69.9||68.4||24.0||14.2||52.7||30.9||72.4||21.6||43.8|
In this section, we compare the image co-localization performance of our methods with state-of-the-art methods including both image co-localization and weakly supervised object localization.
We first compare the results of DDT to state-of-the-arts (including SCDA) on Object Discovery in Table 1. For SCDA, we also use VGG-19 to extract the convolution descriptors and perform experiments. As shown in that table, DDT outperforms other methods by about 4% in the mean CorLoc metric. Especially for the airplane class, it is about 10% higher than that of Cho et al (2015). In addition, note that the images of each category in this dataset contain only one object, thus, SCDA can perform well. But, our DDT gets a slightly lower CorLoc score than DDT, which is an exception in all the image co-localization datasets. In fact, for car and horse of the Object Discovery dataset, DDT only returns one more wrong prediction than DDT for each category.
For PASCAL VOC 2007 and 2012, these datasets contain diverse objects per image, which is more challenging than Object Discovery. The comparisons of the CorLoc metric on these two datasets are reported in Table 2 and Table 3, respectively. It is clear that on average our DDT and DDT outperform the previous state-of-the-arts (based on deep learning) by a large margin on both datasets. Moreover, our methods work well on localizing small common objects, e.g., “bottle” and “chair”. In addition, because most images of these datasets have multiple objects, which do not obey SCDA’s assumption, SCDA performs poorly in the complicated environment. For fair comparisons, we also use VGG-19 to extract the fully connected representations of the object proposals in (Li et al, 2016), and then perform the remaining processes of their method (the source codes are provided by the authors). As aforementioned, due to the high dependence on the quality of object proposals, their mean CorLoc metric of VGG-19 is 41.9% and 45.6% on VOC 2007 and 2012, respectively. The improvements are limited, and the performance is still significantly worse than ours.
To further verify the effectiveness of our methods, we also compare DDT and DDT with some state-of-the-art methods for weakly supervised object localization. Table 4 illustrates these empirical results on VOC 2007. Particularly, DDT achieves 46.9% on average which is higher than most WSOL methods in the literature. DDT achieves 48.5% on average, and it even performs better than the state-of-the-art in WSOL (i.e., Wang et al (2014)) which is also a deep learning based approach. Meanwhile, note that our methods do not use any negative data for co-localization. Moreover, our methods could handle noisy data (cf. Sec. 4.5). But, existing WSOL methods are not designed to deal with noise.
|Siva and Xiang (2011)||42.4||46.5||18.2||8.8||2.9||40.9||73.2||44.8||5.4||30.5||19.0||34.0||48.8||65.3||8.2||9.4||16.7||32.3||54.8||5.5||30.4|
|Shi et al (2013)||67.3||54.4||34.3||17.8||1.3||46.6||60.7||68.9||2.5||32.4||16.2||58.9||51.5||64.6||18.2||3.1||20.9||34.7||63.4||5.9||36.2|
|Cinbis et al (2015)||56.6||58.3||28.4||20.7||6.8||54.9||69.1||20.8||9.2||50.5||10.2||29.0||58.0||64.9||36.7||18.7||56.5||13.2||54.9||59.4||38.8|
|Wang et al (2015)||37.7||58.8||39.0||4.7||4.0||48.4||70.0||63.7||9.0||54.2||33.3||37.4||61.6||57.6||30.1||31.7||32.4||52.8||49.0||27.8||40.2|
|Bilen et al (2015)||66.4||59.3||42.7||20.4||21.3||63.4||74.3||59.6||21.1||58.2||14.0||38.5||49.5||60.0||19.8||39.2||41.7||30.1||50.2||44.1||43.7|
|Ren et al (2016)||79.2||56.9||46.0||12.2||15.7||58.4||71.4||48.6||7.2||69.9||16.7||47.4||44.2||75.5||41.2||39.6||47.4||32.2||49.8||18.6||43.9|
|Wang et al (2014)||80.1||63.9||51.5||14.9||21.0||55.7||74.2||43.5||26.2||53.4||16.3||56.7||58.3||69.5||14.1||38.3||58.8||47.2||49.1||60.9||47.7|
Here, we take the total 171 images in the aeroplane category of VOC 2007 as examples to report the computational costs. The average image resolution of the 171 images is
. The computational time of DDT has two main components: one is for feature extraction, the other is for deep descriptor transforming (cf. Algo.2). Because we just need the first principal component, the transforming time on all the 120,941 descriptors of 512-d is only 5.7 seconds. The average descriptor extraction time is 0.18 second/image on GPU and 0.86 second/image on CPU, respectively. For DDT, it has the same deep descriptor extraction time. Although it needs descriptors from two convolutional layers, it only requires one time feed-forward processing. The deep descriptor transforming time of DDT is only 11.9 seconds for these 171 images. These numbers above could ensure the efficiency of the proposed methods in real-world applications.
In order to justify the generalization ability of the proposed methods, we also conduct experiments on some images (of six subsets) disjoint with the images from ImageNet. Note that, the six categories (i.e., “chipmunk”, “rhino”, “stoat”, “racoon”, “rake” and “wheelchair”) of these images are unseen by pre-trained models. The six subsets were provided in (Li et al, 2016). Table 5 presents the CorLoc metric on these subsets. Our DDT (69.1% on average) and DDT (70.4% on average) still significantly outperform other methods on all categories, especially for some difficult objects categories, e.g., rake and wheelchair. In addition, the mean CorLoc metric of (Li et al, 2016) based on VGG-19 is only 51.6% on this dataset.
|Cho et al (2015)||26.6||81.8||44.2||30.1||8.3||35.3||37.7|
|Li et al (2016)||44.9||81.8||67.3||41.8||14.5||39.3||48.3|
Furthermore, in Fig. 4, several successful predictions by DDT and also some failure cases on this dataset are provided. In particular, for “rake” (“wheelchair”), even though a large portion of images in these two categories contain both people and rakes (wheelchairs), our DDT could still accurately locate the common object in all the images, i.e., rakes (wheelchairs), and ignore people. This observation validates the effectiveness (especially for the high CorLoc metric on rake and wheelchair) of our method from the qualitative perspective.
In this section, we quantitatively present the ability of the proposed DDT method to identify noisy images. As aforementioned, in Object Discovery, there are 18%, 11% and 7% noisy images in the corresponding categories. In our DDT, the number of positive values in can be interpreted as a detection score. The lower the number is, the higher the probability of noisy images will be. In particular, no positive value at all in presents the image as definitely a noisy image. For each category in that dataset, the ROC curve is shown in Fig. 5, which measures how the methods correctly detect noisy images. In the literature, only the method in (Tang et al, 2014) (i.e., the Image-Box model in that paper) could solve image co-localization with noisy data. From these figures, it is apparent to see that, in image co-localization, our DDT has significantly better performance in detecting noisy images than Image-Box (whose noisy detection results are obtained by re-running the publicly available code released by the authors). Meanwhile, our mean CorLoc metric without noise is about 12% higher than theirs on Object Discovery, cf. Table 1.
As validated by previous experiments, DDT can accurately detect noisy images and meanwhile supply object bounding boxes of images (except for noisy images). Therefore, we can use DDT to process web images. In this section, we report the results of both image classification and object detection when using DDT as a tool for generating valid external data sources from free but noisy web data. This DDT based strategy is denoted as DDT augmentation.
For web based image classification, we compare DDT augmentation with the current state-of-the-art webly-supervised classification method proposed by Zhuang et al (2017). As discussed in the related work, Zhuang et al (2017) proposed a group attention framework for handling web data. In their method, it employed two level attentions: the first level is designed as the group attention for filtering out noise, and the second level attention is based on the single image for capturing discriminative regions of each image.
In the experiments, we test the methods on the WebCars and WebImageNet datasets which are also proposed by Zhuang et al (2017). In WebCars, there are 213,072 car images of totally 431 car model categories collected from web. In WebImageNet, Zhuang et al (2017) used 100 sub-categories of the original ImageNet as the categories of their WebImageNet dataset. There are 61,639 images belonging to the 100 sub-categories from web in total.
In our DDT augmentation, as what we do in Sec. 4.5, we first use DDT to obtain the number of positive values in as the detection score for each image in every category. Here, we divide the detection score by the total number of values in as the noise rate which is in the range of . The more the noise rate is close to zero, the higher the probability of noisy images will be. In the following, we conduct experiments with two thresholds (i.e., 0 or 0.1) with respect to the noise rate. If the noise rate of an image equals to or is smaller than the threshold, that image will be regarded as a noisy image. Then, we remove it from the original webly dataset. After doing the above processing for every category, we can obtain a relatively clean training dataset. Finally, we train deep CNN networks on that clean dataset. The other specific experimental settings of these two webly datasets follow Zhuang et al (2017).
Two kinds of deep CNN networks are conducted as the test bed for evaluating the classification performance on both two webly datasets:
“Attention” represents the CNN model with the attention mechanism on the single image level. Because the method proposed in Zhuang et al (2017) is equipped with the single image attention strategy, we also compare our method based on this baseline model for fair comparisons.
The quantitative comparisons of our DDT augmentation with Zhuang et al (2017) are shown in Table 6 and Table 7. In these tables, for example, “DDT GAP” denotes that we first deploy DDT augmentation and then use the GAP model to conduct classification. As shown in these two tables, for both two base models (i.e., “GAP” and “Attention”), our DDT augmentation with 0.1 threshold performs better than DDT augmentation with 0 threshold, which is reasonable. Because in many cases, the noisy images still contains several related concept regions, these (small) regions might be detected as a part of common objects. Therefore, if we set the threshold as 0.1, this kind of noisy images will be omitted. It will bring more satisfactory classification accuracy. Several detected noisy images by DDT of WebCars are listed in Fig. 6.
Comparing with the state-of-the-art (i.e., Zhuang et al (2017)), our DDT augmentation with 0.1 threshold outperforms it and the GAP baseline apparently, which validate the generalization ability and the effectiveness of the proposed DDT in real-life computer vision tasks, i.e., DDT augmentation in webly-supervised classification. Meanwhile, our DDT method is easy to implement and has low computational cost, which ensures its scalability and usability in the real-world scenarios.
|Zhuang et al (2017)||Attention||76.58|
|Ours (thr=0)||DDT GAP||69.79|
|Ours (thr=0)||DDT Attention||76.18|
|Ours (thr=0.1)||DDT GAP||71.66|
|Ours (thr=0.1)||DDT Attention||78.92|
|Zhuang et al (2017)||Attention+Neg11footnotemark: 1||71.24|
|Ours (thr=0)||DDT GAP||62.31|
|Ours (thr=0)||DDT Attention||69.50|
|Ours (thr=0.1)||DDT GAP||65.59|
|Ours (thr=0.1)||DDT Attention||73.06|
In the experiments on WebImageNet of Zhuang et al (2017), beyond attention, they also incorporated 5,000 negative class web images for reducing noise. However, we do not require any negative images.
For web based object detection, we first collect an external dataset from the Internet by Google image search engine, named WebVOC, using the categories of the PASCAL VOC dataset (Everingham et al, 2015). In total, we collect 12,776 noisy web images, which has a similar scale as the original PASCAL VOC dataset. As the results shown in webly-supervised classification, DDT with 0.1 threshold could be the optimal option for webly noisy images. Firstly, we also use DDT with 0.1 threshold to remove the noisy images for the images belonging to 20 categories in WebVOC. Then, 10,081 images are remaining as valid images. Furthermore, DDT are used to automatically generate the corresponding object bounding box for each image. The generated bounding boxes by DDT are regarded as the object “ground truth” bounding boxes for our WebVOC detection dataset. Several random samples of our WebVOC dataset with the corresponding DDT generating bounding boxes are shown in Fig. 7.
Note that, the COCO trainval set contains 120k human labeled images involving 80 object categories. While, our DDT augmentation only depends on 10k images of 20 object categories, in especial, these images are automatically labeled by the proposed DDT method.
After that, a state-of-the-art object detection method, i.e., Faster RCNN (Ren et al, 2017), is trained as the base model on different training data to validate the effectiveness of DDT augmentation on the object detection task. For the test sets of detection, we employ the VOC 2007 and VOC 2012 test set and report the results in Table 8 and Table 9, respectively.
For testing on VOC 2007, following Ren et al (2017), Faster RCNN is trained on “07+12” and “COCO+07+12”. “07+12” presents the training data is the union set of VOC 2007 trainval and VOC 2012 trainval. “COCO+07+12” denotes that except for VOC 2007 and VOC 2012, the COCO trainval set is also used for training. “DDT+07+12” is our proposal, which uses DDT to process the web images as aforementioned and then combines the processed web data with “07+12” as the final training data.
As shown in Table 8, our proposal outperforms “07+12” by 2.8% on VOC 2007, which is a large margin on the object detection task. In addition, the detection mAP of DDT augmentation is 4% better than “07++12” on the VOC 2012 test set, cf. Table 9. Note that, our DDT augmentation only depends on 10k images of 20 object categories, in especial, these images are automatically labeled by the proposed DDT method.
On the other hand, our mAP is comparable with the mAP training on “COCO+07+12” in Table 8 (or “COCO+07++12” in Table 9). Here, we would like to point out that the COCO trainval set contains 120k human labeled images involving 80 object categories, which requires much more human labors, capital and time costs than our DDT augmentation. Therefore, these detection results could validate the effectiveness of DDT augmentation on the object detection task.
Note that, the COCO trainval set contains 120k human labeled images involving 80 object categories. While, our DDT augmentation only depends on 10k images of 20 object categories, in especial, these images are automatically labeled by the proposed DDT method.
In the above, DDT only utilizes the information of the first principal components, i.e., . How about others, e.g., the second principal components ? In Fig. 8, we show four images from each of three categories (i.e., dogs, airplanes and trains) in PASCAL VOC with the visualization of their and . Through these figures, it is apparently to find can locate the whole common object. However, interestingly separates a part region from the main object region, e.g., the head region from the torso region for dogs, the wheel and engine regions from the fuselage region for airplanes, and the wheel region from the train body region for trains. Meanwhile, these two meaningful regions can be easily distinguished from the background. These observations inspire us to use DDT for the more challenging part-based image co-localization task in the future, which is never touched before in the literature.
Pre-trained models are widely used in diverse applications in computer vision. However, the treasures beneath pre-trained models are not exploited sufficiently. In this paper, we proposed Deep Descriptor Transforming (DDT) for image co-localization. DDT indeed revealed another reusability of deep pre-trained networks, i.e., convolutional activations/descriptors can play a role as a common object detector. It offered further understanding and insights about CNNs. Besides, our proposed DDT method is easy to implement, and it achieved great image co-localization performance. Moreover, the generalization ability and robustness of DDT ensure its effectiveness and powerful reusability in real-world applications. Thus, DDT can be used to handle free but noisy web images and further generate valid data sources for improving both recognition and detection accuracy.
Additionally, DDT also has the potential ability in the applications of video-based unsupervised object discovery. Furthermore, interesting observations in Sec. 4.7 make the more challenging but intriguing part-based image co-localization problem be a future work.
The authors want to thank Yao Li and Bohan Zhuang for conducting part of experiments, and thank Chen-Wei Xie and Hong-Yu Zhou for helpful discussions.
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pp 1081–1089
Jiao J, Yang Q, He S, Gu S, Zhang L, Lau RWH (2017) Joint image denoising and disparity estimation via stereo structure PCA and noise-tolerant cost.International Journal of Computer Vision in press:1–19
Papandreou G, Chen LC, Murphy K, Yuille AL (2015) Weakly- and semi-supervised learning of a DCNN for semantic image segmentation. In:Proceedings of IEEE International Conference on Computer Vision, pp 1742–1750
Proceedings of International Joint Conference on Artificial Intelligence, in press