Learning from Web Data: the Benefit of Unsupervised Object Localization

12/21/2018 ∙ by Xiaoxiao Sun, et al. ∙ Cardiff University NetEase, Inc 24

Annotating a large number of training images is very time-consuming. In this background, this paper focuses on learning from easy-to-acquire web data and utilizes the learned model for fine-grained image classification in labeled datasets. Currently, the performance gain from training with web data is incremental, like a common saying "better than nothing, but not by much". Conventionally, the community looks to correcting the noisy web labels to select informative samples. In this work, we first systematically study the built-in gap between the web and standard datasets, i.e. different data distributions between the two kinds of data. Then, in addition to using web labels, we present an unsupervised object localization method, which provides critical insights into the object density and scale in web images. Specifically, we design two constraints on web data to substantially reduce the difference of data distributions for the web and standard datasets. First, we present a method to control the scale, localization and number of objects in the detected region. Second, we propose to select the regions containing objects that are consistent with the web tag. Based on the two constraints, we are able to process web images to reduce the gap, and the processed web data is used to better assist the standard dataset to train CNNs. Experiments on several fine-grained image classification datasets confirm that our method performs favorably against the state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 4

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep Convolutional Neural Networks (CNNs) have achieved great success in solving recognition challenges including image classification 

[1]

, scene recognition 

[2, 3] and fine-grained image classification [4, 5, 6]. However, the commonly used CNN models such as AlexNet [1], VGGNet [7], GoogleNet-Inception [8] and ResNet [9] have a huge number of parameters, and their performance heavily depends on the availability of a large number of labeled training examples. In practice, getting reliably annotated images at a large scale is usually expensive and time-consuming, which prevents CNNs from being sufficiently trained for new image recognition tasks.

This paper considers the task of leveraging web data to improve recognition accuracy. The advantage of web data is primarily in its large-scale availability, such as millions of images with user-supplied tags from social websites or image search engines [10, 4, 11, 12].

Fig. 1: Sample images of the standard and web datasets. The image pairs at each row of (a), (b) or (c) have the same label. The objects in Web Dog (WD-) and Web Food (WF-) images are selected by the boxes. Red boxes on images mean that the contained objects are consistent with the web labels, and yellow boxes denote that the objects are not consistent with the labels. For (c), the images are from the MIT Indoor 67 dataset (I-) and the Web Indoor (WI-) scene images.

Intuitively, more training data leads to higher classification ability of CNN models. However, there exists a saying when employing web data, which is “better than nothing, but not by much”. Existing works suggest that the labeling noise contained in webly-labeled images usually negatively influences the performance of CNN models. However, we believe the phenomenon that the content of web images is more complex than images from standard datasets is the crux of the problem. Fig. 1 shows images from Food-101 [13], Stanford Dogs [14], MIT Indoor67 [15] datasets and web data. The objects of Web Dog (WD-), Web Food (WF-) images are marked by bounding boxes. Objects in red boxes are consistent with the web tag, and objects in yellow boxes are not. It is obvious that the objects of Dog (D1-D3) and Food (F1-F3) images from standard datasets are usually located in the center of the images and have a regular scale (size), but web images do not have these characteristics, such as WD3 and WF3. Meanwhile, web dog and food images often contain more than one object, some of which are from different categories, such as WD2 (two breeds of dogs) and WF2 (three different kinds of food). On the other hand, analyzing Indoor (I1-I3) scenes requires knowledge about both scenes and various objects, but the objects in web indoor scene images are often more cluttered than standard images, e.g., WI2 and WI3. Based on the observation, we believe that the aforementioned factors, i.e., the “built-in gap” between web and standard datasets, influence the effectiveness of utilizing web data.

As discussed in [16], performance on the test set of a target task often decreases when the training data is augmented with data from other datasets. Torralba et al. [16, 17] argue that the reason of this phenomenon is the input space of each image dataset is different, i.e.,

the datasets are biased. Regarding web data, we predict the label probabilities of sampled web images and test images from the Food-101 dataset using a benchmark model (fine-tuned on the Food-101 dataset with recognition accuracy of 84.31%). Fig.

2 shows the predicted label probabilities for two images from the web (red) and standard (green) datasets, which are unexpectedly significantly different. For example, the image baby-back-ribs from Food-101 has a salient rib closer to the center of the image. In contrast, the web baby-back-ribs image contains four dishes, but only the one in the top left contains ribs. Hence, the localization and scale of the labeled content, as well as the number of relevant objects result in the difference between the data from the two sources. To further illustrate that bias exists between web and standard datasets generally rather than occasionally, we conduct rigorous experiments to quantitatively analyze the difference between standard and web datasets based on 5 criteria: relative data bias, cross-dataset generalization [16, 17], the scale of related content, density of domain information, and label quality (see Section IV-C).

Fig. 2: Comparison of probabilities of example images from web and standard data belonging to different classes. Here, we show the web data (red border) and standard data (green border) from the two classes with their class names given underneath, as well as the label probabilities (P) for 101 classes predicted by the benchmark recognition model.

In this paper, we design abundant experiments to evaluate the web and standard datasets, and experimental results (in Section IV-C) confirm our conjecture that the complexity of web data results in the bias between the web and the standard datasets. Furthermore, we address the problem of learning from web images by reducing dataset bias in two aspects: (1) normality of content/information distribution on images (referred to as Form bias, i.e. whether the location, size and number of classification objects/content on web data are similar to these conditions on standard data). (2) label consistency (referred to as Label bias, i.e. whether the label is accurate to describe the web image). Specifically, we present an unsupervised object localization method to generate content regions related to the task on web images. Considering standard datasets without bounding box (bbox) annotations, we employ the “labeled-content-centric” prior to generate the bboxes, and use the bboxes to train a region proposal network (RPN). Then, we generate object proposals of web images and add two constraints for the regions: (1) a form constraint to control the scale and density of objects based on the bbox of region proposal, and (2) a label constraint to control the consistency of subject and the label based on the score from RPN and the probability from the benchmark recognition model. Finally, the eligible regions are collected as the de-biased web dataset. The evaluation results on the Food-101, Stanford Dogs (Dog-120) [14] and MIT Indoor 67 [15] datasets, as well as images retrieved from the web show that the designed framework can substantially reduce bias. Meanwhile, extensive experimental results show that reducing bias effectively improves the efficiency of using web data.

The contributions of this paper are three-fold.

  1. We systematically study the built-in gap between web and standard datasets, and perform quantitative analysis on dataset bias from various aspects. As far as we know, there is no existing work that processes web data from this perspective.

  2. We propose an unsupervised object localization method and design two constraints based on the built-in gap between web and standard datasets to change the scale and density consistency of web images to approximate standard data.

  3. We perform experimental evaluation on three publicly available datasets, i.e., Stanford Dogs, Food-101, MIT67. The experiments demonstrate that the proposed framework is able to improve the performance of learning from web images.

Ii Related work

Ii-a Deep Learning from Web Data

Leveraging web data [18, 19, 20, 21, 22] has been an alternative method to satisfy the data requirements of deep models. Previous work explores utilizing web data by reducing the label noise level or designing a noise-aware model [23, 10, 4, 12]. Different from these works, we propose to reduce dataset bias when employing web data.

The ever-growing web images have been an important data source for computer vision tasks. However, label noise is ubiquitous in real-world data, which influences the performance of the model. Efforts have been made to improve results when using web data 

[24, 25, 26, 27, 28]. Recent works [29]

demonstrate that web data can be used to improve the performance of deep learning. To readily scale to ImageNet-sized problems, Izadinia

et al[30] present a method to directly learn from image tags in the wild and show that web data is useful for training CNN models. Chen and Gupta [31] propose a two-stage approach to training CNNs by exploiting noisy web data and the transferability of CNNs. This work assumes that label noise is conditionally independent of the image, and is directly affected by the noisy annotations produced by humans. Xiao et al[10]

use a probabilistic framework to handle noisy labels and train a classifier in an end-to-end learning procedure. Furthermore, Krause

et al[4] demonstrate that using massive web data for training CNN models can improve the performance of deep models on fine-grained datasets. Joulin et al[11] argue that ConvNets can learn from scratch in a weakly-supervised way, by utilizing 100M Flickr images annotated with noisy captions.

However, there exists a phenomenon when employing web data for training CNN models, which is “better than nothing, but not by much”. In previous work, there is no significant improvement for the final classification accuracy, except for the cases which use over - times more images than the standard datasets [4]. Different from above work, a new perspective is considered in this work and systematic experiments show that dataset bias between web and standard datasets of the target task is the main reason for the limited benefit of web data. Therefore, we explore the potential of web data for training CNN models by reducing bias.

Ii-B Built-in Gap between Datasets

With the recent development of deep learning, transfer learning has been successfully applied to object recognition problems. Based on the observation of transferability of deep neural networks, the works 

[32, 1] propose to conquer this problem by first initializing CNN parameters with a model pre-trained on a larger yet related dataset, and then fine-tuning it on the smaller dataset of the target task. However, there is a problem that the data distribution of the source task often fails to match the target task exactly (and vice versa). We define the mismatch as dataset bias, which has been researched in [16, 33]. These works prove that bias exists in visual datasets, and [34] further shows that the contents of images are often biased with human-centric labels. Moreover, [35] transfers human biases into machine systems to help with object recognition.

We believe the bottleneck of employing web data for training CNN models is the bias between web and standard datasets, so we present a method to reduce the bias for improving the performance of using web data. In [16], biases are summarized as selection bias, capture bias, category or label bias and negative set bias. [34] reveals the bias of human-centric annotations with images as keyword reporting bias. [36] considers the scale bias of object and scene datasets. Based on existing summarized biases, we systematically study dataset bias between webly-labeled data and standard data. Meanwhile, we present an unsupervised object localization method to reduce the bias, which can improve the effectiveness of using web data.

Ii-C Object Localization Methods

Ii-C1 Traditional Object Detection Methods

The sliding-window method has been used for a long time in which a classifier is applied on a dense image grid. For this kind of methods [37, 38], LeCun et al. apply convolutional neural networks to handwritten digit recognition and improve the final results. Viola and Jones [39]

use improved object detectors to detect faces, leading to widespread adoption of these models. The introduction of Deformable Part Models (DPMs) 

[40] helps extend dense detectors to more general object categories and get best results on the PASCAL dataset [41]. However, with the development of deep learning [1], new detectors based on deep models quickly become the main approach for object detection.

Ii-C2 Deep Object Detection Methods

The commonly used framework in modern object detection is a two-stage approach. Based on selective search work [42], the first stage will generate a set of candidate object proposals that contain all objects while filtering out the majority of negative locations. Then, the second stage classifies the proposals into foreground classes and background. On the other hand, some recent research focuses on developing one-stage methods, such as OverFeat [43], which is one of the first end-to-end object detectors based on deep networks, as well as SSD [44] and YOLO [45]. These methods can improve the processing speed but their accuracy is not good compared with two-stage methods, such as R-CNN [46]. Different from these works, we emphasize that our simple detector achieves top results not based on innovations in network design but due to our novel strategies.

Ii-C3 Unsupervised Object Detection Methods

Unsupervised object discovery (also called image co-localization) [47, 48] is a fundamental problem in computer vision, where it needs to discover the common object emerging in only positive sets of example images (without any negative examples or further supervisions). Recently, there also appear several co-localization methods based on pre-trained deep convolutional models e.g., Li et al[49]. These methods just treat pre-trained models as simple feature extractors to extract the fully connected representations, which do not sufficiently mine the information beneath the convolutional layers. Zhang et al[18] propose a two-level attention framework for dealing with webly-supervised classification, which achieves a new state-of-the-art. Our method handles web data based on RPN, which can (1) recognize noisy images and also (2) supply bounding boxes of objects.

Fig. 3: Overview of the proposed method. (a) Training of RPN: the region proposal network is trained based on the “labeled-content-centric” prior. The unsupervised red rectangles are used to train the RPN for region detection to generate region proposals. Smaller blue rectangles are used to train another RPN for object detection. This is the first step of the proposed method. (b) Prediction: it is the second step to process the image and calculate the scores as well as the probabilities of the proposal regions. First, web data is input into the region detection RPN (the RPN with size ratio for region detection) to generate region proposals and scores of regions . Then, is re-input into the benchmark classification model (fine-tuned on the standard dataset) to get the predicted probabilities of labels , and is input into the RPN with for object detection to generate smaller regions containing objects. (c) Output : based on the score, label probability and the approximate IoU (Intersection-over-Union), the constraints are added to control the object localization and label consistency of regions. The satisfactory regions are combined to form the de-biased dataset . Our method benefits from region proposal networks (Section III-B) and the constraints (Section III-C).

Iii Method

We aim to eliminate the dataset bias between the web and standard datasets, and then utilize such data to train a CNN model for classification. The causes of built-in bias are summarized as two aspects in this work: Form bias and Label bias (Section IV-C2). In our method, we mainly address the form and label biases under the following assumptions: (1) web data may contain different distributions, such as size, number and localization of classification objects/contents from the standard dataset (e.g. classification objects/contents for the dog recognition task refer to dogs appearing in the images); (2) labels of web images may be inconsistent with the contents of images. Specifically, We have a web dataset , where the -th image has label , is the number of images and is the number of classes. Meanwhile, we have a standard dataset . Our target is to transform web dataset to a new dataset , which is de-biased with respect to .

Iii-a Framework of the Presented Method

First, we capture objects related to the target task to normalize the scale, density and location of objects, which are important measures of form bias. Specifically, we train a detection model on the standard dataset to detect target objects in web images, and then learn the level of relevance of detected regions with the tags of web images by modeling where is a web image with label and refers to the set of detected regions of . is one of the regions, is the number of regions. Second, we have the following two considerations for tags of web images: (1) whether target objects appear in the web image (e.g., a web image used for dog classification contains dogs) and (2) whether the label is consistent with the object present in the web image (e.g., the breed of dog in a web image is the same as the label of this image). To eliminate label bias, we couple the two considerations with the detected object proposals of web images, so that we can select object proposals of web images accurately.

Fig. 3 shows the framework of our method, in which the web dataset passes through the region proposal network and the constraints, and then becomes . (a) Training of region proposal networks has two-levels of parameters to control the size of bboxes corresponding to the red (region proposal) and blue (object detection) rectangles on the images. During generation of the proposals on the web dataset (see (b) Prediction of RPNs), there is a two-stage selection based on the above two RPNs trained on bboxes of two sizes. Meanwhile, the basic model will predict the probability of proposals, which can be used to remove noisy regions. Finally, we take the output and as the training set to train the classification CNN model. The RPN and constraints are introduced as follows.

Iii-B Unsupervised Region Proposal Networks

As discussed above, candidate object proposals of web images are not only crucial for reducing form bias but also helpful to select regions consisting of relevant tags from web images. However, since there are no ground-truth annotations to locate object regions, we present a new unsupervised object localization method through joint region proposal and feature extraction via CNN layers without needing bounding box annotations. Given a standard dataset

, where is the number of images, based on the assumption that standard images typically have one object in the center, we define the weak bounding box of an image as follows:

(1)

where and are the width and height of , and and are the width and height of its weak bounding box, which are defined as and . are the coordinates of the top-left corner of the bounding box, and is the proportion used to set the size of bounding box.

We use the standard dataset and the weak bounding boxes defined above to train a region proposal network (RPN) where is set to for region proposal [50]. Fig. 3 shows the framework of the unsupervised region proposal network. Applying this RPN to web images generates region proposals. Given an image , the RPN will output a set of rectangular object proposals , each with an objectness score

that estimates the probability of

containing a domain object. The regions produced by the RPN are not accurate enough because of the imprecise bounding boxes used for training. This is however sufficient, as our goal is to generate candidate regions containing objects from web images rather than locating objects precisely.

To further select appropriate proposals, we use the size ratio for training the object detection RPN, which is smaller than used to train the RPN for generating region proposals (as shown in (a) Training of RPNs in Fig. 3). We train two RPNs by using two different sizes of bboxes for generation of content (red border) and object (blue border) region proposals respectively.

So far we have described how to get a set of candidate proposals which almost contain objects located close to the center. This makes the scale of related content and the density of domain information similar to the standard dataset. During the training of RPNs, one mini-batch is formed by 256 proposals and the loss is calculated based on classification loss and regression loss, which ensure the proposal contains the object, and its location and size are close to the ground truth.

Iii-C Form and Label Constraints

The detection network maps the whole image to domain regions with associated probabilities, but the category of the object in the image has not yet been considered. Therefore, label bias still exists. We handle it with convolutional layers followed by a softmax layer. The convolutional features for detection and classification are shared, so that the classification network can be easily trained.

Given an image , let be the locations of proposal regions in the image. Note that we have adopted non-maximum suppression (NMS) [50] on all regions to reduce redundancy. In order to select the distinguishing regions from the proposals, we consider combining two score constraints by solving the following optimization problem:

(2)

where is defined as a scoring function over two constraints as follows:

(3)

in which denotes the form constraint and denotes the label constraint,

Form constraint. The form constraint controls the scale of the regions. We want to constrain the scale of objects such that they are not “too small” compared to the size of the regions. For identifying the object in a region, as described before, in addition to training an RPN with size ratio for regions containing relevant content, we also train another RPN with size ratio for object detection. We then send each region proposal generated by the RPN model with into the object detection model (trained with size ratio ) and retain the proposal with maximum object score as the object proposal. Hence, we define the constraint as:

(4)

where means the Intersection-over-Union overlap between the object proposal and the region. is the threshold value of IoU. The proposal objects are generalized by the detection model trained on the standard dataset.

Label constraint. The label of web images may contain errors, so we select proposals by adding a label constraint:

(5)
(6)

where is the probability of region belonging to the category , , which is predicted by the benchmark model (fine-tuned on the standard dataset). is the threshold value to control the score of the detected objects. is the predicted label of the image region. is the probability that contains the classification content. The predicted label of the region is required to be the same as the web label .

0:    Standard dataset: ; Web dataset: ; The initialized network model .
1:  Define the weak bounding boxes of images in by Eq. 1;
2:  Train an RPN using the weak bounding boxes of images in ;
3:  Generate for of ;
4:  Predict and probability of region;
5:  for  do
6:     calculate the by Eq. 4;
7:     calculate the by Eq. 5;
8:     get by Eq. 3;
9:     optimize according to Eq. 2;
10:  end for
11:  ;
12:  Use to aid the training of CNN model;
12:  CNN model trained on and
Algorithm 1 Unsupervised Detection for Web Data

Finally, for each web image, we obtain which is de-biased with the standard data. That is, by combining scale and label information, we map to a new processed dataset , which is used to assist the target task.

Iii-D Training Process

We use as the auxiliary data and input it along with into the CNN model. The parameters of the model are initialized on parameters of the pre-trained model

(trained on ImageNet), and then the parameters are updated using stochastic gradient descent (SGD). The iteration of SGD updates current parameters

as:

(7)

where

is a loss function,

e.g., softmax. is computed by gradient back-propagation. is a mini-batch randomly drawn from the mixed training dataset , and is the learning rate, which will reduce during training. The method is summarized in Algorithm 1.

Iv Experiments

Iv-a Experimental Setup

Dataset. We employ three standard datasets (where “standard” refers to public datasets of the target task), namely Food-101, Stanford Dogs and MIT Indoor67 to evaluate our method. We use these datasets for two reasons: 1) The diverse characteristics of both web-derived and well-labeled datasets, such as the number of instances in each image, object size and location, result in a significant gap which is under-researched in the literature. 2) The representative capacity to various classification tasks, i.e., foreground-oriented shape-based task (dog), foreground-oriented non-shape task (food) and hybrid content based task (indoor scene), can evaluate the robustness of the proposed method.

For each task, the web datasets are collected by keyword search from Google Images, Flickr and Twitter [51]. After downloading web images, the images that are near duplication with any image in the test set are removed following [52], which ensures the fairness of our tests. Finally, 214,743 images of food, 11,2018 images of dog and 76,907 indoor scene web images are used for experiments based on the dataset of work [51]. For each source, we use a similar amount of data as the well-labeled images for the food and dog datasets. For indoor scenes, we found the downloaded web data are more complex (e.g., the contents of web images are richer and there are more similar images) than the other two tasks, so we use more web images (about 2 times compared to standard data) from each source of this task.

Implementation. The major model of the experiments is the ResNet [9], which has good performance for many classification tasks. For generating object proposals, we use our data to train the architecture used in [53]

. For classification, the first step of our experiments is to fine-tune CNN models on the target datasets as the basic models. We use Caffe 

[54] in our experiments, and our models are trained on NVIDIA TITAN X GPUs. We use different experimental settings on the three datasets because the scales of datasets are different. The learning rate is initially set to

, which is divided by 10 after 10 epochs. The total number of iterations is 30 epochs with a mini-batch including 20 224

224 images on Titan X (for ResNet110, the batch-size is set to 13 for Titan X). Since our goal is to obtain content regions related to the label rather than locate objects precisely, the settings of and for region detection and object detection do not need to be precise, as discussed in Section IV-B. The detailed discussion about in Eq. 4 and in Eq. 5 are also shown in Section IV-B. None of these parameters will seriously affect the results unless they are set to very small or very big values.

Iv-B Parameters

In this section, we discuss the parameter settings of Eq. 1, Eq. 4 and Eq. 5. Here, we use “Name That Dataset” to evaluate how the parameters affect the gap between Food-101 and the processed data. First, there are two values for region detection and object detection, respectively. Fig. 4 shows the classification precision of “Name That Dataset” on Food-101 and Food-debiased when setting and to different values. As can be seen, is the best setting for region detection, because the minimum of binary classification results is at , which means the gap is smaller than other settings. To make the experiment manageable, we fixed for region detection to when discussing for object detection, and gets minimum results for classifying the two datasets. Second, Fig. 5 shows the influence of setting the IoU threshold in Eq. 4 for generating object proposals and the object score threshold in Eq. 5.

Fig. 4: Precision of “Name That Dataset” for classifying Food-101 and de-biased web images, with respect to changing parameters and . These values are used for region detection and object detection, respectively. for region detection means using the pre-trained RPN model to generate regions.
Fig. 5: Precision of “Name That Dataset” for classifying Food-101 and de-biased web images, with respect to changing parameters: IoU threshold and objectness score threshold .

The classification precision of “Name That Dataset” on Food-101 and Food-debiased with different parameter settings are reported. It can be seen that none of these parameters affect the results significantly unless they are set to very small or very big values. The noise level affects the number of images fed into the CNN model selected by our label constraint module. For some images, the proposed method can retain more than one region proposal for one image. During the training of RPN, one mini-batch is formed by 256 proposals and the loss is calculated based on classification loss and regression loss, which ensure the proposal contains the object, and its location and size are close to the ground truth.

Iv-C Evaluation of Built-in Gap Between Datasets

To analyze the change of the bias between web and standard datasets, we will measure dataset bias from five aspects: (1) Relative data bias: following [16, 17] we define the relative data bias as the uniformity between web and standard data, i.e., if they are misaligned. (2) Cross-dataset generalization: similar to the definition proposed in [16, 17], we use it to evaluate the generalization of different types of data, especially web data. (3) Scale of related content: it is designed based on [36], and originally means the scale of objects in an image, but in this paper, it represents the scale of the object relevant to the web label. (4) Density of domain information: it originally indicates the density of objects in a scene image, but here we use this to represent the number of classification objects occurring in a web image. (5) Label quality: it is determined by the noisy label of web data and represents the relationship between web label and web image content. Specifically, we measure the dataset bias as follows:

Fig. 6: CNN plays “Name That Dataset”, where the task is to separate images from standard datasets and web datasets as specified by each curve. The ‘-debiased’ means to classify standard data and data obtained using our method. “Ratio of used training set” is the proportion of training dataset actually used for training.

Iv-C1 Measuring Dataset Bias

In this section, we will report the quantitative evaluation to demonstrate the existence of bias.

First, we employ the game “Name That Dataset” used in [16] on three datasets respectively as the measure of relative data bias. As shown in Fig. 6, for each dataset (Food-101, Dog-120 and Indoor-67), we create a mixed dataset (indicated by suffix -mix) by combining the same number of training images from the training dataset and web images, and we gradually increase the number of dataset images from to the full set. This is used to train a CNN model for binary classification of whether the image is from the standard dataset or the web. With increasing training data, the web and standard images are easily classified by the model. For comparison, we also conduct the experiment on each standard dataset itself by labeling half of the images as 1 and the rest as 2, and the classification accuracy is stable at around . The line chart illustrates that web and standard datasets have strong relative data bias and are not aligned, i.e., either of them is unique and identifiable from the other.

Second, the cross-dataset generalization experiments on web and standard datasets are designed following [16]. Table I shows the classification performance on food, dog and indoor scene datasets and the corresponding web data. Note that the same number of web images as the standard datasets are used in this experiment.

[width=9em,trim=l]TrainingTesting Standard Web Mean
Standard (Food-101) 84.31 52.49
Web (Food) 74.01 81.72
Standard (Dog-120 ) 81.26 58.41
Web (Dog) 67.58 74.45
Standard (Dog-120) 81.26 62.79
Web (L-Dog [4]) 68.09 73.36

Standard (Indoor-67)
79.64 60.40
Web (Indoor) 65.97 72.39
TABLE I: Cross-dataset generalization: Classification accuracies when training on one dataset (row) and testing on Standard and Web datasets as well as their mean ( difference).

Results in each row are trained using the specified dataset, and the three columns show accuracies when testing on the standard and web datasets as well as the mean of the two accuracies, e.g., in the first row, accuracy is obtained by training and testing on Standard (Food-101), and is obtained by training on Standard (Food-101) and testing on Web (Food). Furthermore, to prove that web dataset bias is universal rather than caused by the collection method used in our work, we conduct the same experiment using web data from L-Dog collected by Krause et al[4]. In this experiment, we use 21,827 images (13,158 for training, 8,669 for testing) from the L-Dog dataset with the same classes as Stanford Dogs for a fair comparison.

As discussed in [16], the actual accuracies are not important, but the differences in performance are worth discussing. The performance when trained on one source and tested on the other source is substantially worse than training and testing on the same source (bold), so the two sources are not generalizable to each other. Meanwhile, the mean difference shows that web data is more generalizable than standard data, which conforms to the fact that standard datasets have more restrictions than web data.

Fig. 7: Examples of the working mechanism of unsupervised object localization. We show examples of web images and the corresponding de-biased images. The web images contain the form bias and label bias, and there are many images with a density greater than one and our method handles them well.

Third, we further perform two experiments to measure the scale of related content and density of domain information. The scale of related content measures the size of the detected region containing the relevant content compared to the size of the entire image, and the density of domain information is the number of detected regions related to the subject in the image. The scale is defined as

(8)

where is the number of detected regions related to the subject of the domain task in the image, so the density is . , is the size of the -th region containing relevant objects, and is the size of the image. In Table II, we report the average scales and densities of images on Food-101 and Stanford Dogs datasets. We can see that the densities of standard images are almost always 1, but they are getting bigger on web images. Meanwhile, the scale of related content is usually smaller in web images than in standard images. For example, Food-101 has and whereas Food-web has and , which are distinguished from each other. Different from object classification, indoor scene recognition requires knowledge about both scenes and objects. Calculating the scale and density of a scene image is not reasonable, but we can take a web scene image as a combination of several parts (some contain useful information and some are redundant information e.g., people). Therefore, we directly use the same process on scene data as for food and dog images without measuring the changes of these metrics.

Finally, for the label quality, we use the region proposal model (see Section III-B) to generate content regions for web images and find about of webly-labeled food images have no detected regions. The result shows that about

of webly-labeled food images are outliers.

Overall, the above results demonstrate the existence of dataset bias between web and standard datasets.

Fig. 8: Cross-dataset generalization. Green, Red and Blue mean Standard, Web and Processed (debiased) web images respectively. The values are the recognition accuracy of the models that are trained on Standard (S), Web (W) and Processed (P) images respectively, and tested on Standard (S), Web (W) and Processed (P) data (columns). We also show some example images for each kind of data. For each column, the image with green border comes from standard dataset, and the two images in second and third rows are a pair, which are one web image (red) and the corresponding image after processing (blue).

Iv-C2 Estimating the Culprits of Built-in Gap

We now investigate the causes of built-in gap between web and standard datasets.

Forming of datasets. It is a more generic concept of “capture bias” [16] relating to focal length, finder frame and view etc. during shooting — that objects in images from the standard dataset are almost always in the center on the image, but objects in web images are not always single-object-in-the-center. Moreover, as shown in Fig. 7, the contents of web images are rich, not just containing one major object with a big scale. These factors result in the form bias.

Label noise level of datasets. It comes from the fact that images of a domain task (fine-grained classification) are often difficult to label for people because of the large intra-class and small inter-class variations among objects. This is particularly problematic for web data, which are usually labeled by non-experts without moderation. Image labels may be inaccurate or even entirely wrong, i.e., irrelevant content in the image. These “wrong” labels also contribute to the bias. The three kinds of biases are with respect to the information distribution, content and label of web images. For form and label bias, we show the results of reducing the influence of them as follows.

Iv-D Performance of Reducing Bias

The effects of the proposed method on reducing bias between web and standard datasets are shown as follows.

In order to show the change of dataset bias quantitatively, we employ the game “Name That Dataset” again. We process the web data by our method, and select the same number of images from different sources to do the experiments. The results are shown in Fig. 6, after debiasing, the relative dataset bias becomes much lower (“-debiased” datasets), comparable to the results on two halves of standard data (around 50%). Furthermore, with more processed training data (-debiased) added, the classification accuracies keep leveling at around the chance . On the contrary, the classification accuracies of “data-web” tend to rise with more training data. Because the images after processing will be similar to the images from the standard dataset, the bias between web and standard datasets is reduced. Fig. 7 shows some examples of the images before and after processing, in which the scale and density of objects in the images are changed. Moreover, the noisy images are removed during processing.

Dataset Scale Density
Food-101 0.8536 1.16
Food-web 0.6218 1.94
Food-debiased 0.7775 1.23
Dog-120 0.7381 1.00
Dog-web 0.5939 1.45
Dog-debiased 0.7465 1.18
TABLE II: Average scale and density values of standard, web and de-biased web data. The regions of standard data are from RPN and have correct predicted labels. For web data, the regions and calculations of scale and density are based on the RPN (). For debiased-web data, regions are from the RPN () trained on the standard dataset and have correct predicted labels, and the scale and density are calculated by the output of RPN with a smaller for de-biased-web data.

As shown in Fig. 8, the cross-dataset generalization results show that de-biased web datasets become significantly more generalizable: training on standard datasets and testing on de-biased web data and vice versa both show significantly better performance than with original web data.

For the scale of related content and density of domain information, we report the measures before and after bias reduction using our method in Table II. Our method simultaneously removes different biases: the label constraint in our method can remove noisy web images and the form constraint controls the scale of the related object, and the density of domain information becomes more consistent with the target dataset. 95,672 (214,743), 68,355 (112,018) and 37,496 (76,970) are the numbers of images after (before) elimination for the three tasks. The numbers of proposals are , and for web food, dog and indoor scenes, respectively. After processing, the numbers of the regions are , and .

Fig. 9: Performance comparison demonstrating the gain of adding web data and processed data. “Mix” means the web and standard data are added together, so the number of images is twice of only using Food-101 or Food-web.

To analyze the effect of using web data, we evaluate the performance of models trained on images from the following sources: the Food-101 dataset [13], web food images (Food-web), as well as web images after manual filtering (Human-Filter) and processed by Pseudo-label method [55]. The test set is the standard set of Food-101. As shown in Fig. 9, the improvements of performance are disparate when adding the same amount of standard data (red border) and webly-labeled data (green border). This may be due to the fact that web data contains noisy labels.

Amount of Processed data
Food-101 Acc (%) 85.24 86.23 88.56 90.41
Stanford Dogs Acc (%) 81.20 83.49 85.13 87.95
Indoor-67 Acc (%) 80.82 81.63 83.25 84.93
TABLE III: Performance gain of adding processed web data. Acc means classification accuracy.
Method Model
Clean-ft
Accuracy (%)
Mix-ft
Accuracy (%)
Method Accuracy (%) Our Accuracy (%)
Bottom-up [56] 66.49 68.72 70.29   (+1.57) 75.31  (+6.59)
Pseudo-label [55] AlexNet 69.36  (+0.64)
Weakly [11] 71.10  (+4.61)
Boosting [23] 67.43 69.00 72.53  (+3.53) 75.58  (+6.58)
PGM [10] CaffeNet 73.14  (+4.14)
WSL [31] 73.21  (+4.21)
Harnessing [24] VGGNet 74.32 76.98 79.02  (+2.04) 81.93  (+4.95)
Goldfince [4] ResNet50 84.31 84.89 86.75  (+2.86) 90.41  (+5.42)
DPTL+subA [5] 88.78  (+3.89)
TABLE IV: Experimental results on the Food-101 dataset. “Clean-ft” means fine-tuning on the model pre-trained on ImageNet with Food-101, and “Mix-ft” means fine-tuning with both standard and web data. Method Accuracy and Our Accuracy are accuracies with the method itself, and the proposed method. Numbers in the brackets show performance gain compared with “Mix-ft” as the baseline.

However, the filtered web data (blue and black) has a much lower noise level, yet the performance is still significantly worse than the standard data (red) of Food-101. We also use mixed data (Mix) of web and Food-101 as well as filtered web data (Mix-filter) to train CNNs. The number of mixed data is double of Food-101 and Food-web for the same coordinate in the graph, but there is no significant improvement on the results after filtering the web data. The reason is that the web data and filtered data have poor cross-dataset generalization. On the contrary, the results of using the processed data are similar to the results of using the standard data (red). Furthermore, mixing the processed data with the standard data (Mix-pro-Data) gets improved results than all the other methods. For the three tasks, the proposed method achieves consistently increasing performance with more processed images and the results are shown in Table III.

Method Acc (%)
Food Dog Indoor
Object-Detection Grad-CAM [57] 82.94 84.92 78.55
Cascade R-CNN [58] 82.45 82.04 78.66
RefineDet [59] 83.20 82.68 79.91
DSOD [60] 84.75 81.93 78.51
Cross-Domain WSD [61] 84.97 83.26 79.43
ResNet-50+EdgeBox [62] 85.37 81.74 78.16
ResNet-50+Selective Search [42] 84.82 81.43 77.69

ResNet-50+DeepMask

[63]
86.59 83.31 78.94
ResNet-50+DeepMask II [64] 86.34 83.86 79.25
Mimic [49] 86.70 84.17 76.47
Baseline ResNet-50+ft 84.31 81.26 79.63
ResNet-110+ft 84.88 81.75 80.82
ResNet-50+(web)+ft 84.89 82.72 82.31
ResNet-110+(web)+ft 85.37 82.89 83.73
Ablation ResNet-50+(web+filter)+ft 86.10 83.41 80.26
ResNet-50+(web+RPN)+ft 86.64 82.65 79.52
ResNet-50+(web+RPN+)+ft 88.15 85.71 80.66
ResNet-50+(web+RPN+)+ft 89.32 85.24 82.09

Ours
ResNet-50+ +ft 90.41 87.95 84.93
ResNet-110+ +ft 91.63 88.62 85.22
TABLE V: Experimental results on the Food-101, Stanford Dogs and Indoor-67 datasets. “ft” means fine-tuning on the model pre-trained on ImageNet. “+web” means adding the web data into the standard training set. “+” represents that debiased web data is used. “Acc” is classification accuracy.

Iv-E Training CNN Models for Classification

We evaluate our method on three datasets. These datasets contain specific forms of objects (dogs), objects with irregular shapes (food), and even scenes which are of complex forms. The detailed results are as follows.

# Method on Food-101 Acc (%) Method on Dog-120 Acc (%) Method on Indoor-67 Acc (%)
Random Forest [13] 50.76 NAC [65] 68.61 IFV+DMS [66] 66.87
SNN [67] 69.90 FoF-Weakly [68] 71.40 FB/REF [24] 61.60

Related work

DNNFM [69] 58.49 PDFS [70] 71.96 MPP [71] 75.67
DCNN+ft [13] 68.44 FB/REF [24] 73.10 MetaObject-CNN [72] 78.90
PTFT [73] 70.41 FOAF+ft [74] 74.49 SFV+place [75] 79.00
Im2Calories [76] 79.00 Weakly-S[77] 80.43 Places205-VGG [78] 79.76
Inception-v3 [79] 88.28 ZSL-WL [80] 85.16 MPP+DSFL [71] 80.78
DLA [81] 89.70 Goldfince [4] 85.90 Double fully hybrid [36] 80.97
Progressive filter [51] 89.77 Progressive filter [51] 87.36 HP-Net [82] 83.10
DPTL [5] 90.40 DPTL [5] 88.00 Progressive filter [51] 84.78
Ours 91.63 Ours 88.62 Ours 85.22
TABLE VI: Comparison with image classification methods on Food-101, Stanford Dogs and Indoor-67 datasets. “Acc” is classification accuracy. There are also some works using web data, such as Goldfince [4] and Progressive filter [51].

Food-101. It is a dataset with 1,000 images of each class. The training and testing are split into 3:1. The number of web images we used is 214,743. First, Table IV shows the comparisons with other works utilizing web images or weakly learning from web labels. To compare with these methods, we conduct experiments based on different models used in these works on the Food-101 dataset, e.g., AlexNet and VGGNet etc. “Clean-ft Accuracy” is the result of training the deep model on standard dataset and “Mix-ft Accuracy” is the performance of the model trained on web and standard datasets together. Combining such methods with our bias removal significantly improves the performance of classification. “Method Accuracy” is the result of using the method in the first column for learning from web food images. “Our Accuracy” indicates the result of our proposed method. Compared with other methods using web data, the proposed method can further improve the performance. Moreover, no matter which deep model (AlexNet, CaffeNet, VGGNet, ResNet) is employed, the proposed algorithm shows its superiority consistently.

Furthermore, Table V presents the results on the Food-101 dataset of some detection methods, and the baseline as well as the ablation experiment results on ResNet50 and ResNet110 with web data. As can be seen, the result of ResNet-50+(web+RPN)+ft is similar to ResNet-50+(web+filter)+ft, because the regions generated by the RPN without the constraints still contain many noisy and redundant regions, which will influence the final results. Adding helps remove redundant regions, so the result is improved by about 2%. Meanwhile, introducing further filters the noisy regions to improve the performance. The ablation experiment results illustrate that each part of our method is efficient. Our result is 6.25% higher than the performance of directly using web data on ResNet-110. Meanwhile, we also compare with some existing unsupervised methods, such as [49] and [61]. Since the proposed method considers the characteristics of web images and adds constraints on the object proposals, the final generated regions are more similar to standard data than other localization methods.

Stanford Dogs. We also evaluate our method on Stanford Dogs to prove its robustness. Stanford Dogs contains 120 dog categories, with 12,000 images for training and 8,580 images for testing. The number of web images is 112,018 in this work. Table V shows the results of fine-tuning the ResNet on Stanford Dogs without web data achieves an accuracy of . After adding web data without any pre-processing, the accuracy is only improved by . However, by employing de-biased web data, the performance of the model improves substantially by . For the dog dataset, the first round filtering removes almost 40% web images, which contain many useful images with abundant domain information.

We also conduct an experiment on another web dataset L-Dogs to illustrate the universality of the existence of bias. The L-Dogs dataset contains the same 120 classes and is collected from Google Images by Krause et al[4]. The result of 85.90% for Goldfince [4] is achieved by using 342,632 images from 515-category dogs, and our method achieves higher accuracy by using only of data in that work. The result tendency of experiments on L-Dogs is consistent with that on our collected web data, e.g. 82.59% for ResNet-110+(L-Dog) and 86.55% for ResNet-110+(L-Dog)++ft.

MIT indoor67. MIT Indoor67 contains 67 categories of indoor images, 5,360 for training and 1,340 for testing. Scene recognition requires knowledge about both scenes and various objects, making the task more challenging. As shown in Table V, the basic performance of ResNet50 is , and the improvement for adding web data is , and after removing the bias, the accuracy improves by . Since the scale and density of domain information in scene images are more difficult to define than in object images, the improvement is not as significant as the results on object recognition datasets. Different from object classification, the intra-class variation is more obvious for scenes, so the web indoor scene images are more complex. Meanwhile, the results of other object detection methods can also prove that scene images are difficult to locate the labeled information, so more noisy regions (focusing on the object rather than the scene) will be introduced to influence the final results. Nevertheless, the classification accuracy of the proposed method is still higher than existing methods for scene recognition. Because the form constraint is a soft constraint for the content of the image, it will not break the wholeness of the scene.

In addition, we show the comparisons of related work on each task in Table VI. In these works, Goldfince [4] and DPTL [5] also use web data, and the amount of used data is larger than that used in our work. However, the proposed method gets better results by reducing the bias between the web and standard datasets. Compared to Progressive filter [51], our method can save more time for training, because Progressive filter needs to train the model for several rounds.

The proposed method effectively uses the web data to improve the performance of the classification models on all the three tasks.

V Conclusion

In contrast to previous works, in this paper, we reveal the phenomenon of web dataset biases and carry out rigorous quantitative analysis on them from various aspects. By conducting extensive experiments, we demonstrate that dataset bias causes the limited benefits of using web data. To address this problem, we present an unsupervised object localization method to provide critical insights into the object density and scale. Experiments show that the proposed method effectively reduces dataset bias. By employing de-biased web data, our method performs favorably against the state-of-the-art on multiple classification tasks. Web images are known to be easy and cheap to access. Although eliminating bias has not been fully solved, this work shows promising benefits towards this direction. How to use astronomically large webly-labeled data for a specific target learning task remains an open problem for future investigation.

References

  • [1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012.
  • [2]

    B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning deep features for scene recognition using places database,” in

    NIPS, 2014.
  • [3] C. Farabet, C. Couprie, L. Najman, and Y. LeCun, “Learning hierarchical features for scene labeling,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013.
  • [4] J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig, J. Philbin, and L. Fei-Fei, “The unreasonable effectiveness of noisy data for fine-grained recognition,” in ECCV, 2016.
  • [5] Y. Cui, Y. Song, C. Sun, A. Howard, and S. Belongie, “Large scale fine-grained categorization and domain-specific transfer learning,” in CVPR, 2018.
  • [6]

    Y. Peng, X. He, and J. Zhao, “Object-part attention model for fine-grained image classification,”

    IEEE Transactions on Image Processing, vol. 27, no. 3, pp. 1487–1500, 2018.
  • [7] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
  • [8] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in CVPR, 2015.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
  • [10] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang, “Learning from massive noisy labeled data for image classification,” in ICCV, 2015.
  • [11] A. Joulin, L. van der Maaten, A. Jabri, and N. Vasilache, “Learning visual features from large weakly supervised data,” in ECCV, 2016.
  • [12] S. Azadi, J. Feng, S. Jegelka, and T. Darrell, “Auxiliary image regularization for deep cnns with noisy labels,” in ICLR, 2016.
  • [13] L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining discriminative components with random forests,” in ECCV, 2014.
  • [14] A. Khosla, N. Jayadevaprakash, B. Yao, and F.-F. Li, “Novel dataset for fine-grained image categorization: Stanford dogs,” in CVPR Workshop, 2011.
  • [15] A. T. Ariadna Quattoni, “Recognizing indoor scenes,” in CVPR, 2009.
  • [16] A. Torralba and A. A. Efros, “Unbiased look at dataset bias,” in CVPR, 2011.
  • [17] A. Khosla, T. Zhou, T. Malisiewicz, A. A. Efros, and A. Torralba, “Undoing the damage of dataset bias,” in ECCV, 2012.
  • [18] B. Zhuang, L. Liu, Y. Li, C. Shen, and I. Reid, “Attend in groups: a weakly-supervised deep learning framework for learning from web data,” in CVPR, 2017.
  • [19] S. Yeung, V. Ramanathan, O. Russakovsky, L. Shen, G. Mori, and L. Fei-Fei, “Learning to learn from noisy web videos,” in CVPR, 2017.
  • [20] S. Hong, D. Yeo, S. Kwak, H. Lee, and B. Han, “Weakly supervised semantic segmentation using web-crawled videos,” in CVPR, 2017.
  • [21]

    A. Li, A. Jabri, A. Joulin, and L. van der Maaten, “Learning visual n-grams from web data,” in

    ICCV, 2017.
  • [22]

    S. Guo, W. Huang, H. Zhang, C. Zhuang, D. Dong, M. R. Scott, and D. Huang, “Curriculumnet: Weakly supervised learning from large-scale web images,” in

    ECCV, 2018.
  • [23] S. Sukhbaatar, J. Bruna, M. Paluri, L. Bourdev, and R. Fergus, “Training convolutional networks with noisy labels,” in ICLRW, 2015.
  • [24] P. D. Vo, A. Ginsca, H. Le Borgne, and A. Popescu, “Harnessing noisy web images for deep representation,” Computer Vision and Image Understanding, 2017.
  • [25] M. Hu, Y. Yang, F. Shen, L. Zhang, H. T. Shen, and X. Li, “Robust web image annotation via exploring multi-facet and structural knowledge,” IEEE Transactions on Image Processing, vol. 26, no. 10, pp. 4871–4884, 2017.
  • [26] Y. Yao, J. Zhang, F. Shen, X.-S. Hua, J. Xu, and Z. Tang, “Exploiting web images for dataset construction: A domain robust approach,” IEEE Transactions on Multimedia, vol. 19, no. 8, pp. 1771–1784, 2017.
  • [27] L. Niu, X. Xu, L. Chen, L. Duan, and D. Xu, “Action and event recognition in videos by learning from heterogeneous web sources,” IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 6, pp. 1290–1304, 2017.
  • [28] L. Niu, W. Li, D. Xu, and J. Cai, “Visual recognition by learning from web data via weakly supervised domain generalization,” IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 9, pp. 1985–1999, 2017.
  • [29] S. E. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich, “Training deep neural networks on noisy labels with bootstrapping,” CoRR, vol. abs/1412.6596, 2014.
  • [30] H. Izadinia, B. C. Russell, A. Farhadi, M. D. Hoffman, and A. Hertzmann, “Deep classifiers from image tags in the wild,” in MMCommons Workshop, 2015.
  • [31] X. Chen and A. Gupta, “Webly supervised learning of convolutional networks,” in ICCV, 2015.
  • [32] M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and transferring mid-level image representations using convolutional neural networks,” in CVPR, 2014.
  • [33] A. Khosla, T. Zhou, T. Malisiewicz, A. Efros, and A. Torralba, “Undoing the damage of dataset bias,” in ECCV, 2012.
  • [34] I. Misra, C. L. Zitnick, M. Mitchell, and R. Girshick, “Seeing through the human reporting bias: Visual classifiers from noisy human-centric labels,” in CVPR, 2016.
  • [35] C. Vondrick, H. Pirsiavash, A. Oliva, and A. Torralba, “Learning visual biases from human imagination,” in NIPS, 2015.
  • [36] L. Herranz, S. Jiang, and X. Li, “Scene recognition with cnns: objects, scales and dataset bias,” in CVPR, 2016.
  • [37]

    Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,”

    Neural computation, vol. 1, no. 4, pp. 541–551, 1989.
  • [38] R. Vaillant, C. Monrocq, and Y. Le Cun, “Original approach for the localisation of objects in images,” IEE Proceedings-Vision, Image and Signal Processing, vol. 141, no. 4, pp. 245–250, 1994.
  • [39] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in CVPR, vol. 1, 2001, pp. I–I.
  • [40] P. F. Felzenszwalb, R. B. Girshick, and D. McAllester, “Cascade object detection with deformable part models,” in CVPR, 2010, pp. 2241–2248.
  • [41] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
  • [42] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, “Selective search for object recognition,” International journal of computer vision, vol. 104, no. 2, pp. 154–171, 2013.
  • [43] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” in ICLR, 2014.
  • [44] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision, 2016, pp. 21–37.
  • [45] J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” in CVPR, 2017, pp. 6517–6525.
  • [46] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in CVPR, 2014.
  • [47] M. Cho, S. Kwak, C. Schmid, and J. Ponce, “Unsupervised object discovery and localization in the wild: Part-based matching with bottom-up region proposals,” in CVPR, 2015, pp. 1201–1210.
  • [48] X. Wang, Z. Zhu, C. Yao, and X. Bai, “Relaxed multiple-instance svm with application to object discovery,” in ICCV, 2015, pp. 1224–1232.
  • [49] Y. Li, L. Liu, C. Shen, and A. van den Hengel, “Image co-localization by mimicking a good detector’s confidence score distribution,” in ECCV, 2016.
  • [50] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in NIPS, 2015.
  • [51] J. Yang, X. Sun, Y.-K. Lai, L. Zheng, and M.-M. Cheng, “Recognition from web data: A progressive filtering approach,” IEEE Transactions on Image Processing, vol. 27, no. 11, pp. 5303–5315, 2018.
  • [52]

    J. Wang, Y. Song, T. Leung, C. Rosenberg, J. Wang, J. Philbin, B. Chen, and Y. Wu, “Learning fine-grained image similarity with deep ranking,” in

    CVPR, 2014.
  • [53] S. Ren, K. He, R. Girshick, X. Zhang, and J. Sun, “Object detection networks on convolutional feature maps,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 7, pp. 1476–1481, 2017.
  • [54] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in ACM MM, 2014.
  • [55]

    D.-H. Lee, “Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks,” in

    ICMLW, 2013.
  • [56] S. Sukhbaatar and R. Fergus, “Learning from noisy labels with deep neural networks,” arXiv preprint arXiv:1406.2080, 2014.
  • [57] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra et al., “Grad-cam: Visual explanations from deep networks via gradient-based localization.” in ICCV, 2017.
  • [58] Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into high quality object detection,” in CVPR, 2018.
  • [59] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li, “Single-shot refinement neural network for object detection,” in CVPR, 2018.
  • [60] Z. Shen, Z. Liu, J. Li, Y.-G. Jiang, Y. Chen, and X. Xue, “Dsod: Learning deeply supervised object detectors from scratch,” in ICCV, 2017.
  • [61] N. Inoue, R. Furuta, T. Yamasaki, and K. Aizawa, “Cross-domain weakly-supervised object detection through progressive domain adaptation,” in CVPR, 2018.
  • [62] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposals from edges,” in ECCV, 2014, pp. 391–405.
  • [63] P. O. Pinheiro, R. Collobert, and P. Dollár, “Learning to segment object candidates,” in NIPS, 2015, pp. 1990–1998.
  • [64] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollár, “Learning to refine object segments,” in ECCV, 2016, pp. 75–91.
  • [65] M. Simon and E. Rodner, “Neural activation constellations: Unsupervised part model discovery with convolutional networks,” in ICCV, 2015.
  • [66] C. Doersch, A. Gupta, and A. A. Efros, “Mid-level visual element discovery as discriminative mode seeking,” in NIPS, 2013.
  • [67] M. Mohammadi and S. Das, “SNN: stacked neural networks,” CoRR, vol. abs/1605.08512, 2016.
  • [68] Z. Xu, D. Tao, S. Huang, and Y. Zhang, “Friend or foe: Fine-grained categorization with weak supervision,” IEEE Transactions on Image Processing, 2017.
  • [69] A. Tatsuma and M. Aono, “Food image recognition using covariance of convolutional layer feature maps,” IEICE Transactions on Information and Systems, 2016.
  • [70] X. Zhang, H. Xiong, W. Zhou, W. Lin, and Q. Tian, “Picking deep filter responses for fine-grained image recognition,” in CVPR, 2016.
  • [71] D. Yoo, S. Park, J.-Y. Lee, and I. So Kweon, “Multi-scale pyramid pooling for deep convolutional representation,” in CVPR, 2015.
  • [72] R. Wu, B. Wang, W. Wang, and Y. Yu, “Harvesting discriminative meta objects with deep cnn features for scene classification,” in CVPR, 2015.
  • [73] K. Yanai and Y. Kawano, “Food image recognition using deep convolutional network with pre-training and fine-tuning,” in ICMEW, 2015.
  • [74] X. Zhang, H. Xiong, W. Zhou, and Q. Tian, “Fused one-vs-all features with semantic alignments for fine-grained visual categorization,” IEEE Transactions on Image Processing, 2016.
  • [75]

    M. Dixit, S. Chen, D. Gao, N. Rasiwasia, and N. Vasconcelos, “Scene classification with semantic fisher vectors,” in

    CVPR, 2015.
  • [76] A. Meyers, N. Johnston, V. Rathod, A. Korattikara, A. Gorban, N. Silberman, S. Guadarrama, G. Papandreou, J. Huang, and K. P. Murphy, “Im2calories: towards an automated mobile vision food diary,” in CVPR, 2015.
  • [77] Y. Zhang, X. S. Wei, J. Wu, and J. Cai, “Weakly supervised fine-grained categorization with part-based image representation,” IEEE Transactions on Image Processing, 2016.
  • [78] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, “Places: A 10 million image database for scene recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 6, pp. 1452–1464, 2018.
  • [79] H. Hassannejad, G. Matrella, P. Ciampolini, I. De Munari, M. Mordonini, and S. Cagnoni, “Food image recognition using very deep convolutional networks,” in International Workshop on Multimedia Assisted Dietary Management, 2016.
  • [80] L. Niu, A. Veeraraghavan, and A. Sabharwal, “Webly supervised learning meets zero-shot learning: A hybrid approach for fine-grained classification,” in CVPR, 2018.
  • [81] F. Yu, D. Wang, E. Shelhamer, and T. Darrell, “Deep layer aggregation,” in CVPR, 2018.
  • [82] P. Wang and N. Vasconcelos, “Towards realistic predictors,” in ECCV, 2018.