Fine-grained recognition refers to the task of distinguishing very similar categories, such as breeds of dogs [27, 37], species of birds [60, 58, 5, 4], or models of cars [70, 30]. Since its inception, great progress has been made, with accuracies on the popular CUB-200-2011 bird dataset  steadily increasing from 10.3%  to 84.6% .
The predominant approach in fine-grained recognition today consists of two steps. First, a dataset is collected. Since fine-grained recognition is a task inherently difficult for humans, this typically requires either recruiting a team of experts [58, 38] or extensive crowd-sourcing pipelines [30, 4]. Second, a method for recognition is trained using these expert-annotated labels, possibly also requiring additional annotations in the form of parts, attributes, or relationships [75, 26, 36, 5]. While methods following this approach have shown some success [5, 75, 36, 28], their performance and scalability is constrained by the paucity of data available due to these limitations. With this traditional approach it is prohibitive to scale up to all 14,000 species of birds in the world (Fig. 1), 278,000 species of butterflies and moths, or 941,000 species of insects .
In this paper, we show that it is possible to train effective models of fine-grained recognition using noisy data from the web and simple, generic methods of recognition [55, 54]. We demonstrate recognition abilities greatly exceeding current state of the art methods, achieving top-1 accuracies of on CUB-200-2011 , on Birdsnap , on FGVC-Aircraft , and on Stanford Dogs  without using a single manually-annotated training label from the respective datasets. On CUB, this is nearly at the level of human experts [6, 58]. Building upon this, we scale up the number of fine-grained classes recognized, reporting first results on over 10,000 species of birds and 14,000 species of butterflies and moths.
The rest of this paper proceeds as follows: After an overview of related work in Sec. 2, we provide an analysis of publicly-available noisy data for fine-grained recognition in Sec. 3, analyzing its quantity and quality. We describe a more traditional active learning approach for obtaining larger quantities of fine-grained data in Sec. 4, which serves as a comparison to purely using noisy data. We present extensive experiments in Sec. 5, and conclude with discussion in Sec. 6.
2 Related Work
2.0.1 Fine-Grained Recognition.
The majority of research in fine-grained recognition has focused on developing improved models for classification [1, 3, 5, 7, 9, 8, 14, 16, 18, 20, 21, 22, 28, 29, 36, 37, 41, 42, 49, 51, 50, 66, 68, 69, 71, 73, 72, 76, 77, 75, 78]. While these works have made great progress in modeling fine-grained categories given the limited data available, very few works have considered the impact of that data [69, 68, 58]. Xu et al.  augment datasets annotated with category labels and parts with web images in a multiple instance learning framework, and Xie et al.  do multitask training, where one task uses a ground truth fine-grained dataset and the other does not require fine-grained labels. While both of these methods have shown that augmenting fine-grained datasets with additional data can help, in our work we present results which completely forgo the use of any curated ground truth dataset. In one experiment hinting at the use of noisy data, Van Horn et al.  show the possibility of learning 40 bird classes from Flickr images. Our work validates and extends this idea, using similar intuition to significantly improve performance on existing fine-grained datasets and scale fine-grained recognition to over ten thousand categories, which we believe is necessary in order to fully explore the research direction.
Considerable work has also gone into the challenging task of curating fine-grained datasets [4, 58, 27, 30, 31, 59, 65, 60, 70] and developing interactive methods for recognition with a human in the loop [6, 62, 61, 63]. While these works have demonstrated effective strategies for collecting images of fine-grained categories, their scalability is ultimately limited by the requirement of manual annotation. Our work provides an alternative to these approaches.
2.0.2 Learning from Noisy Data.
Our work is also inspired by methods that propose to learn from web data [15, 10, 11, 45, 34, 19] or reason about label noise [39, 67, 58, 52, 43]. Works that use web data typically focus on detection and classification of a set of coarse-grained categories, but have not yet examined the fine-grained setting. Methods that reason about label noise have been divided in their results: some have shown that reasoning about label noise can have a substantial effect on recognition performance , while others demonstrate little change from reducing the noise level or having a noise-aware model [52, 43, 58]. In our work, we demonstrate that noisy data can be surprisingly effective for fine-grained recognition, providing evidence in support of the latter hypothesis.
3 Noisy Fine-Grained Data
In this section we provide an analysis of the imagery publicly available for fine-grained recognition, which we collect via web search.111Google image search: http://images.google.com We describe its quantity, distribution, and levels of noise, reporting each on multiple fine-grained domains.
We consider four domains of fine-grained categories: birds, aircraft, Lepidoptera (a taxonomic order including butterflies and moths), and dogs. For birds and Lepidoptera, we obtained lists of fine-grained categories from Wikipedia, resulting in 10,982 species of birds and 14,553 species of Lepidoptera, denoted L-Bird (“Large Bird”) and L-Butterfly. For aircraft, we assembled a list of 409 types of aircraft by hand (including aircraft in the FGVC-Aircraft  dataset, abbreviated FGVC). For dogs, we combine the 120 dog breeds in Stanford Dogs  with 395 other categories to obtain the 515-category L-Dog. We evaluate on two other fine-grained datasets in addition to FGVC and Stanford Dogs: CUB-200-2011  and Birdsnap , for a total of four evaluation datasets. CUB and Birdsnap include 200 and 500 species of common birds, respectively, FGVC has 100 aircraft variants, and Stanford Dogs contains 120 breeds of dogs. In this section we focus our analysis on the categories in L-Bird, L-Butterfly, and L-Aircraft in addition to the categories in their evaluation datasets.
3.2 Images from the Web
We obtain imagery via Google image search results, using all returned images as images for a given category. For L-Bird and L-Butterfly, queries are for the scientific name of the category, and for L-Aircraft and L-Dog queries are simply for the category name (e.g. “Boeing 737-200” or “Pembroke Welsh Corgi”).
3.2.1 Quantifying the Data.
How much fine-grained data is available? In Fig. 2 we plot distributions of the number of images retrieved for each category and report aggregates across each set of categories. We note several trends: Categories in existing datasets, which are typically common within their fine-grained domain, have more images per category than the long-tail of categories present in the larger L-Bird, L-Aircraft, or L-Butterfly, with the effect most pronounced in L-Bird and L-Butterfly. Further, domains of fine-grained categories have substantially different distributions, i.e. L-Bird and L-Aircraft have more images per category than L-Butterfly. This makes sense – fine-grained categories and domains of categories that are more common and have a larger enthusiast base will have more imagery since more photos are taken of them. We also note that results tend to be limited to roughly 800 images per category, even for the most common categories, which is likely a restriction placed on public search results.
Most striking is the large difference between the number of images available via web search and in existing fine-grained datasets: even Birdsnap, which has an average of 94.8 images per category, contains only 13% as many images as can be obtained with a simple image search. Though their labels are noisy, web searches unveil an order of magnitude more data which can be used to learn fine-grained categories.
In total, for all four datasets, we obtained 9.8 million images for 26,458 categories, requiring 151.8GB of disk space.222URLs available at https://github.com/google/goldfinch
Though large amounts of imagery are freely available for fine-grained categories, focusing only on scale ignores a key issue: noise. We consider two types of label noise, which we call cross-domain noise and cross-category noise. We define cross-domain noise to be the portion of images that are not of any category in the same fine-grained domain, i.e. for birds, it is the fraction of images that do not contain a bird (examples in Fig. 3). In contrast, cross-category noise is the portion of images that have the wrong label within a fine-grained domain, i.e. an image of a bird with the wrong species label.
To quantify levels of cross-domain noise, we manually label a 1,000 image sample from each set of search results, with results in Fig. 5. Although levels of noise are not too high for any set of categories (max. 34.2% for L-Butterfly), we notice an interesting correlation: cross-domain noise decreases moderately as the number of images per category (Fig. 2) increases. We hypothesize that categories with many search results have a corresponding large pool of images to draw results from, and thus actual search results will tend to be higher-precision.
In contrast to cross-domain noise, cross-category noise is much harder to quantify, since doing so effectively requires ground truth fine-grained labels of query results. To examine cross-category noise from at least one vantage point, we show the confusion matrix of given versus predicted labels on 30 categories in the CUB test set and their web images in Fig. 7
, left and right, which we generate via a classifier trained on the CUB training set, acting as a noisy proxy for ground truth labels. In these confusion matrices, cross-category noise is reflected as a strong off-diagonal pattern, while cross-domain noise would manifest as a diffuse pattern of noise, since images not of the same domain are an equally bad fit to all categories. Based on this interpretation, the web images show a moderate amount more cross-category noise than the clean CUB test set, though the general confusion pattern is similar.
We propose a simple, yet effective strategy to reduce the effects of cross-category noise: exclude images that appear in search results for more than one category. This approach, which we refer to as filtering, specifically targets images for which there is explicit ambiguity in the category label (examples in Fig. 7). As we demonstrate experimentally, filtering can improve results while reducing training time via the use of a more compact training set – we show the portion of images kept after filtering in Fig. 5. Agreeing with intuition, filtering removes more images when there are more categories. Anecdotally, we have also tried a few techniques to combat cross-domain noise, but initial experiments did not see any improvement in recognition so we do not expand upon them here. While reducing cross-domain noise should be beneficial, we believe that it is not as important as cross-category noise in fine-grained recognition due to the absence of out-of-domain classes during testing.
4 Data via Active Learning
In this section we briefly describe an active learning-based approach for collecting large quantities of fine-grained data. Active learning and other human-in-the-loop systems have previously been used to create datasets in a more cost-efficient way than manual annotation [74, 12, 47], and our goal is to compare this more traditional approach with simply using noisy data, particularly when considering the application of fine-grained recognition. In this paper, we apply active learning to the 120 dog breeds in the Stanford Dogs  dataset.
Our system for active learning begins by training a classifier on a seed set of input images and labels (i.e.
the Stanford Dogs training set), then proceeds by iteratively picking a set of images to annotate, obtaining labels with human annotators, and re-training the classifier. We use a convolutional neural network[32, 54, 25] for the classifier, and now describe the key steps of sample selection and human annotation in more detail.
4.0.1 Sample Selection.
There are many possible criterion for sample selection . We employ confidence-based sampling: For each category , we select the images with the top class scores as determined by our current model, where is a desired prior distribution over classes, is a budget on the number of images to annotate, and is the output of the classifier. The intuition is as follows: even when is large, false positives still occur quite frequently – in Fig. 8 left, observe that the false positive rate is about at the highest confidence range, which might have a large impact on the model. This contrasts with approaches that focus sampling in uncertain regions [33, 2, 40, 17]. We find that images sampled with uncertainty criteria are typically ambiguous and difficult or even impossible for both models and humans to annotate correctly, as demonstrated in Fig. 8 bottom row: unconfident samples are often heavily occluded, at unusual viewpoints, or of mixed, ambiguous breeds, making it unlikely that they can be annotated effectively. This strategy is similar to the “expected model change” sampling criteria , but done for each class independently.
4.0.2 Human Annotation.
Our interface for human annotation of the selected images is shown in Fig. 9. Careful construction of the interface, including the addition of both positive and negative examples, as well as hidden “gold standard” images for immediate feedback, improves annotation accuracy considerably (see Sec. 0.A.2 for quantitative results). Final category decisions are made via majority vote of three annotators.
5.1 Implementation Details
The base classifier we use in all noisy data experiments is the Inception-v3 convolutional neural network architecture, which is among the state of the art methods for generic object recognition [44, 53, 23]. Learning rate schedules are determined by performance on a holdout subset of the training data, which is 10% of the training data for control experiments training on ground truth datasets, or 1% when training on the larger noisy web data. Unless otherwise noted, all recognition results use as input a single crop in the center of the image.
Our active learning comparison uses the Yahoo Flickr Creative Commons 100M dataset  as its pool of unlabeled images, which we first pre-filter with a binary dog classifier and localizer , resulting in 1.71 million candidate dogs. We perform up to two rounds of active learning, with a sampling budget of the original dataset size per round333To be released.. For experiments on Stanford Dogs, we use the CNN of , which is pre-trained on a version of ILSVRC [44, 13] with dog data removed, since Stanford Dogs is a subset of ILSVRC training data.
5.2 Removing Ground Truth from Web Images
One subtle point to be cautious about when using web images is the risk of inadvertently including images from ground truth test sets in the web training data. To deal with this concern, we performed an aggressive deduplication procedure with all ground truth test sets and their corresponding web images. This process follows Wang et al. , which is a state of the art method for learning a similarity metric between images. We tuned this procedure for high near-duplicate recall, manually verifying its quality. More details are included in the Sec. 0.B.
5.3 Main Results
|Training Data||Acc.||Dataset||Training Data||Acc.||Dataset|
|CUB-GT||84.4||CUB ||FGVC-GT||88.1||FGVC |
|Web (raw)||87.7||Web (raw)||90.7|
|Web (filtered)||89.0||Web (filtered)||91.1|
|Birdsnap-GT||78.2||Birdsnap ||Stanford-GT||80.6||Stanford Dogs |
|Web (raw)||76.1||Web (raw)||78.5|
|Web (filtered)||78.2||Web (filtered)||78.4|
We present our main recognition results in Tab. 1, where we compare performance when the training set consists of either the ground truth training set, raw web images of the categories in the corresponding evaluation dataset, web images after applying our filtering strategy, all web images of a particular domain, or all images including even the ground truth training set.
On CUB-200-2011 , the smallest dataset we consider, even using raw search results as training data results in a better model than the annotated training set, with filtering further improving results by 1.3%. For Birdsnap 
, the largest of the ground truth datasets we evaluate on, raw data mildly underperforms using the ground truth training set, though filtering improves results to be on par. On both CUB and Birdsnap, training first on the very large set of categories in L-Bird results in dramatic improvements, improving performance on CUB further by 2.9% and on Birdsnap by 4.6%. This is an important point: even if the end task consists of classifying only a small number of categories, training with more fine-grained categories yields significantly more effective networks. This can also be thought of as a form of transfer learning within the same fine-grained domain, allowing features learned on a related task to be useful for the final classification problem. When permitted access to the annotated ground truth training sets for additional fine-tuning and domain transfer, results increase by anotheron CUB and on Birdsnap.
For the aircraft categories in FGVC, results are largely similar but weaker in magnitude. Training on raw web data results in a significant gain of 2.6% compared to using the curated training set, and filtering, which did not affect the size of the training set much (Fig. 5), changes results only slightly in a positive direction. Counterintuitively, pre-training on a larger set of aircraft does not improve results on FGVC. Our hypothesis for the difference between birds and aircraft in this regard is this: since there are many more species of birds in L-Bird than there are aircraft in L-Aircraft (10,982 vs 409), not only is the training size of L-Bird larger, but each training example provides stronger information because it distinguishes between a larger set of mutually-exclusive categories. Nonetheless, when access to the curated training set is available for fine-tuning, performance dramatically increases to 94.5%. On Stanford Dogs we see results similar to FGVC, though for dogs we happen to see a mild loss when comparing to the ground truth training set, not much difference with filtering or using L-Dog, and a large boost from adding in the ground truth training set.
An additional factor that can influence performance of web models is domain shift – if images in the ground truth test set have very different visual properties compared to web images, performance will naturally differ. Similarly, if category names or definitions within a dataset are even mildly off, web-based methods will be at a disadvantage without access to the ground truth training set. Adding the ground truth training data fixes this domain shift, making web-trained models quickly recover, with a particularly large gain if the network has already learned a good representation, matching the pattern of results for Stanford Dogs.
5.3.1 Limits of Web-Trained Models.
To push our models to their limits, we additionally evaluate using 144 image crops at test time, averaging predictions across each crop, denoted “(MC)” in Tab. 1
. This brings results up to 92.3%/92.8% on CUB (without/with CUB training data), 85.4%/85.4% on Birdsnap, 93.4%/95.9% on FGVC, and 80.8%/85.9% on Stanford Dogs. We note that this is close to human expert performance on CUB, which is estimated to be between and .
|PB R-CNN ||GT+BB+Parts||73.9|
|Weak Sup. ||GT||75.0|
|Noisy Data+CNN ||Web||92.3|
5.3.2 Comparison with Prior Work.
We compare our results to prior work on CUB, the most competitive fine-grained dataset, in Tab. 2. While even our baseline model using only ground truth data from Tab. 1 was at state of the art levels, by forgoing the CUB training set and only training using noisy data from the web, our models greatly outperform all prior work. On FGVC, which is more recent and fewer works have evaluated on, the best prior performing method we are aware of is the Bilinear CNN model of Lin et al. , which has accuracy 84.1% (ours is 93.4% without FGVC training data, 95.9% with), and on Birdsnap, which is even more recent, the best performing method we are aware of that uses no extra annotations during test time is the original 66.6% by Berg et al.  (ours is 85.4%). On Stanford Dogs, the most competitive related work is 
, which uses an attention-based recurrent neural network to achieve(ours is without ground truth training data, with).
We identify two key reasons for these large improvements: The first is the use of a strong generic classifier . A number of prior works have identified the importance of having well-trained CNNs as components in their systems for fine-grained recognition [36, 26, 29, 75, 5], which our work provides strong evidence for. On all four evaluation datasets, our CNN of choice , trained on the ground truth training set alone and without any architectural modifications, performs at levels at or above the previous state-of-the-art. The second reason for improvement is the large utility of noisy web data for fine-grained recognition, which is the focus of this work.
We finally remind the reader that our work focuses on the application-level problem of recognizing a given set of fine-grained categories, which might not come with their own expert-annotated training images. The use of existing test sets serves to provide an accurate measure of performance and put our work in a larger context, but results may not be strictly comparable with prior work that operates within a single given dataset.
5.3.3 Comparison with Active Learning.
We compare using noisy web data with a more traditional active learning-based approach (Sec. 4) under several different settings in Tab. 3. We first verify the efficacy of active learning itself: when training the network from scratch (i.e. no fine-tuning), active learning improves performance by up to , and when fine-tuning, results still improve by .
How does active learning compare to using web data? Purely using filtered web data compares favorably to non-fine-tuned active learning methods ( better), though lags behind the fine-tuned models somewhat. To better compare the active learning and noisy web data, we factor out the difference in scale by performing an experiment with subsampled active learning data, setting it to be the same size as the filtered web data. Surprisingly, performance is very similar, with only a advantage for the cleaner, annotated active learning data, highlighting the effectiveness of noisy web data despite the lack of manual annotation. If we furthermore augment the filtered web images with the Stanford Dogs training set, which the active learning method notably used both as training data and its seed set of images, performance improves to even be slightly better than the manually-annotated active learning data ( improvement).
|A.L., one round (scratch)||65.8|
|A.L., two rounds (scratch)||74.0|
|A.L., one round (ft)||81.6|
|A.L., one round (ft, subsample)||78.8|
|A.L., two rounds (ft)||82.1|
|Web (filtered) + Stanford-GT||82.6|
These experiments indicate that, while more traditional active learning-based approaches towards expanding datasets are effective ways to improve recognition performance given a suitable budget, simply using noisy images retrieved from the web can be nearly as good, if not better. As web images require no manual annotation and are openly available, we believe this is strong evidence for their use in solving fine-grained recognition.
5.3.4 Very Large-Scale Fine-Grained Recognition.
A key advantage of using noisy data is the ability to scale to large numbers of fine-grained classes. However, this poses a challenge for evaluation – it is infeasible to manually annotate images with one of the 10,982 categories in L-Bird, 14,553 categories in L-Butterfly, and would even be very time-consuming to annotate images with the 409 categories in L-Aircraft. Therefore, we turn to an approximate evaluation, establishing a rough estimate on true performance. Specifically, we query Flickr for up to 25 images of each category, keeping only those images whose title strictly contains the name of each category, and aggressively deduplicate these images with our training set in order to ensure a fair evaluation. Although this is not a perfect evaluation set, and is thus an area where annotation of fine-grained datasets is particularly valuable , we find that it is remarkably clean on the surface: based on a 1,000-image estimate, we measure the cross-domain noise of L-Bird at only 1%, L-Butterfly at 2.3%, and L-Aircraft at 4.5%. An independent evaluation  further measures all sources of noise combined to be only 16% when searching for bird species. In total, this yields 42,115 testing images for L-Bird, 42,046 for L-Butterfly, and 3,131 for L-Aircraft.
Given the difficulty and noise, performance is surprisingly high: On L-Bird top-1 accuracy is 73.1%/75.8% (1/144 crops), for L-Butterfly it is 65.9%/68.1%, and for L-Aircraft it is 72.7%/77.5%. Corresponding mAP numbers, which are better suited for handling class imbalance, are 61.9, 54.8, and 70.5, reported for the single crop setting. We show qualitative results in Fig. 10. These categories span multiple continents in space (birds, butterflies) and decades in time (aircraft), demonstrating the breadth of categories in the world that can be recognized using only public sources of noisy fine-grained data. To the best of our knowledge, these results represent the largest number of fine-grained categories distinguished by any single system to date.
5.3.5 How Much Data is Really Necessary?
In order to better understand the utility of noisy web data for fine-grained recognition, we perform a control experiment on the web data for CUB. Using the filtered web images as a base, we train models using progressively larger subsets of the results as training data, taking the top ranked images across categories for each experiment. Performance versus the amount of training data is shown in Fig. 12. Surprisingly, relatively few web images are required to do as well as training on the CUB training set, and adding more noisy web images always helps, even when at the limit of search results. Based on this analysis, we estimate that one noisy web image for CUB categories is “worth” 0.507 ground truth training images .
5.3.6 Error Analysis.
Given the high performance of these models, what room is left for improvement? In Fig. 12 we show the taxonomic distribution of the remaining errors on L-Bird. The vast majority of errors (74.3%) are made between very similar classes at the genus level, indicating that most of the remaining errors are indeed between extremely similar categories, and only very few errors (7.4%) are made between dissimilar classes, whose least common ancestor is the “Aves” (i.e. Bird) taxonomic class. This suggests that most errors still made by the models are fairly reasonable, corroborating the qualitative results of Fig. 10.
In this work we have demonstrated the utility of noisy data toward solving the problem of fine-grained recognition. We found that the combination of a generic classification model and web data, filtered with a simple strategy, was surprisingly effective at discriminating fine-grained categories. This approach performs favorably when compared to a more traditional active learning method for expanding datasets, but is even more scalable, which we demonstrated experimentally on up to 14,553 fine-grained categories. One potential limitation of the approach is the availability of imagery for categories either not found or not described in the public domain, for which an alternative method such as active learning may be better suited. Another limitation is the current focus on classification, which may be problematic if applications arise where multiple objects are present or localization is otherwise required. Nonetheless, with these insights on the unreasonable effectiveness of noisy data, we are optimistic for applications of fine-grained recognition in the near future.
We thank Gal Chechik, Chuck Rosenberg, Zhen Li, Timnit Gebru, Vignesh Ramanathan, Oliver Groth, and the anonymous reviewers for valuable feedback.
Appendix 0.A Active Learning Details
Here we provide additional details for our active learning baseline, including further description of the interface, improvements in rater quality as a result of this interface, statistics of the number of positives obtained per class in each round of active learning, and qualitative examples of images obtained.
Designing an effective rater tool is of critical importance when getting non-experts to rate fine-grained categories. We seek to give the raters simple decisions and to provide them with as much information as possible to make the correct decision in a generic and scalable way. Fig. 13 shows our rater interface, which includes the following components to serve this purpose:
0.a.1.1 Instructional positive images
inform the rater of within-class variation. These images are obtained from the seed dataset input to active learning. Many rater tools only provide this (e.g. ), which does not provide a clear class boundary concept on its own. We also provide links to Google Image Search and encourage raters to research the full space of examples of the class concept.
0.a.1.2 Instructional negative images
help raters define the decision boundary between the right class and easily confused other classes. We show the top two most confused categories, determined by the active learning’s current model. This aids in classification: in Fig. 13, if the rater studies the positive class “Bernese mountain dog”, they may form a mental decision rule based on fur color pattern alone. However, when studying the negative, easily confused classes “Entlebucher” and “Appenzeller”, the rater can refine the decision on more appropriate fine-grained distinctions – in this case, hair length is a key discriminative attribute.
0.a.1.3 Batching questions by class
has the benefit of allowing raters to learn about and focus on one fine-grained category at a time. Batching questions may also allow raters to build a better mental model of the class via a human form of semi-supervised learning, although this phenomena is more difficult to isolate and measure.
0.a.1.4 Golden questions for rater feedback and quality control.
We use the original supervised seed dataset to add a number of known correct and incorrect images in the batch to be rated, which we use to give short- and long-term feedback to raters. Short-term feedback comes in the form of a pop-up window informing the rater the moment they make an incorrect judgment, allowing them to update their mental model while working on the task. Long-term feedback summarizes a days’ worth of rating to give the rater a summary of overall performance.
0.a.2 Rater Quality Improvements
To determine the impact of our annotation framework improvements for fine-grained categories, we performed a control experiment with a more standard crowdsourcing interface, which provides only a category name, description, and image search link. Annotation quality is determined on a set of difficult binary questions (images mistaken by a classifier on the Stanford Dogs test set). Using our interface, annotators were both more accurate and faster, with a 16.5% relative reduction in error (from 28.5% to 23.8%) and a improvement in speed (4.1 to 1.68 seconds per image).
0.a.3 Annotation Statistics and Examples
In Fig. 14 we show the distribution of images judged correct by human annotators after active learning selection of 1000 images per class for Stanford Dogs classes. The categories are sorted by the number of positive training examples collected in the first iteration of active learning. The 10 categories with the most positive training examples collected after both rounds of mining are: Pug, Golden Retriever, Boston Terrier, West Highland White Terrier, Labrador Retriever, Boxer, Maltese, German Shepherd, Pembroke Welsh Corgi, and Beagle. The 10 categories with the fewest positive training examples are: Kerry Blue Terrier, Komondor, Irish Water Spaniel, Curly Coated Retriever, Bouvier des Flandres, Clumber Spaniel, Bedlington Terrier, Afghan Hound, Affenpinscher, and Sealyham Terrier. These counts are influenced by the true counts of categories in the YFCC100M  dataset and our active learner’s ability to find them.
In Fig. 15, we show positive training examples obtained from active learning for select categories, comparing examples obtained in iterations 1 and 2.
Appendix 0.B Deduplication Details
Here we provide more details on our method for removing any ground truth images from web search results, which we took great care in doing. Our general approach follows Wang et al. , which is a state of the art method for learning a similarity metric between images. To scale 
to the millions of images considered in this work, we binarize the output for an efficient hashing-based exact search. Hamming distance corresponds to dissimilarity: identical images have distance 0, images with different resolutions, aspect ratios, or slightly different crops tend to have distances of up to roughly 4 and 8, and more substantial variations,e.g. images of different views from the same photographer, or very different crops, roughly have distances up to 10, beyond which the vast majority of image pairs are actually distinct. Qualitative examples are provided in Fig. 16. We tuned our dissimilarity threshold for recall and manually verified it – the goal is to ensure that images that have even a moderate degree of similarity to test images did not appear in our training set. For example, of a sample of 183 image pairs at distance 16 in the large-scale bird experiments, zero were judged by a human to be too similar, and we used a still more conservative threshold of 18. In the case of L-Bird, 2,996 images were removed as being too similar to an image in either the CUB or Birdsnap test set.
Appendix 0.C Remaining Errors: Qualitative
Here we highlight one type of error that our image search model made on CUB  – finding errors in the test set. We show an example in Fig. 17, where the true species for each image is actually a bird species not in the 200 CUB bird species. This highlights one potential advantage of our approach: by relying on category names, web training data is tied more strongly to the semantic meaning of a category instead of simply a 1-of- label. This also provides evidence for the “domain shift” hypothesis when fine-tuning on ground truth datasets, as irregularities like this can be learned, resulting in higher performance on the benchmark dataset under consideration.
Appendix 0.D Network Visualization
In order to examine the impact of web-trained models of fine-grained recognition from another vantage point, here we present one visualization of network internals. Specifically, in Fig. 18
we visualize gradients with respect to the square of the norm of the last convolutional layer in the network, backpropagated into the input image, and visualized as a function of training data. This provides some indication of the importance of each pixel with respect to the overall network activation. Though these examples are only qualitative, we observe that the gradients for the network trained on L-Bird are generally more focused on the bird when compared to gradients for the network trained on CUB, indicating that the network has learned a better representation of which parts of an image are discriminative.
|Image CUB-200 L-Bird||Image CUB-200 L-Bird|
Angelova, A., Zhu, S., Lin, Y.: Image segmentation for large-scale subcategory flower recognition. In: Workshop on Applications of Computer Vision (WACV). pp. 39–45. IEEE (2013)
-  Balcan, M.F., Broder, A., Zhang, T.: Margin based active learning. In: Learning Theory, pp. 35–50. Springer (2007)
Berg, T., Belhumeur, P.N.: Poof: Part-based one-vs.-one features for fine-grained categorization, face verification, and attribute estimation. In: Computer Vision and Pattern Recognition (CVPR). pp. 955–962. IEEE (2013)
-  Berg, T., Liu, J., Lee, S.W., Alexander, M.L., Jacobs, D.W., Belhumeur, P.N.: Birdsnap: Large-scale fine-grained visual categorization of birds. In: Computer Vision and Pattern Recognition (CVPR) (June 2014)
-  Branson, S., Van Horn, G., Perona, P., Belongie, S.: Improved bird species recognition using pose normalized deep convolutional nets. In: British Machine Vision Conference (BMVC) (2014)
-  Branson, S., Van Horn, G., Wah, C., Perona, P., Belongie, S.: The ignorant led by the blind: A hybrid human–machine vision system for fine-grained categorization. International Journal of Computer Vision (IJCV) pp. 1–27 (2014)
-  Chai, Y., Lempitsky, V., Zisserman, A.: Bicos: A bi-level co-segmentation method for image classification. In: International Conference on Computer Vision (ICCV). IEEE (2011)
-  Chai, Y., Lempitsky, V., Zisserman, A.: Symbiotic segmentation and part localization for fine-grained categorization. In: International Conference on Computer Vision (ICCV). pp. 321–328. IEEE (2013)
-  Chai, Y., Rahtu, E., Lempitsky, V., Van Gool, L., Zisserman, A.: Tricos: A tri-level class-discriminative co-segmentation method for image classification. In: European Conference on Computer Vision (ECCV), pp. 794–807. Springer (2012)
Chen, X., Gupta, A.: Webly supervised learning of convolutional networks. In: International Conference on Computer Vision (ICCV). IEEE (2015)
-  Chen, X., Shrivastava, A., Gupta, A.: Neil: Extracting visual knowledge from web data. In: International Conference on Computer Vision (ICCV). pp. 1409–1416. IEEE (2013)
-  Collins, B., Deng, J., Li, K., Fei-Fei, L.: Towards scalable dataset construction: An active learning approach. In: European Conference on Computer Vision (ECCV), pp. 86–98. Springer (2008)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-Scale Hierarchical Image Database. In: Computer Vision and Pattern Recognition (CVPR) (2009)
-  Deng, J., Krause, J., Fei-Fei, L.: Fine-grained crowdsourcing for fine-grained recognition. In: Computer Vision and Pattern Recognition (CVPR). pp. 580–587 (2013)
-  Divvala, S.K., Farhadi, A., Guestrin, C.: Learning everything about anything: Webly-supervised visual concept learning. In: Computer Vision and Pattern Recognition (CVPR). pp. 3270–3277. IEEE (2014)
-  Duan, K., Parikh, D., Crandall, D., Grauman, K.: Discovering localized attributes for fine-grained recognition. In: Computer Vision and Pattern Recognition (CVPR). pp. 3474–3481. IEEE
-  Erkan, A.N.: Semi-supervised learning via generalized maximum entropy. Ph.D. thesis, New York University (2010)
-  Farrell, R., Oza, O., Zhang, N., Morariu, V.I., Darrell, T., Davis, L.S.: Birdlets: Subordinate categorization using volumetric primitives and pose-normalized appearance. In: International Conference on Computer Vision (ICCV). pp. 161–168. IEEE (2011)
-  Fergus, R., Fei-Fei, L., Perona, P., Zisserman, A.: Learning object categories from internet image searches. Proceedings of the IEEE 98(8), 1453–1466 (2010)
-  Gavves, E., Fernando, B., Snoek, C.G., Smeulders, A.W., Tuytelaars, T.: Fine-grained categorization by alignments. In: International Conference on Computer Vision (ICCV). pp. 1713–1720. IEEE
-  Gavves, E., Fernando, B., Snoek, C.G., Smeulders, A.W., Tuytelaars, T.: Local alignments for fine-grained categorization. International Journal of Computer Vision (IJCV) pp. 1–22 (2014)
-  Goering, C., Rodner, E., Freytag, A., Denzler, J.: Nonparametric part transfer for fine-grained recognition. In: Computer Vision and Pattern Recognition (CVPR). pp. 2489–2496. IEEE (2014)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition (CVPR). IEEE (2016)
-  Hinchliff, C.E., Smith, S.A., Allman, J.F., Burleigh, J.G., Chaudhary, R., Coghill, L.M., Crandall, K.A., Deng, J., Drew, B.T., Gazis, R., Gude, K., Hibbett, D.S., Katz, L.A., Laughinghouse, H.D., McTavish, E.J., Midford, P.E., Owen, C.L., Ree, R.H., Rees, J.A., Soltis, D.E., Williams, T., Cranston, K.A.: Synthesis of phylogeny and taxonomy into a comprehensive tree of life. Proceedings of the National Academy of Sciences (2015), http://www.pnas.org/content/early/2015/09/16/1423041112.abstract
Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: Neural Information Processing Systems (NIPS) (2015)
-  Khosla, A., Jayadevaprakash, N., Yao, B., Fei-Fei, L.: Novel dataset for fine-grained image categorization. In: First Workshop on Fine-Grained Visual Categorization, Conference on Computer Vision and Pattern Recognition (CVPR). Colorado Springs, CO (June 2011)
-  Krause, J., Gebru, T., Deng, J., Li, L.J., Fei-Fei, L.: Learning features and parts for fine-grained recognition. In: International Conference on Pattern Recognition (ICPR). Stockholm, Sweden (August 2014)
-  Krause, J., Jin, H., Yang, J., Fei-Fei, L.: Fine-grained recognition without part annotations. In: Conference on Computer Vision and Pattern Recognition (CVPR). IEEE
-  Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine-grained categorization. In: 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13). IEEE (2013)
-  Kumar, N., Belhumeur, P.N., Biswas, A., Jacobs, D.W., Kress, W.J., Lopez, I.C., Soares, J.V.: Leafsnap: A computer vision system for automatic plant species identification. In: European Conference on Computer Vision (ECCV), pp. 502–516. Springer (2012)
-  LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
-  Lewis, D.D., Catlett, J.: Heterogeneous uncertainty sampling for supervised learning. In: International Conference on Machine Learning (ICML). pp. 148–156 (1994)
-  Li, L.J., Fei-Fei, L.: Optimol: automatic online picture collection via incremental model learning. International Journal of Computer Vision (IJCV) 88(2), 147–168 (2010)
-  Lin, T., Maire, M., Belongie, S., Bourdev, L.D., Girshick, R.B., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. CoRR abs/1405.0312 (2014), http://arxiv.org/abs/1405.0312
-  Lin, T.Y., RoyChowdhury, A., Maji, S.: Bilinear cnn models for fine-grained visual recognition. In: International Conference on Computer Vision (ICCV). IEEE
-  Liu, J., Kanazawa, A., Jacobs, D., Belhumeur, P.: Dog breed classification using part localization. In: European Conference on Computer Vision (ECCV), pp. 172–185. Springer (2012)
-  Maji, S., Kannala, J., Rahtu, E., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. Tech. rep. (2013)
-  Mnih, V., Hinton, G.E.: Learning to label aerial images from noisy data. In: International Conference on Machine Learning (ICML). pp. 567–574 (2012)
-  Mozafari, B., Sarkar, P., Franklin, M., Jordan, M., Madden, S.: Scaling up crowd-sourcing to very large datasets: a case for active learning. Proceedings of the VLDB Endowment 8(2), 125–136 (2014)
-  Nilsback, M.E., Zisserman, A.: A visual vocabulary for flower classification. In: Computer Vision and Pattern Recognition (CVPR). vol. 2, pp. 1447–1454. IEEE (2006)
-  Pu, J., Jiang, Y.G., Wang, J., Xue, X.: Which looks like which: Exploring inter-class relationships in fine-grained visual categorization. In: European Conference on Computer Vision (ECCV), pp. 425–440. Springer (2014)
-  Reed, S., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., Rabinovich, A.: Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596 (2014)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) pp. 1–42 (April 2015)
-  Schroff, F., Criminisi, A., Zisserman, A.: Harvesting image databases from the web. Pattern Analysis and Machine Intelligence (PAMI) 33(4), 754–766 (2011)
-  Sermanet, P., Frome, A., Real, E.: Attention for fine-grained categorization. arXiv preprint arXiv:1412.7054 (2014)
-  Settles, B.: Active learning literature survey. University of Wisconsin, Madison 52(55-66), 11 (2010)
-  Settles, B., Craven, M., Ray, S.: Multiple-instance active learning. In: Advances in Neural Information Processing Systems (NIPS). pp. 1289–1296 (2008)
-  Shih, K.J., Mallya, A., Singh, S., Hoiem, D.: Part localization using multi-proposal consensus for fine-grained categorization. In: British Machine Vision Conference (BMVC) (2015)
-  Simon, M., Rodner, E.: Neural activation constellations: Unsupervised part model discovery with convolutional networks. In: ICCV (2015)
-  Simon, M., Rodner, E., Denzler, J.: Part detector discovery in deep convolutional neural networks. In: Asian Conference on Computer Vision (ACCV). vol. 2, pp. 162–177 (2014)
-  Sukhbaatar, S., Fergus, R.: Learning from noisy labels with deep neural networks. arXiv preprint arXiv:1406.2080 (2014)
-  Szegedy, C., Ioffe, S., Vanhoucke, V.: Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv preprint arXiv:1602.07261 (2016)
-  Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Computer Vision and Pattern Recognition (CVPR) (2015)
-  Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Computer Vision and Pattern Recognition (CVPR). IEEE (2016)
-  Thomee, B., Shamma, D.A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., Li, L.J.: The new data and new challenges in multimedia research. arXiv preprint arXiv:1503.01817 (2015)
-  Torralba, A., Efros, A., et al.: Unbiased look at dataset bias. In: Computer Vision and Pattern Recognition (CVPR). pp. 1521–1528. IEEE (2011)
-  Van Horn, G., Branson, S., Farrell, R., Haber, S., Barry, J., Ipeirotis, P., Perona, P., Belongie, S.: Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In: Computer Vision and Pattern Recognition (CVPR). IEEE (2015)
-  Vedaldi, A., Mahendran, S., Tsogkas, S., Maji, S., Girshick, B., Kannala, J., Rahtu, E., Kokkinos, I., Blaschko, M.B., Weiss, D., Taskar, B., Simonyan, K., Saphra, N., Mohamed, S.: Understanding objects in detail with fine-grained attributes. In: Computer Vision and Pattern Recognition (CVPR) (2014)
-  Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The Caltech-UCSD Birds-200-2011 Dataset. Tech. Rep. CNS-TR-2011-001, California Institute of Technology (2011)
-  Wah, C., Belongie, S.: Attribute-based detection of unfamiliar classes with humans in the loop. In: Computer Vision and Pattern Recognition (CVPR). pp. 779–786. IEEE (2013)
-  Wah, C., Branson, S., Perona, P., Belongie, S.: Multiclass recognition and part localization with humans in the loop. In: International Conference on Computer Vision (ICCV). pp. 2524–2531. IEEE (2011)
-  Wah, C., Horn, G., Branson, S., Maji, S., Perona, P., Belongie, S.: Similarity comparisons for interactive fine-grained categorization. In: Computer Vision and Pattern Recognition (CVPR) (2014)
Wang, J., Song, Y., Leung, T., Rosenberg, C., Wang, J., Philbin, J., Chen, B., Wu, Y.: Learning fine-grained image similarity with deep ranking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1386–1393 (2014)
-  Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., Perona, P.: Caltech-UCSD Birds 200. Tech. Rep. CNS-TR-2010-001, California Institute of Technology (2010)
Xiao, T., Xu, Y., Yang, K., Zhang, J., Peng, Y., Zhang, Z.: The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In: Computer Vision and Pattern Recognition (CVPR). IEEE
-  Xiao, T., Xia, T., Yang, Y., Huang, C., Wang, X.: Learning from massive noisy labeled data for image classification. In: Computer Vision and Pattern Recognition (CVPR). IEEE
Xie, S., Yang, T., Wang, X., Lin, Y.: Hyper-class augmented and regularized deep learning for fine-grained image classification. In: Computer Vision and Pattern Recognition (CVPR). IEEE
-  Xu, Z., Huang, S., Zhang, Y., Tao, D.: Augmenting strong supervision using web data for fine-grained categorization. In: International Conference on Computer Vision (ICCV) (2015)
-  Yang, L., Luo, P., Loy, C.C., Tang, X.: A large-scale car dataset for fine-grained categorization and verification. In: Computer Vision and Pattern Recognition (CVPR). IEEE
-  Yang, S., Bo, L., Wang, J., Shapiro, L.G.: Unsupervised template learning for fine-grained object recognition. In: Advances in Neural Information Processing Systems (NIPS). pp. 3122–3130 (2012)
-  Yao, B., Bradski, G., Fei-Fei, L.: A codebook-free and annotation-free approach for fine-grained image categorization. In: Computer Vision and Pattern Recognition (CVPR). pp. 3466–3473. IEEE (2012)
-  Yao, B., Khosla, A., Fei-Fei, L.: Combining randomization and discrimination for fine-grained image categorization. In: Computer Vision and Pattern Recognition (CVPR). pp. 1577–1584. IEEE (2011)
-  Yu, F., Zhang, Y., Song, S., Seff, A., Xiao, J.: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365 (2015)
-  Zhang, N., Donahue, J., Girshick, R., Darrell, T.: Part-based r-cnns for fine-grained category detection. In: European Conference on Computer Vision (ECCV), pp. 834–849. Springer (2014)
-  Zhang, N., Farrell, R., Darrell, T.: Pose pooling kernels for sub-category recognition. In: Computer Vision and Pattern Recognition (CVPR). pp. 3665–3672. IEEE (2012)
-  Zhang, N., Farrell, R., Iandola, F., Darrell, T.: Deformable part descriptors for fine-grained recognition and attribute prediction. In: International Conference on Computer Vision (ICCV). pp. 729–736. IEEE (2013)
-  Zhang, Y., Wei, X.s., Wu, J., Cai, J., Lu, J., Nguyen, V.A., Do, M.N.: Weakly supervised fine-grained image categorization. arXiv preprint arXiv:1504.04943 (2015)