Geo-Aware Networks for Fine Grained Recognition

06/04/2019 ∙ by Grace Chu, et al. ∙ Google 2

Fine grained recognition distinguishes among categories with subtle visual differences. To help identify fine grained categories, other information besides images has been used. However, there has been little effort on using geolocation information to improve fine grained classification accuracy. Our contributions to this field are twofold. First, to the best of our knowledge, this is the first paper which systematically examined various ways of incorporating geolocation information to fine grained images classification - from geolocation priors, to post-processing, to feature modulation. Secondly, to overcome the situation where no fine grained dataset has complete geolocation information, we introduce, and will make public, two fine grained datasets with geolocation by providing complementary information to existing popular datasets - iNaturalist and YFCC100M. Results on these datasets show that, the best geo-aware network can achieve 8.9 iNaturalist and 5.9 results. In addition, for small image baseline models like Mobilenet V2, the best geo-aware network gives 12.6 achieving even higher performance than Inception V3 models without geolocation. Our work gives incentives to use geolocation information to improve fine grained recognition for both server and on-device models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Fine grained recognition helps people know subordinate categories of an object, e.g. recognizing the species of cats, dogs, flowers and birds [28, 35, 49]. Instead of just recognizing a tree, it recognizes the type of tree [53]. Besides natural world verticals [12], man-made objects (e.g. makes of cars) [30, 58], and fashion styles [16] also need fine grained recognition techniques. Fine grained recognition can also benefit other research areas such as generating images based on fine grained categories [6] or text descriptions [55].

Fine grained recognition is very challenging as the visual difference among fine grained categories is subtle [14]. Moreover, images are often photographed at angles that fail to capture that subtle difference. To overcome this difficulty, researchers have been using various complementary information of the fine grained categories, such as attributes, object pose [8, 62], object parts[61], text description [38, 24], etc. [52, 21].

Geolocation has been used in coarse grained object classification [45, 57], clustering and retrieval [17, 41, 4], study of human mobility [11, 46], and social network user profile studies [65]. However, there is little work on using geolocation to improve fine grained categorization. To the best of our knowledge, this is the first paper systematically examining the effects of using geolocation information in fine grained recognition.

Figure 1: Western gray squirrel and its habitat heatmap.

Intuitively, geolocation can be informative when a fine grained category has a certain territory. Take Figure 1 as an example, Western gray squirrel is a species of squirrel which only lives in the west coast of United States. Therefore, if a squirrel image taken in eastern coast, like New York, then it should not be identified as western gray squirrel.

In this paper, we examine the effectiveness of using geolocation on fine grained recognition problems. Unlike many geolocation related works which explore different derived geo features [45] or combine geo feature with other meta data [22]

, we focus on geo-aware model architectures which can best use geo features. To this purpose, we use the most common geo information, i.e. latitude and longitude, and study different ways of incorporating it with the image model. We first examined an intuitive way of using geolocation priors where we discussed both the Bayesian approach and a whitelist-based method. Then, we examined a post-processing method where a geolocation network is combined with a pre-trained and frozen image network at logits layer. Significant improvement has been observed using this model. Finally, to see whether geolocation can affect image feature learning, we examined a feature modulation model, which yield comparable results as the post-processing model.

In order to demonstrate the effectiveness our geo-aware models, we introduce two fine grained datasets with geolocation information. We obtained these datasets by integrating the geolocation information contained in complementary material to the existing image only datasets. The complementary material and necessary code to process them will be released with this paper.

The rest of the paper is organized as follows. Section 2 gives overview of related works. Section 3 presents three geo-aware networks we examine in this paper. Two fine grained datasets with geolocation are introduced in Section 4. Then, experimental results are demonstrated in Section 5. Section 6 concludes the paper.

2 Related Works

In this section, we give an overview of common techniques used in fine grained recognition field and how geoloation information is used in image classification.

2.1 Fine Grained Recognition

Fine grained recognition differs from general visual recognition mainly from the following two aspects: different fine grained categories usually have little visual difference that only domain experts can tell; rare subordinate objects are observed less while commonly seen ones dominant the fine grained dataset. This leads to a long-tail label frequency distribution in such problems [64, 25]

. Therefore, although the advances of general convolutional neural networks (CNN)

[31, 40, 43, 44, 42, 23, 26, 39] can lead to progress in fine grained recognition, there are still more research efforts needed.

To deal with the subtle visual difference of fine grained recognition, researchers have tried various directions. For architecture change, recent work on bilinear CNN have been proven to be a success on learning the localized feature interactions [34, 20, 13]. To locate the subtle difference, researchers have used attention networks [63, 19, 36] and also tried to separate objects into body parts [61]. As different poses of an object makes it harder to find the subtle difference, works on normalizing or unifying posese have been examined [8, 62]. Additional information about the objects, such as attributes [52, 21] and text description [38, 24], are also important to distinguish the fine grained categories. Researchers have also tried using easily accesible web images to augment datasets [29, 56] and human verification to augment label annotations [9, 15]

. In addition, transfer learning has also been used to address the long tail problem

[12].

In this paper, we aim to focus on the subtle visual difference difficulty and show that geolocation information can help identify fine grained recognition.

2.2 Geolocation in Image Classification

Geolocation has been used in various research areas like image categorization/search/retrieval [17, 18, 4] and human movement [11, 46]. For the purpose of this paper, we focus on the works related to image classification.

As its direct relation, geolocation has been widely used in place related classifications, but mainly as a prediction target from image input [54, 5, 32, 60]

. Yet, there are also some works in this field which uses geolocation to classify places

[57, 59]. Specifically, Yu et al. uses geolocation and season information to help identify scenes like snow and sand [59]. Yan et al. uses geolocation information to find place types that tend to occur nearby, then use it to re-weight the classification results on public places like airport, museum, university [57].

For coarse grained classifications, like classifying snow, monument, wave, Tang et al. used 6 geolocation related features and concatenated them with the image model output before softmax [45]. One of the 6 geoloation related features in this work is latitude and longitude, while other features incorporated extra information, such as geographic maps and hashtags in Instagram. To solve similar problems, Liao et al.

approached it by finding neighbor images taken near the target image, and then used the tag distribution of neighbor images as a feature to feed into support vector machine (SVM) classifiers

[33].

There are, however, only few works in fine grained recognition that have tried using geolocation to improve accuracy. Berg et al. in [7] made a simulated geolocation fine grained dataset by combining an image only dataset and a geo only dataset. Then, Bayesian based geolocation priors were used to improve the classification accuracy. A few participants of PlantCLEF2016 competition [22] tried using geolocation information. The competition contains plant species in and around France where only minority of the images contain geolocation. A few non neural network based methods were tried, but with no obvious improvements [10, 48].

3 Geo-Aware Networks

In this section, we study three types of incorporation of geolocation with image feature based fine-grained models.

3.1 Geolocation Priors

As discussed in introduction, objects of natural world verticals are distributed on the earth with some geographical traits. Assuming the data samples containing geographical information are observed independently in both the training and test dataset, we can extract the geolocation based distribution from the training data. There are two intuitive ways of utilizing this distribution without additional model training or any change to the image only classifier.

Bayesian Priors:

From Bayesian inference point of view, given image observations

without additional information, traditional fine-grained recognition can be viewed as a Maximum Likelihood Estimation (MLE).

(1)

where denotes the image label, denotes the image observation and denotes the likelihood function of an observation given the label .

Now assume that a prior distribution over the fine-grained labels exists and follows some geographical traits, where

denotes the geolocation of the examined image. Then, it allows us to make an maximum a posteriori estimation:

(2)

Label Whitelisting: A different way of utilizing the geographical information is to restrict the inference result by a geo-restricted whitelist. The geo-restricted whitelist works as a gating function which restricts output labels to be one in the whitelist of labels that have data observations within a geo-restricted radius :

(3)

where is an indicator function, which equals to one when has observsations within geo-restricted radius of G, zero otherwise. Similar to the Bayesian approach, the whitelist can be inferred from the geolocation based histogram of the training data.

3.2 Post-Processing Models

We consider a post-processing model to be any model that does not touch pixels, but instead consumes one or more image classifiers or embeddings. Here, we trained a post-processing model that consumes the output of the baseline image classifier together with geolocation coordinates.

The model evaluated below accepts geolocation in its simplest form: a vector of length two containing latitude and longitude, normalized to range by dividing by constant values. We also experimented with Earth-center-fixed rectangular coordinates and multi-scale one-hot S2 representations [51]. These made little difference in performance.

Figure 2: Network architecture for post-processing models. Logistic is the inverse function of logistic function. “FCL” denotes fully connected layer. The last FCL outside geo net box is the logits layer.

Geolocation is then processed by three fully connected layers of sizes 256, 128, and 128, followed by a layer of logits with size equal to the output label map. These are then added to the logits of the image classifier, or , where . Figure 2 shows the diagram of the architecture. In this late-fusion architecture, no units jointly encode appearance and location. We also experimented with models where the output of the image classifier or an image embedding was concatenated with one of the fully-connected layers in the geolocation network. In these models, units jointly encode appearance and location. However, adding these visual inputs does not affect performance of the post-processing model. This may suggest that appearance and location are not tightly interdependent.

During training, the weights of the baseline image classifier are fixed, and gradients are not pushed back through the image classifier. One disadvantage of this is that the image classifier may waste effort attempting to distinguish two visually-similar labels that could have been easily distinguished by geolocation alone.

Post-processing models offer some practical advantages over jointly training over pixels and geolocation. Learning rates, hyperparameters, and loss functions are decoupled between the two models, and can be tuned separately. Similarly, the selection and balancing of training data can be performed independently for image models versus geolocation models. If labeled training images without geolocation are available, they can be used to train the image classifier but omitted when training the post-processing model. If label noise is correlated with appearance but not with location (e.g. mustang cars mixed in with horses), then the post-processing model may benefit from the inclusion of noisier training data sources that harm image classifier performance.

Another feature of the post-processing model is that the geolocation network only needs to learn the residual between the baseline image classifier output and the ground-truth. The strength of the gradients that get pushed back into the post processing geolocation network are proportional to the error of the baseline classifier. If the baseline image classifier already classifies a label perfectly, then no geolocation model will be learned for that label, since none is needed. Thus, the post-processing model minimizes its reliance on geolocation cues: it relies on geolocation only in proportion to how much it improves an image classifier that was previously trained to maximize performance.

Adding the logits of the geolocation and image networks has some theoretical basis. Suppose appearance and location were conditionally independent of each other given the ground-truth label. Then , where is the image, is the geolocation, and is the ground-truth label. For convenience, define the likelihood ratio , where denotes the condition that label is false. Then:

Thus, if conditional independence holds, then the post-processing network would be optimal in the sense that it outputs the exact posterior probability

if the output of image classifier equals and the geolocation logits activation equals . In this case, conditional independence implies that the appearance of a label does not change depending on its location. Note that conditional independence is a sufficient condition for the model to behave optimally for some set of weights. It isn’t a necessary condition; certain violations of conditional independence may still be captured by this model. For example, suppose geolocation could sometimes be estimated from the background of the image. Location and appearance are not conditionally independent in this case. In this scenario, models using Bayesian priors would double-count location evidence, adjusting scores based on location even though the image classifier already factored it in. In contrast, since the post-processing model trains the geolocation network based on the residual between the baseline image classifier output and the ground-truth, no double-counting occurs; the learned geolocation model is only as strong as geolocation evidence not already captured by the baseline image classifier.

3.3 Feature Modulation Models

To examine whether geolocation can have deeper effect on image feature learning, we built networks with geolocation information integrated into image features.

Figure 3: Network architecture of using geolocation to affect image features. FCL* represents FCL without activation, and has a reshape operation afterwards to match the feature dimension it adds to.

Similar to post-processing models, we use addition to modulate image features via geolocation features. As shown in Figure 3, latitude and longitude first go through a set of fully connected layers. Then, depending on the shape of each image feature, the output of geolocation network goes through different sized fully connected layers (without activation) to be reshaped before adding to the image feature. Mathematically,

(4)

where and are image features before and after modulation. Subscript ”post-cat” specifies that the features are after activation. are reshaped geolocation features.

Not all image features from each layer are modulated by geolocation features. Lowest level image features are general features specifying lines or edges of the object, which conveys little information about species level distinction. Thus, we modulate middle and higher image features instead of lower ones. Note that, we have tried to modulate also early features but didn’t get better results than just modulating the inception modules.

For general image feature modulation, Perez et al. have discussed it in [37]. Specifically, they modulated image features by both multiplication and addition as the following,

(5)

where and are image features before and after modulation. Subscript ”pre-act” specifies that the features are before activation. and are modulation features. In Section 5.3, we will show that, for geo-aware networks, only using addition is the best way to modulate image features.

Note that we have also considered concatenating image features and geolocation features, and jointly trained the networks. However, the results were not as good as feature modulation. This may due to the fact that image features and geolocation features are different types of features and feature modulation is more reasonable than concatenation.

4 Fine Grained Datasets with Geolocation

One challenge of using geolocation in fine grained recognition is the lack of fine grained datasets with geolocation information. To the best of our knowledge, there are only two fine grained datasets that have been used in geolocation related research in this field [7, 22]. In [7], the authors simulated a geolocation fine grained dataset by randomly picking pairs of images and geolocation with the same ground-truth label from an image only dataset and a geolocation only dataset. In one of the ImageCLEF/LifeCLEF competitions [2], PlantCLEF2016 [22] data contains partial geolocation information - less than half of the data - where all images were taken in France.

In this section, we will introduce two fine grained datasets with geolocation, one for both training and evaluation; the other for evaluation only. Both datasets contain genuine (not simulated) and worldwide geolocation. Both datasets will be made public with this paper.

4.1 iNaturalist Dataset with Geolocation

We introduce the iNaturalist fine grained dataset with geolocation based on the data from iNaturalist challenge at FGVC (fine grained visual categorization) 2017. The challenge data, without geolocation, was published in [50] and available in the challenge page [3]. The state-of-the-art classification results based on this dataset were presented in [12]. This dataset contains 5089 fine grained labels, which embraces 13 categories, such as plantae (plant), insecta (insects), aves (birds), etc. To be comparable with existing results, we used the same train/test split as in [12], where 665,473 images are in training and 9,697 images are in test.

To obtain geolocation information of above dataset, we first map image key in [3] to observation id. Then, we utilize the iNaturalist observation data from Global Biodiversity Information Facility (GBIF) [27] which contains observation id and geolocation data. From the path of image key to observation id to geolocation, we can find the corresponding geolocation information for existing iNaturalist challenge images.

During the mapping process, there are 4% images that couldn’t find corresponding geolocation information, due to either missing observation ids in [3] or missing geolocation in the GBIF observation data. Finally, the fine grained dataset, where each image has its corresponding geolocation, contains 645,424 images in training and 9,394 images in test. Figure 3(a) shows the heatmap of the geolocation distribution of the obtained dataset, including both training and test data. This indicates the worldwide distribution of our iNaturalist based fine grained dataset.

4.2 YFCC100M Fine Grained Evaluation Dataset with Geolocation

The YFCC100M dataset consists of 100 million Flickr images and videos with creative commons licences [47]. Of 99.2 million images, 48.5 million have geolocation. For each image, we identified Flickr tags or image titles that contained labels corresponding to one of the 5089 fine-grained plant and animal species labels from the iNaturalist dataset in Section 4.1. Since iNaturalist labels are all species-level, images with multiple labels were omitted. 1,362,447 geolocated images had a single matching label.

iNaturalist labels in the YFCC100M dataset are highly skewed towards popular species like domestic animals, cut flowers, and zoo animals. For example, YFCC100M had 207,575 geolocated images labeled “

felis catus” (house cat), accounting for of all labels. By comparison, the iNaturalist evaluation set has only three cat examples. To mitigate the impact of highly common labels, we limited our evaluation to at most 10 examples per label. Of 4721 labels represented in YFCC100M, 3553 labels had at least 10 examples. 36,146 labeled geolocated images were used in total.

As a complementary evaluation dataset for geo-aware networks trained using iNaturalist dataset, geolocation distribution of this evaluation dataset should be similar to that of iNaturalist dataset. Figure 3(b) shows the distribution of YFCC100M fine grained evaluation dataset, which is also worldwide distributed as the iNaturalist dataset in Figure 3(a).

(a)
(b)
Figure 4: Geolocation distribution of fig:geo_dist_a iNaturalist dataset, and fig:geo_dist_b YFCC100M fine grained evaluation dataset.

5 Experimental Results

In this section, we present experimental results of the examined geo-aware networks. To show the effectiveness of using geolocation, we compare geo-aware networks with state-of-the-art image only model presented in [12]. Specifically, we take the Inception V3 with 299x299 input size as the image baseline classifier. From this initial checkpoint, we train or calculate results for our geo-aware networks based on iNaturalist dataset with geolocation. Comprehensive analysis has been done under such settings. Lastly, a smaller image baseline classifier and a different evaluation dataset were examined to show the generalization of the geo-aware networks.

5.1 Geolocation Priors

As discussed in Section 3.1

, we assume that the geolocation priors follow certain geographical traits, therefore the prior distribution will differ as geolocation changes. To use the geolocation based prior distribution for inference, we treat the geolocation where a testing data sample is observed as a reference point and extract the label prior distribution at each reference point. For each reference point, all training data points within a certain radius from the reference geolocation are counted with equal weights and a histogram of class labels is calculated for this geolocation. After this, we either use the histogram as a whitelist of labels or normalize it and treat it as a prior probability distribution.

We empirically pick the best radius for the best baseline accuracy numbers using geolocation priors. We picked a few radius in the range sweeping from 50 miles to 5000 miles and found 100 miles to be the golden number of iNaturalist dataset. We also found that using geolocation based Bayesian prior produces worse results on iNaturalist, which is likely due to fact that the geolocation based prior distributions in the test set are more uniform and mismatch the ones estimated from the training set. Using a label whitelist mitigates the disparity between the prior distribution on the training set and the one on the test set, which gives better results. More quantitative results are given in Table 1.

50 100 500 1000
Bayesian Priors 68.5% 69.4% 67.8% 66.5%
Whitelisting 71.3% 72.6% 72.3% 71.8%
Table 1: Top-1 accuracy using geolocation based Bayesian priors and whitelist with different radius (miles) at each test location on iNaturalist dataset, where the image only baseline model gives [12]

5.2 Post-Processing Models

The post-processing models were trained over the iNaturalist training partition at a learning rate of 0.02 without decay. It consumed the output of the Inception V3 model described in Section 5, without touching pixels. Evaluated over the iNaturalist evaluation set, it achieved 79.0% accuracy for the top label, an increase of 8.9% over the baseline model.

5.3 Feature Modulation Models

In this section, we demonstrate the performance of geo-aware network as in Figure 3. FCL layers inside geolocation network have output sizes of 128 then 256. Take Inception V3 as the image CNN, we apply feature modulations for all image features out of the Inception modules [44]. Take the Inception V3 image baseline classifier as an initial checkpoint for the image CNN part of the geo-aware network in Figure 3

, we train the whole network, including image CNN and geo net, end to end. Specifically, we have used RMSprop optimizer with initial learning rate 0.0045, decaying every 4 epochs with decay rate 0.94.

Top-1 Top-5
Feature Modulation Accuracy Accuracy
None [12] 70.1% 89.4%
FiLM: [37] 72.5% 90.8%
65.6% 87.1%
76.8% 93.1%
76.2% 92.9%
77.2% 93.1%
78.2% 93.3%
Table 2: Top-k accuracies for different feature transforms. denote the image feature before activation. and

denote ReLU function and Sigmoid function respectively.

and are two functions of latitude and longitude.

We trained and evaluated on the iNaturalist dataset train/evaluation split. The last bolded line in Table 2 shows the Top-1 and Top-5 accuracies of the proposed feature modulation model. Comparing with the image only model (first line), our proposed geo-aware network achieves 8.1% increase on top-1 accuracy and 3.9% increase on top-5 accuracy.

The second line in Table 2 shows results by using the general feature modulation scheme proposed in [37]

. The improvements over image only model, 2.4%, is less than one third of that obtained by our customized feature modulation model. In addition, we also tried other variations of the feature modulation, including modulating the feature before/after activation, using multiplication and/or addition as the modulation operation, and different activation functions on the modulator before combining with features. We have listed some results also in Table

2. However, none of them gives better results than our proposed method. This indicates that addition is the best way to affect image features when using geolocation as the modulator.

5.4 Comparison of Different Geo-Aware Networks

Figure 5: Examples where geo-aware networks corrected the prediction results using geolocation information. Distribution heatmaps are obtained by searching for the particular species/taxonomy in iNaturalist org [1].

We summarize the best result from each network in Table 3. While all geo-aware networks achieve better results than image only models, post-processing and feature modulation models give much better results than using geolocation priors. Among all models, post-processing model performs the best. The higher performance by post-processing models is informative is informative because while feature modulation networks can capture arbitrary relationships , post-processing models are severely restricted in the relationships between appearance and location they can capture, expressing only . This suggests that the dependencies between appearance and location that cannot be expressed by post-processing models may be rare in nature for fine-grained plants & animals.

Geo-aware Top-1 Head: Tail:
Model Accu 100 im 100 im
Image Only 70.1% [12] 76.5% 66.2%
Whitelisting 72.6% 77.2% 68.6%
Post-Process 79.0% 81.0% 77.2%
Feature Modulate 78.2% 81.1% 75.6%
Table 3: Top-1 accuracy of different geo-aware networks, together with the head and tail top-1 accuracy. Head and tail images are images whose labels have 100 images per label and 100 images per label in training set, respectively.

As fine grained dataset usually has long tail distribution [25], we also show the results on head and tail images in Table 3. Head images are those whose labels have images per label in training dataset and tail images are the rest. All geo-aware networks achieves more improvements on tail images than on head images. For example, the best post-processing model gives 4.5% increase on head images while having 11% increase on tail images, 2.4 times more improvement than on the head images. In addition, even though post-processing model has the highest overall accuracy, it is slightly worse than feature modulation model on head images.

Label Category Image Post- Feature
(number of labels) Only Process Mod
Overall (5089) 70.1% 79.0% 78.2%
Plantae (2101) 75.2% 81.7% 81.0%
Insecta (1021) 77.3% 82.2% 81.4%
Aves (964) 68.0% 77.2% 76.0%
Reptilia (289) 50.9% 63.7% 64.9%
Mammalia (186) 59.1% 72.0% 73.1%
Fungi (121) 72.1% 78.2% 78.8%
Amphibia (115) 51.0% 74.9% 72.1%
Mollusca (93) 65.6% 74.8% 76.8%
Animalia (77) 71.5% 77.7% 73.1%
Arachnida (56) 72.5% 82.6% 77.1%
Actinopterygii (53) 65.4% 75.0% 67.3%
Chromista (9) 63.2% 84.2% 68.4%
Protozoa (4) 100% 83.3% 100%
Table 4: Top-1 accuracy of different geo-aware models on each label category.

Table 4 shows performance on each iNaturalist super category (e.g. plants, birds, etc). Both models networks improve classification accuracy for all categories (excepting protozoa which has too few samples to conclude).

To better understand how geolocation improves the classification, we show some examples in Figure 5 where geo-aware network corrects the wrong label recognized by image only model. Columns in this figure are, from left to right: images; top-1 results from image only model, with geo distribution of the wrong classified label; geolocation of the image; top-1 results from geo-aware networks (both post-processing and feature modulation model give the same results for these examples), with geo distribution of the correct classified label.

Take the first row as an example, the image captures a nuttall’s woodpecker bird. However, the image only model gives a prediction of red-bellied woodpecker. By just looking at the sample images of these two species in the second column from left and the rightmost column, it is hard to distinguish. However, they have completely different geolocation distributions which indicates their different habitats. The correct species, nuttall’s woodpecker, locates only in west coast of America, while the wrong species, red-bellied woodpecker, locates mainly in central and east coast of America. Such difference can only be caught and used by geo-aware networks: by giving the latitude and longitude of the image and knowing that the image was taken in west coast of America (as shown in third column from left), geo-aware networks abandoned the wrong species label, which has never been observed in the particular geolocation, and are able to give the correct species label.

5.5 Results on Mobile Image Networks

While Inception V3 is a server sized model, we also examined Mobilenet V2 as the image baseline classifier to see how geo-aware networks perform on a smaller on-device model. We used the same settings as the ones used for Inception V3 in [12] to train the Mobilenet V2 image only model on iNaturalist dataset. Then, three geo-aware networks are calculated or trained based on this image baseline classifier. For feature modulation model, feature modulations are applied for all blocks with inverted bottleneck. Table 5 shows the results on Mobilenet V2 comparing with those on Inception V3.

Geo-aware Model Inception V3 Mobilenet V2
Image Only 70.1% [12] 59.6%
Whitelisting 72.6% 62.1%
Post-Process 79.0% 70.7%
Feature Mod. 78.2% 72.2%
Table 5: Top-1 Accuracy of different geo-aware networks applied on different image baseline classifiers.

Since image baseline classifier is smaller, it has more room to improve. The best geo-aware network achieves 12.6% top-1 accuracy increase over the image only model, comparing with the 8.1% increase for the bigger image baseline classifier. Importantly, the best geo-aware network based on Mobilenet V2 model achieves even better performance than the image only network based on Inception V3. This gives incentive of applying geo-aware networks on on-device fine grained image models.

Unlike over the Inception V3 image baseline classifier, feature modulation models outperform post-processing models over Mobilenet V2. Recall that one disadvantage of the post-processing models is that the baseline image classifier it relies on must expend effort to distinguish visually-similar labels that can be easily disambiguated using geolocation. For a larger Inception model, this may be a small penalty. However, wasting capacity to visually distinguish, for example, American and European Magpies may be especially costly for a smaller on-device model.

5.6 Results on YFCC100M Evaluation Data

To demonstrate the generalization of the geo-aware models, we evaluate the geo-aware models on another dataset: YFCC100M fine grained evaluation dataset. Table 6 shows evaluation results of the two successful neural network based geo-aware models presented in Section 5.4 on this new evaluation dataset, compared with the original iNaturalist evaluation split.

Image Post- Feature
Evaluation Dataset Only Process Mod.
iNaturalist Eval 70.1% 79.0% 78.2%
YFCC100M FG Eval 54.6% 60.5% 58.7%
Table 6: Top-1 accuracy of geo-aware networks, upon Inception V3 image baseline network, when evaluating on different evaluation data. FG denotes fine grained.

As shown in Table 6, post-processing model achieves 5.9% gain over image only model, while feature modulation model achieves 4.1% gain. The improvements are smaller than those on iNaturalist evaluation set because the quality of this dataset is not as good as iNaturalist, which have been verified by domain experts. For example, some images in YFCC100M dataset contains a sculpture of an animal instead of a real animal; In addition, there are also cases where the animal or plant does not actually exist in the image, or too small to see.

6 Conclusion

We have given a systematic overview of geo-aware networks on fine grained recognition. To deal with the lack of geolocation dataset in fine grained research field, we introduced iNaturalist dataset geolocation and YFCC100M fine grained evaluation dataset with geolocation. Experimental results show that all geo-aware networks achieve improvements over image only models. Specifically, post-processing model achieves 8.9% increase over state-of-the-art Inception V3 based image only model on iNaturalist data and 5.9% increase on YFCC100M evaluation data. Feature modulation model achieves 12.6% increase over Mobilenet V2 based image only model, which is even better than Inception V3 based image only model. We believe our results prove the effectiveness of using geolocation for fine grained recognition and give incentives of using geo-aware networks for fine grained recognition problems on both server side models and on-device models.

Acknowledgements We would like to thank Yanan Qian, Fred Fung, Christine Kaeser-Chen, Professor Serge Belongie, and Chenyang Zhang for their useful discussions on this topic and help in geolocation dataset preparation.

References

  • [1] A Community for Naturalists - iNaturalist.org. https://www.inaturalist.org.
  • [2] ImageCLEF / LifeCLEF - Multimedia Retrieval in CLEF. https://www.imageclef.org/.
  • [3] iNaturalist Challenge at FGVC 2017. https://www.kaggle.com/c/inaturalist-challenge-at-fgvc-2017.
  • [4] K. Amlacher, G. Fritz, P. Luley, A. Almer, and L. Paletta. Geo-contextual priors for attentive urban object recognition. In 2009 IEEE International Conference on Robotics and Automation, pages 1214–1219, May 2009.
  • [5] G. Baatz, O. Saurer, K. Köser, and M. Pollefeys. Large scale visual geo-localization of images in mountainous terrain. In A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, and C. Schmid, editors, Computer Vision – ECCV 2012, pages 517–530, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.
  • [6] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua. Cvae-gan: Fine-grained image generation through asymmetric training. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [7] T. Berg, J. Liu, S. W. Lee, M. L. Alexander, D. W. Jacobs, and P. N. Belhumeur. Birdsnap: Large-scale fine-grained visual categorization of birds. In

    2014 IEEE Conference on Computer Vision and Pattern Recognition

    , pages 2019–2026, June 2014.
  • [8] S. Branson, G. V. Horn, S. J. Belongie, and P. Perona. Bird species categorization using pose normalized deep convolutional nets. CoRR, abs/1406.2952, 2014.
  • [9] S. Branson, C. Wah, F. Schroff, B. Babenko, P. Welinder, P. Perona, and S. Belongie. Visual recognition with humans in the loop. In K. Daniilidis, P. Maragos, and N. Paragios, editors, Computer Vision – ECCV 2010, pages 438–451, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg.
  • [10] J. Champ, H. Goëau, and A. Joly. Floristic participation at LifeCLEF 2016 Plant Identification Task. In CLEF 2016 - Conference and Labs of the Evaluation forum, pages 450–458, Évora, Portugal, Sept. 2016.
  • [11] C. Comito, D. Falcone, and D. Talia. Mining human mobility patterns from social geo-tagged data. Pervasive and Mobile Computing, 33:91 – 107, 2016.
  • [12] Y. Cui, Y. Song, C. Sun, A. Howard, and S. Belongie. Large scale fine-grained categorization and domain-specific transfer learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [13] Y. Cui, F. Zhou, J. Wang, X. Liu, Y. Lin, and S. Belongie. Kernel pooling for convolutional neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [14] J. Deng, J. Krause, and L. Fei-Fei. Fine-grained crowdsourcing for fine-grained recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013.
  • [15] J. Deng, J. Krause, M. Stark, and L. Fei-Fei. Leveraging the wisdom of the crowd for fine-grained recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(4):666–676, April 2016.
  • [16] W. Di, C. Wah, A. Bhardwaj, R. Piramuthu, and N. Sundaresan. Style finder: Fine-grained clothing style detection and retrieval. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2013.
  • [17] A. Ennis, L. Chen, C. Nugent, G. Ioannidis, and A. Stan. A system for real-time high-level geo-information extraction and fusion for geocoded photos. In Proceedings of International Conference on Advances in Mobile Computing & Multimedia, MoMM ’13, pages 75:75–75:84, New York, NY, USA, 2013. ACM.
  • [18] A. Ennis, C. Nugent, P. Morrow, L. Chen, G. Ioannidis, and A. Stan. Evaluation of mediaplace: a geospatial semantic enrichment system for photographs. pages 19–25, 12 2015.
  • [19] J. Fu, H. Zheng, and T. Mei. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [20] Y. Gao, O. Beijbom, N. Zhang, and T. Darrell. Compact bilinear pooling. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [21] T. Gebru, J. Hoffman, and L. Fei-Fei. Fine-grained recognition in the wild: A multi-task domain adaptation approach. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [22] H. Goëau, P. Bonnet, and A. Joly. Plant Identification in an Open-world (LifeCLEF 2016). In CLEF 2016 - Conference and Labs of the Evaluation forum, pages 428–439, Évora, Portugal, Sept. 2016.
  • [23] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [24] X. He and Y. Peng. Fine-grained image classification via combining vision and language. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [25] G. V. Horn and P. Perona. The devil is in the tails: Fine-grained classification in the wild. CoRR, abs/1709.01450, 2017.
  • [26] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017.
  • [27] iNaturalist.org (2018). inaturalist research-grade observations. occurrence dataset accessed via gbif.org on 2018-02-28.
  • [28] A. Khosla, N. Jayadevaprakash, B. Yao, and F. fei Li. Novel dataset for fine-grained image categorization. In First Workshop on Fine-Grained Visual Categorization, CVPR, 2011.
  • [29] J. Krause, B. Sapp, A. Howard, H. Zhou, A. Toshev, T. Duerig, J. Philbin, and L. Fei-Fei. The unreasonable effectiveness of noisy data for fine-grained recognition. In B. Leibe, J. Matas, N. Sebe, and M. Welling, editors, Computer Vision – ECCV 2016, pages 301–320, Cham, 2016. Springer International Publishing.
  • [30] J. Krause, M. Stark, J. Deng, and L. Fei-Fei. 3d object representations for fine-grained categorization. In The IEEE International Conference on Computer Vision (ICCV) Workshops, June 2013.
  • [31] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.
  • [32] X. Li, M. Larson, and A. Hanjalic. Geo-distinctive visual element matching for location estimation of images. IEEE Transactions on Multimedia, 20(5):1179–1194, May 2018.
  • [33] S. Liao, X. Li, H. T. Shen, Y. Yang, and X. Du. Tag features for geo-aware image classification. IEEE Transactions on Multimedia, 17(7):1058–1067, July 2015.
  • [34] T. Lin, A. RoyChowdhury, and S. Maji. Bilinear convolutional neural networks for fine-grained visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6):1309–1322, June 2018.
  • [35] M. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics Image Processing, pages 722–729, Dec 2008.
  • [36] Y. Peng, X. He, and J. Zhao.

    Object-part attention model for fine-grained image classification.

    IEEE Transactions on Image Processing, 27(3):1487–1500, March 2018.
  • [37] E. Perez, F. Strub, H. de Vries, V. Dumoulin, and A. C. Courville. Film: Visual reasoning with a general conditioning layer. CoRR, abs/1709.07871, 2017.
  • [38] S. Reed, Z. Akata, H. Lee, and B. Schiele. Learning deep representations of fine-grained visual descriptions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [39] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [40] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
  • [41] J. Stutzki and M. Schubert. Geodata supported classification of patent applications. In Proceedings of the Third International ACM SIGMOD Workshop on Managing and Mining Enriched Geo-Spatial Data, GeoRich ’16, pages 4:1–4:6, New York, NY, USA, 2016. ACM.
  • [42] C. Szegedy, S. Ioffe, and V. Vanhoucke. Inception-v4, inception-resnet and the impact of residual connections on learning. CoRR, abs/1602.07261, 2016.
  • [43] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [44] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [45] K. Tang, M. Paluri, L. Fei-Fei, R. Fergus, and L. Bourdev. Improving image classification with location context. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
  • [46] G. Thakur, K. Sims, H. Mao, J. Piburn, K. Sparks, M. Urban, R. Stewart, E. Weber, and B. Bhaduri. Utilizing Geo-located Sensors and Social Media for Studying Population Dynamics and Land Classification, pages 13–40. Springer International Publishing, Cham, 2018.
  • [47] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li. YFCC100M: The new data in multimedia research. Commun. ACM, 59(2):64–73, Jan. 2016.
  • [48] B. Tóth, M. J. Tóth, D. Papp, and G. Szücs. Deep learning and svm classification for plant recognition in content-based large scale image retrieval. In CLEF, 2016.
  • [49] G. Van Horn, S. Branson, R. Farrell, S. Haber, J. Barry, P. Ipeirotis, P. Perona, and S. Belongie. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 595–604, June 2015.
  • [50] G. Van Horn, O. Mac Aodha, Y. Song, Y. Cui, C. Sun, A. Shepard, H. Adam, P. Perona, and S. Belongie. The inaturalist species classification and detection dataset. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [51] E. Veach, J. Rosenstock, E. Engle, R. Snedegar, J. Basch, and T. Manshreck. S2 geometry. http://s2geometry.io/.
  • [52] A. Vedaldi, S. Mahendran, S. Tsogkas, S. Maji, R. Girshick, J. Kannala, E. Rahtu, I. Kokkinos, M. B. Blaschko, D. Weiss, B. Taskar, K. Simonyan, N. Saphra, and S. Mohamed. Understanding objects in detail with fine-grained attributes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.
  • [53] J. D. Wegner, S. Branson, D. Hall, K. Schindler, and P. Perona. Cataloging public objects using aerial and street-level images - urban trees. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6014–6023, June 2016.
  • [54] T. Weyand, I. Kostrikov, and J. Philbin. Planet - photo geolocation with convolutional neural networks. In B. Leibe, J. Matas, N. Sebe, and M. Welling, editors, Computer Vision – ECCV 2016, pages 37–55, Cham, 2016. Springer International Publishing.
  • [55] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [56] Z. Xu, S. Huang, Y. Zhang, and D. Tao. Webly-supervised fine-grained visual categorization via deep domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(5):1100–1113, May 2018.
  • [57] B. Yan, K. Janowicz, G. Mai, and R. Zhu. xnet+sc: Classifying places based on images by incorporating spatial contexts. In GIScience, 2018.
  • [58] L. Yang, P. Luo, C. Change Loy, and X. Tang. A large-scale car dataset for fine-grained categorization and verification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [59] J. Yu and J. Luo.

    Leveraging probabilistic season and location context models for scene understanding.

    In Proceedings of the 2008 International Conference on Content-based Image and Video Retrieval, CIVR ’08, pages 169–178, New York, NY, USA, 2008. ACM.
  • [60] E. Zemene, Y. T. Tesfaye, H. Idrees, A. Prati, M. Pelillo, and M. Shah. Large-scale image geo-localization using dominant sets. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(1):148–161, Jan 2019.
  • [61] N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Part-based r-cnns for fine-grained category detection. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, editors, Computer Vision – ECCV 2014, pages 834–849, Cham, 2014. Springer International Publishing.
  • [62] N. Zhang, E. Shelhamer, Y. Gao, and T. Darrell. Fine-grained pose prediction, normalization, and recognition. CoRR, abs/1511.07063, 2015.
  • [63] H. Zheng, J. Fu, T. Mei, and J. Luo. Learning multi-attention convolutional neural network for fine-grained image recognition. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [64] X. Zhu, D. Anguelov, and D. Ramanan. Capturing long-tail distributions of object subcategories. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.
  • [65] C. Zhuang, Q. Ma, and M. Yoshikawa. Sns user classification and its application to obscure poi discovery. Multimedia Tools and Applications, 76(4):5461–5487, Feb 2017.