1 Introduction
A great portion of the images daily uploaded on the Internet are from smartphones and therefore contain geotags, providing information for their geographic location. However, there are numerous cases, such as social media photos, where this information is missing. Therefore, the ability to estimate the geographic location of these images, known as location estimation or geolocation, is crucial for a number of applications ranging from social media mining to media verification. More formally, image geolocation is the process of inferring the GPS coordinates of the depicted picture based solely on its visual elements.
Stateoftheart approaches for geolocation employ the latest advances in Computer Vision, such as Convolutional Neural Networks (CNNs), to extract representations of the depicted scenes and utilize huge databases of images taken worldwide either for training a classifier
[5, 6, 7] or for retrieval [2, 3, 4, 11]. Classification solutions partition the earth into a set of geographic cells, and the images are passed through a CNN to be classified into a single cell. Retrieval solutions compare the test images against the ones from a largescale database in order to retrieve the most similar images and derive a single estimation by aggregating their locations. In both cases, performance is measured by the percentage of images localized within a certain distance from their groundtruth location, denoted as geolocation accuracy @ km.Ongoing research focuses mainly on improving geolocation accuracy at different granularities (e.g. = 1, 25, 200, 750 and 2500 km). However, contrary to most image classification datasets where images usually contain enough visual cues to be classified in a unique class, a good portion of images in geolocation datasets depict scenes with no apparent visual cues mapping to the image’s location (e.g. indoor spaces, portraits). Such images could have been captured anywhere on the globe, and therefore attempting to localize them would most likely result in erroneous or unreliable estimations. Hence, we deem crucial for the reliability of geolocation systems to estimate not only the geolocation of input images but also their localizability.
In a general sense, we consider localizable the images that contain enough visual cues for their accurate geolocation. However, localizability is better approached considering a granularity scale (i.e. the range within which an image can be correctly placed from its true location) and a geolocation model (i.e. a mechanism that derives the image’s location from its visual content). To this end, we introduce the problem of image localizability detection, building upon the foundation of selective prediction [12]. We propose a methodology that utilizes stateoftheart geolocation systems to infer localizability at different scales. More precisely, we reimplement an image geolocation model [5] and, instead of interpreting the output probability distribution as pure categorical data and predicting the most probable location, we visualize the whole distribution on the world map. We then devise two novel selection functions, i.e. spatial entropy and prediction density, which measure cell probability dispersion and concentration over the globe, exploiting intrinsic proprieties of geolocation systems – unlike current stateoftheart selection functions. We extensively evaluate our methodology on the two most widely used evaluation datasets, i.e. Im2GPS [2] and Im2GPS3k [11], and highlight the effectiveness of our proposed selection functions compared to stateoftheart approaches in selective prediction. By discarding images considered nonlocalizable at cityscale, we boost the accuracy of our base geolocation model at cityscale from 27.8% to 70.5%, making it reliable for realworld applications. To the best of our knowledge, we are the first to propose the task of image localizability detection and leverage selective prediction to address it. Therefore, our work makes the following contributions:

We introduce the problem of image localizability detection and frame it under a selective prediction framework. This formulation allows current classification models to infer localizability, and hence abstain from predicting nonlocalizable images.

We propose two novel selection functions, specifically designed for geolocation, that outperform current stateoftheart selection functions which do not consider spatial information.

We extensively evaluate our methodology on the two most widely used datasets, achieving good separation between localizable and nonlocalizable images, and making current geolocation systems more reliable.
2 Related Work
This section gives an overview of some of the fundamental works that have contributed to geolocation estimation and selective prediction.
Geolocation Estimation: Hays and Efros [2] introduced the problem of planetscale image location estimation. They used handcrafted features to retrieve images similar to a query image and infer its location based on theirs. Weyand et al. [5]
took advantage of the deep learning advances and trained a Convolutional Neural Network (CNN) to extract features from images. Additionally, they formulated the geolocation problem as a classification task and divided the earth’s surface, using Google’s s2 geometry library
^{1}^{1}1https://s2geometry.io, to create a set of classes for the training and test images. More recent works modify the classification pipeline using a hierarchical partitioning of the earth [6], or novel loss functions
[7]. Recently, a hybrid scheme called Search within Cell [8] was proposed, which combines a classification and retrieval approach for the final location estimation.Selective prediction:
Works in this area focus on machine learning systems that are not only able to make predictions but also to know when to abstain from predicting. Although the field exists for several decades, it was not until recently that a unified formulation was introduced
[12] and approaches regarding deep architectures were proposed by ElYaniv and Wiener [15, 16]. In [15], Softmax Response (the maximum output after the softmax layer) and MCDropout
[13] were used as confidence functions, and an algorithm that finds the appropriate threshold given a desired risk was proposed. In [16], an algorithm for jointly training the classification network and the selection function was proposed.3 Methodology
In this section, we present the proposed methodology for the selection of localizable images; this is illustrated in Fig. 1.
3.1 Geolocation Estimation
A geolocation model takes as input an image and returns the estimated GPS coordinates of the location it was captured. Following the classification approach for geolocation, we first divide the earth’s surface into a grid of geographic cells , and then we employ a CNN, with outputs in the final layer, corresponding to the cells of the grid. For each input image , the CNN creates a probability distribution over the grid of cells. The predicted location is the mean coordinates of the cell with the highest probability.
Most geolocation approaches [5, 6, 7] consider only the cell with the highest probability, ignoring all the information provided by the cell probability distribution over the grid. We found that this probability distribution can provide valuable insights for the localizability of images. More precisely, a trained network has learned both to estimate the location of images and also to associate concepts with locations. For example, when the network is presented with the image of Fig. 1(a), the probability of several cells nearby the sea is high. However, when presented with the image from Fig. 1(b), all cells around Egypt are activated, since the image contains many visual cues that map to that area. Thus, even though it is challenging to predict the exact location of those images, the model generates reasonable estimates to candidate locations.
By inspecting the map, it is evident that the network is more confident for the estimation of Fig. 1(b)’s location than Fig. 1(a)’s since the probability distribution is more concentrated on a specific area. Thus, the estimation of the former can be considered more reliable than the latter. The spatial distribution of cells is essential information for the geolocation estimation problem, differentiating it from the general image classification task. Hence, our goal in this paper is to exploit this information to improve the reliability of the model’s predictions.
3.2 Localizability
To develop and evaluate a methodology for image localizability, we have to associate all images in a dataset with groundtruth labels that indicate localizability. Moreover, labeling images as localizable or not is highly subjective and depends on the collection of images recognized by the prospective annotator. To address the former issue, we define localizability at a certain scale (distance tolerance from the ground truth location). To address the latter issue, we approximate localizability in terms of our model’s ability to infer location from the input image, i.e. we assess which images our employed model is able to predict correctly. Therefore, all images that our model is able to predict within a certain distance from their groundtruth location are labeled as localizable, and all other images are labeled as nonlocalizable.
More formally, given a geolocation estimation model , the localizability of an image at distance is defined as:
(1) 
where GCD is the Great Circle Distance between two locations, and is the groundtruth location of .
3.3 Selective Prediction
Predicting which images are localizable according to our model’s geolocation capability can be formulated as a selective prediction scheme following the formulation in [12]. Here, we adapt their definitions to fit the needs of the geolocation estimation task. Our aim is to build a selective geolocation system such that is the base geolocation module, as described in section 3.1, and is the selection function. Then our selective geolocation system is defined as:
(2) 
The selection function is usually modeled based on a confidence function (which measures our model’s confidence or uncertainty), a scale and a tunable threshold . For measuring confidence, is defined as follows:
(3) 
Let be the distribution over , where is the image space and the coordinate space, characterizing the probability of image being captured at geographical coordinates . Given an underlying distribution , a confidence function and a scale , varying the parameter determines the performance of our selective geolocation system, which can be expressed using coverage and risk, as follows:
Coverage is the mass probability of the nonrejected region in , and can be approximated given enough i.i.d. samples from as follows:
(4) 
Risk is the expected percentage of the kept images that will be predicted outside a radius , and can be approximated given enough i.i.d. samples from as follows:
(5) 
where is a loss function defined as:
(6) 
3.4 Estimating Image Localizability
Our main goal in this work is to find good confidence functions and thresholds for the function . For , we employ a geolocation system similar to [5]; however, any geolocation system that tackles geolocation as a classification problem can be used.
3.4.1 Spatial Entropy (SE)
measures the dispersion of cell probabilities around the globe. To calculate the of the cell distribution at scale , we initially select the most probable cell and merge all cells within distance from it to form a supercell. The probability of the supercell derives from the sum of the cell probabilities of the individual cells. Then, we ignore all cells merged to the supercell and find the next most probable cell from the remaining ones. Similarly, we merge it with its neighboring cells that are inside a radius . This process is repeated until the cumulative probability of the supercells accounts for the 90% of the total confidence^{2}^{2}2We empirically found this to remove noise from cell distributions compared to considering cells accounting for 100% of the total confidence.. We denote as the new probability distribution of the supercells; hence, SE is defined as:
(7) 
Higher Spatial Entropy indicates lower confidence; therefore, we devise the selection function as:
(8) 
where is a tunable threshold.
3.4.2 Prediction Density (PD)
measures the concentration of cell probability in a particular region instead of its dispersion around the globe. To calculate the of the cell distribution at scale , we accumulate the model’s cell probabilities in a radius around the most probable cell, which can be considered as the model’s confidence that an input image can be localized at scale . This is formulated as follows:
(9) 
where is the indicator function. Higher Prediction Density denotes higher confidence; therefore, we devise the selection function as:
(10) 
where is a tunable threshold.
4 Evaluation Setup
4.1 Datasets
To train our geolocation model, we use the training split of the MediaEval Placing Task 2016 dataset (MP16 train) [9], which is a subset of the Yahoo Flickr Creative Commons 100 Million (YFCC100M) [10]. It consists of 4,723,695 images posted on Flickr with their metadata, among which geographical coordinates. We also use the YFCC25k dataset from [6], composed of 25,600 randomly selected images from YFCC100M (excluding images from MP16 train), for validation. Due to the unavailability of several images, we end up with a total of 23,007 images. Finally, for evaluation, we use the Im2GPS [2] and Im2GPS3k [11] datasets, provided by the original authors, consisting of 237 and 3,000 images, respectively.
4.2 Implementation Details
For the cell partitioning described, we adopt the fine partitioning from [6] and terminate cell splitting when each cell contains between 50 and 1,000 images from the MP16 train. We discard all cells that end up with less than 50 photos. This results in 13,662 cells and 4,071,346 images for training. Although the particular partitioning implementation could affect the geolocation estimation performance, we are primarily interested in the performance of our localizability methods, and hence we do not consider alternate partitionings.
For the geolocation model , we use EfficientNetB4 [1]
as our backbone CNN and replace its last layer with a linear layer consisting of 13,662 neurons corresponding to our total number of cells. We replicate the preprocessing and training pipeline of Kordopatis et al.
[8], and we do not use any further additions such as hierarchical partitioning [6] or the MvMF loss [7].4.3 Competing Approaches
In Section 5, we compare our selection functions against two baseline runs (which serve to visualize selective performance limits) and two stateoftheart methods in selective prediction, briefly described below:
Random selection function: randomly selects whether to predict or abstain from providing a prediction, with an equal probability of 50%.
Ideal selection function: selects images based on their groundtruth localizability values, prioritizing the images considered localizable.
Softmax Response (SR) [14]: uses the maximum probability after the final softmax layer, i.e. the maximum cell probability, as confidence function. It has been employed for selective prediction in [15]. Note that this selection function cannot be intrinsically adapted for the different scales .
MonteCarlo Dropout (MC) [13]
: uses as confidence function the variance of the softmax response of multiple forward passes of an input image with dropout applied in the final layer. This is shown to be a good approximation of a Bayesian Neural Network with Gaussian parameter priors
[13] and a stateoftheart method in selective prediction [15]. Following [15] we use a dropout of 0.5. Note that again this selection function cannot be intrinsically adapted for the different scales .5 Experiments
5.1 Selective Geolocation Performance
We benchmark the selective prediction performance of the proposed selection functions and against the competing approaches. We evaluate them in both Im2GPS and Im2GPS3k at scales 1, 25, 200, 750 and 2500km, which are the most widely reported scales and correspond to street, city, region, country and continent level granularity scales.
First, we present the RiskCoverage (RC) curves for each dataset, illustrated in Fig. 3. These curves are obtained by computing the risk and coverage of each method for different values of the threshold . Both our selection functions achieve stateoftheart performance, yielding lower risk at every coverage level in all datasets and scales. SE and PD perform similarly, with PD consistently outperforming SE by a small margin. Moreover, in coarser granularity scales, the performance gap between our selection functions and the competing SR and MC widens considerably, with our selection functions reaching close to the ideal. This can probably be attributed to their intrinsic adaptation to different scales. Finally, it is evident that RC curves on Im2GPS3k are smoother and more monotonous, which is expected due to its greater size and variety of images.










Although riskcoverage curves give a comprehensive insight of the selective prediction performance, we need to determine a specific threshold that separates localizable and nonlocalizable images given a selection function. To do so, we chose the value that corresponds to the coverage that equals the percentage of images can successfully localize. We learn this value on the validation YFCC25K dataset for each selection function and each granularity scale. We call the risk and coverage at this threshold Optimal Risk (OR) and Optimal Coverage (OC) respectively. We also report the classification accuracy and the F1score of the positive class.
Tables 1 and 2 display the results of the selection functions on the two evaluation datasets. We note that high accuracy in finer scales is not indicative of good separation between localizable and nonlocalizable images due to the class imbalance; however, combined with F1score, they provide useful insights. In particular, it is evident that the proposed SE and PD achieve better class separation than the SR and MC, with PD slightly surpassing SE. Moreover, in most cases, the selected threshold for our methods leads to lower risk and wider coverage compared to the competition.
5.2 Selective Geolocation Reliability
We present a quantitative and qualitative assessment of the performance of our selective models and compared to .


We split both Im2GPS and Im2GPS3k into a localizable and a nonlocalizable subset using our selective models at cityscale. Table 3 displays the geolocation accuracies on these splits at all granularity scales, compared to the performance of the base model without a selection scheme. For fine and medium granularity scales, our selective models achieve more than double the geolocation accuracy of the base model , with only a tiny portion of localizable images rejected by our functions. In particular, prediction density increased the geolocation accuracy on Im2GPS3k from 27.8% to 70.5% by discarding nonlocalizable images, from which only 8.2% could have been successfully localized. This highlights the reliability current image geolocation models can achieve using the selective prediction mechanisms presented.
Fig. 4 depicts image samples randomly selected from Im2GPS3k for qualitative evaluation of our methodology. Images are grouped by their predicted and groundtruth localizability in (a) true positive, (b) false positive, (c) true negative and (d) false negative, using PD at cityscale as the selection function. True positive samples either depict landmarks or characteristic elements that hint at very specific locations (e.g. the Golden Gate Bridge). True negative samples contain mostly generic scenes that should not even be attempted to be localized. The two images with the car and the lights could have been localized if they were in our training dataset, but even in that case, a similar scene can easily exist in multiple cities. False positive samples contain enough visual cues to be worthy of geolocation, however not enough for the required granularity. Finally, all false negative samples besides the Edificio Meneses picture are not localizable and their correct geolocation by our geolocation model could be attributed to presence of very similar images in the train dataset.
6 Conclusions
In this paper, we introduced the problem of image localizability detection and used it as a foundation for reliable image geolocation. We adapted a selective prediction methodology to the context of geolocation and presented two novel selection functions, Spatial Entropy and Prediction Density, tailored to the needs of the geolocation task. Our functions achieved superior selective performance compared to stateoftheart on the two widelyused evaluation datasets. We also demonstrated how they can be exploited to abstain from geolocating nonlocalizable images, significantly boosting the geolocation performance in all granularity scales, and thus making current geolocation models more reliable. In the future, we plan to explore the design and evaluation of more sophisticated and trainable selection functions.
Acknowledgments: This work has been supported by the projects WeVerify and MediaVerse, partially funded by the European Commission under contract number 825297 and 957252, respectively.
References
 [1] Tan M, Le Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, 2019.

[2]
Hays J, Efros AA. Im2gps: estimating geographic information from a single image. In IEEE Computer Vision and Pattern Recognition, 2008.
 [3] Hays J, Efros AA. Largescale image geolocalization. In Multimodal location estimation of videos and images, 2015.

[4]
KordopatisZilos G, Popescu A, Papadopoulos S, Kompatsiaris Y. Placing Images with Refined Language Models and Similarity Search with PCAreduced VGG Features. In MediaEval, 2016.
 [5] Weyand T, Kostrikov I, Philbin J. Planetphoto geolocation with convolutional neural networks. In European Conference on Computer Vision, 2016.

[6]
MullerBudack E, PustuIren K, Ewerth R. Geolocation estimation of photos using a hierarchical model and scene classification. In European Conference on Computer Vision, 2018.
 [7] Izbicki M, Papalexakis EE, Tsotras VJ. Exploiting the earth’s spherical geometry to geolocate images. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2019.
 [8] KordopatisZilos G, Galopoulos P, Papadopoulos S, Kompatsiaris I. Leveraging EfficientNet and Contrastive Learning for Accurate Globalscale Location Estimation. In ACM International Conference on Multimedia Retrieval, 2021.
 [9] Larson M, Soleymani M, Gravier G, Ionescu B, Jones GJ. The benchmarking initiative for multimedia evaluation: MediaEval 2016. IEEE MultiMedia. 2017 Feb 9;24(1):936.
 [10] Thomee B, Shamma DA, Friedland G, Elizalde B, Ni K, Poland D, Borth D, Li LJ. YFCC100M: The new data in multimedia research. Communications of the ACM. 2016 Jan 25;59(2):6473.
 [11] Vo N, Jacobs N, Hays J. Revisiting im2gps in the deep learning era. In IEEE International Conference on Computer Vision, 2017.
 [12] ElYaniv R. On the Foundations of Noisefree Selective Classification. Journal of Machine Learning Research. 2010 May 1;11(5).
 [13] Gal Y, Ghahramani Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, 2016.
 [14] Hendrycks D, Gimpel K. A baseline for detecting misclassified and outofdistribution examples in neural networks. arXiv preprint arXiv:1610.02136. 2016.
 [15] Geifman Y, ElYaniv R. Selective Classification for Deep Neural Networks. In Advances in Neural Information Processing Systems, 2017.
 [16] Geifman Y, ElYaniv R. Selectivenet: A deep neural network with an integrated reject option. In International Conference on Machine Learning, 2019.
Comments
There are no comments yet.