The ability to accurately recognize different categories of objects from aerial imagery, such as roads and buildings, is of great importance in understanding the world from above, with many useful applications ranging from mapping, urban planning to environment monitoring. This domain is starting a flourishing period, as the several technological and computational aspects involved, both at the hardware and algorithms levels, form in combination very powerful systems that are suitable for practical, real-world tasks. In this paper we address two important problems that are not sufficiently studied in the literature. We are among the first, to our best knowledge, to propose a method for automatic geo-localization in aerial images without GPS information, by putting in correspondence the real world images with the publicly available, manually labeled maps from the OpenStreetMap (OSM) project 111https://www.openstreetmap.org/. We solve the task by first learning to detect roads and intersections in aerial images, and then learn to identify specific intersections based on a high level descriptor that puts in correspondence the detected intersections from real world images to intersections detected in the manually labeled OSM maps. Accurate localization is then obtained by the geometric alignment of the two road maps - the detected ones and the OSM annotations - at the final step. We present how the alignment to the OSM maps could be used to improve the quality of the detected roads and intersections. We also show that the accurate geometric registration of roads and intersections can improve both recognition of the roads and the initial localization. A key insight of our approach is the observation that intersections tend to have a unique road pattern surrounding them and thus can play a key role in localization, by reducing this difficult task to a sparse feature matching problem followed by a local refined roadmap alignment. For the accurate detection of roads we use a recent state of the art method  that is based on a dual stream local-global deep CNN, which takes advantage of both the local appearance of an object as well as the larger contextual region around the object of interest, in order to augment its local appearance and thus improve recognition performance.
2 Related work on road detection and localization
. The recent success of convolutional neural networks[14, 25] has led to greatly improved accuracy and robust road detection [22, 24]. As shown in , the lack of good quality aerial images, as well as clutter and occlusion can greatly affect and significantly degrade the learning and performance even for top, state-of-the-art architectures. Post-processing is often required in aerial image analysis , but it is not expected to solve the most difficult cases. There are many approaches proposed for road detection, such as following road tracks , local context modeling with CRFs , minimum path methods  or using neural networks 
. Arguably, free road vectors are widely available for most of the planet. However, they are sometimes misaligned and have a poor level of detail. Therefore some methods attempt to correct these road vectors by aligning them to real rectified aerial images. Topological road improvement methods trace back to . A more recent approach  uses Conditional Random Fields in conjunction with a minimum cost path algorithm for improving topology. The authors take into account various cues, such as context, cars, smoothness between road widths in order to offset road vertices to their real location. The same authors previously proposed a metric for topology measurement .
There are several methods related to automatic geolocalization from aerial images, but the tasks they address differ from ours. Some use known landmarks, others ground floor images or extra GPS or IMU measurements. Most employ sparse, manually designed features - ours being the first, to the best of our knowledge, to automatically localize aerial images from recognition and matching of semantic categories, such as roads and intersections, in the context of deep neural networks. More specifically, related to our work, geolocalization for unmanned aerial vehicles (UAVs) using sparse manually designed features has been proposed in , while accurate, sub-pixel manhole localization has been proposed using known landmarks . A road following strategy for UAVs with lost GPS signal is described in . Other authors augment a feature-based approach by fusing camera input with GPU and inertial measurement unit (IMU) outputs. They propose a monocular SLAM approach without visual beacons [12, 5], which yields an error of about 5m. Given the global coverage of aerial images, there has been interest in geolocalizing a ground image using aerial images at training time [18, 29, 17]. Geolocalizing single ground images has also been recently experimented in . An approach loosely related to geolocalization proposed the study of street patterns in order to identify the city class .
3 Our approach
Our method has several stages: 1) road pixelwise classification in a given aerial image; 2) detection of intersections based on the detected roads; 3) identification of a given intersection by matching its surrounding region to regions from a stored dataset of OpenStreetMap(OSM) road and interesections maps. At this stage we keep, for each test intersection, a list of closest OSM interesections in the intersections descriptor space; 4) accurate geometric alignment for improved localization and road detection enhancement. At this stage we keep from the list of candidate intersection matches the one with minimum geometric alignment error. In this work we focus on recognition and localization of given detected intersection. We use intersections as anchors for localization for three reasons. First, once intersections are found and images are aligned to known roadmaps the location of any given point in the image follows immediately. Second, intersections are sparse and require very little computational and storage costs for recognition and matching. Third, they are also sufficiently discriminative localization when their surrounding area is taken into account. They tend to have a unique pattern of roads in the neighborhood region, which acts as a unique fingerprint that is useful for location recognition. We present an overview of our approach in Figure 1
. Note that while we did not use any GPS information for localization, we assumed that we know the orientation of the image with respect to the cardinal points - an information that is easily obtained with a compass in a real world situation. To account for small errors in orientation estimation we added a random Gaussian noise to the test image rotation angle with 0 mean and standard deviation of 5 degrees. While the added noise affected slightly the performance of intersection recognition, it did not influence the final geometric alignment stage that is affine invariant. We detail the stages of our pipeline next.
3.1 Finding roads and intersections
Detection of roads:
We train a state-of-the-art dual stream local-global Convolutional Neural Network  (LG-Net) on the task of road detection (Figure 2). The network combines two pathways, one based on an adjusted VGG-Net  that uses local appearance information (a local 64x64 patch surrounding the road region) and the other, based on an adjusted AlexNet , which takes as input a significantly larger neighborhood (256x256) for contextual reasoning. The two pathways are joined in the last FC layers and the output is a small 16x16 center patch having 1’s for road pixels and zeros otherwise. The final road map is obtained by dividing the larger aerial images into disjoint 16x16 patches, which are classified independently. In the experiments presented in  the local-global network achieves an F-measure that is consistently superior to a network that has only the local pathway. Also, compared to previous contextual approaches to road detection, ours avoids hand crafted cues, such as the nearby cars and consistent road width  or nearby lines , and effectively learns to reason about context by considering the larger area containing the road.
Detection of intersections:
For the detection of intersections we trained an adjusted AlexNet architecture, modified to output a single class to signal the presence or absence of an intersection at a given point in the image. We considered as input several channels containing the original RGB image as well as the estimated roadmap provided by the LG-Net. Including the channels with the original RGB low level signal improved the maximum detection F-measure from to
, in our experiments, using a scanning window approach with non-maxima suppression. The most relevant of the two types of input is the estimated roadmap that represents signal at a higher, semantic level of image interpretation. Note that intersections, by definition, are directly related to the existence of at least two roads that intersect. In order to speed up the detection of intersections we classified pixels on the grid (with steps of 10 pixels) and obtained the final dense intersections map by interpolation. This resulted in a speedup by two orders of magnitude at the cost of a relatively small decrease in detection quality. In Figure2, we also present the system for intersection detection with an example estimated map of intersections. We notice that most intersections are detected, while, in some cases, intersections seem to be correctly detected in the image but are not present in the OSM, which we considered as ground truth. Note that such inconsistencies between images and manually labeled roads are not uncommon in OSM.
3.2 Automatic geolocalization
We represent each intersection by a descriptor which is learned such that identical intersections from detected roads and OSM roads should have similar descriptors, while descriptors for different intersections should be as far separated as possible. For extracting the intersection descriptors we start from the modified AlexNet trained for intersection detection, such that the last FC layer of 4096 elements is used as a descriptor. Intersections from the detected road maps will be matched against a database from OSM using Euclidean distances in descriptor space. While this approach proves to be very effective, we further improve the performance by fine-tuning the network using backpropagation for adjusting distances in descriptor space in order to improve matching performance. (Figure3). Localization is further refined by the geometric alignment between the estimated roads and the OSM roads in the regions centered at the intersections that have been put in correspondence. We detail next the algorithms for matching and localization.
Descriptor extraction and learning:
We extract descriptors for intersection images in a way that is similar to . Moreover, we fine-tune the descriptor extracted for intersections from the neural network, so as to minimize the distance between identical intersections and maximize the distances between dissimilar ones. First, we train the modified Alexnet for intersection detection. Second we fine tune the network weights in a Siamese-like fashion, with corresponding intersection pairs from estimated roadmaps and OSM, respectively, marked as positive and different intersection pairs marked as negative. See  for details on this type of training. The robust loss formula we use takes in consideration the ground truth label , which is if the intersections are the same and otherwise, the squared Euclidean distance between pairs of intersections descriptors and a margin , which gives zero penalty to descriptors and from different intersections that are at a distance of at least in descriptor space:
The learning phase creates a descriptor for each intersection image. Similar images will correspond to descriptors that are close in Euclidean space. When matching two regions centered at two candidate intersection matches, we also consider the descriptors of the nearby intersections. This results in a bipartite graph matching problem for matching two sets of descriptors. It is possible, as nearby intersections usually have similar regions to wrongly match detected intersections to their neighbor OSM intersections , but such local misplacements are most often fixed at the final geometric alignment step when all the roads details in a region are taken into account. Next we present our method for finding correspondences between detected intersections and the ones from OSM, by matching sets of intersections from their corresponding regions. These neighborhoods of a certain radius centered at the intersections of interest. As our experiments show, the larger this radius the more accurate the intersection identification. This is expected, as larger regions include more road structures that are unique to a specific urban area.
Although a location can be theoretically determined by a single correctly identified intersection and a correct rotation with respect to the cardinal points , in order to have a robust match and further improve the initial localization (which could be off due to intersection detection misalignments), we also estimate for a given pair of candidate intersection matches , a geometric affine transformation between the roads in regions and
Then, a misalignment measure is computed such that most outlier candidates in the listof a given test intersection (found using Algorithm 1) are removed. The 2D registration procedure is performed by sampling road points from the test and query images and computing Shape Context descriptors
at sampled locations. Using kNN with Shape Context descriptors, a list of candidate correspondences are found and an affine transform is robustly estimated using RANSAC. Then, the Euclidean distance transform (Matlab function) is used in order to compute the symmetrized Chamfer distance between the two registered roadmaps, as a measure of misalignment - which, in practice yields significantly better results. Other approaches (such as ) also proposed road alignment. Ours is fast and very effective for rejection of outlier intersection matches, improving localization and road enhancement (next Section). The more detailed overview of our localization algorithm is presented below:
3.3 Enhancing the road map
We can use the aligned OSM roadmaps to improve the detected roads and vice-versa - since OSM roadmaps sometimes contain wrongly labeled roads, or do not reflect recent road changes. Here present a simple but effective method: 1) we apply a soft dilation procedure on the estimated roadmap and multiply it, pixel by pixel, with the aligned OSM map; 2) the resulted soft output is then smoothed with a Gaussian filter and the result is thinned using a standard nonmax suppression method for boundary detection. 3) after thinning the roads are dilated back, to achieve the initial thickness. The results are substantially better, as expected, greatly improving the similarity between the roads found and the OSM roads - the f-measure in road detection improved from to . Important note: this procedure does not use ground truth localization, but only the entire OSM dataset and relies on the accuracy of the automatic matching and alignment algorithms. It has proved generally effective even when the localization was wrong but the road structure between the matched OSM region and the test image was similar. We present qualitative results in Figure 4.
4 Experimental analysis
Two Cities Dataset:
We collected aerial images of two European cities (termed A and B) and automatically aligned them with the OSM road maps for training and evaluation. We plan to make the dataset public. The images are 600x600px, have the spatial resolution of 1m/pixel and cover an area of about 70 sq. Km each. We use city A for training and validation and images from city B for testing. The quality of the images is fairly low, which makes the task of road detection and localization very challenging, even for the human eye (see example images in Figure 4).
Figure 3 presents the average performance measures after geolocalizing all 3177 intersections from city B. We present intersection identification (recognition) rates versus the region radius (top left plot). As expected performance increases as the region radius increases, at the cost of more computation and data being required. We also demonstrate that the geometric alignment phase significantly increase performance, bringing it close to the mark even when the region radius is small. The plot also presents the consistent improvement brought by fine tuning the descriptors to optimize intersection matching. The other three plots present the distribution of localization errors in meters. We notice that most errors (around or above of them) are below 2.5 meters, that is below 3 pixels for the image resolution available in our experiments. This error is very small considering the poor image quality and the errors present in the OSM itself, which was considered as ground truth. For these reasons we believe that our results demonstrate high level of localization accuracy for our system, which could be very effective in most cases when the GPS signal is lost.
Training time for road detection and intersections descriptor learning took between 3-5 days on a GeForce GTX 970 GPU with 4Gb memory and 1664 CUDA cores. At test time, road extraction speed is 5km2/s, at a spatial resolution of 1m/pixel and represents the most expensive task for geolocalization. Intersection detection takes 0.7km2/s, while localization by means of kNN in intersection descriptor space and geometric alignment is an order of magnitude faster in the context of searching within the limits of a sq. Km city.
5 Discussion and Conclusions
We have presented a complete system for geo-localization from aerial images in the absence of GPS information. Our proposed pipeline includes many contributions with efficient methods for road and intersection detection, intersection recognition with geometric alignment for accurate localization, followed by road detection enhancement. There are many potential applications for our approach in areas such as urban planning, tracking structural changes, updating of existing maps and environment monitoring. Our system could also be used in the context of unmanned aerial vehicles, in order to correct their GPS localization or to make their flight possible even when GPS signal is lost. We estimate that if the search area was only times smaller than in our experiments, the automatic localization would be tractable for onboard processing, in near real-time, for current generation of NVIDIA’s embedded GPUs (Jetson TX1). For nighttime use for example, the roads are generally ’extracted’ by means of street lightning, which makes the problem of road and intersection detection easier - thus even more accessible for on-board processing. We have proven that geolocalization from images alone, using learned high level features is feasible and can achieve a high level of accuracy. It can be used as a GPS alternative or in conjunction with GPS, bringing valuable contributions to the literature and also to many applications that require offline or online, realtime processing.
The authors would like to thank Alina Marcu for his dedicated assistance with some of our experiments. Marius Leordeanu was supported in part by CNCS-UEFISCDI, under project PNII PCE-2012-4-0581.
-  Anonymous, ‘Object contra context: Dual local-global semantic segmentation in aerial imagery’, in submitted to ECAI, (2016).
-  Marc Barthélemy and Alessandro Flammini, ‘Modeling urban street patterns’, Physical review letters, 100(13), 138702, (2008).
-  Serge Belongie, Jitendra Malik, and Jan Puzicha, ‘Shape context: A new descriptor for shape matching and object recognition’, in NIPS, volume 2, p. 3, (2000).
-  Fernando Caballero, Luis Merino, Joaquín Ferruz, and Aníbal Ollero, ‘Unmanned aerial vehicle localization based on monocular vision and online mosaicking’, Journal of Intelligent and Robotic Systems, 55(4-5), 323–343, (2009).
-  Fernando Caballero, Luis Merino, Joaquin Ferruz, and Aníbal Ollero, ‘Vision-based odometry and slam for medium and high altitude flying uavs’, Journal of Intelligent and Robotic Systems, 54(1-3), 137–161, (2009).
-  Christian Drewniok and Karl Rohr, ‘High-precision localization of circular landmarks in aerial images’, in Mustererkennung 1995, 594–601, Springer, (1995).
-  Eric Frew, Tim McGee, ZuWhan Kim, Xiao Xiao, Stephen Jackson, Michael Morimoto, Sivakumar Rathinam, Jose Padial, and Raja Sengupta, ‘Vision-based road-following using a small autonomous aircraft’, in Aerospace Conference, 2004. Proceedings. 2004 IEEE, volume 5, pp. 3006–3015. IEEE, (2004).
-  Paolo Gamba, Fabio Dell’Acqua, and Gianni Lisini, ‘Improving urban road extraction in high-resolution images exploiting directional filtering, perceptual grouping, and simple topological concepts’, Geoscience and Remote Sensing Letters, IEEE, 3(3), 387–391, (2006).
-  Armin Gruen and Haihong Li, ‘Road extraction from aerial and satellite images by dynamic programming’, ISPRS Journal of Photogrammetry and Remote Sensing, 50(4), 11–20, (1995).
Raia Hadsell, Sumit Chopra, and Yann LeCun, ‘Dimensionality reduction by
learning an invariant mapping’, in
Computer vision and pattern recognition, 2006 IEEE computer society conference on, volume 2, pp. 1735–1742. IEEE, (2006).
-  Jiuxiang Hu, Anshuman Razdan, John C Femiani, Ming Cui, and Peter Wonka, ‘Road network extraction and intersection detection from aerial images by tracking road footprints’, Geoscience and Remote Sensing, IEEE Transactions on, 45(12), 4144–4157, (2007).
-  Jonghyuk Kim and Salah Sukkarieh, ‘Real-time implementation of airborne inertial-slam’, Robotics and Autonomous Systems, 55(1), 62–71, (2007).
-  Dan Klang, ‘Automatic detection of changes in road data bases using satellite imagery’, International Archives of Photogrammetry and Remote Sensing, 32, 293–298, (1998).
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, ‘Imagenet classification with deep convolutional neural networks’, inAdvances in neural information processing systems, pp. 1097–1105, (2012).
-  Ivan Laptev, Helmut Mayer, Tony Lindeberg, Wolfgang Eckstein, Carsten Steger, and Albert Baumgartner, ‘Automatic extraction of roads from aerial images based on scale space and snakes’, Machine Vision and Applications, 12(1), 23–31, (2000).
-  Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 27–35, (2015).
-  Tsung-Yi Lin, Serge Belongie, and James Hays, ‘Cross-view image geolocalization’, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 891–898, (2013).
-  Tsung-Yi Lin, Yin Cui, Serge Belongie, and James Hays, ‘Learning deep representations for ground-to-aerial geolocalization’, in Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pp. 5007–5015. IEEE, (2015).
-  Yucong Lin and Srikanth Saripalli, ‘Road detection and tracking from aerial desert imagery’, Journal of Intelligent & Robotic Systems, 65(1-4), 345–359, (2012).
-  Gellert Mattyus, Shenlong Wang, Sanja Fidler, and Raquel Urtasun, ‘Enhancing road maps by parsing aerial images around the world’, in The IEEE International Conference on Computer Vision (ICCV), (December 2015).
-  Helmut Mayer, Stefan Hinz, Uwe Bacher, and Emmanuel Baltsavias, ‘A test of automatic road extraction approaches’, International Archives of Photogrammetry, Remote Sensing, and Spatial Information Sciences, 36(3), 209–214, (2006).
-  Volodymyr Mnih and Geoffrey E Hinton, ‘Learning to detect roads in high-resolution aerial images’, in Computer Vision–ECCV 2010, 210–223, Springer, (2010).
-  Javier A Montoya-Zegarra, Jan D Wegner, L’ubor Ladickỳ, and Konrad Schindler, ‘Mind the gap: modeling local and global context in (road) networks’, in Pattern Recognition, 212–223, Springer, (2014).
-  Shunta Saito and Yoshimitsu Aoki, ‘Building and road detection from large aerial imagery’, in IS&T/SPIE Electronic Imaging, pp. 94050K–94050K. International Society for Optics and Photonics, (2015).
-  Karen Simonyan and Andrew Zisserman, ‘Very deep convolutional networks for large-scale image recognition’, arXiv preprint arXiv:1409.1556, (2014).
-  Engin Türetken, Fethallah Benmansour, and Pascal Fua, ‘Automated reconstruction of tree structures using path classifiers and mixed integer programming’, in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 566–573. IEEE, (2012).
-  Jan Wegner, Javier Montoya-Zegarra, and Konrad Schindler, ‘A higher-order crf model for road network extraction’, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1698–1705, (2013).
-  Tobias Weyand, Ilya Kostrikov, and James Philbin, ‘Planet-photo geolocation with convolutional neural networks’, arXiv preprint arXiv:1602.05314, (2016).
-  Scott Workman, Richard Souvenir, and Nathan Jacobs, ‘Wide-area image geolocalization with aerial reference imagery’, in Proceedings of the IEEE International Conference on Computer Vision, pp. 3961–3969, (2015).
-  Jiangye Yuan and Anil M Cheriyadat, ‘Road segmentation in aerial images by exploiting road vector data’, in Computing for Geospatial Research and Application (COM. Geo), 2013 Fourth International Conference on, pp. 16–23. IEEE, (2013).