Bridging the Domain Gap for Ground-to-Aerial Image Matching

04/24/2019 ∙ by Krishna Regmi, et al. ∙ University of Central Florida 0

The visual entities in cross-view images exhibit drastic domain changes due to the difference in viewpoints each set of images is captured from. Existing state-of-the-art methods address the problem by learning view-invariant descriptors for the images. We propose a novel method for solving this task by exploiting the generative powers of conditional GANs to synthesize an aerial representation of a ground level panorama and use it to minimize the domain gap between the two views. The synthesized image being from the same view as the target image helps the network to preserve important cues in aerial images following our Joint Feature Learning approach. Our Feature Fusion method combines the complementary features from a synthesized aerial image with the corresponding ground features to obtain a robust query representation. In addition, multi-scale feature aggregation preserves image representations at different feature scales useful for solving this complex task. Experimental results show that our proposed approach performs significantly better than the state-of-the-art methods on the challenging CVUSA dataset in terms of top-1 and top-1 method on urban landscapes, we collected a new cross-view localization dataset with geo-reference information.



There are no comments yet.


page 1

page 3

page 6

page 7

page 8

page 12

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Estimating the geo-location of an image has been tackled as an image-matching task where the query image is compared against a database of reference images with known locations. Traditionally, the matching has been conducted between images taken from the same view, primarily street-view [13, 43, 34]

, which have a high degree of visual similarity in terms of scene contents. Since these ground level reference images are typically concentrated around urban areas with more human accessibility, the applicability of the method is limited to those regions. With the availability of aerial images from Google maps, Bingmaps, etc. that cover the earth surface densely, researchers have lately explored the prospect of cross-view image matching

[23, 40, 17], where the query ground image is matched against aerial images. This comes with additional challenges due to variation in viewpoints between the ground and aerial images, which capture the same scene differently in two views. This motivates us to explore transforming the query street-view image into aerial view, so that the transformed image has scene representations similar to the images it is matched against.

Figure 1: Given a query ground panorama, we synthesize aerial image using a conditional GAN. We jointly learn the representations for image triads (, and ). The features and are ultimately fused to obtain a robust query representation. The aerial image (illustrated in green box on the right panel) with features closest to the query image feature is its correct match.

The recent success of Generative Adversarial Networks (GANs) [11]

in synthesizing realistic images from randomly sampled noise vectors

[29] or conditional variables such as text [30, 46], images [18, 31], labels [27], etc. has inspired us to frame the problem as viewpoint translation followed by feature matching. Moreover, GANs have been used for domain transfer problems as in [47, 19] to learn the mapping between different domain representations. Recent cross-view synthesis works by [31, 9, 32, 48] are successful in transforming the images between aerial and street views. In this paper, we take a next step to synthesize aerial images from ground level panorama and use them for image retrieval.

The complexity of the cross-view image synthesis problem and its challenges are well-known. Thus, the synthesized images cannot be relied on to completely replace the query ground-level image to solve the matching task. Therefore, we propose a framework to incorporate the synthesized image into the matching pipeline as auxiliary information in order to bridge the existing domain gap between aerial and ground views. We attempt to learn representations for aerial reference images that are similar to their corresponding ground level images, as well as the synthesized aerial images. Since the synthesized aerial images are transformed representations of street-view images, we expect them to contain representative features. By learning representations in this manner, the synthesized aerial images force the network to minimize the distance between feature representations of aerial images and street-view images. Additionally, we hypothesize that some features of aerial images are better learned by looking at synthesized aerial images rather than street-view images. Thus, the joint training of these image triads will help the aerial stream retain important cues that would have otherwise been lost in cross-view training. We fuse the learned complementary feature representations of synthesized images with query image features to obtain a robust representation that we use for our image matching task.

The features extracted at different layers of deep neural networks capture varying levels of semantic information of the input image. For the image matching task, which is considerably more challenging than a standard classification problem, we exploit the inherent multi-scale pyramidal structure of features at multiple layers of deep neural networks and aggregate them to obtain a better image representation.

In summary, this paper makes the following contributions. We propose a novel approach to leverage aerial images synthesized using GANs to extract complementary features for cross-view image matching. We incorporate the edgemaps, in addition to semantic segmentation which is typically used, together with the input images to improve the cross-view synthesis by providing cues on object shapes and boundaries to the network. The synthesized images bridge the domain gap between cross-view images. The joint training of image triads using auxiliary loss helps improve the network training. The proposed feature fusion strategy demonstrates the capabilities of GANs for constructive training and complementary feature learning. Lastly, we show that aggregating features from multiple convolutional layers at different resolutions greatly helps preserve coarse to fine latent representations necessary for complex cross-view matching task. Our extensive experiments show that the proposed joint feature learning method outperforms the state-of-the-art methods on CVUSA dataset [45] and with feature fusion, we obtain significant improvements on top-1 and top-10 retrieval accuracies.

2 Related Works

2.1 Domain Transfer and GANs

GANs are very popular in domain transfer tasks. In the works reported in [18, 47, 19, 41], image mapping between two domains; source and target domains is learnt. Augmented CycleGAN [2], StarGAN [7] have explored many-to-many cross-domain mappings.

Cross-view relations have been explored in [45, 31, 10] with more challenging settings of aerial and ground views, where there is very less semantic and viewpoint overlap between the objects in the images. Cross-view image synthesis between these contrasting domains has attracted wide interests lately [31, 32, 9, 48] with the popularity of GANs.

2.2 Multi-scale Feature Aggregation

Features at different layers of deep neural networks are essentially the multi-resolution features of the same image. Abundance of literature has explored features at multiple scales [16, 28, 33, 25, 21] for applications like key-point detection, human pose estimation, semantic segmentation. FPN [24], HyperNet [20], ION [5] explored multi-scale features for object detection. Earlier, Hypercolumns [12] were created from multi-layer features and used for object segmentation and localization. We also aggregate the features at multiple scales to efficiently obtain robust representation of the images.

2.3 Image Geolocalization

Image geolocalization has been tackled as an image matching task [3, 14, 42] in computer vision community. Early works in geolocalization [43, 37, 34, 44] matched images in the same view; a query street-view image is compared against the reference street-view images using hand-crafted features. Hays [13] proposed a data-driven approach to estimate the distribution over geographical location from a single image. Bansal [4] explored facade matching. Tian [36] matched building features in oblique views.

Cross-view matching has been explored by several recent works [23, 35, 17, 39, 38] using both hand-crafted features as well as deep networks. Recent work by [17] exploit the NetVLAD [3] to obtain view-invariant descriptors for cross-view pairs and use them for matching.

In this work, we exploit the synthesized aerial images as complementary source of information for better scene understanding of street-view images to solve cross-view matching task, rather than just learning view-invariant features as in the previous approaches.

3 Method

We propose a novel method to bridge the domain gap between street-view and aerial images by leveraging the synthesized aerial images using GANs. We learn the representations for synthesized aerial images jointly with ground and aerial image representations. Additionally, we fuse the complementary representations for ground images with the representations for their corresponding synthesized aerial images to learn robust query representations for ground images. Also, we exploit the edgemaps of input images to provide GANs with the notion of object shapes and boundaries and facilitate the cross-view image synthesis.

The organization of the rest of this section is as follows. In the next subsection, we briefly describe how GANs are used for cross-view image synthesis, followed by joint feature learning, and finally feature fusion is described.

3.1 Cross-View Image Synthesis

We synthesize the aerial images from ground-level panorama to use them during joint feature learning to bridge the domain gap between cross-view images. Previous works [31, 32, 9, 48] have been successful in image translation between aerial and ground-level cropped (single camera) images. Zhai et al. [45] explored the possibility of synthesizing ground-level panorama from aerial images. Here, we transform the ground level panorama to aerial view.

We adopt X-Fork generator architecture of [31] to train the GAN. The X-Fork is a multi-task learning architecture that synthesizes cross-view image as well as semantic segmentation map. We make the following modifications to the X-Fork architecture. Since our input is panorama (rectangular in shape), the feature maps at the bottleneck are also rectangular (1x4). We reshape the features into squares (2 x 2), and then apply multiple upconvolution operations to generate 512 x 512 resolution aerial images. Next, we exploit the edgemaps of input images to provide the network with notion of object shapes and boundaries. We employ Canny Edge Detection [6] to obtain the edgemaps of the inputs. We stack an edgemap together with the panorama, along the channels to create a 4-channel input; 3 channels for RGB image and 1 channel for edgemap. The output is an RGB image and its segmentation map in aerial view. We utilize the synthesized aerial images in joint feature learning experiments.

(a) Joint Feature Learning: Jointly training on and through matching them against employing auxiliary loss between and helps to pull features and closer, and minimize the domain gap between two features than when training two-stream network on pairs.
(b) Feature Fusion. We fuse the features and obtained from 1(a) and find their common representation using fully-connected layer (FC). Similarly, is mapped to , a representation closer to . The fully-connected layers are trained, whereas the rest of the network is frozen.
Figure 2: Architectures for our proposed approaches. The green and blue triangles are encoders with the same architecture as shown in Figure 3.
Figure 3: Our multi-scale feature aggregation architecture. represents channel-wise concatenation of feature maps.

3.2 Joint Feature Learning

In this work, we utilize the synthesized aerial images constructively as complementary information for the network to exploit during joint feature learning. We build upon the generative power of conditional GANs to generate an aerial image for ground-level panorama. Ideally, the synthesized aerial image should be sufficient and capable to retrieve its matching gallery image to justify that cross-view image matching can be solved as cross-view image synthesis followed by the same-view matching. However, the cross-view image synthesis has its own challenges as explained in [31] and the synthesized images are noisy; hence cannot be relied upon wholly. Therefore, we propose to learn the representations for image triads (, and ) jointly so that the synthesized aerial image representations pushes the image representations and closer to each other.

The joint feature learning architecture is shown in Figure 1(a). The encoder blocks are shown in green (for ground image) and blue (for aerial images) triangles. Each encoder consists of deep convolutional architecture as shown in Figure 3. We employ multi-scale feature aggregation by preserving the features at three scales. We elegantly exploit the inherent multi-scale pyramidal structure of features at multiple layers of deep neural networks. We consider the features from the final three convolutional layers, conv_6, conv_7 and conv_8 layers. These features are aggregated and followed by a fully connected layer to obtain the feature representation for images in each view.

The encoders for aerial and street-view images do not share the weights. Since the cross-view images are captured from different viewpoints, the visual entities exhibit drastic domain changes. The two encoders operate on these sets of diverse images, so it is understandable that the weight sharing is not a good choice. On the other hand, the encoders for and share the weights, since both images represent the aerial domain. This way, the aerial encoders learn weights suitable for the synthesized image as well as the real image . Thus, effectively forces the features to be closer to and bridges the domain gap between the two views. This is possible because the transformed image captures representations of which are easier for the network to learn from than it would be when learning directly from .

This strategy leverages the synthesized images at training time, but does not require them during the testing. The auxiliary loss between and influences the aerial image encoder to learn representations for aerial images by looking at synthesized aerial image. We train our network jointly on these image triads (, and ) using weighted soft-margin ranking loss [17], which is explained next.

3.2.1 Weighted Soft-margin Triplet Loss

Consider a feature embedding for ground-level image, for the corresponding matching aerial image and a non-matching aerial image feature . The triplet loss [15] aims to bring the matching feature closer to while at the same time push away. Here, if is the Euclidean distance between positive samples (, ) and is the Euclidean distance between negative/non-matching samples (, ), we try to minimize as well as maximize . The triplet loss is expressed as shown below:


where, is a margin that specifies a minimum distance between non-matching pairs.

In order to avoid the necessity of explicitly deciding the margin for triplet loss, soft-margin triplet loss is popular and is expressed as shown in Equation 2.


where d = - .

We use the weighted soft margin triplet loss[17] as shown in Equation 3 in our work. We use = 10, which results in better convergence than = 1.


We incorporate the auxiliary loss between the synthesized aerial images and the real aerial images along with the loss between the aerial and the ground images for joint feature learning using the Equation 4. Here, and are balancing factors between the losses for (, ) and (, ) pairs respectively.


3.3 Feature Fusion

In the above method, the synthesized aerial image is used during the training only, for bridging the domain gap between the real aerial and ground view images; but is neglected during testing. Since the features of the synthesized image contain complementary information that assist in joint feature learning, we attempt to further exploit that. We fuse the ground image features with synthesized aerial image features and find a robust representation for the query ground image.

The fusion architecture is shown in Figure 1(b). We use the trained joint feature learning network as feature extractor for our feature fusion task. We first concatenate the features from ground query image with the features from synthesized aerial image. The concatenated features need to be refined to obtain a generalized representation for query image . We achieve this by passing through a fully-connected layer in the upper stream. The features from the lower stream need to be optimized against the refined features from upper fully-connected layer. So, we add a fully-connected layer in the lower stream that learns the generalized representations, , for the aerial images. During the testing, the fused feature representation for query image is compared against the representations for aerial images for image matching.

4 Experimental Setup

This section discusses about the datasets we used and the experimental setups we followed in our work.

4.1 Datasets

We conduct experiments on CVUSA dataset [45] to compare our work with existing methods. We also collect a new dataset from urban areas of Orlando and Pittsburgh with geo-information.

CVUSA: CVUSA is a benchmark dataset for cross-view image matching with 35,532 satellite and ground-panorama image pairs for training and 8,884 pairs for testing. Aerial images are 750 x 750 and ground-panorama are 224 x 1232 in resolutions. Sample images from this dataset are shown in Figure 4.

Orlando-Pittsburgh (OP) dataset: We collect image pairs from two US cities, Orlando and Pittsburgh with ground-truth geo-locations. We call it Orlando-Pittsburgh (OP) dataset. The dataset covers urban areas of the cities, entirely different from the rural areas in CVUSA dataset. Figure 8 shows some example images of this dataset. The dataset contains 1910 training and 722 testing pairs of aerial and ground-panorama images. The resolutions are 640 x 640 for aerial images and 416 x 832 for panoramas. Primary motivation to collect this dataset is to evaluate the generalization of the proposed methods in urban locations and to compute matching accuracy in terms of distance (meters); and the unavailability of such datasets publicly. Though small-scale dataset, this dataset will provide some insights to future research in this direction. The dataset will be made publicly available.

4.2 Implementation Details

We explain the implementation details of our cross-view synthesis network and cross-view matching networks next.

Cross-View Synthesis Network:

The generator of cross-view synthesis network has an encoder and two decoders. The input to the encoder is a 4-channel image; 3-RGB channels and an edgemap, stacked together. The decoders generate cross-view image and its segmentation map for a given input. The network consists of blocks of Convolution, Batch Normalization and Leaky ReLU layers. Convolutional kernels of size 4 x 4 with a stride of 2 are used that downsamples the feature maps after each convolution and upsamples the feature maps after each upconvolution operation. We reshape the features at bottleneck to adjust the feature shape and pass through the decoders. The six blocks of decoders share the weights whereas the final two blocks don’t. The discriminator network has similar architecture to the encoder of the generator. We train the GAN end-to-end using Torch


implementation. The weights are initialized with a random Gaussian distribution with zero mean and 0.02 standard deviation.

Cross-View Matching Networks: Each stream (encoder) of our networks consists of eight convolutional layers, each followed by ReLU activations. We apply dropouts on final three layers after the ReLUs. We flatten the features after these dropouts and concatenate them to obtain multi-scale representation of the input image. We finally add a fully-connected layer for dimensionality reduction to obtain 1,000-dimensional feature vector for each input. We train the two-stream baselines from scratch with Xavier initialization. For joint feature learning, we initialize the network with weights from the two-stream network trained on (,

) image pairs and optimize the loss function shown in Equation

4. We use = 10 and = 1, weighing more on the loss term for (, ) pairs because of their superior performance over (, ) in image matching as reported in Table 1 and objectively we are using as an auxiliary information, only during the training in joint feature learning.

We implement our models using Tensorflow

[1] with Adam optimizer (lr = ) and dropout = 0.5. We use batch size of B = 30 for experiments on two-stream networks and B = 24 for joint feature learning networks. We use weighted soft-margin triplet loss for training our experiments. We employ exhaustive mini-batch strategy [38] to maximize the number of triplets within each batch. For each image in a batch of B images, we have 1 positive pair and (B-1) negative pairs for each ground image, and (B-1) negative pairs for each aerial image. So, for B images, we have B positive pairs and 2 x B x (B-1) negative pairs. We continue further training with in-batch hard negative mining; by training each positive pair against the most negative sample (i.e. smallest distance) in the batch.

We first train the GAN to generate the cross-view image for the ground panorama . Next, we use the synthesized images for joint feature learning in our proposed methods.

5 Results

We present an extensive analysis of our proposed method demonstrating the effectiveness of synthesized images for image retrieval to bridge the domain gap between the cross-view images. We also provide the comparison of our work with the state-of-the-art methods on the CVUSA dataset. Finally, we present an evaluation on geo-localization task on the OP dataset.

5.1 Evaluation Metric

The common metric for evaluation of image based matching task is to compute the recall accuracy. A matching is successful for a query street-view image if the correct match lies within a set of closest images in Euclidean distance of the representative features. We report top-1% accuracy for ease of comparison with previous works. We also report top-1 and top-10 recalls on CVUSA dataset.

Figure 4: Image retrieval examples on CVUSA dataset [45]. For each query ground image, the synthesized aerial image is shown alongside, followed by the five closest aerial images retrieved by proposed Feature Fusion method. The correct matching (ground truth) aerial images are shown in green boxes. Rows 5, 6 and 7 show examples where the ground truth aerial images are retrieved at the second, fourth and fifth positions respectively.

5.2 Comparison of Our Approaches

We evaluate our model variants in terms of retrieval accuracy on the CVUSA dataset [45]. The result is reported in Table 1.

Baseline Comparison: The two-stream networks trained on image pairs (, ) and (, ) are the baselines. Here, either or is trained separately with . We observe that the synthesized image as a query performs quite well with 72.58% for top-1% recall but slightly lower than as query (82.94%). This means that the synthesized images are not yet completely dependable for cross-view image retrieval and we need to consider real ground images as well. This provided us the motivation for joint feature learning.

Joint Feature Learning: For joint feature learning, as explained earlier, image triads (, and ) are used during training and only (, ) pairs are used during the testing. We report an improvement of about 9% in top-1% retrieval accuracy over two-stream baseline (, ) by joint feature learning. The improvement suggests that the synthesized aerial images have features complementary to ground image features that facilitate the network to learn better representations for aerial images during the joint feature learning. The synthesized aerial image as an auxiliary information between the ground and aerial images is successful in forcing them to bring their feature representations closer to each other during the joint feature learning.

Since the representations for , and were learned together during joint feature learning, we were curious to evaluate how well the feature representations for do in image matching. Unsurprisingly, we obtain an improvement of about 14% in top-1% retrieval accuracy over two-stream baseline (, ). This improvement further consolidated the belief that the learned features for and are complementary to each other and can be fused together to obtain robust descriptor for the ground image.

Feature Fusion: The Feature Fusion approach fuses the synthesized image features with the ground image features to obtain a representative feature for the query. This provides further improvement of 3.89% in top-1% accuracy. The significance of feature fusion can be measured by about 19% improvement in top-1 retrieval accuracy over joint feature learning. This improvement further signifies that the synthesized image features are complementary to street-view image features that can be exploited to obtain better feature for cross-view matching. The qualitative results are shown in Figure 4. The query ground images and the synthesized aerial images along with five closest images are shown in each row.

Method Top-1 Top-10 Top-1%
Workman et al. [40] - - 34.3%
Zhai et al. [45] - - 43.2%
Vo and Hays [38] - -
CVM-Net-I [17] 22.53% 91.4%
CVM-Net-II [17] 11.18% 43.51% 87.2%
Two-stream baseline (, )
Two-stream baseline (, ) 18.45% 82.94%
Joint Feat. Learning (, ) 14.31% 48.75% 86.47%
Joint Feat. Learning (, ) 29.75% 66.34% 92.09%
Feature Fusion 48.75% 81.27% 95.98%
Table 1: Comparison of Top-1, Top-10 and Top-1% recall with previous methods on CVUSA Dataset [45].

5.3 Comparison to Existing Methods

We compare our work with the previous approaches by [40, 45, 38, 17] on CVUSA dataset [45]. We report the top-1, top-10 and top1-% accuracies for state-of-the-art CVM-Net [17] and our methods. The results are shown in Table 1. We observe that the Joint Feature Learning outperforms the previous works and is further boosted by Feature Fusion. We achieve an overall 4.58% improvement over SOTA CVM-Net [17] for top-1% recall accuracy. We obtain significant increments of more than 26% and 18% in top-1 and top-10 accuracies over CVM-Net-I [17]. We also plot top-K recall accuracy for K = 1 to 80 for our methods as compared with previous approaches in Figure 5. It illustrates that our proposed methods outperform the existing state-of-the-art approaches for all values of K.

Figure 5: Comparison of different versions of our methods with CVM-Net I and CVM-Net II [17] on CVUSA dataset [45].

5.4 Ablation Study

We conduct the following ablation studies to understand the impact of different choices made for our architectures. The ablations are conducted on two-stream network with (, ) unless stated otherwise. This is done is avoid the complexities during the training and the explanation of the ablations.

Method Top-1 Top-10 Top-1%
Synthesized Aerial Image
Weight Sharing
Single Scale Features 1.41 % 6.53 % 25.05%
Two-stream (, ) 18.45% 48.98% 82.94%
Table 2: Ablation Study on CVUSA Dataset [45].
Figure 6: Example images where synthesized aerial image features, , were able to retrieve correct matches and where ground image features, , failed. First two rows show cases where as query successfully retrieved correct match , but as query failed in two-stream training. was successful after joint feature learning. Second two rows show examples where , after joint training, was not able to retrieve correct match but feature fusion of with was successful in retrieval.

Synthesized Aerial vs. Real Ground Image as query: This ablation is conducted to evaluate the impact of synthesized aerial images in two folds. One, to quantify the amount of information captured by the synthesized images from ground panorama that can be exploited during joint feture learning. Second, the importance of relying on real ground view query image. The retrieval accuracy is provided in Table 2. We observe that the synthesized images preserve fair amount of semantic information from ground image that can be used for the same view retrieval. Also, the results on cross-view matching, when using true ground image, as query signifies that synthesized image cannot be completely relied upon, hence we need to find a way to exploit the features from these two images to obtain a robust query descriptor.

We also analyze the significance of synthesized images in terms of their contributions in image retrieval. Figure 6 shows instances where synthesized aerial image features were valuable assets in retrieval. The first two rows show examples where as query was able to retrieve the correct match while the ground image feature as query failed for two-stream networks. However, as query was successful in the retrieval process after joint feature learning. The next two rows show examples where fusion of with helped retrieve correct match while failed even after joint feature learning. These examples fully justify that the synthesized images are an integral part of our joint feature learning and feature fusion approaches.

Single Scale vs. Multi-scale Features: For this ablation, we train our two-stream architecture with single scale features. We consider only the features after final convolutional block (conv_8) for single scale features. The scores are reported in Table 2. The results signify that features from conv_6 and conv_7 are also crucial in image matching rather than just using the features from final conv_8 layer only. The results demonstrate that it is very important to aggregate the multi-scale features for cross-view matching.

Weight Sharing vs. No Weight Sharing: We believe that the two branches receiving the input images from completely different viewpoints as is the case with aerial and ground -view images should not share the weights. Even though the networks will be looking at same scene contents their representations from the two views are drastically different, thus suggesting that the networks should freely evolve their weights based on the input they receive. The results are reported in Table 2. The numbers clearly suggest that no weight sharing is fairly an easy choice.

Pooling vs. No Pooling: We also conduct ablations on using global average pooling [22] in our experiments. Global average pooling is a popular approach to reduce the spatial dimensions of the features and consequently reduce the number of parameters in the network. We experimented with using global average pooling layer before concatenating the features from multiple scales. The result is reported in Table 2. We observe that the loss of spatial information in features severely impacts the retrieval performance.

Two-stream (, ) Joint Feat. Learning Feature Fusion
Table 3: Top-1 retrieval accuracy on Orlando-Pittsburgh Dataset.
Figure 7: Geo-localization results on the OP dataset with different error thresholds.
Figure 8: Image retrieval examples on the OP dataset. The correct aerial image matches are shown in green borders. The numbers below each aerial image shows its distance in meters from query ground image.

5.5 Cross-view Localization

We use the Orlando-Pittsburgh (OP) dataset for image based geo-localization. We want to determine the gps location of the query image by assigning it the location of closest retrieved aerial image. The query image is correctly geo-localized if it is located within a threshold distance in meters from its ground truth position.

The recall accuracy with respect to distance threshold in meters is plotted in Figure 7. We observe that our proposed Feature Fusion method can retrieve images close to its geo-location with higher accuracy than the baseline which can be attributed to its superiority in Top-1 recall.

The image retrieval examples on the OP dataset are shown in Figure 8. The ground query images are followed by the five closest aerial images. Even though the retrieved images are very similar to each other, we are able to retrieve the correct match at Top-1 position. The Top-1 recall is reported in Table 3. The results affirm that the proposed methods are generalizable to urban cities of OP dataset as well as rural areas of CVUSA dataset.

6 Conclusion

In this paper, we have presented a novel and practical approach to cross-view image retrieval by transforming the query image to target view to obtain a better scene understanding. We showed that the synthesized aerial images can be seamlessly incorporated in the cross-view matching pipeline by joint feature training to bridge the domain gap between the aerial and street-view images. Also, the ground image features and the corresponding synthesized aerial image features are fused to obtain a robust descriptor of the ground image. We obtained significant improvements over state-of-the-art methods on challenging CVUSA dataset.


  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng.

    Tensorflow: A system for large-scale machine learning.

    In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 265–283, 2016.
  • [2] A. Almahairi, S. Rajeshwar, A. Sordoni, P. Bachman, and A. Courville. Augmented cyclegan: Learning many-to-many mappings from unpaired data. In International Conference on Machine Learning, pages 195–204, 2018.
  • [3] R. Arandjelović, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. NetVLAD: CNN architecture for weakly supervised place recognition. In

    IEEE Conference on Computer Vision and Pattern Recognition

    , 2016.
  • [4] M. Bansal, K. Daniilidis, and H. Sawhney. Ultra-wide baseline facade matching for geo-localization. In A. Fusiello, V. Murino, and R. Cucchiara, editors, Computer Vision – ECCV 2012. Workshops and Demonstrations, pages 175–186, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg.
  • [5] S. Bell, C. Lawrence Zitnick, K. Bala, and R. Girshick.

    Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks.

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2874–2883, 2016.
  • [6] J. Canny. A computational approach to edge detection. In Readings in computer vision, pages 184–203. Elsevier, 1987.
  • [7] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo.

    Stargan: Unified generative adversarial networks for multi-domain image-to-image translation.

    In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [8] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop, 2011.
  • [9] X. Deng, Y. Zhu, and S. Newsam. What is it like down there?: Generating dense ground-level views and image features from overhead imagery using conditional generative adversarial networks. In Proceedings of the 26th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, SIGSPATIAL ’18, pages 43–52, New York, NY, USA, 2018. ACM.
  • [10] N. Ghouaiel and S. Lefèvre. Coupling ground-level panoramas and aerial imagery for change detection. Geo-spatial Information Science, 19(3):222–232, 2016.
  • [11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [12] B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hypercolumns for object segmentation and fine-grained localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 447–456, 2015.
  • [13] J. Hays and A. A. Efros. im2gps: estimating geographic information from a single image. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2008.
  • [14] J. Hays and A. A. Efros. Large-scale image geolocalization. In Multimodal Location Estimation of Videos and Images, 2015.
  • [15] A. Hermans, L. Beyer, and B. Leibe. In defense of the triplet loss for person re-identification. CoRR, abs/1703.07737, 2017.
  • [16] S. Honari, J. Yosinski, P. Vincent, and C. Pal. Recombinator networks: Learning coarse-to-fine feature aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5743–5752, 2016.
  • [17] S. Hu, M. Feng, R. M. H. Nguyen, and G. Hee Lee. Cvm-net: Cross-view matching network for image-based ground-to-aerial geo-localization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [18] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.

    Image-to-image translation with conditional adversarial networks.

    CVPR, 2017.
  • [19] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim. Learning to discover cross-domain relations with generative adversarial networks. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1857–1865, International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
  • [20] T. Kong, A. Yao, Y. Chen, and F. Sun. Hypernet: Towards accurate region proposal generation and joint object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 845–853, 2016.
  • [21] G. Lin, A. Milan, C. Shen, and I. Reid. RefineNet: Multi-path refinement networks for high-resolution semantic segmentation. In CVPR, July 2017.
  • [22] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.
  • [23] T.-Y. Lin, S. Belongie, and J. Hays. Cross-view image geolocalization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2013.
  • [24] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2117–2125, 2017.
  • [25] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
  • [26] L. v. d. Maaten and G. Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
  • [27] M. Mirza and S. Osindero. Conditional generative adversarial nets. CoRR, abs/1411.1784, 2014.
  • [28] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision, pages 483–499. Springer, 2016.
  • [29] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  • [30] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. In M. F. Balcan and K. Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1060–1069, New York, New York, USA, 20–22 Jun 2016. PMLR.
  • [31] K. Regmi and A. Borji. Cross-view image synthesis using conditional gans. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [32] K. Regmi and A. Borji. Cross-view image synthesis using geometry-guided conditional gans. arXiv preprint arXiv:1808.05469, 2018.
  • [33] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. CoRR, abs/1505.04597, 2015.
  • [34] T. Sattler, M. Havlena, K. Schindler, and M. Pollefeys. Large-scale location recognition and the geometric burstiness problem. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [35] Q. Shan, C. Wu, B. Curless, Y. Furukawa, C. Hernandez, and S. M. Seitz. Accurate geo-registration by ground-to-aerial image matching. In 2014 2nd International Conference on 3D Vision, volume 1, pages 525–532. IEEE, 2014.
  • [36] Y. Tian, C. Chen, and M. Shah. Cross-view image matching for geo-localization in urban environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3608–3616, 2017.
  • [37] A. Torii, J. Sivic, and T. Pajdla. Visual localization by linear combination of image descriptors. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), pages 102–109. IEEE, 2011.
  • [38] N. N. Vo and J. Hays. Localizing and Orienting Street Views Using Overhead Imagery, pages 494–509. Springer International Publishing, Cham, 2016.
  • [39] S. Workman and N. Jacobs.

    On the location dependence of convolutional neural network features.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 70–78, 2015.
  • [40] S. Workman, R. Souvenir, and N. Jacobs. Wide-area image geolocalization with aerial reference imagery. In IEEE International Conference on Computer Vision (ICCV), 2015.
  • [41] Z. Yi, H. Zhang, P. Tan, and M. Gong. Dualgan: Unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE International Conference on Computer Vision, pages 2849–2857, 2017.
  • [42] A. R. Zamir and M. Shah. Accurate image localization based on google maps street view. In K. Daniilidis, P. Maragos, and N. Paragios, editors, Computer Vision – ECCV 2010, pages 255–268, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg.
  • [43] A. R. Zamir and M. Shah. Image geo-localization based on multiplenearest neighbor feature matching usinggeneralized graphs. IEEE Trans. Pattern Anal. Mach. Intell., 36(8):1546–1558, 2014.
  • [44] E. Zemene, Y. T. Tesfaye, H. Idrees, A. Prati, M. Pelillo, and M. Shah. Large-scale image geo-localization using dominant sets. IEEE transactions on pattern analysis and machine intelligence, 41(1):148–161, 2019.
  • [45] M. Zhai, Z. Bessinger, S. Workman, and N. Jacobs. Predicting ground-level scene layout from aerial imagery. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [46] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
  • [47] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networkss. In Computer Vision (ICCV), 2017 IEEE International Conference on, 2017.
  • [48] X. Zhu, Z. Yin, J. Shi, H. Li, and D. Lin. Generative adversarial frontal view to bird view synthesis. In 2018 International Conference on 3D Vision (3DV), pages 454–463, Sep. 2018.

1 Visualization and Interpretation of Features

In Figure 9, we visualize the aerial and ground image features obtained using the two-stream baseline and the proposed feature fusion methods for 100 images on the CVUSA dataset [45]. The feature representation for each image is a 1000-dimensional vector and we apply t-SNE to learn their two-dimensional embeddings for ease of visualization. The red and cyan circles close to each other or with some overlap represent the features for the ground image and its corresponding ground-truth aerial image respectively in the subplots.

The scatter-plot for features obtained using the two-stream baseline trained on (, ) pairs is shown on the left. We observe that, for each image pair, there is less overlap between the aerial and the ground image features. We also notice that the features from different image pairs are located close to each other, with some instances of red circles overlapping each other.

The scatter-plot for the representations obtained using the proposed feature fusion method trained on image triads (, , ) is shown on the right subplot. We observe higher overlap between the features for ground and corresponding aerial image pairs. At the same time, we observe greater separation between the feature embeddings for different image samples.

Thus, the use of synthesized aerial images in our proposed Feature Fusion method are successful in bringing the feature representations of aerial images closer to the representations of ground images and bridging the domain gap between the images from these two drastically different views to improve the matching accuracy.

Figure 10: Some examples of failure cases. The numbers on the right show the position where the ground-truth aerial images were retrieved.

2 Failure Examples

We present some failure cases for the proposed Feature Fusion method in Figure 10. In each row, We respectively present the query ground image, corresponding synthesized aerial image, image retrieved at Top-1 position, ground-truth aerial image and a number representing the position where the ground-truth aerial image was retrieved.

Row 1 shows that ground truth aerial image consisting of water body in lower right section of the image. The ground image does not provide any information regarding water, so the image matching is challenging. The ground truth is retrieved at 13 position.

In Row 2, we can observe that the top match and ground-truth aerial images are very similar to each other. Also, the matched image has similar color distribution to query image than the ground-truth aerial image. The problem arises because the aerial and ground image pairs in the dataset are captured at different times, so have some visual differences.

Row 3 shows an example where the aerial image has houses which are not captured in street-view images due to occlusion by trees. The impact can also be observed in the corresponding synthesized image which doesn’t contain houses.

Row 4 shows that the street-view image contains a building at far distance. The building covers large region in ground-truth aerial image, which is difficult to comprehend from the street-view image. Also, this is a rare situation in the dataset with large building. So, the method fails badly, retrieving the ground-truth image at position 1700.

3 Aerial-to-Ground Image Matching

We conducted experiments for the reverse problem of Aerial-to-Ground image matching. Here, the aerial image is the query, and we attempt to find the matching ground panorama. First, we use GANs to synthesize ground level panoramas from the aerial images and then use the synthesized images in the proposed joint feature learning and feature fusion methods.

Method Top-1 Top-10 Top-1%
Two-stream baseline (, )
Two-stream baseline (, ) 16.99% 82.11%
Joint Feat. Learning (, ) 16.46% 50.26% 86.26%
Joint Feat. Learning (, ) 27.39% 65.29% 91.46%
Feature Fusion 44.99% 79.37% 95.66%
Table 4: Image matching performance in terms of Top-1, Top-10 and Top-1% recall on CVUSA Dataset [45] for aerial-to-ground matching.
Figure 11: Comparison of our methods with the baselines on CVUSA dataset [45] for reverse problem of aerial-to-ground image matching.
Figure 12: Qualitative Results on CVUSA dataset [45] for aerial-to-ground image matching. Images with green borders are the ground-truth panoramas for the corresponding query images.
Figure 13: Cross-view image retrieval examples on the OP Dataset. Ground-truth aerial images are shown in green boxes. The number below each aerial image is its distance in meters from the query image. The first three rows present the images from Orlando and the next three rows of images are from Pittsburgh.

We conduct experiments for two-stream baselines, joint feature learning and feature fusion methods. The top-1, top-10 and top-1% accuracies are reported in Table 4. We obtain results similar to the numbers reported in the main paper for ground-to-aerial image matching. We also plot the top-K recall for K = 1 to 80 for the proposed method compared to the baselines in Figure 11. This affirms that our method can be applied for image matching in both directions.

The qualitative results for aerial-to-ground image matching are shown in Figure 12. The query aerial image, synthesized ground panorama followed by the three closest matches are visualized. The ground-truth panorama are shown with the green borders. We can also observe that the synthesized ground panorama are successful in transforming the semantic information from aerial to ground domain.

4 OP Dataset

The existing public datasets on cross-view image matching do not provide geo-information. Also, the images on the CVUSA dataset are collected from the rural areas that largely cover land and vegetation as shown in Figures 10 and 12. To compensate those issues, we collect a new dataset of cross-view image pairs. The images cover urban areas of Orlando and Pittsburgh. Figure 13 shows the example images of this dataset. We can observe that this dataset contains images of mainly urban areas with buildings and roads and less vegetation, contrasting to the CVUSA dataset.

We conduct experiments on the OP dataset and provide more qualitative results in Figure 13. The number below each aerial image represents its distance in meters from the query ground image. We observe that though the aerial images look very similar to each other, the proposed feature fusion method is able to retrieve the ground-truth aerial image as the closest matching image. The quantitative evaluation is provided in the main paper. We obtain similar results for top-1 accuracies on both the CVUSA and the OP dataset. This consolidates the fact that the proposed method generalizes well on both rural and urban datasets.