Aerial-view geo-localization generally refers to retrieving the corresponding images between drone and satellite platforms. It recently could be deployed to many fields, such as drone navigation, even detection, aerial photography, and so on (Zheng et al., 2020b; Ding et al., 2020; Zeng et al., 2022). In application, given a drone-view image as the query, the retrieval system intends to search the most relevant satellite image from the candidate gallery. The satellite-view images possess geo-tag information, such as GPS. Thus the drone can naturally determine its geographic location. Compared with ground-view images with more occlusion, e.g., tree, drone-view images support excellent visibility. However, aerial-view geo-localization remains challenging due to the cross-view domain shift caused by the viewpoint and the environment.
Convolutional neural networks (CNNs) have recently received primary support in aerial-view geo-localization due to the strong potential to learn invariant image representations. Relying on CNNs, a two-branch network (Liu and Li, 2019; Shi et al., 2020b) as the prototype has been widely employed in related studies. The metric learning (Shi et al., 2020a; Hu et al., 2018) and the classification loss (Zheng et al., 2020b; Wang et al., 2021; Zheng et al., 2020a) are two predominant choices to optimize this prototype. A number of extension works adjust the spatial layout of image semantics (Shi et al., 2019; Fu et al., 2019) to extract view-invariant features or deploy part-based matching (Wang et al., 2021; Sun et al., 2018) to roughly align local information. All these existing methods concentrate on mitigating the cross-view domain gap introduced by viewpoint change. One scientific problem raises: how can the model cope with the environmental domain shift? Recently, some researchers (Hendrycks and Dietterich, 2018; Wu et al., 2019) point out that a trained network is easy to collapse for unfamiliar input distributions. As a result, existing networks are likely to fail in multi-environment inference of geo-localization. It is well known that bad weather can cause serious flights accidents (Board, 2020; Times, 2020), and the drone is prone to encounter severe weather that does not occur prior to takeoff. Therefore, employing the aerial-view geo-localization system to retrieve in multiple environments is a meaningful and practical topic. This topic touches on domain generalization (DG) (Blanchard et al., 2011), and a direct effort is to let the geo-localization model ’remember’ the distribution of a location in different environments during training. However, many studies (Li et al., 2017; Khosla et al., 2012; Chattopadhyay et al., 2020; Ilse et al., 2020)
in domain generalization have demonstrated that forcing the entire model or features to be domain invariant is challenging. We know that humans recognize a previously seen location again by eliminating the interference caused by environmental changes rather than remembering how the location looked under different environments. This human cognitive mechanism inspires our research. Specifically, when testing, we hope that the trained drone-to-satellite geo-localization system can adaptively filter out the domain shift caused by environmental changes. While achieving this goal is non-trivial, two issues need to be discussed. One is the reproduction of the environmental style information. We consider a reasonable assumption that the seen environments can replay in a new scenario. Therefore, we need the system to be able to independently reproduce different environmental style information from the inputs during testing. Then the style information is utilized to balance the environmental domain shift. To do so, we expand the training data for environmental diversity first. The data collection is expensive and difficult since one fixed position calls for images shot in different weather. Another reason is that unlike the acquisition of ground-level images, the collection of drone-view images needs to be operated by professionals. Generative adversarial networks (GANs)(Zheng et al., 2017; Goodfellow et al., 2014; Zheng et al., 2019; Tang et al., 2020) and data augmentation are two appealing choices to replace manual dataset production. GANs recently can synthesize images that are not only of high quality but also possess sufficient diversity. However, the key of our research is not to generate stylized images with high quality. Considering the speed and flexibility, we abandon GANs and choose an off-the-shelf image-based style transformation library (Jung et al., 2020) to pre-process images. After processing images, we obtain nine synthetic environmental images for one geographical location, i.e., fog, rain, snow, fog add rain, fog add snow, rain add snow, dark, overexposure, and wind (see Figure 1). We then require the system to be able to extract different environment knowledge to achieve reproduction. In vanilla CNNs, the image information contained in extracted features is tangled. Some methods (Ilse et al., 2020; Chattopadhyay et al., 2020)
suggest that representations can be disentangled into different parts using specific domain labels or domain-related prior knowledge. We follow this idea and encode the environmental information from input images by supervised learning, where the environmental labels are automatically annotated during the image transformation. Another issue is how to utilize the style information to minimize the environmental style gap. We suppose that the identity information is domain-agnostic and the environmental style is domain-specific for one location. Some reference methods retain only the domain-agnostic information for downstream visual tasks(Khosla et al., 2012; Li et al., 2017) or employ instance normalization (Ulyanov et al., 2017) to resist the effect of image style changes (Pan et al., 2018). Unlike these methods, our system aims to align the domain-specific distribution on the fly. To achieve this goal, we first borrow experience from IBN-Net (Pan et al., 2018)
to process images and obtain visual features. IBN-Net integrates batch normalization (BN)(Ioffe and Szegedy, 2015) and instance normalization (IN) into residual blocks of shadow layers. BN is used to retain the discrimination of features (Huang et al., 2017; He et al., 2016; Xie et al., 2017), and IN is employed to filter out domain-specific style information from the content (Pan et al., 2018). However, IN applies the same treatment to multiple styles of content. To satisfy the dynamic adaptability, we further introduce spatially-adaptive denormalization (SPADE) (Park et al., 2019) and integrate SPADE into a residual structure, called Residual SPADE, in our system. Same as SPADE, Residual SPADE is a conditional normalization module that allows flexible modulation of image styles by scales and biases learned from the external data. In light of the above analysis, we propose a two-branch learning framework called Multiple-environment Self-adaptive Network (MuSe-Net). The design of MuSe-Net is based on IBN-Net yet with more scalability. In particular, we insert Residual SPADE after the instance normalization layer of IBN-Net as the self-adaptive feature extraction network to construct one branch of MuSe-Net. The other branch of MuSe-Net is a multiple-environment style extraction network, which parameterizes the environment information as inputs of Residual SPADE. Two branches have the same inputs, ensuring that the self-adaptive feature extraction network can utilize the corresponding environmental information from the style extraction network to dynamically close the environmental style gap.
The main contributions of this work are summarized as follows.
We identify one key challenge, i.e., the weather and illumination changes, when applying the visual geo-localization system to the real-world scenario. The large visual changes usually compromise the reliability of the existing methods.
To address this limitation, we present an end-to-end learning framework called Multiple-environment Self-adaptive Network (MuSe-Net). MuSe-Net applies a dual-path CNN model to extract the environment-related style information and dynamically minimize the environment-related style gap, such as weather and light changes.
Extensive experiments on two prevailing multi-platform geo-localization benchmark, e.g., University-1652 (Zheng et al., 2020b) and CVUSA (Zhai et al., 2017), show that our method achieves superior results for geo-localization in multiple environments. Meanwhile, for an unseen extreme weather, i.e., mixing the fog, rain and snow, MuSe-Net still arrives at competitive results.
2. Related Work
2.1. CNN-based Cross-view Geo-localization
Existing cross-view geo-location methods mostly focus on solving the visual gap caused by the changing appearance in different viewpoints. In order to gain the discriminative image representation, some pioneering works (Castaldo et al., 2015; Lin et al., 2013; Senlet and Elgammal, 2011; Bansal et al., 2011) make many efforts on hand-crafted feature matching. Due to the powerful capability on image representation (Russakovsky et al., 2015), deep convolutional neural networks (CNNs) become the prevalent choice for feature extraction. Follow this line, Workman et al. (Workman and Jacobs, 2015) first attempt to employ an AlexNet (Krizhevsky et al., 2012)
pre-trained on Imagenet(Russakovsky et al., 2015) and Places (Zhou et al., 2014)
to extract deep features for cross-view geo-localization. They prove that the top layers of CNN include rich information of geographic location. Further, Workmanet al. (Workman et al., 2015) extend their work by minimizing the distance of cross-view features and gaining improved performance. Lin et al. (Lin et al., 2015) define the matching task as similar to the face verification, and deploy the contrastive loss (Hadsell et al., 2006) to optimize a modified Siamese Network (Chopra et al., 2005). Later on, Hu et al. (Hu et al., 2018) insert a NetVLAD into the high-level layer of a Siamese-like architecture to aggregate the local feature, which facilitates image descriptors against the viewpoint changes. Considering the importance of orientation, Liu et al. (Liu and Li, 2019) encode the corresponding coordinate information into the network for the discriminative feature. Tian et al. (Tian et al., 2020) propose an orientation normalization network to alleviate the effect of the variation of orientation. Wang et al. (Wang et al., 2021) design a square-ring partition strategy to cope with the image rotation. Discussing on a limited Field of View (FoV) setting, DSM (Shi et al., 2020a) provides a dynamic similarity matching module to align the orientation of cross-view images. In order to align the spatial layout information, Zhai et al. (Zhai et al., 2017) encode and transfer the semantic information of ground images to aerial images. Regmi and Shah (Regmi and Shah, 2019) apply a generative model to synthesize an aerial image from a panoramic image of the ground. Shi et al. (Shi et al., 2020b) resort to the optimal transport theory to compare and adjust pairwise image distribution in the feature level. Another work of Shi et al. (Shi et al., 2019) directly utilizes the polar transform to accomplish the pixel-level alignment of semantic information between ground images and satellite images. In the aspect of optimizing different training objectives, Vo and Hays (Vo and Hays, 2016) investigate a variety of CNN architectures for cross-view matching and gain the best performance through employing a soft margin triplet loss to optimize a triplet CNN. Hu et al. (Hu et al., 2018) design a weighted soft-margin ranking loss that speeds up the convergence rate. Cai et al. (Cai et al., 2019) introduce a hard exemplar reweighting triplet loss to improve the retrieval. Zheng et al. (Zheng et al., 2020b) imitate the classification tasks and use instance loss (Zheng et al., 2017, 2020c) as the proxy targets to solve the cross-view image retrieval. There are also some works (Cai et al., 2019; Shi et al., 2019) that employ the attention mechanism to locate the interesting areas, which effectively promotes the discrimination of features.
In contrast to existing works, we study a new real-world scenario where the drone may encounter different weather and illumination, which leads to the same location with multiple domain distributions. For this multiple domain problem, we explicitly parameterize environmental information to align the domain distribution. Therefore, the drone is able to localize unseen positions in previously seen environments.
2.2. Domain Generalization
Generally, domain generalization (DG) is mentioned in multi-source domain problem. Similar to the topic of domain adaptation (DA), domain generalization tends to address the domain shift introduced by the statistical differences of multiple domains. However, compared with domain adaptation that employs the labeled or unlabeled data of target domain during training, domain generalization concentrates on only leveraging the multi-source data to learn robust data representations which could be potentially useful in different marginal distributions, e.g., unseen target domains.
can realize multi-domain alignment. The statistical learning theory(Vapnik, 1999) suggests that the diversity of training samples can boost the generalization of learning models. Therefore, other researchers propose advanced augmentation algorithms (Zhou et al., 2021; Xu et al., 2021; Zhang et al., 2018) to lessen the domain gap. Adversarial approaches (Li et al., 2018b; Shao et al., 2019; Rahman et al., 2020) rely on confusing a domain discriminator to learn domain-invariant features, which can also alleviate the domain shift. Meta-learning (Zhang et al., 2021; Fu et al., 2021) enables the model to learn new concepts and skills fast with a few training examples. Based on meta-learning, MAML (Finn et al., 2017) gains great success in domain generalization, and some improved MAML frameworks are also subsequently presented in (Li et al., 2019; Zhao et al., 2021)
. Disentangled representation learning intends to learn the common and exclusive information from multiple domains and then processes these information separately to obtain robust features for future predictions. According to that, one group of approaches considers the disentanglement at the feature level. Khoslaet al. (Khosla et al., 2012) et al., 2017) employ the neural network to re-implement this concept. DMG (Chattopadhyay et al., 2020) adopts a learnable mask to select and balance the domain-invariant and domain-specific features and demonstrates that analysing domain-specific components can assist to the prediction at test-time. Another solution supports encoding the multiple knowledge into different latent spaces. Ilse et al. (Ilse et al., 2020) provide a VAE (Kingma and Welling, 2014)-type framework to learn three complementary sub-spaces for domain-invariant classification. There is also literature documenting the use of GANs (Zheng et al., 2019; Wang et al., 2020) to construct two latent spaces: one for identity confirmation and the other serving the domain-related information.
3. Proposed Method
3.1. Problem Definition and Notations
The multiple environment aerial-view geo-localization task assumes the scenario that satellite-view images are constant while the style of drone-view images is variable with environmental changes. The different environmental information induces the existing methods which are difficult to search geographic target images with the same identities between different viewpoints. Our research focuses on reducing the interference of environmental styles when retrieving between two aerial viewpoints. Let , be the inputs and identity labels, respectively, for one multiple environmental aerial-view geo-localization dataset. is the number of images, and , and indicates the number of identities. Conventional, the inputs consist of domains, and . denotes the satellite-view domain and denotes the drone-view domain. These two domains share the same identity labels . In our setting, we follow the previous definition and add one style space. In particular, holding constant, we expand the original drone-view domain to multi-environment drone-view domain . The subscript , and indicates the number of environmental styles. Since the style of the satellite-view domain is constant, we denote this domain as and . Therefore, the added style label includes styles, i.e., .
3.2. Relevant Technologies Revisit
IBN-Net (Pan et al., 2018). Batch normalization (BN) has recently been a basic component of CNNs. BN maintains the discrimination of features by utilizing the global statistics (i.e.
, mean and variance) recorded in training to normalize the testing sample. However, global statistics contains domain-specific knowledge (e.g.
, style information) inevitably. In domain generalization, the domain-specific information induces the domain shift, which causes a trained model in one domain with poor performance in another domain. For instance, in scene parsing, due to the different image styles in two datasets, a CNN-based model trained on Cityscapes(Cordts et al., 2016) exhibits notable performance drops when testing on GTA5 (Richter et al., 2016), even though two datasets include similar semantics. In contrast with BN, instance normalization (IN) discards the global statistics. With the learned affine parameters, IN intends to close the style gap between each testing and training sample. Therefore, IN resists the effect of the style discrepancy but damages the discrimination simultaneously. Considering the advantages of BN and IN, IBN-Net integrates IN and BN as building blocks to extract style-invariable features and achieves a competitive result in the cross-domain scene parsing task.
Spatially-adaptive denormalization (SPADE) (Park et al., 2019). SPADE is a conditional normalization module that first requires external data to generate the learned affine parameters. Then the normalized activations are modulated by the learned affine parameters. SPADE can be simplified formulated as:
where is the input feature and is the corresponding style feature. and compute the mean and variance of feature , respectively. and are the learned scale and bias to modulate the normalized feature .
Discussion. Instance normalization (IN) plays a crucial role in improving the generalization capability of IBN-Net. However, as mentioned above, IN dilutes the content discrimination. Unlike IBN-Net trained on Cityscapes, our approach deploys multiple environmental style images to train MuSe-Net. When there are significant discrepancies between these styles, we suggest that IN could diminish more useful information in order to learn compromise parameters for style alignment of features. Therefore, we insert the Spatially-adaptive denormalization (SPADE) after IN and employ the style information of images as the external condition to dynamically adjust the activation of IN. It is worth noting that we do not directly follow Equation 1 to modulate the normalized activations. We reserve the affine parameters of IN and apply a residual structure to adjust the feature. We call the modified SPADE as Residual SPADE, which can be formulated as:
where denotes Residual SPADE. Characters , , and have the same meaning as in Equation 1, is the operation of instance normalization. is a generalization of . When the learned parameters and converge to zero, can arrive at . Compared with Equation 1, has a residual structure that has been shown to facilitate the learning of network parameters (He et al., 2016). With the dynamic fine-tuning by , the final depth features tend to retain the discrimination as much as possible while reducing the interference of their respective styles. Experiments in Section 4 demonstrate the effectiveness of MuSe-Net.
3.3. Overview of MuSe-Net
The proposed Multiple-environment Self-adaptive Network (MuSe-Net) is illustrated in Figure 2 (I). MuSe-Net consists of two branches: the multiple-environment style extraction branch and the self-adaptive feature extraction branch. These two branches have the same inputs. The multiple-environment style extraction branch is employed to extract features with different style information. Then these style features as the control data are fed into Residual SPADE to conduct a learnable transformation. Subsequently, the self-adaptive feature extraction branch employs the learned affine parameters of Residual SPADE to dynamically align the style information of inputs and pulls the content features with the same identity together.
The multiple-environment style extraction branch. This branch has two components: a style encoder and a style classifier. The style encoder is intercepted from ResNet-50 (He et al., 2016). In particular, ResNet-50 contains four stages with repeated bottlenecks named stage1, stage2, stage3 and stage4. Following the analysis (Pan et al., 2018) that the style discrepancy is mostly preserved in shallow layers, we pick the first two stages of ResNet-50 as the style encoder to extract the style feature, as shown in Figure 2 (II). The style classifier consists of a batch normalization layer (BN), a dropout layer (Dropout), and a fully-connected layer (FC). Given an aerial-view image (i.e., drone or satellite) of size , we first utilize the style encoder to acquire the style feature with the shape of . Then we employ the average pooling layer to transform into a 512-dim feature . In the end, we deploy the style classifier to predict the style of the input and utilize the cross-entropy loss to optimize this branch. The above process could be formulated as:
where denotes the style classifier.
is the predicted probability ofbelonging to the corresponding style label . In Equation 6, we calculate the cross-entropy loss.
The self-adaptive feature extraction branch. This branch consists of a content encoder with Residual SPADE embedded and an identity classifier. The content encoder is proposed relying on IBN-Net (Pan et al., 2018). IBN-Net has the similar structure with ResNet-50. Specifically, the number of bottlenecks in stage1 of IBN-Net is 3. We embed Residual SPADE in the second and last bottlenecks of stage1 as the content encoder (see Figure 2 (II)). Residual SPADE contains two convolutional layers, i.e., Conv_w1 and Conv_b1. One convolutional layer is employed to learn the scale, and another is for the bias (see Figure 2 (III)). The identity classifier has the same component with the style classifier, i.e., a batch normalization layer (BN), a dropout layer (Dropout), and a fully-connected layer (FC). The content encoder accepts the same input with the style encoder, and is employed to extract the content feature of size . Residual SPADE serves as an important role in the content encoder. In Residual SPADE, the style feature
is first interpolated to the same size as the activation of instance normalization (IN) in the content encoder. Then the interpolated feature is convolved to produce the scale and bias to modulate the activation of IN, and the modulation operation is shown in Figure2 (III-b). The extracted content feature is further transformed into a 2048-dim feature by an average pooling layer. Finally, we harness the identity classification loss as the proxy target to force the content encoder to extract the discriminative feature, and this loss can be formulated as:
where indicates the identity classifier. is the predicted probability that belongs to the geo-tagged identity label .
Optimization. We train MuSe-Net by jointly employing the style loss and the identity loss :
The style loss forces features with different style information to stay apart, and the identity loss brings matching image pairs of the same geo-tag closer. Also, the identity loss serves the optimization of Residual SPADE.
4.1. Datasets and Evaluation Protocol
We mainly train and evaluate the proposed method using the University-1652 (Zheng et al., 2020b) since it supports large-scale cross aerial-view images. We also verify the effectiveness of our method in CVUSA (Zhai et al., 2017), which is a street-to-satellite dataset.
University-1652 (Zheng et al., 2020b) is a newly-released dataset that focuses on the drone-based geo-localization. It consists of data from three different platforms, i.e., drones, satellites, and phone cameras. All of these data are collected from 1,652 buildings of 72 universities around the world. There are 54 drone-view images for one building in the dataset to guarantee that the drone-view data can cover rich information of the target, e.g., scale and viewpoint variants. With one satellite-view image for each building as a reference, the dataset also includes 5,580 street-view images. Due to the limited viewpoint of the phone camera, street-view images can not cover all facets of a target building. To make up this weakness as much as possible, 21,099 common-view images from Google Image are added to University-1652 as an extra training set. The training set includes 701 buildings of 33 universities, and another 951 buildings belonging to the rest 39 universities are contained in the test set. Universities in the training and test set are not overlapping. The dataset support two new tasks, i.e., drone-view target localization (Drone Satellite) and drone navigation (Satellite Drone). In the drone-view target localization task, the query set contains 37,855 drone-view images, and the gallery set includes 701 true-matched satellite-view images and 250 distractors. For the drone navigation task, with 701 satellite-view images as the query set, there are 37,855 true-matched drone-view images and 13,500 distractors in the gallery. Under this task, one query image has multiple correct matches in the gallery. It is clear that the drone-view target localization is a more challenging task than drone navigation since there is only one true-matched satellite-view image for a drone-view query.
CVUSA (Zhai et al., 2017) consists of image pairs from two viewpoints, i.e., the street view and the satellite view. Each viewpoint contains 35,532 images for training and 8,884 images for testing. It is worth noting that street-view images are panoramas, and all the street and satellite images are north aligned.
Evaluation protocol. The performance of our method is evaluated by the Recall@K (R@K) and the average precision (AP). R@K denotes the proportion of correctly localized images in the top-K list, and R@1 is an important indicator. AP is equal to the area under the Precision-Recall curve. Higher scores of R@K and AP indicate better performance of the network.
4.2. Implementation Details
The style encoder is intercepted from ResNet-50 (He et al., 2016)
and initialized using the pre-trained weights on ImageNet(Deng et al., 2009). The kernel size of the convolutional layer in Residual SPADE is , and the kernel is initialized with normal initialization. We employ weights of IBN-Net50-a (Pan et al., 2018) which is trained on ImageNet (Deng et al., 2009) to initialize the content encoder. Following (Zheng et al., 2020b)
, the stride of the second convolutional layer and the last down-sample layer of the first bottleneck in stage4 of the content encoder is modified from 2 to 1. We fix the size of input images topixels when training and inference. In training, we augment satellite-view images by employing random cropping and flipping. For drone-view images, we first apply the library of imgaug (Jung et al., 2020)
to change the environmental style of images. Then we also use random cropping and flipping to augment these images. We adopt stochastic gradient descent (SGD) with momentum 0.9 and weight decay 0.0005 to optimize our model. The mini-batch of training is set to 16 with 8 images for one platform. The initial learning rate is 0.005 for two classifiers and Residual SPADE, and 0.0005 for the rest layers. Our model is trained for 210 epochs, and the learning rate is decayed to its 0.1 and 0.01 at 120 and 180 epochs. During testing, the Euclidean distance is applied to measure the similarity between the query and candidate images in the gallery. We implement our code based on Pytorch(Paszke et al., 2019), and all experiments are conducted on one NVIDIA RTX 2080Ti GPU.
|Method||Normal||Fog||Rain||Snow||Fog +Rain||Fog +Snow||Rain +Snow||Dark||Over -exposure||Wind||Mean|
|VGG16 (Simonyan and Zisserman, 2015)||59.98||64.69||56.21||61.11||53.97||58.90||50.07||55.08||50.43||55.63||42.77||48.01||51.08||56.10||39.10||44.30||45.16||50.47||50.84||56.05||49.96||55.03|
|Zheng et al. (Zheng et al., 2020b)||67.83||71.74||60.97||65.23||60.29||64.61||55.58||60.09||54.75||59.40||44.85||49.78||57.61||62.03||39.70||44.65||51.85||56.75||58.28||62.83||55.17||59.71|
|DenseNet121 (Huang et al., 2017)||69.48||73.26||64.25||68.47||63.47||67.64||59.29||63.70||59.68||64.13||50.41||55.20||60.21||64.57||48.57||53.41||54.04||58.88||60.74||65.14||59.01||63.44|
|Swin-T (Liu et al., 2021)||69.27||73.18||66.46||70.52||65.44||69.60||61.79||66.23||63.96||68.21||56.44||61.07||62.68||67.02||50.27||55.18||55.46||60.29||63.81||68.17||61.56||65.95|
|IBN-Net (Pan et al., 2018)||72.35||75.85||66.68||70.64||67.95||71.73||62.77||66.85||62.64||66.84||51.09||55.79||64.07||68.13||50.72||55.53||57.97||62.52||66.73||70.68||62.30||66.46|
|VGG16 (Simonyan and Zisserman, 2015)||75.89||58.50||75.18||55.42||71.61||53.03||68.19||48.29||71.18||49.34||65.48||40.87||69.47||50.03||64.34||35.74||64.91||44.20||68.90||49.53||69.52||48.50|
|Zheng (Zheng et al., 2020b)||83.45||67.94||79.60||61.12||77.60||59.73||73.18||55.07||75.89||54.45||70.76||43.26||74.75||56.44||69.47||39.25||72.18||51.91||76.46||57.59||75.33||54.68|
|DenseNet121 (Huang et al., 2017)||83.74||70.34||82.31||66.32||81.17||65.23||78.60||60.33||79.46||61.66||74.61||51.14||78.46||61.68||74.47||47.88||74.32||55.26||78.32||61.63||78.55||60.15|
|Swin-T (Liu et al., 2021)||80.74||68.94||81.03||67.46||81.17||66.39||78.46||61.33||79.17||64.65||74.89||56.57||78.89||63.49||75.61||48.43||76.60||56.57||78.74||64.45||78.53||61.83|
|IBN-Net (Pan et al., 2018)||86.31||73.54||84.59||67.61||84.74||69.03||80.88||64.44||83.31||63.71||77.89||52.14||83.02||65.74||78.46||50.77||79.46||58.64||84.02||67.94||82.27||63.36|
|Method||Normal||Fog||Rain||Snow||Fog +Rain||Fog +Snow||Rain +Snow||Dark||Over -exposure||Wind||Mean|
|Zheng et al. (Zheng et al., 2020b)||63.78||67.77||60.19||64.47||61.53||65.65||61.39||65.52||58.00||62.27||55.32||59.72||60.81||65.03||51.64||55.91||59.89||64.14||60.41||64.69||59.30||63.52|
|IBN-Net (Pan et al., 2018)||76.61||79.50||74.68||77.80||75.52||78.52||74.54||77.65||73.31||76.61||71.33||74.71||74.72||77.80||66.45||69.99||73.14||76.29||74.63||77.76||73.49||76.66|
4.3. Comparisons with Competitive Methods
Results on University-1652. University-1652 (Zheng et al., 2020b) supports two tasks: drone-view target localization (Drone Satellite) and drone navigation (Satellite Drone). We re-implement five methods as competitive comparisons of our method on these two tasks. Five comparison methods include only the feature extraction branch containing a content encoder and an identity classifier. Content encoders in five comparison methods are VGG16 (Simonyan and Zisserman, 2015), ResNet-50 (He et al., 2016) (Zheng et al. (Zheng et al., 2020b)), DenseNet121 (Huang et al., 2017), Swin-T (Liu et al., 2021), and IBN-Net50-a (IBN-Net (Pan et al., 2018)). Keeping the style of satellite-view images unchanged, the results of drone-view images in 10 different conditions are shown in Table 1. Comparing five re-implemented methods in Drone Satellite, we observe that IBN-Net (Pan et al., 2018) has achieved a significant improvement of the performance. Our method surpasses IBN-Net (Pan et al., 2018) in all environmental conditions. Specifically, when calculating the mean accuracy, our method improves the R@1 accuracy from to () and the AP accuracy from to (). The Satellite Drone is an easier task than Drone Satellite. We first observe that even VGG16 can obtain higher performance of R@1 than the reported results of our method in Drone Satellite. In Satellite Drone, our method still keeps sufficient advantages over five comparison methods. In particular, our results in 10 different environmental conditions outperform all of the IBN-Net (Pan et al., 2018), and the mean accuracy of R@1 increases from to () and the mean value of AP raises from to (). The experimental results of two sub-tasks demonstrate two points. First, as the multiple-domain related method, IBN-Net (Pan et al., 2018) compared with other comparison methods can acquire a more robust representation containing less domain shift caused by different environmental styles. Second, our method based on IBN-Net learns the dynamic parameters to adaptively adjust the style information and can further improve the performance, as discussed in Section 3.2.
Results on CVUSA. Street-view and satellite-view images on CVUSA (Zhai et al., 2017) retain drastic appearance changes. In order to achieve cross-view images with a similar pattern, we follow (Shi et al., 2019, 2020a) to pre-process satellite-view images before training and testing. Specially, we apply the polar transform to warp satellite images, which ensures that the appearance of satellite images is closer to street-view panoramas. In the multiple-environment setting, results of our method compared with two competitive methods on CVUSA are detailed in Table 2. We could observe that our method obtains the increment in most environmental conditions than IBN-Net (Pan et al., 2018). Meanwhile, the mean accuracy of R@1 goes up from to (), and the mean accuracy of AP boosts from to ().
Results on unseen weather. In the realistic scenario, the aerial-view geo-localization system can usually encounter the unseen weather. To explore whether MuSe-Net can cope with these weather conditions, especially the severe weather, we carry out experiments on mixing fog, rain and snow. Table 3 shows the experimental results. The proposed MuSe-Net on University-1652 (Zheng et al., 2020b) has achieved R@1 accuracy and AP for Drone Satellite, and R@1 accuracy and AP for Satellite Drone. The obtained R@1 accuracy and AP on CVUSA are and , respectively. The superior performance compared with two competitive methods, i.e., Zheng et al. (Zheng et al., 2020b) and IBN-Net (Pan et al., 2018), suggests that our method holds great potential to the unseen extreme weather. This character can also provide additional assurance of safe flight.
|Method||Fog + Rain + Snow|
|Drone Satellite||Satellite Drone||Street Satellite|
|Zheng et al. (Zheng et al., 2020b)||27.73||32.35||61.34||27.43||46.31||51.03|
|IBN-Net (Pan et al., 2018)||41.19||46.06||73.75||42.57||67.24||70.98|
|Method||Normal||Fog||Rain||Snow||Fog +Rain||Fog +Snow||Rain +Snow||Dark||Over -exposure||Wind||Mean|
4.4. Model Analysis
We further analyze and discuss our model based on the multiple-environment drone-view target localization task (Drone Satellite) in this section.
Which bottleneck(s) embedding Residual SPADE is more effective? As mentioned in Section 3.3, we embed Residual SPADE in the second and last bottlenecks of stage1. Other embedding options exist. Work (Pan et al., 2018) proves that the style difference mostly lies in the shallow layers. Following this finding, we select all three bottlenecks in stage1 for objects of our study. Specifically, we conduct experiments that embedding single, double or three Residual SPADE in these selected bottlenecks. Table 4 shows the details of experimental results. We observe first that deploying Residual SPADE in any bottlenecks yields higher mean results than IBN-Net (Pan et al., 2018) showed in Table 1, which demonstrates the effectiveness of Residual SPADE. Then we compare results of deploying the single Residual SPADE in three different bottlenecks (i.e., the first three rows of Table 4). Looking at mean results, embedding Residual SPADE in the second bottleneck () achieves the best performance. When employing two or three Residual SPADE in bottlenecks, we notice that combinations containing the first bottleneck, i.e., , and , achieve slightly lower mean results than . Finally, considering both the individual results in 10 conditions and the mean results, we choose as the choice of our method.
Qualitative result. As shown in Figure 3, we visualise heatmaps and Top-5 retrieval results generated by our method in 10 different environmental conditions. Heatmaps show that our method can activate two regions of the geographic target. Another discovery is that there are subtle differences in the extent and brightness of the activated areas in 10 heatmaps. This phenomenon reflects from the side that results of geo-localization in 10 environmental conditions can exist difference. From the retrieval results shown, we observe that our model obtains the true match in the Top-1 yet the remaining retrieval results are inconsistent under 10 different conditions, which also indicates that the adjusted features still contain a few discrepancies.
Model complexity. We employ FLOPs and PN to evaluate the model complexity of the proposed method and two baselines. FLOPs denotes the floating-point operations, and PN indicates the number of parameters. The baseline methods of Zheng (Zheng et al., 2020b) and IBN-Net (Pan et al., 2018) have similar complexity, i.e., FLOPs and PN. Our method inevitably yields a higher FLOPs () and PN () value since the designed multiple-environment style extraction branch. However, the growth rates of FLOPs and PN between baselines and our method are and , respectively. The lower costs of growth indicate that the complexity of our model is also acceptable.
In this paper, we identify the challenge when employing aerial-view geo-localization in the real-world scenario where the weather and illumination changes. To reduce domain gaps of different environments, we propose an end-to-end learning network, called Multiple-environment Self-adaptive Network (MuSe-Net), to dynamically adjust the style difference for inputs with one geo-tag. MuSe-Net consists of two branches. One is a multiple-environment style extraction network for learning environment-related information. The other is a self-adaptive feature extraction network which integrates Residual SPADE into the content encoder to dynamically balance the environmental domain shift. To verify the effectiveness of MuSe-Net, we have evaluated the method in a drone-based geo-localization dataset, i.e., University-1652 (Zheng et al., 2020b) and achieved competitive performance. Besides, the proposed method also has acquired competitive results on one street-to-satellite dataset, i.e., CVUSA (Zhai et al., 2017). In the future, we will continue to study the disentangled representation learning and further improve the performance of geo-localization in multiple environments.
- Geo-localization of street views with aerial image databases. In ACM International Conference on Multimedia, Cited by: §2.1.
- Generalizing from several related classification tasks to a new unlabeled sample. Advances in Neural Information Processing Systems 24, pp. 2178–2186. Cited by: §1.
- Aircraft accident investigative update. Note: https://www.ntsb.gov/investigations/Documents/DCA20MA059-Investigative-Update.pdf Cited by: §1.
Ground-to-aerial image geo-localization with a hard exemplar reweighting triplet loss.
IEEE International Conference on Computer Vision, Cited by: §2.1.
- Semantic cross-view matching. In IEEE International Conference on Computer Vision Workshops, Cited by: §2.1.
- Learning to balance specificity and invariance for in and out of domain generalization. In European Conference on Computer Vision, Cited by: §1, §2.2.
Learning a similarity metric discriminatively, with application to face verification.
IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
The cityscapes dataset for semantic urban scene understanding. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.2.
- Imagenet: a large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.2.
- A practical cross-view image matching method between uav and satellite for uav-based geo-localization. Remote Sensing 13 (1), pp. 47. Cited by: §1.
Model-agnostic meta-learning for fast adaptation of deep networks.
International Conference on Machine Learning, Cited by: §2.2.
Sta: spatial-temporal attention for large-scale video-based person re-identification.
AAAI Conference on Artificial Intelligence, Cited by: §1.
- Meta-fdmixup: cross-domain few-shot learning guided by labeled target data. In ACM International Conference on Multimedia, Cited by: §2.2.
- Generative adversarial nets. Advances in neural information processing systems 27. Cited by: §1.
- Dimensionality reduction by learning an invariant mapping. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
- Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §3.2, §3.3, §4.2, §4.3.
- Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, Cited by: §1.
- Cvm-net: cross-view matching network for image-based ground-to-aerial geo-localization. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.1.
- Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §4.3, Table 1.
Diva: domain invariant variational autoencoders. In Medical Imaging with Deep Learning, Cited by: §1, §2.2.
- Batch normalization: accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, Cited by: §1.
- imgaug. Note: https://github.com/aleju/imgaugOnline; accessed 01-Feb-2020 Cited by: §1, §4.2.
- Undoing the damage of dataset bias. In European Conference on Computer Vision, Cited by: §1, §2.2.
- Auto-encoding variational bayes. In International Conference on Learning Representations, Cited by: §2.2.
- Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25, pp. 1097–1105. Cited by: §2.1.
- Deeper, broader and artier domain generalization. In IEEE International Conference on Computer Vision, Cited by: §1, §2.2.
- Episodic training for domain generalization. In IEEE International Conference on Computer Vision, Cited by: §2.2.
- Domain generalization with adversarial feature learning. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.2.
- Domain generalization for medical imaging classification with linear-dependency regularization. In Neural Information Processing Systems, Cited by: §2.2.
- Deep domain generalization via conditional invariant adversarial networks. In European Conference on Computer Vision, Cited by: §2.2.
- Cross-view image geolocalization. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
- Learning deep representations for ground-to-aerial geolocalization. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
- Lending orientation to neural networks for cross-view geo-localization. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.1.
- Swin transformer: hierarchical vision transformer using shifted windows. In IEEE International Conference on Computer Vision, Cited by: §4.3, Table 1.
- Two at once: enhancing learning and generalization capacities via ibn-net. In European Conference on Computer Vision, Cited by: §1, §3.2, §3.3, §3.3, §4.2, §4.3, §4.3, §4.3, §4.4, §4.4, Table 1, Table 2, Table 3.
- Semantic image synthesis with spatially-adaptive normalization. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §3.2.
- Pytorch: an imperative style, high-performance deep learning library. In Neural Information Processing Systems, Cited by: §4.2.
- Correlation-aware adversarial domain adaptation and generalization. Pattern Recognition 100, pp. 107124. Cited by: §2.2.
- Bridging the domain gap for ground-to-aerial image matching. In IEEE International Conference on Computer Vision, Cited by: §2.1.
- Playing for data: ground truth from computer games. In European Conference on Computer Vision, Cited by: §3.2.
- Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §2.1.
- A framework for global vehicle localization using stereo images and satellite and road maps. In IEEE International Conference on Computer Vision Workshops, Cited by: §2.1.
- Multi-adversarial discriminative deep domain generalization for face presentation attack detection. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.2.
- Spatial-aware feature aggregation for image based cross-view geo-localization. In Neural Information Processing Systems, Cited by: §1, §2.1, §4.3.
Where am i looking at? joint location and orientation estimation by cross-view matching. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.1, §4.3.
- Optimal feature transport for cross-view image geo-localization. In AAAI Conference on Artificial Intelligence, Cited by: §1, §2.1.
- Very deep convolutional networks for large-scale image recognition”. In International Conference on Learning Representations, Cited by: §4.3, Table 1.
- Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline). In European Conference on Computer Vision, Cited by: §1.
- Dual attention gans for semantic image synthesis. In ACM International Conference on Multimedia, Cited by: §1.
- Cross-time and orientation-invariant overhead image geolocalization using deep local features. In IEEE Winter Conference on Applications of Computer Vision, Cited by: §2.1.
- Kobe bryant’s death updates: clippers to honor kobe bryant in first game at staples center. Note: https://www.latimes.com/sports/liveblog/kobe-bryant-dies-in-helicopter-crash-in-calabasas Cited by: §1.
- Improved texture networks: maximizing quality and diversity in feed-forward stylization and texture synthesis. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
- The nature of statistical learning theory. Springer science & business media. Cited by: §2.2.
- Localizing and orienting street views using overhead imagery. In European Conference on Computer Vision, Cited by: §2.1.
- Cross-domain face presentation attack detection via multi-domain disentangled representation learning. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.2.
- Each part matters: local patterns facilitate cross-view geo-localization. IEEE Transactions on Circuits and Systems for Video Technology. Note: doi:10.1109/TCSVT.2021.3061265 Cited by: §1, §2.1.
- On the location dependence of convolutional neural network features. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.1.
- Wide-area image geolocalization with aerial reference imagery. In IEEE International Conference on Computer Vision, Cited by: §2.1.
- Ace: adapting to changing environments for semantic segmentation. In IEEE International Conference on Computer Vision, Cited by: §1.
- Aggregated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1.
- Robust and generalizable visual representation learning via random convolutions. In International Conference on Learning Representations, Cited by: §2.2.
- Geo-localization via ground-to-satellite cross-view image retrieval. IEEE Transactions on Multimedia. Cited by: §1.
- Predicting ground-level scene layout from aerial imagery. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: 3rd item, §2.1, §4.1, §4.1, §4.3, §5.
- Mixup: beyond empirical risk minimization. In International Conference on Learning Representations, Cited by: §2.2.
- Curriculum-based meta-learning. In ACM International Conference on Multimedia, Cited by: §2.2.
- Learning to generalize unseen domains via memory-based multi-source meta-learning for person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.2.
- VehicleNet: learning robust visual representation for vehicle re-identification. IEEE Transactions on Multimedia (TMM). Note: doi: 10.1109/TMM.2020.3014488 Cited by: §1.
- University-1652: a multi-view multi-source benchmark for drone-based geo-localization. In ACM International Conference on Multimedia, Note: doi: 10.1145/3394171.3413896 Cited by: Figure 1, 3rd item, §1, §1, §2.1, §4.1, §4.1, §4.2, §4.3, §4.3, §4.4, Table 1, Table 2, Table 3, §5.
- Joint discriminative and generative learning for person re-identification. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.2.
- Dual-path convolutional image-text embeddings with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16 (2), pp. 1–23. Note: doi: 10.1145/3383184 Cited by: §2.1.
- Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In IEEE International Conference on Computer Vision, Note: doi: 10.1109/ICCV.2017.405 Cited by: §1, §2.1.
- Learning deep features for scene recognition using places database. Neural Information Processing Systems Foundation, pp. 487–495. Cited by: §2.1.
- Domain generalization with mixstyle. In International Conference on Learning Representations, Cited by: §2.2.