1 Introduction
Localization is one of the most fundamental problem for mobile robots. With a decade of research, localization given the measurement and the map built by the same sensor is relatively mature. But for matching measurements from heterogeneous sensor modalities remains an open problem. This problem is practical considering the effort to build a map. We would like the map to be sharable by multiple robot users, even equipped with heterogeneous sensors. Note the great progress in visual inertial navigation and mapping [11], the solution space to localization can be reduced to 3 or 4, namely the translation and the heading angle. By exploiting birdseye view measurements, the problem is reformulated as an image matching problem with warping space built on . In this paper, we focus on this scenarios with birdseye view images acquired by ground vehicles, satellites and UAV with vision and LiDAR, as shown in Figure 1.
Previous researches on homogeneous image matching can be divided into two categories: those rely on point features correspondences to localize in specific setups [12, 9, 1, 3], those apply dense correlation methods to find the best pose candidate in solution space [16, 4, 17, 7, 2]. However, all these approaches do not perform well when heterogeneous measurements are given. As for heterogeneous image matching, [8, 15] utilize handcraft features to localize LiDAR against satellite maps. These frameworks rely heavily on the coverage and optimality of the handcrafted features over the variations in real environment. Learningbased methods proved to have certain generalization ability are intuitively appropriate for heterogeneous image matching. [10, 18] learn the embeddings for heterogeneous observations and exhaustively search for the optimal pose in the discrete solution space, which is more interpretable than regressing the pose by an endtoend network as in [6, 5]. However, the former suffers from low efficiency and limited pose range due to the exhaustive evaluation on the large pose space, constraining the application with known scale.
By revealing the problems encountered by prior works, one can think of an ideal matcher that can obtain the solution without exhaustive evaluation and also have good interpretability and generalization. In our work, we set to propose such a learnable matcher, of which the essence is a differentiable phase correlation. Phase correlation [16] is a similaritybased matcher that performs well for inputs with the same modality but only tolerate small high frequency noise. We modify the phase correlation into a differentiable manner and embed it into our endtoend framework. This architecture allows our system to find optimal feature extractors with respect to the resultant pose of image matching. Specifically, we adopt the conventional phase correlation pipeline proposed by [16]
and explicitly endow the Discrete Fourier Transform (DFT) layer, logpolar transformation layer (LPT), and differentiable correlation layer (DC) with differentiability and thus make it trainable for our endtoend matching network as shown in Figure
2. Our experiments show the robustness and efficiency of the propose method on matching heterogeneous sensor measurements.2 Deep Phase Correlation Network
With known gravity direction, the relative pose between the two observed birdseye view images can be simplified as similarity transform :
(1) 
where is the scale, is the rotation matrix generated by the heading angle , and
is the translation vector. In general, the two images are disturbed by illumination, shadow and occlusion, or even acquired by heterogeneous sensors. For example,
is acquired by a birdseye camera of a UAV in the morning, while is a local elevation map constructed by the ground robot with a LiDAR. To address this issue, a general process is to extract features from the two images, and estimate the relative pose using the features instead of the original sensor measurements. By applying the template matching to the features, we derive the optimal feature extractors and for the two images respectively by solving(2) 
where is the ground truth, is a scoring function to measure the similarity, and is to transform by relative pose . Inner product is often regarded as the scorer function. Then we have
(3) 
There are two problems. First, an exhaustive evaluation is required for all elements in , which is extremely time consuming considering the dimensional space . Second, is not differentiable, thus (3) is hard to be optimized. If and are handcrafted processes, the optimization is even harder. In this paper, we set to differentiate (3) and eliminate the exhaustive evaluation to find efficient datadriven feature extractors and .
2.1 Decoupled correlation based pose estimator
Crosscorrelation: We begin with the known scale and rotation , reducing the unknown parameters to . Denoting and , we have
(4) 
where is a position in the feature. Note the term in the is the crosscorrelation function parameterized by between and , which can be evaluated very efficiently using convolution.
Translation invariance: In general, if we achieve scale and rotation by assuming known translation, then the problem is chickenandegg. We introduce a representation which is invariant to translation, but variant to scale and rotation. Thus we can ignore the translation when solving for scale and rotation. Specifically, we refer to the magnitude of frequency spectrum of and . According to the property of Fourier transform, we have
(5) 
where is the fourier transform. It means that only rotation and scale have effects on the magnitude of frequency spectrum.
(6) 
Cartesian to logpolar:
We decouple rotation and scale from relative pose using the frequency spectrum. However, exhaustive evaluation is still required for rotation and scale. By looking into the frequency domain, we have
(7) 
where is a position in the frequency spectrum. By representing it in the polar coordinates, we have
(8) 
where and . To deal with the scale, we further apply the logarithm to the position
(9) 
Now we finally arrive at the correlation form with respect to the scale and rotation, which eliminates all exhaustive evaluation in the whole pose estimator.
(10) 
When there is no feature extractor, or equivalently, the original sensor measurements are fed, the process is called phase correlation, estimating the relative pose in very efficiently. However, when there is variation between the pair of inputs, the feature extractor is indispensable.
2.2 Endtoend learnable feature extractor
Expectation as differentiable : To find the feature extractor which is optimal with respect to the training data, we have to approximate the in (4) and (10
) with differentiable function. We connect the crosscorrelation to a softmax function, which maps the input to a discrete probability density function
. Set translation part as example, the crosscorrelation in (4) is(11) 
By keeping the features positive, we do not need to care about the negative correlation. Still refering to the translation part, given the probability function, we derive the expectation as the optimal translation estimator, and substitute it into (
4), yielding the differentiable form(12) 
For rotation and scale (10), the same expectation estimator can be applied to approximate the nondifferentiable based estimator. When learning, we assign a temperature coefficient to the softmax to tune the range of feature input, which accelerates the convergence, but does not make difference in theoretic derivation. The whole pose estimator can be regarded as a differentiable phase correlation (DPC) with backpropagated gradients to enforce the learning of feature extractor.
Deep feature extractor: The conventional phase correlation [16] utilizes high pass filters to suppress high frequency random noise of two inputs, which can be seen as a feature extractior. For more distinct variation between the pair of inputs, one high pass filter is far from sufficient. Considering that there is no common feature to directly supervise the feature extractor, we utilize the endtoend learning to address the problem.
We adopt UNet for feature extraction, aiming at learning the common features of the two images implicitly. We construct 4 separate UNets with the input and output size of
respectively for the template image and the source image in the rotation phase and the translation phase, shown in Figure 2. Each UNet is constructed with 4 downsampling encoder layers and 4 upsampling decoder layers to extract features. As the training progresses, the parameters of four UNets are tuned. Note that this network is lightweighted so that it could be efficient enough for realtime execution. Combining the feature extractor and the DPC, we name the whole network as DPCN.2.3 Data preparation for learning
One question is that why we supervise DPCN on the estimated pose, but not the correlation matrix with a onepeak matrix centering at the correct position, thus the expectation based estimator can be eliminated. We argue that enforcing the correlation matrix to be onepeak is oversupervision. In theory of phase correlation, the position corresponding to the maximum correlation, does not necessarily mean zero correlation for the others, which can be explained by the resonance in physics. However, when the correlation map is passed through softmax estimator, the temperature coefficient suppresses the nonmaximal part in the correlation matrix to be close to 0, resulting in a normalized probability density function. In addition to the pose supervision as shown in (12), we also supervise the probability density function of translation, and scale/rotation, to be onepeak. Still refer to the translation as example, we do it by applying the KLdivergence:
(13) 
where is a normalized onepeak function centering at . In practice, we slightly expand the onepeak function with Gaussian smooth. Theoretically, the resultant distribution for some cases can be multimodal, e.g. the repetitive local environment. However, given massive training data with similar input and different output, the optimal solution to (13) is a multimodal probability function, i.e. multimodal . The two loss (13) and (12) are mixed by weight. For translation part, multimodal result occurs more often, so we increase the loss of (13) in the total loss for translation phase.
3 Experimental Results
In this section, we explicitly evaluate the performance of our approach. By utilizing phase correlation with DFT and logpolar transform over the learned representation of two images, we are able to fully estimate the rotation and scale transformation of the two heterogeneous images, and eventually able to estimate the 4DoF relative pose .
Dataset & Metrics: Our approach is evaluated both on randomly generated simulation dataset and on realworld dataset ”AeroGround Dataset”[19] which contains several different image pairs shown as follows:

l2d:“LiDAR Local Map” to “Drone’s Birdseye Camera View”;

l2sat:“LiDAR Local Map” to “Satellite Map”;

s2d:“Stereo Local Map” to “Drone’s Birdseye Camera View”;

s2sat:“Stereo Local Map” to “Satellite Map”.
On the simulation dataset, we evaluate our work on homogeneous, heterogeneous, and heterogeneous with dynamic obstacles images pairs whereas on the AeroGround Dataset, we evaluate our work on the application of cooperative SLAM system between ground mobile robots, the MAV and the Satellite. The demonstration of both datasets is shown in Figure 3 and 4. Finally, we applied our method to Monte Carlo Localization to prove the realtime capability and robustness of our method. In all datasets, we constrained translations of both and , rotation changes and scale changes in the range of pixels, and respectively with images shapes of .
For evaluating the accuracy and error rate of estimation, we consider “Accuracy in Units” and Mean Square Error(MSE) of the estimated result and the ground truth as the mean indicator:
(14) 
(15) 
where being the output type(x, y, rotation, and scale), being the threshold of accuracy (pixel for translation, degree for rotation and multiplier for scale), being the amount of image pairs, being the estimated result of the th image pair and being the corresponding ground truth. “Accuracy in Units” is calculated as the percentage of estimation with an error lower than the threshold.
Comparative Methods: Benchmarks in the experiments include conventional Phase Correlation[16], deep learning based QATM[3], DAM[14], Relative Pose Regression(RPR), and Dense Search(DS). Phase Correlation is the baseline for registering two homogeneous images and the pipeline of which is also partly adopted in our approach. We select it as a benchmark for evaluating the performance of our approach in matching homogeneous images. QATM is a representative work in image matching applying deep learning to learn features for matching and NMS for matching selection. It could handles translation displacement with high accuracy and therefore we select it as the benchmark for evaluating translation estimations in heterogeneous images. Unfortunately, the author of QATM only provided a pretrained model without a training script so that we could only evaluate its performance with the provided model. DAM trains and learns affine transformations including translation, rotation and scale changes, and therefore we trained DAM on the same dataset and compare our method with it on the four aspects. Relative Pose Regression(RPR) is adopted by multiple methods[6, 5] in pose regression by outputting desired classes of estimation through several fullyconnected layers without an analytical solutions. Finally, Dense Search(DS) is the methodology adopted by [2] whcih violently rotates images to some discrete angles to estimate the relative shifts. We trained DS in a smaller rotation range [0,15] due to its discrete property and time consumption. It is selected as a benchmark to demonstrate the advantage of estimation continuity and swiftness of our approach.
3.1 Simulation Dataset
In order to verify the feasibility of our approach under several different situations and the generalization capability of the fully differentiable methodology, we conduct various of experiments on the simulation dataset. The experiments on a pair of homogeneous images is conducted to verify the equivalence to the conventional phase correlation when dealing with images of same styles. Moreover, with experiments on a pair of heterogeneous images conducted, we show the unique capability of our approach with estimating the 4DoF pose of two images with drastic style changes. Finally, we introduce dynamic obstacles to the heterogeneous image pairs to demonstrate the robustness of our approach for being insensitive to dynamic obstacles.
Benchmarks  Exp.  Runtime(ms)  


1  0.6635  99.2  0.9231  99.5  0.0663  99.7  0.0710  98.9  141.4  
2  1774.1592  52.3  3233.8133  33.2  145.8561  72.3  0.1992  97.6  138.8  
3  2319.5537  49.2  2945.3017  42.6  121.5026  67.9  0.1218  96.7  137.0  
QATM  1  15.4820  95.4  8.9192  96.3  108.3  
2  2999.3710  31.4  4286.4810  26.4  108.9  
3  3651.4691  25.6  4901.7201  21.5  109.1  
DAM  1  53.7597  90.6  28.6825  95.9  19.2243  81.7  0.1452  90.5  111.7  
2  2.5816  98.4  4.6234  97.9  28.9341  80.8  0.1432  90.9  114.2  
3  46.1165  71.3  89.6835  68.2  36.5608  77.8  0.3625  87.3  110.4  
RPR  1  8.6754  96.9  2.2201  97.3  16.2723  90.2  0.0805  95.2  64.1  
2  22.3634  39.9  32.8334  41.2  97.8517  78.3  0.1367  96.7  64.7  
3  14.6842  51.1  11.1322  56.8  101.3329  76.8  0.1846  96.1  63.5  
DS  1  7.4092  85.1  11.1108  79.3  26.2656  33.4  304.5  
2  5.4970  86.6  7.3333  86.5  31.3891  23.9  301.3  
3  6.2850  83.9  7.8176  79.6  25.6643  27.1  301.4  
DPCN  1  0.1031  100  0.2162  100  0.0528  100  0.0522  100  71.9  
(Ours)  2  0.0073  100  0.0172  100  0.0397  100  0.0642  100  72.1  
3  0.0761  100  0.4671  100  0.0913  100  0.0013  100  72.1 
Note: , and is conducted on “Homogeneous”, “Heterogeneous”, and “Dynamic Obstacles” sets respectively. Red is the best performance and blue is the secondary.
Experimental results on the simulation dataset are shown in Table 1. The results show that for the homogeneous pairs of images, our method maintains an equivalent performance with the conventional phase correlation pipeline in accuracy with faster speed and outperformed the rest of the baselines. In the heterogeneous dataset and dynamic obstacle dataset, our approach outperformed any existing baselines in accuracy and is only a bit slower than the Relative Pose Regression.
The key to our approach is applying a fully optimal differentiable DFT with logpolar transformation to backpropagate the supervised error to train a multichannel featurebased representation which is optimized for phase correlation to estimate the relative pose. It outperformed the conventional phase correlation method applying high pass filter with the learned representation when dealing with images that is not simply a cutout to one another. The learned representation, the result of the DFT, logpolar transformation, and the correlation map shown in Figure 6 indicate that the network is interpretable: by supervising endtoend, the network is finally able to predict the transformation of two heterogeneous images with their UNet outputs of convergent features. Note that the simulation dataset is randomly generated to reduce overfitting.
Table 2 and elaborations in Appendix C verify the generalizing ability of our approach For generalization experiments, the model is trained on the “Heterogeneous” set and is evaluated on two other sets in simulation datasets. The results show that with models not specifically trained, it can still maintain a high rate of accuracy in all 4DoF.
Exp.  

Homogeneous  0.4121  100  0.705  100  0.0432  100  0.015  100 
Dynamic Obstacle  0.0276  100  0.105  100  0.039  100  0.003  100 
3.2 AeroGround Dataset
To evaluate the applicability in the real world of our approach, we conduct experiments on AeroGround Dataset to match images across sensors. Few baseline support 4DoF(or more) pose estimations across sensors and therefore we relax the condition by providing rotation and scale ground truth as initials to several benchmark for comparison, including the DAM with scaling initialized and QATM with both rotation and scale initialized. We train and validate our models in two different scenes (Figure 4(a) and (b)) separately and generalize them in the third scene(Figure 4(c)).
Scene  Benchmarks  Exp.  Runtime(ms)  
(a)  QATM  l2sat  6533.8164  21.5  4082.8421  57.1  109.2  
l2d  8168.9537  34.9  4160.1207  21.9  108.8  
s2sat  8268.8350  24.3  7135.4707  19.5  108.9  
s2d  5288.4170  33.6  5309.1103  31.7  109.7  
DAM  l2sat  507.1945  55.4  208.8668  70.8  44.2139  37.8  110.6  
l2d  690.1782  39.4  301.1191  66.8  96.5603  22.5  117.3  
s2sat  740.6881  35.2  732.4164  33.6  105.1678  24.1  114.4  
s2d  536.5027  51.5  616.4043  43.9  68.1288  33.9  114.2  
DPCN  l2sat  40.5561  96.9  4.8175  98.0  0.1172  99.2  0.00345  95.5  74.39  
(Ours)  l2d  15.53  98.2  6.4531  94.0  0.0412  99.2  0.0122  94.2  74.63  
s2sat  65.373  90.9  15.5920  97.8  0.1078  97.4  0.0055  93.7  75.38  
s2d  327.31  91.3  14.493  92.6  0.2274  99.3  0.0070  93.5  73.44  
(b)  Benchmarks  
QATM  l2d  3603.8648  37.5  4018.3337  35.9  109.2  
s2d  2808.5308  36.3  2878.3589  31.0  108.5  
DAM  l2d  972.8225  30.1  588.4123  42.2  61.3341  35.1  113.9  
s2d  633.2790  40.9  484.3626  49.6  85.3438  27.4  116.5  
DPCN  l2d  8.0043  96.2  102.359  89.2  0.0059  99.7  0.0005  99.7  75.34  
(Ours)  s2d  88.7428  91.6  61.0860  90.6  0.7634  99.4  0.0035  95.0  75.96 
Table 3 shows validation results on scene (a) and (b) with the threshold error of for translation, for rotation and for scale. Acknowledging that each groundimage is generated with a scale of in the AeroGround Dataset, the threshold error of for translation could be transformed to in the real world. The results prove that when estimating 4DoF poses across realworld sensors, our approach is the first to finish this job with an accuracy of at least when considering error lower than . Even when we relax the conditions and provide initials to the rest of the benchmarks, our approach still outperforms the rest of them in 2DoF(QATM) and 3DoF(DAM). Additional demonstrations of the experiment comparing with conventional phase correlation is shown in appendix B.
To evaluate the generalizing capability of our approach, we conduct experiments on scene (c) with DPCN models trained on scene(a) and (b) and DAM model trained on scene(c). The results shown in Figure 6 and Table 4(Appendix C) prove that with the types of source sensors given and fixed, our approach is capable of estimating poses regardless of scene changes and illumination changes with similar accuracy and still outperformed DAM which is specifically trained on (c). Therefore, the robustness of our approach in the realworld application is well documented.
3.3 Application in satellite map based localization
In this section, we demonstrate the application in localization by introducing our approach to Monte Carlo Localization. It proves that with corresponding maps as the output, our model is capable of realtime airground localization, e.g. scene(a). A demonstration is shown in Figure 7, where the green dashed line is the odometry estimated by “VINS” through a stereo camera, the red dashed line is generated by MCL by matching 4DoF poses of LiDAR intensity map and satellite ,ap, and the yellow line is the ground truth. As the cumulative error of the odometry gradually increases, the corrective effect of our method is sufficiently demonstrated.
4 Conclusion
We present an approach for precise multisensor pose matching which greatly fascinate multiagent collaborative exploration. We achieve this by training pairs of individual UNets with endtoend poses to learn representations of heterogeneous images to be recognized by differentiable phase correlation(DFT+LPT+Correlation). We show that by training the network endtoend with a fully differentiable pipeline, the network is easy and fast to be trained, precise in matching and capable of running in realtime. We also show that with every estimation analytical, the network is completely interpretive and has the capability of generalization.
If a paper is accepted, the final cameraready version will (and probably should) include acknowledgments. All acknowledgments go at the end of the paper, including thanks to reviewers who gave useful comments, to colleagues who contributed to the ideas, and to funding agencies and corporate sponsors that provided financial support.
References
 [1] (201711) Driven to Distraction: SelfSupervised Distractor Learning for Robust Monocular Visual Odometry in Urban Environments. External Links: 1711.06623, ISBN 9781538630808, Link Cited by: §1.
 [2] (2019) Masking by Moving: Learning DistractionFree Radar Odometry from Pose Information. (CoRL). External Links: 1909.03752, Link Cited by: §1, §3.

[3]
(2019)
QATM: Qualityaware template matching for deep learning.
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
2019June, pp. 11545–11554. External Links: Document, arXiv:1903.07254v2, ISBN 9781728132938, ISSN 10636919 Cited by: §1, §3.  [4] (2016) Collaborative localization of aerial and ground robots through elevation maps. SSRR 2016  International Symposium on Safety, Security and Rescue Robotics, pp. 284–290. External Links: Document, ISBN 9781509043491 Cited by: §1.

[5]
(2017)
Geometric loss functions for camera pose regression with deep learning
. In Proceedings  30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, External Links: Document, 1704.00390, ISBN 9781538604571 Cited by: §1, §3.  [6] (2015) PoseNet: A convolutional network for realtime 6dof camera relocalization. In Proceedings of the IEEE International Conference on Computer Vision, External Links: Document, ISBN 9781467383912, ISSN 15505499 Cited by: §1, §3.
 [7] (2018) Robust template matching using scaleadaptive deep convolutional features. Proceedings  9th AsiaPacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017 2018Febru (December), pp. 708–711. External Links: Document, ISBN 9781538615423 Cited by: §1.
 [8] (2010) Large scale graphbased SLAM using aerial images as prior information. Robotics: Science and Systems 5, pp. 297–304. External Links: Document, ISBN 9780262514637, ISSN 2330765X Cited by: §1.
 [9] (2015) Learning deep representations for groundtoaerial geolocalization. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, External Links: Document, ISBN 9781467369640, ISSN 10636919 Cited by: §1.
 [10] (2019) L3net: Towards learning based lidar localization for autonomous driving. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, External Links: Document, ISBN 9781728132938, ISSN 10636919 Cited by: §1.

[11]
(2007)
A multistate constraint Kalman filter for visionaided inertial navigation
. In Proceedings  IEEE International Conference on Robotics and Automation, External Links: Document, ISBN 1424406021, ISSN 10504729 Cited by: §1. 
[12]
(2011)
Vehicle egolocalization by matching invehicle camera images to an aerial image.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
6469 LNCS (PART 2), pp. 163–173. External Links: Document, ISBN 9783642228186, ISSN 03029743 Cited by: §1.  [13] (2019) GPU accelerated realtime traversability mapping. In IEEE International Conference on Robotics and Biomimetics, ROBIO 2019, External Links: Document, ISBN 9781728163215 Cited by: §A.2.
 [14] (2020) A twostream symmetric network with bidirectional ensemble for aerial image matching. Remote Sensing. External Links: Document, ISSN 20724292 Cited by: §3.
 [15] (2015) Localization on OpenStreetMap data using a 3D laser scanner. In Proceedings  IEEE International Conference on Robotics and Automation, External Links: Document, ISSN 10504729 Cited by: §1.
 [16] (1996) An FFTbased technique for translation, rotation, and scaleinvariant image registration. IEEE Transactions on Image Processing. External Links: Document, ISSN 10577149 Cited by: §1, §1, §2.2, §3.
 [17] (2017) Template matching with deformable diversity similarity. In Proceedings  30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, External Links: Document, 1612.02190, ISBN 9781538604571 Cited by: §1.
 [18] (2020) RSLNet: Localising in Satellite Images from a Radar on the Ground. IEEE Robotics and Automation Letters 5 (2), pp. 1087–1094. External Links: Document, 2001.03233, ISSN 23773766 Cited by: §1.
 [19] (2020)(Website) External Links: Link Cited by: §3.
Appendix A Network Structure and Experimental Setup
a.1 Network Structure
We train the DPCN network with the input of pairs of heterogeneous images and the ground truth of their relative pose. There are two phases here in the training. Phase 1
: Rotation and scale changes are trained and estimated in this phase. The pair of images go through a pair of UNets and their outputs go through the DFT layer for translation reduction. The Log Polar Transform remaps the DFT’s output to the log polar coordinate so that rotation and scale variances are then shown in columns and rows. Afterward, phase correlation is able to estimate the rotation and scale with a relative power spectrum as the output. Finally, we supervised the UNet using CrossEntropy Loss of both rotation, scale, and their ground truth.
Phase 2: Translation of the image is trained and estimated in this phase. The source image of the image pair is rotated and scaled by the result in phase 1 and the new image pair(original template image and the new source image) go through a new pair of codec. The translation is then estimated by phase correlation and therefore the codec is supervised using CrossEntropy Loss and L1 Loss of estimated transformation and their ground truth.a.2 Experimental Setup
Hardware: All models for the full method and their comparison models were trained on a single server with CPU of Intel i99900X @ 3.5GHz x 20 and GPU of RTX 2080ti x 4. The validation experiments are all conducted on a single desktop computer(AMD Ryzen 3700X @ 3.7GHz x8) with the GPU of an RTX 2060super. The AeroGround Dataset is record by a ground robot and a DJI M100 drone. The ground robot is equipped with three 32wire Velodyne LiDAR, four pairs of stereo cameras by Flir and one RTK dGPS provided by QX. The DJI drone records videos by one pair of Intel Realsense D435i stereo camera with one facing forward and one downward.
Software: The birdseye global map from the drone used in the training and validation is constructed by fulllicensed software Metashape with video clips from the AeroGround dataset. The local map of both LiDAR intensity style and stereo style from the ground robot is constructed by Elevation Map[13]. All of the maps constructed above and the satellite map obtained from Google Maps have the resolution of 0.1 MPP.
Appendix B Additional Demonstration
Demonstrations of comparing DPCN and conventional phase correlation on realworld dataset AeroGround Dataset is shown in Figure 9 and Figure 10. They show that when matching heterogeneous images from different sensors, the trainable feature extractors in DPCN play important roles in outperforming the conventional phase correlation. However, the core of the entire DPCN is the differentiable DFT, LPT and phase correlation that could back propagate losses and eventually train the feature extractors. They also prove that by learning features through the supervision of the endtoend poses, the approach is capable of reducing hollows, rejecting noises and assimilate different styles to estimate the 4DoF relative pose.
b.1 DPCN Results
b.2 Conventional Phase Correlation Results
Appendix C Elaboration on Estimation
The threshold of estimation in experiments in 3 is elaborated in this section by the means of graphs. Figure 11 shows the of translation estimation in simulation dataset, figure 12 shows the of translation estimation in AeroGround dataset, Figure 13 shows the of translation estimation in simulation dataset, and Table 4 show the exact error of generalization experiments on AeroGround Dataset.
Model  

DPCN in (a)  232.4638  73.2  29.9253  92.4  89.0943  95.7  0.0084  95.0 
DPCN in (b)  31.2071  92.2  138.5449  88.5  2.8793  96.3  0.0153  93.3 
DAM  602.8490  40.8  720.9244  33.1  88.7239  22.5  0.0153  93.3 
QATM  3922.6715  26.7  1103.6291  36.9 
Comments
There are no comments yet.