Localization is one of the most fundamental problem for mobile robots. With a decade of research, localization given the measurement and the map built by the same sensor is relatively mature. But for matching measurements from heterogeneous sensor modalities remains an open problem. This problem is practical considering the effort to build a map. We would like the map to be sharable by multiple robot users, even equipped with heterogeneous sensors. Note the great progress in visual inertial navigation and mapping , the solution space to localization can be reduced to 3 or 4, namely the translation and the heading angle. By exploiting birds-eye view measurements, the problem is re-formulated as an image matching problem with warping space built on . In this paper, we focus on this scenarios with birds-eye view images acquired by ground vehicles, satellites and UAV with vision and LiDAR, as shown in Figure 1.
Previous researches on homogeneous image matching can be divided into two categories: those rely on point features correspondences to localize in specific setups [12, 9, 1, 3], those apply dense correlation methods to find the best pose candidate in solution space [16, 4, 17, 7, 2]. However, all these approaches do not perform well when heterogeneous measurements are given. As for heterogeneous image matching, [8, 15] utilize hand-craft features to localize LiDAR against satellite maps. These frameworks rely heavily on the coverage and optimality of the hand-crafted features over the variations in real environment. Learning-based methods proved to have certain generalization ability are intuitively appropriate for heterogeneous image matching. [10, 18] learn the embeddings for heterogeneous observations and exhaustively search for the optimal pose in the discrete solution space, which is more interpretable than regressing the pose by an end-to-end network as in [6, 5]. However, the former suffers from low efficiency and limited pose range due to the exhaustive evaluation on the large pose space, constraining the application with known scale.
By revealing the problems encountered by prior works, one can think of an ideal matcher that can obtain the solution without exhaustive evaluation and also have good interpretability and generalization. In our work, we set to propose such a learnable matcher, of which the essence is a differentiable phase correlation. Phase correlation  is a similarity-based matcher that performs well for inputs with the same modality but only tolerate small high frequency noise. We modify the phase correlation into a differentiable manner and embed it into our end-to-end framework. This architecture allows our system to find optimal feature extractors with respect to the resultant pose of image matching. Specifically, we adopt the conventional phase correlation pipeline proposed by 
and explicitly endow the Discrete Fourier Transform (DFT) layer, log-polar transformation layer (LPT), and differentiable correlation layer (DC) with differentiability and thus make it trainable for our end-to-end matching network as shown in Figure2. Our experiments show the robustness and efficiency of the propose method on matching heterogeneous sensor measurements.
2 Deep Phase Correlation Network
With known gravity direction, the relative pose between the two observed birds-eye view images can be simplified as similarity transform :
where is the scale, is the rotation matrix generated by the heading angle , and
is the translation vector. In general, the two images are disturbed by illumination, shadow and occlusion, or even acquired by heterogeneous sensors. For example,is acquired by a birds-eye camera of a UAV in the morning, while is a local elevation map constructed by the ground robot with a LiDAR. To address this issue, a general process is to extract features from the two images, and estimate the relative pose using the features instead of the original sensor measurements. By applying the template matching to the features, we derive the optimal feature extractors and for the two images respectively by solving
where is the ground truth, is a scoring function to measure the similarity, and is to transform by relative pose . Inner product is often regarded as the scorer function. Then we have
There are two problems. First, an exhaustive evaluation is required for all elements in , which is extremely time consuming considering the dimensional space . Second, is not differentiable, thus (3) is hard to be optimized. If and are hand-crafted processes, the optimization is even harder. In this paper, we set to differentiate (3) and eliminate the exhaustive evaluation to find efficient data-driven feature extractors and .
2.1 Decoupled correlation based pose estimator
Cross-correlation: We begin with the known scale and rotation , reducing the unknown parameters to . Denoting and , we have
where is a position in the feature. Note the term in the is the cross-correlation function parameterized by between and , which can be evaluated very efficiently using convolution.
Translation invariance: In general, if we achieve scale and rotation by assuming known translation, then the problem is chicken-and-egg. We introduce a representation which is invariant to translation, but variant to scale and rotation. Thus we can ignore the translation when solving for scale and rotation. Specifically, we refer to the magnitude of frequency spectrum of and . According to the property of Fourier transform, we have
where is the fourier transform. It means that only rotation and scale have effects on the magnitude of frequency spectrum.
Cartesian to log-polar:
We decouple rotation and scale from relative pose using the frequency spectrum. However, exhaustive evaluation is still required for rotation and scale. By looking into the frequency domain, we have
where is a position in the frequency spectrum. By representing it in the polar coordinates, we have
where and . To deal with the scale, we further apply the logarithm to the position
Now we finally arrive at the correlation form with respect to the scale and rotation, which eliminates all exhaustive evaluation in the whole pose estimator.
When there is no feature extractor, or equivalently, the original sensor measurements are fed, the process is called phase correlation, estimating the relative pose in very efficiently. However, when there is variation between the pair of inputs, the feature extractor is indispensable.
2.2 End-to-end learnable feature extractor
) with differentiable function. We connect the cross-correlation to a softmax function, which maps the input to a discrete probability density function. Set translation part as example, the cross-correlation in (4) is
By keeping the features positive, we do not need to care about the negative correlation. Still refering to the translation part, given the probability function, we derive the expectation as the optimal translation estimator, and substitute it into (4), yielding the differentiable form
For rotation and scale (10), the same expectation estimator can be applied to approximate the non-differentiable based estimator. When learning, we assign a temperature coefficient to the softmax to tune the range of feature input, which accelerates the convergence, but does not make difference in theoretic derivation. The whole pose estimator can be regarded as a differentiable phase correlation (DPC) with back-propagated gradients to enforce the learning of feature extractor.
Deep feature extractor: The conventional phase correlation  utilizes high pass filters to suppress high frequency random noise of two inputs, which can be seen as a feature extractior. For more distinct variation between the pair of inputs, one high pass filter is far from sufficient. Considering that there is no common feature to directly supervise the feature extractor, we utilize the end-to-end learning to address the problem.
We adopt U-Net for feature extraction, aiming at learning the common features of the two images implicitly. We construct 4 separate U-Nets with the input and output size ofrespectively for the template image and the source image in the rotation phase and the translation phase, shown in Figure 2. Each U-Net is constructed with 4 down-sampling encoder layers and 4 up-sampling decoder layers to extract features. As the training progresses, the parameters of four U-Nets are tuned. Note that this network is light-weighted so that it could be efficient enough for real-time execution. Combining the feature extractor and the DPC, we name the whole network as DPCN.
2.3 Data preparation for learning
One question is that why we supervise DPCN on the estimated pose, but not the correlation matrix with a one-peak matrix centering at the correct position, thus the expectation based estimator can be eliminated. We argue that enforcing the correlation matrix to be one-peak is over-supervision. In theory of phase correlation, the position corresponding to the maximum correlation, does not necessarily mean zero correlation for the others, which can be explained by the resonance in physics. However, when the correlation map is passed through softmax estimator, the temperature coefficient suppresses the non-maximal part in the correlation matrix to be close to 0, resulting in a normalized probability density function. In addition to the pose supervision as shown in (12), we also supervise the probability density function of translation, and scale/rotation, to be one-peak. Still refer to the translation as example, we do it by applying the KL-divergence:
where is a normalized one-peak function centering at . In practice, we slightly expand the one-peak function with Gaussian smooth. Theoretically, the resultant distribution for some cases can be multi-modal, e.g. the repetitive local environment. However, given massive training data with similar input and different output, the optimal solution to (13) is a multi-modal probability function, i.e. multi-modal . The two loss (13) and (12) are mixed by weight. For translation part, multi-modal result occurs more often, so we increase the loss of (13) in the total loss for translation phase.
3 Experimental Results
In this section, we explicitly evaluate the performance of our approach. By utilizing phase correlation with DFT and log-polar transform over the learned representation of two images, we are able to fully estimate the rotation and scale transformation of the two heterogeneous images, and eventually able to estimate the 4-DoF relative pose .
Dataset & Metrics: Our approach is evaluated both on randomly generated simulation dataset and on real-world dataset ”Aero-Ground Dataset” which contains several different image pairs shown as follows:
l2d:“LiDAR Local Map” to “Drone’s Birds-eye Camera View”;
l2sat:“LiDAR Local Map” to “Satellite Map”;
s2d:“Stereo Local Map” to “Drone’s Birds-eye Camera View”;
s2sat:“Stereo Local Map” to “Satellite Map”.
On the simulation dataset, we evaluate our work on homogeneous, heterogeneous, and heterogeneous with dynamic obstacles images pairs whereas on the Aero-Ground Dataset, we evaluate our work on the application of cooperative SLAM system between ground mobile robots, the MAV and the Satellite. The demonstration of both datasets is shown in Figure 3 and 4. Finally, we applied our method to Monte Carlo Localization to prove the real-time capability and robustness of our method. In all datasets, we constrained translations of both and , rotation changes and scale changes in the range of pixels, and respectively with images shapes of .
For evaluating the accuracy and error rate of estimation, we consider “Accuracy in Units” and Mean Square Error(MSE) of the estimated result and the ground truth as the mean indicator:
where being the output type(x, y, rotation, and scale), being the threshold of accuracy (pixel for translation, degree for rotation and multiplier for scale), being the amount of image pairs, being the estimated result of the th image pair and being the corresponding ground truth. “Accuracy in Units” is calculated as the percentage of estimation with an error lower than the threshold.
Comparative Methods: Benchmarks in the experiments include conventional Phase Correlation, deep learning based QATM, DAM, Relative Pose Regression(RPR), and Dense Search(DS). Phase Correlation is the baseline for registering two homogeneous images and the pipeline of which is also partly adopted in our approach. We select it as a benchmark for evaluating the performance of our approach in matching homogeneous images. QATM is a representative work in image matching applying deep learning to learn features for matching and NMS for matching selection. It could handles translation displacement with high accuracy and therefore we select it as the benchmark for evaluating translation estimations in heterogeneous images. Unfortunately, the author of QATM only provided a pretrained model without a training script so that we could only evaluate its performance with the provided model. DAM trains and learns affine transformations including translation, rotation and scale changes, and therefore we trained DAM on the same dataset and compare our method with it on the four aspects. Relative Pose Regression(RPR) is adopted by multiple methods[6, 5] in pose regression by outputting desired classes of estimation through several fully-connected layers without an analytical solutions. Finally, Dense Search(DS) is the methodology adopted by  whcih violently rotates images to some discrete angles to estimate the relative shifts. We trained DS in a smaller rotation range [0,15] due to its discrete property and time consumption. It is selected as a benchmark to demonstrate the advantage of estimation continuity and swiftness of our approach.
3.1 Simulation Dataset
In order to verify the feasibility of our approach under several different situations and the generalization capability of the fully differentiable methodology, we conduct various of experiments on the simulation dataset. The experiments on a pair of homogeneous images is conducted to verify the equivalence to the conventional phase correlation when dealing with images of same styles. Moreover, with experiments on a pair of heterogeneous images conducted, we show the unique capability of our approach with estimating the 4-DoF pose of two images with drastic style changes. Finally, we introduce dynamic obstacles to the heterogeneous image pairs to demonstrate the robustness of our approach for being insensitive to dynamic obstacles.
Note: , and is conducted on “Homogeneous”, “Heterogeneous”, and “Dynamic Obstacles” sets respectively. Red is the best performance and blue is the secondary.
Experimental results on the simulation dataset are shown in Table 1. The results show that for the homogeneous pairs of images, our method maintains an equivalent performance with the conventional phase correlation pipeline in accuracy with faster speed and outperformed the rest of the baselines. In the heterogeneous dataset and dynamic obstacle dataset, our approach outperformed any existing baselines in accuracy and is only a bit slower than the Relative Pose Regression.
The key to our approach is applying a fully optimal differentiable DFT with log-polar transformation to back-propagate the supervised error to train a multi-channel feature-based representation which is optimized for phase correlation to estimate the relative pose. It outperformed the conventional phase correlation method applying high pass filter with the learned representation when dealing with images that is not simply a cutout to one another. The learned representation, the result of the DFT, log-polar transformation, and the correlation map shown in Figure 6 indicate that the network is interpretable: by supervising end-to-end, the network is finally able to predict the transformation of two heterogeneous images with their U-Net outputs of convergent features. Note that the simulation dataset is randomly generated to reduce overfitting.
Table 2 and elaborations in Appendix C verify the generalizing ability of our approach For generalization experiments, the model is trained on the “Heterogeneous” set and is evaluated on two other sets in simulation datasets. The results show that with models not specifically trained, it can still maintain a high rate of accuracy in all 4-DoF.
3.2 Aero-Ground Dataset
To evaluate the applicability in the real world of our approach, we conduct experiments on Aero-Ground Dataset to match images across sensors. Few baseline support 4-DoF(or more) pose estimations across sensors and therefore we relax the condition by providing rotation and scale ground truth as initials to several benchmark for comparison, including the DAM with scaling initialized and QATM with both rotation and scale initialized. We train and validate our models in two different scenes (Figure 4(a) and (b)) separately and generalize them in the third scene(Figure 4(c)).
Table 3 shows validation results on scene (a) and (b) with the threshold error of for translation, for rotation and for scale. Acknowledging that each ground-image is generated with a scale of in the Aero-Ground Dataset, the threshold error of for translation could be transformed to in the real world. The results prove that when estimating 4-DoF poses across real-world sensors, our approach is the first to finish this job with an accuracy of at least when considering error lower than . Even when we relax the conditions and provide initials to the rest of the benchmarks, our approach still outperforms the rest of them in 2-DoF(QATM) and 3-DoF(DAM). Additional demonstrations of the experiment comparing with conventional phase correlation is shown in appendix B.
To evaluate the generalizing capability of our approach, we conduct experiments on scene (c) with DPCN models trained on scene(a) and (b) and DAM model trained on scene(c). The results shown in Figure 6 and Table 4(Appendix C) prove that with the types of source sensors given and fixed, our approach is capable of estimating poses regardless of scene changes and illumination changes with similar accuracy and still outperformed DAM which is specifically trained on (c). Therefore, the robustness of our approach in the real-world application is well documented.
3.3 Application in satellite map based localization
In this section, we demonstrate the application in localization by introducing our approach to Monte Carlo Localization. It proves that with corresponding maps as the output, our model is capable of real-time air-ground localization, e.g. scene(a). A demonstration is shown in Figure 7, where the green dashed line is the odometry estimated by “VINS” through a stereo camera, the red dashed line is generated by MCL by matching 4-DoF poses of LiDAR intensity map and satellite ,ap, and the yellow line is the ground truth. As the cumulative error of the odometry gradually increases, the corrective effect of our method is sufficiently demonstrated.
We present an approach for precise multi-sensor pose matching which greatly fascinate multi-agent collaborative exploration. We achieve this by training pairs of individual U-Nets with end-to-end poses to learn representations of heterogeneous images to be recognized by differentiable phase correlation(DFT+LPT+Correlation). We show that by training the network end-to-end with a fully differentiable pipeline, the network is easy and fast to be trained, precise in matching and capable of running in real-time. We also show that with every estimation analytical, the network is completely interpretive and has the capability of generalization.
If a paper is accepted, the final camera-ready version will (and probably should) include acknowledgments. All acknowledgments go at the end of the paper, including thanks to reviewers who gave useful comments, to colleagues who contributed to the ideas, and to funding agencies and corporate sponsors that provided financial support.
-  (2017-11) Driven to Distraction: Self-Supervised Distractor Learning for Robust Monocular Visual Odometry in Urban Environments. External Links: Cited by: §1.
-  (2019) Masking by Moving: Learning Distraction-Free Radar Odometry from Pose Information. (CoRL). External Links: Cited by: §1, §3.
-  (2019) QATM: Quality-aware template matching for deep learning. 2019-June, pp. 11545–11554. External Links: Cited by: §1, §3.
-  (2016) Collaborative localization of aerial and ground robots through elevation maps. SSRR 2016 - International Symposium on Safety, Security and Rescue Robotics, pp. 284–290. External Links: Cited by: §1.
Geometric loss functions for camera pose regression with deep learning. In Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, External Links: Cited by: §1, §3.
-  (2015) PoseNet: A convolutional network for real-time 6-dof camera relocalization. In Proceedings of the IEEE International Conference on Computer Vision, External Links: Cited by: §1, §3.
-  (2018) Robust template matching using scale-adaptive deep convolutional features. Proceedings - 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2017 2018-Febru (December), pp. 708–711. External Links: Cited by: §1.
-  (2010) Large scale graph-based SLAM using aerial images as prior information. Robotics: Science and Systems 5, pp. 297–304. External Links: Cited by: §1.
-  (2015) Learning deep representations for ground-to-aerial geolocalization. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, External Links: Cited by: §1.
-  (2019) L3-net: Towards learning based lidar localization for autonomous driving. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, External Links: Cited by: §1.
A multi-state constraint Kalman filter for vision-aided inertial navigation. In Proceedings - IEEE International Conference on Robotics and Automation, External Links: Cited by: §1.
Vehicle ego-localization by matching in-vehicle camera images to an aerial image.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)6469 LNCS (PART 2), pp. 163–173. External Links: Cited by: §1.
-  (2019) GPU accelerated real-time traversability mapping. In IEEE International Conference on Robotics and Biomimetics, ROBIO 2019, External Links: Cited by: §A.2.
-  (2020) A two-stream symmetric network with bidirectional ensemble for aerial image matching. Remote Sensing. External Links: Cited by: §3.
-  (2015) Localization on OpenStreetMap data using a 3D laser scanner. In Proceedings - IEEE International Conference on Robotics and Automation, External Links: Cited by: §1.
-  (1996) An FFT-based technique for translation, rotation, and scale-invariant image registration. IEEE Transactions on Image Processing. External Links: Cited by: §1, §1, §2.2, §3.
-  (2017) Template matching with deformable diversity similarity. In Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, External Links: Cited by: §1.
-  (2020) RSL-Net: Localising in Satellite Images from a Radar on the Ground. IEEE Robotics and Automation Letters 5 (2), pp. 1087–1094. External Links: Cited by: §1.
-  (2020)(Website) External Links: Cited by: §3.
Appendix A Network Structure and Experimental Setup
a.1 Network Structure
We train the DPCN network with the input of pairs of heterogeneous images and the ground truth of their relative pose. There are two phases here in the training. Phase 1
: Rotation and scale changes are trained and estimated in this phase. The pair of images go through a pair of U-Nets and their outputs go through the DFT layer for translation reduction. The Log Polar Transform remaps the DFT’s output to the log polar coordinate so that rotation and scale variances are then shown in columns and rows. Afterward, phase correlation is able to estimate the rotation and scale with a relative power spectrum as the output. Finally, we supervised the U-Net using Cross-Entropy Loss of both rotation, scale, and their ground truth.Phase 2: Translation of the image is trained and estimated in this phase. The source image of the image pair is rotated and scaled by the result in phase 1 and the new image pair(original template image and the new source image) go through a new pair of codec. The translation is then estimated by phase correlation and therefore the codec is supervised using Cross-Entropy Loss and L1 Loss of estimated transformation and their ground truth.
a.2 Experimental Setup
Hardware: All models for the full method and their comparison models were trained on a single server with CPU of Intel i9-9900X @ 3.5GHz x 20 and GPU of RTX 2080ti x 4. The validation experiments are all conducted on a single desktop computer(AMD Ryzen 3700X @ 3.7GHz x8) with the GPU of an RTX 2060super. The Aero-Ground Dataset is record by a ground robot and a DJI M100 drone. The ground robot is equipped with three 32-wire Velodyne LiDAR, four pairs of stereo cameras by Flir and one RTK dGPS provided by QX. The DJI drone records videos by one pair of Intel Realsense D435i stereo camera with one facing forward and one downward.
Software: The birds-eye global map from the drone used in the training and validation is constructed by full-licensed software Metashape with video clips from the Aero-Ground dataset. The local map of both LiDAR intensity style and stereo style from the ground robot is constructed by Elevation Map. All of the maps constructed above and the satellite map obtained from Google Maps have the resolution of 0.1 MPP.
Appendix B Additional Demonstration
Demonstrations of comparing DPCN and conventional phase correlation on real-world dataset Aero-Ground Dataset is shown in Figure 9 and Figure 10. They show that when matching heterogeneous images from different sensors, the trainable feature extractors in DPCN play important roles in outperforming the conventional phase correlation. However, the core of the entire DPCN is the differentiable DFT, LPT and phase correlation that could back propagate losses and eventually train the feature extractors. They also prove that by learning features through the supervision of the end-to-end poses, the approach is capable of reducing hollows, rejecting noises and assimilate different styles to estimate the 4-DoF relative pose.
b.1 DPCN Results
b.2 Conventional Phase Correlation Results
Appendix C Elaboration on Estimation
The threshold of estimation in experiments in 3 is elaborated in this section by the means of graphs. Figure 11 shows the of translation estimation in simulation dataset, figure 12 shows the of translation estimation in Aero-Ground dataset, Figure 13 shows the of translation estimation in simulation dataset, and Table 4 show the exact error of generalization experiments on Aero-Ground Dataset.
|DPCN in (a)||232.4638||73.2||29.9253||92.4||89.0943||95.7||0.0084||95.0|
|DPCN in (b)||31.2071||92.2||138.5449||88.5||2.8793||96.3||0.0153||93.3|