SD-6DoF-ICLK: Sparse and Deep Inverse Compositional Lucas-Kanade Algorithm on SE(3)

by   Timo Hinzmann, et al.

This paper introduces SD-6DoF-ICLK, a learning-based Inverse Compositional Lucas-Kanade (ICLK) pipeline that uses sparse depth information to optimize the relative pose that best aligns two images on SE(3). To compute this six Degrees-of-Freedom (DoF) relative transformation, the proposed formulation requires only sparse depth information in one of the images, which is often the only available depth source in visual-inertial odometry or Simultaneous Localization and Mapping (SLAM) pipelines. In an optional subsequent step, the framework further refines feature locations and the relative pose using individual feature alignment and bundle adjustment for pose and structure re-alignment. The resulting sparse point correspondences with subpixel-accuracy and refined relative pose can be used for depth map generation, or the image alignment module can be embedded in an odometry or mapping framework. Experiments with rendered imagery show that the forward SD-6DoF-ICLK runs at 145 ms per image pair with a resolution of 752 x 480 pixels each, and vastly outperforms the classical, sparse 6DoF-ICLK algorithm, making it the ideal framework for robust image alignment under severe conditions.



There are no comments yet.


page 5


Learning monocular visual odometry with dense 3D mapping from dense 3D flow

This paper introduces a fully deep learning approach to monocular SLAM, ...

Tight Integration of Feature-Based Relocalization in Monocular Direct Visual Odometry

In this paper we propose a framework for integrating map-based relocaliz...

Taking a Deeper Look at the Inverse Compositional Algorithm

In this paper, we provide a modern synthesis of the classic inverse comp...

Joint Epipolar Tracking (JET): Simultaneous optimization of epipolar geometry and feature correspondences

Traditionally, pose estimation is considered as a two step problem. Firs...

Sequential Learning of Visual Tracking and Mapping Using Unsupervised Deep Neural Networks

We proposed an end-to-end deep learning-based simultaneous localization ...

Inverse Compositional Spatial Transformer Networks

In this paper, we establish a theoretical connection between the classic...

Deep UAV Localization with Reference View Rendering

This paper presents a framework for the localization of Unmanned Aerial ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Robust image alignment under challenging conditions is an important core capability towards safe autonomous navigation of robots in unknown environments. This paper focuses primarily on the *ICLK [lucas1981iterative, baker2004lucas]

algorithm that optimizes the alignment between two images on SE(3) by utilizing dense or sparse depth information. One advantage of this approach in contrast to feature-based alignment is that a costly outlier rejection step can be avoided. Applications of six-*DOF-*ICLK include rigid stereo extrinsics refinement after shocks, visual-inertial odometry systems

[Forster2017], or non-rigid stereo pair tracking [Hinzmann2019]. In this paper, the 6DoF-*ICLK is leveraged with the help of *DL to make parameters in the framework trainable and the image alignment robust against e.g., challenging light conditions.

2 Related Work

Image alignment approaches can be divided into feature-based and direct methods. Feature-based methods have come a long way from computationally expensive hand-crafted feature detectors and descriptors (SIFT [Lowe:IJCV2004], SURF [Bay2006]) to faster binary variants (BRISK [leutenegger2011brisk]) and recently trainable approaches (SuperPoint [detone2017superpoint], D2-Net [Dusmanu_2019_CVPR]). Likewise, direct methods have seen many variants and may rely on dense (DTAM [Newcombe2011]) or sparse depth information (SVO [Forster2017]), Mutual Information based Lucas-Kanade tracking [DowsonB08], and more recently *DL-variants of *ICLK [Lv19cvpr]. In this context, we propose SD-6DoF-*ICLK, a learning-based sparse Inverse Compositional Lucas-Kanade (ICLK) algorithm for image alignment on SE(3) using sparse depth estimates. The implemented SD-6DoF-*ICLK algorithm is optimized for *GPU operations to speed up batch-wise training. The output of the framework consists of sparse feature pairs in both images and the relative pose connecting the two cameras. The feature locations and the estimated relative pose can be further refined using an individual feature alignment step with a subsequent pose, and optional structure refinement, as proposed in [Forster2017]. For depth map generation, the estimated relative pose can be fed to classical rectification and depth estimation modules, or to *DL-based depth from multi-view stereo algorithms.

3 Methodology

Our proposed SD-6DoF-*ICLK framework is depicted in Fig. 1: The input is a grayscale or colored reference image with sparse features. For every sparse feature, a depth estimate is assumed to be known. The objective is to find the relative 6DoF pose that best aligns the reference image to , where may also be grayscale or colored but contains no depth information or extracted features.

Figure 1: Overview of SD-6DoF-ICLK with feature, pose, and structure refinement.

To align the images on SE(3) with the described input data, we adopt [Lv19cvpr] to sparse depth information as follows: The input images and are color normalized and fed as single views to the feature encoder described in [Lv19cvpr]. This operation returns four pyramidal images with resolution (level 0, input image resolution), , and (level 3). The sparse image alignment algorithm described in [Forster2017] is designed for CPU operations and explicitly iterates over the pixels of every patch. Instead, we formulate the problem with binary masks to exploit the full advantage of indexing with *GPU and to allow fast batch-wise training and inference. To achieve this, binary masks are created at every level with a patch size of pixels surrounding the sparse feature. Similarly, the sparse inverse depth image for image is generated by setting every pixel within the patch to the inverse depth value.

Fig. 1 shows the pseudo-code of the inverse compositional algorithm that, starting from the highest level, iterates over all pyramidal images. For every pyramidal image, the warp parameters are optimized using the Levenberg-Marquardt update step [marquardt:1963, Lv19cvpr]:


where and denote the Jacobian and residual after applying the binary mask of the corresponding pyramidal layer. The convolutional M-Estimator proposed in [Lv19cvpr] is used to learn the weights in . The damping term is set to throughout the experiments.

As shown in Fig. 1, the framework continues with a feature alignment step and bundle adjustment step to achieve subpixel accuracy [Forster2017].

4 Experiments and Results


The SD-6DoF-ICLK algorithm and feature alignment step is implemented in pytorch 

[NEURIPS2019_9015] and adopted from [Lv19cvpr, Forster2017]. The pose is then refined with GTSAM [dellaert2012factor] using a Cauchy loss, to reduce the influence of outliers, and Levenberg-Marquardt for optimization.


To generate a large amount of training, validation, and test data we implemented a shader program in OpenGL [woo1999opengl] that takes an orthomosaic and elevation map as input and renders an RGB and depth image given a geo-referenced pose , camera intrinsics matrix , and distortion parameters [Hinzmann2020_loc]. No distortion parameters are set in this paper as the *ICLK algorithm expects undistorted images as input. Camera positions and orientations of pose pairs are uniformly sampled from locations above the orthomosaic and rejected if the camera’s field of view is facing areas outside the map. Training and test data are rendered from two different satellite orthomosaics selected from nearby but non-overlapping locations. The validation set111The validation split is set to 20% of the total set dedicated for training and validation. is drawn from the shuffled training data. The OpenGL renderer’s task is to augment the training and test data geometrically. Appearance variations are generated using pytorch’s build-in color-jitter functionality that randomly sets brightness, contrast, saturation, and hue. This is to emulate challenging light conditions like over-exposure or lens flares. This augmentation creates

new, color-altered images for every inserted original image. Sparse features are drawn randomly from a uniform distribution such that

with border , patch width , image width and image height . The number of sparse features for all training, validation, and test set is set to .

Figure 2: Training, validation and test loss (3D End-Point-Error EPE) over epochs.
Param Value Batch size (max. memory) Epochs Num. images train/val Validation split Num. images test Input image resolution Max. iterations 10 Optimizer SGD Initial learning rate Momentum Nesterov False Weight decay Learning rate decay epochs Learning rate decay ratio Validation frequency Test frequency
Figure 3: Training, validation, and test parameters used for session visualized in Fig. 3.


Fig. 3 presents the results from a training, validation, and test session over epochs. Training parameters are listed in Tab. 3. Analogue to [Lv19cvpr], the utilized training loss is the 3D End-Point-Error (EPE) using the rendered dense depth map. After epoch , the training loss (red) is below the validation loss (green) and continues to decrease. As the appearance of the images is randomly changed, it may occur that the validation is below the training loss as seen here in the initial set of epochs. On every fifth epoch, the test set is evaluated. The loss of the test set given the initial, incorrect relative transformation is illustrated by the black dashed line for reference. The solid black line is the test set evaluated for 6DoF-ICLK without learning (also max.  iterations) and returns roughly the same result independent of the epoch as expected. Evaluating the test set with SD-6DoF-ICLK demonstrates that the classical 6DoF-ICLK is clearly outperformed. Tab. 1 shows the pixel (euclidean distance), translational, and rotational errors that continuously decrease for the subsequent image alignment steps. These results are visualized in Fig. 4: The first row shows with the features projected from based on the current estimate for the relative transformation . The red lines illustrate the error with respect to the ground truth feature location. Given the estimate for , can be overlaid over , which is shown in the second row.

Initial SD-6DoF-ICLK Feature Alignment Pose Optimization
33.986 1.286 0.413 0.121
4.934 3.257 3.257 0.089
0.075 0.020 0.020 0.000
Table 1: Pixel, translational, and rotational error for the subsequent image alignment steps.
Figure 4: Qualitative results of the image alignment framework.


The inference time of the SD-6DoF-ICLK is on a GeForce RTX 2080 Ti (12GB) for an image resolution of pixels, four pyramidal layers, and a maximum of iterations of the incremental optimization. Note that the image resolution of was selected for training and forward inference, as it represents the target camera resolution on our *UAV. A smaller resolution, however, could be set to speed up the training process and to increase the batch size if desired.

5 Conclusion

This work introduced SD-6DoF-ICLK, a learning-based sparse Inverse Compositional Lucas-Kanade (ICLK) algorithm, enabling robust image alignment on SE(3) with sparse depth information as input, and optimized for *GPU operations. A synthetic dataset rendered with OpenGL shows that SD-6DoF-ICLK outperforms the classical sparse 6DoF-ICLK algorithm by a large margin, making the proposed algorithm an ideal choice for robust image alignment. The proposed SD-6Dof-ICLK is able to perform inference in on a GeForce RTX 2080 Ti with input images at a resolution of pixels. To further refine the feature locations and relative pose, individual feature alignment with subsequent pose and, if required, structure refinement is applied as proposed in [Forster2017]. In future work, the framework could be embedded into an odometry or mapping framework, or used for depth image generation.


The authors thank Geodaten © swisstopo for access to the satellite imagery.