Robust image alignment under challenging conditions is an important core capability towards safe autonomous navigation of robots in unknown environments. This paper focuses primarily on the *ICLK [lucas1981iterative, baker2004lucas]
algorithm that optimizes the alignment between two images on SE(3) by utilizing dense or sparse depth information. One advantage of this approach in contrast to feature-based alignment is that a costly outlier rejection step can be avoided. Applications of six-*DOF-*ICLK include rigid stereo extrinsics refinement after shocks, visual-inertial odometry systems[Forster2017], or non-rigid stereo pair tracking [Hinzmann2019]. In this paper, the 6DoF-*ICLK is leveraged with the help of *DL to make parameters in the framework trainable and the image alignment robust against e.g., challenging light conditions.
2 Related Work
Image alignment approaches can be divided into feature-based and direct methods. Feature-based methods have come a long way from computationally expensive hand-crafted feature detectors and descriptors (SIFT [Lowe:IJCV2004], SURF [Bay2006]) to faster binary variants (BRISK [leutenegger2011brisk]) and recently trainable approaches (SuperPoint [detone2017superpoint], D2-Net [Dusmanu_2019_CVPR]). Likewise, direct methods have seen many variants and may rely on dense (DTAM [Newcombe2011]) or sparse depth information (SVO [Forster2017]), Mutual Information based Lucas-Kanade tracking [DowsonB08], and more recently *DL-variants of *ICLK [Lv19cvpr]. In this context, we propose SD-6DoF-*ICLK, a learning-based sparse Inverse Compositional Lucas-Kanade (ICLK) algorithm for image alignment on SE(3) using sparse depth estimates. The implemented SD-6DoF-*ICLK algorithm is optimized for *GPU operations to speed up batch-wise training. The output of the framework consists of sparse feature pairs in both images and the relative pose connecting the two cameras. The feature locations and the estimated relative pose can be further refined using an individual feature alignment step with a subsequent pose, and optional structure refinement, as proposed in [Forster2017]. For depth map generation, the estimated relative pose can be fed to classical rectification and depth estimation modules, or to *DL-based depth from multi-view stereo algorithms.
Our proposed SD-6DoF-*ICLK framework is depicted in Fig. 1: The input is a grayscale or colored reference image with sparse features. For every sparse feature, a depth estimate is assumed to be known. The objective is to find the relative 6DoF pose that best aligns the reference image to , where may also be grayscale or colored but contains no depth information or extracted features.
To align the images on SE(3) with the described input data, we adopt [Lv19cvpr] to sparse depth information as follows: The input images and are color normalized and fed as single views to the feature encoder described in [Lv19cvpr]. This operation returns four pyramidal images with resolution (level 0, input image resolution), , and (level 3). The sparse image alignment algorithm described in [Forster2017] is designed for CPU operations and explicitly iterates over the pixels of every patch. Instead, we formulate the problem with binary masks to exploit the full advantage of indexing with *GPU and to allow fast batch-wise training and inference. To achieve this, binary masks are created at every level with a patch size of pixels surrounding the sparse feature. Similarly, the sparse inverse depth image for image is generated by setting every pixel within the patch to the inverse depth value.
Fig. 1 shows the pseudo-code of the inverse compositional algorithm that, starting from the highest level, iterates over all pyramidal images. For every pyramidal image, the warp parameters are optimized using the Levenberg-Marquardt update step [marquardt:1963, Lv19cvpr]:
where and denote the Jacobian and residual after applying the binary mask of the corresponding pyramidal layer. The convolutional M-Estimator proposed in [Lv19cvpr] is used to learn the weights in . The damping term is set to throughout the experiments.
As shown in Fig. 1, the framework continues with a feature alignment step and bundle adjustment step to achieve subpixel accuracy [Forster2017].
4 Experiments and Results
The SD-6DoF-ICLK algorithm and feature alignment step is implemented in pytorch[NEURIPS2019_9015] and adopted from [Lv19cvpr, Forster2017]. The pose is then refined with GTSAM [dellaert2012factor] using a Cauchy loss, to reduce the influence of outliers, and Levenberg-Marquardt for optimization.
To generate a large amount of training, validation, and test data we implemented a shader program in OpenGL [woo1999opengl] that takes an orthomosaic and elevation map as input and renders an RGB and depth image given a geo-referenced pose , camera intrinsics matrix , and distortion parameters [Hinzmann2020_loc]. No distortion parameters are set in this paper as the *ICLK algorithm expects undistorted images as input. Camera positions and orientations of pose pairs are uniformly sampled from locations above the orthomosaic and rejected if the camera’s field of view is facing areas outside the map. Training and test data are rendered from two different satellite orthomosaics selected from nearby but non-overlapping locations. The validation set111The validation split is set to 20% of the total set dedicated for training and validation. is drawn from the shuffled training data. The OpenGL renderer’s task is to augment the training and test data geometrically. Appearance variations are generated using pytorch’s build-in color-jitter functionality that randomly sets brightness, contrast, saturation, and hue. This is to emulate challenging light conditions like over-exposure or lens flares. This augmentation creates
new, color-altered images for every inserted original image. Sparse features are drawn randomly from a uniform distribution such thatwith border , patch width , image width and image height . The number of sparse features for all training, validation, and test set is set to .
Fig. 3 presents the results from a training, validation, and test session over epochs. Training parameters are listed in Tab. 3. Analogue to [Lv19cvpr], the utilized training loss is the 3D End-Point-Error (EPE) using the rendered dense depth map. After epoch , the training loss (red) is below the validation loss (green) and continues to decrease. As the appearance of the images is randomly changed, it may occur that the validation is below the training loss as seen here in the initial set of epochs. On every fifth epoch, the test set is evaluated. The loss of the test set given the initial, incorrect relative transformation is illustrated by the black dashed line for reference. The solid black line is the test set evaluated for 6DoF-ICLK without learning (also max. iterations) and returns roughly the same result independent of the epoch as expected. Evaluating the test set with SD-6DoF-ICLK demonstrates that the classical 6DoF-ICLK is clearly outperformed. Tab. 1 shows the pixel (euclidean distance), translational, and rotational errors that continuously decrease for the subsequent image alignment steps. These results are visualized in Fig. 4: The first row shows with the features projected from based on the current estimate for the relative transformation . The red lines illustrate the error with respect to the ground truth feature location. Given the estimate for , can be overlaid over , which is shown in the second row.
|Initial||SD-6DoF-ICLK||Feature Alignment||Pose Optimization|
The inference time of the SD-6DoF-ICLK is on a GeForce RTX 2080 Ti (12GB) for an image resolution of pixels, four pyramidal layers, and a maximum of iterations of the incremental optimization. Note that the image resolution of was selected for training and forward inference, as it represents the target camera resolution on our *UAV. A smaller resolution, however, could be set to speed up the training process and to increase the batch size if desired.
This work introduced SD-6DoF-ICLK, a learning-based sparse Inverse Compositional Lucas-Kanade (ICLK) algorithm, enabling robust image alignment on SE(3) with sparse depth information as input, and optimized for *GPU operations. A synthetic dataset rendered with OpenGL shows that SD-6DoF-ICLK outperforms the classical sparse 6DoF-ICLK algorithm by a large margin, making the proposed algorithm an ideal choice for robust image alignment. The proposed SD-6Dof-ICLK is able to perform inference in on a GeForce RTX 2080 Ti with input images at a resolution of pixels. To further refine the feature locations and relative pose, individual feature alignment with subsequent pose and, if required, structure refinement is applied as proposed in [Forster2017]. In future work, the framework could be embedded into an odometry or mapping framework, or used for depth image generation.
The authors thank Geodaten © swisstopo for access to the satellite imagery.