A Generative Map for Image-based Camera Localization

02/18/2019 ∙ by Mingpan Guo, et al. ∙ fortiss Technische Universität München 0

In image-based camera localization systems, information about the environment is usually stored in some representation, which can be referred to as a map. Conventionally, most map representations are built upon hand-crafted features. Recently, neural networks have attracted attention as a data-driven map representation, and have shown promising results in visual localization. However, these neural network maps are generally unreadable and hard to interpret. A readable map is not only accessible to humans, but also provides a way to be verified when the ground truth pose is unavailable. To tackle this problem, we propose Generative Map, a new framework for learning human-readable neural network maps. Our framework can be used for localization as previous learning maps, and also allows us to inspect the map by querying images from specified viewpoints of interest. We combine a generative model with the Kalman filter, which exploits the sequential structure of the localization problem. This also allows our approach to naturally incorporate additional sensor information and a transition model of the system. For evaluation we use real world images from the 7-Scenes dataset. We show that our approach can be used for localization tasks. For readability, we demonstrate that our Generative Map can be queried with poses from the test sequence to generate images, which closely resemble the true images.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image-based localization is an important task for many computer vision applications such as autonomous driving, indoor navigation and augmented or virtual reality. In these applications the environment is usually represented by a map, whereby the approaches differ considerably in the way the map is structured. In classical approaches, human designed features are extracted from images, and stored into a map with geometrical relations. The same features can then be compared with the recorded ones to determine the camera pose relative to the map. Typical examples of these features include local point-like features 

[20, 24], image patches [1, 6], and objects [27].

However, these approaches may ignore useful information which is not captured by the employed features. This becomes more problematic if there are not enough rich textures to be extracted from the environment. Furthermore, these approaches typically rely on prescribed structures like point clouds or grids, which are inflexible and grow with the scale of the environment.

Recently, deep neural networks (DNNs) are considered for the direct prediction of 6-DoF camera poses from images [17, 23, 4, 33, 3]. In this context, Brahmbhatt et al. [3] proposed to treat a neural network as a form of map representation, i.e., an abstract summary of input data, which can be queried to get camera poses. The DNN is trained to establish a relationship between images and corresponding poses. During test time, it can be used for querying a pose given an input image from that viewpoint. While the performance of these DNN map approaches has significantly improved [17, 16, 3] and is getting close to hybrid approaches, e.g.  [2], these maps are typically unreadable for humans.

To solve this problem, we propose a new framework for learning a DNN map, which not only can be used for localization, but also allows queries from the other direction, i.e., given a camera pose, what should the scene look like? We achieve this via a combination of the generative model of Variational Auto-Encoders (VAEs) [19] with a new training objective that is appropriate for this task, and the classic Kalman filter [14]. This makes the map human readable, and hence easier to interpret and verify.

Most research on image generation [31, 29, 10, 13, 30, 9] are either based on VAEs [19], or Generative Adversarial Networks [8]. In our work, we take the VAE approach, due to its capability to infer latent variables from input images. On the other hand, our model relies on the Kalman filter for connecting the sequence with a neural network as the observation model. This also enables our framework to integrate the transition model of the system, and other sources of sensor information, if they are available.

The main contribution of this work can be summarized as follows:

  • Prior works on DNN maps [17, 23, 4, 33, 3] learn the map representation by directly regressing the 6-DoF camera poses from images. In this work, we approach this problem from the other direction via the generative model of the VAE, i.e., by learning the mapping from poses to images. For maintaining the discriminability, we derive a new training objective, which allows the model to learn pose-specific distributions, instead of a single distribution for the entire dataset as traditional VAEs [19, 29, 31, 34, 13, 30]. Our map is thus more interpretable, and can be used for querying an image from a particular viewpoint.

  • Image generation models cannot directly produce poses for localization. To solve this, we exploit the sequential structure of the localization problem, and propose a framework to estimate the poses with a Kalman filter [14]

    , where a neural network is used for the observation model of the filtering process. We show that this estimation framework works even with a simple constant transition model, and can be further improved if an accurate transition model is available. In addition, we also show that the same framework can be applied for other DNN map approaches based on direct regression 

    [16] and achieves better performance.

2 Related Works

DNN map for localization

In terms of localization, PoseNet [17] first proposed to directly learn the mapping from images to the 6-DoF camera poses. Follow up works in this direction improved the localization performance by introducing deeper architectures [23], exploiting spatial [4] and temporal [33] structures, and incorporating relative distances between poses in the training objective [3]. Kendall and Cipolla [16]

showed that the idea of probabilistic deep learning can be applied, and introduced learnable weights for the translation and rotation error in the loss function, which increased the performance significantly. All of these approaches tackle the learning problem via direct regression of camera poses from images, and focus on improving the accuracy of localization. Instead, we propose to learn the generative process from poses to images. Our focus is to make the DNN map human readable, by providing the capability to query the view of a specific pose.

Image Generation

Generative models based on neural networks were originally designed to capture the image distribution [32, 19, 26, 8, 9]. Recent works in this direction succeeded in generating images of increasingly higher quality. However, these models do not establish geometric relationships between viewpoints and images. In terms of conditional generation of images, many approaches have been proposed for different sources of information, e.g. class labels [25] and attributes [31, 29]. For a map in camera localization, our input source is the camera pose. The generative query network [7] can generate images from different poses for different scenes, but only in simulated environments. Instead, we train and evaluate our framework on a real world localization benchmark dataset [28].

VAE-based training objective

Several recent works [29, 31, 34, 13, 30]

discuss VAE-based image generation. Most of them assume a single normal distribution as the prior for the latent representation, and regularize the latent variable of each data point to match this prior 

[29, 31, 34, 13, 36]. Tolstikhin et al. [30] relaxed this constraint by modeling the latent representations of the entire dataset, instead of a single data point, as one single distribution. However, such a setting is still inappropriate in our case, since restricting latent representations from different pose-image pairs to form a single distribution may reduce their discriminability, which can be critical for localization tasks. There have been also several works proposed for sequence learning with VAEs [21, 36, 15] and sequential control problems [35], which similarly assume a single prior distribution for the latent variables. Instead, we model the distribution of the latent variables only conditioning on poses. By assuming this distribution to be Gaussian, we also make our proposed approach naturally compatible with a Kalman filter. This is explained further in Section 3.2 and 3.3.

3 Proposed Approach

Figure 1: Proposed architecture of Generative Map, where relevant networks for estimation and generation are shown as trapezoids.

In this paper, we propose a new framework for learning a DNN-based map representation, by learning a generative model. Figure 1 shows our overall framework, which is described in detail in Section 3.1. Our objective function is based on the lower-bound of the conditional log-likelihood of images given poses. In Section 3.2 we derive this objective for training the entire framework with a single loss. Section 3.3 introduces the pose estimation process for our framework. The sequential estimator based on the Kalman filter is crucial for the localization task in our model, and allows us to incorporate the transition model of the system in a principled way. We also show in Section 3.4 how the proposed estimator can be used to increase the performance of previous DNN map approaches as well.

In this work, we denote images by , poses by , and latent variables by . We assume the generative process , and follow [19] to use and for generative and inference models correspondingly.

3.1 Framework

Our framework consists of three neural networks, the image encoder , pose encoder , and image decoder , as shown in Figure 1. During training, all three networks are trained jointly with a single objective function described in Section 3.2. Once trained, depending on the task that we want to perform, i.e., pose estimation or image (video) generation, different networks should be used. Generating images involves the pose encoder and image decoder , while pose estimation requires the pose encoder and image encoder .

For the image encoder, we use ResNet [11], similar to the one proposed in [3]

. For the image decoder, we use a fractional-strided convolution structure, as proposed in DC-GAN 


. For the pose encoder, we use a common feed-forward neural network. However, there is no restriction on the specific neural network architectures that can be applied in our framework.

3.2 Training Objective

Our objective function is based on the Variational Auto-Encoder, which optimizes the following lower bound of the log-likelihood [19]


where represents the data to encode, and stands for the latent variables that can be inferred by through . The objective can be intuitively interpreted as minimizing the reconstruction error together with a KL-divergence term for regularization .

To apply VAEs in cases with more than one input data source, e.g. images and poses like in our case, we need to reformulate the above lower bound. We achieve this by optimizing the following lower bound,


where , and are modeled by DNNs. A detailed derivation can be found in the appendix A. For convenience, we treat the negative right-hand-side of Equation (2) as our loss and train our model by minimizing


Similar to Equation (1), the first term in our loss function can be seen as a reconstruction error for the image, while the second term serves as a regularizer. Unlike most other extensions of the VAE [19, 31, 34], where the marginal distribution of latent variable is assumed to be normally distributed, our loss function assumes the distribution of latent variables to be normal only when conditioning on the corresponding poses or images. We assume that for every pose , the latent representation follows a normal distribution . Similarly, a normal distribution is assumed for the latent variable conditioning on the image . The KL-divergence term enforces these two distributions to be close to each other.

One fundamental difference between our loss function (3) and previous works in DNN-based visual localization [17, 16, 3] is that, a direct mapping from images to poses does not exist in our framework. Hence, we cannot obtain the poses by direct regression. Instead, we treat the network as an observation model and use the Kalman filter [14] for iteratively estimating the correct pose. This is described in detail in Section 3.3. Another important difference is that, the generative process from poses to images is modeled by the networks and . This allows us to query the model with a pose of interest, and obtain a generated RGB image which describes how the scene should look like at that pose.

3.3 Kalman Filter for Pose Estimation

As mentioned above, the generative model we propose cannot predict poses directly. However, we can still estimate the pose with the trained model using a Kalman filter, as shown in Figure 2. The network

is seen as a sensor, which processes an image each time step, and produces an observation vector

based on that image . From the pose we can also obtain an expected observation using , which is compared with the observation from the raw image. By assumption, and are both normally distributed and regularized to resemble each other. In addition, we also model as normally distributed. Therefore, the generator model naturally fits into the estimation process of a Kalman filter.

Figure 2: Update of the Kalman filter for pose estimation. Pose and latent variable are seen as the state and observation, respectively.

In order to close the update loop, we need a transition function . If the ego-motion is unknown, a simple approach is to assume the pose remains constant over time, i.e., . In such a case, the Kalman filter introduces no further information about the system itself, but rather a smoothing effect based on previous inputs. On the other hand, if additional control signals or motion constraints are known, a more sophisticated transition model can be devised. In such a case, the transition function becomes , where is the control signal for the ego-motion, which can be obtained from other sensors. We show in Section 4.1 that we can estimate the pose with a simple constant transition model. And a significant improvement in localization performance can be achieved, if an accurate transition model from to is available.

Training Framework Kalman Filter
Pose State
Mean of
latent variable Observation
Variance of Diagonal of observation
latent variable uncertainty matrix
Sensor that
Image encoder produces observation and
Pose encoder Observation model
Table 1: Corresponding relationship between different components of our training framework, and the Kalman filter during pose estimation. Note that, instead of sampling from , we directly use the mean as the observation for the Kalman filter, which increases the stability.

The corresponding relationship between different components in our training framework and Kalman filter is summarized in Table 1. The pose estimation update using the Kalman filter consists of prediction and correction step, which are explained in the following.

  • Prediction with transition model
    Let’s denote the transition function by , and its first order derivative w.r.t. by , which can be obtained by the finite difference method. An update for the prediction step of the estimation can then be written as


    where stands for the covariance matrix of the pose. stands for the state transition uncertainty. It needs to be set to higher values if the transition is inaccurate, e.g. if we are using a constant model, and smaller when an accurate transition model is available.

  • Correction with current observation
    We assume the neural observation model can also be written as , its first order derivative w.r.t. given by the finite difference method is denoted by . In each time step, our neural sensor model produces a new observation based on the current image , which is then compared with . The correction step can be written as


    where is the Kalman gain, and is the observation uncertainty. In our case, we can directly use the variance of inferred by the image encoder to build the matrix .

3.4 Kalman Filter for Direct Regression Models

In this section, we describe how we can apply the our Kalman filter technique for previous works in DNN maps with direct regression approaches [17, 23, 4, 33, 3]. In particular, we discuss its relationship with the learning weight technique introduced by Kendall and Cipolla [16], and introduce a small modification to make the direct regression approach fully compatible with a Kalman filter.

When we perform regression for poses from images, i.e., minimizing the pose error , one important question is how to weigh between the translational error , where the stands for Cartesian coordinates, and rotational error , where can be the quaternion. PoseNet [17] first discovered this problem, and found out that when introducing a better balance between translation and rotational error, the overall optimization performance can be improved. They achieved this by a linear combination of translational and rotational error with a weight factor in the objective [17], i.e., . Kendall and Cipolla then approached it from a probabilistic perspective, and proposed to model the pose with a Laplacian distribution, by assuming a variance for both the translational part and rotational part  [16]. Hence, the loss function for regressing the pose can be written as


where and are typically set as e.g. -norm, and and are set as trainable parameters of the model, independent of the input data.

In order to fit into the Kalman filter process described in Section 3.3, we can treat the observation model as the identity mapping. In addition, we also modify the assumed distribution for poses. Specifically, instead of a Laplacian distribution, we assume a normal distribution for the pose


where we use an -norm for the losses, i.e., , and for we use log-quaternions, following [3].

3.5 Implementation Details

As explained in Section 3.1, we use ResNet as our image encoder . However, directly training ResNet on 7-Scenes, which contains less than 10,000 images per scene, suffers from overfitting. In our experiments, we found that training ResNet from scratch on such a small dataset converges on the training set, but leads to unusable results on unseen test sequences. Therefore, we follow PoseNet [17]

and use a ResNet pre-trained on ImageNet 


as the initialization of our image encoder. We replace the last layer of ResNet with a fully connected layer with relu activation function of size 2048 and a dropout rate of 0.5, followed by a linear mapping to the latent variable

of size 256. Note that unlike previous works [3] that focus on the localization only and uses ResNet-34, we use ResNet-18, since our trial experiments show no obvious performance boost by increasing the depth.

For the image decoder , we use the decoder architecture proposed in DC-GAN [25]. The initial features in the decoder contain 1024 channels, and are obtained by a linear mapping from latent variable . We use 4 of fractional-strided convolutional layer for our experiment in 7-Scenes, with a kernel size of 5 in all layers. For the pose encoder, we use a three-layered fully connected network, where the only hidden layer contains 512 units.

The input images for the image encoder are resized to , while generated images are set to be . We use the Adam optimizer [18] with a learning rate of 0.0001 without decay. For localization tests in Section 4.1

we train the direct regression model for 500 epochs in each environment. However, for generation, we do not use a fixed number of training epochs, since the training set of each scene in the 7-Scenes dataset has different sizes, and the small amount of training samples makes it prone to overfitting. Instead, the model is trained such that the final negative lower-bound of the loglikelihood reaches a value between

and , i.e., from Equation 3. Under this setting, the model is exposed to a similar number of images in each scene and does not overfit. For the learnable variance of the reconstruction distribution, we use a initial value of for the Generative Map, and followed [3] to set , for the direct regression model in Section 3.4.

4 Experiments

We use the 7-Scenes datatset [28] to evaluate our framework, for both the generation and localization task. The dataset contains video sequences recorded from seven different indoor environments. Each scene contains 2 to 7 sequences for training, with either 500 or 1000 images for each sequence. The corresponding ground truth poses are provided for training and evaluation. For training generative models, prior approaches often rely on a large dataset, e.g. CelebA [22] contains more than 200,000 images. Therefore, the dataset we use is much more challenging, where each scene only contains less than 10,000 training samples.

4.1 7-Scenes Localization

Correcting false initialization

Figure 3: Localization result on the chess-seq5 test sequence with a constant transition model. The upper plot shows the estimated position, and the bottom the log-quaternions for orientation. The solid and dashed lines stand for estimated values and real values, respectively.

Our framework does not provide a direct mapping from images to poses, but rather relies on an iterative estimation process based on the Kalman filter for localization. For such a process, an initial pose needs to be provided as the starting point. An effective estimator should be able to converge to an accurate value, even when provided with a false initialization. Here, we test our framework by feeding false initializations and using a constant transition model, i.e., assuming no further information from the system, but only . The estimator is initialized with a false pose of , i.e., position and quaternion for orientation . The transition uncertainty is set as a diagonal matrix with 0.1 for all its diagonal elements. The result is shown in Figure 3.

It can be observed that, although the initial value is not correct, our model is able to estimate the pose to a reasonable accuracy, based only on the observations obtained from the neural network sensor . For example, the first element of the state should be at the beginning, but is initialized as . After 50 to 100 steps, it is able to correct itself within the range of around the correct value for the entire sequence. Note that, the first point of the solid lines shows the estimated value after the first observation of , hence the curve does not start directly at the origin.

Incorporating transition model

Scene Generative Map PoseNet [16] (ResNet-18), our implementation with a Gaussian training loss as in Section 3.4 PoseNet (ResNet-34, reported in [3])
constant model accurate model constant model accurate model regression regression
Table 2: Median translational error (m) and rotational error () in the 7-Scenes dataset. We implemented PoseNet according to [16] with our version of pre-trained ResNet-18, and the log-quaternion pose parameterization [3].

The Kalman filter provides a principled way to combine sequential information, and produces a reasonable estimated value. An accurate transition model for predicting the next state is crucial for the performance of the filter. Previously, we have shown that even with a constant model, the estimator is able to correct itself to a reasonable value. On the other hand, if an accurate model is available, i.e., the applied transition model of is close enough to the true model, the performance can be further improved. However, for 7-Scenes, we do not have other sensors to provide ego-motion information to devise a transition model. In order to evaluate this, we calculate the per time step difference via for translational motion, and as the rotational transformation to simulate an accurate transition model.

In Table 2 we compare the constant model and accurate model for both our Generative Map framework as in Section 3.1 and 3.2, and direct regression PoseNet [16] with the normal training loss described in Section 3.4. Again, the transition uncertainty is set to be a diagonal matrix, with diagonal elements all equal to 0.01 for constant models, and 0.0001 for accurate models. For PoseNet, we also report their direct regression performance. Scores of state-of-the-art PoseNet reported in [3] are also given in the table, which used the much deeper architecture of ResNet-34. Note that we use a different pre-trained model with smaller input images.

From the results we can see that, incorporating a Kalman filter with a constant model performs comparably with the direct regression. While the Kalman filter with an accurate transition model increases the localization performance obviously, both in our Generative Map, and the PoseNet approaches.

4.2 7-Scenes Generation

Figure 4: Localization and generation of equidistant sample poses from the unseen test sequences. The left sequence is redkitchen-seq12, and the right one is chess-seq03.
Figure 5: Image generation for equidistant sample poses from the unseen test sequences. Sequences from top to bottom: fire-seq04, stairs-seq04, pumpkin-seq01, office-seq02, redkitchen-seq03, chess-seq05, heads-seq01. For each sequence, the downsampled original images are shown on top, and the generated images at the bottom.

In this section, we evaluate our model by querying images with poses from the test sequence. Ten poses for each test sequence are sampled equidistant in time, i.e., for sequence of length 1000, we take the poses from indices of . We plot both the real images and the model generated images in Figure 5.

Despite being trained on a small training set, our model is able to generate meaningful images that are roughly recognizable for each scene. For every sequence, the images generated for different time steps clearly varies. In most cases, we can observe that the generated image shows a scene from the pose that matches with the real image, which demonstrates the readability of our approach.

In Figure 4, we show the evaluation result of our model for both localization and image generation. Here we use an accurate transition model, with the diagonal elements of the transition uncertainty set to 0.00001. The true trajectories are displayed in dashed blue curves, while the red solid curves show the localization result. For each sequence, we mark four equidistant poses, and generate the corresponding images. The real images are shown for comparison in the first and the last rows, while the generated images are shown on the second and third rows. Again, we can observe that the generated images roughly match the real ones. Especially for poses with more accurate localization, the generated images resemble the real images more closely.

5 Conclusion

In image based localization problems, the map representation plays an important role. Instead of using hand-crafted features, deep neural networks are recently explored as a way to learn a data-driven map [3]. Despite their success in improving the localization accuracy, prior works in this direction [17, 23, 4, 33, 16, 3] produce maps that are unreadable for humans, and hence hard to visualize and verify. In this work, we propose the Generative Map framework for learning a neural network based map representation. Our probabilistic framework tackles the problem of readability of prior DNN-based maps, by allowing queries for images given poses of interest. Our training objective is derived from the generative model of Variational Auto-Encoders [19] and can be applied to train the entire framework jointly. For localization, our approach relies on the classic Kalman filter [14] and estimates the pose through an iterative process. This also enables us to easily incorporate additional information, e.g. sensor inputs or transition models of the system.

We evaluate our approach on the 7-Scenes benchmark dataset [28], which is challenging as it is small in size compared to other datasets for generative models. Our experimental result shows that, given a pose of interest from the test data, our model is able to generate an image that largely matches the ground truth image from the same pose. Moreover, we also show that our map is suitable for the localization task, and can correct false initialization values based on the input images. We also observe that, if an accurate transition model is available, the estimation accuracy of the approach can be significantly improved.

This leads to several potential directions for future research. First, the generated images may provide a way to visualize and measure the accuracy of the model for each region of the environment. It is interesting to conduct an in-depth investigation regarding the correlation between the quality of the generated images and the localization accuracy. Regions with worse generated images may require more training data to be collected. In that way, we are able to actively search for areas to improve based on e.g. image reconstruction error from the environment. Secondly, combining both generative and non-generative DNN-map may result in a hybrid model with better readability and localization performance. Finally, it is meaningful and interesting to extend our framework to a full SLAM scenario [12], which can not only localize itself, but also build an explicit map in completely new environments.


Appendix A Derivation of the Training Objective (Eq. (2))

(first term integrates to 1)
(property of logarithm)
(add the term , and subtract it immediately)
(non-negative KL-divergence)
(conditional independence assumption , property of logarithm)

Intuitively, the first term minimizes the difference between and , and the second term makes sure that we can predict the scene .