1 Introduction
Imagebased localization is an important task for many computer vision applications such as autonomous driving, indoor navigation and augmented or virtual reality. In these applications the environment is usually represented by a map, whereby the approaches differ considerably in the way the map is structured. In classical approaches, human designed features are extracted from images, and stored into a map with geometrical relations. The same features can then be compared with the recorded ones to determine the camera pose relative to the map. Typical examples of these features include local pointlike features
[20, 24], image patches [1, 6], and objects [27].However, these approaches may ignore useful information which is not captured by the employed features. This becomes more problematic if there are not enough rich textures to be extracted from the environment. Furthermore, these approaches typically rely on prescribed structures like point clouds or grids, which are inflexible and grow with the scale of the environment.
Recently, deep neural networks (DNNs) are considered for the direct prediction of 6DoF camera poses from images [17, 23, 4, 33, 3]. In this context, Brahmbhatt et al. [3] proposed to treat a neural network as a form of map representation, i.e., an abstract summary of input data, which can be queried to get camera poses. The DNN is trained to establish a relationship between images and corresponding poses. During test time, it can be used for querying a pose given an input image from that viewpoint. While the performance of these DNN map approaches has significantly improved [17, 16, 3] and is getting close to hybrid approaches, e.g. [2], these maps are typically unreadable for humans.
To solve this problem, we propose a new framework for learning a DNN map, which not only can be used for localization, but also allows queries from the other direction, i.e., given a camera pose, what should the scene look like? We achieve this via a combination of the generative model of Variational AutoEncoders (VAEs) [19] with a new training objective that is appropriate for this task, and the classic Kalman filter [14]. This makes the map human readable, and hence easier to interpret and verify.
Most research on image generation [31, 29, 10, 13, 30, 9] are either based on VAEs [19], or Generative Adversarial Networks [8]. In our work, we take the VAE approach, due to its capability to infer latent variables from input images. On the other hand, our model relies on the Kalman filter for connecting the sequence with a neural network as the observation model. This also enables our framework to integrate the transition model of the system, and other sources of sensor information, if they are available.
The main contribution of this work can be summarized as follows:

Prior works on DNN maps [17, 23, 4, 33, 3] learn the map representation by directly regressing the 6DoF camera poses from images. In this work, we approach this problem from the other direction via the generative model of the VAE, i.e., by learning the mapping from poses to images. For maintaining the discriminability, we derive a new training objective, which allows the model to learn posespecific distributions, instead of a single distribution for the entire dataset as traditional VAEs [19, 29, 31, 34, 13, 30]. Our map is thus more interpretable, and can be used for querying an image from a particular viewpoint.

Image generation models cannot directly produce poses for localization. To solve this, we exploit the sequential structure of the localization problem, and propose a framework to estimate the poses with a Kalman filter [14]
, where a neural network is used for the observation model of the filtering process. We show that this estimation framework works even with a simple constant transition model, and can be further improved if an accurate transition model is available. In addition, we also show that the same framework can be applied for other DNN map approaches based on direct regression
[16] and achieves better performance.
2 Related Works
DNN map for localization
In terms of localization, PoseNet [17] first proposed to directly learn the mapping from images to the 6DoF camera poses. Follow up works in this direction improved the localization performance by introducing deeper architectures [23], exploiting spatial [4] and temporal [33] structures, and incorporating relative distances between poses in the training objective [3]. Kendall and Cipolla [16]
showed that the idea of probabilistic deep learning can be applied, and introduced learnable weights for the translation and rotation error in the loss function, which increased the performance significantly. All of these approaches tackle the learning problem via direct regression of camera poses from images, and focus on improving the accuracy of localization. Instead, we propose to learn the generative process from poses to images. Our focus is to make the DNN map human readable, by providing the capability to query the view of a specific pose.
Image Generation
Generative models based on neural networks were originally designed to capture the image distribution [32, 19, 26, 8, 9]. Recent works in this direction succeeded in generating images of increasingly higher quality. However, these models do not establish geometric relationships between viewpoints and images. In terms of conditional generation of images, many approaches have been proposed for different sources of information, e.g. class labels [25] and attributes [31, 29]. For a map in camera localization, our input source is the camera pose. The generative query network [7] can generate images from different poses for different scenes, but only in simulated environments. Instead, we train and evaluate our framework on a real world localization benchmark dataset [28].
VAEbased training objective
Several recent works [29, 31, 34, 13, 30]
discuss VAEbased image generation. Most of them assume a single normal distribution as the prior for the latent representation, and regularize the latent variable of each data point to match this prior
[29, 31, 34, 13, 36]. Tolstikhin et al. [30] relaxed this constraint by modeling the latent representations of the entire dataset, instead of a single data point, as one single distribution. However, such a setting is still inappropriate in our case, since restricting latent representations from different poseimage pairs to form a single distribution may reduce their discriminability, which can be critical for localization tasks. There have been also several works proposed for sequence learning with VAEs [21, 36, 15] and sequential control problems [35], which similarly assume a single prior distribution for the latent variables. Instead, we model the distribution of the latent variables only conditioning on poses. By assuming this distribution to be Gaussian, we also make our proposed approach naturally compatible with a Kalman filter. This is explained further in Section 3.2 and 3.3.3 Proposed Approach
In this paper, we propose a new framework for learning a DNNbased map representation, by learning a generative model. Figure 1 shows our overall framework, which is described in detail in Section 3.1. Our objective function is based on the lowerbound of the conditional loglikelihood of images given poses. In Section 3.2 we derive this objective for training the entire framework with a single loss. Section 3.3 introduces the pose estimation process for our framework. The sequential estimator based on the Kalman filter is crucial for the localization task in our model, and allows us to incorporate the transition model of the system in a principled way. We also show in Section 3.4 how the proposed estimator can be used to increase the performance of previous DNN map approaches as well.
In this work, we denote images by , poses by , and latent variables by . We assume the generative process , and follow [19] to use and for generative and inference models correspondingly.
3.1 Framework
Our framework consists of three neural networks, the image encoder , pose encoder , and image decoder , as shown in Figure 1. During training, all three networks are trained jointly with a single objective function described in Section 3.2. Once trained, depending on the task that we want to perform, i.e., pose estimation or image (video) generation, different networks should be used. Generating images involves the pose encoder and image decoder , while pose estimation requires the pose encoder and image encoder .
For the image encoder, we use ResNet [11], similar to the one proposed in [3]
. For the image decoder, we use a fractionalstrided convolution structure, as proposed in DCGAN
[25]. For the pose encoder, we use a common feedforward neural network. However, there is no restriction on the specific neural network architectures that can be applied in our framework.
3.2 Training Objective
Our objective function is based on the Variational AutoEncoder, which optimizes the following lower bound of the loglikelihood [19]
(1) 
where represents the data to encode, and stands for the latent variables that can be inferred by through . The objective can be intuitively interpreted as minimizing the reconstruction error together with a KLdivergence term for regularization .
To apply VAEs in cases with more than one input data source, e.g. images and poses like in our case, we need to reformulate the above lower bound. We achieve this by optimizing the following lower bound,
(2) 
where , and are modeled by DNNs. A detailed derivation can be found in the appendix A. For convenience, we treat the negative righthandside of Equation (2) as our loss and train our model by minimizing
(3) 
Similar to Equation (1), the first term in our loss function can be seen as a reconstruction error for the image, while the second term serves as a regularizer. Unlike most other extensions of the VAE [19, 31, 34], where the marginal distribution of latent variable is assumed to be normally distributed, our loss function assumes the distribution of latent variables to be normal only when conditioning on the corresponding poses or images. We assume that for every pose , the latent representation follows a normal distribution . Similarly, a normal distribution is assumed for the latent variable conditioning on the image . The KLdivergence term enforces these two distributions to be close to each other.
One fundamental difference between our loss function (3) and previous works in DNNbased visual localization [17, 16, 3] is that, a direct mapping from images to poses does not exist in our framework. Hence, we cannot obtain the poses by direct regression. Instead, we treat the network as an observation model and use the Kalman filter [14] for iteratively estimating the correct pose. This is described in detail in Section 3.3. Another important difference is that, the generative process from poses to images is modeled by the networks and . This allows us to query the model with a pose of interest, and obtain a generated RGB image which describes how the scene should look like at that pose.
3.3 Kalman Filter for Pose Estimation
As mentioned above, the generative model we propose cannot predict poses directly. However, we can still estimate the pose with the trained model using a Kalman filter, as shown in Figure 2. The network
is seen as a sensor, which processes an image each time step, and produces an observation vector
based on that image . From the pose we can also obtain an expected observation using , which is compared with the observation from the raw image. By assumption, and are both normally distributed and regularized to resemble each other. In addition, we also model as normally distributed. Therefore, the generator model naturally fits into the estimation process of a Kalman filter.In order to close the update loop, we need a transition function . If the egomotion is unknown, a simple approach is to assume the pose remains constant over time, i.e., . In such a case, the Kalman filter introduces no further information about the system itself, but rather a smoothing effect based on previous inputs. On the other hand, if additional control signals or motion constraints are known, a more sophisticated transition model can be devised. In such a case, the transition function becomes , where is the control signal for the egomotion, which can be obtained from other sensors. We show in Section 4.1 that we can estimate the pose with a simple constant transition model. And a significant improvement in localization performance can be achieved, if an accurate transition model from to is available.
Training Framework  Kalman Filter 

Pose  State 
Mean of  
latent variable  Observation 
Variance of  Diagonal of observation 
latent variable  uncertainty matrix 
Sensor that  
Image encoder  produces observation and 
Pose encoder  Observation model 
The corresponding relationship between different components in our training framework and Kalman filter is summarized in Table 1. The pose estimation update using the Kalman filter consists of prediction and correction step, which are explained in the following.

Prediction with transition model
Let’s denote the transition function by , and its first order derivative w.r.t. by , which can be obtained by the finite difference method. An update for the prediction step of the estimation can then be written as(4) (5) where stands for the covariance matrix of the pose. stands for the state transition uncertainty. It needs to be set to higher values if the transition is inaccurate, e.g. if we are using a constant model, and smaller when an accurate transition model is available.

Correction with current observation
We assume the neural observation model can also be written as , its first order derivative w.r.t. given by the finite difference method is denoted by . In each time step, our neural sensor model produces a new observation based on the current image , which is then compared with . The correction step can be written as(6) (7) (8) where is the Kalman gain, and is the observation uncertainty. In our case, we can directly use the variance of inferred by the image encoder to build the matrix .
3.4 Kalman Filter for Direct Regression Models
In this section, we describe how we can apply the our Kalman filter technique for previous works in DNN maps with direct regression approaches [17, 23, 4, 33, 3]. In particular, we discuss its relationship with the learning weight technique introduced by Kendall and Cipolla [16], and introduce a small modification to make the direct regression approach fully compatible with a Kalman filter.
When we perform regression for poses from images, i.e., minimizing the pose error , one important question is how to weigh between the translational error , where the stands for Cartesian coordinates, and rotational error , where can be the quaternion. PoseNet [17] first discovered this problem, and found out that when introducing a better balance between translation and rotational error, the overall optimization performance can be improved. They achieved this by a linear combination of translational and rotational error with a weight factor in the objective [17], i.e., . Kendall and Cipolla then approached it from a probabilistic perspective, and proposed to model the pose with a Laplacian distribution, by assuming a variance for both the translational part and rotational part [16]. Hence, the loss function for regressing the pose can be written as
(9) 
where and are typically set as e.g. norm, and and are set as trainable parameters of the model, independent of the input data.
In order to fit into the Kalman filter process described in Section 3.3, we can treat the observation model as the identity mapping. In addition, we also modify the assumed distribution for poses. Specifically, instead of a Laplacian distribution, we assume a normal distribution for the pose
(10)  
where we use an norm for the losses, i.e., , and for we use logquaternions, following [3].
3.5 Implementation Details
As explained in Section 3.1, we use ResNet as our image encoder . However, directly training ResNet on 7Scenes, which contains less than 10,000 images per scene, suffers from overfitting. In our experiments, we found that training ResNet from scratch on such a small dataset converges on the training set, but leads to unusable results on unseen test sequences. Therefore, we follow PoseNet [17]
and use a ResNet pretrained on ImageNet
[5]as the initialization of our image encoder. We replace the last layer of ResNet with a fully connected layer with relu activation function of size 2048 and a dropout rate of 0.5, followed by a linear mapping to the latent variable
of size 256. Note that unlike previous works [3] that focus on the localization only and uses ResNet34, we use ResNet18, since our trial experiments show no obvious performance boost by increasing the depth.For the image decoder , we use the decoder architecture proposed in DCGAN [25]. The initial features in the decoder contain 1024 channels, and are obtained by a linear mapping from latent variable . We use 4 of fractionalstrided convolutional layer for our experiment in 7Scenes, with a kernel size of 5 in all layers. For the pose encoder, we use a threelayered fully connected network, where the only hidden layer contains 512 units.
The input images for the image encoder are resized to , while generated images are set to be . We use the Adam optimizer [18] with a learning rate of 0.0001 without decay. For localization tests in Section 4.1
we train the direct regression model for 500 epochs in each environment. However, for generation, we do not use a fixed number of training epochs, since the training set of each scene in the 7Scenes dataset has different sizes, and the small amount of training samples makes it prone to overfitting. Instead, the model is trained such that the final negative lowerbound of the loglikelihood reaches a value between
and , i.e., from Equation 3. Under this setting, the model is exposed to a similar number of images in each scene and does not overfit. For the learnable variance of the reconstruction distribution, we use a initial value of for the Generative Map, and followed [3] to set , for the direct regression model in Section 3.4.4 Experiments
We use the 7Scenes datatset [28] to evaluate our framework, for both the generation and localization task. The dataset contains video sequences recorded from seven different indoor environments. Each scene contains 2 to 7 sequences for training, with either 500 or 1000 images for each sequence. The corresponding ground truth poses are provided for training and evaluation. For training generative models, prior approaches often rely on a large dataset, e.g. CelebA [22] contains more than 200,000 images. Therefore, the dataset we use is much more challenging, where each scene only contains less than 10,000 training samples.
4.1 7Scenes Localization
Correcting false initialization
Our framework does not provide a direct mapping from images to poses, but rather relies on an iterative estimation process based on the Kalman filter for localization. For such a process, an initial pose needs to be provided as the starting point. An effective estimator should be able to converge to an accurate value, even when provided with a false initialization. Here, we test our framework by feeding false initializations and using a constant transition model, i.e., assuming no further information from the system, but only . The estimator is initialized with a false pose of , i.e., position and quaternion for orientation . The transition uncertainty is set as a diagonal matrix with 0.1 for all its diagonal elements. The result is shown in Figure 3.
It can be observed that, although the initial value is not correct, our model is able to estimate the pose to a reasonable accuracy, based only on the observations obtained from the neural network sensor . For example, the first element of the state should be at the beginning, but is initialized as . After 50 to 100 steps, it is able to correct itself within the range of around the correct value for the entire sequence. Note that, the first point of the solid lines shows the estimated value after the first observation of , hence the curve does not start directly at the origin.
Incorporating transition model
Scene  Generative Map  PoseNet [16] (ResNet18), our implementation with a Gaussian training loss as in Section 3.4  PoseNet (ResNet34, reported in [3])  

constant model  accurate model  constant model  accurate model  regression  regression  
Chess  
Fire  
Heads  
Office  
Pumpkin  
Kitchen  
Stairs  
Average 
The Kalman filter provides a principled way to combine sequential information, and produces a reasonable estimated value. An accurate transition model for predicting the next state is crucial for the performance of the filter. Previously, we have shown that even with a constant model, the estimator is able to correct itself to a reasonable value. On the other hand, if an accurate model is available, i.e., the applied transition model of is close enough to the true model, the performance can be further improved. However, for 7Scenes, we do not have other sensors to provide egomotion information to devise a transition model. In order to evaluate this, we calculate the per time step difference via for translational motion, and as the rotational transformation to simulate an accurate transition model.
In Table 2 we compare the constant model and accurate model for both our Generative Map framework as in Section 3.1 and 3.2, and direct regression PoseNet [16] with the normal training loss described in Section 3.4. Again, the transition uncertainty is set to be a diagonal matrix, with diagonal elements all equal to 0.01 for constant models, and 0.0001 for accurate models. For PoseNet, we also report their direct regression performance. Scores of stateoftheart PoseNet reported in [3] are also given in the table, which used the much deeper architecture of ResNet34. Note that we use a different pretrained model with smaller input images.
From the results we can see that, incorporating a Kalman filter with a constant model performs comparably with the direct regression. While the Kalman filter with an accurate transition model increases the localization performance obviously, both in our Generative Map, and the PoseNet approaches.
4.2 7Scenes Generation
In this section, we evaluate our model by querying images with poses from the test sequence. Ten poses for each test sequence are sampled equidistant in time, i.e., for sequence of length 1000, we take the poses from indices of . We plot both the real images and the model generated images in Figure 5.
Despite being trained on a small training set, our model is able to generate meaningful images that are roughly recognizable for each scene. For every sequence, the images generated for different time steps clearly varies. In most cases, we can observe that the generated image shows a scene from the pose that matches with the real image, which demonstrates the readability of our approach.
In Figure 4, we show the evaluation result of our model for both localization and image generation. Here we use an accurate transition model, with the diagonal elements of the transition uncertainty set to 0.00001. The true trajectories are displayed in dashed blue curves, while the red solid curves show the localization result. For each sequence, we mark four equidistant poses, and generate the corresponding images. The real images are shown for comparison in the first and the last rows, while the generated images are shown on the second and third rows. Again, we can observe that the generated images roughly match the real ones. Especially for poses with more accurate localization, the generated images resemble the real images more closely.
5 Conclusion
In image based localization problems, the map representation plays an important role. Instead of using handcrafted features, deep neural networks are recently explored as a way to learn a datadriven map [3]. Despite their success in improving the localization accuracy, prior works in this direction [17, 23, 4, 33, 16, 3] produce maps that are unreadable for humans, and hence hard to visualize and verify. In this work, we propose the Generative Map framework for learning a neural network based map representation. Our probabilistic framework tackles the problem of readability of prior DNNbased maps, by allowing queries for images given poses of interest. Our training objective is derived from the generative model of Variational AutoEncoders [19] and can be applied to train the entire framework jointly. For localization, our approach relies on the classic Kalman filter [14] and estimates the pose through an iterative process. This also enables us to easily incorporate additional information, e.g. sensor inputs or transition models of the system.
We evaluate our approach on the 7Scenes benchmark dataset [28], which is challenging as it is small in size compared to other datasets for generative models. Our experimental result shows that, given a pose of interest from the test data, our model is able to generate an image that largely matches the ground truth image from the same pose. Moreover, we also show that our map is suitable for the localization task, and can correct false initialization values based on the input images. We also observe that, if an accurate transition model is available, the estimation accuracy of the approach can be significantly improved.
This leads to several potential directions for future research. First, the generated images may provide a way to visualize and measure the accuracy of the model for each region of the environment. It is interesting to conduct an indepth investigation regarding the correlation between the quality of the generated images and the localization accuracy. Regions with worse generated images may require more training data to be collected. In that way, we are able to actively search for areas to improve based on e.g. image reconstruction error from the environment. Secondly, combining both generative and nongenerative DNNmap may result in a hybrid model with better readability and localization performance. Finally, it is meaningful and interesting to extend our framework to a full SLAM scenario [12], which can not only localize itself, but also build an explicit map in completely new environments.
References
 [1] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison. Dtam: Dense tracking and mapping in realtime, 11 2011.

[2]
E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, and
C. Rother.
Dsacdifferentiable ransac for camera localization.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, volume 3, 2017.  [3] S. Brahmbhatt, J. Gu, K. Kim, J. Hays, and J. Kautz. Geometryaware learning of maps for camera localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2616–2625, 2018.
 [4] R. Clark, S. Wang, A. Markham, N. Trigoni, and H. Wen. Vidloc: A deep spatiotemporal model for 6dof videoclip relocalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 3, 2017.
 [5] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. ImageNet: A LargeScale Hierarchical Image Database. In CVPR09, 2009.
 [6] J. Engel, T. Schöps, and D. Cremers. LSDSLAM: Largescale direct monocular SLAM. September 2014.

[7]
S. A. Eslami, D. J. Rezende, F. Besse, F. Viola, A. S. Morcos, M. Garnelo,
A. Ruderman, A. A. Rusu, I. Danihelka, K. Gregor, et al.
Neural scene representation and rendering.
Science, 360(6394):1204–1210, 2018.  [8] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [9] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra. Draw: A recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015.
 [10] I. Gulrajani, K. Kumar, F. Ahmed, A. A. Taiga, F. Visin, D. Vazquez, and A. Courville. Pixelvae: A latent variable model for natural images. arXiv preprint arXiv:1611.05013, 2016.
 [11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [12] J. F. Henriques and A. Vedaldi. Mapnet: An allocentric spatial memory for mapping environments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8476–8484, 2018.
 [13] I. Higgins, N. Sonnerat, L. Matthey, A. Pal, C. P. Burgess, M. Bošnjak, M. Shanahan, M. Botvinick, D. Hassabis, and A. Lerchner. Scan: Learning hierarchical compositional visual concepts. 2018.
 [14] R. E. Kalman. A new approach to linear filtering and prediction problems. Journal of basic Engineering, 82(1):35–45, 1960.
 [15] M. Karl, M. Soelch, J. Bayer, and P. van der Smagt. Deep variational bayes filters: Unsupervised learning of state space models from raw data. arXiv preprint arXiv:1605.06432, 2016.
 [16] A. Kendall, R. Cipolla, et al. Geometric loss functions for camera pose regression with deep learning. In Proc. CVPR, volume 3, page 8, 2017.
 [17] A. Kendall, M. Grimes, and R. Cipolla. Posenet: A convolutional network for realtime 6dof camera relocalization. In Proceedings of the IEEE international conference on computer vision, pages 2938–2946, 2015.
 [18] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [19] D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [20] G. Klein and D. Murray. Parallel tracking and mapping for small ar workspaces. In 2007 6th IEEE and ACM International Symposium on Mixed and Augmented Reality, pages 225–234, Nov 2007.
 [21] R. G. Krishnan, U. Shalit, and D. Sontag. Deep kalman filters. arXiv preprint arXiv:1511.05121, 2015.
 [22] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
 [23] I. Melekhov, J. Ylioinas, J. Kannala, and E. Rahtu. Imagebased localization using hourglass networks. arXiv preprint arXiv:1703.07971, 2017.
 [24] R. MurArtal, J. M. M. Montiel, and J. D. Tardós. Orbslam: A versatile and accurate monocular slam system. IEEE Transactions on Robotics, 31(5):1147–1163, Oct 2015.
 [25] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.

[26]
D. J. Rezende, S. Mohamed, and D. Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.
InProceedings of the 31st International Conference on International Conference on Machine Learning  Volume 32
, ICML’14, pages II–1278–II–1286. JMLR.org, 2014.  [27] R. F. SalasMoreno, R. A. Newcombe, H. Strasdat, P. H. J. Kelly, and A. J. Davison. Slam++: Simultaneous localisation and mapping at the level of objects. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 1352–1359, June 2013.
 [28] J. Shotton, B. Glocker, C. Zach, S. Izadi, A. Criminisi, and A. Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgbd images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2930–2937, 2013.
 [29] M. Suzuki, K. Nakayama, and Y. Matsuo. Improving bidirectional generation between different modalities with variational autoencoders. arXiv preprint arXiv:1801.08702, 2018.
 [30] I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf. Wasserstein autoencoders. arXiv preprint arXiv:1711.01558, 2017.
 [31] R. Vedantam, I. Fischer, J. Huang, and K. Murphy. Generative models of visually grounded imagination. arXiv preprint arXiv:1705.10762, 2017.

[32]
P. Vincent, H. Larochelle, Y. Bengio, and P.A. Manzagol.
Extracting and composing robust features with denoising autoencoders.
In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pages 1096–1103, New York, NY, USA, 2008. ACM.  [33] F. Walch, C. Hazirbas, L. LealTaixe, T. Sattler, S. Hilsenbeck, and D. Cremers. Imagebased localization using lstms for structured feature correlation. In Int. Conf. Comput. Vis.(ICCV), pages 627–637, 2017.
 [34] W. Wang, X. Yan, H. Lee, and K. Livescu. Deep variational canonical correlation analysis. arXiv preprint arXiv:1610.03454, 2016.
 [35] M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller. Embed to control: A locally linear latent dynamics model for control from raw images. In Advances in neural information processing systems, pages 2746–2754, 2015.
 [36] Y. Yoo, S. Yun, H. J. Chang, Y. Demiris, and J. Y. Choi. Variational autoencoded regression: high dimensional regression of visual data on complex manifold. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3674–3683, 2017.
Appendix A Derivation of the Training Objective (Eq. (2))
(first term integrates to 1)  
(property of logarithm)  
(add the term , and subtract it immediately)  
(nonnegative KLdivergence)  
(conditional independence assumption , property of logarithm)  
Intuitively, the first term minimizes the difference between and , and the second term makes sure that we can predict the scene .