or object pose estimation[do2018deep]
. In contrast, novel view synthesis is in the opposite direction that maps the camera pose and 3D scene representation back to the posed 2D image under certain view[eslami2018neural, sitzmann2019scene]. A fundamental problem in both lines of work is to find effective representations of the camera pose [zhou2019continuity]. Existing methods include representing the agent’s position in 3D Cartesian coordinate, and the 3D orientation can be represented by Euler angle, axis-angle, (3) rotation matrices, quaternions or log quaternions. These representations are mainly defined in manually designed coordinates where each dimension has highly abstract semantics, which could be suboptimal when involved in the optimization with deep neural networks. It is desirable to have learning-based representations for camera poses.
Recently, [gao2018learning] proposes a representational model of grid cells in the entorhinal cortex of mammalian brains. Grid cells have been found participating in mental self-navigation and they fire at strikingly regular hexagon grids of positions when the agent moves within an open field. The representational model in [gao2018learning] consists of a vector representation of agent’s self-position, coupled with a matrix representation of agent’s self-motion. When the agent undergoes a certain self-motion in the 2D space, the vector of self-position is rotated by the matrix of self-motion on a 2D sub-manifold in the mental space. Such a model achieves self-navigation and learns hexagon grid patterns of grid cells, which has the promise to be biologically plausible.
Inspired by [gao2018learning, gao2020representational], we propose an approach towards learning neural representation of camera pose, coupled with representation of local camera movement. Specifically, given 2D posed images of a 3D scene and their corresponding camera poses, we assume a shared vector representation for the underlying 3D scene and a distinct vector representation for the camera pose of each 2D image. When the camera has a local displacement, the vector of 3D scene remains unchanged while the vector of camera pose is rotated by the matrix representation of camera movement (Figure 1). We further parametrize the matrix representation by matrix Lie group and the corresponding matrix Lie algebra. The vector representations of camera poses and matrix presentations of camera movements can be shared across multiple scenes, so that they can be learned from multiple scenes to boost performance. The vectors of 3D scene and camera pose are concatenated together to generate the 2D image through a decoder network (Figure 2). The model is learned with only posed 2D images and camera poses, without extra knowledge such as depths or shapes. We perform various experiments on synthetic and real datasets in the context of novel view synthesis and camera pose regression.
The contributions of our work include:
We propose a method for learning neural camera pose representation coupled with neural camera movement representation.
We associate this representational model with the agent’s visual input through a generative model.
We demonstrate that the learned neural representation is effective as the target representation in camera pose regression.
2 Related work
2.1 Representing camera orientation and position
The simplest way to represent orientation is by Euler angle. However, as [kendall2015posenet, kendall2017geometric] point out, Euler angle wraps around at and is not injective to 3D rotation, and thus can be difficult to learn. [balntas2018relocnet] uses (3) rotation matrices to represent the relative orientation rotation between a pair of images. (3) rotation matrices are an over-parameterized representation of rotation which has the property of orthonormality. However, it is in general difficult to enforce the orthonormality constraint when learning a (3) representation through back-propagation. [ummenhofer2017demon, mahendran20173d] use axis-angle representation, which represents 3D orientation by the direction of axis of rotation as well as the magnitude of the rotation. Similar to Eluer angles, this representation also has the problem of repetition around the radians. PoseNet and its variants [kendall2015posenet, kendall2017geometric] propose to use quaternions. Quarernions, or more specfically, quaternions with unit length, are a 4-D continuous and smooth representation of rotation. MapNet [brahmbhatt2018geometry] further proposes to use log quaternions to avoid over-parametrization. These quaternion-based methods achieve state-of-the-art results in the area of absolute camera pose regression. [zhou2019continuity] argues that these representations are not continuous and proposes another 5D or 6D representation for orientation. All these representations are manually designed and pre-defined. In contrast, [gao2020representational] proposes a neural representation of position and motion to explain the emergence of grid cell pattern. However, [gao2020representational] only considers motion in 2D space and does not take visual input into consideration. Our method can be seen as a generalization of [gao2020representational]
. Our method models both position and orientation and their corresponding changes in 3D environments, and we associate position representations with visual inputs. The concept of position embedding is also used in other areas such as natural language processing. For example, transformer-based models such as BERT[devlin2018bert] or GPT [radford2018improving] have a high dimensional embedding of the position of word in the sentence. These embeddings [vaswani2017attention, shaw2018self, gehring2017convolutional, wang2021position] can be either learnable or predefined. We introduce learnable representations for camera pose in 3D vision. Our rotation loss enforces translation invariance, which serves as a regularization on the learned representations.
2.2 Novel view synthesis
Learning neural 3D scene representation is a fundamental problem in 3D vision, and a compelling way to evaluate the learned representations is by novel view synthesis. One line of work [sitzmann2019scene, mildenhall2020nerf, tung2019learning] incorporates prior knowledge of rendering such as rotation and projection to enforce the consistency between different views, such as NeRF [mildenhall2020nerf]. Another theme [tatarchenko2016multi, worrall2017interpretable, eslami2018neural] learns neural representations purely from the perception of the agent, without extra 3D prior knowledge. Our model belongs to the latter. Different from previous methods, we also learn neural representations of the camera pose and camera movement, and the representations of 3D scene and camera pose are disentangled in an unsupervised manner. [tatarchenko2016multi, worrall2017interpretable] infer the scene representation from a single image or a pair of images, while [eslami2018neural] assumes that the representation can be obtained from a small batch of images. Compared to these methods, our model is able to utilize posed images of various scenes to update the shared camera pose representations.
2.3 Interpretable representation
In generative modeling, learning interpretable latent representation is a long-standing target. Specifically, the goal is to learn latent vectors such that each dimension or sub-vector is aligned with an independent factor or concept. This can be done either with supervision [kulkarni2015deep, plumerault2020controlling] or without supervision [higgins2016beta, kim2018disentangling, karras2019style]. Besides vector representation, [litany2020representation, worrall2017interpretable, jayaraman2015learning, gao2019learning] learns matrix representation of image transformation that operates on the latent vector representation.
Our model is a combination of both vector and matrix representations. On the one hand, we disentangle the vector representations of each individual scene and camera pose. On the other hand, we model the movement of the camera pose by matrix representation, which is in the form of matrix Lie group and matrix Lie algebra. In terms of parametrization of the matrix representation, [worrall2017interpretable] uses predefined and fixed rotation matrix, [litany2020representation] learns a fixed matrix for each type of variation, and [jayaraman2015learning] parametrizes 2D ego-motion operated on 2D images. Different from these methods, we parametrize the matrix representation of camera movement as a nonlinear function of the movement in 3D that can take continuous values and operate on the vector representations of 3D scenes.
2.4 Deep pose regression models
Deep pose regression models [sattler2019understanding] can be categorized into absolute camera pose regression (APR) [kendall2015posenet, kendall2017geometric, brahmbhatt2018geometry] which directly predicts the camera pose given an input image, and relative camera pose estimation (RPR) [saha2018improved, balntas2018relocnet, laskar2017camera] that predicts the pose of a test image relative to one or more training images. In this work, we adopt the APR setting while the method can also be easily adapted to the RPR setting. Note that our focus is to compare the effectiveness of different camera pose representations, which is orthogonal to the other methods that specifically target at improving the performance of pose regression.
3 Representational model
Suppose an agent move in a 3D environment with head rotations. There are at most 6 degrees of freedom (DOF), i.e., the position of the agentand its head orientation . We denote them as the pose of the agent . Following the idea of [gao2020representational], we encode each DOF by a -dimensional sub-vector . From the embedding point of view, essentially we embed the 1D domain in as a 1D manifold in a higher dimensional space . We limit each sub-vector to have unit length, i.e., we further assume the 1D manifold to be a circle. For notation simplicity, we concatenate those sub-vectors to a pose vector . When the camera makes a movement , the camera pose changes from to . See Figure 1 for an illustration of our proposed framework.
3.1 Modeling movement as vector rotation
We start from considering an infinitesimal camera movement . For each DOF , we propose the following model:
where is a matrix depending on . Given that is infinitesimal, the model can be further parametrized as
is the identity matrix andis a matrix that needs to be learned. We assume
to be skew-symmetric i,e.. This assumption guarantees that , i.e.,
is approximately an orthogonal matrix. From the geometric perspective, it maps the movement alongaxis in 1D space to rotation of the vector in the high-dimensional latent space. In practice, we only need to parametrize the upper triangle of as trainable parameters and the lower triangle of is taken to be the negative of the upper triangle. We further assume to be block-diagonal so that the total number of parameters can be greatly reduced. If there are movements on multiple DOFs, we only need to rotate each sub-vector of DOF independently.
As pointed out by [gao2020representational], equations 1 and 2 can be justified as a minimally simple recurrent model. To model the movement in the latent space, the most general form is , i.e., the pose vector for the new pose is a function of the one for the old pose and the movement. Given that is infinitesimal, we can apply the first-order Taylor expansion: where we use to denote the first derivative of , i.e., . When the movement , we should have that . Then a minimally simple model is to assume
as a linear transformation , i.e., and . As we will discuss in 3.3, for finite movement , we recurrently apply the model of infinitesimal , so that the matrix representation becomes a matrix Lie group.
3.2 Polar system for position change in 2D
If the movement of the agent is constrained in a 2D environment, we follow [gao2020representational] to use a polar coordinate system to model the change of position, which corresponds to the egocentric perspective and could be potentially more biological plausible. Specifically, let be the position of the agent in the 2D space, instead of using individual vector and to represent the position, we represent position in a single vector and the movement is captured by direction and distance . We have . The representational model under this polar coordinate system is:
The is a function of and is skew-symmetric. models the change of position along the direction . If the agent changes the direction of movement from to , then we assume
where is another skew-symmetric matrix to learn. The geometric interpretation is that if the agent changes direction, is rotated by another matrix .
For camera movement in 3D environment, such coupled representation in polar coordinate will end up with too many matrix representations to learn. Therefore, we restrict ourselves in using it only in 2D space, and use the vector-matrix representations that are disentangled for each DOF as proposed in 3.1 for general 3D movements.
3.3 Matrix Lie group for finite movement
So far we have discussed the formulation for infinitesimal movements above. In this subsection we generalize to finite movements. Suppose the agent has a finite movement along the axis . We can divide this movement into steps, so that as , , and
This underlies the relationship between matrix Lie algebra and matrix Lie group. Specifically, the set of for forms a matrix Lie group. The tangent space of at identity is the corresponding matrix Lie algebra. is the basis of this tangent space, and is also called as the generator matrix.
For a finite but small , can be approximated by a second-order Taylor expansion
For a large finite change in each axis, we can divide it into a series of small finite changes, expand each change using second-order Taylor expansion and multiply them together.
3.4 Theoretical understanding of our model
A deep theoretical result from mathematics, namely the Peter-Weyl theorem [taylor2002lectures], inspires our work. It says that for a compact Lie group, if we can find an irreducible unitary representation, i.e., each element of the group is represented by a unitary (or orthogonal) matrix , then the matrix elements () form a set of orthogonal basis functions for the general functions of . This is a deep generalization of Fourier analysis. In our case, the learned vector representation are linear compositions of the above basis functions, and the elements () serve as a more compact set of basis functions for representing general functions of . Our method can be used to represent the pose of the camera and objects in general. The continuous changes of the pose in the physical space generally form a Lie group. Our learned vector and matrix system forms a representation of the pose and its change in the neural space.
3.5 Implementation of pose representation
Suppose we want to learn the representation of axis l, whose value ranges in . For orientation, the angles is of range [, ), while for position, we can predefine the largest range the agent can move within. We divide this range into multiple grids and we learn an individual vector at each grid point. Given an arbitrary position , we first find its nearest grid point and the corresponding vector representation, and then we rotate this vector to the target position by the matrix representation depending on the distance between this nearest grid position and the target position. See Figure 2. Since we can set the length of grid to be relatively small, the distance between the grid and target positions is also small, so that we can use second-order Taylor expansion in Equation 6 to approximate the matrix representation.
3.6 Decoding to posed 2D images
To associate the camera pose representation with visual input, more specifically the posed 2D images, we propose a decoder or emission model. For each 3D scene, suppose we are given multiple posed 2D images and the corresponding camera poses . Then we assume a shared vector representation of the 3D scene, and obtain the vector representation of the camera pose as described in 3.5. We learn a decoder that maps and to the image space to reconstruct
where denotes parameters in the decoder network.
4 Learning and inference
4.1 Learning through view synthesis
For a general 3D environment, the unknown parameters of the proposed model include (1) for any on grid positions, (2) for any , and (3) parameters in
. To learn these parameters, we define a loss function, where
is the reconstruction loss for view synthesis, which enforces the decoding of the pose and scene representations to reconstruct the observation. The expectation is estimated by Monte Carlo samples. stands for rotation loss, which serves to constrain so that the learned pose representations of different poses can be transformed to each other based on our representational model (Equation 5). The expectation term in can be approximated by randomly sampled pairs of poses and that are relatively close to each other, which means that we have infinite amount of data for this loss term.
If the movement of camera pose is in a 2D space and we employ the polar coordinate system, then part (2) of the unknown parameters becomes for any , and . The loss functioin is defined as , where and follow equation 8 and
For training, we minimize by iteratively updating the decoder (as well as our scene representation ) and our pose representation system , for . In practice, the decoder is parameterized by a multi-layer deconvolutional neural network. Besides the latent vector on top of the decoder, we also learn a scene-dependent vector at each following layers using AdaIN [huang2017arbitrary]. We normalize the scene vector at the top layer of the decoder to have unit norm so that it has the same magnitude as the pose representation. We find this helps optimization. More details can be found in Supplementary.
4.2 Inference by pose regression
With the learned pose representation, we can then use it as the target output for camera pose regression. Specifically, for each DOF, we train a separate inference networkthat maps the observed posed 2D image to the pose representation . The loss function is defined as the distance between the inferred and learned pose presentations
is parameterized by a convolutional neural network wheredenotes the parameters and we introduce some scene-dependent parameters using AdaIN. For different DOFs, the inference networks share common lower layers but with different top fully-connected layers.
For testing, given an unseen posed image , we can get the inferred pose representation from our inference model, and decode the predicted pose by:
In this section, we demonstrate the efficacy of our learned pose representations in both view synthesis and pose regression tasks. For view synthesis, we mainly compare with the Generative Query Network(GQN) [eslami2018neural]. For pose regression, we compare our learned neural representations of camera pose with other commonly used pose representations, including the Euler angle, the sinusoidal representation used in GQN, and the quaternions (as well as log quaternions) representations used in the PoseNet [kendall2015posenet, kendall2017geometric] and MapNet [brahmbhatt2018geometry], by evaluating the pose estimation accuracy. More details of implementation can be found in Supplementary. Our code and pretrained models can be found at https://github.com/AlvinZhuyx/camera_pose_representation.
GQN rooms. GQN [eslami2018neural] introduces a synthetic dataset with 2 million synthetic scenes, where each scene contains various objects, textures, and walls. The agent can navigate in a 2D space and rotate the head horizontally in the scenes. Each scene contains 10 rendered RGB images. We use the version of the dataset where the camera moves freely and the objects do not rotate around their axes. We sample scenes from the dataset. For each scene, we sample 9 images for training and use the left one image for testing. Since this dataset contains a large number of simple scenes with a small number of images for each scene, instead of learning an individual scene representation vector for each scene, we learn an encoder to encode the scene representation online similar to [eslami2018neural]. Since the agent has 2 DOFs for position and 1 DOF for orientation, our pose vector contains one position sub-vector in the polar coordinate system and one orientation sub-vector. Each sub-vector has 96 dimensions. We assume that is block-diagonal with six blocks, and each block is 16 16.
ShapeNet v2. We use the images generated by [sitzmann2019scene] from the car category of ShapeNet v2 dataset [chang2015shapenet]. This dataset contains 2,151 object models. For each scene, the instance locates at the center of a sphere. The virtual agent can move on the surface of this sphere, with its camera pointing to the center. Therefore, the agent has 2 DOFs, and we use 2 orientation angles to denote its position on the sphere. Each instance contains 500 different views of rendered RGB images, where we randomly sample 100 images for training and leave the others for testing. The pose representation contains two sub-vectors of two orientation angles. Each sub-vector has a dimension of 96, and has six blocks. We learn an individual scene representation vector for each instance.
Gibson Environment. The Gibson Environment [xia2018gibson] provides tools for rendering images corresponding to different views in a room, which we use to generate a synthetic dataset. We refer to this dataset as Gibson rooms. Specifically, we select 20 areas of size 2m 2m from different rooms. For each area, we randomly render about 28k RGB images of different views. We fix the camera height and constrain the camera to rotate only horizontally. Compared to GQN rooms and ShapeNet car, this synthetic dataset contains more realistic and complicated indoor scenes, which could be more challenging. Moreover, it includes fewer scenes while for each scene, images from abundant views are provided. Therefore, incorporating view-based information becomes very important. The agent has 2 DOFs for position and 1 DOF for orientation, which corresponds to a position sub-vector in the polar coordinate system and one orientation sub-vector. The dimensions of the sub-vectors and are the same as the ones for GQN rooms dataset.
7 Scenes Dataset. Microsoft 7 Scenes [shotton2013scene] is a widely used dataset for camera pose estimation. It contains RGB-D images for seven different indoor scenes. Each scene has several trajectories for training and testing. In our experiment, we follow the training and testing split in [shotton2013scene], and we only use RGB images without depth information. We translate and align the position coordinates of scenes and ensure that all the trajectories locate in a 4m 1.5m 3m cuboid. The agent has 6 DOFs, so the pose representation vector contains 6 sub-vectors. We assume that each sub-vector has a dimension of 32, and each has four 8 8 blocks. We mainly use this dataset for camera pose regression. We resize the images to 128 128 when training the decoder and pose representation system. We use shared pose representations for all the seven scenes and distinct scene representation for each of them. When performing pose regression, following [kendall2017geometric, brahmbhatt2018geometry], we train an individual inference model for each scene and resize the input images so that the shortest side is of length 256.
5.2 Novel view synthesis
The first question is whether our learned pose representation is meaningful. We answer this by testing our learned representations on novel view synthesis task. The experimental results demonstrate that our learned representations can generate a novel view of a scene of high quality. Figure 3 shows the qualitative results, and Figure 4
shows the quantitative results in terms of Peak Signal-to-Noise Ratio (PSNR). We compare the results with GQN. For GQN, we use the implementation by[GQN:2018] and the same training and testing splits as ours. We use 8 generation layers and set the shared core option to be False. We add extra convolution and de-convolution layers when dealing with images of size . The total number of parameters for this GQN implementation is 114M. In contrast, our model only has less than 9M parameters.
(noise magnitude of 0.0 corresponds to novel view synthesis test result), we see that for GQN rooms dataset, our model gets a bit worse but comparable results with the GQN model. For ShapeNet car dataset, which contains complex instances, our model generates more consistent and clearer results compared with GQN. For Gibson rooms dataset, which is more complicated, GQN fails to capture the relationship. The reconstruction only captures some specific views and does not generalize to other views. On the other hand, our learned model is able to generate a query view corresponding to our pose representation. This is probably because that the 3D scene representations in our method are learned by all the 2D posed images of the scene.
-th element in the position vector has a standard deviation, then we add a Gaussian Noise to the corresponding element. Noise magnitude 0.0 corresponds to the novel view synthesis test result. We compare with GQN on three datasets.
|Representations||ShapeNet car||GQN rooms||Gibson rooms|
|(, , axis-angle)||-||-||-||-||-||-|
|(, , , ))||0.108||0.104||0.033m||0.034m|
|(, , )||0.050||0.048||0.043m||0.042m|
|(, , )||0.051||0.051||0.028m||0.027m|
5.3 Robustness to pose noise
Next, we try to answer why we need that representation and what is the advantage of such neural representation over directly using 6 DOFs coordinate representation in terms of novel view synthesis. One critical supporting evidence is that our learned neural pose representation is more robust to noise. Specifically, Figure 4 shows the changes of PSNR for our model versus the GQN model when some Gaussian noise with various magnitudes is added to the pose representations. We observe that the performance of the GQN model degrades quickly as the magnitude of added noise increases. This is not surprising since GQN directly uses coordinate representation for position and orientation and thus is vulnerable to noise interference. On the other hand, our learned representation embeds the camera pose to high dimensional space and is further regulated by the rotation loss, and thus is more robust to noise.
|Scene||PoseNet17[kendall2017geometric]||PoseNet + [brahmbhatt2018geometry]||PoseNet + (*)||ours|
5.4 Inference results
We further demonstrate that our learned representation is efficient to serve as the target output of pose regression. In the camera pose regression task, the camera position is usually represented using 3D coordinate and the camera orientation can be represented by various methods. The most straightforward one is to use the Euler angle to represent the orientation. Another representation is axis-angle representation. In [eslami2018neural], the authors use to represent each orientation angle. Besides, unit quaternions and logarithm of the unit quaternions are another two popular representations used in pose regression [kendall2015posenet, kendall2017geometric, brahmbhatt2018geometry]. Comparing with those methods, we used learned neural representations for both camera position and orientation. We conduct the pose regression experiments on all four datasets we mentioned above. For our representation, Euler Angle representation and representation we use mean square error loss for regression. On 7 Scenes dataset, we also use norm loss for our representation. For quaternions and log quaternions representations, as suggested by [brahmbhatt2018geometry], we use norm loss. For axis-angle representation, we find that for ShapeNet car dataset, using norm loss leads to better results. For Gibson rooms and GQN rooms datasets, since the agent can only rotate its head horizontally, the axis-angle representation degrades to a single Euler angle. For the two quaternions-related baselines, we employ the automatic weight tuning method proposed in [kendall2017geometric] to make a fair comparison. Note that the main focus of this work is to compare different pose representations, and thus we do not include other improvement techniques (, including unlabeled data or relative pose loss between image pairs), as we consider them as orthogonal directions to improving the pose representations. More details can be found in Supplementary.
We first show the comparison results on GQN rooms, ShapeNet car, and Gibson rooms datasets in Table 1. For a fair comparison, we keep the same network structure for all the representations on each dataset and only change the final output layer. Since the dimension of our learned representation is higher than all the baseline representations, for a fair comparison, we add another fully-connected layer to these baseline inference networks so that the inference models have roughly the same number of parameters across different pose representations. According to Table 1, our representation consistently outperforms all the other representations, especially for orientation regression. For most configurations, our representation yields the best results in both orientation and position prediction. On GQN dataset, the quaternions and log quaternions representation achieve slightly better results in position prediction. However, their orientation prediction results are much worse than ours. A possible explanation is that we embed both the camera position and orientation as neural representations, and thus they are more consistent with each other. Besides, representing the rotation angles on a hyper-sphere in a high dimensional space may also make it easier for the model to regress.
We further compare our learned pose representations with the popular quaternions and log quaternions representations on 7 Scenes dataset using PoseNet. Following [brahmbhatt2018geometry], we use a pre-trained ResNet34 as our feature extractor and 6 parallel fully-connected (FC) layers to predict the 6 pose sub-vectors. We employ color jittering as data augmentation and remove the dropout in the FC layers. The results are shown in Table 2. We compare our results with [kendall2017geometric, brahmbhatt2018geometry]. We also run the code provided by [brahmbhatt2018geometry] to re-train their model and report the results. The difference between the reported values and the reproduced results is probably due to the randomness and different versions of software 111The code of [brahmbhatt2018geometry] is originally implemented in python 2.7 and PyTorch 0.4.0 while we make minor adaptation to enable it to run in python 3.6 and PyTorch 1.2.0
is originally implemented in python 2.7 and PyTorch 0.4.0 while we make minor adaptation to enable it to run in python 3.6 and PyTorch 1.2.0. Following the convention on this dataset, we report the median errors of location and orientation predictions. The result shows that, on average, our model outperforms all the baselines.
6 Conclusion and Future Work
We propose a framework for learning neural vector representations for both camera poses and 3D scenes, coupled with neural matrix representation for camera movements. The model is learned through novel view synthesis and can be used for camera pose regression. Our learned representation proves to be more robust against pose noise in the novel view synthesis task and works well as the estimation target for camera pose regression. We hope that our work can motivate further interest and study on learning neural representations for camera poses and joint representations for camera poses and 3D scenes. An interesting future direction is how to combine our method with the recent work of NeRF [mildenhall2020nerf], which uses sinusoidal functions of very high frequencies. Our model can be adapted to this new generative model structure and may be able to learn more flexible camera pose representation.
The work is supported by NSF DMS-2015577; DARPA XAI N66001-17-2-4029; ARO W911NF1810296; ONR MURI N00014-16-1-2007.
Appendix A Training details
In this section, we describe the details about the structure of our neural networks and the hyperparameters we use in the experiments. The main differences among the network structures we use on different datasets depend on: (i) the size of the image we are dealing with: the larger image needs more blocks; (ii) the complexity of the scenes. For 7Scenes and Gibson rooms dataset, the scenes are highly complex. Therefore we apply instance normalization to multiple layers, which is dependent on scenes, besides the vector representation of the scene at the top layer. For the GQN rooms dataset, which includes a huge amount of scenes, we employ an encoder to calculate the scene representations online. We use Adam[kingma2014adam] as optimizer for all the experiments with and . The learning rate for each setting is introduced in each later section.
a.1 GQN rooms dataset
Generative experiment. Since this dataset contains a huge amount of scenes, and each scene only has few images, we encode the scene representations online instead of learning an individual vector representation for each scene. The encoder structure is shown in Figure 7(a). Specifically, the encoder encodes the scenes as a scene vector that is fed to the top layer of the decoder, and it also encodes the parameters of instance normalization [huang2017arbitrary] that is applied to the multiple layers of the generator. Following [eslami2018neural], to summarize information across multiple images of the same scene, we sum up the encoded vectors and parameters of these images. The decoder structure is shown in Figure 7(b). We discretize the square space into 20 20 grids and learn a position vector at each grid. Similarly, we discretize the orientation into 36 grids ( per grid) and learn an orientation vector at each grid. The training takes about four days on a single Titan RTX GPU.
We train the model for one million iterations. At each iteration, we randomly sample 30 scenes, each containing ten images. We use the first six images of each scene to encode the scene representation and concatenate it with the other three images’ pose representations. We use the concatenated representations to reconstruct the three images. We leave the last image for testing. For the rotation loss, we randomly sample 4000 pairs of poses for each iteration. The learning rate of the pose representations and matrix representations of camera movements is 0.01, and the learning rate for the encoder, decoder, and scene representations is 0.0001. Here, we update all the learnable parameters together. We set as 0.05, , as 100 and as 0.8.
For the baseline GQN network, we also train the model for one million steps. At each step, we feed in a batch of 64 scenes. The other parameters follow the original implementation.
Inference experiment. We show the inference model structure in Figure 7(c). Like the generative experiment, we use an encoder to encode the scene and the parameters of instance normalization online. The encoder structure is the same as the encoder used in the generation task, except that we do not encode a vector representation at the top layer but encode another set of instance norm parameters . We set the learning rate as 0.0001 for all the parameters. We train the inference model for 100,000 steps. At each iteration, we feed in 30 scenes. For this dataset, we use the homoscedastic uncertainty method proposed in [kendall2017geometric] to automatically tune the weight between pose prediction loss of position and orientation. We set the initial guess for logarithmic weight of position loss as and the initial guess for logarithmic weight of orientation loss as (so that and ). We use the same inference model structure for baseline models, except that we add another fully-connected (FC) layer with size 196 to these models to make sure that they have approximately the same amount of parameters as the model trained on our representations. We also train these models for 100,000 iterations with the same batch size. We tune the learning rate for each baseline model to make a fair comparison and use the same automatically weight tuning method for the two quaternions-related baselines. The initial guess for the logarithm weight of position loss is set to 0.0, and the one of orientation loss is set to -3.0 as suggested by [kendall2017geometric]. For our model and each of the baseline models, the training takes about 5 hours on a single Titan RTX GPU.
a.2 ShapeNet car
Generative experiment. This dataset contains 2151 different cars. The heads of the cars are aligned to the same orientation, and the background is blank. Given the simplicity of this dataset, we do not use instance normalization. The vector representation of scenes is of 128 dimensions, and we learn a separate vector representation for each scene instead of obtaining by an encoder. The structure of the generator model is shown in Figure 8(a). For our pose representation system, we discretize the orientation for to into 36 grids and learn individual orientation vectors at each grid.
For each scene, we randomly sample 50 pairs of images for each scene as the training set and leave the others as the test set. The camera poses of the two images in each pair is close to each other, so that the change from one to another can be approximated by Taylor expansion of the matrix Lie groups as discussed in section 3.3, which means that we can apply the camera poses of the two images to the rotation loss. We train our model for 160,000 iterations, i.e., 1500 epochs. We randomly sample 20 scenes at each iteration, and for each instance, we sample 10 images (5 pairs). For the rotation loss, we randomly sample additional 200 pairs of camera poses to compute the loss. The learning rate is set to 0.0001. We setas 0.05 and as 50. We iteratively update the decoder for one time and pose representation system for three times at each iteration. The training takes about four days on a single Titan RTX GPU.
For the baseline GQN model, we trained the model for 500,000 steps. At each step, we randomly sample a batch of 36 scenes. We randomly sample 15 images for each scene to infer the scene representation and another image as the reconstruction target. We use the same train-test split as our model for each scene here.
Inference experiment Since the head direction for each car is aligned to the same direction, the pose regression task should follow the same rule across different scenes. Thus, we do not include scene-related parameters in our inference model. The structure of our inference model is shown in Figure 8(b). For each scene, we randomly sample 250 images as the training set and the rest 250 images as the test set. We train our model and all the baseline models for 500 epochs. At each iteration, we use 10 scenes, and we randomly sample 20 images from each scene. The learning rate is set to 0.001. We simply set the weights of prediction losses of the two rotation vectors as 1.0 without further automatic tuning. For each baseline representation, we use the same inference model structure and add another fully-connected (FC) layer with size 256. We tune the learning rate carefully to make a fair comparison, and we use the automatic weight tuning method for the two quaternions-related baseline methods. The initial guess for the logarithmic weight of orientation loss is set to -3.0 as suggested by [kendall2017geometric]. The training for our model and each of the baseline models takes about 8 hours on a single Titan RTX GPU.
a.3 Gibson rooms dataset
Generative experiment. This dataset contains complex scenes. We apply instance normalization at multiple layers, which is dependent on the scene. The structure is shown in Figure 9(a). The scene vector representation is of 768 dimensions, and the dimensions of instance normalizations are summarized in Figure 9(a). We discretize the 2m 2m square space into 40 40 grids. We discretize the two orientation angles into grids so that each grid is .
For each scene, we randomly sample half of the data as the training set and the rest as the test set. We train our model for 500k steps. At each iteration, we randomly choose four scenes. For each scene, we randomly sample 50 images. For the rotation loss, we randomly sample another 3000 pairs of poses. We use a learning rate of 0.0001 for training the generator and a learning rate of 0.01 for the pose representation. We iteratively update the generator parameters for one time and update the pose representation two times at each iteration. We set as 0.01, , as 100 and as 0.8. The training takes about five days on a single Titan RTX GPU.
For the baseline GQN model, we train the model for 500k steps. At each iteration, we randomly sample and predict 36 images. To predict each image, we randomly pick 15 images from the same scene to infer the scene representations.
Inference experiment. The inference structure is shown in Figure 9(b). We trained the inference model for 25000 steps for both our representation and the baseline representations. At each step, we randomly sample 4 scenes with 50 images from each scene. The learning rate for the model with our representation is 0.001. For this dataset, we find that simply set the weight of position prediction loss as 20 and set the weight of orientation prediction loss as 10 is good enough. So we do not employ the automatic weight tuning mechanism here. We tuned the learning rate for each baseline model, and we applied the homoscedastic uncertainty method to tune the weight for the quaternions-related representations automatically. The initial guess for the logarithm weight of position loss is set to 0.0, and the one of orientation loss is set to -3.0. For each baseline representation, we use the same inference model structure and add another fully-connected (FC) layer with size 192.The initial guess follows [kendall2017geometric]. For our model and each of the baseline models, the training takes about 5.5 hours on a single Titan RTX GPU.
Generative experiment. For this dataset, we use the same generator structure as for the Gibson Room dataset (see Figure 9(a)). Since this dataset contains less data than the Gibson Room dataset, we set the dimension of the scene vector representation to 96. We discretize the whole region (4m 1.5m 3m) into grids so that each grid is of 0.1m 0.1m 0.1m. The orientation is discretized into grids so that each grid is of .
We update the model for 100,000 steps. At each step, we randomly sample 16 pairs of images from each scene, and we randomly sample 3000 extra pairs of poses to estimate the rotation loss. We use the learning rate 0.0001 for the generator and 0.001 for the pose representation system. We iteratively update the generator for one time and pose representation system for two times at each iteration. We set as 0.009 and as 50. The training takes about one day on a single Titan RTX GPU.
Inference experiment. For the inference model, we use the same structure proposed in [brahmbhatt2018geometry], , we use a pre-trained ResNet34 as the basic feature extractor. We learn a separate module containing several FC layers on the top of the extracted features to predict each pose vectors. Following [brahmbhatt2018geometry], we train an individual inference model for each scene. We use learning rate 0.00005 and train the model of each scene for 60 epochs. To isolate the effect of different representations, we use PoseNet as the model for all the representations, without other techniques such as adding pair losses or unlabeled data. We consider these techniques to be orthogonal to the improvement in pose representation. We employ the automatic weight tuning method as [brahmbhatt2018geometry] to tune the weight between the three position vectors and three orientation vectors. We set the initial guess for the logarithm weight of three position vectors’ losses the initial guess for logarithm weight of three orientation vectors’ losses as -3.0. We employ 0.7 color jitter as data augmentation and remove the dropout in the final FC layer. Our model takes about 3.7 hours for training all the 7 scenes on a single Titan RTX GPU. For the baseline model, we use the released code of [brahmbhatt2018geometry]
and we use the default setting with python 3.6 and torch 1.2.0, which trains the models on each scene for 300 epochs with a learning rate of 0.0001. It takes about 13 hours to train the baseline models on the entire 7 scenes on a single Titan RTX GPU.
Appendix B Additional training results
b.1 Generative results
b.2 Reconstructed image under different noise magnitude
In Figure 5, we show the reconstructed images at different noise levels using our model with learned camera pose representation and GQN (which uses predefined low dimensional sinusoidal function to represent rotation). We can see that our model can reconstruct image with correct pose even with high noise while the poses in the reconstructed images of GQN model change a lot as noise increases. This agrees with our observations from the psnr curves and further prove that our learned camera pose representation is more robust to noise.
b.3 Learning the camera pose representation by a fully connected neural network
As a comparison, we replace our proposed camera pose representation by a fully connected neural network on ShapeNet car dataset. Specifically, we encode each angle by a 2-layer fully-connected neural network. The first layer has a length of 128 the second layer has a length of 96 (which is same to our embedding). We use leaky relu as the activation function. As shown in Figure6, this embedding is also a high-dimensional one but it doesn’t has the translation invariance [wang2021position] as in our learned representation. Figure 7 shows the PSNR over the magnitude of noise added to representations. The representation using a fully connected neural network works better than the plain low dimension embedding used in GQN in terms of robustness to noise. But it still performs worse than our design, which is regulated by the rotation loss. As for the camera pose estimation, using the representation from a fully connected neural network gives a testing error of , which is lower than the results of all the other hand designed representations but still higher than the result of our design (with testing error ). The results show that learning a high-dimenstional representation is better than the low-dimensional hand designed ones and enforcing the translation invariance using rotation loss can further improve the results.