relativeCameraPose
Relative Camera Pose Estimation Using Convolutional Neural Networks
view repo
This paper presents a convolutional neural network based approach for estimating the relative pose between two cameras. The proposed network takes RGB images from both cameras as input and directly produces the relative rotation and translation as output. The system is trained in an endtoend manner utilising transfer learning from a large scale classification dataset. The introduced approach is compared with widely used local feature based methods (SURF, ORB) and the results indicate a clear improvement over the baseline. In addition, a variant of the proposed architecture containing a spatial pyramid pooling (SPP) layer is evaluated and shown to further improve the performance.
READ FULL TEXT VIEW PDF
This paper addresses the task of relative camera pose estimation from ra...
read it
We propose a new deep learning based approach for camera relocalization....
read it
In this paper, we propose an encoderdecoder convolutional neural networ...
read it
We present a deep convolutional neural network for estimating the relati...
read it
We propose a novel pose estimation method for geometric vision of
omnid...
read it
Deep learning and convolutional neural networks (ConvNets) have been
suc...
read it
In this paper, we propose an efficient endtoend algorithm to tackle th...
read it
Relative Camera Pose Estimation Using Convolutional Neural Networks
The ability to estimate the relative pose between camera views is essential for many computer vision applications, such as structure from motion (SfM), simultaneous localization and mapping (SLAM) and visual odometry. Due to its practical importance, plenty of research effort has been devoted to the topic over the years. One popular approach to the problem is based on detecting and matching local feature points and using the obtained correspondences to determine the relative poses. The performance of such system is highly dependent on the accuracy of the local feature matches, which are commonly determined using methods like SIFT
[1], DAISY [2], or SURF [3]. Unfortunately, there are many practically important cases where these handcrafted descriptors are not able to find sufficient amount of correspondences. Particularly, repetitive structures, textureless objects, and extremely large viewpoint changes are difficult to handle. We highlight such cases in Fig. 1. An alternative solution would be to utilize all photometric information from the images to determine the poses. However, such methods (e.g. [4]) are usually not applicable to wide baseline settings, where there is large viewpoint change, or they may be computationally expensive.Recently, methods based on convolutional neural networks (CNNs) have clearly outperformed previous stateoftheart results in many computer vision problems, such as image classification, object recognition, and image retrieval. In this work, we show how CNNs can also be applied to estimate the relative camera poses. Our contributions are as follows: 1) we propose a CNNbased method, which takes RGB images from both cameras as input and directly produces the relative rotation and translation as output; 2) we explore several network architectures and evaluate their performance on the DTU dataset
[5]; 3) we study how different training strategies affect the results and make comparisons to popular keypoint based approaches. In addition, we investigate how spatial pyramid pooling [6] could be applied in the context of relative camera pose estimation problem.The rest of the paper is organized as follows. Section 2 describes the related work focusing on relative camera pose estimation. The proposed approach and details related to network architectures and objective functions are introduced in Section 3. Finally, sections 4 and 5 present the baseline methods, experimental setup, evaluation results, discussion, and possible directions for future investigations.
Over the years, a large variety of different local featurebased methods, such as SIFT [1], SURF [3], ORB [7], BRIEF [8], have been utilized in structure from motion, imagebased localization, and visual SLAM contexts for estimating camera poses. The main disadvantage of these methods is their limited ability to cope with nuisance factors such as variations in viewpoint, reflections, and lack of distinguishable texture. As also noted in [9], some recent efforts indicate promise in approaching geometric computer vision tasks with a dense, featureless methods based on using full images. Taking this into account, one of the most prominent solutions is to apply convolutional neural networks (CNNs). While they have recently been applied in many computer vision problems, there are only a few works that consider them in the context of relative pose estimation.
Konda et al. [10] proposed a CNN architecture for predicting change in velocity and local change in orientation using short stereo video clips. They used a rather shallow CNN architecture together with unsupervised pretraining of the filters in early layers. Partly because of the shortage of training data in their path prediction application, they were forced to discretize the space of velocities and local changes for a softmaxbased classification instead of continuous estimates with regression. Mohanty et al. [11] tried to solve the same problem as in [10] using a monocular approach. In detail, they used an architecture based on two AlexNetlike CNN branches acting as inputs to a stack of fully connected layers coupled with a regression layer.
Ummenhofer et al. [12] proposed a CNN architecture for depth and relative camera motion estimation. They utilized multiple tasks in the learning phase to provide additional supervision in order to get more accurate depth maps and camera motion estimates. DeTone et al. [9] proposed a CNN architecture for estimating the relative homography between two images by regressing a 4point homography parameterization with an Euclidean loss. Finally, instead of relative camera pose, Kendall et al. [13] proposed a CNNbased method for absolute 6DoF camera pose estimation.
Our proposal is related to all previously discussed works, but it is the first one investigating the suitability of Siamese network architectures in the relative camera pose estimation problem. Compared with [10, 11], our study aims at more general treatment of the camera pose estimation problem. That is, our approach is applicable for general unrestricted camera motion and for wide baseline view pairs, unlike [10, 11]. Compared with [9], we are trying to solve relative camera pose, which can be regarded as a more general problem than solving the relative homography between two views. Regarding [13], we adopt the same learning objective but concentrate on solving a different problem. In particular, unlike prediction of absolute pose [13], relative pose estimation provides means for relation and representation learning for previously unseen scenes and objects. Finally, compared with [12], our study focuses on analyzing the differences in traditional and CNNbased approaches for relative camera pose estimation and does not consider the role of additional supervisory signals. That is, our approach does not require depth maps for training which is beneficial in practice. Further details of our approach will be given in the following sections.
Our goal is to estimate relative camera pose directly by processing a pair of images captured by two cameras. We propose a convolutional neural network based method that predicts a 7dimensional relative camera pose vector
containing the relative orientation vector (4dimensional quaternion), and the relative position, i.e. translation vector (3dimensional), so that .To estimate the relative camera pose between a pair of images, we apply a Siamese network architecture [14] (see Fig. 2
). In detail, our network consists of two blocks termed as the representation and the regression part. The representation part incorporates two identical CNN branches sharing the weights and other parameters. In general, both of these branches are composed of convolutional layers and rectified linear units (ReLU). The regression part in turn is composed of two fullyconnected (FC1 and FC2) layers, where FC1 and FC2 have 4 and 3 connections respectively for estimating the 7dimensional pose vector.
Following [13], we apply transfer learning. In detail, we take the HybridCNN [15] as a base network for both of the branches and initialize them based on weights learned usin largescale classification data. More specifically, HybridCNN is AlexNet trained on both image classification ImageNet [16] and a scenecentric Places [15] datasets.
Extracting features from convolutional layers instead of fullyconnected layers has shown to produce more accurate results in image retrieval problem [17, 18, 19]
. Therefore, we removed the last three fullyconnected layers (FC6, FC7 and FC8) from the original HybridCNN preserving only convolutional, maxpooling layers and ReLU. More precisely, the network architecture for one branch has the following blocks: convB1
, pool, convB2, pool, convB3, convB4, convB5, pool. The notation convB consists of a convolution layer with filters of sizewith stride
and padding
and a regularization layer (ReLU), pool is a maxpooling layer of size applied with a stride . The last layer of this baseline architecture dubbed cnnA is a pooling layer producing a tiny feature map as an output. Therefore, due to reduction of spatial resolution, image details that are beneficial for relative camera pose estimation may have been lost at this part of the network. In order to limit such information loss, we remove the last maxpooling layer extracting features from convB5, which allows to have slightly larger feature maps (size ). This modified version of the cnnA architecture is called cnnB.Each branch of our representation part has an AlexNetlike structure originally designed for a fixedsize input image. Such limitation may lead to reduced accuracy in the relative camera pose estimation degrading the performance of the system in general. To have more accurate estimations, it might be beneficial to process larger images to be able to extract more information from the scene structure. Theoretically, the convolutional layers accept arbitrary input image sizes, but they also produce outputs of variable dimensions. However, the FC layer of the regression part (see Fig. 2) requires to have fixedlength vectors as input. To make our pipeline to accept arbitrary image sizes, we apply a spatial pyramid pooling (SPP) layer which can maintain spatial information by pooling in local spatial bins [6]. An SPP layer consists of a set of pooling layers of bins with the window size and a stride , where is a size of the input feature map of the SPP layer. Therefore, the number of output bins of the SPP layer is fixed regardless of the image size.
We modified the original architectures cnnA and cnnB by adding an SPP layer to the end of each branch. Obtained networks (cnnAspp and cnnBspp) have 4level and 5level
pyramid pooling respectively. An cnnBspp structure is illustrated in Fig. 2. More detailed evaluation of the proposed network architectures is presented in Sec. 4.2.
To regress the relative pose, the network was designed to compute the Euclidean loss between estimated vectors and ground truth. Following [13], we predict the relative orientation and position together using only one neural network learnt in a multitask fashion.
Our loss function for training is as follows
(1) 
where and are the groundtruth relative orientation and translation, and is a parameter to keep the estimated values to be nearly equal. As described in [13], must balance the orientation and translation estimations and can be set by grid search. In our experiments we set
equal to 10. The network is trained via backpropagation using stochastic gradient descent. The details of the training are described in Section
4.2.It should be noted that quaternion vectors have unit length by definition and therefore in (1). Further, since the absolute scale of the translation can not be recovered, we normalize the ground truth translations to unit length as well, i.e. . However, the norm constraints are not explicitly enforced during training. Instead, the estimates are normalized to unit length as a postprocessing step like in [13].
Thus, at test time, a pair of images is fed to a regression neural network, consisting of two branches, which directly estimates the realvalued parameters of the relative camera pose vector. Finally, the estimated quaternion and translation vectors are normalized to unit length.
The error metrics that we use for evaluation of the performance are: (a) relative orientation error (ROE) in degrees and (b) relative translation error (RTE) in degrees. The latter error is the angle between the estimated translation vector and the ground truth. The former error is measured by determining the rotation angle for the rotation between the estimated orientation and the ground truth (i.e. we determine the rotation which rotates the estimated orientation onto the ground truth).






It is essential to have a large and consistent dataset for training CNNs for relative camera pose problem. However, collecting such data may be expensive and laborious task. We overcome this by utilizing a crowdsourced image collection provided by [20], where the ground truth camera poses are obtained by using an automatic structure from motion pipeline based on local feature matching. The collection consists of 13 subsets of images representing different landmarks and the numerical data of the global structure from motion reconstruction for each subset.
To evaluate the proposed CNN architectures we construct datasets for training and validation. The training set is composed of samples of five landmarks (Montreal Notre Dame, Piccadilly, Roman Forum, Vienna Cathedral and Gendarmenmarkt) covering altogether 581k image pairs (see examples of each landmark in Fig. 3). For the validation set, we used the Yorkminster subset covering 22k image pairs in total. The ground truth labels (relative orientation and translation) are provided by [20] and were computed by applying the SIFT keypoint detector and descriptor followed by the structure and motion estimation via RANSAC, triangulation and iterative optimization.
In order to obtain a fair comparison between our approach and pointbased methods, we need to specify an evaluation dataset where the ground truth orientation and translation are accurate and reliable, and not dependent on the success of traditional pointbased motion estimation methods. For this, we utilize the DTU Robot Image Dataset provided by [5], where the ground truth is obtained by attaching the camera to a precisely positioned robot arm. The dataset covers 124 scenes containing different number of camera positions. Several object scenes of this dataset are illustrated in Fig. 4. See [5] for further details about the pipeline that was used to collect the DTU dataset.
We evaluated the performance of our proposal on the DTU dataset comparing it with two traditional feature based methods, namely SURF [3] and ORB [7].
As explained, the DTU dataset consists of 124 scenes covering different number of camera positions. More specifically, it contains 77 scenes (typeI) with 49 camera positions and 47 scenes (typeII) with 64 camera positions. In order to estimate relative camera poses between pairs of views, we first determine the camera pairs which have overlapping fields of view.
Assuming that the pairwise relationships between cameras are represented in a symmetric adjacency matrix, it is easy to see that the upper bound for the number of overlapping view pairs is , where is the number of camera positions in the scene (49 or 64). Depending on the scene, this equals to 1176 and 2016, respectively. However, we compute relative poses only for the subset of pairs which have overlapping fields of view. These camera pairs can be determined easily since the intrinsic camera parameters are known for the dataset. In detail, we define the field of view of each camera as a cone, i.e. a frustum, and calculate the intersection for pairs of corresponding frustums by using the publicly available OpenMVG [21] library. As a result, the number of overlapping pairs of views is 512 for scenes of typeI and 753 for typeII.
To compare the discussed Siamese network architectures with and without SPP, i.e. cnnA/cnnB and cnnAspp/cnnBspp (see Sec. 3), we created training image pairs by rescaling the input images so that the smaller dimension was fixed to 323 pixels and keeping the original aspect ratio. Depending on the model, we then used either random or pixel crops for training (i.e. the larger size for architectures applying SPP), and a center crop at the test time. To train our networks we used stochastic gradient descent (SGD) and the Adam solver [22]. Learning rate , weight decay as well as the size of minibatches
were the same for all models during the training phase. We used the publicly available machine learning framework Torch
[23]on two NVIDIA Titan X cards applying the data parallelism trick to speed up training. It took around 60 hours to finish 15 training epochs with the 581k image pairs of the training dataset described in Section
3.4.Fig. 5 shows a set of normalized cumulative histograms of relative orientation and translation errors for each of the discussed models evaluated on all scenes of the DTU dataset. According to the results it can be seen that having a bigger output feature map size before the final FC layers is clearly beneficial. This can be seen especially in the case of the reported error on relative translations (Fig. 5 right) where the cnnB and cnnBspp outperform cnnA and cnnAspp. In addition, utilizing SPP yields a further improvement. We dubbed the topperforming model (cnnBspp) to cnnspp and used it for further experiments reported in the following section.



We compare our best model with a baseline consisting of two feature based methods, namely SURF and ORB. For both of these methods, we used the OpenCV implementation with the default parameters. The pose was recovered from the essential matrix estimated using the fivepoint method and RANSAC.
The results presented in Fig. 5 confirm that our proposed model cnnspp performs better compared with the baseline feature based methods. For fair comparison, we resize all images of the DTU dataset to size, transform internal camera parameters accordingly, and evaluate CNNs and featurebased approaches on this data. Our results show that transfer learning from external datasets can be utilized effectively to train a general relative pose regressor. Further, it should be noted that essential matrix estimation requires knowledge of the internal camera parameters, whereas our CNN based approach does not use that information.
We illustrate example results in Fig. 6. The yellow lines show matching points of SURF features across the images in a pair. The visualization shows that our method is robust and in some cases produces more accurate relative pose estimations than conventional pointbased methods (Fig. 5(a)).
In Sec. 3.4 we described the data used to train our models in the previous experiments. However, the visual characteristics and distribution of relative poses in this data are quite different from the data used in evaluation, namely the DTU dataset. Therefore, given that many studies [13, 16] show that it is essential to use training data which is as close as possible to the data in the target domain, it is interesting to see how the performance is affected if a subset of DTU scenes is used for training. Thus, we divided the DTU scenes into two sets, and used one of them to finetune cnnspp model pretrained on Landmarks dataset (Sec. 3.4). The other part of the DTU images was then used as the final testing set. Furthermore, according to the properties of SPP layer we conduct experiments using different input image sizes for the proposed model. Particularly, low resolution images and high resolution images were evaluated. The size of high resolution images corresponds to the original images in the test set of the DTU dataset.
The final results, illustrated in Fig. 7, show that the finetuned model produces more accurate relative camera orientation estimations than the one trained just on Landmarks dataset (red and grey curves respectively). Further, utilizing high resolution images both during training and evaluation leads to the best performance among the CNNbased methods (yellow curve). On average the proposed method falls slightly behind the feature based approaches in estimating the relative translation. However, based on our inspection it can be said that in certain cases (see for example Fig. 6(b)) where the objects are mostly textureless the proposed CNNbased camera pose regressor significantly outperforms traditional feature based approaches (Fig. 6(c)).
We presented an endtoend trainable convolutional neural network based approach for estimating the relative pose between two cameras. We evaluated several different network architectures which all were based on the idea of combining two identical network structures with weight sharing, i.e. the Siamese network. We showed that the problem is solvable by this structure and that it is useful to have a larger feature map fed as input to the stack of final fullyconnected layers. We further showed that applying spatial pyramid pooling is the key to even more accurate relative pose estimations as it opens the door for larger images that according to our results is one way to improve the accuracy. Overall, our proposal demonstrates very promising results, but there is also some room for further improvement.
For future directions, one interesting option is to construct a model based on two steps. During the first step, a CNN would produce coarse estimations and then, on the second step, would further refine the final estimate using the preliminary estimations. This kind of a model is reminiscent of the one presented in [12]. We leave constructing such a model for future work.



Learning deep features for scene recognition using places database.
NIPS (2014)
Comments
There are no comments yet.