KFNet
KFNet: Learning Temporal Camera Relocalization using Kalman Filtering (CVPR 2020 Oral)
view repo
Temporal camera relocalization estimates the pose with respect to each video frame in sequence, as opposed to oneshot relocalization which focuses on a still image. Even though the time dependency has been taken into account, current temporal relocalization methods still generally underperform the stateoftheart oneshot approaches in terms of accuracy. In this work, we improve the temporal relocalization method by using a network architecture that incorporates Kalman filtering (KFNet) for online camera relocalization. In particular, KFNet extends the scene coordinate regression problem to the time domain in order to recursively establish 2D and 3D correspondences for the pose determination. The network architecture design and the loss formulation are based on Kalman filtering in the context of Bayesian learning. Extensive experiments on multiple relocalization benchmarks demonstrate the high accuracy of KFNet at the top of both oneshot and temporal relocalization approaches. Our codes are released at https://github.com/zlthinker/KFNet.
READ FULL TEXT VIEW PDFKFNet: Learning Temporal Camera Relocalization using Kalman Filtering (CVPR 2020 Oral)
Camera relocalization serves as the subroutine of applications including SLAM [16], augmented reality [10] and autonomous navigation [48]. It estimates the 6DoF pose of a query RGB image in a known scene coordinate system. Current relocalization approaches mostly focus on oneshot relocalization for a still image. They can be mainly categorized into three classes [14, 53]: (1) the relative pose regression (RPR) methods which determine the relative pose w.r.t. the database images [4, 30], (2) the absolute pose regression (APR) methods regressing the absolute pose through PoseNet [26] and its variants [24, 25, 64] and (3) the structurebased methods that establish 2D3D correspondences with Active Search [51, 52] or Scene Coordinate Regression (SCoRe) [56] and then solve the pose by PnP algorithms [19, 45]. Particularly, SCoRe is widely adopted recently to learn perpixel scene coordinates from dense training data for a scene, as it can form dense and accurate 2D3D matches even in textureless scenes [6, 7]. As extensively evaluated in [6, 7, 53], the structurebased methods generally show better pose accuracy than the RPR and APR methods, because they explicitly exploit the rules of the projective geometry and the scene structures [53].
Apart from oneshot relocalization, temporal relocalization with respect to video frames is also worthy of investigation. However, almost all the temporal relocalization methods are based on PoseNet [26], which, in general, even underperform the structurebased oneshot methods in accuracy. This is mainly because their accuracies are fundamentally limited by the retrieval nature of PoseNet. As analyzed in [53]
, PoseNet based methods are essentially analogous to approximate pose estimation via image retrieval, and cannot go beyond the retrieval baseline in accuracy.
In this work, we are motivated by the high accuracy of structurebased relocalization methods and resort to SCoRe to estimate perpixel scene coordinates for pose computation. Besides, we propose to extend SCoRe to the time domain in a recursive manner to enhance the temporal consistency of 2D3D matching, thus allowing for more accurate online pose estimations for sequential images. Specifically, a recurrent network named KFNet is proposed in the context of Bayesian learning [40]
by embedding SCoRe into the Kalman filter within a deep learning framework. It is composed of three subsystems below, as illustrated in Fig.
1.[noitemsep,topsep=3pt, leftmargin=5mm]
The measurement system features a network termed SCoordNet to derive the maximum likelihood (ML) predictions of the scene coordinates for a single image.
The process system uses OFlowNet that models the optical flow based transition process for image pixels across time steps and yields the prior predictions of scene coordinates. Additionally, the measurement and process systems provide uncertainty predictions [43, 24] to model the noise dynamics over time.
The filtering system fuses both predictions and leads to the maximum a posteriori (MAP) estimations of the final scene coordinates.
Furthermore, we propose probabilistic losses for the three subsystems based on the Bayesian formulation of KFNet, to enable the training of either the subsystems or the full framework. We summarize the contributions as follows.
[noitemsep,topsep=3pt, leftmargin=6mm]
We are the first to extend the scene coordinate regression problem [56] to the time domain in a learnable way for temporallyconsistent 2D3D matching.
We integrate the traditional Kalman filter [23] into a recurrent CNN network (KFNet) that resolves pixellevel state inference over timeseries images.
Lastly, for better practicality, we propose a statistical assessment tool to enable KFNet to selfinspect the potential outlier predictions on the fly.
Camera relocalization. We categorize camera relocalization algorithms into three classes: the relative pose regression (RPR) methods, the absolute pose regression (APR) methods and the structurebased methods.
The RPR methods use a coarsetofine strategy which first finds similar images in the database through image retrieval [59, 3] and then computes the relative poses w.r.t. the retrieved images [4, 30, 49]. They have good generalization to unseen scenes, but the retrieval process needs to match the query image against all the database images, which can be costly for timecritical applications.
The APR methods include PoseNet [26] and its variants [24, 25, 64] which learn to regress the absolute camera poses from the input images through a CNN. They are simple and efficient, but generally fall behind the structurebased methods in terms of accuracy, as validated by [6, 7, 53]. Theoretically, [53] explains that PoseNetbased methods are more closely related to image retrieval than to accurate pose estimation via 3D geometry.
The structurebased methods explicitly establish the correspondences between 2D image pixels and 3D scene points and then solve camera poses by PnP algorithms [19, 45, 31]. Traditionally, correspondences are searched by matching the patch features against Structure from Motion (SfM) tracks via Active Search [51, 52] and its variants [33, 11, 34, 50]
, which can be inefficient and fragile in textureless scenarios. Recently, the correspondence problem is resolved by predicting the scene coordinates for pixels by training random forests
[56, 62, 39] or CNNs [6, 7, 32, 8] with ground truth scene coordinates, which is referred to as Scene Coordinate Regression (SCoRe).Besides oneshot relocalization, some works have extended PoseNet to the time domain to address temporal relocalization. VidLoc [12] performs offline and batch relocalization for fixedlength videoclips by BLSTM [54]. Coskun et al. refine the pose dynamics by embedding LSTM units in the Kalman filters [13]. VLocNet [60] and VLocNet++ [46] propose to learn pose regression and the visual odometry jointly. LSG [67] combines LSTM with visual odometry to further exploit the spatialtemporal consistency. Since all the methods are extensions of PoseNet, their accuracies are fundamentally limited by the retrieval nature of PoseNet, following the analysis of [53].
Temporal processing. When processing timeseries image data, ConvLSTM [65] is a standard way of modeling the spatial correlations of local contexts through time [63, 36, 29]. However, some works have pointed out that the implicit convolutional modeling is less suited to discovering the pixel associations between neighboring frames, especially when pixellevel accuracy is desired [22, 42]. Therefore, in later works, the optical flow is highlighted as a more explicit way of delineating the pixel correspondences across sequential steps [44]. For example, [44, 21, 29, 57, 42] commonly predict the optical flow fields to guide the feature map warping across time steps. Then, the warped features are fused by weighting [75, 76] or pooling [41, 44] to aggregate the temporal knowledge. In this work, we follow the practice of flowguided warping, but the distinction from previous works is that we propose to fuse the predictions by leveraging Kalman filter principles [40].
This section presents the Bayesian formulation of recursive scene coordinate regression in the time domain for temporal camera relocalization. Based on the formulation, the proposed KFNet is built and the probabilistic losses are defined in Sec. 4 6. Notations used below have been summarized in Table 1 for quick reference.
Given a stream of RGB images up to time , i.e., , our aim is to predict the latent state for each frame, i.e., the scene coordinate map, which is then used for pose computation. We denote the map as , where is the pixel number. By imposing the Gaussian noise assumption on the states, the state conditioned on
follows an unknown Gaussian distribution:
(1) 
where and
are the expectation and covariance to be determined. Under the routine of Bayesian theorem, the posterior probability of
can be factorized as(2) 
where .
The first factor of the right hand side (RHS) of Eq. 2 indicates the prior belief about obtained from time through a process system. Provided that no occlusions or dynamic objects occur, the consecutive coordinate maps can be approximately associated by a linear process equation describing their pixel correspondences, wherein
(3) 
with being the sparse state transition matrix given by the optical flow fields from time to , and , ^{1}^{1}1 denotes the set of Ndimensional positive definite matrices. being the process noise. Given , we already have the probability statement that . Then the prior estimation of from time can be expressed as
(4) 
where , .


Module  inputs  outputs  















The second factor of the RHS of Eq. 2 describes the likelihood of image observations at time made through a measurement system. The system models how is derived from the latent states , formally . However, the high nonlinearity of makes the following computation intractable. Alternatively, we map to via a nonlinear function inspired by [13], so that the system can be approximately expressed by a linear measurement equation:
(5) 
where , denotes the measurement noise, and can be interpreted as the noisy observed scene coordinates. In this way, the likelihood can be rewritten as by substituting for .
Let denote the residual of predicting from time ; thus
(6) 
Since and are all known, observing is equivalent to observing . Hence, the likelihood can be rewritten as . Substituting Eq. 5 into Eq. 6, we have , so that the likelihood can be described by
(7) 
Based on the theorems in multivariate statistics [1, 40], combining the two distributions 4 & 7
gives the bivariate normal distribution:
(8) 
Making the conditioning variable, the filtering system gives the posterior distribution that writes
(9) 
where is conceptually referred to as the Kalman gain and as the innovation^{2}^{2}2The derivation of Eqs. 8 & 9 is shown in Appendix B. [40, 20].
As shown in Fig. 1, the inference of the posterior scene coordinates and covariance for image pixels proceeds recursively as the time
evolves, which are then used for online pose determination. Specifically, the pixels with variances greater than
are first excluded as outliers. Then, a RANSAC+P3P [19] solver is applied to compute the initial camera pose from the 2D3D correspondences, followed by a nonlinear optimization for pose refinement.The measurement system is basically a generative model explaining how the observations are generated from the latent scene coordinates , as expressed in Eq. 5. Then, the remaining problem is to learn the underlying mapping from to . This is similar to the SCoRe task [56, 6, 7], but differs in the constraint about imposed by Eq. 5. Below, the architecture of SCoordNet is first introduced, which outputs the scene coordinate predictions, along with the uncertainties, to model the measurement noise . Then, we define the probabilistic loss based on the likelihood of the measurement system.
SCoordNet shares the similar fully convolutional structure to [7], as shown in Fig. 1. However, it is far more lightweight, with parameters fewer than one eighth of [7]. It encompasses twelve
convolution layers, three of which use a stride of
to downsize the input by a factor of. ReLU follows each layer except the last one. To simplify computation and avoid the risk of overparameterization, we postulate the isotropic covariance of the multivariate Gaussian measurement noise,
i.e., for each pixel , where denotes the identity matrix. The output thus has a channel of , comprising d scene coordinates and a d uncertainty measurement.According to Eq. 5, the latent scene coordinates of pixel should follow the distribution
. Taking the negative logarithm of the probability density function (PDF) of
, we define the loss based on the likelihood which gives rise to the maximum likelihood (ML) estimation for each pixel in the form [24]:(10) 
with being the groundtruth label for . For numerical stability, we use logarithmic variance for the uncertainty measurements in practice, i.e., .
Including uncertainty learning in the loss formulation allows one to quantify the prediction errors stemming not just from the intrinsic noise in the data but also from the defined model [15]. For example, at the boundary with depth discontinuity, a subpixel offset would cause an abrupt coordinate shift which is hard to model. SCoordNet would easily suffer from a significant magnitude of loss in such cases. It is sensible to automatically downplay such errors during training by weighting with the uncertainty measurements. Fig. 2(a) illustrates the uncertainty predictions in such cases.
The process system models the transition process of pixel states from time to , as described by the process equation of Eq. 3. Herein, first, we propose a cost volume based network, OFlowNet, to predict the optical flows and the process noise covariance jointly for each pixel. Once the optical flows are determined, Eq. 3 is equivalent to the flowguided warping from time towards , as commonly used in [44, 21, 29, 57, 42]. Second, after the warping, the prior distribution of the states, i.e., of Eq. 4, can be evaluated. We then define the probabilistic loss based on the prior to train OFlowNet.
OFlowNet is composed of two components: the cost volume constructor and the flow estimator.
The cost volume constructor first extracts features from the two input images and respectively through seven convolutions, three of which have a stride of . The output feature maps and have a spatial size of oneeighth of the inputs and a channel number of . Then, we build up a cost volume for each pixel of the feature map , so that
(11) 
where is the size of the search window which corresponds to pixels in the fullresolution image, and is the spatial offset. We apply L2normalization to the feature maps along the channel dimension before differentiation, as in [66, 35].
The following flow estimator operates over the cost volumes for flow inference. We use a UNet with skip connections [47] as shown in Fig. 1, which first subsamples the cost volume by a factor of for an enlarged receptive field and then upsamples it to the original resolution. The output is a unbounded confidence map for each pixel. Related works usually attain flows by hard assignment based on the matching cost encapsulated by the cost volumes [66, 58]. However, it would cause nondifferentiability in later steps where the optical flows are to be further used for spatial warping. Thus, we pass the confidence map through the differentiable spatial softmax operator [17] to compute the optical flow as the expectation of the pixel offsets inside the search window. Formally,
(12) 
where is the confidence at offset . To fulfill the process noise modeling, i.e., in Eq. 3, we append three fully connected layers after the bottleneck of the UNet to regress the logarithmic variance, as shown in Fig. 1. Sample optical flow predictions are visualized in Fig. 3.
Once the optical flows are computed, the state transition matrix of Eq. 3 can be evaluated. We then complete the linear transition process of Eq. 3 by warping the scene coordinate map and uncertainty map from time towards through bilinear warping [72]. Let and be the warped scene coordinates and Gaussian variance, and be the Gaussian variance of the process noise of pixel at time . Then, the prior coordinates of , denoted as , should follow the distribution
(13) 
where . Taking the negative logarithm of the PDF of , we get the loss of the process system as
(14) 
It is noteworthy that the loss definition uses the prior distribution of to provide the weak supervision for training OFlowNet, with no recourse to the optical flow labeling.
One issue with the proposed process system is that it assumes no occurrence of occlusions or dynamic objects which are two outstanding challenges for tracking problems [28, 77]. Our process system partially addresses the issue by giving the uncertainty measurements of the process noise. As shown in Fig. 2(b), OFlowNet generally produces much larger uncertainty estimations for the pixels from occluded areas and dynamic objects. This helps to give lower weights to these pixels that have incorrect flow predictions in the loss computation.
The measurement and process systems in the previous two sections have derived the likelihood and prior estimations of the scene coordinates , respectively. The filtering system aims to fuse both of them based on Eq. 9 to yield the posterior estimation.
For a pixel at time , and are respectively the likelihood and prior distributions of its scene coordinates. Putting the variables in Eqs. 6 & 9, we evaluate the innovation and the Kalman gain at pixel as
(15) 
Imposing the linear Gaussian postulate of the Kalman filter, the fused scene coordinates of with the least square error follow the posterior distribution below [40] :
(16) 
where and . Hence, the Kalman filtering system is parameterfree, with the loss defined based on the posterior distribution:
(17) 
which is then added to the full loss that allows the endtoend training of KFNet as below:
(18) 
In practice, the filter could behave incorrectly due to the outlier estimations caused by the erratic scene coordinate regression or a failure of flow tracking. This would induce accumulated state errors in the long run. Therefore, we use the statistical assessment tool, Normalized Innovation Squared (NIS) [5], to filter the inconsistent predictions during inference.


Oneshot Relocalization  Temporal Relocalization  
scenes 











7scenes 
chess  0.08m, 3.25°  0.04m, 1.73°  0.04m, 1.96°  0.02m, 0.5°  0.019m, 0.63°  0.18m,   0.33m, 6.9°  0.023m,1.44°  0.09m, 3.28°  0.018m, 0.65°  
fire  0.27m, 11.7°  0.03m, 1,74°  0.03m, 1.53°  0.02m, 0.9°  0.023m, 0.91°  0.26m,   0.41m, 15.7°  0.018m, 1.39°  0.26m, 10.92°  0.023m, 0.90°  
heads  0.18m, 13.3°  0.05m, 1.98°  0.02m, 1.45°  0.01m, 0.8°  0.018m, 1.26°  0.21m,   0.28m, 13.01°  0.016m, 0.99°  0.17m, 12.70°  0.014m, 0.82°  
office  0.17m, 5.15°  0.04m, 1.62°  0.09m, 3.61°  0.03m, 0.7°  0.026m, 0.73°  0.36m,   0.43m, 7.65°  0.024m, 1.14°  0.18m, 5.45°  0.025m, 0.69°  
pumpkin  0.22m, 4.02°  0.04m, 1.64°  0.08m, 3.10°  0.04m, 1.1°  0.039m, 1.09°  0.31m,   0.49m, 10.63°  0.024m, 1.45°  0.20m, 3.69°  0.037m, 1.02°  
redkitchen  0.23m, 4.93°  0.04m, 1.63°  0.07m, 3.37°  0.04m, 1.1°  0.039m, 1.18°  0.26m,   0.57m, 8.53°  0.025m, 2.27°  0.23m, 4.92°  0.038m, 1.16°  
stairs  0.30m, 12.1°  0.04m, 1.51°  0.03m, 2.22°  0.09m, 2.6°  0.037m, 1.06°  0.14m,   0.46m, 14.56°  0.021m,1.08°  0.23m, 11.3°  0.033m, 0.94°  
Average  0.207m, 7.78°  0.040m, 1.69°  0.051m, 2.46°  0.036m, 1.10°  0.029m, 0.98°  0.246m,   0.424m, 11.00°  0.022m, 1.39°  0.190m, 7.47°  0.027m, 0.88°  


Cambridge 
GreatCourt        0.40m, 0.2°  0.43m, 0.20°          0.42m, 0.21°  
KingsCollege  1.07m, 1.89°    0.42m, 0.55°  0.18m, 0.3°  0.16m, 0.29°    2.01m, 5.35°      0.16m, 0.27°  
OldHospital  1.94m, 3.91°    0.44m, 1.01°  0.20m, 0.3°  0.18m, 0.29°    2.35m, 5.05°      0.18m, 0.28°  
ShopFacade  1.49m, 4.22°    0.12m, 0.40°  0.06m, 0.3°  0.05m, 0.34°    1.63m, 6.89°      0.05m, 0.31°  
StMarysChurch  2.00m, 4.53°    0.19m, 0.54°  0.13m, 0.4°  0.12m, 0.36°    2.61m, 8.94°      0.12m, 0.35°  
Street      0.85m, 0.83°        3.05m, 5.62°        
Average ^{1}  1.63m, 3.64°    0.29m, 0.63°  0.14m, 0.33°  0.13m, 0.32°    2.15m, 6.56°      0.13m, 0.30°  


DeepLoc      0.010m, 0.04°    0.083m, 0.45°      0.320m, 1.48°    0.065m, 0.43°  

The average does not include errors of GreatCourt and Street as some methods do not report results of the two scenes.
Normally, the innovation variable follows the Gaussian distribution as shown by Eq. 8, where . Then,
is supposed to follow the Chisquared distribution with three degrees of freedom, denoted as
. It is thus reasonable to see a pixel state as an outlier if its NIS value locates outside the acceptance region of . As illustrated in Fig. 4, we use the critical value of in the NIS test, which means we have at least statistical evidence to regard one pixel state as negative. The uncertainties of the pixels failing the test, e.g. , are reset to be infinitely large so that they will have no effect in later steps.Datasets. Following previous works [26, 6, 7, 46], we use two indoor datasets  7scenes [56] and 12scenes [61], and two outdoor datasets  DeepLoc [46] and Cambridge [26] for evaluation. Each scene has been split into different strides of sequences for training and testing.
Data processing. Images are downsized to for 7scenes and 12scenes, for DeepLoc and Cambridge. The groundtruth scene coordinates of 7scenes and 12scenes are computed based on given camera poses and depth maps, whereas those of DeepLoc and Cambridge are rendered from surfaces reconstructed with training images.
Training. Our best practice chooses the parameter setting as , , . The ADAM optimizer [27] is used with and . We use an initial learning rate of and then drop it with exponential decay. The training procedure has 3 stages. First, we train SCoordNet for each scene with the likelihood loss (Eq. 10). The iteration number is set to be proportional to the surface area of each scene and the learning rate drops from to . In particular, we use SCoordNet as the oneshot version of the proposed approach. Second, OFlowNet is trained using all the scenes for each dataset with the prior loss (Eq. 14). It also experiences the learning rate decaying from to . Each batch is composed of two consecutive frames. The window size of OFlowNet in the original images is set to 64, 128, 192 and 256 for the four datasets mentioned above, respectively, due to the increasing egomotion through them. Third, we finetune all the parameters of KFNet jointly by optimizing the full loss (Eq. 18) with a learning rate going from to . Each batch in the third stage contains four consecutive frames.
Following [6, 7, 12, 60], we use two accuracy metrics: (1) the median rotation and translation error of poses (see Table 2); (2) the 5cm5deg accuracy (see Table 3), i.e., the mean percentage of the poses with translation and rotation errors less than 5 cm and 5°, respectively. The uncertainty threshold (Sec. 3) is set to 5 cm for 7scenes and 12scenes and 50 cm for DeepLoc and Cambridge.
Oneshot relocalization. Our SCoordNet achieves the lowest pose errors on 7scenes and Cambridge, and the highest 5cm5deg accuracy on 12scenes among the oneshot methods, surpassing CamNet [14] and MapNet [9] which are the stateoftheart relative and absolute pose regression methods, respectively. Particularly, SCoordNet outperforms the stateoftheart structurebased methods DSAC++ [7] and ESAC [8], yet with fewer parameters (M vs. M vs. M, respectively). The advantage of SCoordNet should be mainly attributed to the uncertainty modeling, as we will analyze in Appendix C. It also surpasses Active Search (AS) [52] on 7scenes and Cambridge, but underperforms AS on DeepLoc. We find that, in the experiments of AS on DeepLoc [53], AS is tested on a SfM model built with both training and test images. This may explain why AS is surprisingly more accurate on DeepLoc than on other datasets, since the 2D3D matches between test images and SfM tracks have been established and their geometry has been optimized during the SfM reconstruction.
Temporal relocalization. Our KFNet improves over SCoordNet on all the datasets as shown in Tables 2 & 3. The improvement on Cambridge is marginal as the images are oversampled from videos sparsely. The too large motions between frames make it hard to model the temporal correlations. KFNet obtains much lower pose errors than other temporal methods, except that it has a larger translation error than VLocNet++ [46] on 7scenes. However, the performance of VLocNet++ is inconsistent across different datasets. On DeepLoc, the dataset collected by the authors of VLocNet++, VLocNet++ has a much larger pose error than KFNet, even though it also integrates semantic segmentation into learning. The inconsistency is also observed in [53], which shows that VLocNet++ cannot substaintially exceed the accuracy of retrieval based methods [59, 3].


7scenes  12scenes  DeepLoc  Cambridge  
mean  stddev  mean  stddev  mean  stddev  mean  stddev  
DSAC++ [7]  28.8  33.1  28.8  47.1      467.3  883.7 
SCoordNet  16.8  23.3  9.8  20.0  883.0  1520.8  272.7  497.6 
KFNet  15.3  21.7  7.3  13.7  200.79  398.8  241.5  441.7 

The mean and standard deviation of predicted scene coordinate errors in centimeters.
Relocalization methods based on SCoRe [56, 7] can create a mapping result for each view by predicting perpixel scene coordinates. Hence, relocalization and mapping can be seen as dual problems, as one can be easily resolved once the other is known. Here, we would like to evaluate the mapping accuracy with the mean and the standard deviation (stddev) of scene coordinate errors of the test images.
As shown in Table 4, the mapping accuracy is in accordance with the relocalization accuracy reported in Sec. 7.2.1. SCoordNet reduces the mean and stddev values greatly compared against DSAC++, and KFNet further reduces the mean error over SCoordNet by , , and on the four datasets, respectively. The improvements are also reflected in the predicted point clouds, as visualized in Fig. 5. SCoordNet and KFNet predict less noisy scene points with better temporal consistency compared with DSAC++. Additionally, we filter out the points of KFNet with uncertainties greater than as displayed in the KFNetfiltered panel of Fig. 5, which helps to give much neater and more accurate 3D point clouds.
Although, in terms of the mean scene coordinate error in Table. 4, SCoordNet outperforms DSAC++ by over and KFNet further improves SCoordNet by a range from to , the improvements in terms of the median pose error in Table 2 are not as significant. The main reason is that the RANSACbased PnP solver diminishes the benefits brought by the scene coordinate improvements, since only a small subset of accurate scene coordinates selected by RANSAC matters in the pose accuracy. Therefore, to highlight the advantage of KFNet, we conduct more challenging experiments over motion blur images which are quite common in real scenarios. For the test image sequences of 7scenes, we apply a motion blur filter with a kernel size of 30 pixels for every 10 images as shown in Fig. 6(a). In Fig. 6(b)&(c), we plot the cumulative distribution functions of the pose errors before and after applying motion blur. Thanks to the uncertainty reasoning, SCoordNet generally attains smaller pose errors than DSAC++ whether motion blur is present. While SCoordNet and DSAC++ show a performance drop after motion blur is applied, KFNet maintain the pose accuracy as shown in Fig. 6(b)&(c), leading to a more notable margin between KFNet and SCoordNet and demonstrating the benefit of the temporal modelling used by KFNet.


Oneshot  Temporal  
SCoordNet  ConvLSTM [65]  TPooler [44]  SWeight [75]  KFNet 
0.029m, 0.98°  0.040m, 1.12°  0.029m, 0.94°  0.029m, 0.95°  0.027m, 0.88° 

This section studies the efficacy of our Kalman filter based framework in comparison with other popular temporal aggregation strategies including ConvLSTM [65, 29], temporal pooler (TPooler) [44] and similarity weighting (SWeight) [75, 76]. KFNet is more related to TPooler and SWeight which also use the flowguided warping yet within an nframe neighborhood. For equitable comparison, the same feature network and probabilistic losses as KFNet are applied to all. We use a kernel size of for ConvLSTM to ensure a window size of in images. The same OFlowNet structure and a frame neighborhood are used for TPooler and SWeight for flowguided warping.
Table 5 shows the comparative results on 7scenes
. ConvLSTM largely underperforms SCoordNet and other aggregation methods in pose accuracy, which manifests the necessity of explicitly determining the pixel associations between frames instead of implicit modeling. Although the flowguided warping is employed, TPooler and SWeight only achieve marginal improvements over SCoordNet compared with KFNet, which justifies the advantage of the Kalman filtering system. Compared with TPooler and SWeight, the Kalman filter behaves as a more disciplined and nonheuristic approach to temporal aggregation that ensures an optimal solution of the linear Gaussian statespace model
[18] defined in Sec. 3.Here, we explore the functionality of the consistency examination which uses NIS testing [5] (see Sec. 6.2). Due to the infrequent occurrence of extreme outlier predictions among the wellbuilt relocalization datasets, we simulate the tracking lost situations by trimming a subsequence off each testing sequence of 7scenes and 12scenes. Let and denote the last frame before and the first frame after the trimming. The discontinuous motion from to would cause outlier scene coordinate predictions for by KFNet. Fig. 7 plots the mean pose and scene coordinate errors of frames around and visualizes the poses of a sample trimmed sequence. With the NIS test, the errors revert to a normal level promptly right after , whereas without the NIS test, the accuracy of poses after is affected adversely. NIS testing stops the propagation of the outlier predictions of into later steps by giving them infinitely large uncertainties, so that will leave out the prior from and reinitialize itself with the predictions of the measurement system.
This work addresses the temporal camera relocalization problem by proposing a recurrent network named KFNet. It extends the scene coordinate regression problem to the time domain for online pose determination. The architecture and the loss definition of KFNet are based on the Kalman filter, which allows a disciplined manner of aggregating the pixellevel predictions through time. The proposed approach yields the top accuracy among the stateoftheart relocalization methods over multiple benchmarks. Although KFNet is only validated on the camera relocalization task, the immediate application alongside other tasks like video processing [21, 29] and segmentation [63, 42] , object tracking [35, 76] would be anticipated.
Appendices
As a supplement to the main paper, we detail the parameters of the layers of SCoordNet and OFlowNet used for training 7scenes in Table 10 at the end of the appendix.
Let us denote the bivariate Gaussian distribution of the latent state and the innovation conditional on as
(19) 
where . Based on the multivariate statistics theorems [2], the conditional distribution of given is expressed as
(20) 
and similarly,
(21) 
Conversely, if Eq. 20 holds and , Eq. 19 will also hold according to [2]. Since we have had in Eq. 4 of the main paper, we can note that
(22) 
Recalling Eq. 7 of the main paper, we already have
(23) 
Equalizing Eq. 21 and Eq. 23, we have
(24) 
Substituting the variables of Eqs. 22 & 24 into Eqs. 19 & 20, we have reached the distributions 8 & 9 in the main paper.
The uncertainty modeling, which helps to quantify the measurement and process noise, is an indispensable component of KFNet. In this section, we conduct ablation studies on it.
First, we run the trained KFNet of each scene from 7scenes and 12scenes over the test images of each scene exhaustively and visualize the median uncertainties as the confusion matrix in Fig. 8
(a). The uncertainties between the same scene in the main diagonal are much lower than those between different scenes. It indicates that meaningful uncertainties are learned which can be used for scene recognition. Second, we qualitatively compare SCoordNet and OFlowNet against their counterparts which are trained with L2 loss without uncertainty modeling. The cumulative distribution functions (CDFs) of scene coordinate errors tested on
7scenes and 12scenes are shown in Fig. 8(b). The uncertainty modeling leads to more accurate predictions for both SCoordNet and OFlowNet. We attribute the improvements to the fact that the uncertainties apply autoweighting to the loss term of each pixel as in Eqs. 10 & 14 of the main paper, which prevents the learning from getting stuck in the hard or infeasible examples like the boundary pixels for SCoordNet and the occluded pixels for OFlowNet (see Fig. 2 of the main paper).




Layers (kernel, stride)  
L7  L8  L9  L10  L11  L12  
8  29  1, 2  1, 1  1, 1  1, 1  1, 1  1, 1  
8  45  3, 2  1, 1  1, 1  1, 1  1, 1  1, 1  
8  61  3, 2  3, 1  1, 1  1, 1  1, 1  1, 1  
8  93  3, 2  3, 1  3, 1  3, 1  1, 1  1, 1  
8  125  3, 2  3, 1  3, 1  3, 1  3, 1  3, 1  
8  157  3, 2  3, 1  5, 1  5, 1  3, 1  3, 1  
8  189  3, 2  3, 1  5, 1  5, 1  5, 1  5, 1  
8  221  3, 2  3, 1  7, 1  7, 1  5, 1  5, 1  


4  93  3, 1  3, 1  5, 1  5, 1  3, 1  3, 1  
8  93  3, 2  3, 1  3, 1  3, 1  1, 1  1, 1  
16  93  3, 2  3, 1  3, 2  1, 1  1, 1  1, 1  
32  93  3, 2  3, 1  3, 2  1, 1  1, 2  1, 1  




Relocalization accuracy  Mapping accuracy  
pose error  pose accuracy  mean  stddev  
29  0.025m, 0.87°  87.9%  29.6cm  32.3  
45  0.023m, 0.88°  93.4%  24.4cm  29.2  
61  0.023m, 0.84°  94.0%  17.3cm  23.1  
93  0.024m, 0.91°  92.9%  11.5cm  16.4  
125  0.026m, 0.95°  88.3%  11.7cm  16.1  
157  0.026m, 0.97°  86.6%  10.3cm  15.0  
189  0.030m, 1.07°  81.0%  10.3cm  13.9  
221  0.031m, 1.22°  71.8%  9.5cm  12.9  

The receptive field, denoted as
, is an essential factor of Convolutional Neural Network (CNN) design. In our case, it determines how many image observations around a pixel are exposed and used for scene coordinate prediction. Here, we would like to evaluate the impact of
on the performance of SCoordNet. SCoordNet presented in the main paper has . We change the kernel size of th to th layers of SCoordNet to adjust the receptive field to , , , , , , , as shown in Table 6. Due to the time limitations, the evaluation only runs on heads of 7scenes dataset [56]. As reported in Table 7, the mean of scene coordinate errors grows up as the receptive field decreases. We illustrate the CDF of scene coordinate errors in Fig. 9. It is noteworthy that a smaller results in more outlier predictions which cause a larger mean of scene coordinate errors. However, a larger mean of scene coordinate error does not necessarily lead to a decrease in relocalization accuracy. For example, a receptive field of has worse mapping accuracy than the larger receptive fields, but it achieves the smaller pose error and the better pose accuracy than them. As we can see from Fig. 9, a smaller receptive field has a larger portion of precise scene coordinate predictions, especially those with errors smaller than . These predictions are crucial to the accuracy of pose determination, as the outlier predictions are generally filtered by RANSAC. Nevertheless, when we further reduce from to and then , a drop of relocalization accuracy is observed. It is because, as decreases, the growing number of outlier predictions deteriorates the robustness of pose computation. A receptive field between and is a good choice that respects the tradeoff between precision and robustness.Due to the cost of dense predictions over fullresolution images, we predict scene coordinates for the images downsized by a factor of in the main paper, following previous works [7]. In this section, we intend to explore how the downsample rate affects the tradeoff between accuracy and efficiency over SCoordNet. As reported in Table 6, we change the kernel size and strides of th to th layers to adjust the downsample rate to , , and with the same receptive field of . The mean accuracy and the average time taken to localize frames of heads are reported in Table 8. As intuitively expected, the larger downsample rate generally leads to a drop of relocalization and mapping accuracy, as well as an increasing speed. For example, the downsample rate and have a comparable performance, while the downsample rate outperforms by a large margin. However, on the upside, a larger downsample rate is appealing due to the higher efficiency which scales quadratically with the downsample rate. For realtime applications, a downsample rate of allows for a low latency of ms per frame with a frequency of about Hz^{3}^{3}3All the experiments of this work run on a machine with a 8core Intel i74770K, a 32GB memory and a NVIDIA GTX 1080 Ti graphics card..



Relocalization accuracy  Mapping accuracy  Time  
pose error  pose accuracy  mean  stddev  
4  0.024m, 0.97°  93.6%  11.2cm  17.3  1.34s  
8  0.024m, 0.91°  92.9%  11.5cm  16.4  0.20s  
16  0.025m, 0.92°  89.1%  16.3cm  20.5  0.11s  
32  0.029m, 1.06°  79.6%  20.7cm  20.7  0.034s  

Table 9 reports the mean running time per frame (of size ) of the measurement, process and filtering systems and NIS test, on a NVIDIA GTX 1080 Ti. Since the measurement and process systems are independent and can run in parallel, the total time per frame is 157.18 ms, which means KFNet only causes an extra overhead of 0.58 ms compared to the oneshot SCoordNet. Besides, our KFNet is 3 times faster than the stateoftheart oneshot relocalization system DSAC++ [7].


KFNet  DSAC++  
Modules  Measurement  Process  Filtering  NIS  Total   
Time (ms)  156.60  51.23  0.29  0.29  157.18  486.07 

As a supplement of Fig. 5 in the main paper, we visualize the point clouds of 7scenes [56], 12scenes [61] and Cambridge [26] predicted by DSAC++ [7] and our KFNetfiltered in Fig. 10. The clean point clouds predicted by KFNet in an endtoend way provides an efficient alternative to costly 3D reconstruction from scratch [73, 71, 69, 55, 38, 74, 70, 37, 68] in the relocalization setting, which is supposed to be valuable to mappingbased applications such as augmented reality.


Input  Layer  Output  Output Size 
SCoordNet  
Conv+ReLU, K=3x3, S=1, F=64  conv1a  
conv1a  Conv+ReLU, K=3x3, S=1, F=64  conv1b  
conv1b  Conv+ReLU, K=3x3, S=2, F=256  conv2a  
conv2a  Conv+ReLU, K=3x3, S=1, F=256  conv2b  
conv2b  Conv+ReLU, K=3x3, S=2, F=512  conv3a  
conv3a  Conv+ReLU, K=3x3, S=1, F=512  conv3b  
conv3b  Conv+ReLU, K=3x3, S=2, F=1024  conv4a  
conv4a  Conv+ReLU, K=3x3, S=1, F=1024  conv4b  
conv4b  Conv+ReLU, K=3x3, S=1, F=512  conv5  
conv5  Conv+ReLU, K=3x3, S=1, F=256  conv6  
conv6  Conv+ReLU, K=1x1, S=1, F=128  conv7  
conv7  Conv, K=1x1, S=1, F=3  
conv7  Conv+Exp, K=1x1, S=1, F=1  
OFlowNet  
Conv+ReLU, K=3x3, S=1, F=16  feat1  
feat1  Conv+ReLU, K=3x3, S=2, F=32  feat2  
feat2  Conv+ReLU, K=3x3, S=1, F=32  feat3  
feat3  Conv+ReLU, K=3x3, S=2, F=64  feat4  
feat4  Conv+ReLU, K=3x3, S=1, F=64  feat5  
feat5  Conv+ReLU, K=3x3, S=2, F=128  feat6  
feat6  Conv, K=3x3, S=1, F=32  
Cost Volume Constructor  vol1  
vol1  Reshape  vol2  
vol2  Conv+ReLU, K=3x3, S=1, F=32  vol3  
vol3  Conv+ReLU, K=3x3, S=2, F=32  vol4  
vol4  Conv+ReLU, K=3x3, S=1, F=32  vol5  
vol5  Conv+ReLU, K=3x3, S=2, F=64  vol6  
vol6  Conv+ReLU, K=3x3, S=1, F=64  vol7  
vol7  Conv+ReLU, K=3x3, S=2, F=128  vol8  
vol8  Conv+ReLU, K=3x3, S=1, F=128  vol9  
vol9  Deconv+ReLU, K=3x3, S=2, F=64  vol10  
vol10 vol7  Conv+ReLU, K=3x3, S=1, F=64  vol11  
vol11  Deconv+ReLU, K=3x3, S=2, F=32  vol12  
vol12 vol5  Conv+ReLU, K=3x3, S=1, F=32  vol13  
vol13  Deconv+ReLU, K=3x3, S=2, F=16  vol14  
vol14 vol3  Conv+ReLU, K=3x3, S=1, F=16  vol15  
vol15  Conv, K=3x3, S=1, F=1  confidence  
confidence  Spatial Softmax [17]  flow1  
flow1  Reshape  flow2  
flow2,  Flowguided Warping [75, 76, 41, 44]  


vol9  Reshape  fc1  
fc1  FC+ReLU, F=64  fc2  
fc2  FC+ReLU, F=32  fc3  
fc3  FC+Exp, F=1  fc4  
fc4  Reshape  

Learning visual feature spaces for robotic manipulation with deep spatial autoencoders
. Cited by: Table 10, §5.1.Bayesian model discrimination and bayes factors for linear gaussian state space models
. Journal of the Royal Statistical Society 57 (1), pp. 237–246. Cited by: §7.3.Spatiotemporal transformer network for video restoration
. In ECCV, Cited by: §2, §5, §8.Geometric loss functions for camera pose regression with deep learning
. In CVPR, Cited by: §1, §2.Convolutional lstm network: a machine learning approach for precipitation nowcasting
. In NIPS, Cited by: §2, §7.3, Table 5.