In computer vision, we often have a model that explains an observation with a small set of parameters. For example, our model is the 6D pose (translation and rotation) of a camera, and our observations are images of a known 3D environment. The task of camera re-localization is then to robustly and accurately predict the 6D camera pose given the camera image. However, inferring model parameters from an observation is difficult because many effects are not explained by our model. People might move through the environment, and its appearance varies largely due to lighting effects such as day versus night. We usually map our observation to a representation from which we can infer model parameters more easily. For example, in camera re-localization we can train a neural network to predict correspondences between the 2D input image and the 3D environment. Inferring the camera pose from these correspondences is much easier, and various geometric solvers for this problem exist [21, 16, 26]. Because some predictions of the network might be erroneous, we have outlier correspondences, we utilize a robust estimator such as Random Sample Consensus (RANSAC) , resp. its differentiable counterpart Differentiable Sample Consensus (DSAC) , or other differentiable estimators [53, 35] for training.
For some tasks, the problem domain is large or ambiguous. In camera re-localization, an environment could feature repeating structures that are unique locally
but not globally, office equipment, radiators or windows. A single feed-forward network cannot predict a correct correspondence for such objects because there are multiple valid solutions. However, if we train an ensemble of networks where each network specializes in a local part of the environment, we can resolve such ambiguities. This strategy is known in machine learning asMixture of Experts (MoE) . Each expert is a network specialized to one part of the problem domain. An additional gating network decides which expert is responsible for a given observation. More specifically, the output of the gating network is a categorical distribution over experts, which either guides the selection of a single expert, or a weighted average of all expert outputs .
In this work, we extend Mixture of Experts for fitting parametric models. Each expert specializes to a part of all training observations, and predicts a representation to which we fit model parameters using DSAC. We argue that two realizations of a Mixture of Experts model are not optimal: i) letting the gating network select one expert only[19, 51, 3, 43]; ii) giving as output a weighted average of all experts [20, 1]. In the first case, we ignore that the gating network might attribute substantial probability to more than one expert. We might choose the wrong expert, and get a poor result. In the second case, we calculate an average in model parameter space which can be instable in learning . In our realization of a Mixture of Experts model, we integrate the gating network into the hypothesize-and-verify framework of DSAC. To estimate model parameters, DSAC creates many model hypotheses by sampling small subsets of data points, and fitting model parameters to each subset. DSAC scores hypotheses according to their consistency with all data points, their sample consensus. One hypothesis is selected as the final estimate according to this score. Hypothesis selection is probabilistic, and training aims at minimizing the expected task loss.
Instead of letting the gating network pick one expert, and fit model parameters only to this expert’s prediction, we distribute model hypotheses among experts. Each expert receives a share of the total number of hypotheses according to the gating network. For the final selection, we score each hypothesis according to sample consensus, irrespective of what expert it came from, see Fig 1. Therefore, as long as the gating network attributes some probability to the correct expert, we can still get an accurate model parameter estimate. We call this framework Expert Sample Consensus (ESAC). We train the network ensemble jointly and end-to-end by minimizing the expected task loss. We define the expectation over both, hypotheses sharing according to the gating network, and hypothesis selection according to sample consensus.
We demonstrate our method on a toy problem where the gating network has to decide which model to fit to synthetic data - a line or a circle. Compared to naive expert selection, our method proves to be extremely robust regarding the gating network’s ability to assign the correct expert. Our method also achieves state-of-the-art results in camera re-localization where each expert specializes in a separate, small part of a larger indoor environment.
We give the following main contributions:
We present Expert Sample Consensus (ESAC), an ensemble formulation of Differentiable Sample Consensus (DSAC) which we derive from Mixture of Experts (MoE).
A method to train ESAC jointly and end-to-end.
We demonstrate the properties of our algorithm on a toy problem of fitting simple parametric models to noisy, synthetic inputs.
Our formulation improves on two real-world aspects of learning-based camera re-localization, scalability and ambiguity. We achieve state-of-the-art results on difficult, public datasets for indoor re-localization.
2 Related Work
Ensemble Methods. To improve the accuracy of machine learning algorithms, one can train multiple base-learners and combine their predictions. A common strategy is averaging, so that errors of individual learners cancel out [10, 25, 45, 18]
. To ensure that base-learners produce non-identical predictions, they are trained using random subsets of training data (bagging) or using random initializations of parameters (network weights). Boosting refers to a weighted average of predictions where the weights emerge from each base-learners ability to classify training samples. In these ensemble methods, all base-learners are trained on the full problem domain.
In contrast, Mixture of Experts (MoE)  employs a divide-and-conquer strategy where each base-learner, resp. expert, specializes in one part of the problem domain. An additional gating network assesses the relevancy of each expert for a given input, and predicts an associated weight. The ensemble prediction is a weighted average of the experts’ outputs. MoE has been trained by minimizing the expected training loss 
, maximizing the likelihood under a Gaussian mixture model interpretation
or using the expectation-maximization (EM) algorithm.
MoE has been applied to image classification where each expert specializes to a subset of classes [51, 19, 1, 3]. Ahmed  find disjunct subsets by an EM-style algorithm. Hinton  and Yan  find subsets of classes based on class confusion of a generalist base network. Aljundi  apply MoE to lifelong multi-task learning. Whenever their system should be extended with a new task (a new object class) they train a new expert and a new expert gate. Each expert gate measures the similarity of an input with its associated task, and the gate with the highest similarity forwards the input to its expert.
In all aforementioned methods, the experts’ outputs constitute the ensemble output directly. In contrast, we are interested in a scenario where experts output a representation to which we fit parametric models in a robust fashion while maintaining the ability to train the ensemble jointly and end-to-end. To the best of our knowledge, this has not been addressed, previously. Some of the aforementioned methods make use of conditional computation, the gating network selects a subset of experts to evaluate while others stay idle [51, 19, 3]. While this is computationally efficient, routing errors can occur, selection of the incorrect expert results in catastrophic errors. In this work, we distribute computational budget between experts based on the potentially soft prediction of the gating network. Thereby, we strike a good balance between efficiency and robustness.
Camera Re-Localization. Camera re-localization has been addressed with a very diverse set of methods. Some authors use image-based retrieval systems [41, 11, 4] to map a query image to the nearest neighbor in a set of database images with known pose. Pose regression methods [23, 50, 22, 5, 9] train neural feed-forward networks to predict the 6D pose directly from an input image. Pose-regression methods vary in network architecture, pose parametrization, or training loss. Both, retrieval-based and pose-regression methods, are very efficient but limited in accuracy. Feature-based re-localization methods [28, 36, 38, 37, 40, 47] match sparse feature points of the input image to a sparse 3D reconstruction of the environment. The 6D camera pose is estimated from these 2D-3D correspondences using RANSAC. These methods are very accurate, scale well but have problems with texture-less surfaces and image conditions like motion blur because the feature detectors fail [44, 23].
Scene coordinate regression methods [44, 17, 49, 7, 31, 32, 6, 12, 33, 8] also estimate 2D-3D correspondences between image and environment but do so densely for each pixel of the input image. This circumvents the need for a feature detector with the aforementioned draw-backs of feature-based methods. Brachmann  combine a neural network for scene coordinate regression with a differentiable RANSAC for an end-to-end trainable camera re-localization pipeline. Brachmann and Rother  improve the pipeline’s initialization and differentiable pose optimization to achieve state-of-the-art results for indoor camera re-localization from single RGB images. We build on and extend [6, 8]
by combining them with our ESAC framework. Thereby, we are able to address two real-world problems: scalability and ambiguity in camera re-localization. Some scene coordinate regression methods use an ensemble of base learners, namely random forests[44, 49, 7, 31, 32, 12, 33]. Guzman-Rivera  train the random forest in a boosting-like manner to diversify its predictions. Massiceti 
map an ensemble of decision trees to an ensemble of neural networks. However, in none of these methods do the base-learners specialize in parts of the problem domain.
In , Brachmann train a joint classification-regression forest for camera re-localization. The forest classifies which part of the environment an input belongs to, and regresses relative scene coordinates for this part. More recently, image-retrieval and relative pose regression have been combined in one system for good accuracy in . Both works,  and , bear some resemblance to our strategy but utilize one large model without the benefit of efficient, conditional computation. Also, their models cannot be trained in an end-to-end fashion.
Model Selection. Sometimes, the model type has to be estimated concurrently with the model parameters. E.g. data points could be explained by a line or higher order polynomials. Methods for model selection implement a trade-off between model expressiveness and fitting error [2, 42]. For illustrative purposes, we introduce ESAC on a toy problem where it learns model selection in a supervised fashion. However, in our main application, camera re-localization, the model type is always known to be a 6D pose.
We start by reviewing DSAC  for fitting parametric models in Sec. 3.1. Then, in Sec. 3.2, we introduce Mixture of Experts  with expert selection. Finally, we present ESAC, an ensemble formulation of DSAC in Sec. 3.3. We will explain these concepts for a simple toy problem before applying them to camera re-localization in Sec. 4.
3.1 Differentiable Sample Consensus
We are interested in estimating a set of model parameters given an observation . For instance, the model could be a 2D line with slope and intercept , . Observation is an image of the line which also contains noise and distractors which are not explained by our model . See top of Fig. 2 a) for an example input where the distractors are boxes that partly occlude the line.
Instead of fitting model parameters directly to , we deduce an intermediate representation from to which we can fit our model easily. In the case of a line, could be a set of 2D points with , where each point is explained by our model: . We can deduce line parameters from
using linear regression or Deming regression.
Since the image formation process is complicated and/or unknown to us, there is no simple way to infer from . Instead, we train a neural network with learnable parameters to predict . The neural network can learn to ignore distractors and image noise to some extent. However, it is likely to make some mistakes, predict some points not explained by our model . Therefore, we employ a robust estimator , namely Random Sample Consensus (RANSAC) , and, for neural network training, Differentiable Sample Consensus (DSAC) .
RANSAC. RANSAC robustly estimates model parameters by sampling a pool of model hypotheses with . A hypothesis is sampled by randomly choosing a minimal set from and fitting model parameters to it. For a 2D line, a minimal set consists of two 2D points which determine slope and intercept. Each hypothesis is scored by measuring its sample consensus or inlier count , the number of data points that agree with the hypothesis.
where is a measure of distance between model hypothesis and data point , the point-line distance. Parameter is a threshold that encapsulates our tolerance for inlier errors, and denotes the Heaviside step function. Our final estimate is the model hypothesis with the maximum score:
Due to the non-differentiability of the selection, we cannot use RANSAC directly in neural network training. However, Brachmann  proposed a differentiable version of the algorithm which we will discuss next.
DSAC. The core idea of Differentiable Sample Consensus  is to make hypothesis selection probabilistic. Instead of choosing the hypothesis with maximum score deterministically as in Eq. 2, we choose it randomly according to a softmax distribution over scores:
This allows us to minimize the expected task loss during training:
where measures the error of a model hypothesis some ground truth parameters . Since is a weighted sum with a finite number of summands, one for each hypothesis in our pool, we can calculate it and its gradients exactly. As one last consideration, we have to replace the non-differentiable inlier count of Eq. 1 by a soft version .
denotes the Sigmoid function, and
are hyperparameters which control the softness of the score.
By minimizing , we can train our network in an end-to-end fashion using DSAC. The network learns to predict a representation that yields an accurate model estimate , although might still contain outliers. For the toy problem of fitting a 2D line, we show an example run of the full pipeline in Fig. 2 a) top.
3.2 Expert Selection
In the following, we introduce the notion of experts for the scenario of parametric model fitting. Firstly, we apply the original formulation of Mixture of Experts (MoE)  before extending it in Sec. 3.3.
Instead of training one neural network responsible for all inputs, we train an ensemble of experts with . We denote the output of each expert with . A gating network decides for a given input
which expert is responsible, it predicts a probability distribution over experts:
. For notation simplicity we stack the learnable parameters of all individual networks in a single parameter vector.
For illustration, we change the toy problem of the previous section in the following way. Some inputs show a 2D line (as before) while others show a 2D circle. Therefore, we extend our model parameters to . In case of a circle, is the circle center and is its radius. In case of a line, and are slope and intercept, respectively and we set to indicate it is not a circle.
We train two experts, , one specialized for fitting lines, one specialized for fitting circles. Additionally, we train a gating network which should decide for an arbitrary input whether it shows a line or a circle, so that we can apply the correct expert. See Fig. 2 for a visualization of all three networks and their respective task.
Given an image , we first choose an expert according to the gating network prediction . We let this expert estimate , and apply DSAC, we sample a pool of hypotheses from . We choose our estimate similar to Eq. 3 according to
we minimize the expected loss over choosing the correct expert according to , and selecting a model hypothesis from this expert according to . Note, that we enforce specialization of experts in this training formulation by running the appropriate version of DSAC depending on which expert we chose, we fit either a circle or a line to .
To calculate the outer expectation, we have to sum over all experts and run DSAC each time for the inner expectation. Since DSAC is costly, and in some applications we might have a large number of experts, this can be infeasible. However, we can re-write the gradients of the expectation as an expectation itself . This allows us to efficiently approximate the gradients via sampling.
where we sample times and average the gradients. We use the abbreviations , and for the respective entities in Eq. 7
. In practice, when training with stochastic gradient descent, we can approximate the expectation withsample which means that we do one run of DSAC per training input.
Since we select only one expert at test time, we only have to compute this expert’s forward pass, which is computationally efficient. However, if we chose the wrong expert, an expert not specialized to current input , we cannot hope to get a sensible prediction . Therefore, the accuracy of this MoE formulation is limited by the accuracy of the gating network. In the next section, we describe our alternative, new formulation which is more robust to inaccuracies of the gating network.
3.3 Expert Sample Consensus
Instead of having the gating network select one expert with the risk of selecting the wrong one, we distribute our budget of model hypotheses among experts. We sample hypotheses from each expert’s prediction . For this purpose, we define a vector that expresses how many hypotheses we assign to each expert.
We choose for a given input based on the output of the gating network. More specifically, follows a multinomial distribution based on the gating probabilities .
Given an image , we first choose , and then, according to we sample hypotheses with from each expert prediction . We use an index pair to denote which expert a hypothesis belongs to, and which of the hypotheses of this expert it is, specifically. We choose our estimate similar to Eq. 3 and Eq. 6 according to
Note that is a softmax distribution over all hypotheses, we choose a hypothesis solely based on its score irrespective of which expert it came from. In particular, the gating network does not influence hypothesis selection directly, but only guides hypotheses distribution among experts. Depending on the prediction of the gating network , some experts with low probability will have no hypotheses assigned (). For these experts, we do not need , and hence can save computing the associated forward pass, implementing conditional computation. We visualize our method in Fig. 3 b).
For training, we adapt our MoE training objective of Eq. 7 and minimize
we minimize the expected loss over distributing hypotheses, and selecting a final estimate. Since is a distribution over all possible vectors , we again rewrite the gradients of as an expectation, and approximate via sampling:
In practice we found to suffice. Throughout training, we sample many different hypotheses splits. Whenever a responsible expert receives too few hypotheses, Eq. 12 yields a large loss, and hence a large training signal for the gating network. On the other hand, receiving too many hypotheses will not decrease the loss further, and there will be no training signal to reward it. Therefore, the gating network learns the trade-off between assigning broad distributions in ambiguous cases, and assigning sufficiently many hypotheses to the most likely experts.
Calculating the approximate gradients of Eq. 13 involves the derivative of the log probability for a given which we calculate as
4 ESAC for Camera Re-Localization
We estimate the 6D camera pose , consisting of 3D translation and 3D rotation , from a single RGB image. Our pipeline is based on DSAC++ of Brachmann and Rother  which itself is based on the scene coordinate regression method of Shotton . For each pixel with 2D position in an image, we regress a 3D scene coordinate , the coordinate of the pixel in world space.
Given a minimal set of four 2D-3D correspondences we can estimate using a perspective-n-point algorithm [16, 26]. We employ a robust estimator as described in Sec. 3. That is, we sample multiple minimal sets to create a pool of pose hypotheses , and select the best one according to a scoring function. We follow , and use a soft inlier count as score. See also Eq. 5 where we use the re-projection error of a scene coordinate for .
Once we have chosen a hypothesis, we refine it using the differentiable pose optimization of . Refinement iteratively resolves the perspective-n-point problem on all inliers of a hypothesis. Gradients are approximated via a linearizion of the objective function in the last refinement iteration. Our output is the refined, selected hypothesis . As task loss for training, we use , where denotes angle difference. The hyperparameter controls the trade-off between rotation and translation errors . We use when measuring angles in degree and translation in meters.
We estimate scene coordinates using an ensemble of experts and a gating network . When designing the expert network architecture we were inspired by DSAC++ . Each expert is an FCN  which predicts scene coordinates for a px image. Different from DSAC++ , we use a ResNet architecture  instead of VGG . We found ResNet to achieve similar accuracy while being more efficient in computation time and memory (28 vs. 210MB). Each expert has 16 layers, 6M parameters and a
px receptive field. The gating network has 10 layers and 100k parameters. The receptive field of the gating network is the complete image, it incorporates more context when assigning experts. Experts have a small receptive field to be robust to view point changes. Our implementation is based on PyTorch, and we will make it publicly available111vislearn.de/research/scene-understanding/pose-estimation/#ICCV19 .
We evaluate ESAC for the toy problem introduced in Sec. 3, and camera re-localization from single RGB images.
5.1 Toy Problem
Setup. We generate images of size 64 64px, which show either a line or a circle with 50% probability. We add 4 to 10 distractors to each image, which can occlude the circle or line. Colors of lines, circles and distractors are uniformly random. Finally, we add speckle noise to each image. Difficult example inputs are shown in Fig. 4 b).
We train one expert for lines and one for circles. Each expert is a CNN with 2M parameters that predicts 64 2D points. The gating network is a CNN with 5k parameters that predicts two outputs, corresponding to the probability for a line or a circle. As training loss for lines, we minimize the maximum distance between the estimate and ground truth in the image. For circles, we minimize the distance between centers and absolute difference in radii of the estimate and ground truth. We pre-train each expert using only line or only circle images with DSAC. We pre-train the gating network using both line and circle images with a negative log likelihood classification loss. After pre-training for 50k iterations, we train the ensemble jointly and end-to-end for another 50k iterations, either using Expert Selection (Sec. 3.2) or ESAC (Sec. 3.3). We train with a batch size of 32, using Adam , and sampling model hypotheses. For testing, we generate a set of 10,000 images.
Results. Fig. 4 a) shows the percentage of correctly estimated model parameters (Parameter Accuracy). We accept a line estimate if the maximum distance to the ground truth line in the image is px. We accept a circle estimate if its center and radius is within px of ground truth. We observe a significant advantage of using ESAC over Expert Selection (+3.9%). The gating network confuses images with lines and circles sometimes, and might assign higher probability to the wrong expert. ESAC runs both experts in unclear cases, and selects the final estimate according to sample consensus. Fig. 4 a) also shows the classification accuracy of the ensemble, selecting the correct model type. Here, ESAC outperforms Expert Selection by 11.5%. The good classification accuracy indicates that ESAC might be a suitable method for model selection, although we did not investigate this scenario further.
5.2 Camera Re-Localization
For our main application, each expert predicts the same model type, a 6D camera pose, but specializes in different parts of a potentially large and repetitive environment.
Datasets. The 7Scenes  dataset consists of RGB-D images, camera poses and 3D models of seven indoor rooms (ca. 125m total). The images contain texture-less surfaces, motion blur and repeating structures, which makes this dataset challenging despite its limited size. The 12Scenes  dataset resembles 7Scenes in structure but features twelve larger rooms (ca. 520m total). The combination of 7Scenes and 12Scenes yields one large environment (19Scenes) comprised of 19 rooms (ca. 645m total, see also Fig. 1). The data features multiple kitchens, living rooms and offices, containing ambiguous furniture and office equipment.
Setup. Ignoring depth channels, we estimate camera poses from RGB only. We train one expert per scene, depending on the dataset. We pre-train each expert for 500k iterations, using a regression loss to ground truth scene coordinates obtained by rendering 3D scene models, similar to . Furthermore, we pre-train the gating network to classify scenes using negative log likelihood for 100k iterations. We use Adam with a fixed learning rate of . After pre-training, we train the ensemble of networks jointly and end-to-end using Expert Selection (Sec. 3.2) or ESAC (Sec. 3.3) for 100k iterations. We use a learning rate of for experts, and for the gating network. Otherwise, we keep the hyperparameters of DSAC++ , we sample hypotheses and use an inlier threshold of px.
Results on Individual Scenes. Firstly, we verify our re-implementation of DSAC++, and our choice of network architecture. To this end, we evaluate our expert networks when the scene ID for a test frame is given. That is, we disable the gating network, and always use the correct expert. We achieve an accuracy similar to DSAC++, slightly worse on 7Scenes, slightly better on 12Scenes, see Fig. 5. Note that our networks are 7.5 smaller than those of DSAC++.
Results on Combined Scenes. To evaluate our main contribution, we create three environments of increasing size, combining scenes of 7Scenes, 12Scenes and both (=19Scenes). We compare to DSAC++ by training a single CNN for an environment. For a fair comparison, we use our expert network architecture for DSAC++, and increase its capacity to match that of ESAC’s network ensemble. We also compare to an ensemble with Expert Selection (Sec. 3.2). We show our main results in Fig. 6 a) measuring the percentage of estimated poses with an error below and cm. The accuracy of DSAC++ decreases notably in larger environments, culminating in a moderate accuracy of 53.3% re-localized images on 19Scenes. DSAC++ relies solely on local image context which becomes increasingly ambiguous with a growing number of visually similar scenes. An ensemble with Expert Selection fares even worse despite using global image context in the gating network when disambiguating scenes. Some of the scenes are too similar, and the top-scoring gating prediction is incorrect in many cases. By distributing model hypotheses among experts, ESAC incorporates global image context in a robust fashion, and consistently achieves best accuracy. The margin is most distinct for 19Scenes, the largest environment, with 88.1% correctly re-localized images. Note that the increased environment scale hardly affects the accuracy of ESAC. It looses 3.5% accuracy for 7Scenes with known scene ID, and less than 1% for 12Scenes, cf. Fig. 5.
Effect of End-To-End Training. See Fig. 7 for the effect of end-to-end training on the overall accuracy. We initialize the experts by optimizing the distance to ground truth scene coordinates, and the gating network by optimizing the negative log likelihood classification loss (denoted Initialization). We then continue to train the gating network (E2E Gating), the experts (E2E Experts) or the entire ensemble (E2E All) using the ESAC objective. End-to-end training of each component increases the average accuracy, and we achieve best accuracy when we train the entire ensemble jointly. The effect of end-to-end training is significant but not large using the common acceptance threshold of 5cm and 5. However, lowering the threshold to 2cm and 2 reveals a large improvement in accuracy of . End-to-end training improves foremost the precision of re-localization, and less so the re-localization rate under a coarse threshold.
Handling Ambiguities. In Fig. 6 b) we show the average scene classification accuracy of Expert Selection and ESAC. In the Appendix A, we provide additional information in the form of scene confusion matrices, and examples of visually similar scenes. Expert Selection is particularly prone to confuse offices which contain ambiguous furniture and office equipment. ESAC can tell these scenes apart reliably by combining global image context when distributing hypotheses and geometric consistency when selecting hypotheses.
Conditional Computation. By using a single, monolithic network, inference with DSAC++ takes almost 1s on 19Scenes due to the large model capacity. ESAC needs to evaluate only those experts relevant for a given test image. On 19Scenes, it evaluates 6.1 experts in 555ms on average. Furthermore, we can restrict the maximum number of experts per image, see Fig. 8. For example, using at most the top 2 experts per test image, we gain +19.7% accuracy over Expert Selection with just a minor increase in computation time. At the other end of the spectrum, we could always evaluate all experts and choose the best hypothesis according to sample consensus, see Uniform Gating in Fig. 8. This achieves good accuracy but is computationally intensive. ESAC shows slightly higher accuracy and is much faster. Also, ESAC almost reaches the accuracy of Oracle Gating which always selects the correct expert via the ground truth scene ID.
Outdoor Re-Localization. We applied ESAC to outdoor re-localization in vast connected spaces, namely to the Dubrovnik dataset , and the Aachen Day dataset . Appendix B contains details about the experimental setup. We present the main results in Fig. 9 and a qualitative result in Fig. 10. While we improve over DSAC++ by a large margin, we do not completely close the performance gap to classical sparse feature-based methods like ActiveSearch . Adding more experts (and therefore model capacity) helps only to some degree. This hints towards limitations of current scene coordinate regression methods [6, 8] beyond the environment size. For example, the SfM ground truth reconstruction, which we use for training, contains a substantial amount of outliers, particularly for Dubrovnik, see Appendix B for a detailed discussion. The training of CNN-based dense regression might be sensitive to such noisy inputs, and developing resilient training strategies might be a promising direction for future research.
We have presented ESAC, an ensemble of expert networks for estimating parametric models. ESAC uses a gating network to distribute model hypotheses among experts. This is more robust than formulations where the gating network chooses a single expert only. We applied ESAC to the camera re-localization task in a large indoor environment where each expert specializes to a single room, achieving state-of-the-art accuracy. For large-scale outdoor re-localization, we made progress towards closing the gap to classical, feature-based methods.
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 647769). The computations were performed on an HPC Cluster at the Center for Information Services and High Performance Computing (ZIH) at TU Dresden.
Appendix A Scene Classification
In Fig. 11, we show the scene confusion matrices of our ensemble trained with Expert Selection and ESAC. The database contains multiple offices which look similar due to ambiguous office equipment, see Fig. 12 for examples. Expert Selection chooses a scene according to the prediction of the gating network, which is error prone. ESAC considers multiple experts in ambiguous cases, and chooses the final estimate according to geometric consistency.
Appendix B Large-Scale Outdoor Re-Localization
Datasets. The Dubrovnik dataset is comprised of ca. 6k holiday photographs taken in the old town of Dubrovnik. Stemming from online photo collections, the images were recorded with different cameras, and feature a multitude of different focal lengths, resolutions and aspect ratios. The Aachen Day dataset is comprised of ca. 4.5k images taken in Aachen, Germany. Training and test images were recorded using two separate but comparable camera types. The full Aachen dataset also comes with a small collection of difficult night time query images (Aachen Night) which we omit here. There is no night time training data, and bridging the resulting domain gap is out of scope of ESAC.
ESAC Training. Both datasets represent large connected areas. For initializing the gating network and scene coordinate experts, we divide each area into clusters via kMeans, see Fig. 10 a) for an example. As input for clustering, we use the median scene coordinate (median per dimension) for each training image. To avoid quantization effects at the cluster borders during initialization, we use the following soft assignment of training images to experts. We express the probability of training image belonging to the cluster of expert via a similarity measure :
We define this similarity in terms of the distance between the mean scene coordinate of image , denoted , and the cluster center :
where is an estimate of the cluster size, and controls the softness of the similarity. We use , and the mean squared distance of all images (resp. their median scene coordinates) within a cluster to the cluster center as . When initializing the gating network, me minimize the KL-divergence of gating predictions and probabilities . When initializing an expert network , we randomly choose training images according to and minimize the distance ground truth scene coordinates for 1M iterations. We obtain ground truth scene coordinates by rendering the sparse SfM reconstruction using the ground truth pose for image . Since ground truth scene coordinates are sparse, we optimize the re-projection error of the dense scene coordinate prediction for another 1M iterations, hence following the two-stage initialization of DSAC++ . Finally, we train the entire ensemble jointly and end-to-end for 50k iterations using the ESAC objective. To support generalization to different camera types and lighting conditions, we convert all images to grayscale, and randomly change brightness and contrast during training in the range of 50-110% and 80-120%, respectively.
Discussion of Results. As stated in the main text, ESAC demonstrates largely improved accuracy on both outdoor datasets compared to DSAC++ . However, is does not yet reach the accuracy of ActiveSearch , a classic sparse feature-based re-localization method. Especially on Dubrovnik, ESAC stays far behind, even when using a substantial amount of experts.
Upon closer inspection, we find that the structure of these datasets potentially contributes to the exceptional performance of ActiveSearch. Both datasets come with a 3D model of the environment and ground truth training poses created by running a sparse feature-based structure-from-motion reconstruction tool on all images (training and test). Images which are challenging for feature-based approaches (images with little structure or motion blur) are naturally not part of these datasets, since they are filtered at the reconstruction stage. It might be problematic to compare learning-based approaches to classical feature-based methods on datasets, where the ground truth was generated with feature-based reconstruction tools.
Furthermore, the reconstructions are not perfect as they contain a substantial amount of outlier points, see Fig. 13 for an illustration. ActiveSearch operates directly on top of this reconstruction, and applies sophisticated outlier rejection schemas. In contrast, scene coordinate regression methods like ESAC try to build geometrically consistent internal representations of a map, encoded in the network weights. Having visually similar image patches associated with very different ground truth scene coordinates (due to outliers) might result in severe overfitting of the network, which tries to tell patches apart that actually show the same location. The poor accuracy of ESAC on Dubrovnik compared to Aachen supports this interpretation, as the re-localization accuracy corresponds well to the general reconstruction quality of both datasets. At the same time, the question arises how meaningful the reported 1m re-localization accuracy for ActiveSearch on the Dubrovnik dataset is, given the ground truth quality. Note that geometry, training poses and test poses were all jointly optimized during the SfM reconstruction. Inaccuracies in the geometry might therefore hint towards limited accuracy of the ground truth poses.
-  Karim Ahmed, Mohammad Haris Baig, and Lorenzo Torresani. Network of experts for large-scale image categorization. In ECCV, 2016.
-  Hirotugu Akaike. A new look at the statistical model identification. TAC, 1974.
-  Rahaf Aljundi, Punarjay Chakravarty, and Tinne Tuytelaars. Expert gate: Lifelong learning with a network of experts. In CVPR, 2017.
-  Relja Arandjelović, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. NetVLAD: CNN architecture for weakly supervised place recognition. In CVPR, 2016.
-  Vassileios Balntas, Shuda Li, and Victor Adrian Prisacariu. RelocNet: Continuous metric learning relocalisation using neural nets. In ECCV, 2018.
-  Eric Brachmann, Alexander Krull, Sebastian Nowozin, Jamie Shotton, Frank Michel, Stefan Gumhold, and Carsten Rother. DSAC-Differentiable RANSAC for camera localization. In CVPR, 2017.
-  Eric Brachmann, Frank Michel, Alexander Krull, Michael Y. Yang, Stefan Gumhold, and Carsten Rother. Uncertainty-driven 6D pose estimation of objects and scenes from a single RGB image. In CVPR, 2016.
-  Eric Brachmann and Carsten Rother. Learning less is more-6D camera localization via 3D surface regression. In CVPR, 2018.
-  Samarth Brahmbhatt, Jinwei Gu, Kihwan Kim, James Hays, and Jan Kautz. Geometry-aware learning of maps for camera localization. In CVPR, 2018.
-  Leo Breiman. Random forests. Machine Learning, 2001.
-  Song Cao and Noah Snavely. Graph-based discriminative learning for location recognition. In CVPR, 2013.
-  Tommaso Cavallari, Stuart Golodetz, Nicholas A Lord, Julien Valentin, Luigi Di Stefano, and Philip HS Torr. On-the-fly adaptation of regression forests for online camera relocalisation. In CVPR, 2017.
-  William Edwards Deming. Statistical Adjustment of Data. 1943.
-  Martin A. Fischler and Robert C. Bolles. Random Sample Consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 1981.
-  Yoav Freund and Robert E. Schapire. A short introduction to boosting. In IJCAI, 1999.
-  Xiao-Shan Gao, Xiao-Rong Hou, Jianliang Tang, and Hang-Fei Cheng. Complete solution classification for the perspective-three-point problem. TPAMI, 2003.
-  Abner Guzman-Rivera, Pushmeet Kohli, Ben Glocker, Jamie Shotton, Toby Sharp, Andrew Fitzgibbon, and Shahram Izadi. Multi-output learning for camera relocalization. In CVPR, 2014.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural network. In NIPS Workshops, 2015.
-  Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive mixtures of local experts. Neural Computation, 1991.
-  Wolfgang Kabsch. A solution for the best rotation to relate two sets of vectors. Acta Crystallographica Section A: Crystal Physics, Diffraction, Theoretical and General Crystallography, 1976.
-  Alex Kendall and Roberto Cipolla. In CVPR, 2017.
-  Alex Kendall, Matthew Grimes, and Roberto Cipolla. PoseNet: A convolutional network for real-time 6-DoF camera relocalization. In ICCV, 2015.
-  Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, 2012.
-  Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Fua. EPnP: An accurate O(n) solution to the PnP problem. IJCV, 2009.
-  Yunpeng Li, Noah Snavely, and Daniel P. Huttenlocher. Location recognition using prioritized feature matching. In ECCV, 2010.
-  Hyon Lim, Sudipta N. Sinha, Michael F. Cohen, and Matthew Uyttendaele. Real-time image-based 6-DoF localization in large-scale environments. In CVPR, 2012.
-  Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
-  Saeed Masoudnia and Reza Ebrahimpour. Mixture of experts: A literature survey. Artificial Intelligence Review, 2014.
-  Daniela Massiceti, Alexander Krull, Eric Brachmann, Carsten Rother, and Philip H. S. Torr. Random forests versus neural networks - What’s best for camera localization? In ICRA, 2017.
-  Lili Meng, Jianhui Chen, Frederick Tung, James J. Little, Julien Valentin, and Clarence W. de Silva. Backtracking regression forests for accurate camera relocalization. In IROS, 2017.
-  Lili Meng, Frederick Tung, James J. Little, Julien Valentin, and Clarence W. de Silva. Exploiting points and lines in regression forests for RGB-D camera relocalization. In IROS, 2018.
-  Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS-W, 2017.
-  René Ranftl and Vladlen Koltun. Deep fundamental matrix estimation. In ECCV, 2018.
-  Torsten Sattler, Michal Havlena, Filip Radenovic, Konrad Schindler, and Marc Pollefeys. Hyperpoints and fine vocabularies for large-scale location recognition. In ICCV, 2015.
-  Torsten Sattler, Michal Havlena, Konrad Schindler, and Marc Pollefeys. Large-scale location recognition and the geometric burstiness problem. In CVPR, 2016.
-  Torsten Sattler, Bastian Leibe, and Leif Kobbelt. Efficient & effective prioritized matching for large-scale image-based localization. TPAMI, 2016.
-  Torsten Sattler, Will Maddern, Carl Toft, Akihiko Torii, Lars Hammarstrand, Erik Stenborg, Daniel Safari, Masatoshi Okutomi, Marc Pollefeys, Josef Sivic, Fredrik Kahl, and Tomas Pajdla. Benchmarking 6DoF outdoor visual localization in changing conditions. In CVPR, 2018.
-  Torsten Sattler, Akihiko Torii, Josef Sivic, Marc Pollefeys, Hajime Taira, Masatoshi Okutomi, and Tomas Pajdla. Are large-scale 3D models really necessary for accurate visual localization? In CVPR, 2017.
-  Grant Schindler, Matthew Brown, and Richard Szeliski. City-scale location recognition. In CVPR, 2007.
-  Gideon Schwarz. Estimating the dimension of a model. Annals of Statistics, 1978.
-  Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In ICLR, 2017.
-  Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in RGB-D images. In CVPR, 2013.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, 2014.
-  Hajime Taira, Masatoshi Okutomi, Torsten Sattler, Mircea Cimpoi, Marc Pollefeys, Josef Sivic, Tomas Pajdla, and Akihiko Torii. InLoc: Indoor visual localization with dense matching and view synthesis. In CVPR, 2018.
-  Carl Toft, Erik Stenborg, Lars Hammarstrand, Lucas Brynte, Marc Pollefeys, Torsten Sattler, and Fredrik Kahl. Semantic match consistency for long-term visual localization. In ECCV, 2018.
-  Julien Valentin, Angela Dai, Matthias Nießner, Pushmeet Kohli, Philip Torr, Shahram Izadi, and Cem Keskin. Learning to navigate the energy landscape. CoRR, 2016.
-  Julien Valentin, Matthias Nießner, Jamie Shotton, Andrew Fitzgibbon, Shahram Izadi, and Philip H. S. Torr. Exploiting uncertainty in regression forests for accurate camera relocalization. In CVPR, 2015.
-  Florian Walch, Caner Hazirbas, Laura Leal-Taixé, Torsten Sattler, Sebastian Hilsenbeck, and Daniel Cremers. Image-based localization with spatial LSTMs. In ICCV, 2017.
-  Zhicheng Yan, Vignesh Jagadeesh, Dennis DeCoste, Wei Di, and Robinson Piramuthu. HD-CNN: hierarchical deep convolutional neural network for image classification. In ICCV, 2015.
-  Bangpeng Yao, Dirk Walther, Diane Beck, and Li Fei-fei. Hierarchical mixture of classification experts uncovers interactions between brain regions. In NIPS, 2009.
-  Kwang Moo Yi, Eduard Trulls, Yuki Ono, Vincent Lepetit, Mathieu Salzmann, and Pascal Fua. Learning to find good correspondences. In CVPR, 2018.