1 Introduction
Several problems in computer vision require processing sparse, unstructured collections of vectors
. This type of data – clouds – is not ordered and permutationequivariant. Examples include pixel locations (), depth sensors (), and correspondences across images (). Widebaseline stereo (), is one of the fundamental problems in computer vision, and lies at the core of StructurefromMotion (SfM), which, in turn, is the building block of applications such as 3D reconstruction Agarwal09, imagebased rendering photosynth and timelapse smoothing hyperlapse, to name a few.Widebaseline stereo has been traditionally solved by extracting small collections of discrete keypoints Lowe04 and finding correspondences among them with robust estimators Fischler81
, a reliable approach used for well over two decades. This has changed over the past few years, with the arrival of deep learning and an abundance of new dense
Zamir16; Ummenhofer17; Zhou17a and sparse Yi18a; Dang18; Ranftl18; Zhao19 methods. In this paper we focus on sparse methods, which have seen many recent developments made possible by the introduction of PointNets Qi17a; Qi17b– neural networks that rely on multilayer perceptrons and global pooling to process unstructured data in a
permutationequivariant manner – something which is not feasible with the convolutional or fullyconnected layers commonly used by their dense counterparts.These type of networks – hereafter referred to as permutationequivariant networks – have pioneered the application of deep learning to point clouds. The original PointNet relied on the concatenation of pointwise (contextagnostic) and global (pointagnostic) features to achieve permutation equivariance, yielding a complex architecture. Yi et al. Yi18a proposed Context Normalization
(CN) as an alternative, a strikingly effective, nonparametric solution based on the normalization of the feature maps to zero mean and unit variance. Note that contrary to other normalization techniques utilized by neural networks
Ioffe15; Ba16; Ulyanov16; Wu18 whose primary objective is to improve convergence, CN is used to generate contextual information while preserving permutation equivariance. Despite its simplicity, it proved more effective than the PointNet approach on widebaseline stereo, contributing to a increase in pose estimation accuracy of relative; see (Yi18a, Fig. 5).Note that CN normalizes the feature maps according to the first (mean) and second (variance) order moment of the point cloud. Interestingly, these two quantities can be expressed as the solution of a leastsquares problem:
(1) 
However, it is well known that leastsquares optimization is not robust to outliers (regcourse_siga16, Sec. 3), a problem that also afflicts CN. We illustrate this effect in Figure 2, on a toy example on line fitting with noisy data. Note that this is a critical weakness, as the application CN was originally devised for, widebaseline stereo, is a problem plagued with outliers: outliertoinlier ratios above are not uncommon on standard public datasets; see Section 4.3.
To address this issue, we take inspiration from a classical technique used in robust optimization: Iteratively Reweighted Least Squares (IRLS) irls. Let us consider the computation of the firstorder moment as an example. Rather than using the square of the residuals, we optimize with respect to a robust kernel that allows for outliers to be ignored. This optimization can then be converted back into an iterative leastsquares form. With iterations indexed by , we write
(2) 
where is the socalled penalty function associated with the kernel ; see mestimator; robust. Inspired by this, we design a network that learns to progressively focus its attention on the inliers, hence operating analogously to over the IRLS iterations.
Specifically, we propose to train a perceptron that translates the (intermediate) feature maps into their corresponding attention weights, and normalizes them accordingly. We thus denote our approach as Attentive Context Normalization (ACN), and the networks that rely on this mechanism Attentive Context Networks (ACNe). We consider two types of attention, one that operates on each data point individually (local), and one that estimates the relative importance between data points (global), and demonstrate that using them together yields the best performance. We also evaluate the effect of supervising this attention mechanism when possible. Our work is, to the best our knowledge, the first to apply attentive mechanisms to point clouds. We verify the effectiveness of our method on \⃝raisebox{0.6pt}{1} robust line fitting, \⃝raisebox{0.6pt}{2} digit classification, and \⃝raisebox{0.6pt}{3} realworld widebaseline stereo datasets, showing drastic performance improvements over the stateoftheart.
2 Related work
We discuss recent works on deep networks operating on point clouds, review various normalization methods for deep networks, and briefly discuss attention mechanisms in machine learning.
Deep networks for point clouds.
Several methods have been proposed to process point cloud data with neural networks. These include graph convolutional networks Defferrard16; Kipf17, VoxelNets voxelnet, tangent convolutions Tatarchenko18, and many others. A simpler strategy was introduced by PointNets Qi17a; Qi17b
, which have become a popular solution, as they are easy to implement and train. At their core, they are convolutional neural networks with
kernels and global pooling operations. Further enhancements to the PointNet architecture include incorporate locality information with kernel correlation Shen18b, and contextual information with LSTMs Liu18f. Another relevant work is Deep Sets Zaheer17, which derives neural network parameterizations that guarantee permutationequivariance of point clouds.Permutationequivariant networks for stereo.
While PointNets were originally introduced for segmentation and classification of 3D point clouds, Yi et al. Yi18a showed that they can also be highly effective for robust matching in stereo, showing a drastic leap in performance against handcrafted methods Fischler81; Torr00; Bian17. Moreover, their solution was simpler, replacing the global feature pooling of Qi17a with CN. While similar to other normalization techniques for deep networks Ioffe15; Ba16; Ulyanov16; Wu18, CN has a drastically different role – to aggregate pointwise feature maps and generate contextual information. Followup works include: in Ranftl18, a similar network was applied in an iterative way to estimate the fundamental matrix; in Dang18, the same architecture was used with a novel loss formulation; in NMNet Zhao19, the notion of locality was incorporated. All of these works rely on CN without further updates – we show how to embed an attention mechanism inside it to improve its performance.
Normalization in deep networks.
In addition to CN, different strategies have been proposed to normalize feature maps in a deep network, starting with the seminal work of Batch Normalization
Ioffe15, which proposed to normalize the feature maps over a minibatch. Layer Normalization Ba16 transposed this operation by looking at all channels for a single sample in the batch, whereas Group Normalization Wu18 applied it over subsets of channels. Further efforts have proposed to normalize the weights instead of the activations Salimans16, or their eigenvalues
Goodfellow14. The main use of all these normalization techniques is to stabilize the optimization process and speed up convergence. By contrast, Instance Normalization Ulyanov16 proposed to normalize individual image samples for style transfer, and was improved in Huang17by aligning the mean and standard deviation of content and style. Regardless of the specifics, all of these normalization techniques operate on the entire sample – in other words, they do not consider the presence of outliers or their statistics. While this is not critical in imagebased pipelines, it can be extremely harmful for point clouds; see the example in Figure
2.Attentional methods.
The core idea behind attention mechanisms is to focus
on the crucial parts of the input. There are different forms of attention, and they have been applied to a wide range of machine learning problems, from natural language processing to images. Vaswani et al.
Vaswani17proposed an attentional model for machine translation eschewing recurrent architectures. Luong et al.
Luong15 blended two forms of attention on sequential inputs, demonstrating performance improvements in text translation. Xu et al. Xu15 showed how to employ soft and hard attention to gaze on salient objects and generate automated image captions. Local response normalization has been used to find salient responses in feature maps Jarrett09; Krizhevsky12, and can be interpreted as a form of lateral inhibition Hartline56. The use of attention in convolutional deep networks was pioneered by Spatial Transformer Networks
Jaderberg15, which introduced a differentiable sampler that allows for spatial manipulation of the image.3 Attentive Context Normalization
Given a feature map , where is the number of features (or data points at layer zero), is the number of channels, and each row corresponds to a data point, we recall that Context Normalization Yi18a is a nonparametric operation that can be written as
(3) 
where is the arithmetic mean, is the standard deviation of the features across , and denotes the elementwise division. Here we assume a single cloud, but generalizing to multiple clouds (i.e. a batch) is straightforward. Note that to preserve the properties of unstructured clouds, the information in the feature maps needs to be normalized in a permutationequivariant way. We extend CN by introducing a weight vector , and indicate with and the corresponding weighted mean and standard deviation. In contrast to Context Normalization, we compute the weights with a parametric function with trainable parameters^{1}^{1}1For simplicity, we abuse the notation and drop the layer index from all parameters. All the perceptrons in our work operate individually over each data point with shared parameters across each layer. that takes as input the feature map, and returns a unit norm vector of weights:
(4) 
where . We then define Attentive Context Normalization as
(5) 
The purpose of the attention network is to compute a weight function that focuses the normalization of the feature maps on a subset of the input features – the inliers. As a result, the network can learn to effectively cluster the features, and therefore separate inliers from outliers.
There are multiple attention functions that we can design, and multiple ways to combine them into a single attention vector . We will now describe those that we found effective for finding correspondences in widebaseline stereo, and how to combine and supervise them effectively.
Generating attention.
We leverage two different types of attention mechanisms, local and global We implemented simple, yet effective, forms of local and global attention as:
(6)  
(7) 
where and are the parameters of a perceptron, and denotes the feature vector for data point – the th row of the feature map . Observe that the local attention mechanism (6) acts on each feature vector independently, whereas the global attention mechanism (7) relates the feature vector for each data point to the collection through the softmax operation.
Blending attention.
Note that the product does not change the scale of the normalization applied in (5). Therefore, to take into account multiple types of attention simultaneously, we simply merge them together through elementwise multiplication . Clearly one could use a parametric form of attention blending instead; however, it is not trivial to combine the weights in a permutationequivariant way, and we found this simple strategy effective.
Supervising attention.
In some problems, the class for each data point is known a priori and explicit supervision can be performed. In this case, adding a supervised loss on the attention signals can be beneficial. For instance, when finding good correspondences for stereo we can apply binarycross entropy using the epipolar distance to generate labels for each putative correspondence, as in Yi18a. Our experiments in Section 6.3.1 show that while this type of supervision can improve performance, our normalization still brings benefits to learning.
4 Network architecture and applications
Our network receives as input , the tensor representation of , and produces an output tensor . Note that as is unstructured, must be equivariant with respect to permutations of the rows of . This output tensor is then used in different ways according to the task at hand. We model our architecture after Yi18a, which we refer to as Context Network (CNe). It features a series of residual blocks He16 with Context Normalization (CN). Our architecture, which we call Attentive Context Network, or ACNe, is pictured in Figure 1. A key distinction is that within each normalization block (Figure 1; right) we link the individual outputs of each perceptron to our ACN layer. We also replace the Batch Normalization layers Ioffe15 used in Yi18a with Group Normalization Wu18, as we found it performs better – please refer to Section 6.3.1 for ablation tests. We demonstrate that ACNe can be used to solve multiple applications, ranging from classical problems such as robust line fitting (Section 4.1) and digit classification on MNIST with point clouds (Section 4.2), to robust camera pose estimation for widebaseline stereo (Section 4.3).
4.1 Robust line fitting
We consider the problem of fitting a line to a collection of points that is ridden by noise and outliers; see Figure 2. This problem can be addressed via smooth (IRLS) or combinatorial (RANSAC) optimization – both methods can be interpreted in terms of sparse optimization, such that inliers and outliers are clustered separately; see sparseicp. Let us parameterize a line as the locus of point such that . We can then score each row of (i.e. each 2D point) by passing the output tensor to an additional weight network – with local and global components – following (4), yielding weights . Given , and expressing our points in homogeneous coordinates as , we can compute our covariance matrix as . Then, denoting
as the eigenvector of
corresponding to its smallest eigenvalue, is the estimated plane equation that we seek to find. We therefore minimize the difference between this eigenvector and the ground truth, with additional guidance to to help convergence. In specific, we minimize the following loss:(8) 
where is the average binary cross entropy between and , is the groundtruth inlier label, and hyperparameters and control the influence of these losses. The resolves the issue that and denote the same line.
4.2 Point cloud classification
We can also apply ACNe to point cloud classification rather than reasoning about individual points. As in the previous application, we consider a set of 2D locations
as input. In order to classify each point set, we transform the output tensor
into a single vector , and associate it with a groundtruth onehot vector through softmax. Additional weight networks to generate are trained for this task. We train with the cross entropy loss. Thus, the loss that we optimize is:(9) 
(a) RANSAC Fischler81  (b) CNe Yi18a  (c) ACNe (ours)  (d) RANSAC Fischler81  (e) CNe Yi18a  (f) ACNe (ours) 
4.3 Widebaseline stereo
In stereo we are given correspondences, so that our input is , where is the number of correspondences and each row contains two pixel locations on different images. As in Yi18a; Zheng18; Zhao19, we assume we know the camera intrinsics and normalize the coordinates accordingly. We obtain from the the output tensor via (4), as in Section 4.1. The weights indicate which correspondences are considered to be inliers, and their relative importance. We then apply a weighted variant of the 8point algorithm Hartley00 to retrieve the essential matrix , which parameterizes the relative camera motion between the two cameras. To do so we adopt the differentiable, nonparametric form proposed by Yi18a, and denote this operation as . We then train our network to regress to the groundtruth essential matrix, as well as providing auxiliary guidance to – the final local attention used to construct the output of the network – with percorrespondence labels obtained by thresholding over the symmetric epipolar distance Hartley00 as in Yi18a. In addition, we also perform auxiliary supervision on – the intermediate local attentions within the network – as discussed in Section 3. Note that it is not immediately obvious how to supervise the global attention mechanism. We therefore write:
(10) 
where is the Frobenius norm, is the binary crossentropy, and denotes ground truth inlier labels. Again, the hyperparameters , , and control the influence of each loss. The resolves the issue that and express the same solution.
5 Implementation details
Network architecture.
In all of our experiments we employ a 12layer structure (excluding the first linear layer that changes the number of channels) for ACNe, with six ARBs and two perceptron layers in each ARB. We also use 32 groups for Group Normalization, as suggested in Wu18. Similarly to Yi18a, we use channels per perceptron.
Training setup.
For all applications, we use the ADAM optimizer Kingma15 with default parameters and a learning rate of . With the exception of robust line fitting, we always use a validation set to perform early stopping. For robust line fitting, the data is purely synthetic and thus infinite, and we run the training for 50k iterations. For MNIST, we use 70k samples with a 8:1:1 split for training, validation and testing. For stereo, we adopt a 3:1:1 split, as in Yi18a. For the loss term involving eigendecomposition (terms multiplied by in (8) and (10)), we use , following Yi18a. All other loss terms have a weight of , that is, and . For stereo, we follow Yi18a and introduce the term involving the essential matrix – the first term in (10) – after 20k iterations.
Test time RANSAC (stereo only).
As in Yi18a; Zheng18, we evaluate the use of outlier rejection with RANSAC at test time to maximize generalization capability. In order to do so, we simply threshold at , select the data points above that threshold as inliers, and run RANSAC with a threshold of in terms of symmetric epipolar distance as in Yi18a, in order to keep the results comparable. We then compute the essential matrix with the standard (nonweighted) 8point algorithm with the surviving inliers. We compare these results with those obtained directly from the weighted 8point formulation, without further processing. As we will discuss in Section 6.3, RANSAC helps when applying trained models to unseen scenes.
6 Results
We first consider a toy example on fitting 2D lines with a large ratio of outliers. We then apply our method to digit classification on MNIST, thresholding the grayscale image and using the location of pixels as data points, as in Qi17a; Qi17b. These two experiments illustrate that our attentional method performs bettter than vanilla Context Normalization under the presence of outliers. We then apply our solution to widebaseline stereo, and demonstrate that this increase in performance holds on challenging realworld applications. Finally, we perform an ablation study and evaluate the effect of supervising the weights used for attention in stereo.
6.1 Robust line fitting – Figure 2 and Table 2
To generate 2D points on a random line, as well as outliers, we first sample 2D points uniformly within the range of
. We then select two points randomly, and fit a line that goes through them. With probability according to the desired inlier ratio, we then project each point onto the line to form inliers. We measure the error in terms of the
distance between the estimated and ground truth values for the line parameters. The results are summarized in Table 2, with qualitative examples in Figure 2. ACNe consistently outperforms CNe Yi18a. Both methods break down at a 8590% outlier ratio, while the performance of ACNe degrades more gracefully. As illustrated in Figure 2, our method learns to progressively focus on the inliers throughout the different layers of the network in order to weed out the outliers and reach a solution.Method  Outlier Ratio  

60%  70%  80%  85%  90%  
CNe Yi18a  .0039  .0051  .041  .147  .433 
ACNe (Ours)  .0001  .0045  .033  .117  0.389 
Methods  0%  10%  20%  30%  40%  50%  60% 

PointNet Qi17a 
98.1  95.1  93.2  79.5  67.7  70.0  54.8 
CNe Yi18a  98.0  95.8  94.0  91.0  90.1  87.7  87.2 
ACNe (Ours)  98.3  97.2  96.5  95.3  94.7  94.3  93.7 

6.2 Point cloud classification – Figure 3 and Table 2
We evaluate our approach on handwritten digit classification on MNIST, which consists of grayscale images. We create a point cloud from these images following the procedure of Qi17a: we threshold each image at 128 use the coordinates – normalized to unit bounding box – of the surviving pixel locations as data samples. We sub sample with replacement to 512 points to have even number of points for all training samples. We further add a small Gaussian noise of 0.01 to the coordinates after sampling following Qi17a. For outliers, we sample from a uniform random distribution. We compare our method against vanilla PointNet Qi17a and CNe Yi18a. For PointNet, we reimplemented their method under our framework to have an identical training setup. We do not apply the initial affine estimation, so that we can isolate the architectural differences between the methods – note that this module could also be added to our method. Table 2 summarizes the results, in terms of classification accuracy; our method performs best, with the gap widening as outlier ratio increases. Note that the result of PointNet are slightly different from what was reported in Qi17a, as we do not use the full training set, but split it to train and validation to perform early stopping. In addition, we run models for 10 times and report the average results.
6.3 Widebaseline stereo – Figure 4 and Table 6.3
Widebaseline stereo is an extremely challenging problem, due to the large number of variables to account for – viewpoint, scale, illumination, occlusions, imaging; see Figure 4. We benchmark the performance of our approach on a realworld dataset against multiple stateoftheart baselines, following the data and protocols provided by Yi18a, where ground truth camera poses are obtained from Structure from Motion Wu13. We sample pairs of images and estimate the relative camera motion with different methods. The error is measured as the angular difference between the estimated and ground truth vectors for both rotation and translation, and summarized by mean Average Precision (mAP) over all image pairs on the test set, with an error threshold of .
We extract 2k keypoints with SIFT Lowe04, and determine an initial set of correspondences by nearestneighbour matching with their corresponding descriptors. We consider the following methods: LMeds Rousseeuw84, RANSAC Fischler81, MLESAC Torr00, CNe Yi18a, and ACNe (ours). We also evaluate GMS Bian17, a semidense method based on a large (8k) number of ORB features; and DeMoN Ummenhofer17, a dense method trained on outdoors images with wide baselines. For CNe and ACNe, we consider the pose estimated with the weighted 8point algorithm directly, as well as adding a postprocessing step with RANSAC followed by the standard 8point algorithm, as outlined in Section 5. Finally, in addition to SIFT, we consider two learned, stateoftheart local features: SuperPoint DeTone17b and LFNet Ono18.
Method  Same  Unseen 

Baselines  
SIFT+LMeds  5.9  
SIFT+RANSAC  9.8  
SIFT+MLESAC  5.2  
GMS  14.4  
DeMoN  4.6  
SIFT+CNe (weighted8pt)  44.9  20.7 
SIFT+CNe (RANSAC)  50.0  27.2 
Ours  
SIFT+ACNe (weighted8pt)  64.3  28.7 
SIFT+ACNe (RANSAC)  61.5  31.4 
LFNet+ACNe (weighted8pt)  67.9  34.6 
LFNet+ACNe (RANSAC)  61.5  37.0 
SPoint+ACNe (weighted8pt)  70.9  37.7 
SPoint+ACNe (RANSAC)  61.1  41.1 
Quantitative results.
We use the following five sequences from Yi18a: Buckingham Palace (BP), Notre Dame (ND), Reichstag (RS), Sacre Coeur (SC), and St. Peter’s Square (SP). We train models on each of these five sequences, and report results both on that same sequence and over the remaining four. Due to space constraints, we summarize all of the results in Table 6.3; see the appendix for full results. We make four fundamental observations: \⃝raisebox{0.6pt}{1} Our method consistently outperforms all of the baselines, including CNe. The difference in performance between ACNe and its closest competitor, CNe, is of relative on the same scene and relative on unseen scenes. The margin between learned and traditional methods is dramatic, with ACNe performing doubling the performance of the closest baseline (GMS) in its least favourable setting. \⃝raisebox{0.6pt}{2} Contrary to the findings of Yi18a, we observe that RANSAC may harm performance, particularly on known sequences, and even on unseen sequences – 2 out of 5 perform better without RANSAC. The performance increase obtained with RANSAC postprocessing on unseen scenes is of relative for CNe, and relative for ACNe – this means that our approach generalizes better and produces better estimates without auxiliary help. \⃝raisebox{0.6pt}{3} As observed by Yi et al. Yi18a, while DeMoN was trained on widebaseline stereo pairs, it performs very poorly on this data. This suggests that dense methods have not yet bridged the gap with sparse methods for this problem. \⃝raisebox{0.6pt}{4} With modern local features, such as LFNet or SuperPoint, we can further increase (relative) performance by on the same sequence, and on unseen sequences.
6.3.1 Ablation study – Table 6.3.1
Methods  CNe Yi18a  ACNe (Ours)  

w/ BN  w/ GN  L  G  L+G  L+G+S  
Weighted8pt  48.4  47.1  62.8  65.7  66.4  70.6 
RANSAC  54.3  53.2  63.9  65.5  65.9  66.4 
We perform an ablation study to evaluate the effect of the different types of attention, as well as the supervision on the local component of the attentive mechanism. \⃝raisebox{0.6pt}{1} We confirm that CNe Yi18a performs better with Batch Normalization (BN) Ioffe15 than with Group Normalization (GN) Wu18 – we use GN for ACNe, as it seems to perform marginally better than BN along our attentive mechanism. \⃝raisebox{0.6pt}{2} We observe that our attentive mechanisms allow ACNe to outperform CNe, and that their combination outperforms their separate use. \⃝raisebox{0.6pt}{3} We demonstrate that applying supervision on the weights can boost performance further.
7 Conclusion
We have proposed Attentive Context Normalization (ACN), and used it to build Attentive Context Networks (ACNe) to solve problems on permutationequivariant data. Our solution is inspired by IRLS, where one iteratively reweighs the importance of each sample, via a soft inlier/outlier assignment. We demonstrated that by learning both local and global attention we are able to outperform stateoftheart solutions on line fitting, handwritten digit classification, and widebaseline stereo. Notably, our method thrives under large outlier ratios. An interesting future direction would be to incorporate ACN into general normalization techniques for deep learning – we believe that this is a interesting direction to pursue, as all of them make use of statistical moments.
Acknowledgements
This work was partially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant “Deep Visual Geometry Machines” (RGPIN201803788, DGECR201800426), Google, and by systems supplied by Compute Canada.
References
Appendix A Full stereo results
Same sequence  Unseen sequences  
Method  BP  ND  RS  SC  SP  Avg.  BP  ND  RS  SC  SP  Avg. 
Baselines  
SIFT+LMeds  5.3  3.9  8.7  8.5  3.3  5.9  6.1  6.4  5.2  5.3  6.6  5.9 
SIFT+RANSAC  7.2  8.0  10.1  16.1  7.7  9.8  10.4  10.3  9.7  8.2  10.3  9.8 
SIFT+MLESAC  4.7  2.0  9.6  6.7  2.8  5.2  5.3  6.0  4.0  4.8  5.7  5.2 
GMS  12.8  10.4  20.7  14.6  13.4  14.4  14.8  15.4  12.8  14.3  14.6  14.4 
DeMoN  5.2  2.4  10.1  2.3  3.3  4.6  4.5  5.2  3.3  5.2  5.0  4.6 
SIFT+CNe (weighted8pt)  31.9  33.1  51.0  60.8  47.7  44.9  24.3  21.0  13.8  15.0  29.3  20.7 
SIFT+CNe (RANSAC)  35.4  34.7  64.4  62.4  53.2  50.0  26.1  30.1  22.2  26.9  30.6  27.2 
Ours  
SIFT+ACNe (weighted8pt)  56.2  54.8  62.5  75.5  72.4  64.3  31.2  26.6  22.0  24.4  39.2  28.7 
SIFT+ACNe (RANSAC)  50.3  47.8  67.8  73.1  68.2  61.5  33.3  29.5  26.0  28.7  39.3  31.4 
LFNet+ACNe (weighted8pt)  73.2  49.4  66.8  72.8  77.5  67.9  42.8  30.0  27.8  27.0  45.3  34.6 
LFNet+ACNe (RANSAC)  63.4  40.2  68.8  65.8  68.1  61.2  41.6  32.7  33.6  32.3  45.0  37.0 
SPoint+ACNe (weighted8pt)  75.3  51.0  71.2  76.4  80.9  70.9  44.0  34.8  28.9  32.0  48.6  37.7 
SPoint+ACNe (RANSAC)  63.2  41.1  64.9  68.0  68.2  61.1  42.9  37.4  36.3  42.6  46.2  41.1 
We train models on each of these five sequences, and report results both on that same sequence (denoted as ‘X’), and over the remaining four sequences (averaged; denoted as ‘X’). Results on all sequences are reporte in Table 5. Our results perform best, with the model trained on SP providing best generalization.
Comments
There are no comments yet.