Attentive Context Normalization for Robust Permutation-Equivariant Learning

07/04/2019 ∙ by Weiwei Sun, et al. ∙ Google University of Victoria 0

Many problems in computer vision require dealing with sparse, unstructured data in the form of point clouds. Permutation-equivariant networks have become a popular solution - they operate on individual data points with simple perceptrons and extract contextual information with global pooling strategies. In this paper, we propose Attentive Context Normalization (ACN), a simple yet effective technique to build permutation-equivariant networks robust to outliers. Specifically, we show how to normalize the feature maps with weights that are estimated within the network so that outliers are excluded from the normalization. We use this mechanism to leverage two types of attention: local and global - by combining them, our method is able to find the essential data points in high-dimensional space in order to solve a given task. We demonstrate through extensive experiments that our approach, which we call Attentive Context Networks (ACNe), provides a significant leap in performance compared to the state-of-the-art on camera pose estimation, robust fitting, and point cloud classification under the presence of noise and outliers.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Several problems in computer vision require processing sparse, unstructured collections of vectors

. This type of data – clouds – is not ordered and permutation-equivariant. Examples include pixel locations (), depth sensors (), and correspondences across images (). Wide-baseline stereo (), is one of the fundamental problems in computer vision, and lies at the core of Structure-from-Motion (SfM), which, in turn, is the building block of applications such as 3D reconstruction Agarwal09, image-based rendering photosynth and time-lapse smoothing hyperlapse, to name a few.

Wide-baseline stereo has been traditionally solved by extracting small collections of discrete keypoints Lowe04 and finding correspondences among them with robust estimators Fischler81

, a reliable approach used for well over two decades. This has changed over the past few years, with the arrival of deep learning and an abundance of new dense 

Zamir16; Ummenhofer17; Zhou17a and sparse Yi18a; Dang18; Ranftl18; Zhao19 methods. In this paper we focus on sparse methods, which have seen many recent developments made possible by the introduction of PointNets Qi17a; Qi17b

neural networks that rely on multi-layer perceptrons and global pooling to process unstructured data in a

permutation-equivariant manner – something which is not feasible with the convolutional or fully-connected layers commonly used by their dense counterparts.

These type of networks – hereafter referred to as permutation-equivariant networks – have pioneered the application of deep learning to point clouds. The original PointNet relied on the concatenation of point-wise (context-agnostic) and global (point-agnostic) features to achieve permutation equivariance, yielding a complex architecture. Yi et al. Yi18a proposed Context Normalization

(CN) as an alternative, a strikingly effective, non-parametric solution based on the normalization of the feature maps to zero mean and unit variance. Note that contrary to other normalization techniques utilized by neural networks 

Ioffe15; Ba16; Ulyanov16; Wu18 whose primary objective is to improve convergence, CN is used to generate contextual information while preserving permutation equivariance. Despite its simplicity, it proved more effective than the PointNet approach on wide-baseline stereo, contributing to a increase in pose estimation accuracy of relative; see (Yi18a, Fig. 5).

Note that CN normalizes the feature maps according to the first- (mean) and second- (variance) order moment of the point cloud. Interestingly, these two quantities can be expressed as the solution of a least-squares problem:


However, it is well known that least-squares optimization is not robust to outliers (regcourse_siga16, Sec. 3), a problem that also afflicts CN. We illustrate this effect in Figure 2, on a toy example on line fitting with noisy data. Note that this is a critical weakness, as the application CN was originally devised for, wide-baseline stereo, is a problem plagued with outliers: outlier-to-inlier ratios above are not uncommon on standard public datasets; see Section 4.3.

To address this issue, we take inspiration from a classical technique used in robust optimization: Iteratively Re-weighted Least Squares (IRLS) irls. Let us consider the computation of the first-order moment as an example. Rather than using the square of the residuals, we optimize with respect to a robust kernel that allows for outliers to be ignored. This optimization can then be converted back into an iterative least-squares form. With iterations indexed by , we write


where is the so-called penalty function associated with the kernel ; see mestimator; robust. Inspired by this, we design a network that learns to progressively focus its attention on the inliers, hence operating analogously to over the IRLS iterations.

Specifically, we propose to train a perceptron that translates the (intermediate) feature maps into their corresponding attention weights, and normalizes them accordingly. We thus denote our approach as Attentive Context Normalization (ACN), and the networks that rely on this mechanism Attentive Context Networks (ACNe). We consider two types of attention, one that operates on each data point individually (local), and one that estimates the relative importance between data points (global), and demonstrate that using them together yields the best performance. We also evaluate the effect of supervising this attention mechanism when possible. Our work is, to the best our knowledge, the first to apply attentive mechanisms to point clouds. We verify the effectiveness of our method on \⃝raisebox{-0.6pt}{1} robust line fitting, \⃝raisebox{-0.6pt}{2} digit classification, and \⃝raisebox{-0.6pt}{3} real-world wide-baseline stereo datasets, showing drastic performance improvements over the state-of-the-art.

2 Related work

We discuss recent works on deep networks operating on point clouds, review various normalization methods for deep networks, and briefly discuss attention mechanisms in machine learning.

Deep networks for point clouds.

Several methods have been proposed to process point cloud data with neural networks. These include graph convolutional networks Defferrard16; Kipf17, VoxelNets voxelnet, tangent convolutions Tatarchenko18, and many others. A simpler strategy was introduced by PointNets Qi17a; Qi17b

, which have become a popular solution, as they are easy to implement and train. At their core, they are convolutional neural networks with

kernels and global pooling operations. Further enhancements to the PointNet architecture include incorporate locality information with kernel correlation Shen18b, and contextual information with LSTMs Liu18f. Another relevant work is Deep Sets Zaheer17, which derives neural network parameterizations that guarantee permutation-equivariance of point clouds.

Permutation-equivariant networks for stereo.

While PointNets were originally introduced for segmentation and classification of 3D point clouds, Yi et al. Yi18a showed that they can also be highly effective for robust matching in stereo, showing a drastic leap in performance against hand-crafted methods Fischler81; Torr00; Bian17. Moreover, their solution was simpler, replacing the global feature pooling of Qi17a with CN. While similar to other normalization techniques for deep networks Ioffe15; Ba16; Ulyanov16; Wu18, CN has a drastically different role – to aggregate point-wise feature maps and generate contextual information. Follow-up works include: in Ranftl18, a similar network was applied in an iterative way to estimate the fundamental matrix; in Dang18, the same architecture was used with a novel loss formulation; in NM-Net Zhao19, the notion of locality was incorporated. All of these works rely on CN without further updates – we show how to embed an attention mechanism inside it to improve its performance.

Normalization in deep networks.

In addition to CN, different strategies have been proposed to normalize feature maps in a deep network, starting with the seminal work of Batch Normalization 

Ioffe15, which proposed to normalize the feature maps over a mini-batch. Layer Normalization Ba16 transposed this operation by looking at all channels for a single sample in the batch, whereas Group Normalization Wu18 applied it over subsets of channels. Further efforts have proposed to normalize the weights instead of the activations Salimans16

, or their eigenvalues 

Goodfellow14. The main use of all these normalization techniques is to stabilize the optimization process and speed up convergence. By contrast, Instance Normalization Ulyanov16 proposed to normalize individual image samples for style transfer, and was improved in Huang17

by aligning the mean and standard deviation of content and style. Regardless of the specifics, all of these normalization techniques operate on the entire sample – in other words, they do not consider the presence of outliers or their statistics. While this is not critical in image-based pipelines, it can be extremely harmful for point clouds; see the example in Figure 


Attentional methods.

The core idea behind attention mechanisms is to focus

on the crucial parts of the input. There are different forms of attention, and they have been applied to a wide range of machine learning problems, from natural language processing to images. Vaswani et al. 


proposed an attentional model for machine translation eschewing recurrent architectures. Luong et al. 

Luong15 blended two forms of attention on sequential inputs, demonstrating performance improvements in text translation. Xu et al. Xu15 showed how to employ soft and hard attention to gaze on salient objects and generate automated image captions. Local response normalization has been used to find salient responses in feature maps Jarrett09; Krizhevsky12, and can be interpreted as a form of lateral inhibition Hartline56

. The use of attention in convolutional deep networks was pioneered by Spatial Transformer Networks 

Jaderberg15, which introduced a differentiable sampler that allows for spatial manipulation of the image.

3 Attentive Context Normalization

Given a feature map , where is the number of features (or data points at layer zero), is the number of channels, and each row corresponds to a data point, we recall that Context Normalization Yi18a is a non-parametric operation that can be written as


where is the arithmetic mean, is the standard deviation of the features across , and denotes the element-wise division. Here we assume a single cloud, but generalizing to multiple clouds (i.e. a batch) is straightforward. Note that to preserve the properties of unstructured clouds, the information in the feature maps needs to be normalized in a permutation-equivariant way. We extend CN by introducing a weight vector , and indicate with and the corresponding weighted mean and standard deviation. In contrast to Context Normalization, we compute the weights with a parametric function with trainable parameters111For simplicity, we abuse the notation and drop the layer index from all parameters. All the perceptrons in our work operate individually over each data point with shared parameters across each layer. that takes as input the feature map, and returns a unit norm vector of weights:


where . We then define Attentive Context Normalization as


The purpose of the attention network is to compute a weight function that focuses the normalization of the feature maps on a subset of the input features – the inliers. As a result, the network can learn to effectively cluster the features, and therefore separate inliers from outliers.

There are multiple attention functions that we can design, and multiple ways to combine them into a single attention vector . We will now describe those that we found effective for finding correspondences in wide-baseline stereo, and how to combine and supervise them effectively.

Generating attention.

We leverage two different types of attention mechanisms, local and global We implemented simple, yet effective, forms of local and global attention as:


where and are the parameters of a perceptron, and denotes the feature vector for data point  – the -th row of the feature map . Observe that the local attention mechanism (6) acts on each feature vector independently, whereas the global attention mechanism (7) relates the feature vector for each data point to the collection through the softmax operation.

Blending attention.

Note that the product does not change the scale of the normalization applied in (5). Therefore, to take into account multiple types of attention simultaneously, we simply merge them together through element-wise multiplication . Clearly one could use a parametric form of attention blending instead; however, it is not trivial to combine the weights in a permutation-equivariant way, and we found this simple strategy effective.

Supervising attention.

In some problems, the class for each data point is known a priori and explicit supervision can be performed. In this case, adding a supervised loss on the attention signals can be beneficial. For instance, when finding good correspondences for stereo we can apply binary-cross entropy using the epipolar distance to generate labels for each putative correspondence, as in Yi18a. Our experiments in Section 6.3.1 show that while this type of supervision can improve performance, our normalization still brings benefits to learning.

Figure 1: ACNe architecture

– (Left) Our permutation-equivariant network receives an input tensor size

, which is processed by a series of Attentive Residual Blocks (ARB). The output of the network is a tensor size , which is then converted to a representation appropriate for the task at hand. Note that the first perceptron changes the dimensionality from (input dimensions) to (feature dimensions). (Middle) Within each residual path of the ARB, we manipulate the feature map with perceptrons with parameters , followed by Attentive Context Normalization (ACN) – we repeat this structure twice. (Right) An ACN module computes local/global attention with two trainable networks, combines them via element-wise multiplication, and normalizes the feature maps with said weights – the block – followed by Group Normalization. Note that the only form of sharing that takes place is that each of the N features shares the same processing path.

4 Network architecture and applications

Our network receives as input , the tensor representation of , and produces an output tensor . Note that as is unstructured, must be equivariant with respect to permutations of the rows of . This output tensor is then used in different ways according to the task at hand. We model our architecture after Yi18a, which we refer to as Context Network (CNe). It features a series of residual blocks He16 with Context Normalization (CN). Our architecture, which we call Attentive Context Network, or ACNe, is pictured in Figure 1. A key distinction is that within each normalization block (Figure 1; right) we link the individual outputs of each perceptron to our ACN layer. We also replace the Batch Normalization layers Ioffe15 used in Yi18a with Group Normalization Wu18, as we found it performs better – please refer to Section 6.3.1 for ablation tests. We demonstrate that ACNe can be used to solve multiple applications, ranging from classical problems such as robust line fitting (Section 4.1) and digit classification on MNIST with point clouds (Section 4.2), to robust camera pose estimation for wide-baseline stereo (Section 4.3).

Figure 2: Robust neural line fitting – We learn to fit lines with outliers (80%) via our ACNe, as well as CNe Yi18a. We visualize the ground truth and the network estimates. We color-code the weights learned by the k-th residual layer of ACNe and used to normalize the feature maps – notice that our method, which mimics Iterative Re-weighted Least Squares (IRLS), learns to progressively focus its attention on the inliers. This allows ACNe to find the correct solution where CNe fails.

4.1 Robust line fitting

We consider the problem of fitting a line to a collection of points that is ridden by noise and outliers; see Figure 2. This problem can be addressed via smooth (IRLS) or combinatorial (RANSAC) optimization – both methods can be interpreted in terms of sparse optimization, such that inliers and outliers are clustered separately; see sparseicp. Let us parameterize a line as the locus of point such that . We can then score each row of (i.e. each 2D point) by passing the output tensor to an additional weight network – with local and global components – following (4), yielding weights . Given , and expressing our points in homogeneous coordinates as , we can compute our covariance matrix as . Then, denoting

as the eigenvector of

corresponding to its smallest eigenvalue, is the estimated plane equation that we seek to find. We therefore minimize the difference between this eigenvector and the ground truth, with additional guidance to to help convergence. In specific, we minimize the following loss:


where is the average binary cross entropy between and , is the ground-truth inlier label, and hyper-parameters and control the influence of these losses. The resolves the issue that and denote the same line.

Figure 3: Point cloud classification – We add salt-and-pepper noise to MNIST images, and then convert the digits to an unstructured point cloud. The % reports the outlier-to-inlier ratio.

4.2 Point cloud classification

We can also apply ACNe to point cloud classification rather than reasoning about individual points. As in the previous application, we consider a set of 2D locations

as input. In order to classify each point set, we transform the output tensor

into a single vector , and associate it with a ground-truth one-hot vector through softmax. Additional weight networks to generate are trained for this task. We train with the cross entropy loss. Thus, the loss that we optimize is:

(a) RANSAC Fischler81 (b) CNe Yi18a (c) ACNe (ours) (d) RANSAC Fischler81 (e) CNe Yi18a (f) ACNe (ours)
Figure 4: Wide-baseline stereo – We show the results of different matching algorithms on the dataset of Yi18a. We draw the inliers produced by them, in green if the match is below the epipolar distance threshold (in red otherwise). Note that this may include some false positives, as epipolar constraints map points to lines – perfect ground truth would require dense pixel-to-pixel correspondences.

4.3 Wide-baseline stereo

In stereo we are given correspondences, so that our input is , where is the number of correspondences and each row contains two pixel locations on different images. As in Yi18a; Zheng18; Zhao19, we assume we know the camera intrinsics and normalize the coordinates accordingly. We obtain from the the output tensor via (4), as in Section 4.1. The weights indicate which correspondences are considered to be inliers, and their relative importance. We then apply a weighted variant of the 8-point algorithm Hartley00 to retrieve the essential matrix , which parameterizes the relative camera motion between the two cameras. To do so we adopt the differentiable, non-parametric form proposed by Yi18a, and denote this operation as . We then train our network to regress to the ground-truth essential matrix, as well as providing auxiliary guidance to  – the final local attention used to construct the output of the network – with per-correspondence labels obtained by thresholding over the symmetric epipolar distance Hartley00 as in Yi18a. In addition, we also perform auxiliary supervision on  – the intermediate local attentions within the network – as discussed in Section 3. Note that it is not immediately obvious how to supervise the global attention mechanism. We therefore write:


where is the Frobenius norm, is the binary cross-entropy, and denotes ground truth inlier labels. Again, the hyper-parameters , , and control the influence of each loss. The resolves the issue that and express the same solution.

5 Implementation details

Network architecture.

In all of our experiments we employ a 12-layer structure (excluding the first linear layer that changes the number of channels) for ACNe, with six ARBs and two perceptron layers in each ARB. We also use 32 groups for Group Normalization, as suggested in Wu18. Similarly to Yi18a, we use channels per perceptron.

Training setup.

For all applications, we use the ADAM optimizer Kingma15 with default parameters and a learning rate of . With the exception of robust line fitting, we always use a validation set to perform early stopping. For robust line fitting, the data is purely synthetic and thus infinite, and we run the training for 50k iterations. For MNIST, we use 70k samples with a 8:1:1 split for training, validation and testing. For stereo, we adopt a 3:1:1 split, as in Yi18a. For the loss term involving eigen-decomposition (terms multiplied by in (8) and (10)), we use , following Yi18a. All other loss terms have a weight of , that is, and . For stereo, we follow Yi18a and introduce the term involving the essential matrix – the first term in (10) – after 20k iterations.

Test time RANSAC (stereo only).

As in Yi18a; Zheng18, we evaluate the use of outlier rejection with RANSAC at test time to maximize generalization capability. In order to do so, we simply threshold  at , select the data points above that threshold as inliers, and run RANSAC with a threshold of in terms of symmetric epipolar distance as in Yi18a, in order to keep the results comparable. We then compute the essential matrix with the standard (non-weighted) 8-point algorithm with the surviving inliers. We compare these results with those obtained directly from the weighted 8-point formulation, without further processing. As we will discuss in Section 6.3, RANSAC helps when applying trained models to unseen scenes.

6 Results

We first consider a toy example on fitting 2D lines with a large ratio of outliers. We then apply our method to digit classification on MNIST, thresholding the grayscale image and using the location of pixels as data points, as in Qi17a; Qi17b. These two experiments illustrate that our attentional method performs bettter than vanilla Context Normalization under the presence of outliers. We then apply our solution to wide-baseline stereo, and demonstrate that this increase in performance holds on challenging real-world applications. Finally, we perform an ablation study and evaluate the effect of supervising the weights used for attention in stereo.

6.1 Robust line fitting – Figure 2 and Table 2

To generate 2D points on a random line, as well as outliers, we first sample 2D points uniformly within the range of

. We then select two points randomly, and fit a line that goes through them. With probability according to the desired inlier ratio, we then project each point onto the line to form inliers. We measure the error in terms of the

distance between the estimated and ground truth values for the line parameters. The results are summarized in Table 2, with qualitative examples in Figure 2. ACNe consistently outperforms CNe Yi18a. Both methods break down at a 85-90% outlier ratio, while the performance of ACNe degrades more gracefully. As illustrated in Figure 2, our method learns to progressively focus on the inliers throughout the different layers of the network in order to weed out the outliers and reach a solution.

Method Outlier Ratio
60% 70% 80% 85% 90%
CNe Yi18a .0039 .0051 .041 .147 .433
ACNe (Ours) .0001 .0045 .033 .117 0.389
Table 2: Point cloud classification – We report the classification accuracy on MNIST, under varying outlier ratios (%). Our approach performs best in all cases, and the gap becomes wider with more outliers – while CNe shows some robustness to noise, PointNet quickly breaks down.
Methods 0% 10% 20% 30% 40% 50% 60%

PointNet Qi17a
98.1 95.1 93.2 79.5 67.7 70.0 54.8
CNe Yi18a 98.0 95.8 94.0 91.0 90.1 87.7 87.2
ACNe (Ours) 98.3 97.2 96.5 95.3 94.7 94.3 93.7

Table 1: Robust line fitting – Line fitting results over the test set in terms of the distance (ignoring sign differences) between the ground-truth and the estimates.

6.2 Point cloud classification – Figure 3 and Table 2

We evaluate our approach on handwritten digit classification on MNIST, which consists of grayscale images. We create a point cloud from these images following the procedure of Qi17a: we threshold each image at 128 use the coordinates – normalized to unit bounding box – of the surviving pixel locations as data samples. We sub sample with replacement to 512 points to have even number of points for all training samples. We further add a small Gaussian noise of 0.01 to the coordinates after sampling following Qi17a. For outliers, we sample from a uniform random distribution. We compare our method against vanilla PointNet Qi17a and CNe Yi18a. For PointNet, we re-implemented their method under our framework to have an identical training setup. We do not apply the initial affine estimation, so that we can isolate the architectural differences between the methods – note that this module could also be added to our method. Table 2 summarizes the results, in terms of classification accuracy; our method performs best, with the gap widening as outlier ratio increases. Note that the result of PointNet are slightly different from what was reported in Qi17a, as we do not use the full training set, but split it to train and validation to perform early stopping. In addition, we run models for 10 times and report the average results.

6.3 Wide-baseline stereo – Figure 4 and Table 6.3

Wide-baseline stereo is an extremely challenging problem, due to the large number of variables to account for – viewpoint, scale, illumination, occlusions, imaging; see Figure 4. We benchmark the performance of our approach on a real-world dataset against multiple state-of-the-art baselines, following the data and protocols provided by Yi18a, where ground truth camera poses are obtained from Structure from Motion Wu13. We sample pairs of images and estimate the relative camera motion with different methods. The error is measured as the angular difference between the estimated and ground truth vectors for both rotation and translation, and summarized by mean Average Precision (mAP) over all image pairs on the test set, with an error threshold of .

We extract 2k keypoints with SIFT Lowe04, and determine an initial set of correspondences by nearest-neighbour matching with their corresponding descriptors. We consider the following methods: LMeds Rousseeuw84, RANSAC Fischler81, MLESAC Torr00, CNe Yi18a, and ACNe (ours). We also evaluate GMS Bian17, a semi-dense method based on a large (8k) number of ORB features; and DeMoN Ummenhofer17, a dense method trained on outdoors images with wide baselines. For CNe and ACNe, we consider the pose estimated with the weighted 8-point algorithm directly, as well as adding a post-processing step with RANSAC followed by the standard 8-point algorithm, as outlined in Section 5. Finally, in addition to SIFT, we consider two learned, state-of-the-art local features: SuperPoint DeTone17b and LF-Net Ono18.

Table 3: Pose estimation accuracy – mAP (in %) at a 20-degree error threshold averaged over the test data. Our method consistently outperforms all others by a significant margin.
Method Same Unseen
SIFT+LMeds 5.9
GMS 14.4
DeMoN 4.6
SIFT+CNe (weighted-8pt) 44.9 20.7
SIFT+CNe (RANSAC) 50.0 27.2
SIFT+ACNe (weighted-8pt) 64.3 28.7
SIFT+ACNe (RANSAC) 61.5 31.4
LF-Net+ACNe (weighted-8pt) 67.9 34.6
LF-Net+ACNe (RANSAC) 61.5 37.0
SPoint+ACNe (weighted-8pt) 70.9 37.7
SPoint+ACNe (RANSAC) 61.1 41.1
Quantitative results.

We use the following five sequences from Yi18a: Buckingham Palace (BP), Notre Dame (ND), Reichstag (RS), Sacre Coeur (SC), and St. Peter’s Square (SP). We train models on each of these five sequences, and report results both on that same sequence and over the remaining four. Due to space constraints, we summarize all of the results in Table 6.3; see the appendix for full results. We make four fundamental observations: \⃝raisebox{-0.6pt}{1} Our method consistently outperforms all of the baselines, including CNe. The difference in performance between ACNe and its closest competitor, CNe, is of relative on the same scene and relative on unseen scenes. The margin between learned and traditional methods is dramatic, with ACNe performing doubling the performance of the closest baseline (GMS) in its least favourable setting. \⃝raisebox{-0.6pt}{2} Contrary to the findings of Yi18a, we observe that RANSAC may harm performance, particularly on known sequences, and even on unseen sequences – 2 out of 5 perform better without RANSAC. The performance increase obtained with RANSAC post-processing on unseen scenes is of relative for CNe, and relative for ACNe – this means that our approach generalizes better and produces better estimates without auxiliary help. \⃝raisebox{-0.6pt}{3} As observed by Yi et al. Yi18a, while DeMoN was trained on wide-baseline stereo pairs, it performs very poorly on this data. This suggests that dense methods have not yet bridged the gap with sparse methods for this problem. \⃝raisebox{-0.6pt}{4} With modern local features, such as LF-Net or SuperPoint, we can further increase (relative) performance by on the same sequence, and on unseen sequences.

6.3.1 Ablation study – Table 6.3.1
Table 4: Ablation study – We consider different CNe Yi18a and ACNe (ours) variants on stereo. We report mAP at a 20-degree error threshold on the validation set of the Saint Peter’s Square sequence. The labels indicate: L – Local attention; G – Global attention; S – Supervision.
Methods CNe Yi18a ACNe (Ours)
w/ BN w/ GN L G L+G L+G+S
Weighted-8pt 48.4 47.1 62.8 65.7 66.4 70.6
RANSAC 54.3 53.2 63.9 65.5 65.9 66.4

We perform an ablation study to evaluate the effect of the different types of attention, as well as the supervision on the local component of the attentive mechanism. \⃝raisebox{-0.6pt}{1} We confirm that CNe Yi18a performs better with Batch Normalization (BN) Ioffe15 than with Group Normalization (GN) Wu18 – we use GN for ACNe, as it seems to perform marginally better than BN along our attentive mechanism. \⃝raisebox{-0.6pt}{2} We observe that our attentive mechanisms allow ACNe to outperform CNe, and that their combination outperforms their separate use. \⃝raisebox{-0.6pt}{3} We demonstrate that applying supervision on the weights can boost performance further.

7 Conclusion

We have proposed Attentive Context Normalization (ACN), and used it to build Attentive Context Networks (ACNe) to solve problems on permutation-equivariant data. Our solution is inspired by IRLS, where one iteratively re-weighs the importance of each sample, via a soft inlier/outlier assignment. We demonstrated that by learning both local and global attention we are able to outperform state-of-the-art solutions on line fitting, handwritten digit classification, and wide-baseline stereo. Notably, our method thrives under large outlier ratios. An interesting future direction would be to incorporate ACN into general normalization techniques for deep learning – we believe that this is a interesting direction to pursue, as all of them make use of statistical moments.


This work was partially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant “Deep Visual Geometry Machines” (RGPIN-2018-03788, DGECR-2018-00426), Google, and by systems supplied by Compute Canada.


Appendix A Full stereo results

Same sequence Unseen sequences
Method BP ND RS SC SP Avg. BP ND RS SC SP Avg.
SIFT+LMeds 5.3 3.9 8.7 8.5 3.3 5.9 6.1 6.4 5.2 5.3 6.6 5.9
SIFT+RANSAC 7.2 8.0 10.1 16.1 7.7 9.8 10.4 10.3 9.7 8.2 10.3 9.8
SIFT+MLESAC 4.7 2.0 9.6 6.7 2.8 5.2 5.3 6.0 4.0 4.8 5.7 5.2
GMS 12.8 10.4 20.7 14.6 13.4 14.4 14.8 15.4 12.8 14.3 14.6 14.4
DeMoN 5.2 2.4 10.1 2.3 3.3 4.6 4.5 5.2 3.3 5.2 5.0 4.6
SIFT+CNe (weighted-8pt) 31.9 33.1 51.0 60.8 47.7 44.9 24.3 21.0 13.8 15.0 29.3 20.7
SIFT+CNe (RANSAC) 35.4 34.7 64.4 62.4 53.2 50.0 26.1 30.1 22.2 26.9 30.6 27.2
SIFT+ACNe (weighted-8pt) 56.2 54.8 62.5 75.5 72.4 64.3 31.2 26.6 22.0 24.4 39.2 28.7
SIFT+ACNe (RANSAC) 50.3 47.8 67.8 73.1 68.2 61.5 33.3 29.5 26.0 28.7 39.3 31.4
LF-Net+ACNe (weighted-8pt) 73.2 49.4 66.8 72.8 77.5 67.9 42.8 30.0 27.8 27.0 45.3 34.6
LF-Net+ACNe (RANSAC) 63.4 40.2 68.8 65.8 68.1 61.2 41.6 32.7 33.6 32.3 45.0 37.0
SPoint+ACNe (weighted-8pt) 75.3 51.0 71.2 76.4 80.9 70.9 44.0 34.8 28.9 32.0 48.6 37.7
SPoint+ACNe (RANSAC) 63.2 41.1 64.9 68.0 68.2 61.1 42.9 37.4 36.3 42.6 46.2 41.1
Table 5: Pose estimation accuracy – mAP (in %) at a 20-degree error threshold over the test data. Our method consistently outperforms all others by a significant margin.

We train models on each of these five sequences, and report results both on that same sequence (denoted as ‘X’), and over the remaining four sequences (averaged; denoted as ‘X’). Results on all sequences are reporte in Table 5. Our results perform best, with the model trained on SP providing best generalization.