Learning Transformation Synchronization

01/27/2019 ∙ by Xiangru Huang, et al. ∙ The University of Texas at Austin 6

Reconstructing the 3D model of a physical object typically requires us to align the depth scans obtained from different camera poses into the same coordinate system. Solutions to this global alignment problem usually proceed in two steps. The first step estimates relative transformations between pairs of scans using an off-the-shelf technique. Due to limited information presented between pairs of scans, the resulting relative transformations are generally noisy. The second step then jointly optimizes the relative transformations among all input depth scans. A natural constraint used in this step is the cycle-consistency constraint, which allows us to prune incorrect relative transformations by detecting inconsistent cycles. The performance of such approaches, however, heavily relies on the quality of the input relative transformations. Instead of merely using the relative transformations as the input to perform transformation synchronization, we propose to use a neural network to learn the weights associated with each relative transformation. Our approach alternates between transformation synchronization using weighted relative transformations and predicting new weights of the input relative transformations using a neural network. We demonstrate the usefulness of this approach across a wide range of datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 11

page 23

page 24

page 25

page 26

page 28

page 30

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Transformation synchronization, i.e., estimating consistent rigid transformations across a collection of images or depth scans, is a fundamental problem in various computer vision applications, including multi-view structure from motion 

[5, 30, 39, 37], geometry reconstruction from depth scans [21, 9], image editing via solving jigsaw puzzles [8], simultaneous localization and mapping [4], and reassembling fractured surfaces [15], to name just a few. A common approach to transformation synchronization proceeds in two phases. The first phase establishes the relative rigid transformations between pairs of objects in isolation. Due to incomplete information presented in isolated pairs, the estimated relative transformations are usually quite noisy. The second phase improves the relative transformations by jointly optimizing them across all input objects. This is usually made possible by utilizing the so-called cycle-consistency constraint, which states that the composite transformation along every cycle should be the identity transformation, or equivalently, the data matrix that stores pair-wise transformations in blocks is low-rank (c.f. [16]). The cycle-consistency constraint allows us to jointly improve relative transformations by either detecting inconsistent cycles [8] or performing low-rank matrix recovery [16, 38].

(a) (b)
(c) (d)
Figure 1: Reconstruction results from 30 RGBD images of an indoor environment using different transformation synchronization methods. (a) Our approach. (b) Rotation Averaging [6]. (c) Geometric Registration[10]. (d) Ground Truth.

However, the success of existing transformation synchronization [38, 5, 2, 19] and more general map synchronization [16, 32, 31, 7, 34, 19]

techniques heavily depends on the alignment between the loss function and the noise pattern of the input data. For example, approaches based on robust norms (e.g., L1 

[16, 7]) can tolerate either a constant fraction of adversarial noise (c.f.[16, 19]

) or a sub-linear outlier ratio when the noise is independent (c.f.

[7, 34]). Such assumptions, unfortunately, deviate from many practical settings, where the majority of the input relative transformations may be incorrect (e.g., when the input scans are noisy), and/or the noise pattern in relative transformations is highly correlated (there are a quadratic number of measurements from a linear number of sources). This motivates us to consider the problem of learning transformation synchronization, which seeks to learn a suitable loss function that is compatible with the noise pattern of specific datasets.

In this paper, we introduce an approach that formulates transformation synchronization as an end-to-end neural network. Our approach is motivated by reweighted least squares and their application in transformation synchronization (c.f. [5, 2, 10, 19]

), where the loss function dictates how we update the weight associated with each input relative transformation during the synchronization process. Specifically, we design a recurrent neural network that reflects this reweighted scheme. By learning the weights from data directly, our approach implicitly captures a suitable loss function for performing transformation synchronization.

Figure 2: Illustration of our network design.

We have evaluated the proposed technique on two real datasets: Redwood [11] and ScanNet [12]. Experimental results show that our approach leads to considerable improvements compared to the state-of-the-art transformation synchronization techniques. For example, on Redwood and Scannet, the best combination of existing pairwise matching and transformation synchronization techniques lead to mean angular rotation errors and , respectively. In contrast, the corresponding statistics of our approach are and , respectively. We also perform an ablation study to evaluate the effectiveness of our approach.

Code is publicly available at https://github.com/xiangruhuang/Learning2Sync.

2 Related Works

Existing techniques on transformation synchronization fall into two categories. The first category of methods [21, 15, 40, 29, 43]

uses combinatorial optimization to select a subgraph that only contains consistent cycles. The second category of methods 

[38, 25, 18, 16, 17, 7, 44, 34, 26, 20] can be viewed from the perspective that there is an equivalence between cycle-consistent transformations and the fact that the map collection matrix that stores relative transformations in blocks is semidefinite and/or low-rank (c.f.[16]). These methods formulate transformation synchronization as low-rank matrix recovery, where the input relative transformations are considered noisy measurements of this low-rank matrix. In the literature, people have proposed convex optimization [38, 16, 17, 7], non-convex optimization [5, 44, 26, 20], and spectral techniques [25, 18, 32, 31, 34, 36] for solving various low-rank matrix recovery formulations. Compared with the first category of methods, the second category of methods is computationally more efficient. Moreover, tight exact recovery conditions of many methods have been established.

A message from these exact recovery conditions is that existing methods only work if the fraction of noise in the input relative transformations is below a threshold. The magnitude of this threshold depends on the noise pattern. Existing results either assume adversarial noise [16, 20] or independent random noise [38, 7, 34, 3]

. However, as relative transformations are computed between pairs of objects, it follows that these relative transformations are dependent (i.e., between the same source object to different target objects). This means there are a lot of structures in the noise pattern of relative transformations. Our approach addresses this issue by optimizing transformation synchronization techniques to fit the data distribution of a particular dataset. To best of our knowledge, this work is the first to apply supervised learning to the problem of transformation synchronization.

Our approach is also relevant to utilizing recurrent neural networks for solving the pairwise matching problem. Recent examples include learning correspondences between pairs of images [28], predicting the fundamental matrix between two different images of the same underlying environment [33], and computing a dense image flow between an image pair [24]. We study a different problem of transformation synchronization in this paper. In particular, our weighting module leverages problem specific features (e.g., eigen-gap) for determining the weights associated with relative transformations. Learning transformation synchronization also poses great challenges in making the network trainable end-to-end.

3 Problem Statement and Approach Overview

In this section, we describe the problem statement of transformation synchronization (Section 3.1) and present an overview of our approach (Section 3.2).

3.1 Problem Statement

Consider input scans capturing the same underlying object/scene from different camera poses. Let denote the local coordinate system associated with . The input to transformation synchronization can be described as a model graph  [22]. Each edge of the model graph is associated with a relative transformation , where and , are rotational and translational components of , respectively. is usually pre-computed using an off-the-shelf algorithm (e.g., [27, 41]). For simplicity, we impose the assumption that if and only if (i) , and (ii) their associated transformations are compatible, i.e.,

It is expected that many of these relative transformations are incorrect, due to limited information presented between pairs of scans and limitations of the off-the-shelf method being used. The goal of transformation synchronization is to recover the absolute pose of each scan in a world coordinate system . Without losing generality, we assume the world coordinate system is given by . Note that unlike traditional transformation synchronization approaches that merely use (e.g.,[5, 38, 2]), our approach also incorporates additional information extracted from the input scans .

3.2 Approach Overview

Our approach is motivated from iteratively reweighted least squares (or IRLS)[13], which has been applied to transformation synchronization (e.g. [5, 2, 10, 19]). The key idea of IRLS is to maintain an edge weight for each input transformation so that the objective function becomes quadratic in the variables, and transformation synchronization admits a closed-form solution. One can then use the closed-form solution to update the edge weights. Under a special weighting scheme (c.f.[19]), it has been shown that when the fraction of incorrect measurements is below a constant, the weights associated with these incorrect measurements eventually become . One way to understand reweighting schemes is that when the weights converged, the reweighted square loss becomes the actual robust loss function that is used to solve the corresponding transformation synchronization problem. In contrast to using a generic weighting scheme, we propose to learn the weighting scheme from data by designing a recurrent network that replicates the reweighted transformation synchronization procedure. By doing so, we implicitly learn a suitable loss function for transformation synchronization.

As illustrated in Figure 2, the proposed recurrent module combines a synchronization layer and a weighting module. At the th iteration, the synchronization layer takes as input the initial relative transformations and their associated weights and outputs synchronized poses for the input objects . Initially, we set . The technical details of the synchronization layer are described in Section 4.1.

The weighting module operates on each object pair in isolation. For each edge , the input to the proposed weighting module consists of (1) the input relative transformation , (2) the induced relative transformation at the th iteration

(3) features extracted from the initial alignment of the two input scans, and (4) a status vector

that collects global signals from the synchronization layer at the th iteration (e.g., spectral gap). The output is the associated weight at the th iteration.

The network is trained end-to-end by penalizing the differences between the ground-truth poses and the output of the last synchronization layer. The technical details of this end-to-end training procedure are described in Section 4.3.

4 Approach

In this section, we introduce the technical details of our learning transformation synchronization approach. In Section 4.1, we introduce details of the synchronization layer. In Section 4.2, we describe the weighting module. Finally, we show how to train the proposed network end-to-end in Section 4.3. Note that the proofs of the propositions introduced in this section are deferred to the supplemental material.

4.1 Synchronization Layer

For simplicity, we ignore the superscripts and when introducing the synchronization layer. Let and be the input relative transformation and its weights associated with the edge . We assume that this weighted graph is connected. The goal of the synchronization layer is to compute the synchronized pose associated with each scan . Note that a correct relative transformation induces two separate constraints on the rotations and translations , respectively:

We thus perform rotation synchronization and translation synchronization separately.

Our rotation synchronization approach adapts the spectral rotation synchronization approach described in  [1]. Specifically, we consider the following optimization problem for rotation synchronization:

(1)

Solving (1) exactly is difficult. We propose to first relax the constraint to when solving (1) and then project each of the resulting solution to . This leads to the following procedure for rotation synchronization. More precisely, we introduce a connection Laplacian  [35], whose blocks are given by

(2)

where collects all neighbor vertices of in .

function SYNC()
     Form the connection Laplacian and vector ;
     Compute first eigenvectors of ;
     Perform SVD on blocks of to obtain via (3);
     Solve (6) to obtain ;
     return ;
end function
Algorithm 1 Translation Synchronization Layer.

Let collect the eigenvectors of

that correspond to the three smallest eigenvalues. We choose the sign of each eigenvector such that

. To compute the absolute rotations, we first perform singular value decomposition (SVD) on each

We then output the corresponding absolute rotation estimate as

(3)

The following proposition states that although do not exactly optimize (1), they still provide effective synchronized rotations due to the following robust recovery property:

Proposition 1.

(Informal) Consider the ground-truth rotations . Suppose where ,

(4)

Then in (3) approximately recovers the ground-truth rotations . More precisely, we define

as the estimation error on . With we denote the corresponding error matrix. When the constraints in (4) are exact, or equivalently, , then the recovery is also exact. In this case, we have

In other words, if the weighting module sets weights of outlier relative transformations to , then approximately recover the underlying rotations.

Translation synchronization solves the following least square problem to obtain :

(5)

Let collect the translation components of the synchronized poses in a column vector. Introduce a column vector where

Then an111When is positive semidefinite, then the solution is unique, and (6) gives one optimal solution. optimal solution to (5) is given by

(6)

Similar to the case of rotation synchronization, we have the following robust recovery property:

Proposition 2.

(Informal) Consider the underlying synchronized poses . Denote . Suppose where ,

(7)

Then in (6) approximately recovers . In particular, when the constraints in (7) are exact, then the recovery is also exact.

Similar to the case of rotation synchronization, if the pairwise matching module sets the weights of outlier relative transformations to , then approximately recover the underlying translations.

Figure 3:

Illustration of network design of the weighting module. We first compute k-nearest neighbor distances between a pair of depth images, which form the images (shown as heat maps) in the middle. We then apply a classic convolutional neural network to output a score between

, which is then combined with status vector to produce the weight of this relative pose according to (13).

4.2 Weighting Module

We define the weighting module as the following function:

(8)

where the input consists of (i) a pair of scans and , (ii) the input relative transformation between them, and (iii) a status vector . The output of this weighting module is given by the new weight at the th iteration. With we denote the trainable weights of the weighting module. In the following, we first introduce the definition of the status vector .

Status vector. The purpose of the status vector is to collect additional signals that are useful for determining the output of the weighting module. Define

(9)
(10)
(11)
(12)

Essentially, and characterize the difference between current synchronized transformations and the input relative transformations. The motivation for using them comes from the fact that for a standard reweighted scheme for transformation synchronization (c.f. [19]), one simply sets for a weighting function (c.f. [13]). This scheme can already recover the underlying ground-truth in the presence of a constant fraction of adversarial incorrect relative transformations (Please refer to the supplemental material for a rigorous analysis). In contrast, our approach seeks to go beyond this limit by leveraging additional information. The definition of comes from Prop.1. equals to the residual of (5). Intuitively, when both and are small, the weighted relative transformations will be consistent, from which we can recover accurate synchronized transformations . We now describe the network design.

Network design. As shown in Figure 3, the key component of our network design is a sub-network that takes two scans and and a relative transformation between them and output a score in that indicates whether this is a good scan alignment or not, i.e., means a good alignment, and means an incorrect alignment.

We design as a feed-forward network. Its input consists of two color maps that characterize the alignment patterns between the two input scans. The value of each pixel represents the distance of the corresponding 3D point to the closest points on the other scan under (See the second column of images in Figure 3). We then concatenate these two color images and feed them into a neural network (we used a modified AlexNet architecture), which outputs the final score.

With this setup, we define the output weight as

(13)

Here adopts form of traditional reweighting function and encode the importance of the elements of . With we collect all trainable parameters of (13).

4.3 End-to-End Training

Let denote a dataset of scan collections with annotated ground-truth poses. Let be the number of recurrent steps (we used four recurrent steps in our experiments) . We define the following loss function for training the weighting module :

(14)

where we set in all of our experiments. Note that we compare relative rotations to factor out the global orientation among the poses. The global shift in translation is already handled by (6).

We perform back-propagation to optimize (14). The technical challenges are to compute the derivatives that pass through the synchronization layer, including 1) the derivatives of with respect to the elements of , 2) the derivatives of with respect to the elements of and , and 3) the derivatives of each status vector with respect to the elements of and . In the following, we provide explicit expressions for computing these derivatives.

We first present the derivative between the output of rotation synchronization and its input. To make the notation uncluterred, we compute the derivative by treating is a matrix function. The derivative with respect to

can be easily obtained via chain-rule.

Proposition 3.

Consider the setup of Prop. 1. Let and be the -th eigenvector and eigenvalue of . Expand the SVD of as follows:

Let be the th canonical basis of . We then have

where

where is defined by ,

The following proposition specifies the derivative of with respect to the elements of and :

Proposition 4.

The derivatives of are given by

Regarding the status vectors, the derivatives of with respect to the elements of are given by Prop. 3; The derivatives of and with respect to the elements of are given by Prop. 4. It remains to compute the derivatives of with respect to the elements of , which can be easily obtained via the derivatives of the eigenvalues of  [23], i.e., .

5 Experimental Results

This section presents an experimental evaluation of the proposed learning transformation synchronization approach. We begin with describing the experimental setup in Section 5.1. In Section 5.2, we analyze the results of our approach and compare it against baseline approaches. Finally, we present an ablation study in Section 5.3.

Methods Redwood ScanNet
Rotation Error Translation Error Rotation Error Translation Error
Mean 0.05 0.1 0.25 0.5 0.75 Mean Mean 0.05 0.1 0.25 0.5 0.75 Mean
FastGR (all) 29.4 40.2 52.0 63.8 70.4 22.0 39.6 53.0 60.3 67.0 0.68 9.9 16.8 23.5 31.9 38.4 5.5 13.3 22.0 29.0 36.3 1.67
FastGR (good) 33.9 45.2 57.2 67.4 73.2 26.7 45.7 58.8 65.9 71.4 0.59 12.4 21.4 29.5 38.6 45.1 7.7 17.6 28.2 36.2 43.4 1.43
Super4PCS (all) 6.9 10.1 16.7 39.6 52.3 4.2 8.9 18.2 31.0 43.5 1.14 0.5 1.3 4.0 17.4 25.2 0.3 1.2 5.3 13.3 21.6 2.11
Super4PCS (good) 10.3 14.9 23.9 48.0 60.0 6.4 13.3 26.2 41.2 53.2 0.93 0.8 2.3 6.4 23.0 31.7 0.6 2.2 8.9 19.5 29.5 1.80
RotAvg (FastGR) 30.4 42.6 59.4 74.4 82.1 23.3 43.2 61.8 72.4 80.7 0.42 6.0 10.4 17.3 36.1 46.1 3.7 9.2 19.5 34.0 45.6 1.26
multi-FastGR (FastGR) 17.8 28.7 47.5 74.2 83.2 4.9 18.4 50.2 72.6 81.4 0.93 0.2 0.6 2.8 16.4 27.1 0.1 0.7 4.8 16.4 28.4 1.80
RotAvg (Super4PCS) 5.4 8.7 17.4 45.1 59.2 3.2 7.4 17.0 32.3 46.3 0.95 0.3 0.8 3.0 15.4 23.3 0.2 1.0 5.8 16.5 27.6 1.70
multi-FastGR (Super4PCS) 2.1 4.1 10.2 33.1 48.3 1.1 3.1 10.3 21.5 31.8 1.25 1.9 5.1 13.9 36.6 47.1 0.4 2.1 9.8 23.2 34.5 1.82
Our Approach (FastGR) 67.5 77.5 85.6 91.7 94.4 20.7 40.0 70.9 88.6 94.0 0.26 34.4 41.1 49.0 58.9 62.3 42.9 2.0 7.3 22.3 36.9 48.1 1.16
Our Approach (Super4PCS) 2.3 5.1 13.2 42.5 60.9 1.1 4.0 13.8 29.0 42.3 1.02 0.4 1.7 6.8 29.6 43.5 0.1 0.8 5.6 16.6 27.0 1.90
Transf. Sync. (FastGR) 27.1 37.7 56.9 74.4 82.4 17.4 34.4 55.9 70.4 81.3 0.43 3.2 6.5 14.6 35.8 47.4 1.6 5.6 15.5 30.9 43.4 1.31
Input Only (FastGR) 36.7 51.4 68.1 87.7 91.7 25.1 49.3 73.2 86.4 91.6 0.26 11.7 19.4 30.5 50.7 57.7 5.9 15.4 30.5 43.7 52.2 1.03
No Recurrent (FastGR) 37.8 52.8 71.1 87.7 91.7 26.3 51.1 77.3 87.1 92.0 0.24 8.6 15.3 26.9 51.4 58.2 3.9 11.1 27.3 43.7 53.9 1.01
Figure 4: Benchmark evaluations on Redwood [11] and ScanNet [12]. Quality of absolute poses are evaluated by computing errors to pairwise ground truth poses. Angular distances between rotation matrices are computed via angular. Translation distances are computed by . We collect statistics on percentages of rotation and translation errors that are below a varying threshold. I) The 4th to 7th rows contain evaluations for upstream algorithms. (all) refers to statistics among all pairs where (good) refers to the statistics computed among relative poses with good quality overlap regions with respect to our pretrained weighting module. II) For the second part, we report results of all baselines computed from this good set of relative poses, which is consistently better than the results from all relative poses. Since there are two input methods, we report the results of each transformation synchronization approach on both inputs. III) The third parts contain results for ablation study performed only on FastGR[41] inputs. The first row reports state-of-the-art rotation and translation synchronization results, followed by variants of our approach.

5.1 Experimental Setup

Datasets. We consider two datasets in this paper, Redwood [11] and ScanNet [12]:

  • Redwood contains RGBD sequences of individual objects. We uniformly sample 60 sequences. For each sequence, we sample 30 RGBD images that are 20 frames away from the next one, which cover 600 frames of the original sequence. For experimental evaluation, we use the poses associated with the reconstruction as the ground-truth. We use 35 sequences for training and 25 sequences for testing. Note that the temporal order among the frames in each sequence is discarded in our experiments.

  • ScanNet contains RGBD sequences, as well as reconstruction, camera pose, for 706 indoor scenes. Each scene contains 2-3 sequences of different trajectories. We randomly sample 100 sequences from ScanNet. We use 70 sequences for training and 30 sequences for testing. Again the temporal order among the frames in each sequence is discarded in our experiments.

More details about the sampled sequences are given in the appendix.

Pairwise methods. We consider two state-of-the-art pairwise methods for generating the input to our approach:

  • Super4PCS [27] applies sampling to find consistent matches of four point pairs.

  • FastGR [41] utilizes feature correspondences and applies reweighted non-linear least squares to extract a set of consistent feature correspondences and fit a rigid pose. We used the Open3D implementation [42].

Baseline approaches. We consider the following baseline approaches that are introduced in the literature for transformation synchronization:

  • Robust Relative Rotation Averaging (RotAvg)[6] is a scalable algorithm that performs robust rotation averaging of relative rotations. To recover translations, we additionally apply a state-of-the-art translation synchronization approach [19]. We use default setting of its publicly accessible code. [19] is based on our own Matlab implementation.

  • Geometric Registration [10] solve multi-way registration via pose graph optimization. We modify the Open3D implementation to take inputs from Super4PCS or FastGR.

Note that our approach utilizes a weighting module to score the input relative transformations. To make fair comparisons, we apply our pretrained weighting module to filter all input transformations, whose associated scores are below . We then feed these filtered input transformations to each baseline approach for experimental evaluation.

Evaluation protocol. We employ the evaluation protocols of [5] and [19]

for evaluating rotation synchronization and translation synchronization, respectively. Specifically, for rotations, we first solve the best matching global rotation between the ground-truth and the prediction, we then report the cumulative distribution function (or CDF) of angular deviation

between a prediction and its corresponding ground-truth . For translations, we report the CDF of between each pair of prediction and its corresponding ground-truth .

Figure 5: For each block, we show the results of our approach (row I), Rotation Averaging [6]+Translation Sync. [19]  (row II), Geometric Registration [10] (row III), and Ground Truth (row IV) (Top). The left four scenes are from Redwood[11] and the right two scenes are from ScanNet[12]

5.2 Analysis of Results

Figure 4 and Figure 5 present quantitative and qualitative results, respectively. Overall, our approach yielded fairly accurate results. On Redwood, the mean errors in rotations/translations of FastGR and our result from FastGR are and , respectively. On ScanNet, the mean errors in rotations/translations of FastGR and our result from FastGR are and , respectively. Note that in both cases, our approach leads to salient improvements from the input. The final results of our approach on ScanNet are less accurate than those on Redwood. Besides the fact that the quality of the initial relative transformations is lower on ScanNet than that on Redwood, another factor is that the depth scans from ScanNet are quite noisy, leading to noisy input (and thus less signals) for the weighting module. Still, the improvements of our approach on ScanNet are salient.

Our approach still requires reasonable initial transformations to begin with. This can be understood from the fact that our approach seeks to perform synchronization by selecting a subset of input relative transformations. Although our approach utilizes learning, its performance shall decrease when the quality of the initial relative transformations drops. An evidence is that our approach only leads to modest performance gains when taking the output of Super4PCS as input.

Comparison with state-of-the-art approaches. Although all the two baseline approaches improve from the input relative transformations, our approach exhibits significant further improvements from all baseline approaches. On Redwood, the mean rotation and translation errors of the top performing method RotAvg from FastGR are and , respectively. The reductions in mean error of our approach are and for rotations and translations, respectively, which are significant. The reductions in mean errors of our approach on ScanNet are also noticeable, i.e., and in rotations and translations, respectively.

Our approach also achieved relative performance gains from baseline approaches when taking the output of Super4PCS as input. In particular, for mean rotation errors, our approach leads to reductions of and on Redwood and ScanNet, respectively.

When comparing rotations and translations, the improvements on mean rotation errors are bigger than those on mean translation errors. One explanation is that there are a lot of planar structures in Redwood and ScanNet. When aligning such planar structures, rotation errors easily lead to a large change in nearest neighbor distances and thus can be detected by our weighting module. In contrast, translation errors suffer from the gliding effects on planar structures (c.f.[14]), and our weighting module becomes less effective.

5.3 Ablation Study

In this section, we present two variants of our learning transformation synchronization approach to justify the usefulness of each component of our system. Due to space constraint, we perform ablation study only using FastGR.

Input only.

In the first experiment, we simply learn to classify the input maps, and then apply transformation synchronization techniques on the filtered input transformations. In this setting, state-of-the-art transformation synchronization techniques achieves mean rotation/translation errors of

and on Redwood and ScanNet, respectively. By applying our learning approach to fixed initial map weights, e.g., we fix of the weighting module during, our approach reduced the mean errors to and on Redwood and ScanNet, respectively. Although the improvements are noticeable, there are still gaps between this reduced approach and our full approach. This justifies the importance of learning the weighting module together.

No recurrent module. Another reduced approach is to directly combine the weighting module and one synchronization layer. Although this approach can improve from the input transformations. There is still a big gap between this approach and our full approach (See the last row in Figure 4). This shows the importance of using weighting modules to gradually reduce the error while simultaneously make the entire procedure trainable end-to-end.

6 Conclusions

In this paper, we have introduced a supervised transformation synchronization approach. It modifies a reweighted non-linear least square approach and applies a neural network to automatically determine the input pairwise transformations and the associated weights. We have shown how to train the resulting recurrent neural network end-to-end. Experimental results show that our approach is superior to state-of-the-art transformation synchronization techniques on ScanNet and Redwood for two state-of-the-art pairwise scan matching methods.

There are ample opportunities for future research. So far we have only considered classifying pairwise transformations, it would be interesting to study how to classifying high-order matches. Another interesting direction is to install ICP alignment into our recurrent procedure, i.e., we start from the current synchronized poses and perform ICP between pairs of scans to obtain more signals for transformation synchronization. Moreover, instead of maintaining one synchronized pose per scan, we can maintain multiple synchronized poses, which offer more pairwise matches between pairs of scans for evaluation. Finally, we would like to apply our approach to synchronize dense correspondences across multiple images/shapes.

Acknowledgement: The authors wish to thank the support of NSF grants DMS-1546206, DMS-1700234, CHS-1528025, a DoD Vannevar Bush Faculty Fellowship, a Google focused research award, a gift from adobe research, a gift from snap research, a hardware donation from NVIDIA, and an Amazon AWS AI Research gift.

References

  • [1] M. Arie-Nachimson, S. Z. Kovalsky, I. Kemelmacher-Shlizerman, A. Singer, and R. Basri. Global motion estimation from point matches. In Proceedings of the 2012 Second International Conference on 3D Imaging, Modeling, Processing, Visualization & Transmission, 3DIMPVT ’12, pages 81–88, Washington, DC, USA, 2012. IEEE Computer Society.
  • [2] F. Arrigoni, A. Fusiello, B. Rossi, and P. Fragneto. Robust rotation synchronization via low-rank and sparse matrix decomposition. CoRR, abs/1505.06079, 2015.
  • [3] C. Bajaj, T. Gao, Z. He, Q. Huang, and Z. Liang. SMAC: simultaneous mapping and clustering using spectral decompositions. In

    Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018

    , pages 334–343, 2018.
  • [4] L. Carlone, R. Tron, K. Daniilidis, and F. Dellaert. Initialization techniques for 3d SLAM: A survey on rotation estimation and its use in pose graph optimization. In ICRA, pages 4597–4604. IEEE, 2015.
  • [5] A. Chatterjee and V. M. Govindu. Efficient and robust large-scale rotation averaging. In ICCV, pages 521–528. IEEE Computer Society, 2013.
  • [6] A. Chatterjee and V. M. Govindu. Robust relative rotation averaging. IEEE transactions on pattern analysis and machine intelligence, 40(4):958–972, 2018.
  • [7] Y. Chen, L. J. Guibas, and Q. Huang. Near-optimal joint object matching via convex relaxation. In ICML, pages 100–108, 2014.
  • [8] T. S. Cho, S. Avidan, and W. T. Freeman. The patch transform. IEEE Trans. Pattern Anal. Mach. Intell., 32(8):1489–1501, 2010.
  • [9] S. Choi, Q.-Y. Zhou, and V. Koltun. Robust reconstruction of indoor scenes. In CVPR, pages 5556–5565. IEEE Computer Society, 2015.
  • [10] S. Choi, Q.-Y. Zhou, and V. Koltun. Robust reconstruction of indoor scenes. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2015.
  • [11] S. Choi, Q.-Y. Zhou, S. Miller, and V. Koltun. A large dataset of object scans. arXiv:1602.02481, 2016.
  • [12] A. Dai, A. X. Chang, M. Savva, M. Halber, T. A. Funkhouser, and M. Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. CoRR, abs/1702.04405, 2017.
  • [13] I. Daubechies, R. DeVore, M. Fornasier, and C. S. Güntürk. Iteratively re-weighted least squares minimization for sparse recovery. Report, Program in Applied and Computational Mathematics, Princeton University, Princeton, NJ, USA, June 2008.
  • [14] N. Gelfand, S. Rusinkiewicz, L. Ikemoto, and M. Levoy. Geometrically stable sampling for the ICP algorithm. In 3DIM, pages 260–267. IEEE Computer Society, 2003.
  • [15] Q. Huang, S. Flöry, N. Gelfand, M. Hofer, and H. Pottmann. Reassembling fractured objects by geometric matching. ACM Trans. Graph., 25(3):569–578, July 2006.
  • [16] Q. Huang and L. Guibas. Consistent shape maps via semidefinite programming. In Proceedings of the Eleventh Eurographics/ACMSIGGRAPH Symposium on Geometry Processing, SGP ’13, pages 177–186, Aire-la-Ville, Switzerland, Switzerland, 2013. Eurographics Association.
  • [17] Q. Huang, F. Wang, and L. J. Guibas. Functional map networks for analyzing and exploring large shape collections. ACM Trans. Graph., 33(4):36:1–36:11, 2014.
  • [18] Q. Huang, G. Zhang, L. Gao, S. Hu, A. Butscher, and L. J. Guibas. An optimization approach for extracting and encoding consistent maps in a shape collection. ACM Trans. Graph., 31(6):167:1–167:11, 2012.
  • [19] X. Huang, Z. Liang, C. Bajaj, and Q. Huang. Translation synchronization via truncated least squares. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 1459–1468. Curran Associates, Inc., 2017.
  • [20] X. Huang, Z. Liang, C. Bajaj, and Q. Huang. Translation synchronization via truncated least squares. In NIPS, 2017.
  • [21] D. Huber. Automatic Three-dimensional Modeling from Reality. PhD thesis, Carnegie Mellon University, Pittsburgh, PA, December 2002.
  • [22] D. F. Huber and M. Hebert. Fully automatic registration of multiple 3d data sets. Image and Vision Computing, 21:637–650, 2001.
  • [23] M. K. Kadalbajoo and A. Gupta. An overview on the eigenvalue computation for matrices. Neural, Parallel Sci. Comput., 19(1-2):129–164, Mar. 2011.
  • [24] S. Kim, S. Lin, S. R. JEON, D. Min, and K. Sohn.

    Recurrent transformer networks for semantic correspondence.

    In NIPS, page to appear, 2018.
  • [25] V. Kim, W. Li, N. Mitra, S. DiVerdi, and T. Funkhouser. Exploring collections of 3d models using fuzzy correspondences. ACM Trans. Graph., 31(4):54:1–54:11, July 2012.
  • [26] S. Leonardos, X. Zhou, and K. Daniilidis. Distributed consistent data association via permutation synchronization. In ICRA, pages 2645–2652. IEEE, 2017.
  • [27] N. Mellado, D. Aiger, and N. J. Mitra. Super 4pcs fast global pointcloud registration via smart indexing. Comput. Graph. Forum, 33(5):205–215, Aug. 2014.
  • [28] K. Moo Yi, E. Trulls, Y. Ono, V. Lepetit, M. Salzmann, and P. Fua. Learning to find good correspondences. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [29] A. Nguyen, M. Ben-Chen, K. Welnicka, Y. Ye, and L. J. Guibas. An optimization approach to improving collections of shape maps. Comput. Graph. Forum, 30(5):1481–1491, 2011.
  • [30] O. Özyesil and A. Singer. Robust camera location estimation by convex programming. CoRR, abs/1412.0165, 2014.
  • [31] D. Pachauri, R. Kondor, G. Sargur, and V. Singh. Permutation diffusion maps (PDM) with application to the image association problem in computer vision. In NIPS, pages 541–549, 2014.
  • [32] D. Pachauri, R. Kondor, and V. Singh. Solving the multi-way matching problem by permutation synchronization. In NIPS, pages 1860–1868, 2013.
  • [33] R. Ranftl and V. Koltun. Deep fundamental matrix estimation. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part I, pages 292–309, 2018.
  • [34] Y. Shen, Q. Huang, N. Srebro, and S. Sanghavi. Normalized spectral map synchronization. In NIPS, pages 4925–4933, 2016.
  • [35] A. Singer and H. tieng Wu. Vector diffusion maps and the connection laplacian. Communications in Pure and Applied Mathematics, 65(8), Aug. 2012.
  • [36] Y. Sun, Z. Liang, X. Huang, and Q. Huang. Joint map and symmetry synchronization. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part V, pages 257–275, 2018.
  • [37] C. Sweeney, T. Sattler, T. Höllerer, M. Turk, and M. Pollefeys. Optimizing the viewing graph for structure-from-motion. In ICCV, pages 801–809. IEEE Computer Society, 2015.
  • [38] L. Wang and A. Singer. Exact and stable recovery of rotations for robust synchronization. Information and Inference: A Journal of the IMA, 2:145–193, December 2013.
  • [39] K. Wilson and N. Snavely. Robust global translations with 1dsfm. In D. J. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, editors, ECCV (3), volume 8691 of Lecture Notes in Computer Science, pages 61–75. Springer, 2014.
  • [40] C. Zach, M. Klopschitz, and M. Pollefeys. Disambiguating visual relations using loop constraints. In CVPR, pages 1426–1433. IEEE Computer Society, 2010.
  • [41] Q. Zhou, J. Park, and V. Koltun. Fast global registration. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II, pages 766–782, 2016.
  • [42] Q. Zhou, J. Park, and V. Koltun. Open3d: A modern library for 3d data processing. CoRR, abs/1801.09847, 2018.
  • [43] T. Zhou, Y. J. Lee, S. X. Yu, and A. A. Efros. Flowweb: Joint image set alignment by weaving consistent, pixel-wise correspondences. In CVPR, pages 1191–1200. IEEE Computer Society, 2015.
  • [44] X. Zhou, M. Zhu, and K. Daniilidis. Multi-image matching via fast alternating minimization. CoRR, abs/1505.04845, 2015.

Appendix A Overview

We organize this supplemental material as follows. In Section B, we provide more detailed experimental results. In Section C, we describe the technical proofs for all the propositions in the main paper. In Section D, we show the scenes we used in this paper.

Appendix B More Experimental Results

b.1 More Visual Comparison Results

Figure 6 shows more visual comparisons between our approach and baseline approaches. Again, our approach produces alignments that are close to the underlying ground-truth. The overall quality of our alignments is superior to that of the baseline approaches.

Ground Truth RotAvg Geometric Registration Our Approach
Figure 6: We show the results of ground truth result (column I), Rotation Averaging [6]+Translation Sync. [19]  (column II), Geometric Registration [10] (column III), and Our Approach (column IV). These scenes are from Redwood Chair dataset.

b.2 Cumulative Density Function

Figure 7 plots the cumulative density functions of errors in rotations and translations with respect to a varying threshold.

Redwood Scannet
Redwood Scannet
Redwood Scannet
Figure 7: Corresponding cumulative density function (CDF) curves. For the top block, we plot CDF from different input sources. Here ”all” corresponds to errors between all pairs and ”good” corresponds to errors between selected pairs. The pairs were selected by 1) computing ICP refinement, 2) computing overlapping region by finding points in source point clouds that are close to target point clouds (i.e. by setting a threshold), 3) for these points, we compute their median distance to the target point clouds. For the middle block, we report the comparison of baselines and our approach. Results from different input sources are reported separately. For the bottom block, we report the comparison between variants of our approach using Fast Global Registration as the input pairwise alignments.

b.3 Illustration of Dataset

To understand the difficulty of the datasets used in our experiments, we pick a typical scene from each of the Redwood and ScanNet datasets and render 15 out of 30 ground truth point clouds from the same camera view point. From Figure 9 and Figure  8, we can see that ScanNet is generally harder than Redwood, as there is less information that can be extracted by looking at pairs of scans.

Figure 8: A typical example of the a Redwood Chair scene: the 1st, 3rd, 5th, 7th, , 29th of the selected scans are rendered from the same camera view point. Each scan is about 40 frames aways from the next one.
Figure 9: A typical example of the a ScanNet scene: the 1st, 3rd, 5th, 7th, , 29th of the selected scans are rendered from the same camera view point. Each scan is about 40 frames aways from the next one.

Appendix C Proofs of Propositions

We organize this section as follows. In Section C.1, we provide key lemmas regarding the eigen-decomposition of a connection Laplacian, including stability of eigenvalues/eigenvectors and derivatives of eigenvectors with respect to elements of the connection Laplacian. In Section C.2, we provide key lemmas regarding the projection operator that maps the space of square matrices to the space of rotations. Section C.3 to Section C.6 describe the proofs of all the propositions stated in the main paper. Section C.7 provides an exact recovery condition of a rotation synchronization scheme via reweighted least squares. Finally, Section C.8 provides proofs for new key lemmas introduced in this section.

c.1 Eigen-Stability of Connection Laplacian

We begin with introducing the problem setting and notations in Section C.1.1. We then present the key lemmas in Section C.1.2.

c.1.1 Problem Setting and Notations

Consider a weighted graph with vertices, i.e., . We assume that is connected. With we denote an edge weight associated with edge . Let be the weighted adjacency matrix (Note that we drop from the expression of to make the notations uncluttered). It is clear that the leading eigenvector of is , and its corresponding eigenvalue is zero. In the following, we shall denote the eigen-decomposition of as

where

collect the remaining eigenvectors and their corresponding eigenvalues of , respectively. Our analysis will also use a notation that is closely related to the pseudo-inverse of :

(15)

Our goal is to understand the behavior of the leading eigenvectors of 222Note that when applying the stability results to the problem studied in this paper, we always use . However, when assume a general when describing the stability results. for a symmetric perturbation matrix , which is a block matrix whose blocks are given by

where is the perturbation imposed on .

We are interested in , which collects the leading eigenvectors of in its columns. With we denote the corresponding eigenvalues. Note that due to the property of connection Laplacian, . Our goal is to 1) bound the eigenvalues , and 2) to provide block-wise bounds between and , for some rotation matrix .

Besides the notations introduced above that are related to Laplacian matrices, we shall also use a few matrix norms. With and we denote the spectral norm and Frobenius norm, respectively. Given a vector , we denote as the element-wise infinity norm. We will also introduce a norm for square matrices, which is defined as

We will also use a similar norm defined for block matrices (i.e., each block is a matrix):

c.1.2 Key Lemmas

This section presents a few key lemmas that will be used to establish main stability results regarding matrix eigenvectors and matrix eigenvalues. We begin with the classical result of the Weyl’s inequality:

Lemma C.1.

(Eigenvalue stability) For , we have

(16)

We proceed to describe tools for controlling the eigenvector stability. To this end, we shall rewrite as follows:

Our goal is to bound the deviation between and a rotation matrix and blocks of .

We begin with controlling , which we adopt a result described in [3]:

Lemma C.2.

(Controlling [3]) If

then there exists 333If not, we can always negate the last column of U. such that

In particular,

It remains to control the blocks of . We state a formulation that expresses the column of using a series:

Lemma C.3.

Suppose , then ,

(17)

We conclude this section by providing an explicit expression for computing the derivative of the leading eigenvectors of a connection Laplacian with its elements:

Lemma C.4.

Let be an non-negative definite matrix and its eigen-decomposition is

(18)

where .

Suppose . Collect the eigenvectors corresponding to the smallest eigenvalues of as the columns of matrix . Namely, where are the smallest eigenvelues of .

Notice that can have different decompositions in (18) when there are repetitive eigenvalues. But in our case where , we claim that is unique under different possible decomposition of so that can be well defined and has an explicit expression

(19)

Moreover, the differentials of eigenvalues are

(20)

c.2 Key Lemma Regarding the Projection Operator

This section studies the projection operator which maps the space of square matrices to the space of rotation matrices. We begin with formally defining the projection operator as follows:

Definition 1.

Suppose . Let be the singular value decomposition of square matrix where and are both orthogonal matrices, and all coefficients are non-negative. Then we define the rotation approximation of as

It is clear that is a rotation matrix, since 1) both and are rotations, and 2) .

Lemma C.5.

Let be a block matrix of form

where . Use to denote the element on position in . Then we have

We then present the following key lemma regarding the stability of the projection operator:

Lemma C.6.

Let be a square matrix and . Suppose , then

Lemma C.7.

Regarding as a function about , then the differential of would be

where all notations follow Definition (1).

c.3 Proof of Prop. 4.1

We first present a formal version of Prop. 4.1 in the main paper.

Proposition 5.

Suppose the underlying rotations are given by . Modify the definition of such that

Define

(21)

Suppose , , and