# Scale Adaptive Clustering of Multiple Structures

We propose the segmentation of noisy datasets into Multiple Inlier Structures with a new Robust Estimator (MISRE). The scale of each individual structure is estimated adaptively from the input data and refined by mean shift, without tuning any parameter in the process, or manually specifying thresholds for different estimation problems. Once all the data points were classified into separate structures, these structures are sorted by their densities with the strongest inlier structures coming out first. Several 2D and 3D synthetic and real examples are presented to illustrate the efficiency, robustness and the limitations of the MISRE algorithm.

## Authors

• 3 publications
• 4 publications
• ### Robust Estimation of Multiple Inlier Structures

The robust estimator presented in this paper processes each structure in...
09/20/2016 ∙ by Xiang Yang, et al. ∙ 0

• ### Automated Clustering of High-dimensional Data with a Feature Weighted Mean Shift Algorithm

Mean shift is a simple interactive procedure that gradually shifts data ...
12/20/2020 ∙ by Saptarshi Chakraborty, et al. ∙ 0

• ### Unsupervised Decision Forest for Data Clustering and Density Estimation

An algorithm to improve performance parameter for unsupervised decision ...
07/15/2015 ∙ by Hayder Albehadili, et al. ∙ 0

• ### Deep Amortized Clustering

We propose a deep amortized clustering (DAC), a neural architecture whic...
09/30/2019 ∙ by Juho Lee, et al. ∙ 30

• ### Multiple kernel learning for integrative consensus clustering of genomic datasets

Diverse applications - particularly in tumour subtyping - have demonstra...
04/15/2019 ∙ by Alessandra Cabassi, et al. ∙ 0

• ### To Refine or Not to Refine: Topology Optimization of Adaptively Refined Infill Structures for Additive Manufacturing

We present a novel method for optimizing structures by adaptively refini...
12/12/2017 ∙ by Jun Wu, et al. ∙ 0

• ### Ridges in the Dark Energy Survey for cosmic trough identification

Cosmic voids and their corresponding redshift-aggregated projections of ...
05/18/2020 ∙ by Ben Moews, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Today in computer vision, people use convolutional neural networks to identify complex patterns in natural settings. These methods give only qualitative categorizations instead of parameterizations of the models. The training process could take a significantly long time and the much smaller testing set has to be similar enough to those used for training.

When the input contains structures governed by precise mathematical relations, the objective functions can be estimated quantitatively without any pre-training. The estimator has to be robust to detect the inlier structures while removing the structureless outliers. The objective functions are either linear, like the estimation of 3D planes, or nonlinear, such as finding 3D spheres or the homography between two 2D images. The input dataset could also contain multiple inlier structures with every structure corrupted by different noise.

The elemental subsets are the building blocks of the robust regression. Each randomly chosen subset has the minimum number of input points required to estimate the parameters in the objective function. The most used algorithm for robust fitting in the past 35 years is the RANdom SAmple Consensus (RANSAC) [10]. Similar types of methods also exist, like PROSAC, MLESAC, Lo-RANSAC, etc. These algorithms use different ways to generate the random sampling and/or probabilistic relations for the elimination of the outliers. In paper [24], a review of these methods was given.

Before the estimation, the user has to assign a value for the inlier noise level, which is a major drawback of using RANSAC. Providing too small of a scale could filter out many inlier points, while too large of a value bring in outliers. This threshold is not even mentioned sometime since in real images the noise is often less than 3 pixels. However, an appropriate value of the scale is hardly predictable without prior knowledge.

The problem also appears in methods which begin with RANSAC, like J-linkage [29]. When the images are resized, the error changes proportionally and the given scale also has to be changed to get the correct result. Methods using adaptive scale rather than hard thresholding, like T-linkage [20], may not work properly when different noise exists for different inlier structures.

The inlier scale can also be predicted based on statistical distributions. Based on [31, Section 3.2.2], the paper [17] combined the projection based M-estimator [4] with RANSAC to detect 3D geometric primitives, but the input acquired by 3D laser scanner had low noise and did not contain randomly distributed outliers. The assumption based on a statistical distribution is only valid for specific problems.

In [23]

, a universal framework for RANSAC (USAC) was proposed, where the threshold was based on Gaussian distribution of the inliers. The method in

[18] performed better than USAC, estimating the inlier rate from probabilistic reasoning, instead of specifying the inlier scale. However, [18] required a very dense sampling and only one inlier structure could be recovered.

Other methods tried to avoid the inlier threshold and transformed the estimation into an energy-based minimization problem. These approaches first identified a possible inlier region, and then iteratively improved the solution to get the final estimate. In [32] the -th ordered absolute residual was sequentially improved. The number of inlier structures had to be specified before the estimation [8]. In [28] points were randomly selected to evaluate the error, where was the size of the elemental subset. The significance of taking two additional points in a sample was not justified and the parameter varied largely in the experiments.

The Propose Expand and Re-estimate Labels (PEARL) algorithm in [15] started with RANSAC, followed with alternative steps of expansion (inlier classification) and re-estimation to minimize the energy of the errors. A synthetic example showed that PEARL can handle the estimation of multiple 2D lines with different Gaussian noises, but was not tested on any other models. The amount of outliers was small in all the experiments. The Random Cluster Model SAmpler (RCMSA) in [22] was similar to [15], but simulated annealing was used to minimize the overall energy function.

The generalized projection-based M-estimator (gpbM) [21]

tried to locate a dense region where an inlier structure could exist, by locating the highest value of the cumulative distribution function weighted by its size. The assigned weights were critical to obtain the correct scale estimate, as Figure 3 from

[21] showed. However, this weighting strategy cannot work all the time due to the interaction between all existing inliers and outliers.

These methods also do not put emphasis on the internal parameter(s) which often have to be modified for a particular estimation task. For example, by varying the model complexity value in [22], the homography estimation showed similar behavior as changing the RANSAC inlier scale (Fig.1). The default value was 100 for the fundamental matrix and 10 for homography. Applying the values vice-versa, the estimator no longer worked. This adjustment needs prior knowledge since there is no systematic way to predict the values of these parameters. To avoid all the internal parameters and handle different structure scales, a general method for robust estimation should independently estimate the fitting error for each individual structure.

In this paper, we propose a new algorithm for the Multiple Inlier Structures Robust Estimator (MISRE). The estimation for each structure consists of three consecutive steps: scale estimation, mean shift based structure recovery, and strength based inlier classification. The only parameter to be specified by the user is the number of trials for random sampling, which is required in any robust estimator using elemental subsets. The major innovations of the paper are summarized below.

• Adaptive scale estimation for each individual structure.

• No tuning of the internal parameters or threshold is required for different objective functions.

• Structures are characterized by their strength (average density), and in general an inlier structure has a stronger strength.

• The efficiency, robustness and the limitations of the new robust estimator are discussed in detail.

Our experiments show successful results for different objective functions without tuning any parameters, and the entire process is self-adaptive.

The linearized objective functions are presented in Section 2. The algorithm of the new robust estimator is detailed in Section 3. Experiments of different estimation problems are given in Section 4. Finally, in Section 5 we compare the new estimator with other methods and discuss some open problems.

## 2 Structure Initialization

In Section 2.1

, the nonlinear objective function of the inputs is transformed into a linear function of the carrier vectors. The first order approximation of the covariance matrix of these carriers is also computed. Section

2.2 explains how the largest Mahalanobis distance of each input point is taken into account when multiple carrier vectors are derived.

### 2.1 Carrier Vectors

The nonlinear objective functions in computer vision can be transformed into linear relations in higher dimensions. These linear relations, containing terms formed by the input measurements and their pairwise products, are called carriers. Each relation gives a carrier vector. For linear objective functions, such as plane fitting, the input variables and the carrier vector are identical.

For example, in the estimation of fundamental matrix, the input data are the point correspondences from two images , with dimensions. The objective function with noisy image coordinates is

 [x′  y′  1] F [x  y  1]⊤≃0 (1)

which gives a carrier vector containing carriers
. The linearized function of the carriers is

 x⊤i\boldmathθ−α≃0   i=1,…,nin (2)

where denotes the number of inliers. Vector and scalar intercept derived from the matrix are to be estimated. The constraint eliminates the multiplicative ambiguity in (2).

In the general case, several linear equations can be derived from a single input

 x[c]⊤i\boldmathθ−α≃0   c=1,…,ζ   i=1,…,nin (3)

corresponding to different carrier vectors . For example, the estimation of homography has carrier vectors derived from and image coordinates.

The Jacobian matrix is required for the first order approximation of the covariance of carrier vector. From each carrier vector , an Jacobian matrix is derived. Each column of the Jacobian matrix contains the derivatives of the carriers in with respect to one measurement from . The Jacobian matrices derived from linear objective functions are not input dependent, while those derived from nonlinear objective functions rely on the specific input point. The carrier vectors are heteroscedastic for nonlinear objective functions. For example, the transpose of the Jacobian matrix of the fundamental matrix

 (4)

depends on .

The covariance matrix of the measurements with , has to be provided before estimation. This is a chicken-egg problem, since the input points have not yet been classified into inliers and outliers. A reasonable assumption is to set

as the identity matrix

, if no additional information is given. The input data are considered as independent and identically distributed, and contain homoscedastic measurements with the same covariance.

The covariance of the carrier vector is computed from

 σ2C[c]i=σ2J%$x$[c]i|yi Cy J⊤x[c]i|yi (5)

where the dimensions of is . The scale of the structure is unknown and to be estimated.

### 2.2 Computation of the Mahalanobis Distances

The elemental subset needs input points to uniquely define and in the linear space. For example, the homography requires four point pairs, and eight pairs are necessary for the fundamental matrix if the 8-point algorithm is used. The input data should be normalized [12, Section 4.4], and the obtained structures mapped back onto the original space.

For each , every carrier vector is projected to a scalar value . The average projection of the vectors from an elemental subset is

. The variance of

is .

The Mahalanobis distance, scaled by an unknown , indicating how far is a projection from , is computed from

 d[c]i = √(x[c]⊤i\boldmathθ−α)⊤(H[c]i)−1(x[c]⊤i\boldmathθ−α) (6) = |x[c]⊤i\boldmathθ−α|√\boldmathθ⊤C[c]i% \boldmathθc=1,…,ζ.

Each input point gives a -dimensional Mahalanobis distance vector

 di=[d[1]i … d[ζ]i]⊤i=1,…,n. (7)

The worst-case scenario is taken to retain the largest Mahalanobis distance from all the values

 ~ci=argmaxc=1,…,ζd[c]i. (8)

For different -s, the same input point may have its largest distance computed from different carriers.

The symbols related to the largest Mahalanobis distance are: , the largest Mahalanobis distance for input ; , the corresponding carrier vector; , the scalar projection of ; , the covariance matrix of ; , the variance of ; , the scale multiplying and , which has to be estimated.

## 3 Estimation of Multiple Structures

The Multiple Inlier Structures Robust Estimator (MISRE) is introduced in this section. The scale for a structure is estimated in Section 3.1 by an expansion criteria. The estimated scale is used in the mean shift to re-estimate the structure in Section 3.2. The iterative process continues until not enough input points remain for a further estimation. In Section 3.3, all the estimated structures are ordered by strengths with the strongest inlier structures returned first. The limitations of the method are explained in Section 3.4 and the conditions for robustness is discussed in Section 3.5.

### 3.1 Scale Estimation of A Structure

Assume that input points remain in the current iteration. The estimation process initializes random elemental subsets, each giving a and an .

For every point , compute the largest Mahalanobis distance

 ~di=|~x⊤i\boldmathθ−α|√\boldmathθ⊤˜Ci\boldmathθ≥0i=1,…,n. (9)

Sort the -distances in ascending order, denoted . In total sorted sequences are found.

Let represent a small amount of points

 nϵ=ϵn1000<ϵ≪100 (10)

where defines the size of in percentage of the input amount.

Among all trials, find the single sequence that gives the minimum sum of Mahalanobis distances from its first points

 minMnϵ∑i=1~d[i]j. (11)

The sequence containing points, denoted , is retained and the first points form the initial set.

If inlier structures still exist and

is sufficiently large, the points in the initial set have a high probability to be selected from a dense region in a single inlier structure. The corresponding structure defining

, though not accurate enough for the final estimate, is closely aligned with an inlier structure.

Two rules should be considered for the ratio . First, should be smaller than the size of any inlier structure to be estimated. Therefore, a small ratio is preferred to detect all the potential structures. All our experiments start with . The second rule is to have the size of at least five times the number of points in the elemental subset, as suggested in [12, page 182]. This rule of thumb reduces unstable results when relatively few input points are provided.

As many as possible points belonging to the same structure is needed to recover the scale . The following example justifies the use of expansion criteria for scale estimation. In Fig.2a two ellipses are shown; each of them has

inlier points. They are corrupted by different Gaussian noise with standard deviation

and respectively. Another outliers are randomly placed in the image. With we obtain the sequence , and its first points are shown in Fig.2b. The sorted Mahalanobis distances in for the first 400 points from 600 in total, are shown in Fig.2c.

Divide the sequence into multiple segments, and each segment covers an equal range of Mahalanobis distance . This corresponds to the Mahalanobis distance of the point at position in the sequence . Let denote the number of points within the -th segment, , with . The expansion process verifies the following condition for each

 nk+11k∑ki=1ni≤0.5 (12)

where the numerator is the number of points in the ()-th segment and the denominator is the average point numbers inside all the segments.

When the point density drops below half of the average in the previous segments, the boundary to separate this structure from the outliers is found, . The value of the scale estimate is . The condition (12

) is a heuristic criteria since the true distribution of the inliers is unknown.

Due to the randomness of the input data, the single scale estimate may not be stable and several independent expansions have to be generated with different values of . This is a similar process to the Gaussian smoothing used in the Canny edge detection, and the SIFT [19] over the scale space, where discretization effect occurs when kernel size changes. In Fig.2d the expansion starting with , stops at , giving (red bar). In Fig.2e the expansion with a larger stops at giving .

To stabilize the scale estimation, the expansion process is applied to an increasing sequence of sets. As illustrated in Fig.3, starts from , and increases by each time, . The blue points indicate the length of . Every expansion process is performed independently, and stops at the corresponding red point when condition (12) is met. The length of continues to increase until it reaches a bound , where the following expansion with can no longer expand beyond , as it is in Fig.3.

The scale estimate is found from the region of interest where the sets of points can expand. In Fig.3 it ranges from to . The region of interest may not always start from , but at the first place where the expansion process begins, as increases.

The largest estimate from the region of interest gives the scale

 ^σ=maxη=ϵ,…,ηtktηΔdη (13)

the farthest expansion inside the region of interest. In Fig.3 the scale estimate is , which is around .

### 3.2 Mean Shift Based Structure Recovery

From the sequence , collect all the points within the scale estimate . Another elemental subsets are generated, where the points are selected only from the collected subset. We set since most points in this set come from the same structure.

For each trial all the remaining input points are projected by to a one-dimensional space . The mean shift [7] moves the from to the closest mode

 [ˆ\boldmathθ,ˆα]=∗argmax\boldmathθ,α1n^σn∑i=1κ((z−˜zi)⊤˜B−1i(z−˜zi)). (14)

The variance is computed from

 ˜Bi=^σ2˜Hi=^σ2% \boldmathθ⊤~Ci\boldmathθ% =^σ2\boldmathθ⊤J~xi|yiJ⊤~xi|yi\boldmathθ (15)

with .

The function is the profile of a radial-symmetric kernel defined only for . For the Epanechnikov kernel

 κ(u)={1−u(z−˜zi)⊤˜B−1i(z−˜zi)≤10(z−˜zi)⊤˜B−1i(z−˜zi)>1. (16)

Let and for the Epanechnikov kernel, when and if . All the points inside the window contribute equally in the mean shift. The convergence to the closest mode is obtained by assigning zero to the gradient of (14) in each iteration. The is updated from the current value by

 znew=[n∑i=1g(u)]−1[n∑i=1g(u)˜zi]. (17)

Many of the input points have their projections more distant from than and their weights are zeros.

The highest mode among all trials gives the estimate . The vector is obtained from the same elemental subset which gives the highest mode . All the input points that can converge into the region around are classified as inliers, resulting in points. The total least squares (TLS) estimate for the structure is then computed to obtain , and . Fig.4a shows the structure recovered in the first iteration, where most inlier points are collected compared with the result in Fig.2b.

### 3.3 Strength Based Classification

After the mean shift step, the points are removed from the inputs before the next iteration. If the amount of the remaining data is smaller than , the algorithm terminates and all the recovered structures are sorted by their strengths in descending order. The strength of a structure is defined as

 s=nin^σtls. (18)

which can also be seen as the density in the linear space of that structure.

Structures with stronger strengths are detected first, and in general are inlier structures with more dense points and smaller scales. The new method does not rely on a threshold to separate inliers from the outliers, or assume any upper/lower limit to bound the range of error. In Fig.4b, three structures are returned

 redgreenbluescale:16.435.4708.7inliers:219210163strength:13.325.930.23

where the first two (red and green) are inlier structures.

If the scale estimator locates a structure consisting of outliers, in general is much larger and the strength weaker than inlier structures. In real images, the difference in scale and strength between the inliers and the outlier is obvious to notice, and the user can easily retain the inlier structures, as the examples in Section 4 will show. When an ambiguous inlier/outlier threshold appears, like in Fig.9c, the strongest inlier structures are still detected correctly.

### 3.4 Limitations

The major limitation of every robust estimator comes from the interactions between inliers and outliers. As the outlier amount increases, and/or the inlier structures become noisier, eventually the inliers and outliers become less separable in the input space. In our algorithm most of the processing is done in a linear space, but the limitation introduced by outliers still exists. We will illustrate it in the following example.

In a image, a circle consists of inliers is corrupted by Gaussian noise with , together with outliers. The circles have different radiuses, 50 in Fig.5a and 200 in Fig.5c. In both these figures, the estimator finds the correct scale estimates from the structure (blue circles) corresponding to the initial set, where and .

About 196 true inlier points should exist inside the scale , based on the Gaussian distribution. The number of outliers can be roughly estimated as

 (2π50)(2∗23.65)1500700×700=45 points

giving 241 points in total inside the true inlier region. However, after the mean shift step an incorrect final result (red circle) containing 261 points is obtained in Fig.5b, where 84 points are true inliers and 177 points from the outliers. Although the true structure appears more dense in the input space, the mean shift converges to an incorrect mode due to the heavy noise from outliers.

The circle in Fig.5c appears much weaker, however after 100 tests with randomly generated data (inlier/outlier), it returns more stable estimations than the smaller circle in Fig.5a. In a result shown in Fig.5d, 346 points are classified as inliers, where 190 points are from true inliers and 156 points from the outliers. The mean shift has a much lower probability to converge to another, incorrect mode, and this inlier structure resists more outliers.

Similar limitation exists in RANSAC when many outliers are present. Even a correct scale given by the user can still lead to an incorrect estimation. The methods proposed in [22] and [28] returned incorrect results if too many outliers existed. The failure of RANSAC also occurs due to not explicitly considering the underlying task [13].

In [30]

a robust estimator for structure from motion algorithm was proposed combining an extended Kalman filter. The example in Figure 5 returned correct estimate with the data containing more than 60% outliers. For the homography estimation of Figure 6 in [22], more than 90% of the points were outliers and correct result was still obtained. However, none of these results had repetitive tests, and the stability of the methods cannot be verified.

In MISRE, the strength of an inlier structure is another factor with a strong influence on the inlier/outlier interaction. Firstly, the level of the inlier noise affects the number of outliers that can be tolerated. With the same number of inliers, structures with lesser inlier noise will have its initial sets better aligned with the true structures, and result in more reliable scale estimate. Noisier inlier structures are most likely to interact with the outliers and lead to spurious results, see Fig.9d.

In Fig.6a, the expansion process is applied to the same example as in Fig.2a, but with . The expansion stops soon and the algorithm locates the region of interest between giving a small scale estimate. After applying the expansion criteria many times, the range of expansion from different are not stable. A too small value of the scale may attract only a minority of the inlier points. In Fig.6b the number of inliers is raised to . The scale estimate becomes a more stable value with the region of interest located between .

Secondly, when the inlier point amount is relatively small compared to the outliers, the region of interest also becomes unstable and the initial set may not closely align with a true structure. The expansion process then will not return a correct scale estimate.

When the inlier/outlier interaction is strong, preprocessing on the input data is required to obtain more inlier points, and/or reduce the outlier amount for a better performance. In Fig.14a of Section 4, an example is given where homography estimation in 2D is used to segment objects in 3D scene. Under the small translational motions, the two planes on the bus though orthogonal in 3D, are not separable in 2D due to the relatively small amount of inlier points. In Fig.14d we show that by using more inlier points, the estimator will recover more inlier structures.

If an inlier structure appears split in several structures with fewer points, post-processing is needed to merge them. The user can easily locate them by their strengths since most of these split structures are still stronger compared with the outliers. The similarity of two structures should be compared in the input space where measurements are obtained, as the derived carriers in the linear space do not represent the nonlinearities of the inputs explicitly.

For two inlier structures with linear objective function, the merge can be implemented based on the orientation of each structure and the distance between them. For two ellipses, the geometric tools to determine the overlap area can be used [14]. The 2D measurements of fundamental matrices and the homographies are in the projective space instead of euclidean, and the reconstructed 3D information should be applied to separate or merge the similar structures.

### 3.5 Conditions for robustness

The number of trials for random sampling is the only parameter given by the user. The required amount of depends strongly on the data to be processed. The complexity of the objective function, the size of the input data, the number of inlier structures, the inlier noise levels, and the amount of outliers, all are factors which can affect the required number of trials.

If no information on the size of is known, the user can run several tests with different -s until the results become stable. When the interaction between inliers and outliers is apparent, the quality of the estimation cannot be compensated by a larger since the initial set has become less reliable. Only through preprocessing of the input data will the number of inlier points increase and a better result be obtained.

Three main conditions to improve the robustness of the proposed algorithm are summarized:

• Preprocessing to reduce the amount of outliers, while bring in more inliers.

• The sampling size should be large enough to stably find the inlier estimates.

• Post-processing should be done in the input space when an inlier structure comes out split or has to be separated.

### 3.6 Review of the algorithm

The MISRE algorithm is summarized below.

Scale adaptive clustering of multiple structures

Input: , data points that contain an unknown number of inlier structures with their scales unspecified, along with outliers. The covariance matrices for are if not provided explicitly.

Output: The sorted structures with inliers come out first.

• Compute the carriers , , and the Jacobians , for each input , .

• Generate random trials based on elemental subsets.

• For each elemental subset find and .

• Compute the Mahalanobis distances from for all carrier vectors , . Keep the largest distance for each point.

• Sort the Mahalanobis distances in ascending order.

• Among all trials, find the sequence with the minimum sum of distances from points.

• Apply the expansion criteria to an increasing sequence of sets and determine the region of interest for a structure.

• In the region of interest find the largest estimate as and collect all points inside this scale.

• Generate random trials from these points.

• Apply the mean shift to all the existing points, to find the closest mode from .

• Find at the maximum mode among all trials, and from the same elemental subset.

• The recovered structure contains points which converged to from .

• Compute the TLS solution for the structure and remove the points from the inputs.

• Go back to and start another iteration.

• If not enough input points remain, sort all the structures by their strengths and return the result.

## 4 Experiments

Several synthetic and real examples are presented in this section. In most cases a single carrier vector exists, , except for the homography estimation which has two and .

The input data for synthetic problems are generated randomly and Gaussian noise is added to each inlier structure. The standard deviation is specified only to verify the results, while not used in the estimation process.

The real 3D datasets are constructed either from the 3D mesh models in [1] or the photogrammetric methods using 2D images. Higher noise is introduced by the outliers which come from the incorrect point correspondences in both 2D and/or 3D matches. In Fig. 8a a 2D image out of a few tens of views of a 2D image sequence is shown. The image feature points are extracted from pairs of images and the point correspondences are robustly filtered through the above MISRE algorithm. The 3D information is recovered from the images based on the projective geometry relations. As more images are registered, more 3D points are added and the point cloud of the 3D scene is generated through incremental structure from motion (SfM) [12, Chapter 18] and [1, 33], followed by hierarchical merging [11, 9] (Fig. 8b). See [34, Chapter 5] for a detailed description of the complete recovery of 3D structures and [12, Chapter 19] for more details of the auto-calibration processure.

The values of the scales and point amounts for each structure are returned as the output of the algorithm. The processing time on an i7-2617M 1.5GHz PC is also given.

### 4.1 2D Lines and 3D Planes

We first examine the use of MISRE in the estimation of linear geometric primitives. For multiple 2D lines, the noisy objective function is

 θ1xi+θ2yi−α≃0i=1,…,nin. (19)

The input variable is identical with the carrier vector .

Five lines are placed in a plane (Fig.7a) and corrupted with different two-dimensional Gaussian noise. They have inlier points, and

, respectively. Another 350 unstructured outliers are uniformly distributed in the image. The amount of points inside each inlier structure is small compared to the entire data.

With , a test result is shown in Fig.7b. The algorithm recovers six structures

 redgreenbluecyanyellowpurplescale:9.618.728.137.144.2370.8inliers:321282240161106240strength:33.415.18.54.32.40.6.

The first five structures are inliers with stronger strengths. The sixth structure, is formed by outliers distributed over the whole image.

When the randomly generated inputs are tested independently for 100 times, the first four lines are correctly segmented in all the tests. In the other six tests the weakest line () is not correctly located. Of the 94 correct estimations, the average result of the scale estimates and the classified inlier amounts as well as their respective standard deviations are

 scale:10.4819.9429.3636.8638.17(1.17)(2.44)(5.30)(10.40)(18.29)inliers:335.9285.8240.8155.693.4(8.2)(9.5)(21.3)(27.0)(28.4).

The average processing time is 0.58 seconds. The estimated scale covers about area of an inlier structure. In general, the number of classified inliers is larger than the true amount due to the presence of outliers in the same area.

In Fig.7c and Fig.7d, the Canny edge detection extracts similar sizes of input data (8310 and 8072 points) from two real images. Again with , the six strongest line structures are superimposed over the original image in Fig.7e. In Fig.7f the three line structures together with the first outlier structure are shown. The processing time depends on the number of structures that detected by the estimator, these two estimations take 7.44 and 4.35 seconds, respectively.

For multiple 3D planes, the objective function for the inlier points is also linear

 θ1xi+θ2yi+θ3zi−α≃0i=1,…,nin. (20)

The input data are the 3D coordinates of the point cloud dataset.

In Fig.8a we show a sample image used in the SfM algorithm [33, 11], and obtain the point cloud as in Fig.8b. The total of 23077 points also with outliers, are recovered from 70 2D images. After 7.04 seconds, with , the estimator locates six planes as shown in Fig.8c and Fig.8d with 21758 inlier points in total.

### 4.2 2D Ellipses

In the next experiment multiple 2D ellipses are estimated. The noisy objective function is

 (yi−yc)⊤Q(yi−yc)−1≃0i=1,…,nin (21)

where is a symmetric positive definite matrix and is the position of the ellipse center. Given the input variable , the carrier is derived as . The condition also has to be satisfied in order to represent an ellipse. We also enforce the constraint that the major axis cannot be more than 10 times longer than the minor axis, to avoid classifying line segment as a part of a very flat ellipse.

The transpose of the Jacobian matrix is

 J⊤x|y=[[]ccccc102xy0010x2y]. (22)

The ellipse fitting is a nonlinear estimation and biased, especially for the part with large curvature. When the inputs are perturbed with zero mean Gaussian noise with , the standard deviation of carrier vector relative to the true value is not zero mean

 E(x−xo)=[0   0   σ2g   0   σ2g]⊤ (23)

since the carrier contains terms. A bias in the estimate can be clearly seen when only a small segment of the noisy ellipse is given in the input. Taking into account also the second order statistics in estimation still does not eliminate the bias. See papers [16], [27] and their references for additional methods.

In Fig.9a three ellipses are placed with 350 outliers in the background. The inlier structures have and . The smallest ellipse with is corrupted with the largest noise . We use in the ellipse fitting experiments. When tested (Fig.9b), four ellipses are recovered

 redgreenbluecyanscale:12.128.948.01321.2inliers:337292222248strength:28.010.14.60.2.

Based on results sorted by strength, the first three structures are inliers and are returned first.

When the estimation is repeated 100 times, the three inlier structures are correctly located 97 times, while in the other three tests the smallest ellipse is not estimated correctly. From the 97 correct estimations, the average scales, the classified inlier amounts, along with their standard deviations are

 scale:11.6021.5932.87(1.54)(3.64)(14.71)inliers:336.2272.9196.4(8.2)(27.5)(53.2).

The average processing time is 3.28 seconds.

When the outlier amount reaches the limit, the inlier structure with weakest strength may no longer be sorted before the outliers. The scale estimate becomes inaccurate due to the heavy outlier noise, and the outliers can form more dense structure with comparable strength. When 800 outliers are placed in the image, a test gives the result in Fig.9c. The outlier structure (blue) has a strength of 4.8, while the value of the inliers (cyan) is 3.9. However, the first two inlier structures are still recovered due to their stronger strengths.

We also gives an example to show one of the limitation explained in Section 3.4, when the inlier strength is too weak to tolerate more outliers. In Fig.9d two inlier structures interact, and the mean shifts converge to incorrect modes.

From Canny edge detection, 4343 and 4579 points are obtained from two real images containing several objects with elliptic shapes, as shown in Fig.10a and Fig.10b. With , the three strongest ellipses are drawn in Fig.10c, superimposed over the original images. The processing time is 18.90 seconds in this case. In Fig.10d the estimation takes 23.14 seconds to detect four strongest ellipses, which are inlier structures. After 100 repetitive tests using the data shown in Fig.10b, only the first two ellipses (red and green) are detected reliably in 98 times. The other two ellipses (blue and cyan) have smaller amounts of inliers and therefore are less stable. The data acquired from Canny edge detection do not necessarily render the overall inlier structures in a more dense state. Preprocessing on the edge data is generally required for better performance.

### 4.3 3D Spheres

For spherical surface fitting in point cloud, the objective function of is

 (x−a)2+(y−b)2+(z−c)2−r2≃0 (24)

a carrier vector is derived with carriers . The transpose of the Jacobian matrix of a spherical surface is derived as

 J⊤x|y=⎡⎢⎣[]cccc2xi1002yi0102zi001⎤⎥⎦. (25)

A sample image is shown in Fig.11a where a toy with spherical surfaces is portrayed. In Fig.11b the 3D point cloud is generated with [1], containing 10854 points from 36 images in 2D. A large number of points in the background have to be rejected as outliers. With , we locate two inlier structures, as shown in red and green colors in Fig.11c and Fig.11d. A total of 3504 points are inliers and the estimation took 7.24 seconds.

### 4.4 3D Cylinders

A cylinder aligned with the Z-axis is defined by the equation

 (x−a)2+(y−b)2−r2=0 (26)

stand for the 2D coordinates where Z-axis passes through the XY-plane, and is the radius. With the input variable , this relation can be reformulated by a quadric matrix where is a symmetric matrix

 P′=λ[D′d′d′Ta2+b2−r2] D′=⎡⎢⎣100010000⎤⎥⎦  d′=⎡⎢⎣−a−b0⎤⎥⎦ (27)

when an euclidean transformation is applied , a general cylinder under rotation and translation is found with

 P=M−TP′M−1=[DddTd]. (28)

has nine unknown parameters up to a scale. A general cylinder have only five degrees of freedom, four for its axis of rotation and one for radius.

Several solutions of the cylinders were described in [3], computed at elemental subsets with various numbers of points from 5 to 9. The nine-point-solution is used in this experiment, where the carrier vector is derived as . An elemental subset consisting of nine points gives an over-determined solution, and the parameters in should be constrained for a cylinder. From equations (4.4) and (28

), it is easy to prove that two of the three singular values of matrix

are identical and the third one is zero, and

is an eigenvector of

. These constraints should be verified for each elemental subset.

The transpose of the Jacobian matrix is

 J⊤xi|yi=⎡⎢⎣2xiyizi0001000xi02yizi001000xi0yi2zi001⎤⎥⎦. (29)

A cylindrical pole is shown in Fig.12a. From 54 images we generate the point cloud (Fig.12b) through the structure from motion (SfM) algorithm. The dataset contains 7241 points with most on the ground being outliers for cylinder detection. With , after 18.55 seconds the single cylinder is located in the noisy dataset containing 568 points, as the red points shown in Fig.12c and Fig.12d.

In Fig.12e another sample image is shown. A total of 6500 vertices are obtained in Fig.12f from the mesh rebuilt in [1], by using 22 images. The two bottles with cylindrical shapes are detected in 12.65 seconds. The two inlier structures are shown in Fig.12g and Fig.12h containing 2262 points.

### 4.5 Fundamental Matrices

The next experiment shows the estimation of the fundamental matrices. The corresponding linear space was introduced in Section 2.1. The matrix is rank-2 and a recent paper [5]

solved the non-convex problem iteratively by a convex moments based polynomial optimization. It compared the results with a few RANSAC type algorithms. The method required several parameters and in each example only one fundamental matrix was recovered.

The fundamental matrix cannot be used directly to segment objects with only translational motions, as proved in [2]. Each example of Fig.13 shows the movement of multiple rigid objects, where a large enough rotation exists. The point correspondences are extracted by OpenCV with a distance ratio of 0.8 for SIFT [19], giving 608, 614 and 457 matches, respectively. With , the structures are retained as

 Fig.???aredgreenbluescale:0.560.7311.78inliers:40710151strength:727.3139.34.33.
 Fig.???bredgreenbluecyanscale:0.460.401.1210.42inliers:1929622147strength:413.4242.8196.84.5.
 Fig.???credgreenbluecyanyellowscale:0.220.720.650.7023.9inliers:135117844843strength:623.0161.5129.068.21.8.

The estimations take 1.75, 2.30 and 2.10 seconds for these three cases. In real images, the outlier structures can be easily filtered out since they have much larger scales than the inliers. It can be observed that the scales of the inlier structures are very close, therefore the methods with fixed thresholds may be used here. However, if the images are scaled before estimation, the error in the inliers will change proportionally. Correct scale estimate can only be found adaptively from the input data.

As discussed in Section 3.4, the first (red) and the fourth (cyan) structures obtained from Fig.13c can be fused as a single structure. This merge has to be done by post-processing in the input space but also requires a threshold from the user.

In SIFT matches false correspondences always exist. If the images contain repetitive features, such as the exterior of buildings, parametrization of the repetitions can reduce the uncertainty [25]. Preprocessing of the images is not described in this paper, and we will not explain it further.

### 4.6 Homographies

The final example is for 2D homography estimation. Each inlier structure is represented by a matrix , which connects two planes inside the image pair

 y′i≃Hyi,i=1,…,nin (30)

where and are the homogeneous coordinates in these two images.

As mentioned in Section 2, the homography estimation has . The input variables are . Two linearized relations can be derived from the constraint (30

) by the direct linear transformation (DLT)

 Aih=⎡⎢ ⎢⎣−y⊤i−0⊤3−x′iy⊤i0⊤3−y⊤i−y′iy⊤i⎤⎥ ⎥⎦⎡⎢ ⎢ ⎢⎣h1h2h3⎤⎥ ⎥ ⎥⎦≃02 . (31)

The matrix is and both rows satisfy the relations with the vector derived from the matrix vec.

The carriers are obtained from the two rows of

 x[1] = [−x  −y  −1  0  0  0  x′x  x′y  x′]⊤ x[2] = [0  0  0  −x  −y  −1  y′x  y′y  y′]⊤. (32)

The transpose of the two Jacobians matrices are

 J⊤x[1]i|y J⊤x[2]i|y (33)

Based on Section 2.2, for every only the larger Mahalanobis distance is used for each input .

The motion segmentation involves only small translation in 3D in Fig.14a and Fig.14b, both images are taken from the Hopkins 155 dataset. With , the processing time is 1.12 and 1.09 seconds for the inputs containing 990 and 482 SIFT point pairs, respectively. As mentioned in Section 3.4, in Fig.14a the estimator cannot separate these two 3D planes on the bus because the 2D homographies corresponding to them are very similar. The condition (12) does not stop where it should since the points on either 2D planes are not dense enough. The example shows that the desired results can only be obtained by increasing the amount of inliers from preprocessing. In Fig.14b, the three objects can be correctly separated in spite of the very small motions, due to the stronger strengths in all the inlier structures.

 Fig.???aredgreenbluecyanscale:1.621.124.49203.11inliers:71310167105strength:440.289.514.90.5
 Fig.???bredgreenbluecyanscale:0.200.170.565.16inliers:1078816589strength:529.6517.7293.417.3.

The importance of preprocessing can also be seen in [26], where the geometric and appearance priors were used to increase the amount of consistent matches before estimation, when the PROSAC [6] failed in the presence of many incorrect matches. In Fig.14c, the 713 point pairs from OpenCV SIFT are used, while the more dense data from datasets [22] with 1940 points are tested in Fig.14d for comparison. After the estimations, MISRE returns only two inlier structures in Fig.14c, and four inliers in Fig.14d. With denser inlier points, more inlier structures can be detected.

## 5 Discussion

A new robust algorithm was presented which does not require the inlier scales to be specified by the user prior to the estimation. It estimates the scale for each structure adaptively from the input data, and can handle inlier structures with different noise levels. Using a strength based classification, the inlier structures with larger strengths are retained while a large quantity of outliers are removed.

In Section 5.1 we will compare MISRE with other robust estimators. In Section 5.2 open problems are reviewed.

### 5.1 Comparison with Other Robust Estimators

The same inlier/outlier setting in Fig.7a is used in Fig.15a for comparison of 2D lines estimation. In Fig.15b MISRE successfully locates all the five inlier structures, and returns them first based on strengths. However, as Fig.15c shows, the J-linkage fails to recover the lines without given a correct inlier scale for each structure, and many split lines are returned. In Fig.15d, the T-linkage cannot fully handle the estimation with different inlier scales, and returns only two inlier structures.

The reconstructed point cloud in [9] (Fig.16a) from 48 2D images containing 11094 points in 3D were extracted (Fig.16b). Both J-linkage and MISRE segment the point cloud into correct planes in Fig.16c and Fig.16d. However, this estimation process takes 330 seconds in J-linkage’s MATLAB implementation, while only 10.2 seconds in MISRE.

Finally in Fig.17, we show the comparison of homography estimations with gpbM and RCMSA. By comparing with the ground truth, the correct/incorrect classification for each structure is listed in the following table (use colors in Fig.17a and Fig.17b as references):

 Fig.???aredgreenbluecyanyellowMISRE:491/1479/9499/29125/053/0gpbM:497/7471/10492/28124/1N/ARCMSA:499/1485/3498/8133/553/9
 Fig.???bredgreenbluecyanyellowMISRE:495/20155/0499/11499/559/0gpbM:495/16154/1498/5498/1N/ARCMSA:495/18155/2499/5499/445/0

The results obtained from these three methods are comparable. However, since RCMSA requires tuning on the internal parameters, the prior knowledge is always required. MISRE significantly reduces the processing time (in seconds)

 Proccessing TimeMerton College% UnionhouseMISRE:3.073.78gpbM:477495RCMSA:24.5825.40

Both gpbM and RCMSA take much longer processing time in their respective estimation process, which again verifies the efficiency of the new estimator MISRE.

### 5.2 Open Problems

We will list several open problems, where further research and experiments are still needed.

Assume that the measurements of the input points have different covariances which are not specified. In many computer vision problems this situation is neglected but can still exist. The homoscedastic inlier covariances have the form . with -s unknown. The computation for the covariance of the carrier (5) places the -s into the product of two Jacobian matrices. A possible solution is to start with a uniform , that is, . Each inlier structure may not attract the quasi-correct amount of points. Then for each inlier structure separately, consider a local region where the inliers are located. Apply the entire estimation again only on the data inside this region, in this way a more accurate could be found and to update the final estimate.

In face image classification or projective motion factorization, the objective functions have only one carrier vector , but the estimate is an matrix and a -dimensional vector . Since is one dimensional, here we use instead of . The covariance of is , with unknown. This gives a symmetric Mahalanobis distance matrix for

 Di=√(x⊤i\boldmathΘ−\boldmathα)⊤H−1i(x⊤i\boldmathΘ−\boldmathα% ) (34)

which could be expressed as the union of vectors . A possible solution is to order the Mahalanobis distances for each column separately, and collect the inputs corresponding to the minimum sum of distances for of the data. The matrices are reduced to initial sets, one for each dimension.

Apply independently times the expansion process described in Section 3.1 and define the diagonal scale matrix with . The covariance matrix is computed as The second step in the algorithm, the mean shift, is now multidimensional and further experiments will be needed to verify the feasibility of this solution.

If an image contains, say, both planes and spheres, there is no clear way to estimate both of them properly. See for example, Fig.11. Supposing we start with the planes, then some points from the spheres may be misclassified as planes. These points should be put back into the input data for another sphere estimation, otherwise some spheres may not be detected. The same problem arises when we estimate the spheres first. The correct separation of multiple types of structures could require supplemental processing.

The Python/C++ program for the robust estimation of multiple inlier structures is posted on our website at

rci.rutgers.edu/riul/research/code/

MULINL/index.html.

## References

• [1] “Autodesk ReMake.”
• [2] S. Basah, A. Bab-Hadiashar, and R. Hoseinnezhad, “Conditions for motion-background segmentation using fundamental matrix,” IET Comput. Vis., vol. 3, pp. 189–200, 2009.
• [3] C. Beder and W. Förstner, “Direct solutions for computing cylinders from minimal sets of 3D points,” in ECCV2010, volume 3952, Springer, pp. 135–146.
• [4] H. Chen and P. Meer, “Robust regression with projection based M-estimators,” in ICCV2003, pp. 878–885.
• [5] Y. Cheng, J. A. Lopez, O. Camps, and M. Sznaier, “A convex optimization approach to robust fundamental matrix estimation,” in CVPR2015, pp. 2170–2178.
• [6] O. Chum and J. Matas, “Matching with PROSAC - Progressive sample consensus,” in CVPR2005, volume I, pp. 220–226.
• [7] D. Comaniciu and P. Meer, “Mean shift: A robust approach toward feature space analysis,” IEEE Trans. Pattern Anal. Mach. Intel., vol. 24, pp. 603–619, 2002.
• [8] E. Elhamifar and R. Vidal, “Sparse subspace clustering: Algorithm, theory, and applications,” IEEE Trans. Pattern Anal. Mach. Intel., vol. 35, pp. 2765–2781, 2013.
• [9]

M. Farenzena, A. Fusiello, and R. Gherardi, “Structure-and-motion pipeline on a hierarchical cluster tree,” in

ICCV Workshops, 2009, pp. 1489–1496.
• [10] M. A. Fischler and R. C. Bolles, “Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography,” Comm. Assoc. Comp. Mach, vol. 24, pp. 381–395, 1981.