Subspace clustering aims to cluster data points into separate subspaces, with the dimension of the subspaces typically much smaller than the ambient space. Examples include vanishing point detection , rigid motion segmentation [50, 56, 64] and face clustering  To make the problem tractable, traditional subspace clustering approaches tend to make various assumptions, such as data lying on a linear manifold, independence between subspaces, data drawn from a single type of subspace, known number of models, etc..
Despite the considerable amount of effort, there are still major lacunae in this research. Firstly, many real world problems consist of data drawn from a union of multiple types of subspaces.We term this problem multi-type subspace clustering. Fig. 1 shows some examples: a toy example of line, circle and ellipses co-existing together, and two real-world motion segmentation scenarios. In the latter two scenarios, the appropriate model to fit the foreground object motions can waver between affine motions, homography, fundamental matrix , and even non-rigid motion, with no clear dividing boundary between them. With few exceptions [2, 50, 56], none of the existing works have considered this realistic scenario. Even if one attempts to fit multiple types of model sequentially like in , it is non-trivial to decide the type when the dichotomy of the models is unclear in the first place, e.g. when is the rotation dominant enough so that homography becomes a better model than fundamental matrix? For non-rigid motions, an analytic subspace model can be hard to define, thus neither the hypothesize-and-test nor the algebraic approach could be easily applied.
Secondly, for problems where there are a significant number of models, the traditional hypothesis-and-test approach is often overwhelmed by sampling imbalance, i.e., points from the same subspace represent only a minority, rendering the probability of hitting upon the correct hypothesis very small. This problem becomes severe when a large number of data samples are required for hypothesizing a model (e.g., eight points are needed for a linear estimation of the fundamental matrix and 5 points for fitting an ellipse). Moreover, for optimal performance, there is inevitably a lot of manipulation of parameters needed, among which the most sensitive include those for deciding what constitutes an inlier for a model[31, 32], for sparsifying the affinity matrices [24, 64], and for selecting the model type . Often, dataset-specific tuning is required, with very little theory to guide the tuning.
Another open challenge in subspace clustering is to automatically determine the number of models, also referred to as model selection in the literature [53, 4, 28, 24]. Traditional methods are based upon the statistical analysis of the residual of the clustering [53, 44]
. Other methods approach the problem using various heuristics including analyzing eigen values[69, 60], over-segment and merge [28, 24], soft thresholding  or adding penalty terms . Most of the above works require extensive parameter tuning and have never been tested on data drawn from mixed-type of models. Lastly, hypothesis-and-test methods have to go through expensive sampling step, whereas analytic approaches have to contend with solving complex optimization problems. Thus, both approaches suffer from slow inference (as evidenced by our experimental comparisons), which is a serious qualification for real-time applications.
With the above considerations, we propose the SubspaceNet, a deep network that learns appropriate feature embeddings from input feature points without having to manually design similarity metric nor to know the subspace model a priori. The learnt feature representation allows clusters to be readily identified using off-the-shelf methods, even when the underlying data are drawn from a union of mixed types of models, with the dividing boundary between these multiple types of subspaces being unclear (e.g. the transitions from a circle to an ellipse), or the underlying subspace is not analytically expressible (e.g. non-rigid motion). Our network consists mainly of stacked multi-layer perceptions (mlps). Each of the mlps has output in the form of , which describes a linear subspace. For each layer of mlp(,) ( and
indicate the number of input and output neurons respectively), we have up to
different subspaces and they could be stacked together to define convex polytopes delimited by multiple linear cuts in the original space. More importantly, by coupling mlps with non-linear activations functions and stacking the resultant nonlinear features into a hierarchy, we can approximate very complex non-linear subspaces in the ambient space. At each layer of mlp, feature points are represented as responses (distances) to the subspaces. This is analogous to the concept of Ordered Residual Kernel (ORK) in
: feature points of the same model display similar responses to the set of subspaces hypothesized and these responses can be regarded as a new form of feature representation. Here, given labelled data (inlier points for each model and outliers), the network learns the appropriate subspace filters (mlps) that produce the feature embeddings (responses to mlps) amenable for grouping into the respective, possibly mixed models. The preference for the various mixed types of models is also decided by the network in a data-driven manner without having to tune a lot of system parameters.
We summarize our contributions as follows. (i) First, we address multi-type subspace clustering, i.e. data drawn from mixed types of (possibly non-analytic) models. (ii) Our solution naturally affords the ability to handle model selection and sampling imbalance. (iii) We propose a subspace clustering network (SubspaceNet) by stacking multi-layer perceptrons and achieved state-of-the-art performance on three datasets. The SubspaceNet is more effective than alternative networks designed for sparse set of feature points. (iv) We proposed a more effective metric learning loss optimizing the distribution of learnt feature embedding. (v) Our SubspaceNet is highly efficient at the inference stage and outperforms existing non-deep approaches by a large margin.
2 Related Work
Subspace Fitting: Early approaches address this in a sequential RANSAC fashion [56, 58, 21] by iteratively fitting and removing inliers. The J-Linkage  and T-Linkage  simultaneously consider the interactions between all points and hypotheses. The final partition is achieved by clustering. The above greedy algorithms often do not perform well under high noise level. Global algorithms have also been proposed to minimize an energy with various regularization terms, including spatial regularization (PEaRL)  and label count penalty . To eschew the problem of having to set thresholds, the ORK approach [4, 5] ranked the hypothesis according to data preference rather than absolute residuals. Analytic approaches are characterized by elegant mathematical formulation, including those based on the sparsity  and low-rank  assumptions and their variants. Many of the preceding works adopt spectral clustering for final grouping and assume known number of models, but only a few considered the model selection problem, e.g., [1, 28, 29, 48]. Even fewer works [2, 13, 50, 56] considered the problem of fitting multiple model of various types, and in these few works, the types are assumed to be known a priori and well-defined which is often not realistic.
Deep Learning for Geometric Modeling Problems
: Using deep learning to solve geometric model fitting has received growing considerations. The dense approaches use raw image to model the transformation between image pairs as homography or non-rigid transformation .  proposed to estimate the camera pose directly from image sequences. DSAC learns to extract from sparse feature correspondences a geometric model in a manner akin to RANSAC. The ability to learn representations from sparse points was also developed recently[39, 40]. This ability was exploited by  to fit essential matrix from noisy correspondences. Despite the promising results, none of the existing works have considered generic model fitting and, more importantly, fitting data drawn from multiple models and even multiple types. In our work, we formulate the generic multi-type fitting problem as one of learning good representations for clustering.
Deep Learning for Clustering
: Unsupervised approaches tackle the problem by finding a latent embedding that minimizes the reconstruction loss of an autoencoder[17, 59, 25]. They are further combined with various losses for clustering objectives [52, 67, 20, 62]. Among these, the k-means loss was proposed by  optimizing the points-to-center distance. The subspace self-expressiveness objective was considered for discovering linear subspaces in the latent space .In our tasks, there is a multiplicity of geometric models that are equally valid and subtly differentiated. For instance, given images of a rigidly moving cube, are we supposed to group the trajectory features by a single fundamental matrix or by multiple homographies? It would be difficult for the unsupervised networks to know the preference without any form of supervision.
In supervised approach, labelled data are used to learn feature embedding amenable for clustering 
. With the advent of deep neural network, works in metric learning focus on designing losses amenable to clustering labelled data[6, 45, 47, 15, 49]. Among these, 
minimizes the L2 distance between the predicted and ground-truth affinities and provides a competitive baseline. To further take into account the global distribution of the data points, we propose the clustering-specific loss MaxInterMinIntra, which optimizes the inter-cluster separation and intra-cluster variance and is proven to be more effective than existing alternatives.
3.1 Network Architecture
We denote the input sparse data with points as where each individual point is . The input sparse data could be geometric shapes, feature correspondences in two frames or feature trajectories in multiple frames. We further denote the one-hot key encoded labels accompanying the input data as where and is the number of clusters or partitions of the input data.
in that both exploit the power of mlps. We have noted that each layer of mlp works as multiple linear subspaces and the response to each layer of mlp serves as the new feature representation of the input feature points. Since the mlps are not scale invariant, a normalization layer is thus necessary before each mlp layer to center all feature points at origin with unit variance. The is realized by a standard z-score normalization on each input dimension, denoted asZ-score Norm layer in Fig. 2. We note that this step resembles the context norm (CN) proposed in . However, the role of CN was ascribed to capturing the relation between feature points by  whereas here, we believe the role of Z-score Norm is more specifically that of ensuring uniform scale. We adopt the same ResNet structure with CorresNet for training deeper network and the depth, number of Subspace Blocks is fixed at 50 for all experiments. For the output layer, we do not apply any activation but instead conduct L2 normalization on each sample. The output embedding is denoted as . To make the output
clustering-friendly, we apply a differentiable, clustering-specific loss function, measuring the match of the output feature representation with the ground-truth labels. The problem now becomes that of learning a CorresNet backbone that minimizes the loss .
3.2 Clustering Loss
We expect our clustering loss function to have the following characteristics. First, it should be invariant to permutation of models, e.g. the order of these models are exchangeable. Second the loss must be adaptable to varying number of groups. Lastly, the loss should enable good separation of data points into clusters. We consider the following loss functions.
L2Regression Loss: Given the ground-truth labels and the output embeddings , the ideal and reconstructed affinity matrices are respectively,
The training objective is to minimize the difference between and measured by element-wise L2 distance .
The above L2 Regression loss is obviously differentiable w.r.t. . Since the output embedding is L2-normalized, the inner product between two point representations is .
Cross-Entropy Loss: As alternative to the L2 distance, one could measure the discrepancy between and as KL-Divergence. Since , where is the entropy function and
is the sigmoid function, with fixed, we simply need to minimize the cross-entropy which yields the following element-wise cross-entropy loss,
The cross-entropy loss is more likely to push points and of the same cluster together faster than L2Regression, i.e. inner product and those of different clusters apart, i.e. inner product .
MaxInterMinIntra Loss: Both the above losses consider the pairwise relation between points; the overall point distribution in the output embedding is not explicitly considered. We now propose a new loss which takes a more global view of the point distribution rather than just the pairwise relations. Specifically, we are inspired by the classical Fisher LDA . LDA discovers a linear mapping that maximizes the distance between class centers/means and minimizes the scatter/variance within each class . Formally, the objective for a two-class problem is written as,
which is to be maximized over . For linearly non-separable problem, one has to design kernel function to map the input features before applying the LDA objective. Equipped now with more powerful nonlinear mapping networks, we adapt the LDA objective—for the multi-class scenarios—to perform these mappings automatically as below,
where , and indicating the set of points belonging to cluster . We use the extrema of the inter-cluster distances and intra-cluster scatters (see Fig. 3) so that the worst case is explicitly optimized. Hence, we term the loss as MaxInterMinIntra (MIMI). By applying log operation on the objective,we arrive at the following loss function to be minimized:
One can easily verify that the MaxInterMinIntra loss is differentiable w.r.t. . We provide the gradient in the supplementary material.
During testing, we apply standard K-means to the output embeddings . This step is applicable to both multi-model and multi-type clustering problems, as we do not need to specify explicitly the type of model to fit. If there is a need to estimate the number of models , we examine the K-means residuals defined by,
We demonstrate the performance of our network on both synthetic and real world data, with extensive comparisons with traditional geometric model fitting algorithms.
Synthesized Lines, Circles and Ellipses (LCE)
: Fitting ellipses has been a fundamental problem in computer vision. We synthesize for each sample four different types of conic curves in a 2D space, specifically, one straight line, two ellipses and one circle. We randomly generate 8,000 training samples, 200 validation samples and 200 testing samples. Each point is perturbed by adding a gaussian noise with .
KT3DMoSeg : This benchmark consists of 22 sequences from the KITTI dataset. Each sequence contains two to five rigid motions. As analyzed by , the geometric model for each individual motion can range from an affine transformation, a homography, to a fundamental matrix, with no clear dividing line between them. We evaluate this benchmark to demonstrate our network’s ability to tackle multi-type clustering. For fair comparison, we only crop the first 5 frames of each sequence for evaluation, so that the broken trajectory does not give undue advantage to certain methods.
FBMS59 : This dataset was proposed for analyzing video object segmentation based on point trajectories, with 59 sequences in total, of which 29 are for training and 30 for testing. It covers a wide variety of scenes and the ground-truth is defined over semantic objects with dense mask. Most of the moving objects involve moderate non-rigidity, for which analytic geometric models are hard to define. We evaluate the first-10-frame setting as reported in 
for fair comparison. The ground-truth for training is constructed by assigning the trajectories to the nearest label mask and the evaluation metric is the standard F-measure.
Adelaide RMF Dataset : We are concerned with the two-view motion segmentation task of this dataset. This task consists of 19 frame pairs each comprising of 2 to 5 independent motions. Though it is nominally a single-type multiple fundamental matrix fitting problem and has been treated as such by the community, we observe moderate degeneracies, i.e. near planar rigid objects, present in this dataset. Hence, we treat it as another multi-type (homography and fundamental matrix) clustering problem.
4.2 Multi-Type Curve Fitting
There is no clear dividing boundary between lines, circles, and ellipses as they can be all explained by the general conic equation (with the special cases of lines and circles obtained by setting some coefficients to ):
There are two ways to adapt the traditional multi-model fitting methods for this multi-type setting. One approach formulates the problem as fitting multiple models parameterized by the same conic equation in Eq (8), which is termed HighOrder (H.O.) fitting. Alternatively, one could sequentially fit three types of models, which is termed Sequential (Seq.) fitting. For ellipse-specific fitting, the direct least square approach  is adopted. For our model, we evaluate the various metric learning losses introduced in Section 3.2 and present the results in Tab. 1. The results are reported with the optimal setting determined by the validation set. We evaluate the performance by two clustering metrics, Classification Error Rate (Error Rate), i.e. the best classification results subject to permutation of clustering labels, and Normalized Mutual Information (NMI). Comparisons are made with state-of-the-art multi-model fitting algorithms including T-linkage , RPA  and RansaCov . We notice that T-linkage returns extremely over-segmented results in the sequential setting, e.g. more than 10 lines, making classification error evaluation intractable. For our model, we evaluate the three loss variants, the L2 Regression loss (L2), Cross Entropy loss (CE) and MaxInterMinIntra loss (MIMI).
We make the following observations about the results. First, all our metric learning variants outperform the HighOrder and Sequential multi-type fitting approaches. Second, the all-encompassing model used in the HighOrder approach suffers from ill-conditioning when fitting simpler models. Thus, the performance is much inferior to that of Sequential fitting. However, it is worth noting that despite the Sequential approach being given the strong a priori knowledge of both the model type and the number of model for each type, its performance is still significantly worse off than ours. For qualitative comparison, we visualize the ground-truth and segmentation results of each method in Fig. 4. Our clustering results on the bottom row show success in discovering all individual shapes with mistakes made only at the intersections of individual structures. The RPA failed to discover ellipses as sampling all 5 inliers amidst the large number of outliers and fitting an ellipse from even correct 5 support points with noise (noise in coordinate) are both very difficult, the latter demonstrated in .
|Model||Non-Deep Approaches||Deep Approaches|
|GPCA||ALC||LSA||LRR||MSMC||SSC||MVC||Unsupervised||Supervised (Different Losses)|
4.3 Multi-Type Motion Segmentation
Each sequence of the KT3DMoSeg benchmark often consists of a background whose motion can be explained by a fundamental matrix while the models for the foreground motions can sometimes be ambiguous due to the limited spatial extent of the objects, thus giving rise to mixed types of models. For example, in Fig. 6, the vehicles in ‘Seq009_Clip01’ and ‘Seq028_Clip03’ can be roughly explained by an affine transformation or homography while the oil tanker in ‘Seq095_Cip01’ should be modeled by a fundamental matrix. When the background is dominated by a plane, for instance, the quasi-planar row of trees on the right side of the road in ‘Seq028_Clip03’, it is likely to lead to degeneracies in the fundamental matrix estimation. For this dataset, we apply leave-one-out cross-validation; we dubbed this the ‘Vanilla’ setting. Each sequence has between 10-20 frames, so we could further increase the training data by augmenting with all the remaining five-frame clips from each sequence, termed as the ‘Augment’ setting. The testing clips (first five frames of each sequence) are kept the same for both settings. We compare with conventional non-deep subspace clustering approaches, GPCA, LSA, ALC, LRR, MSMC and SSC and the multi-view clustering (MVC) methods in . For the unsupervised deep clustering approaches, we include the Deep Subspace Clustering Network (DSCN) and Simultaneous Deep Learning and Clustering (DCN) for comparison. For the supervised setting, we compare with semi-hard triplet loss (SHT) , lifted structured feature embedding (LIFT) and clustering quality metric (NMI)  with the same network architecture. Results are presented in Tab. 2.
Our vanilla approach achieved very competitive performance on all 22 sequences in KT3DMoSeg. In the ‘Augment’ setting, our approach even outperforms the state-of-the-art multi-view clustering approaches (MVC) . Of all benchmarked methods, only MVC has considered the multi-type fitting issue. Furthermore, we notice that our proposed MIMI metric is the best among all alternative losses considered. The unsupervised deep approaches lag behind by a large margin corroborating our earlier argument about the necessity to exploit labelled information for complex multi-type subspace clustering problem. It is obvious the deep approaches are very efficient in inference, costing only 1.85 seconds to process all sequences (from trajectory input to clustering output) while the best performing non-deep approach (MVC) costs 143.52 seconds. The only faster algorithm (GPCA) has a much worse performance.
Finally, we present qualitative comparisons in Fig. 6. The SubspaceNet surpasses our expectations in how it performs in ‘Seq009_Clip01’. Here the independently moving car (the yellow group in the ground truth image) has a flow field that is consistent with the epipolar constraint associated with the background motion (due to them both translating in the same direction) . Without resorting to reconstructing the depth of the car, it would be impossible to separate it from the background. However, criteria involving depth would be very unwieldy to specify analytically in the existing approaches. Here, without having any preconceived notion of the geometrical model, our network has learnt the requisite criteria to separate the independent motion.
4.4 Non-Rigid Motion Segmentation
We demonstrate the ability to learn non-rigid motion segmentation which is hard to be modeled by analytic geometric models. We train our model on the training set with 29 unique sequences and evaluate on the test set with 30 sequences following the rule established by . For our method, the number of motion is estimated via SOD with candidate cluster range from 1 to 10. SSC, ALC, Spectral Clustering (SC) and MultiCut are compared and the results are presented in Tab. 4. We observe that our SubspaceNet, for both losses, is superior in performance compared with all three baseline methods. We further present qualitative comparisons with [35, 33] in Fig. 5. It is evident that the translational model employed in  with spatial and color information  detects the whole background but at the cost of over-segmenting the non-rigid foreground, e.g. the lion’s head and the tractor’s wheels. In contrast, our SubspaceNet detects the whole non-rigid foreground while keeping the background segmentation intact. Some objects are missed by all methods, e.g. the horse in stable of “Farm01”, since it does not move significantly in the first 10 frames.
4.5 Two-View Motion Segmentation
We evaluate the motion segmentation task in the Adelaide RMF dataset . We carry out a leave-one-out cross-validation. For comparability, we report the classification error rate (ErrorRate). The state-of-the-art models being compared include J-Linkage (J-Lnk), T-Linkage (T-Lnk), RPA , RCMSA , ILP-RansaCov (ILP), DGSAC  and NMU. The comparisons are presented in Tab. 4. We observe that our SubspaceNet gives competitive results; in particular, our model with MIMI loss gives a mean error of . We note the performance is achieved by training on only a very small amount of data (18 sequences) and without any dataset-specific parameter tuning. We also notice that our SubspaceNet is very efficient at the inference stage (1.4 seconds) which is 30 times faster than the closest performing method (NMU costs 499.6 seconds).
4.6 Sampling Imbalance
In this section, we further demonstrate the ability of our network to robustly handle sampling imbalance, i.e. the inlier points represent a minority. We demonstrate via a synthetic single-type multi-model fitting problems. Specifically, we synthesize 8,000 training samples and 200 testing samples for each of the type, line, circle and ellipses, and compare with RPA. The results are presented in Fig. 7. We conclude that, first, our multi-model network performs comparably with RPA on multi-line segmentation task while outperforming RPA with large margin on the more challenging multi-circle and multi-ellipse tasks. The performance drops sharply from multi-line (blue) to multi-ellipse (green) fitting for RPA, with the drop getting more acute as the number of model increases. This suggests that the increasing size of the minimal support set (2 points for line, 3 points for circle and 5 points for ellipse) poses great challenge for the RANSAC-based approaches due to sampling imbalance. More precisely, in a noiseless -model experiment, the chance of hitting the true model in a single sampling reduces from for straight line to for ellipse. It is evident that our multi-model network is less sensitive to the complexity of the model, as the drop in performance (purple and cyan bars) is less significant.
4.7 Model Selection
As can be seen from Fig. 6, the point distribution in the learned feature embedding is amenable for model selection. We evaluate the ability of both Second Order Difference (SOD) and Silhouette Analysis (Silh.)  to estimate the number of motions. We also compare with alternative subspace clustering approaches with built-in model selection, namely, LRR , MSMC , SSC, GPCA, ALC and additionally apply self-tuning spectral clustering(S.T.) 
to the affinity matrix obtained in MVC. Among the above competitors, the model selection for GPCA and SSC are implemented with SOD. Performances are evaluated in terms of mean classification error (Mean Err) and correct rate (Correct), i.e. the percentage of samples/sequences with correctly estimated number of cluster (higher the better). Comparisons are presented in Tab. 5
. Thanks to the deep feature learning, both SOD and Silh. applied to our method yield substantially better performance without the need to tune any parameter.
4.8 Further Study
Feature Embedding We provide direct visualization of the learnt representations. We use T-SNE to project both the KT3DMoSeg raw feature points (of dimension ten for 5 frames) and network output embeddings to a 2-dimensional space. Three example sequences are presented in the last row of Fig. 6. We conclude from the figure that: (i) the original feature points are hard to be grouped by K-means correctly; and (ii) after our network embedding, feature points are more likely to be grouped according to the respective motions, regardless of the underlying types of motions.
Dimension of Output Embedding: We investigate the impact of the dimension of the output embedding . We vary the size of the embedding dimension from 3 to 7 for three tasks and present the resulting error rates against the dimension in Fig 8 (middle). As can be seen, the errors are relatively stable w.r.t. the output embedding dimension from 4 to 7 for all three tasks, with optimal dimension between 5 to 6 coninciding with the maximal number of clusters for each task (5 motions for KT3DMoSeg and 4 structures for Synthetic). Thus the maximal number of clusters serves as a good heuristic for the dimension of the network output embedding.
Weak Supervision: The SubspaceNet is trained on labelled data points which is often very costly to obtain compared with image category labels. In this section, we investigate the interaction between weaker supervision, i.e. fewer labelled data points and performance. In specific, we randomly subsample to labelled data points for each sequence in KT3DMoSeg and AdelaideRMF MoSeg and train the model with reduced labelled data while keeping the same evaluation protocol as normal. The results averaged over 10 trials are presented in Fig. 8 (right). We observe very stable error rate from subsample rate, suggesting the SubspaceNet is robust to fewer annotated data.
Network Comparison: We compare SubspaceNet with alternative networks that are able to learn from sparse set of data. In particular, we compare with the correspondence network (CorresNet)  and PointNet  on KT3DMoSeg(KT3D.) and AdelaideRMF MoSeg(Adel.), both of which are experimented with L2 Loss and our MIMI loss. The results are presented in Fig. 8(left). We observe a significant performance gap between our SubspaceNet and the two alternatives. The proposed MIMI loss is also effective with alternative networks.
In this work, we investigate training a deep neural network for general multi-type subspace clustering. We formulate the problem as learning non-linear feature embeddings that maximize the distance between points of different clusters and minimize the variance within clusters. For inference, the output features are fed into a K-means to obtain the grouping. Model selection is easily achieved by just analyzing the K-means residual in a parameter free manner. Experiments are carried out on both synthetic and real motion segmentation tasks. Comparison with state-of-the-art approaches proves that our network can better deal with multiple types of models simultaneously. Our method is also less sensitive to sampling imbalance brought about by the increasing number of models, and it is highly efficient at inference stage. As future works, one could consider including additional texture and color information and adopting sliding window technique to handle arbitrary long sequences.
-  C. Alzate and J. A. Suykens. Multiway spectral clustering with out-of-sample extensions through weighted kernel pca. IEEE transactions on pattern analysis and machine intelligence, 2010.
-  D. Barath and J. Matas. Multi-class model fitting by energy minimization and mode-seeking. In CVPR, 2018.
-  E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, and C. Rother. Dsac-differentiable ransac for camera localization. In CVPR, 2017.
-  T. Chin, H. Wang, and D. Suter. The ordered residual kernel for robust motion subspace clustering. In NIPS, 2009.
-  T.-J. Chin, J. Yu, and D. Suter. Accelerated hypothesis generation for multi-structure robust fitting. In ECCV, 2010.
-  S. Chopra, R. Hadsell, and L. Y. Learning a similiarty metric discriminatively, with application to face verification. In CVPR, 2005.
-  D. DeTone, T. Malisiewicz, and A. Rabinovich. Deep image homography estimation. arXiv preprint arXiv:1606.03798, 2016.
-  R. Dragon, B. Rosenhahn, and J. Ostermann. Multi-scale clustering of frame-to-frame correspondences for motion segmentation. In ECCV, 2012.
-  E. Elhamifar and R. Vidal. Sparse subspace clustering: Algorithm, theory, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013.
-  R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2):179–188, 1936.
-  A. Fitzgibbon, M. Pilu, and R. B. Fisher. Direct least square fitting of ellipses. IEEE Transactions on pattern analysis and machine intelligence, 21(5):476–480, 1999.
-  A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. International Journal of Robotics Research, 2013.
-  A. Goh and R. Vidal. Segmenting motions of different types by unsupervised manifold clustering. In CVPR, 2007.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe. Deep clustering: Discriminative embeddings for segmentation and separation. In ICASSP, 2016.
-  J. Ho, M.-H. Yang, J. Lim, K.-C. Lee, and D. Kriegman. Clustering appearances of objects under varying illumination conditions. In CVPR, 2003.
-  F. J. Huang, Y.-L. Boureau, Y. LeCun, et al. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In CVPR, 2007.
-  H. Isack and Y. Boykov. Energy-based geometric multi-model fitting. International Journal of Computer Vision, 2012.
-  P. Ji, M. Salzmann, and H. Li. Shape interaction matrix revisited and robustified: Efficient subspace clustering with corrupted and incomplete data. In ICCV, 2016.
-  P. Ji, T. Zhang, H. Li, M. Salzmann, and I. Reid. Deep subspace clustering networks. In NIPS, 2017.
-  Y. Kanazawa and H. Kawakami. Detection of planar regions with uncalibrated stereo using distributions of feature points. In BMVC, 2004.
-  M. Keuper, B. Andres, and T. Brox. Motion trajectory segmentation via minimum cost multicuts. In ICCV, 2015.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
-  T. Lai, H. Wang, Y. Yan, T. J. Chin, and W. L. Zhao. Motion Segmentation Via a Sparsity Constraint. IEEE Transactions on Intelligent Transportation Systems, 2017.
H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng.
Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations.In ICML, 2009.
Two-view motion segmentation from linear programming relaxation.In CVPR, 2007.
-  Z. Li, L. F. Cheong, and S. Z. Zhou. SCAMS: Simultaneous clustering and model selection. In CVPR, 2014.
-  Z. Li, J. Guo, L.-F. Cheong, and S. Zhiying Zhou. Perspective motion segmentation via collaborative clustering. In ICCV, 2013.
-  G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, and Y. Ma. Robust recovery of subspace structures by low-rank representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013.
L. v. d. Maaten and G. Hinton.
Visualizing data using t-sne.
Journal of machine learning research, 2008.
-  L. Magri and A. Fusiello. T-linkage: A continuous relaxation of J-linkage for multi-model fitting. In CVPR, 2014.
-  L. Magri and A. Fusiello. Robust Multiple Model Fitting with Preference Analysis and Low-rank Approximation. In BMVC, 2015.
-  L. Magri and A. Fusiello. Multiple model fitting as a set coverage problem. In CVPR, 2016.
I. Melekhov, J. Ylioinas, J. Kannala, and E. Rahtu.
Relative camera pose estimation using convolutional neural networks.In International Conference on Advanced Concepts for Intelligent Vision Systems, 2017.
-  P. Ochs, J. Malik, and T. Brox. Segmentation of moving objects by long term video analysis. IEEE transactions on pattern analysis and machine intelligence, 2014.
-  H. Oh Song, S. Jegelka, V. Rathod, and K. Murphy. Deep metric learning via facility location. In CVPR, 2017.
-  H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese. Deep metric learning via lifted structured feature embedding. In CVPR, 2016.
-  T. T. Pham, T.-J. Chin, J. Yu, and D. Suter. The random cluster model for robust geometric fitting. IEEE transactions on pattern analysis and machine intelligence, 2014.
-  C. R. Qi, H. Su, K. Mo, and L. J. Guibas. PointNet: Deep learning on point sets for 3D classification and segmentation. In CVPR, 2017.
-  C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NIPS, 2017.
-  S. Rao, R. Tron, R. Vidal, and Y. Ma. Motion segmentation in the presence of outlying, incomplete, or corrupted trajectories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010.
-  I. Rocco, R. Arandjelovic, and J. Sivic. Convolutional neural network architecture for geometric matching. In CVPR, 2017.
-  C. Rother. A new approach to vanishing point detection in architectural environments. Image and Vision Computing, 2002.
P. J. Rousseeuw.
Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.Journal of computational and applied mathematics, 1987.
F. Schroff, D. Kalenichenko, and J. Philbin.
FaceNet: A unified embedding for face recognition and clustering.In CVPR, 2015.
-  F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In CVPR, 2015.
-  K. Sohn. Improved Deep Metric Learning with Multi-class N-pair Loss Objective. In NIPS, 2016.
-  M. Soltanolkotabi, E. J. Candes, et al. A geometric analysis of subspace clustering with outliers. The Annals of Statistics, 2012.
-  H. O. Song, S. Jegelka, V. Rathod, and K. Murphy. Deep metric learning via facility location. In CVPR, 2017.
-  Y. Sugaya and K. Kanatani. Geometric structure of degeneracy for multi-body motion segmentation. In In Workshop on Statistical Methods in Video Processing, 2004.
-  M. Tepper and G. Sapiro. Nonnegative matrix underapproximation for robust multiple model fitting. In CVPR, 2017.
-  F. Tian, B. Gao, Q. Cui, E. Chen, and T.-Y. Liu. Learning deep representations for graph clustering. In AAAI, 2014.
-  R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2001.
-  L. Tiwari and S. Anand. Dgsac: Density guided sampling and consensus. In WACV, 2018.
-  R. Toldo and A. Fusiello. Robust multiple structures estimation with j-linkage. In ECCV, 2008.
-  P. H. Torr. Geometric motion segmentation and model selection. Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences, 1998.
R. Vidal, Y. Ma, and S. Sastry.
Generalized principal component analysis (gpca).IEEE transactions on pattern analysis and machine intelligence, 2005.
-  E. Vincent and R. Laganiére. Detecting planar homographies in an image pair. In the 2nd International Symposium on Image and Signal Processing and Analysis, 2001.
P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol.
Extracting and composing robust features with denoising autoencoders.In ICML, 2008.
-  U. Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 2007.
-  H. S. Wong, T.-J. Chin, J. Yu, and D. Suter. Dynamic and hierarchical multi-structure geometric model fitting. In ICCV, 2011.
-  J. Xie, R. Girshick, and A. Farhadi. Unsupervised deep embedding for clustering analysis. In ICML, 2016.
-  E. Xing, M. Jordan, S. Russell, and A. Ng. Distance metric learning with application to clustering with side-information. In NIPS, 2003.
-  X. Xu, L. F. Cheong, and Z. Li. Motion segmentation by exploiting complementary geometric models. In CVPR, 2018.
-  J. Yan and M. Pollefeys. A General Framework for Motion Segmentation : Degenerate and Non-degenerate. In ECCV, 2006.
-  B. Yang, X. Fu, N. D. Sidiropoulos, and M. Hong. Towards k-means-friendly spaces: Simultaneous deep learning and clustering. In ICML, 2017.
-  J. Yang, D. Parikh, and D. Batra. Joint unsupervised learning of deep representations and image clusters. In CVPR, 2016.
-  K. M. Yi, E. Trulls, Y. Ono, V. Lepetit, M. Salzmann, and P. Fua. Learning to find good correspondences. In CVPR, 2018.
-  L. Zelnik-Manor and P. Perona. Self-tuning spectral clustering. In NIPS, 2005.
-  T. Zhang, A. Szlam, Y. Wang, and G. Lerman. Hybrid linear modeling via local best-fit flats. International Journal of Computer Vision, 2012.
6 Network Design
In this section, we make more ablation study of the proposed SubspaceNet. In particular, we are concerned with the depth of the network, the necessity of the L2 normalization layer.
6.1 Network Depth
We evaluate the impact of the depth of SubspaceNet. The depth is varied from 20 to 80 with step of 10 and both the mean error and median error on KT3DMoSeg(KT3D.) and AdelaideRMF MoSeg (Adel.) are reported in Fig. 9. We observe relatively stable performance w.r.t the depth of network thanks to the ResNet structure. In particular, the optimal range is between 40 to 60.
6.2 L2 Normalization Layer
We introduced a L2 normalization layer at the output of SubspaceNet. This layer normalizes the scales of all feature embeddings so that all feature points lie on a unit sphere, thereby benefitting the metric learning procedure. We specifically evaluate the necessity of this layer by comparing the results on motion segmentation with and without the L2 normalization layer. As can be seen from Tab. 6, the performance is consistently better with the L2norm layer for both the KT3DMoSeg and AdelaideRMF MoSeg datasets, suggesting that the L2norm layer is beneficial for learning better feature embeddings.
|w L2norm||w/o L2norm|
6.3 Tensorflow Implementation
We further append the tensorflow implementation of the proposed SubspaceNet in Fig. 14.
Due to the limited size of existing subspace clustering datasets, exhibiting a low diversity of motion-scene types for motion segmentation, one might suspect the risk of overfitting. In this section, we investigate this issue by visualizing both the training/validation loss and errors. The results on both KT3DMoSeg and AdelaideRMF MoSeg are shown in Fig. 10. We observe both training and validation loss converging after 100 epochs as does the prediction accuracy.
8 Loss Design
In this section, we present more analysis and derivations of the proposed MaxInterMinIntra (MIMI) loss. Specifically, we first provide the gradient of the MIMI loss and then analyze the two components of the MIMI loss. Finally, we publish the tensorflow implementation of the proposed MaxInterMinIntra code.
8.1 Gradient of MIMI Loss
8.2 MIMI Loss Components
Here we investigate the necessity of both maximizing inter cluster distance and minimizing intra cluster variance. Specifically, we compare the following variants. (i) MaxInter: only maximizing the inter cluster distance is considered, equivalent to the first term in Eq (10). (ii) MinIntra: only minimizing the intra cluster variance is considered, the second term in Eq (10). (iii) K-means loss: we further note the k-means loss  proposed for unsupervised deep clustering shares the same objective with MinIntra
. We therefore adapt the k-means loss to supervised learning with fixed point-to-cluster assignment during training. We compare the three variants with our final MIMI loss on KT3DMoSeg and present the results in Fig.11. The MIMI loss is consistently better (lower error) than all three variants. In particular, the MinIntra and K-means loss produce large errors. This indicates that pushing points of different clusters away is vital to feature embedding for clustering.
8.3 Tensorflow Implementation
The tensorflow implementation of the proposed MaxInterMinIntra loss is shown in Fig. 13.
9 Additional Results
We present in this section additional qualitative results on the datasets evaluated in the manuscript. In particular, we demonstrate the motion segmentation results on AdelaideRMF MoSeg dataset in Fig. 12. For each sequence, we visualize both the ground-truth as ‘XXX GT’ and ‘Pred. err-Y.YY%’ where ‘XXX’ is the sequence name and ‘Y.YY’ is the error rate. We also report the per sequence performance and compare with multiple state-of-the-art methods in Tab. 7. We observe very competitive performance on most sequences of AdelaideRMF MoSeg dataset with only one sequence (dinobooks) missing one book. It is also noteworthy that our approach does generalize to new shapes well. For example, the boardgames in the sequence ‘boardgame’ (middle& right), the dinosaur in the sequence ‘dinobooks’ (middle) and the stacked games in the sequence ‘game’ all have never appeared in any other sequences. For the shapes that appear in multiple sequences, we notice the extracted feature points are quite different from sequence to sequence. For example, the toys, cubes and bread in all different sequences are very different in term of feature points thus overfitting to the shape is not allowed.