With the advent of big data, multi-view features with high dimensions are widely employed to represent the complex data in various research fields, such as multimedia computing, machine learning and data mining[Liu et al.2016, Liu et al.2017, Zhu et al.2017b, Zhu et al.2015, Cheng and Shen2016, Cheng et al.2016]. On the one hand, with multi-view features, the data could be characterized more precisely and comprehensively from different perspectives. On the other hand, high-dimensional multi-view features will inevitably generate expensive computation cost and cause massive storage cost. Moreover, they may contain adverse noises, outlying entries, irrelevant and correlated features, which may be detrimental to the subsequent learning process [Zhu et al.2016b, Zhu et al.2016a, Zhu et al.2017a]. Unsupervised multi-view feature selection [Wang et al.2016, Li and Liu2017]
is devised to alleviate the problem. It selects a compact subset of informative features from the original features by dropping irrelevant and redundant features with advanced unsupervised learning. Due to the independence on semantic labels, high computing efficiency and well interpretation capability, unsupervised multi-view feature selection has received considerable attention in literature. It becomes a prerequisite component in various machine learning models[Li et al.2017].
The key problem of multi-view feature selection is how to effectively exploit the diversity and consistency of multi-view features to collaboratively identify the feature dimensions, which could retain the key characteristics of the original features. Existing approaches can be categorized into two major families. The first kind of methods first concatenates multi-view features into a vector and then directly imports it into the conventional single-view feature selection model. The candidate features are generally ranked based on spectral graph theory. Typical methods of this kind include Laplacian Score (LapScor)[He et al.2005], spectral feature selection (SPEC) [Zhao and Liu2007] and minimum redundancy spectral feature selection (MRSF) [Zhao et al.2010]. Commonly, the pipeline of these methods follows two separate processes: 1) Similarity structure is constructed with fixed graph parameters to describe the geometric structure of data. 2) Sparsity and manifold regularization are employed together to identify the most salient features. Although these methods are reported to achieve certain success, they treat features from different views independently and unfortunately neglect the important view correlations.
Another family of methods considers view correlation when performing feature selection. Representative works include adaptive multi-view feature selection (AMFS) [Wang et al.2016], multi-view feature selection (MVFS) [Tang et al.2013] and adaptive unsupervised multi-view feature selection (AUMFS) [Feng et al.2013]. These methods first construct multiple view-specific similarity structures111In this paper, view-specific similarity structure is constructed with the corresponding view-specific feature. and then perform the subsequent feature selection based on the collaborative (combined) similarity structure. These two processes are separate and independent. The collaborative similarity structure remains fixed during feature selection. The latently involved data noises and outlying entries in the view-specific similarity structures will adversely reduce the reliability of the ultimate collaborative similarity structure for feature selection. Furthermore, conventional approaches generally employ -nearest neighbors assignment to construct the view-specific similarity structures and the simple weighted combination for ultimate similarity structure generation. This strategy can hardly achieve the ideal state for clustering that the number of connected components in the ultimate similarity structure is equal to the number of clusters [Nie et al.2014]. Thus, suboptimal performance may be caused under such circumstance.
In this paper, we introduce an adaptive collaborative similarity learning (ACSL) for unsupervised multi-view feature selection. The main contributions of this paper can be summarized as follows:
Different from existing solutions, we integrate the collaborative similarity structure learning and multi-view feature selection into a unified framework. The collaborative similarity structure and similarity combination weights could be learned adaptively by considering the ultimate feature selection performance. Simultaneously, the feature selection can preserve the dynamically adjusted similarity structure.
We impose a reasonable rank constraint to adaptively learn an ideal collaborative similarity structure with proper neighbor assignment which could positively facilitate the ultimate feature selection. An effective alternate optimization approach guaranteed with convergence is derived to iteratively solve the formulated optimization problem.
2 Related Work
One kind of unsupervised multi-view feature selection methods directly imports the concatenated features in multiple views into the single-view feature selection model. In [He et al.2005], Laplacian score (LapScor) is employed to measure the capability of each feature dimension on preserving sample similarity. [Zhao and Liu2007] proposes a general spectral theory based learning framework to unify the unsupervised and supervised feature selection. [Zhao et al.2010] adopts an embedding model to handle feature redundancy in the spectral feature selection. These methods generally rank the candidate feature dimensions with various graphs which characterize the manifold structure. They treat features from different views independently and unfortunately ignore the important correlation of different feature views. Another kind of methods directly tackles the multi-view feature selection. They consider view correlations when performing feature selection. Adaptive multi-view feature selection (AMFS) [Wang et al.2016] is an unsupervised feature selection approach which is developed for human motion retrieval. It describes the local geometric structure of data in each view with local descriptor and performs the feature selection in a general trace ratio optimization. In this method, the feature dimensions are determined with trace ratio criteria. Adaptive unsupervised multi-view feature selection (AUMFS) [Feng et al.2013] addresses the feature selection problem for visual concept recognition. It employs norm [Nie et al.2010] based sparse regression model to automatically identify discriminative features. In AUMFS, data cluster structure, data similarity and the correlations of different views are considered for feature selection. Multi-view feature selection (MVFS) [Tang et al.2013] investigates the feature selection for multi-view data in social media. A learning framework is devised to exploit the relations of views and help each view select relevant features.
3 The Proposed Methodology
3.1 Notations and Definitions
Throughout the paper, all the matrices are written in uppercase with boldface. For a matrix , its row is denoted by , its column is denoted by . The element in the row and column is represented as . The trace of the matrix M is denoted as . The transpose of matrix M is denoted as . The norm of the matrix M is denoted as , which is calculated by . The Frobenius norm of M is denoted by . 1 denotes a column vector whose all elements are one. denotes identify matrix.
The feature matrix of data in the view is denoted as , , is the dimension of feature in the view, is the number of data samples. We pack the feature matrices in views and the overall feature matrix of data can be represented as , . The objective of unsupervised multi-view feature selection is to identify most valuable features with only X.
The importance of feature dimensions are primarily determined by measuring the their capabilities on preserving the similarity structures in multiple views. In this paper, we develop a unified learning framework to learn an adaptive collaborative similarity structure with automatic neighbor assignment for multi-view feature selection. In our model, the neighbors in the collaborative similarity structure could be adaptively assigned by considering the feature selection performance, and simultaneously the feature selection could preserve the dynamically constructed collaborative similarity structure. Given similarity structures constructed in multiple views , is the number of views, we can automatically learn a collaborative similarity structure S by combining with weights.
where characterizes the similarities between any data points with , it should be subjected to the constraint that , is comprised of view weights for the column of similarities, it is constrained with , is view weight matrix for all columns in the similarity structures. As indicated in recent work [Nie et al.2014], a theoretically ideal similarity structure for clustering should have the property that the number of connected components is equal to the number of clusters. The similarity structure with such neighbor assignment could benefit the subsequent feature selection. Unfortunately, the similarity structure learned from Eq.(1) does not have such desirable property.
To tackle the problem, in this paper, we impose a reasonable rank constraint on the Laplacian matrix of the collaborative similarity structure to enable it to have such property. Our idea is motivated by the following spectral graph theory.
If the similarity structure S are nonnegative, the multiplicity of eigen-values corresponding to its Laplacain matrix is equal to the number of components of S. [Alavi1991]
As mentioned above, the data points can be directly partitioned into clusters if the number of components in the similarity structure S is exactly equal to . Theorem 1 indicates that this condition can be achieved if the rank of Laplacian matrix is equal to . With the analysis, we add a reasonable rank constraint in Eq.(1) to achieve the condition. The optimization problem becomes
where is the Laplacain matrix of similarity structure S, is diagonal matrix. As shown in Eq.(2), directly imposing the rank constraint will make the above problem hard to solve. Fortunately, according to Ky Fan’s Theorem [K.1949], we can have , where is the smallest eigen-values of and is the relaxed cluster indicator matrix. Obviously, the rank constraint can be satisfied when . To this end, we reformulate the Eq.(2) as the following simple equivalent form
As shown in the above equation, when is large enough, the term is forced to be infinitely approximate 0 and the rank constraint can be satisfied accordingly. By simply transforming the rank constraint to trace in objective function, the problem in Eq.(2) can be tackled more easily.
The selected features should preserve the dynamically learned similarity structure. Conventional approaches separate the similarity structure construction and feature selection into two independent processes, which will potentially lead to sub-optimal performance. In this paper, we learn the collaborative similarity structure dynamically and further integrate it with feature selection into a unified framework. Specifically, based on the collaborative similarity structure learning in Eq.(3), we employ sparse regression model to learn a projection matrix , so that the projected low-dimensional data XP can approximate the relaxed cluster indicator F. To select the features, we impose norm penalty on P to force it with row sparsity. The importance of features can be measured by the norm of each row feature in P. The overall optimization formulation can be derived as
With P, the importance of features are measured by . The features with the largest values can be finally determined.
3.3 Alternate Optimization
As shown in Eq.(4), the objective function is not convex to three variables simultaneously. In this paper, we propose an effective alternate optimization to iteratively solve the problem. Specifically, we optimize one variable by fixing the others.
Update P. By fixing the other variables, the optimization for P can be derived as
This equation is not differentiable. Hence, we transform it to following equivalent equation [Nie et al.2010]
is diagonal matrix whose diagonal element is . is small enough constant. It is used to avoid the condition that is zero. By calculating the derivations of the objective function with P and setting it to zeros, we can obtain the updating rule for P as
Note that is dependent on P. We develop an iterative approach to solve P and until convergence. Specifically, we fix to solve P, and vice versa.
Update F. By fixing the other variables, the optimization for F can be derived as
where . With the transformation, the optimization for updating F can be solved by simple eigen-decomposition on the matrix . Specifically, the columns of F are comprised of the eigenvectors corresponding to the
Update S. By fixing the other variables, the optimization for S becomes
The above equation can be rewritten as
where denotes the element in the row and column of S. The optimization processes for the columns of S are independent with each other. Hence, they can be optimized separately. Formally, S can be solved by
Let be row vector with dimensions. Its element is . The above optimization formula can be transformed as
This problem can be solved by an efficient iterative algorithm [Huang et al.2015].
Update W. Similar to S, the optimization processes for the columns of W are independent with each other. Hence, they can be optimized separately. Formally, its column is solved by
The objective function in Eq.(14) can be rewritten as
where , .
We can obtain the Lagrangian function of problem (14)
is also Lagrangian multiplier. By calculating the derivative of (16) with and setting it to 0, we obtain the updating rule of as
ACC of different methods with different numbers of selected features by using K-means for clustering.
3.4 Convergence Analysis
The convergence of solving problem (6) can be proven by the following theorem.
The iterative optimization process for solving Eq.(5) will monotonically decrease the objective function value until convergence.
Let be the newly updated P, we can obtain the following inequality
By adding to the both sides of the inequality (18) and substituting , the inequality can be rewritten as
On the other hand, according to the Lemma 1 in [Nie et al.2010], we can obtain that for any positive number and , we can have
The convergence of solving Algorithm 1 can be proven by the following theorem.
By fixing other variables and updating F, the objective function in Eq.(8) is convex (The Hessian matrix of the Lagrangian function of Eq.(8) is positive semidefinite [Alavi1991]). Therefore, we can obtain that
By fixing other variables and updating S, optimizing the Eq.(13) is a typical Quadratic programming problem. The Hessian matrix of the Lagrangian function of problem (13) is also that is positive semidefinite. Therefore, we can obtain that
By fixing other variables and updating W, the Hessian matrix of Eq.(16) is . It is positive semidefinite as . Hence, the objective function for optimizing W is also convex. Then, we arrive at
4.1 Experimental Datasets
, we select 7 classes composed of tree, building, airplane, cow, face, car, bicycle and each class has 30 images. We extract 5 visual features from each image: color moment with dimension 48, GIST with 512 dimension, SIFT with dimension 1230, CENTRIST feature with 210 dimension, and local binary pattern (LBP) with 256 dimension. 2)Handwritten Numeral [van Breukelen et al.1998]. This dataset is comprised of 2,000 data points from 0 to 9 digit classes. 6 features are used to represent each digit. They are 76 dimensional Fourier coefficients of the character shapes, 216 dimensional profile correlations, 64 dimensional Karhunen-love coefficients, 240 dimensional pixel averages in windows, 47 dimensional Zernike moment and 6 dimensional morphological features. 3) Youtube [Liu et al.2009]. This real-world dataset is collected from Youtube. It contains intended camera motion, variations of the object scale, viewpoint, illumination and cluttered background. The dataset is comprised of 1,596 video sequences in 11 actions. 4) Outdoor Scene [Monadjemi et al.2002]. The outdoor scene dataset contains 2,688 color images that belong to 8 outdoor scene categories. 4 visual features are extracted from each image: color moment with dimension 432, GIST with dimension 512, HOG with dimension 256, and LBP with dimension 48.
4.2 Experimental Setting
Baselines. We compare ACSL with several representative unsupervised multi-view feature selection methods on clustering performance. The compared methods include, three single view feature selection approaches (Laplacian score (LapScor) [He et al.2005], spectral feature selection (SPEC) [Zhao and Liu2007] and minimum redundancy spectral feature selection (MRSF) [Zhao et al.2010]), and three multi-view feature selection approach (adaptive multi-view feature selection (AMFS) [Wang et al.2016], multi-view feature selection (MVFS) [Tang et al.2013] and adaptive unsupervised multi-view feature selection (AUMFS) [Feng et al.2013]). Evluation Metrics. We employ standard metrics: clustering accuracy (ACC) and normalized mutual information (NMI), for performance comparison. Each experiment is performed 50 times and the mean results are reported. Parameter Setting. In implementation of all methods, the neighbor graph is adopted to construct the initial affinity matrices. The number of neighbors is set to 10 in all methods. In ACSL, are chosen from to . The parameters in all compared approaches are carefully adjusted to report the best results.
4.3 Comparison Results
The comparison results measured by ACC and NMI are reported in Table 1 and Table 2, respectively. For these metrics, the higher value indicates the better feature selection performance. Each metric penalizes or favors different properties in feature selection. Hence, we report results on these diverse measures to perform a comprehensive evaluation. The obtained results demonstrate that ACSL can achieve superior or at least comparable performance than the compared approaches. The promising performance of ACSL is attributed to the reason that the proposed collaborative similarity structure learning with proper neighbor assignment could positively facilitate the ultimate multi-view feature selection.
4.4 Parameter and Convergence Experiment
We investigate the impact of parameters , and in Eq.(4) on the performance of ACSL. Specifically, we vary one parameter by fixing the others. Figure 1 presents the main results on MSRC-V1. The obtained results clearly show that ACSL is robust to the involved three parameters. Figure 2 records the variations of the objective function value in Eq.(4) with the number of iterations on MSRC-V1 and Handwritten Numeral. We can easily observe that the convergence curves become stable within about 5 iterations. The fast convergence ensures the optimization efficiency of ACSL.
In this paper, we propose an adaptive collaborative similarity structure learning for multi-view feature selection. Different from existing approaches, we integrate collaborative similarity learning and feature selection into a unified framework. The collaborative similarity structure with the ideal neighbor assignment and similarity combination weights are adaptively learned to positively facilitate the subsequent feature selection. Simultaneously, the feature selection can supervise the similarity learning process to dynamically construct the desirable similarity structure. Experiments show the superiority of the proposed approach.
- [Alavi1991] Y. Alavi. The laplacian spectrum of graphs. Graph theory, combinatorics, and applications, 2(12):871–898, 1991.
- [Cheng and Shen2016] Zhiyong Cheng and Jialie Shen. On very large scale test collection for landmark image search benchmarking. Signal Processing, 124:13 – 26, 2016.
[Cheng et al.2016]
Zhiyong Cheng, Jialie Shen, and Haiyan Miao.
The effects of multiple query evidences on social image retrieval.Multimedia Systems, 22(4):509–523, 2016.
- [Feng et al.2013] Yinfu Feng, Jun Xiao, Yueting Zhuang, and Xiaoming Liu. Adaptive unsupervised multi-view feature selection for visual concept recognition. In ACCV, 2013.
- [Grauman and Darrell2006] K Grauman and T Darrell. Unsupervised learning of categories from sets of partially matching image features. In CVPR, pages 19–25, 2006.
- [He et al.2005] Xiaofei He, Deng Cai, and Partha Niyogi. Laplacian score for feature selection. In NIPS, pages 507–514, 2005.
- [Huang et al.2015] Jin Huang, Feiping Nie, and Heng Huang. A new simplex sparse learning model to measure data similarity for clustering. In IJCAI, pages 3569–3575, 2015.
On a theorem of weyl concerning eigenvalues of linear transformations.Proc Natl Acad Sci U S A, 11(35):652–655, 1949.
- [Li and Liu2017] Jundong Li and Huan Liu. Challenges of feature selection for big data analytics. IEEE Intelligent Systems, 32(2):9–15, 2017.
- [Li et al.2017] Yun Li, Tao Li, and Huan Liu. Recent advances in feature selection and its applications. Knowledge and Information Systems, 53(3):551–577, 2017.
- [Liu et al.2009] J. Liu, Yang Yang, and M. Shah. Learning semantic visual vocabularies using diffusion distance. In CVPR, pages 461–468, 2009.
- [Liu et al.2016] An-An Liu, Wei-Zhi Nie, Yue Gao, and Yu-Ting Su. Multi-modal clique-graph matching for view-based 3d model retrieval. TIP, 25(5):2103–2116, 2016.
- [Liu et al.2017] A. A. Liu, Y. T. Su, W. Z. Nie, and M. Kankanhalli. Hierarchical clustering multi-task learning for joint human action grouping and recognition. TPAMI, 39(1):102–114, 2017.
[Monadjemi et al.2002]
A. Monadjemi, B. T. Thomas, and M. Mirmehdi.
Experiments on high resolution images towards outdoor scene classification.Computer Vision Winter Workshop, 2002.
- [Nie et al.2010] Feiping Nie, Heng Huang, Xiao Cai, and Chris H. Q. Ding. Efficient and robust feature selection via joint l21-norms minimization. In NIPS, pages 1813–1821, 2010.
- [Nie et al.2014] Feiping Nie, Xiaoqian Wang, and Heng Huang. Clustering and projected clustering with adaptive neighbors. In KDD, pages 977–986, 2014.
- [Tang et al.2013] Jiliang Tang, Xia Hu, Huiji Gao, and Huan Liu. Unsupervised feature selection for multi-view data in social media. In SDM, pages 270–278, 2013.
[van Breukelen et al.1998]
M P. W van Breukelen, D M. J Tax, and J E den Hartog.
Handwritten digit recognition by combined classifiers,.Kybernetika, 34:381–386, 1998.
- [Wang et al.2016] Zhao Wang, Yinfu Feng, Tian Qi, Xiaosong Yang, and Jian J. Zhang. Adaptive multi-view feature selection for human motion retrieval. Signal Processing, 120(C):691 – 701, 2016.
- [Winn and Jojic2005] J. Winn and N. Jojic. Locus: learning object classes with unsupervised segmentation. In ICCV, pages 756–763, 2005.
- [Zhao and Liu2007] Zheng Zhao and Huan Liu. Spectral feature selection for supervised and unsupervised learning. In ICML, pages 1151–1157, 2007.
- [Zhao et al.2010] Zheng Zhao, Lei Wang, and Huan Liu. Efficient spectral feature selection with minimum redundancy. In AAAI, 2010.
- [Zhu et al.2015] Lei Zhu, Jialie Shen, Hai Jin, Ran Zheng, and Liang Xie. Content-based visual landmark search via multimodal hypergraph learning. TCYB, 45(12):2756–2769, 2015.
- [Zhu et al.2016a] L. Zhu, J. Shen, L. Xie, and Z. Cheng. Unsupervised topic hypergraph hashing for efficient mobile image retrieval. TCYB, 47(11):3941–3954, 2016.
- [Zhu et al.2016b] Lei Zhu, Jialie She, Xiaobai Liu, Liang Xie, and Liqiang Nie. Learning compact visual representation with canonical views for robust mobile landmark search. In IJCAI, pages 3959–3965, 2016.
- [Zhu et al.2017a] L. Zhu, Z. Huang, X. Liu, X. He, J. Sun, and X. Zhou. Discrete multimodal hashing with canonical views for robust mobile landmark search. TMM, 19(9):2066–2079, 2017.
- [Zhu et al.2017b] Lei Zhu, Jialie Shen, Liang Xie, and Zhiyong Cheng. Unsupervised visual hashing with semantic assistant for content-based image retrieval. TKDE, 29(2):472–486, 2017.