Introduction
Due to the explosive growth of data [Wang, Kumar, and Chang2012]
, subset selection methods are increasingly popular for a wide range of machine learning and computer vision applications
[Frey and Dueck2007, Jenatton, Audibert, and Bach2011]. This kind of methods offer the potential to select a few highly representative samples or exemplars to describe the entire dataset. By analyzing a few, we can roughly know all. Such case is very important to summarize and visualize huge datasets of texts, images and videos etc. [Bien and Tibshirani2011, Elhamifar et al.2012b]. Besides, by only using the selected exemplars for succeeding tasks, the cost of memories and computational time will be greatly reduced [Garcia et al.2012]. Additionally, as outliers are generally less representative, the side effect of outliers will be reduced, thus boosting the performance of subsequent applications [Elhamifar et al.2012a].There have been several subset selection methods. The most intuitional method is to randomly select a fixed number of samples. Although highly efficient, there is no guarantee for an effective selection. For the other methods, depending on the mechanism of representative exemplars, there are mainly three categories of selection methods. One category relies on the assumption that the data points lie in one or multiple lowdimensional subspaces. Specifically, the Rank Revealing QR (RRQR) [Chan1987, Boutsidis, Mahoney, and Drineas2009] selects the subsets that give the best conditional submatrix. Unfortunately, this method has suboptimal properties, as it is not assured to find the globally optimum in polynomial time.
Another category assumes that the samples are distributed around centers [Frey and Dueck2007, Liu et al.2010]. The center or its nearest neighbour are selected as exemplars. Perhaps, Kmeans and Kmedoids are the most typical methods (Kmedoids is a variant of Kmeans). Both methods employ an EMlike algorithm. Thus, the results depend tightly on the initialization, and they are highly unstable for large (i.e. the number of centers or selected samples).
Recently, there are a few methods that assume exemplars are the samples that can best represent the whole dataset. However, for [Yu, Bi, and Tresp2006], the optimization is a combinatorial problem (NPhard) [Nie et al.2013, Yu et al.2008], which is computationally intractable to solve. Besides, the representation loss is measured by the least square measure, which is sensitive to outliers in data [Wang et al.2014, Zhu et al.2014, Nie et al.2013].
Then [Nie et al.2013] improves [Yu, Bi, and Tresp2006] by employing a robust loss via the norm; the norm is applied to samples, and the norm is used for features. In this way, the side effect of outlier samples is relieved. The solver of [Nie et al.2013] is theoretically perfect due to its ability of convergence to global optima. Unfortunately, in terms of computational costs, the solver is highly complex. It takes for one iteration as shown in Table 1. This is infeasible for the case of large (e.g. it takes 2000+ hours for a case of ). Moreover, the representation loss is only robust against outlier samples. Such case is worth improvement, as there may exist outlier elements in real data.
Contributions.
In this paper, we propose an accelerated robust subset selection method to highly raise the speed on the one hand, and to boost the robustness on the other. To this end, we use the norm based robust measure for the representation loss, preventing large errors from dominating our objective. As a result, the robustness against outliers is greatly boosted. Then, based on the observation that data size is generally much larger than feature length, i.e. , we propose a speedup solver. The main acceleration is owing to the Augmented Lagrange Multiplier (ALM) and an equivalent derivation. Via them, we reduce the computational complexity from to . Extensive results on ten benchmark datasets demonstrate that in average, our method is 10,000+ times faster than Nie’s method. The selection quality is highly encouraging as shown in Fig. 1. Additionally, via another equivalent derivation, we give an accelerated solver for Nie’s method, theoretically reducing the computational complexity from to as listed in Table 1, empirically obtaining a 500+ times speedup compared with the authorial solver.
Notations.
We use boldface uppercase letters to denote matrices and boldface lowercase letters to represent vectors. For a matrix
, we denote its row and column as and respectively. The norm of a matrix is defined as . The norm of a matrix is defined as ; thus, we have .Subset Selection via SelfRepresentation
In the problem of subset selection, we are often given a set of unlabelled points , where is the feature length. The goal is to select the top most representative and informative samples (i.e. exemplars) to effectively describe the entire dataset . By solely using these exemplars for subsequent tasks, we could greatly reduce the computational costs and largely alleviate the side effects of outlier elements in data. Such a motivation could be formulated as the Transductive Experimental Design (TED) model [Yu, Bi, and Tresp2006]:
(1) 
where is the selected subset matrix, whose column vectors all come from , i.e. ; is the corresponding linear combination coefficients. By minimizing (1), TED could select the highly informative and representative samples, as they have to well represent all the samples in .
Although TED (1
) is well modeled—very accurate and intuitive, there are two bottlenecks. First, the objective is a combinatorial optimization problem. It is NPhard to exhaustively search the optimal subset
from . For this reason, the author approximate (1) via a sequential optimization problem, which is solved by an inefficient greedy optimization algorithm. Second, similar to the existing least square loss based models in machine learning and statistics, (1) is sensitive to the presence of outliers [Wang et al.2014].Methods  RRSS  RRSS  ARSS 

Complex. 
Accordingly, Nie et al. propose a new model (RRSS):
(2) 
where is a nonnegative parameter; is constrained to be rowsparse, and thus to select the most representative and informative samples [Nie et al.2013]. As the representation loss is accumulated via the norm among samples, compared with (1), the robustness against outlier samples is enhanced. Equivalently, (2) is rewritten in the matrix format:
(3) 
Since the objective (3) is convex in , the global minimum may be found by differentiating (3) and setting the derivative to zero [Levin et al.2008], resulting in a linear system^{1}^{1}1To avoid singular failures, we get , . Then the algorithm is to minimize the objective of . When , this objective is reduced to the objective (3).
(4) 
where is a diagonal matrix with the diagonal entry as and .
It seems perfect to use (4) to solve the objective (3), because (4) looks simple and the global optimum is theoretically guaranteed [Nie et al.2013]. Unfortunately, in terms of speed, (4) is usually infeasible due to the incredible computational demand in the case of large (the number of samples). At each iteration, the computational complexity of (4) is up to , as analyzed in Remark 1. According to our experiments, the time cost is up to 2088 hours (i.e. 87 days) for a subset selection problem of 13000 samples.
Remark 1.
Since , the major computational cost of (4) focuses on a linear system. If solved by the Cholesky factorization method, it costs for factorization as well as for forward and backward substitution. This amounts to in total. By now, we only solve . Once solving all the set of , the total complexity amounts to for one iteration step.
Accelerated Robust Subset Selection (ARSS)
Due to the huge computational costs, Nie’s method is infeasible for the case of large —the computational time is up to 2088 hours for a case of 13000 samples. Besides, Nie’s model (3) imposes the norm among features, which is prone to outliers in features. To tackle the above two issues, we propose a more robust model in the norm. Although the resulted objective is challenging to solve, a speedup algorithm is proposed to dramatically save the computational costs. For the same task of , it costs our method 1.8 minutes, achieving a 68429 times acceleration compared with the speed of Nie’s method.
Modeling.
To boost the robustness against outliers in both samples and features, we formulate the discrepancy between and via the norm. There are theoretical and empirical evidences to verify that compared with or norms, the norm is more able to prevent outlier elements from dominating the objective, enhancing the robustness [Nie et al.2012]. Thus, we have the following objective
(5) 
where is a balancing parameter; is a row sparse matrix, used to select the most informative and representative samples. By minimizing the energy of (5), we could capture the most essential properties of the dataset .
After obtaining the optimal , the row indexes are sorted by the rowsum value of the absolute in decreasing order. The samples specified by the top indexes are selected as exemplars. Note that the model (5
) could be applied to the unsupervised feature selection problem by only transposing the data matrix
. In this case, is a row sparse matrix, used to select the most representative features.Accelerated Solver for the ARSS Objective in (5)
Although objective (5) is challenging to solve, we propose an effective and highly efficient solver. The acceleration owes to the ALM and an equivalent derivation.
Alm
The most intractable challenge of (5) is that, the norm is nonconvex, nonsmooth and notdifferentiable at the zero point. Therefore, it is beneficial to use the Augmented Lagrangian Method (ALM) [Nocedal and Wright2006] to solve (5), resulting in several easily tackled unconstrained subproblems. By solving them iteratively, the solutions of subproblems could eventually converge to a minimum [Li2011, Meng et al.2013].
Specifically, we introduce an auxiliary variable . Thus, the objective (5) becomes:
(6) 
To deal with the equality constraint in (6), the most convenient method is to add a penalty, resulting in
(7) 
where is a penalty parameter. To guarantee the equality constraint, it requires approaching infinity, which may cause bad numerical conditions. Instead, once introducing a Lagrangian multiplier, it is no longer requiring [Li2011, Nocedal and Wright2006]. Thus, we rewrite (7) into the standard ALM formulation as:
(8) 
where consists of Lagrangian multipliers. In the following, a highly efficient solver will be given.
The updating rule for
Similar to the iterative thresholding (IT) in [Wright et al.2009, Nie et al.2014], the degree of violations of the equality constraints are used to update the Lagrangian multiplier:
(9) 
where is a monotonically increasing parameter over iteration steps. For example, , where is a predefined parameter [Nocedal and Wright2006].
Efficient solver for
Removing irrelevant terms with from (8), we have
(10) 
where . According to the definition of the norm and the Frobeniusnorm, (10) could be decoupled into independent and unconstrained subproblems. The standard form of these subproblems is
(11) 
where is a given positive parameter, is the scalar variable need to deal with, is a known scalar constant.
Zuo et al. [Zuo et al.2013] has recently proposed a generalized iterative shrinkage algorithm to solve (11). This algorithm is easy to implement and able to achieve more accurate solutions than current methods. Thus, we use it for our problem as:
(12) 
where
; is obtained by solving the following equation:
which could be solved efficiently via an iterative algorithm. In this manner, (10) could be sovled extremely fast.
Accelerated solver for
The main acceleration focuses on the solver of . Removing irrelevant terms with from (8), we have
(13) 
where is a nonnegative parameter, . Since (13) is convex in , the optimum could be found by differentiating (13) and setting the derivative to zero. This amounts to tackling the following linear system^{2}^{2}2 is a positive and diagonal matrix with the diagonal entry as , where is a small value to avoid singular failures [Nie et al.2013, Zhu et al.2014]. :
(14) 
As , (14) is mainly a linear system. Once solved by the Cholesky factorization, the computational complexity is highly up to . This is by no means a good choice for real applications with large . In the following, an equivalent derivation of (14) will be proposed to significantly save the computational complexity.
Theorem 2.
The linear system (14) is equivalent to the following linear system:
(15) 
where is a identity matrix.
Proof.
Note that is a diagonal and positivedefinite matrix, the exponent of is efficient to achieve, i.e. . We have the following equations
(16) 
where , is a identity matrix. The following equation holds for any conditions
(17) 
Multiplying (17) with on the left and on the right of both sides of the equalsign, we have the equation as:
(18) 
Therefore, substituting (18) and into (16), we have the simplified updating rule as:
(19) 
When , the most complex operation is the matrix multiplications, not the linear system.∎
Corollary 3.
The solver to update is given in Algorithm 1. The overall solver for our model (5) is summarized in Algorithm 2.
According to Theorem 2 and Corollary 3, the solver for our model (13) is highly simplified, as feature length is generally much smaller than data size, i.e . Similarly, Nie’s method could be highly accelerated by Theorem 4, obtaining 500+ times speedup, as shown in Fig. 2 and Table 3.
Theorem 4.
Nie’s solver (20) [Nie et al.2013] is equivalent to the following linear system (21)
(20)  
(21) 
, where is a identity matrix.
Proof.
Based on (20), we have the following equalities:
The derivations are equivalent; their results are equal. ∎
Corollary 5.
Since feature length is generally much smaller than data size, i.e. , our accelerated solver (20) for Nie’s model (3) is highly faster than the authorial solver (21). Theoretically, we reduce the computational complexity from to , while maintaining the same solution. That is, like Nie’s solver (20), our speedup solver (21) can reach the global optimum. Extensive empirical results will verify the huge acceleration
Experiments
Experimental Settings
In this part, the experimental settings are introduced. All experiments are conducted on a server with 64core Intel Xeon E74820 @ 2.00 GHz, 18 Mb Cache and 0.986 TB RAM, using Matlab 2012. Brief descriptions of ten benchmark datasets are summarized in Table 2, where ‘Total’ denotes the total set of samples in each data. Due to the high computational complexity, other methods can only handle small datasets (while our method can handle the total set). Thus, we randomly choose the candidate set from the total set to reduce the sample size, i.e. (cf. ‘Total’ and ‘candid.’ in Table 2). The remainder (except candidate set) are used for test. Specifically, to simulate the varying quality of samples, ten percentage of candidate samples from each class are randomly selected and arbitrarily added one of the following three kinds of noise: “Gaussian”, “Laplace” and “Salt & pepper” respectively. In a word, all experiment settings are same and fair for all the methods.
Speed Comparisons
There are two parts of speed comparisons. First, how speed varies with increasing is illustrated in Fig. 2. Then the comparison of specific speed is summarized in Table 3. Note that TED and RRSS denote the authorial solver (via authorial codes); RRSS is our accelerated solver for Nie’s model via Theorem 4; ARSS is the proposed method.
Speed vs. increasing
To verify the great superiority of our method over the stateoftheart methods in speed, three experiments are conducted. The results are illustrated in Fig. 2, where there are three subfigures showing the speed of four methods on the benchmark datasets of Letter, MNIST and Waveform respectively. As we shall see, both selection time of TED [Yu, Bi, and Tresp2006] and RRSS [Nie et al.2013] increases dramatically as increases. No surprisingly, RRSS is incredibly timeconsuming as grows—the order of curves looks higher than quadratic. Actually, the theoretical complexity of RRSS is highly up to as analyzed in Remark 1.
Compared with TED and RRSS, the curve of ARSS is surprisingly lower and highly stable against increasing ; there is almost no rise of selection time over growing . This is owing to the speedup techniques of ALM and equivalent derivations. Via them, we reduce the computational cost from to , as analyzed in Theorem 2 and Corollary 3. Moreover, with the help of Theorem 4, RRSS is the second faster algorithm that is significantly accelerated compared with the authorial algorithm RRSS.
Speed with fixed
The speed of four algorithms is summarized in Table 3a, where each row shows the results on one dataset and the last row displays the average results. Four conclusions can be drawn from Table 3a. First, ARSS is the fastest algorithm, and RRSS is the second fastest algorithm. Second, with the help of Theorem 4, RRSS is highly faster than RRSS, averagely obtaining a 559 times acceleration. Third, ARSS is dramatically faster than RRSS and TED; the results in Table 3a verify an average acceleration of 23275 times faster than RRSS and 281 times faster than TED. This means that for example if it takes RRSS 100 years to do a subset selection task, it only takes our method 1.6 days to address the same problem. Finally, we apply ARSS to the whole sample set of each data. The results are displayed in the column in Table 3, showing its capability to process very large datasets.
Prediction Accuracy
Accuracy comparison
We conduct experiments on ten benchmark datasets. For each dataset, the top 200 representative samples are selected for training. The prediction accuracies are reported in Table 3
b, including the results of two popular classifiers. Three observations can be drawn from this table. First, Linear SVM generally outperforms KNN. Second, in general, our method performs the best; for a few cases, our method achieves comparable results with the best performances. Third, compared with TED, both RRSS and ARSS achieve an appreciable advantage. The above analyses are better illustrated in the last row of Table
3b. These results demonstrate that the loss in our model is well suited to select exemplars from the sample sets of various quality.Prediction accuracies vs. increasing
To give a more detailed comparison, Fig. 3 shows the prediction accuracies versus growing (the number of selected samples). There are two rows and four columns of subfigures. The top row shows the results of KNN, and the bottom one shows results of SVM. Each column gives the result on one dataset. As we shall see, the prediction accuracies generally increase as increases. Such case is consistent with the common view that more training data will boost the prediction accuracy. For each subfigure, ARSS is generally among the best. This case implies that our robust objective (5) via the norm is feasible to select subsets from the data of varying qualities.
Conclusion
To deal with tremendous data of varying quality, we propose an accelerated robust subset selection (ARSS) method. The norm is exploited to enhance the robustness against both outlier samples and outlier features. Although the resulted objective is complex to solve, we propose a highly efficient solver via two techniques: ALM and equivalent derivations. Via them, we greatly reduce the computational complexity from to . Here feature length is much smaller than data size , i.e. . Extensive results on ten benchmark datasets verify that our method not only runs 10,000+ times faster than the most related method, but also outperforms state of the art methods. Moreover, we propose an accelerated solver to highly speed up Nie’s method, theoretically reducing the computational complexity from to . Empirically, our accelerated solver could achieve equal results and 500+ times acceleration compared with the authorial solver.
Limitation.
Our efficient algorithm build on the observation that the number of samples is generally larger than feature length, i.e. . For the case of , the acceleration will be inapparent.
Acknowledgements
The authors would like to thank the editor and the reviewers for their valuable suggestions. Besides, this work is supported by the projects (Grant No. 61272331, 91338202, 61305049 and 61203277) of the National Natural Science Foundation of China.
References
 [Bien and Tibshirani2011] Bien, J., and Tibshirani, R. 2011. Prototype selection for interpretable classification. Annals of Applied Statistics 5(4):2403–2424.
 [Boutsidis, Mahoney, and Drineas2009] Boutsidis, C.; Mahoney, M. W.; and Drineas, P. 2009. An improved approximation algorithm for the column subset selection problem. In SODA, 968–977.
 [Chan1987] Chan, T. F. 1987. Rank revealing {QR} factorizations. Linear Algebra and its Applications 88–89(0):67 – 82.
 [Elhamifar et al.2012a] Elhamifar, E.; Sapiro, G.; Vidal, R.; and Vidal, R. 2012a. See all by looking at a few: Sparse modeling for finding representative objects. In IEEE CVPR, 1600–1607.
 [Elhamifar et al.2012b] Elhamifar, E.; Sapiro, G.; Vidal, R.; and Vidal, R. 2012b. Finding exemplars from pairwise dissimilarities via simultaneous sparse recovery. In NIPS, 19–27.
 [Frey and Dueck2007] Frey, B. J., and Dueck, D. 2007. Clustering by passing messages between data points. Science 315(5814):972–976.
 [Garcia et al.2012] Garcia, S.; Derrac, J.; Cano, J. R.; and Herrera, F. 2012. Prototype selection for nearest neighbor classification: Taxonomy and empirical study. IEEE Trans. Pattern Anal. Mach. Intell. 34(3):417–435.
 [Jenatton, Audibert, and Bach2011] Jenatton, R.; Audibert, J.Y.; and Bach, F. 2011. Structured variable selection with sparsityinducing norms. Journal of Machine Learning Research (JMLR) 12:2777–2824.
 [Levin et al.2008] Levin, A.; Lischinski, D.; Weiss, Y.; and Weiss, Y. 2008. A closedform solution to natural image matting. IEEE Trans. Pattern Anal. Mach. Intell. 30(2):228–242.
 [Li2011] Li, C. 2011. Compressive Sensing for 3D Data Processing Tasks: Applications, Models and Algorithms. Ph.D. Dissertation, Houston, TX, USA. AAI3524544.
 [Liu et al.2010] Liu, G.; Lin, Z.; Yu, Y.; and Yu, Y. 2010. Robust subspace segmentation by lowrank representation. In ICML, 663–670.
 [Meng et al.2013] Meng, G.; Wang, Y.; Duan, J.; Xiang, S.; and Pan, C. 2013. Efficient image dehazing with boundary constraint and contextual regularization. In ICCV, 617–624.
 [Nie et al.2012] Nie, F.; Wang, H.; Cai, X.; Huang, H.; and Ding, C. 2012. Robust matrix completion via joint schatten pnorm and lpnorm minimization. In IEEE ICDM, 566–574.

[Nie et al.2013]
Nie, F.; Wang, H.; Huang, H.; and Ding, C. H. Q.
2013.
Early active learning via robust representation and structured sparsity.
In IJCAI, 1572–1578.  [Nie et al.2014] Nie, F.; Huang, Y.; Wang, X.; and Huang, H. 2014. New primal svm solver with linear computational cost for big data classifications. In ICML.
 [Nocedal and Wright2006] Nocedal, J., and Wright, S. J. 2006. Numerical Optimization. New York: Springer, 2nd edition.
 [Wang et al.2014] Wang, H.; Nie, F.; Huang, H.; and Huang, H. 2014. Robust distance metric learning via simultaneous l1norm minimization and maximization. In ICML, 1836–1844.
 [Wang, Kumar, and Chang2012] Wang, J.; Kumar, S.; and Chang, S.F. 2012. Semisupervised hashing for largescale search. IEEE Trans. Pattern Anal. Mach. Intell. 34.

[Wright et al.2009]
Wright, J.; Ganesh, A.; Rao, S.; Peng, Y.; and Ma, Y.
2009.
Robust principal component analysis: Exact recovery of corrupted lowrank matrices via convex optimization.
In NIPS, 2080–2088.  [Yu et al.2008] Yu, K.; Zhu, S.; Xu, W.; ; and Gong, Y. 2008. Nongreedy active learning for text categorization using convex transductive experimental design. In SIGIR, 1081–1088.
 [Yu, Bi, and Tresp2006] Yu, K.; Bi, J.; and Tresp, V. 2006. Active learning via transductive experimental design. In ICML, 1081–1088.
 [Zhu et al.2014] Zhu, F.; Wang, Y.; Fan, B.; Meng, G.; and Pan, C. 2014. Effective spectral unmixing via robust representation and learningbased sparsity. arXiv :1409.0685.
 [Zuo et al.2013] Zuo, W.; Meng, D.; Zhang, L.; Feng, X.; and Zhang, D. 2013. A generalized iterated shrinkage algorithm for nonconvex sparse coding. In IEEE ICCV, 217–224.
Comments
There are no comments yet.