1 Introduction
The availability of large annotated datasets in computer vision, such as ImageNet, has led to many recent breakthroughs in object detection and classification using supervised learning techniques such as deep learning. However, as data sizes continue to grow, it has become difficult to annotate the data for training fully supervised algorithms. As a consequence, the development of unsupervised learning techniques that can learn from
unlabeled datasets has become extremely important. In addition to the challenge introduced by the sheer volume of data, the number of data samples in unlabeled datasets usually varies widely for different classes. For example, a street sign database collected from street view images may contain drastically different numbers of instances for different types of signs since not all of them are used on streets with the same frequency; a handwritten letter database may be highly imbalanced as the frequency of different letters in English text varies significantly (see Figure 1). An imbalanced data distribution is known to compromise performance of canonical supervised [1] and unsupervised [2] learning techniques.We exploit the idea of exemplar selection to address the challenge of learning from an unlabeled dataset. Exemplar selection refers to the problem of selecting a set of data representatives or exemplars from the data. It has been a particularly useful approach for scaling up existing data clustering algorithms so that they can handle large datasets more efficiently [3]. Finding an exemplar set that is informative of the entire data is often the key challenge for the success of such approaches. Particularly, when the data is drawn from several different groups, it is crucial that an algorithm selects enough samples from each of the groups without prior knowledge of which points belong to which groups. This can be especially difficult when the data is imbalanced, as it is more likely to select data from overrepresented groups than from underrepresented groups.
Exemplar selection is also useful when one has limited resources so that only a small subset of data can be labeled. In such cases, exemplar selection can determine the subset to be manually labeled, and then used to train a model to infer labels for the remaining data [4]. The ability to correctly classify as many of the unlabeled data points as possible depends critically on the quality of the selected exemplars.
Some of the most popular methods for exemplar selection include centers and medoids, which search for the set of centers and medoids that best fit the data under the assumption that data points concentrate around a few discrete points. However, certain highdimensional image and video data is distributed among certain lowdimensional subspaces [5, 6], and the discrete center based methods become ineffective. In this paper, we consider exemplar selection under a model where the data points lie close to a collection of unknown lowdimensional subspaces. One line of work that can address such problem is based on the assumption that each data point can be expressed by a few data representatives with small reconstruction residual. This includes the simultaneous sparse representation [7] and dictionary selection [8, 9], which use greedy algorithms to solve their respective optimization problems, and group sparse representative selection [10, 11, 12, 13, 14, 15], which uses a convex optimization approach based on group sparsity. In particular, the analysis in [12] shows that when data come from a union of subspaces, their method is able to select a few representatives from each of the subspaces. However, methods in this category cannot effectively handle largescale data as they have quadratic complexity in the number of points. Moreover, the convex optimization based methods such as that in [12] are not flexible in selecting a desired number of representatives since the size of the subset cannot be directly controlled by adjusting an algorithm parameter.
1.1 Paper contributions
We present a data selfrepresentation based exemplar selection algorithm for learning from large scale and imbalanced data in an unsupervised manner. Our method is based on the selfexpressiveness property of data in a union of subspaces [16], which states that each data point in a union of subspaces can be written as a linear combination of other points from its own subspace. That is, given data , there exists such that and is nonzero only if and are from the same subspace. Such representations are called subspacepreserving. In particular, if the subspace dimensions are small, then the representations can be taken to be sparse. Based on this observation, [16] proposes the Sparse Subspace Clustering (SSC) method, which computes for each
the vector
as a solution to the sparse optimization problem(1) 
where . In [16], the solution to (1) is used to define an affinity between any pair of points and as
, and then spectral clustering is applied to generate a segmentation of the data points into their respective subspaces. Existing theoretical results show that, under certain assumptions on the data, the solution to (
1) is subspacepreserving [17, 18, 19, 20, 21, 22, 23, 24, 25], thus justifying the correctness of the affinity produced by SSC.While the nonzero entries for each determine a subset of that can represent with the minimum norm on the coefficients, the union of the representations over all often uses the entire dataset . In this paper, we propose to find a small subset , which we call exemplars, such that solutions to the problem
(2) 
are also subspacepreserving. Since is a small subset of , solving the optimization problem (2) is much cheaper computationally compared to (1). Computing an appropriate through an exhaustive search would be computationally impractical. To address this issue, we present an efficient algorithm (an exemplar selection algorithm) that iteratively selects the worst represented point from the data to form . Our exemplar selection procedure is then used to design an exemplarbased subspace clustering approach (assuming that the exemplars are unlabeled) [26] and an exemplarbased classification approach (assuming that the exemplars are labeled) by using the representative power of the selected exemplars. In summary, our work makes the following contributions compared to the state of the art:

[leftmargin=*,topsep=0.3em,noitemsep]

We present a geometric interpretation of our exemplar selection algorithm as one of finding a subset of the data that best covers the entire dataset as measured by the Minkowski functional of the subset. When the data lies in a union of independent subspaces, we prove that our method selects sufficiently many representative data points (exemplars) from each subspace, even when the dataset is imbalanced. Unlike prior methods such as [12], our method has linear execution time and memory complexity in the number of data points for each iteration, and can be terminated when the desired number of exemplars have been selected.

We show that the exemplars in selected by our method can be used for subspace clustering by first computing the representations for each data point with respect to the exemplars as in (2), second constructing a nearest neighbor graph of the representation vectors, and third applying spectral clustering. Compared to SSC, the exemplarbased subspace clustering method is empirically less sensitive to imbalanced data and more efficient on largescale datasets (see Figure 2). Experimental results on the largescale and labelimbalanced handwritten letter dataset EMNIST and street sign dataset GTSRB show that our method outperforms stateoftheart algorithms in terms of both clustering performance and running time.

We show that a classifier trained on the exemplars selected by our model (assuming that the labels of the exemplars are provided) is able to correctly classify the rest of the data points. We demonstrate through experiments on the Extended Yale B face database that exemplars selected by our method produce higher classification accuracy when compared to several popular exemplar selection methods.
We remark that a conference version of the paper appeared in the proceedings of European Conference on Computer Vision (ECCV) in 2018 [26]. In comparison to the conference version, which focuses on the problem of subspace clustering on imbalanced data, the current paper addresses the problem of exemplar selection, which has a broader range of applications that include data summarization, clustering and classification tasks. With additional technical results and experimental evaluation, the current paper provides a more comprehensive study of the subject.
1.2 Related work
Exemplar selection. Two of the most popular methods for exemplar selection are centers and medoids. The centers problem is a data clustering problem studied in theoretical computer science and operations research. Given a set and an integer , the goal is to find a set of centers with that minimizes the quantity , where is the squared distance of to the closest point in . A partition of is given by the closest center to which each point belongs. The medoids is a variant of centers that minimizes the sum of the squared distances, i.e., minimizes instead of the maximum distance. However, both centers and medoids model data as concentrating around several cluster centers, and do not generally apply to data lying in a union of subspaces.
In general, selecting a representative subset of the entire data has been studied in a wide range of contexts such as Determinantal Point Processes [27, 28, 29], Prototype Selection [30, 31], Rank Revealing QR [32], Column Subset Selection (CSS) [33, 34, 35, 36], separable Nonnegative Matrix Factorization (NMF) [37, 38, 39], and so on [40]. In particular, both CSS and separable NMF can be interpreted as finding exemplars such that each data point can be expressed as a linear combination of such exemplars. However, these methods do not impose sparsity on the representation coefficients, and therefore cannot be used to select good representatives from data that is drawn from a union of lowdimensional subspaces.
Subspace clustering on imbalanced and large scale data.
Subspace clustering aims to cluster data points drawn from a union of subspaces into their respective subspaces. Recently, selfexpressiveness based subspace clustering methods such as SSC and its variances
[41, 42, 43, 44, 45, 46, 47]have achieved great success for many computer vision tasks such as face clustering, handwritten digit clustering, and so on. Nonetheless, previous experimental evaluations focused primarily on balanced datasets, i.e. datasets with approximately the same number of samples from each cluster. In practice, datasets are often imbalanced and such skewed data distributions can significantly compromise the clustering performance of SSC. There has been no study of this issue in the literature to the best of our knowledge.
Another issue with many selfexpressive based subspace clustering methods is that they are limited to small or medium scale datasets [48]. Several works addressed the scalability issue by computing a dictionary with number of atoms much smaller than the total number of data points in , and expressing each data point in as a linear combination of the atoms in the dictionary (the dictionary is usually not a subset of ). In particular, [49] shows that if the atoms in the dictionary happen to lie in the same union of subspaces as the input data , then this approach is guaranteed to be correct. However, there is little evidence that such a condition is satisfied for real data as the atoms of the dictionary are not constrained to be a subset of . Another recent work [50], which uses dataindependent random matrices as dictionaries, also suffers from this issue and lacks correctness guarantees. More recently, several works [51, 52, 53] use exemplar selection to form the dictionary for subspace clustering, but they lack theoretical justification that their selected exemplars represent the subspaces.
2 SelfRepresentation based Unsupervised Exemplar Selection
In this section, we present our selfrepresentation based method for exemplar selection from an unlabeled dataset , which are assumed to have unit norm.^{1}^{1}1This is not a strong assumption as one can always normalize the data points as a preprocessing step for any given dataset. We first formulate the model for selecting a subset of exemplars from in Section 2.1
as minimizing a selfrepresentation cost. Since the model is a combinatorial optimization problem, we present an efficient algorithm for solving it approximately in Section
2.2.2.1 A selfrepresentation cost for exemplar selection
In our exemplar selection model, the goal is to find a small subset that can linearly represent all data points in . In particular, the set should contain exemplars from each subspace such that the solution to (2) for each data point is subspacepreserving. Next, we define a cost function based on the optimization problem in (2) and then present our exemplar selection model.
Definition 1 (Selfrepresentation cost function).
Given , we define the selfrepresentation cost function as
(3) 
where
(4) 
and is a parameter. By convention, we define for all , where is the empty set.
The quantity is a measure of how well the data point is represented by the subset . The function has the following properties.
Lemma 1.
For each , the function is monotone with respect to the partial order defined by set inclusion, i.e., for any .
Proof.
Let . Then, let us define as
It follows from the optimality conditions that for all such that . Combining this with yields
which is the desired result. ∎
Lemma 2.
For each the following hold: (i) for every the inclusion holds; (ii) ; and (iii) if and only if at least one of or is in .
Proof.
First observe that if , then it follows from Definition 1 that . Second, consider the case . In this case, define to be the onehot vector with th entry and all other entries zero. One can then verify that (by recalling the assumption that ). Combining these two cases with Lemma 1 establishes that parts (i) and (ii) hold.
For the “if” direction of part (iii), let either or . Define as a onehot vector with th entry if , and if ; in either case all other entries are set to zero. One can then verify that , which completes the proof for this direction.
To prove the “only if” direction, suppose that . Let us define
and . From the optimality conditions, it follows that for all such that . Using this fact, the assumption that the data is normalized, and basic properties of norms, we have
(5) 
From (5), and definition of , we have
(6)  
where the last inequality follows by computing the minimum value of . It follows that equality is achieved for all inequalities in (6). By requiring equality for the second and first inequalities in (6), we get respectively,
(7) 
Since (7) implies , we can conclude that all of the inequalities in (5) must actually be equalities. Using this fact and (5) we have that
(8) 
Define . From definition of , (7), the fact that the data is normalized, and (8), we have
(9) 
For the second term on the right hand side of (9), we may use the fact that the data is normalized, definition of , and (7) to conclude that
Plugging this into (9) yields
(10) 
which after simplification shows that
(11) 
Recall that (see Definition 1). Therefore, from (11) we see that . Since both and have unit norm, we conclude that , i.e., that either or must be in , as desired. ∎
Observe that if contains enough exemplars from the subspace containing and a solution to the optimization problem in (4) is subspacepreserving, then it is expected that will be sparse and that the residual will be close to zero. This suggests that we should select the subset such that the value is small for all . As the value is achieved by the data point that has the largest value , we propose to perform exemplar selection by searching for a subset that minimizes the selfrepresentation cost function, i.e.,
(12) 
where is the target number of exemplars. The objective function in (12) is monotone as shown next.
Lemma 3.
If , then .
Proof.
2.2 A Farthest First Search (FFS) algorithm
Solving the optimization problem (12) is NPhard in general as it requires evaluating for each subset of size at most . In Algorithm 1 below, we present a greedy algorithm for efficiently computing an approximate solution to (12). The algorithm progressively grows a candidate subset (initialized as the empty set) until it reaches the desired size . During each iteration , step 3 of the algorithm selects the point that is worst represented by the current subset as measured by . It was shown in Lemma 2 that if , and if and . Thus, during each iteration an element not from is added to to form when is sufficiently large. When the algorithm terminates, the output contains exactly distinct exemplars from .
We also note that the FFS algorithm can be viewed as an extension of the farthest first traversal algorithm (see, e.g. [54]), which is an approximation algorithm for the centers problem discussed in Section 1.2.
Efficient implementation. Observe that each iteration of Algorithm 1 requires evaluating for every . Therefore, the complexity of Algorithm 1 is linear in the number of data points assuming is fixed and small. However, computing itself is not easy as it requires solving a sparse optimization problem. Next, we introduce an efficient implementation of Algorithm 1 that accelerates the procedure by eliminating the need to compute for some in each iteration.
The idea underpinning the computational savings in Algorithm 2 is the monotonicity of (see Lemma 1). That is, for any we have . Since in the FFS algorithm the set is progressively increased, this implies that is nonincreasing in . In step 2 we initialize for each , which is an upper bound for for . In each iteration , the goal is to find a data point that maximizes . To do this, we first find an ordering of such that (step 4). We then compute sequentially for points in (step 7) while tracking the highest value of by the variable max_cost (step 9). Once the condition that max_cost is met (step 11), we can assert that for any the point is not a maximizer. This can be seen from max_cost, where the first inequality follows from the monotonicity of as a function of . Thus, we can break the loop (step 12) and avoid computing for the remaining values of in this iteration. When Algorithm 2 terminates, it produces the same output as Algorithm 1 but with a reduced total number of evaluations for .
Figure 3 reports the computational time of Algorithm 1 and Algorithm 2 with synthetically generated data where data points are sampled uniformly at random from the unit sphere of . It shows that the efficient implementation in Algorithm 2 is around to times faster than the naive implementation in Algorithm 1. Comparing the results across different values of , we find that the benefit of Algorithm 2 is more prominent for larger values of .
3 Theoretical Analysis
In this section, we study the theoretical properties of the selfrepresentation based exemplar selection method. In Section 3.1 and 3.2 we present a geometric interpretation of the exemplar selection model from Section 2.1 and the FFS algorithm from Section 2.2, and study their properties when data is drawn from a union of subspaces. To simplify the analysis, we assume that the selfrepresentation is strictly enforced by extending (4) to , i.e., we let
(13) 
We define if problem (13) is infeasible. The effect of using a finite is discussed in Section 3.3.
3.1 Geometric interpretation
We first provide a geometric interpretation of the exemplars selected by (12). Given any , we denote the convex hull of the symmetrized data points in by , i.e.,
(14) 
(see an example in Figure 4). The Minkowski functional [55] associated with a set is given by the following.
Definition 2 (Minkowski functional).
The Minkowski functional associated with a set is a map denoted by and defined by
(15) 
We define if is empty.
The Minkowski functional is a norm on , and its unit ball is . Thus, for any nonzero , the point is the projection onto the boundary of . The green and red dots in Figure 4 are examples of and , respectively. It follows that if , then is the length of the ray inside .
Using Definition 2, it has been shown by [56, Section 2][18, Section 4.1] that
(16) 
A combination of (16) and the interpretation of above provides a geometric interpretation of . That is, is large if the length of the ray inside is small. In particular, it holds that is infinity if is not in the span of .
In view of (16), the exemplar selection model (12) may be written equivalently as
(17) 
Therefore, the solution to (12) is the subset of that maximizes where the ray intersects taken over all data (i.e., maximizes the minimum of such intersections over all ).
Also, from (16) we see that each iteration of Algorithm 1 selects the that minimizes . Therefore, each iteration of FFS adds the point whose associated ray has the shortest intersection with .
Finally, we remark that our exemplar selection objective is related to the sphere covering problem. This is discussed in detail in the Appendix.
3.2 Exemplars from a union of subspaces
We now study the properties of our exemplar selection method when applied to data from a union of subspaces. Let be drawn from a collection of subspaces of dimensions with each subspace containing at least samples that span . We assume that the subspaces are independent, which is commonly used in the analysis of subspace clustering methods [57, 16, 42, 41, 58].
Assumption 1.
The subspaces are independent, i.e., is equal to the dimension of .
We now aim to show that the solution to (12) contains at least independent vectors from each subspace and, moreover, the solution to (2) with being any solution to (12) is subspacepreserving for all . Formally, the subspacepreserving property is defined as follows.
Definition 3 (Subspacepreserving property).
A vector associated with is called subspacepreserving if implies that and are from the same subspace.
We first need the following lemma.
Lemma 4.
Proof.
An optimal solution to (13) must be feasible, i.e.,
which after rearrangement gives
(18) 
Since the lefthand side is a vector in and the righthand side is a vector in , it follows from Assumption 1 and [59, Theorem 6] that , as claimed.
Next, let us define the vector such that for all and for all such that . Using from above and the definition of , we see that is feasible for (13). Moreover, it satisfies
(19) 
Since is optimal for (13), it follows from (19) that . Combining this fact with (19) shows that for all such that , which completes the proof. ∎
We may use this lemma to prove the following result.
Theorem 1.
Proof.
Let be any optimal solution to (12) for any fixed , and be any subset with that contains linearly independent points from for each , which we know exists. It follows that for all so that . This and optimality of imply . This fact and the definition of means that for all , i.e., for all . Combining this with Assumption 1 implies that contains at least linearly independent points from each subspace , which also means that the problem in (13) is feasible for all . Combining this with Lemma 4 shows that all solutions to the optimization problem in (13) are subspace preserving. ∎
When , Theorem 1 shows that points are selected from subspace regardless of the number of points in . Therefore, when the data is class imbalanced, (12) selects a subset that is more balanced provided the dimensions of the subspaces do not differ dramatically.
Theorem 1 also shows that only points are needed to correctly represent all data points in . In other words, the required number of exemplars for representing the dataset does not scale with the size of the dataset .
Although the FFS algorithm in Section 2.2 is a computationally efficient greedy algorithm that does not necessarily solve (12), the following result shows that it does output a subset of exemplars from the data with desirable properties.
Comments
There are no comments yet.