Self-Representation Based Unsupervised Exemplar Selection in a Union of Subspaces

06/07/2020 ∙ by Chong You, et al. ∙ Johns Hopkins University berkeley college 0

Finding a small set of representatives from an unlabeled dataset is a core problem in a broad range of applications such as dataset summarization and information extraction. Classical exemplar selection methods such as k-medoids work under the assumption that the data points are close to a few cluster centroids, and cannot handle the case where data lie close to a union of subspaces. This paper proposes a new exemplar selection model that searches for a subset that best reconstructs all data points as measured by the ℓ_1 norm of the representation coefficients. Geometrically, this subset best covers all the data points as measured by the Minkowski functional of the subset. To solve our model efficiently, we introduce a farthest first search algorithm that iteratively selects the worst represented point as an exemplar. When the dataset is drawn from a union of independent subspaces, our method is able to select sufficiently many representatives from each subspace. We further develop an exemplar based subspace clustering method that is robust to imbalanced data and efficient for large scale data. Moreover, we show that a classifier trained on the selected exemplars (when they are labeled) can correctly classify the rest of the data points.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The availability of large annotated datasets in computer vision, such as ImageNet, has led to many recent breakthroughs in object detection and classification using supervised learning techniques such as deep learning. However, as data sizes continue to grow, it has become difficult to annotate the data for training fully supervised algorithms. As a consequence, the development of unsupervised learning techniques that can learn from

unlabeled datasets has become extremely important. In addition to the challenge introduced by the sheer volume of data, the number of data samples in unlabeled datasets usually varies widely for different classes. For example, a street sign database collected from street view images may contain drastically different numbers of instances for different types of signs since not all of them are used on streets with the same frequency; a handwritten letter database may be highly imbalanced as the frequency of different letters in English text varies significantly (see Figure 1). An imbalanced data distribution is known to compromise performance of canonical supervised [1] and unsupervised [2] learning techniques.

Fig. 1: Number of points in each class associated with the EMNIST handwritten letters (top) and the GTSRB (bottom) street sign databases.

We exploit the idea of exemplar selection to address the challenge of learning from an unlabeled dataset. Exemplar selection refers to the problem of selecting a set of data representatives or exemplars from the data. It has been a particularly useful approach for scaling up existing data clustering algorithms so that they can handle large datasets more efficiently [3]. Finding an exemplar set that is informative of the entire data is often the key challenge for the success of such approaches. Particularly, when the data is drawn from several different groups, it is crucial that an algorithm selects enough samples from each of the groups without prior knowledge of which points belong to which groups. This can be especially difficult when the data is imbalanced, as it is more likely to select data from over-represented groups than from under-represented groups.

Exemplar selection is also useful when one has limited resources so that only a small subset of data can be labeled. In such cases, exemplar selection can determine the subset to be manually labeled, and then used to train a model to infer labels for the remaining data [4]. The ability to correctly classify as many of the unlabeled data points as possible depends critically on the quality of the selected exemplars.

Some of the most popular methods for exemplar selection include -centers and -medoids, which search for the set of centers and medoids that best fit the data under the assumption that data points concentrate around a few discrete points. However, certain high-dimensional image and video data is distributed among certain low-dimensional subspaces [5, 6], and the discrete center based methods become ineffective. In this paper, we consider exemplar selection under a model where the data points lie close to a collection of unknown low-dimensional subspaces. One line of work that can address such problem is based on the assumption that each data point can be expressed by a few data representatives with small reconstruction residual. This includes the simultaneous sparse representation [7] and dictionary selection [8, 9], which use greedy algorithms to solve their respective optimization problems, and group sparse representative selection [10, 11, 12, 13, 14, 15], which uses a convex optimization approach based on group sparsity. In particular, the analysis in [12] shows that when data come from a union of subspaces, their method is able to select a few representatives from each of the subspaces. However, methods in this category cannot effectively handle large-scale data as they have quadratic complexity in the number of points. Moreover, the convex optimization based methods such as that in [12] are not flexible in selecting a desired number of representatives since the size of the subset cannot be directly controlled by adjusting an algorithm parameter.

1.1 Paper contributions

We present a data self-representation based exemplar selection algorithm for learning from large scale and imbalanced data in an unsupervised manner. Our method is based on the self-expressiveness property of data in a union of subspaces [16], which states that each data point in a union of subspaces can be written as a linear combination of other points from its own subspace. That is, given data , there exists such that and is nonzero only if and are from the same subspace. Such representations are called subspace-preserving. In particular, if the subspace dimensions are small, then the representations can be taken to be sparse. Based on this observation, [16] proposes the Sparse Subspace Clustering (SSC) method, which computes for each

the vector

as a solution to the sparse optimization problem

(1)

where . In [16], the solution to (1) is used to define an affinity between any pair of points and as

, and then spectral clustering is applied to generate a segmentation of the data points into their respective subspaces. Existing theoretical results show that, under certain assumptions on the data, the solution to (

1) is subspace-preserving [17, 18, 19, 20, 21, 22, 23, 24, 25], thus justifying the correctness of the affinity produced by SSC.

While the nonzero entries for each determine a subset of that can represent with the minimum -norm on the coefficients, the union of the representations over all often uses the entire dataset . In this paper, we propose to find a small subset , which we call exemplars, such that solutions to the problem

(2)

are also subspace-preserving. Since is a small subset of , solving the optimization problem (2) is much cheaper computationally compared to (1). Computing an appropriate through an exhaustive search would be computationally impractical. To address this issue, we present an efficient algorithm (an exemplar selection algorithm) that iteratively selects the worst represented point from the data to form . Our exemplar selection procedure is then used to design an exemplar-based subspace clustering approach (assuming that the exemplars are unlabeled) [26] and an exemplar-based classification approach (assuming that the exemplars are labeled) by using the representative power of the selected exemplars. In summary, our work makes the following contributions compared to the state of the art:

Fig. 2: Subspace clustering on imbalanced data. Two subspaces of dimension three are generated uniformly at random in ambient space of dimension five. Then, and points are sampled uniformly at random from the two subspaces, respectively, where is varied in the x-axis. The clustering accuracy of SSC decreases dramatically as the dataset becomes imbalanced. The exemplar based subspace clustering (see Algorithm 3) is more robust to imbalanced data distribution.
  • [leftmargin=*,topsep=0.3em,noitemsep]

  • We present a geometric interpretation of our exemplar selection algorithm as one of finding a subset of the data that best covers the entire dataset as measured by the Minkowski functional of the subset. When the data lies in a union of independent subspaces, we prove that our method selects sufficiently many representative data points (exemplars) from each subspace, even when the dataset is imbalanced. Unlike prior methods such as [12], our method has linear execution time and memory complexity in the number of data points for each iteration, and can be terminated when the desired number of exemplars have been selected.

  • We show that the exemplars in selected by our method can be used for subspace clustering by first computing the representations for each data point with respect to the exemplars as in (2), second constructing a -nearest neighbor graph of the representation vectors, and third applying spectral clustering. Compared to SSC, the exemplar-based subspace clustering method is empirically less sensitive to imbalanced data and more efficient on large-scale datasets (see Figure 2). Experimental results on the large-scale and label-imbalanced handwritten letter dataset EMNIST and street sign dataset GTSRB show that our method outperforms state-of-the-art algorithms in terms of both clustering performance and running time.

  • We show that a classifier trained on the exemplars selected by our model (assuming that the labels of the exemplars are provided) is able to correctly classify the rest of the data points. We demonstrate through experiments on the Extended Yale B face database that exemplars selected by our method produce higher classification accuracy when compared to several popular exemplar selection methods.

We remark that a conference version of the paper appeared in the proceedings of European Conference on Computer Vision (ECCV) in 2018 [26]. In comparison to the conference version, which focuses on the problem of subspace clustering on imbalanced data, the current paper addresses the problem of exemplar selection, which has a broader range of applications that include data summarization, clustering and classification tasks. With additional technical results and experimental evaluation, the current paper provides a more comprehensive study of the subject.

1.2 Related work

Exemplar selection. Two of the most popular methods for exemplar selection are -centers and -medoids. The -centers problem is a data clustering problem studied in theoretical computer science and operations research. Given a set and an integer , the goal is to find a set of centers with that minimizes the quantity , where is the squared distance of to the closest point in . A partition of is given by the closest center to which each point belongs. The -medoids is a variant of -centers that minimizes the sum of the squared distances, i.e., minimizes instead of the maximum distance. However, both -centers and -medoids model data as concentrating around several cluster centers, and do not generally apply to data lying in a union of subspaces.

In general, selecting a representative subset of the entire data has been studied in a wide range of contexts such as Determinantal Point Processes  [27, 28, 29], Prototype Selection [30, 31], Rank Revealing QR [32], Column Subset Selection (CSS) [33, 34, 35, 36], separable Nonnegative Matrix Factorization (NMF) [37, 38, 39], and so on [40]. In particular, both CSS and separable NMF can be interpreted as finding exemplars such that each data point can be expressed as a linear combination of such exemplars. However, these methods do not impose sparsity on the representation coefficients, and therefore cannot be used to select good representatives from data that is drawn from a union of low-dimensional subspaces.

Subspace clustering on imbalanced and large scale data.

Subspace clustering aims to cluster data points drawn from a union of subspaces into their respective subspaces. Recently, self-expressiveness based subspace clustering methods such as SSC and its variances

[41, 42, 43, 44, 45, 46, 47]

have achieved great success for many computer vision tasks such as face clustering, handwritten digit clustering, and so on. Nonetheless, previous experimental evaluations focused primarily on balanced datasets, i.e. datasets with approximately the same number of samples from each cluster. In practice, datasets are often imbalanced and such skewed data distributions can significantly compromise the clustering performance of SSC. There has been no study of this issue in the literature to the best of our knowledge.

Another issue with many self-expressive based subspace clustering methods is that they are limited to small or medium scale datasets [48]. Several works addressed the scalability issue by computing a dictionary with number of atoms much smaller than the total number of data points in , and expressing each data point in as a linear combination of the atoms in the dictionary (the dictionary is usually not a subset of ). In particular, [49] shows that if the atoms in the dictionary happen to lie in the same union of subspaces as the input data , then this approach is guaranteed to be correct. However, there is little evidence that such a condition is satisfied for real data as the atoms of the dictionary are not constrained to be a subset of . Another recent work [50], which uses data-independent random matrices as dictionaries, also suffers from this issue and lacks correctness guarantees. More recently, several works [51, 52, 53] use exemplar selection to form the dictionary for subspace clustering, but they lack theoretical justification that their selected exemplars represent the subspaces.

2 Self-Representation based Unsupervised Exemplar Selection

In this section, we present our self-representation based method for exemplar selection from an unlabeled dataset , which are assumed to have unit norm.111This is not a strong assumption as one can always normalize the data points as a preprocessing step for any given dataset. We first formulate the model for selecting a subset of exemplars from in Section 2.1

as minimizing a self-representation cost. Since the model is a combinatorial optimization problem, we present an efficient algorithm for solving it approximately in Section 

2.2.

2.1 A self-representation cost for exemplar selection

In our exemplar selection model, the goal is to find a small subset that can linearly represent all data points in . In particular, the set should contain exemplars from each subspace such that the solution to (2) for each data point is subspace-preserving. Next, we define a cost function based on the optimization problem in (2) and then present our exemplar selection model.

Definition 1 (Self-representation cost function).

Given , we define the self-representation cost function as

(3)

where

(4)

and is a parameter. By convention, we define for all , where is the empty set.

The quantity is a measure of how well the data point is represented by the subset . The function has the following properties.

Lemma 1.

For each , the function is monotone with respect to the partial order defined by set inclusion, i.e., for any .

Proof.

Let . Then, let us define as

It follows from the optimality conditions that for all such that . Combining this with yields

which is the desired result. ∎

Lemma 2.

For each the following hold: (i) for every the inclusion holds; (ii) ; and (iii) if and only if at least one of or is in .

Proof.

First observe that if , then it follows from Definition 1 that . Second, consider the case . In this case, define to be the one-hot vector with -th entry and all other entries zero. One can then verify that (by recalling the assumption that ). Combining these two cases with Lemma 1 establishes that parts (i) and (ii) hold.

For the “if” direction of part (iii), let either or . Define as a one-hot vector with -th entry if , and if ; in either case all other entries are set to zero. One can then verify that , which completes the proof for this direction.

To prove the “only if” direction, suppose that . Let us define

and . From the optimality conditions, it follows that for all such that . Using this fact, the assumption that the data is normalized, and basic properties of norms, we have

(5)

From (5), and definition of , we have

(6)

where the last inequality follows by computing the minimum value of . It follows that equality is achieved for all inequalities in (6). By requiring equality for the second and first inequalities in (6), we get respectively,

(7)

Since (7) implies , we can conclude that all of the inequalities in (5) must actually be equalities. Using this fact and (5) we have that

(8)

Define . From definition of , (7), the fact that the data is normalized, and (8), we have

(9)

For the second term on the right hand side of (9), we may use the fact that the data is normalized, definition of , and (7) to conclude that

Plugging this into (9) yields

(10)

which after simplification shows that

(11)

Recall that (see Definition 1). Therefore, from (11) we see that . Since both and have unit norm, we conclude that , i.e., that either or must be in , as desired. ∎

Observe that if contains enough exemplars from the subspace containing and a solution to the optimization problem in (4) is subspace-preserving, then it is expected that will be sparse and that the residual will be close to zero. This suggests that we should select the subset such that the value is small for all . As the value is achieved by the data point that has the largest value , we propose to perform exemplar selection by searching for a subset that minimizes the self-representation cost function, i.e.,

(12)

where is the target number of exemplars. The objective function in (12) is monotone as shown next.

Lemma 3.

If , then .

Proof.

Let us define

It follows from these definitions and Lemma 1 that

which completes the proof. ∎

2.2 A Farthest First Search (FFS) algorithm

Solving the optimization problem (12) is NP-hard in general as it requires evaluating for each subset of size at most . In Algorithm 1 below, we present a greedy algorithm for efficiently computing an approximate solution to (12). The algorithm progressively grows a candidate subset (initialized as the empty set) until it reaches the desired size . During each iteration , step 3 of the algorithm selects the point that is worst represented by the current subset as measured by . It was shown in Lemma 2 that if , and if and . Thus, during each iteration an element not from is added to to form when is sufficiently large. When the algorithm terminates, the output contains exactly distinct exemplars from .

We also note that the FFS algorithm can be viewed as an extension of the farthest first traversal algorithm (see, e.g. [54]), which is an approximation algorithm for the -centers problem discussed in Section 1.2.

0:  Data , parameter and number of desired exemplars .
1:  Select randomly and set .
2:  for  do
3:     
4:  end for
4:  
Algorithm 1 A farthest first search (FFS) algorithm for exemplar selection
0:  Data , parameters and number of desired exemplars .
1:  Select randomly and set .
2:  Compute for .
3:  for  do
4:     Let be a permutation of such that
5:     Initialize max_cost .
6:     for  do
7:        Set .
8:        if  max_cost then
9:           Set max_cost , new_index .
10:        end if
11:        if  or max_cost  then
12:            break
13:        end if
14:     end for
15:     .
16:  end for
16:  
Algorithm 2 An efficient implementation of FFS

Efficient implementation. Observe that each iteration of Algorithm 1 requires evaluating for every . Therefore, the complexity of Algorithm 1 is linear in the number of data points assuming is fixed and small. However, computing itself is not easy as it requires solving a sparse optimization problem. Next, we introduce an efficient implementation of Algorithm 1 that accelerates the procedure by eliminating the need to compute for some in each iteration.

The idea underpinning the computational savings in Algorithm 2 is the monotonicity of (see Lemma 1). That is, for any we have . Since in the FFS algorithm the set is progressively increased, this implies that is non-increasing in . In step 2 we initialize for each , which is an upper bound for for . In each iteration , the goal is to find a data point that maximizes . To do this, we first find an ordering of such that (step 4). We then compute sequentially for points in (step 7) while tracking the highest value of by the variable max_cost (step 9). Once the condition that max_cost is met (step 11), we can assert that for any the point is not a maximizer. This can be seen from max_cost, where the first inequality follows from the monotonicity of as a function of . Thus, we can break the loop (step 12) and avoid computing for the remaining values of in this iteration. When Algorithm 2 terminates, it produces the same output as Algorithm 1 but with a reduced total number of evaluations for .

Fig. 3: Running time for Algorithm 1 and Algorithm 2 on a synthetically generated dataset where data points are sampled uniformly at random from the unit sphere of averaged over trials. is varied along the x-axis and takes values between and .

Figure 3 reports the computational time of Algorithm 1 and Algorithm 2 with synthetically generated data where data points are sampled uniformly at random from the unit sphere of . It shows that the efficient implementation in Algorithm 2 is around to times faster than the naive implementation in Algorithm 1. Comparing the results across different values of , we find that the benefit of Algorithm 2 is more prominent for larger values of .

3 Theoretical Analysis

In this section, we study the theoretical properties of the self-representation based exemplar selection method. In Section 3.1 and 3.2 we present a geometric interpretation of the exemplar selection model from Section 2.1 and the FFS algorithm from Section 2.2, and study their properties when data is drawn from a union of subspaces. To simplify the analysis, we assume that the self-representation is strictly enforced by extending (4) to , i.e., we let

(13)

We define if problem (13) is infeasible. The effect of using a finite is discussed in Section 3.3.

3.1 Geometric interpretation

We first provide a geometric interpretation of the exemplars selected by (12). Given any , we denote the convex hull of the symmetrized data points in by , i.e.,

(14)

(see an example in Figure 4). The Minkowski functional [55] associated with a set is given by the following.

Definition 2 (Minkowski functional).

The Minkowski functional associated with a set is a map denoted by and defined by

(15)

We define if is empty.

The Minkowski functional is a norm on , and its unit ball is . Thus, for any nonzero , the point is the projection onto the boundary of . The green and red dots in Figure 4 are examples of and , respectively. It follows that if , then is the length of the ray inside .

Using Definition 2, it has been shown by [56, Section 2][18, Section 4.1] that

(16)

A combination of (16) and the interpretation of above provides a geometric interpretation of . That is, is large if the length of the ray inside is small. In particular, it holds that is infinity if is not in the span of .

In view of (16), the exemplar selection model (12) may be written equivalently as

(17)

Therefore, the solution to (12) is the subset of that maximizes where the ray intersects taken over all data (i.e., maximizes the minimum of such intersections over all ).

Fig. 4: A geometric illustration of the solution to (12) with . The shaded area is the convex hull defined in (14).

Also, from (16) we see that each iteration of Algorithm 1 selects the that minimizes . Therefore, each iteration of FFS adds the point whose associated ray has the shortest intersection with .

Finally, we remark that our exemplar selection objective is related to the sphere covering problem. This is discussed in detail in the Appendix.

3.2 Exemplars from a union of subspaces

We now study the properties of our exemplar selection method when applied to data from a union of subspaces. Let be drawn from a collection of subspaces of dimensions with each subspace containing at least samples that span . We assume that the subspaces are independent, which is commonly used in the analysis of subspace clustering methods [57, 16, 42, 41, 58].

Assumption 1.

The subspaces are independent, i.e., is equal to the dimension of .

We now aim to show that the solution to (12) contains at least independent vectors from each subspace and, moreover, the solution to (2) with being any solution to (12) is subspace-preserving for all . Formally, the subspace-preserving property is defined as follows.

Definition 3 (Subspace-preserving property).

A vector associated with is called subspace-preserving if implies that and are from the same subspace.

We first need the following lemma.

Lemma 4.

Suppose that . Under Assumption 1, if the optimization problem in (13) is feasible, then any optimal solution to it satisfies , and for all satisfying , i.e., is expressed as a linear combination of points in that are from its own subspace.

Proof.

An optimal solution to (13) must be feasible, i.e.,

which after rearrangement gives

(18)

Since the left-hand side is a vector in and the right-hand side is a vector in , it follows from Assumption 1 and [59, Theorem 6] that , as claimed.

Next, let us define the vector such that for all and for all such that . Using from above and the definition of , we see that is feasible for (13). Moreover, it satisfies

(19)

Since is optimal for (13), it follows from (19) that . Combining this fact with (19) shows that for all such that , which completes the proof. ∎

We may use this lemma to prove the following result.

Theorem 1.

Under Assumption 1, for all , any solution to the optimization problem (12) contains at least linearly independent points from each subspace . Moreover, with , the optimization problem in (13) is feasible for all with all optimal solutions being subspace-preserving.

Proof.

Let be any optimal solution to (12) for any fixed , and be any subset with that contains linearly independent points from for each , which we know exists. It follows that for all so that . This and optimality of imply . This fact and the definition of means that for all , i.e., for all . Combining this with Assumption 1 implies that contains at least linearly independent points from each subspace , which also means that the problem in (13) is feasible for all . Combining this with Lemma 4 shows that all solutions to the optimization problem in (13) are subspace preserving. ∎

When , Theorem 1 shows that points are selected from subspace regardless of the number of points in . Therefore, when the data is class imbalanced, (12) selects a subset that is more balanced provided the dimensions of the subspaces do not differ dramatically.

Theorem 1 also shows that only points are needed to correctly represent all data points in . In other words, the required number of exemplars for representing the dataset does not scale with the size of the dataset .

Although the FFS algorithm in Section 2.2 is a computationally efficient greedy algorithm that does not necessarily solve (12), the following result shows that it does output a subset of exemplars from the data with desirable properties.

Theorem 2.

The conclusions of Theorem 1 hold when is replaced by for any , where is the set of exemplars returned by Algorithm 1 (equivalently, Algorithm 2).

Proof.

Note that since , it follows from the definition in (13) that if and only if . It follows from this fact and the construction of Algorithm 1 that each iteration of Algorithm 1 adds a data point from that is linearly independent from those in