Cluster Representatives Selection in Non-Metric Spaces for Nearest Prototype Classification

07/03/2021 ∙ by Jaroslav Hlaváč, et al. ∙ 0

The nearest prototype classification is a less computationally intensive replacement for the k-NN method, especially when large datasets are considered. In metric spaces, centroids are often used as prototypes to represent whole clusters. The selection of cluster prototypes in non-metric spaces is more challenging as the idea of computing centroids is not directly applicable. In this paper, we present CRS, a novel method for selecting a small yet representative subset of objects as a cluster prototype. Memory and computationally efficient selection of representatives is enabled by leveraging the similarity graph representation of each cluster created by the NN-Descent algorithm. CRS can be used in an arbitrary metric or non-metric space because of the graph-based approach, which requires only a pairwise similarity measure. As we demonstrate in the experimental evaluation, our method outperforms the state of the art techniques on multiple datasets from different domains.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The

-NN classifiers are often used in many application domains due to their simplicity and ability to trace the classification decision to a specific set of samples. However, their adoption is limited by high computational complexity. Because contemporary datasets are often huge, containing hundreds of thousands or even millions of samples, computing similarity between the classified sample and the entire dataset may be computationally intractable.

In order to decrease computational and memory requirements, the nearest prototype classification (NPC) method is commonly used, c.f. [seo2003soft, schleif2005local, cervantes2007adaptive]. In NPC, each class is divided into one or more clusters, and each cluster is represented by its prototype. The classified sample is then compared just to the prototypes instead of calculating similarity to the entire dataset.

Therefore, the goal of prototype selection is to find a memory-efficient representation of clusters such that classification accuracy is preserved while the number of comparisons is significantly reduced.

However, in many application domains, objects might exist in a non-metric space where only a pairwise similarity is defined, e.g., bioinformatics [martino2018granular], biometric identification [becker2010methods], computer networks [kopp2018community]

or pattern recognition 

[scheirer2014good].

In such application domains, standard representations such as centroids may not be easily determined, or their interpretation does not make much sense. For these scenarios, only a few methods have been developed, and to best of our knowledge the only general (not domain-specific) approach is based on the selection of small subsets of objects to represent the remaining cluster members. These object, called representatives, are then used as a prototype.

While several methods capable of solving representatives selection on non-metric spaces exist (i.e. DS3 [elhamifar2015dissimilarity], -medoids [liebman2015representative]), there has not been much research activity in this direction.

Our focus on non-metric spaces comes from the problem of behavioural clustering of network hosts [kopp2018community]. Nevertheless, the problem of selecting a minimal number of representative samples is of more general interest. Therefore, we present a novel method to solve the problem of Cluster Representatives Selection (CRS). CRS is a general method capable of selecting small representative subset of objects from a cluster as its prototype. Its core idea is fast construction of an approximate reverse -NN graph and then solving minimal vertex cover problem on that graph. Only a pairwise similarity is required to build the reverse -NN graph, therefore application of CRS is not limited to metric spaces.

To show that CRS is general and domain-independent, we present an experimental evaluation on datasets from image recognition, document classification and network host classification, with appealing results when compared to the current state of the art.

The paper is organized as follows. The related work is briefly reviewed in the next section. Section 3 formalises the representative selection as an optimization problem. The proposed CRS method is described in detail in Section 4. The experimental evaluation is summarized in Section 5 followed by the conclusion.

2 Related Work

During the past years, significant effort has been made to represent clusters in the most condensed way. The approaches could be categorized into two main groups.

The first group gathers all prototype generation methods [triguero2011taxonomy], which create artificial samples to represent original clusters, e.g. [geva1991adaptive, xie1993vector]. The second group contains the prototype selection methods. As the name suggests, a subset of samples from the given cluster is selected to represent it. Prototype selection is a well-explored field with many approaches, see, e.g. [garcia2012prototype].

However, most of the current algorithms exploit the properties of the metric space, e.g., structured sparsity [wang2017representative], -norm induced selection [zhang2018seeing] or identification of borderline objects [olvera2018accurate].

When we leave the luxury of the metric space and focus on situations where only a pairwise similarity exists or where averaging of existing samples may create an object without meaning, there is not much previous work.

The -medoids [liebman2015representative] algorithm uses the idea of -medoids to semi-greedily cover the space with -neighbourhoods, in which it then looks for an optimal medoid to represent a given neighbourhood. The main issue of this method is the selection of

: this hyperparameter has to to be fine-tuned based on the domain.

The DS3 [elhamifar2015dissimilarity]

algorithm calculates the full similarity matrix and then selects representatives by a row-sparsity regularized trace minimization program which tries to minimize the rows needed to encode the whole matrix. The overall computational complexity is the most significant disadvantage of this algorithm, despite some proposed approximate estimation of the similarity matrix using only a subset of the data.

The proposed method for Cluster Representatives Selection (CRS) approximates the topological structure of the data by creating a reverse -NN graph. CPS then iteratively selects nodes with the biggest reverse neighbourhoods as representatives of the data. This approach systematically minimizes the number of pairwise comparisons to reduce computational complexity while accurately representing the data.

3 Problem Formulation

In this section, we define the problem of prototype-based representation of clusters and the nearest prototype classification (NPC). As we already stated in Introduction, we study the prototypes selection in general cases, including non-metric spaces. Therefore, we further assume that a cluster prototype is always specified as (possibly small) subset of its members.

Cluster prototypes

Let be an arbitrary space of objects for which a pairwise similarity function is defined and let be a set of (training) samples. Let be a clustering of such that and Let be a cluster of size . For , let us denote the closest samples to , i.e., the set of samples that have the highest similarity to in the rest of the cluster . Then the goal of the prototype selection is to find a subset of samples for each cluster such that:

(1)

The set is then called the prototype of the cluster . In case of ties, we pick the samples with lowest indices .

In order to minimize computational requirements of NPC, we search for a minimal set of cluster representatives for each cluster, which satisfies the coverage requirement (1):

(2)

Note that several sets might satisfy this coverage requirement.

Relaxed prototypes

Finding cluster prototypes that fully meet the coverage requirement (1

) might pose an unnecessary computational burden. In many cases, a smaller prototype which is much easier to obtain can capture enough of the important characteristics of a cluster despite possibly not covering all of its members (e.g., a few outliers). Motivated by this observation, we introduce a relaxed requirement on cluster prototypes. We say that a set

is a representative prototype of cluster if the following condition is met:

(3)

for a preset parameter .

In further work, we replace the requirement (1) with its relaxed version (3). In case of need, the full coverage requirement can be enforced by simply setting . Similarly, also in the relaxed version, we seek for a prototype with minimal cardinality which satisfies (3).

Nearest Prototype Classification

Having the representative prototypes for all clusters, we now describe how the classification is performed. In the nearest prototype classification (NPC), an unseen sample is classified to the cluster (i.e., the respective target class is assigned to the sample) which is represented by the prototype with the highest similarity to the sample . As the cluster prototypes are disjoint sets, the nearest prototype is defined as the prototype containing the sample with the highest similarity to . Formally, given the prototypes of all clusters , the nearest prototype, denoted , is the prototype containing , where

Again, we resolve ties by picking the candidate with lowest index in the dataset.

Finally, the sample is classified with the same label as that of the cluster represented by the prototype .

4 Cluster Representatives Selection

In this section, we describe our method CRS for building the cluster prototypes. The entire method is composed of two steps that are discussed in more detail in individual subsections. First, given a cluster and a similarity measure , a reverse -NN graph is constructed from objects using the pairwise similarity . Then, the graph is used to select the representatives that satisfy the coverage requirement while minimizing the size of the cluster prototype. The simplified scheme of the whole process is depicted in Figure 1.

(a) dataset
(b) k-neighbourhood
(c) reverse neighbourhood
Figure 1: Illustration of the steps of CRS algorithm. (a) Visualization of a toy 2D dataset. (b) 2-NN graph created from from the dataset. (c) Reverse graph created from the graph depicted in (b). Point C is a representative of A, B, D, E and would be a good choice first choice of a representative. Depending on the coverage parameter , the node F could be considered an outlier or also added to the representation.

4.1 Building the Prototype

For the purpose of building the prototype for a cluster a weighted reverse -NN graph is used. It is defined as , where is the set of all objects in cluster , is a set of edges and

is a weight vector. An edge between two nodes

exists if , while the edge weight is given by the similarity between the connected nodes, .

The effective construction of such graph is enabled by employing the NN-Descent [dong2011efficient] algorithm which produces a -NN graph . The reverse -NN graph is then obtained from by simply reversing directions of the edges in .

NN-Descent is a fast converging approximate method for the -NN graph construction. It exploits the idea that “a neighbour of a neighbour is also likely to be a neighbour” to locally explore neighbouring nodes for better solutions.

Having the reverse -NN graph , we want to ensure that each object is at least -similar to all its neighbours, i.e.,

Omitting all edges with weight lower than not only lowers the memory requirements, but it also unfolds objects with large neighbourhood as good representative candidates.

The selection of representatives is treated as a minimum vertex cover problem on . We use a greedy algorithm which iteratively selects objects with maximal as representatives and marks them and their neighbourhood as covered. The algorithm stops when the coverage requirement (3) is met (see Section 3).

The whole algorithm is summarized in Algorithm 1.

Data: cluster , similarity , coverage threshold
Result: set of selected representatives
1 NN-Descent(, )
2 ReverseGraph()
3  //set of uncovered objects
4  //set of representatives
5 while  do
6       )
7      
8      
9      
10 end while
return
Algorithm 1 Pseudocode for Cluster Representatives Selection

An example of a cluster prototype selected by the CRS algorithm is presented in Figure 2.

Figure 2: Selection of representatives CRS with different s for a 2-dimensional dataset with 255 samples. For better comparison -medoids and DS3 are also shown. CRS takes into account the density of different parts of the cluster and selects representatives accordingly. -Medoids covers the dataset entirely by evenly distributed representatives. DS3 in selects the least representatives but does not capture the overall structure of the cluster very well.

4.2 Discussion on parameters

This subsection summarizes the parameters of the CRS method.

  • : number of neighbors for the -NN graph creation. When is high, each object covers more neighbours, but on the other hand it also increases the number of pairwise similarity calculations. This trade-off is illustrated for different values of in Figure 3. Due to the large impact of this parameter on properties of the produced representations and computational requirements, we further study its behaviour in more detail in a dedicated experiment in Section 5.

    Figure 3: Number of representatives being selected and the quality of representation are both controlled by . As each object explores a bigger neighbourhood for higher , the number of other objects it represents grows, therefore the number of representatives decreases. On the other hand, with less representatives, some information about the structure is lost, as in the case of .
  • : coverage parameter for the relaxed coverage requirement as introduced in Section 3. In this work, we set it to 0.95 such that the vast majority of each cluster is still covered but outliers do not influence the prototypes.

  • : threshold on weights, determining which edges will be kept in the graph (see Section 4.1). By default it is automatically set to the value of homogeneity of the cluster :

    (4)

Additionally, the NN-Descent algorithm, used within the CRS method, has two more parameters that specify its behaviour during the -NN graph creation. First, the parameter which is used for early termination of the NN-Descent algorithm when the number of changes in the constructed graph is minimal. We set it to 0.001, as suggested by the authors of the original work [dong2011efficient]. Second, the sample rate controls the number of reverse neighbours to be explored in each iteration of NN-Descent. Again, we set it to 0.7 to speed up the -NN creation while not diverging too far from the optimal solution in accordance with suggestions published in [dong2011efficient].

5 Experiments

This section presents experimental evaluation of the CRS algorithm on multiple datasets from very different domains that cover computer networks, text documents processing and image classification. In the first experiment, we study the influence of the parameter (which determines the number of nearest neighbors used for building the underlying -NN graph). Next, we compare the CRS method to the state of the art techniques DS3[elhamifar2015dissimilarity] and -medoids [liebman2015representative] in the nearest prototype classification task on different datasets.

We set as an approximate homogeneity calculated from random 5% of the cluster. We use as for -Medoids algorithm. It makes the most sense in comparing with CRS, because CRS is also restricting the similarity by in reverse graph creation. The best results for DS3 we obtained with and , while creating the full similarity matrix for the entire cluster. Finally the parameters for CRS are discussed in Section 4.2. The following experiment explores the impact of in greater detail.

5.1 Impact of

When building cluster prototypes by the CRS method, the number of nearest neighbors considered for building the -NN graph (specified by the parameter ) plays very important role. With small values of , each object represents only few of its neighbors that are most similar to it. However, this also increases the number of representatives needed to sufficiently cover the cluster. On the other hand, higher values of produce smaller prototypes as each representative is able to cover more objects. Nonetheless, this is at the cost of increased computational burden because the cost of -NN creation increases rapidly with higher s. These trends can be well observed in Figure 4 which shows classification precision, sizes of created prototypes and numbers of similarity function evaluations depending on for several clusters that differ in their homogeneity and sizes. We can see the changing trade-off between computational requirements (blue line) and memory requirements (red line) as the increases. However, this is mostly without significant impact on classification precision. The parameter can be therefore set depending on the preferences on computational requirements without significantly decreasing the classification performance.

(a) MNIST Fashion - Dress
(b) Medium Network Cluster
(c) MNIST Fashion - Sandal
(d) Big Network Cluster
Figure 4: Illustration of how the selection of influences the number of representatives and number of similarity computations. The number of representatives is in relative numbers to the size of the cluster. For different clusters as increases the relative number of comparisons also increases. However, the size of prototype selected decreases steeply while the precision only decreases slowly.

5.2 Datasets

In this section we briefly describe the three datasets used in the ongoing subsections for experimental comparison of individual methods.

5.2.1 MNIST Fashion

The MNIST Fashion [xiao2017/online]

is a well established dataset for image recognition consisting of 60000 black and white images of fashion items belonging to 10 classes. It recently replaced the overused handwritten digits datasets in many benchmarks. In case of this dataset, the cosine similarity was used as the similarity function

.

5.2.2 20 Newsgroup

This dataset is a known benchmark dataset for text documents processing. It is composed of nearly 20 thousand newspaper documents from 20 different classes (topics). The dataset was preprocessed such that each document is represented by a TF-IDF frequency vector. As a similarity function, we again use the cosine similarity which is a common choice in the domain of text documents processing.

5.2.3 Private Network Dataset

This dataset was collected on a corporate computer network, originally for the purpose of network host clustering based on their behavior [kopp2018community]. The work defines a specific similarity measure on top of network hosts which we adopt for this paper. Clusters of network hosts were defined according to results achieved in the original work as well. Additionally, for the purposes of the evaluation, clusters smaller than 10 members were not considered, since such small clusters can be easily represented by any method. In contrast to the previous datasets, the sizes and values of homogeneity of clusters in the Network dataset differ significantly.

5.3 Evaluation of Results

In this section we present the results for each dataset in detail. The main results are summarized in Table 1. For a more complete picture we also included results for a random 5% and all 100% of the cluster as a prototype. The statistical comparison of the methods can be found in Figure 5. Better rankings for some of CRS methods reflect, that CRS only covers

which removes the outliers that decrease precision and recall of full cluster representation.

Method MNIST Fashion 20Newsgroup Network
-medoids 0.763/0.744 (4.73%) 0.542/0.515 (14.51%) 0.865/0.978 (7.38%)
DS3 0.657/0.563 (0.07%) 0.133/0.132 (0.64%) 0.87/0.977 (1.88%)
random-5% 0.793/0.784 (5.0%) 0.452/0.435 (5.06%) 0.958/0.97 (5.09%)
full-100% 0.823/0.817 (100.0%) 0.56/0.548 (100.0%) 0.987/0.963 (100.0%)
CRS-k5 0.855/0.852 (87.71%) 0.635/0.632 (56.58%) 0.988/0.958 (65.31%)
CRS-k10 0.836/0.826 (15.37%) 0.538/0.516 (7.14%) 0.985/0.965 (7.74%)
CRS-k15 0.828/0.823 (5.08%) 0.522/0.488 (5.17%) 0.983/0.982 (5.26%)
Table 1: Average precision/recall values for each method used on each dataset. The table also shows the percentage of the cluster that was selected as a prototype. Our algorithm is on par with existing methods while selecting noticeably fewer representatives.
(a) MNIST precision
(b) news precision
(c) network precision
(d) MNIST recall
(e) news recall
(f) network recall
Figure 5: Critical difference diagram comparison (cf. [demvsar2006statistical]) of algorithms constructed using Friedman’s test with correction for multiple post-hoc hypotheses by Shaffer [shaffer1995multiple]. The diagrams show the average rank of each algorithm over all clusters in each dataset, groups of algorithms that are not significantly different (p=0.05) are connected.

When evaluating the experiments, we take into account both precision/recall and the percentage of samples selected as prototypes. As we have shown in the experiment in Section 5.1, CRS can be tuned by the parameter to significantly reduce the number of representatives and maintain a high precision/recall values. When using the full cluster as its prototype, the average values of precision and recall are slightly lower than when using the CRS method. This shows that CRS with makes the classification immune to outliers which can otherwise decrease the classification quality. The DS3 method selects a significantly lower number of representatives than any other method. However, it is at the cost of lower precision and recall values.

Runtimes of individual algorithms also differ significantly. We evaluate the runtime requirements of each algorithm by the relative number of similarity computations defined as:

(5)

where stands for the actual number of comparisons made and is the hypothetical number needed for computing full similarity matrix.

We use DS3 with the full similarity matrix to get most accurate results, therefore . For -Medoids the number of computations performed can not be easily preset. Therefore, we averaged the number of comparisons over different values of (the number might differ significantly for different values of ). For CRS the number of comparisons is influenced by , homogeneity of each cluster and its size. The impact of was discussed in detail in Section 5.1. The experiment shows that one can make assumptions about based on the size of the cluster and (i.e. for MNIST Fashion dataset , ).

5.3.1 MNIST Fashion

The average homogeneity of a cluster in the MNIST Fashion dataset is 0.76. This corresponds with a slower decline of the precision and recall values as the number of representatives decreases. In Figure 6 the steep decline of representatives selected decreases only slightly decreases the precision and recall for the each cluster. In the case of the the Dress cluster it even slightly increases from to . In Figure 7 are the confusion matrices for the methods.

Figure 6: Visualization of precision and recall of all methods in relation to percentage of cluster selected on chosen clusters for the MNIST Fashion dataset. Values 0.0 for some clusters for DS3 mean that less than 0.1% objects were selected as representatives.
(a) CRS-k10
(b) -Medoids
(c) DS3
Figure 7: Confusion matrices for a cluster from each category in the MNIST Fashion dataset show the performance all 3 methods compared. The Sandal class was the hardest to represent for all methods. This is also quantified in Figure 6.

5.3.2 20Newsgroup

The 20Newsgroup dataset has the lowest average homogeneity from all the datasets. The samples are less similar on average, therefore the lower precision and recall values. It reflects in the percentage of objects selected as representatives by the -Medoids algorithm. Confusion matrices for one cluster form each subgroup are in Figure 8.

(a) CRS-k10
(b) -Medoids
(c) DS3
Figure 8:

Confusion matrices for a cluster from each category in the 20Newsgroup dataset show the performance all 3 methods compared. The confusion matrix for DS3 visualizes the results from Table  

1

5.3.3 Network Dataset

The results for data collected in real network further prove assumptions made in Section 5.1.

Figure 9 shows comparison of individual methods by means of precision and recall for selected large clusters as well as the impact of different values of . The depicted clusters were chosen for their sizes and different homogeneity, see Table 2.

Cluster A B C D E F G H I J K L M N
Size 1079 2407 75 2219 346 59 248 49 52 108 218 44 42 32
Homogeneity 0.58 0.14 0.84 0.64 0.60 0.92 0.34 0.84 0.69 0.35 0.78 0.35 0.79 1.0
Table 2: Sizes and approximate homogeneity for each cluster from network dataset.

Increasing significantly reduces the percentage of dataset selected for its representation while still retaining high precision and recall values. The results on the network dataset are very important because it is a non-metric dataset, where only a pair-wise similarity is defined.

Figure 9: Visualization of precision and recall of all methods in relation to percentage of cluster selected on chosen clusters from the Network dataset.

6 Conclusion

This paper proposed a new method called CRS for building representations of clusters — cluster prototypes which are small subsets of the original clusters. CRS leverages nearest neighbor graphs to map structure of each cluster and to identify the most important representatives that will form the cluster prototype. Thanks to this approach, CRS can be equally applied in both metric and non-metric spaces. The proposed method was compared to the prior art in a nearest prototype classification setup on multiple datasets from different domains. The experimental results show that the CRS method achieves superior classification quality while producing comparably compact representations of clusters.

References