-NN classifiers are often used in many application domains due to their simplicity and ability to trace the classification decision to a specific set of samples. However, their adoption is limited by high computational complexity. Because contemporary datasets are often huge, containing hundreds of thousands or even millions of samples, computing similarity between the classified sample and the entire dataset may be computationally intractable.
In order to decrease computational and memory requirements, the nearest prototype classification (NPC) method is commonly used, c.f. [seo2003soft, schleif2005local, cervantes2007adaptive]. In NPC, each class is divided into one or more clusters, and each cluster is represented by its prototype. The classified sample is then compared just to the prototypes instead of calculating similarity to the entire dataset.
Therefore, the goal of prototype selection is to find a memory-efficient representation of clusters such that classification accuracy is preserved while the number of comparisons is significantly reduced.
However, in many application domains, objects might exist in a non-metric space where only a pairwise similarity is defined, e.g., bioinformatics [martino2018granular], biometric identification [becker2010methods], computer networks [kopp2018community]scheirer2014good].
In such application domains, standard representations such as centroids may not be easily determined, or their interpretation does not make much sense. For these scenarios, only a few methods have been developed, and to best of our knowledge the only general (not domain-specific) approach is based on the selection of small subsets of objects to represent the remaining cluster members. These object, called representatives, are then used as a prototype.
While several methods capable of solving representatives selection on non-metric spaces exist (i.e. DS3 [elhamifar2015dissimilarity], -medoids [liebman2015representative]), there has not been much research activity in this direction.
Our focus on non-metric spaces comes from the problem of behavioural clustering of network hosts [kopp2018community]. Nevertheless, the problem of selecting a minimal number of representative samples is of more general interest. Therefore, we present a novel method to solve the problem of Cluster Representatives Selection (CRS). CRS is a general method capable of selecting small representative subset of objects from a cluster as its prototype. Its core idea is fast construction of an approximate reverse -NN graph and then solving minimal vertex cover problem on that graph. Only a pairwise similarity is required to build the reverse -NN graph, therefore application of CRS is not limited to metric spaces.
To show that CRS is general and domain-independent, we present an experimental evaluation on datasets from image recognition, document classification and network host classification, with appealing results when compared to the current state of the art.
The paper is organized as follows. The related work is briefly reviewed in the next section. Section 3 formalises the representative selection as an optimization problem. The proposed CRS method is described in detail in Section 4. The experimental evaluation is summarized in Section 5 followed by the conclusion.
2 Related Work
During the past years, significant effort has been made to represent clusters in the most condensed way. The approaches could be categorized into two main groups.
The first group gathers all prototype generation methods [triguero2011taxonomy], which create artificial samples to represent original clusters, e.g. [geva1991adaptive, xie1993vector]. The second group contains the prototype selection methods. As the name suggests, a subset of samples from the given cluster is selected to represent it. Prototype selection is a well-explored field with many approaches, see, e.g. [garcia2012prototype].
However, most of the current algorithms exploit the properties of the metric space, e.g., structured sparsity [wang2017representative], -norm induced selection [zhang2018seeing] or identification of borderline objects [olvera2018accurate].
When we leave the luxury of the metric space and focus on situations where only a pairwise similarity exists or where averaging of existing samples may create an object without meaning, there is not much previous work.
The -medoids [liebman2015representative] algorithm uses the idea of -medoids to semi-greedily cover the space with -neighbourhoods, in which it then looks for an optimal medoid to represent a given neighbourhood. The main issue of this method is the selection of
: this hyperparameter has to to be fine-tuned based on the domain.
The DS3 [elhamifar2015dissimilarity]
algorithm calculates the full similarity matrix and then selects representatives by a row-sparsity regularized trace minimization program which tries to minimize the rows needed to encode the whole matrix. The overall computational complexity is the most significant disadvantage of this algorithm, despite some proposed approximate estimation of the similarity matrix using only a subset of the data.
The proposed method for Cluster Representatives Selection (CRS) approximates the topological structure of the data by creating a reverse -NN graph. CPS then iteratively selects nodes with the biggest reverse neighbourhoods as representatives of the data. This approach systematically minimizes the number of pairwise comparisons to reduce computational complexity while accurately representing the data.
3 Problem Formulation
In this section, we define the problem of prototype-based representation of clusters and the nearest prototype classification (NPC). As we already stated in Introduction, we study the prototypes selection in general cases, including non-metric spaces. Therefore, we further assume that a cluster prototype is always specified as (possibly small) subset of its members.
Let be an arbitrary space of objects for which a pairwise similarity function is defined and let be a set of (training) samples. Let be a clustering of such that and Let be a cluster of size . For , let us denote the closest samples to , i.e., the set of samples that have the highest similarity to in the rest of the cluster . Then the goal of the prototype selection is to find a subset of samples for each cluster such that:
The set is then called the prototype of the cluster . In case of ties, we pick the samples with lowest indices .
In order to minimize computational requirements of NPC, we search for a minimal set of cluster representatives for each cluster, which satisfies the coverage requirement (1):
Note that several sets might satisfy this coverage requirement.
Finding cluster prototypes that fully meet the coverage requirement (1
) might pose an unnecessary computational burden. In many cases, a smaller prototype which is much easier to obtain can capture enough of the important characteristics of a cluster despite possibly not covering all of its members (e.g., a few outliers). Motivated by this observation, we introduce a relaxed requirement on cluster prototypes. We say that a setis a representative prototype of cluster if the following condition is met:
for a preset parameter .
Nearest Prototype Classification
Having the representative prototypes for all clusters, we now describe how the classification is performed. In the nearest prototype classification (NPC), an unseen sample is classified to the cluster (i.e., the respective target class is assigned to the sample) which is represented by the prototype with the highest similarity to the sample . As the cluster prototypes are disjoint sets, the nearest prototype is defined as the prototype containing the sample with the highest similarity to . Formally, given the prototypes of all clusters , the nearest prototype, denoted , is the prototype containing , where
Again, we resolve ties by picking the candidate with lowest index in the dataset.
Finally, the sample is classified with the same label as that of the cluster represented by the prototype .
4 Cluster Representatives Selection
In this section, we describe our method CRS for building the cluster prototypes. The entire method is composed of two steps that are discussed in more detail in individual subsections. First, given a cluster and a similarity measure , a reverse -NN graph is constructed from objects using the pairwise similarity . Then, the graph is used to select the representatives that satisfy the coverage requirement while minimizing the size of the cluster prototype. The simplified scheme of the whole process is depicted in Figure 1.
4.1 Building the Prototype
For the purpose of building the prototype for a cluster a weighted reverse -NN graph is used. It is defined as , where is the set of all objects in cluster , is a set of edges and
is a weight vector. An edge between two nodesexists if , while the edge weight is given by the similarity between the connected nodes, .
The effective construction of such graph is enabled by employing the NN-Descent [dong2011efficient] algorithm which produces a -NN graph . The reverse -NN graph is then obtained from by simply reversing directions of the edges in .
NN-Descent is a fast converging approximate method for the -NN graph construction. It exploits the idea that “a neighbour of a neighbour is also likely to be a neighbour” to locally explore neighbouring nodes for better solutions.
Having the reverse -NN graph , we want to ensure that each object is at least -similar to all its neighbours, i.e.,
Omitting all edges with weight lower than not only lowers the memory requirements, but it also unfolds objects with large neighbourhood as good representative candidates.
The selection of representatives is treated as a minimum vertex cover problem on . We use a greedy algorithm which iteratively selects objects with maximal as representatives and marks them and their neighbourhood as covered. The algorithm stops when the coverage requirement (3) is met (see Section 3).
The whole algorithm is summarized in Algorithm 1.
An example of a cluster prototype selected by the CRS algorithm is presented in Figure 2.
4.2 Discussion on parameters
This subsection summarizes the parameters of the CRS method.
: number of neighbors for the -NN graph creation. When is high, each object covers more neighbours, but on the other hand it also increases the number of pairwise similarity calculations. This trade-off is illustrated for different values of in Figure 3. Due to the large impact of this parameter on properties of the produced representations and computational requirements, we further study its behaviour in more detail in a dedicated experiment in Section 5.
: coverage parameter for the relaxed coverage requirement as introduced in Section 3. In this work, we set it to 0.95 such that the vast majority of each cluster is still covered but outliers do not influence the prototypes.
: threshold on weights, determining which edges will be kept in the graph (see Section 4.1). By default it is automatically set to the value of homogeneity of the cluster :
Additionally, the NN-Descent algorithm, used within the CRS method, has two more parameters that specify its behaviour during the -NN graph creation. First, the parameter which is used for early termination of the NN-Descent algorithm when the number of changes in the constructed graph is minimal. We set it to 0.001, as suggested by the authors of the original work [dong2011efficient]. Second, the sample rate controls the number of reverse neighbours to be explored in each iteration of NN-Descent. Again, we set it to 0.7 to speed up the -NN creation while not diverging too far from the optimal solution in accordance with suggestions published in [dong2011efficient].
This section presents experimental evaluation of the CRS algorithm on multiple datasets from very different domains that cover computer networks, text documents processing and image classification. In the first experiment, we study the influence of the parameter (which determines the number of nearest neighbors used for building the underlying -NN graph). Next, we compare the CRS method to the state of the art techniques DS3[elhamifar2015dissimilarity] and -medoids [liebman2015representative] in the nearest prototype classification task on different datasets.
We set as an approximate homogeneity calculated from random 5% of the cluster. We use as for -Medoids algorithm. It makes the most sense in comparing with CRS, because CRS is also restricting the similarity by in reverse graph creation. The best results for DS3 we obtained with and , while creating the full similarity matrix for the entire cluster. Finally the parameters for CRS are discussed in Section 4.2. The following experiment explores the impact of in greater detail.
5.1 Impact of
When building cluster prototypes by the CRS method, the number of nearest neighbors considered for building the -NN graph (specified by the parameter ) plays very important role. With small values of , each object represents only few of its neighbors that are most similar to it. However, this also increases the number of representatives needed to sufficiently cover the cluster. On the other hand, higher values of produce smaller prototypes as each representative is able to cover more objects. Nonetheless, this is at the cost of increased computational burden because the cost of -NN creation increases rapidly with higher s. These trends can be well observed in Figure 4 which shows classification precision, sizes of created prototypes and numbers of similarity function evaluations depending on for several clusters that differ in their homogeneity and sizes. We can see the changing trade-off between computational requirements (blue line) and memory requirements (red line) as the increases. However, this is mostly without significant impact on classification precision. The parameter can be therefore set depending on the preferences on computational requirements without significantly decreasing the classification performance.
In this section we briefly describe the three datasets used in the ongoing subsections for experimental comparison of individual methods.
5.2.1 MNIST Fashion
The MNIST Fashion [xiao2017/online]
is a well established dataset for image recognition consisting of 60000 black and white images of fashion items belonging to 10 classes. It recently replaced the overused handwritten digits datasets in many benchmarks. In case of this dataset, the cosine similarity was used as the similarity function.
5.2.2 20 Newsgroup
This dataset is a known benchmark dataset for text documents processing. It is composed of nearly 20 thousand newspaper documents from 20 different classes (topics). The dataset was preprocessed such that each document is represented by a TF-IDF frequency vector. As a similarity function, we again use the cosine similarity which is a common choice in the domain of text documents processing.
5.2.3 Private Network Dataset
This dataset was collected on a corporate computer network, originally for the purpose of network host clustering based on their behavior [kopp2018community]. The work defines a specific similarity measure on top of network hosts which we adopt for this paper. Clusters of network hosts were defined according to results achieved in the original work as well. Additionally, for the purposes of the evaluation, clusters smaller than 10 members were not considered, since such small clusters can be easily represented by any method. In contrast to the previous datasets, the sizes and values of homogeneity of clusters in the Network dataset differ significantly.
5.3 Evaluation of Results
In this section we present the results for each dataset in detail. The main results are summarized in Table 1. For a more complete picture we also included results for a random 5% and all 100% of the cluster as a prototype. The statistical comparison of the methods can be found in Figure 5. Better rankings for some of CRS methods reflect, that CRS only covers
which removes the outliers that decrease precision and recall of full cluster representation.
|-medoids||0.763/0.744 (4.73%)||0.542/0.515 (14.51%)||0.865/0.978 (7.38%)|
|DS3||0.657/0.563 (0.07%)||0.133/0.132 (0.64%)||0.87/0.977 (1.88%)|
|random-5%||0.793/0.784 (5.0%)||0.452/0.435 (5.06%)||0.958/0.97 (5.09%)|
|full-100%||0.823/0.817 (100.0%)||0.56/0.548 (100.0%)||0.987/0.963 (100.0%)|
|CRS-k5||0.855/0.852 (87.71%)||0.635/0.632 (56.58%)||0.988/0.958 (65.31%)|
|CRS-k10||0.836/0.826 (15.37%)||0.538/0.516 (7.14%)||0.985/0.965 (7.74%)|
|CRS-k15||0.828/0.823 (5.08%)||0.522/0.488 (5.17%)||0.983/0.982 (5.26%)|
When evaluating the experiments, we take into account both precision/recall and the percentage of samples selected as prototypes. As we have shown in the experiment in Section 5.1, CRS can be tuned by the parameter to significantly reduce the number of representatives and maintain a high precision/recall values. When using the full cluster as its prototype, the average values of precision and recall are slightly lower than when using the CRS method. This shows that CRS with makes the classification immune to outliers which can otherwise decrease the classification quality. The DS3 method selects a significantly lower number of representatives than any other method. However, it is at the cost of lower precision and recall values.
Runtimes of individual algorithms also differ significantly. We evaluate the runtime requirements of each algorithm by the relative number of similarity computations defined as:
where stands for the actual number of comparisons made and is the hypothetical number needed for computing full similarity matrix.
We use DS3 with the full similarity matrix to get most accurate results, therefore . For -Medoids the number of computations performed can not be easily preset. Therefore, we averaged the number of comparisons over different values of (the number might differ significantly for different values of ). For CRS the number of comparisons is influenced by , homogeneity of each cluster and its size. The impact of was discussed in detail in Section 5.1. The experiment shows that one can make assumptions about based on the size of the cluster and (i.e. for MNIST Fashion dataset , ).
5.3.1 MNIST Fashion
The average homogeneity of a cluster in the MNIST Fashion dataset is 0.76. This corresponds with a slower decline of the precision and recall values as the number of representatives decreases. In Figure 6 the steep decline of representatives selected decreases only slightly decreases the precision and recall for the each cluster. In the case of the the Dress cluster it even slightly increases from to . In Figure 7 are the confusion matrices for the methods.
The 20Newsgroup dataset has the lowest average homogeneity from all the datasets. The samples are less similar on average, therefore the lower precision and recall values. It reflects in the percentage of objects selected as representatives by the -Medoids algorithm. Confusion matrices for one cluster form each subgroup are in Figure 8.
5.3.3 Network Dataset
The results for data collected in real network further prove assumptions made in Section 5.1.
Figure 9 shows comparison of individual methods by means of precision and recall for selected large clusters as well as the impact of different values of . The depicted clusters were chosen for their sizes and different homogeneity, see Table 2.
Increasing significantly reduces the percentage of dataset selected for its representation while still retaining high precision and recall values. The results on the network dataset are very important because it is a non-metric dataset, where only a pair-wise similarity is defined.
This paper proposed a new method called CRS for building representations of clusters — cluster prototypes which are small subsets of the original clusters. CRS leverages nearest neighbor graphs to map structure of each cluster and to identify the most important representatives that will form the cluster prototype. Thanks to this approach, CRS can be equally applied in both metric and non-metric spaces. The proposed method was compared to the prior art in a nearest prototype classification setup on multiple datasets from different domains. The experimental results show that the CRS method achieves superior classification quality while producing comparably compact representations of clusters.