Genetic Programming for Evolving Similarity Functions for Clustering: Representations and Analysis

10/22/2019 ∙ by Andrew Lensen, et al. ∙ 0

Clustering is a difficult and widely-studied data mining task, with many varieties of clustering algorithms proposed in the literature. Nearly all algorithms use a similarity measure such as a distance metric (e.g. Euclidean distance) to decide which instances to assign to the same cluster. These similarity measures are generally pre-defined and cannot be easily tailored to the properties of a particular dataset, which leads to limitations in the quality and the interpretability of the clusters produced. In this paper, we propose a new approach to automatically evolving similarity functions for a given clustering algorithm by using genetic programming. We introduce a new genetic programming-based method which automatically selects a small subset of features (feature selection) and then combines them using a variety of functions (feature construction) to produce dynamic and flexible similarity functions that are specifically designed for a given dataset. We demonstrate how the evolved similarity functions can be used to perform clustering using a graph-based representation. The results of a variety of experiments across a range of large, high-dimensional datasets show that the proposed approach can achieve higher and more consistent performance than the benchmark methods. We further extend the proposed approach to automatically produce multiple complementary similarity functions by using a multi-tree approach, which gives further performance improvements. We also analyse the interpretability and structure of the automatically evolved similarity functions to provide insight into how and why they are superior to standard distance metrics.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Clustering is a fundamental data mining task (Fayyad et al., 1996), which aims to group related/similar instances into a number of clusters where the data is unlabelled. It is one of the key tasks in exploratory data analysis, as it enables data scientists to reveal the underlying structure of unfamiliar data, which can then be used for further analysis (Jain, 2010).

Nearly all clustering algorithms utilise a similarity measure, usually a distance function, to perform clustering as close instances are similar to each other, and expected to be in the same cluster. The most common distance functions, such as Manhattan or Euclidean distance, are quite inflexible: they consider all features equally despite features often varying significantly in their usefulness. Consider a weather dataset of daily records which contains the two following features: day of week and rainfall (mm). Clusters generated using the rainfall feature will give insight into what days are likely to be rainy, which may allow better prediction of whether we should take an umbrella in the future. Clusters generated with the day of week feature however are likely to give little insight or be misleading — intuitively, we know that the day of the week has no effect on long-term weather patterns and so any clusters produced could mislead us. Ideally, we would like to perform feature selection to select only the most useful features in a dataset. These distance functions also have uniform behaviour across a whole dataset, which makes them struggle with common problems such as clusters of varying density or separation, and noisy data. Indeed, trialling a range of similarity measures is commonly a tedious but necessary parameter tuning step when performing cluster analysis. Feature construction is the technique of automatically combining existing low-level features into more powerful high-level features. Feature construction could produce similarity measures which are better fitted to a given dataset, by combining features in a non-uniform and flexible manner. These feature reduction techniques have been shown to be widely effective in both supervised and unsupervised domains (Xue et al., 2016; Aggarwal and Reddy, 2014).

A variety of representations have been proposed for modelling clustering solutions. The graph representation models the data in an intuitive way, where instances (represented as nodes) are connected by an edge if they are similar enough (von Luxburg, 2007). This is a powerful representation that allows modelling a variety of cluster shapes, sizes, and densities, unlike the more common prototype-based representations such as -means. However, algorithms using graph representations are very dependent on the criterion used to select edges (von Luxburg, 2007). One of the most common criteria is to simply use a fixed threshold (von Luxburg, 2007), which indicates the distance at which two instances are considered too dissimilar to share an edge. Such a threshold must be determined independently for every dataset, and this approach typically does not allow varying thresholds to be used in different clusters. Another popular criterion is to put an edge between each instance and its -nearest neighbours (von Luxburg, 2007), where N is a small integer value such as 2, 3, or 4. must also be determined before running the algorithm, with results being very sensitive to the value chosen. Again, this method does not allow for varied cluster structure.

Many of the above issues can be tackled by using a similarity function which is able to be automatically tailored to a specific dataset, and which can treat different clusters within a dataset differently. Genetic programming (GP) (Koza, 1992) is an evolutionary computation (EC) (Eiben and Smith, 2015) method that automatically evolves programs. The most common form of GP is tree-based GP, which models solutions in the form of a tree which takes input (such as a feature set) and produces an output, based on the functions performed within the tree. We hypothesise that this approach can be used to automatically evolve similarity functions that are represented as GP trees, where a tree takes two instances as input and produces an output that corresponds to how similar those two instances are. By automatically combining only the most relevant features (i.e. performing feature selection and construction), more powerful and specific similarity functions can be generated to improve clustering performance on a range of datasets. GP can also use multiple trees to represent each individual/solution. Multi-tree GP has the potential to automatically generate multiple complementary similarity functions, which are able to specialise on different clusters in the dataset. To our best knowledge, such an approach has not been investigated to date.

1.1 Goals

This work aims to propose the first approach to using GP for automatically evolving similarity functions with a graph representation for clustering (GPGC). This work is expected to improve clustering performance while also producing more interpretable clusters which use only a small subset of the full feature set. We will investigate:

  • how the output of an evolved similarity function can be used to create edges in clustering with a graph representation;

  • what fitness function should be used to give high-quality clustering results;

  • whether using multiple similarity functions to make a consensus decision can further improve clustering results; and

  • whether the evolved similarity functions are more interpretable and produce simpler clusters than standard distance functions.

A piece of preliminary work was presented in our previous research (Lensen et al., 2017a), which proposed evolving a single similarity function with a single-tree approach for clustering. This work extends the preliminary work significantly by providing more detailed and systematic description and justification and introducing a multi-tree approach, as well as much more rigorous comparisons to existing techniques and more detailed analysis of the proposed method.

2 Background

2.1 Clustering

A huge variety of approaches have been proposed for performing clustering (Xu and II, 2005; Aggarwal and Reddy, 2014)

, which can be generally categorised into hard, soft (fuzzy), or hierarchical clustering methods. In hard and soft clustering, each instance belongs to exactly one or to at least one cluster respectively. In contrast, hierarchical clustering methods build a hierarchy of clusters, where a parent cluster contains the union of its child clusters. The majority of work has focused on hard clustering, as partitions where each instance is in exactly one cluster tend to be easier to interpret and analyse. A number of distinct models have been proposed for performing hard clustering: prototype-based models (including the most famous clustering method

-means (J. A. Hartigan, 1979), and its successor -means++ (Arthur and Vassilvitskii, 2007)), density-based models (e.g. DBSCAN (Ester et al., 1996) and OPTICS (Ankerst et al., 1999)), graph-based models (e.g. the Highly Connected Subgraph (HCS) algorithm (Hartuv and Shamir, 2000)), and statistical approaches such as distribution-based (e.g. EM clustering) and kernel-based models. Prototype-based models produce a number of prototypes, each of which corresponds to a unique cluster, and then assigns each instance to its nearest prototype using a distance function, such as Euclidean distance. While these models are the most popular, they are inherently limited by their use of prototypes to define clusters: when there are naturally clusters that are non-hyper-spherically shaped, prototype-based models will tend to perform poorly as minimising the distance of instances to the prototype encourages spherical clusters. This problem is further exemplified when a cluster is non-convex.

Graph-based clustering algorithms (von Luxburg, 2007) represent clusters as distinct graphs, where there is a path between every pair of instances in a cluster graph. This representation means that graph-based measures are not restricted to clusters with hyper-spherical or convex shapes. The HCS algorithm (Hartuv and Shamir, 2000) uses a similarity graph which connects instances sharing a similarity value (e.g. distance) above a certain threshold, and then iteratively splits graphs which are not highly connected by finding the minimum cut, until all graphs are highly connected. Choosing a good threshold value in HCS can be difficult when there is no prior knowledge of the data.

EC techniques have also been applied to clustering successfully (Lorena and Furtado, 2001; Picarougne et al., 2007; Nanda and Panda, 2014; García and Gómez-Flores, 2016; Sheng et al., 2016)

with many genetic algorithms (GA) and particle swarm optimisation (PSO) techniques used to automatically evolve clusters. Again, the majority of the literature tends to use prototype-based models, and little work uses feature reduction techniques to improve the performance of clustering methods and to produce more interpretable clusters. There is notably a deficit of methods using GP for clustering, and no current methods, asides from our preliminary work

(Lensen et al., 2017a), that use GP to automatically evolve similarity functions. Relevant EC clustering methods will be discussed further in the related work section.

2.2 Feature Reduction

Feature reduction is a common strategy used to improve the performance of data mining algorithms and interpretability of the models or solutions produced (Liu and Motoda, 2012). The most common feature reduction strategy is feature selection

, where a subset of the original feature set is selected for use in the data mining algorithm. Using fewer features can decrease training time, produce more concise and understandable results, and even improve performance by removing irrelevant/misleading features or reducing over-fitting (in supervised learning). Feature selection has been extensively studied, on a range of problems, such as classification

(Tang et al., 2014) and clustering (Alelyani et al., 2013). Feature construction, another feature reduction strategy, focuses on automatically producing new high-level features, which combine multiple features from the original feature set in order to produce more powerful constructed features (CFs). As with feature selection, the use of feature construction can improve performance and interpretability by automatically combining useful features.

Research into the use of EC techniques for performing feature reduction has become much more popular during the last decade, due to the ability of EC techniques to efficiently search a large feature set space. Feature selection has been widely studied using PSO and GAs (García-Pedrajas et al., 2014; Xue et al., 2016), and GP has emerged as a powerful feature construction technique due to its tree representation allowing features to be combined in a hierarchical manner using a variety of functions (Espejo et al., 2010; Neshatian et al., 2012)

. Despite this, the use of EC for feature reduction in clustering tasks has thus far been relatively unexplored. Given that clustering is an unsupervised learning task with a huge search space, especially when there are many instances, features, or clusters, good feature reduction methods for clustering are needed.

2.3 Subspace Clustering

Another approach for performing feature reduction in clustering tasks is subspace clustering (Liu and Yu, 2005; Müller et al., 2009), where each cluster is located in a subspace of the data, i.e. it uses only a subset of the features. In this regard, each cluster is able to correspond to a specific set of features that are used to characterise that cluster, which has the potential to produce better-fitted and more interpretable clusters. Several EC methods have been proposed for performing subspace clustering (Vahdat and Heywood, 2014; Peignier et al., 2015). However, subspace clustering intrinsically has an even larger search space than normal clustering, as the quantity and choice of features must be made for every cluster, rather than only once for the dataset (Parsons et al., 2004). In this paper, we do not strictly perform subspace clustering, but rather we allow the proposed approach to use different features in different combinations across the cluster space.

2.4 Related Work

There has been a handful of work proposed that uses a graph-based representation in conjunction with EC for performing clustering. One notable example is the MOCK algorithm (Handl and Knowles, 2007)

, which uses a GA with a locus-based graph approach to perform multi-objective clustering. Another GA method has also been proposed, which take inspiration from spectral clustering and uses either a label-based or medoid-based encoding to cluster the similarity graph

(Menéndez et al., 2014).

The use of GP for performing clustering is very sparse in the literature, with only about half a dozen pieces of work proposed. One early work uses a grammar-based approach (Falco et al., 2005) where a grammar is evolved to assign instances to clusters based on how well they matched the evolved formulae. Instances that do not match any formulae are assigned to the closest centroid. This assignment technique, and the fitness function used, means that the proposed method is biased towards hyper-spherical clustering. Boric et al. (Boric and Estévez, 2007) proposed a multi-tree representation, where each tree in an individual corresponds to a single cluster. This method required the number of trees () to be set in advance, i.e. the number of clusters must be known a priori, which may not be available in many cases. A single-tree approach has also been proposed (Ahn et al., 2011), which uses integer rounding to assign each instance to a cluster based on the output of the evolved tree. Such an approach is unlikely to work well on datasets with a relatively high , and this method produces a difficult search space due to clusters having an implicit order. Another proposed approach (Coelho et al., 2011) uses GP to automatically build consensus functions that combine the output of a range of clustering algorithms to produce a fusion partition. This is suggested to combine the benefits of each of the clustering algorithms, while avoiding their limitations. Each of the clustering algorithms use a fixed distance function to measure similarity between instances, and several of the algorithms require that is known in advance. More recently, a GP approach has been proposed based on the idea of novelty search (Naredo and Trujillo, 2013), where in lieu of an explicit fitness function, the uniqueness (novelty) of a solution in the behavioural landscape is used to determine whether it is used in subsequent generations. This approach was only tested on problems with two clusters, and it is unclear how it would scale as increases, given that the behavioural landscape would become exponentially larger.

While there has been very little work utilising feature construction techniques for improving the performance of clustering, there has been a significant amount of study into using feature selection for clustering problems (Dy and Brodley, 2004; Alelyani et al., 2013)

, with dimensionality reduction approaches such as Principal Component Analysis (PCA)

(Jolliffe, 2011) being used with both EC (Kuo et al., 2012) and non-EC clustering methods. In addition, there has been some work using EC to perform simultaneous clustering and feature selection, with the aim of concurrently tailoring the features selected to the clusters produced. PSO, in particular, has been shown to be effective on this task (Sheng et al., 2008; Lensen et al., 2017b).

The clustering literature has an overwhelming focus on producing novel clustering algorithms which employ a wide range of techniques for modelling and searching the clustering problem space. However, there has been very little focus on new techniques for automatically creating more appropriate and more powerful similarity measures to accurately model the relationships between instances on a specific dataset. GP, with its intrinsic function-like solution structure, is a natural candidate for automatically evolving similarity functions tailored to the data it is trained on. GP, and EC methods in general, have been shown to be effective on large dataset sizes and dimensionality; GP has the potential to evolve smaller, more refined, and more interpretable similarity functions on very big datasets. This paper investigates the capability of GP for automatically constructing power similarity functions.

3 Proposed Approaches

An overview of the proposed GPGC algorithm is shown in Fig. 1. We discuss the different parts of this overall algorithm in the following subsections.

Figure 1: The overall flow of the proposed GPGC algorithm. The clustering process is discussed in detail in Section 3.2, and is shown in Algorithm 1.

3.1 GP Program Design

To represent a similarity function, a GP tree must take two instances as input and produce a single floating-point output corresponding to the similarity of the two instances. Therefore, we define the terminal set as all feature values of both instances, such that there are possible terminals for features ( and through to and ), as well as a random floating-point value (for scaling purposes). The function set comprises of the normal arithmetic functions (, ,,), two absolute arithmetic functions ( and ), and the , and operators. All of these functions asides from take two inputs and output a single value which is the result of applying that function. The function takes three inputs and outputs the second input if the first is positive, or the third input if it is not. We include the , , and functions to allow conditional behaviour within a program, in order to allow the creation of similarity functions which operate differently across the feature space. The operator is protected division: if the divisor (the second input) is zero, the operator will return a value of one. An example of a similarity function using this GP program design is shown in Fig. 2.

Figure 2: An example of of a similarity function with the expression .

3.2 Clustering Process

As we are using a graph representation, every pair of instances which are deemed “close enough” by an evolved GP tree should be connected by an edge. As discussed before, we would like to refrain from using a fixed similarity threshold as varying thresholds may be required across a dataset due to varying cluster density. We therefore use the approach where each instance is connected to a number of its most similar neighbours (according to the evolved similarity function).

To find the most similar neighbour of a given instance for an evolved similarity function requires comparing the instance to every other instance in the dataset. Normally, when using a distance metric, these pairwise similarities could be precomputed; however, in the proposed algorithm, these must be computed separately for every evolved similarity function, giving comparisons for every GP individual on every generation of the training process, given

instances. In order to reduce the computational cost, we use a heuristic whereby each instance is only compared to its

nearest neighbours based on Euclidean distance. The set of nearest neighbours can be computed at the start of the EC training process, meaning only comparisons are required per GP individual. By using this approach, we balance the flexibility of allowing an instance to be connected to many different neighbours with the efficiency of using a subset of neighbours to compare to. As we use Euclidean distance only to give us the order of neighbours, the problems associated with Euclidean distance at high dimensions should not occur. We found in practice that setting as gave a good neighbourhood size that is proportional to , while ensuring is at least 2 (Lensen et al., 2017b).

Algorithm 1 shows the steps used to produce a cluster for a given GP individual, . For each instance in the dataset, the nearest neighbours are found using the pre-computed Euclidean distance mappings. Each of these neighbours is then fed into the bottom of the tree () along with . The tree is then evaluated, and produces an output corresponding to the similarity between and that neighbour. The neighbour with the highest similarity is chosen, and an edge is added between it and . As in (Lensen et al., 2017a), we tested adding edges to more than one nearest neighbour, but found that performance tended to drop. Once this process has been completed for each , the set of edges formed will give a set of graphs, where each graph represents a single cluster. These graphs can then be converted to a set of clusters by assigning all instances in each graph to the same cluster.

1 ;
2 for  do Choose edge
3        ;
4        ;
5        ;
6        for  do Test neighbour
7               ;
8               if  then
9                     ;
10               end if
11              
12        end for
13       add edge from to to ;
14       
15 end for
16;
Algorithm 1 Process to produce a cluster using a given GP individual () and the number of neighbours ().

3.3 Fitness Function

The most common measures of cluster quality are cluster compactness and separability (Aggarwal and Reddy, 2014). A good cluster partition should have distinct clusters which are very dense in terms of the instances they contain, and which are far away from other clusters. A third, somewhat less common measure, is the instance connectedness, which measures how well a given instance lies in the same cluster as its nearby neighbours (Handl and Knowles, 2007). The majority of the clustering literature measures performance in a way that implicitly encourages hyper-spherical clusters to be produced, by minimising each instance’s distance to its cluster mean, and maximising the distance between different cluster means. Such an approach is problematic, as it introduces bias in the shape of clusters produced, meaning elliptical or other non-spherical clusters are unlikely to be found correctly.

As a graph representation is capable of modelling a variety of cluster shapes, we instead propose using a fitness function which balances these three measures of cluster quality in a way that gives minimal bias to the shape of clusters produced. We discuss each of these in turn below:

Compactness

To measure the compactness of a cluster, we choose the instance in the cluster which is the furthest away from its nearest neighbour in the same cluster; that is, the instance which is the most isolated within the cluster. The distance between that instance and its nearest neighbour, called the sparsity of the cluster, should be minimised. We define sparsity in Equation (1), where represents the cluster of clusters, represents an instance in the cluster, and is the Euclidean distance between two instances.

(1)
Separability

To measure the separation of a cluster, we find the minimum distance from that cluster to any other cluster. This is equivalent to finding the minimum distance between the instances in the cluster and all other instances in the dataset which are not in the same cluster, as shown in Equation (2). The separation of a cluster should be maximised to ensure that it is distinct from other clusters.

(2)
Connectedness

An instance’s connectedness is measured by finding how many of its nearest neighbours are assigned to the same cluster as it, with higher weighting given to neighbours which are closer to the given instance, as shown in Equation (3). To prevent connectedness from encouraging spherical clusters, must be chosen to be adequately small — otherwise, large cluster “blobs” will form. We found that setting provided a good balance between producing connected instances and allowing varying cluster shapes. The mean connectedness of a dataset should be maximised.

(3)

where gives the nearest neighbours of , indicates that and are (correctly) in the same cluster, and

(4)

The inverse distance between two instances is capped at 10, to prevent very close instances from overly affecting the fitness measure. Inverse distance is used to weight closer neighbours more highly.

Our proposed fitness function is a combination of these three measures (Equations (1)–(3)): we find each cluster’s ratio of sparsity: separation (as they are competing objectives) as shown in Equation (5), and then measure the partition’s fitness by also considering the connectedness, as shown in Equation (6). This fitness function should be maximised.

(5)
(6)

3.4 Using a Multi-Tree Approach

As previously discussed, using a single fixed similarity function means that every pair of instances across a dataset must be compared identically, i.e. with all features weighted equally regardless of the characteristics of the given instances. By using GP to automatically evolve similarity functions containing conditional nodes (, , and ), we are able to produce trees which will measure similarity dynamically. However, a tree is still limited in its flexibility, as there is an inherent trade-off between the number of conditional nodes used and the complexness of the constructed features in a tree — more conditionals will tend to mean simpler constructed features with fewer operators (and vice versa), due to the limitations on tree depth and training time.

To tackle these issues, while still maintaining reasonable tree depth and training time, we propose evolving a number of similarity functions concurrently. Using this approach, a pair of instances will be assigned a similarity score by each similarity function, which are then summed together to give a total measure of how similar the instances are111

Of course, there are a variety of ways to combine similarity scores in the ensemble learning literature, such as taking the maximum output. We use the sum here to increase the stability of the joint similarity function, and to reduce the effect of outliers/edge cases. We hope to investigate this further in future work.

. In this regard, each similarity function provides a measure of its confidence that two instances should lie in the same cluster, allowing different similarity functions to specialise on different parts of the dataset. This is implemented using GP with a multi-tree approach, where each GP individual contains not only one but multiple trees. An example of this structure is shown in Fig. 3 with the number of trees, . As all similarity functions are evolved concurrently in a single individual, a set of cohesive functions will be evolved that work well together, but that are not expected to be good similarity functions independently. In this way, a GP individual can be thought of as a meta-function. The core of the clustering process remains the same with this approach, with the only change being that the most similar neighbour for a given instance is based on the sum of similarities given by all trees in an individual. This change to the algorithm is shown in Algorithm 2.

(a)
(b)
(c)
Figure 3: An example of a multi-tree similarity function.
6 for  do Test neighbour
7        ;
8        for  do Each tree
9               ;
10              
11        end for
12       if  then
13              ;
14        end if
15       
16 end for
Algorithm 2 Choosing the most similar neighbour to an instance () in the multi-tree approach for individual .

There are several factors that must be considered when extending the proposed algorithm to use a multi-tree approach: how to perform crossover when there are multiple trees to crossover between, and how many trees to use. These two factors will be discussed in the following paragraphs. A third consideration is the maximum tree depth — we use a smaller tree depth when multiple trees are used, as each tree is able to be more specialised and so does not require as many nodes to produce a good similarity function. Mutation is performed as normal, by randomly choosing a tree to mutate.

3.4.1 Crossover Strategy

In standard GP, crossover is performed by selecting two individuals, randomly selecting a subtree from each of these two individuals, and swapping the selected subtrees to produce new offspring. In multi-tree GP, a tree within each individual must also be selected. There are a number of possible methods for doing so (Haynes and Sen, 1997; Thomason and Soule, 2007), as discussed below:

Random-index crossover

The most obvious method is to randomly select a tree from each individual, which we term random-index crossover (RIC). This method may be problematic when applied to our proposed approach, as it reduces the ability of each tree to specialise, by exchanging information between trees which may have different “niches”.

Same-index crossover

An alternative method to avoid the limitations of RIC is to always pick two trees at the same index in each individual. For example, selecting the third tree in both individuals. This method, which we call same-index crossover (SIC), allows an individual to better develop a number of distinct trees while still encouraging co-operation between individuals through the crossover of related trees.

All-index crossover

The SIC method can be further extended by performing crossover between every pair of trees simultaneously, i.e. crossover between every tree in both individuals, where for trees. This approach, called all-index crossover (AIC) allows information exchange to occur more aggressively between individuals, which should increase training efficiency. However, it introduces the requirement that the effect of performing all pairs of crossovers gives a net fitness increase which may limit the exploitation of individual solutions during the EC process.

We will compare each of these crossover approaches to investigate which type of crossover is most appropriate.

3.4.2 Number of Trees

The number of trees used in a multi-tree approach must strike a balance between the performance benefit gained by using a large number of specialised trees and the difficulty in training many trees successfully. When using either the SIC or RIC crossover methods, increasing the number of trees used will reduce the chance proportionally that a given tree is chosen for crossover/mutation, thereby decreasing the rate at which each tree is refined. When the AIC method is used, a larger number of trees increases the probability that a crossover will not improve fitness, as the majority of the trees are unlikely to gain a performance boost when crossed over in the later stages of the training process when small “tweaks” to trees are required to optimise performance. We will investigate the effect of the number of trees used on the fitness obtained later in this paper.

4 Experiment Design

4.1 Benchmark Techniques

We compare our proposed single-tree approach (GPGC) to a variety of baseline clustering methods, which are listed below. We also compare the single- and multi-tree approaches, to investigate the effectiveness of using additional trees.

  • -means++ (Arthur and Vassilvitskii, 2007), a commonly used partitional algorithm. Standard -means++ cannot automatically discover the number of clusters, and so K is pre-fixed for this method. We use this as an example of a relatively simple and widely used method in the clustering literature.

  • OPTICS (Ankerst et al., 1999), a well-known density-based algorithm. OPTICS requires a contrast parameter, , to be set in order to determine where in the dendrogram the cluster partition is extracted from; we test OPTICS with a range of values in and report the best result in terms of the Adjusted Rand Index (defined in Section 4.4).

  • Two naïve graph-based approaches which connect every instance with an edge to its -nearest neighbours (von Luxburg, 2007). We test with both (called NG-2NN) and (called NG-3NN) in this work. Note that the case where (NG-1NN) is similar to the clustering process used in Algorithm 3.2; we exclude NG-1NN as it produces naive solutions with a fixed distance function.

  • The Markov Clustering (MCL) algorithm (Van Dongen, 2000), another clustering algorithm using a graph-based representation, which simulates random walks through the graph and keeps instances in the same cluster when they have a high number of paths between them.

  • The multi-objective clustering with automatic -determination (MOCK) algorithm (Handl and Knowles, 2007) introduced earlier, as an example of a well-known high-quality EC clustering method.

4.2 Datasets

We use a range of synthetic clustering datasets to evaluate the performance of our proposed approach, with varying cluster shapes, numbers of features (), instances () and clusters (). We avoid using real-world datasets with class labels as done in previous clustering studies, as there is no requirement that classes should correspond well to homogeneous clusters (von Luxburg et al., 2012) — for example, clustering the well-known Iris dataset will often produce two clusters, as the versicolor and virginica classes overlap significantly in the feature space. The datasets were generated with the popular generators introduced by Handl et al. (Handl and Knowles, 2007)

. The first generator uses a Gaussian distribution, which produces a range of clusters of varying shapes at low dimensions, but produces increasingly hyper-spherical clusters as

increases. As such, we use this generator only at a small , to produce the datasets shown in Table 1. The second generator produces clusters using an elliptical distribution, which produces non-hyper-spherical clusters even at large dimensionality. A wide variety of datasets were generated with this distribution, with varying from 10 to 1000, and varying from 10 to 100, as shown in Table 2. Datasets with clusters have between and instances per cluster, whereas datasets with a higher have between and to limit the memory required. These datasets allow our proposed approach to be tested on high-dimensional problems. All datasets are scaled so that each feature is in to prevent feature range overly affecting the distance calculations used in the clustering process. As a generator is used, the cluster that each instance is assigned to is known — i.e. the datasets provide a gold standard in the form of a “cluster label” for each instance. While this label is not used during training, it is useful for evaluating the clusters produced by the clustering methods.

Name
10d10cGaussian 10 2730 10
10d20cGaussian 10 1014 20
10d40cGaussian 10 1938 40
Table 1: Datasets generated using a Gaussian distribution (Handl and Knowles, 2007).
Name
Name
10d10c 10 2903 10 100d10c 100 2893 10
10d20c 10 1030 20 100d20c 100 1339 20
10d40c 10 2023 40 100d40c 100 2212 40
10d100c 10 5541 100 1000d10c 1000 2753 10
50d10c 50 2699 10 1000d20c 1000 1088 20
50d20c 50 1255 20 1000d40c 1000 2349 40
50d40c 50 2335 40 1000d100c 1000 6165 100
Table 2: Datasets generated using an Elliptical distribution (Handl and Knowles, 2007).

4.3 Parameter Settings

The non-deterministic methods (-means++, GPGC, MOCK, MCL) were run 30 times, and the mean results were computed. -means++, GPGC and MOCK were run for 100 iterations, by which time -means++ had achieved convergence. All benchmarks use Euclidean distance. The GP parameter settings for the single- and multi-tree GPGC methods, are based on standard parameters (Poli et al., 2008), and are shown in Table 3; the multi-tree (MT) approach uses a smaller maximum tree depth than the single-tree (ST) approach due to having multiple more-specific trees. The MOCK experiments used the attainment score method to select the best solution from the multi-objective pareto front.

Parameter Setting Parameter Setting
Generations 100 Population Size 1024
Mutation 20% Crossover 80%
Elitism top-10 Selection Type Tournament
Min. Tree Depth 2 Max. Tree Depth 5 (MT), 7 (ST)
Tournament Size 7 Pop. Initialisation Half-and-half
Table 3: Common GP Parameter Settings

4.4 Evaluation Metrics

To evaluate the performance of each of the clustering algorithms, we use the three measures defined previously (connectedness, sparsity, and separation), as well as the Adjusted Rand Index (ARI), which compares the cluster partition produced by an algorithm to the gold standard provided by the cluster generators in an adjusted-for-chance manner (Vinh et al., 2010).

Given a cluster partition produced by an algorithm and a gold standard cluster partition

, the ARI is calculated by first generating a contingency table where each entry

denotes the number of instances in common between and , where is the -th cluster in , and is the -th cluster in . In addition, the sum of each row and column is computed, denoted as and respectively. As before, is the total number of instances. The ARI is then calculated according to Equation (7), which finds the frequency of occurrence of agreements between the two clusterings, while adjusting for the chance grouping of instances.

(7)

5 Results and Discussion

We provide and analyse the results of our experiments in this section. We begin by comparing each of the proposed multi-tree approaches to the single-tree GPGC approach in order to decide which version of GPGC is the more effective (Section 5.1). We then compare the best of these approaches, GPGC-AIC, to the benchmark methods to examine how well our proposed method performs relative to existing clustering methods (Section 5.2). The effect of the number of trees on the performance of the multi-tree approach is analysed in Section 5.3.

5.1 GPGC using Multiple Trees

Method Fitness K Conn. Spar. Sep. ARI
GPGC
AIC
RIC
SIC
(a) 10d10cGaussian
Method Fitness K Conn. Spar. Sep. ARI
GPGC
AIC
RIC
SIC
(b) 10d20cGaussian
Method Fitness K Conn. Spar. Sep. ARI
GPGC
AIC
RIC
SIC
(c) 10d40cGaussian
Table 4: Crossover: Datasets using a Gaussian Distribution
Method Fitness K Conn. Spar. Sep. ARI
GPGC
AIC
RIC
SIC
(a) 10d10c
Method Fitness K Conn. Spar. Sep. ARI
GPGC
AIC
RIC
SIC
(b) 10d20c
Method Fitness K Conn. Spar. Sep. ARI
GPGC
AIC
RIC
SIC
(c) 10d40c
Method Fitness K Conn. Spar. Sep. ARI
GPGC
AIC
RIC
SIC
(d) 10d100c
Method Fitness K Conn. Spar. Sep. ARI
GPGC
AIC
RIC
SIC
(e) 50d10c
Method Fitness K Conn. Spar. Sep. ARI
GPGC
AIC
RIC
SIC
(f) 50d20c
Method Fitness K Conn. Spar. Sep. ARI
GPGC
AIC
RIC
SIC
(g) 50d40c
Method Fitness K Conn. Spar. Sep. ARI
GPGC
AIC
RIC
SIC
(h) 100d10c
Method Fitness K Conn. Spar. Sep. ARI
GPGC
AIC
RIC
SIC
(i) 100d20c
Method Fitness K Conn. Spar. Sep. ARI
GPGC
AIC
RIC
SIC
(j) 100d40c
Method Fitness K Conn. Spar. Sep. ARI
GPGC
AIC
RIC
SIC
(k) 1000d10c
Method Fitness K Conn. Spar. Sep. ARI
GPGC
AIC
RIC
SIC
(l) 1000d20c
Method Fitness K Conn. Spar. Sep. ARI
GPGC
AIC
RIC
SIC
(m) 1000d40c
Method Fitness K Conn. Spar. Sep. ARI
GPGC
AIC
RIC
SIC
(n) 1000d100c
Table 5: Crossover: Datasets using an Elliptical Distribution.

To further improve the performance of the proposed GPGC approach, we proposed an extension to use a multi-tree GP design in Section 3.4. To analyse the effectiveness of this extension, and determine which type of multi-tree crossover is most effective, we evaluated the three crossover methods (RIC, SIC, AIC) against the single-tree GPGC approach. We used trees based on initial tests — the effect of varying is discussed further in Section 5.3. Tables 4 and 5 show the results of these experiments on the datasets generated using a Gaussian and elliptical distribution respectively. For each of the four methods, we provide the (mean) number of clusters (

), as well as four metrics of cluster quality: fitness achieved, connectedness (Conn), sparsity (Spar), separation (Sep), and the ARI. Connectedness, sparsity, and separation are defined in the same way as in the fitness function. We performed a two-tailed Mann Whitney U-Test at a 95% confidence interval comparing each of the multi-tree approaches to the single-tree approach on each of the metrics. A “

” indicates a method is significantly better than the single-tree GPGC method, a “” indicates it is significantly worse, and no symbol indicates no significant difference was found. For all metrics except for sparsity, a larger value indicates a better result.

The most noticeable result of using a multi-tree approach is that the fitness achieved by the GP process is significantly improved across all datasets with the exception of the 10d20cGaussian, 10d40c and 10d100c datasets, where the multi-tree approaches were significantly worse or had similar fitness to GPGC. On the datasets generated using a Gaussian distribution, the multi-tree approaches are able to find the number of clusters much accurately on 10d10cGaussian, and achieve a significantly higher ARI result. On the 10d40cGaussian dataset, both AIC and RIC achieved significantly better fitness, connectedness, sparsity, and separation than GPGC. While AIC is significantly worse than GPGC on 10d20cGaussian and 10d40cGaussian, the decrease of ~1.5% ARI is not very meaningful given it gained 13% ARI on 10d10cGaussian.

The multi-tree approaches also tend to produce clusters that are both better connected and better separated than GPGC on the datasets generated with an elliptical distribution. It seems that using multiple trees allows the GP evolutionary process to better separate clusters, while still ensuring that similar instances are placed in the same cluster. Sparsity is either increased (i.e. made worse) or is similar compared to GPGC when a multi-tree approach is used — this suggests that the single tree approach was overly favouring reducing sparsity at the expense of the overall fitness. Another interesting pattern is that the number of clusters (

) found by the multi-tree approaches was always lower than that found by GPGC; given that GPGC tended to over-estimate

, this can be seen as further evidence that using multiple trees improves clustering performance. Furthermore, a smaller is likely to directly improve connectedness as more instances will have neighbours in the same cluster, and separation since having fewer clusters increases the average distance between neighbouring clusters.

In terms of the ARI, the multi-tree approaches were significantly better than GPGC on a number of elliptically-generated datasets, with the RIC, SIC, and AIC methods being significantly better on 3, 3, and 6 datasets respectively. Both the AIC and RIC methods have significantly better fitness than GPGC on many of these datasets, while SIC is not significantly better on 100d10c or 1000d100c.

To better understand which of the three multi-tree methods have the highest performance, we analysed the ARI results further as these give the best overall evaluation of how the multi-tree methods compare to the gold standard. We performed a Kruskal-Wallis rank sum test (at a 5% significance level) followed by post-hoc pair-wise analysis using Dunn’s test. The summary of this testing is shown in Table 6.

Dataset Finding p-value Finding p-value Finding p-value Finding p-value
10d10cG AIC >GPGC 0.000 AIC >SIC 0.015 RIC >GPGC 0.002 SIC >GPGC 0.034
10d10c AIC >GPGC 0.002 SIC >GPGC 0.018 RIC >GPGC 0.028
10d40c GPGC >AIC 0.048 GPGC >RIC 0.001 SIC >RIC 0.047
50d10c AIC >GPGC 0.000 AIC >RIC 0.040 AIC >SIC 0.003 RIC >GPGC 0.016
50d40c AIC >GPGC 0.003 AIC >RIC 0.003
100d40c AIC >GPGC 0.013 RIC >GPGC 0.006 SIC >GPGC 0.004
1000d100c AIC >GPGC 0.000 AIC >RIC 0.037 AIC >SIC 0.004
Table 6: Summary of ARI post-hoc analysis findings. For each dataset, all results with a p-value below 0.05 (5% significance level) are shown. “AIC >GPGC” indicates that AIC had a significantly better ARI than GPGC on the given dataset, with a given p-value.

According to the post-hoc analysis, AIC outperformed GPGC 6 times, whereas RIC and SIC outperformed GPGC 4 and 3 times respectively. AIC outperformed RIC and SIC in 3 cases each as well. In one case, SIC outperformed SIC. Furthermore, AIC generally had a smaller p-value when it outperformed GPGC compared to SIC and RIC. Based on these findings, we conclude that AIC is the most effective of the three proposed multi-tree approaches. Based on this, we use GPGC-AIC in the next section to compare to other clustering methods.

5.2 GPGC-AIC compared to the Benchmarks

Method K Conn Spar Sep ARI
AIC
-means++
MCL
MOCK
NG-2NN
NG-3NN
OPT-0.005
(a) 10d10c
Method K Conn Spar Sep ARI
AIC
-means++
MCL
MOCK
NG-2NN
NG-3NN
OPT-0.001
(b) 10d20c
Method K Conn Spar Sep ARI
AIC
-means++
MCL
MOCK
NG-2NN
NG-3NN
OPT-0.001
(c) 10d40c
Table 7: Baselines: Datasets using a Gaussian distribution.
Method K Conn Spar Sep ARI
AIC
-means++
MCL
MOCK
NG-2NN
NG-3NN
OPT-0.05
(a) 10d10cE
Method K Conn Spar Sep ARI
AIC
-means++
MCL
MOCK
NG-2NN
NG-3NN
OPT-0.001
(b) 10d20cE
Method K Conn Spar Sep ARI
AIC
-means++
MCL
MOCK
NG-2NN
NG-3NN
OPT-0.005
(c) 10d40cE
Method K Conn Spar Sep ARI
AIC
-means++
MCL
MOCK
NG-2NN
NG-3NN
OPT-0.01
(d) 10d100cE
Method K Conn Spar Sep ARI
AIC
-means++
MCL
MOCK
NG-2NN
NG-3NN
OPT-0.05
(e) 50d10c
Method K Conn Spar Sep ARI
AIC
-means++
MCL
MOCK
NG-2NN
NG-3NN
OPT-0.005
(f) 50d20c
Method K Conn Spar Sep ARI
AIC
-means++
MCL
MOCK
NG-2NN
NG-3NN
OPT-0.05
(g) 50d40c
Method K Conn Spar Sep ARI
AIC
-means++
MCL
MOCK
NG-2NN
NG-3NN
OPT-0.001
(h) 100d10c
Method K Conn Spar Sep ARI
AIC
-means++
MCL
MOCK
NG-2NN
NG-3NN
OPT-0.01
(i) 100d20c
Method K Conn Spar Sep ARI
AIC
-means++
MCL
MOCK
NG-2NN
NG-3NN
OPT-0.001
(j) 100d40c
Table 8: Baselines: Datasets using an Elliptical Distribution (Part 1).
Method K Conn Spar Sep ARI
AIC
-means++
MCL
MOCK
NG-2NN
NG-3NN
OPT-0.005
(a) 1000d10c
Method K Conn Spar Sep ARI
AIC
-means++