Genetic Programming for Evolving a Front of Interpretable Models for Data Visualisation

01/27/2020 ∙ by Andrew Lensen, et al. ∙ Victoria University of Wellington 5

Data visualisation is a key tool in data mining for understanding big datasets. Many visualisation methods have been proposed, including the well-regarded state-of-the-art method t-Distributed Stochastic Neighbour Embedding. However, the most powerful visualisation methods have a significant limitation: the manner in which they create their visualisation from the original features of the dataset is completely opaque. Many domains require an understanding of the data in terms of the original features; there is hence a need for powerful visualisation methods which use understandable models. In this work, we propose a genetic programming approach named GPtSNE for evolving interpretable mappings from a dataset to highquality visualisations. A multi-objective approach is designed that produces a variety of visualisations in a single run which give different trade-offs between visual quality and model complexity. Testing against baseline methods on a variety of datasets shows the clear potential of GP-tSNE to allow deeper insight into data than that provided by existing visualisation methods. We further highlight the benefits of a multi-objective approach through an in-depth analysis of a candidate front, which shows how multiple models can



There are no comments yet.


page 1

page 9

page 10

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Visualisation is a fundamental task in data mining which aims to represent data in a human-understandable graphical manner [fayyad1996data]. Visualisation is often touted as a useful tool for allowing practitioners to gain an understanding of the data being analysed, but state-of-the-art visualisation methods (e.g. t-Distributed Stochastic Neighbour Embedding (t-SNE) [maatenTSNE]

) tend to be black boxes that give no insight into how the visualisation represents the original features of the data. Other methods such as autoencoders

[hinton2006reducing] and parametric t-SNE [maaten2009learning]

produce a parametric mapping from the original dataset to the two-dimensional visualisation, but these use complex deep neural networks which are opaque to humans and provide little “intuitive” understanding. Even the most recent advances such as Uniform Manifold Approximation and Projection (UMAP)

[2018arXivUMAP] acknowledge significant weaknesses with regards to interpretability. As McInnes et al. note (emphasis ours):

For a number of uses [sic] cases the interpretability of the reduced dimension results is of critical importance. Similarly to most non-linear dimension reduction techniques (including t-SNE and Isomap), UMAP lacks the strong interpretability

of Principal Component Analysis (PCA) and related techniques such a [

sic] NonNegative Matrix Factorization (NMF). In particular the dimensions of the UMAP embedding space have no specific meaning

, unlike PCA where the dimensions are the directions of greatest variance in the source data. Furthermore, since UMAP is based on the distance between observations rather than the source features, it does not have an equivalent of factor loadings that linear techniques such as PCA, or Factor Analysis can provide. If strong interpretability is critical we therefore recommend linear techniques such as PCA and NMF.”

[2018arXivUMAP, p. 35]

While such linear techniques as PCA and NMF provide strong interpretability, they are inherently limited in the quality of their output dimensions by their use of linear weightings only. Tree-based genetic programming (GP) is often recognised for its ability to directly model functions by taking inputs as leaves and producing an output at the root [poli2008field] and is often used for dimensionality reduction in the form of feature construction (FC) [neshatian2012filter]. Significant progress has been made on producing interpretable GP trees, which are simple enough to be understood by a human expert by using techniques such as parsimony pressure [poli2008field] and multi-objective optimisation [bleuler2001multiobjective], with multi-objective approaches showing particularly good results due to their ability to produce a range of solutions with different levels of complexity [wagner2012parsimony, cano2017multi]. Despite being clearly suitable for dimensionality reduction and evolving interpretable trees, GP has never been used to produce interpretable models that create understandable visualisations.

We recently investigated the use of GP to do manifold learning (GP-MaL) for dimensionality reduction [lensen2019can]. We found that GP was able to significantly reduce the size of feature sets while still retaining high-dimensional structure by producing sophisticated constructed features which used a variety of non-linear and conditional transformations. Our findings also highlighted the considerable potential of GP-MaL for visualisation, but that the original fitness function needed substantial refinement and that the complexity of the evolved models would need to be significantly reduced.

In this paper we aim to significantly expand our previous work by using a multi-objective GP approach for visualisation which is expected to address the above limitations. To our knowledge, this is the first multi-objective GP method proposed which treats visualisation quality and model complexity as competing objectives. This paper will:

  • Propose a holistic set of functions and terminals for creating powerful and interpretable models for visualisation;

  • Design a multi-objective approach to allow for the evolution of a solution front containing visualisations representing different levels of quality and complexity;

  • Compare the quality and interpretability of the evolved visualisations with those produced by state-of-the-art visualisation methods on a range of datasets; and

  • Perform an in-depth analysis of the trade-off between visualisation quality and tree complexity to demonstrate the unique advantages of the proposed approach.

This work is expected to provide a new and innovative approach which addresses the little-studied issue of model interpretability in visualisation. Of particular novelty is the in-depth analysis that will be performed that allows significant insight which is unattainable with existing single-solution black box approaches.

Ii Background

Ii-a Dimensionality Reduction

Dimensionality reduction (DR), the process of reducing the number of features/attributes in a dataset to improve understanding and performance [liu2012feature], has continued to grow in importance as datasets become increasingly big and deep neural networks become more and more un-interpretable. Common techniques to address this problem include feature selection (FS) and feature construction (FC), which reduce the feature space through removing unwanted features or creating fewer, more complex meta-features, respectively. Evolutionary Computation (EC) methods have been applied to these NP-hard problems with significant success [xue2015survey, neshatian2012filter]. In particular, tree-based GP has proved to be a natural fit for FC problems due to its functional and interpretable nature; GP-based FC has been used in classification [tran2016genetic], image analysis [alsahaf2017automatically], clustering [lensen2017GPGC] and other domains [Ahmed2014Multiple, hart2017hybrid].

Manifold learning (also known as non-linear dimensionality reduction) [lee2007nonlinear] can be regarded as an unsupervised FC approach, where the task is to build a set of features which represent the non-linear manifold present in the high-dimensional feature space. One way of approaching this task is to attempt to build a function, which maps the high-dimensional space to the low-dimensional manifold; such an approach could provide an understandable mapping between the two. Until our recent work [lensen2019can] there had been no investigation into whether GP could be used to evolve such a mapping.

Ii-B Machine Learning for Performing Visualisation

The simplest commonly-used machine-learning visualisation methods are FS and Principal Component Analysis (PCA)


. By using FS to select only two (or perhaps three) features, one can visualise a dataset by plotting the feature values of each instance along the x- and y-axes. PCA uses a more sophisticated approach, where it creates a series of “principal components” which are linear combinations of the feature set, such that each successive component is orthogonal to all the preceding components. By plotting the first two created components, a visualisation is created which is optimal when performing only linear transformations.

However, linear transformations fail to give clear visualisations of any underlying manifold/structure in a dataset beyond simple, low-dimensional data. Unfortunately, creating optimal non-linear transformations is an NP-hard problem, with many machine learning methods proposed as a result. The earliest methods include techniques such as Isomap [tenenbaum2000isomap] and Local Linear Embedding (LLE) [roweis2000lle], but t-SNE [maatenTSNE] is generally regarded as the mainstream state-of-the-art method.

t-SNE is an improvement to the previously proposed SNE method [hinton2002sne], which introduced a more nuanced probabilistic mapping from the high to low-dimensional spaces based on the similarity of neighbours in the high dimensional space. t-SNE improved upon SNE by introducing a cost function that could be better optimised by gradient-based optimisers; and by using a heavy-tailed t-distribution in place of a Gaussian to address the tendency of points in the low-dimensional space to be pushed together at the cost of separation between points, a characteristic known as the crowding problem. A range of improvements to t-SNE have since been proposed, including a more efficient tree-based nearest-neighbour search [maaten2014accelerating], and a parametric version[maaten2009learning], but the cost function, which is the key aspect in our work, has remained relatively unchanged.

Ii-C Multi-objective Optimisation

Multi-objective optimisation (MO) is a technique used when a problem intrinsically has two (or more) conflicting objectives between which a trade-off must be made by any solution to the problem. The quality of a solution in this context is often relative to the objective function values of other solutions. For example, given two candidate solutions to a problem with objectives to be minimised, then the following test can be used to determine if is strictly better than (or dominates) :


where for objectives (in this paper, ). A solution for which this does not hold true for all other solutions (i.e. it is non-dominated) is called a Pareto-optimal solution. The set of Pareto-optimal solutions is often called the Pareto set. The Pareto front is then the image of the Pareto set in the objective space. In practice, it is infeasible or very difficult to find the true Pareto front, and so MO algorithms hence search for the best approximation front.

EC methods are some of the most successful and widely used approaches to MO as their population-based structure naturally allows for multiple non-dominated solutions to be found in one run. Of the Evolutionary Multi-objective Optimisation (EMO) algorithms, the Multi-Objective Evolutionary Algorithm based on Decomposition (MOEA/D)

[zhang2007moead] has seen widespread use in recent years. MOEA/D has been shown to produce approximation fronts with better spread and convergence than previous EMO methods[li2009mo]. Multi-objective genetic programming (MOGP) approaches have seen significant success in recent years [bhowan2013evolving, cano2017multi].

Ii-D Related Work

Ii-D1 GP for Manifold Learning and Visualisation

The only existing work we are aware of which used GP for MaL is our previous GP-MaL approach [lensen2019can]. GP-MaL was proposed to tackle the general MaL problem, i.e. the reduction of a -dimension high-dimensional space to a -dimension lower-dimensional space, where . A (single-objective) fitness function was proposed that measured how well manifold structure was preserved by measuring the preservation of -neighbour orderings in the lower-dimensional space. GP-MaL was fundamentally designed for dimensionality reduction for data mining tasks (e.g. classification), and so was primarily used to retain as much structure of the data as possible. Some initial application of GP-MaL for visualisation was investigated with encouraging results, but it was found that a more specific fitness function would be needed to improve visualisation quality, and that tree size would need to be addressed for visualisations to be understandable.

Multi-objective GP approaches have been proposed to improve the visualisation quality of feature construction in a supervised learning context. The MOG3P method


aimed to produce solutions with a trade-off between three objectives: classifiability, visual interpretability, and semantic interpretability. Unlike in this work, MOG3P focused on performing

supervised learning

, using class labels to guide the learning process. The three objectives also do not appear to be strictly conflicting: on a naive classification dataset, it would be possible to achieve perfect classification performance with a simple hyperplane split, which could be represented by a simple GP tree which produces a visualisation with two very distinct classes. Even on more complex datasets, there is an inherent relationship between the visual interpretability of a dataset (i.e. how well classes are separated), and the classification performance: given well-separated classes, one would expect correct classification to result. A later approach focused on classifiability and visual interpretability only, but used several different measures for each of these objectives

[cano2017multi]. This work also is a supervised approach, and so tackles a significantly different problem to that of this paper. The use of GP for visualisation has also been applied to combinatorial optimisation problems, such as job shop scheduling [nguyen2018visualising].

Ii-D2 Other Approaches

A variety of non-EC approaches have been proposed for MaL, some of which produce models that could theoretically be interpreted to understand the manifold in terms of the original features. Parametric t-SNE [maaten2009learning]

is a variation of t-SNE that allows for re-use of learnt t-SNE representations on future samples. Parametric t-SNE constructs a mapping from the high- to low-dimensional space using restricted Boltzmann machines to construct a pre-trained feed-forward neural network model. The neural network used over 10,000 neurons on the largest dataset, heavily restricting the potential for interpretation of the network. Autoencoders

[hinton2006reducing] are another neural-network based approach, which attempt to compress the representation of the data into as narrow of a middle hidden layer as possible such that the original data can be re-created from the concise representation. To do so, autoencoders use a number of layers of varying sizes to encode and decode the data. This provides a mapping to and from the learnt representation, but is unrealistic to interpret given the number of nodes and fully-connected topology. Attempts have been made to improve the interpretability of autoencoders [sun2018evolving], but success is fundamentally limited by the need for architectures which are differentiable. Self-organising maps (SOMs) [kohonen1982som], a variation of an unsupervised neural network, have been used for visualisation, but the number of weights scales with dimensionality and so their interpretability is limited on non-trivial datasets. The visualisation literature [nonato2019multidimensional] discusses a few examples of mapping “synthesised dimensions” to original dimensions (in particular, page 4 of the supplementary material). These examples generally use a post hoc approach, by varying the model’s parameters or the instances [faust2019dimreader] and analysing the effect, or using tools such as heatmaps. There is a distinct lack of literature that directly evolves simple models that use minimal unique features.

Iii The Proposed Method: GP-tSNE

Generally when applying GP to a problem, there are two main decisions to be made: what terminal and functional nodes are suitable, and how to formulate a fitness function to solve the problem. In addition, there are often other improvements made to the standard GP evolutionary process to further tailor it to the problem. Each of these three considerations are discussed in the following subsections. While each of these components builds on established work, the use of them together to produce a range of interpretable models producing high-quality visualisations is a new and substantial advance in this field. The full source code of GP-tSNE is available online111

Iii-a GP Architecture

In order to project a high-dimensional space to a two-dimensional space for visualisation, it is important that a range of powerful and varied functions are available to the evolutionary process in order to produce compact but representative models. As exactly two dimensions are required for visualisation, we use a multi-tree approach where each individual contains two trees and each tree produces one dimension (i.e. the x- or y- axis) of the visualisation. A multi-tree representation is chosen rather than a co-operative co-evolution approach as the two trees must be tightly coupled (i.e. highly dependant) in order to give high-quality visualisations; co-operative co-evolution approaches tend to non-deterministically pair solutions from different sub-populations which greatly decreases the ability for such coupling to occur.

Iii-A1 Function Set

Table I lists the nine functions and four terminals used in GP-tSNE. The four arithmetic functions are standard, with the exception of the and function which are the only functions that can take the zero node as input. This allows a variable number of inputs to these functions, which allows the evolutionary process to easily perform mutation that effectively removes whole trees (hopefully introns: “useless” sub-trees whose output do not affect the tree output) in the pursuit of model simplification. The function performs protected division: if the denominator is , it returns .

Function No. of Inputs Description
Arithmetic Functions
1–5 Flexible Addition
1–2 Std. Subtraction
2 Std. Multiplication
2 Protected Division
Non-Linear Functions
Sigmoid 1
ReLU 1
Conditional Functions
Max 2
Min 2
If 3 if : y; else
Terminal Nodes
F 0 feature value
NF 0
Mean of feature values
from 3-nearest neighbours
Constant 0 From
Zero 0 The number
TABLE I: The function and terminal sets of the proposed method.

The sigmoid and ReLU functions are unary operators that are included to allow easy transformation of (linear) inputs into a non-linear output space. These were chosen based on inspiration from auto-encoders and other neural network methods. The three conditional operators provide a different kind of non-linear transform which is quite unique to GP due to their non-differentiability, and they are expected to allow a single tree to exhibit varied behaviour depending on its inputs.

Iii-A2 Terminal Set

The F terminal is commonly used in feature-based GP to use a feature as an input to a GP tree. Each feature in a dataset is assigned a distinct terminal which returns the values of the feature for a given instance. While this provides a great deal of flexibility to the EC search process, it does make the search space large for high numbers of features (). To remedy this, we use PCA to produce the first PC on a dataset, and then select the features from that PC which have the highest-magnitude weights. Features with higher-magnitude weights contribute more to a PC, and therefore choosing the highest-magnitude features is a simple form of unsupervised feature selection. Each of these features have an increased likelihood of being chosen from the terminal set, which helps the EC search process focus on a set of promising features, while still allowing it to select other features with a smaller likelihood. In this work, we set , and each of the PCA-selected features have an additional likelihood of being selected from the terminal set. For example, if features, then : each of the 8 features chosen by PCA has ( the likelihood) of being chosen from the terminal set. is chosen so that the number of selected features increases slowly with the total number of features. Note that while we use PCA to weight promising features more strongly, we do not use the actual transformed PCs in our terminal set, as this would make the trees very difficult to interpret.

The NF terminal serves two purposes: it provides a lower-noise version of a given feature, making trees less sensitive to noisy instances; and it allows a tree to encode local structure into the low-dimensional space more effectively by considering an instance’s neighbourhood. While individual features (F) are suitable for embedding global structure, they are not as effective for preserving local structure. The three nearest-neighbours for each instance are pre-computed by considering the ordering of the other instances by Euclidean distance.

The constant terminal is often used in GP, and is used here with a range of to allow different parts of the tree to have different impacts on the final output. The zero node is included solely for the and function, and is not used by any other functions as it is either destructive or has no effect. The constant and zero nodes are chosen from the terminal set with a likelihood of as they are less likely to be useful to a function compared to a feature-based node. A summary of the terminal nodes’ weightings are shown in Table II. Note that the likelihood calculation is for a single instance in the case of a feature-based terminal, e.g. there are different terminals, each with a base likelihood of 1. All of the numerical values, including the “5” in the flexible addition, the use of 3-nearest neighbours, and the likelihood values, were empirically chosen by testing on a range of representative datasets.

Terminal L.h. L.h. for
“Normal” F 1 1
PCA-selected F 11
NF 1 1
Constant 10
Zero 10
TABLE II: Calculation of likelihood (l.h.) for each terminal to be selected from the terminal set.

Iii-B Multi-Objective Approach

There is an intrinsic relationship in machine learning between the potential performance of a model and the complexity required to attain that performance. For example, the simplest model to separate two classes would be a decision boundary that simply thresholds at a certain point in space, whereas for three linearly-separable classes, at least two thresholds would be required (for some real , :  

). The same is true in visualisation: the more granular (specific) a visualisation is, the more complex the function used to produce that visualisation must be. To recreate the high-dimensional structure of a complex dataset in two dimensions would require two very big and complex GP trees. Each node removed from a tree decreases the accuracy with which the tree can reproduce the high-dimensional probability distribution (in the case of t-SNE). As an analogy, consider evolving a very complex polynomial function with GP: the fewer components (nodes) in the evolved functions, the fewer inflection points available to approximate the function.

In this work, we employ a multi-objective approach to produce a set of solutions that allow for a trade-off between visualisation quality and model interpretability to be chosen. This is strictly a more difficult problem than optimising only visualisation quality (i.e. as in t-SNE) as a model must be found which maps the high-dimensional to the low-dimensional space. We choose to use MOEA/D [zhang2007moead] due to the reasons discussed in Section II-C. The first objective uses t-SNE to measure the visualisation quality of a GP tree; the second objective uses tree size as a proxy measure for the interpretability of the tree. These will be discussed in turn.

Iii-C Objective 1: Visualisation Quality

t-SNE uses conditional probabilities to represent the similarity between instances in a given dimensional space. Given two instances in the high-dimensional space, and , the conditional probability that would be chosen as a neighbour of is defined as [maatenTSNE]:


given a Gaussian with variance centred at . t-SNE employs a symmetric approach where joint probability distributions are used; is computed as the symmetrised conditional probabilities which for instances is defined as:


In order to remedy the crowding problem (see [maatenTSNE] for further details), t-SNE uses a slightly different approach in the low-dimensional space which is based on a Student t-distribution. The joint probabilities of two instances, and , in the low-dimensional space, called is computed as:


The cost function (C), which measures the extent to which the low-dimensional probability distribution does not match the high-dimensional probability distribution, is the difference between the two distributions as measured using the sum of the Kullback-Leibler (KL) divergences:


We use this cost function as the first objective, which should be minimized. The parameter is computed based on a perplexity of using the approach outlined in [maatenTSNE]. We choose this cost function as it is well-tested and well-regarded in the MaL and visualisation literature.

Iii-D Objective 2: Model Complexity

A common issue encountered in GP is the production of bloated trees, where a GP tree is significantly bigger than is necessary to achieve a given level of fitness. Traditionally, there is no evolutionary pressure to encourage compact trees, and so trees may contain unnecessarily complex sub-trees, or in the worst case introns While being computationally inefficient, bloated trees are also much harder for humans to interpret and understand.

Parsimony pressure, which treats minimisation of model size as a (minor) objective in the optimisation process, is the most frequently method for controlling bloat in GP [poli2014parsimony]. Weighted sum approaches, where a small component of overall fitness is based on tree size, has been often used for attempting to control bloat, but choosing the right weighting is difficult and generally must be set empirically. More recently, multi-objective approaches have been proposed for addressing bloat, whereby tree size is used as a secondary objective to be minimised [wagner2012parsimony]. This approach allows a trade-off between fitness and model complexity to be found according to the approximation front produced by the GP process, while also producing a range of solutions of varying complexity in a single GP run which can be compared by the user for better insight into the problem being tackled.

We use a simple formula for complexity in this work which is based on the number of nodes in each of the two trees, , in a GP individual :


A given node is counted towards the complexity unless it is a ZeroNode; these are not counted as they exist only to allow flexibility in the arity of the addition function, and can be removed from the tree structure when interpreting it.

Iii-E Optimisation of Tree Constants

To further increase the visualisation quality without introducing additional model complexity, we use Particle Swarm Optimisation (PSO) [kennedy1995particle] to fine-tune the Ephemeral random constants (ERCs) in each individual in the final front at the end of the evolutionary process. Standard GP has no ability to fine-tune its numerical parameters efficiently (as it randomly searches the parameter space); by employing PSO we can do so. Each ERC (from both trees) is allocated a dimension in the PSO representation, with the value of a given dimension representing how much the value of the given ERC varies from its original value. We use a very small range of initial position values () and low minimum and maximum velocities ( and ) to focus the PSO search on fine-tuning the ERCs in the tree. One of the 30 particles is initialised with all values of (i.e. the original ERC values) so that the PSO search is guaranteed not to produce inferior solutions. The PSO search is single-objective (as the tree structure is fixed), with the fitness function being the same cost function used as the first objective in the GP search (Eq. 5). PSO is only used at the end of the GP evolutionary process both due to its computational cost and to prevent GP from falling into local minima that would result from fine-tuning during evolution.

Other techniques such as the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) [cmaes2003] have also been used for numerical optimisation, but PSO has been shown to give considerably better results on ill-conditioned functions222An ill-conditioned function is one with a high condition number: its output is overly sensitive to changes in the values of its variables. [hansen2011impacts]. For GP trees of sufficient complexity, it is expected they have the characteristics of an ill-conditioned function as a change in a given ERC value is likely to significantly change the output of the tree due to the interactions present across different sub-trees. Preliminary testing confirmed that PSO could optimise the random constants of the final GP individuals more effectively and consistently than CMA-ES.

Iii-F Other Considerations

We use a slight variation to the standard MOEA/D, whereby we prevent the breeding process from producing individuals which have already been created in earlier generations333The breeding process tries up to 10 times to produce a non-duplicate individual — otherwise it returns one of the parents, chosen randomly.

. While this added a small amount of overhead to the evolutionary process, it gave a much more efficient and effective learning process which was ultimately found to improve the diversity and quality of the final approximated front significantly. The weight vectors in MOEA/D were scaled to encourage a uniformly distributed front. The cost objective was scaled to

and complexity to .

The evolutionary process was sped up through the use of multi-threaded evaluation, caching of quality values for previously-seen trees, and the use of linear algebra libraries which allow significantly faster matrix operations through native code libraries. An overview of the GP-tSNE algorithm is shown in Algorithm 1.

0:  Dataset: , maximum generations:
0:  Approximated Pareto front
1:  Randomly initialise population
2:   Evolutionary Loop:
3:  for  to  do
4:     for  to  do
5:          MOEAD
7:         while  do
8:            if  then
10:            else
12:            end if
13:            if  then
15:            end if
16:            if  then
18:            end if
19:         end while
22:         Evaluate using Eq. 5and using Eq. 6
23:          MOEAD
24:     end for
26:  end for
27:  for  to  do
29:  end for
30:  return  
Algorithm 1 Overall GP-tSNE Algorithm

Iv Experiment Setup

The proposed GP-tSNE method was applied to a range of representative datasets which contain well-formed classes that lend well to visualisation. The nine datasets are summarised in Table III, and have a range of numbers of instances, features, and classes. These datasets are from a number of different domains including general classification, biology, and image analysis. Most of these datasets were sourced from the UCI repository [uci]. The standard t-SNE implementation [maatenTSNE] was chosen as a baseline method as it has the same optimisation measure as GP-tSNE, and is regarded as the state-of-the-art in visualisation techniques. We used a standard perplexity value of (the same as in GP-tSNE). We also compare to our previous GP-MaL method [lensen2019can] which was the first GP method to perform MaL, using a single-objective approach. GP-MaL parameter settings were unchanged from the paper.

Dataset Instances Features Classes
Iris 150 3 3
Wine 178 13 3
Dermatology 358 34 6
Breast Cancer Wisconsin 683 9 2
COIL20 1440 1024 20
Isolet 1560 617 26
MFAT 2000 649 10
MNIST 2-class 2000 784 2
Image Segmentation 2310 19 7
TABLE III: Classification datasets used for experiments.

We found that the use of a multi-objective approach necessitated a large number of generations (2,500) in order to allow the biggest trees (with the best performance) to be trained sufficiently. However, only a small population size of 100 individuals was needed to achieve a reasonable cover of the front, especially due to the technique used to breed unique individuals. The remaining parameters (Table IV) are standard settings — tuning them further gave no change in performance.

Standard GP mutation is used, with one of the two trees in an individual being randomly chosen to be mutated. Crossover is performed by randomly selecting either the first or second tree to use, and then performing crossover between this tree in each parent — this crossover occurs between the same axis of the visualisation.

Parameter Setting Parameter Setting
Generations 2500 Population Size 100
Mutation 20% Crossover 80%
Min. Tree Depth 2 Max. Tree Depth 14
No. Trees 2 Pop. Initialisation Half-and-half
TABLE IV: GP Parameter Settings.

All three methods were run 30 times on each dataset. As our evaluation is primarily based on qualitative analysis (as is standard in visualisation [maatenTSNE, 2018arXivUMAP]), we chose the median result out of the 30 runs according to the objective function used by the method: t-SNE’s cost function for t-SNE and GP-tSNE, and the fitness function proposed in [lensen2019can] for GP-MaL.

t-SNE runs (by default) for up to 1,000 iterations, corresponding to 1,000 evaluations of the t-SNE cost function. GP-tSNE runs for 2500 generations with a population size of 100, giving 250,000 evaluations. Clearly, GP-tSNE is computationally more expensive, although the use of fitness caching and multi-threading reduces this expense. Also, GP-tSNE produces many visualisations in a single run, whereas t-SNE produces only one. On the biggest dataset, Image Segmentation, GP-tSNE takes 30 hours, whereas t-SNE takes around two minutes. We intend to significantly reduce this difference in future work, through the use of surrogate techniques and tuning of the GP design, but note that GP-tSNE is highly parallelisable, and so can run much quicker in practice.

V Results and Discussion

The results across each of the nine datasets are shown in Figs. 19. Each figure contains eight plots, which we refer to as (a)–(h) reading left-to-right and top-to-bottom. Plot (a) shows the attained approximation front of the (median) GP result, with blue crosses showing each individual in the front, and the black line showing the shape of the front. The y and x-axes show each individual’s value for Objective 1 (cost/visualisation quality) and 2 (model complexity) respectively. The grey lines show the worst and best of the 30 GP-tSNE runs, which show the variance of the GP-tSNE algorithm. The green and yellow dotted lines represent the cost achieved by the baseline t-SNE and GP-MaL results respectively. Plot (b) shows the same front, but with the x-axis truncated at 200 complexity, so as to focus on the distribution of the simpler/more interpretable trees. Plots (c)–(f) show representative visualisations — coloured by class label — produced by individuals at different levels of model complexity (highlighted in red on the attained front): (c) is the simplest model found by GP-tSNE where each tree is a single feature (i.e. feature selection (FS)), (d) and (e) are two characteristic samples from along the front, and (f) is the most complex model, with the most accurate visualisation and lowest cost. A small amount of jitter is added to visualisations with low complexity (less than 20) so that overlapping points can be better distinguished. Plots (g) and (h) are the t-SNE and GP-MaL baseline results with median performance, where the cost written above the plots is calculated using t-SNE’s objective function (the first objective of GP-tSNE), to allow for sensible comparisons.

It is only possible to show a small sample of the evolved visualisations in this paper, but one of the significant advantages of GP-tSNE is its ability to produce a series of visualisations which progressively become more complex but more accurate representations of the high-dimensional feature space. To demonstrate this, we have included a video that shows each individual along the approximation front for each dataset, as the front is traversed from least to most complex. This is included with our submission, and is also available on YouTube444 Specific visualisations can be studied in more detail by lowering the playback speed (e.g. to 25% on YouTube).

V-a Specific Analysis

On the simplest dataset, Iris (Fig. 1), all three methods provide a clear 2D visualisation, although t-SNE and GP-tSNE more distantly separate the green class. The Sample 1 visualisation for GP-tSNE is quite similar to that produced by GP-MaL when rotated, but with less continuous spacing of points, likely due to the trees being too small to convert the low-granularity feature space to a more complex output space. The front shows that diminishing returns are quickly found on this dataset: after 34 tree size (sample 2), quality only gradually improves as tree size increases. Indeed, the difference between the visualisations at complexity 34 and 1923 are very minor — clearly GP-tSNE can capture the bulk of the structure in the simple Iris dataset using a simple model.

[height=0.434]gpTSNEJoin/iris-median-PLOTS-PSO-PAPER (a)(b)(c)(d)(e)(f)(g)(h)

Fig. 1: Iris.

[height=0.434]gpTSNEJoin/wine-median-PLOTS-PSO-PAPER (a)(b)(c)(d)(e)(f)(g)(h)

Fig. 2: Wine.

Figure 2 shows the results on another simple dataset: Wine. All three methods produce clear visualisations, with the main difference being that t-SNE projects all 3 classes across the visualisation diagonally, whereas the two GP methods produce a curved pattern. Even at a low level of complexity, GP-tSNE clearly separates the three classes, and higher levels of complexity tend to mostly refine the local structure within each class. As on the Iris dataset, the cost achieved by GP-tSNE at its maximum complexity is reasonably close to t-SNE, despite being produced by a mapping rather than just an embedding.

Both t-SNE and GP-tSNE clearly separate the two classes on the Wisconsin Breast Cancer dataset (Fig. 3), whereas GP-MaL does not give as clear a separation boundary. GP-tSNE begins to show the division between the two classes as early as at a complexity of 33, with the red class becoming more tightly packed as the complexity increases further. Indeed, it appears that the GP models retain the same general “approach” in sample 1, 2 and at the maximum complexity, but become increasingly refined and accurate. In this way, it could be possible to use the simpler models to explore the more accurate patterns found by the bigger un-interpretable models by exploring how points move as complexity increases.

[height=0.434]gpTSNEJoin/breast-cancer-wisconsin-median-PLOTS-PSO-PAPER (a)(b)(c)(d)(e)(f)(g)(h)

Fig. 3: Breast Cancer Wisconsin.

The Dermatology dataset (Fig. 4) further reinforces this pattern: as GP-tSNE produces increasingly complex models, the visualisation becomes of a “higher-resolution” and produces points that appear to be in a continuous space, rather than a discrete space. Sample 1 shows the blue class clearly being separated, and the red and purple classes beginning to separate from the other classes. Sample 2 shows a clear separation between all classes except the yellow and green ones, despite having a complexity of only 86. The maximum complexity separates the classes further, and compresses each class more tightly. The lowest cost for GP-tSNE gives a very similar visualisation to t-SNE, despite having a cost that is higher. GP-MaL is clearly worse than GP-tSNE, even when GP-tSNE has a tree size as low as 86.

[height=0.434]gpTSNEJoin/dermatology-median-PLOTS-PSO-PAPER (a)(b)(c)(d)(e)(f)(g)(h)

Fig. 4: Dermatology.

The COIL20 dataset (Fig. 5) highlights t-SNE’s ability to freely move points throughout the 2D space, as it clearly produces the best visualisation. It is interesting to note however that GP-tSNE is able to separate some classes well, even at low model complexity. For example, in Sample 1 (complexity of 22), a number of classes (dark green, pink, blue) are already clearly visible — this suggests that these classes may be particularly well-defined, as they are easily separated. As the complexity increases, additional classes separate out, and the curve topology of the blue/pink/red classes seen in the t-SNE visualisation starts to form for GP-tSNE too.

[height=0.434]gpTSNEJoin/COIL20-median-PLOTS-PSO-PAPER (a)(b)(c)(d)(e)(f)(g)(h)

Fig. 5: COIL20.

Figure 6 shows how the Isolet dataset has many overlapping classes, with only a few classes distinctly separated by t-SNE. The GP methods generally overlap the same classes as t-SNE, but do not manage to separate the groups quite as successfully. Isolet has the highest number of classes (26) of all the datasets, which makes it especially challenging for GP-based methods, since evolving two functions that can provide sufficient granularity to separate 26 classes is clearly very challenging and requires sufficiently complex trees. Despite this, it may be possible to gain some insight into why given classes overlap, as the general overlapping patterns start to be visible even at lower complexities of 39 to 131.

[height=0.434]gpTSNEJoin/Isolet-median-PLOTS-PSO-PAPER (a)(b)(c)(d)(e)(f)(g)(h)

Fig. 6: Isolet.

[height=0.434]gpTSNEJoin/mfatCombined-median-PLOTS-PSO-PAPER (a)(b)(c)(d)(e)(f)(g)(h)

Fig. 7: MFAT.

[height=0.434]gpTSNEJoin/mnist_train_1k_23-median-PLOTS-PSO-PAPER (a)(b)(c)(d)(e)(f)(g)(h)

Fig. 8: MNIST 2-class.

On the MFAT dataset (Fig. 7), GP-tSNE is able to group each class effectively at the maximum complexity, but does not show the clear separation between classes that t-SNE provides. GP-tSNE clearly performs significantly better than GP-MaL, which overlaps the light green and purple classes onto other classes. The grouping of classes starts to appear at a complexity of 94, although the orange and purple classes overlap still. A complexity of 94 corresponds to two trees with 94 nodes in total — which is clearly quite restrictive given that the MFAT dataset has 649 features and 10 classes.

The visualisations in Fig. 8 (MNIST 2-class) provide an example of a large dataset with a small number of classes. All three methods are able to separate the two classes reasonably well, with GP-tSNE and t-SNE producing quite similar results. This separation is evident even at low model complexity in GP-tSNE – at a complexity of 51, it is possible to draw a vertical line through the middle of the visualisation that would separate the two classes with reasonable accuracy. At higher levels of complexity, GP-tSNE is able to push the two classes apart to leave a clear gap between them. GP-MaL produces a less clear separation and struggles to distribute the points in the visualisation space well.

On the final dataset, Image Segmentation (Fig. 9), all three methods are able to separate the green and orange classes out, with t-SNE and GP-tSNE providing clearer separation margins. Both GP-tSNE and t-SNE also separate the red class to some extent, with t-SNE doing so slightly more effectively. An interesting observation is that the orange, and to a lesser extent the green classes, are separated at very low model complexity. The orange class is easily separated using only one feature, whereas the green class can be separated at a complexity of 31. This suggests that perhaps these are more natural/intrinsic classes in the dataset; perhaps they exhibit characteristics that make them particularly distinct from the other instances. Another interesting finding is that the red class is separated well by GP-tSNE at a complexity of 11 — but then overlaps with the purple class at higher complexity. This may be a limitation of the t-SNE objective function or may suggest that the red class is actually a subclass of the purple class and is incorrectly over-separated at a low complexity. Indeed, the red class corresponds to pixels that are part of a path in an image, and the purple class to cement pixels. Clearly, many different things can be constructed out of concrete, including paths, which explains the overlap that is evident at higher complexities. This phenomenon is another useful characteristic of GP-tSNE: it shows the different relationships between classes at different levels of model complexity.

[height=0.434]gpTSNEJoin/image-segmentation-median-PLOTS-PSO-PAPER (a)(b)(c)(d)(e)(f)(g)(h)

Fig. 9: Image Segmentation.

V-B General Findings

Fig. 10: Convergence of GP-tSNE over the evolutionary process: y-axis shows the difference in t-SNE cost between the best individual at a given generation and that attained by t-SNE.

The visualisations produced by GP-tSNE were consistently superior to those of our previous GP method, GP-MaL. GP-MaL was developed for more general MaL (i.e. not just reducing to two dimensions); by using a visualisation-specific quality measure, GP-tSNE was able to clearly improve. As the number of instances increased, t-SNE begun to produce clearer visualisations than GP-tSNE, although the same general patterns were shown by both methods. Given GP-tSNE is in essence tackling a strictly more difficult task, of evolving a functional model to produce a visualisation, we see this as a success for the first such approach to this task. This pattern can be further seen in the convergence analysis shown in Figure 10. On the easier datasets (the first few in the legend), GP-tSNE converges before 1,000 generations. On the hardest datasets such as MNIST and MFAT, GP-tSNE is likely to benefit from additional generations. We are confident that future research in this new research direction will be able to further improve GP-tSNE on high-dimensional datasets with many classes.

One particularly interesting observation was how GP-tSNE produced similar overall visualisations at different model complexities, with the more complex models being more granular/refined than the simpler ones. This demonstrates the advantages of an evolutionary approach: good sub-trees of a complex tree can be used to improve simpler trees (and vice-versa) via crossover, allowing for more efficient search that produces models with different levels of trade-off. In the following section, we will explore this phenomenon in more depth to highlight the novel advantages of GP-tSNE.

Fig. 11: Complexity of 13.
Fig. 12: Complexity of 19.
Fig. 13: Complexity of 26.

Vi Further Analysis

The results of GP-tSNE on the Dermatology dataset (Fig. 4) were of particular interest as they showed similar results between GP-tSNE and t-SNE; showed the same general visualisation, but at different qualities across tree sizes; and achieved good visualisations even at reasonably small model complexities. To further demonstrate the value of GP-tSNE, this section analyses a selection of increasingly complex trees produced by the median GP run on this dataset.

Vi-a Simple Models

The simplest model which gives a reasonable visualisation is shown in Fig. 13, which contains one tree with three nodes, and the other with 10, for a total complexity of 13. Even with such a low complexity, three of the classes start to appear, with the blue, purple, and red classes already showing a clear separation from the others. The top tree, which gives the x-axis of the visualisation, can be written as . The bottom tree, which gives the y-axis is simply . These two trees use a total of seven unique features out of the 34 in the original feature set, yet they are able to separate three classes of the dataset well. In particular, the blue, red and purple classes can be separated out by drawing two vertical lines through the plot. In other words, two thresholds can be applied to the output of the top tree (x-axis) in order to roughly separate the classes into three groups: red; orange and green; and purple and blue. Given that all features have the same range of and the only feature subtracted in the tree is , this suggests that the blue and purple classes have particularly small feature values for (as they have high x values), and the red class has particularly large values (as it has low x values). in the Dermatology dataset corresponds to the “thinning of the suprapapillary epidermis” feature, which is a feature commonly associated with the skin condition of psoriasis [lever2015histopathology]. In the visualisation in Fig. 13, the red class corresponds to a diagnosis (class label) of psoriasis — and indeed, the red instances appear on the left side of the plot, which is consistent with having a high value of being subtracted from the rest of the tree. This sort of analysis can be used by a clinician to understand that this feature alone could be used to diagnose psoriasis with reasonable accuracy, and also provides greater confidence in the visualisation’s accuracy.

Fig. 14: Complexity of 33.
Fig. 15: Complexity of 59.
Fig. 16: Complexity of 104.

When the model complexity is increased to 19 (Fig. 13), the separation of points within classes starts to become better-defined, and the orange class starts to become more distinct from the yellow and green classes. The top tree actually uses fewer unique features (three) than at a complexity of 13, with significant weight put on : , whereas the bottom tree uses four unique features: , again for a total of seven unique features across both trees. The blue class (“lichen planus”) is clearly distinct from the other classes along the x-axis — given that the top tree weights heavily, it seems likely that a high value of this feature is characteristic of the “lichen planus” diagnosis. Indeed, the dermatology literature commonly reports on this symptom being indicative of this diagnosis [lever2015histopathology].

The trees shown in Fig. 13 appear very similar to the previous model, with the top tree varying only by the omission of one nf node, and the introduction of the subtraction of the nf feature. This is, however, sufficient to cause some of the orange points to be separated from the yellow and green points along the x-axis — indicating that higher values of nf may indicate a point belongs to the orange class rather than the yellow one. The major change in the bottom tree is the introduction of the nf and nf features, which helps to compact the blue class and starts to separate the red class more strongly.

Figure 16 further pushes the blue class away from the other classes, at the expense of squashing the other classes together. This is achieved by making the top tree (x-axis) weight f and f even more strongly. It is interesting to note that all the trees analysed so far have been strictly linear functions, utilising only arithmetic and subtraction. This is perhaps not unexpected: utilising more sophisticated functions is likely to require complex trees to fully take advantage of them — for example, taking the maximum of two simple sub-trees is unlikely to be more representative of the original dataset than simply adding together features. This also has the benefit of making the simpler trees easier to interpret and understand. While there are many methods for doing linear transformations for visualisation, such as PCA, these methods generally function by weighting all features to different extents; GP-tSNE only selects a subset of features to be used, and so is inherently more interpretable. The sequential analysis of increasingly more complex models clearly provides additional insight that would be lacking in a non-multi-objective approach. This analysis technique is also an exclusive benefit of GP-tSNE among other visualisation techniques which are primarily black boxes with one visualisation produced per run.

Vi-B Complex Models

From here, the improvement in quality with the addition of more complexity begins to show clear diminishing returns, with the cost decreasing by only about 0.1 as the complexity is roughly doubled from 33 to 59, and from 59 to 104. Figure 16 gives a visualisation that is very similar to that of the maximum complexity (see Fig. 4), with the yellow and green classes overlapping but all other classes clearly distinctly separated. This is also the first stage at which constants and non-linear operators such as and are introduced. The trees used in previous models are still recognisable as sub-trees (or “building blocks”) in the more complex trees in Fig. 16 — it may be possible to analyse the more complex trees based on what has been added in addition to these sub-trees. Figure 16 does not change the class separations present in Fig. 16, but better separates within-class instances, as well as moving the purple class away from the other classes.

We do not attempt to analyse models with a complexity of over 100 in detail, as the trees become difficult to interpret easily, and eventually become as much of a black box as t-SNE itself is. The main aim of this paper is to produce good-quality (not state-of-the-art) visualisations that are interpretable in order to provide deeper insight into the relationships within a dataset in terms of the features it uses. If visualisation quality is the primary goal, then methods such as t-SNE are expected to be more appropriate, given they are not constrained to finding a functional mapping from the original feature space to the visualisation axes. While the very complex GP trees cannot be feasibly interpreted, they are still important to the GP-tSNE algorithm. During the evolutionary process, it is common to see complex trees with low cost being automatically simplified through the removal of superfluous sub-trees via crossover and mutation, allowing reductions in cost to “trickle down” to simpler models along the front. When we restricted the EMO search to only simple trees, we found that results were much poorer, with smaller trees having much higher cost. The complex trees are also useful outright as they can be directly applied for visualising future instances (e.g. streaming data), whereas t-SNE must re-optimise its embedding each time.

Vi-C Summary

In this section, we showed how progressive examination of the approximation front from least to most complex models allowed insight into the structure of the data that is not easily achievable through traditional visualisation algorithms. Using the Dermatology dataset as a case study, we showed that even simple models could separate out some classes clearly, with the use of only a few features. As we increased the complexity, we found that the granularity/local structure within classes improved, but the overall patterns changed only slowly. In this way, it is possible to use simple, understandable models to provide insight into how more complex models are able to produce good-quality visualisations.

Vii Conclusion

This paper highlighted the need for research into machine learning methods that can not only produce clear visualisations, but do so through a model that can be interpreted itself to give insight into how the visualisation arose from the features of the dataset. The use of GP was suggested due to its functional model structure, and a multi-objective approach was proposed that gave a staggered front of visualisations with different levels of trade-off between visual clarity and model complexity. Results on nine different datasets highlighted the promise in using a GP approach for this task, with visualisations produced that showed clear characteristics of datasets even at model complexities that could be low enough to understand. This was further showcased through an in-depth analysis of an evolved front on the Dermatology dataset, where concrete insight into how the dataset was structured was found by examining a range of models with different complexities.

As this is the first multi-objective GP approach to this problem, there is a clear need for continued future research. While the quality of visualisations were encouraging, on more complex datasets they fell short of those produced by t-SNE. This is not unexpected given t-SNE does not produce a functional mapping, but rather an embedding of the data. The measure of visualisation quality used in this work was based on the one used by t-SNE given it is the state-of-the-art measure. However, the design of t-SNE was constrained by the need for a differentiable cost function; as an EC-based method, GP-tSNE need not have this constraint on its objective function — there is likely to be better measures of visualisation quality that could be used in an EC context. It was also found that the more unique features used by a GP tree, the more difficult it was to understand; some sort of evolutionary pressure towards trees using a small set of (cohesive) features may improve this.