The processing and extraction of information from large and noisy data sets is a challenging problem in Computer Science. The techniques of algebraic topology have gained the attention of scientists for years, giving rise to an emerging research field called Topological Data Analysis (TDA) [Carlsson:Bulletin, EdelsbrunnerHarer2010]. TDA is an approach to infer the topology underlying a dataset by using combinatorial algebraic structures known as simplicial complexes. TDA also involves the computation of invariant properties from continuous transformations of these simplicial complexes: a process known as persistent homology [EdelsbrunnerHarer2010].
Over several decades, the high dimensionality of datasets coupled with the combinatorial and continuous character of topology have been problematic issues, making computing persistent homology a challenge that has been addressed by several authors. Edelsbrunner et al. [EdelsbrunnerHarer2010] present an efficient algorithm and its visualization as a persistence diagram [EdelsbrunnerHarer2010, zomorodian_2005]. Carlsson et al. [Carlsson:Bulletin] strengthened the mathematical foundations and also proposed another visualization tool called persistence Barcodes [Ghrist2008, Carlsson:Bulletin]. Further developments in the TDA field are derived from those initial works.
As a consequence of their combinatorial nature, the construction and representation of simplicial complexes also represent a challenge. Many works have dealt with efficient construction and representation of filtered simplicial complexes. Data structures and algorithms have been developed [DBLP:journals/talg/BoissonnatS18, Zom2010, DBLP:journals/algorithmica/BoissonnatST17, DBLP:journals/algorithmica/BoissonnatM14], and they have mainly focused on the construction of Čech, Rips and other kinds of simplicial complexes such as: Witness, Alpha, Delaunay, Tangent, and Cover complexes. Theoretical and practical results have been organized as TDA libraries: GUDHI [DBLP:journals/algorithmica/BoissonnatM14, gudhi2014], Dionysus, Ripser, Dipha, Perseus and JavaPlex. A complete benchmark of those libraries can be found in [Otter2017].
Regarding the use of TDA for classification, a TDA-based method was used in [tdaretina]
for classifying high-resolution diabetic retinopathy images. They used a preprocessing stage for computing persistent homology to detect topological features encoded into persistence diagrams. A support vector machine (SVM) was used to classify the images according to the persistence descriptors which were used to discriminate between diabetic and healthy patients.
Moreover, TDA has been applied to time-series analyses [TDAapp2017]. One common pipeline is to consider the time-series as a dynamic system and compute the attractors or time-variant of the signal, which creates a manifold around the attractors and turns the signal into phase-domain [DBLP:journals/corr/VenkataramanRT16, umeda2017]. Persistent Homology or another TDA-tool is applied on these phase-space manifold to create topological descriptors [timeserie_tda_2016]
, and as a final step, a machine learning method is applied such as k-NN, CNN, or SVM. Recently, TDA has been applied in Deep learning to address the interpretability problem[GunnarGab2018]
, to regularize loss functions[gabrielsson20a], and to build a persistence layer to consider topological information during learning [persistencelayer2017, gabrielsson20a].
What all those examples of TDA-applications have in common is that TDA has been used as a preprocessing stage of conventional Machine Learning (ML) algorithms. However, during the TDA pipeline execution, multi-scale relationships among data occur and disappear. From the moment when a multi-scale relationship occurs until it is mixed with another one, it is called persistence. The persistence of many of those relationships is captured and represented by persistence diagrams or barcodes. Taking advantage of the entire TDA pipeline and not just the result could help address some of the current challenges of supervised and semi-supervised learning, such as imbalanced data classification, identification, and correction of mislabeled data; missing data analysis; and dimensionality reduction.
In this scenario, this paper proposes a methodology to make a TDA pipeline that is able to classify balanced and imbalanced datasets with no ML further stage. The fundamental idea is to provide neighborhoods on a filtered simplicial complex related to a point set (a simplex) or a point as a special case. Those neighborhoods will be, in fact, a sub-complex of the filtered simplicial complex built on the dataset. Persistent homology is used to guide the detection of an appropriate sub-complex from the entire filtration. A labeling process is then made to propagate labels from labeled points to unlabeled points, taking advantage of the simplicial relationships.
To illustrate, the proposed method is compared with several baseline classifiers. One of the baseline algorithms is the k-NN algorithm, one of the most popular supervised classification methods. The second baseline method is an enhanced version of k-NN, the weighted k-NN (wk-NN) especially suited for imbalanced datasets. This document is organized into several sections. Section 2 presents the mathematical foundations used in this work. Section 3 explains the concepts, algorithms, and methodology of the proposed classification method. Next, Section 4 describes algorithms, datasets, the experimental protocol, evaluation criteria, and the selected metrics to assess the proposed method performance. In Section 5, the results and implementation details of the proposed method are explained. Conclusions are presented in Section 6.
2 Mathematical foundations
In this section, mathematical definitions are introduced (i.e.: simplices, simplicial complex, the Čech and Rips complexes, the star, and link concepts). Concepts such as persistent homology, filtration, sub-complex, and filtration levels are briefly presented. For more detailed definitions, please see [EdelsbrunnerHarer2010].
2.1 Simplicial Complexes
Simplicial complexes are combinatorial and algebraic objects which represent a discrete space homotopically equivalent to a data space. Concepts related to simplicial complexes are defined briefly as follows: a q-simplex is the convex hull of affinely independent points , . In this case, the set is called the set of vertices of and the simplex is generated by the set ; this relation will be denoted by . A q-simplex has dimension and it has vertices. Given a q-simplex , a d-simplex with and is called a d-face of , denoted by , and is called a q-coface of , denoted by . Note that the 0-faces of a q-simplex are the elements of , the 1-faces are line segments with endpoints in and so forth. A q-simplex has d-faces and faces in total.
In order to define homology groups of topological spaces, the notion of simplicial complexes is central:
(Simplicial complex): A simplicial complex is a finite collection of simplices such that:
The dimension of is .
There are many known simplicial complexes, though two of the most popular are the Čech [Ghrist2008, EdelsbrunnerHarer2010] and Vietoris-Rips complexes. In the following definitions the set should denote the open ball of radius and centered at , namely .
(Čech complex): Let be a finite subspace of and fix . The Čech complex is a simplicial complex where the vertices or 0-simplices are the elements of and a set of vertices define a q-simplex if
(Vietoris-Rips or complex): Let be a finite metric space and fix . The Vietoris-Rips complex is a simplicial complex where the 0-simplices are the elements of and a set of vertices define a q-simplex if .
From the above definitions it follows that , where a proof is given in [EdelsbrunnerHarer2010], and this relationship is shown in Figure 1. The Čech complex is intrinsically a high dimensional simplicial complex. From a computational sense, complex is more feasible (i.e. lower storage and time complexity) than Čech, even when the complex has more simplices in general. Compared to Čech, the VR complex does not need to be stored entirely, as it can be stored as a graph and be reconstituted combinatorially [Ghrist2008]. Even when the results in this paper could be applied to several simplicial complexes with minor changes, this document is focused on Čech and complexes.
(Star, Closure, Closed Star, and Link): Let be a simplicial complex, and be a q-simplex. The of is the set of all co-faces of in [EdelsbrunnerHarer2010]:
Let be a subset of simplices . The of is the smallest simplicial complex containing :
The is not a simplicial complex because of all the missing faces. The smallest simplicial complex that contains is the closed star (closure of star) of :
The of is a set of simplices in its closed star that does not share any face with [EdelsbrunnerHarer2010]:
The concept of link of a simplex in a simplicial complex will be important along this paper. For this reason we present two equivalent characterizations of this set:
: Let be a simplicial complex and . Then coincide with the sets
Let be a simplex in . In particular, does not belong to nor since any simplex in one of these two sets necessarily intersects , then .
If is a simplex in , then there exists such that and . It follows that and .
Finally, if , then for some . It follows that , but . Then, , and the equivalence of sets is stated.
Figure 2 presents an example of and of point from a given simplicial complex build on a point set .
2.2 Persistent Homology
Persistent homology is a tool to find topological features in a metric space [EdelsbrunnerHarer2010, Carlsson:Bulletin]. As a general rule, the objective of persistent homology is to track how topological features on a topological space appear and disappear when a scale value (usually a radius) varies incrementally, in a process known as filtration [EDelsbrunnerMorozov2014, zomorodian_2005].
(Sub-complex): Let be a simplicial complex. is a sub-complex of if and is also a simplicial complex.
(Filtration): Let be a simplicial complex. A filtration on is a succession of increasing sub-complexes of :
In this case, is called a filtered simplicial complex.
In most of simplicial complexes where the simplices are determined by proximity under a distance function (as in the case of Čech or VR complexes), a filtration on a simplicial complex is obtained by taking a sequence of positive values , and the complex corresponds to the value .
Definition 7 (Filtration level function, ).
Let be a finite simplicial complex and a filtration on : . The filtration level function is defined on by .
A filtration could be understood as a method to build the whole simplicial complex from a “family” of sub-complexes incrementally sorted according to some criteria, where each level corresponds to the “birth” or “death” of a topological feature as described in Definition 10.
This process is illustrated in Figure 3.
Definition 8 (Filtration value collection ).
Let be a filtered simplicial complex and a filtration on . Let be a set of non-negative numbers such that , where is a filtration value (radius) applied to build the sub-complex in the filtration . The set is called the filtration value collection associated to .
Definition 9 (Filtration value of a q-simplex ).
Let be a filtered simplicial complex and its filtration value collection. Let be a q-simplex. If but , then is the filtration value of .
Note that , which means that in a filtered simplicial complex , every simplex appears before all its co-faces .
Definition 10 (Birth and Death).
Birth is a concept to describe the filtration level when a new topological feature appears. Similarly, death refers to the filtration level when a topological features disappears. Thus, a persistence interval (birth, death) is the “lifetime” of a given topological feature [EdelsbrunnerHarer2010].
3 Proposed Classification Method
Let be a finite point set, where is a feature space. Suppose is divided in two subspaces , where is the training set, and is the test set. Let be the label set, and let be the association space, which relates every point with a unique label . Let and be the two disjoint association sets corresponding to and , respectively, where . In this setting, the real label list, , is the list of labels assigned to each element of in the association set . Thus, the classification problem can be defined as how to predict a suitable label for every by assuming the association set as unknown. Consequently, the predicted label list, , will be the resulting collection of labels after classifying each . Since , it is common to use to evaluate the quality of . Depending on the size of and the problem is known as supervised classification (), or semi-supervised classification ().
A classification method based on TDA is presented in this section. Overall, a filtered simplicial complex is built over to generate data relationships. Only a few of those relationships will be relevant relationships between data points. In this context, a relationship between points is real if this relationship is part of the data’s hidden structure. Thus, persistent homology is applied to capture the real structure of the dataset. This information is helpful to detect a subset of relationships likely to be real. The proposed method is based on the supposition that on the filtration a sub-complex exists, whose simplices represent real data relationships. For every q-simplex , the set of vertices of , , will be split into labeled and unlabeled points, where any of these subsets could be empty. The fact that a point set belongs to a q-simplex implies a similarity or dissimilarity relationship between points . This implicit relationship among data is applied to propagate labels from labeled points to unlabeled points. Thus, a link-based labeling propagation method is developed to make a suitable label prediction for each point .
3.1 Link-based label propagation function
On a filtered simplicial complex , the neighborhood relationships of a q-simplex could be recovered by using the link, star, and closed star concepts (Definition 4). A key component of the proposed method is the label propagation over a filtered simplicial complex. Given a simplicial complex , a separation between useful simplices, and not-useful simplices (see Definition 11) needs to be considered. This simplicial classification is helpful because useful-simplices contribute to discriminate more labeling information during the propagation and label assignment process. In this way, those sub-complexes with an appropriate distribution of useful-simplices will be preferred.
(useful-simplex, and non-useful-simplex)
Let be a simplicial complex built on , and be a q-simplex. We say is a useful-simplex if it contains more elements from than elements from . In another case, then is a non-useful-simplex.
3.1.1 The labeling function
Let be a finite simplicial complex built on , and be a filtration on : . Suppose a preferred subcomplex in the filtration has been selected. Let be the -module with generators . The generator will be associated to the label according to the following definition:
Definition 12 (Association function).
Let be the association function defined on a 0-simplex as if and in any other case. The association function can be extended to a q-simplex as .
As an intermediate step to propagate labels from labeled points in to the unlabeled points in by means of the link operation in simplicial complexes, define the extension function in as follows:
Definition 13 (Extension function).
Let be the function defined on a point by
In Equation 3, for every q-simplex , the filtration value is applied to prioritize the influence of to label . Let be two simplices, such that . This condition implies that the vertices of cluster around earlier than the vertices of do since they were added first to the filtration. In consequence, contributions should be more important than contributions.
According to the previous definitions, given a point , the evaluation of the extension function at would be , where .
Definition 14 (Labeling function).
Let be a point in such that . If is the maximum value in , define the labeling function at as where is uniformly selected from the set .
If there exists a unique maximum in the set from the previous definition, then the labeling function is uniquely defined at . In most of datasets where the proposed TDA classification method was tested, the label assignment of each point in was uniquely defined. Figure 4 shows the labeling process on a previously selected sub-complex.
3.2 Classification by using simplicial complexes and persistent homology
The proposed method computes the predicted label list corresponding to . The entire process is summarized in Algorithm 1. Following subsections explain each step in detail.
3.2.1 Building the filtered simplicial complex.
Let and be two point sets, where is the labeled set, and is the unlabeled set. The filtered simplicial complex is built on . A maximal dimension is given to control the simplicial complex combinatorial growing. Algorithm 2 illustrates this process.
3.2.2 Obtaining the most reliable label
Once a filtered simplicial complex is obtained, all the elements of need to be labeled. Because of the full connectivity on , it is highly likely that all the vertices are interconnected. Then , where , and the number of simplices in would be .
Therefore, if functions and from Definition 13 and 14 are computed using , it could occur that a given point has contributions of all possible labels, due to the high number of co-faces of the analyzed simplices. In this context, it is worth understanding which simplices of containing some are reliable or which are noise to perform the label propagation and labeling. Then, the purpose is to choose a sub-complex from the filtered simplicial complex , such that: i) constitutes a good approximation to the real structure of the dataset, and ii) there are enough useful simplices in to label each point . In this vein, persistent homology is used to guide the process of selecting , reduce the classification space, and guarantee the useful simplices inside the selected sub-complex.
With persistence homology, multi-dimensional topological features are detected. For a filtered simplicial complex of dimension , persistent homology will compute up to -dimensional homology groups. Each topological feature represented by an element in a homology group of a given dimension will be represented by a persistence interval (see Definition 10).
The collection of simplices that shapes one topological feature belongs to a homology class. However, it is no trivial to recover information about every simplex belonging to that homology class, except by the simplex generator of that topological feature. For example, a 0-simplex connects to another 0-simplex and creates a 1-simplex. The 1-simplex, which joins together two 1-simplices, creates a hole. A 2-simplex is attached to the other three 2-simplices and becomes a 3-simplex, leaving a void inside. Precisely, the moment when this happens is the birth of a topological feature and the death of another topological feature of a lower dimension. As it can be noticed, the relation between simplices and topological features is injective, and it is only related to the generator. Thus, only the generator of a topological feature might be recovered from a persistence interval at its birth.
Based on the assumption that any simplex is related to only one persistence interval, this relation will persist from its birth to its death. Therefore, all persistence intervals which have intercepted their lifetimes will have their corresponding related simplices coexisting during the interception time. Eventually, when a homology class dies, their connection with simplices also dies, and they are no longer considered, at least not directly. A conventional way to get a well-defined membership relation from simplices to a homology class is to look at the simplices associated with the birth of a homology class, which are the only simplices that can reliably be associated with the homology class.
The challenge is to find appropriate homology classes to ask for their associated simplices. It is known that long life invariants (high ) represent topological features, while short life invariants are commonly considered noise. However, short life persistence intervals could also mean local topological features or high dimensional topological features, which could be more profitable in useful-simplices. On this scenario, persistence intervals will be considered from higher homology groups () downwards (see Algorithm 3).
Persistent homology is then considered to recover topological features which represent meaningful data relationships. Some of those topological features hopefully will represent (in their birth) a level of filtration that maximizes the number of useful-simplices associated with each . As a result, it is highly likely that we obtain a reliable label for every .
Persistent homology is computed according to [EdelsbrunnerHarer2010, EDelsbrunnerMorozov2014, DBLP:journals/dcg/SilvaMV11, Dey2014ComputingTP], and a collection is obtained with the persistence interval set of the -dimensional homology group. Algorithm 3 computes a persistence interval set , which is the non-empty homology group with higher dimension.
Let be a persistence interval. Then . We notice that becomes undefined for immortal topological feature (i.e. infinite death time). To overcome this issue, it is enough to change the death time from infinite to the maximum of the filtration value collection . We call this a -transformation (see Equation 4). Thus, a new function is defined to apply the -transformation before life(d) is called.
The maximum persistence interval:
A persistence interval selected in a random way:
The closest interval to the persistence intervals average:
Although homology groups are different, the birth and death of persistence intervals values are absolutes. Even when those persistence intervals that contain high dimensional invariants were selected, low dimensional topological features are not necessarily excluded. In Algorithm 4, a persistence interval is selected from a filtration, to recover a sub-complex and classify all .
Because of the injectivity between simplices and birth time of persistence intervals, the sub-complex might be selected. Nevertheless, the sub-complexes and could be selected as well. In these cases, the middle time () and death time of persistence interval will capture all those simplices which are generators of topological features still alive (or born) on the middle and death times. The choice between birth, middle, or death time to select the most appropriate sub-complex seems related to the homology group dimension. When the selected homology group has a high dimension, gives good precision on classification. On the other hand, if a 0-dimensional homology group is selected, the sub-complex should be the best choice. The sub-complex could be also selected on 1-dimensional homology groups. As a result, the birth time was chosen (see Algorithm 4) to select the sub-complex because it is always guaranteed to be present.
Figure 5 shows the selection of a sub-complex , where is a filtered simplicial complex built on the Circles dataset (with noise = 10), one of the artificial datasets used to evaluate the proposed method in Section 4. The selection of is guided by the persistent homology information and the application of , and selection functions (see Equation 5, and Equation 6) to select an appropriate persistence interval according to each criterion. Note that was coincident with , and the results of only one persistence interval were shown in Figure 5.
An overview of the proposed method is presented in Figure 6. To classify two black points , a 4-steps process is executed.
The proposed TDA-based classifier (TDABC) was implemented on top of the GUDHI library [gudhi2014, DBLP:journals/algorithmica/BoissonnatST17, DBLP:journals/talg/BoissonnatS18], which is one of the most complete libraries for building simplicial complexes [gudhi:FilteredComplexes] and computing homology groups [DBLP:journals/algorithmica/BoissonnatST17, Otter2017, DBLP:journals/talg/BoissonnatS18, DBLP:journals/algorithmica/BoissonnatM14].
3.3.1 Simplicial complex construction and persistence computation
The first step of the proposed TDABC method (see Figure 6) is building a filtered simplicial complex on . For datasets with high dimensions or datasets with too many samples, the implementation of the Algorithm 2 in GUDHI could be impractical due to combinatorial complexity. Consequently, the combinatorial complexity of the simplicial complex must be reduced. To address this problem, the approach followed in this paper is using the edge collapse [gudhiContraction] method on the GUDHI library. The collapse of edges in GUDHI has to be performed on the 1-skeleton of the simplicial complex and then expand to build all high dimensional simplices up to a maximal dimension . Algorithm 5 computes a simplex tree by using the edge collapsing method. A collapsing coefficient depends on the maximal dimension , but it could be enhanced by invoking repeatedly until the simplex tree does not change any more.
3.3.2 Persistence computation and persistence interval selection
In GUDHI, instead of persistent homology, persistent cohomology is computed using the algorithm [DBLP:journals/dcg/SilvaMV11] and [Dey2014ComputingTP] and the Compressed Annotation Matrix data structure implementation presented in [annomatrix]. Due to homology and cohomology’s duality both methods compute the same homological information, but cohomology provides richer topological information [DBLP:journals/dcg/SilvaMV11]. Algorithm 3’s implementation in GUDHI is direct, using the method of the simplex tree data structure.
3.3.3 Label propagation implementation
The extension function from Definition 13 depends on the operation from Definition 4. Up to now the python interface of GUDHI Library (v.3.3.0) [gudhi:cython] does not have an implementation of the simplex link operation; however, it provides implementations of star and co-face operators. As a result, a link operation function was implemented based on Equation 2 from Lemma 1.
According to Lemma 1, function from Definition 13 could be implemented based on . In addition, two ways of removing the contributions are possible: a strict way and a belated one. The first method is computing strictly using Lemma 1. The advantage of this way is reducing the quantity of invocations of the association function in Definition 12. In the second way, the function is used as a whole, and the contributions are then ignored during function execution because it would be 0 for unknown 0-simplices. Lemma 1 and Definition 13 show that both approaches are equivalent.
In GUDHI, each q-simplex represented by a simplex tree is related to its filtration value . Thus, the is a function in which returns a 2-tuple set . This facilitates the implementation of function and the recovery of the -values to impose a priority over simplices and minimize a tie.
The proposed TDA-based classifier (TDABC) is sensible to the chosen selection function . Those selection functions can detect a specific sub-complex in the filtered simplicial complex built on the dataset. Due to this dependency, the proposed method’s behavior needs to explored by using those functions. Consequently, three versions of TDABC methods are configured to assess the proposed solutions:
The TDABC-R classification method using the selection function.
The TDABC-M, because of the utilization of the selection function.
The TDABC-A, which uses the selection function.
4.1 Selected baseline classifiers
Three baseline methods were selected to compare the proposed methods:
The k-Nearest Neighbors (k-NN) implementation from Scikit-Learn [scikit-learn] was chosen.
The distance-based weighted k-NN from Scikit-Learn was also selected to assess the proposed methods.
Several data sets were chosen to evaluate the proposed methods and compare them to the baseline classifiers. Table 1 shows the datasets with some of their characteristics.
|Wine||13||3||178||[59, 71, 48]||-||-||-|
. There are datasets with more than 3 dimensions. In case datasets involves more than 3 dimensions, a Principal Component Analysis (PCA) was applied to reduce the dimensionality for visualization purposes only. Then, resulting datasets were plotted taking pairwise variablesto provide several two-dimensional points of view with the axis XY, XZ, and YZ, respectively.
4.2.1 Artificial datasets
A group of datasets was artificially generated: The Circles, Swissroll, Moon, Normdist, and Sphere datasets (see Figure 8). In this section, details regarding each one of those datasets will be provided.
The Circles dataset is an artificial and simple dataset that consists of a large circle with a small circle inside. Both circles are Gaussian data with a spherical decision boundary for binary classification. A Gaussian noise factor of 3 was added to the data making the circular boundary more diffused. This dataset was proposed to assess the ability to disentangle or to deal with overlapped data regions. The label set will be denoting both circles. The point set will be all samples points from both circles. Figure 7 shows the Circle dataset without noise. Figure 8 presents this dataset with a noise factor of 3. The noisy Circles dataset was selected for experiments.
The Moon dataset is a simple dataset generated by making two interleaving half circles. A noise factor of 3 was added to data to make it difficult to separate both half circles. The label set denoting both classes. The point set is composed of all generated samples of the dataset. Figure 7 shows the Moon dataset samples distribution without noise. Figure 8 shows this dataset with used noise.
The Swissroll dataset is an point set mapped to with a rolled shape. In this paper, a Swissroll dataset was generated using 300 samples from 5 different classes. Besides, a noise factor of 10 was added to the data, which dissolves the rolled shape almost totally. The label set will be composed by enumerating all classes. The generated samples will be directly used to build the point set . Figure 7 shows the Swissroll dataset without noise, and Figure 8 shows it with noise.
An Artificial Dataset Generator framework was implemented for developing datasets distributions. This framework is flexible enough to simulate several complex situations. It is possible to define the desired number of objects (classes) with samples number per class, and global mean and standard deviation or per class.
The Normal Distribution based Dataset
is generated by defining a several per class and overall parameters such as: dataset size, samples dimension, mean per class, standard deviation per class, number of samples per class, number of objects. The number of objects determine the number of classes or labels to be part of the Dataset. The dimension of the Dataset is solved by generating a normal distribution in each component.
An artificial dataset based on mixtures of normal distribution () was generated using the dataset generation framework. This dataset is a high dimensional point set , with a total size of 300 samples. The label set will be . The point set is composed by generating a normal distribution across each component. The sample number list is . To guide the dispersion and density of the point cloud, we used a mean values collection mean = and a standard deviation () per label. Figure 8 shows the samples’ distribution after the PCA process was applied for visualization.
Generating a Sphere-based dataset is similar to generating a normal distribution-based one. Although these datasets are always in three dimensions, they are oriented to capture problems associated with data shape, and entanglement between different class samples and diverse class sample distributions and sizes. Figure 8 shows a sphere-based dataset . This data set has a total size of 653 samples, with a label set . The label distribution is also imbalanced with . The mean () and the standard deviation () are equal per label samples subset.
4.2.2 Real datasets
The Iris, Wine, and Breast Cancer datasets were selected as real datasets to compare the proposed classifiers and the baseline ones. In this section, each real dataset will be explained and several of their characteristics will be described.
The Iris dataset [Dua:2019] contains 3 classes of 50 instances each, where each class refers to a type of Iris plant. One class is linearly separable from the other two, which are not linearly separable from each other (see Figure 9). The label set in Iris dataset is and their corresponding names are “Setosa", “Versicolor", and “Virginica", respectively. Each sample in the Iris dataset is a 5-tuple, defined by (sepal_length, sepal_width, petal_length, petal_width, label). The class of Iris plant will be the predicted attribute. The point set is built using the first four components of each sample.
The Wine dataset [Duadua:2019] is the result of a chemical analysis of wines grown in the same region in Italy by three different growers. There are thirteen different measurements taken for different components found in the three types of wine. The label set to enumerate the three wine types will be taken from the first component of each sample. The point set will be completed using the remaining 13 components of each sample. Figure 9 shows the Wine dataset samples distribution after applying PCA to reduce dimensions to 3, and it was plotted combining two dimensions.
The Breast Cancer dataset [Duadua:2019] features were computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe the characteristics of the cell nuclei present in the image. The label set will denote Malignant (0) tumors and Benignant (1) tumors. The point set will be where each sample represents the cell nuclei information of one image. Figure 9 shows this dataset after applying a PCA process to visualize it from several 2-dimensional perspectives.
4.3 Classifier Evaluation
The classifier evaluation over all datasets were conducted using a Repeated cross-validation process (see A for details). The aim is to avoid biased results because the training and test sets are each time from the same dataset.
Let be the fold size in the Repeated Cross-Validation approach to avoid confusion with the use of k in k-NN. The R-FOLD Cross Validation is then repeated 5 times (N=5), and R will be the 10% of the selected dataset. For any value of R, R-folds will be selected to be the training set , and the remaining R-fold will be the test data in each iteration. When the problem is considered to be semi-supervised, where there are more unknown samples to classify than there are labeled samples.
It is common in ML algorithms to use parameters whose values are changed before the learning process begins. Those parameters are called hyper-parameters [scikit-learn, Japkowicz2011ELA, tom97ML]
. For k-NN and wk-NN algorithms, a k=15 was considered a good number of neighbors; we obtained it using the hyper-parameter estimators from scikit-learn[scikit-learn]. For the three TDABC algorithms, the maximal simplex dimension needs to be fixed to a value to control the VR-complex construction process. Experiments were conducted with . In B, a detailed explanation of selected metrics is given. Results presented were obtained on two computers: 8 GB RAM, Intel Core i7-6500U CPU 2.50GHz x 4, and 16 GB RAM, 7 3700U 2.30 GHz x 4.
Each classifier is executed a total number of times because of the repeated cross validation process. This process results in a total number of predicted collections for each classifier. Similarly, a total number of real label collections results for each classifier. Those two lists of collections, , and are concatenated by putting the collection at the end of the previous collection, which results in two big collections of predicted, and real labels.
^Y & = &(^y_1, ^y_2, ⋯,^y_|P|, ^y_|P|+1, ⋯, ^y_2⋅|P|,⋯, ^y_N ⋅|P|),
Y & = & (y_1, y_2, ⋯, y_|P|, y_|P|+1, ⋯, y_2 ⋅|P|, ⋯, y_N ⋅|P|),
Where is the predicted labels list and the real labels list, both resulting from execution of Repeated R-Fold. As the correspondence between each components of both predicted and real labels is maintained, it is easy to generalize all metrics’ computation by considering .
4.4.1 General metrics computation result
In Table 2 and Table 3, all metrics results are shown. Each metric was computed across all datasets. For each metric, columns from 2 to 10 represent the datasets, and rows represent each classifier results. The two last columns show the arithmetic mean and standard deviation of the corresponding metrics across all datasets. More details about the metrics’ computations are presented in B.
The experiments were conducted per dataset for each fixed simplicial complex dimension on the interval . Nevertheless, in this section, results are shown for one value of in each dataset. For example, from the Iris, Circles, and Sphere datasets, a fixed simplicial complex dimension was selected. As another example, for the Moon-dataset, results were selected for experiments conducted using a simplicial complex dimension . In contrast, metrics results on the Swissroll and Wine datasets were selected for simplicial complex dimension . On the other hand, the presented results were obtained for the Breast Cancer dataset using a fixed simplicial complex dimension . Meanwhile, from the Normdist dataset, results were selected using a fixed simplicial complex dimension .
|True Negative Rate (TNR)|
|False Positive Rate (FPR)|
|Matthews Correlation Coefficient (MCC)|
|Geometric Mean (GMean)|
|Classification Error (CError)|
Table 4 summarizes the classifiers’ average performance. It was built using the two last columns of Tables 2 and Table 3, which represent the mean and the standard deviation of each metric across all datasets.
4.4.2 Selected confusion matrices
For a graphical visualization of the evaluated classifiers’ performance, 40 confusion matrices were created, each one corresponding to a classifier in a dataset (5 classifiers, 8 datasets). Nonetheless, only matrices for Iris, Circles, Moon and Sphere datasets are shown in this section. All confusion matrices can be seen in C.
The discussion section is organized into three subsections. First, the analysis of results, followed by a highlight of the proposed method’s most relevant characteristics and a discussion of related works.
Average performance was computed with the arithmetic mean, geometric mean, and harmonic mean in all datasets. Algorithm ranking remained across means’ computations; thus, only arithmetic mean is shown in Table4.
By analyzing results per dataset independently, it could be noted that the TDABC approaches were superior to wk-NN and k-NN in 5 of the 8 evaluated datasets, specifically on the Circles, Moon, Normdist, Sphere, and Wine datasets (see Table 2 and Table 3). On the other hand, baseline methods were slightly better on the three remaining datasets.
The Circles, Moon, Normdist, Sphere, and Wine datasets have different challenging features such as high dimensionality, the imbalanced distribution of labels, and highly entanglement classes. Despite these challenges, TDABC approaches overcome baseline methods in every computed metric.
The Circles and Moon datasets are balanced and have very entangled classes due to the noise factor, making the classification a challenge. In these datasets, wk-NN and k-NN behave poorly, as observed through the negative values obtained after applying the MCC measure (see Table 5). This behavior is related to the fixed value of k and to the assumption that each data point is equally relevant. Even though wk-NN imposes a local data point weighting based on distances, it is not enough with highly entangled classes, as our results show. The TDABC methods are capable of dealing with the entanglement challenge through a disambiguation factor based on filtration values ().
In the case of the Normdist and Sphere datasets, there is a high imbalanced ratio of the classes, with and samples per class, respectively. In this situation, it is important to have dynamic neighborhoods. The proposed method generates dynamic-sized “neighborhoods” for each point, in contrast to k-NN and wk-NN classifiers. In the high imbalance case, the disambiguation factor () also provides a multi-scale local weighting to TDABC methods.
The Wine and Normdist datasets are highly dimensional (13 and 350 dimensions, respectively). TDABC methods behave better than baseline approaches for those datasets in all metrics. k-NN and wk-NN use Euclidean distance directly to detect their k neighbors. Even though TDABC methods also use Euclidean distance, they are still able to unravel multi-scale, multi-dimensional relationships among data, which better handle high dimensionality.
Swissroll and Iris are balanced datasets, with samples per class. In contrast, the Breast Cancer dataset has samples per class. In the Swissroll and Iris datasets, weighted k-NN and k-NN were better, respectively, than proposed classifiers in all metrics. Interestingly, for the Breast Cancer dataset, which is slightly imbalanced, TDABC was equally performant in 2 out of 9 metrics.
5.2 Key aspects of the proposed method
Regarding the proposed TDA-based classification methodology, two key aspects are discussed: persistent homology and voting system.
The first aspect is related to persistent homology’s key role in selecting the desired sub-complex from a filtered simplicial complex built on the dataset. Algorithm 3 reduces the search space in the filtered simplicial complex by taking advantage of topological features encoded inside selected persistence intervals. Although, selecting the right sub-complex is a very challenging problem [Caillerie2011], the simple criterion we propose (death time of persistence intervals resulting from MaxInt, RandInt, and AvgInt) is sufficient to achieve good classification results.
Despite the birth time’s theoretical guarantees to selecting the sub-complex, the middle and death times might be useful depending on the dataset structure and complexity. Experimentally, promissing results were obtained using both middle and death time. However, death time was better experimentally because it can reach more stable topological features and minimize the presence of isolated points. This process is summarized in Figure 5.
The second aspect is the proposed voting system (see Definition 13), which gives richer information than the one used in the classification. During the voting system execution, a fundamental stage is the label propagation performed by the labeling function from Definition 14
. The result of the labeling function could also be represented by a contribution vector, with each component being the contribution of each label . By normalizing
, the probabilities ofto belong to each component’s class is obtained. Thus, the voting system provides the probability of each class, allowing, for instance, the use of ensemble techniques.
5.3 Related methods
In other related approaches, the authors of [Zhang2017] proposes the Rare-class Nearest Neighbour (KRNN), a k-NN variant to deal with the sparsity of the positive samples on an imbalanced dataset. The KRNN uses dynamic local query neighborhoods that contain at least k positive nearest neighbors (a member of minority classes). In [Vuttipittayamongkol2020], a different approach is proposed to deal with imbalanced datasets, focusing on negative samples (from majority class) in contrast to [Zhang2017]. They experimentally prove that negative samples on the overlapping region cause the most inaccuracies on classification. Thus, a neighbor-based algorithm is proposed in [Vuttipittayamongkol2020] that removes negative samples from the overlapped area.
Both [Zhang2017] and [Vuttipittayamongkol2020] were built to deal with two-classes imbalanced classification problems successfully. However, when applied to multi-class imbalanced problems, several issues arose in both methods, mostly related to the ambiguity of determining if an instance is a positive or a negative one. In multi-classes imbalanced problems, the same class could play both roles simultaneously because it could be a minority class concerning a class , but it could be majority respect to a class . Closely related testing scenarios were the Normdist and Sphere datasets, where the proposed TDABC method was experimentally evaluated. The proposed method obtains good classification rates on minority classes, and it was also able to deal with the overlapping area because of its disentanglement properties.
Recent TDA works still consider TDA as a complement of ML tasks. Works as [ATIENZA2020107509, Riihimaki2020] focus on discovering better ways to transform persistent homology representations into topological features for deep learning pipelines or sophisticated ML methods. In [ATIENZA2020107509], the stability of persistent entropy is provided, justifying its application as a useful statistic in topological data analysis. In [Riihimaki2020], TDA is applied to bioinformatics by proposing a novel algorithm based on another major TDA tool called the Mapper algorithm, used to visualize and interpret low and high volume of data (see [Carlsson:Bulletin]), and built a ML classifier on top of the mapper generated graphs. In [MAJUMDAR2020113868]
, Self-Organized Maps were combined with TDA tools to cluster and classify time series in the financial domain with competitive results. In this context, our work is an example of a fully TDA-based approach applied to supervised learning, with a preliminary version shown as two technical reports in[rolandokindelanmauriciocerdanancyhitschfeld2020, rknmcnh2020].
In this work, TDA was applied directly in a classification problem and evaluated in 8 datasets, including imbalanced and high dimensionality ones, with good results compared to baseline methods. Overall, we show that Topological Data Analysis alone can classify without any ML method. To our knowledge, this is the first study that proposes this approach for classification.
The proposed TDA-based classification method propagates labels from labeled points to unlabeled ones over the built filtered simplicial complex. The filtration values were interpreted as indirect distance indicators to provide a natural disambiguation method to label-contributions.
The use of persistent homology was key to reduce the search space’s complexity by providing the topological features needed to select a sub-complex close enough to the data topology and use it for classification.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
This research work was supported by the National Agency for Research and Development of Chile (ANID), with grants ANID 2018/BECA DOCTORADO NACIONAL-21181978, FONDECYT 1181506, ICN09_015, and PIA ACT192015. Beca postdoctoral CONACYT (Mexico) also supports this work. The first author would like to thank professor José Carlos Gómez-Larrañaga from CIMAT, Mexico, due to his support and collaboration.
Appendix A Repeated cross-validation process
The performance evaluation of any classifier in a multi-class classification problem is a difficult task. One of the most significant issues is ensuring that the assessment does not make any assumption about data distributions or the classifier. Another problem to address is guaranteeing the testing robustness against bias, overfitting, and underfitting. A well-known approach is to use a cross-validation method [BROWNE2000108, scikit-learn, Japkowicz2011ELA, tom97ML]. Cross-validation aims to divide the data set into equal pieces or folds of size R, one of those pieces is selected to be the test set , and the () remaining folds are considered the training set . This process continues until the last fold is selected to be . However, since all folds are taken from the same dataset, sometimes a fold is a test set and, at other times, it is part of the training set. This process makes Cross Validation biased. One way to overcome this issue is by making a repeated cross validation process. This means repeating the R-Fold cross validation process N times. This method is called Repeated Cross-Validation (see Figure 14).
Appendix B Metrics for Classifiers Evaluation
Several metrics need to be considered to evaluate the proposed and baseline classifiers’ performance. The classification metrics are computed as functions of the True Positives (), True Negatives (), False Positives (), and False Negatives ()’s values. Once those primitive values are computed, it is possible to compute several classification metrics such as Accuracy (), Precision (), Recall (), False Positive Rate (), False Negative Rates (), F1-measure ( or Harmonic Measure of and ), Matthews Correlation Coefficient (), Geometric Mean (GMEAN), and Classification Error ().
On the other hand, real and predicted label collections definitions are needed to compute the metrics. Let be the real label list . Let be the predicted label list computed by Algorithm 1, where . The following sections explain the metrics’ computation.
b.1 True positives, True negatives, False Positives, and False Negatives
A True positive sample is a sample successfully classified as belonging to the True label (the critical or most important one). A True Negative value is a sample successfully classified as to be labeled with a negative label. A False positive value is a mislabeled sample with a true label, and a False negative is a mislabeled sample with a negative label.
In a multi-class classification problem (more than two classes), it is more difficult to determine true and negative classes. In this paper, each class is considered a true class, the remaining classes will be the negative ones, and this process is repeated to cover all classes as true classes. In this process, are computed for each label with L the label set, from Equation B.1 to Equation B.1
TP_l & = & ∑_i = 1^n I(l = ^y_i) ⋅I(^y_i = y_i),
FP_l & = & ∑_i = 1^n I(l = ^y_i) ⋅I(^y_i ≠y_i),
TN_l & = & ∑_i = 1^n I(l ≠^y_i) ⋅I(^y_i = y_i),
FN_l & = & ∑_i = 1^n I(l ≠^y_i) ⋅I(^y_i ≠y_i).
b.2 Metrics computation for binary and multi-class classification
Binary classification is the setting where only two classes are taken into consideration. In this scenario, popular metrics are:
Accuracy (Acc): Percentage of correct predictions over the total samples.
Precision: Number of items correctly identified as positive over the total of positive items.
Recall, Sensitivity or True Positive Rate (Re): Number of items correctly identified as positive out of total true positives.
True Negative Rate or Specificity (TNR): Number of items correctly identified as negative out of total negatives.
False Positive Rate or Type I Error (FPR):Number of items wrongly identified as positive out of total true negatives.
False Negative Rate or Type II Error (FNR):Number of items wrongly identified as negative out of total true positives.
F1-Measure: This measure summarizes Pr and Re in a single metric. It is known to be the harmonic mean from both. It mitigates the impact of the high rate but also accentuates the lower rates’ impact.
Matthews Correlation Coeficcient (MCC): A measure unaffected by the imbalanced datasets issue. MCC is a contingency matrix method obtained from calculating the Pearson correlation coefficient between real and predicted values.
Geometric Mean (GMean): The geometric mean corresponds to the square root of the product of the Recall and True Negative Rate. It is commonly used to understand the classifier behavior with imbalanced datasets.
Classification Error (CErr): Percentage of misclassification over the total samples.