Computational and applied topology, tutorial

07/11/2018 ∙ by Paweł Dłotko, et al. ∙ Swansea University 0

This is a tutorial in applied and computational topology and topological data analysis. It is illustrated with numerous computational examples that utilize Gudhi library. It is under constant development, so please do not consider this version as final.



There are no comments yet.


page 6

page 9

page 22

page 23

page 38

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


  1. When talking about Euler‘s problem I have promised you an explanation why a described walk in Königsberg does not exist. Have a look at the graph. Note that there are an odd numbed of edges incidental to every vertex. Suppose by contrary that such a walk exists. If it exist, we can assume that we start at any region we like, call it

    . Then in order to move to the next region, we need to cross a bridge. Pick any bridge and cross. Once the bridge was crossed, it cannot be crossed any more from the formulation of the problem. Now, there are an even number of bridges we can use when going through . When getting in and getting out, we will make two of the remaining even number not usable. Therefore we can never get back to having used all the attached bridges. In his work Euler gave a deep analysis and proved theorems about existence of a path, which we nowadays call an Euler path in graphs.

  2. Moore‘s law is an observation saying that the processing power of our computers double every months. That means that now for 1000$ we can buy twice as powerful computer as we could 18 months ago. There is a huge amount of money put by the processor manufactures to maintain such a speed. But far fewer people are aware that there is a Moore‘s law present in algorithms and in general in science. In particular, the development of algorithms to compute maximal flows and minimal cuts extends Moore‘s law even though the money invested in developing algorithms are small fractions of money invested to build chips. So if you have a DeLorean which allows you to travel back in time and you want to make a lot of money by doing max flow min cut calculations 20 years ago, but can take either software or hardware, you should definitely take software with you.

  3. The genesis on work on max flow and min cut is a cold war analysis of transport capability of Soviet Union and the dependent countries111See USAF unclassified report available here: The very urgent question for the US Air Forces was – how much supply can be brought daily to the front lines and where to cut the rail network of the enemy with the minimal cost so that a supply for the front lines will be broken. Below is an original map of the railway network graph with weights on the edges:

    Figure 3.5: USAF Project Rand, research memorandum.
  4. A classical algorithm to compute maximal flow and minimal cut is due to Ford and Fulkerson. The idea is very simple:

    1. Try to find a path from source to target such that the flow does not use all available capacity of that path.

    2. Increase the flow through so that the maximal capacity of is reached.

    3. Repeat the previous steps as long as such a path can be found.

    In order to find a minimal cut, one needs to start from the source and annotate all the vertices that can be reached with the edges such that . A minimal cut consists of all the edges which has one endpoint annotated and the other one not annotated.

5.1 Filtration on complexes

In this section we will define a filtration of a simplicial or cubical complex. Traditionally in mathematics, a filtration is defined as an increasing sequence of subcomplexes. To make it as simple as possible, I prefer to define it in an equivalent way, by using a so called filtering function defined on a finite complex. Let be a finite simplicial or cubical complex. Then , where is a set with a total order, is a filtering function if for every and for every in the boundary of , . In computational topology we often say that the boundary of a cell has to enter the filtration before the cell. In the Figure 5.6 please find a positive and negative example of a filtering function.

Figure 5.6: On the left, a filtering function on a simplicial complex. The function on the right is not a filtering function, as some simplices appears before their boundaries.

There are a few standard ways of defining a filtration in a complex . Let us have a look at them:

  1. Given a filtering function defined on vertices. This may happen for instance when a scalar function is provided along with geometric data. Then, given a simplex or a cube which has vertices we set the filtration value of to .

  2. Given a filtering function defined on top dimensional cells of a complex . This situation happens for instance when a or a image is given. The gray scale level of the R, G or B from RGB levels on pixels or voxels provide such a function. Then the filtering function on a cell is defined as the minimum of filtering function values of maximal cells that have in their boundary.

  3. Suppose that we are constructing a Rips complex. Then a natural filtration is induced by the length of the edges of a graph. A filtration value of vertices is set to , filtration values of edges is their length and the filtration values of higher dimensional simplices is the maximal value of the length of edges in the boundary of that simplex.

It is an easy exercise to show that those are filtering functions on a complex .


  1. There exists a theory dual to homology theory called cohomology theory. If there is time, it will be intuitively introduced at the end of this lecture. Computational cohomology is also useful in applied science. Generating cocyles, a concept dual to generating cycles in homology, is explicitly needed in some discrete formulations of Maxwell‘s equations. In a lot of cases homology and cohomology theory are dual. Do you remember the max-flow-min-cut duality we talked about in the first section? If there are no so called torsions present in the homology groups333Torsions are generators of a finite order in the homology group. then homology and cohomology groups are dual (see the Universal Coefficient Theorem for Cohomology [9]). There is a simple interpretation to this available in the picture of the annulus below. In red a representative of a cohomology generator is given. Formally it is so called cochain. Think about it as a fence that has to be crossed by any homology generator. This is the basic interpretation of homology-cohomology duality.

    This duality is a discrete version of Ampere‘s law. Let us suppose that the annulus above is a cross-section of an insulating domain. The hole therein is part of a conducting domain (that close up to a solid torus in 3d). Ampere‘s law states that magnetomotive forces on cycles surrounding the branch of a conductor s equal to the current passing through this conductor. It can be interpreted exactly as homology–cohomology duality.

  2. Homology over is easy to define. In a standard lecture in algebraic topology we always start from homology defined over integer coefficients. Due to the so called universal coefficient theorem for (co)homology, those are the ones which carry the most information (see the comment above). The reason why we have chosen to use homology (or in general, a field homology) is that they are needed for persistent homology to be well defined. We will talk about persistent homology in the next lecture. I want to point out that the fact that we are using homology here does not mean that the (co)homology over integers is not used in applied topology. On the contrary, in some applications they are explicitly needed.


  1. By using ideas from rigorous numerics one can compute persistence not only of a finite complex, but also for a continuous function defined on a compact subset of .

7.2 Standard metrics and stability results.

When analysing noisy data we always need to make sure that the result for noise-less data is not far from the results obtained for noisy data, i.e. that the method is stable. One of main reasons why persistent homology is useful in applied science is precisely this property: persistence is stable, it is robust. Roughly it means that small change in the filtering function induce small changes in the persistence diagram. In order to make this statement precise, we need to introduce a metric in a space of functions, and more importantly, in the space of persistence diagrams. In the space of functions we will use standard metric. Let us define now a metric in a space of persistent diagrams. Recall that a persistence diagram is a finite multiset of points in the plane (with points at infinity). For convince, for this finite multiset of points we are adding the points in the diagonal, each with infinite multiplicity. They are needed for the technical reasons. They represent points that are born and die in the same time. Think of them as virtual pairs of particle and anti-particle. You can get them by infinitesimally small perturbation of the filtration. Let and be two such persistence diagrams. Let us consider a bijections . The Bottleneck distance is defined in the following way:

In other words, we consider all possible bijections between (infinite) multisets of points and and we are searching for the one that minimize the maximal distance between a point and a point matched with via this bijection. This concept is closely related with the earth mover distance. It only cares about the maximal distance. There is another distance which is sensitive to all changes. It is a q-Wasserstein distance and it is defined in the following way:

Please consult Figure 7.7 for an example of a matching.

Figure 7.7: Example of a matching between persistence diagrams. For illustration we put the persistence diagram from the top left and the top right into one diagram in the bottom, and we make a matching therein.

In this section we will provide a stability theorem for the Bottleneck distance. There exists stability theorems for the Wasserstein distances, but they are conceptually more complicated. This stability theorem is due to David Cohen-Steiner, Herbert Edelsbrunner and John Harer [4]. There are more general stability theorems that work using very mild condition [5] - for a case of Rips complexes with similarity measure.

Theorem 1

Let be a cell complex and be two filtering functions. For each dimension the bottleneck distance between the diagram of with a function (denoted as ) and the diagram of with a function (denoted as ) satisfies:

In other words, when perturbing the filtering function by , the points in the diagram will not move further away than . This is a property which makes persistence useful in applied science. If time permits we will show how it allows us to compute persistence of a continuous function. Let us put this theorem to the test! We already know how to generate triangulations with Gudhi. Let us generate one, and add a filtration to it: For the purpose of this exercise we will pick one of the triangulations from the Section in which we have discussed simplicial complexes and assign a filtration (composed for instance from an increasing sequence of integers) to it:

import numpy as np
import gudhi as gd
import matplotlib
#Create a simplex tree
#Simplices can be inserted one by one
#Vertices are indexed by integers
#Notice that inserting an edge automatically insert its vertices
(if they were #not already in the complex)
st = gd.SimplexTree()
diagram = st.persistence(persistence_dim_max=True)
plt = gd.plot_persistence_diagram(diagram)

The numbers after comma (ranging from to ) indicate the filtration value of top dimensional simplices. This is propagated to lower dimensional simplices via so called lower star filtration (i.e. a lower dimensional simplex get a filtration value equal to minimum of filtration values of adjusted top dimensional simplices). Now, to illustrate stability theorems, let us perturb the filtration and visualize it, and compute the Bottleneck distances between diagrams. Those bottleneck distances will be bounded by the magnitude of noise we add to the filtration. Below we give some hint on how to use a random number generator to perturb the filtration.

import random as rd

Also remember that in order to compute the Bottleneck distance between two diagrams use:


Please do it on your own! If there is any problem, you can find the whole code below (with the white ink, sorry… :) You can also find it in the attached directory tree.

import random as rd
import numpy as np
import gudhi as gd
#Create a simplex tree
#Simplices can be inserted one by one
#Vertices are indexed by integers
#Notice that inserting an edge automatically insert its vertices (if they were #not already in the complex)
st1 = gd.SimplexTree()
diagram1 = st1.persistence(persistence_dim_max=True)
diagram = st.persistence_intervals_in_dimension(1)
diagram1 = st1.persistence_intervals_in_dimension(1)

8.1 Open and closed proteins

This example was entirely designed by Bertrand Michel and is available at his tutorial in here: It is based heavily on the work of Peter Bubenik. Big thanks to both of them. In ths example we wil work on the data also considered in the paper The data required for this exercise is available here:˙corr The purpose of this exercise is to understand if persistent homology information can determine the folding state of a protein. Please load the data, unpack them and put them all to the data folder. For further details, please consult the paper. Here are the steps we are gong to perform in the calculations:

  1. Read the appropriate correlation matrices from csv files (we will use Pandas for that).

  2. Transform a correlation matrix to a pseudo–distance matrix by transforming each entry into .

  3. Compute the Rips complexes of the obtained distance matrices (note that in this case, we are constructing a Rips complex of a pseudo–metric space which is not a Euclidean space. Such situations are very typical in this type of bio–oriented research).

  4. Compute persistent homology of the obtained complexes.

  5. Compute all-to-all distance matrix between the obtained persistence diagrams (do it dimension–by–dimension). For that purpose we will use the Bottleneck distance implemented in Gudhi.

  6. Use a Multidimensional Scaling method implemented in the scikit-learn library to find the two dimensional projection of the data.

  7. As you can see, in this case we get sort of a separation, but not a linear one. For further studies of what can be done with those methods, please consult the original paper

import numpy as np
import pandas as pd
import pickle as pickle
import gudhi as gd
from pylab import *
from mpl_toolkits.mplot3d import Axes3D
from IPython.display import Image
from sklearn import manifold
import seaborn as sns
#Here is the list of files to import
files_list = [
#Read the files:
corr_list = [pd.read_csv(u , header=None,delim_whitespace=True)
for u in files_list]
#And change correlation matrix do a distance matrix:
dist_list = [1- np.abs(c) for c in corr_list]
#Compute persistence in dimension 1 for all the files.
#Visualize the persistence diagrams for some of them
persistence = []
for i in range(0,len(dist_list)):
        rips_complex = gd.RipsComplex(distance_matrix=dist_list[i].values,max_edge_length=0.8)
        simplex_tree = rips_complex.create_simplex_tree(max_dimension=2)
        persistence.append( simplex_tree.persistence_intervals_in_dimension(0) )
#And compute all-to-all bottleneck distances. Note that this part
#will take a few seconds:
dist_mat = []
for i in range(0,len(persistence)):
        row = []
        for j in range(0,len(persistence)):
                row.append( gd.bottleneck_distance(persistence[i], persistence[j]) )
#We will now use a dimension reduction method to
#visualize a configuration in  R^2  which almost
#matches with the matrix of bottleneck distances.
#For that purpose we will apply a Multidimensional
#Scaling method implemented in the scikit-learn library.
mds = manifold.MDS(n_components=2, max_iter=3000, eps=1e-9,dissimilarity=”precomputed”, n_jobs=1)
pos =
plt.scatter(pos[0:7,0], pos[0:7, 1], color=’red’, label=”closed”)
plt.scatter(pos[7:len(dist_mat),0], pos[7:len(dist_mat), 1], color=’blue’, label=”open”)
plt.legend( loc=2, borderaxespad=1)
#repeat this for diagram in dimension 0.

In one of the next sections we will be discussing persistence representations, and one example we will consider there uses permutation test. Feel free to return to this example and use a permutation test to distinguish two types of proteins.

8.2 Classification of smart phone‘s sensor data

In this exercise we will gather accelerometer data of various activities using our mobile phones. For that we need an app to record data on our phones. The one I am using is an android app ‘‘Physical Toolbox”. You can find it on a play store˙US. Due to the hardware restrictions I was not able to test in practice, the i-phone analogue, but I think that one of those may work for you if you are an apple user:

With this, we need a linear accelerometer, and start to recording at the right moment. The aim is to record various activities, like walking on stairs, normal walking, push ups, jogging, and whatever else you want. Remember to stop recording as soon as you stop the activity. Given the data, the app will store it in a csv file, which we will use for a later analysis. Please grab your phones and record accelerometer data when different activities are performed. At the end, export your file, name it with the name of activity, and store it. You shall get file the header of which look like this:

Remove the first row and the first and last column, and read it.

import numpy as np
import pandas as pd
import pickle as pickle
import gudhi as gd
from pylab import *
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
from IPython.display import Image
from sklearn.neighbors.kde import KernelDensity
import matplotlib
data_pd = pd.read_csv(’activities/walk.csv’,decimal=”.”,delimiter=”,”)
data = data_pd.values
data = np.delete(data, (0), axis=1)
data = np.delete(data, (3), axis=1)
#now we can visualize the data:
fig = plt.figure()
ax = fig.add_subplot(111, projection=’3d’)
#plt.plot(data [:,0],data [:,1],data [:,2] )
ax.scatter(data[:,0],data[:,1],data[:,2] )
#And now we can compute persistent homology of the data and compare
#(visually at the moment) different activities.
Rips_complex_sample = gd.RipsComplex(points = data,max_edge_length=0.6 )
Rips_simplex_tree_sample = Rips_complex_sample.create_simplex_tree(max_dimension=2)
diag_Rips = Rips_simplex_tree_sample.persistence()
#diag_Rips_0 = Rips_simplex_tree_sample.persistence_intervals_in_dimension(0)
plt = gd.plot_persistence_diagram(diag_Rips)

Note that depending on if you have python 2 or 3, you may get some problem with reading the csv file. This is something we need to work out when doing this exercise. Check if persistent homology give you a way to cluster the activities. Here are a few examples of my activities, their point clouds, and the persistence diagrams obtained for the parameters of the Rips construction as in the example above. walk squats jumping downstairs upstairs Please play with different sensors. You can get really nice patterns from theq magnetometer. Run it, rotate your phone, and see what you get! Check it up for your case! Think of the characteristics of persistence diagrams that discriminate those activities. Think of alternative (to persistent homology) characteristics of point clouds that could differentiate the type of activity. Think about it next time you give an app permission to access your phone sensor. And share your ideas and observations!

8.3 Back to the point clouds

So far we have generated a number of point clouds. Computing their persistent homology is the first task we can do in this section. In fact, computations of persistent homology is an optional part in all the codes available therein. The point that is missing is to visualize persistence diagrams. This can be achieved using the following code:

import matplotlib
#suppose that over here the simplex tree has been generated.
plt = gd.plot_persistence_barcode(pers)

for persistence barcodes or

import matplotlib
#suppose that over here the simplex tree has been generated.
plt = gd.plot_persistence_diagram(pers)

for persistence diagramss Please carry on the experimentation with all the point clouds and check if the results are as predicted. Try to do the sampling a few times. Plot the persistence diagrams. Are they similar? Verify this by computing the distances between them. For that purpose you can use either python-based Wasserstein distance, or C++ based persistence representations. If you take a reasonable number of points sampled from the set, you will see almost exactly the same persistence diagram. There is in fact a stability result saying that if the so called Hausdorff distnace between point clouds is bounded by , then the persistence diagrams of that point cloud are bounded by the same , subject to a few mild assumptions.

8.4 Random cubical complexes and generalized percolation

In this exercise we will create a number of cubical complexes according to the following model: A top dimensional cube is present in the complex with probability . Pick a few values of

, and generate a few instances of cubical complexes (choose the dimension and size as you wish, but be reasonable as you do not want to blow up your memory). Compute persistent homology of those complexes. What can you observe? Can you track for evidences of phase transitions happening in the system? How this experiment is related to percolation theory. Use the code below as a starting point in your experiments:

import matplotlib.pyplot as plt
import numpy as np
import random as rd
from mpl_toolkits.mplot3d import Axes3D
import csv
from sklearn import decomposition
import math
import gudhi as gd
from PIL import Image
#in this example we are considering two dimensional complexes
N = 100
#this is where you set the probability:
p = 0.6
bitmap = []
for i in range(0,N*N):
        x = rd.uniform(0, 1)
        if ( x < p ): bitmap.append(1)
        else: bitmap.append(0)
bcc = gd.CubicalComplex(top_dimensional_cells = bitmap, dimensions=[N,N])
diag = bcc.persistence()
#now we can check how many generators in dimension 0 and in dimension 1 we have:
dim0 = bcc.persistence_intervals_in_dimension(0)
dim1 = bcc.persistence_intervals_in_dimension(1)
print Here is the first Betti number: ”, len(dim0)
print Here is the second Betti number: ,” len(dim1)

You can also modify this exercise to create cubical complexes such that cubes get not only filtration or , but the whole spectrum . Another interesting experiment aims in picking a grid , computing the Betti numbers for each probability, averaging them for each probability, and plotting the results.

8.5 Ssid walks

If during your ssid–walk you have made a cycle, you can now go back to the data and check if you can see the dominant interval in one dimensional persistent homology. Observe the persistence of this interval compares to the persistence of other (shorter) intervals that come from the noise.

9.1 Persistence Landscapes

One option was proposed by Peter Bubenik[3] and implemented by me [16]. This is the idea of persistence landscapes. Let us fix first the multiset of persistence intervals . For every interval, let us define a function:


Then a persistence landscape is a set of functions such that k-th largest value of . This may be a little hard to digest at first. Let us take a look at Figure 9.4 to get the idea.

Figure 9.3: An idea of a persistence landscape. First, we ‘‘lie down‘‘ the persistence diagram (b). Formally this corresponds to moving from coordinates into ones. Then, for the image of every interval in the new coordinates, we draw a plot of the function as in (c). Then, the first landscape function, is depicted in Figure (d), the in (e) and in (f).

Persistence landscapes are an equivalent way of representing persistence diagrams. There are algorithms to get one representation from anther. Also, given the intuition from Figure 9.4, it is clear that . Lets talk a little bit more about the idea of this representation. What if, at a given , we have nonzero landscapes and ? Well, it means that at level , our filtered complex has the rank of the homology in the considered dimension equal . Moreover, the minimal amount of perturbation we need to make to kill at least one homology generator is . The obvious question is – how does introducing persistence landscapes help us to solve the problems we have with defining an average? Well, it does. An average of two functions and is simply a point-wise average of them. So, the average is well defined, it is unique. The price to pay is that average of two landscapes is a partially linear function which typically does not come from any persistent diagram. This is something that make some people worry. In my opinion it is a very typical situation that in order to define an average of two nice objects, we need to allow ourselves a little more freedom, and move to a larger space (exactly like in case of integer and rational numbers). Only then one can state a definition that make sense. I think it is the case for persistence landscapes. But lifting the persistence diagrams to the space, for , gives us more. There is a natural distance function in this space: . Moreover those distances are stable. The persistence landscape toolbox and the Gudhi implementation of persistence landscapes have the following functionalities:

  1. Computations of a distance matrix.

  2. Various plotting procedures.

  3. Computations of averages and standard deviation.

9.2 Persistence Heat Maps

Many groups at different time have come up with an idea of building a functional descriptor of a persistence diagram by placing a kernel (typically Gaussian kernel) on each point of a diagram. In order to make this construction stable some type of weighting is necessary: otherwise on persistence points of very low persistence one should place the same kernel as the points of high persistence. As the first ones can appear and disappear in any number, even when a small amount of noise is introduced, they can change the resulting function by an arbitrary factor. To make this descriptor stable two strategies have been implemented:

  1. Weight the kernel placed on a point by the total persistence of the point . This strategy has been used in so called Persistence Weighted Gausian Kernel proposed by Yasuaki Hiraoka and collaborators [6] as well as Persistence Images [8].

  2. Weight all the kernels by the same number, but for any kernel placed in the point place the symmetric kernel at the point with the weight . This strategy has been proposed in the Persistence Stable Space Kernel [10].

An example of a kernel with no weight is presented in Figure 9.4.

Figure 9.4:

Persistence images obtained from diagrams describing neuronal morphologies, see 

[13] for details.

As much as Persistence Landscapes, various persistence heat maps enable all the main operations required in statistics and machine learning.

9.3 Persistence representations in Gudhi

All those persistence representations and more have been implemented in the package Persistence Representations available in Gudhi. At the moment unfortunately they are not available in python, but this will change soon. In the next section we present an exercise we can do to familiarise ourselves with the concept. Unfortunately we do not yet have the persistence landscapes available in python, so to do that we need to get our hands dirty with C++. We are here to help you with that!

9.4 Exercise: detecting the dimension of a random point cloud examination of their persistence diagrams.

At the moment the exercise requires working on the C++ level, since some functionalities related to averaging various representations of persistence has not been cytonized yet. For that purpose, please download the Gudhi library from, and subsequently use the attached˙dimensions˙with˙ C++ file which need to be added to the Gudhi library and compiled with it. Please consult the linked file for further technical details. For the purpose of the exercise, please generate random point clouds for two different dimensions and . Then generate a random point cloud in . Using a permutation test, please check if there is a statistical difference between Persistence Landscapes of different dimensions. Any necessary details about this task will be given on demand, so please ask!

9.5 Permutation tests

Permutation test – Suppose we are given two sets , of persistence diagrams and we would like to show that they are essentially different. For simplicity let us assume that

. If we have some assumptions, we can use a t-test (implemented in the PLT). If not, we can always use the permutation test. It works as follow:

  1. Compute averages average of landscapes constructed based on the diagrams from set A, average of landscapes constructed based on the diagrams from set B. Let distance between and . Let .

  2. Put all persistence diagrams from and in a set . Shuffle the set , divide it into two new sets and . Compute averages , and a distance between them. Every time , increment the counter .

  3. Repeat step 2 times (typically 10000-1000000 times).

The only output from this procedure we care about is the counter . Given it, we compute the ration . If it is small, that means that the clusters are well separated. If it is large, then just the opposite. In statistics the magic value below which the clusters are considered to be separated well is . Also note that here, efficient computations of averages and distances is crucial.


The idea of persistence landscapes is in essence similar to the idea of size functions introduced by the group of Massimo Ferri, Patrizio Frosini and Claudia Land from Bologna in early ‘90. Long before the concept of persistence was introduced, they had persistence in dimension zero. In order to measure it, they have been using size functions.