1 Introduction
Deep learning over the past decade has had tremendous impact in computer vision, natural language processing, machine learning, and healthcare. Among other approaches, convolutional neural networks (CNNs) in particular have received great attention and interest from the computer vision community. This is attributed to the fact that they are able to exploit the local temporal and spatial correlations that exist in 1dimensional (1D) sequential timeseries signals, 2dimensional (2D) data like images, 3dimensional (3D) data like videos, and 3D objects. In this paper, we refer to these type of data as
input data. CNNs also have far less learnable parameters than their fullyconnected counterparts, making them less prone to overfitting and have shown stateoftheart results in applications like image classification, object detection, scene recognition, finegrained categorization and action recognition
[26, 20, 50, 51, 52]. Apart from being good at learning mappings between the input and corresponding class labels, deep learning frameworks are also efficient in discovering mappings between the input data and other output feature representations [45, 47, 28, 16, 13].While methods for learning features from scratch and mapping data to desired outputs via neural networks have matured significantly, relatively less attention has been paid to invariance to nuisance lowlevel transforms like Gaussian noise, blur and affine transformations. Topological data analysis (TDA) methods are popularly used to characterize the shape of highdimensional point cloud data using representations such as persistent diagrams (PDs) that are robust to certain types of variations in the data [14]. The shape of the data is quantified by properties such as connected components, cycles, highdimensional holes, levelsets and monotonic regions of functions defined on the data [14]. Topological properties are those invariants that do not change under smooth deformations like stretching, bending and rotation, but without tearing or gluing surfaces. These attractive traits of TDA has renewed interested in this area for answering various fundamental questions, including those dealing with interpretation, generalization, model selection, stability, and convergence [19, 6, 34, 32, 18, 17].
A lot of work has gone into utilizing topological representations efficiently in largescale machine learning [3, 5, 35, 30, 33, 1, 40]. However, bottlenecks such as computational load involved in discovering topological invariants as well as a lack of a differentiable architecture remain. In this paper we propose a deep learning approach to learning approximate mappings between data and their topological feature representations, efficiently and using a differentiable architecture. The gist of our idea is illustrated in figure 1 and the main contributions are listed below.
Contributions: (1) We propose a novel differentiable neural network architecture called PINet, to extract topological representations. In this paper we focus on persistence images (PIs) as the desired topological feature. (2) We provide two simple CNNbased architectures called Signal PINet that takes in multivariate 1D sequential data and Image PINet that takes in multichannel 2D image data. (3)
We also employ transfer learning strategies to train the proposed
PINet model on a source dataset and use it on a target dataset. (4) Through our experiments on human activity recognition using accelerometer sensor data and image classification on standard image datasets, we show the effectiveness of the generated approximations for PIs and compare their performance to PIs generated using TDA approaches. We also explore the benefits of concatenating PIs with features learnt using deep learning methods like Alexnet [26], NetworkinNetwork [27] for image classification and test their robustness to different noise variations. Our code is available at https://github.com/anirudhsom/PINet.2 Related Work
Although the formal beginnings of topology is already a few centuries old dating back to Euler, algebraic topology has seen a revival in the past decade with the advent of computational tools and software [36, 2, 4]. Arguably the most popular topological summary is the persistence diagram (PD), which is a multiset of points in a 2D plane that quantifies the birth and death times of topological features such as dimensional holes of sublevel sets of a function defined on a point cloud [15]. This simple summary has resulted in adoption of topological methods for various applications [31, 43, 8, 11, 10, 23, 39, 44]. However, TDA methods suffer from two major limitations. First, it is computationally very taxing to extract PDs. The computational load increases with the dimensionality of the data being analyzed. The second obstacle is that a PD is a multiset of points, making it impossible to use machine learning or deep learning frameworks directly on the space of PDs. Efforts have been made to tackle the second issue by attempting to map PDs to spaces that are more favorable for machine learning tools [3, 5, 35, 30, 33, 1, 40]. To alleviate the first problem, in this paper we propose a simple onestep differentiable architecture called PINet to compute the desired topological feature representation, specifically persistence images (PIs). To the best of our knowledge, we are the first to propose the use of deep learning for computing PIs directly from data.
Our motivation to use deep learning stems from its successful use to learn mappings between input data and different feature representations [45, 47, 28, 16, 13]. However, deep learning and TDA did cross paths before but not in the same context as what we propose in this paper. TDA methods have been used to study the topology [19, 6], algorithmic complexity [34], behavior [18] and selection [32] of deep learning models. Efforts have also been made to use topological feature representations either as inputs or fused with features learned using neural network models [12, 24, 7]. Later in Section 5, we too show experimental results on fusing generated PIs with deep learning frameworks for action recognition and image classification tasks.
3 Background
Persistence Diagrams: Consider a graph constructed from data projected onto a highdimensional pointcloud space. Here, is the set of nodes and
denotes the neighborhood relations defined between the samples. Topological properties of the graph’s shape can be estimated by first constructing a simplicial complex
over . is defined as , with being a family of nonempty level sets of , with each element is a simplex [15]. This falls under the realm of persistent homology when we are interested in summarizing the dimensional holes present in the data. The simplices are constructed using the the neighborhood rule [15]. It is also possible to quantify the topology induced by a function defined on the vertices of a graph by studying the topology of its sublevel or superlevel sets. Since , this is referred to as scalar field topology. In either case, PDs provide a simple way to summarize the birth vs death time information of the topological feature of interest. Birthtime () refers to the scale at which the feature was formed and deathtime () refers to the scale at which it ceases to exist. The difference between and gives us the lifetime or persistence and is denoted by . Each PD is a multiset of points in . Interested readers can refer to the following papers to learn more about the properties of the space of PDs [14, 15].Persistence Images:
A PI is a finitedimensional vector representation of a PD
[1] and can be computed through the following series of steps. First we map the PD to an integrable function called a persistence surface. The persistence surface is defined as a weighted sum of Gaussian functions that are centered at each point in the PD. Next, a discretization of a subdomain of the persistence surface is done which results in a grid. Finally, the PI is obtained by integrating the persistence surface over each grid box, giving us a matrix of pixel values. An interesting aspect when computing PIs is the broad range of weighting functions to chose from, to weight the Gaussian functions. Typically, points of high persistence or lifetime are perceived to be more important than points of low persistence. In such cases one may select the weighting function to be nondecreasing with respect to the persistence value of each point in the PD. Adams et al. also talk about the stability of persistence images with respect ot the 1Wasserstein distance between PDs [1]. Figure 2 shows an example of a PD and its PI that is weighted by its lifetime.Convolutional Neural Networks: CNNs were inspired from the hierarchical organization of the human visual cortex [21]
and consist of many intricately interconnected layers of neuron structures serving as the basic units to learn, extract both lowlevel and highlevel features from images. CNNs are particularly more attractive and powerful compared to their connected counterparts because CNNs are able to exploit the spatial correlations present in natural images and each convolutional layer has far less trainable parameters than a fullyconnected layer. Several sophisticated CNN architectures have been proposed in the last decade, for example
AlexNet [26], NetworkinNetwork [27], VGG [38], GoogleNet [42], ResNet [22], etc. Some of these designs are known to surpass humans for object recognition tasks [37]. Apart from discovering features from scratch for classification tasks, CNNs are also popular for learning mappings between input and other feature representations [45, 47, 28, 16, 13]. This motivates us to design simple CNN models for the task of learning mappings between the data and their PI representations. We would like to direct interested readers to the following survey paper to know more about different CNN architectures [41].Learning Strategies:
Here we will briefly talk about the two learning strategies namely: supervised learning and transfer learning. We employ these strategies to train the proposed
PINet model. Supervised Learning is concerned with learning complex mappings from to when many pairs of are given as training data, with being the input data and being the corresponding label or feature representation. In a classification setting corresponds to a fixed set of labels. In a regression setting, the output is either a real number or a set of real numbers. In this paper our problem falls under the regression category as we try to learn a mapping between the input data and its PI. Transfer Learning is a design methodology that involves using the learned weights of a pretrained model that is trained on a source dataset for the source task , to initialize the weights of another model that is finetuned using a target dataset for the target task [48]. When training a model, abstract feature representations are usually learnt in the initial and middle layers, whereas taskspecific features are learnt in the final layers. With transfer learning we only retrain or finetune the last layers. This allows us to leverage the source dataset that the model was initially trained on. The is useful in cases where the target dataset is a lot less compared to the source dataset. However, transfer learning only works if the features learned for the source task are generalizable for the target task. In Section 4 we show how transfer learning is employed in our proposed framework.4 PINet Framework
In this section we will first go through the steps to generate the groundtruth PIs and later discuss the proposed network architecture. The two PINet variants are illustrated in Figure 3. To generate PIs from multivariate timeseries signals we use Signal PINet and for multichannel images we use Image PINet.
4.1 Generating Ground Truth Persistence Images
Data Preprocessing: For univariate or multivariate timeseries signals, we consider only fixedframe signals, i.e.
signals with fixed number of timesteps, and zerocenter them. We standardize the train and test sets such that they have unit variance along each timestep. For images we enforce the pixel value range to be
.Computing Persistence Diagrams and Images: We use the ScikitTDA python library [36] and use the Ripser package for computing PDs. We only focus on extracting PDs of scalar functions on data. When working with 1D sequential data, these offer a way to describe extremal points. For example local minimums give birth to a topological feature (more accurately a 0order homology group) which then die at local maxima. From our initial investigations we were able to generate PIs which had features of high persistence. We can also extract PDs from an image considering the pixel values to be the function on a 2D plane. However, using this approach we were not able to observe PDs with high persistence features for the image. Instead, we vectorize each image along its rows to form a 1D signal. We then extract PDs for the 1D signal, just as we do for timeseries data. For multichannel color images, we vectorize each color channel separately and then compute PDs. For example, we reshape a 32323 color image to get 10243. This small change allowed us to observe more richer PDs that have features with high persistence.
For computing PIs we used the Persim package in the ScikitTDA toolbox. In all our experiments we set the grid size of the generated PIs to 50
50 and fit a Gaussian kernel function on each point in the PD. We weight each Gaussian kernel by the lifetime of the point. For all timeseries datasets we set the standard deviation of the Gaussian kernel to 0.25 and set the birthtime range to [10, 10]. For image datasets we fix the standarddeviation to 0.05 and the birthtime range to [0, 1]. Once we compute PIs we normalize each PI by dividing by its maximum intensity value. This forces the intensity values in the PI to lie between [0,1].
4.2 Network Architecture
Here, we describe the Signal PINet and Image PINet architectures. These models are shown in Figure 3
and are designed using Keras with tensorflow backend
[9].Signal PINet: The input to the network is a dimensional timeseries signal, where refers to the number of timesteps or frame size. For a univariate signal and for a multivariate signal . For our experiments in Section 5, and
. After the input layer, the encoder block consists of four 1D convolution layers. Except the final convolution layer, all other convolution layers are followed by batch normalization, ReLU activation and Maxpooling. The final convolution layer is followed by batch normalization, ReLU activation and Globalaveragepooling. The number of convolution filters is set to 128, 256, 512 and 1024 respectively. However, the convolution kernel size is same for all layers and is set to 3 with stride set to 1. We use appropriate zero padding to keep the output shape of the convolution layer unchanged. For all Maxpool layers, we set the kernel size to 3 and stride to 2. After the encoder block, we pass the globalaveragepooled output into a final output dense layer of size
. The output of the dense layer is subjected to ReLU activation and reshaped to size . As mentioned earlier, we set the height and width of all generated PIs to .Image PINet: The input to this network is a dimensional image, where are the image’s height, width and number of channels. The structure of the encoder block is the same as that of the Signal PINet model. The only difference is that we now use the 2D version of the same layers described earlier. We pass the output of the encoder block into a latent variable layer which consists of a dense layer of size 2500. The output of the latent variable layer is reshaped to and is passed into the decoder block. The decoder block consists of one 2D deconvolution layer with kernel size set to 50, stride set to 1, number of filters to . The output of the deconvolution layer is also zeropadded such that the height and width of the output remains unchanged. The deconvolution layer is followed by a final batch normalization and ReLU activation. The shape of the output we get is .
To employ transfer learning, we first train the Image PINet model on a source dataset. Next, we finetune just the last layers of the model using the target dataset. Specifically, we retrain from the fourth convolution layer with 1024 filters in the Encoder block till the final output layer as shown in Figure 4.
Loss function:
The MeanSquaredError loss function is used to quantify the deviation of the generated PIs from the groundtruth PIs. The different train and test loss trends for both Signal and Image PINet variants is shown in Figure
6.5 Experiments
This section can be broadly divided into four parts. First we show human activity recognition on two accelerometer sensor datasets: GeneActiv [46] and USCHAD [49]. Second, we show improvements for image classification task after fusing PIs obtained traditionally and using the proposed Image PINet framework with popular neural network architectures like Alexnet [26] and NetworkinNetwork [27]. For image classification we use the following datasets: CIFAR10 [25] and SVHN [29]. Third, we show how the generated PIs can be used to help improve robustness of deep learning models to different noises like blur, translation and Gaussian noise. Finally, we show improvements in computation time for the task of extracting PIs using the proposed method.
5.1 Action Recognition using Accelerometer Data
We conduct this experiment on the following accelerometer datasets: GeneActiv [46] and USCHAD [49]. The GeneActiv dataset consists of 29 different humanactivity classes from 152 subjects. The data was collected at a sampling rate of 100Hz using a GeneActiv sensor, a lightweight, waterproof, wristworn triaxial accelerometer. Please refer to the following paper to know more about the data collection protocol [46]. We extract nonoverlapping frames of 10 seconds each, giving us about 31,275 frames. Each frame has a 1000 timesteps. We roughly use 75% of the frames for the trainingset and the rest as testset. To avoid inducing any bias, we make sure to place all frames from the same subject into either one of the sets. The USCHAD dataset consists of 12 different humanactivity classes from 14 subjects. It was collected using the triaxial MotionNode accelerometer sensor at a sampling rate of 100Hz, with the sensor being placed at the front right hip [49]. Here also we extract 10 second nonoverlapping frames resulting in about 2,499 frames. We use frames from the first 8 subjects for the training set and the remaining frames as the test set. Figure 5 show the list of all activity classes and their distribution for both datasets.
Training Signal PINet: We train one Signal PINet model described in Section 4.2 using just the training set of the GeneActiv
dataset. We set the batchsize to 128 and train the model for a 1000 epochs. The learning rate for the first 300 epochs, second 300 epochs and final 400 epochs was set to
, and respectively. The Adam optimizer was used for training the model. We use the MeanSquaredError loss function to quantify the overall deviation of the generated PIs from the groundtruth PIs. The training and test loss trends are shown in Figure 6.Method  GeneActiv  USCHAD 
MLP  PI  46.270.28  44.711.26 
MLP  Signal PINet  49.760.90  48.211.42 
MLP  SF [46]  35.480.50  31.862.47 
MLP  SF + PI  47.630.43  45.790.33 
MLP  SF + Signal PINet  49.680.22  48.680.63 
1D CNN  56.340.89  53.331.35 
1D CNN + PI  58.680.49  55.671.03 
1D CNN + Signal PINet  59.420.35  58.560.81 
For characterizing the timeseries signals, we consider three different feature representations: (1) A 19dimensional feature vector consisting of different statistics calculated over each 10second frame [46]; (2) Features learnt from scratch using 1D CNNs; (3) Persistence Images generated using the traditional filtration technique and the proposed Signal PINet model. The 19dimensional feature vector includes mean, variance, rootmeansquare value of the raw accelerations on each of , and axes, pearson correlation coefficients between ,  and  time series, difference between maximum and minimum accelerations on each axis denoted by , and , , , . From here on out we will refer to this 19dimensional statistics feature as SF. We use the trained Signal PINet model to extract PIs for the test set of the GeneActiv dataset. We also use the same model to compute PIs for both the training and test sets of the USCHAD dataset. We wanted to see if we could exploit the knowledge learnt by the proposed Signal PINet model on a source dataset (GeneActiv) and use it on a target dataset (USCHAD). As seen from Figure 5, there is a huge shift in both the data distribution and endtarget classes. This pushes it to the realm of a crossdomain and crosstask learning problem. Crossdomain since for each dataset the accelerometer sensor was placed on different parts of the human body; and crosstask since the classdistribution and end classification task is very different for both datasets.
The weighted F1 score classification results is shown in Table 1
. We use a multilayerperceptron (MLP) classifier for the SF, PI features and a 1D CNN classifier for the timeseries signals. The MLP classifier contains 8 dense layers with ReLU activation and having 1024, 1024, 512, 512, 256, 256, 128, 128 units respectively. To avoid overfitting, each dense layer is followed by a dropout layer with a dropout rate of 0.2 and a batchnormalization layer. The output layer is another dense layer with Softmax activation and with number of units equal to the number of classes. The 1D CNN classifier consists of 10 CNN layers with number of filters set to 64, kernel size to 3, stride to 1 and the output is zeropadded. Each CNN layer is followed by batchnormalization, ReLU activation and maxpooling layers. For maxpool layers we set the filter size to 3, the stride was set to 1 for every odd layer and 2 for every even layer. For the final CNN layer we use a globalaveragepooling layer instead of a maxpool layer. Here too, the output layer consists of a dense layer with softmax activation and number of units equal to number of target classes.
Table 1 shows results for both individual features and different fusion cases. In the table, PI refers to PIs obtained using conventional TDA methods and Signal PINet means PIs computed using the proposed Signal PINet model. We fuse SF and PI features at the input layer before passing into the MLP classifier. For 1D CNNs we fuse the PI features after the globalaveragepooling layer. We see improvements in classification results using the proposed Signal PINet model for both datasets. We would like to remind our readers that the results for USCHAD was obtained using the Signal PINet model trained on just the GeneActiv dataset. This opens doors to further explore the proposed framework on crossdomain, crosstask learning problems. For the 1D CNN case, apart from improving the overall classification accuracy we also notice the standard deviation being reduced after combining PIs. We further provide the confusion matrices for a few of the methods listed in table 1 in the Appendix at the end of the paper.
5.2 Image Classification
We use the following three image datasets for purposes of training different Image PINet models: CIFAR10, CIFAR100 [25] and SVHN [29]. However, we show image classification results for only CIFAR10 and SVHN. Both CIFAR10 and CIFAR100 contain 60,000 color images, which are split into 50,000 training images and 10,000 test images. The SVHN dataset contains 73,257 training images and 26,032 test images. Images have the same shape in all three datasets. The height, width and number of channels for each image is equal to 32, 32 and 3 respectively. Sample test images in each class for the CIFAR10 and SVHN dataset are shown in Figures 7 and 8 respectively.
Training Image PINet: We develop two kinds of Image PINet models based on the datasets we chose as source and target datasets to train the model: (1) In the first kind we set the source and target datasets to be same, i.e. we train the Image PINet model using the CIFAR10 or SVHN dataset. (2) For the second type, we use the CIFAR100 dataset as our source dataset and the target datasets are either CIFAR10 or SVHN. Simply put, we employ transfer learning by first training the Image PINet model using CIFAR100 and later use the target dataset to finetune the last layers as illustrated in Figure 4. For the second case, we further explore two variations: (2a) Finetune the last layers using all samples from the training set of the target dataset; (2b) finetune using just a subset i.e. 500 images per class in the training set of the target dataset. We will refer to these variants as Image PINet FA (Finetune All) and Image PINet FS (Finetune Subset) respectively. For all cases we normalize the images by dividing all pixels by 255. This scales all pixels to lie in the range . For the model described in Section 4.2 we set the batchsize to 128 and train the model for a 1000 epochs. Just like the SignalToPI model we set the learning rate for the first 300 epochs, next 300 epochs and final 400 epochs to , and respectively. Here too we use Adam optimizer and the MeanSquaredError loss function. The training and test loss trends for the different Image PINet models are shown in Figure 6.
Alexnet [26]  NetworkinNetwork [27]  
Name  Description  Name  Description 
input  32 32 RGB image  input  32 32 RGB image 
conv1a  32 filters, 33, pad=’same’, ReLU  conv1a  192 filters, 55, pad=’same’, ReLU 
conv1b  32 filters, 33, ReLU  conv1b  160 filters, 11, pad=’same’, ReLU 
pool1  Maxpool 22  conv1c  96 filters, 11, pad=’same’, ReLU 
drop1  Dropout 0.2  pool1  Maxpool 33, stride=(2,2), pad=’same’ 
conv2a  64 filters, 33, pad=’same’, ReLU  conv2a  192 filters, 55, pad=’same’, ReLU 
conv2b  64 filters, 33, ReLU  conv2b  192 filters, 11, pad=’same’, ReLU 
pool2  Maxpool 22  conv2c  192 filters, 11, pad=’same’, ReLU 
drop2  Dropout 0.2  pool2  Maxpool 33, stride=(2,2), pad=’same’ 
flatten1  Flatten  conv3a  192 filters, 33, pad=’same’, ReLU 
dense1  Fully connected 1024 units, ReLU  conv3b  192 filters, 11, pad=’same’, ReLU 
drop3  Dropout 0.2  conv3c  10 filters, 11, pad=’same’, ReLU 
dense2  Fully connected 10 units  gavgpool1  Global Average Pool 
output  Softmax  output  Softmax 
Groundtruth PIs and PIs generated using the above Image PINet cases for both the CIFAR10 and SVHN are shown in Figures 7 and 8 respectively. For image classification we use Alexnet [26] and NetworkinNetwork (NIN) [27] as our base models. Topological features like PIs alone are not as powerful as features learnt by most deep learning frameworks. This is clearly evident from our earlier experiment in Section 5.1, where a simple 1D CNN applied directly over the timeseries data outperforms a MLP that takes in PIs as input. However, from the same section and from past work [40, 12] we know that topological features carry complimentary information that can be exploited to improve the overall classification performance. We too show results using Alexnet and NIN models in conjunction with PIs that are generated using traditional filtration techniques and using the proposed Image PINet model. Figure 9 illustrates how we concatenate PIs with the base network feature for image classification.
Method  CIFAR10  SVHN  
MeanSD  pValue  MeanSD  pValue  
Alexnet  80.490.30    93.080.17   
Alexnet + PI  80.520.38  0.8932  93.720.10  0.0001 
Alexnet + Image PINet  81.250.49  0.0182  93.830.11  <0.0001 
Alexnet + Image PINet FA  81.230.42  0.0125  93.920.13  <0.0001 
Alexnet + Image PINet FS  81.800.24  0.0001  93.940.13  <0.0001 
NIN  84.930.13    95.830.07   
NIN + PI  85.290.30  0.0392  95.750.08  0.1309 
NIN + Image PINet  86.610.19  <0.0001  96.040.04  0.0004 
NIN + Image PINet FA  86.620.39  <0.0001  95.970.05  0.0066 
NIN + Image PINet FS  86.610.40  <0.0001  96.060.04  0.0002 
The network architecture for Alexnet and NIN models is shown in Table 2. When fusing PIs with the Alexnet model, we first pass PIs through two dense layers with ReLU activation and having 1024, 512 units respectively. Each dense layer is followed by a dropout layer with dropout rate 0.2 and batchnormalization. The final PI output is concatenated with the output of the final dropout layer ’drop3’ in Alexnet. We modify the NIN model slightly when fusing PIs. The only change is that we put the globalaveragepool layer after ‘conv3b’, instead of ‘conv3c’. We concatenate the PI features with the output of the globalaveragepool layer. The classification results are tabulated in Table 3. We see that PIs generated using all variants of the proposed framework help improve the overall classification results for the base models on both datasets. The pvalues are calculated for each case with respect to only the base model.
CIFAR10  Blur  Translation  Gaussian Noise  
SD  SD  SD  SD  SD  
Alexnet  71.571.80  51.182.67  38.141.73  31.780.95  28.340.85  72.831.00  68.060.94  62.061.11  55.481.00  49.460.77  77.860.47  65.101.79  48.492.75  34.942.65  25.982.57  
Alexnet + PI  68.790.90  45.831.69  32.501.02  26.580.93  23.290.70  71.300.38  66.230.38  60.590.49  53.970.27  48.180.64  76.110.20  58.521.24  39.562.32  27.232.10  19.590.94  

71.790.45  49.650.58  35.350.67  28.760.66  25.640.47  73.410.28  68.800.57  63.050.25  56.340.39  50.840.24  77.460.38  60.681.94  40.803.03  27.042.97  19.222.38  

71.290.60  47.991.39  35.241.49  29.211.27  26.350.97  73.240.68  68.380.47  62.890.60  55.950.55  50.421.05  77.380.50  60.922.09  40.842.65  26.572.29  18.541.59  

71.381.00  47.982.63  34.321.74  28.071.00  25.310.46  73.390.50  68.520.22  62.990.35  56.020.44  50.280.25  77.891.12  59.733.94  39.104.21  25.053.05  17.291.89  
NIN  77.790.91  54.981.45  38.390.84  30.210.55  26.230.62  80.280.20  77.930.44  74.640.38  70.850.38  65.980.46  81.210.55  66.652.15  48.373.35  33.643.49  24.032.85  
NIN + PI  76.930.73  49.241.95  32.191.57  25.171.69  21.691.66  80.350.24  77.570.34  74.410.27  70.510.31  65.310.22  81.080.70  64.092.32  44.083.90  28.813.55  19.482.46  

77.700.78  49.751.38  33.371.15  26.120.43  23.290.65  81.400.38  78.830.38  75.920.43  72.050.40  66.900.51  82.690.37  65.912.27  45.193.87  29.433.53  19.952.48  

77.430.98  50.521.61  34.291.98  26.861.93  23.901.87  81.190.28  78.610.35  75.600.44  71.660.42  66.600.19  82.680.27  66.141.38  46.802.35  31.922.92  22.862.68  

78.501.05  51.392.06  34.581.75  26.841.22  23.111.33  81.760.44  79.240.48  76.400.36  72.530.58  67.490.40  82.480.25  65.741.17  44.671.26  28.251.19  19.420.92 
SVHN  Blur  Translation  Gaussian Noise  
SD  SD  SD  SD  SD  
Alexnet  92.830.14  92.700.16  91.420.14  89.630.17  85.730.25  88.420.23  80.980.43  68.390.55  55.490.70  45.140.49  91.140.49  78.332.23  59.263.02  44.892.77  35.742.33  
Alexnet + PI  93.610.13  93.360.14  92.190.04  90.210.08  86.080.21  90.050.06  83.540.21  72.160.46  59.520.24  48.460.45  92.330.19  85.071.26  72.412.18  60.382.67  49.292.84  

93.640.13  93.410.15  92.320.16  90.440.10  86.530.10  90.100.33  83.740.52  72.070.65  59.600.68  48.520.48  92.800.23  87.260.98  75.281.89  61.132.29  48.332.45  

93.740.14  93.520.10  92.270.07  90.520.08  86.550.14  90.290.12  83.990.19  72.600.39  59.820.36  48.710.23  93.010.15  88.090.33  78.281.57  66.752.96  55.743.87  

93.680.15  93.530.15  92.410.10  90.530.14  86.590.28  90.350.09  83.960.19  72.440.36  59.740.31  48.740.36  92.890.18  86.980.89  74.962.46  61.353.36  49.253.42  
NIN  95.680.06  95.400.05  94.470.08  92.750.10  89.390.21  94.520.13  92.250.11  87.250.21  79.400.37  69.500.45  95.390.09  92.810.16  86.990.44  79.340.68  70.810.75  
NIN + PI  95.650.04  95.270.09  94.300.12  92.360.12  88.610.20  94.460.09  92.100.10  86.870.20  78.560.22  68.240.22  95.080.09  91.970.29  85.480.57  76.950.90  68.140.63  

95.880.04  95.620.06  94.670.11  92.870.09  89.410.19  94.790.07  92.350.04  87.410.15  79.580.19  69.770.35  95.520.08  92.880.25  86.980.52  78.930.69  70.161.07  

95.830.05  95.540.08  94.630.11  92.790.19  89.230.21  94.770.11  92.430.15  87.540.26  79.890.36  70.020.53  95.430.09  92.400.49  85.850.69  77.630.97  68.211.34  

95.920.05  95.620.09  94.690.06  92.940.08  89.440.17  94.810.11  92.400.16  87.480.16  79.640.30  69.750.11  95.530.11  92.530.16  86.130.34  77.830.52  68.240.38 
5.3 Robustness to Noise
In this section we create noisy variations of the test set in CIFAR10 and SVHN. In particular we generate noisy images with different levels of blur, affine translation and Gaussian noise. We evaluate the performance of the base models described in Section 5.2 on these noisy image variants. Both Alexnet and NIN are evaluated alone and after fusion with PIs. The PIs are generated again for the noisy images using conventional TDA tools and the proposed Image PInet framework. Note that these models were trained using just clean images from the dataset and are not retrained on noisy images. For blur we use an averaging filter and consider the following 5 kernel sizes  , , , , . For affine transformation, we translate the image within , , , , percentage ranges. In the case of Gaussian noise, we add Gaussian noise to the original image with different levels of standarddeviation  , , , and . The results are tabulated in Table 4. We observe that fusing PIs generated using the Image PInet model is beneficial on all cases for Alexnet and NIN on the SVHN dataset, and for affine translation noise case on the CIFAR10 dataset.
Method  MeanSD ( seconds)  



Conventional TDA  CPU  146.503.83  105.033.57  
Image PINet  GPU  2.520.02  2.190.02 
.
5.4 Computation Time to Generate PIs
We used 4 NVIDIA GeForce GTX Titan Xp graphic cards, each with 12GB memory to train and evaluate all deep learning models. All other tasks were carried out on a standard Intel i7 CPU using Python with a working memory of 32GB. We use the ScikitTDA software to compute PDs and PIs [36]. Table 5 shows the average time taken by conventional TDA methods using one CPU and the proposed Image PINet framework on just one GPU, to extract PI for one image. The average is computed over all images present in the training set of the dataset. Using the Image PINet model, we see an effective speed up of two orders of magnitude in the computation time. We also check the time taken to compute PIs when the entire training set is passed into the Image PINet model as a single batch. For the entire training set it takes about 9.770.08 seconds for CIFAR10 and 12.930.05 seconds for SVHN. This is a fraction of the time compared to the time it takes using conventional TDA tools. So far it had been impossible to compute PIs at realtime using conventional TDA approaches. However, the proposed framwork allows us to easily compute PIs in realtime thereby opening doors to new realtime applications for TDA.
6 Conclusion and Future Work
In this paper we took the first step in using deep learning to extract topological feature representations. We developed a differentiable and effective architecture called PINet to extract PIs directly from data. PINet has a significantly lower computational complexity compared to using conventional topological tools. We show good results on different timeseries and image datasets, and also test the robustness of different base classification networks together with PIs generated using PINet for different kinds of noise added to the data.
For future work we would like to explore more sophisticated deep learning architectures that can allow us to learn mappings between higher dimensional data and their corresponding topological feature representations. We would also like to see how deep learning can be further used to generate other kinds of topological representations. Also, conventional TDA tools are invariant to small perturbations on the input data space. Now that we are able to generate approximations of topological representations, it would be interesting to use the proposed framework in a setting that is resistant to adversarial attacks which is major issue faced by current deep neural networks.
7 Acknowledgements
This work was supported in part by NSF CAREER grant 1452163 and ARO grant number W911NF1710293.
References
 [1] H. Adams, T. Emerson, M. Kirby, R. Neville, C. Peterson, P. Shipman, S. Chepushtanova, E. Hanson, F. Motta, and L. Ziegelmeier. Persistence images: A stable vector representation of persistent homology. Journal of Machine Learning Research, 18(8):1–35, 2017.
 [2] H. Adams, A. Tausz, and M. VejdemoJohansson. Javaplex: A research software package for persistent (co) homology. In International Congress on Mathematical Software, pages 129–136. Springer, 2014.

[3]
R. Anirudh, V. Venkataraman, K. Natesan Ramamurthy, and P. Turaga.
A riemannian framework for statistical analysis of topological
persistence diagrams.
In
The IEEE Conference on Computer Vision and Pattern Recognition Workshops
, pages 68–76, 2016.  [4] U. Bauer, M. Kerber, and J. Reininghaus. Distributed computation of persistent homology. In Proceedings of the Workshop on Algorithm, Engineering and experiments, pages 31–38. SIAM, 2014.
 [5] P. Bubenik. Statistical topological data analysis using persistence landscapes. The Journal of Machine Learning Research, 16(1):77–102, 2015.
 [6] P. Bubenik and J. Holcomb. Statistical inferences from the topology of complex networks. Technical report, Cleveland State University, Cleveland, United States, 2016.
 [7] Z. Cang and G.W. Wei. Topologynet: Topology based deep convolutional and multitask neural networks for biomolecular property predictions. PLoS Computational Biology, 13(7), 2017.
 [8] H. Chintakunta, T. Gentimis, R. GonzalezDiaz, M.J. Jimenez, and H. Krim. An entropybased persistence barcode. Pattern Recognition, 48(2):391–401, 2015.
 [9] F. Chollet et al. Keras. https://keras.io, 2015.
 [10] M. K. Chung, P. Bubenik, and P. T. Kim. Persistence diagrams of cortical surface data. In International Conference on Information Processing in Medical Imaging, pages 386–397. Springer, 2009.
 [11] Y. Dabaghian, F. Mémoli, L. Frank, and G. Carlsson. A topological paradigm for hippocampal spatial map formation using persistent homology. PLoS Computational Biology, 8(8):1–14, 2012.
 [12] T. K. Dey, S. Mandal, and W. Varcho. Improved Image Classification using Topological Persistence. In Vision, Modeling & Visualization. The Eurographics Association, 2017.

[13]
C. Dong, C. C. Loy, K. He, and X. Tang.
Learning a deep convolutional network for image superresolution.
In European Conference on Computer Vision, pages 184–199. Springer, 2014.  [14] H. Edelsbrunner and J. Harer. Computational topology: an introduction. American Mathematical Society, 2010.
 [15] H. Edelsbrunner, D. Letscher, and A. Zomorodian. Topological persistence and simplification. Discrete & Computational Geometry, 28(4):511–533, 2002.
 [16] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multiscale deep network. In Advances in Neural Information Processing Systems, pages 2366–2374, 2014.
 [17] M. Ferri. Why topology for machine learning and knowledge extraction? Machine Learning and Knowledge Extraction, 1(1):115–120, 2018.
 [18] M. Gabella, N. Afambo, S. Ebli, and G. Spreemann. Topology of learning in artificial neural networks. arXiv preprint arXiv:1902.08160, 2019.
 [19] R. B. Gabrielsson and G. Carlsson. Exposition and interpretation of the topology of neural networks. arXiv preprint arXiv:1810.03234, 2018.
 [20] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 580–587, 2014.
 [21] K. GrillSpector and R. Malach. The human visual cortex. Annu. Rev. Neurosci., 27:649–677, 2004.
 [22] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
 [23] K. Heath, N. Gelfand, M. Ovsjanikov, M. Aanjaneya, and L. J. Guibas. Image webs: Computing and exploiting connectivity in image collections. In IEEE Conference on Computer Vision and Pattern Recognition, 2010.
 [24] C. Hofer, R. Kwitt, M. Niethammer, and A. Uhl. Deep learning with topological signatures. In Advances in Neural Information Processing Systems, pages 1634–1644. 2017.
 [25] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
 [27] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.
 [28] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
 [29] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. 2011.
 [30] D. Pachauri, C. Hinrichs, M. K. Chung, S. C. Johnson, and V. Singh. Topologybased kernels with application to inference problems in alzheimer’s disease. IEEE transactions on Medical Imaging, 30(10):1760–1770, 2011.
 [31] J. A. Perea and J. Harer. Sliding windows and persistence: An application of topological methods to signal analysis. Foundations of Computational Mathematics, 15(3):799–838, 2015.
 [32] K. N. Ramamurthy, K. Varshney, and K. Mody. Topological data analysis of decision boundaries with application to model selection. In Proceedings of the International Conference on Machine Learning, pages 5351–5360, 2019.
 [33] J. Reininghaus, S. Huber, U. Bauer, and R. Kwitt. A stable multiscale kernel for topological machine learning. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
 [34] B. Rieck, M. Togninalli, C. Bock, M. Moor, M. Horn, T. Gumbsch, and K. Borgwardt. Neural persistence: A complexity measure for deep neural networks using algebraic topology. In International Conference on Learning Representations, 2019.
 [35] D. Rouse, A. Watkins, D. Porter, J. Harer, P. Bendich, N. Strawn, E. Munch, J. DeSena, J. Clarke, J. Gilbert, et al. Featureaided multiple hypothesis tracking using topological and statistical behavior classifiers. In SPIE Defense+Security, 2015.
 [36] N. Saul and C. Tralie. ScikitTDA: Topological data analysis for python. https://doi.org/10.5281/zenodo.2533369, 2019.
 [37] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016.
 [38] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [39] G. Singh, F. Memoli, T. Ishkhanov, G. Sapiro, G. Carlsson, and D. L. Ringach. Topological analysis of population activity in visual cortex. Journal of Vision, 2008.
 [40] A. Som, K. Thopalli, K. Natesan Ramamurthy, V. Venkataraman, A. Shukla, and P. Turaga. Perturbation robust representations of topological persistence diagrams. In Proceedings of the European Conference on Computer Vision, pages 617–635, 2018.
 [41] S. Srinivas, R. K. Sarvadevabhatla, K. R. Mopuri, N. Prabhu, S. S. Kruthiventi, and R. V. Babu. A taxonomy of deep convolutional neural nets for computer vision. Frontiers in Robotics and AI, 2:36, 2016.
 [42] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
 [43] C. J. Tralie and J. A. Perea. (quasi) periodicity quantification in video data, using topology. SIAM Journal on Imaging Sciences, 11(2):1049–1077, 2018.
 [44] V. Venkataraman, K. N. Ramamurthy, and P. Turaga. Persistent homology of attractors for action recognition. In IEEE International Conference on Image Processing, pages 4150–4154. IEEE, 2016.
 [45] J. Walker, A. Gupta, and M. Hebert. Dense optical flow prediction from a static image. In Proceedings of the IEEE International Conference on Computer Vision, pages 2443–2451, 2015.
 [46] Q. Wang, S. Lohit, M. J. Toledo, M. P. Buman, and P. Turaga. A statistical estimation framework for energy expenditure of physical activities from a wristworn accelerometer. In Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pages 2631–2635. IEEE, 2016.
 [47] X. Wang, D. Fouhey, and A. Gupta. Designing deep networks for surface normal estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 539–547, 2015.
 [48] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems, pages 3320–3328, 2014.
 [49] M. Zhang and A. A. Sawchuk. USCHAD: a daily activity dataset for ubiquitous activity recognition using wearable sensors. In Proceedings of the ACM Conference on Ubiquitous Computing, pages 1036–1043. ACM, 2012.
 [50] N. Zhang, J. Donahue, R. Girshick, and T. Darrell. Partbased RCNNs for finegrained category detection. In European Conference on Computer Vision, pages 834–849. Springer, 2014.
 [51] N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and L. Bourdev. Panda: Pose aligned networks for deep attribute modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1637–1644, 2014.

[52]
B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva.
Learning deep features for scene recognition using places database.
In Advances in Neural Information Processing Systems, pages 487–495, 2014.
Appendix
Confusion Matrix for Timeseries Human Activity Recognition
Here we show the confusion matrices for a few of the methods listed in Table 1
. Specifically we show the confusion matrix for the Multi Layer Perceptron (MLP) method on the 19dimensional statistical feature (SF), persistence image (PI) obtained using conventional topological data analysis (TDA) tools and PI computed using the proposed PINet model. We show these for both the GeneActiv
[46] and USCHAD [49] datasets in Figures 10 and 12 respectively. We also show the confusion matrices for the 1dimensional convolutional neural network (1D CNN) both alone and in fusion with the two PI variants in Figures 11 and 13 respectively. For the MLP classifier we observe PI features being more informative than the SF method. We also observe that fusing PIs with more powerful classifiers like 1D CNNs helps improve the overall classification performance. We used the same Signal PINet model trained on the GeneActiv dataset to extract PIs for the USCHAD dataset, i.e. we do not finetune the model again using the USCHAD dataset.