People have adopted different strategies to extract features in neural networks. Recurrent neural networks (RNNs) consider the information from the past or/and the future of a signal by using recurrent connections, which have yielded huge success in speech recognition. On the other hand, convolutional neural networks (CNNs) include context information by taking weighted sum in a segment of a given length . In this paper, we intend to include the shape information of a segment into the neural network, by using a topological tool called persistent homology.
The goal of persistent homology is to describe the shape (e.g. connected components or holes) of a set of data points . To achieve this, it forms a nested sequence of sub-objects such that the latter ones includes the former ones and each sub-object is described topologically. The end product from persistent homology is usually referred to as a summary for the data shape. The most commonly used summaries include persistence diagram , barcode , and the recently proposed persistence landscape [6, 7]. A nice property of the summaries is that they are stable with respect to perturbations of the data . Such summaries can be employed to discern the topological difference between two sets of data. For example, a recent work done by Reininghaus et al.
establishes a multi-scale kernel for persistence diagram that can be used with kernel-based classification algorithms such as support vector machine for 3D shape classification.
The use of persistence homology has been found beneficial for a number of machine learning tasks in recent years. For example, Li et al.  also used persistence diagram for 3D shape recognition.
In the audio domain, Brown et al.  used persistence barcode on the raw signals of speech to learn structural features that distinguish among vowels, nasals, and noisy sounds. Bergomi  applied persistence to a tone representation to analyze the structure of music pieces. Emrani et al.  used barcode for wheeze detection from breathing sound signals.
Although the persistence homology has shown potential in various areas, it has not been incorporated in a neural network model, to the best of our knowledge.
Motivated by the potential gain of combining the idea of persistence homology and deep learning, we pioneer this research front by firstly designing a neural network model to characterize the topological features of audio signals. Specifically, we propose a design that exploits persistence landscape in a CNN.
We choose to work with CNN for its well demonstrated discriminative power in various audio-related tasks such as speech recognition  and music structure analysis . In addition, as will be elaborated in Section III, it is easy to combine the outputs of a convolutional layer and a dedicated layer for persistence homology in the segment level, an intermediate level for audio signal processing.
We evaluate the proposed persistent convolutional neural network (PCNN) model on the task of music auto-tagging, a multi-label classification task that aims at assigning tags such as genres and instruments to music pieces [15, 16].
The state-of-the-art method for music auto-tagging proposed by Dieleman et al. 
uses a convolution-flavored feature processed by principal component analysis and clustering. Furthermore, they proposed different ways to exploit multi-scale information. As an extension of this method, Dielemanet al. later applied CNN on the raw audio signals and log mel-spectrograms and achieved similar accuracy for music auto-tagging. Our evaluation shows that PCNN can outperform these two methods.
Ii Persistence Landscape
The topological summary used in this paper is persistence landscape proposed by Bubenik [6, 7]. In persistent homology, the target object is often topologically represented in different ways, depending on the target applications. In this paper, we consider cubical complexes for audio signals are basically constructed on equal-spaced grids [18, 19]. Specifically, we use one-dimensional cubical complexes, and the edges connecting two neighbors, denoted as , where is the length of a feature sequence for a audio segment. One of the benefit of cubical complex is that it is naturally equipped with the notion of connectivity. We consider 2-connectivity in this paper, where each element in connects to its immediate neighbors and . Due to space limit, we will not formally introduce persistent homology but instead provide the intuitions behind the definitions. For an introduction to homology and persistent homology, please refer to [3, 19].
In homology theory, homology classes are used to characterize non-boundary cycles. Persistent homology studies the change of homology classes constructed from and a filtering function . In the case considered here, the values of function are provided by the output signals from a convolution layer. We can construct from a chain of sub-complexes that starts with the empty complex and ends with the complete complex, i.e., , where s are real values.
Each sub-complex is defined as the superlevel-set [3, 9]. In the processing of audio features, we care more about the higher values of the features because the parts of higher values often means there are some interesting events. We only consider 1-dimensional cubical complex, so the only possible non-trivial homology classes are 0-dimensional homology classes, which describe connected components.
The next step is to derive the births and deaths of homology classes, which are, in our case, the births and deaths of connected components. A birth is a value where a new component appears, and a death is a value where a component is merged into an earlier born component, as illustrated in Figure 1.
The persistence is the difference between the death and birth times. Therefore, the birth-death pairs are the basis of either barcode, persistence diagram, and persistence landscape.
Given a birth-death pair , we can construct the persistence landscape as follows . First, piece-wisely linear functions are constructed:
It can be seen that has a triangle shape. Given the birth-death pairs of a space, the persistence landscape is defined as the functions , where is the -th largest value of the sequence . We will refer to as the -th piece of a persistence landscape. The global maximum of on the space will not have a finite death time in theory. In practice, we assign the global minimum to the death of the component of the global maximum, as done in . Some examples of signals and the corresponding persistence landscape are shown in Figure 2. For the signal on the top left, there is one local maximum so the persistence landscape is a single mountain. For the signal on the top right, there are two local maxima of the same value so there are two mountains. One benefit of persistence landscape is its invariance to small noises, as demonstrated from the two signals at the bottom of Figure 2.
Ii-a Intuition about Persistence Landscape
We may consider the first component as providing an overview of the segment. In our setting, the first component of the persistence landscape always contains a mountain where the left foot is the global minimum and the right foot is the global maximum. We may think of the peaks in the persistence landscapes as emphasizing the more stable part of a component, sitting right between the birth and the death of a birth-death pair. In other words, it encodes the most prominent part of all the .
If in a persistence landscape is nonzero at some value , it implies that there is a that has at least connected components that containing . For an audio signal, this means that there is at least local maxima. Therefore, a signal has a nonzero value in with higher indicates larger fluctuation.
Ii-B Why Persistence Landscape instead of Persistence Diagram?
Another popular persistence summary is the persistence diagram, which is defined as the set of tuples of birth-death pairs. The relationships between two persistence diagram can be measured through the bottleneck metric or p-Wasserstein metric [9, 20, 21]. We argue below why persistence landscape is more suitable for being used in a network based model as compared with persistence diagram.
In a network, it is convenient if we can have a representation of matrix form by which the pointwise comparison of two matrices is meaningful. However, persistence diagram does not have this property. In contrast, we can do pointwise addition on persistence landscapes because they are essentially functions . Moreover, as a persistence landscape is a function, for computational purposes we can convert these functions to matrices by subsampling the persistence landscape in a chosen range of the domain. We may think of this representation as a restriction of the persistence diagram functions to a subsets of the domain, so the addition can still be done pointwisely. In this way, we can simply treat the persistence landscape as a finite-size, two-dimensional feature map that can be easily processed by a subsequent convolutional layer in a CNN architecture. While the subsampling can be done in various ways, for simplicity a uniform sampling approach is adopted in this work.
Ii-C Persistence Landscape and CNN
CNN provides a good environment for the incorporation of persistence landscape. In common CNNs, a convolution layer uses a number of filters to process an input signal and then summarizes the information by using max-pooling on segments of finite length. The assumption here is that these summaries of the segments provide information useful to the subsequent processing. On the other hand, a landscape is computed over a topological space, a cubical complex in our case. With a persistence layer, we can compute the persistence landscape of the same segments of length from the input signal, thereby offering a different way to characterize the content of the input signal. As the information captured by this persistence layer and a convolution layer might be complementary to each other, we can also concatenate the features derived from the two processing pipelines. This can be easily implemented as their outputs can have the same number of temporal units.
We show the network structure in Figure 3. We use multiple times of convolution layers in the structure. A convolution layer consists of a convolution of filters and convolving size and a max-pooling sub-layer that performs max pooling every units along time axis. The input to the network is a feature map, a matrix with a temporal axis and a feature axis. The feature map is first processed by a stack of convolution layers, referred to as the early convolution layers.
The output of the final early convolution layer is fed into either a middle convolution layer, a persistence layer, or both. A middle convolution layer is a convolution layer. A network uses only middle convolution without persistence layer is simply a CNN. A persistence layer processes each filter from the preceding layer separately, each time using a segment of length . We can view each filter as the filtering function . A persistence layer will have the following fixed parameters, a value range deciding the range to sample a persistence landscape, the number of pieces , deciding how many pieces in a persistence landscape we will use, and the number of sample points , deciding how many points we should sample uniformly from the value range. We can therefore refer to a persistence layer as a layer. If there are filters from the output of the early convolution layers, the total feature dimension will be . A network that uses only the persistence layer in this part is referred to as the persistent neural network (PNN). We can also combine the outputs of the middle convolution layer and the persistence layer by concatenation, leading to the persistent convolutional neural network (PCNN). For PCNN, we set to ensure that the temporal scales from the two layers are the same.
The output of from the previous layer is processed by another stack of convolutions, referred to as the late convolution layers. The late convolution layers use and (i.e. no temporal context and no max poolings),
thereby providing the function of the fully connected layers in conventional CNNs. With this replacement, we can process sequences of arbitrary length [22, 23]. The last late convolution layer is the segment output sub-layer, where the number of units if equal to the number of tags. Its outputs are pooled temporally with the final mean-pooling layer to give a final prediction for the entire music clip.
We put the persistence layer after a convolution layer for two reasons. First, the convolution layers can serve as a dimensionality reduction device as the persistence operation is more computationally expensive. Second, we want the information can back-propagated through persistence layer so that the persistence layers have influences on the learned features.
Iii-a How Back-propagation Works through the Persistence Layer
Persistence landscapes are constructed from piece-wise linear functions . The values of are composed of linear functions of and in a birth-death pairs , as shown in Equation (1). A persistence landscape is simply a re-ordering of the function values in its sampled matrix form. Note that the deaths and births are all local extrema. For an element in a persistence landscape matrix, the back-propagation is done through the elements which own the birth or death value.
We evaluate on a music auto-tagging dataset MagnaTagATune . It contains multi-label annotations collected from human evaluations of music tags through playing a game. There are totally 188 tags and 25,863 29-second clips. It includes tags of instruments, tempo, genres, acoustic, etc. We note that there are at least two versions of MagnaTagATune and some people used an earlier 160-tag version [25, 26]. However, this version is not publicly available. We instead use the current publicly available version and follow the setting employed in [17, 27] — while MagnaTagATune natively has 16 sub-folders with no overlapping artists, we use the 1st–12th sub-folders for model training, the 13th sub-folder for validating and parameter tuning, and the 14-16th sub-folders for testing. Following [17, 27], we consider the top 50 tags (out of the 188 possible tags) in most experiments.
Moreover, following the convention in the literature, we use the average area under the ROC-curve (AUC) and mean average precision (mean AP; or MAP) as the performance metrics [15, 17]. The evaluation can either be done per-class or per-clip. The per-class result computes AUC or AP for each tag class across all the clips, whereas the per-clip result computes AUC or AP in a clip over all tags.
. This is to facilitate our comparison with existing work in terms of the learning models, rather than the input features. The features are z-score normalized using parameters
Iv-a Model Implementation
We fix the settings for the early and late convolution layers as follows. For the early convolution layers, we use one (64, 8, 4) convolution layer. For the late convolution layers, we use a stack of two (512, 1, 1) convolution layers, followed by a (50, 1, 1) convolution layer for tag prediction outputs.
When persistence landscapes are required, a persistence layer is the default setting for it. As there are 64 filters from the early convolution layer, the input to persistence will have 64 filters. Therefore, the default feature size is . In preliminary experiments, we found that the segment size has an effect on the performance of PNN. Slightly better result is achieved by setting the segment size to 64, but similar result can be obtained by choosing 32 or 128. In balance of performance and efficiency, we fix the segment size to 32.
In a PCNN, the middle convolution layer is a convolution layer. Note that the max-pooling size is equal to the segment size in the persistence layer. Therefore, the outputs of the middle convolution layer and the persistence layer will have the same temporal size.
We implement our models with Theano and Lasagne111https://lasagne.readthedocs.org/en/latest/. The CNNs are trained with back-propagation and AdaGrad 
, 0.5 dropout rate, and 0.01 initial learning rate. A model is trained with 100 epochs. The parameters from the epoch that gives the best average per-class AUC on validation set is adopted. For reproducibility, the python source codes will be made publicly available athttp:
We compare the performance of CNN, PNN, and PCNN models with different settings. As the performance of a neural network model can be sensitive to initialization values, we execute the network training and evaluation 5 times for each setting and report the average results of the 5 trials. When assessing whether there is a significant performance difference between two models, we apply a one-tailed t-test under 0.1 confidence level on the results of the 5 trials.
|Model||Mid. conv. size||(i.e.||Average AUC||MAP|
|Number of tags||Model||Average AUC||MAP|
|50||Dieleman et al. ||0.898||NA||NA||NA|
|50||Dieleman et al. ||0.8815||NA||NA||NA|
|160||Nam et al. ||0.888||0.956||NA||NA|
First we notice that PNNs do not perform better than CNNs, as shown in Table I. Interestingly, the performance is decreasing as the number of increases. We will get back to this issue later.
With the combination of convolution features and persistence landscape, PCNNs outperform the CNNs significantly. PCNN with achieves the best performance with 0.9013 average per-class AUC and 0.4267 average per-class MAP.
In Table II, we compare the performance of the proposed model with two results from Dieleman and Schrauwen [17, 27]. The proposed model outperforms these two prior arts. We also present the results of PCNN () trained with the top 160 tags and all 188 tags, and the performance from Nam et al.  with 160 tags in the earlier version of MagnaTagATune. We cannot compare the two models fairly but it gives us a reference about the performance for a larger number of tags.
The persistence layer naturally produces features of large dimension. PCNN () has 6400-D in the middle. One may wonder if the improvements is purely from the large dimension. This should not be the case. As we can see in Table I, the CNN using 6400 filters is not as good as PCNN models even though they have the same size in the middle. Furthermore, our analysis shows that the results of 200, 400, 800, 1600, 3200, and 6400 filters are not significantly different.
An interesting problem is what kind of information is contained in persistence landscapes in the real data. By the definition of persistence landscape and observing simple signals such as those in Figure 2, we conjecture that the persistence landscape, as applied to music signals, may contain information about the number of strong beats or the number of onsets. To verify this, we compute 1) the average onset strength for each clip and 2) the average persistence landscape values over all sample values, all segments, and all 64 filters in a clip for each and each clip from PCNN (). The Pearson’s correlation coefficients between the average onsets strength and the average persistence landscape are 0.7297, 0.9556, 0.9774, 0.9709, and 0.9560 for with respectively. They are highly correlated, especially for larger . In contrast, the correlation coefficient is 0.011 between the average onset strength and the average absolute values of the output of middle convolution layer of PCNN (). We may see this property from another perspective. For a given tag, we compute the average persistence values for . By arranging them in descending order, as shown in Figure 4, we can see that those tags with more strong beats or more fast tempos are on the left.
Different pieces in persistence landscapes contribute to PNN and PCNN differently. To see the contribution of different pieces in persistence landscape to the late convolution layers, we compute the average parameter weights of the connections between persistence layer and the first layer of the late convolution layers in PNN () and PCNN (), as shown in Table III. We see that the middle pieces are more important for PNN, while the first piece has larger contributions than other pieces for PCNN. On the other hand, although the best performance is achieved by PCNN (), we find the average AUCs from PCNN () and PCNN () are also quite high. These two observations raise a question about the usefulness of the later s.
We look into the performance tag-wisely. It turns out that the PCNNs incorporating later pieces perform consistently better in “classical,” “slow,” “soft,” and “choir,” the tags that characterize gentler music. In contrast, PCNN () consistently performs better in tags related to human voices and electronic music. Our conjecture is that the later pieces of persistence landscape might signify the absence of the more fluctuated part of the signal, which help the classification of gentler music.
V Discussion and conclusions
In this paper, we have presented a CNN model that incorporates the topological tool persistence landscape. We show empirically that the use of a dedicated persistence layer in the middle of a CNN model can perform similarly as a pure CNN model, and that the combination of the two models greatly improves the discriminative power. Evaluating on the MagnaTagATune dataset for music auto-tagging, the combined model outperforms the state-of-the-art models by a great margin.
As we observe in Section IV, different parts of the persistence landscape could help the classification in different ways. Using the same setting of persistence layer for all tags might lead to sub-optimal results. One way to utilize this property is to train multiple models and then combine the results.
In this paper, we only use the information of 0-homology classes which characterize connected components as we assume 2-connectivity. This provides an efficient implementation of persistence landscape. However, it also loses the information of higher order homology classes. An interesting future work is to re-formulate the used complex to account for higher order shape information.
-  A. Graves, Supervised Sequence Labelling with Recurrent Neural Networks, ser. Studies in Computational Intelligence. Springer, 2012, vol. 385.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” inProc. Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.
-  H. Edelsbrunner and J. Harer, “Persistent homology – a survey,” Contemporary Mathematics, vol. 453, pp. 257–282, 2008.
-  H. Edelsbrunner, D. Letscher, and A. Zomorodian, “Topological persistence and simplification,” Discrete and Computational Geometry, vol. 28, no. 4, pp. 511–533, 2002.
-  G. Carlsson, A. Zomorodian, A. Collins, and L. Guibas, “Persistence barcodes for shapes,” in Proc. Eurographics/ACM SIGGRAPH symposium on Geometry processing, 2004, p. 124.
-  P. Bubenik, “Statistical topological data analysis using persistence landscapes,” Journal of Machine Learning Research, vol. 16, no. 1, pp. 77–102, 2015.
-  P. Bubenik and P. Dlotko, “A persistence landscapes toolbox for topological statistics,” Journal of Symbolic Computation, 2016.
-  J. Reininghaus, S. Huber, U. Bauer, and R. Kwitt, “A stable multi-scale kernel for topological machine learning,” in
-  C. Li, M. Ovsjanikov, and F. Chazal, “Persistence-based structural recognition,” in Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, 2014, pp. 2003–2010.
-  K. A. Brown and K. P. Knudson, “Nonlinear statistics of human speech data,” Int. Journal of Bifurcation and Chaos, vol. 19, no. 07, pp. 2307–2319, 2009.
-  M. G. Bergomi, “Dynamical and topological tools for (modern) music analysis,” Ph.D. dissertation, Université Pierre et Marie Curie - Paris VI, 2015.
-  S. Emrani, H. Chintakunta, and H. Krim, “Real time detection of harmonic structure: A case for topological signal analysis,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, 2014, pp. 3445–3449.
-  O. Abdel-Hamid, A. R. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolutional neural networks for speech recognition,” IEEE/ACM Trans. Audio, Speech, and Language Processing, vol. 22, no. 10, pp. 1533–1545, 2014.
-  K. Ullrich, J. Schlüter, and T. Grill, “Boundary detection in music structure analysis using convolutional neural networks.” in Proc. Int. Soc. Music Info. Retrieval Conf., 2014, pp. 417–422.
-  D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet, “Semantic annotation and retrieval of music and sound effects,” IEEE Trans. Audio, Speech, and Language Processing, vol. 16, no. 2, pp. 467–476, 2008.
-  D. Tingle, Y. E. Kim, and D. Turnbull, “Exploring automatic music annotation with acoustically-objective tags,” in Proc. ACM Int. Conf. Multimedia Information Retrieval, 2010, pp. 55–61.
-  S. Dieleman and B. Schrauwen, “Multiscale approaches to music audio feature learning,” in Proc. Int. Soc. Music Info. Retrieval Conf., 2013, pp. 116–121.
-  H. Wagner, C. Chen, and E. Vuçini, Efficient Computation of Persistent Homology for Cubical Data. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 91–106.
-  T. Kaczynski, K. Mischaikow, and M. Mrozek, Computational homology. Springer Science & Business Media, 2004, vol. 157.
-  B. T. Fasy, F. Lecci, A. Rinaldo, L. Wasserman, S. Balakrishnan, and A. Singh, “Confidence sets for persistence diagrams,” Annals of Statistics, vol. 42, no. 6, pp. 2301–2339, 2014.
-  R. Kwitt, S. Huber, M. Niethammer, W. Lin, and U. Bauer, “Statistical topological data analysis – a kernel perspective,” in Proc. Advances in Neural Information Processing Systems, 2015, pp. 3070–3078.
-  D. Pathak, E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional multi-class multiple instance learning,” arXiv preprint arXiv:1412.7144, 2014.
M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Is object localization for free? weakly-supervised learning with convolutional neural networks,” inProc. IEEE Conf. Computer Vision and Pattern Recognition, 2015, pp. 685–694.
-  E. Law, K. West, M. I. Mandel, M. Bay, and J. S. Downie, “Evaluation of algorithms using games: The case of music tagging,” in Proc. Int. Soc. Music Info. Retrieval Conf., 2009.
-  P. Hamel, S. Lemieux, Y. Bengio, and D. Eck, “Temporal pooling and multiscale learning for automatic annotation and ranking of music audio,” in Proc. Int. Soc. Music Info. Retrieval Conf., 2011, pp. 729–734.
-  J. Nam, J. Herrera, and K. Lee, “A deep bag-of-features model for music auto-tagging,” Arxive preprint, pp. 1–12, 2015.
-  S. Dieleman and B. Schrauwen, “End-to-end learning for music audio,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing, 2014, pp. 6964–6968.
-  B. McFee, M. McVicar, C. Raffel, D. Liang, O. Nieto, E. Battenberg, J. Moore, D. Ellis, R. Yamamoto, R. Bittner, D. Repetto, P. Viktorin, J. F. Santos, and A. Holovaty, “librosa: 0.4.1,” 2015. [Online]. Available: http://dx.doi.org/10.5281/zenodo.32193
J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,”The Journal of Machine Learning Research, vol. 12, pp. 2121–2159, 2011.