Topological Approaches to Deep Learning

by   Gunnar Carlsson, et al.

We perform topological data analysis on the internal states of convolutional deep neural networks to develop an understanding of the computations that they perform. We apply this understanding to modify the computations so as to (a) speed up computations and (b) improve generalization from one data set of digits to another. One byproduct of the analysis is the production of a geometry on new sets of features on data sets of images, and use this observation to develop a methodology for constructing analogues of CNN's for many other geometries, including the graph structures constructed by topological data analysis.



page 15


Understanding Deep Neural Networks Using Topological Data Analysis

Deep neural networks (DNN) are black box algorithms. They are trained us...

Topological Data Analysis for Arrhythmia Detection through Modular Neural Networks

This paper presents an innovative and generic deep learning approach to ...

Topology and geometry of data manifold in deep learning

Despite significant advances in the field of deep learning in applicatio...

Persistent and Zigzag Homology: A Matrix Factorization Viewpoint

Over the past two decades, topological data analysis has emerged as a yo...

Functional Network: A Novel Framework for Interpretability of Deep Neural Networks

The layered structure of deep neural networks hinders the use of numerou...

Topological Data Analysis for Portfolio Management of Cryptocurrencies

Portfolio management is essential for any investment decision. Yet, trad...

Spatial Applications of Topological Data Analysis: Cities, Snowflakes, Random Structures, and Spiders Spinning Under the Influence

Spatial networks are ubiquitous in social, geographic, physical, and bio...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks [10]

are a powerful and fascinating methodology for solving problems with large and complex data sets. They use directed graphs as a template for very large computations, and have demonstrated a great deal of success in the study of various kinds of data, including images, text, time series, and many others. One issue that restricts their applicability, however, is the fact that it is not understood in any kind of detail how they work. A related problem is that there is often a certain kind of overfitting to particular data sets, which results in the possibility of so-called adversarial behavior, where they can be made to fail by making very small changes to image data that is almost imperceptible to a human. For these reasons, it is very desirable to develop methods for gaining understanding of the internal states of the neural networks. Because of the very large number of nodes (or neurons), and because of the stochastic nature of the optimization algorithms used to train the networks, this is a problem in data analysis, specifically for unsupervised data analysis. The initial goal of the work in this paper was to perform topological data analysis (TDA) on the internal states of the neural nets being trained on image data to demonstrate that TDA can provide this kind of insight, as well as to understand to what extent the neural net recapitulates known properties of the mammalian visual pathway. We have carried out this analysis, and the results are reported in Section

4. We show that our findings are quite consistent with the data analytic results on image patches in natural images obtained in [2]. In addition, we are able to study the learning process in one example, and also to study a very deep pre-trained neural network, with interesting results which clarify the roles played by the different layers in the network.

Having performed these experiments, we became interested in the question of how to apply the knowledge obtained from our study to deep learning more generally. In particular, we asked how one might generalize the convolutional neural net (CNN) construction to other data sets, so as to obtain methods for constructing efficient nets that are well adapted to other large classes of data sets, or individual data sets. We found that the key idea from the image CNN construction is the fact that the set of features (pixels) is endowed with a geometry, which can be encoded in a metric, coming from the grid in which the pixels are usually arranged. However, in most data sets, one has one or more natural notions of distance between features, and generalizations based on such metrics appeared to be a potentially very powerful source of methods for constructing neural nets with restrictions on the connections based on such a metric. The idea of studying geometric properties of features has been foreseen by M. Robinson in [11] under the heading of topological signal processing. The second goal for us in this paper, then, is to introduce a mathematical formalism for constructing neural network structures from metric and graph based information on the feature space of a data set. We also find that this formalism simplifies and makes precise the specification of neural networks even while using standard methods. In Section 5.2 we evaluate the improvements possible from the very simplest application of this idea. The nature of the improvements come in two directions. The first is in speeding up the learning process. The training of neural nets can be quite a time consuming process, and it is clearly desirable to lower the cost (in time) of training. We found that the methods were more effective on more complex data sets, which is encouraging. A second kind of improvement is in the direction of generalization. When training on image data sets, it is standard procedure to select two subsets of the data set, one the training set and the other the test set. The network is trained on the training set, and accuracy is evaluated on the test set. This procedure is designed to guard against overfitting, and the accuracy often achieves very impressive numbers. However, one can consider the problem of training on one data set of images and evaluating on an entirely different data set. For example, there are two familiar data sets consisting of images of digits, one MNIST [7] and the other SVHN [16]. The first is a relatively “clean” data set, The second is actually obtained from images of numbers for the addresses of houses. One could attempt to train on MNIST and evaluate accuracy not on a different subset of MNIST, but rather SVHN. Surprisingly, the results of this process yield abysmal results, with an accuracy very close to that achieved by random selection of classifications. We demonstrate that by the use of the methods we have discussed one can improve the accuracy significantly, although still not to an acceptable level. It suggests that further application of the methods could give us much improved generalization.

We identify three separate scenarios giving rise to geometric information about the feature space. The first is where by its very construction, a set of features is equipped with a geometric structure. Typical examples of this situation are images or time series, where, for example, the pixels (features of images) are designed with a rectangular geometry in mind. The second is where a geometry is obtained from studies such as that performed in [2]. Finally, there is a situation where one is given a more or less general data matrix with numerical entries, and imposes a metric on it via standard choices of metric such as Euclidean, Hamming, etc. Once this has been done, it is important to be able to compress this geometric information into a smaller representation, something which can be achieved by the Mapper construction [12].

We believe that the study of the geometry of the feature space attached to various kinds of data sets will be a very powerful tool that can inform the construction and improve the performance of neural networks. Additionally, because we have incorporated geometric methods in the constructions, we also believe that our formalism opens the door to more sophisticated, detailed, and nuanced mathematical analysis of neural networks.

2 Neural Nets

This section will introduce feed-forward neural nets as well as the special case of convolutional neural nets (CNN’s).

Definition 2.1

By a feed-forward system of depth we will mean a directed acyclic graph with vertex set with the following properties.

  1. is decomposed as a disjoint union

  2. If , then every edge of the form of has .

  3. The nodes in (respectively ) are called initial nodes (respectively terminal nodes).

  4. We assume that for every non-initial node , there is at least one so that is an edge in .

  5. For each vertex of , we denote by (respectively ) the set of all vertices of so that (respectively ) is an edge of .

The sets are referred to as the layers of the feed-forward system. We say that a layer is locally finite if the sets are finite for all . By a sub-feed-forward system of a feed-forward system of depth , we mean a directed subgraph so that the graph and the families of vertices themselves form a feed-forward system. In particular, it must be the case that for each , the set must be non-empty.

Remark 2.1

Note that we do not assume that is finite. It is sometimes useful to use infinite feed forward systems as idealized constructions with useful finite systems contained in it.

Remark 2.2

We have described only the simplest kinds of structures used in neural nets. There are many others, which can also be described using the methodology we are introducing, but we leave them to future work.

It is also useful to have a slightly different point of view on feed-forward systems. Recall that a correspondence from a set to a set is a subset . It is clear that one can compose correspondences, and for any correspondence we will write and . We also say that a correspondence is surjective if for all

. These notions are familiar, but we give some particular examples that will be relevant for the construction of convolutional neural networks.

Example 2.1

Given any two sets and , we have the complete correspondence , defined b y .

Example 2.2

Given any map of sets , we have the functional correspondence attached to , defined to consist of the points in the graph of , defined to be .

Example 2.3

Let and , we define the product correspondence

by the requirement that if and only if and .

Example 2.4

Let be a metric space, with distance function . Suppose further that we are given a non-negative threshold . Then we define , the metric correspondence with threshold from to itself, by . It will occasionally be useful to permit the definition of metric spaces to include the possibility of infinite values. The three axioms of metric spaces extend in a natural way to this generality.

Example 2.5

Let be graph, with vertex set . Then the graphical correspondence is defined by if and only if is an edge in .

We now give the definition of a kind of object that is completely equivalent to a feedforward system.

Definition 2.2

Let denote the totally ordered set regarded as a category. By a generator for an -layer feed-forward system, we will mean a functor from the category to the category of finite sets and correspondences. The associated feed-forward system has as its vertex set , and where there is a connection from to if and only if (1) and (2) .

Feed-forward systems are used to describe and specify certain computations. The nodes are considered variables, so will be assigned numerical values which we call . The nodes in the -th or initial layer are regarded as input variables, so they are in one to one correspondence with variables that are attached to a data set.

Definition 2.3

By an activator, we will mean a triple , where is a commutative semigroup structure on , is a subsemigroup of the multiplicative semigroup of , and is a function, which we call the cutoff function. Given a feedforward structure , an activation system for is a choice of an activator for each non-initial vertex of . A coefficient system for a feed-forward system and activation system is a choice of element for each edge of .

Remark 2.3

Typically we use only a small number of distinct activators, and also assign all the nodes in a given layer the same activator. For purposes of this paper, the only semigroup structures on we use are the additive structure and the commutative operation . Also, for the purposes of this paper, the only choices for will be either all of or , but in other contexts there might be other choices. The cutoff function may be chosen to be the identity, but in general is a continuous function that is a continuous version of a function that is zero below a threshold and 1 above it. The ring can be replaced by other rings, such as the field with two elements, which can be useful in Boolean calculations.

We now wish to use this data to construct functions on the input data. We assume we are given a locally finite feed-forward structure , equipped with an activation system and a coefficient system . For each , with , we set

equal to the real vector space of functions from

to . We now define a function , for , on a function by

Note that the sum is computed using the monoid structure , and is taken over all edges of with terminal vertex . This set is finite by the local finiteness hypothesis. We have now constructed functions for all , and therefore can construct the composite

from to , i.e. a function from the input set to the output set.

The final requirement is the choice of a loss function. Given a set of points , and a function , the goal of deep learning methods is to construct a function as above that best approximates the function in a sense which has yet to be defined. If the function is viewed as a continuous function to the vector space , then the finding the best approximation is quite reasonable, and the distance from the approximating function to

will be defined to be the loss function. If, however, the output function is categorical, i.e. has a finite list of values, then it is often the case that the possible outputs are identified with the vertices in the standard


in , and other loss functions are more appropriate. The output function still takes continuous valued, and the goal becomes to fit a continuous function to the discrete one. One could of course do this directly, but it has been found that fitting certain transformations of the continuous function perform better. One very common choice is the following. Suppose that from the construction of the neural net, it is known that the values of the neurons in the terminal layer are always positive real numbers. Define by


The function takes its values in the standard simplex. The softmax function is the composite , where exp denotes the function from to . A standard procedure for optimizing fitting a continuous function with discrete values is to minimize the error of the transformed function

where is the number of neurons in the output layer. This notion of loss or error is referred to as the softmax loss function.

Deep learning proceeds to minimize the chosen loss function of the difference between and a given function over the possible choices of the coefficients using a stochastic variant of the gradient descent method. Note that is typically empirically observed, it is not given as a formula. The optimization process often is time consuming, and occasionally becomes stuck in local optima. We refer to a feed-forward system equipped with activation system as a neural net.

Definition 2.4

Consider a locally finite feed-forward system , possibly infinite, equipped with an activation system . Let be a sub-feed-forward system. If is an activation system on , then it is clear that its restriction to is an activation system for and that similarly, a coefficient system on restricts to an coefficient system on . We will call the neural net the restriction of the neural net to .

There is an additional kind of structure on a feed-forward system that is particularly useful for data sets of images, as well as other types of data.

Definition 2.5

By a convolutional structure on a layer in a feed-forward system we mean a pair , where is an equivalence relation on the set of vertices of , and where is an assignment of a bijection

for any pair in , satisfying the requirement that and when defined. An activation system for is said to be adapted to the convolutional structure on a layer if whenever , it is the case that . A coefficient system for the neural net is adapted to a convolutional structure if it satisfies the compatibility requirement that whenever , then we have

for all .

Example 2.6

Suppose that a layer and the layer are acted on by a group , and suppose further that for any and , is an edge in if and only if is an edge for all . Suppose further that the actions on both and are free, so that the only element of that fixes a node is the identity element. We define an equivalence relation on by declaring that if and only if there is an element so that . Because of the freeness of the action, and determine uniquely. We define the bijection to be multiplication by . Because the group preserves the directed graph structure in , does carry to . The application of this idea to data sets of images uses the group , whose points correspond to an infinite pixel grid. We call structures defined this way Cayley structures.

The description of a convolutional layer in Example 2.6 is useful in many situations where the group, and therefore the feed-forward system, are infinite. Nevertheless, it is useful to adapt the networks to finite regions in the grid, such as grids within an infinite pixel grid. This fact motivates the following definition.

Definition 2.6

We suppose that we have a feed-forward structure , a layer equipped with a convolutional structure , and a sub-feed-forward structure . The restriction of the equivalence relation to does give an equivalence relation on , but it does not necessarily have the property that the restriction of the bijections to remains a bijection. We will define an equivalence relation on by declaring that if and only if (a) as vertices in and (b) restricts to a bijection from to . This clearly produces a convolutional structure on the layer in the feed-forward structure , which we refer to as the restriction of the convolutional structure e on to .

3 Natural Images and Convolutional Neural Nets

Data sets of images are of a great deal of interest for many problems. For example, the task of recognizing hand drawn digits or letters from images taken of them is a very interesting problem, and an important test case. Neural net technology has been successfully applied to this situation, but in many ways the success is not well understood, and it is believed that it is often due to overfitting. Our goal is to understand the operation of this methodology better, and to use that understanding to improve performance in terms of speed, and of the ability to generalize from one data set to another. In this section we will discuss image data sets, the feed-forward systems that have been designed specifically for them, the extent to which the neural networks act similarly to learning in humans and primates, and how such insights can be used to speed up and improve generalization from one image data set to another.

By an image, we will mean an assignment of numbers (gray scale values) to each pixel of a pixel array, typically arranged in a square. The image can be regarded as a -vector, where denotes the number of pixels in an array. However, the grid structure of the pixels tells us that there is additional information, namely a geometric structure on the set of coordinates in the vector. It turns out to be useful to build neural nets with a specific structure, reflecting this fact. For simplicity of discussion, it turns out to be useful to build infinite but locally finite models first, and then realize the actual computations on a finite subobject of these infinite constructions, by restricting the support of the activation systems we consider in the optimization. We will be specifying our neural networks by generators. First, we let denote the integers. By

we will mean the metric space whose elements consist of ordered pairs of integers, and where the distance function is the restriction of the

distance on . We of course have the metric correspondences from to itself. We will define another family of correspondences called pooling correspondences. For any pair of integers , let denote the intersection of the interval in the real line with the integers. Let denote a positive integer, and define a correspondence to be where is defined by . We have two parameters that are of interest for these correspondences, the stride, which is the integer , and the width, which is the integer

. To give a sense of the nature of these correspondences, consider the situation with stride and width both equal to 2, and with

. In this case, it is easy to check that the correspondence is given by . In general, if the stride is equal to the width, the correspondence is actually functional, and the corresponding function is to . We’ll write for the -fold product of as a correspondence from to itself.

It will be useful to have a language to describe the layers in a feed-forward system in terms of the generators.

Definition 3.1

Let denote a feed-forward system, with generator . For any , we consider the -th layer as well as the correspondence .

  1. We say the layer is fully connected if is the complete correspondence , as defined in Example 2.1.

  2. We say is grid convolutional if there are sets and , so that is of the form

    where is a metric correspondence as defined in Example 2.4.

  3. We say is pooling if is of the form

Remark 3.1

The reason for taking the product of convolutional or pooling correspondences with complete correspondences is in order to accommodate the idea of including numerous copies of a grid within a layer, but with the understanding that the graph connections between any copy of a grid in and any copy in are identical. This is exactly what the product correspondence achieves.

We are now in a position to build some convolutional neural networks. We will do so by constructing a generator. The generator is a functor that can be specified by a diagram like the following, where writing denotes a set of cardinality .


To further simplify the description, we note that there is product decomposition of the functor . For an two functors , we can form the product functor , which is defined to be the point wise product on object, and which also forms the product correspondences. It is clear from the description above that the functor we have described decomposes as the functor , where is given by

and by

This kind of decomposition is ubiquitous for neural networks, where there is one functor consisting entirely of complete correspondences. We will say a generator is complete if each of the correspondences is a complete correspondence, and describe generators as , where is a complete correspondence, and will be referred to as the structural generator. We note that a complete correspondence is completely determined by the cardinalities of the sets , and so we specify by its list of cardinalities. We say that the type of a complete generator is the list of integers

and note that the type determines the structure of .

4 Findings

Because of the stochastic nature of the optimization algorithms used in convolutional neural nets, the problem of understanding how they function is a problem in data analysis. What we mean by this is that it is a computational situation where there are outliers which are not meaningful, and a useful analysis must proceed by understanding what the most common (or dense) phenomena are, in a way that permits one to ignore the outliers, which will be sparse. Before diving into the methodology and results of our study, we will talk about earlier work

[2] on the statistics of natural images which is quite relevant to our results on convolutional neural nets.

The work in [2] was a study of a data set constructed by Mumford et al in [8] based on a database of images collected by van Hateren and van der Schaaf [13]. The images were taken in Groningen in the Netherlands, and Mumford et al collected a data set consisting of

patches, thresholded from below by variance of the patch. Each patch consists of nine gray scale values, one for each pixel. The data was then mean centered, and the contrast (a weighted version of variance) was normalized to have value 1. This means that the data can be viewed as residing on the sphere

, a subspace of . Finally, the data was filtered by codensity, a function on the data set defined at a point to be the distance from to its -th nearest neighborhood. The integer

is a parameter, much as variance is a parameter in kernel density estimators, and the codensity varies inversely with density.

What was done in [2] as to select a threshold value (a percentage) for the codensity computed for a value , and consider only points whose codensity was less than . For example, one might study the set of data points which are among the lowest in codensity, computed for the parameter value . This was carried out in [2] for a threshold value, and for the parameter values and .

Figure 1:
Figure 2:

These diagrams were obtained by examining the data following persistent homology computations which showed in the case of Figure 2 and in the case of Figure 2 (note that in the case of Figure 2 the model is not actually three disjoint circles, instead each of the secondary circles intersects the primary circle in two data points. The work in [2] went further and found more relaxed thresholds that yielded a Klein bottle instead of just a one skeleton, indicating that more is going on. It meant that the data set actually included arbitrary rotations of the two secondary circles in Figure 2. The original motivation for the work in [13] and [8] was to understand if analysis of the spaces of patches in natural images is reflected in the “tuning” of individual neurons in the primary visual cortex. We set out to determine if the statistical analysis of [2] has a counterpart in convolutional neural networks for the study of images. The following are insights we have obtained.

  • The role of thresholding by density or proxies for density is crucial in any kind analysis of this kind. Without that a very small number of outliers can drastically alter the topological model from something giving insight to something essentially useless.

  • The development of neural networks was based on the the idea that neural networks are analogous to networks of neurons in the brains of mammals. There is an understanding [5] that the primary visual cortex acts as a detector for edges and lines, and also that higher level components of the visual pathway detect more complex shapes. We perform an analysis analogous to the one in [2], and show that it gives results consistent with the density analysis performed there.

  • We demonstrate that our observations can be used to improve the ability of a convolutional neural network to generalize from one data set to another.

  • We demonstrate that the results can be used to speed up the learning process on a data set of images

We next describe the way that the data analysis was performed. We suppose that we have fixed an architecture for a convolutional neural network analysis of a data set of images, using grid layers as described in Section 3. We used an architecture in which the correspondences described the connections into a convolutional layer, where is the metric on the grids. This means that any node in a grid layer is connected to the nodes which belong to a patch surrounding it. The weights therefore constitute a vector in , which corresponds exactly to raw data used in [8]. The data points will be referred to as weight vectors. In [4], we performed analyses on data sets constructed this way using a methodology identical to that carried out in [8] and [2]. The rest of this section will describe the results of this study.

We first discuss the two data sets that we studied. The first is MNIST [7], which is a data set of images of hand drawn digits. The images are given as gray scale images. For this data set, we used an architecture described as follows. The depth is , and the generator is a product of two generators, and . The complete factor is of type , and the structural factor has the form


where denotes an grid, denotes a set of cardinality , and the output layer is identified with the ten digits . This feed-forward structure embeds as a sub-feed-forward structure of the structure obtained by replacing all the finite grids with copies of , into which they embed. Therefore, the layers and inherit a convolutional structure from the Cayley convolutional structure (defined in Definition 2.6) on

, which is the convolutional structure we use. The activation systems are defined using two different activation functions

. The first is the rectifier, which denotes the function , and which is often also denoted by ReLU. The second is the identity function and the third is the exponential function . The activation system is given on the layers and by , on the layers and by , on the layer by , and on the layer by . The loss function (defined on the layer ) is the function defined in (2–1) above.

We now look at results for the neural net trained on MNIST. Figure 4 shows a Mapper analysis of the data set of weight vectors in the first convolutional layer in the neural net described above. The neural net was trained 100 separate times, and each training consisted of 40,000 iterations of the gradient descent procedure. In each node, one can form the average (in ) of the vectors in that node. The patches surrounding the Mapper model are such averages taken in representative nodes of the model near the given position. We see that the Mapper model is in rough agreement with the circular model in Figure 2 above.

Figure 3: MNIST layer 1

Dimension 0





Dimension 1





Figure 4: Barcode for layer 1

In Figure 4, we see persistence barcodes computed for for the data set. The computation confirms the presence of connectedness of the data set as well as the presence of a significant loop, which is a strong indication that the Mapper model is accurately reflecting the structure of the data set. Figure 5 shows a Mapper model of the second convolutional layer. One observes that there appear to be patches which are roughly like those in the primary circle, but the structure is generally more diffuse that what appeared in the first layer. Persistence barcodes did not confirm a loop in this case.

Figure 5: MNIST layer 2

The second data set is CIFAR-10 [6], which is a data set of color images objects divided into 10 classes, namely airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. The color is encoded using the RGB system, so that each pixel is actually equipped with three coordinates, one for each of the three colors red, green, and blue. There are different options about how to analyze color image data, and we examined three of them.

  1. Reduce the colors to a single gray scale value by taking a linear combination of the three color values, and then analyze the data set as a collection of gray scale images. We used the combination . This choice is one standard choice made for this kind of problem. See for a discussion.

  2. Study the individual color channels separately, producing three separate gray scale data sets, one each for red, green and blue.

  3. Consider all three color channels together, and build a neural network to accommodate that. This means in particular that the input layer will need to include three copies of the grid.

For options (1) and (2), we constructed a neural net very similar to the one used for MNIST. Its complete factor is of type , identical to the one used for MNIST. The structural factor has the form


The generator is identical to the one for MNIST except for the substitution of , and for , and , respectively, and for the substitution of a pooling layer of width as the correspondence between and . The activation systems are identical to those in the MNIST case, as is the loss function. For option (3), it is necessary to form an additional complete factor of type , and form the product as the generator. Of course, the ’s correspond to the set . The activation systems and loss functions are identical in all three cases.

We first performed an analysis in the case of option (1). The results were not as clear as in the MNIST analysis, but did give some indications of interesting structure. In particular, the second layer had the Mapper model shown in Figure 6 below. Notice that the primary circle is included, together with a kind of “bullseye” patch which does not appear even in the Klein bottle model given in [2]. We also analyzed option (3) above. In this case, the result was quite striking. A Mapper model of the first layer appears in Figure 8, which we see recovers the three circle model of [2], and where a persistence barcode for this space appears in Figure 8. We also analyzed option 2 above, and found strong primary circles in that case. The findings confirm that generally, the convolutional neural network well reflects the density analysis in [2], as well as the results on the primary visual cortex given in [5]. Moreover, the detection of the bullseye shown in Figure 6 demonstrates that the higher levels of the neural network find more complex patches, not accounted for by the low level analysis encoded in the Klein bottle of [2]. This is also consistent with our understanding of the visual pathway, in which there are higher level units above the primary visual cortex that capture more “abstract” shapes.

Figure 6: CIFAR-10 layer 2, gray scale
Figure 7: First layer, CIFAR-10, separate colors

Dimension 0







Dimension 1







Figure 8: Persistence barcode, Figure 8

We also examined the learning process for CIFAR-10. We did this by performing the analysis in the case of option (1) above at various stages of the optimization algorithm. Figure 9 shows the results for both first and second layers. The numbers below the models show the number of iterations corresponding to the models above them. Most of the models shown are “carpets”, which simply reflects the choice of two filter functions for the model. This means that they are not topologically interesting by themselves. However, each node in the a Mapper model consists of a collection of data points, and the cardinality of that set becomes a function on the set of vertices of the model. Sub- or superlevel sets of that function can then give interesting information, loosely correlated with density. The models in Figure 9 illustrate this, particularly strongly in the first layer. We note that the first layer, beginning with something near random after 100 iterations, organizes itself into a recognizable primary circle after 200 iterations, remains at that structure until roughly 900 iterations, when the circle begins to “degrade”, and instead form a structure which is capturing patches more like those of the secondary circles. The second layer, on the other hand, is not demonstrating any strong structure until it has undergone 1000 or 2000 iterations, when one begins to see the primary circle appearing. One could interpret this as a kind of compensation for the changes occurring in the first layer.

Figure 9: CIFAR-10 learning

Finally, we examined a well known pretrained neural network, VGG16, trained on Imagenet, a large image data base

[15],[3]. This neural net has 13 convolutional layers, and so permits us to study seriously the “responsibilities” of the various layers.

Figure 10: VGG16

Mapper models of the sets of weight vectors for layers 2-13 are shown in Figure 10. In this case, the neural net has sufficiently many grids in each layer to construct a useful data set from this network alone. Observe that the first two layers give exactly a primary circle, and that after that more complex things appear. Secondary circle patches occur in layer 4, and in higher layers we see different phenomena occurring, including the bullseye we saw in CIFAR-10, as well as crossings of lines. One interesting line of research would be to assemble all of these different phenomena into a single space, including the Klein bottle. The advantage of doing this is that it will permit feature generation in terms of functions on the space, such as was done in [14], or improved compression algorithms as in [9]. For now, the outcome demonstrates with precision how the higher layers encode higher layers of abstraction in images, as occurs in the mammalian visual pathway.

5 Feature Geometries and Architectures

5.1 Generalities

Since CNN’s have demonstrated a great deal of success on data sets of images, the idea of trying to generalize it suggests itself. To perform the generalization, one must identify what properties of image data sets are being used, and how. There are two key properties.

  • Locality: The features in image data set (i.e.pixels) are equipped with a geometry, i.e. that of a rectangular grid. That grid is critical in restricting the connections in the corresponding feed-forward structure, and that restriction can be formulated very simply in terms of the distance function on the grid, as we have seen in our constructions of CNN’s in Section 4. This observation suggests that one can use other metric spaces to restrict the connections in architectures based on these metric spaces. We note that the grid geometry can be regarded as a discretization of the geometry of the plane, or of a square in the plane.

  • Homogeneity: The convolutional neural net is equipped not only with a connection structure, but a choice of convolutional structure (as in Definition 2.5), which creates its own restrictions on the features created in the neural net. Because it requires that weight vectors associated with one point in the grid be identical with those constructed at other points, the convolutional property should be interpreted as a kind of homogeneity. In addition to putting drastic limitations on the features being created in the neural net, this restriction encodes a property of image data sets that we regards as desirable, namely that the same object occurring in different parts of an image should be detected in an identical fashion.

What we would like to do is to describe how the two properties above can be used to construct neural nets in an analogous fashion, to improve performance on image data sets and to generalize the ideas to more general data sets. In order to have a notion of locality, we will need to understand data sets in terms of the geometry of their sets of features. We identify at least three methods in which feature sets can obtain a geometry.

  1. A priori geometries: The prime example here is the image situation, where the grid geometry is explicitly constructed in the construction of the data. The continuous version of this geometry is that of the plane. Other examples would include time series, where the a priori continuous geometry is the line, or periodic time series, where the geometry is that of the circle. The geometries for the building of the neural net would be discretizations of these geometries, obtained by selecting discrete subsets, often in a regular way.

  2. Geometries obtained from data analysis: The data analysis performed in [2] or [4] reveals that the frequently occurring local patches in images concentrate around a primary circle, and that these patches are well modeled by particular functions which can be algebraically defined. We will show below that this fact permits the construction of a set of features for images which admit a circular geometry. One could also construct a Klein bottle based set of features and a corresponding Klein bottle based geometry on that set.

  3. Purely data drive geometries: In many situations one does not want to perform a detailed modeling procedure for the set of features, but nevertheless wants to use feature geometries to restrict connections in neural nets which are designed to learn a function based on the features. In this case, one can use the Mapper methodology [12] to obtain discretized versions of geometries on the feature space, well suited to the construction of neural nets.

Section 3 can be regarded as a discussion of one case where an a priori geometry is available, so we will not discuss it further. Instead, we will give examples of data analytically obtained geometries and purely data driven constructions.

5.2 Data-analytically Defined Geometries

We first consider the data analytic work that was done in [2] and [4]. We find that the frequently occurring patches are approximable by discretizations of linear intensity functions onto a grid. To be specific, we regard the pixels in a patch to be embedded in the square , as the subset . The discretization operation can be considered as the restriction of a function on to . We consider the set of linear functions in two variables given by the formulae

The set of functions is parametrized by the circle valued parameter . For each , we can construct a function on an image as follows. Let denote a particular pixel in the grid defining an image data set consisting of images , with denoting the gray scale value of an image within the data set. Given an angle , we now define a function on by the formula

In this case the continuous geometry associated to the feature space for these images is . The discretization will be choosing a rectangular lattice for in the usual way, and by choosing the set of -th roots of unity for the circular factor. So the discretized form is . This set is a metric space in its own right, and we can use the metric correspondences defined in Example 2.4 to construct generators and neural nets based on this geometry.

Remark 5.1

There are similar synthetic models with a Klein bottle replacing . There are natural choices for discretizations of as well.

We have demonstrated that there are methods of imposing locality on new features that have been constructed based on the data analysis of image patches and of weight vectors in convolutional neural nets. For this construction, there are also convolutional structures as defined in Definition 2.5. In fact, they are Cayley structures in the sense of Example 2.6, as we can readily see from the observation that the metric space is equipped with a free and transitive action by the group , and this group action determines a Cayley convolutional structure. This gives a number of possibilities for the construction of new feed-forward systems with feature geometries taken in to account. To see how these might look, let’s consider the feed-forward system described in (3–2) above. is broken into a product , where is a complete generator, and the structural factor is given by

The idea will be to construct new structural factors by taking products with generators involving only for various ’s. We’ll call these generators angular factors. The simplest one is of the form

Here denotes a one element set. The corresponding structural factor including the grids would then be

The effect of this modification is simply to use the newly constructed features directly in the computation. It permits the algorithm to use them directly rather than having to “learn” them. Another angular factor is

Forming the product of this angular factor with and ultimately as well produces a feed-forward structure which creates new angular factors in layer . The corresponding neural networks would be able to learn angle dependent functions from earlier angular functions. Yet another angular factor would be the following.

where is the distance from to the primitive root of unity . Adding this angular factor to creates new angular features in layer , allows these angular features to learn from angular and raw features, and further restricts that learning so that a given angular feature would only depend on raw values and angular features in the input that are near to the given feature in the metric on . This is the angular analogue to the idea that a convolutional neural net permits a feature in a convolutional layer to depend only on features in the preceding layer that are spatially close to the given feature, in this case in the a priori geometry on pixel space.

There is also an analogue for to the pooling correspondences defined in Section 3. They are correspondences from , and they are defined by

It is easy to verify that this is well-defined. We have only created analogues to the correspondences from Section 3, but analogues for other values of , and exist as well. We could now construct a new angular factor

which would incorporate pooling in the angular directions as well. Each of these constructions have analogues for the case of the Klein bottle geometries.

We have some preliminary experiments involving the simplest versions of these geometries. We have used them to study MNIST, as well as the SVHN data set [16]. SVHN is another data set of images of digits, collected by taking photographs of house numbers on mailboxes and on houses. For these studies, we have simply modified the feed-forward systems by constructing the product of the existing structural factors described in (4–3) and (4–4) with an additional structural factor of the form


where plus denotes with a disjoint point added. This additional point is there so that we include the original “raw” pixel features. This amounts to including the “angular” coordinates described above as part of the input data, and using it to inform the higher level computations. We have two results, one in the direction of speeding up the learning process and the other concerning the generalization from a network trained on MNIST to SVHN.

  • We found substantial improvement in the training time for both MNIST and SVHN when using the additional angular features. A factor of speed up was realized for MNIST, and a factor of for SVHN. MNIST is a much cleaner and therefore easier data set, and we suspect that the speed up will in general be larger for more complex data sets.

  • We also examined the degree to which a network trained on one data set (MNIST) can achieve good results on another data set (SVHN). Using the standard convolutional network for images, we found that a model trained on MNIST applied to SVHN achieved roughly accuracy. Since there are 10 distinct outcomes, this is essentially the same as selecting a classification at random. However, when we built the corresponding model using the additional factor (5–5) above, we found that the accuracy improved to . Of course, one wants much higher accuracy, but what this finding demonstrates is that this generalization problem can be substantially improved using these methods.

In these examples, we have only used the simplest versions of the constructions we have discussed in Section 2. The possibilities that we envision going forward include taking products with structural factors of the form


The correspondences and for in this feed-forward system are straightforward generalizations of and to the situation where the disjoint base point has been added. is obtained by constructing a metric on for which the distance from the point to each of the elements of , as well as all the distances between adjacent roots of unity, are all equal to . It is not hard to see that this can be done. is the functional correspondence which is equal to on and which carries the point to . The effect of this construction is that it would include angular features at the higher layers, and that it would restrict the angular features that are constructed to include only those which involve nearby angular features in the preceding layers.

5.3 Purely Data Driven Geometries

Suppose that we are given a data set defined by a data matrix , with the rows corresponding to the data points and the columns corresponding to the features, but that we have no theory for the features analogous to the one described in [2]. What we generally have, though, are metrics on the set of features. If the matrix entries are continuous, one can use Euclidean distance of the features viewed as column vectors. There are variants, such as mean centered and/or variance normalized versions, correlation distance, angle distance, etc. If the entries of the matrix are binary, then Hamming distance is an option. In general, it is most often possible to define, in natural ways, metrics on the set of columns. This means that the feature set is a metric space, and therefore that we already have the possibility of carrying out part of the process used on image data sets, namely the construction of the correspondences , where denotes the feature set. We refer to the column space equipped with a metric as the feature space. These can be used to create a counterpart for the initial convolutional layers in the feed-forward system, but it does not give a counterpart to the pooling correspondences. The pooling correspondences are important because they allow one to study features that are more broadly distributed in the geometry of the feature space. To construct deeper networks, one may also need an analogue for higher level convolutional layers. There is an approach using the Mapper methodology introduced in [12] that will directly construct a counterpart to pooling methodology.

We recall that the output of Mapper, applied to a finite metric space , is a graph , with an assignment to each vertex of a subset of , having the following two properties:

  1. Every point is contained in for some vertex of .

  2. Two vertices and of are connected by an edge if and only if .

We observe that this means that if we have two Mapper models and on the same metric space , then there is a well-defined correspondence

defined by the property that for , if and only if .

These properties allows us to construct two specific correspondences. Given a metric space and a Mapper model for the feature space of a data matrix, we have the augmentation correspondence , defined by if and only if . We also have the correspondence .

Remark 5.2

The correspondence is simply the graph correspondence defined in Example 2.5.

To define analogues to pooling correspondences, we need a bit more detail on the Mapper construction. It begins with one or more projections , which we call filters. Typically there are only a small number of ’s, perhaps 1,2, or 3, and we denote the collection of filters by , where it is understood that is small. We now construct a family of open coverings of the real line.

Definition 5.1

Given a pair of real numbers with , we define the covering to consist of all intervals of the form , . The condition guarantees that the family is a covering. Given a pair , we defined the double of to be the pair . covers with intervals of double the length of the intervals comprising . We refer to as the length and as the stride.

Let denote the cardinality of , and equip with a total ordering, so . Let denote the product . For each filter , we choose a pair . For each , we let , and let , where is an indexing set for the intervals in . We now construct the product covering of