Deep learning-based algorithms reached human or superhuman performance in many real-world tasks. Beyond the extreme effectiveness of deep learning, one of the main reasons for its success is that raw data are sufficient—if not even more suitable than hand-crafted features—for these algorithms to learn a specific task. However, only few attempts have been made to create formal theories allowing for the creation of a controllable and interpretable framework, in which deep neural networks can be formally defined and studied. Furthermore, if learning directly from raw data allows to outclass human feature engineering, the architectures of deep networks are growing more and more complex, and often are as task-specific as hand-crafted features used to be.
We aim at providing a general mathematical framework, where any agent capable of acting on a certain dataset (e.g. deep neural networks) can be formally described as a collection of transformations acting on the data. To motivate our model, we assume that data cannot be studied directly, but only through the action of operators that make them measurable. Consequently, our model stems from a functional viewpoint. By interpreting data as points of a function space, it is possible to learn and optimise operators defined on the data. In other words, we are interested in the space of transformations of the data, rather than the data themselves. Albeit unformalised, this idea is not new in deep learning. For instance, the main characteristic of convolutional neural networks lecun1995convolutional
is the election of convolution as the operator of choice to act on the data. The convolutional kernels learned by optimising a loss function are operators that map an image to a new one that, for instance, is more easily classifiable. Moreover, convolutions are operators equivariant with respect to translations (at least in the ideal continuous case). We believe that the restriction to a specific family of operators and the equivariance with respect to interpretable transformations are key aspects of the success of this architecture. In our theory, operators are thought of as instruments allowing an agent to provide a measure of the world, as the kernels learned by a convolutional neural network allow a classifier to spot essential features to recognise objects belonging to the same category. Equivariance with respect to the action of a group, or a set of transformations corresponds to the introduction of symmetries. This allows us to both gain control on the nature of the learned operators, as well as drastically reduce the dimensionality of the space of operators to be explored during learning. Such a goal is in line with the recent interest for invariant representations in machine learning (cf., e.g.,AnRoPo16 ).
We make use of topological data analysis to describe spaces of group-equivariant non-expansive operators (GENEOs). GENEOs are maps between function spaces associated with groups of transformations. We study the topological and metric properties of the space of GENEOs to evaluate their approximating power and set the basis for general strategies to initialise, compose operators and eventually connect them hierarchically to form operator networks. Our first contribution is to define suitable pseudo-metrics for the function spaces, the equivariance groups, and the set of non-expansive operators. Basing on these pseudo-metrics, we prove that the space of GENEOs is compact and convex, under the assumption that the function spaces are compact and convex. These results provide fundamental and provable guarantees for the goodness of this operator-based approach in a machine learning perspective. The property of compactness for instance shows how any operator belonging to a certain space can be approximated by a finite number of operators sampled in the same space. Our study of the space of GENEOs takes advantage of recent results in topological data analysis, in particular in the theory of persistent homology. This allows us to provide a general framework to previous works on group equivariance in deep learning context cohen2016group ; worrall2017harmonic . Moreover, this approach generalises standard group-equivariance to set-equivariance, that seems much more suitable for the representation of intelligent agents.
To conclude, we validate our model with examples on the MNIST and fashion-MNIST datasets. These applications are aimed at proving the effectiveness on discrete examples, of the metrics defined and the theorems proved in the continuous case. By considering isometry-equivariant non-expansive operators (IENEO), we describe two simple algorithms allowing to select and sample IENEOs based on few labelled samples taken from the dataset. We show how selected and sampled operators can be used to perform both classical metric learning and effective initialisation of the kernels of a convolutional neural network.
We believe that the formal foundation of our model is suitable to start a new theory of deep-learning engineering, and that novel research lines will stem from the synergy of machine learning and topology.
The paper is structured as follows. In Section 2 the epistemological foundations of our model are discussed. The mathematical background in topological persistence is discussed in Section 3. Section 4 is the core of this work and details the mathematical models defining and studying the spaces of GENEOs, and their topology. New results in persistent homology and the extension of the theory to set-equivariance are presented in Section 5. The necessary hypotheses and theorems to prove the compactness and convexity of the space of GENEOs are described in Section 6. Finally, in Section 7, we describe the algorithms used to select and sample operators in the discrete case and apply them on both the MNIST and fashion-MNIST datasets.
2 Epistemological setting
The mathematical model described and studied in this paper is justified by an epistemological background that is briefly presented in this section. We build our model on the following assumptions:
Data cannot be studied in a direct and absolute way. They are only knowable through acts of measurement made by an agent. From the point of view of data analysis only the pair (data, agent) matters.
Any act of measurement can be represented as a function defined on a topological space, since only stable measurements can be considered for applications and stability requires a topological structure.
Any agent is described by the way they transform data and preserves some kind of invariance. In other words, any agent can be seen as a collection of group equivariant operators acting on function spaces.
Only the agent is entitled to decide about data similarity.
In other words, in our framework we assume that the analysis of data is replaced by the analysis of the pair (data, agent) we are considering. Since an agent can be seen as a set of group equivariant operators, from the mathematical viewpoint our purpose consists in presenting a good topological theory of suitable operators of this kind, representing agents. In our agent-centered setting, two objects can be distinguished from each other if and only if they can be distinguished by a suitable measurement. In case every available measurement is not able to distinguish them, they must be considered as equal.
For more details about this epistemological setting we refer the interested reader to Fr16 .
3 Mathematical background
Our mathematical model builds on functional analysis and Topological Data Analysis (TDA) [Carlsson]. TDA is an emerging field of research which studies topological approaches to explore and make sense of complex, high-dimensional data, such as artificial and biological networks [Carlsson, Scientific Reports 2012, Bulletin of the AMS 2009]. The basic idea is that topology can help to recognize patters within data, and therefore to turn data into useful knowledge. One of the main concepts in TDA is Persistent Homology (PH), a mathematical tool that captures topological information at multiple scales. Our mathematical model proposes an integration between the theory of group actions and Persistent Homology. In what follows, we briefly summarize the main concepts in PHBiDFFaal08 ; CaZo09 ; EdHa08 .
3.1 Persistent Homology
In PH, data are modeled as objects in a metric space. The first step is to filter the data so to obtain a family of nested topological spaces that captures the topological information at multiple scales. A common way to obtain a filtration is by sublevel sets of a continuous function, hence the name sublevelset persistence. Let be a real-valued continuous function on a topological space . Persistent homology represents the changes of the homology groups of the sub-level set varying in . We can see the parameter as an increasing time, whose changes produce the birth and the death of -dimensional holes in the sub-level set . For example the number of -dimensional holes equals the number of the connected components of , -dimensional holes refer to tunnels and -dimensional holes to voids.
If and , we can consider the inclusion of into . If denotes the Čech homology functor, such an inclusion induces a homomorphism between the homology groups of and in degree . The group is called the th persistent homology group with respect to the function , computed at the point . The rank of is said the th persistent Betti number function (PBN) with respect to the function , computed at the point .
Persistent Betti number functions can be completely described by multisets called persistence diagrams. The th persistence diagram is the multiset of all the pairs , where and are the times of birth and death of the th -dimensional hole, respectively. When a hole never dies, we set its time to death equal to . The multiplicity says how many holes share both the time of birth and the time of death . For technical reasons, the points are added to each persistence diagram, each one with infinite multiplicity.
Each persistence diagram can contain an infinite number of points. For every , the equality means that does not belong to the persistence diagram . We define on a pseudo-metric as follows
by agreeing that for .
The pseudo-metric between two points and takes the smaller value between the cost of moving to and the cost of moving and onto . Obviously, for every . If and , then equals the distance, endowed by the max-norm, between and . Points at infinity have a finite distance only to the other points at infinity, and their distance equals the Euclidean distance between abscissas.
We can compare persistence diagrams by means of the bottleneck distance (also called matching distance) .
Let be two persistence diagrams. We define the bottleneck distance between and by setting
where varies in the set of all bijections from the multiset to the multiset .
For further informations about persistence diagrams and the bottleneck distance, we refer the reader to EdHa08 ; CSEdHa07 . Each persistent Betti number function is associated with exactly one persistence diagram, and (if we use Čech homology) every persistence diagram is associated with exactly one persistent Betti number function. Then the metric induces a pseudo-metric on the sets of the persistent Betti number functions CeDFFeal13 .
4 Mathematical model
In our mathematical model, data are represented as function spaces, that is, as sets of real-valued functions on some topological space (Subsection 4.1). Function spaces come with invariance groups representing the transformations on data which are admissible for some agent (Subsection 4.2). The groups of transformations are specific to different agents, and can be either learned or part of prior knowledge. The operators on data are then defined as group-equivariant non-expansive operators (GENEOs) (Subsection 4.3).
4.1 Data representation
Let us consider a set and a topological subspace of the set of all bounded functions from to , denoted by and endowed with the topology induced by the distance
If is compact, then it is also bounded, i.e., there exists a non-negative real value , such that for every . We can think of X as the space where one makes measurements, and of as the set of admissible measurements (also called set of admissible functions). In other words, is the set of functions from to that can be produced by measuring instruments. For example, an image can be represented as a function from the real plane to the real numbers.
To quantify the distance between two points , we compare the values taken at and by the functions in the space of possible measurements . Therefore, we endow with the extended pseudo-metric222We recall that a pseudo-metric is just a distance without the property that implies . An extended pseudo-metric is a pseudo-metric that may take the value . If is bounded, then is a pseudo-metric. defined by setting
for every (see Appendix A).
The assumption behind the definition of is that two points can be distinguished only if they assume different values while being measured. As an example, if contains only constant functions, no discrimination can be made between points in and hence vanishes for every .
The pseudo-metric space can be considered as a topological space by choosing as a base the collection of all the sets
where and (see, Ga64 ).
The reason to endow the measurement space with a topology, rather than considering just a set, follows from the need of formalizing the assumption that measurements are stable. To formalize stability we have to use a topology (or a pseudo-metric inducing a topology).
It is interesting to stress the link between the topology associated with and the initial topology333We recall that is the coarsest topology on such that each function is continuous. Explicitly, the open sets in are the sets that can be obtained as unions of finite intersections of sets , where and . In other words, a base of is given by the collection of all sets that can be represented as , where is a finite set of indexes and , for every Ga64 . on with respect to , when we take the Euclidean topology on .
The topology on induced by the pseudo-metric is finer than the initial topology on with respect to . If is totally bounded, then the topology coincides with .
In general is not compact with respect to the topology , even if is compact. For example, if is the open interval and contains only the identity from to , the topology induced by is simply the Euclidean topology and hence is not compact. However, the next result holds.
If is compact and is complete then is also compact.
Since is the coarsest topology on such that is continuous, Theorem 4.1 guarantees that the assumption that the functions are continuous is not restrictive in practice, for example while dealing with images, which often contain discontinuities. Indeed, our functions are not required to be continuous with respect to other topologies (e.g., the Euclidean topology on ).
4.1.1 A remark on the use of pseudo-metrics
The reader could think better to change the pseudo-metric into a metric by quotienting out by the equivalence relation and defining for any . The reason we do not do this is that several different sets of admissible measurements can be considered on the same set . For two different sets , of admissible functions, we obtain two different quotient spaces , . If we forget about the original space , we lose the possibility of linking the equivalence classes in with the ones in . On the contrary, we prefer to preserve the identity of points in , studying how they link to each other when we change the set . This observation leads us to work with pseudo-metrics instead of metrics.
Before proceeding, we observe that the map taking each point to the equivalence class is continuous with respect to and , and surjective. Moreover, takes each ball with respect to to a ball with respect to , while the inverse image under of each ball with respect to is a ball with respect to . It follows that if a subset is compact (sequentially compact) for then is compact (sequentially compact) for , and that if a subset is compact (sequentially compact) for then is compact (sequentially compact) for . Finally, given a sequence in , we observe that converges to in if and only if the sequence converges to in . These facts imply that the development of our theory in terms of pseudo-metrics is not far from the analysis in terms of metrics.
4.2 Transformations on data
In our model, we assume that data are transformed through maps from to which are -preserving homeomorphisms with respect to the pseudo-metric . Let denote the set of homeomorphisms from to with respect to , and denote the set of -preserving homeomorphisms, namely the homeomorphisms such that and for every .
The following Proposition 4.3 implies that is exactly the set of all bijections such that and for every .
If is a bijection from to such that and for every , then is an isometry444The definition of isometry between pseudo-metric spaces can be considered as a special case of isometry between metric spaces. Let and be two pseudo-metric spaces. It is easy to check that if is a function verifying the equality for every , then is continuous with respect to the topologies induced by and . If verifies the previous equality and is bijective, we say that it is an isometry between the considered pseudo-metric spaces. If is an isometry, we can trivially observe that is also an isometry, and that is a homeomorphism. (and hence a homeomorphism) with respect to .
In general, . As an example, take and . In this case and , while is the set of all homeomorphisms from the interval to itself with respect to the Euclidean distance.
For each , we consider the bijective map defined by setting for every . We claim that preserves the pseudo-distance in Equation 3. Indeed, if and then
because is a bijection. Since is a bijection preserving , then is an isometry with respect to .
In the rest of this paper we will assume that is compact with respect to the topology induced by , and that is complete (and hence compact) with respect to the topology induced by .
Let us now consider a subgroup of the group . represents the set of transformations on data for which we require equivariance to be respected.
We can define the pseudo-distance on :
from to (see Appendix A).
can be expressed as:
We can now state the following theorems:
is a topological group with respect to the pseudo-metric topology and the action of on through right composition is continuous.
If is complete then it is also compact with respect to .
From now on we will suppose that is complete (and hence compact) with respect to the topology induced by .
4.2.1 The natural pseudo-distance
We can consider the natural pseudo-distance on the space FrJa16 :
The pseudo-distance is defined by setting
It is called the natural pseudo-distance associated with the group acting on .
The natural pseudo-distance represents the ground truth in our model. It is based on comparing measurements, and vanishes for pairs of measurements that are equivalent with respect to the action of our group of homeomorphisms , which expresses the equivalences between data.
If , then equals the sup-norm distance on . If and are subgroups of and , then the definition of implies that
for every .
4.2.2 A remark on the use of homeomorphisms
The reader could criticize the choice of grounding our approach on the concept of homeomorphism. After all, most of the objects that are considered for purposes of shape comparison “are not homeomorphic”. Therefore, the definition of natural pseudo-distance could seem not to be sufficiently flexible, since it does not allow to compare non-homeomorphic objects. Though, it is important to note that the space we use in our model does not represent the objects, but the space where one takes measurements about the objects. As such, is unique. For example, two images are considered as functions from the real plane to the real numbers, independently of the topological properties of the 3D objects represented in the images. If we make two CAT scans, the topological space is always given by an helix turning many times around a body, and no requirement is made about the topology of such a body. In other words, the topological space is determined only by the measuring instrument and not by the single object instances.
4.3 Group-Equivariant Non-Expansive Operators
Under the assumptions made in the previous sections, the pair is called a perception pair.
Let us now assume that two perception pairs , are given together with a fixed homomorphism . Each function such that for every is said to be a perception map from to associated with the homomorphism . More briefly, we will also say that is a group equivariant operator. If is equal to the identity homomorphism , we can say that is a -map. We observe that the functions in and the functions in are defined on spaces that are generally different from each other.
Each perception pair can be seen as a category, whose objects are the functions in and the morphisms between two functions are the elements such that . As usual, if and we wish to distinguish as a morphism between and from as a morphism between and , so we make different copies , of the homeomorphism by labelling it. As natural, . A precise formalization of this procedure can be done in terms of slice categories. For more details we refer the reader to Appendix B.
When two perception pairs , are considered as categories and a homomorphism is fixed, each perception map from to is naturally associated with a functor between the two categories, taking each function to and each morphism to the morphism .
Assume that , are two perception pairs and that a homomorphism has been fixed. Each non-expansive perception map from to with respect to is called a Group Equivariant Non-Expansive Operator (GENEO) associated with .
Obviously, the non-expansivity of means that for every .
As a reference for the reader, we give the following basic example of GENEO. Let be the set containing all -Lipschitz functions from to , and be the group of all rotations of around the -axis. Let be the set containing all -Lipschitz functions from to , and be the group of all rotations of . We observe that and are two perception pairs. Now, let us consider the map taking each function to the function defined by setting (with polar coordinates), and the homomorphism taking the rotation of of radians around the -axis positively oriented to the counter-clock rotation of radians of . We can easily check that is a perception map and a GENEO from to , associated with the homomorphism . In this example and are surjective, but an example where and are not surjective can be easily found, e.g. by restricting to the singleton containing only the null function and to the trivial group containing only the identical homomorphism.
We can study how GENEOs act on the natural pseudo-distances:
If is a GENEO from to associated with , then it is a contraction with respect to the natural pseudo-distances , .
4.3.1 Pseudo-metrics on
Let us denote by the set of all GENEOs between two perception pairs , associated with . We can endow this set with the following pseudo-distances , .
If , we set
The next result can be easily proved by applying the inequality (see Theorem 5.1) and recalling that the supremum of a family of bounded pseudo-metrics is still a pseudo-metric.
and are pseudo-metrics on . Moreover, .
It would be easy to check that is a metric.
For the sake of conciseness, in the following we will set .
This simple statement holds:
For every and every : , where 0 denotes the function taking the value 0 everywhere.
4.4 GENEOs as agents in our model
In our model the agents are represented by GENEOs. Indeed, each agent can be seen as a black box that receives and transforms data. If a nonempty subset of is fixed, a simple pseudo-distance to compare two admissible functions can be defined by setting . This definition expresses our assumption that the comparison of data strongly depends on the choice of the agents. However, we note that the computation of for every pair of admissible functions is computationally expensive. We will see how persistent homology allows us to replace with a pseudo-metric that is quicker to compute.
5 A strongly group-invariant pseudo-metric induced by Persistent Homology
In this section, we show how Persistent Homology supports the definition of a strongly group-invariant pseudo-metric on , for which we prove some theoretical results.
We begin by recalling the stability of the classical pseudo-distance between Persistent Betti Numbers (BPN) (cf. Definition 3.2) with respect to the pseudo-metrics and . We assume the finiteness of PBNs 555Though in our setting, the space is assumed to be compact, PBN functions are not necessarily finite. For example, let us consider the set and . Even if is compact, every sublevel set has infinite connected components, and hence the th persiste nt Betti Number function takes infinite value everywhere. We add the assumption on the finiteness of PBN (i.e., the assumption that the PBN function of every takes a finite value at each point ) to get stability and discard pathological cases (for example the case that the set of admissible functions is the set of all maps from to ). Since the PBN functions of the pseudo-metric space coincide with the persistent Betti number functions of its Kolmogorov quotient , the finiteness of the persistent Betti number functions can be obtained when is finitely triangulable (cf. CeDFFeal13 ).. Then, the stability of easily follows from the stability theorem of the interleaving distance and the isometry theorem (cf., Ou15 ).
If k is a natural number, and , then
5.1 Strongly group invariant comparison of filtering functions via persistent homology
The proofs reported in the rest of this section are just a straightforward generalization of the proofs given in FrJa16 for the case , .
Let us consider a subset of . For every fixed , we can consider the following pseudo-metric on :
for every , where denotes the th persistent Betti number function with respect to the function .
In this work, we will say that a pseudo-metric on is strongly G-invariant if it is invariant under the action of with respect to each variable, that is, if for every and every .
It is easily seen that the natural pseudo-distance is strongly -invariant.
is a strongly -invariant pseudo-metric on .
5.2 Some theoretical results on the pseudo-metric
The proofs reported in this section are a generalization of the proofs given in FrJa16 for the case , . At first we want to show that the pseudo-metric is stable with respect to both the natural pseudo-distance associated with the group and the distance .
Let and be two homeomorphic spaces and let be a homeomorphism. Then the persistent homology group with respect to the function and the persistent homology group with respect to the function are isomorphic at each point in the domain. Therefore we can say that the persistent homology groups and the persistent Betti number functions are invariant under the action of .
If is a non-empty subset of , then
The definitions of the natural pseudo-distance and the pseudo-distance come from different theoretical concepts. The former is based on a variation approach involving the set of all homeomorphisms in , while the latter refers only to a comparison of persistent homologies depending on a family of group equivariant non-expansive operators. Given those comments, the next result may appear unexpected.
Let us assume that , every function in is non-negative, the -th Betti number of does not vanish, and contains each constant function for which a function exists such that . Then .
We observe that if is bounded, the assumption that every function in is non-negative is not quite restrictive. Indeed, we can obtain it by adding a suitable constant value to every admissible function.
5.3 Beyond group equivariance
We observe that while the definition of the natural pseudo-distance requires that has the structure of a group, the definition of does not need this assumption. In other words, our approach based on GENEOs can be used also when we wish to have equivariance with respect to a set instead of a group of homeomorphisms. This property is promising for extending the application of our theory to the cases in which the agent is equivariant with respect to each element of a finite set of homeomorphisms that is not closed with respect to composition and computation of the inverse.
5.4 Pseudo-metrics induced by persistent homology
Persistent homology can be seen as a topological method to build new and easily computable pseudo-metrics for the sets , and . These new pseudo-metrics , , can be used as proxies for (and hence ), , , respectively:
In particular, and a discretized version of the pseudo-metric will be used in the experiments described in Section 7. We underline that the use of persistent homology is a key tool in our approach: it allows for a fast comparison between functions and between GENEOs. Without persistent homology, this comparison would be much more computationally expensive.
The next result will be of use for the approximation of .
Let . If the Hausdorff distance
is not larger than , then
for every .
Therefore, if we can cover by a finite set of balls in of radius , centered at points of a finite set , the approximation of can be reduced to the computation of , i.e. the maximum of a finite set of bottleneck distances between persistence diagrams, which are well-known to be computable by means of efficient algorithms.
This fact leads us to study, in the following section, the properties of the topological space .
6 On the compactness and convexity of the space of GENEOs
In this section we show that if the function spaces we are considering are compact and convex, then the space of GENEOs is compact and convex too. This property has important consequences from the computational point of view, since it guarantees that the space of GENEOs can be approximated by a finite set and that new GENEOs can be obtained by convex combination of preexisting GENEOs.
6.1 The space of GENEOs is compact with respect to
We start by recalling that we are assuming and compact with respect to and , respectively.
is compact with respect to .
Let be a non-empty subset of . For every , a finite subset of exists, such that
for every .
The previous corollary shows that, under suitable hypotheses, the computation of can be reduced to the computation of the maximum of a finite set of bottleneck distances between persistence diagrams, for every .
6.2 The set of GENEOs is convex
Let be GENEOs from to associated with the homomorphism . Let with . Consider the function
from to the set of the continuous functions from to , where is the domain of the functions in .
If , then is a GENEO from to with respect to .
If is convex, then the set of GENEOs from to with respect to is convex.
7 Experimental validation
We validate our model on the MNIST and fashion-MNIST datasets. We first define isometry-equivariant non-expansive operators (IENEOs), then describe two simple algorithms allowing to select and sample IENEOs based on few labelled samples taken from the dataset. We show how selected and sampled operators can be used to perform both classical metric learning and effective initialisation of the kernels of a convolutional neural network.
7.1 Isometry-equivariant non-expansive operators (IENEOs)
We define a parametric family of non-expansive operators which are equivariant with respect to Euclidean plane isometries.
Given and , we consider the 1-dimensional Gaussian function with width and center
. For a positive integer , we take the set of the -tuples for which . is a submanifold of .
For each , we then consider the function defined as
If we denote by the convolutional operator mapping each continuous function with compact support to the continuous and with compact support function defined as
then the operator is a group-equivariant non-expansive operator with respect to the group of Euclidean plane isometries. We call a IENEO (Isometry-Equivariant Non-Expansive Operator).
The IENEO is parametric with respect to the -tuple . Therefore, we define a parametric family of IENEOs
The next section shows how to select a finite subfamily of of operators which are suited to image classification in the MNIST and fashion-MNIST datasets, using the pseudo-metrics defined in Section 5.
7.2 Selection and sampling of IENEOs
We begin by randomly sampling operators in . We then select those operators that consider as similar the objects belonging to the same class. To this end, let be the set of functions representing the objects in a class of cardinality . For each of the randomly sampled operators , we define the class-dependent value
An operator is selected if is smaller than a threshold , for each class .
Once we have selected a smaller number of operators according to the criterion above, we then sample operators to avoid storing operators that would focus on the same or similar characteristic across classes. To this end, given a class , we define the distance between two operators and (cf. Section 5.4)