1 Introduction
^{1}^{1}footnotetext: Let and be the performance of an algorithm computed using a training and testing set, respectively; is the estimated testing error computed without any testing data. The performance metric may be classification accuracy, F1score, IntersectionoverUnion (IoU), etc.Deep Neural Networks (DNNs) are algorithms capable of identifying complex, nonlinear mappings, , between an input variable and and output variable , i.e., [18]
. Each DNN is defined by its unique topology and loss function. Some wellknown models are
[24, 14, 17], to name but a few.Given a wellcurated dataset with samples, , we can use DNNs to find an estimate of the functional mapping . Let us refer to the estimated mapping function as . Distinct estimates, , will be obtained when using different DNNs and datasets. Example datasets we can use to this end are [9, 12, 10], among many others.
Using datasets such as these to train DNNs has been very fruitful. DNNs have achieved considerable improvements in a myriad of, until recently, very challenging tasks, e.g., [17, 26].
Unfortunately, we do not generally know how the estimated mapping functions will perform in the real world, when using independent, unseen images.
The classical way to address this problem is to use a testing dataset, Figure 1(a, bottom). The problem with this approach is that, in many instances, the testing set is visible to us, and, hence, we keep modifying the DNN topology until it works on this testing dataset. This means that we overfit to the testing data and, generally, our algorithm may not be the best for truly unseen samples.
To resolve this issue, we can use a sequestered dataset. This means that a thirdparty has a testing dataset we have never seen and we are only able to know how well we perform on that dataset once every several months. While this does tell us how well our algorithm performs on previously unseen samples, we can only get this estimate sporadically. And, importantly, we need to rely on someone else maintaining and updating this sequestered testing set. Many such sequestered datasets do not last long, because maintaining and updating them is a very costly endeavour.
In the present paper, we introduce an approach that resolves these problems. Specifically, we derive an algorithm that gives an accurate estimate of the performance gap between our training and testing error, without the need of any testing dataset, Figure 1(a, top). That means we do not need to have access to any labelled or unlabelled data. Rather, our algorithm will give you an accurate estimate of the performance of your DNN on independent, unseen sample.
Our key idea is to derive a set of topological summaries measuring persistent topological maps of the behavior of DNNs across computer vision problems. Persistent topology has been shown to correlate with generalization error in classification [8], and as a method to theoretically study and explain DNNs’ behavior [5, 8, 27]. The hypothesis we are advancing is that the generalization gap is a function of the innerworkings of the network, here represented by its functional topology and described through topological summaries. We propose to regress this function and use it to estimate test performance based only on training data.
Figure 1(b) shows an example. In this plot, the axis shows a linear combination of persistent topology measures of DNNs. The axis in this plot is the value of the performance gap when using these DNNs on multiple computer vision problems. As can be seen in this figure, there is a linear relationship between our proposed topological summaries and the DNN’s performance gap. This means that knowing the value of our topological summaries is as good as knowing the performance of the DNN on a sequestered dataset, but without any of the drawbacks mentioned above – no need to depend on an independent group to collect, curate, and update a testing set.
We start with a set of derivations of the persistent topology measures we perform on DNNs (Section 2), before using this to derive our algorithm (Section 3). We provide a discussion of related work (Section 4) and extensive experimental evaluations on a variety of DNNs and computer vision problems, including object recognition, facial expression analysis, and semantic segmentation (Sections 5 and 6).
2 Topological Summaries
A DNN is characterized by its structure (i.e., the way its computational graph is defined and trained), and its function (i.e, the actual values its components take in response to specific inputs). We focus here on the latter.
To do this, we define DNNs on a topological space. A set of compact descriptors of this space, called topological summaries, are then calculated. They measure important properties of the network’s behaviour. For example, a summary of the functional topology of a network can be used to detect overfitting and perform earlystopping [8].
Let be a set. An abstract simplicial complex is a collection of vertices denoted , and a collection of subsets of called simplices that is closed under the subset operation, i.e., if and , then .
The dimension of a simplex is , where denotes cardinality. A simplex of dimension is called a simplex. A simplex is realized by a single vertex, a simplex by a line segment (i.e., an edge) connecting two vertices, a simplex is the filled triangle that connects three vertices, etc.
Let be a metric space – the association of the set with a metric . Given a distance , the VietorisRips complex [25] is an abstract simplicial complex that contains all the simplices formed by all pairs of elements with
(1) 
for some small , and .
By considering a range of possible distances, , where , a VietorisRips filtration yields a collection of simplicial complexes, , at multiple scales, Figure 2 [13].
We are interested in the persistent topology properties of these complexes across different scales. For this, we compute the persistent homology groups and the Betti numbers , which gives us the ranks of those groups [13]. This means that the Betti numbers compute the number of cavities of a topological object.^{2}^{2}2Two objects are topologically equivalent if they have the same number of cavities (holes) at each of their dimensions. For example, a donut and a coffee mug are topologically equivalent, because each has a single 2D cavity, the whole in the donut and in the handle of the mug. On the other hand, a torus (defined as the product of two circles, ) has two holes because it is hollow. Hence, a torus is topologically different to a donut and a coffee mug.
In DNNs, we can, for example, study how its functional topology varies during training as follows (Fig. 3
). First, we compute the correlation of every node in our DNN to every other node at each epoch. Nodes that are highly correlated (i.e., their correlation is above a threshold) are defined as connected, even if there is no actual edge or path connecting them in the network’s computational graph. These connections define a simplicial complex, with a number of cavities. These cavities are given by the Betti numbers. We know that the dynamics of lowdimension Betti numbers (i.e.,
and) is informative over the biasvariance problem (i.e., the generalization vs. memorization problem)
[8]. Similarly, it has been shown that these persistence homology measures can be used to study and interpret the data as points in a functional space, making it possible to learn and optimize the estimates defined on the data [5].3 Algorithm
Recall is the set of labeled training samples, with the number of samples.
Let be the activation value of a particular node in our DNN for a particular input
. Passing the sample vectors
through the network (), allows us to compute the correlation between the activation of each pair of nodes which defines the metric of our VietorisRips complex. Formally,(2) 
where and
indicate the mean and standard deviation over
.We represent the results of our persistent homology using a persistence diagram. In our persistence diagram, each point has as coordinates a set of pairs of real positive numbers , where the subscripts and in indicate the birth and death distances of a cavity in the VietorisRips filtration, and is the total number of cavities.
A filtration of a metric space is a nested subsequence of complexes that abide by the following rule: [29]. Thus, this filtration is in fact what defines the persistence diagram of a dimensional homology group. This is done by computing the creation and deletion of dimensional homology features. This, in turn, allows us to compute the lifespan homological feature [6].
Based on this persistence diagram, we define the life of a cavity as the average time (i.e., persistence) in this diagram. Formally,
(3) 
Similarly, we define its midlife as the average density of its persistence. Formally,
(4) 
Finally, we define the linear functional mapping from these topological summaries to the gap between the training and testing error as,
(5) 
where is our estimate of the gap between the training and testing errors, and , with , and , Figure 1(b).
With the above result we can estimate the testing error without the need of any testing data as,
(6) 
where is the training error computed during training with .
Given an actual testing dataset , we can compute the accuracy of our estimated testing error as,
(7) 
where is the testing error computed on .
The pseudocode of our proposed approach is shown in Alg. 1.^{3}^{3}3Code available at https://github.com/cipriancorneanu/dnntopology.
3.1 Computational Complexity
Let the binomial coefficient be the number of simplices of a simplicial complex (as, for example, would be generated during the VietorisRips filtration illustrated in Fig. 2). In order to compute persistent homology of order on S, one has to compute , with , the number of simplices, and the number of simplices. This has polynomial complexity , .
Fortunately, in Alg. 1, we only need to compute persistent homology of the first order. Additionally, the simplicial complexes generated by the VietorisRips filtration are generally extremely sparse. This means that for typical DNNs, the number of simplices is way lower than the binomial coefficient defined above. In practice, we have found 10,000 to be a reasonable upper bound for the cardinality of . This is because we define nodes by taking into account structural constraints on the topology of DNNs. Specifically, a node
is a random variable with value equal to the mean output of the filter in its corresponding convolutional layer. Having random variables allows us to define correlations and metric spaces in Alg.
1. Empirically, we have found that defining nodes in this way is robust, and similar characteristics, e.g. high correlation, can be found even if a subset of filters is randomly selected. For smaller, toy networks there is previous evidence [8] that supports that functional topology defined in this way is informative for determining overfitting in DNNs.Finally, the time it takes to compute persistent homology, and consequently, the topological summaries, and , is 5 minutes and 15 seconds for VGG16, one of the most extended networks in our analysis. This corresponds to a single iteration of Alg. 1 (the forloop that iterates over ), excluding training, on a single 2.2 GHz Intel Xeon CPU.
4 Related Work
Topology measures have been previously used to identify overfitting in DNNs. For example, using lower dimensional Betti curves (which calculates the cavities) of the functional (binary) graph of a network [8], which can be used to perform early stopping in training and detect adversarial attacks. Other topological measures, this time for characterizing and monitoring structural properties, have been used for the same purpose [23].
Other works tried to address the crucial question of how the generalization gap can be predicted from training data and network parameters [2, 1, 22, 15]. For example, a metric based on the ratio of the margin distribution at the output layer of the network and a spectral complexity measure related to the network’s Lipschitz constant has been proposed [3]. In [22], the authors developed bounds on the generalization gap based on the product of norms of the weights across layers. In [1], the authors developed bounds based on noise stability properties of networks showing that more stability implies better generalization. And, in [15]
, the authors used the notion of margin in support vector machines to show that the normalized margin distribution across a DNN’s layers is a predictor of the generalization gap.
5 Experimental Settings
We have derived an algorithm to compute the testing accuracy of a DNN that does not require access to any testing dataset. This section provides extensive validation of this algorithm. We apply our algorithm in three fundamental problems in computer vision: object recognition, facial action unit recognition, and semantic segmentation, Figure 4.
5.1 Object Recognition
Object recognition is one of the most fundamental and studied problems in computer vision. Many large scale databases exist, allowing us to provide multiple evaluations of the proposed approach.
5.2 Facial Action Unit Recognition
5.3 Semantic Segmentation
Semantic segmentation is another challenging problem in computer vision. We use PascalVOC [11] and Cityscapes [7]. The version of PascalVOC used consists of 2,913 images, with pixel based annotations for 20 classes. The Cityscapes dataset focuses on semantic understanding of urban street scenes [7]. It provides 5,000 images with dense pixel annotations for 30 classes.
Semantic segmentation is evaluated using unionoverintersection (IoU^{4}^{4}4
Also known as the Jaccard Index., which counts the number of pixels common between the ground truth and prediction segmentation masks divided by the total number of pixels present across both masks.
), Figure 7 and Table 5.5.4 Models
We have chosen a wide range of architectures (i.e., topologies), including three standard and popular models [17, 24, 14] and a set of customdesigned ones. This provides diversity in depth, number of parameters, and topology. The custom DNNs are summarized in Table 6.
For semantic segmentation, we use a custom architecture called Fully Convolutional Network (FCN) capable of producing dense pixel predictions [19]
. It casts classical classifier networks
[14, 24] into an encoderdecoder topology. The encoder can be any of the networks previously used.5.5 Training
For all the datasets we have used, if a separate test set is provided, we attach it to the train data and perform a cross validation on all available data. For object recognition , and for the other two problems . Each training is performed using a learning rate with random initialization. We also train each model on , and of the available folds. This increases the generalization gap variance for a specific dataset. In the results presented below, we show all the trainings that achieved a performance metric above . We skip extreme cases of generalization gaps either close to maximum ().
For object recognition the input images are resized to color images, unless explicitly stated. Also they are randomly cropped and randomly flipped. In the case of semantic segmentation all input images are
color images. No batch normalization (except for ResNet which follows the original design), dropout, or other regularization techniques are used during training. We train with a sufficient fixed number of epochs to guarantee saturation in both training and validation performance.
We use a standard stochastic gradient descent (SGD) optimizer for all training with momentum
and learning rate and weight decay as indicated above. The learning rate is adaptive following a plateau criterion on the test performance, reducing to a quarter every time the validation performance metric does not vary outside a range for a fixed number of epochs.6 Results and Discussion
Model  conv_2  conv_4  alexnet  conv_6  resnet18  vgg16  mean 

Model  conv2  conv4  vgg16  resnet18 

svhn  10.944.99  10.732.60  8.012.31  11.094.18 
cifar10  9.643.60  5.241.92  9.411.72  5.670.73 
cifar100  4.796.20  22.466.87  10.931.46  6.711.51 
imagenet  8.335.48  21.847.35  13.497.14  9.494.35 
Model  resnet18  vgg16  mean 

Model  resnet18  vgg16 

bp4d  3.822.80  6.044.17 
disfa  3.072.17  4.073.56 
emotionet  7.483.66  7.465.01 
Model  fcn32_vgg16  fcn32_resnet18  mean 

Network  Convolutions  FC layers 

conv_2  256, 256,  
conv_4  256, 256,  
conv_6  256, 256, 
Topological summaries are strongly correlated with the performance gap. This holds true over different vision problems, datasets and networks.
Life , the average persistence of cavities in the VietorisRips filtration, is negatively correlated with the performance gap, with an average correlation of . This means that the more structured the functional metric space of the DNN (i.e., larger wholes it contains), the less it overfits.
Midlife is positively correlated with the performance gap, with an average correlation of . Midlife is an indicator of the average distance about which cavities are formed. For DNNs that overfit less, cavities are formed at smaller which indicates that fewer connections in the metric spaces are needed to form them.
We show the plots of topological summaries against performance gap for object recognition, AU recognition and semantic segmentation in Figures 57, respectively. The linear mapping between each topological summary, life and midlife (Eqs. 3 & 4), and the performance gap are shown in the first and second rows of these figures, respectively. In all these figures rows represent DNN’s results.
The results of each dataset are indicated by a disc, where the centre specifies the mean and the radius the standard deviation. We also mark the linear regression line
and the corresponding standard deviation of the observed samples from it.Finally, Tables 1, 3 & 5 show the , namely the absolute value of the difference between the estimate given by Alg. 1 and that of a testing set, computed with Eq. (7) by leavingonesampleout. A different way of showing the same results can be found in Tables 2 and 4 where mean and standard deviation of the same error is computed by leavingonedatasetout.
It is worth mentioning that our algorithm is general and can be applied to any DNN architecture. In Table 6 we detail the structure of the networks that have been used in this paper. These networks range from simple (with only a few hundred of nodes, i.e., ), to large (e.g., ResNet) nets with many thousands of nodes.
The strong correlations between basic properties of the functional graph of a DNN and fundamental learning properties like the performance gap also makes these networks more transparent. Not only do we propose an algorithm capable of computing the performance gap, but we show that this is linked to a simple law of the innerworkings of that networks. We consider this to be a contribution to make deep learning more explainable.
Based on these observations we have chosen to model the relationship between the performance gap and the topological summaries through a linear function. Figures 57 show a simplified representation of the observed pairs and the regressed lines.
We need to mention that choosing a linear hypothesis for is by no means the only option. Obviously, using a nonlinear regressor for in Alg. 1 leads to even more accurate predictions of the testing error. However, this improvement comes at the cost of being less flexible when studying less common networks/topologies – overfitting.
Crucially, an average error between and is obtained across computer vision problems, which is as accurate as computing a testing error with a labelled dataset .
7 Conclusions
We have derived, to our knowledge, the first algorithm to compute the testing classification accuracy of any DNNbased system in computer vision, without the need for the collection of any testing data.
The main advantages of the proposed evaluation method versus the classical use of a testing dataset are:

there is no need for a sequestered dataset to be maintained and updated by a thirdparty,

there is no need to run costly crossvalidation analyses,

we can modify our DNN without the concern of overfitting to the testing data (because it does not exist), and,

we can use all the available data for training the system.
We have provided extensive evaluations of the proposed approach on three classical computer vision problems and shown the efficacy of the derived algorithm.
As a final note, we would like to point out the obvious. When deriving computer vision systems, practitioners would generally want to use all the testing tools at their disposal. The one presented in this paper is one of them, but we should not be limited by it. Where we have access to a sequestered database, we should take advantage of it. In combination, multiple testing approaches should generally lead to better designs.
Acknowledgments. NIH grants R01DC014498 and R01EY020834, Human Frontier Science Program RGP0036/2016, TIN201674946P (MINECO/FEDER, UE), CERCA (Generalitat de Catalunya) and ICREA (ICREA Academia). CC and AMM defined main ideas and derived algorithms. CC, with SE, and MM ran experiments. CC and AMM wrote the paper.
References
 [1] S. Arora, R. Ge, B. Neyshabur, and Y. Zhang. Stronger generalization bounds for deep nets via a compression approach. arXiv preprint arXiv:1802.05296, 2018.
 [2] L. F. Barrett, R. Adolphs, S. Marsella, A. M. Martinez, and S. D. Pollak. Emotional expressions reconsidered: Challenges to inferring emotion from human facial movements. Psychological Science in the Public Interest, 20(1):1–68, 2019.
 [3] P. L. Bartlett, D. J. Foster, and M. J. Telgarsky. Spectrallynormalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6240–6249, 2017.
 [4] C. F. BenitezQuiroz, R. Srinivasan, Q. Feng, Y. Wang, and A. M. Martinez. Emotionet challenge: Recognition of facial expressions of emotion in the wild. arXiv preprint arXiv:1703.01210, 2017.

[5]
M. G. Bergomi, P. Frosini, D. Giorgi, and N. Quercioli.
Towards a topological–geometrical theory of group equivariant nonexpansive operators for data analysis and machine learning.
Nature Machine Intelligence, 1(9):423–433, 2019.  [6] F. Chazal, D. CohenSteiner, L. J. Guibas, F. Mémoli, and S. Y. Oudot. Gromovhausdorff stable signatures for shapes using persistence. In Computer Graphics Forum, volume 28, pages 1393–1403. Wiley Online Library, 2009.

[7]
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele.
The cityscapes dataset for semantic urban scene understanding.
InProc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2016.  [8] C. A. Corneanu, M. Madadi, S. Escalera, and A. M. Martinez. What does it mean to learn in deep networks? and, how does one detect adversarial attacks? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4757–4766, 2019.
 [9] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. Ieee, 2009.
 [10] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International journal of computer vision, 111(1):98–136, 2015.
 [11] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascalnetwork.org/challenges/VOC/voc2012/workshop/index.html.
 [12] C. Fabian BenitezQuiroz, R. Srinivasan, and A. M. Martinez. Emotionet: An accurate, realtime algorithm for the automatic annotation of a million facial expressions in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5562–5570, 2016.
 [13] A. Hatcher. Algebraic Topology. Cambridge University Press, 2002.
 [14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [15] Y. Jiang, D. Krishnan, H. Mobahi, and S. Bengio. Predicting the generalization gap in deep networks with margin distributions. arXiv preprint arXiv:1810.00113, 2018.
 [16] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

[17]
A. Krizhevsky, I. Sutskever, and G. E. Hinton.
Imagenet classification with deep convolutional neural networks.
In Advances in neural information processing systems, pages 1097–1105, 2012.  [18] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436, 2015.
 [19] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
 [20] S. M. Mavadati, M. H. Mahoor, K. Bartlett, P. Trinh, and J. F. Cohn. Disfa: A spontaneous facial action intensity database. IEEE Transactions on Affective Computing, 4(2):151–160, 2013.
 [21] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. Journal, 2011.
 [22] B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro. Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pages 5947–5956, 2017.
 [23] B. Rieck, M. Togninalli, C. Bock, M. Moor, M. Horn, T. Gumbsch, and K. Borgwardt. Neural persistence: A complexity measure for deep neural networks using algebraic topology. arXiv preprint arXiv:1812.09764, 2018.
 [24] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. ICLR, 2015.
 [25] L. Vietoris. Über den höheren zusammenhang kompakter räume und eine klasse von zusammenhangstreuen abbildungen. Mathematische Annalen, 97(1):454–472, 1927.
 [26] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015.
 [27] A. M. Zador. A critique of pure learning and what artificial neural networks can learn from animal brains. Nature communications, 10(1):1–7, 2019.
 [28] X. Zhang, L. Yin, J. F. Cohn, S. Canavan, M. Reale, A. Horowitz, P. Liu, and J. M. Girard. Bp4dspontaneous: a highresolution spontaneous 3d dynamic facial expression database. Image and Vision Computing, 32(10):692–706, 2014.
 [29] A. Zomorodian and G. Carlsson. Computing persistent homology. Discrete & Computational Geometry, 33(2):249–274, 2005.
Comments
There are no comments yet.