RicciNets: Curvature-guided Pruning of High-performance Neural Networks Using Ricci Flow

07/08/2020 ∙ by Samuel Glass, et al. ∙ University of Cambridge 0

A novel method to identify salient computational paths within randomly wired neural networks before training is proposed. The computational graph is pruned based on a node mass probability function defined by local graph measures and weighted by hyperparameters produced by a reinforcement learning-based controller neural network. We use the definition of Ricci curvature to remove edges of low importance before mapping the computational graph to a neural network. We show a reduction of almost 35% in the number of floating-point operations (FLOPs) per pass, with no degradation in performance. Further, our method can successfully regularize randomly wired neural networks based on purely structural properties, and also find that the favourable characteristics identified in one network generalise to other networks. The method produces networks with better performance under similar compression to those pruned by lowest-magnitude weights. To our best knowledge, this is the first work on pruning randomly wired neural networks, as well as the first to utilize the topological measure of Ricci curvature in the pruning mechanism.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

At birth, the construction of the most important networks is largely random and random graph modelling is heavily used in the study of the human brain (Bullmore and Sporns, 2009; Bassett and Sporns, 2017)

. Recent work on randomly wired neural networks has emulated this in the field of deep learning, and moves away from the wiring approach that has typically dominated NAS

(Xie et al., 2019)

. Randomly wired networks display comparable performance to state-of-the-art architectures (eg. ResNet, DenseNet), and provide a relatively unrestricted space on which to perform further optimisation. We propose a search method that takes place within a low-dimensional search space; a pruning methodology which operates on networks produced by a successful random network generator. It is based on the discrete Ricci curvature of a graph, with estimates of a node’s community, contribution to robustness and computational demand contributing to the identification of salient computational paths in the network. A curvature-guided diffusion process, Ricci flow, deforms the discrete space of the graph, and edges within the graph are removed based on their local deformation. A reinforcement learning controller parameterises the Ricci flow process. To the best of the authors’ knowledge, this is the first work to take inspiration from the physics concept of space curvature deformation, in combination with reinforcement learning, to drive the process of neural network pruning.

The technique proposed has three key advantages: (1) it is a successful form of regularisation which promotes sparse network connectivity, and as a result low computational demand; (2) it operates with no degradation in baseline performance; (3) a successful hyperparameter state can be applied, without further optimisation, to similarly produced random networks. The process operates before training, saving compute during both training and inference. This is the first known work to investigate the pruning of randomly wired neural networks. Using RicciNets, we demonstrate novel, efficient generation of compact neural architectures.

2 Related Work

Ricci curvature is the definition of curvature used in Einstein’s Field Equations. Loosely, it measures the deformation of a volume on the surface of a manifold relative to the volume in Euclidean space (see Appendix A). Unlike other ML works, in which continuous Ricci curvature is used to visualise high-dimensional loss landscapes (Li et al., 2018) or in novel characterisations of structural features (Chazal and Michel, 2017; Rieck et al., 2018), we use a measure of Ricci curvature within the discrete space of a graph. In order to do this, we relate Ricci curvature to optimal transport (Ollivier, 2009). Given a probability measure at each node, optimal transport can be formulated on a graph and Ricci curvature can be calculated. For a metric space equipped with probability measure for each , the Ollivier-Ricci curvature, , along the shortest path xy is given by Eq. (1), where is the Wasserstein distance, and is the path distance:

(1)

Similar to Ni et al. (2019)

, we use a curvature-guided diffusion process, Ricci flow, to detect community structures within a network. The probability distribution used here includes further terms to estimate the computational demand of an individual node as well as its contribution to the overall network’s robustness with respect to damage. In both works, the curvature evolves under discrete time intervals, Eq. (

2).

(2)

Successful efforts in NAS may require months or even years of compute time. Zoph and Le (2017) demonstrated a computationally expensive process in which a RL controller parameterised a search within a high-dimensional search space. Resultant architectures match state-of-the-art performance. We optimise a combination of three hyperparameters for our search. Randomly wired networks offer a relatively unrestricted initial search space with good baseline performance. Xie et al. (2019) found a Watts-Strogatz graph generator (Watts and Strogatz, 1998) produced networks with the best performance, with and (see Appendix B). Their simple graph-to-network mapping allowed a focus on wiring and structural features. The successful generator and straightforward mapping are both used here.

3 Method

Random computational graphs are generated using the Watts-Strogatz model. First, we use a controller neural network to predict a hyperparameter state. Second, we propose a node mass distribution based on local graph measures weighted by the predicted hyperparameters. Then, we use the process of Ricci flows to compute weights associated with each edge and prune the computational graph based on an edge threshold value. The pruned computational graph is mapped to a neural network and trained on the dataset. Accuracy and FLOPs per pass are combined to yield a reward then passed to the controller network, which is updated via a policy gradient method (see Appendix C). The code is publicly available at https://github.com/seglass5/RicciNets.

3.1 Mass Distribution

We calculate the curvature within a network using a hypothesised probability (mass) distribution as Eq. (3):

(3)

Input() defines the input degree of node , Output() the output degree and Deg() the total degree. defines the immediate neighbours of . gives the proportion of mass to remain on a node. , and control the contribution of each of the three terms. The first term promotes a well-defined community structure, yields a lower mass for a node with more neighbours. The second term promotes low input degree. Taking the transformation operation at a node to be of linear complexity in input degree, this approximates to promoting low computational burden at each node. The final term promotes a smaller ratio of output degree to input degree. A greater loss in accuracy is observed for removal of a node with high output degree, and an edge with low target node input degree (Xie et al., 2019). A smaller ratio of output degree to input degree is therefore taken to indicate better robustness with respect to graph damage. By requiring that the masses of a node and its neighbours sum to unity, this can be reduced to an equation in three hyperparameters.

3.2 Pruning Threshold

The weight associated to each edge in the graph is updated via Ricci flow from an initial value of zero, Eq. (2

). The weights are normalised to prevent expansion to infinity and checked for convergence on each iteration. The process of Ricci flow ran for 50 iterations and typically reached convergence well within this limit. The threshold for pruning was set to the mean of all the weights in the network following Ricci flow. Hyperparameter selection can alter the skewness of the distributions of curvatures and weights, and adjusting the distribution of weights is interpreted as the controller network learning a definition of saliency; if very few paths can be considered salient, parameter prediction can lead to negative skewness in the weight distribution and more edges are removed. Similarly, if the drop in accuracy is too high for removing a group of paths, a set of hyperparameters that adjusts the mean to save this group can be learnt.

3.3 Controller

An auxiliary controller network generates hyperparameter states. The controller is implemented as a simple feed forward network. The parameter states are one-hot encoded and discretised in the range

. The output parameters, are passed to the pruning function. The controller seeks to maximise its expected reward, Eq. (4), where indicates the expected reward at a parameter state . represents the list of actions (possible hyperparameter combinations) for hyperparameters,

(4)

Since the reward signal is non-differentiable, we use a policy gradient method to iteratively update . We use the REINFORCE rule (Williams, 1992), Eq. (5).

(5)

An empirical approximation of the above quantity is given in Eq. (6). is the number of parameter states sampled in one batch by the controller. The reward that the network in the parameter state achieves after training is ,

(6)

The reward used to update policy is given in Eq. (7). The top one accuracy, , is regularised by the FLOPs per pass of the network using a regularisation parameter .

(7)

Episode rewards are discounted according to Eq. (8), where is the episode reward for a state , a discount parameter, and the number of iterations within an episode. This encourages prolonged episodes.

(8)

4 Experiments and Results

Experimentation is based on the classification of images from the CIFAR-10 dataset, with 50,000 training images and 10,000 test images. The images are batched in groups of 64. Each network is trained for 4 epochs, and the policy gradient controller ran over 20 episodes, with episodes batched in pairs to update policy.

4.1 Evaluation Procedure

Network performance is evaluated using the top-one accuracy and the number of FLOPs per pass of the network produced. Performance is measured in relation to a baseline set by an unpruned network produced using the same generator parameters. To assess the importance of considering the topology of randomly wired networks in the pruning procedure, we compare RicciNets against pruning weights by lowest magnitude (Zhu and Gupta, 2017). The combination of hyperparameters learnt for a given graph is applied to other graphs of the same random graph generator. We note that while this implementation of randomly wired neural networks yielded an accuracy of on CIFAR-10 over 100 training epochs, we only report results achieved after 4 epochs owing to resource constraints.

4.2 Results

Fig. 1 (a) shows the variation in top-one accuracy of the resultant networks for regularisation parameter in the range [0, 1.5]. The methodology produces better-than-baseline performance under compression across the range of

, with a small variance in top one accuracy (

). (b) shows the top-one accuracy of architectures against the FLOPs per pass expressed as a percentage of baseline. All networks produced operated using of the baseline FLOPs per pass. For small networks, more severe compression would result in a chain-like structure and a sharp drop off in accuracy, which would be discouraged by the controller (see Appendix D). An exploratory step carried out with probability within the controller, or further fine-tuning of the policy gradient network could result in a larger range of compression.

(a) Accuracy against . Pruned networks display better-than-baseline performance, with almost increase in accuracy on baseline when averaged across the range of .
(b) Accuracy gainst FLOPs per pass of pruned network as a percentage of baseline. Overlapping errorbars appear darker.
Figure 1: The pruned networks show no degradation in performance under compression.

RicciNets demonstrates a restricted range of compression when compared to pruning via lowest-magnitude weights. Within this range, however, the networks produced by RicciNets demonstrate better performance than those pruned by weight, Table 1. Future work includes incorporating more control over the level of compression via alternative ways to regularize the reward objective.

The combination of hyperparameters that produced the greatest accuracy using generalised to other Watts-Strogatz graphs. Pruned networks generated with different and displayed an increase in performance and moderate compression, Fig. 2. RicciNets maintained the salient computational paths identified in the learnt case.


Pruning
Top One Accuracy (%) Weights Remaining (%)
RicciNets
Lowest Magnitude
Baseline
Table 1: Average top one accuracy and average percentage baseline weights remaining after pruning for RicciNets, pruning via lowest magnitude weights and baseline. Average taken within the range of weights remaining.
(a) Top one accuracy against the value of parameter in , with and .
(b) Top one accuracy against the value of parameter in , with and .
(c) FLOPs per pass through the network against parameter value in , with and .
(d) FLOPs per pass through the network against parameter value in , with and .
Figure 2: Pruning WS graphs without specific optimisation showed no drop in performance.

5 Conclusions

Our model combines the principle of curvature with ML to carry out neural architecture search. It successfully identifies salient computational paths, and demonstrates a reduction in computational cost for no degradation in baseline performance. It outperforms pruning via lowest-magnitude weights on randomly wired neural networks. A combination of hyperparameters learnt with a given network generalises to others from the same generator with no specific optimisation, offering compression for no drop in performance. The results obtained suggest a successful novel methodology for compact NAS, and are the first on the compression dynamics of randomly wired neural networks. Future work will develop more comparative methods against other pruning procedures, and investigate off-policy controller algorithms.


References

  • D. S. Bassett and O. Sporns (2017) Network neuroscience. Nature neuroscience 20 (3), pp. 353. Cited by: §1.
  • D. K. Bennett Chow (2004) The ricci flow: an introduction (mathematical surveys and monographs). edition, Mathematical Surveys and Monographs, Vol. , American Mathematical Society. External Links: ISBN 0821835157,9780821835159, Link Cited by: §5.
  • M. Boileau (2016) Ricci flow and geometric applications: cetraro, italy 2010. 1 edition, Lecture Notes in Mathematics 2166, Vol. , Springer International Publishing. External Links: ISBN 978-3-319-42350-0,978-3-319-42351-7, Link Cited by: §5.
  • S. Boucksom (2013) An introduction to the kähler-ricci flow. 1 edition, Lecture Notes in Mathematics 2086, Vol. , Springer International Publishing. External Links: ISBN 978-3-319-00818-9,978-3-319-00819-6, Link Cited by: §5.
  • S. Brendle (2010) Ricci flow and the sphere theorem. edition, Graduate Studies in Mathematics 111, Vol. , American Mathematical Society. External Links: ISBN 0821849387, 9780821849385, Link Cited by: §5.
  • E. Bullmore and O. Sporns (2009) Complex brain networks: graph theoretical analysis of structural and functional systems. Nature Reviews Neuroscience 10 (3), pp. 186–198. External Links: ISSN 1471-0048, Document, Link Cited by: §1.
  • F. Chazal and B. Michel (2017) An introduction to topological data analysis: fundamental and practical aspects for data scientists. arXiv preprint arXiv:1710.04019. Cited by: §2.
  • A. Choudhary, J. F. Lindner, E. G. Holliday, S. T. Miller, S. Sinha, and W. L. Ditto (2020) Physics-enhanced neural networks learn order and chaos. Phys. Rev. E 101, pp. 062207. External Links: Document, Link Cited by: §5.
  • A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T. Yang, and E. Choi (2018) Morphnet: fast & simple resource-constrained structure learning of deep networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 1586–1595. Cited by: §5.
  • H. Li, Z. Xu, G. Taylor, C. Studer, and T. Goldstein (2018) Visualizing the loss landscape of neural nets. In Advances in Neural Information Processing Systems, pp. 6389–6399. Cited by: §2.
  • P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz (2016)

    Pruning convolutional neural networks for resource efficient inference

    .
    arXiv preprint arXiv:1611.06440. Cited by: §5.
  • C. Ni, Y. Lin, F. Luo, and J. Gao (2019) Community detection on networks with ricci flow. Scientific Reports 9 (1), pp. 9984. External Links: ISSN 2045-2322, Document, Link Cited by: Appendix A, §2.
  • Y. Ollivier (2009)

    Ricci curvature of markov chains on metric spaces

    .
    Journal of Functional Analysis 256 (3), pp. 810 – 864. External Links: ISSN 0022-1236, Document, Link Cited by: Appendix A, §2.
  • B. Rieck, M. Togninalli, C. Bock, M. Moor, M. Horn, T. Gumbsch, and K. Borgwardt (2018) Neural persistence: a complexity measure for deep neural networks using algebraic topology. arXiv preprint arXiv:1812.09764. Cited by: §2.
  • D. J. Watts and S. H. Strogatz (1998) Collective dynamics of ‘small-world’networks. nature 393 (6684), pp. 440–442. Cited by: §2.
  • X. D. G. (. Wei Zeng (2013) Ricci flow for shape analysis and surface registration: theories, algorithms and applications. 1 edition, SpringerBriefs in Mathematics, Vol. , Springer-Verlag New York. External Links: ISBN 978-1-4614-8780-7,978-1-4614-8781-4, Link Cited by: §5.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8 (3), pp. 229–256. External Links: ISSN 1573-0565, Document, Link Cited by: §3.3.
  • S. Xie, A. Kirillov, R. Girshick, and K. He (2019) Exploring randomly wired neural networks for image recognition. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1284–1293. Cited by: Appendix B, §1, §2, §3.1.
  • M. Zhu and S. Gupta (2017) To prune, or not to prune: exploring the efficacy of pruning for model compression. Technical report Google. External Links: Link Cited by: §4.1.
  • B. Zoph and Q. V. Le (2017) Neural architecture search with reinforcement learning. External Links: Link Cited by: §2.

Appendix A Ricci Curvature

The notion of curvature was introduced by Gauss and Riemann over 190 years ago; it is a measure of how space is curved at a point in that space. Given an n-dimensional manifold, which we define as a space that locally looks n

-dimensional, we may form a Riemannian metric; this assigns each tangent space of the manifold a Euclidean metric, which in turn gives the ”standard” distance between any two vectors in the space. A manifold together with its corresponding Riemannian metrics forms a Riemannian manifold

(Ni et al., 2019).

For a surface S, the Gaussian map from S to the unit sphere sends a point on S to the unit normal vector of S at p, a point on the unit sphere. The Gaussian curvature of the surface at a point p is the Jacobian of the Gaussian map at p, the signed area distortion of the Gaussian map at p. Hence, the plane has zero curvature, the sphere has positive curvature, and the hyperboloid of one sheet has negative curvature. The curvature depends only on the induced Riemannian metric on the surface and does not depend on how the surface is embedded in space.

Riemann generalised Gaussian curvature to higher dimensions. For a Riemannian manifold (M,g), the sectional curvature assigns each 2-dimensional linear subspace P in the tangent space of M at p a scalar, the Riemannian sectional curvature. The scalar is equal to the curvature of the image of P under the exponential map. A positively curved space tends to have small diameter and is geometrically crowded; a sphere, for example. Conversely, a negatively curved space is geometrically spreading out.

The Ricci curvature assigns each unit tangent vector v at a point p a scalar which is the average of the sectional curvatures of planes containing v.

There have been various approaches to generalize the concept of curvature to non-manifold spaces. Here, we look to assign curvature to a graph, G(V, E, w), with vertices V, edges E and edge weights w. Ollivier-Ricci curvature Ollivier (2009) relates Ricci curvature to optimal transport, allowing a mapping to discrete spaces. Given a probability measure at each point, optimal transport can be formulated on general metric spaces and may be used to define Ricci curvature on a network with edge weights and probability measures at each vertex.

Appendix B Watts-Strogatz Random Graph Generator

The Watts-Strogatz method demonstrated the most success within Xie et al. (2019). This operates by first placing nodes regularly in a ring, with each node connected to its neighbours on both sides, where is an even number. Then, in a clockwise loop, for every node , the edge that connects to its clockwise next node is rewired with probability . ”Rewiring” is defined as uniformly choosing a random node that is not and that is not a duplicate edge. This loop is repeated times for . and are the only two parameters of the Watts-Strogatz model. Any graph generated by a state has exactly edges. only covers a small subset of all possible -node graphs, and a different subset than that covered by other random graph generators with equal . Random graph generators present a relatively unrestricted initial search space, but a prior is introduced in the choice of random graph generator. Watts-Strogatz graphs display small world properties; the typical distance between randomly chosen nodes is proportional to .

Appendix C Method Overview

Figure 3: An overview of the methodology. Pruning takes place before the graph-to-network mapping. The pale blue box indicates a single step within a policy gradient episode.

Appendix D Extensive Pruning

Figure 4: The unpruned state of a graph, with , left, and the same graph following pruning under a selected combination of hyperparameters, right. We observe increased sparsity whilst still retaining some clustering and skipped connections. Since both nodes and paths are removed, the node labels do not carry over from the unpruned to the pruned state, and are shown here for the purposes of information flow; in both cases data is carried in the direction of increasing node label.
Figure 5: Sparser, chain-like architectures typically give a lower top one accuracy and so are discouraged by the controller network.