As smartphones get increasingly integrated into our daily lives and the numbers of both cyberphysical systems and smart devices continues to grow, there has been a noticeable evolution in the way many large data sets are being generated. In fact, Cisco  predicted that in 2021, whilst 20.6 ZB of data (e.g. large ecommerce site records) will be handled by cloud-based approaches in large data-centres, this amount will be dwarfed by the 850 ZB generated by local devices . In response to the data sources becoming more device-centric, there has been a shift in focus for many machine learning algorithms to be implemented and even trained locally on the (potentially hardware limited) devices. Running the algorithms on the devices represents a radical shift away from traditional centralised learning where the data and algorithms are stored and processed in the cloud but brings the benefits of  1. increased user privacy as the data is not transmitted to a centralised sever 2. reduced latency since the algorithms can react immediately to newly generated data from the device and 3. improved energy efficiency mostly because the data and algorithm outputs don’t have to be constantly transferred to and from the cloud. However, running the algorithms locally on the devices brings its own issues, most notably in dealing with the devices’ limited computational power, memory and energy storage. Overcoming these hardware constraints has motivated substantial efforts to improve algorithm design, particularly towards developing leaner, more efficient neural networks .
Two popular approaches to make neural network algorithms leaner and more hardware-conscious are quantised neural networks [34, 7, 35], where fixed-point arithmetic is used to accelerate the computational speed and reduce memory footprint, and pruned neural networks [27, 4, 18, 19, 32, 31, 22, 17, 12, 23, 26, 15, 30, 24], where, typically, the weights contributing least to the function mapping are removed, promoting sparsity in the weights. Both of these approach have achieved impressive results. For instance, by quantising,  was able to reduce model size by
without any noticeable loss in accuracy when evaluated on the CIFAR-10 benchmark and demonstrated that between 50-80 of its model weights could be pruned with little impact on performance . However, our understanding of neural network reduction methods such as these remains lacking and reliably predicting their performance remains a challenge. Illustrating this point,  stated that for pruned neural networks “our results suggest the need for more careful baseline evaluations in future research on structured pruning methods” with a similar sentiment raised in  “our clearest finding is that the community suffers from a lack of standardized benchmarks and metrics”. These quotes indicate a need for robust evaluation methods for lean neural network designs, a perspective explored in this work.
This paper introduces a method to automatically synthesize neural networks of reduced dimensions (meaning fewer neurons) from a trained larger one, as illustrated in Figure 1. These smaller networks are termed reduced-order neural networks since the approach was inspired by reduced order modelling in control theory . The weights and biases of the reduced order network are generated form the solution of a semi-definite program (SDP)- a class of well-studied convex problems  combining a linear cost function with a linear matrix inequality (LMI) constraint- which minimises the worst-case approximation error with respect to the larger network. Bounds are also obtained for this worst-case approximation error and so the performance of the network reduction is guaranteed.
What separates the proposed synthesis approach to the existing methods for generating efficient neural networks, e.g. pruning, is the inclusion of the worst-case approximation error of the reduced-order neural network directly within the cost function for computing the weights and biases. Whilst the presented results are still preliminary, their focus on robust neural network synthesis introduces a new set of of tools to generate lean neural networks which should have more reliable out-of-sample performance and which are equipped with approximation error bounds. The broader goals of this work are to translate recent results on the verification of NN robustness using an SDP [11, 33] into a synthesis problem, mimicking the progression from absolute stability theory  to robust control synthesis  witnessed in control theory during the 1980s. In this way, this work carries on the tradition of control theorists exploring the connections between robust control theory and neural networks, as witnessed since the 1990s with Glover , Barabanov , Angeli  and Narendra .
Non-negative real vectors of dimensionare denoted . A positive (negative) definite matrix is denoted . Non-negative diagonal matrices of dimension are . The matrix of zeros of dimension is and the vector of zeros of dimension is
. The identity matrix of sizeis . The vector of 1s of dimension is and the matrix of 1s is . The element of a vector is denoted unless otherwise defined in the text. The notation is adopted to represent symmetric matrices in a compact form, e.g.
I-B Neural networks
The neural networks considered will be treated as functions mapping input vectors of size to output vectors of dimension . In a slight abuse of notation, will refer to mappings of both scalars and vectors, with the vector operation applied element-wise. The full-order neural network will be composed of hidden layers, with the layer being composed of neurons. The total number of neurons in the full-order neural network is . Similarly, the reduced-order neural network will be composed of hidden layers with the layer being composed of neurons. The total number of neurons in the reduced-order network is
. The dimension of the domain of the activation functions is defined as(full-order network) and (reduced-order network).
Ii Problem statement
In this section, the general problem of synthesizing reduced-order NNs is posed. Consider a nonlinear function mapping input data to an output set . The goal of this work is to generate a “simpler” function that is as “close” as possible to for all . Here, “simpler” refers to the dimension of the approximating neural network (measured by the total number of neurons) and “close”-ness relates to the approximation error between the two functions and measured by the induced 2-norm . The goal is to automatically synthesize the simpler functions from the solution of a convex problem and obtain worst-case bounds for approximation error with respect to the larger neural network for all .
To ensure that the function approximation problem remains feasible, structure is added to the set . It is assumed that the function being approximated
is generated by a feed-forward neural network
Here, the input data is mapped through the nonlinear activation functions
(which could be the standard choices of ReLU, sigmoid, tanh or any function that satisfies a quadratic constraint as given in SectionIII-B) element-wise with the weight matrices , and biases , . Whilst the results are described for feed-forward neural networks, the method can be generalised to other network architectures, such as recurrent and even implicit neural networks . As an aside, verifying the well-posedness of implicit neural networks has a strong connection to that of Lurie systems with feed-through terms .
The neural network (which will be referred to as the full-order neural network) is to be approximated by another neural network (referred to as the reduced-order neural network) of a smaller dimension
The weights and biases in this neural network are , , . The network structure in (3b) is general, and even allows for implicitly defined networks . This generality follows from the lack of structure imposed on the matrices used in the synthesis procedure. However, by adding structure, the search can be limited to, for example, feed-forward networks, which are simpler to implement.
In this work, the dimension of the reduced-order network is fixed and the problem is to find the reduced-order NN’s parameters, being the weights and biases , that minimise the worst-case approximation error between the full and reduced order neural networks for all . The main tool used for this reduced-order NN synthesis problem is the outer approximation of the NN’s input set , nonlinear activation function’s gains and the output error by quadratic constraints. These outer approximations enable the robust weight synthesis problem to be stated as a convex SDP, albeit at the expense of introducing conservatism into the analysis.
Iii Quadratic Constraints
In this section, the quadratic constraints for the convex outer approximations of the various sets of interest of the reduced NN synthesis problem are defined. These characterisations are posed in the framework of , which in turn was inspired by the integral quadratic constraint framework of  and the classical absolute stability results for Lurie systems .
Iii-a Quadratic constraint: Input set
The input data is restricted to the hyper-rectangle .
Define the hyper-rectangle . If then where
Iii-B Quadratic constraint: Activation functions
The main obstacle to any robustness-type result for neural networks is in accounting for the nonlinear activation functions . To address this issue, the following function properties are introduced.
The activation function satisfying is said to be sector bounded if
|and slope restricted if|
|If then the nonlinearity is monotonic and if is slope restricted then it is also sector bounded. The activation function is bounded if|
|it is positive if|
|its complement is positive if|
|and it satisfies the complementarity condition if|
Most popular activation functions, including the ReLU, (shifted-)sigmoid and tanh satisfy some of these conditions, as illustrated in Table I. As the number of properties satisfied by increases, the characterisation of this function within the robustness analysis improves, often resulting in less conservative results. It is also noted that to satisfy some activation functions may require a shift, e.g. the sigmoid, or they may require transformations to satisfy additional function properties, as demonstrated in the representation of the LeakyReLU as a ReLU + linear term function.
Properties of commonly used activation functions, including the sigmoid, tanh, rectified linear unit ReLU and exponential linear unit (ELU). The properties of other functions, such as the LeakyReLU, can also be inferred.
As is well-known from control theory , functions with these specific properties are important for robustness analysis problems because they can be characterised by quadratic constraints.
Consider the vectors , and that are mapped component-wise through the activation functions . If is a sector-bounded non-linearity, then
|and if it is slope-restricted then|
|If is bounded then|
|if is positive then|
|if the complement of is positive then|
|and if satisfies the complementary condition then|
|Additionally, if both and its complement are positive then so are the cross terms|
The characterisation of the nonlinear activation functions via quadratic constraints allows the neural network robustness analysis to be posed as a SDP- with the various ’s in Lemma 1 being decision variables. Such an approach has been used in [11, 14, 3], and elsewhere, for neural networks robustness problems, with the conservatism of this approach coming from the obtained worst-case bounds holding for all nonlinearities satisfying the quadratic constraints. In this work, the aim is to extend this quadratic constraint framework for neural network robustness analysis problems to a synthesis problem.
A quadratic constraint characterisation of both the reduced and full-order neural networks can then be written, with the following lemma being the application of Lemma 1 for both the reduced and full-order neural networks.
Appendix 2 details the characterisation of for the specific case of ReLU activation functions.
Iii-C Quadratic constraint: Approximation error of the reduced-order neural network
An upper bound for the approximation error between the full and reduced-order networks can also be expressed as a quadratic constraint. This error bound will be used as a performance metric to gauge how well the reduced-order neural network approximates the full-order one, as in how well .
Definition 3 (Approximation error)
For some , , the reduced-order NN’s approximation error is defined as the quadratic bound
In practice, this bound is computed by minimising over some combination of and
Iv Reduced-order neural network synthesis problem
This section contains the main result of the paper; an SDP formulation of the reduced-order NN synthesis problem (Proposition 1). To arrive at this formulation, a general statement of the synthesis problem is first defined in Theorem 1. This theorem characterises the search for the reduced-order neural network’s parameters as minimising the worst-case approximation error for all inputs .
Assume the activation functions satisfy some of the properties from Definition 2. Then, with the fixed weights , if there exists a solution to
then the worst-case approximation error is bounded by for all .
Proof. See Appendix 3.
The main issue with Theorem 1 is verifying inequality (11b) since it includes a non-convex bilinear matrix inequality (BMI) between the matrix variables of the reduced-order network’s weights, its biases and the scaling variables in . The following proposition details how this constraint can be written (after the application of a convex relaxation of the underlying BMI) as an LMI. The search over the reduced NN variables can then be translated into a SDP, a class of well understood convex optimisation problems with many standard solvers such as MOSEK  implemented through the YALMIP  interface in MATLAB or even the Robust Control Toolbox.
with defined in (33) of Appendix 4, then the reduced-order network with weights and affine terms
ensures that the worst-case approximation error bound of the reduced-order neural network satisfies for all .
Proof. See Appendix 4.
Appendix 5 details the matrix (which characterises how the activation functions are included within the robustness condition ) for the special case ReLU(). Some remarks about the proposition are given in Appendix 6.
V Numerical example
The proposed reduced-order neural network synthesis method was then evaluated in two numerical examples. In both cases, the performance of the synthesized neural networks were evaluated graphically (see Figures 2-3) to give a better representation of the robustness of the approximations (the focus of this work). Only academic examples were considered due to the well-known scalability issues of SDP solvers (but which are becoming less of an issue ) and because performance was measured graphically. The code for the numerical examples can be obtained on request from the authors.
The first example explores the impact of reducing the dimension of the reduced-order neural network on its accuracy. In this case, the full-order neural network considered was a single hidden layer network of dimension 10 with the weights , , and, with the input constrained to . The ReLU was taken as the activation function of both the full and reduced-order neural networks. Reduced-order neural networks with single hidden layers of various dimensions were then synthesized using Proposition 1. Figure 1(a) shows the various approximations obtained and Figure 1(b) shows how the error bounds and approximation errors changed as the dimension of the reduced-order network increased. The error bound was satisfied in all cases (albeit conservatively) and the approximation error dropped as the degree of the reduced network increased, as expected.
The second example considers a more complex function to approximate and illustrates some potential pitfalls of pruning too hard. In this case, the full-order network’s weights were defined by , , and with then the weights and biases were , , , , and Figure 3 shows the output generated from a reduced-order neural network as well as the network generated by setting the matrices in Lemma 2 to be diagonal (this reduced the compute time but, as shown, can alter the obtained function). Also shown is the case when the full-order neural network has been pruned to have a similar number of connections as the reduced-order one by removing the 32 smallest weights. In this case, the pruned network was cut so far that it simply generated a constant function, but further fine-tuning of the pruned network may recover performance. Likewise, fine-tuning of the reduced-order neural network (through different substitutions of and ) may improve the approximation of the synthesized reduced-order neural networks.
A method to synthesize the weights and biases of reduced-order neural networks (having few neurons) approximating the input/output mapping of a larger was introduced. A semi-definite program was defined for this synthesis problem that directly minimised the worst-case approximation error of the reduced-order network with respect to the larger one, with this error being bounded. By including the worst-case approximation error directly within the training cost function, it is hoped that the ideas explored in this paper will lead to more robust and reliable reduced-order neural network approximations. Several open problems still remain to be explored, most notably in reducing the conservatism of the bounds, scaling up the method to large neural networks and exploring the convexification of the bi-linear matrix inequality of the synthesis problem.
The authors were funded for this work through the Nextrode Project of the Faraday Institution (EPSRC Grant EP/M009521/1) and a UK Intelligence community fellowship from the Royal Academy of Engineering.
The MOSEK interior point optimizer for linear programming: an implementation of the homogeneous algorithm. In High Performance Optimization, pp. 197–232. Cited by: §IV.
-  (2009) Convergence in networks with counterclockwise neural dynamics. IEEE Transactions on Neural Networks 20 (5), pp. 794–804. Cited by: §I.
Stability analysis of discrete-time recurrent neural networks. IEEE Transactions on Neural Networks 13 (2), pp. 292–303. Cited by: §I, §III-B.
-  (2020) What is the state of neural network pruning?. arXiv preprint arXiv:2003.03033. Cited by: §I.
-  (2004) Convex optimization. Cambridge University Press. Cited by: §I.
-  (1999) Bounds of the induced norm and model reduction errors for systems with repeated scalar nonlinearities. IEEE Transactions on Automatic Control 44 (3), pp. 471–483. Cited by: §I.
-  (2014) Training deep neural networks with low precision multiplications. arXiv preprint arXiv:1412.7024. Cited by: §I.
-  (2020) Enabling certification of verification-agnostic networks via memory-efficient semidefinite programming. Advances in Neural Information Processing Systems 33. Cited by: §V, Computational cost.
-  (1988) State-space solutions to standard and control problems. In Procs. of the American Control Conference, pp. 1691–1696. Cited by: §I.
Implicit deep learning. arXiv preprint arXiv:1908.06315. Cited by: §II, §II.
-  (2019) Safety verification and robustness analysis of neural networks via quadratic constraints and semidefinite programming. arXiv preprint arXiv:1903.01287. Cited by: §I, §III-B, §III, Computational cost.
-  (2019) The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574. Cited by: §I.
-  Cloud index: Forecast and methodology, 2016–2021, white paper. [online]. Available: https://www.cisco.com/c/en/us/ solutions/collateral/service-provider/globalcloud-index-gci/white-paper-c11-738085.htm. Cited by: §I.
-  (1984) All optimal Hankel-norm approximations of linear multivariable systems and their -error bounds. International Journal of Control 39 (6), pp. 1115–1193. Cited by: §I, §III-B.
-  (2015) Learning both weights and connections for efficient neural network. In Advances in Neural Information Processing Systems, pp. 1135–1143. Cited by: §I.
-  (2015) Learning both weights and connections for efficient neural network. Advances in neural information processing systems 28, pp. 1135–1143. Cited by: §I.
-  (1993) Second order derivatives for network pruning: optimal brain surgeon. In Advances in Neural Information Processing Systems, pp. 164–171. Cited by: §I.
-  (1989) Pruning versus clipping in neural networks. Physical Review A 39 (12), pp. 6600. Cited by: §I.
-  (1990) A simple procedure for pruning back-propagation trained neural networks. IEEE Transactions on Neural Networks 1 (2), pp. 239–242. Cited by: §I.
-  (2002) Nonlinear systems. 3 edition, Prentice hall Upper Saddle River, NJ. Cited by: §III-B, §III.
-  (1990) Identification and control of dynamical systems using neural networks. IEEE Transactions on Neural Networks 1 (1), pp. 4–27. Cited by: §I.
-  (1990) Optimal brain damage. In Advances in Neural Information Processing Systems, pp. 598–605. Cited by: §I.
-  (2019) A signal propagation perspective for pruning neural networks at initialization. arXiv preprint arXiv:1906.06307. Cited by: §I.
-  (2016) Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: §I.
-  (2016) Fixed point quantization of deep convolutional networks. In International conference on machine learning, pp. 2849–2858. Cited by: §I.
-  (2017) Runtime neural pruning. In Advances in Neural Information Processing Systems, pp. 2181–2191. Cited by: §I.
-  (2018) Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270. Cited by: §I.
-  (2004) YALMIP: A toolbox for modeling and optimization in MATLAB. In International Conference on Robotics and Automation, pp. 284–289. Cited by: §IV.
-  (1997) System analysis via integral quadratic constraints. IEEE Transactions on Automatic Control 42 (6), pp. 819–830. Cited by: §III.
Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440. Cited by: §I.
-  (1989) Skeletonization: A technique for trimming the fat from a network via relevance assessment. In Advances in Neural Information Processing Systems, pp. 107–115. Cited by: §I.
-  (1989) Using relevance to reduce network size automatically. Connection Science 1 (1), pp. 3–16. Cited by: §I.
-  (2018) Semidefinite relaxations for certifying robustness to adversarial examples. In Advances in Neural Information Processing Systems, pp. 10877–10887. Cited by: §I.
-  (2016) Fixed-point performance analysis of recurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 976–980. Cited by: §I.
-  (2015) Resiliency of deep neural networks under quantization. arXiv preprint arXiv:1511.06488. Cited by: §I.
-  (2017) Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE 105 (12), pp. 2295–2329. Cited by: §I, §I.
-  (2018) Regional analysis of slope-restricted Lurie systems. IEEE Transactions on Automatic Control 64 (3), pp. 1201–1208. Cited by: §II.
-  (2000) A tutorial on linear and bilinear matrix inequalities. Journal of Process Control 10 (4), pp. 363–385. Cited by: Bilinearity.
-  (1968) Stability conditions for systems with monotone and slope-restricted nonlinearities. SIAM Journal on Control 6 (1), pp. 89–108. Cited by: §I.
Edge intelligence: Paving the last mile of artificial intelligence with edge computing. Proceedings of the IEEE 107 (8), pp. 1738–1762. Cited by: §I.
Appendix 1: Definition of various vectors
For the full-order neural network, the vector mapped through the nonlinear activation function is
and for the reduced-order neural network with activation function , it is
The following two vectors are used to define the quadratic constraints of Proposition 1,
Appendix 2: Quadratic Constraint for the ReLU
Appendix 3: Proof of Theorem 1
Appendix 4: Proof of Proposition 1
In order to obtain the SDP formulation of the NN synthesis problem, the three quadratic constraints for the input set (Definition 1), the nonlinear activation functions (Lemma 2) and the output error set (Definition 3) need to be defined as matrix inequalities. The vectors and from Definition 4 in Appendix 1 are used throughout.
and defined in (18a).
The most challenging aspect of the proposition is obtaining an LMI for the synthesis of the neural network’s weights and biases. From Lemma 2, if the activation functions satisfy quadratic constraints and hence the inequality (9), then where is defined in (24).
The product of matrix variables (highlighted in blue in the above) leads to a non-convex BMI constraint in the problem, with both the scaling terms of the quadratic constraint inequality (9) and the reduced-order NN’s weights and biases being matrix variables. A convex relaxation is proposed for this BMI constraint (see (25)) but it is highlighted that this relaxation is perhaps the biggest source of conservatism in the proposition, as it restricts the space of solutions that can be searched over.
To achieve the desired linear representation of (24), the bilinear terms in (24) are expressed as scaled versions of the new matrix variables and , with the scaling done by the matrices and . In other words, it is proposed to express , with the bilinear terms
The scaling matrices , are free variables that can be picked by the user, under the stipulation that they preserve the properties of the multipliers of Lemma 2. In this work, the choice was
In this way, the non-convexity of the bilinear matrix inequality of the problem has been relaxed into a convex linear one. However, the substitution (25) limits the space of solutions that can be searched over by the synthesis SDP, resulting in only local optima being achieved and increased conservatism in the approximation error bounds.
The zeros in the top left corner of are problematic for numerically verifying positive-definiteness of the matrix used in the proposition. To alleviate this problem, the structural equality constraints of (29) are introduced to fill this zero block. These constraints are defined by drawing connections between the various elements of the vector , namely
with , , and . The following quadratic equalities can then be written
for any , , . There then exists a matrix (given in (31)) built from , and such that
With the matrices and defined, the robustness condition (11b) of Theorem 1 can be expressed as with . However, because is a special case of , if instead it can be shown that holds with , then a solution to has also been found.
Satisfying inequality is equivalent to verifying negative definiteness of , as in