Neural networks have been employed to accurately solve many different tasks (see, e.g., [arridge2019solving, bronstein2021geometric, raissi2019physics, jumper2021highly]). Indeed, because of their excellent approximation properties, ability to generalise to unseen data, and efficiency, neural networks are one of the preferred techniques for the approximation of functions in high-dimensional spaces.
In spite of this popularity, a substantial part of results and success stories in deep learning still relies on empirical evidence and more theoretical insight is needed. Recently, a number of scientific papers on the mathematical foundations of neural networks have appeared in the literature, [berner2021modern, weinan2017proposal, Saxe2014ExactST, ShwartzZiv2017OpeningTB, thorpe2018deep, huang2020deep]. In a similar spirit, many authors consider the design of deep learning architectures taking into account specific mathematical properties such as stability, symmetries, or constraints on the Lipschitz constant, [lecun1989backpropagation, hertrich2021convolutional, galimberti2021hamiltonian, smets2022pde, chen2019symplectic, gomez2017reversible, trockman2021orthogonalizing, jin2020sympnets, wang2020orthogonal, zakwan2022robust]. Even so, the imposition of structure on neural networks is often done in an ad hoc manner, making the resulting input to output mapping hard to analyse. In this paper, we describe a general and systematic way to impose desired mathematical structure on neural networks leading to an easier approach to their analysis.
There have been multiple attempts to formulate unifying principles for the design of neural networks. We hereby mention Geometric Deep Learning (see e.g. [bronstein2017geometric, bronstein2021geometric]), Neural ODEs (see e.g. [chen2018neural, massaroli2020dissecting, ruiz2021neural, xiao2018dynamical]
), the continuous-in-time interpretation of Recurrent Neural Networks (see e.g.[rusch2021unicornn, chang2019antisymmetricrnn]) and of Residual Neural Networks (see e.g. [weinan2017proposal, lu2018beyond, celledoni2021structure, ruthotto2020deep, agrachev2021control]). In this work, we focus on Residual Neural Networks (ResNets) and build upon their continuous interpretation.
Neural networks are compositions of parametric maps, i.e. we can characterise a neural network as a map , with being the network layers. For ResNets the most important parametric maps are of the form
The continuous-in-time interpretation of ResNets arises from the observation that if , coincides with one step of the explicit Euler method applied to the non-autonomous ODE . In this work, we consider piecewise-autonomous systems, i.e. we focus on time-switching systems of the form
with being piecewise constant (see e.g. [ruiz2021neural, liberzon2003switching]), and a family of parametric vector functions. This simplification is not restrictive and can help analyse and design neural networks, as we will clarify throughout the paper.
This interpretation of ResNets gives the skeleton of our reasoning. Indeed, we replace the explicit Euler method in Eq. 1 with suitable (structure-preserving) numerical flows of appropriate vector fields. We call “dynamical blocks” the groups of layers obtained with these numerical flows. The choice of the vector field is closely related to the structure to impose. For example, to derive symplectic neural networks we would apply symplectic time integrators to Hamiltonian vector fields. This approach enables us to derive new structured networks systematically and collocate other existing architectures into a more general setting, making their analysis easier. For instance, Section 2 presents a strategy to study the approximation capabilities of some structured networks. Finally, we highlight the flexibility and the benefits of this framework in Section 3, where we show that to obtain expressive and robust neural networks, one can also include layers that are not 1-Lipschitz.
There are multiple situations where one could be interested in networks with some prescribed property. We report three of them here, where we refer to as the function to approximate:
When has some known characterising property, e.g. is known to be symplectic, compare Section 4.
When the data we process has a particular structure, e.g. vectors whose entries sum to one, as we present in Section 4.
Thus, there are various applications where having neural networks structured in a particular way is desirable. We will delve deeper into some of them in the following sections. To be precise, we remark that all the properties we focus on are preserved under composition, such as being 1-Lipschitz or symplectic.
The paper is structured in five sections. First, in Section 2 we investigate the universal approximation capabilities of some neural networks, thanks to vector field decompositions, splitting methods and an embedding of the dynamics into larger dimensional spaces. We then move, in Section 3, to a neural network that has the property of being 1-Lipschitz. After the mathematical derivation of the architecture, we present some numerical experiments on adversarial robustness for the CIFAR-10 image classification problem. Next, in Section 4, we introduce other neural networks with specific designs. This last section aims to present in a systematic way how one can impose certain properties on the architecture. We finally conclude the paper in Section 5, mentioning some promising directions for further work.
Before moving on, we now report a numerical experiment that motivates our investigation of structured neural networks. The results highlight how imposing a structure does not have to degrade the approximation’s quality considerably. Furthermore, this experiment suggests that not all the constraining strategies perform similarly. Thus, a systematic process to impose structure is essential since it allows changing the architecture in a guided manner while preserving the property of interest.
1.1 Classification of points in the plane
We present a numerical experiment for the classification problem of the dataset in Fig. 0(b). We consider neural networks that are 1-Lipschitz, as in Section 3. We define the network layers alternatingly as contractive flow maps, whose vector fields belong to , and as flows of other Lipschitz continuous vector fields in , with and 111To impose the weight orthogonality, we set with being the matrix exponential and a trainable matrix.. The time steps for each vector field are network parameters, and we constrain them to get a 1-Lipschitz network, see Section 3. We report the results in Fig. 0(a) and Table 1.
The average classification test accuracy and final integration time, in combination, get better by combining with instead of considering alone. In particular, we see that the final integration time with is the smallest without significantly compromising the accuracy. The parameter measures how complex the network is; hence we get a more natural and efficient solution by alternating the vector fields. In Section 2 we reinforce this empirical result, proving results about theoretical approximation guarantees. This renders the possibility of obtaining neural networks with prescribed properties without compromising their approximation capabilities.
|Adopted family of vector fields||Median accuracy||Median of|
2 Universal approximation properties
As introduced before, neural network layers can be modelled by discretising ordinary differential equations. In particular, this ODE-based approach can also be beneficial for imposing some structure on neural networks and providing a better theoretical understanding of their properties. In this section, we follow this principle and prove the universal approximation capabilities of two neural network architectures. Starting with the continuous-in-time interpretation of neural networks, many approaches are possible to prove such approximation properties, often based on merging the broad range of results from dynamical systems theory and numerical analysis. One can proceed, for example, in a constructive way as done in[ruiz2021neural], where the authors investigate the dynamics of some neural networks and explicitly construct solutions to the problem of approximating a target function. Another possibility is to study the approximation capabilities of compositions of flow maps, as done in [li2022deep]. In this section, we focus on two solutions that, to the best of our knowledge, are new. The first result is based on a vector field decomposition, while the second is based on embedding vector fields into larger dimensional spaces.
The approximation results that we cover rely on approximating vector fields arbitrarily well. Consequently, this allows us to approximate their flow maps accurately. This is based on the fact that for a sufficiently regular vector field , if is such that for every
then also their flow maps are close to one another for finite time intervals. We formalise this reasoning in Proposition 1. In the proposition and throughout the paper, we denote with the time- flow of the vector field , applied to .
Let and be as in Eq. 3. Then where is the Lipschitz constant of .
We consider the integral equations associated to the ODEs and and study the difference of their solutions both with the same initial condition
Then we conclude that
applying Gronwall’s inequality.
A particular consequence of this proposition is that if for every there is an making Eq. 3 true, then we can approximate the flow map of arbitrarily well using elements of :
Because of this result, we now derive two approximation results for neural networks working at the level of modelling vector fields.
2.1 Approximation based on a vector field decomposition
We now aim to show that, for a particularly designed neural network, we can approximate arbitrarily well any continuous function in the norm and any differentiable invertible function in the supremum norm on compact sets. We also mention how to extend this last result to generic continuous functions.
Let be a continuous function, with a compact set. Suppose that it can be approximated, with respect to some norm , by a composition of flow maps of vector fields, i.e. for any , , such that
Then, can be approximated arbitrarily well by composing flow maps of gradient and sphere preserving vector fields, i.e. .
By sphere preserving vector field, we mean a vector field having as a first integral, i.e. such that for any .
The norm in Eq. 4 can be any norm that is well defined for functions in . Two typical choices in the literature are norms and the supremum norm
Various works, like [brenier2003p] and [li2022deep], have already proven the existence of vector fields making Eq. 4 true when is the norm and is a continuous function. Regarding the validity of hypothesis Eq. 4 with the norm defined in Eq. 5, we mention [teshima2020universal] where the authors have proven that if is a smooth invertible map with smooth inverse, then the existence of can be guaranteed.
Theorem 1 is a consequence of the Presnov decomposition of vector fields, introduced in [presnov2002non], applied to the vector fields in Eq. 4. The Presnov decomposition is indeed a global decomposition of vector fields into the sum of a gradient and a sphere preserving vector field. We now prove Theorem Theorem 1, and we specialise it to the subfamilies of vector fields we implement to define neural networks.
The vector fields are supposed to be continuously differentiable. Thus, they all admit a unique Presnov decomposition, i.e. they can be written as
for a scalar function , with , and a sphere preserving vector field . In general the two vector fields and do not commute, i.e. the Jacobi-Lie bracket is not identically zero. However, because of the Baker-Campbell-Hausdorff formula (see e.g. [geomBook, Section III.4.2]), as in splitting methods (see e.g. [mclachlan2002splitting]) we can proceed with an approximation of the form
We recall that for any tuple such that . Thus, up to choosing the s small enough, we can approximate as accurately as desired with composition of flow maps of sphere preserving and gradient vector fields. This concludes the proof.
Similar reasoning can be extended to other vector field decompositions, e.g. the Helmholtz decomposition, as long as admit such a decomposition. In Section 3, we adopt gradient vector fields whose flow maps expand and contract distances to obtain 1-Lipschitz neural networks. We now specialise Theorem 1 to the vector fields we use to model such neural networks.
We first remark that with an expansive vector field we mean a vector field such that for any , while by contractive we mean that
. To prove the corollary, we rely on a classical universal approximation theorem with non-polynomial activation functions (see e.g.[pinkus1999approximation]). For clarity, we report it here.
Theorem 2 (Universal approximation, [pinkus1999approximation])
Let be a compact set and . Assume and is not a polynomial. Then for every there is
such that , and
We now prove Corollary 1.
The proof follows the same reasoning of the one of Theorem 1. Indeed, we first decompose each of the of equation Eq. 4 via the Presnov decomposition as . Then, we approximate each of the functions thanks to Theorem 2. To ease the notation, we focus on one of the and denote it with from now on in the proof.
Let and be so that . Choose then , , and . Since is not a polynomial and it is continuously differentiable, Theorem 2 for any ensures the existence of a function
that satisfies and . We now split into a contractive and an expansive part, exploiting the two following properties of and :
is positively homogeneous, i.e. for , ,
is strongly convex.
We decompose as , where , with . Because of the positive homogeneity, can be rewritten as
Because of the strong convexity of , we have
with and . This means that the flow of is an expansive map. A similar reasoning shows that has a contractive flow map. We can now conclude as in Theorem 1 since we have shown that every in Eq. 4 can be approximated arbitrarily well as .
As for the expansive and the contractive vector fields, to define neural networks based on Corollary 1 one needs to parameterise the vector field that preserves spheres. Many possibilities are available, and we report a couple of them. The first is
where is the orthogonal projection on the space and
is the identity matrix. Another option is
where with being a strictly upper triangular matrix whose entries are modelled by , , , , . These two possibilities allow us to approximate any sphere preserving vector field arbitrarily because of classical universal approximation results, like the one mentioned in Theorem 2. We prefer, for practical reasons, the second one in the experiments reported in Appendix A. On the other hand, in the experiment reported in Section 3 we replace the part of the vector field decomposition with vector fields of the form
. This choice gives good experimental performances. The primary motivation behind this replacement is that, in those experiments, the linear transformations of the inputs are based on convolution operations, and the extension of sphere preserving vector fields to that setting brings to involved architectures.
We now summarise the results presented in the context of neural networks. Suppose that and that for . In Theorem 1 we have worked with the exact flows of the vector fields. However, most of the times these are not available and hence a numerical approximation is needed. This is exactly equivalent to applying a splitting numerical integrator (see e.g. [geomBook, Chapter 2] or [mclachlan2002splitting]) to approximate the -flow map of (and hence also of ) and get
Here we denote with a discrete approximation of the exact flow and is the neural network that approximates the target function . Because of Corollary 1 and basic theory of numerical methods for ODEs, hence can approximate arbitrarily well in the norm .
We remark that the neural network defined in Eq. 6 does not change the dimensionality of the input point, i.e. it is a map from to itself. However, usually Residual Neural Networks allow for dimensionality changes thanks to linear lifting and projection layers. One can extend all the results presented in this section to the dimensionality changes, where instead of defining the whole network as the composition of discrete flow maps, just the “dynamical blocks” are characterised in that way, as represented in Fig. 2.
Consequently, one can extend the results presented in [zhang2020approximation]. In particular, one can show that by composing flow maps of sphere-preserving and gradient vector fields, generic continuous functions can also be approximated in the sense of Eq. 5.
In Appendix A we show some numerical experiments where some unknown dynamical systems and some target functions are approximated starting from the above results. We now introduce another way to get expressivity results starting from the continuous-in-time interpretation of neural networks.
2.2 Approximation based on Hamiltonian vector fields
Augmenting the dimensionality of the space where the dynamics is defined is a typical technique for designing deep neural networks, see Fig. 2. Based on this idea, we now study the approximation properties of networks obtained by composing flow maps of Hamiltonian systems. For an introductory presentation of Hamiltonian systems, see [leimkuhler2004simulating].
We now show that for any function for which hypothesis Eq. 4 holds, one can approximate arbitrarily well, in the same function norm, by composing flow maps of Hamiltonian systems. Consequently, symplectomorphisms like those defined by SympNets ([jin2020sympnets]) also approximate arbitrarily well.
This result relies on the embedding of a vector field into a Hamiltonian vector field on . To do so, we first define the linear map , as We then introduce the function , where is the conjugate momentum of . The gradient of such a function is
This implies that the Hamiltonian ODEs associated to are
Hence, we have where is the projection on the first component . This construction, together with the fact that , implies that
One could be interested in such a lifting procedure and hence work with Hamiltonian systems because discretising their flow maps with symplectic methods might generate more stable networks or, as highlighted in [galimberti2021hamiltonian]
, can prevent vanishing and exploding gradient problems.
3 Adversarial robustness and Lipschitz neural networks
In this section, we consider the problem of classifying points of a set . More precisely, given a set defined by the disjoint union of subsets , we aim to approximate the function that assigns all the points of to their correct class, i.e. for all and all . Because of their approximation properties, one can often choose neural networks to solve classification problems, i.e. as models that approximate the labelling function . On the other hand, there is increasing evidence that trained neural networks are sensitive to well-chosen input perturbations called adversarial attacks. The first work that points this out is [Szegedy2014IntriguingPO] and, since then, numerous others (see e.g. [madry2018towards, carlini2017towards, Goodfellow2015ExplainingAH]) have introduced both new ways to perturb the inputs (attacks) and to reduce the sensitivity of the networks (defences). We first formalise the problem of adversarial robustness from the mathematical point of view and then derive a network architecture with inherent stability properties.
Let be a neural network trained so that the true labelling map is well approximated by for points . Furthermore, let us assume
for some norm defined on the ambient space. With this setup, we say the network is robust if
In order to quantify the robustness of we, first of all, consider its Lipschitz constant , i.e. the smallest scalar value such that
where is the norm. We also need a way to quantify how certain the network predictions are. A measure of this certainty level is called margin in the literature (see e.g. [anil2019sorting, bartlett2017spectrally, tsuzuku2018lipschitz]) and it is defined as
where is the th vector of the canonical basis of . Combining these two quantities, in [tsuzuku2018lipschitz] the authors show that if the norm considered for is the norm of the ambient space , then
Hence, for the points in where Eq. 7 holds, the network is robust to perturbations with a magnitude not greater than . This result can be extended to generic metrics, but, in this section, we focus on the case where is the metric of and, from now on, we keep denoting it as .
Motivated by inequality Eq. 7, we present a strategy to constrain the Lipschitz constant of Residual Neural Networks to the value of . Differently from [celledoni2021structure, zakwan2022robust, meunier2021scalable], we impose such a property on the network without relying only on layers that are 1-Lipschitz. This strategy relies on the ODE based approach that we are presenting and is motivated by the interest of getting networks that also have good expressivity capabilities. Indeed, we remark that in Section 2 we studied the approximation properties of networks similar to those we consider in this section. We conclude the section with numerical experiments for the adversarial robustness with the CIFAR-10 dataset to test the proposed network architecture.
3.1 Non–expansive dynamical blocks
Consider a scalar differentiable function that is also strongly convex, for some , i.e.
see e.g. [hiriart2013convex, Chapter 6]. This said, it follows that the dynamics defined by the ODE
is contractive, since
where and . A choice for is , where , is a strongly convex differentiable function, and . In this way, the network we generate by concatenating explicit Euler steps applied to such vector fields has layers of the type
where and .
If we discretise the ODE introduced above reproducing the non-expansive behaviour at a discrete level, as presented for example in [dahlquist1979generalized, celledoni2021structure, meunier2021scalable], we get that the numerical flow is non-expansive too. Consequently, we can obtain 1-Lipschitz neural networks composing these non-expansive discrete flow maps. A simple way to discretise Eq. 9 while preserving non-expansivity is to use explicit-Euler steps with a small enough step size. Indeed, assuming , a layer of the form
is guaranteed to be 1-Lipschitz. We remark that, as highlighted in [celledoni2021structure, meunier2021scalable], it is not necessary to require strong convexity for in order to make 1-Lipschitz. Indeed, it is enough to take convex. However, the strong convexity assumption allows us to include other layers that are not 1-Lipschitz thanks to inequality Eq. 10.
We now shortly present the reasoning behind this statement. Consider another ODE where is again a vector field on and suppose that is -Lipschitz. Then, we have that
This implies that, given as in Eq. 9, the map satisfies
so is Lipschitz continuous and will be 1-Lipschitz if . This amounts to impose for the considered vector fields and time intervals on which corresponding flow maps are active. The map can be seen as the exact flow map of the switching system having a piecewise constant (in time) autonomous dynamics. In particular, such a system coincides with for the first time interval and with for the time interval .
We could choose to be the gradient vector field of a -smooth scalar potential. In other words, we ask for its gradient to be -Lipschitz. Thus, one possible way of building a dynamical block of layers that is 1-Lipschitz is through a consistent discretisation of the switching system
where is a piecewise constant time-switching signal that, following Eq. 11, balances the expansive and contractive regimes. In Eq. 12, we are assuming that with Lipschitz and that is strongly convex. In the numerical experiment we report at the end of the section we design giving an alternation between contractive and possibly non-contractive behaviours. In the following subsection, we present two possible approaches to discretise numerically the system in Eq. 12, mentioning how this extends to more general families of vector fields.
In this subsection, we have worked to obtain dynamical blocks that are non-expansive for the Euclidean metric. In Appendix B we show a way to extend this reasoning to more general metrics defined in the input space.
3.2 Non–expansive numerical discretisation
As presented in the introductory section, it is not enough to have a continuous model that satisfies some property of interest to get it at the network level. Indeed, discretising the solutions to such an ODE must also be done while preserving such a property. One approach that always works is to restrict the step sizes to be sufficiently small so that the behaviour of the discrete solution resembles the one of the exact solution. This strategy can lead to expensive network training because of the high number of time steps. On the other hand, this strategy allows weaker weight restrictions and better performances. We remark how this translates for the dynamical system introduced in Eq. 12, with . For that ODE, the one-sided Lipschitz constant of contractive layers is ,
being the smallest eigenvalue. Thus, ifis orthogonal, we get . Under the same orthogonality assumption, the expansive layers in Eq. 12 have Lipschitz constant and this allows to specialise the non-expansivity condition Eq. 11 to . Thus, if we impose such a relationship and do sufficiently small time steps, also the numerical solutions will be non expansive.
However, frequently smarter choices of discrete dynamical systems can lead to leaner architectures and faster training procedures. We focus on the explicit Euler method for this construction; however, one can work with other numerical methods, like generic Runge–Kutta methods, as long as the conditions we derive are adjusted accordingly. We concentrate on two time steps applied to equation Eq. 12, but then the reasoning extends to every pair of composed discrete flows and to other families of vector fields. Let
The condition we want to have is that the map is 1-Lipschitz, or at least that this is true when , and satisfy some well-specified properties. We first study the Lipschitz constant of both the discrete flow maps and then upper bound the one of with their products. We take two points , define , and proceed as follows
where is because we consider . In the experiments we present at the end of the section, we assume the weight matrices to be orthogonal, hence . Multiple works support this weight constraint as a way to improve the generalisation capabilities, the robustness to adversarial attacks and the weight efficiency (see e.g. [wang2020orthogonal, trockman2021orthogonalizing]). We will detail how we get such a constraint on convolutional layers in the numerical experiments.
The orthogonality of also implies and hence we get that a Lipschitz constant of is . For the expansive flow map , we have
under the orthogonality assumption for . The same result holds also if we just have . This leads to a region in the plane where that can be characterised as follows
This is represented in Fig. 2(a) for the case , that is the one used also in the numerical experiment for adversarial robustness. Thus, we now have obtained a way to impose the 1-Lipschitz property on the network coming from the discretisation of the ODE Eq. 12. It is clear that the result presented here easily extends to different time-switching rules (i.e. a different choice of ), as long as there is the possibility of balancing expansive vector fields with contractive ones. Furthermore, to enlarge the area in the plane where non-expansivity can be obtained, one can decide to divide into sub-intervals the time intervals and . Doing smaller steps, the allowed area increases. Indeed, instead of doing two single time steps of length and , one can perform time-steps all of step-length or . Thus, replacing and into Eq. 15 it is immediate to see that and are allowed to be larger than with the case . For example, if we again fix and set , we get the area represented in Fig. 2(b). The choice of and is the one we adopt in the experiments reported in this section. We now conclude the section showing how the derived architecture allows to improve the robustness against adversarial attacks for the problem of image classification.
3.3 Numerical experiments with adversarial robustness
We now apply the reasoning presented above to the problem of classifying images of the CIFAR-10 dataset. The implementation is done with PyTorch and is available at the GitHub repository associated to the paper222https://github.com/davidemurari/StructuredNeuralNetworks
. We work with convolutional neural networks, and with the activation function. The residual layers are defined by dynamical blocks based on the discrete flow maps
We prefer this alternation regime rather than the one introduced in Eq. 13 because experimentally it gives better performances and it allows to generate networks that are easier to train. We remark that all the calculations and step restrictions presented before, apply with no changes to the alternation regime in Section 3.3333This highlights the flexibility of the ODE based approach to impose structure on the network. Indeed, this strategy just a guided procedure to impose structure, but usually it does not limit considerably the options one has available.
. Indeed, what should change is just the estimate in equationEq. 14, but the Lipschitz constant of can be bounded by as in that equation as long as . The step restriction is imposed after every training iteration, projecting back the pairs in the region represented in Fig. 2(b) if needed.
We implement an architecture that takes as inputs tensors of order three and shape. The first dimensionality of the tensor increases to feature maps throughout the network via convolutional layers. For each fixed number of filters, we have four layers of the form specified in Section 3.3. To be precise, convolutional layers replace the matrix-vector products. As a comparison, we also consider a baseline ResNet with the same layer sequence and lifting pattern. The residual layers are of the form . This ResNet has a number of parameters that is roughly the double of those on which we constrain the Lipschitz constant, i.e. based on Section 3.3.
The network architecture based on Eq. 12
gets close to 90% test accuracy when trained with cross-entropy loss and without weight constraints. However, as presented at the beginning of this section, one could consider its Lipschitz constant and its margin at any input point of interest to get robustness guarantees for a network. For this reason, we now focus on constraining the Lipschitz constant of the architecture, and we introduce the loss function we adopt to promote higher margins. As in[anil2019sorting], we train the network architecture with the multi-class hinge loss function defined as
where margin is a parameter to tune. In Fig. 4, we report how the robustness of the architecture changes when we change such a margin value. We also train the baseline ResNet with this loss function for a fair comparison. Having predictions with higher margins allows us to get more robust architectures if we fix the Lipschitz constant. Still, too high margins can lead to poor accuracy. In Fig. 4, we report the results obtained constraining all the dynamical blocks and the final projection layer. However, we do not constrain the lifting layers. In this way, we can still control the full network’s Lipschitz constant, just considering the norms of those lifting layers. On the other hand, we leave some flexibility to the network, which can train better also when we increase the hinge-loss margin. We notice that the dynamical blocks usually get a small Lipschitz constant. Thus, even when we do not constrain all the layers, the network will still be 1-Lipschitz.
To get orthogonal convolutional filters, we apply the regularisation strategy proposed in [wang2020orthogonal]. This strategy is not the only one possible. Still, for this experiment, we preferred it to more involved ones since the main focus has been on the architecture, and the obtained results are satisfactory. Various works, e.g. [leimkuhler21a, ozay2016optimization, francca2021optimization], highlight how one can directly constrain the optimisation steps without having to project the weights on the right space or to add regularisation terms. We have not experimented with these kinds of strategies, and we leave them for future study. We also work with an orthogonal initialisation for the convolutional layers. The lifting layers are modelled as for a convolutional filter with
. To constrain the norm to 1, we add a projection step after the stochastic gradient descent (SGD) method updates the weights, i.e. we normalise the weights as. Here, the norm of the convolutional filters is computed with the power method as described, for example, in [meunier2021scalable]. Furthermore, we work with the SGD having a learning rate scheduler that divides the learning rate by
after a fixed number of epochs. Finally, we generate the adversarial examples with the library “Foolbox” introduced in[rauber2017foolbox]. We focus on the PGD attack and perform ten steps of it. We test different magnitudes of the adversarial perturbations.
In Fig. 4 we see that the robustness of the constrained neural networks improves compared to the baseline ResNet, whatever the margin we choose. Furthermore, the flexibility in the Lipschitz constant induced by the unconstrained lifting layers makes the training procedure less sensitive to the changes in the parameter. This experimental evidence supports the trade-off between these two scalar values reported in equation Eq. 7.
4 Imposition of other structure
Depending on the problem and the application where a neural network is adopted, the properties that the architecture should satisfy may be very different. We have seen in the previous section a strategy to impose Lipschitz constraints on the network to get some guarantees in terms of adversarial robustness. In that context, the property is of interest because it is possible to see that even when imposing it, we can get sufficiently accurate predictions. Moreover, this strategy allows controlling the network’s sensitivity to input perturbations. As mentioned in the introduction, there are at least two other situations where structural constraints might be desirable. The first one is when one knows that the function to approximate has some particular property. The second is when we know that the data we process induces some structure on the function we want to approximate.
This section supports the claim that combining ODE models with suitable numerical integrators allows us to define structured networks. More precisely, we derive multiple architectures putting together these two elements. Some of these have already been presented in other works, and others are new. The properties that we investigate are symplecticity, volume preservation and mass preservation. For the first two, we describe how to constrain the dynamical blocks. For the third, we propose how to structure also the linear lifting and projection layers. Moreover, for this latter example, we also report some numerical experiments.
4.1 Symplectic dynamical blocks
A function is said to be symplectic if it satisfies the identity
with being the zero and the identity matrices. Symplectic maps are particularly important in classical mechanics, because the flow map of a Hamiltonian system is symplectic, see e.g. [leimkuhler2004simulating]. This fact implies that if one is interested in approximating such a flow map with a neural network, then structuring it to be symplectic might be desirable. In this direction there are a considerable number of works (see e.g. [jin2020sympnets, chen2019symplectic, zhu2020deep]). We mention in particular [jin2020sympnets] where the authors construct layers of a network to ensure the symplectic property is satisfied. On the other hand, in [chen2019symplectic] the authors consider a neural network as the Hamiltonian function of a system and approximate its flow map with a symplectic numerical integrator444We remark that a one-step numerical method is symplectic if and only if, when applied to any Hamiltonian system, it is a symplectic map.. The simplest symplectic integrator is symplectic-Euler, that applied to writes
We now focus on the gradient modules presented in [jin2020sympnets], defined alternating maps of the form
We notice that we can obtain the same map from a time-switching ODE model. We first introduce the time-dependent Hamiltonian of such a model that is