1 Introduction
Deep neural networks are now the stateoftheart in a variety of challenging tasks, ranging from object recognition to natural language processing and graph analysis
krizhevsky2012a ; BattenbergCCCGL17 ; zilly17a ; SutskeverVL14 ; MontiBMRSB17 . With enough layers, they can, in principle, learn arbitrarily complex abstract representations through an iterative process greff2016 where each layer transforms the output from the previous layer nonlinearly until the input pattern is embedded in a latent space where inference can be done efficiently.Until the advent of Highway srivastava2015 and Residual (ResNet; he2015b
) networks, training nets beyond a certain depth with gradient descent was limited by the vanishing gradient problem
hochreiter1991a ; bengio1994a . These very deep networks (VDNNs) have skip connections that provide shortcuts for the gradient to flow back through hundreds of layers. Unfortunately, training them still requires extensive hyperparameter tuning, and, even if there were a principled way to determine the optimal number of layers or processing depth for a given task, it still would be fixed for all patterns.Recently, several researchers have started to view VDNNs from a dynamical systems perspective. Haber and Ruthotto haber2017 analyzed the stability of ResNets by framing them as an Euler integration of an ODE, and lu2018beyond showed how using other numerical integration methods induces various existing network architectures such as PolyNet zhang2017 , FractalNet larsson2016 and RevNet gomez2017 . A fundamental problem with the dynamical systems underlying these architectures is that they are autonomous: the input pattern sets the initial condition, only directly affecting the first processing stage. This means that if the system converges, there is either exactly one fixpoint or exactly one limit cycle strogatz2014 . Neither case is desirable from a learning perspective because a dynamical system should have inputdependent convergence properties so that representations are useful for learning. One possible approach to achieve this is to have a nonautonomous system where, at each iteration, the system is forced by an external input.
This paper introduces a novel network architecture, called the “NonAutonomous InputOutput Stable Network” (NAISNet), that is derived from a dynamical system that is both timeinvariant (weights are shared) and nonautonomous.^{1}^{1}1The DenseNet architecture lang1988 ; huang2017densely is nonautonomous, but timevarying. NAISNet is a general residual architecture where a block (see figure 1) is the unrolling of a timeinvariant system, and nonautonomy is implemented by having the external input applied to each of the unrolled processing stages in the block through skip connections. ResNets are similar to NAISNet except that ResNets are timevarying and only receive the external input at the first layer of the block.
With this design, we can derive sufficient conditions under which the network exhibits inputdependent equilibria that are globally asymptotically stable for every initial condition. More specifically, in section 3, we prove that with activations, NAISNet has exactly one inputdependent equilibrium, while with ReLU activations it has multiple stable equilibria per input pattern. Moreover, the NAISNet architecture allows not only the internal stability of the system to be analyzed but, more importantly, the inputoutput stability — the difference between the representations generated by two different inputs belonging to a bounded set will also be bounded at each stage of the unrolling.^{2}^{2}2In the supplementary material, we also show that these results hold both for shared and unshared weights.
In section 4, we provide an efficient implementation that enforces the stability conditions for both fullyconnected and convolutional layers in the stochastic optimization setting. These implementations are compared experimentally with ResNets on both CIFAR10 and CIFAR100 datasets, in section 5, showing that NAISNets achieve comparable classification accuracy with a much better generalization gap. NAISNets can also be 10 to 20 times deeper than the original ResNet without increasing the total number of network parameters, and, by stacking several stable NAISNet blocks, models that implement patterndependent processing depth can be trained without requiring any normalization at each step (except when there is a change in layer dimensionality, to speed up training).
The next section presents a more formal treatment of the dynamical systems perspective of neural networks, and a brief overview of work to date in this area.
2 Background and Related Work
Representation learning is about finding a mapping from input patterns to encodings that disentangle the underlying variational factors of the input set. With such an encoding, a large portion of typical supervised learning tasks (e.g. classification and regression) should be solvable using just a simple model like logistic regression. A key characteristic of such a mapping is its invariance to input transformations that do not alter these factors for a given input
^{3}^{3}3Such invariance conditions can be very powerful inductive biases on their own: For example, requiring invariance to time transformations in the input leads to popular RNN architectures tallec2018a .. In particular, random perturbations of the input should in general not be drastically amplified in the encoding. In the field of control theory, this property is central to stability analysis which investigates the properties of dynamical systems under which they converge to a single steady state without exhibiting chaos khalil2001 ; strogatz2014 ; sontag_book .In machine learning, stability has long been central to the study of recurrent neural networks (RNNs) with respect to the vanishing
hochreiter1991a ; bengio1994a ; pascanu2013a , and exploding doya1992bifurcations ; baldi1996universal ; pascanu2013agradient problems, leading to the development of Long ShortTerm Memory
hochreiter1997b to alleviate the former. More recently, general conditions for RNN stability have been presented zilly17a ; kanai2017a ; laurent_recurrent_2016 ; vorontsov2017 based on general insights related to Matrix Norm analysis. Inputoutput stability khalil2001 has also been analyzed for simple RNNs steil1999input ; knight_stability_2008 ; haschke2005input ; singh2016stability .Recently, the stability of deep feedforward networks was more closely investigated, mostly due to adversarial attacks szegedy2013intriguing on trained networks. It turns out that sensitivity to (adversarial) input perturbations in the inference process can be avoided by ensuring certain conditions on the spectral norms of the weight matrices cisse_parseval_2017 ; yoshida2017a . Additionally, special properties of the spectral norm of weight matrices mitigate instabilities during the training of Generative Adversarial Networks miyato2018a .
Almost all successfully trained VDNNs hochreiter1997b ; he2015b ; srivastava2015 ; cho2014learning share the following core building block:
(1) 
That is, in order to compute a vector representation at layer
(or time for recurrent networks), additively updatewith some nonlinear transformation
of which depends on parameters . The reason usual given for why Eq. (1) allows VDNNs to be trained is that the explicit identity connections avoid the vanishing gradient problem.The semantics of the forward path are however still considered unclear. A recent interpretation is that these feedforward architectures implement iterative inference greff2016 ; jastrzebski2017 . This view is reinforced by observing that Eq. (1) is a forward Euler discretization ascher1998a
of the ordinary differential equation (ODE)
if for all in Eq. (1). This connection between dynamical systems and feedforward architectures was recently also observed by several other authors weinan2017a . This point of view leads to a large family of new network architectures that are induced by various numerical integration methods lu2018beyond . Moreover, stability problems in both the forward as well the backward path of VDNNs have been addressed by relying on wellknown analytical approaches for continuoustime ODEs haber2017 ; chang2017multi . In the present paper, we instead address the problem directly in discretetime, meaning that our stability result is preserved by the network implementation. With the exception of liao_bridging_2016 , none of this prior research considers timeinvariant, nonautonomous systems.Conceptually, our work shares similarities with approaches that build network according to iterative algorithms gregor2010a ; zheng2015a and recent ideas investigating patterndependent processing time graves2016a ; veit2017a ; figurnov2017a .
3 NonAutonomous InputOutput Stable Nets (NAISNets)
This section provides stability conditions for both fullyconnected and convolutional NAISNet layers. We formally prove that NAISNet provides a nontrivial inputdependent output for each iteration as well as in the asymptotic case (). The following dynamical system:
(2) 
is used throughout the paper, where is the latent state, is the network input, and . For ease of notation, in the remainder of the paper the explicit dependence on the parameters, , will be omitted.
Fully Connected NAISNet Layer.
Our fully connected layer is defined by
(3) 
where and are the state and input transfer matrices, and is a bias. The activation
is a vector of (elementwise) instances of an activation function, denoted as
with . In this paper, we only consider the hyperbolic tangent,, and Rectified Linear Units (ReLU) activation functions. Note that by setting
, and the step the original ResNet formulation is obtained.Convolutional NAISNet Layer.
The architecture can be easily extended to Convolutional Networks by replacing the matrix multiplications in Eq. (3) with a convolution operator:
(4) 
Consider the case of channels. The convolutional layer in Eq. (4) can be rewritten, for each latent map , in the equivalent form:
(5) 
where: is the layer state matrix for channel , is the layer input data matrix for channel
(where an appropriate zero padding has been applied) at layer
, is the state convolution filter from state channel to state channel , is its equivalent for the input, and is a bias. The activation, , is still applied elementwise. The convolution forhas a fixed stride
, a filter size and a zero padding of , such that .^{4}^{4}4 If , then can be extended with an appropriate number of constant zeros (not connected).Convolutional layers can be rewritten in the same form as fully connected layers (see proof of Lemma 1 in the supplementary material). Therefore, the stability results in the next section will be formulated for the fully connected case, but apply to both.
Stability Analysis.
Here, the stability conditions for NAISNets which were instrumental to their design are laid out. We are interested in using a cascade of unrolled NAIS blocks (see Figure 1), where each block is described by either Eq. (3) or Eq. (4). Since we are dealing with a cascade of dynamical systems, then stability of the entire network can be enforced by having stable blocks khalil2001 .
The statetransfer Jacobian for layer is defined as:
(6) 
where the argument of the activation function, , is denoted as . Take an arbitrarily small scalar and define the set of pairs for which the activations are not saturated as:
(7) 
Theorem 1 below proves that the nonautonomuous residual network produces a bounded output given a bounded, possibly noisy, input, and that the network state converges to a constant value as the number of layers tends to infinity, if the following stability condition holds:
Condition 1.
For any , the Jacobian satisfies:
(8) 
where is the spectral radius.
The steady states, , are determined by a continuous function of . This means that a small change in cannot result in a very different . For activation, depends linearly on , therefore the block needs to be unrolled for a finite number of iterations, , for the mapping to be nonlinear. That is not the case for ReLU, which can be unrolled indefinitely and still provide a piecewise affine mapping.
In Theorem 1, the InputOutput (IO) gain function, , describes the effect of normbounded input perturbations on the network trajectory. This gain provides insight as to the level of robust invariance of the classification regions to changes in the input data with respect to the training set. In particular, as the gain is decreased, the perturbed solution will be closer to the solution obtained from the training set. This can lead to increased robustness and generalization with respect to a network that does not statisfy Condition 1. Note that the IO gain, , is linear, and hence the block IO map is Lipschitz even for an infinite unroll length. The IO gain depends directly on the norm of the state transfer Jacobian, in Eq. (8), as indicated by the term in Theorem 1.^{5}^{5}5see supplementary material for additional details and all proofs, where the untied case is also covered.
Theorem 1.
(Asymptotic stability for shared weights)
If Condition 1 holds, then NAISNet with
ReLU or activations is Asymptotically Stable with respect to
input dependent equilibrium points. More formally:
(9) 
The trajectory is described by , where is a suitable matrix norm.
In particular:

With activation, the steady state is independent of the initial state, and it is a linear function of the input, namely, . The network is Globally Asymptotically Stable.
With ReLU activation, is given by a continuous piecewise affine function of and . The network is Locally Asymptotically Stable with respect to each .

If the activation is , then the network is Globally InputOutput (robustly) Stable for any additive input perturbation . The trajectory is described by:
(10) where is the inputoutput gain. For any , if then the following set is robustly positively invariant ():
(11) 
If the activation is ReLU, then the network is Globally InputOutput practically Stable. In other words, we have:
(12) The constant is the norm ball radius for .
4 Implementation
In general, an optimization problem with a spectral radius constraint as in Eq. (8) is hard kanai2017a
. One possible approach is to relax the constraint to a singular value constraint
kanai2017a which is applicable to both fully connected as well as convolutional layer types yoshida2017a. However, this approach is only applicable if the identity matrix in the Jacobian (Eq. (
6)) is scaled by a factor kanai2017a . In this work we instead fulfil the spectral radius constraint directly.The basic intuition for the presented algorithms is the fact that for a simple Jacobian of the form , , Condition 1 is fulfilled, if
has eigenvalues with real part in
and imaginary part in the unit circle. In the supplemental material we prove that the following algorithms fulfill Condition 1 following this intuition. Note that, in the following, the presented procedures are to be performed for each block of the network.Fullyconnected blocks.
In the fully connected case, we restrict the matrix to by symmetric and negative definite by choosing the following parameterization for them:
(13) 
where is trained, and is a hyperparameter. Then, we propose a bound on the Frobenius norm, . Algorithm 1, performed during training, implements the following^{6}^{6}6The more relaxed condition is sufficient for Theorem 1 to hold locally (supplementary material).:
Theorem 2.
Convolutional blocks.
The symmetric parametrization assumed in the fully connected case can not be used for a convolutional layer. We will instead make use of the following result:
Lemma 1.
The convolutional layer Eq. (4) with zeropadding , and filter size has a Jacobian of the form Eq. (6). with . The diagonal elements of this matrix, namely, , , are the central elements of the th convolutional filter mapping , into , denoted by . The other elements in row , , are the remaining filter values mapping to .
To fulfill the stability condition, the first step is to set , where is trainable parameter satisfying , and is a hyperparameter. Then we will suitably bound the norm of the Jacobian by constraining the remaining filter elements. The steps are summarized in Algorithm 2 which is inspired by the Gershgorin Theorem Horn:2012:MA:2422911 . The following result is obtained:
Theorem 3.
Note that the algorithm complexity scales with the number of filters. A simple design choice for the layer is to set , which results in being fixed at .^{7}^{7}7Setting removes the need for hyperparameter but does not necessarily reduce conservativeness as it will further constrain the remaining element of the filter bank. This is further discussed in the supplementary.
5 Experiments
Experiments were conducted comparing NAISNet with ResNet, and variants thereof, using both fullyconnected (MNIST, section 5.1) and convolutional (CIFAR10/100, section 5.2) architectures to quantitatively assess the performance advantage of having a VDNN where stability is enforced.
Single neuron trajectory and convergence. (Left)
Average loss of NAISNet with different residual architectures over the unroll length. Note that both ResNetSHStable and NAISNet satisfy the stability conditions for convergence, but only NAISNet is able to learn, showing the importance of nonautonomy. Crossentropy loss vs processing depth. (Right) Activation of a NAISNet single neuron for input samples from each class on MNIST. Trajectories not only differ with respect to the actual steadystate but also with respect to the convergence time.5.1 Preliminary Analysis on MNIST
For the MNIST dataset lecun1998mnist a singleblock NAISNet was compared with 9 different layer ResNet variants each with a different combination of the following features: SH (shared weights i.e. timeinvariant), NA (nonautonomous i.e. input skip connections), BN
(with Batch Normalization),
Stable (stability enforced by Algorithm 1). For example, ResNetSHNABN refers to a 30layer ResNet that is timeinvariant because weights are shared across all layers (SH), nonautonomous because it has skip connections from the input to all layers (NA), and uses batch normalization (BN). Since NAISNet is timeinvariant, nonautonomous, and input/output stable (i.e. SHNAStable), the chosen ResNet variants represent ablations of the these three features. For instance, ResNetSHNA is a NAISNet without I/O stability being enforced by the reprojection step described in Algorithm 1, and ResNetNA, is a nonstable NAISNet that is timevariant, i.e nonsharedweights, etc. The NAISNet was unrolled foriterations for all input patterns. All networks were trained using stochastic gradient descent with momentum
and learning rate, for 150 epochs.
Results.
Test accuracy for NAISNET was , while ResNetSHBN was second best with , but without BatchNorm (ResNetSH) it only achieved (averaged over 10 runs).
After training, the behavior of each network variant was analyzed by passing the activation,
, though the softmax classifier and measuring the crossentropy loss. The loss at each iteration describes the trajectory of each sample in the latent space: the closer the sample to the correct steady state the closer the loss to zero (see Figure
3). All variants initially refine their predictions at each iteration since the loss tends to decreases at each layer, but at different rates. However, NAISNet is the only one that does so monotonically, not increasing loss as approaches . Figure 3 shows how neuron activations in NAISNet converge to different steady state activations for different input patterns instead of all converging to zero as is the case with ResNetSHStable, confirming the results of haber2017 . Importantly, NAISNet is able to learn even with the stability constraint, showing that nonautonomy is key to obtaining representations that are stable and good for learning the task.NAISNet also allows training of unbounded processing depth without any feature normalization steps. Note that BN actually speeds up loss convergence, especially for ResNetSHNABN (i.e. unstable NAISNet). Adding BN makes the behavior very similar to NAISNet because BN also implicitly normalizes the Jacobian, but it does not ensure that its eigenvalues are in the stability region.
5.2 Image Classification on CIFAR10/100
Experiments on image classification were performed on standard image recognition benchmarks CIFAR10 and CIFAR100 krizhevsky2009cifar . These benchmarks are simple enough to allow for multiple runs to test for statistical significance, yet sufficiently complex to require convolutional layers.
Setup.
The following standard architecture was used to compare NAISNet with ResNet^{8}^{8}8https://github.com/tensorflow/models/tree/master/official/resnet: three sets of residual blocks with , , and filters, respectively, for a total of stacked blocks. NAISNet was tested in two versions: NAISNet1 where each block is unrolled just once, for a total processing depth of 108, and NAISNet10 where each block is unrolled 10 times per block, for a total processing depth of 540. The initial learning rate of was decreased by a factor of at epochs , and and the experiment were run for 450 epochs. Note that each block in the ResNet of he2015a has two convolutions (plus BatchNorm and ReLU) whereas NAISNet unrolls with a single convolution. Therefore, to make the comparison of the two architectures as fair as possible by using the same number of parameters, a single convolution was also used for ResNet.
Results.

Table 4 compares the performance on the two datasets, averaged over 5 runs. For CIFAR10, NAISNet and ResNet performed similarly, and unrolling NAISNet for more than one iteration had little affect. This was not the case for CIFAR100 where NAISNet10 improves over NAISNet1 by
. Moreover, although mean accuracy is slightly lower than ResNet, the variance is considerably lower. Figure
4 shows that NAISNet is less prone to overfitting than a classic ResNet, reducing the generalization gap by 33%. This is a consequence of the stability constraint which imparts a degree of robust invariance to input perturbations (see Section 3). It is also important to note that NAISNet can unroll up to layers, and still train without any problems.5.3 PatternDependent Processing Depth
For simplicity, the number of unrolling steps per block in the previous experiments was fixed. A more general and potentially more powerful setup is to have the processing depth adapt automatically. Since NAISNet blocks are guaranteed to converge to a patterndependent steady state after an indeterminate number of iterations, processing depth can be controlled dynamically by terminating the unrolling process whenever the distance between a layer representation, , and that of the immediately previous layer, , drops below a specified threshold. With this mechanism, NAISNet can determine the processing depth for each input pattern. Intuitively, one could speculate that similar input patterns would require similar processing depth in order to be mapped to the same region in latent space. To explore this hypothesis, NAISNet was trained on CIFAR10 with an unrolling threshold of . At test time the network was unrolled using the same threshold.
Figure 10
shows selected images from four different classes organized according to the final network depth used to classify them after training. The qualitative differences seen from low to high depth suggests that NAISNet is using processing depth as an additional degree of freedom so that, for a given training run, the network learns to use models of different complexity (depth) for different types of inputs within each class. To be clear, the hypothesis is not that depth correlates to some notion of input complexity where the same images are always classified at the same depth across runs.
6 Conclusions
We presented NAISNet, a nonautonomous residual architecture that can be unrolled until the latent space representation converges to a stable inputdependent state. This is achieved thanks to stability and nonautonomy properties. We derived stability conditions for the model and proposed two efficient reprojection algorithms, both for fullyconnected and convolutional layers, to enforce the network parameters to stay within the set of feasible solutions during training.
NAISNet achieves asymptotic stability and, as consequence of that, inputoutput stability. Stability makes the model more robust and we observe a reduction of the generalization gap by quite some margin, without negatively impacting performance. The question of scalability to benchmarks such as ImageNet
deng2009a will be a main topic of future work.We believe that crossbreeding machine learning and control theory will open up many new interesting avenues for research, and that more robust and stable variants of commonly used neural networks, both feedforward and recurrent, will be possible.
Aknowledgements
We want to thank Wojciech Jaśkowski, Rupesh Srivastava and the anonymous reviewers for their comments on the idea and initial drafts of the paper.
References
 (1) U. M. Ascher and L. R. Petzold. Computer methods for ordinary differential equations and differentialalgebraic equations, volume 61. Siam, 1998.
 (2) P. Baldi and K. Hornik. Universal approximation and learning of trajectories using oscillators. In Advances in Neural Information Processing Systems, pages 451–457, 1996.
 (3) E. Battenberg, J. Chen, R. Child, A. Coates, Y. Gaur, Y. Li, H. Liu, S. Satheesh, D. Seetapun, A. Sriram, and Z. Zhu. Exploring neural transducers for endtoend speech recognition. CoRR, abs/1707.07413, 2017.
 (4) Y. Bengio, P. Simard, and P. Frasconi. Learning longterm dependencies with gradient descent is difficult. Neural Networks, 5(2):157–166, 1994.
 (5) Bo Chang, Lili Meng, Eldad Haber, Frederick Tung, and David Begert. Multilevel residual networks from dynamical systems view. arXiv preprint arXiv:1710.10348, 2017.
 (6) K. Cho, B. Van Merriënboer, C. Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
 (7) M. Cisse, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier. Parseval networks: Improving robustness to adversarial examples. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 854–863, Sydney, Australia, 06–11 Aug 2017. PMLR.

(8)
J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. FeiFei.
ImageNet: A LargeScale Hierarchical Image Database.
In
Conference on Computer Vision and Pattern Recognition (CVPR)
, 2009.  (9) K. Doya. Bifurcations in the learning of recurrent neural networks. In Circuits and Systems, 1992. ISCAS’92. Proceedings., 1992 IEEE International Symposium on, volume 6, pages 2777–2780. IEEE, 1992.
 (10) David Duvenaud, Oren Rippel, Ryan Adams, and Zoubin Ghahramani. Avoiding pathologies in very deep networks. In Artificial Intelligence and Statistics, pages 202–210, 2014.
 (11) M. Figurnov, A. Sobolev, and D. Vetrov. Probabilistic adaptive computation time. CoRR, abs/1712.00386, 2017.
 (12) Marco Gallieri. LASSOMPC – Predictive Control with Regularised Least Squares. SpringerVerlag, 2016.

(13)
A. Gomez, M. Ren, R. Urtasun, and R. B. Grosse.
The reversible residual network: Backpropagation without storing activations.
In NIPS, 2017.  (14) A. Graves. Adaptive computation time for recurrent neural networks. CoRR, abs/1603.08983, 2016.
 (15) K. Greff, R. K. Srivastava, and J. Schmidhuber. Highway and residual networks learn unrolled iterative estimation. arXiv preprint arXiv:1612.07771, 2016.
 (16) K. Gregor and Y. LeCun. Learning fast approximations of sparse coding. In International Conference on Machine Learning (ICML), 2010.
 (17) E. Haber and L. Ruthotto. Stable architectures for deep neural networks. arXiv preprint arXiv:1705.03341, 2017.
 (18) R. Haschke and J. J. Steil. Input space bifurcation manifolds of recurrent neural networks. Neurocomputing, 64:25–38, 2005.
 (19) K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. arXiv preprint arXiv:1502.01852, 2015.
 (20) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, Dec 2016.
 (21) S. Hochreiter. Untersuchungen zu dynamischen neuronalen netzen. diploma thesis, 1991. Advisor:J. Schmidhuber.
 (22) S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural Computation, 9(8):1735–1780, 1997.
 (23) R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, New York, NY, USA, 2nd edition, 2012.
 (24) G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
 (25) S. Jastrzebski, D. Arpit, N. Ballas, V. Verma, T. Che, and Y. Bengio. Residual connections encourage iterative inference. arXiv preprint arXiv:1710.04773, 2017.

(26)
S. Kanai, Y. Fujiwara, and S. Iwamura.
Preventing gradient explosions in gated recurrent units.
In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 435–444. Curran Associates, Inc., 2017.  (27) H. K. Khalil. Nonlinear Systems. Pearson Education, 3rd edition, 2014.
 (28) J. N. Knight. Stability analysis of recurrent neural networks with applications. Colorado State University, 2008.
 (29) A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.

(30)
A. Krizhevsky, I. Sutskever, and G. E. Hinton.
ImageNet classification with deep convolutional neural networks.
In Advances in Neural Information Processing Systems (NIPS), 2012.  (31) J.K. Lang and M. J. Witbrock. Learning to tell two spirals apart. In D. Touretzky, G. Hinton, and T. Sejnowski, editors, Proceedings of the Connectionist Models Summer School, pages 52–59, Mountain View, CA, 1988.
 (32) G. Larsson, M. Maire, and G. Shakhnarovich. Fractalnet: Ultradeep neural networks without residuals. arXiv preprint arXiv:1605.07648, 2016.
 (33) T. Laurent and J. von Brecht. A recurrent neural network without chaos. arXiv preprint arXiv:1612.06212, 2016.

(34)
Yann LeCun.
The MNIST database of handwritten digits.
http://yann.lecun.com/exdb/mnist/, 1998.  (35) Qianli Liao and Tomaso Poggio. Bridging the gaps between residual learning, recurrent neural networks and visual cortex. arXiv preprint arXiv:1604.03640, 2016.
 (36) Y. Lu, A. Zhong, D. Bin, and Q. Li. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations, 2018.
 (37) T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial networks. International Conference on Learning Representations, 2018.

(38)
F. Monti, D. Boscaini, J. Masci, E. Rodolà, J. Svoboda, and M. M.
Bronstein.
Geometric deep learning on graphs and manifolds using mixture model cnns.
In CVPR2017, 2017.  (39) R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In International Conference on Machine Learning, pages 1310–1318, 2013.
 (40) J. Singh and N. Barabanov. Stability of discrete time recurrent neural networks and nonlinear optimization problems. Neural Networks, 74:58–72, 2016.
 (41) E. Sontag. Mathematical Control Theory: Deterministic Finite Dimensional Systems. SpringerVerlag, 2nd edition, 1998.
 (42) R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387, May 2015.
 (43) Jochen J Steil. Input Output Stability of Recurrent Neural Networks. Cuvillier Göttingen, 1999.
 (44) S. H. Strogatz. Nonlinear dynamics and chaos: with applications to physics, biology, chemistry, and engineering. Westview Press, 2nd edition, 2015.
 (45) I. Sutskever, O. Vinyals, and Le. Q. V. Sequence to sequence learning with neural networks. CoRR, abs/1409.3215, 2014.
 (46) C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
 (47) C. Tallec and Y. Ollivier. Can recurrent neural networks warp time? International Conference on Learning Representations, 2018.
 (48) A. Veit and S. Belongie. Convolutional networks with adaptive computation graphs. CoRR, 2017.
 (49) E. Vorontsov, C. Trabelsi, S. Kadoury, and C. Pal. On orthogonality and learning recurrent networks with long term dependencies. arXiv preprint arXiv:1702.00071, 2017.
 (50) E. Weinan. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 5(1):1–11, 2017.
 (51) Pei Yuan Wu. Products of positive semidefinite matrices. Linear Algebra and Its Applications, 1988.
 (52) Y. Yoshida and T. Miyato. Spectral norm regularization for improving the generalizability of deep learning. arXiv preprint arXiv:1705.10941, 2017.
 (53) X. Zhang, Z. Li, C. C. Loy, and D. Lin. Polynet: A pursuit of structural diversity in very deep networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3900–3908. IEEE, 2017.
 (54) S. Zheng, S. Jayasumana, B. RomeraParedes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. S. Torr. Conditional random fields as recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1529–1537, 2015.
 (55) J. G. Zilly, R. K. Srivastava, J. Koutník, and J. Schmidhuber. Recurrent highway networks. In ICML2017, pages 4189–4198. PMLR, 2017.
Appendix A Basic Definitions for the Tied Weight Case
Recall, from the main paper, that the stability of a NAISNet block with fully connected or convolutional architecture can be analyzed by means of the following vectorised representation:
(14) 
where is the unroll index for the considered block. Since the blocks are cascaded, stability of each block implies stability of the full network. Hence, this supplementary material focuses on theoretical results for a single block.
a.1 Relevant Sets and Operators
a.1.1 Notation
Denote the slope of the activation function vector, , as the diagonal matrix, , with entries:
(15) 
The following definitions will be use to obtain the stability results, where :
(16)  
In particular, the set is such that the activation function is not saturated as its derivative has a nonzero lower bound.
a.1.2 Linear Algebra Elements
The notation, is used to denote a suitable matrix norm. This norm will be characterized specifically on a case by case basis. The same norm will be used consistently throughout definitions, assumptions and proofs.
We will often use the following:
Lemma 2.
(Eigenvalue shift)
Consider two matrices , and with being a complex scalar. If is an eigenvalue of then is an eigenvalue of .
Proof.
Throughout the material, the notation is used to denote the th row of a matrix .
a.1.3 Nonautonomuous Behaviour Set
The following set will be consider throughout the paper:
Definition A.1.
(Nonautonomous behaviour set) The set is referred to as the set of fully nonautonomous behaviour in the extended stateinput space, and its setprojection over , namely,
(18) 
is the set of fully nonautonomous behaviour in the state space. This is the only set in which every output dimension of the ResNet with input skip connection can be directly influenced by the input, given a nonzero^{9}^{9}9The concept of controllability is not introduced here. In the case of deep networks we just need to be nonzero to provide input skip connections. For the general case of timeseries identification and control, please refer to the definitions in [41]. matrix .
Note that, for a activation, then we simply have that (with for ). For a ReLU activation, on the other hand, for each layer we have:
(19) 
a.2 Stability Definitions for Tied Weights
This section provides a summary of definitions borrowed from control theory that are used to describe and derive our main result. The following definitions have been adapted from [12] and refer to the general dynamical system:
(20) 
Since we are dealing with a cascade of dynamical systems (see Figure 1 in (main paper)), then stability of the entire network can be enforced by having stable blocks [27]. In the remainder of this material, we will therefore address a single unroll. We will cover both the tied and untied weight case, starting from the latter as it is the most general.
a.2.1 Describing Functions
The following functions are instrumental to describe the desired behaviour of the network output at each layer or time step.
Definition A.2.
(function) A continuous function is said to be a function () if it is strictly increasing, with .
Definition A.3.
(function) A continuous function is said to be a function () if it is a function and if it is radially unbounded, that is as .
Definition A.4.
(function) A continuous function is said to be a function () if it is a function in its first argument, it is positive definite and non increasing in the second argument, and if as .
The following definitions are given for timeinvariant RNNs, namely DNN with tied weights. They can also be generalised to the case of untied weights DNN and timevarying RNNs by considering worst case conditions over the layer (time) index . In this case the properties are said to hold uniformly for all . This is done in Section B. The tiedweight case follows.
a.2.2 Invariance, Stability and Robustness
Definition A.5.
(Positively Invariant Set) A set is said to be positively invariant (PI) for a dynamical system under an input if
(21) 
Definition A.6.
(Robustly Positively Invariant Set) The set is said to be robustly positively invariant (RPI) to additive input perturbations if is PI for any input .
Definition A.7.
(Asymptotic Stability) The system Eq. (20) is called Globally Asymptotically Stable around its equilibrium point if it satisfies the following two conditions:

Stability. Given any , such that if , then .

Attractivity. such that if , then as .
If only the first condition is satisfied, then the system is globally stable. If both conditions are satisfied only for some then the stability properties hold only locally and the system is said to be locally asymptotically stable.
Local stability in a PI set is equivalent to the existence of a function and a finite constant such that:
(22) 
If then the system is asymptotically stable. If the positivelyinvariant set is then stability holds globally.
Define the system output as , where is a continuous, Lipschitz function. InputtoOutput stability provides a natural extension of asymptotic stability to systems with inputs or additive uncertainty^{10}^{10}10Here we will consider only the simple case of , therefore we can simply use notions of InputtoState Stability (ISS)..
Definition A.8.
(InputOutput (practical) Stability) Given an RPI set , a constant nominal input and a nominal steady state such that , the system Eq. (20) is said to be inputoutput (practically) stable to bounded additive input perturbations (IOpS) in if there exists a function and a function and a constant :
Comments
There are no comments yet.