Deep learning powers many research areas and impacts various aspects of society [LeCun, Bengio, and Hinton2015] from computer vision [He et al.2016, Huang et al.2017], natural language processing [Cho et al.2014] to biology [Esteva et al.2017] and e-commerce. Recent progress in designing architectures for deep networks has further accelerated this trend [Simonyan and Zisserman2015, He et al.2016, Huang et al.2017]. Among the most successful architectures are deep residual network (ResNet) and its variants, which are widely used in many computer vision applications [He et al.2016, Pohlen et al.2017] and natural language processing tasks [Oord et al.2016, Xiong et al.2017, Wu et al.2016]. However, there still are few theoretical analyses and guidelines for designing and training ResNet.
In contrast to the recent interest in deep residual networks, system of Ordinary Differential Equations (ODEs), special kinds of dynamical systems, have long been studied in mathematics and physics with rich theoretical and empirical success [Coddington and Levinson1955, Simmons2016, Arnolʹd2012]. The connection between nonlinear ODEs and deep ResNets has been established in the recent works of [E2017, Haber and Ruthotto2017, Haber, Ruthotto, and Holtham2017, Lu et al.2017, Long et al.2017, Chang et al.2017]. The continuous interpretation of ResNets as dynamical systems allows the adaption of existing theory and numerical techniques for ODEs to deep learning. For example, the paper [Haber and Ruthotto2017] introduces the concept of stable networks that can be arbitrarily long. However, only deep networks with simple single-layer convolution building blocks are proposed, and the architectures are not reversible (and thus the length of the network is limited by the amount of available memory), and only simple numerical examples are provided. Our work aims at overcoming these drawbacks and further investigates the efficacy and practicability of stable architectures derived from the dynamical systems perspective.
In this work, we connect deep ResNets and ODEs more closely and propose three stable and reversible architectures. We show that the three architectures are governed by stable and well-posed ODEs. In particular, our approach allows to train arbitrarily long networks using only minimal memory storage. We illustrate the intrinsic reversibility of these architectures with both theoretical analysis and empirical results. The reversibility property easily leads to a memory-efficient implementation, which does not need to store the activations at most hidden layers. Together with the stability, this allows one to train almost arbitrarily deep architectures using modest computational resources.
The remainder of our paper is organized as follows. We discuss related work in Sec. 2. In Sec. 3 we review the notion of reversibility and stability in ResNets, present three new architectures, and a regularization functional. In Sec. 4 we show the efficacy of our networks using three common classification benchmarks (CIFAR-10, CIFAR-100, STL-10). Our new architectures achieve comparable or even superior accuracy and, in particular, generalize better when a limited number of labeled training data is used. In Sec. 5 we conclude the paper.
2 Related Work
Residual Neural Networks and Extensions
ResNets are deep neural networks obtained by stacking simple residual blocks [He et al.2016]. A simple residual network block can be written as
Here, are the values of the features at the th layer and are the th layer’s network parameters. The goal of the training is to learn the network parameters . Eq. (1) represents a discrete dynamical system. An early review on neural networks as dynamical systems is presented in [Cessac2010].
ResNets have been broadly applied in many domains including computer vision tasks such as image recognition [He et al.2016], object detection [He et al.2017], semantic segmentation [Pohlen et al.2017] and visual reasoning [Perez et al.2017], natural language processing tasks such as speech synthesis [Oord et al.2016], speech recognition [Xiong et al.2017] and machine translation [Wu et al.2016].
Besides broadening the application domain, some ResNet successors focus on improving accuracy [Xie et al.2017, Zagoruyko and Komodakis2016] and stability [Haber and Ruthotto2017], saving GPU memory [Gomez et al.2017], and accelerating the training process [Huang et al.2016]. For instance, ResNxt [Xie et al.2017] introduces a homogeneous, multi-branch architecture to increase the accuracy. Stochastic depth [Huang et al.2016] reduces the training time while increases accuracy by randomly dropping a subset of layers and bypassing them with identity function.
Systems of Ordinary Differential Equations
To see the connection between ResNet and ODE systems we add a hyperparameterto Eq. (1) and rewrite the equation as
For a sufficiently small , Eq. (2) is a forward Euler discretization of the initial value problem
Thus, the problem of learning the network parameters,
, is equivalent to solving a parameter estimation problem or optimal control problem involving the ODE system Eq. (3). In some cases (e.g., in image classification), Eq. (3
) can be interpreted as a system of Partial Differential Equations (PDEs). Such problems have rich theoretical and computational framework, including techniques to guarantee stable networks by using appropriate functions, the discretization of the forward propagation process [Ascher and Petzold1998, Ascher2010, Bellman1953], theoretical frameworks for the optimization over the parameters [Bock1983, Ulbrich2002, Gunzburger2003], and methods for computing the gradient of the solution with respect to [Bliss1919].
Reversible numerical methods for dynamical systems allow the simulation of the dynamic going from the final time to the initial time, and vice versa. Reversible numerical methods are commonly used in the context of hyperbolic PDEs, where various methods have been proposed and compared [Nguyen and McMechan2014]. The theoretical framework for reversible methods is strongly tied to issues of stability. In fact, as we show here, not every method that is algebraically reversible is numerically stable. This has a strong implication for the practical applicability of reversible methods to deep neural networks.
Recently, various reversible neural networks have been proposed for different purposes and based on different architectures. Recent work by [Dosovitskiy and Brox2016, Mahendran and Vedaldi2015] inverts the feed-forward net and reproduces the input features from their values at the final layers. This suggests that some deep neural networks are reversible: the generative model is just the reverse of the feed-forward net [Arora, Liang, and Ma2016]. [Gilbert et al.2017] provide a theoretical connection between a model-based compressive sensing and CNNs. NICE [Dinh, Krueger, and Bengio2015, Dinh, Sohl-Dickstein, and Bengio2016]
uses an invertible non-linear transformation to map the data distribution into a latent space where the resulting distribution factorizes, yielding good generative models. Besides the implications that reversibility has on the deep generative models, the property can be used for developing memory-efficient algorithms. For instance, RevNet[Gomez et al.2017]
, which is inspired by NICE, develops a variant of ResNet where each layer’s activations can be reconstructed from next layer’s. This allows one to avoid storing activations at all hidden layers, except at those layers with stride larger than one. We will show later that our physically-inspired network architectures also have the reversible property and we derive memory-efficient implementations.
We introduce three new reversible architectures for deep neural networks and discuss their stability. We capitalize on the link between ResNets and ODEs to guarantee stability of the forward propagation process and the well-posedness of the learning problem. Finally, we present regularization functionals that favor smooth time dynamics.
ResNet as an ODE
Eq. (3) interprets ResNet as a discretization of a differential equation, whose parameters are learned in the training process. The process of forward propagation can be viewed as simulating the nonlinear dynamics that take the initial data,
, which are hard to classify, and moves them to a final state, which can be classified easily using, e.g., a linear classifier.
A fundamental question that needs to be addressed is, under what conditions is forward propagation well-posed? This question is important for two main reasons. First, instability of the forward propagation means that the solution is highly sensitive to data perturbation (e.g., image noise or adversarial attacks). Given that most computations are done in single precision, this may cause serious artifacts and instabilities in the final results. Second, training unstable networks may be very difficult in practice and, although impossible to prove, instability can add many local minima.
Let us first review the issue of stability. A dynamical system is stable if a small change in the input data leads to a small change in the final result. To better characterize this, assume a small perturbation, to the initial data in Eq. (3). Assume that this change is propagated throughout the network. The question is, what would be the change after some time , that is, what is ?
This change can be characterized by the Lyapunov exponent [Lyapunov1992], which measures the difference in the trajectories of a nonlinear dynamical system given the initial conditions. The Lyapunov exponent, , is defined as the exponent that measures the difference:
The forward propagation is well-posed when , and ill-posed if . A bound on the value of
can be derived from the eigenvalues of the Jacobian matrix ofwith respect to , which is given by
A sufficient condition for stability is
where is the th eigenvalue of , and denotes the real part.
This observation allows us to generate networks that are guaranteed to be stable. It should be emphasized that the stability of the forward propagation is necessary to obtain stable networks that generalize well, but not sufficient. In fact, if the real parts of the eigenvalues in Eq. (5) are negative and large, , Eq. (4) shows that differences in the input features decay exponentially in time. This complicates the learning problem and therefore we consider architectures that lead to Jacobians with (approximately) purely imaginary eigenvalues. We now discuss three such networks that are inspired by different physical interpretations.
The two-layer Hamiltonian network
[Haber and Ruthotto2017] propose a neural network architecture inspired by Hamiltonian systems
where and are partitions of the features,
is an activation function, and the network parameters areand are convolution operator and convolution transpose operator respectively. It can be shown that the Jacobian matrix of this ODE satisfies the condition in Eq. (5), thus it is stable and well-posed. The authors also demonstrate the performance on a small dataset. However, in our numerical experiments we have found that the representability of this “one-layer” architecture is limited.
According to the universal approximation theorem [Hornik1991], a two-layer neural network can approximate any monotonically-increasing continuous function on a compact set. Recent work [Zhang et al.2017] shows that simple two-layer neural networks already have perfect finite sample expressivity as soon as the number of parameters exceeds the number of data points. Therefore, we propose to extend Eq. (6) to the following two-layer structure:
In principle, any linear operator can be used within the Hamiltonian framework. However, since our numerical experiments consider images, we choose to be a convolution operator, as its transpose. Rewriting Eq. (7) in matrix form gives
There are different ways of partitioning the input features, including checkerboard partition and channel-wise partition [Dinh, Sohl-Dickstein, and Bengio2016]. In this work, we use equal channel-wise partition, that is, the first half of the channels of the input is and the second half is .
where is the derivative of the activation function. The eigenvalues of are all imaginary (see the Appendix for a proof). Therefore Eq. (5) is satisfied and the forward propagation of our neural network is stable and well-posed.
We choose Eq. (10) to be our Hamiltonian blocks and illustrate it in Fig. 1. Similar to ResNet [He et al.2016], our Hamiltonian reversible network is built by first concatenating blocks to units, and then concatenating units to a network. An illustration of our architecture is provided in Fig. 2.
The midpoint network
Another reversible numerical method for discretizing the first-order ODE in Eq. (3) is obtained by using central finite differences in time
This gives the following forward propagation
where is obtained by one forward Euler step. To guarantee stability for a single layer we can use the function to contain an anti-symmetric linear operator, that is,
The Jacobian of this forward propagation is
which has only imaginary eigenvalues. This yields the single layer midpoint network
As we see next, it is straightforward to show that the midpoint method is reversible (at least algebraically). However, while it is possible to potentially use a double layer midpoint network, it is difficult to ensure the stability of such network. To this end, we explore the leapfrog network next.
The leapfrog network
A stable leapfrog network can be seen as a special case of the Hamiltonian network in Eq. (7
) when one of the kernels is the identity matrix and one of the activation is the identity function. The leapfrog network involves two derivatives in time and reads
It can be discretized, for example, using the conservative leapfrog discretization, which uses the following symmetric approximation to the second derivative in time
Substituting the approximation in Eq. (16), we obtain:
Reversible architectures and stability
An architecture is called reversible if it allows the reconstruction of the activations going from the end to the beginning. Reversible numerical methods for ODEs have been studied in the context of hyperbolic differential equations [Nguyen and McMechan2014]
, and reversibility was discovered recently in the machine learning community[Dinh, Krueger, and Bengio2015, Gomez et al.2017]. Reversible techniques enable memory-efficient implementations of the network that requires the storage of the last activations only.
Let us first demonstrate the reversibility of the leapfrog network. Assume that we are given the last two states, and . Then, using Eq. (17) it is straight-forward to compute :
one can continue and re-compute the activations at each hidden layer during backpropagation. Similarly, it is straightforward to show that the midpoint network is reversible.
The Hamiltonian network is similar to the RevNet and can be described as
where and are a partition of the units in block ; and are the residual functions. Eq. (19) is reversible as each layer’s activations can be computed from the next layer’s as follows:
While RevNet and MidPoint represent reversible networks algebraically, they may not be reversible in practice without restrictions on the residual functions. To illustrate, consider the simple linear case where and . The RevNet in this simple case reads
One way to simplify the equations is to look at two time steps and subtract them:
which implies that
These type of equations have a solution of the form . The characteristic equation is
Define , the roots of the equation are If then we have that and
which implies that the method is stable and no energy in the feature vectors is added or lost.
It is obvious that Eq. (21) is not stable for every choice of and . Indeed, if, for example, and are positive then and the solution can grow at every layer exhibiting unstable behavior. It is possible to obtain stable solutions if and and both are sufficiently small. This is the role of in our Hamiltonian network.
This analysis plays a key role in reversibility. For unstable networks, either the forward or the backward propagation consists of an exponentially growing mode. For computation in single precision (like most practical CNN), the gradient can be grossly inaccurate. Thus we see that not every choice of the functions and lead to a reasonable network in practice and that some control is needed if we are to have a network that does not grow exponentially neither forward nor backwards.
Arbitrarily deep residual neural networks
All three architectures we proposed can be used with arbitrary depth, since they do not have any dissipation. This implies that the signal that is input into the system does not decay even for arbitrarily long networks. Thus signals can propagate through this system to infinite network depth. We have also experimented with slightly dissipative networks, that is, networks that attenuate the signal at each layer, that yielded results that were comparable to the ones obtained by the networks proposed here.
Regularization plays a vital role serving as parameter tuning in the deep neural network training to help improve generalization performance [Zhang et al.2017]. Besides commonly used weight decay, we also use weight smoothness decay. Since we interpret the forward propagation of our Hamiltonian network as a time-dependent nonlinear dynamic process, we prefer convolution weights that are smooth in time by using the regularization functional
where represents the Frobenius norm. Upon discretization, this gives the following weight smoothness decay as a regularization function
We evaluate our methods on three standard classification benchmarks (CIFAR-10, CIFAR100 and STL10) and compare against state-of-the-art results from the literature. Furthermore, we investigate the robustness of our method as the amount of training data decrease and train a deep network with 1,202 layers.
|Name||Units||Channels||# Model Params (M)||Accuracy|
Datasets and baselines
CIFAR-10 and CIFAR-100
The CIFAR-10 dataset [Krizhevsky and Hinton2009] consists of 50,000 training images and 10,000 testing images in 10 classes with
image resolution. The CIFAR-100 dataset uses the same image data and train-test split as CIFAR-10, but has 100 classes. We use the common data augmentation techniques including padding four zeros around the image, random cropping, random horizontal flipping and image standardization. Two state-of-the-art methods ResNet[He et al.2016] and RevNet [Gomez et al.2017] are used as our baseline methods.
The STL-10 dataset [Coates, Ng, and Lee2011] is an image recognition dataset with 10 classes at image resolutions of . It contains 5,000 training images and 8,000 test images. Thus, compared with CIFAR-10, each class has fewer labeled training samples but higher image resolution. We used the same data augmentation as the CIFAR-10 except padding zeros around the images.
Neural network architecture specifications
We provide the neural network architecture specifications here. The implementation details are in the Appendix. All the networks contain 3 units, and each unit has blocks. There is also a convolution layer at the beginning of the network and a fully connected layer in the end. For Hamiltonian networks, there are 4 convolution layers in each block, so the total number of layers is . For MidPoint and Leapfrog, there are 2 convolution layers in each block, so the total number of layers is . In the first block of each unit, the feature map size is halved and the number of filters is doubled. We perform downsampling by average pooling and increase the number of filters by padding zeros.
Main Results and Analysis
|Baselines||[Yang et al.2015]||73.15%|
|(Dundar et al. 2015)||74.1%|
|[Zhao et al.2016]||74.3%|
CIFAR-10 and CIFAR-100
We show the main results of different architectures on CIFAR-10/100 in Table 1. Our three architectures achieve comparable performance with ResNet and RevNet in term of accuracy using similar number of model parameters. Compared with ResNet, our architectures are more memory efficient as they are reversible, thus we do not need to store activations for most layers. While compared with RevNet, our models are not only reversible, but also stable, which is theoretically proved in Sec. 3. We show later that the stable property makes our models more robust to small amount of training data and arbitrarily deep.
Main results on STL-10 are shown in Table 2. Compared with the state-of-the-art results, all our architectures achieve better accuracy.
Robustness to training data subsampling
Sometimes labeled data are very expensive to obtain. Thus, it is desirable to design architectures that generalize well when trained with few examples. To verify our intuition that stable architectures generalize well, we conducted extensive numerical experiments using the CIFAR-10 and STL-10 datasets with decreasing number of training data. Our focus is on the behavior of our neural network architecture in face of this data subsampling, instead of improving the state-of-the-art results. Therefore we intentionally use simple architectures: 4 blocks, each has 4 units, and the number of filters are . For comparison, we use ResNet [He et al.2016] as our baseline. CIFAR-10 has much more training data than STL-10 (50,000 vs 5,000), so we randomly subsample the training data from to for CIFAR-10, and from to for STL-10. The test data set remains unchanged.
Fig. 3 shows the result on CIFAR-10 when decreasing the number examples in the training data from to . Our Hamiltonian network performs consistently better in terms of accuracy than ResNet, achieving up to higher accuracy when trained using just and of the original training data.
From the result as shown in Fig. 4, we see that Hamiltonian consistently achieves better accuracy than ResNet with the average improvement being around . Especially when using just of the training data, Hamiltonian has a higher accuracy compared to ResNet.
Training a 1202-layer Hamiltonian
To demonstrate the stability and memory-efficiency of the Hamiltonian network with arbitrary depth, we explore a 1202-layer architecture on CIFAR-10. An aggressively deep ResNet is also trained on CIFAR-10 in [He et al.2016] with 1202 layers, which has an accuracy of . Our result is shown at the last row of Table 1. Compared with the original ResNet, our architecture uses only a half of parameters and obtains better accuracy. Since the Hamiltonian network is intrinsically stable, it is guaranteed that there is no issue of exploding or vanishing gradient. We can easily train an arbitrarily deep Hamiltonian network without any difficulty of optimization. The implementation of our reversible architecture is memory efficient, which enables a 1202 layer Hamiltonian model running on a single GPU machine with 10GB GPU memory.
We present three stable and reversible architectures that connect the stable ODE with deep residual neural networks and yield well-posed learning problems. We exploit the intrinsic reversibility property to obtain a memory-efficient implementation, which does not need to store the activations at most of the hidden layers. Together with the stability of the forward propagation, this allows training deeper architectures with limited computational resources. We evaluate our methods on three publicly available datasets against several state-of-the-art methods. Our experimental results demonstrate the efficacy of our method with superior or on-par state-of-the-art performance. Moreover, with small amount of training data, our architectures achieve better accuracy compared with the widely used state-of-the-art ResNet. We attribute the robustness to small amount of training data to the intrinsic stability of our Hamiltonian neural network architecture.
Proof: All eigenvalues of in Eq. (3) are imaginary.
then and have the same eigenvalues. is a diagonal matrix with non-negative elements, and is a real anti-symmetric matrix such that . Let and
be a pair of eigenvalue and eigenvector of, then
where is the generalized inverse of . On one hand, since is non-negative definite, is real. On the other hand,
where represents conjugate transpose. Eq. 27 implies that is imaginary. Therefore, has to be imaginary. As a result, all eigenvalues of are imaginary.
Our method is implemented using TensorFlow library[Abadi et al.2016]. The CIFAR-10/100 and STL-10 experiments are evaluated on a desktop with an Intel Quad-Core i5 CPU and a single Nvidia 1080 Ti GPU.
For CIFAR-10 and CIFAR-100 experiments, we use a fixed mini-batch size of 100 both for training and test data except Hamiltonian-1202, which uses a batch-size of 32. The learning rate is initialized to be 0.1 and decayed by a factor of 10 at 80, 120 and 160 training epochs. The total training step is 80K. The weight decay constant is set to, weight smoothness decay is and the momentum is set to 0.9.
For STL-10 experiments, the mini-batch size is 128. The learning rate is initialized to be 0.1 and decayed by a factor of 10 at 60, 80 and 100 training epochs. The total training step is 20K. The weight decay constant is set to , weight smoothness decay is and the momentum is set to 0.9.
- [Abadi et al.2016] Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G. S.; Davis, A.; Dean, J.; Devin, M.; et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.
- [Arnolʹd2012] Arnolʹd, V. I. 2012. Geometrical methods in the theory of ordinary differential equations. Springer Science & Business Media.
- [Arora, Liang, and Ma2016] Arora, S.; Liang, Y.; and Ma, T. 2016. Why are deep nets reversible: A simple theory, with implications for training. ICLR-Workshop.
- [Ascher and Petzold1998] Ascher, U., and Petzold, L. 1998. Computer Methods for Ordinary Differential Equations and Differential-Algebraic Equations. Philadelphia: SIAM.
- [Ascher2010] Ascher, U. 2010. Numerical methods for Evolutionary Differential Equations. Philadelphia: SIAM.
- [Bellman1953] Bellman, R. 1953. An introduction to the theory of dynamic programming. Technical report.
- [Bliss1919] Bliss, G. A. 1919. The use of adjoint systems in the problem of differential corrections for trajectories. JUS Artillery 51:296–311.
- [Bock1983] Bock, G. 1983. Recent advances in parameter identification techniques for ode. In Deuflhard, P., and Hairer, E., eds., Numerical treatment of inverse problems. Boston: Birkhauser.
- [Cessac2010] Cessac, B. 2010. A view of neural networks as dynamical systems. International Journal of Bifurcation and Chaos.
- [Chang et al.2017] Chang, B.; Meng, L.; Haber, E.; Tung, F.; and Begert, D. 2017. Multi-level residual networks from dynamical systems view. arXiv preprint arXiv:1710.10348.
- [Cho et al.2014] Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. ACL.
- [Coates, Ng, and Lee2011] Coates, A.; Ng, A.; and Lee, H. 2011. An analysis of single-layer networks in unsupervised feature learning. In AISTATS.
- [Coddington and Levinson1955] Coddington, E. A., and Levinson, N. 1955. Theory of ordinary differential equations. Tata McGraw-Hill Education.
- [Dinh, Krueger, and Bengio2015] Dinh, L.; Krueger, D.; and Bengio, Y. 2015. Nice: Non-linear independent components estimation. ICLR-Workshop.
- [Dinh, Sohl-Dickstein, and Bengio2016] Dinh, L.; Sohl-Dickstein, J.; and Bengio, S. 2016. Density estimation using real NVP. NIPS.
- [Dosovitskiy and Brox2016] Dosovitskiy, A., and Brox, T. 2016. Inverting visual representations with convolutional networks. In CVPR.
- [Dundar, Jin, and Culurciello2015] Dundar, A.; Jin, J.; and Culurciello, E. 2015. Convolutional clustering for unsupervised learning. arXiv preprint arXiv:1511.06241.
- [E2017] E, W. 2017. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics.
- [Esteva et al.2017] Esteva, A.; Kuprel, B.; Novoa, R. A.; Ko, J.; Swetter, S. M.; Blau, H. M.; and Thrun, S. 2017. Dermatologist-level classification of skin cancer with deep neural networks. Nature.
- [Gilbert et al.2017] Gilbert, A. C.; Zhang, Y.; Lee, K.; Zhang, Y.; and Lee, H. 2017. Towards understanding the invertibility of convolutional neural networks. arXiv preprint arXiv:1705.08664.
- [Gomez et al.2017] Gomez, A. N.; Ren, M.; Urtasun, R.; and Grosse, R. B. 2017. The reversible residual network: Backpropagation without storing activations. NIPS.
- [Gunzburger2003] Gunzburger, M. D. 2003. Perspectives in flow control and optimization. SIAM.
- [Haber and Ruthotto2017] Haber, E., and Ruthotto, L. 2017. Stable architectures for deep neural networks. arXiv preprint arXiv:1705.03341.
- [Haber, Ruthotto, and Holtham2017] Haber, E.; Ruthotto, L.; and Holtham, E. 2017. Learning across scales-a multiscale method for convolution neural networks. arXiv preprint arXiv:1703.02009.
- [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR.
- [He et al.2017] He, K.; Gkioxari, G.; Dollár, P.; and Girshick, R. 2017. Mask R-CNN. In ICCV.
- [Horn and Johnson2012] Horn, R. A., and Johnson, C. R. 2012. Matrix Analysis.
- [Hornik1991] Hornik, K. 1991. Approximation capabilities of multilayer feedforward networks. Neural networks.
- [Huang et al.2016] Huang, G.; Sun, Y.; Liu, Z.; Sedra, D.; and Weinberger, K. Q. 2016. Deep networks with stochastic depth. In ECCV.
- [Huang et al.2017] Huang, G.; Liu, Z.; Weinberger, K. Q.; and van der Maaten, L. 2017. Densely connected convolutional networks. CVPR.
- [Krizhevsky and Hinton2009] Krizhevsky, A., and Hinton, G. 2009. Learning multiple layers of features from tiny images.
- [LeCun, Bengio, and Hinton2015] LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. Nature.
- [Long et al.2017] Long, Z.; Lu, Y.; Ma, X.; and Dong, B. 2017. Pde-net: Learning pdes from data. arXiv preprint arXiv:1710.09668.
- [Lu et al.2017] Lu, Y.; Zhong, A.; Li, Q.; and Dong, B. 2017. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. arXiv preprint arXiv:1710.10121.
- [Lyapunov1992] Lyapunov, A. M. 1992. The general problem of the stability of motion. International Journal of Control.
- [Mahendran and Vedaldi2015] Mahendran, A., and Vedaldi, A. 2015. Understanding deep image representations by inverting them. In CVPR.
- [Nguyen and McMechan2014] Nguyen, B. D., and McMechan, G. A. 2014. Five ways to avoid storing source wavefield snapshots in 2d elastic prestack reverse time migration. Geophysics.
- [Oord et al.2016] Oord, A. v. d.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; and Kavukcuoglu, K. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.
- [Perez et al.2017] Perez, E.; de Vries, H.; Strub, F.; Dumoulin, V.; and Courville, A. 2017. Learning visual reasoning without strong priors. arXiv preprint arXiv:1707.03017.
[Pohlen et al.2017]
Pohlen, T.; Hermans, A.; Mathias, M.; and Leibe, B.
Full resolution image compression with recurrent neural networks.CVPR.
- [Simmons2016] Simmons, G. F. 2016. Differential equations with applications and historical notes. CRC Press.
- [Simonyan and Zisserman2015] Simonyan, K., and Zisserman, A. 2015. Very deep convolutional networks for large-scale image recognition. ICLR.
- [Ulbrich2002] Ulbrich, S. 2002. A sensitivity and adjoint calculus for discontinuous solutions of hyperbolic conservation laws with source terms. SIAM J. Control and Optimization 41:740–797.
- [Wu et al.2016] Wu, Y.; Schuster, M.; Chen, Z.; Le, Q. V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
- [Xie et al.2017] Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; and He, K. 2017. Aggregated residual transformations for deep neural networks. CVPR.
- [Xiong et al.2017] Xiong, W.; Droppo, J.; Huang, X.; Seide, F.; Seltzer, M.; Stolcke, A.; Yu, D.; and Zweig, G. 2017. The microsoft 2016 conversational speech recognition system. In ICASSP.
- [Yang et al.2015] Yang, S.; Luo, P.; Loy, C. C.; Shum, K. W.; Tang, X.; et al. 2015. Deep representation learning with target coding. In AAAI.
- [Zagoruyko and Komodakis2016] Zagoruyko, S., and Komodakis, N. 2016. Wide residual networks. BMVC.
- [Zhang et al.2017] Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; and Vinyals, O. 2017. Understanding deep learning requires rethinking generalization. ICLR.
- [Zhao et al.2016] Zhao, J. J.; Mathieu, M.; Goroshin, R.; and LeCun, Y. 2016. ICLR-workshop.