Neural Ordinary Differential equations (neural ODEs) are proposed in (Chen et al. (2018)
) and model the evolution of hidden representation with ordinary differential equation (ODE). The right-hand side of this ODE is represented with some neural network. If one considers classical Euler scheme to integrate this ODE, then ResNet-like architecture (He et al. (2016)) will be obtained. Thus, Neural ODEs are continuous analogue of ResNets. One of the motivation to introduce such models was assumption on smooth evolution of the hidden representation that can be violated with ResNet architecture. Also, in contrast to ResNet models, Neural ODEs share parameters of the ODE right-hand side between steps to integrate this ODE. Thus, Neural ODEs are more memory efficient than ResNet.
Different normalization techniques were proposed to improve the quality of deep neural networks. Batch normalization (Ioffe and Szegedy (2015)
) is a useful technique when training a deep neural network model. However, it requires computing and storing moving statistics for each time point. It becomes problematic when a number of time steps required for different inputs vary as in recurrent neural networks (Hochreiter and Schmidhuber (1997); Cooijmans et al. (2016); Ba et al. (2016)), or the time is continuous as in neural ODEs. We apply different normalization techniques (Salimans and Kingma (2016); Miyato et al. (2018); Ba et al. (2016)) to Neural ODE models and report results for the CIFAR-10 classification task. The considered normalization approaches are compared in terms of test accuracy and ability to generalize when a more powerful ODE solver is applied during inference.
The main ingredient of the neural ODE architecture is the ODE block. The forward pass through the ODE block is equivalent to the solve the following initial value problem (IVP)
where denotes the input features, which are considered as initial value. To solve IVP, we numerically integrate system (1) using ODE solver. Depending on the solver type different number of RHS evaluations of (1) are performed. Initial value problem (1
) replaces Euler discretization for the same right-hand side that arises in ResNet-like architectures. One part of the standard ResNet-like architecture is the so-called ResNet block, which consists of convolutions, batch normalizations, and ReLU layers. In practice, batch normalization is often used to regularize model, make it more robust to training hyperparameters and reduce internal covariate shift (Shimodaira (2000)). Also, it is shown that batch normalization yields smoother loss surface and makes neural network training faster and more robust (Santurkar et al. (2018)). In the context of neural ODEs training, previous studies applied layer normalization (Chen et al. (2018)) and batch normalization (Gholami et al. (2019)) but did not investigate the influence of these layers on the model performance. In this study, we focus on the role of normalization techniques in neural ODEs. We assume that proper normalization applied to the layers in ODE blocks leads to the test accuracy increase and learned smoother dynamic.
According to Luo et al. (2018), different problems and neural network architectures require different types of normalization. In our empirical study, we investigate the following normalization techniques to solve the image classification problem with neural ODE models.
Batch normalization (BN; Ioffe and Szegedy (2015)) is the most popular choice for the image classification problem, we discuss its benefits in the above paragraph.
Layer normalization (LN; Ba et al. (2016)) and weight normalization (WN; Salimans and Kingma (2016)) were introduced for RNNs. We consider these normalizations as a ppropriate candidates for incorporating in neural ODEs since they showed its effectiveness for RNNs that also exploit the idea of weights sharing through time.
Spectral normalization (SN; Miyato et al. (2018)) was proposed for generative adversarial networks. It is natural to consider SN for neural ODEs since if the Jacobian norm is bounded by , one may expect better properties of the gradient propagation in the backward pass.
We also trained neural ODEs without any normalization (NF).
To perform back-propagation, we use ANODE (Gholami et al. (2019)) approach. This is a memory-efficient procedure to compute gradients in neural ODEs with several ODE blocks. This method exploits checkpointing technique at the cost of extra computations.
3 Numerical Experiments
This section presents numerical results of applying different normalization techniques to neural ODEs in the CIFAR-10 classification task. Firstly, we compare test accuracy for neural ODE based models with different types of normalizations. Secondly, we present an -criterion
to estimate quality of the trained neural ODE-like model.
In our experiments we consider neural ODE based models, which are build by stacking standard layers and ODE blocks. After replacing ResNet block with ODE block in ResNet4 model, we get the following model
conv norm activation ODE block avgpool fc,
which we call ODENet4 model. In this model we will test different normalization techniques in the place of layer norm and unside the ODE block.
Similarly, by replacing ResNet blocks, that do not perform downsampling, in ResNet10 architecture with ODE blocks we get the following architecture:
conv norm activation ResNet block ODE block ResNet block ODE block avgpool fc,
which we call ODENet10 model. In contract to ODENet4, this model admits different normalizations in place of the norm layer, inside ResNet blocks and ODE blocks.
We use ANODE to considered models since it is more robust than the adjoint method (more details see in Gholami et al. (2019)). In both forward and backward passes through ODE blocks we solve corresponding ODEs using Euler scheme. For the training schedule, we follow the settings from ANODE (Gholami et al. (2019)). In contract to ANODEDEV2 (Zhang et al. (2019)
), we include activations and normalization layers to the model. We train considered models for 350 epochs with an initial learning rate equal to 0.1. The learning rate is multiplied byat epoch 150 and 300. Data augmentation is implemented. The batch size used for training is 512. For all experiments with different normalization techniques, we use the same settings.
In our experiments, we assume that normalizations for all ResNet blocks are the same, as well as for all ODE blocks. Along with these two normalizations, we vary a normalization technique after the first convolutional layer. We report test accuracy for different normalization schedules for ODENet10. Table 1 presents test accuracy given by ODENet10 model. The best model achieves 93% accuracy. It uses batch normalization after the first convolutional layer and in the ResNet blocks, and layer normalization in the ODE blocks. Also, we observe that the elimination of batch normalization after the first convolutional layer and from the ResNet blocks leads to decreasing accuracy to 91.2%. Such quality is even worse than the quality obtained with the model without any normalizations (92%).
3.2 -criterion of dynamic smoothness in the trained model
Since in neural ODEs like models, we train not only parameters of standard layers, but also parameters in the right-hand side of the system (1), the test accuracy is not the only important measure. Another significant criterion is the smoothness of the hidden representation dynamic that is controlled by the trained parameters of the right-hand side (1).
To implicitly estimate this smoothness, we propose an ()-criterion that indicates whether more powerful solver induces performance improvement of the trained neural ODE model during evaluation. Here, denotes a solver name (Euler, RK2, RK4, etc) and denotes a number of the right-hand side evaluations necessary for integration of system (1), which corresponds to the forward pass through the ODE block. By more powerful solver we mean ODE solver that requires more right-hand side evaluations to solve (1) than ODE solver used in training for the same purpose. For example, assume one trains the model with Euler scheme and . Then, we say that ODE block in trained model corresponds to smooth dynamic if using Euler scheme with during evaluation yields higher accuracy. Otherwise, we say that -criterion shows the absence of learned smooth dynamics. Worth noting that the -criterion has limitation. Namely, it requires the solution of IVP (1) to be a Lipchitz function of the right-hand side (1) parameters and inputs (Coddington and Levinson (1955)). This limitation is based on the Otherwise, we can not rely on this criterion since the closeness in the right-hand side parameters does not induce the closeness of features that are inputs to the next layers of the model.
In our experiments we consider ODENet4 architecture with four different settings of the Euler scheme: . For each setting we have trained 10 types of architectures that differ from each other by the type of normalization we apply to the first convolutionl layer and convolutional layers in the ODE block. For example, the model named “ODENet4 BN-LN (Euler, 2)” means the following: we have used ODENet4 architecture, where after conv layer follows a BN layer, after each convolutional layer in the right-hand side (1) follows an LN layer, and Euler scheme with 2 steps is used to propagate through the ODE block.
For a fixed model trained with (Euler, ) solver we check the fulfillment of -criterion by evaluating its accuracy with more powerful solver. In this case, we consider the following more powerful solvers: (Euler, ), (RK2, ) and (RK4, ), where . In Figure 1 you can see the plots, where every subplot corresponds to each trained model. Different line types correspond to different solver type (Euler, RK2, RK4), -axis depicts the number of the right-hand side evaluations, -axis states for test accuracy.
In Figure 1, we show how test accuracy given by ODENet4 model with different normalizations changes with varying ODE solvers to integrate IVP (1) in ODE blocks. These models were trained with Euler scheme and after that we use Euler, RK2 and RK4 schemes to compute test accuracy. Every row from top to bottom corresponds to used in Euler scheme. One can see that if we change the ODE solver the model with layer normalization in ODE blocks (the second column), the accuracy does not drop and remains the highest among other models.
4 Discussion and Further research
This study presents an analysis of the different normalization techniques for ODE-based neural networks. Here we use only non-adaptive solvers to integrate IVP (1). Therefore, the complexity of the forward and backward passes is known in advance. However, non-adaptive solvers may not be accurate enough and lead to decreasing the accuracy of the model. Adaptive solvers (Nguyen et al. (2019)) help in such case but if IVP (1) is stiff, they compute the right-hand side in IVP (1) many times (Söderlind et al. (2015)). Thus, the training process becomes slower. To decrease the stiffness of IVP (1) and speed up the training process, one can use proper normalization layers. The study of how different normalization techniques affect the stiffness of IVP (1) is an interesting future work. One more promising direction is to investigate more sophisticated normalization techniques like Squeeze and Excitation (Hu et al. (2018)) that performs the dynamic channel-wise calibration. Such investigations can give more insights into the normalization meaning in the context of neural ODEs training.
Sections 2 and 3 were supported by Ministry of Education and Science of the Russian Federation grant 14.756.31.0001. High-performance computations presented in the paper were carried out on Skoltech HPC cluster Zhores.
- Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §1, 2nd item.
- Neural ordinary differential equations. In Advances in neural information processing systems, pp. 6571–6583. Cited by: §1, §2.
- Theory of ordinary differential equations. Tata McGraw-Hill Education. Cited by: §3.2.
- Recurrent batch normalization. arXiv preprint arXiv:1603.09025. Cited by: §1.
- Anode: Unconditionally accurate memory-efficient gradients for neural odes. arXiv preprint arXiv:1902.10298. Cited by: §2, §2, §3.
- Deep residual learning for image recognition. In , pp. 770–778. Cited by: §1.
- Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1.
- Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §4.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §1, 1st item.
- Differentiable learning-to-normalize via switchable normalization. arXiv preprint arXiv:1806.10779. Cited by: §2.
- Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: §1, 3rd item.
- InfoCNF: An efficient conditional continuous normalizing flow with adaptive solvers. arXiv preprint arXiv:1912.03978. Cited by: §4.
- Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in neural information processing systems, pp. 901–909. Cited by: §1, 2nd item.
- How does batch normalization help optimization?. In Advances in Neural Information Processing Systems, pp. 2483–2493. Cited by: §2.
- Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference 90 (2), pp. 227–244. Cited by: §2.
- Stiffness 1952–2012: Sixty years in search of a definition. BIT Numerical Mathematics 55 (2), pp. 531–558. Cited by: §4.
- ANODEV2: A coupled neural ODE evolution framework. arXiv preprint arXiv:1906.04596. Cited by: §3.