code for "Residual Flows for Invertible Generative Modeling".
Reversible deep networks provide useful theoretical guarantees and have proven to be a powerful class of functions in many applications. Usually, they rely on analytical inverses using dimension splitting, fundamentally constraining their structure compared to common architectures. Based on recent links between ordinary differential equations and deep networks, we provide a sufficient condition when standard ResNets are invertible. This condition allows unconstrained architectures for residual blocks, while only requiring an adaption to their regularization scheme. We numerically compute their inverse, which has O(1) memory cost and computational cost of 5-20 forward passes. Finally, we show that invertible ResNets perform on par with standard ResNets on classifying MNIST and CIFAR10 images.READ FULL TEXT VIEW PDF
code for "Residual Flows for Invertible Generative Modeling".
A TensorFlow implementation of Invertible Residual Networks
Reversible deep networks are one of the first classes of deep networks that provide statistical guarantees and access to key quantities like densities and gradients for generative modelling. Most notably, they allow for the exact computation of the determinant of their Jacobian, crucial for maximum-likelihood based generative models dinh2016density ; kingma2018glow ; chen2018neural . Further, they provide guarantees on mutual information between input and intermediate outputs, making them a suitable tool to analyze invariants of learned representations jacobsen2018irevnet ; anonymous2019excessive
. The ability to invert deep networks alleviates the need to store intermediate activations during backpropagationgomez2017reversible ; Chang2018ReversibleAF and offers a promising approach in Inverse Problems (ardizzone2018analyzing, ). However, architectures used to achieve reversibility are typically restricted to have analytical inverses dinh2014nice .
Reversible networks rely on fixed dimension splitting heuristics, but common splittings interleaved with non-volume conserving elements are constraining and their choice has a big impact on performancekingma2018glow ; dinh2016density . This makes building reversible networks a difficult task, while at the same time, it is unclear how such exotic designs relate to widely used architectures like ResNets (he2016identity, ).
Recently, instead of specifying a discrete architecture, it was proposed to parametrize the derivative of hidden states by a neural network(chen2018neural, ; ffjord, ). This approach amounts to learning Lipschitz dynamics of an ordinary differential equation (ODE). While the ODE over a finite time interval is invertible (chen2018neural, ), the discretization relies on adaptive ODE solvers to guarantee accurate solutions. This induces computational costs which are hard to control and even get worse during training (ffjord, ). In contrast, ResNets have fixed computational cost.
The remarkable similarity of ResNets and Euler-methods from ODEs
with activations (states) , layer-index (time-step) , scaling (step-size) and residual block (dynamic of ODE
) has attracted growing research in the intersection of deep learning and dynamical systems(luODE, ; haberRuthotto, ; ruthottoHaber, ; chen2018neural, ). However, little attention has been paid to the dynamics backwards in time
which amounts to the implicit backward Euler discretization. In particular, solving the dynamics backwards in time would implement an inverse of the corresponding ResNet. As stated in the following theorem, a simple condition suffices to make the dynamics solvable and thus renders the ResNet invertible.
Let with denote a ResNet with blocks . Then, the ResNet is invertible if
where is the Lipschitz-constant of .
Since ResNet is a composition of functions, it is invertible if each block is invertible. Let be arbitrary and consider the backward Euler discretization . Re-writing as a iteration yields
where is the fixed point if the iteration converges. As is an operator on a Banach space, the contraction condition guarantees convergence due to the Banach fixed point theorem. ∎
The condition above was also stated in (anonymous2019information, ) (Appendix D), however, their proof restricts the domain of the residual block to be bounded and applies only to linear operators as the inverse was given by a convergent Neumann-series. Note, that the condition is not necessary, e.g. other approaches (dinh2014nice, ; dinh2016density, ; jacobsen2018irevnet, ; Chang2018ReversibleAF, ; kingma2018glow, ) rely on triangular structures to create analytical inverses. After proving above condition for invertible ResNets, we discuss connections to related approaches:
Invertibility and ODEs: While ResNets are only guaranteed to be invertible if each residual block is contractive (), ODEs given by
with Lipschitz continuous are reversible (chen2018neural, ). By choosing a scaling (time steps ), such that , bijectivity of the ODE is preserved under the Euler discretization.
Maximal singular value of each layers convolutional operator for various CIFAR10 trained ResNets.Left: Vanilla and Batchnorm ResNet singular values. It is likely that the baseline ResNets are not invertible as roughly two thirds of their layers have singular values fairly above one, making the blocks non-contractive. Right: Singular values for our 4 spectrally normalized ResNets. The regularization is effective and in every case the single ResNet block remains a contraction.
Stability of ODE: There are two main approaches to study stability of ODEs, 1) behavior for and 2) Lipschitz stability over finite time intervals . Based on time-invariant dynamics , (naisnet, ) constructed asymptotically stable ResNets using anti-symmetric layers such that (with
denoting the real-part of eigenvalues,spectral radius and the Jacobian at point x). By projecting weights based on the Gershgorin circle theorem, they further fulfilled , yielding asymptotically stable ResNets with shared weights over layers.
Turning discrete into continuous dynamics: The view of deep networks as dynamics over time offers two fundamental learning approaches: 1) Direct learning of continuous ODE dynamics parametrized by neural networks as in (chen2018neural, ; ffjord, ) and 2) Indirect learning of ODE dynamics using discretization architectures like ResNets (haberRuthotto, ; ruthottoHaber, ; luODE, ; naisnet, ).
By fixing the ResNet , the dynamic is only fixed at time points , corresponding to each block
. For example, a linear interpolation in time turns the discretization back in to a continuous set of dynamics. However, in this indirect approach the dynamics depend on the discretization, while the direct approach(chen2018neural, ; ffjord, ) learns the ODE and adapts the discretization based on the fixed ODE. Thus, the effect of multi-step discretization schemes (luODE, ) in the indirect approach is unclear as the nature of the discretization changes the underlying ODE dynamics.
We show that standard and normalized ResNets perform on par in image classification, while normalized ResNets are guaranteed to be invertible. We train a standard pre-activation ResNet (he2016identity, )
with 55 bottleneck blocks on CIFAR10 and MNIST. All experiments use identical settings for all hyperparameters. We replace subsampling of strided convolution layers with "invertible downsampling" operationsjacobsen2018irevnet to allow invertibility, see appendix A for training and architectural details. To obtain the numerical inverse, we apply 100 fixed point iterations for each block. However, this is just to guarantee full convergence, but in practice much fewer iterations suffice. The trade-off between reconstruction error and number of iterations is analyzed in appendix B.
|Normed L2 Error||MNIST||2.5e-7||1.0||2.2e-7||1.9e-7||1.6e-7||7.3e-8|
Satisfying : We implement residual blocks as , where are convolutional layers and
denotes ReLU. Thus,if , where denotes the spectral norm. Unlike (miyato2018spectral, )
, we directly estimate the spectral norm ofby performing power-iteration using and as proposed in (gouk, ). Note, that a power-iteration on the parameter matrix (miyato2018spectral, ) only gives a bound on , see (tsuzuku2018lipschitz, ).
exactly using the SVD on the Fourier transformed parameter matrix following(singularValConv, ) to show holds in all cases.
Classification and reconstruction results for two baseline ResNets (with and without BatchNorm) and four invertible ResNets with different spectral normalization coefficients are shown in Table 1. The results illustrate that our proposed invertible ResNets perform on par with the baselines for larger settings in terms of classification performance, while being provably invertible. When applying very conservative normalization (small ), the classification error becomes higher on both datasets.
The normalized L2 reconstruction errors show, that our regularization is effective and the inverse is close to exact. Intruigingly, our analysis also reveals that ResNets without BatchNorm are invertible after training on MNIST, whereas the BatchNorm ResNet is not. Further, both ResNets with and without BatchNorm are not invertible after training on CIFAR10, as can also be seen from the singular value plots in figure 4.
See figure 3 for qualitative reconstruction results with 100 fixed point iteration steps. Note that the reconstruction error decays quickly and the errors are already imperceptible after 5-20 iterations, which is the cost of 5-20 times the forward pass and empirically corresponds to 0.15-0.75 seconds for reconstructing 100 CIFAR10 images. Computing the inverse is fast even for the largest normalizaton coefficient, but becomes faster with stronger regularization. The iterations needed for full convergence is approximately cut into half when reducing the spectral normalization coefficient by 0.2, see appendix B for a detailed plot.
|Vanilla ResNet Inverse:|
|Normalized ResNet Inverse:|
|Vanilla ResNet Inverse:|
|Normalized ResNet Inverse:|
In summary, we observe that invertibility without additional constraints is unlikely, but possible, whereas it is hard to predict if networks will have this property. In our proposed normalized ResNets however, we do have the theoretical guarantee of the existence of an inverse without harming classification performance.
We gratefully acknowledge the financial support from the German Science Foundation for RTG 2224 ": Parameter Identification - Analysis, Algorithms, Applications"
Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
European conference on computer vision, pages 630–645. Springer, 2016.
We use pre-activation ResNets with 55 convolutional bottleneck blocks with 3 convolution layers each and kernel sizes of 3x3, 1x1, 3x3 respectively. In the BatchNorm version, we apply a batch normalization before every ReLU activation function. The multiplier for the bottleneck is 4. The network has 2 downsampling stages after 18 and 36 blocks each and also concatenates zeros to its input, to increase the number of channels of the input by a factor of 4, a strategy initially described as injective iRevNet and also used in Glow on MNIST .
We train for 200 epochs with momentum SGD and a learning rate of 0.1, decayed by a factor of 0.2 after 60, 120 and 160 epochs. Weight decay is set to 5e-4 and a dropout of probability 0.1 is applied in the residual block, which is also taken into account when choosing the normalization coefficients. We apply shifts for MNIST and shifts and flips for CIFAR10 as random data-augmentation during training. The inputs for MNIST are normalized to [-0.5,0.5] and for CIFAR10 as well, after preprocessing each image via subtracting the mean and dividing by the standard deviation of the training set.