1 Introduction
In this work we consider the problem of designing and training Convolutional Neural Networks (CNNs). The topic has been a major field of research over the last years, after it has shown remarkable success, e.g., in classifying images of hand writing, natural images, videos; see, e.g., [12, 11, 13] and references within. This success has generated thousands of research papers and a few celebrated software packages.
However, the success of CNNs is not fully understood and in fact, tuning network architecture and parameters is very hard in practice. Typically many trial and error experiments are required to find a CNN that is effective for a specific class of data. In addition to the computational costs associated with those experiments, in many cases, small changes to the network can yield large changes in the learning performance. To overcome the difficulty of learning, systematic approaches using Bayesian optimization have recently been proposed to infer the best architecture [8].
Currently used training methods depend, e.g., on the architecture of the network as well as the resolution of the image data. Changing any of those parameters in the training or prediction phase can severely affect the performance of the CNN. For example, CNNs are typically trained using images of a fixed resolution, and classifying images of a different resolution requires interpolation. Such a process can be computationally expensive, particularly if the data represents videos or highresolution 3D images as is common in applications, e.g., in medical imaging and geosciences
[9, 10]).In this paper we derive a framework that allows scaling CNNs across image resolution and depths and thus enables multiscale learning. As a backbone of our methods we present an interpretation of deep CNNs as an optimal control problem involving a nonlinear time dependent differential equations. This understanding leads to a very common structure that is used in fields such as path planning, data assimilation, and nonlinear Kalman filtering; see
[3] and reference within.We present new methods for scaling CNNs from low to highresolution image data and vice versa. We propose an algebraic multigrid approach to adapt the coefficients of the convolution kernel for different scales and demonstrate the importance of this step. Our method allows multiscale learning using image pyramids, where the network is trained at different resolutions. Such a process is known to be very efficient in other fields, both from a computational point of view, and from skipping local minima [14, 5, 1]. The method also allows for the classification of lowresolution images by networks that have been designed for and trained using highresolution images without interpolation of the coarse scale image to finer scales.
We also present a method for scaling the number of layers in CNNs. Our method is based on the interpretation of the forward propagation in CNN as a discretization of a timedependent nonlinear differential equation. In that framework the number of layers corresponds to the number time steps. Our observation motivates the use of multilevel learning algorithms that accelerate the training of deep networks by solving a series of learning problems from shallow to deep architectures.
The rest of this paper is structured as follows. In the next section, we show the connection between time dependent differential equations and CNNs. This connection allows us to introduce the optimization problem as a dynamic control problem. In Sec. 3 we present multiscale methods connecting CNNs across image resolutions and depths. In Sec. 4 we demonstrate the potential of our methods using image classification benchmarks. Finally, in Sec. 5 we summarize the paper.
2 An Optimal Control Perspective on CNNs
In this section we derive a fully continuous formulation of deep convolution neural networks in the framework of optimal control. In Sec. 2.1 we give a continuous interpretation of the spatial convolution and the forward propagation. In Sec. 2.2 we discuss the remaining components of CNNs and present the continuous optimal control problem.
We focus on image classification and assume we are given training data consisting of discrete dimensional images and corresponding labels . In this paper, we consider and thus corresponds to the number of pixels in the data. As common in image processing, we interpret the image data as discretization of a continuous function at the cellcenters of a rectangular grid with equally sized pixels. Here, denotes the image domain and for simplicity we assume square pixels of edge length . We denote the number of layers in the deep CNN by .
2.1 Forward Propagation as a Nonlinear Differential Equation
In this section we establish the interpretation of the forward propagation Residual Neural Networks (ResNN) [7] as a nonlinear differential equation. A simple way to write the forward propagation of a discrete image through a ResNet is
(1) 
Here, are the input features, are the hidden layers, and are the output layers. The matrix maps the input image into the feature space . This matrix can be "learned" or fixed. The parameters need to be determined by the "learning" process. We generalize the original ResNet model by adding the parameter , which helps to derive the continuous interpretation below (the original formulation is obtained for ).
A common choice in CNN is to have the function as a convolution with parameter that represent the convolution weights and bias, leading to the explicit expression
(2) 
Here is a convolution matrix, which is a circulant matrix that represents the convolution and depends on the stencil or convolution kernel, ,
is a bias vector and
is an activation function. The size of the stencil is typically much smaller than the number of pixels in the image and thus
is a sparse matrix. Next, we interpret the depth of the network in a continuous framework. We start by rewriting the forward propagation (1) as(3) 
The left hand side of the above equation is a finite difference approximation to the differential operator with step size . While the approximation used in the original ResNet (using ) is valid if features change sufficiently slowly, the obtained dynamical system can be chaotic if the features change quickly. Having a chaotic system as a forward problem implies that one can expect difficulties when considering the learning problem. Therefore, our first goal is to stabilize the forward propagation process.
To obtain a fully continuous formulation of the forward propagation we note that the convolution weights can be seen as a discretization of continuous functions (whose support is limited to a small region around the origin). This allows to interpret as a discretization of . Upon taking the limit in (3) we obtain the continuous forward propagation process
(4) 
for all , where
is the final time corresponding with the output layer. General stability of ordinary differential equations (ODEs) applies for this process and in particular, it is easy to verify that the system of ODEs is stable as long as the real part of the eigenvalues of the convolution are nonpositive. A second well known problem is vanishing gradients
[2]. This implies that the eigenvalues of the convolution have a strong negative real part. Note that if the eigenvalues of are imaginary, then no decay of the signal occurs, and no vanishing gradients are expected. This can aid in choosing and initializing the network parameters.Given the continuous forward propagation in (4) we interpret (3) as a forward Euler discretization with a fixed time step size of . Thus, the forward propagation is stable as long as the real parts of the eigenvalues of the convolution and the time steps are sufficiently small. We note that there are numerous methods for time integration, some of which provide superior stability of the forward propagation.
2.2 Optimal Control Formulation of Supervised Learning
Having discussed forward propagation, we now briefly review classification and give continuous and discrete formulations of the learning problem.
The hypothesis or classification function, which predicts the label for each data using the values at the output layer, , can be written as
(5) 
where the columns of are classification weights and
are biases for the respective classes. Commonly used choices are softmax, leastsquares, logistic regression, or support vector machines. We have generalized the common notation by adding the parameter
that allows to interpret as a midpoint rule applied to the standard inner product for a sufficiently regular function . The th column of is the discretization of at the cellcenters of the grid. This generalization allows to adjust the weights across image resolutions.For training data containing of continuous functions and labels , learning consists of solving the optimal control problem
(6a)  
(6b) 
Here
is a loss function measuring the mismatch between the predicted and known label and
is a regularization function that penalizes undesired features in the parameters and avoids overfitting. Typically, the problem is not solved to a high accuracy and low accuracy solutions are sufficient. A validation set is often used to determine the stopping criteria for the optimization algorithm.For completeness we note that a discrete version of the optimal control problem is
(7a)  
(7b) 
Note that for simplicity we have ignored the pooling layer, although it can be added in general (The necessity of pooling has been debated in [17]).
3 Multiscale Methods
In this section we present new methods for scaling deep CNNs along two dimensions. In Sec. 3.1 and Sec. 3.2 we discuss restriction and prolongation of convolution operators as a way to scale CNNs along image resolution. In Sec. 3.3 we scale the depth of the network to simplify initialization and accelerate training.
3.1 From HighResolution to LowResolution: Restricting Convolution Operators
Assume first that we are given some image data, , on a mesh with pixel size and a stencil, that operates on this image. Assume also that we would like to apply the fine mesh convolution to an image, , given on a coarser mesh with pixel size . In other words the goal is to find a stencil for which the coarse mesh convolution is equivalent to refining the image data and applying fine mesh convolution with . This problem is well studied in the multigrid literature [18].
Our method for restricting the stencil follows the algebraic multigrid approach; see, e.g., [18] for details and alternative approaches using rediscretization. We assume that the following connection holds between the fine mesh image, , and the coarse mesh image
(8) 
Here, is a prolongation matrix and is a restriction matrix. is an interpolated coarse scale image on the fine mesh and typically, for some
that depends on the dimensionality of the problem. The interpretation is that the coarse scale image is obtained using some linear transformation from the fine scale image (e.g. averaging). Conversely, an approximate fine scale image can be obtained from the coarse scale image by interpolation. This interpretation can easily be extended to 3D data to allow, e.g., classification of videos.
Let be the sparse matrix that represents the convolution on the fine scale. This matrix operates on a vectorized image and is equivalent to convolving the vector with the stencil, . The matrix is circulant and sparse with a few nonzero diagonals. Our goal is to build a coarse scale convolution, that operates on a vector and is consistent with the operation of on a fine scale vector . Using the prolongation and restriction we obtain that
(9) 
That is, given we first prolong it to the mesh , then operate on it with the matrix , which yields a vector on mesh . Finally, we restrict the result to the mesh . This implies that the coarse scale convolution matrix can be expressed as . This definition of the operator can be built directly using any interpolation/restriction operators. Furthermore, assuming that the stencil, is constant on the coarse mesh (that is, it is not changing on the mesh as commonly assumed in CNN), it is straightforward to evaluate it without generating the matrix , as commonly done in algebraic multigrid.
Example 1.
To demonstrate the concept of how to the convolution changes with resolution, we use the following simple example demonstrated in Figure 1. We select an image from the MNIST data set (bottom left) and convolve it with the fine mesh convolution parameterized by the stencil
obtaining the image in the bottom right panel of Figure 1. Now, by restricting the weights using the algebraic multigrid approach, we obtain that on a coarse mesh the weights are,
These weights are used on the coarse scale image (top left panel of Figure 1) to construct the filtered image on the top right panel of Figure 1.
Looking at the weights obtained on the coarse mesh, it is evident that they are significantly different from the fine scale weights. It is also evident from the fine and coarse images, that adjusting the weights on the coarse mesh is necessary if we are to keep a faithful transformation between the images.
The interpretation of images and convolution weights as continuous functions, allows us to work with different imageresolutions. This has two important consequences. First, assume that we have trained our network on some fine scale images and that we are given a coarse scale image. Rather than interpolating the image to a fine mesh (which can be memory intensive and computationally expensive), we transform the stencils to a coarse mesh and use the coarse mesh stencils to classify the coarse scale image. Such a process can be particularly efficient when considering the classification of videos on mobile devices where expanding the video to high resolution can be computationally prohibitive. A second consequence is that we are able to train the network on a coarse mesh and then interpolate the result to a fine mesh. As we see next, this allows us to use a process of image pyramid or multiresolution for the solution of the optimization problem that is at the heart of the training process.
3.2 From LowResolution to HighResolution: Prolongating Convolution Operators
Understanding how to move between different scales allows us to construct efficient algorithms that use inexpensive coarse mesh representations of the problem in order to initialize the problem on finer scales. This is similar to the classical image pyramid process [5] and multilevel methods that are used in applications that range from full waveform inversion to shape from shading [19]. The idea is to solve the optimization problem on a coarse mesh first in order to initialize fine grid parameters. The algorithm is summarized in Algorithm 1.
Solving each optimization problem on coarser meshes is cheaper than solving the problem on finer meshes. In fact, when an image is coarsened by a factor of 2, each convolution step is 4 times cheaper in 2D and 8 times cheaper in 3D. In some cases, such a process leads to linear complexity of the problem [18].
In order to apply such algorithms in our context, we need to address the transformation of the coarse scale operator to a fine scale one. This process is different than classical multigrid where the operator on a fine mesh is given, and a coarse scale representation is desired. As previously discussed, we use the classical multigrid result to transform a fine mesh operator to a coarse one
(10) 
In the classical multigrid implementation, one has a hold on the fine scale operator , and the goal is to compute the coarse scale operator . In our application, throughout the mesh continuation method, we are given the coarse mesh operator and we are to compute the fine mesh operator. In principle, there is no unique fine scale operator given a coarse scale one; however, assuming that the fine scale operator is a convolution with fixed stencil (as in the case of CNN), there is a unique solution. This is a classical result in Fourier analysis on multigrid methods [18]. Since (10) represents a linear connection between and , we extract equations (where is the size of each convolution stencil) that connect the fine scale convolutions to the coarse scale ones. For a convolution stencil of size and linear prolongation/restriction, this is a simple linear system that can be easily solved to obtain the fine scale convolution. A classical multigrid result is that this linear system is wellposed. Assuming that the coarse mesh is a stencil and that the interpolation is linear, the fine mesh stencil is also a stencil which is uniquely determined from the coarse mesh stencil.
3.3 From Shallow to Deep Networks
In this section we consider scaling the number of layers in the network as another way to use the continuous framework. In our case we gradually increase the number of layers keeping the final time constant in order to accelerate learning by reusing parameters from shallow networks to initialize the learning problem for the deeper architecture. Note that the number of layers in the network corresponds to the number of discretization points in the discrete forward propagation. Similar ideas have been used in multigrid [4] and image processing; see, e.g., [15].
To solve a learning task in practice, we first solve the learning problem using a network with only a few layers. Subsequently, we prolongate the estimated parameters of the forward propagation to initialize the optimization problem for the next network that features, e.g., twice as many layers. To this end, simple linear interpolation can be used. We repeat this process until the desired network depth is reached.
Besides realizing some obvious computational savings on shallower networks, the main motivation behind our approach is to obtain good starting guesses for the next level. This is key since, while deeper architectures offer more flexibility to model complicated datalabel relation, deeper networks are notoriously difficult to initialize. Another advantage when using secondorder learning algorithms is the faster convergence rate obtained by warm starting.
4 Experiments
In this section, we demonstrate the benefits of the proposed multiscale algorithms for CNN using two supervised image classification problems.
4.1 Classification of Images Across Resolutions
We demonstrate that using the continuous formulation we are able to classify lowresolution images using CNNs trained on highresolution images and vice versa. Here, no additional learning is performed and, for example, classifying the lowresolution images does require neither interpolation nor highresolution convolutions. This is important for efficient classification on mobile devices etc.
We consider the MNIST dataset and independently train two networks with two layers each using the coarse and fine data, respectively. The MNIST dataset that consists of 60,000 labeled images each with pixels. Since the images are rather coarse, we use only two levels. To obtain coarse scale images the fine scale images are convolved with a Gaussian, and restricted to a coarse mesh using the operator introduced above. This yields a coarse mesh data consisting of images. We randomly divide the datasets into a training set consisting of images, and a validation set consisting of images. In all experiments, we choose a CNN with identical layers, activation function, and a softmax classifier. For optimization we use a BlockCoordinateDescent (BCD) method. Each iteration consists of one GaussNewton step with subsampled Hessian to update the forward propagation parameters and 5 Newton steps to update the weights and biases of the classifier. To avoid overfitting and stabilize the process we enforce spatial smoothness of the classification weights and smoothness across layers for the propagation parameters through derivativebased regularization.
The validation accuracy for the networks on the resolutions they are trained on is around 98.28% and 98.18% for the coarse and fine scale network, respectively. Next, we prolongate the classification weights and apply the multigrid prolongation to the convolution kernels from the coarse network to the fine resolution. Using only the results from the coarse level and no training on the fine level, we get a validation accuracy of %. For comparison, using the original convolution kernels gives a validation accuracy of %.
Next, we restrict the classification weights and convolution kernel of the network trained on fine data. Using this, gives a validation accuracy % (compared to % without restricting the kernels). Again, we note that no training is performed on the coarse resolution.
4.2 Shallow to Deep Training
We show the benefit of the multilevel training strategy using the MNIST example. We solve a sequence of training problems for CNNs whose depths increase in powers of two from 2 layers up to 64. For each CNN, we estimate the parameters using 20 iterations of the BCD. Except the number of layers, all parameters are chosen identical to the previous experiment.
We compare the convergence properties of the learning algorithm using random initialization and the proposed multiscale initialization, which uses the prolongated network parameters from the previous level. The validation accuracy and the value of the loss function can be seen in Fig.(2). It can be seen that the initial guesses provided by the multiscale process have a lower value of the loss function and higher validation accuracy. For the deeper networks where training is most costly, the optimal accuracy is reached after only a few iterations using the multiscale method while for random initialization more iterations are needed to achieve a comparable accuracy.
4.3 Multiscale CNN Training on ImageNet
We demonstrate the computational benefit of coarse mesh training compared with only training the fine mesh CNN. To this end, we select ten categories from the ImageNet dataset
[16]. The ImageNet consists of images that are of varying dimensions and hence, we preprocess the images to be of dimension 224by224 as also proposed in [6]. On this resolution level, the image quality is sufficiently high to visually recognize the objects in the images and the discrete images appear smooth, i.e., are free of block artifacts. This choice leads to nontrivial training problem, where one can expect to reduce training time using our multigrid approach. For each category, there are 1,300 images for the total of 13,000 images. We randomly divide the data into 10,000 images used for training and 3,000 images used for validation. We use ResNet34 architecture shown in Fig. 3 of [6] with two differences, first, the first CNN kernel is of dimension rather than and second, we did not use the fully connected layer, but rather, the output of the average pooling is connected to the classification layer with softmax activation directly. Note that the average pooling ensures the dimension of the penultimate layer to be identical regardless of the dimension of the input layer.We demonstrate the computational benefit of coarse mesh training compared with only training the fine mesh CNN. The first step is coarse mesh training. To this end, we restrict the images to a mesh and train ResNet34 on the coarsened images. Then, we obtain the weights for the finer scale using the method described in Section 3.2. For comparison, we also train ResNet34 using the image data directly. We show in the left subplot of Fig. 3
that the multiscale approach results in considerable reduction in the number of epochs before reaching convergence. We adopt early stopping to stop training if the loss does not improve for 10 Epochs. In the right subplot of Fig.
3, we show the total time to training for multiscale, sum of the time to train for the coarse scale and time to train for the fine scale, compared to directly training on . Note that the multiscale approach not only requires fewer epochs and lower runtime, but also gives lower training and validation errors. To ensure that this is not a phenomenon that is specific to the given traintest data split, we report the average over 5 splits; we have provided additional plots in Figure 4. In all traintest data splits, the multiscale approach produce lower training and validation error in smaller number of epochs.5 Conclusions
In this work, we explore the connection between optimal control and training deep Convolution Neural Networks (CNNs) that enables learning across scales. The foundation of our approaches is a continuous image model and interpreting forward propagation in CNN as discretization of a timedependent nonlinear differential equation. We showed how this mathematical framework can be used to scale deep CNNs along two dimensions: Image resolution and depth. While the obtained multiscale approaches are new in deep learning they are commonly used for the numerical solution of related optimal control problems, e.g., in image processing and parameter estimation.
Our method for connecting low and highresolution images is unique in that it scales the parameter of the network rather than interpolating the image data to different resolutions. To this end, we present an algebraic multigrid approach to computing convolution operators that are consistent with coarse and fine scale images. We exemplified the benefit of our approach in two ways. In Sec. 4.1, we show that CNNs trained on fine resolution images can be adapted and used to classify coarse resolution images and vice versa. Our method is advantageous when memory is limited and interpolation of images or videos is not practical. In Sec. 4.3, we demonstrate that it is possible to use coarse representation of the images to learn the convolution kernels and – after prolongation – use the weights to classify highresolution images.
Our example in Sec. 4.2 shows that scaling the number of layers of the CNN can improve and accelerate the training of deep networks through initialization using results from shallow ones. In our experiment this drastically reduces the number of iterations for the deep networks where costperiteration is high.
Casting CNNs as a continuous optimal control problem of differential equations provides new insights into the field and motivates new ways to solve and regularize the learning problem.
6 Acknowledgements
This work is supported in part by the US National Science Foundation (NSF) award DMS 1522599.
References
 [1] M. W. ans A. Ratcliffe, T. Nangoo, J. Morgan, A. Umpleby, N. Shah, V. Vinje, I. Štekl, L. Guasch, C. Win, G. Conroy, and A. Bertrand. Anisotropic 3d fullwaveform inversion. GEOPHYSICS, 78(20):59–80, 2013.
 [2] Y. BENGIO, P. SIMARD, and P. FRASCONI. Learning LongTerm Dependencies with Gradient Descent Is Difficult. Ieee Transactions on Neural Networks, 5(2):157–166, 1994.
 [3] L. T. Biegler, O. Ghattas, M. Heinkenschloss, D. Keyes, and B. van Bloemen Waanders (Editors). RealTime PDEConstrained Optimization. SIAM, Philadelphia, 2009.
 [4] F. A. Bornemann and P. Deuflhard. The cascadic multigrid method for elliptic problems. Numerische Mathematik, 75(2):135–152, 1996.
 [5] E. Haber and J. Modersitzki. Multilevel methods for image registration. SIAM J. on Scientific Computing, 27:1594–1607, 2004.

[6]
K. He, X. Zhang, S. Ren, and J. Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 770–778, 2016.  [7] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, pages 630–645. Springer, 2016.
 [8] J. M. HernándezLobato, M. A. Gelbart, R. P. Adams, M. W. Hoffman, and Z. Ghahramani. A general framework for constrained bayesian optimization using informationbased search. volume 17, pages 1–53, 2016.
 [9] J. Jiang, P. Trundle, and J. Ren. Medical image analysis with artificial neural networks. Computerized Medical Imaging and Graphics, 34:617–631, 2010.
 [10] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. FeiFei. Largescale video classification with convolutional neural networks. In CVPR, 2014.
 [11] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 61:1097–1105, 2012.
 [12] Y. LeCun and Y. Bengio. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361:255–258, 1995.
 [13] Y. LeCun, K. Kavukcuoglu, and C. Farabet. Convolutional networks and applications in vision. IEEE International Symposium on Circuits and Systems: NanoBio Circuit Fabrics and Systems, page 253–256, 2010.
 [14] J. Modersitzki. Numerical Methods for Image Registration. Oxford, 2004.
 [15] J. Modersitzki. FAIR: Flexible Algorithms for Image Registration. SIAM, Philadelphia, 2009.
 [16] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
 [17] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Riedmiller. Striving for simplicity: The all convolutional net. CoRR, abs/1412.6806, 2014.
 [18] U. Trottenberg, C. W. Oosterlee, and A. Schuller. Multigrid. Academic press, 2000.
 [19] R. Zhang, P.S. Tsai, J. Cryer, and M. Shah. Shape from shading: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 218:690–706, 1999.