## 1 Introduction

Convolutional Neural Networks (CNN) have revolutionized many machine learning and vision tasks such as image classification, segmentation, denoising and deblurring (see (bengio2009learning; lecun2015deep; Goodfellow-et-al-2016; Hammernik_2017; AVENDI2016108) and references within).

Many different architectures have been proposed and often tailored to specific tasks. In recent years, residual networks (ResNets) have shown to be successful in dealing with many different tasks (GomezEtAl2017; he2016identity; he2016deep; LiEtAl2017). ResNets have a number of practical advantages (e.g., ease of training and possible reversibility (YangHuiHe2018; Chang2017Reversible)

) and are also supported by mathematical theory due to their link to ordinary differential equations

(NeuralODE2018; HaberRuthotto2017a; CJWE2018) and, when dealing with imaging data, partial differential equations (RuthottoHaber2018).The connection between ResNets and differential equations has highlighted the issue of forward stability of the network. Roughly speaking, a network is forward stable when it does not amplify perturbations of the input features due to, e.g., noise or adversarial attacks. It has been shown that stable networks train faster and generalize better (NeuralODE2018). Keeping a network stable requires attention and can be challenging to control (HaberRuthotto2017a; NAISnet; CJWE2018).

While it is possible to apply ResNets to many different vision problems, one should differentiate between problems that have a small-dimensional output and problems that have a large-dimensional one.

In a small-dimensional output problem, the image is reduced in dimension to a small vector. For example, in image classification a small dimensional vector in

represents the likelihood of the image to be one of different classes. In this case, the network is being used for dimensionality reduction. For these problems the image is typically coarsened a number of times before a prediction is made. The coarsening of the image and using convolution allows far-away pixels to communicate, utilizing long range correlations in the image.In a large-dimensional output problem, the network generates a number of different output images and, most commonly, each output image has at least as many pixels than the input image. One example is image segmentation where each output image represents the probability of each pixel to belong to a certain class. In depth estimation, image denoising and deblurring, the output image has the same dimension as the input image and in many cases, contains high spatial frequencies that are absent from the original image. For these problems one is required to work with the original resolution of the image. In this case, a straightforward extension of the ResNet architecture may not be sufficient. This is because it is impossible to coarsen the image and therefore modeling interactions between far away pixels requires very deep networks. This is known as the

*field of view problem*and has been studied in (NIPS2016_6203) and it is common also in other image processing techniques such as Total Variation image denoising (RudinOsherFatemi92) and anisotropic diffusion (bsmh; weickert)

. Image coarsening can be done as a part of the network. However, image interpolation is needed in order to return to the original image size and provide dense output. This leads to a different architecture, the U-net

(UNET2015), that is more complex than simple ResNets, is less well-understood theoretically, and usually requires many more parameters.In this paper, we introduce the implicit-explicit network, IMEXnet, and apply it to high-dimensional output problems. Our network is based on simple but effective changes to the popular ResNet architecture, and are motivated by semi-implicit techniques for partial differential equations. Such techniques are used for time-dependent problems arising in computational fluid dynamics and imaging when global information is passed within a small number of iterations or time steps (JFNK2011; Schoenlieb2011). These techniques address both the stability and the field of view issues, while adding a negligible number of parameters and computational complexity.

The paper is structured as follows. In Section 2, we derive the IMEXnet and explore its theoretical properties. In Section 3, we show that our method can be implemented efficiently in existing machine learning packages and demonstrate that it adds only a marginal cost to simple ResNets. In Section 4, we conduct numerical experiments on a synthetic dataset that is constructed to demonstrate the advantages and limitations of the method, as well as on the NYU depth dataset. We summarize the paper in Section 5.

## 2 Semi-implicit Neural Networks

We first briefly review residual neural networks (ResNets) and outline their limitation in terms of stability and field of view problem. We then derive the basic idea behind our new implicit-explicit IMEXnet as a modification of ResNets. Finally, we analyze the improved stability of our method and discuss its advantages and disadvantages.

### 2.1 Residual Networks

Our starting point is the -th layer of a ResNet that propagates the features as follows

(1) |

Here, are the output features of the -th layer, the parameters that this layer depends on, and is a nonlinear function. In imaging problems, the parameters typically contain convolution kernels as well as scaling and bias parameters for batch or instance norm. Here is a step size that is typically set to . In particular, we explore the structure proposed in (he2016deep) that has the form

(2) |

Here and are build using (typically ) convolution operators and is a normalization layer that depends on the parameters and .

It is interesting to evaluate the action of this network on some image . At every step of this network, each pixel communicates with a patch of around it. Therefore, for high-resolution images, many layers are needed in order propagate information from one side of the image to another. This is demonstrated in the top two rows of Figure 1 where two delta functions are propagated through a multi-layer ResNet with 20 layers and and a ResNet with 5 layers and . Comparing the output features it is apparent that the second (4 times less expensive) network is unable to propagate information over large distances.

The above discussion highlights the *field-of-view* problem, i.e., that many convolutional layers are needed to model nonlocal interactions between distant pixels. For problems such as image classification, the image is typically coarsened using pooling layers placed between ResNet blocks.
Pooling makes each pixel encompass a larger area and therefore allows information to travel larger distances in the same number of convolution steps.
Coarsening is not applicable in tasks that require a high-dimensional output, as it leads to the loss of important local information.
In these cases, many layers are needed in order to pass information between different parts of the image. This leads to very high computational cost and storage.

At this point, it is worthwhile recalling the differential equation interpretation given to ResNets proposed in (CJWE2018; HaberRuthotto2017a; NeuralODE2018). In this interpretation the ResNet step (1) is viewed as a forward Euler discretization of the ordinary differential equation (ODE)

(3) |

Here, the features and weights are continuous functions in the (artificial) time that corresponds with the depth of the network.
While it is possible to discretize the system using the forward Euler method (resulting in (1)), many other methods can be used. In particular, in (HaberRuthotto2017a; NeuralODE2018) the midpoint method
was used and Runge Kutta methods were proposed. These methods are all *explicit* methods, i.e., the state, at time is explicitly expressed by the states
at previous times.
While such methods enjoy simplicity, they suffer from the lack of stability and locality.
Indeed, many small steps are needed in order to integrate the ODE for a long time.
In particular, when explicit methods are applied to partial differential equations (PDEs), many time steps are required in order for information to travel on the entire computational mesh.
This problem is well-known and documented in the numerical solution of PDEs, for example, when solving Navier-Stokes equations (gs), the solution of flow in porous media (ChenHuanMaBook) and in cloth simulation for computer graphics (BaraffWitkin1998).
Hence, the relation of convolutional ResNets to those PDEs described in (RuthottoHaber2018), provides an alternative explanation of the field-of-view problem.

### 2.2 The Semi-Implicit Network

One way to accelerate the communication of information across all pixels is to use implicit methods (ap). Such methods express the state at time implicitly. For example, the simplest implicit method for ODEs is the backward Euler where in order to obtain we solve the nonlinear equation

(4) |

The backward Euler method is stable for any choice of

when the eigenvalues of the Jacobian of

have no positive real part. Therefore, it is possible to take arbitrarily large steps in such a network while being robust to small perturbations of the input images due to, e.g., noise or adversarial attacks. Unfortunately, implicit methods can be rather expensive. In particular, the solution of the nonlinear equation (4) is a non-trivial task that can be computationally intensive.Rather than using a fully implicit method, we derive a new architecture using the computationally efficient Implicit-Explicit method (IMEX) (ars; arw). IMEX is commonly used in fluid dynamics and surface formation and has applied also in the context of image denoising (Schoenlieb2011). The key idea of the IMEX method is to divide the right hand side of the ODE into two parts. The first (nonlinear) part is treated explicitly and the second (linear) part is treated implicitly. We design the implicit part so that it can be solved efficiently. In our context, there is no natural division to an explicit and an implicit part and therefore, we rewrite the ODE (3

) by adding and subtracting a linear invertible matrix

(5) |

Here, is a matrix that we are free to choose. We assume that is symmetric positive definite matrix that is ”easy” to invert. As we show next, we can use a particular convolution model for that has these properties. Next, we use the forward Euler method for the explicit term and a backward Euler step for the implicit term. The forward propagation through the new network that we call IMEXnet then reads

(6) |

where

denotes the identity matrix.

While the forward propagation may appear more complicated than a simple ResNet step, we show below that the computational complexity of the matrix inversion is similar to that of a convolution and we emphasize that this construction has some of the favorable properties of an implicit method. Further, we show next that by an appropriate choice of the matrix the network is unconditionally stable. This implies that no exploding modes will occur throughout the network training. Also, the matrix is dense, i.e., it couples all the pixels in the image in a single step. For problems where the field of view is important, such methods can be very effective.

To demonstrate this fact we refer to the third row of Figure 1. It shows the forward propagation using the semi-implicit IMEX method where we choose as a group convolution with the weights

(7) |

which is a discrete Laplace operator. We discuss this choice next. Comparing the output images after only 5 time steps to a ResNet with 20 time steps it is apparent that the IMEX method increases the coupling between far away pixels.

### 2.3 Stability of the Method

We now discuss the selection of the matrix its impact on the stability. To ensure low computational complexity, we choose to have as a group convolution. Assuming, channels, this implies that the matrix has the form

In this way, the implicit step leads to

independent linear systems (which can be parallelized) and it allows us to use tools commonly available in most software packages. As we show in the next section, such a matrix is stored as a 3D tensor and can be quickly inverted.

Next, to study the magnitude of the entries of the matrix we analyze its behavior by studying a simple model problem

(8) |

Here, represents the eigenvalue of the linearized network and we assume that its real part is not positive, . In this case, the norm of the solution is bounded by the norm of for all times .

As discussed above, the usual ResNet is equivalent to the forward Euler method and reads

This equation is stable (i.e., ) if and only if

Hence, the usual ResNet may be unstable when is large unless is chosen small enough, which is computationally expensive. Now, consider the semi-implicit model with , which can be written as

where a large can be used as long as is chosen to ensure the stability of the scheme. Indeed, since we assume that the real part of is non-positive, it is always possible to choose such that , which implies that stability is conserved independent of . The above discussion can be summarized by the following theorem:

###### Theorem 1

Let the dynamical system be a linearization of a nonlinear dynamical system and let with be the negative real part of any of the eigenvalues of the Jacobian, . Then, if we choose such that

(9) |

the magnification factor between layers in the IMEX method (6)

and the method is stable.

The proof of this theorem is straight forward by computing the absolute value of the magnification factor . It is also important to note that as we can choose and keep stability.

In our numerical experiments we pick to be relatively large (in the range 1-10). We noticed that around this range of values the method is rather insensitive to its choice.

## 3 Numerical Implementation and Computational Costs

We show in detail that our network, despite being slightly more complex than the standard ResNet, can be implemented using building blocks that exist in common machine learning frameworks and benefit, e.g., from GPU acceleration. In particular, we discuss the computation of the implicit step, that is solving the linear system

where is constructed above as group-wise convolution and collect the explicit terms.

To solve the system efficiently, we use the representation of convolution in the Fourier space that states that the convolution between a kernel and the features can be computed as

where

is the Fourier transform,

is a convolution and is the Hadamard element-wise product. Here, and in the following we assume periodic boundary conditions on the image data. This implies that if we need to compute the product of the inverse of the convolution operator defined by (assuming it is invertible) with a vector, we can simply element-wise divide by the inverse Fourier transform of , i.e.,where applied element-wise division.

In our case the kernel is associated with the matrix

which is invertible, e.g., when we choose to be positive semi-definite. Thus, we define

where

is a (trainable) group-wise convolution operator. A simple torch code to compute the step is presented in Algorithm

1.Using Fourier methods we need to have the convolution kernel at the same size as the image we convolve it with. This is done by generating an array of zeros that has the same size of the image and inserting the entries of the convolution into the appropriate places. The techniques is explained in detain in (nagyHansenBook).

Let us now discuss the memory and computational effort involved with the method. To be more specific, we analyze the method for a single ResNet layer with channels, applied to an image of size . Assuming that the stencil size of the convolutions is , applying such a layer to an image requires a computational cost of operations and a memory that is to store the weights.

For the implicit networks we have the usual explicit step followed by an addition of the implicit step. The implicit step is a group convolution is requires additional operations, where the term results from the Fourier transform. Since is typically much smaller than the additional cost of the implicit step is insignificant.

The memory footprint of the implicit step is also very small. It requires only additional coefficients. For problems where the number of channels is larger than say, this cost represents less than additional storage. Thus, the improvement we obtain to ResNet comes with a very low cost of both computations and memory.

## 4 Numerical Experiments

In this section, we conduct two numerical experiments that demonstrate the points discussed above. In the first problem we experiment with semantic segmentation of a synthetic dataset that we call the Q-tip dataset. We designed this dataset to expose the limitations of explicit methods and demonstrate the improvements of semi-implicit methods. In the second example we show the advantages of our approach on the NYU Depth Dataset V2 that contains images of different room types together with their depth. The goal of the training here is to predict the depth map given the image of the room. While the two problems are different in their output they share the need of nonlocal coupling across large distances in order to deliver an accurate prediction.

### 4.1 The Q-tips Dataset

We introduce a synthetic semantic segmentation dataset intended to quantify the effect of a network’s receptive field. In this dataset every image contains a single object composed of a rectangular gray midsection with either a white or black square at each end. We define the object classes according to the combination of markers present, resulting in three classes (white-white, white-black, and black-black). Accurate segmentation of the object requires information to be shared across the entire object, emphasizing the effect of the network’s receptive field.

For the following experiments we generate a dataset of 1024 training examples and 64 validation examples. Each image consists of a single object of length, , width, , and orientation,

, randomly selected from a discrete uniform distribution, where

, , and .In order to evaluate the effect of our proposed semi-implicit architecture we train two nearly identical 12-layer IMEXnet with weights that are randomly initialized from a uniform distribution on the interval

. The opening layer expands the single channel input to 64 channels, and the width is subsequently doubled every 4 layers to result in a 224 channel output before the classifier. Neither network contains any pooling layers, and the convolution layers are padded such that the input and output are the same resolution. In order to make one of the networks purely explicit we set

, which will prevent any implicit coupling, effectively resulting in a ResNet. In both cases, we use stochastic gradient descent to minimize the weighted cross entropy loss for 200 epochs with a learning rate of 0.001 and a batch size of 8. The loss is weighted according to the normalized class frequencies calculated from the entire dataset in order to address the class imbalance.

For comparison we present various error measurements for both networks on the validation dataset in Table 1.

Network | Parameters | IOU | Loss | Accuracy |
---|---|---|---|---|

IMEXnet | 2701440 | 0.926 | 0.0982 | 99.56 |

ResNet | 2691648 | 0.741 | 0.3332 | 98.18 |

Note that the IMEX method did much better in terms of loss and intersection over unions (IOU). The pixel accuracy counts the background and therefore is somewhat misleading. An example of the results on the testing set is plotted in Figure 2.

Image | Segmentation | IMEX Predicition | ResNet Predicition |
---|---|---|---|

The table and images demonstrate that the ResNet cannot label the image correctly. In particular, the center piece of the rod is not continuous with the end of the rod. In particular, the center of the rod is randomly classified as one of three classes. Adding an implicit layer with a negligible memory footprint and computational complexity manage to resolve the problem obtaining a near perfect segmentation.

### 4.2 The NYU Depth Dataset

The NYU-Depth V2 dataset is a set of indoor images recorded by both RGB and Depth cameras from the Microsoft Kinect. Four different scenes from the dataset are plotted in Figure 3.

The goal of our network is to use the RGB images in order to predict the depth images. We use a subset of the dataset, made of the kitchen scene in order to train a network to achieve this task. The network contains three ResNet blocks and three bottle-neck blocks and is plotted in Figure 4.

The ResNet has only 506,928 parameters. For the IMEXnet we use an identical network but add implicit layers. This adds only roughly 27,000 more parameters for the implicit network. We use 500 epochs to fit the data. The initial misfit is . Using the ResNet architecture we are able to decrease the misfit to . The small addition of parameters for the implicit method allows us to fit the data to . The convergence of the two methods is presented in Figure 5.

We found that for learning one set of images (i.e. kitchens), it is possible to use rather few examples. For the kitchen dataset we used only training images, validation image and one test image. The results on the test image is plotted in Figure 6.

Kitchen scene | Depth map |

ResNet recovery | Implicit net recovery |

We observe that we are able to obtain good results even when the number of images used for the training is small. These results echo the results obtain for image filtering using variational networks presented in (PockDepth).

Comparing the ResNet and the IMEXnet we see that the predictions of the IMEXnet is smoother than the one obtained from the ResNet. This is not surprising as the implicit step in the IMEXnet can be seen as a smoothing step. We also note that in our numerical experiments we have found that the implicit method is less sensitive to initialization. This is not surprising as the implicit step adds stability. Indeed, if we compare an explicit ResNet with an implicit one with the same Kernels, the analysis suggests that while the explicit network may be unstable, the implicit one is stable, assuming is chosen appropriately.

## 5 Summary

In this paper we introduce the IMEXnet architecture for computer vision tasks that is inspired by semi-implicit IMEX methods commonly used to solve partial differential equations. Our new network extends standard ResNet architectures by adding implicit layers that involve the a group-wise inverse convolution operator after each explicit layer. We have discussed and exemplified that this approach can resolve the field of view problem as well as the problem of the forward stability of the network. This makes this type of networks suitable for problems where the dimension of the output is similar to the dimension of the input, such as semantic segmentation and depth estimation, and where nonlocal interactions are needed.

We exemplify, using pyTorch, that our method can be implemented efficiently using the available built-in functions. The computational complexity and memory allocation that is added by using the implicit step is small compared with the complexity and memory needed by the ResNet. We have shown that although the method has marginally larger cost compared with ResNet, it can be much more efficient in training as well as in validation on two simple model problems of semantic segmentation and depth estimation from images.

While we have explored one semi-implicit method (5), it is important to realize that we are free to choose other models with similar properties. One attractive option is to remove the matrix from the explicit part, and consider the diffusion-reaction problem

(10) |

This type of equations have been used extensively to model nonlinear phenomena such as pattern formation and can have interesting behavior, e.g., it can lead to nonlinear waves. These systems were already studied by Turing (Turing52) and have been studied extensively in (murray2; Ruuth2; Witkin1991). A similar treatment using an IMEX integration scheme leads to the method

(11) |

Following the analysis in Section 2.3 we see that this network has better stability properties. In particular, we can remove the restriction on having the real parts of the eigenvalues of to be positive. Since this type of equations are used for pattern formation this type of network may have advantages when considering an output which has patterns, e.g. segmentation of texture. Detailed experimentation and evaluation of this approach is an item of future work.

Semi-implicit methods play a huge role in the solution of problems in many fields. We believe that this paper illustrates that it can play a large role also in the field of machine learning using deep neural networks.

## Acknowledgements

We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC). The second author’s work is supported by the Mitacs Accerlerate program and Xtract AI.