Multigrid-in-Channels Architectures for Wide Convolutional Neural Networks

06/11/2020 ∙ by Jonathan Ephrath, et al. ∙ Emory University Ben-Gurion University of the Negev 0

We present a multigrid approach that combats the quadratic growth of the number of parameters with respect to the number of channels in standard convolutional neural networks (CNNs). It has been shown that there is a redundancy in standard CNNs, as networks with much sparser convolution operators can yield similar performance to full networks. The sparsity patterns that lead to such behavior, however, are typically random, hampering hardware efficiency. In this work, we present a multigrid-in-channels approach for building CNN architectures that achieves full coupling of the channels, and whose number of parameters is linearly proportional to the width of the network. To this end, we replace each convolution layer in a generic CNN with a multilevel layer consisting of structured (i.e., grouped) convolutions. Our examples from supervised image classification show that applying this strategy to residual networks and MobileNetV2 considerably reduces the number of parameters without negatively affecting accuracy. Therefore, we can widen networks without dramatically increasing the number of parameters or operations.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional neural networks (CNNs) LeCun et al. (1990) have achieved inspiring accuracy for image classification, semantic segmentation, and many other imaging tasks Krizhevsky et al. (2012); Girshick et al. (2014)

. The essential idea behind CNNs is to define the linear operators in the neural network as convolutions with small dimensional kernels. This increases the computational efficiency of the network (compared to the original class of networks) due to the essentially sparse operators, and the considerable reduction in the number of weights. CNNs are among the most effective computational tools for processing high-dimensional data. The general trend in the development of CNNs has been to make deeper and more complicated networks to achieve higher accuracy

Szegedy et al. (2015).

In practical applications of CNNs, the network’s feature maps are divided into channels and the number of channels, , can be defined as the width of the layer. A standard CNN layer connect any input channel with any output channel. Hence, the number of convolution kernels per layer is equal to the product of the number of input channels and output channels. Assuming the number of output channels is proportional to the number of input channels, this growth of operations and parameters causes immense computational challenges. When the number of channels is large, convolutions are the most computationally expensive part of training and prediction.

This trend is exacerbated by wide architectures with several hundred or thousand of channels, which are particularly effective in classification tasks that involve a large number of classes. The width of a network is critical both in terms of accuracy and computational efficiency compared to deeper, narrower networks Zagoruyko and Komodakis (2016). However, the quadratic scaling causes the number of weights to reach hundreds of millions and beyond Huang et al. (2019), and the computational resources (power and memory) needed for training and running ever-growing CNNs surpasses the resources of common systems Bianchini and Scarselli (2014). This motivates us to design more efficient network architectures with competitive performance.

Figure 1: An illustration of a three-level multigrid (additive) cycle for 16 input channels. The group size is 4 in each of the convolutions. Restrict and Prolong are the grid transfer operators are .

Contribution We propose network architectures whose layers connect all channels with only convolutions and number of weights. For a given computational budget, this linear scaling allows us to use wider, deeper, and essentially more expressive networks. To this end, we develop multigrid-in-channel approaches that achieve “global” connectivity of the channels using multigrid cycles consisting of grouped convolutions. Each level in the cycle acts on a particular “scale” (width) in the channel space. Coarser levels in the cycle are defined by averaging representative channels from different groups and applying grouped convolutions on those averaged representatives. Therefore, coarser levels have fewer channels and can effectively connect different fine-level groups. The multigrid idea can be realized in many different ways and we propose two variants: a simple, additive approach (see also Fig. 1) and a more advanced, and more effective, multiplicative approach. We stress that, due to the multigrid cycle, our network overcomes the limitation of grouped convolutions that limit the connectivity to small groups only which is not expressive enough in general

Motivation Multigrid methods Trottenberg et al. (2000)

are primarily used to solve differential equations and other graphical problems related to diffusion processes (e.g., Markov chains,

Sterck et al. (2011); Treister and Yavneh (2011)). Generally, multigrid methods utilize a hierarchy of grids. They are based on a principle that a local process on a fine grid can only effectively “smooth” the error in an iterative solution process. That error can be approximated by a suitable procedure on a coarser grid leading to two advantages. First, coarse grid procedures are less expensive (have fewer grid points or nodes) than fine grid procedures. Second, traversing different scales leads to faster convergence of the solution process. Another way to look at this process is that the multigrid hierarchy efficiently transfers information across all the grid points using local processing only, at different levels. Classical multigrid methods rely on multiscale representation of functions in space, but can also be used to tackle long processes in time Falgout et al. (2014). In this work, we leverage this idea to the width of the network (number of channels) to prevent the known redundancy in the number of parameters in CNNs. By imposing a fixed grouped connectivity in channels, we keep the number of convolutions linearly proportional to the network’s width.

2 Related work

Multigrid methods in deep learning

Multigrid has been abundantly applied across the computational sciences, e.g., in partial differential equations 

Sterck et al. (2011); Treister and Yavneh (2011) and sparse optimization Treister and Yavneh (2012); Treister et al. (2016) to name a few. In training CNNs, multigrid has been used, e.g., to warm-start the training of networks on high resolution images with training on low-resolution images Haber et al. (2018), adopting a multiscale approach in space. Similarly, Pelt and Sethian (2018); Ke et al. (2017) define multiscale architectures that extract and combine features from different resolution images scales. The DeepLabV3 architecture for semantic segmentation Chen et al. (2017)

also exploits multiscale representations. Multigrid has also been used in the layer (or in time) dimension for residual networks, e.g., to warm-starting the training of a deep network by interpolating the weights of a trained shallow network with a larger step size

Chang et al. (2017) and to parallelize the training through the layers Gunther et al. (2020). The above works apply the multigrid idea either in space or in layers (depth), while in this work we apply the multigrid idea in the channel space.

Pruning and sparsity Reducing the number of parameters in CNNs by limiting the connectivity between channels has been a central theme recently. Among the first approaches are the methods of pruning Hassibi and Stork (1992); Han et al. (2015); Li et al. (2017) and sparsity Changpinyo et al. (2017); Han et al. (2017) that have been typically applied to already trained networks. It has been shown that once a network is trained, a lot of its weights can be removed without hampering its efficiency. However, the resulting connectivity of these processes may typically be unstructured, which may leads to inefficient deployment of the networks on hardware. While pruning does not save computations during training, it still serves as a proof-of-concept that the full connectivity between channels is unnecessary, and that there is a redundancy in CNNs Molchanov et al. (2017).

Depth-wise, group-wise, and shuffle convolutions Another recent effort to reduce the number of parameters in networks is to define architectures based on separable convolutions Howard et al. (2017); Sandler et al. (2018); Wang et al. (2016); Tan and Le (2019); Ephrath et al. (2020). These CNNs use spatial depth-wise convolution, which filter each input channel separately, and point-wise convolutions, which couple all the channels. A popular architecture is the MobileNet, which involves significantly fewer parameters since the fully coupled operators use only stencils. The majority of the weights in MobileNetV2 Sandler et al. (2018) are in the point-wise operators as this scales with . The strength of MobileNetV2 Sandler et al. (2018), and EfficientNet Tan and Le (2019) that improved it, is the inverse bottleneck structure, that takes a quite narrow network (with relatively few channels) and expand it by a significant factor to perform the depth-wise and non-linear activation. This way, the number of parameters that scale quadratically in width is relatively small compared to the number of spatial convolutions and activations. The ShuffleNet Zhang et al. (2018); Ma et al. (2018) reduces the parameters of the point-wise operator by applying convolutions to half of the channels and then shuffling them.

3 Preliminaries and notation

A popular CNN architectures is the residual network (ResNet) He et al. (2016a, b) whose -th layer is


where and are input and output features, respectively, and represent convolution operators, and

is a non-linear point-wise activation function, typically the ReLU function


is a normalization operator often chosen to be a batch normalization. In classification problems, an input image

is filtered by layers, given by (1) and occasional pooling, resulting in

, which is used as the input of a linear classifier to determine the class of

. represents the depth of the network. In typical ResNets, the layer (1) is the main driving force of the network, and is the dominating computational operation. While (1) is used in the original ResNet, deeper networks are typically based on the “bottleneck” ResNet version that includes three convolutions per layer


Here, and are fully coupled convolutions, and is a convolution, which can be a grouped (or depth-wise) convolution to reduce the number of parameters and increase the ratio between activations and parameters Sandler et al. (2018); Xie et al. (2017).

A standard convolution layer takes a tensor of size

representing channels of feature maps with pixels each. The output consists of feature maps, where each one is a linear sum of the input maps, convolved with a kernel, typically of size or . Hence, the convolution layer requires operations, where is the size of the convolution kernel. In matrix form the convolution operators in (1) have the form


where is a sparse matrix associated with the -th convolution kernel. Overall, each convolution layer consists of convolutions. Since practical implementations of convolutional ResNets (and variants) often use hundreds or thousands of channels, the full coupling leads to large computational costs and to millions of parameters for each layer, which may not always be unnecessary.

4 Multigrid-in-channels CNN architectures

Our architectures achieve full coupling with only convolutions by replacing the dense (in channel-space) operator in (3) with a sparse, structured, and load-balanced multigrid scheme. Thereby we can model wider networks for a given budget of parameters and floating point operations (FLOPs). We refer to a convolution operator in (3) to be a connection between channels and , and we wish to limit this connectivity. To this end, we use a multigrid scheme that restricts the connectivity between channels by combining grouped convolutions—a known operator that is available in common frameworks. In matrix form, the grouped operator is given by


More precisely, the ResNet step in (1) with fully coupled and provides interactions between all the channels. However, if the block structure in (4) is used, most of the interactions between channels are ignored, and a few are computed. Since limiting the interactions between channels may reduce performance, we impose these connections using a multigrid hierarchy in the channels.

Multigrid hierarchy The key idea of our multigrid architecture is to design hierarchy of grids in the channel space (also referred as “levels”), where the number of channels in the finest level corresponds to the width of the network, and number of channels is halved between the levels. Our multigrid architecture is first defined with a suitable CNN step, like one of the ResNet steps in (1)-(2), which is applied on each level at each cycle. On the finest and intermediate grids we only connect disjoint groups of channels using convolution operators like (4). These convolutions have parameters, as we keep the group size fixed throughout the network, and as the network widens the number of groups grows. Interactions between different groups of channels are introduced on coarser grids. On the coarsest grid, we couple all channels using operators like(3), but note that the coarse grid only involves significantly fewer channels, which lowers the computational costs.

Grid transfer To reduce the channel dimension we use a restriction operator , which is a sparse convolution defined below. We compute the coarse feature maps of the fine feature map by . In a two-level setting, the fully coupled CNN step is applied on the coarse grid feature maps, . This step couples all channels but involves only instead of convolutions. Afterwards, we use a prolongation operator to obtain the output feature map with channels.

4.1 A vanilla additive multigrid cycle

The simplest version of the multigrid cycle resembles the so-called “additive” multigrid cycle Trottenberg et al. (2000); Reps and Weinzierl (2017), which is often used to parallelize multigrid computations. We define the two-level additive version of the ResNet layer in (1) as


The operators are grouped operators, and are fully coupled operators. As said above, are sparse tensors that downsample and upsample the dimensions of channels, respectively. An illustration of this architecture using three levels is presented in Fig. 1. A multilevel cycle is applied by repeating the dimensionality reduction process again and again before finally applying the fully coupled convolution. The coarse-grid correction step in (13) is the so-called -correction, which is commonly used in non-linear multigrid schemes—we elaborate on this point below. This multigrid cycle is repeated in the network (as a block), where each cycle has the number of levels corresponding to the number of channels in the network. We define the coarsest grid when the number of channels reaches a certain threshold, e.g., the group size. Therefore, the multgrid approach is most attractive for wide networks, where many levels can be added.

The choice of transfer operators and The transfer operators play an important role in multigrid methods. The restriction maps the fine-level feature map to problem and state of the iterative solution onto the coarse grid, and the prolongation acts in the opposite direction. In essence, in the coarsening process we loose information, since we reduce the dimension of the problem and the state of the iterate. The key idea is to design such that the coarse problem captures the subspace that is causing the fine-grid process to be inefficient. This results in two complementary processes: the fine-level steps (dubbed as relaxations in multigrid literature), and the coarse grid correction.

To counteract the locality of the fine level steps, we use and to shuffle the channels. This allows distant channels to interact on the coarse grid. To this end, we choose to be a sparse convolution operator. We choose locations of non-zeros per row of at random and let the network learn those weights, starting from positive random weights that sum to 1. We typically set equal to the group size used in the grouped convolutions. The sparsity pattern of equals to that of , so that channels that are transferred to the coarse grid and back end up at the same locations.

4.2 Multiplicative multigrid cycles

Most existing multigrid methods use ‘multiplicative correction cycles”. As a motivation, we show that the additive multigrid cycle shown in (5)-(8) and in Fig. 1 can be interpreted as a single ResNet step (if indeed the chosen CNN step is ResNet). Ignoring the normalization and notation, the additive three-level cycle in Fig. 1 can be unfolded to


These actions are algebraically identical to the ResNet step


Hence, the additive multigrid cycle is a CNN step with convolution operators of a special multigrid-in-channels structure that couple all channels with linear complexity in width. While this is beneficial to support parallel computations, its simplicity may limit the expressiveness of architecture.

To achieve longer (or deeper) paths and increase expressiveness by introducing more consecutive activation layers, we now derive the multiplicative correction cycles. Here, we let the information propagate through the network levels sequentially, which leads to the multiplicative two-level cycle


Compared to (5)-(8), the CNN step in (14) is applied on after the coarse grid correction. Algorithm LABEL:alg:MGStep

summarizes the multigrid cycle in more detail. We note that in channel-changing steps, i.e., when we are enlarging the number of channels (and typically also pool or stride), the sequential structure of the multiplicative cycle in (

11)-(14) makes it difficult to mix resolutions (both in image and channel spaces), hence for those steps we may revert to the additive multigrid version.


The -correction and scaling of coarse grid correction The -corrections in (7) and (13) are the standard way to apply multigrid cycles to solve non-linear problems Trottenberg et al. (2000); Yavneh and Dardyk (2006). In (7) this can be simplified to by reducing the skip connection in (6). While this is a natural simplification for ResNets, it requires us to change the ResNet step and remove the skip connection on the coarser grids. This may introduce the problem of vanishing gradients that stems from the consecutive ’s and ’s. The correction in (13) introduces an identity mapping on all steps and levels, which is the property needed for preventing vanishing gradients; see He et al. (2016b). Also, it is common in multigrid to either dampen or amplify the coarse grid correction Sterck et al. (2011); Yavneh and Dardyk (2006).

4.3 The total number of parameters and computational cost of a cycle

Consider a case where we have channels in the network, and we apply a multigrid cycle using convolution kernels and groups size of . Such a grouped convolution has parameters. Each coarsening step divides by 2, until the coarsest level is reached and a fully coupled convolution with parameters is used. For levels, the number of parameters is


If is large (e.g. in wide networks), we can neglect the second term and get parameters.

5 Experiments

We compare our proposed multigrid approach to a standard fully-coupled ResNet and MobileNetV2 Sandler et al. (2018)

for image classification tasks using the CIFAR10/100 and ImageNet dataset. The CIFAR-10/100 datasets 

Krizhevsky and Hinton (2009) consists of 60k natural images of size with labels assigning each image into one of ten categories (for CIFAR10) or hundred categories (for CIFAR100). The data are split into 50K training and 10K test sets. The ImageNet 1 challenge ILSVRC consists of over 1.28M images of size

with labels assigning each image into one of 1000 classes where each class has 50 validation images. We use a different network architecture for each data set to reflect its difficulty. Our experiments are performed with the PyTorch software

Paszke et al. (2017).

Our focus is to compare how different architectures perform using a relatively small number of parameters on the one hand, and how expanding the network’s width improves the accuracy even though the number of additional parameters scales linearly. We use the established ResNet architectures as the baseline for comparison and use the same structure of those ResNets, only with multigrid cycles. On ImageNet, we also test the multigrid cycle in established MobileNet.

Our networks consist of several blocks preceded by an opening convolutional layer, which initially increases the number of channels. In the ImageNet dataset (Table 2), we used a an opening layer of a strided

convolution followed by max pooling. After the opening layers, there are several blocks, each consisting of several ResNet steps whose number varies between the different experiments. Each convolution is applied in addition to a ReLU activation and batch normalization, as described in (

1). In the multigrid version, each block performs the multigrid cycle, whether it is the multiplicative or additive cycle. To increase the number of channels and to downsample the image, we concatenate the feature maps with a depth-wise convolution applied to the same channels, thus doubling the number of channels. This is followed by an average pooling layer.

Our multigrid versions of the MobileNets replace the in the inverted residual structure with a grouped convolution. Here, the convolution in the coarse grid channel is fully coupled. We used the standard MobileNet steps to increase the channels or downsample the image. The last block consists of a pooling layer that averages each channel’s map to a single pixel. Finally, we use a fully-connected linear classifier with softmax and cross-entropy loss. We train all networks from scratch, i.e., no pre-trained weights are used.

As optimization strategy for ImageNet, we use momentum SGD with a mini-batch size of for epochs. The learning rate starts at and is divided by every epochs. The training of MG-MobileNetV2 is performed differently, like Sandler et al. (2018). The weight decay is , and the momentum is . The strategy for the other data sets is similar, with slight changes in the number of epochs, batch sizes and the timing for reducing the learning rate. We use standard data augmentation, i.e., random resizing, cropping, and horizontal flipping.

Our classification results are given in Tables 1-2, where we chose several typical configurations of group size for the sparse convolutions, channel width and the coarse grid channel threshold which determines the number of channels for which we perform the fully connected convolution. The results show the linear cost of increasing the network’s width instead of the quadratic cost in the ResNet and MobileNet. Thus, a multigrid network can be expanded at a linear cost, outperform the original ResNet network, and can be competitive with other well-known networks.

Architecture Channels Group Size Coarsest Params[M] Test acc.
ResNet 8-16-32-64 0.27 90.4%
ResNet 16-32-64-128 1.1 93.3%
ResNet 32-64-128-256 4.7 94.5%
MG-ResNet* 16-32-64-128 8 16 0.22 89.4%
MG-ResNet 16-32-64-128 8 16 0.22 91.5%
MG-ResNet 16-32-64-128 16 16 0.41 92.6%
MG-ResNet 32-64-128-256 8 16 0.46 92.8%
MG-ResNet 32-64-128-256 16 16 0.87 93.8%
MG-ResNet 64-128-256-512 8 16 0.94 93.4%
MG-ResNet 64-128-256-512 16 16 1.8 94.4%
Architecture Channels Group Size Coarsest Params[M] Test acc.
ResNet 64-128-256-512 28.9 78.5%
MG-ResNet 64-128-256-512 16 64 3 75.0%
MG-ResNet 64-128-256-512 32 32 5.7 75.6%
MG-ResNet 64-128-256-512 32 64 5.6 76.5%
MG-ResNet 128-256-512-1024 16 64 6.1 75.9%
MG-ResNet 128-256-512-1024 32 32 11.7 77.9%
MG-ResNet 128-256-512-1024 32 64 11.6 78.1%
MG-ResNet 256-512-1024-2048 16 64 12.3 76.1%
MG-ResNet 256-512-1024-2048 32 32 23.7 78.7%
MG-ResNet 256-512-1024-2048 32 64 23.6 79.5%
Table 1: Classification results for CIFAR-10/100 datasets. Keeping the same basic architectures we study the impact of the channels, group size and coarse grid channel on the test accuracy. The channel repetitions structure is 2-3-3-3 for CIFAR10 and 3-5-7-4 for CIFAR100. ‘*’ denotes the additive multigrid cycle. MG = Ours.

Influence of groups, width, and coarse grid threshold We show the classification accuracy of multigrid with different configurations of group size, width, and the number of levels on the CIFAR10/100 data. To highlight differences in performance, we use small networks. Table 1 presents the classification results. As expected, increasing the width and group sizes adds weights to the network and improves accuracy. For both data sets, we reach the same accuracy as the base ResNet with fewer weights. We also tested additive cycle whose performance, as predicted by the theory, was inferior to the multiplicative cycle; one representative configuration is shown for CIFAR10.

Architecture Channels Group Size Coarsest Params[M] Test acc.
ResNet34 64-128-256-512 21.8 74.0%
MG-ResNet34 64-128-256-512 32 64 4.6 72.2%
MG-ResNet34 96-192-384-768 32 64 8.8 74.5%
MG-ResNet34 128-256-512-1024 32 64 11.7 75.4%
MobileNetV2 Sandler et al. (2018) 1.0 3.4 72.0%
MobileNetV2 Sandler et al. (2018) 1.4 6.9 74.7%
MG-MobileNetV2 1.0 32 64 3.0 71.7%
MG-MobileNetV2 2.0 32 64 6.0 73.9%
Table 2: Comparison of classification results for ImageNet using different networks. MG = Ours.

ImageNet Classification The results on the ImageNet dataset show that the Multigrid version of ResNet is lighter but less accurate. Nevertheless, its width can be expanded to consume significantly fewer parameters yet outperform the original ResNet. In the multigrid version of MobileNet, we kept several layers as in the original network. For example, the channel-changing steps and the last two linear layers, which include many parameters. Hence, the reduction of parameters between MobileNet and MG-MobileNet is not as significant as in ResNet architecture. However, when doubling the channels’ width, we receive a network that consumes fewer parameters than the original expansion of MobileNetV2 and achieves a comparable accuracy.

6 Advantages and Limitations

Full Coupling By using local processes on coarser grids or graphs, multigrid methods make distant fine-grid nodes “closer”, allowing information to travel between distant nodes in minimal effort. This leads to a full coupling between the unknowns of the problem, and in our context, all feature channels.

Computational Efficiency Our scheme reduces the number of convolution operators in standard CNNs, which leads to fewer weights and FLOPs. Since computations remains structured, it can be implemented easily using grouped convolutions in the channel space and leads to a balanced computational load on parallel hardware. Some of the high-level multigrid cycle ingredients can also be parallelized. As wider networks require more memory for hidden features, we use checkpointing.

Generality Our multigrid layer can be employed in a variety of CNNs. We demonstrate this using variants of residual networks (e.g. ResNet He et al. (2016a), ResNeXt Xie et al. (2017), MobileNet Sandler et al. (2018)) ResNets but note that it can also be attractive for U-Nets Ronneberger et al. (2015) or other architectures Pohlen et al. (2017). While we focus on layers with an identical number of input and output channels, our approach can be generalized to other scenarios.

High Expressiveness-to-Parameters Ratio The expressiveness of the network can be defined as the complexity of the high-dimensional functions that the network can approximate. Roughly speaking, given a general structure of the architecture, the network is more expressive as the number of weights, and non-linear activations grow. Since the vast majority of our convolutions are grouped, the multigrid architecture has more activations-per-parameters compared to the single-level architecture using the same building blocks are used in both. Hence, the multigrid version of any single layer architecture is, in principle, more expressive (and more complex) given the same number of parameters and FLOPs.

Training Times The reduction of the number of parameters currently does not translate to faster training times. This is due to the inefficient implementation of grouped convolutions on modern GPUs, which also affects other reduced architectures. We note that this may change with the increased use of reduced CNNs, which are needed for resource-limited hardware (e.g., mobile devices).

7 Conclusions

We present a novel multigrid-in-channels approach that improves the efficiency of convolutional networks consisting of many channels, . Such wide networks are popular for classification tasks, however, due to the use of fully-coupled convolutions the number of weights and FLOPS are . Applying multigrid across channels, we achieve full coupling through a multilevel hierarchy of channels at only cost. As demonstrated in our experiments using two common architectures, our approach achieves higher or comparable accuracies at a given budget. Our multigrid convolution model is not specific to ResNet or MobileNet and can be used in other wide architectures.


This research was partially supported by grant no. 2018209 from the United States - Israel Binational Science Foundation (BSF), Jerusalem, Israel.


  • [1] (2020) Note:[Online; accessed May-2019] Cited by: §5.
  • [2] M. Bianchini and F. Scarselli (2014) On the complexity of neural network classifiers: a comparison between shallow and deep architectures. IEEE transactions on neural networks and learning systems 25 (8), pp. 1553–1565. Cited by: §1.
  • [3] B. Chang, L. Meng, E. Haber, F. Tung, and D. Begert (2017) Multi-level residual networks from dynamical systems view. arXiv preprint arXiv:1710.10348. Cited by: §2.
  • [4] S. Changpinyo, M. Sandler, and A. Zhmoginov (2017) The power of sparsity in convolutional neural networks. External Links: 1702.06257 Cited by: §2.
  • [5] L. Chen, G. Papandreou, F. Schroff, and H. Adam (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §2.
  • [6] J. Ephrath, M. Eliasof, L. Ruthotto, E. Haber, and E. Treister (2020) LeanConvNets: low-cost yet effective convolutional neural networks. IEEE Journal of Selected Topics in Signal Processing. Cited by: §2.
  • [7] R. D. Falgout, S. Friedhoff, T. V. Kolev, S. P. MacLachlan, and J. B. Schroder (2014) Parallel time integration with multigrid. SIAM Journal on Scientific Computing 36 (6), pp. C635–C661. Cited by: §1.
  • [8] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 580–587. Cited by: §1.
  • [9] S. Gunther, L. Ruthotto, J. B. Schroder, E. C. Cyr, and N. R. Gauger (2020) Layer-parallel training of deep residual neural networks.

    SIAM Journal on Mathematics of Data Science

    2 (1), pp. 1–23.
    Cited by: §2.
  • [10] E. Haber, L. Ruthotto, E. Holtham, and S. Jun (2018) Learning across scales—multiscale methods for convolution neural networks. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §2.
  • [11] S. Han, J. Pool, S. Narang, H. Mao, E. Gong, S. Tang, E. Elsen, P. Vajda, M. Paluri, J. Tran, B. Catanzaro, and W. J. Dally (2017) DSD: dense-sparse-dense training for deep neural networks. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §2.
  • [12] S. Han, J. Pool, J. Tran, and W. J. Dally (2015) Learning both weights and connections for efficient neural network. International Journal of Computer Vision 5 (5), pp. 1135–1143. Cited by: §2.
  • [13] B. Hassibi and D. G. Stork (1992) Second order derivatives for network pruning: optimal brain surgeon reconstruction. International Journal of Computer Vision 5 (5), pp. 164–171. Cited by: §2.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §3, §6.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. In European Conference on Computer Vision, pp. 630–645. Cited by: §3, §4.2.
  • [16] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §2.
  • [17] Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V. Le, Y. Wu, et al. (2019) Gpipe: efficient training of giant neural networks using pipeline parallelism. In Advances in Neural Information Processing Systems, pp. 103–112. Cited by: §1.
  • [18] T. Ke, M. Maire, and S. X. Yu (2017) Multigrid neural architectures. Cornell University, TTI Chicago, ttic uchicago edu. Note: Cited by: §2.
  • [19] A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Cited by: §5.
  • [20] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  • [21] Y. LeCun, B. E. Boser, and J. S. Denker (1990) Handwritten digit recognition with a back-propagation network. In Advances in neural information processing systems, pp. 396–404. Cited by: §1.
  • [22] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf (2017) Pruning filters for efficient ConvNets. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §2.
  • [23] N. Ma, X. Zhang, H. Zheng, and J. Sun (2018) ShuffleNet V2: practical guidelines for efficient CNN architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 116–131. Cited by: §2.
  • [24] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz (2017)

    Pruning convolutional neural networks for resource efficient transfer learning

    In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §2.
  • [25] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. In Advances in Neural Information Processing Systems, Cited by: §5.
  • [26] D. M. Pelt and J. A. Sethian (2018) A mixed-scale dense convolutional neural network for image analysis. Proceedings of the National Academy of Sciences 115 (2), pp. 254–259. Cited by: §2.
  • [27] T. Pohlen, A. Hermans, M. Mathias, and B. Leibe (2017) Full-resolution residual networks for semantic segmentation in street scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4151–4160. Cited by: §6.
  • [28] B. Reps and T. Weinzierl (2017) Complex additive geometric multilevel solvers for helmholtz equations on spacetrees. ACM Transactions on Mathematical Software (TOMS) 44 (1), pp. 1–36. Cited by: §4.1.
  • [29] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §6.
  • [30] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) MobileNetV2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §2, §3, Table 2, §5, §5, §6.
  • [31] H. D. Sterck, K. Miller, E. Treister, and I. Yavneh (2011) Fast multilevel methods for markov chains. Numerical Linear Algebra with Applications 18 (6), pp. 961–980. Cited by: §1, §2, §4.2.
  • [32] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §1.
  • [33] M. Tan and Q. V. Le (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In

    International Conference on Machine Learning (ICML)

    Cited by: §2.
  • [34] E. Treister, J. S. Turek, and I. Yavneh (2016)

    A multilevel framework for sparse optimization with application to inverse covariance estimation and logistic regression

    SIAM Journal on Scientific Computing 38 (5), pp. S566–S592. Cited by: §2.
  • [35] E. Treister and I. Yavneh (2011) On-the-fly adaptive smoothed aggregation multigrid for markov chains. SIAM Journal on Scientific Computing 33 (5), pp. 2927–2949. Cited by: §1, §2.
  • [36] E. Treister and I. Yavneh (2012) A multilevel iterated-shrinkage approach to {} penalized least-squares minimization. IEEE Transactions on Signal Processing 60 (12), pp. 6319–6329. Cited by: §2.
  • [37] U. Trottenberg, C. W. Oosterlee, and A. Schuller (2000) Multigrid. Elsevier. Cited by: §1, §4.1, §4.2.
  • [38] M. Wang, B. Liu, and H. Foroosh (2016) Design of efficient convolutional layers using single intra-channel convolution, topological subdivisioning and spatial "bottleneck" structure. External Links: 1608.04337 Cited by: §2.
  • [39] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §3, §6.
  • [40] I. Yavneh and G. Dardyk (2006) A multilevel nonlinear method. SIAM journal on scientific computing 28 (1), pp. 24–46. Cited by: §4.2.
  • [41] S. Zagoruyko and N. Komodakis (2016-09) Wide residual networks. In Proceedings of the British Machine Vision Conference (BMVC), E. R. H. Richard C. Wilson and W. A. P. Smith (Eds.), pp. 87.1–87.12. External Links: Document, ISBN 1-901725-59-6, Link Cited by: §1.
  • [42] X. Zhang, X. Zhou, M. Lin, and J. Sun (2018) ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856. Cited by: §2.