Convolutional networks have become a dominant approach for visual object recognition [12, 39, 25, 41]12] poses significant challenges as input information can vanish passing through many layers before reaching the end.
When training a deep neural network gradients can become very small during the backpropagation process, making it hard to optimise the parameters in the early stages of the network. Therefore in the training phase the weights of the layers at the end of the network get updated quite rapidly while the early layers do not, leading to poor results. Activation function ’ReLU’ and regularization methods like dropouts were proposed to address this problem
When training a deep neural network gradients can become very small during the backpropagation process, making it hard to optimise the parameters in the early stages of the network. Therefore in the training phase the weights of the layers at the end of the network get updated quite rapidly while the early layers do not, leading to poor results. Activation function ’ReLU’ and regularization methods like dropouts were proposed to address this problem. However, while these methods are important they do not solve the problem entirely. Huang et al.  found that as layers are added to a network, at some point its performance will start to decrease . Recent work [12, 13, 43, 42] proposed different solutions such as skip connections , use of different sized filters in parallel [43, 42] and exhaustive concatenation between layers . This goes some way to addressing the problem.
In this paper we draw inspiration from the above networks [12, 13] and propose a novel network architecture that retains positive aspects of these approaches [12, 13] whilst overcoming some of their limitations. Figure 1 illustrates a single module layout of our proposed architecture where its unique connectivity is displayed.
We show that ChoiceNet design allows good gradient and information flow through the network while using fewer parameters compared to other state of the art schemes. We evaluate ChoiceNet on three benchmark datasets (CIFAR10 , CIFAR 100  and SVHN ) for image classification and also compare the performance of our network with state of the art methods in CamVid dataset . Our model performs well against existing networks [12, 13] on all three datasets, showing promising results when compared to the current state-of-the-art.
Consider a single image that is passing through a CNN. The network has
layers, each with a non-linear transformation, where is the index number of the layer.
is a list of operations such as Batch-Normalization, Pooling 33] or a convolutional operation. The output of layer is denoted as .
A typical convolutional feed-forward network connects the layer’s output to the layer’s input. It gives rise to a layer transition: = . ResNet  adds an identity mapped connection, also referred as skip connection, that bypasses the transformation in between:
This mechanism allows the network to flow gradients directly through the identity functions which results in faster training and better error propagation. However in  it was argued that despite the benefits of using skip connections, there is a possibility that when a layer is connected by a skip connection it may disrupt the information flow of the network therefore degrading the performance of the network.
In , a wider version of ResNet was proposed where the authors showed that an increased number of filters in each layer could improve the overall performance with sufficient depth. FractalNet  also shows comparable improvement on similar benchmark datasets.
DenseNet: As an alternative to ResNet, DenseNet proposed a different connectivity scheme. They allowed connections from each layer to all of its subsequent layers. Thus layer receives feature maps from all previous layers. Considering as input:
where denotes the concatenation of feature maps produced from previous layers respectively.
The network maximizes information flow by connecting the convolutional layers channel wise instead of skipping connections. In this model, the layer has number of inputs consisting of all the feature maps of previous layers. Thus on the layer, there are connections. DenseNet requires fewer parameters as there is no need to learn from redundant features maps. This allows the network to compete against ResNet using fewer parameters.
ChoiceNet: We propose an alternative connectivity that retains the advantages of the above architectures whilst reducing some of their limitations. Figure 1 illustrates the connectivity layout between each layer of a single module. Each block of ChoiceNet contains three modules and the total network is comprised of three blocks with pooling operations in the middle (see Figure 3).
Figure 2 shows a breakdown of each module. Letters A to G denote unique information generated by one forward pass through the model. B is generated by three consecutive convolutional operations, whereas A is the result of the same three convolutional operations but additionally connected by a skip connection. Following this pattern, we generate information represented by letters C, D, F and G. Letter E denotes the special case where no convolutional operation is done after the convolutional operation and it contains all the original information. This information is then concatenated with the others (ie. A, B etc.) at the final output.
Therefore, the final output contains information with and without skip connections from filters of size 3, 5 and 7 and also from the original input without any modification. Note that the convolutional operation at the start acts as a bottleneck to limit computational costs and all the convolutional operations are padded appropriately for the concatenation at the final stage.
convolutional operation at the start acts as a bottleneck to limit computational costs and all the convolutional operations are padded appropriately for the concatenation at the final stage.
Considering as input, our proposed connectivity is given by:
where is concatenation of feature maps. The feature maps are first summed and then concatenated which resembles characteristics of ResNet and DenseNet respectively.
Composite function: Each of the composite functions consists of a convolution operation followed by a batch normalisation, and ends with a rectified linear unit(ReLU) operation.
: Pooling is an essential part of convolutional networks since Equations 1 and 2 are not viable when the feature maps are not of equal size. We divide the network into multiple blocks where each block contains same sized features. Instead of using either max pooling or average pooling, we use both pooling mechanisms and concatenate them before feeding it to the next layer (see figure4).
Bottleneck layers: The use of convolutional operations (known as bottleneck layers) can reduce computational complexity without hurting the overall performance of a network . We introduce a convolutional operation at the start of each composite function (see fig 1 and 3).
Implementation Details: ChoiceNet has three blocks with equal number of modules inside. In each Choice operation (see fig 1), there are three , three and three convolutional operations. Each of the consecutive convolutional operations is connected via a skip connection (red line in fig 1). The feature maps are then concatenated so that both the outputs with the skip and without the skip connections are included (green and black lines in fig 1 before ”C”). Finally, the original input feature map is also merged (blue line in fig 1) to produce the final output.
The intuition behind having the skip (Letter A, Figure 2) and the non-skip connections (Letter B, Figure 2) output merged together is for enabling the network to choose between the two options for each filter size. We also merge the original input to this output (Letter E Figure 2) so that the network can choose a suitable depth for optimal performance. To allow the network further options, we use both Max and average pooling. Thus, each pooling layer contains both a Max-Pool and an Avg-Pool operation. The outputs of each pooling operation are merged before proceeding to the next layer.
We evaluate our proposed ChoiceNet architecture on three benchmark datasets (CIFAR10 , CIFAR 100  and SVHN ) and compared it with other state of the art architectures. We also evaluated it on state of the art semantic segmentation dataset CamVid .
The CIFAR dataset  is a collection of two datasets, CIFAR10 and CIFAR100. Each dataset consists of 50,000 training images and 10,000 test images with
pixels. The CIFAR10 dataset contains 10 class values and CIFAR100 dataset contains 100. In our experiment, we hold out 5,000 images from the training set for validation and use the rest of the images for training. We choose the model with the highest accuracy on the validation set to test on the testset. We adopt standard data augmentation with training including horizontally flipping images, random cropping, shifting and normalizing using channel mean and standard deviation. These augmentations were widely used in previous work[12, 14, 24, 27, 29, 36, 40, 41]. We also tested our model on the datasets without augmentation. In our final output in Table 1, we denote the original dataset as C10 and C100, and the augmented dataset as C10+ and C100+.
The SVHN dataset contains images of Street View House Numbers with pixels.There are 73,257 images in the training set and 26,032 on the testset. It also contains additional 531,131 images for training purposes. Like in previous work [12, 14, 24, 27, 36], we use all the training data with no augmentation and use 10% of the training images as a validation set. We select the model with the highest accuracy on the validation set and report the test error in Table 1.
The CamVid dataset  is a dataset consisting of 12 classes and has been mostly used for the task of semantic segmentation in previous work [32, 2, 7]. The dataset contains a training set of 367 images, a validation set of 100 images and a test set of 233 images. The challenge is to do pixel wise classification of the input image and correctly identify the objects in the scene. The metric called IoU or ’intersection over union’ is commonly used for this particular task [6, 17, 2].
to keep the comparisons as fair and simple as possible. On all three datasets, we used a training batch of 128. For the first 100 epochs, we used a learning rate of, for the next 100 epochs , and then a rate of for the final 300 epochs.
We used weight decay of and Nesterov  momentum without dampening. We use a dropout layer after each ChoiceNet block with dropout rate at 0.2.
For this task we use the training procedure of U-Net  (Fig. 5) and we change the conv-blocks of U-Net with Res-Block (a block of the network that holds off the unique properties), Dense-Block and ChoiceNet-Module (Fig. 3). We use the Adam Optimizer with an initial learning rate of which was reduced by a factor of 10 after each 100 epochs until the network converged. A weight decay of and Nesterov  momentum without dampening was used. For fair comparison we kept the number of channels of Res-block and Dense-block unchanged as in the original article [13, 14].
Each of the experiments was performed 5 times and during the training process we took the model with the best validation score and reported its performance on the test set.
We used PyTorch to implement our models. We trained DenseNet and ResNet models using Pytorch implementations .
We used a machine with 16Gb of RAM with Intel i7 8700 with 6 core CPU and a Nvidia RTX 2080ti with 11GB of VRAM.
|Network in Network||-||-||10.41||8.81||35.68||-||2.35|
|Deeply Supervised Net||-||-||9.69||7.97||-||34.57||1.92|
|ResNet (reported by  )||110||1.7M||13.63||6.41||44.74||27.22||2.01|
|ResNet with Stochastic Depth||110||1.7M||11.66||5.23||37.8||24.58||1.75|
|DenseNet (k = 12)||40||1.0M||7.3*||5.43*||29.03*||27.12*||1.81*|
|DenseNet (k = 12)||100||7.0M||5.81*||4.5*||24.97*||20.84*||1.76*|
|DenseNet (k = 24)||100||27.2M||5.98*||4.1*||24.01*||20.5*||1.71*|
|DenseNet-BC (k = 12)||100||0.8M||6.03*||4.7*||24.60||22.98*||1.82*|
|DenseNet-BC (k = 24)||250||15.3M||5.16*||4.9*||21.55*||18.42*||1.7*|
|DenseNet-BC (k = 40)||190||25.6M||-||4.2*||-||18.88*||-|
3.3 Result Analysis
3.3.1 CIFAR and SVHN
Accuracy: Table 1 shows that the ChoiceNet depth 40 achieves the highest accuracy on all three datasets. The error rate on C10+ and C100+ is 4.0% and 17.5% respectively which is lower than error rates achieved by other state of the art models. Our results on the original C10 and C100 (without augmentation) data sets are 2% lower than Wide ResNet and 5% lower than pre-activated ResNet. Our model ChoiceNet () performs comparably well to DenseNet-BC with and , whereas ChoiceNet () outperforms all other networks.
Parameter efficiency: Table 1 shows that ChoiceNet needs fewer parameters to give similar or better performance compared to other state of the art architectures. For instance, ChoiceNet with a depth of 30 has only 13 million parameters yet it performs comparably well to DenseNet-BC () which has 15.3 million parameters. Our best results were achieved by ChoiceNet ( ) with 23.4 million parameters compared to DenseNet-BC ( ) with 25.6m, DenseNet ( ) with 27.2m and Wide ResNet with 36.5m parameters.
: Deep learning architectures can often be prone to overfitting however as ChoiceNet requires a smaller number of parameters, it is less likely to overfit the training datasets. Its performance on the non-augmented datasets appears to support this claim.
: While training ChoiceNet we observed that it occasionally suffers from an exploding gradient problem. ResNet and DenseNet were both trained using stochastic gradient descend(SGD) and a learning rate ofthat was later reduced to and after every 100 epochs. However, we had to start training our network using a learning rate of because setting the rate any higher was causing gradients to explode. We also had to reduce the learning rate to and then to after each 50 epochs instead of 100 to prevent the problem from reoccurring.
The problem of exploding gradients is easier to handle than that of vanishing gradients. We used a smaller learning rate at the start and L2 regularisers with dropout layers () which addressed the problem.
|Wu et. al. ||80.6|
|Wang et. al. ||80.1|
|Ke et. al. ||79.1|
|Kong et. al. ||78.2|
|Wang et. al. ||77.6|
|*ChoiceNet-block (13M )||73.5|
Lin et. al. 
Chen et. al. 
|Mehta et. al. ||70.2|
Fourure et. al. 
|*Dense-blocks ( 25M )||69.2|
Lo et. al. 
|Yu et. al. ||67.1|
|Kreo et. al. ||66.3|
Chen et. al. 
|Berman et. al. ||63.1|
|Arnab et. al. ||62.5|
Huang et. al.
We tested ChoiceNet on the CamVid dataset and compared it with specialist segmentation and other state of the art networks [47, 46, 16, 19, 45, 28, 6, 6, 31, 10]. Mean IoU (m_IoU) scores are shown in table 2.
Although our network did not perform better than specialist state of the art segmentation networks, it was able to outperform DenseNet and ResNet both in terms of m_IoU score as well as in terms of parameter efficiency. Our ChoiceNet with 13 million parameters was able to perform better than networks almost twice its size.
In Figure 7 we display some of the predicted images from ResNet, DenseNet and our model against ground truth data from CamVid dataset. The qualitative results show that our model has the ability to segment smaller classes with good precision.
Model compactness: As a result of the use of different filter sizes with feature concatenation and skip connections at every stage, feature maps learned by any layer in a block can be accessed by all subsequent layers. This extensive feature reuse throughout the network leads to a compact model.
In figure 6, we showed a layer vs test error graph which demonstrates the compactness of ChoiceNet with respect to other state of the art architectures. Note that for training different networks we kept the environment same but changed the depth and later smoothed the curve for better visualisation. ChoiceNet’s curve always stays at the very bottom which means better error rate with fewer parameters/layers.
Feature Reuse: ChoiceNet uses different filter sizes with skip connections and channel concatenation in each module (see fig 1). In order to have a deeper and visual understanding of its operation, we took the weights of the first block (in ChoiceNet-37) and normalized them to the range . After normalizing the weights we mapped them to two sections, weights under 0.4 as white and over 0.4 as colored - see table 3. We assumed that the weights less than 0.4 will have insignificant effect on the total performance. The figure shows that after the very first convolution operation on the raw input, the conv operations with channel size 7 has more effect than size 3 and 5. In the second module all the conv operations’ weights were under 0.4 which suggests that the model used either the feature maps of the earlier output by concatenation (red line between filter 5 and 7 of the middle module) or it used the skip connection (red line above filter 3 with highlighted ”+” sign). On one hand this indicates that the skip connection or channel concatenation or both are working as they were suppose to but this also means that we still have many redundant parameters in the network. In the third module it was found that filters 3 and 5 had weights over 0.4 which indicates that they possibly had some contribution in the network. We suspect that the selection of filter size 7 in the first module and 3 and 5 in the third module echoes the hypothesis of AlexNet  where they found bigger filter sizes work better at the beginning of the networks and smaller filters work better in the later stages.
In table 2, we show the Mean Intersection over Union (m_IoU) on the CamVid dataset of some of the current state of the art models. We used the U-Net training scheme and changed the basic convolutional operations with ResBlocks, DenseBlocks and ChoiceNet-module (see figure 1). While our network has fewer parameters compared to ResBlock and Denseblocks, it achieved a higher score. Note that even though our model achieved a good m_IoU score , it is not as good as some of the network architectures designed specificaly for segmentation tasks [47, 46, 16, 19, 45]. Nevertheless, it performed well comparing to both ResBlock and Dense-block as well as some other general purpose convolutional neural networks .
Our intuition is that the extra connections and paths in our method enable the network to learn from a large variety of feature maps. This also enables the network to back propagate errors more efficiently (see also [12, 13]). We found that due to all the connections the network can be prone to exploding gradient and therefore needs a small learning rate to begin with. We also found by grid search that the network shows peak performance when the depth is between 30 to 40 layers and further increasing the layers appears to have little effect. We suspect that ChoiceNet plateaus at depth 30 to 40 although it is possible that it could be a local minima as we couldn’t train models with depth more than 60 layers due to resource limitation.
In this paper, we introduced a powerful yet lightweight and efficient network, ChoiceNet, which encodes better spatial information from images by learning from its numerous elements such as skip connections, the use of different filter size, dense connectivity and including both Max and Avg pooling. ChoiceNetis a general purpose network with good generalisation abilities and can be used across a wide range of tasks including classification, image segmentation and others. Our network shows promising performance when compared to state-of-the-art techniques across different tasks such as semantic segmentation and object classification while being more efficient.
A. Arnab, S. Jayasumana, S. Zheng, and P. H. S. Torr.
Higher order conditional random fields in deep neural networks.
European Conference on Computer Vision (ECCV), 2016.
-  V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12):2481–2495, 2017.
M. Berman, A. Rannen Triki, and M. B. Blaschko.
The lovász-softmax loss: a tractable surrogate for the
optimization of the intersection-over-union measure in neural networks.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4413–4421, 2018.
-  L. Bottou. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015.
-  L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2018.
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele.
The cityscapes dataset for semantic urban scene understanding.In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016.
-  G. E. Dahl, T. N. Sainath, and G. E. Hinton. Improving deep neural networks for lvcsr using rectified linear units and dropout. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 8609–8613. IEEE, 2013.
-  J. Fauqueur, G. Brostow, and R. Cipolla. Assisted video object labeling by joint tracking of regions and keypoints. In 2007 IEEE 11th International Conference on Computer Vision, pages 1–7. IEEE, 2007.
-  D. Fourure, R. Emonet, E. Fromont, D. Muselet, A. Tremeau, and C. Wolf. Residual conv-deconv grid network for semantic segmentation. arXiv preprint arXiv:1707.07958, 2017.
-  A. Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, volume 1, page 3, 2017.
-  G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger. Deep networks with stochastic depth. In European Conference on Computer Vision, pages 646–661. Springer, 2016.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
-  T.-W. Ke, J.-J. Hwang, Z. Liu, and S. X. Yu. Adaptive affinity fields for semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 587–602, 2018.
-  A. Kendall, V. Badrinarayanan, and R. Cipolla. Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. arXiv preprint arXiv:1511.02680, 2015.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  S. Kong and C. C. Fowlkes. Recurrent scene parsing with perspective understanding in the loop. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 956–965, 2018.
-  I. Krešo, D. Čaušević, J. Krapac, and S. Šegvić. Convolutional scale invariance for semantic segmentation. In German Conference on Pattern Recognition. Springer, 2016.
-  A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
-  A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  G. Larsson, M. Maire, and G. Shakhnarovich. Fractalnet: Ultra-deep neural networks without residuals. arXiv preprint arXiv:1605.07648, 2016.
-  Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  C.-Y. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-supervised nets. In Artificial Intelligence and Statistics, pages 562–570, 2015.
-  G. Lin, A. Milan, C. Shen, and I. Reid. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1925–1934, 2017.
-  M. Lin, Q. Chen, and S. Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.
-  S.-Y. Lo, H.-M. Hang, S.-W. Chan, and J.-J. Lin. Efficient dense modules of asymmetric convolution for real-time semantic segmentation. arXiv preprint arXiv:1809.06323, 2018.
-  S. Mehta, M. Rastegari, L. Shapiro, and H. Hajishirzi. Espnetv2: A light-weight, power efficient, and general purpose convolutional neural network. arXiv preprint arXiv:1811.11431, 2018.
-  E. Mulalić, N. Grujić, V. Ilić, M. Marković, et al. Object-level grouping and identification for tracking objects in a video, Feb. 20 2018. US Patent 9,898,677.
V. Nair and G. E. Hinton.
Rectified linear units improve restricted boltzmann machines.In Proceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010.
-  Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. 2011.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
-  A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
-  O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
-  S. Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
-  R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. In Advances in neural information processing systems, pages 2377–2385, 2015.
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.
Inception-v4, inception-resnet and the impact of residual connections on learning.In AAAI, volume 4, page 12, 2017.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
-  S. Targ, D. Almeida, and K. Lyman. Resnet in resnet: generalizing residual architectures. arXiv preprint arXiv:1603.08029, 2016.
-  P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell. Understanding convolution for semantic segmentation. arXiv preprint arXiv:1702.08502, 2017.
-  P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell. Understanding convolution for semantic segmentation. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1451–1460. IEEE, 2018.
-  Z. Wu, C. Shen, and A. Van Den Hengel. Wider or deeper: Revisiting the resnet model for visual recognition. Pattern Recognition, 90:119–133, 2019.
-  F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.