1 Introduction
Recent advances in Neural Networks made them prevalent in many problems of computer vision like image denoising, segmentation and captioning. Recently, they achieved and even surpassed a humanlevel performance in a wide range of tasks, including image classification and object detection He et al. (2016a); Russakovsky et al. (2015). Residual learning He et al. (2016a) is a building block for most of state of the art models. Deep Residual Networks showed outstanding results on the ILSVRC2015 challenge and quickly became one of the most promising architectures for many efficient models in computer vision.
Residual learning is based on the idea of shortcut connections between layers. If a standard layer learns some parametric mapping of an input , residual learning suggests to learn the mapping in a form of . The function usually takes a form of a few convolutional layers with nonlinearities in between. The form of mapping is motivated by the fact that very deep neural networks without shortcut connections often exhibit worse performance than their shallower counterparts. Counterintuitively, as shown in He et al. (2016a), this result is due to underfitting, not overfitting. However, intuitively, deeper networks should not generalize worse, since last layers of the network can learn an identity mapping , effectively learning a shallower model. Since a network cannot learn an identity mapping , adding a shortcut connection allows the latter layers to learn an identity mapping by setting , which can usually be achieved by zeroing out all the weights. The idea of residual networks paved the way for deep models consisting of hundreds Huang et al. (2017b) and even thousands He et al. (2016b) of layers.
One of the most notable properties of ResNets is their robustness to deletion and reordering Veit et al. (2016); Srivastava et al. (2015) of layers. Randomly deleting or reordering of multiple layers of the trained ResNet does not significantly decrease the performance of the model. However, quality of nonresidual models, like the VGG family Simonyan and Zisserman (2014), drops to the random prediction after these operations. This leads to the hypothesis that models with ResNetbased architectures exhibit ensemblelike behavior in the sense that their performance smoothly correlates with the number of layers. Veit et al. (2016) considered this effect as the result of an exponentially large number of paths of different lengths formed by skip connections. Although these paths are trained simultaneously, they are weakly dependent, and deleting some layers of the network affects only a small number of paths. From this point of view, authors proposed to treat ResNets as an exponential ensemble of shallow networks.
While studying this notable property, we assume that for each input object, only a small set of paths bring a significant contribution to the network’s output. This set is determined by object’s specificity, which sometimes can be interpretable. For example, a presence of certain objects on the picture may trigger the ResNet to transmit the information trough skip connections and not parametric transforms. This idea motivates and leads to the hypothesis that the number of required computations for inference in ResNets can be significantly reduced while saving or even increasing the efficiency of the network.
To detect this specificity, we propose a Recurrent Set (ReSet) module, which dynamically constructs a route for each input object through layers of a trained ResNet. This route can go through a single module multiple times or skip some modules. This routing is carried out by a ReSet’s part called the controller.
Presumably, this scheme allows the network to learn a set of useful transformations and apply them in a casespecific order, that can lead to the better quality of classification and a lower number of required operations with a slightly greater number of parameters.
While we examined only a part of these insights, we hope that this experience will be useful for the scientific community and will help researchers to further deepen their understanding of ResNetbased models properties.
As our main contribution, we proposed the ReSet block that is indeed more flexible than SkipNet (Wang, 2017). However, SkipNet and ReSet have different motivations and outcomes—SkipNet’s controller implicitly groups objects by complexity, while our model groups objects by their semantic similarity. For example, one can use routing scores as a short set of discrete image features (or embeddings). Also, SkipNet can only skip layers and cannot add new transformations, while ReSet can learn new features by combining learned layers and increase the quality. In our work, we described different approaches to ReSet’s construction and provided many experiments with different settings. We believe that this experience will be useful for future research on ResNets and neural networks with dynamic routing.
In this paper, we denote our contributions as follows:

We propose a ReSet block that dynamically selects and applies transformations to the input data from a set of learnable layers (computational units).

We design neural network ReSet38 incorporating ReSet module into ResNet38, and obtain better results on benchmarks while saving the same number of net’s parameters

We observed that proposed ReSet module groups objects in certain paths patterns by their semantic similarity (not only by complexity, as in described above models), which opens a wide range of possible applications of obtained path vectors (e.g. using them as image embeddings)
2 Related Work
Various recent studies on ResNets revealed many intriguing properties of neural networks equipped with skip connections.
Initially proposed ResNet is organized as three groups of layers called stages. Each stage is a sequence of a few blocks of layers with a skip connection. Between the stages, a network downsamples the image to reduce its spatial dimensions.
In addition to Veit et al. (2016) and Srivastava et al. (2015), authors of Greff et al. (2016)
proposed another explanation of ResNets’ robustness to deletion and reordering. Authors provided experiments showing that ResNet does not learn completely new representations within one stage, but instead its blocks refine the features extracted by the previous layer. Only after downsampling that is performed between successive stages, a new level of representation can be obtained. This process is called
unrolled iterative estimation
or an iterative feature refinement Jastrzebski et al. (2017). Later, Huang et al. (2017a) summed up the similarities between the boosting and an iterative feature refinement.Considered properties of residual networks were implicitly exploited by many researchers in their works, resulting in great number of modifications of ResNet, which allows to significantly reduce the number of parameters and the number of operations during inference, without a decrease of performance.
Initially, Graves (2016)
introduced Adaptive Computation Time (ACT) mechanism for recurrent neural networks. Since some objects are easier to classify than others, this mechanism dynamically selects the number of iterations for a recurrent neural network, while promoting the smaller number of iterations. Next,
Figurnov et al. (2017) developed this idea further and incorporated ACT in ResNetbased architecture, where it decides whether to evaluate or to skip certain layers. They also introduced the Spacial Adaptive Computation Time (SACT) mechanism, which applies ACT to each spatial position of the residual block. Authors reported up to 45% reduction of the number of computations while preserving almost the same efficiency as original ResNets.ShaResNet Boulch (2017) shares the weights of convolutional layers between residual blocks within one stage. This model is trained exactly in the same fashion as the original ResNet and reduces the number of parameters up to 40% without a significant loss in the quality. Also, Jastrzebski et al. (2017)
successfully reduced the number of weights three folds by sharing all residual blocks after the fifth within each stage. They investigated that sharing batch normalization statistics leads to low model efficiency, and resolved this issue by keeping a unique set of batch normalization statistics and parameters for each iteration.
SkipNet Wang et al. (2018)
dynamically routes objects through a ResNet, skipping some layers. Authors developed a Hybrid Reinforcement Learning technique to reward the network for skipping blocks that have a small impact on the output. This way, they reformulated the routing problem in the context of sequential decision making, reducing the total number of computations on average by 40%. This work is similar to ours in the idea of dynamically selecting computational units. However, SkipNet can neither change the order of blocks nor select a single block more than once.
Finally, Leroux et al. (2018) used ideas of iterative estimation, adaptive computational time and weights sharing and proposed an Iterative and Adaptive Mobile Neural Network. The model is a compact network that consists of recursive blocks with ACT, and reduces both the size and a computational cost compared with ResNets. Authors maximally avoided redundancy using only one block, which recursively computes transformations until ACT module decides to proceed to the next stage.
3 Learning the Routing Policy
Following the work of Wang et al. (2018), we formulate the routing as a policy learning problem. We propose a model that at each iteration selects a Computational Unit (CU) from a set of units and applies it to the data by computing
(1) 
The routing policy selects a Computational Unit by calculating a distribution vector
that assigns a probability value for the selection of each Computational Unit. Each dimension corresponds to one of
available CU, where the higher value indicates the higher contribution of the corresponding unit. Then, we sample an index of a Computational Unit from the distribution :(2) 
Since the selection of a computational unit is a nondifferentiable function, we can consider a “soft” selection mechanism, that applies the policy by computing an expected output value with respect to the distribution :
(3) 
Notice, that when the weight for a Computational Unit is zero, evaluation of can be skipped, saving computational time. With a slight abuse of notation, we further denote and as , and the type of selection mechanism that is used will be clear from the context.
We compute the weights using the Controller network . The Controller takes a current representation of an object and a current state of the controller :
(4) 
where
are logits that can be transformed into a probability distribution by applying a Softmax function:
(5) 
In the general case, the soft scoring for a dense vector requires an evaluation of all CUs, requiring a lot of computational resources.
The soft selection is differentiable, but is more computationally expensive. Hard selection is a nondifferentiable operation and requires some tricks to pass the gradient through sampling using REINFORCE algorithm, Gumbel trick, or Hybrid Reinforcement Learning.
3.1 Reinforce
REINFORCE is a generalpurpose algorithm for estimating gradient with respect to the policy Williams (1992), and is based on MonteCarlo estimation of pathdependent gradients:
(6) 
Its known issue is notoriously high variance of gradients, which makes its application difficult. Several approaches to reduce variance were made in
Greensmith et al. (2004). However, as experiments revealed, they appeared to be not powerful enough in the context of our problem. This result can be explained by an exponentially large number of routes that can be used for an object, requiring plenty of simulations in each pooling node for precise enough gradient estimation.3.2 GumbelSoftmax Estimator
Proposed in Jang et al. (2016), this technique performs continuous relaxation of onehot discrete sampled vector:
(7) 
where are logits for each option, is a constant called a temperature, and are samples from a distribution, that can be easily computed as:
(8) 
This relaxation requires a soft selection. However, when a temperature tends to zero, weights of all elements but one become almost onehot, effectively selecting only a single unit.
3.3 StraightThrough GumbelSoftmax Estimator (ST)
Continuous relaxations of onehot vectors still have no sparsity (although all components except one are close to 0). To sample onehot vector, Jang et al. (2016) proposed to discretize by taking , and use continuous approximation in the backward pass by taking . Thus, gradients become biased. Nevertheless, the bias approaches zero as the temperature tends to zero. This technique was called a StraightThrough (ST) Gumbel Estimator.
3.4 Hybrid Reinforcement Learning
After training the controller with sparse policy (i.e. only one computational unit with nonzero score on each iteration), we propose to apply Hybrid Reinforcement Learning, introduced in Wang et al. (2018)
, to reduce number of CU’s evaluations. For this, we add a separate CU that always returns zero. Combining with residual connection, selection of this unit will produce an identity mapping. To promote usage of this unit, we add a regularizer:
(9) 
where is the total number of controller’s invocations, is the current one, is a reward for skipping the computation of a CU, i.e. using a zero unit. can depend on or be a constant. is an indicator of whether computation was skipped or not. Maximizing this reward promotes the network to omit evaluation of CU, reducing the number of total operations.
4 ReSet
In this section, we describe the full architecture of dynamical routing that we call a Recurrent Set (ReSet) module.
The model consists of a Controller and a pool (set) of Computational Units (CUs). For each object, the controller assigns scores to computational units and evaluates each block with a nonzero weight. A weighted sum of the results is added to the original representation, as described in Eq. 3. We repeat this operation a fixed number of times, and then pass the result to next layer.
ReSet module performs casespecific transformations of input objects, clustering their internal representations and passing them through different sets of computational units. Motivated with ResNet’s robustness to deletion and shuffle of layers, the proposed recurrent structure has similarities with iterative refinement Greff et al. (2016). The proposed model may include multiple evaluations of the same computational unit, giving more flexibility than iterative refinement. The model with a soft selection is evaluated as follows:
(10)  
(11)  
(12) 
We also add an entropy regularizer with of two terms: promoting different routes for different objects (high entropy across cases) and and promoting deterministic routes for each object on its own (low entropy for each case):
(13) 
where are model parameters, and is entropy.
We tried two different architectures for a controller: a convolutional (CNN, Figure LABEL:fig:cnn_ctr) and a recurrent (RNN, Figure LABEL:fig:rnn_ctr) controller.
The CNN controller is a shallow convolutional network without biases with a global average pooling and a fullyconnected layer with a Softmax in the end. As recommended in Jastrzebski et al. (2017), we store and recompute unique batch normalization statistics and parameters for each iteration. RNN controller, uses an LSTM cell Hochreiter and Schmidhuber (1997) to perform sequential decision making within one ReSet module. The diagram of a ReSet module is shown on Figure 1.
Further, we designed a ReSet architecture (Figure 2), that is obtained by replacing stages in ResNet38 with ReSet modules, allowing dynamic routing through a set of Computational Units presented by ResNet blocks. The model consists of 3 sequential parts:

A preliminary block containing a convolutional layer, followed by a batch normalization, ReLU, and a max pooling.

Three ReSet modules with a similar structure. Each module contains a pool of five Computational Units
and a zeroconnection. Each CU is structured as original ResNet’s block, i.e. small learnable network formed by two convolutional layers with batch normalizations and ReLU activations. We iterate each ReSet module five times with these CUs to produce the output. We also add an extra convolutional layer with a stride between the groups to downsample spatial dimensions.

Final global average pooling followed by a fullyconnected classifier.
In general, ReSet’s pool of Computational Units consists of instances of a ResNet block. This decision allows to replace whole stage of original network with one ReSet module with certain number of iterations.
5 Experiments
We used CIFAR10 Krizhevsky and Hinton (2009) in provided experiments. It contains images in the training set and images in the test set, each of which has size and is related to one of
classes. Also, to measure the effect of hyperparameters overfitting, we used an additional test set of images introduced in
Recht et al. (2018); Torralba et al. (2008). We compare our results with the ResNet38 architecture He et al. (2016a).5.1 CNN controller vs RNN controller
To test the controller’s ability to route objects, we conducted the following experiment. We took a pretrained ResNet38 model, and put blocks from its stages into corresponding pools of computational units of three ReSet blocks. We fixed parameters of all computational units, and trained only a controller on training batches of size using Adam Kingma and Ba (2015) optimizer with a learning rate . We used He He et al. (2016a)
initialization and a L2regularizer on weights along with gradient clipping.
In a stated scenario, a good controller should find a deterministic policy corresponding to a sequential evaluation of a ResNet. We observed a close to uniform and ineffective policies of the CNNcontroller, while a RNNcontroller was able to find a sequential policy.
5.2 Learning ReSet38 from scratch
To validate the idea of dynamic recurrent routing, we conducted experiments with shortened ResNet, that are shown in Table 1. From them it can be concluded that ReSet model found effective routing and better weights compared with original ResNet.
Arc  Model  Acc, C10  Acc, C10.1  Gap  Acc, C10  Acc, C10.1 

115  ResNet  88.7  79.5  9.2     
ReSet  88.9  79.5  9.4  0.2  0  
151  ResNet  87  78.1  8.9     
ReSet  88  78.7  9.3  1  0.6  
511  ResNet  86.5  78  8.5     
ReSet  87.1  78.2  8.9  0.6  0.2 
In the main part of experiments, we tried 3 different selection mechanisms (scorers) for controllers: Softmax, Gumbel Softmax and Gumbel StraightThrough estimator. We tried using Adam and SGD with Momentum optimizers with stepwise learning rate decrease (Adam: multiply by 0.95 every 1000 batches, SGD: as in original paper He et al. (2016a)). Also, we added an entropy regularizer to promote a selection of different routes by different objects. We varied the regularizer’s weight, reducing it by the end of the training. Results are presented in the first section of Table 2.
N  Model  Accuracy  Gap  Time  Accuracy,  
C10  C10.1  C10  C10.1  
1  ResNet38 (baseline)  92.8  83.8  9       
2  SM  91.9  83.5  8.4  1.92  0.9  0.3 
3  SM, Adam  89.9  81.8  8.1  2  2.9  2 
4  SM+ENTR  92.1  85.1  7  1.92  0.7  1.3 
5  SM+ENTR, Adam  89.3  79.5  9.8  2  3.5  4.3 
6  GSM+ENTR  89  80.8  8.2  1.92  3.8  3 
7  GST  60.5  54.1  6.4  1.08  32.3  29.7 
8  SM pretr., Contr.  91.8  81.9  9.9  2  1  1.9 
9  GSM pretr. SM, Contr.  91.8  83.5  8.3  2  1  0.3 
10  GST pretr. SM, Contr.  66.5  54.1  12.4  1  26.3  29.7 
11  GST pretr. GSM, Contr.  70.1  57.3  12.8  1  22.7  26.5 
12  GST pretr. GSM  68.3  56  12.3  1  24.5  27.8 
13  HybRL pretr GST  60.7  51.3  9.4  0.88  32.1  32.5 
14  HybRL pretr GST  48.8  42.2  6.6  0.79  44  41.6 
15  HybRL pretr GST  35.3  30.4  4.9  0.71  57.5  53.4 
16  7 then finetune all  89.8  79.9  9.9  2.08  3  3.9 
17  GST pretr., batch rep. x4  72.2  60.9  11.3  1.08  20.6  22.9 
18  SM pretr., topk 1  92.6  83.9  8.7  1.31  0.2  0.1 
19  SM pretr., topk 2  92.7  84.1  8.6  1.92  0.1  0.3 
20  SM pretr., topk 4  92.1  83.7  8.4  1.69  0.7  0.1 
From these results we see that ReSet with a Softmax scorer and an entropy regularizer performs onpar with the original ResNet on the standard CIFAR10 test set (used for validation here), outperforming on the CIFAR10.1 test set. We decided to use only SGD with Momentum optimizer in further experiments because it outperformed Adam (with different hyperparameters) in almost all launches.
5.3 Pretraining and relaxation pipelines
Our attempts to train ReSet producing sparse scores from scratch have failed (GST in Table 2). We assume, that this can be explained by high variance of gradients, which is due to the large number of possible controller’s choices (to be precise, choices). To optimize the whole architecture, we used the following relaxation pipeline:

take pretrained ResNet38/ReSet38 with Softmax

freeze all parameters except controller’s (including batch normalization running statistics)

change controller’s type to Gumbel Softmax (still not sparse, but closer to GumbelST)

train controller on batches

change controller’s type to StraightThrough Gumbel Softmax

train controller until policy converged

add Hybrid Reinforcement Learning term to encourage reduction of operations

train the network until tradeoff between efficiency and performance is reached
Results are presented in the second section of Table 2. We can see that the relaxation pipeline helps models to achieve better results compared to learning from scratch. The other important result is that while Hybrid Reinforcement Learning allows controlling computational cost, Gumbel ST estimator appeared to be too biased, leading to poor results.
5.4 Alternative techniques
We also tried several alternative techniques. The bestachieved results are presented in the third section of Table 2.
5.4.1 Twophase learning
We separated the training procedure into 2 phases: controller’s and CUs’ learning, and then alternated training between them. Switching from one phase to another was based on several different rules: constant number of iterations, convergence of the current phase, progressive time for one of the phases.
This strategy did not give noticeable improvements, and straightforward procedures outperformed it. We think that the main reason of this is a permanently changing loss function surface for network’s parameters, which led to the divergence of the learning process.
5.4.2 Incremental learning
Take a pretrained network, freeze all its parameters, then unfreeze and train them incrementally, starting from last layers. In other words, unfreeze last layers, train the network on some batches, unfreeze last stage, train on another set of batches, unfreeze last but one stage, and so on.
The proposed modification has shown almost the same results as twophase learning, and our conclusions are the same too.
5.4.3 Topk pools choice (Softmax StraightThrough)
Compute scores with a Softmaxbased controller and take the largest of them, setting others to zero. In this case, computational units are getting ”unequal” updates, which leads to unstable learning. Detailed analysis and comparison with other techniques can be found in Jang et al. (2016).
This technique showed almost the same results as the baseline. Noticeably, that efficiency is not a monotonous function of . That can be explained by the local properties of the gradient descent.
5.5 Additional experiments
We have also tried some additional experiment with computational pool’s size and the number of iterations of ReSet layer. First, we used a shortened ReSet38 as a base model: we took two blocks (2 convolutions with batch normalization and ReLU) instead of the second and the third ReSet layers and modified the first ReSet layer as:

CUs in pool and iterations (#pool #iters)

CUs in pool and iterations (#pool #iters)

CUs in pool and iterations (#pool #iters)

CUs in pool and iterations (#pool #iters)

CUs in pool and iterations (#pool #iters)
Results of the proposed modification and sequential model taken as a baseline are presented in Table 3. In this experiments, RNNcontroller successfully exploits additional components and computational resources to outperform baseline model.
N  Model  Accuracy  Gap  Time  Accuracy,  
C10  C10.1  C10  C10.1  
1  ResNet baseline, 2 CU & 2 iters  86.1  75  11.1       
2  ReSet, 6 CU & 2 iter  87.2  76.9  10.3  1  1.1  1.9 
3  ReSet, 3 CU & 2 iters  86.5  78.4  8.1  1  0.4  3.4 
4  ReSet, 2 CU & 3 iters  86.4  77  9.4  1.08  0.3  2 
5  SM, 2 CU & 6 iters  87.3  77  10.3  1.33  1.2  2 
5.6 Examining learned routes
In this section, we visualize routing scores for a few images and analyze the obtained patterns. For each considered image, we found five images with the most similar routing pattern using the Manhattan distance. The results are shown in Figure 4. On Stage 1, distribution of scores collapses—presumably, the network extracts basic patterns, and on further stages, the distribution of paths (routes) has a much higher entropy (if there is no regularizer promoting a deterministic routing).
To demonstrate this, and that routing is, indeed, dynamic, we measured standard deviations of scores vectors’ components produced by model Num. 8 in Table
2 for test CIFAR10 images. The result is shown in Figure 3.From these figures, it implies that ReSet’s controller can distinguish pictures by some intrinsic semantic properties.
Analyzing the results, it becomes apparent that the network learned to pass images of different classes through different routes where similar by L1metric paths are assigned to semantically similar images. Also, according to obtained patterns, routing between first stages is almost identical for all objects. However, the last layers used different routes for different classes, indicating that the network uses last iterations to refine its predictions.
5.7 Policy Evolution
In this section we visualize and analyze the evolution of policy for certain models. In Figure LABEL:fig:policy_evolution we show average score of different Computational Units at different stages and iterations of the ReSet model.
From this figure, we can conclude that the learned policy can behave differently, depending on the set of the hyperparameters. For example, in Figure LABEL:fig:sm_entr_policy, the learned policy was recurrent, since one block was selected at each iteration with almost certain probability. On stage 1, CU 2 learned to iteratively refine the result. In particular, the possibility to use a recurrent strategy was exploited in Leroux et al. (2018). Also, some stages develop a mixed behaviour, selecting different CUs for different objects. As suggested in the previous experiment, this usually happens at the last stage.
On Figure LABEL:fig:rnn_pretrained_policy and LABEL:fig:rnn_entr_pretrained_policy we see that on stage 2, the policy is absolutely different from sequential (as in original ResNet38), however, leading to the same result. This can be treated as new evidence of unrolled iterative estimation hypothesis Jastrzebski et al. (2017). Noticeably, that entropy regularizer reduces the variance of scores (hence, the variance of gradients too) compared with model without it.
6 Conclusion
In this work, we introduced a ReSet layer that performs dynamic routing through the set of independent Computational Units (transformations). The proposed model achieved better classification results compared with the ResNet38 model, having a comparable number of parameters. The model learned to separate paths of images from different classes and produced separate Computational Units for the final stage of the network to refine its predictions.
The obtained results open a wide range of possible applications of the proposed mechanism of dynamic recurrent routing with ReSet. For example, ReSet could be used in Natural Language Processing, where one would expect the ReSet to process different parts of the sentence with different Computational Units. An additional direction of research is the properties of image’s controller scores vectors, which could be considered as corresponding embeddings for pictures.
The work was performed according to the Russian Science Foundation Grant 177120072.
References
 Boulch (2017) Alexandre Boulch. Sharesnet: reducing residual network parameter number by sharing weights. arXiv preprint arXiv:1702.08782, 2017.
 Figurnov et al. (2017) Michael Figurnov, Maxwell D Collins, Yukun Zhu, Li Zhang, Jonathan Huang, Dmitry P Vetrov, and Ruslan Salakhutdinov. Spatially adaptive computation time for residual networks. In CVPR, volume 2, page 7, 2017.
 Graves (2016) Alex Graves. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983, 2016.
 Greensmith et al. (2004) Evan Greensmith, Peter L Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.
 Greff et al. (2016) Klaus Greff, Rupesh K Srivastava, and Jürgen Schmidhuber. Highway and residual networks learn unrolled iterative estimation. arXiv preprint arXiv:1612.07771, 2016.

He et al. (2016a)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016a.  He et al. (2016b) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, pages 630–645. Springer, 2016b.
 Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jurgen Schmidhuber. Long shortterm memory. Neural computation, 9 8:1735–80, 1997.
 Huang et al. (2017a) Furong Huang, Jordan Ash, John Langford, and Robert Schapire. Learning deep resnet blocks sequentially using boosting theory. arXiv preprint arXiv:1706.04964, 2017a.
 Huang et al. (2017b) Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, volume 1, page 3, 2017b.
 Jang et al. (2016) Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. arXiv preprint arXiv:1611.01144, 2016.
 Jastrzebski et al. (2017) Stanisław Jastrzebski, Devansh Arpit, Nicolas Ballas, Vikas Verma, Tong Che, and Yoshua Bengio. Residual connections encourage iterative inference. arXiv preprint arXiv:1710.04773, 2017.
 Kingma and Ba (2015) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International Conference for Learning Representations, 2015.
 Krizhevsky and Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
 Leroux et al. (2018) Sam Leroux, Pavlo Molchanov, Pieter Simoens, Bart Dhoedt, Thomas Breuel, and Jan Kautz. Iamnn: Iterative and adaptive mobile neural network for efficient image classification. arXiv preprint arXiv:1804.10123, 2018.
 Recht et al. (2018) Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do cifar10 classifiers generalize to cifar10? 2018. https://arxiv.org/abs/1806.00451.
 Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 Srivastava et al. (2015) Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387, 2015.

Torralba et al. (2008)
Antonio Torralba, Rob Fergus, and William T. Freeman.
80 million tiny images: A large data set for nonparametric object and scene recognition.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11):1958–1970, 2008.  Veit et al. (2016) Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like ensembles of relatively shallow networks. In Advances in Neural Information Processing Systems, pages 550–558, 2016.
 Wang et al. (2018) Xin Wang, Fisher Yu, ZiYi Dou, Trevor Darrell, and Joseph E. Gonzalez. Skipnet: Learning dynamic routing in convolutional networks. In The European Conference on Computer Vision (ECCV), September 2018.
 Williams (1992) Ronald J Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. In Reinforcement Learning, pages 5–32. Springer, 1992.