1 Introduction
In the past years, deep learning models have become significantly deeper and more computationally expensive. As evident from the ImageNet competition results
[24, 31, 36, 14], increasing the depth of computer vision models indeed leads to improved results. However, such expensive models are not suitable in many cases. One approach to reducing this cost is to only use as much computation as needed for the particular input.
Adaptive Computation Time (ACT) [12] is a recently proposed mechanism that adjusts the computational depth of deep models: the harder the object is, the more iterations it is processed for. This mechanism is endtoend trainable, problemagnostic and does not require an explicit supervision for the number of computational iterations. It has been applied to recurrent networks for the problems of text modelling [12] and reasoning [30]. Spatially Adaptive Computation Time (SACT) [9] applies the ACT mechanism to the spatial positions of Residual Networks [15]
, a popular convolutional neural network model. This results in computational savings and interpretable computation time maps that highlight the regions of the image that the network considers relevant to the task at hand.
In this paper, we introduce Probabilistic Adaptive Computation Time (PACT), a probabilistic model with discrete latent variables that specify the number of iterations to execute. We define a prior on the latent variables that encodes the desired tradeoff between speed and accuracy. Then, we perform amortized maximum a posteriori (MAP) inference to find the proper amount of computation for a given object. The ACT mechanism can be seen as an adhoc relaxation of the PACT model with a specific prior distribution. A significant downside of the ACT relaxation is that it provides a discontinuous objective. Since reparameterization trick is only valid for continuous objectives, ACT cannot be incorporated into stochastic models trained with reparameterization, such as variational autoencoder
[22].We extend variational optimization [34, 35], a method for MAP inference, to handle intractable expectations using REINFORCE or reparameterization trick. For discrete latent variables, we propose to apply the Concrete relaxation [26, 18]
and then perform the reparameterization. We call the obtained method stochastic variational optimization and apply it to the PACT model. Evaluation on ResNets shows that training using the relaxation outperforms the REINFORCE based method and matches the performance of the heuristic ACT. We show that the relaxation allows to train the model with up to
discrete latent variables. Additionally, the models trained with the proposed relaxation can be evaluated with a simple deterministic approach that reduces the memory consumption, compared to ACT. Evaluation of the ACT models in the same manner decreases the performance.2 Background
Notation.
Let be the expectation of a function
over a probability distribution
,the sigmoid function,
the logit function,
the stepfunction that is equal to if is true and otherwise. Also, let be a shorthand notation for .2.1 Variational Optimization
Variational optimization [34, 35] is a method for maximization of a function of an argument . This argument can be either continuous or discrete. To apply variational optimization, we choose an auxiliary parametric probability distribution over the arguments values . The following lower bound on the optimal value holds for any distribution :
(1) 
Suppose that the parametric family of distributions can model arbitrary deltafunctions. Then, the bound is tight and the optimum is achieved when , where .
Let us assume that the density is a smooth function of . Then, is a smooth function. Variational optimization further assumes that the expectation in is tractable and maximizes with a gradientbased method. However, it is not applicable when the expectation is intractable. We address this limitation in sec. 3.
2.2 Variational Optimization for Probabilistic Models
Consider a discriminative probabilistic model with latent variables , where is the object, is the target label and is the latent variable. The prior encodes our preference for the values of . The maximum a posteriori (MAP) inference problem is to find that maximizes the density of the posterior distribution . During training time, we know both and , while during testing time we only have and would like to find the distribution . Therefore, we search for in a parametric form that only depends on , so that we can use it during the test time. This can be achieved by performing variational optimization with an auxiliary distribution :
(2) 
For training, we plug in the groundtruth label and optimize . During testing, we sample and obtain the distribution over the labels .
Let us analyze a special case of this approach that has been extensively used in attention models literature
[32, 1, 2, 41, 25]. Consider a probabilistic model with a learnable prior. We can use the prior as the approximate posterior in variational inference. The corresponding evidence lower bound is(3) 
Renaming into , we recognize the objective (2), where the prior distribution is uniform, (for a continuous latent variable on unbounded domain, this prior is improper). Applying the inequality (1), we have . Therefore, optimization of corresponds to maximum likelihood inference of the latent variables. On the other hand, the bound (2) allows to incorporate an explicit prior distribution over the latent variables and perform MAP inference. This is a crucial requirement for the models such as the one proposed in the paper that provide an explicit prior distribution.
The objective (2) can also be seen as evidence lower bound on the marginal likelihood minus the entropy term. Indeed, adding the entropy of to the eqn. (2) yields
(4) 
Unlike MAP inference, variational inference provides a distribution over the latent variable. In our case, this is undesirable since we are interested in the single “best” value for the latent variables at the test time. To obtain a single value of the variables for evaluation, we could choose a maximum of the approximate posterior. However, this would introduce a gap between the train and testtime behavior of the model.
2.3 Concrete Distribution and Reparametrization
Suppose that we would like to stochastically optimize parameters of an intractable expectation , where is smooth. The reparametrization trick [22, 37] allows for this, provided that the distribution can be reparametrized, we can sample as follows:
(5) 
where is smooth w.r.t. and
. Then, applying the chain rule we have:
(6) 
This expectation can be approximated using MonteCarlo sampling. The reparameterization trick is most commonly used for Normal distribution. If
, then and .Unfortunately, the reparameterization trick cannot be directly applied to discrete random variables, since the corresponding function
is a nonsmooth step function. However, it is possible to relax a discrete random variable so that the relaxation becomes reparameterizable.The Concrete distribution [26, 18] is a continuous reparameterizable relaxation of a discrete random variable. For the purposes of this paper, we only consider relaxation of Bernoulli (binary) discrete random variables. Consider a random variable , where . We introduce a temperature parameter . The relaxed random variable is defined via the following sampling procedure:
(7)  
(8) 
The distribution has several useful properties [26]. First, the probability to be greater than 0.5 is equal for and random variables. However, the mean value of is, in general, not equal to . For , the distribution of approaches . Next, for the density does not have modes in the interior of the range. As a result, the samples are typically close to either zero or one, which makes the relaxation work well for our purposes. Importantly for us, when or , the distribution of approaches a deltafunction at 0 or 1, respectively. This means that for extreme values of probability, the gap between the relaxed and nonrelaxed distributions vanishes, regardless of the temperature .
3 Stochastic Variational Optimization
Consider the variational optimization objective , where
is a latent variable. Stochastic variational optimization estimates the gradient
stochastically, even when the expectation is intractable. First, we consider the case of a reparameterizable distribution, and then cover the case of discrete distributions.If the distribution is reparameterizable, is a Normal distribution, we can perform reparameterization trick and calculate the stochastic gradients directly. We then apply stochastic gradient optimization methods, resulting in stochastic variational optimization of the objective.
Now, we switch to the case where is discrete. One popular method for this type of problems is REINFORCE [39] training rule:
(9) 
where
is a scalar baseline. The expectation can be approximated by MonteCarlo sampling. Although this procedure provides unbiased gradients, the estimate often has an impractically high variance.
We propose to apply Concrete relaxation to the proposal distribution and then use the reparameterization trick. This results in lowervariance gradients at the cost of a bias. Assume that . Let’s decompose the proposal distribution using the chain rule, (this sidesteps enumeration of all the configurations of during sampling). We make two assumptions: (1) is defined and smooth for ; (2) each factor is defined and smooth for . Then, we can apply the Concrete relaxation with temperature to each factor (the hat denotes relaxation):
(10) 
The relaxed objective has the form
(11) 
This objective can now be stochastically optimized using the reparameterization trick.
If all the probabilities in the relaxed distribution approach extreme values (0 or 1), the relaxed distribution approaches the nonrelaxed one, for any temperature . In this case, the value of the relaxed objective approaches the value of the original objective .
4 Probabilistic Adaptive Computation Time
First, we introduce adaptive computation block. It is a computation module that chooses the number of iterations depending on the input. Depending on the specific type of the latent variables, we obtain a discrete, thresholded or relaxed block. Importantly, the blocks are compatible in the sense that one can train a model with one type of block and then switch to another during evaluation. Then, we present a probabilistic model that incorporates the number of iterations as a latent variable into a discriminative model. The prior on the latent variable favors using less iterations. Finally, we perform MAP inference over the number of iterations via stochastic variational optimization.
Discrete adaptive computation block performs iterations of computation, where is a discrete latent variable. Let us assume that the th iteration outputs a value (we use upper subscripts to index the iterations in a block), and that all have the same shape. The output of the block is , the output of the th iteration. To perform optimization over the discrete latent variable , we introduce a distribution with parameters . Denote the halting unit of the block: when it is equal to one, the computation is halted. The two desiderata for are: (1) the probability of halting at the th step should depend on ; (2) it should be possible to sample after only executing the first iterations.
To satisfy the first property, we introduce a halting probability for every iteration:
(12) 
For the second property, we define the following sampling procedure for the distribution :
(13)  
(14) 
The vector
is a onehot representation of the discrete ary latent variable . We reparameterize via Bernoulli latent variables . The distribution of can be obtained by taking an expectation over the independent random variables :(15) 
Thresholded adaptive computation block is a deterministic version of the (stochastic) discrete adaptive computation block. Since we perform MAP inference over the latent variables, we expect the halting probabilities to be sufficiently close to either zero or one. Therefore, during evaluation we can replace sampling (13) with thresholding of the halting probabilities:
(16) 
The advantage of this block is an extremely simple implementation: stop as soon as the halting probability exceeds .
Relaxed adaptive computation block is obtained form discrete adaptive computation block by replacing the random variables with variables. We denote the relaxed variables with a hat and define the temperature of the relaxation . Sampling the vector from proceeds as follows:
(17)  
(18) 
The vector is no longer onehot. However, since it is produced by a stickbreaking procedure, it forms a discrete probability distribution over the iterations that we call the halting distribution. Finally, we define the output of the relaxed adaptive computation block as an expectation of the iteration outputs w.r.t. the halting distribution :
(19) 
The whole procedure is illustrated on fig. 1.
Probabilistic model. Consider a discriminative model with a likelihood of the target label given an object (for simplicity of notation, we consider just one object), parameterized by . This model can be a deep network for classification or regression problem. In many cases we prefer that the model make the prediction as quickly as possible. Assume that we have incorporated adaptive computation blocks into the likelihood with the corresponding latent variables (number of computation iterations) . Also, denote the maximum number of iterations in the th block as .
We now discuss the prior distribution that encodes the preference for less iterations. For simplicity, we assume that it factorizes over the blocks, . The prior for each block is a discrete distribution over
iterations. To make our model directly comparable to ACT, we choose a prior distribution that provides the same loglinear penalty as the ACT model (up to a normalization constant), a truncated Geometric distribution. We parameterize the Geometric distribution via a logscale number of iterations penalty
(the canonical Geometric distribution probability for success can be recovered as ). The prior distribution for a single block is(20) 
Using the described prior, we obtain the following probabilistic model:
(21) 
We perform MAP inference of the latent variable by variational optimization with an auxiliary distribution
(22) 
where is defined via eqn. (14). The dependence on the input and the previous latent variables is via the inputs of the block. We refer to this probabilistic model as discrete. The objective for maximization w.r.t. and is
(23) 
To reduce the variance of the stochastic estimate of the objective, we analytically compute the expectation of the logprior:
(24) 
Here is the expected number of iterations in the th block. Ignoring the additive constant, we have
(25) 
The objective in eqn. (25) is intractable for deep models consisting of several stacked adaptive computation blocks, as the complexity of direct evaluation of the expectation grows exponentially in the number of blocks. One heuristic is to replace the random variables with their expectations and optimize the probabilities directly. However, this simple approach fails for deep networks as they learn to trick the objective by increasing the halting probability for the first iterations and decreasing it for the latter iterations, while significantly boosting the magnitude of the outputs for the latter iterations [12]. The prior term value reflects that few iterations were used, while the outputs of the blocks are dominated by the last iterations.
Instead, we stochastically optimize the objective (25). In sec. 3 we proposed two approaches to do this, one using REINFORCE and another using relaxation.
In the first approach, we directly apply REINFORCE to the objective (25), obtaining the following gradients w.r.t. :
(26) 
where is a scalar baseline. The value is defined by eqn. (15). Note that we have neglected the dependency of on to reduce the variance of the gradients.
For the second approach, we replace every adaptive computation block with a relaxed counterpart, and the corresponding distribution with the relaxed distribution . This relaxed model has an objective that can be optimized via the reparameterization trick:
(27) 
In the supplementary we present the algorithms for PACT in Discrete, Thresholded and Relaxed modes.
4.1 Application: Probabilistic Spatially Adaptive Computation Time for Residual Networks
Residual network (ResNet) [14, 15] is a deep convolutional neural network architecture that has been successfully applied to many computer vision problems [6, 5]. We describe ResNet32 and ResNet110 models for CIFAR image classification dataset [23]. They contain three stacked blocks, each consisting of several residual units (5 for ResNet32 and 18 for ResNet110). The computational iteration of a ResNet is a residual unit of the form , where is a subnetwork consisting of two convolutional layers.
is the output of the previous block of residual units. The outputs of the residual units in each block have the same size. The first units in the second and third blocks are applied with stride 2 to perform spatial downsampling, while also increasing the number of output channels by a factor of two. Thus, the spatial dimensions of the first block are
(same as the size of CIFAR10 images), the second block and the third block . In this way, the amount of computation for every residual unit is roughly constant. The outputs of the last block are passed through a global average pooling and linear layers to obtain the class probabilities logits.SACT [9] applies the ACT mechanism to every spatial position of every residual network block. Likewise, we apply an adaptive computation block to every spatial position of every residual network block. We call the obtained model PSACT, probabilistic spatially adaptive computation time. The corresponding latent variable is where is the number of residual network block and is the spatial position. The halting probability map is computed as , where is convolution and is global average pooling. The computation time penalty for a block is chosen to be , where is a global computation time penalty and and are the height and width of the ResNet block.
In order to impute the noncomputed intermediate values, we redefine the residual unit as
(28) 
where is an active positions mask. For the discrete model, we choose , with the operation performed elementwise. Thus, if the position is no longer evaluated (hence, ), the value is zero and we simply carry the features from the previous iteration. Otherwise, the value is one. For the relaxed model, we use , where
is a scalar hyperparameter. By clipping the values of
, we obtain strict zeros and can skip computing the corresponding values during the training time. We have verified that setting to zero gives similar results, although without a possibility of computation savings during training.4.2 Application: Probabilistic Adaptive Computation Time for Recurrent Neural Networks
We can also apply the proposed model to dynamically vary the amount of computation in Recurrent Neural Networks, such as Long ShortTerm Memory networks (LSTMs)
[16]. Let us denote the input sequence , where is the number of timesteps. An adaptive computation block is associated with each timestep. Therefore, each timestep is processed for an adaptive number of iterations. We can use the same computation time penalty for all iterations. The computation iteration consists of applying the RNN’s transition function to obtain the new state of the RNN: . Here is the output state from the previous block/timestep. The binary input feature allows the network to detect the beginning of a new timestep. The halting probability is computed as . The output state of a block is used as an input state for the next block and as features for predicting the emission values for the timestep.5 Related work
Adaptive Computation Time (ACT) mechanism [12] can be seen as a heuristic deterministic relaxation of our PACT model. Specifically, ACT transforms the halting probabilities into the halting distribution as follows:
(29)  
(30) 
Since the halting distribution is not onehot, additional memory is required to maintain the output during evaluation (an algorithm is presented in the supplementary). In discrete and thresholded PACT models, the halting distribution is onehot and this memory can be saved.
The stopping time has zero gradients almost everywhere. In order to optimize the stopping time, a differentiable upper bound, ponder cost, is introduced. Ponder cost is linear almost everywhere, but is a discontinuous function of the halting probabilities, with discontinuities arising in the configurations where changes the value, see fig. 2. For instance, this means that ACT cannot be used with reparameterization trick that is only valid for continuous objectives. The objective of ACT, for several adaptive computation blocks, is .
Let us summarize why the proposed PACT model is more principled than ACT. First, the discrete PACT model straightforwardly defines the halting time as the iteration where the halting unit is fired. On the other hand, ACT that uses an adhoc definition (29). Second, PACT allows to directly minimize the expected halting time, while ACT minimizes the discontinuous ponder cost.
Several papers have explored using REINFORCE for adjusting the number of computation steps in neural networks using discrete latent variables: choosing the number of patches to process [25], determining the number of objects on a scene [8]
, dropping the unnecessary subsets of neurons in a fullyconnected network
[3]. REINFORCE for discrete latent variables is also used for hard attention methods [28, 1]. Most of them use the same amount of computation for all inputs, although [25] explores dynamically adjusting the number of steps. As we experimentally show, using Concrete relaxation dramatically simplifies training, compared to using REINFORCE.Recently, [19] proposed to only update a dynamically chosen subset of the hidden state of a recurrent network. This can be seen as an alternative to ACT for recurrent neural networks. However, it is still a heuristic mechanism requiring several tricks to train.
Two concurrent works explore adaptive dropping of residual units in ResNet models using ActorCritic [40] and GumbelSoftmax [38]. This can be seen as an adaptive version of stochastic depth [17]. In this paper, we propose a probabilistic view of ACT and SACT mechanisms. The resulting method is generally applicable to sequential models, including ResNets and RNNs.
Our work follows a trend in machine learning of interpreting methods as approximate Bayesian procedures. For example, in the field of topic modelling, Latent Dirichlet Allocation
[4] is a probabilistic counterpart of Latent Semantic Indexing [7]. Recently, Dropout [33] has been interpreted as variational inference in a probabilistic model [21, 10]. This spurred the development of more innovative ways of using Dropout, in RNNs [11] and for sparsifying neural networks [29]. We hope that our paper will similarly open the way for various extensions of adaptive computation time.6 Experiments
In the experimental evaluation we focus on PSACT model for ResNets, since it allows to adjust the number of latent variables by grouping the spatial positions. First, we demonstrate that the relaxed model’s parameters are compatible with the discrete and thresholded models. Then, we compare training of the relaxed model to training of the discrete model with with REINFORCE, for varying number of latent variables. Finally, we demonstrate that the relaxed PSACT model achieves close results to ACT. We also verify that the parameters obtained by the relaxed model can be used in a thresholded model with extremely simple testtime behavior, and that it is not the case for SACT.
We consider preactivation ResNets [15] with 32 and 110 convolutional layers. We use CIFAR10 image classification dataset [23]. The training hyperparameters are provided in the supplementary. Unless otherwise noted, PSACT is trained using the relaxed model and evaluated using the discrete model. As a proxy to the potential time savings, we compute the number of floating point operations (FLOPs) required to evaluate the positions with nonzero values in the active positions mask, as done in [9].
In the first experiment, we train a relaxed PSACT model. The obtained parameters are continuously evaluated on the test set in three models: relaxed (Concrete relaxation of the Bernoulli variables), discrete (discrete latent variables), and thresholded (deterministic latent variables). The results on fig. 3
show that the loss function and accuracy stay remarkably close for the three models. Since the computation in relaxed model is stopped when
, and might take nonextreme values, the relaxed model requires more computation.Next, we compare training of the relaxed model to training of the discrete model using REINFORCE. We use an exponential moving average reward baseline with a decay factor of . We do not employ an inputdependent baseline to simplify the model, since the paper [27] finds small improvement from using it. Additionally, for REINFORCE, we use Adam optimizer [20] with initial learning rate of (the decay schedule is kept the same), since SGD with momentum used in other experiments results in unstable training.
PSACT model for ResNet32 has 5ary categorical latent variables: one variable per spatial positions. To study the effect of the number of the latent variables on the training, we group the latent variables spatially. Namely, in every ResNet block, we group the spatial positions into nonoverlapping patches, . Within each patch, we average the logits of the halting probabilities and sample a single latent variable per patch. The results presented on fig. 4 show that REINFORCE has a much higher gradient variance. For latent variables, the difference is about two orders of magnitude. REINFORCE achieves comparable results for and latent variables, but the accuracy quickly deteriorates when the number of latent units is increased.
. PSACT is trained using the relaxed model. The results are averaged over five runs, with error bars denoting one standard deviation.
Left: ResNet32, right: ResNet110.Finally, we compare SACT and PSACT models for ResNet32 and ResNet110 on fig. 5. The PSACT model is trained using the relaxation and then evaluated in the discrete and thresholded regimes. PSACT and SACT perform similarly. We find that PSACT requires using somewhat lower computation time penalty to achieve the same number of FLOPs, perhaps because the expected number of iterations penalty in PSACT is easier to optimize than the surrogate ponder cost of SACT. Relaxed PSACT successfully trains on ResNet110, where we have 18ary discrete latent variables. PSACT can be evaluated in deterministic Thresholded mode with very close results, indicating that the latent variables probabilities have saturated. This is not the case for SACT: evaluation in Thresholded mode reduces the accuracy by at least 5% (a plot is available in the supplementary materials). We also present the comparison of the learned computation time maps on fig. 6.
7 Conclusion
We have presented Probabilistic Adaptive Computation Time, a principled latent variable model for varying the amount of computation in deep models. The proposed stochastic variational optimization allows to perform approximate MAP inference in this model. Experimentally, we find that training using Concrete relaxation of discrete latent variables outperforms REINFORCEbased training. The model achieves similar results to the heuristic method Adaptive Computation Time, while enjoying a principled formulation. It can also be used in Thresholded mode with a very simple testtime behavior. In future, we plan to explore different training techniques and modifications of the proposed latent variable model. Additionally, we expect that the proposed techniques could be useful for replacing REINFORCE in training of hard attention models.
Acknowledgments. M. Figurnov and D. Vetrov are supported by Russian Science Foundation grant 177120072 and Russian Academic Excellence Project ‘5100’.
References
 [1] Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with visual attention. ICLR, 2015.
 [2] Jimmy Ba, Ruslan R Salakhutdinov, Roger B Grosse, and Brendan J Frey. Learning wakesleep recurrent attention models. In NIPS, 2015.
 [3] Emmanuel Bengio, PierreLuc Bacon, Joelle Pineau, and Doina Precup. Conditional computation in neural networks for faster models. ICLR Workshop, 2016.
 [4] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. JMLR, 2003.
 [5] LiangChieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv, 2016.
 [6] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. Rfcn: Object detection via regionbased fully convolutional networks. NIPS, 2016.
 [7] Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6), 1990.

[8]
Ali Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, David Szepesvari,
Koray Kavukcuoglu, and Geoffrey E Hinton.
Attend, infer, repeat: Fast scene understanding with generative models.
NIPS, 2016.  [9] Michael Figurnov, Maxwell D Collins, Yukun Zhu, Li Zhang, Jonathan Huang, Dmitry Vetrov, and Ruslan Salakhutdinov. Spatially adaptive computation time for residual networks. CVPR, 2017.
 [10] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. ICML, pages 1050–1059, 2016.
 [11] Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. Advances in neural information processing systems, 2016.
 [12] Alex Graves. Adaptive computation time for recurrent neural networks. arXiv, 2016.
 [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. CVPR, 2015.
 [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CVPR, 2016.
 [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. ECCV, 2016.
 [16] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 [17] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. ECCV, 2016.
 [18] Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbelsoftmax. ICLR, 2017.
 [19] Yacine Jernite, Edouard Grave, Armand Joulin, and Tomas Mikolov. Variable computation in recurrent neural networks. ICLR, 2017.
 [20] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015.
 [21] Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. NIPS, 2015.
 [22] Diederik P Kingma and Max Welling. Autoencoding variational bayes. ICLR, 2014.
 [23] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Computer Science Department, University of Toronto, Tech. Rep, 2009.
 [24] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. NIPS, 2012.
 [25] Zhichao Li, Yi Yang, Xiao Liu, Shilei Wen, and Wei Xu. Dynamic computational time for visual attention. arXiv, 2017.
 [26] Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. ICLR, 2017.
 [27] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. ICML, 2014.
 [28] Volodymyr Mnih, Nicolas Heess, Alex Graves, and Koray Kavukcuoglu. Recurrent models of visual attention. In NIPS, 2014.
 [29] Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. ICML, 2017.
 [30] Mark Neumann, Pontus Stenetorp, and Sebastian Riedel. Learning to reason with adaptive computation. NIPS Workshop on Interpretable Machine Learning in Complex Systems, 2016.
 [31] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. ICLR, 2015.
 [32] Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In NIPS, pages 3483–3491, 2015.
 [33] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014.
 [34] Joe Staines and David Barber. Variational optimization. arXiv, 2012.
 [35] Joe Staines and David Barber. Optimization by variational bounding. ESANN, 2013.
 [36] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. CVPR, 2015.
 [37] Michalis Titsias and Miguel LázaroGredilla. Doubly stochastic variational bayes for nonconjugate inference. In ICML, 2014.
 [38] Andreas Veit and Serge Belongie. Convolutional networks with adaptive computation graphs. arXiv, 2017.

[39]
Ronald J Williams.
Simple statistical gradientfollowing algorithms for connectionist reinforcement learning.
Machine learning, 1992.  [40] Zuxuan Wu, Tushar Nagarajan, Abhishek Kumar, Steven Rennie, Larry S Davis, Kristen Grauman, and Rogerio Feris. Blockdrop: Dynamic inference paths in residual networks. arXiv, 2017.
 [41] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. ICML, 2015.
Appendix A Algorithms for adaptive computation blocks
We present the algorithms for discrete adaptive computation block in alg. 1, for thresholded block in alg. 2 and for relaxed block in alg. 3. Additionally, the adaptive computation time relaxation for the block is presented in alg. 4. We see that discrete and thresholded blocks allow more straightforward implementation than the adaptive computation time mechanism.
Appendix B Training hyperparameters and additional experimental results
The training hyperparameters are as follows. The batch size is 128, weight decay is 0.0002. The training is performed for 100,000 iterations. The weights are initialized with variance scaling initializer [13]. For all the experiments, except training using REINFORCE, we use SGD optimizer with momentum 0.9. The initial learning rate is 0.1, decayed by a factor of 10 after 60,000, 75,000 and 90,000 training iterations. For training of SACT and PSACT models, we use the initialization heuristics from [9] to prevent the dead residual units problem. Namely, we initialize the weights of model with a pretrained vanilla ResNet, and initialize the biases of the logits of the halting probabilities with a constant . We train the relaxed PSACT models with temperature and clipping threshold . We have explored temperatures in the range and obtain similar results.
We demonstrate additional examples of the computation time maps of SACT and PSACT in fig. 7.
An extended version of figure 5 from the main text is shown on fig. 8. We demonstrate that when a model trained with SACT relaxation is evaluated as a PSACT Thresholded model, the accuracy significantly drops. This indicates that training using SACT does not result in a sharp halting distribution.
The values of in this experiment are as follows. ResNet32 PSACT: ResNet32 SACT: . ResNet110 PSACT: . ResNet110 SACT: . Higher values of correspond to less FLOPs.