Faster Neural Network Training with Approximate Tensor Operations

05/21/2018 ∙ by Menachem Adelman, et al. ∙ Technion 0

We propose a novel technique for faster Neural Network (NN) training by systematically approximating all the constituent matrix multiplications and convolutions. This approach is complementary to other approximation techniques, requires no changes to the dimensions of the network layers, hence compatible with existing training frameworks. We first analyze the applicability of the existing methods for approximating matrix multiplication to NN training, and extend the most suitable column-row sampling algorithm to approximating multi-channel convolutions. We apply approximate tensor operations to training MLP, CNN and LSTM network architectures on MNIST, CIFAR-100 and Penn Tree Bank datasets and demonstrate 30 maintaining little or no impact on the test accuracy. Our promising results encourage further study of general methods for approximating tensor operations and their application to NN training.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Approximation techniques for faster inference and training of deep neural networks (DNNs) have received considerable attention. Examples include quantized numerical representation(Hubara et al., 2016; Micikevicius et al., 2017; Seide et al., 2014; Wen et al., 2017), low-rank models(Mamalet and Garcia, 2012; Kuchaiev and Ginsburg, 2017), weight extrapolations(Kamarthi and Pittner, 1999) and partial or delayed gradient updates(Recht et al., 2011; Strom, 2015; Sun et al., 2017a).

We propose a novel approach to performing approximate DNN training that reduces the amount of computations by approximating compute-intensive tensor operations. At high level, the original matrix products and convolutions are replaced with their faster approximate versions that require fewer computations. The approximation is applied separately to each tensor operation, keeping the network architecture and dimensions intact, thereby facilitating the adoption of this technique in existing DNN training frameworks potentially in combination with other approximation techniques.

We first focus on the existing methods for approximating matrix multiplication, noting that there is a rich literature on the topic. We analyze several known algorithms  (Cohen and Lewis, 1999; Drineas and Kannan, 2001; Drineas et al., 2006; Magen and Zouzias, 2011; Sarlos, 2006; Clarkson and Woodruff, 2009; Pagh, 2013; Kutzkov, 2013), and find column-row sampling (CRS) (Drineas et al., 2006) to be the most suitable for approximating matrix multiplications in fully-connected layers. Given the product of two matrices , the algorithm samples the columns of and the corresponding rows of thus constructing smaller matrices, which are then multiplied as usual. This method incurs low sampling overheads linearly proportional to the size of the input matrices, and lends itself to an efficient implementation using existing dense matrix product routines. The algorithm minimizes the approximation error for the Frobenius norm of the resulting matrix while keeping the approximation unbiased.

We study the application of CRS and its variants to training Multi Layer Perceptron (MLP) on the MNIST dataset, and observe no impact on training accuracy while performing as little as 40% of computations.

Encouraged by these results, we turn to approximating the training of Convolutional Neural Networks (CNNs). We generalize CRS to approximating multi-channel convolutions and analyze the approximation error to derive the optimal sampling policy.

Inspired by prior works on gradient resilience to partial updates(Recht et al., 2011; Strom, 2015; Sun et al., 2017a), we apply more aggressive approximation of gradient computations. This allows us to further reduce the amount of computations by when training on MNIST without affecting the result accuracy.

We demonstrate the utility of our approach on different network architectures and datasets as summarized in Table 1. CRS approximation saves 30-80% of the computational cost with little or no degradation in model accuracy. The compute reduction column shows the relative amount of computations saved by our method in matrix multiplications and convolutions, as we further explain in detail in Section 3.3.

Network Dataset Compute Accuracy Baseline
reduction accuracy
MLP MNIST 79% 98.1% 98.16%
CNN MNIST 77% 99.25% 99.29%
LSTM PTB 24% 83.6 perplexity 83.5 perplexity
WRN-28-10(Zagoruyko and Komodakis, 2016) CIFAR-100 50% 77.1% 78%
Table 1: Compute reduction of training with approximate tensor operations

This paper makes the following contributions:

  • We explore the application of general approximation algorithms for tensor operations to DNN training,

  • We develop a novel algorithm for fast approximation of multi-channel convolution,

  • We show that our approach can significantly reduce the computational cost of training several popular neural network architectures with little or no accuracy degradation.

2 Related work

To the best of our knowledge, we are the first to study the application of general approaches to approximating tensor operations to speed up DNN training. However, there have been several prior efforts to accelerate DNN computations via approximation which we survey below.

Several works employ model compression to accelerate inference (Denton et al., 2014; Jaderberg et al., 2014; Lebedev et al., 2014; Osawa et al., 2017; Gong et al., 2014; Han et al., 2015; Sun et al., 2017b). Large body of work is devoted to quantization and the use of low-precision datatypes (see for example (Hubara et al., 2016; Micikevicius et al., 2017; Seide et al., 2014; Wen et al., 2017)

). Approximation has been used to extrapolate weight values instead of performing backpropagation iterations

(Kamarthi and Pittner, 1999). Several works address communication and synchronization bottlenecks in distributed training by allowing delayed weight updates(Recht et al., 2011; Strom, 2015). Another approach enforces low-rank structure on the layers, resulting in lower computational cost both for training and inference(Mamalet and Garcia, 2012; Kuchaiev and Ginsburg, 2017). These methods are all different from ours and can potentially be applied in combination with approximate tensor operations.

Dropout(Srivastava et al., 2014) is related to CRS, but it has been primarily evaluated in the context of preventing overfitting rather than for approximate computations. We elaborate on the connection between CRS and dropout in Section 6.

Another closely related work is meProp(Sun et al., 2017a), which computes a small subset of the gradients during backpropagation. We note that the simple unified top-k variant of meProp is in fact a particular case of CRS, applied only to backpropagation and with a different sampling criterion. We show in Section 5 that CRS enables greater savings than meProp.

3 Approximating matrix multiplication for DNN training

There are several known algorithms for approximating matrix product, however only the algorithms that meet the following requirements will be effective for general DNN training. First, the algorithm should apply to any input matrices regardless of the dimensions or their constituent values. Second, to be effective in reducing the training time, the total computational cost of the approximate multiplication including input transformation should be smaller than the cost of the original matrix product. Last, the algorithm should be amenable to efficient implementation on commodity hardware.

With these criteria in mind we now consider the following algorithms:

Random walk (Cohen and Lewis, 1999) This algorithm performs random walks on a graph representation of the input matrices. However, it is applicable to non-negative matrices only, which is not the case for matrices typically encountered in DNN training.

Random projections (Sarlos, 2006; Clarkson and Woodruff, 2009; Magen and Zouzias, 2011) The two matrices to be multiplied are first projected into a lower-dimensional subspace by a scaled random size matrix. These algorithms require both input matrices to be roughly square, otherwise the cost of projection will be similar to the original product. In DNNs, however, it is common for one dimension to be smaller than the other.

FFT (Pagh, 2013; Kutzkov, 2013)

These algorithms represent each column-row outer product as a polynomial multiplication and then calculate it using Fast Fourier Transform. The complexity depends on the sparsity of the input matrices, decreasing as the sparsity increases. Therefore, these algorithms might not be effective for computing fully-connected layers generally represented by dense matrices.

SVD (Drineas and Kannan, 2001; Denton et al., 2014; Osawa et al., 2017) Several algorithms replace one input matrix with its low-rank approximation using truncated SVD. These algorithms are suitable for inference where the weight matrix factorization can be pre-computed offline, but are not applicable to training since the high cost of factorization is incurred in every matrix product.

Column-row sampling (CRS) (Drineas and Kannan, 2001; Drineas et al., 2006) The sampling algorithm approximates matrix product by sampling columns of and respective rows of to form smaller matrices, which are then multiplied as usual. We choose CRS as the basis for our current work, because it meets all the criteria above: It is applicable to fully-connected layers of any size, its effectiveness does not depend on the matrix contents, its sampling is computationally lightweight, and may use regular matrix multiplication algorithms since the sampled sub-matrices remain dense.

3.1 Crs(Drineas and Kannan, 2001; Drineas et al., 2006)

Let . Their product is approximated as a weighted sum of outer products between sampled columns of and corresponding rows of .


where denote the matrix i’th column and row respectively, is the number of samples (satisfying ),

is a probability distribution over the column-row pairs of

and . This algorithm allows linear reduction in complexity from to .

Eq. 1 can also be expressed as , where is a matrix of the sampled columns of , comprises the respective rows of and is a diagonal matrix with being the scaling factor that makes the approximation unbiased:


Drineas et al. (2006) derive the upper bounds for the spectral and Frobenius norms of the error matrix . They show that the error is minimized when the sampling probabilities are proportional to the product of the column-row euclidean norms:


We refer to sampling using this probability distribution as Norm-Proportional Sampling (NPS).

We consider different variants of the CRS algorithm:

  1. Sampling with or without replacement

  2. Sampling with uniform distribution versus the distribution given by Eq 

    3 (NPS)

  3. Using or omitting the scaling factor

See Drineas et al. (2006) for the derivation of error bounds.

In addition, we introduce a deterministic top- sampling, which chooses the column-row pairs with the largest product of their euclidean norms without scaling. The intuition is that this policy chooses the samples that would be assigned the highest probabilities by NPS (Eq. 3) while avoiding the overhead of generating random samples.

3.2 Approximate matrix multiplication on synthetic data

We study the approximation quality and the computational cost of the CRS variants for standalone matrix multiplication. We generate random matrices and evaluate all the CRS variants while computing per-element, spectral norm and Frobenious norm error. We report only the latter because we find that the error metrics are equivalent w.r.t. the relative quality of the algorithms. For example, the algorithm with the lowest Frobenious norm error is also the one that yields the lowest spectral norm and per-element error. Using matrices of other sizes yields similar results.

Figures 0(a),0(b) show the approximation error for different variants of the CRS algorithm and different sampling ratios, averaged over 1000 runs. We show the error metric , as well as its data-independent upper bound for NPS with replacement (Theorem 1 in (Drineas et al., 2006)).

We observe that (1) sampling without replacement outperforms sampling with replacement, which agrees with the theoretical bounds (for uniform sampling) in (Drineas and Kannan, 2001); (2) deterministic top-k

produces similar results to NPS without replacement and without scaling; (3) when one matrix or both have i.i.d. entries with zero mean, random individual column-row products are unbiased estimators of the result. In this case, multiplying by the scaling factor

only increases the error.

Therefore, the use of deterministic top-k sampling without scaling appears to be preferable as it results in lower error, yet it is simple and computationally lightweight.

(a) Matrix product: both matrix entries drawn from
(b) Matrix product: one matrix entries drawn from , the other from
(c) MLP model training on MNIST for different CRS variants
Figure 1: Approximation error for matrix product (left, center) and MLP training on MNIST (right) for different CRS variants depending on the percentage of performed computations. Lower is better.

3.3 Approximate training of Multi Layer Perceptron

Evaluation methodology.

Here and in the rest of the paper we estimate the speedup as the amount of computations saved due to approximation. Specifically, we denote by compute reduction

the proportion of the multiply-accumulate operations in the fully-connected and convolutional layers in the forward and backward passes saved due to approximation out of the total computations in these layers performed in the exact training. We neglect the cost of activation functions and other element-wise operations, as well as the cost of the sampling itself, since it involves a single pass over the input and has lower asymptotic complexity of

versus of the exact computation. We believe that compute reduction is a reliable estimate of the potential training time savings because tensor operations by far dominate the DNN training time. We leave the efficient implementation of the CRS approximation for future work.

Execution environment.

We perform our experiments in Tensorflow

(Abadi et al., 2016)

and replace exact tensor operations with their CRS-based approximations. Only column-row pairs sampled in the forward pass are used during backpropagation because the rest do not affect the loss function. Hence, sampling in the forward pass reduces the amount of computations in the backward pass by the same ratio. We apply approximations only during training, and use exact computations to evaluate the model on the test/validation sets.

We evaluate the impact of different CRS variants on the training accuracy of a 3-layer Multi Layer Perceptron (MLP) (98.16% accuracy on MNIST (LeCun et al., 1998) with exact computations). The model and training details are found in supplementary material. Figure 0(c) shows the results. Deterministic top-k performs the best along with NPS without replacement and without scaling. There is no loss in model accuracy when using only 40% of the original column-row pairs while training with the original hyper-parameters. Overall, the results are similar to those of approximating matrix product (Figure 0(b)). This is not surprising given our empirical observation that the network weights are indeed close to be symmetrically distributed around zero during training.

3.4 Approximate training of Recurrent Neural Networks

We consider a model similar to the medium Long Short Term Memmory cells (LSTM) proposed by Zaremba et al. (2014) for language modeling on the Penn Tree Bank dataset (PTB)(Marcus et al., 1993). The model involves two weight matrices learned during training. One is the gates matrix that controls the state transfer, where n is the hidden state size. The other is that transforms the LSTM output into log-probabilities of the next word to predict. See supplementary material for model and training details.

We use the CRS deterministic top-k algorithm that performs best for the MLP model. We apply 50% sampling to the matrix products of

only. This computation accounts for about half of the multiply-accumulate operations in the model, therefore the total compute reduction is 24%. The resulting test perplexity is 83.6 vs. 83.5 of the original model. The perplexity on the validation set is 87.9 vs. 87.5 of the original model. We adjust the learning rate to allow faster learning at early training stages (0.74 decay after 29 epochs instead of 0.8 after 6).

Applying 50% sampling to the entire model, however, results in degradation of test perplexity to 89.8. We also observe degradation when approximating each gate individually. Following ideas similar to (Lu et al., 2016; Bayer et al., 2014) we also check approximations of specific gates known to be more resilient to perturbations such as compression or dropout, but observe perplexity degradation as well.

4 Approximating convolutions

So far we considered approximation of fully-connected layers. In this section we extend the basic CRS algorithm to the approximation of multi-channel convolution.

We consider two approaches: (1) Transforming convolution into matrix multiplication (e.g., as in cuDNN (Chetlur et al., 2014)) and then applying CRS (2) Devising a specialized algorithm for approximate convolution. We prefer the second approach to avoid introducing dependence of the approximation quality on a particular implementation of the transformation.

Our algorithm is a generalization of the CRS algorithm for matrices. In matrix multiplication sampling is performed over the common dimension. The analogue for multi-channel convolution is to sample over the input features dimension. As in the matrix case, the output dimensions remain the same.

Formally, let be the input tensor, where B is the batch size, IH,IW are the input height and width respectively, and IC are the input channels. Let be the kernels tensor, where KH,KW are the kernel height and width, and IC,OC are the input and output channels respectively. Let be the output tensor, where OH,OW are the output height and width.

The multi-channel convolution operation is defined as:


For notation simplicity, we assume zero padding. The inner sums in Eq. 

4 can be written as convolutions of the input channels:


where denotes a tensor with one input channel that corresponds to the i’th input channel of I, i.e . Similarly, corresponds to the i’th input channel of K.

This formulation immediately hints at the possibility to sample over the input channel dimension, similarly to sampling column-row pairs in matrices. We propose to approximate multi-channel convolution as a convolution of lower-rank tensors:


where are such that and is a probability distribution over the input channels, is a tensor composed of sampled input channels of I scaled by , and is a tensor composed of corresponding sampled input channels of K scaled by the same factor.

Computing the convolution of the smaller tensors can be done using standard efficient convolution implementations.

The properties of the approximation in Eq. 6 can be derived similarly to the CRS derivations for matrix multiplication. In particular, we prove the following (see supplementary material):

Proposition 1

The approximation is unbiased, i.e., .

Proposition 2

The expected Frobenius norm of the error tensor is minimized when the sampling probabilities are:



The terms emerge for convolutions when the kernel spatial dimensions are greater than one. However, computing them is too expensive, precluding efficient implementation of the approximate version. We therefore omit them and verify empirically whether the resulting norm-proportional probabilities:


yield better results than the uniform sampling. Intuitively, in some (common) cases these terms are much smaller than , so their omission does not significantly impact the final outcome. amounts to the outer spatial dimensions of the input not being convolved with the entire kernel, so it is likely to be smaller than the Frobenius norm of the whole input. is the sum of products of different input and kernel entries. If different kernels are lowly-correlated and weights are centered around zero, the sum will include terms of similar magnitudes but opposite signs.

4.1 Approximate training of Convolutional Neural Networks

We evaluate our extended CRS algorithm applied both to matrix multiplications and convolutions.

Small CNN.

We first consider a small CNN (99.35% accuracy on MNIST with exact computations). The training is performed using approximate tensor operations while keeping the original training hyper-parameters unchanged. See the supplementary material for model and training details.

Figure 1(a) shows the results of different CRS variants. For low sampling ratios, top-k and NPS without replacement and without scaling outperform the other policies, as in the MLP case.

The learning curves in Figure 1(b) show that using only 30% of the channels via top-k sampling is almost equivalent to training with full computations, achieving final accuracy of 99.21%. This level of sampling results in about 68% compute reduction since the first convolution layer has only one input channel and therefore is computed exactly. The figure also shows that removing additional 10% of the computations achieves the same accuracy but with slower convergence.

(a) MNIST test error for CNN for different CRS variants
(b) MNIST test error during CNN training
(c) CIFAR-100 test error for full computations and top-k sampling
Figure 2: Evaluation of approximate CNN training
Large CNN.

We also consider Wide ResNet-28-10 model(Zagoruyko and Komodakis, 2016) (WRN-28-10) trained on the CIFAR-100 dataset(Krizhevsky and Hinton, 2009). See supplementary material for model and training details. We evaluate only the deterministic top-k CRS that shows good results for smaller models. We use approximation for all the convolutional layers except for the input layer. Initial experiments show a significant drop in accuracy if the input layer is sampled, possibly because it has only three channels. Further, we do not approximate the single fully-connected layer since it amounts only to 0.01% of the total computations in WRN-28-10.

Our results show that by using only 50% of the channels the model achieves 77.1% accuracy on CIFAR-100 dataset vs. 78% of the original model. Moreover, the accuracy reaches 77.5% if we sample only the larger convolutional layers with 320 and 640 filters. In total, the compute reduction is 50% and 33% respectively. Figure 1(c) shows that the learning curves of the approximate training closely follow the curve of the exact computations. Note that these results are obtained using the same hyper-parameters as in the original model.

5 Aggressive approximation in backpropagation

Prior work shows that partial gradient computations do not affect the accuracy(Recht et al., 2011; Strom, 2015; Sun et al., 2017a). Specifically, meProp(Sun et al., 2017a) computes only a small subset of the gradients in fully-connected layers without accuracy loss. In fact, the computation of the gradient in the simple unified top-k variant of meProp can be viewed as a particular case of CRS that uses a different deterministic sampling criterion based on the gradient matrix only. This prompts us to apply more aggressive approximation to the tensor operations in backpropagation, on top of the compute reduction due to sampling in the forward pass.

Table 2 shows the results of training the MLP and CNN models on MNIST using CRS with deterministic top-k  sampling. We retain at least 10 samples per tensor operation thereby sampling larger layers more aggressively. We compare to meProp simple unified top-k algorithm (Sun et al., 2017a) that also uses compute-efficient sampling patterns. The aggressive sampling in backpropagation achieves the accuracy of the exact training with the compute reduction of about 80%. The accuracy for the MLP is the same as reported for meProp (Sun et al., 2017a) but CRS allows to further reduce the amount of computations compared to meProp by about .

Network Method Forward Backprop Compute Accuracy Baseline accuracy
sampling sampling reduction (no sampling)
MLP meProp Sun et al. (2017a) - 6% 55% 98.08% 98.01%
CRS 40% 5% 80% 98.1%
CNN meProp - - - - 99.29%
CRS 40% 20% 78% 99.25%
Table 2: Accuracy on MNIST with aggressive sampling in backpropagation vs. meProp

Larger networks, however, are more sensitive to additional sampling in backpropagation. Both the LSTM model and WRN-28 had a noticeable drop in accuracy, even when the additional backpropagation sampling ratio exceeds 50%.

6 Discussion

CRS interpretation in the context of DNN training.

The use of CRS algorithm in DNN training results in choosing a different subset of features in each iteration. Both top-k and NPS sampling prioritize the features and weights that jointly dominate in the current minibatch and in the network, and therefore are more likely to influence the prediction. Thus, different minibatch compositions trigger learning of different features.

CRS and Dropout.

The CRS approximation is closely related to Dropout(Srivastava et al., 2014). In Dropout, activations of each minibatch example are randomly zeroed. Applying CRS is also equivalent to zeroing activations, albeit the same activations across the entire mini-batch. Furthermore, CRS with replacement results in the column-row sampling that follows multinomial propability distribution, which has also been studied for Dropout (Li et al., 2016). This connection might serve the basis for analyzing the properties of CRS in DNN training by leveraging similar analysis for Dropout (for example (Wager et al., 2013; Baldi and Sadowski, 2014)), which we leave for future work.

7 Conclusions

This paper shows that general approximate tensor operations applied to DNN training are effective in significantly reducing the amount of computations without affecting the model accuracy. It generalizes the column-row sampling algorithms for approximating matrix products to multi-channel convolution and demonstrates promising results on several popular DNN architectures.

Our work opens new opportunities for systematic approximation of DNN training. Future research directions include efficient implementation of approximation algorithms for training, theoretical analysis of their properties, use of different approximation algorithms in suitable scenarios, combination with other approximation techniques and dynamic tuning of the approximation level for different layers or training stages.


Supplementary Material


The following proofs go along the same lines of [Drineas et al., 2006], generalizing them to multi-channel convolutions (zero-padding assumed).

Proposition 1.

Suppose is a probability distribution over and are such that .

Let be the multi-channel convolution of as defined in Eq. 4 and let be its approximation by sampling k input channels as defined in Eq. 6. Then:


We show that every satisfies .

For , define .

Using Eq. 6 we can write .

Taking the expectation, we get:


Lemma 1.

Suppose the same as Proposition 1. Then:


Define as in Proposition1. From Eq. 6 and the independence of different :


From Eq. 10 we get .

Substituting both expressions in Eq. 12 and expanding concludes the proof. ∎

Proposition 2.

Suppose the same as Proposition 1. Then:



The expected error is minimized when the sampling probabilities are:


We use here the Frobenius norm in its generalization for tensors. For a tensor T of rank r:


Note that:


Substituting the result from Lemma 1:


This expression includes 3 terms. The first involves products between each element of and all the corresponding entries in , except for the upper and left edges of . We therefore add and subtract the correction term to get:


The second term is .

The third term can be written as

Substituting these terms in Eq. 17 yields the result or Eq. 14.

To find that minimize the expression in Eq. 14 it is enough to minimize the function under the constraints and . We can write the numerator as because the expression in Eq. 13 is non-negative.

This minimization problem has a straightforward solution in Lemma 4 of [Drineas et al., 2006], which is .

In our case, , and therefore the optimal probabilities are:


Implementation details


The MNIST dataset[LeCun et al., 1998] includes 60K training examples and 10K test examples. We use 5K as validation set. Each example is a gray-scale image of a handwritten digit.

Our MLP model contains the following layers:

  • fully-connected layer with RELU activations

  • fully-connected layer

  • fully-connected layer

  • Softmax

We use the Adam optimizer[Kingma and Ba, 2014] with default parameters (learning rate=0.001,,,). As loss function we use cross-entropy. We use minibatch size of 50 and train the model for 20 epochs, reporting the test accuracy of the model with the highest accuracy on the validation set.


The Penn Tree Bank dataset (PTB)[Marcus et al., 1993] has a vocabulary of 10K words and consists of 929K training words, 73K validation words, and 82k test words.

Our network is based on [Zaremba et al., 2014] with the implementation in

The network includes 2 layers of LSTM cells with a hidden size of 650, unrolled for 35 steps. The model is trained with a minibatch size of 20, gradient clipping of 5. The perplexity is reported on the test set 39 training epochs. The learning rate decay is 0.8 starting after 6 epochs for the full computations and 0.74 after 29 epochs for the approximations. The non recurrent connections are regularized with Dropout with



The network is composed of the following layers:

  • convolution layer with RELU activation., followed by max pooling.

  • convolution layer with RELU activation., followed by max pooling.

  • fully connected layer with RELU activation.

  • fully connected layer.

  • Dropout layer with

  • Softmax

The model is trained using Adam optimizer with default parameters (learning rate=0.001,,,) and cross-entropy loss. We use minibatch size of 50 and train the model for 20K iterations.

Wide ResNet-28-10 for CIFAR-100

The CIFAR-100 dataset consists of color images from 100 classes, split into 50K training set and 10K test set.

For WRN-28-10 we use the implementation in

WRN-28-10 includes the following layers:

  • conv1 - input convolution layer

  • conv2 - eight convolution layers

  • conv3 - eight convolution layers

  • conv4 - eight convolution layers

  • Batch normalization,

    Average pooling, fully connected+softmax layers.

Every two subsequent convolution layers are followed by a residual connection that adds the input to these layers to the result. the first convolution conv3 and conv4 has a stride of 2, halving the spatial dimensions. For additional details see

[Zagoruyko and Komodakis, 2016].

Image preprocessing includes padding to 36x36 and random crop, horizontal flipping and per-image whitening. The optimizer is Momentum with momentum=0.9. Learning rate is 0.1 for the first 40K iterations, 0.01 until 60K, 0.001 afterwards. We use 0.002 L2 weight decay. Batch size is 128.

We train the model for 75K iterations and compare the accuracy on the test set for different approximation algorithms.