I Introduction
Following its introduction in 2015 [1]
, Batch Normalization (BN) layers rapidly became a default layer type in stateoftheart deep convolutional neural networks (CNNs). In particular for CNNs designed as image classifiers, batchnomalization is essential for state of the art accuracy on difficult datasets like Imagenet
[2]. However, BN layers have several disadvantages such as:
BN layers typically increase the GPU memory requirements and computational load during training, due to bydefault storage of feature maps for every layer, for use in gradient calculations^{1}^{1}1This extra storage can be avoided by recomputing BN layer outputs from its input feature map, but this slows down training slightly.; additional multiplications needed for BN layers also slows down inference;

BN layers and the computation of their parameters are implemented slightly differently in different popular deep learning libraries, resulting in difficulty replicating results and in model portability;

BN layers are not wellsuited to customhardware implementations of training, such as those that use a pipeline approach where all processing needs to be done on one sample independently of all other samples.

BN layers create challenges for efficient training when batches are split across multiple GPUs, and as a result sometimes nonexact approximations are used [3];

BN layers can be less effective for models that have to be trained using small batches [4], such as when input images or model sizes are large enough to fill up GPU RAM;

BN layers can be problematic for datasets with high classimbalance or when samples in a minibatch are not independent [5].
The last four disadvantages are due to the fact that BN layers require calculation of batchwise statistics, namely the mean and variance of every channel, calculated over all locations in the channel’s feature map and all samples in a minibatch, at each point in a network where a BN layer is used. These statistics are less robust for both smaller numbers of samples in a minibatch, and if a batch includes rarely chosen or nonindependent samples
[5].Some of the potential challenges of BN layers are not as frequently of importance for inference, such as minibatch size. However in the near future, it has been predicted that there will be many applications where it will be desirable to train deep neural networks on lowpower custom hardware, such as when networks need to be updated frequently using new data, and communication to a data center is too slow or unavailable [6].
For all these reasons, it is desirable to know whether BN layers really are generally necessary for best performance, or whether for particular datasets stateoftheart accuracy can be achieved without them. However, our main motivation is drawn from seeking to design deep CNN classifiers implemented in custom hardware, that run as fast and efficiently as possible in an embedded system. It is already wellestablished that convolutional layers can be implemented without use of multipliers, by reducing representation precision of weights or feature maps to 1 bit, hence potentially saving large amounts of chip space and power usage [7, 8, 6, 9, 10]. However, existing investigations in this area often still use BN layers, which if implemented exactly introduce significant complexity, including mandating the use of minibatches, and the need for multipliers, and it is desirable to know whether they can be removed or replaced.
In this paper, we train deep CNNs with and without BN layers, and analyze the resulting accuracy changes.
The paper is structured as follows. In Section II we review the BN layer definition, and discuss relevant prior research related to the goals of this paper. Then, in Section III, we describe the CNN architectures we chose to use for this study, how we vary the use of BN layers in this architecture, and adaptations we need to introduce to enable them to work effectively. Next, Section IV contains our results. Finally, we discuss the implications and significance of our results in Section V.
Ii Relevant Previous Research
Depth is crucial for effective learning [11]
in modern neural networks as it allows a network to learn a hierarchical representation of the data by successively composing simpler features into more complex features. Depth, however, poses a number of challenges when training neural networks. The large number of parameters in deep networks typically restricts the optimization methods that could be feasibly used to first order methods like stochastic gradient descent. The loss surface of a deep neural network, however, is highly nonconvex and there is no guarantee that starting from an initial parameter point and traversing this loss surface using gradient descent will land in a good local minimum that allows the network to generalize well on unseen data
[12, 13]. A considerable number of techniques and tricks have been developed to allow gradient descent to practically succeed in training deep networks, ranging from random parameter initialization strategies [14, 15], unsupervised pretraining [16, 17, 18], or augmenting gradient descent with gradient history information to more effectively traverse the loss surface [19, 20].The performance of gradient descent is highly dependent on the way in which the optimization problem is parameterized. For example, by reparameterizing a deep neural network, i.e, by changing the scale and shift of the network’s parameters, we can drastically change the curvature of the loss surface [21]
and the behavior of gradient descent. One of the famous deleterious effects of bad parameterization is the vanishing and exploding gradients problem that plagued early neural networks
[22]. To combat such problems, and to yield a more favorable parameterization for gradient descent in general, a broad class of techniques attempt to reparameterize deep neural networks through various forms of normalization. Such normalization techniques operate on various quantities in the network, such as activations or weights.Perhaps the most popular normalization technique is batchnormalization [1], which is described in more detail in the next subsection. Batchnormalization not only accelerates gradient descent learning, it also has a beneficial regularization effect [23]
. Batchnormalization operates on preactivations (the inputs to neurons before the activation function is applied); it rescales and shifts the preactivations of each single neuron so that they have zero mean and unity variance across the minibatch samples, in each channel. During training, the network’s response to an example thus depends on the other examples that accompany it in a minibatch. This dependence, however, could prove undesirable in some situations
[24]. Batchnormalization requires the use of a minibatch that is large enough to yield reliable preactivation statistics. Increasing minibatch size, however, often leads to a decrease in the network’s generalization performance [25].A related normalization technique that does not depend on minibatch statistics is layer normalization [26]. Layer normalization normalizes the preactivations of neurons in a layer so that they have zero mean and unity variance. These statistics are calculated across all preactivations in a layer, unlike batchnormalization which calculates them for each neuron individually across the minibatch. One downside, however, is that different neurons, especially in convolutional layers, can have widely different input statistics, so normalizing all of them using the same coefficients is poorly motivated. Instead of normalizing the activations, ref [27]
normalizes by the norm of the input weight vector of each neuron. An extra parameter is introduced for each weight vector to explicitly control its length. The performance of weight normalization closely matches that of batchnormalization while avoiding its major downside: the dependence on minibatch statistics. Another innovation introduced by
[27] is the “meanonly BN” layer, in which layer inputs are centered according to the mean of a batch, in conjunction with weight normalization.Recently, yet another method was introduced and shown to outperform BN: group normalization [4] while avoiding some of its downsides.
The choice of activation function has a large impact on the performance of gradient descent or backpropagation. When errors are backpropagated through a layer, they are scaled by the derivative of the activation function of the neurons in that layer. If these derivatives are small, then errors are effectively blocked from propagating backwards. Saturating activation functions such as the logistic sigmoid, or hyperbolic tan, are particularly vulnerable to this effect as their derivatives are very small when their input is far from zero
[28]. The use of nonsaturating activation functions such as Rectified Linear Units (ReLUs) has partially alleviated this problem. A ReLU activation is zero for negative inputs and the identity for positive inputs. ReLUs, however, cannot have an output that is zero mean across the training examples as they do not produce negative outputs. Exponential Linear Units (ELUs)
[29] address this issue by having a saturating negative output if the input is less than zero. ELUs have been shown to perform extremely well without any form of explicit normalization suggesting that they intrinsically normalize the levels of activation in the network.However, shifted Rectified Linear Units (sReLUs) offer nearly all the same benefits as ELUs, but without the need to calculate exponentials, and given our motivation of minimizing computational load, they are our focus here instead of ELUs. Indeed, our results in Figures 5 and 6 show that networks using sReLU and ELU activations do not have significant differences in accuracy.
For similar reasons, we aim to establish if we can avoid alternatives to BN like group normalization [4], where complexity and additional computation is introduced for computing statistics and carrying out normalizations by nonconstant factors.
Iia Review of BN layers
Batchnormalization (BN) layers have been shown to help deep neural networks produce better accuracy following training. The usual explanation for this is that they help “reduce internal covariate shift” [1]. This view has been recently challenged [30]. However it is clear that the use of BN layers enables higher learning rates, and hence faster convergence during training, and diminishes the importance of good initial conditions for convolutional layers [1, 5].
BN layers are applied to a minibatch of feature maps, which typically can be represented as a 4axis tensor,
, where is the size of the minibatch, and are the height and width of the feature maps, and is the number of channels. BN layers operate on a perchannel basis, with channelspecific gain and shift parameters, and make use of channelspecific calculations of the mean and variance of the inputs to the channel in a minibatch. Mathematically, if is the value of the feature map at location in the th channel of the th sample, then BN layers transform that value according to(1) 
where and are calculated as
(2) 
and
(3) 
The remaining parameters are:

gain, , and shift , which are usually learned in the same manner as weights in convolutional layers;

, which ensures division by zero cannot happen in the event of zero variance, and can also act like a regularizer; it is not widely appreciated that accuracy can depend on the exact choice of and that inference performance is sensitive to inadvertent changes in compared to training. It is also noteworthy that some popular deep learning libraries define differently, by taking it outside the square root, which changes subtely but sometimes significantly the performance of otherwise identical models.
There are two major differences between BN layers and other layers:

calculation of layer outputs depend on all samples in a minibatch and cannot be computed independently for each sample;

During training, minibatch means and variances are calculated. For inference, each mean and variance is an additional parameter that forms part of the trained model, but unlike most parameters derived from training data, these are not learned, but rather are computed. Most popular libraries compute these values during training by calculating exponential moving averages over batches as training progresses. We have found that this can sometimes give misleading and subpar performance during monitoring on a validation set as training progresses, because parameters change over the window in which averages are created. Instead, we favour calculation of batch means and variances using multiple training batches while training is frozen, as described in Algorithm 2 in the original BN paper [1].
IiB BN layer scale and shift can be detrimental
Recent work demonstrated the surprising result that not learning BN scale and shift parameters can actually be beneficial, leading to reduced error rates for CIFAR 10 and CIFAR 100 in a wide residual network [10]. This result was found to be the case for both fullprecision networks, and for “1bitperweight” versions of the same networks.
IiC Shifted Rectified Linear Unit and Exponential Linear Unit
It has already been shown that removal of BN layers can somewhat be compensated for by using exponential linear units (ELUs) [29] instead of the combination of BN and ReLU. The mathematical definition of these two activation functions are
(4) 
where is the Heaviside step function, and
(5) 
Both can be generalized to a different minimum output value, but we consider only the case of . See Figure 1 for plots of the sReLU and ELU characteristics, in comparison with ReLU.
When ELUs were introduced, it was shown that they outperform shifted rectified linear units (sReLUs) [29]. However, in our experiments with more recently advanced networks and training methods not available at the time of [29], we did not observe any significant difference between ELUs and sReLUs. Moreover, ELUs are computationally expensive compared with sReLU, both because sReLUs do not require calculations of exponentials in the forward pass, nor storage of their values for use in the backward pass during training. We show here that shifted ReLUs are an effective and efficient alternative to ELUs, and seek to quantify how much accuracy penalty is incurred by replacing BN layers.
Iii Methods
Iiia Baseline network architecture and training
Our experiments are based on wide residual networks [31] (see Figure 2) that use the postactivation architecture [32]. Using the nomenclature of [31], for CIFAR 10 and CIFAR 100 we used depth 20 and widths of 4 and 10 and for ImageNet depth 18 and widths 1 and 2.5, meaning in both cases that the first layer of convolutional weights had either 64 or 160 output channels. The design we used has a few differences to [31], such as that the final weights layer is a convolutional layer applied before the global average pooling layer—see Figure 2 which illustrates this aspect, and the overall design. More details are described in [10]. Note also that in the design of [10] a BN layer was applied to the RGB input channels prior to the first convolution layer. Here we do the same for all variations, including the sReLU one. Provided that the shift and scale factors are not learned, this BN layer is equivalent to preprocessing raw data, and does not need to be considered to form a network layer.
Training was carried out similarly to [10], i.e. we used backpropagation and stochastic gradient descent, with minibatches of size 125 samples, momentum of 0.9 and weight decay with a value of 0.0005. No biases are used. Weights were initialized using the method of [15], while BN gains were initialized to 1 and shifts to 0. No convolutional layer biases were used. Unlike [10], we did not use a warm restart learning rate schedule, as we found this could sometimes lead to divergence following a restart with the sReLU networks. However, we did use a cosine learningrate decay schedule, starting at an initial value of and finishing at
after 300 epochs (CIFAR) or 60 epochs (ImageNet). For networks where BN layers were used, we computed the mean and variance statistics following the end of training, by calculating averages over all minibatches in 1 epoch, with learning turned off.
IiiB 1bitperweight networks
As well as training networks with the usual 32 bit floating point precision for all variables, including learned weights, we also trained networks using a method for enabling storage of learned weights and inference to take place using 1bit values. We followed the method of [10]; differences in training are summarised for the case of sReLUs in Figure 3.
We emphasize that the main motivation of this paper is to examine whether methods for enabling reducedprecision representations in deep neural networks, such as [10], still work effectively when batchnormalization layers are removed.
IiiC Shifted ReLUs
IiiD Training innovations for networks with BN layers removed
For the model where all BN layers but the final one were removed, we found we did not need to change any aspect of training relative to our baseline models.
However, without any BN layers, we found that training diverged, or converged very slowly. A partial remedy to this problem is to simply reduce the learning rate, a fact consistent with one of the advantages of using BN layers, i.e. that the initial learning rate can be set higher [1] than in networks without BNs. However, this leads to slow convergence and increased error rates.
We investigated why this was the case. We found, surprisingly, that using just a single BN layer between the final convolutional layer and the global average pooling layer was sufficient to enable the network to be trained exactly like the baseline models.
Based on this observation, we hypothesized that the main reason that networks with BN layers enable a larger learning rate is due to the standarized scaling the final BN layer imparts on gradients from the output prior to backpropagation to weight layers. Indeed, we observed that without a final BN layer, the distribution of gradients at the output early in training have a longer tail then when BN is included.
We therefore introduced in our shifted ReLU networks a constant scaling layer to replace the BN layer between the final convolutional layer and global average pooling layer. For simplicity we have set the constant scaling identically for all channels (which in our design at this point in the network is equal to the number of classes).
Due to the linearity of the global average pooling layer, changing this scaling is equivalent to changing the temperature in the softmax layer, from its default value of 1. That is, our approach corresponds to employing a final softmax layer of the form
(6) 
We found a value for the temperature between approximately 30 and 100 to enable training to take place identically to the baseline models, including the same high initial learning rate. Lower values of tended to result in failure to converge shortly after training commenced.
Note that although the gradient propagated back is linearly scaled by , changing from the default of is not equivalent to simply changing the learning rate by a factor of . This is due to the nonlinearity in the softmax layer. Increasing has the effect of moving softmax outputs away from 0 or 1, thereby increasing the entropy of the output vector. In turn, this means gradients propagating backwards early in training have a distribution with lower standard deviation.
That this temperature scaling is all that is needed to ensure a high learning rate suggests that the main problem with larger learning rate for models without BN layers is simply that early in training, the gradients calculated at the output of the network are too high for high learning rates, and that the normalization of the final BN layer compensates for this.
Iv Experiments and Results
We trained the following variations of our wide residual networks on both CIFAR 10 and CIFAR 100, all for both width 4 and width 10.

Baseline 1: networks as in Figure 2, with conventional BN layers where scales and shifts are learned.

Baseline 2: the same as Baseline 1, but without any scales and shifts learned, as in [10].

A single BN layer at the end of the networks only (no scale and shift learned), with all other BNReLU combinations replaced by sReLU.

No BN layers – all replaced by sReLU, as in Figure 4.

No BN layers – all replaced by ELU [29].

No BN layers – all replaced by sReLU, except for a ‘meanonly BN’ layer at the end of the network [27], without a learned bias.
Our results for width4 and width10 ResNets are summarized in Tables I and II, while Figures 5 and 6 show the mean and spread of accuracies for width4 ResNets for CIFAR 10 and CIFAR 100 over 10 runs.
For ImageNet, we compared width1 and width2.5 networks, and an ensemble of 3 width1 networks, for the case of Baseline 1, and case 3 in the above list. The results are summarised in Table III.
Our conclusions drawn from the results are left for Section V.
Model  Baseline 1  Baseline 2  Final BN only  sReLU only  sReLU gap 

ResNet 204, 32 bit weights  3.97%  3.80%  4.42 %  4.67%  0.87% 
ResNet 2010, 32 bit weights  3.29%  3.57%  3.79%  4.36%  1.07% 
ResNet 204, 1 bit weights  4.51%  4.65%  4.47%  4.66%  0.19% 
ResNet 2010, 1 bit weights  3.83%  3.65%  4.00%  3.74%  0.09% 
Model  Baseline 1  Baseline 2  Final BN only  sReLU only  sReLU gap 
ResNet 204, 32 bit weights  22.10%  19.53%  20.18%  22.24%  2.69% 
ResNet 2010, 32 bit weights  20.99%  17.05%  18.25%  20.97%  3.92% 
ResNet 204, 1 bit weights  22.82%  22.07%  22.31%  23.69%  1.60% 
ResNet 2010, 1 bit weights  20.61%  18.22%  19.42%  20.74%  2.52% 
Bits per weight  Model  # Learned parameters  Test mode  Baseline 1  Final BN only  Gap 

32  ResNet 181  11.5M  centre crop  12.41%  15.50%  3.01% 
32  ResNet 181  11.5M  multi crop  9.03%  14.70%  5.67% 
32  Ensemble of 3 ResNet 181  34.5M  centre crop  10.70%  14.00%  3.30% 
32  Ensemble of 3 ResNet 181  34.5M  multi crop  7.95%  12.30%  4.35% 
32  ResNet 182.5  70M  centre crop  9.20%  9.51%  0.3% 
32  ResNet 182.5  70M  multi crop  6.91%  8.83%  1.92% 
1  ResNet 181  11.5M  centre crop  17.55%  23.66%  6.11% 
1  ResNet 181  11.5M  multi crop  12.80%  19.94%  7.14% 
1  Ensemble of 3 ResNet 181  34.5M  centre crop  15.48%  22.18%  6.70% 
1  Ensemble of 3 ResNet 181  34.5M  multi crop  11.38%  18.85%  7.47% 
1  ResNet 182.5  70M  centre crop  11.51%  11.81%  0.3% 
1  ResNet 182.5  70M  multi crop  8.48%  10.03%  1.58% 
V Discussion
Va The impact of removing BN layers
Our results indicate the following:

The importance of BN layers is dataset and modelsize dependent. Our findings show that the accuracy loss when replacing all BN layers with shifted ReLUs is larger on CIFAR 100 than on CIFAR 10. For all three datasets, and especially ImageNet, the accuracy loss is very large for width1 (nonwide), but not for wide variants, indicating BN is much more important for smaller models.

For CIFAR, there is no clear advantage in using ELU layers rather than sReLU.

The impact of removing BN is larger for fullprecision weights compared with 1bitper weight.

Baseline 1, where gains and shifts are learned is about as good for CIFAR 10 as Baseline 2 where they are not, but Baseline 2 is clearly better for CIFAR 100.

Using a single BNlayer at the end of networks was mostly markedly better than using all sReLUs, and in some cases nearly as good as the best baseline.

For width 4 networks, the variations within models across runs for CIFAR 10 and 1bitperweight indicated that any model could produce best results. For CIFAR 100, the first three models exhibited this effect, but the sReLU network clearly had a small gap in performance compared to other models. For 32bitperweight models, gaps were larger, but the top two models for CIFAR 10 were the two baselines, while for CIFAR 100 they were Baseline 2 and the final BN only model.

The meanonlyBN model outperforms the All ReLU and All ELU networks for CIFAR 100 and 1bit per weight, halving the error rate gap to Baseline 2.
From these observations, we propose the following:

Conclusion 1: Removing all BN layers can be expected to cause an accuracy penalty, but this penalty is potentially small, depending on the dataset and network architecture.

Conclusion 2: Removing all but the final BN layer is a potentially viable option for getting the benefits of removing most BN layers, but without as much accuracy loss as removing all.

Conclusion 3: There is no consistently best way to design networks with BNs; in some cases learning scales and shifts is beneficial, but in other cases it causes a big drop in performance.

Conclusion 4: There is less impact on accuracy in the case of 1bitperweight compared with full precision weights. Indeed, for CIFAR 10, sReLU networks and 1bitperweight had an accuracy gap no larger than 0.2%, and only 0.3% for width2.5 ImageNet with centre cropping, suggesting sReLU as a viable option when 1bitperweight is used.
VB Significance for custom hardware implementations
Hardware acceleration of CNNs is necessary to provide realtime embedded vision systems, e.g., UAVs and Internet of Things (IoT) devices, because existing systems using CPUs are too slow. To increase speed, most softwarebased CNNs use GPUs. However, it is difficult to deploy GPUs in embedded systems, since they consume a significant amount of power. Thus, custom rather than generalpurpose hardwarebased CNNs are desired for lowpower and realtime embedded vision systems.
In hardwarebased acceleration systems, most power consumption comes from the computation modules, e.g., the multipliers and adders, and from accessing of the data, particularly the weights. By using binary weights, the multiplier can be replaced by a multiplexer, which consumes orders of magnitude less power. In modern CMOS technologies, e.g., 28nm, a binary convolutional operation with multiplexers achieves a power efficiency up to 230 1bTOPS/W [36]. More importantly, using binary weights enables use of only onchip memories, e.g., SRAMs, to store these weights. Accessing data stored in external memories would consume an order of magnitude more power (10) than the computation itself [37].
The use of batchnormalization layers might maintain maximum classification accuracy but at the cost of extra silicon area and computation and thus more power consumption. Particularly, it will require significant amount of silicon area to implement the nonlinear square and squareroot operations in the conventional batchnormalization. What makes it worse is that these operations impede low bitwidth quantization techniques [38]. One can easily implement the shifted ReLU activation function with a tiny silicon area, while achieving comparable accuracies to networks with BN layers.
VC Future work
In future, it will be valuable to try to devise new lowcomplexity methods that narrow the accuracy gap between networks that use BN layers, and the ones outlined here.
Acknowledgment
This work was supported by a Discovery Project funded by the Australian Research Council (project number DP170104600).
References
 [1] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” CoRR, vol. abs/1502.03167, 2015. [Online]. Available: http://arxiv.org/abs/1502.03167
 [2] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” Microsoft Research, Tech. Rep., 2015, arxiv.1512.03385.
 [3] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour,” ArXiv eprints, June 2017.
 [4] Y. Wu and K. He, “Group Normalization,” ArXiv eprints, Mar. 2018.
 [5] S. Ioffe, “Batch renormalization: Towards reducing minibatch dependence in batchnormalized models,” CoRR, vol. abs/1702.03275, 2017. [Online]. Available: http://arxiv.org/abs/1702.03275
 [6] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and <1MB model size,” CoRR, vol. abs/1602.07360, 2016. [Online]. Available: http://arxiv.org/abs/1602.07360
 [7] M. Courbariaux, Y. Bengio, and J. David, “BinaryConnect: Training Deep Neural Networks with binary weights during propagations,” CoRR, vol. abs/1511.00363, 2015. [Online]. Available: http://arxiv.org/abs/1511.00363
 [8] P. Merolla, R. Appuswamy, J. V. Arthur, S. K. Esser, and D. S. Modha, “Deep neural networks are robust to weight binarization and other nonlinear distortions,” CoRR, vol. abs/1606.01981, 2016. [Online]. Available: http://arxiv.org/abs/1606.01981
 [9] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi, “XNORNet: Imagenet classification using binary convolutional neural networks,” CoRR, vol. abs/1603.05279, 2016. [Online]. Available: http://arxiv.org/abs/1603.05279
 [10] M. D. McDonnell, “Training wide residual networks for deployment using a single bit for each weight,” 2018, in Proc. ICLR 2018; arxiv: 1802.08530.
 [11] J. Ba and R. Caruana, “Do deep nets really need to be deep?” in Advances in neural information processing systems, 2014, pp. 2654–2662.
 [12] A. Choromanska, M. Henaff, M. Mathieu, G. Arous, and Y. LeCun, “The loss surfaces of multilayer networks,” in Artificial Intelligence and Statistics, 2015, pp. 192–204.
 [13] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” arXiv preprint arXiv:1611.03530, 2016.
 [14] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks.” in AISTATS, vol. 9, 2010, pp. 249–256.
 [15] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing humanlevel performance on ImageNet classification,” in Proc. IEEE International Conference on Computer Vision (ICCV) (see arXiv:1502.01852), 2015.

[16]
D. Erhan, Y. Bengio, A. Courville, P. Manzagol, P. Vincent, and S. Bengio,
“Why does unsupervised pretraining help deep learning?”
The Journal of Machine Learning Research
, vol. 11, pp. 625–660, 2010. 
[17]
H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations,” in
Proceedings of the 26th Annual International Conference on Machine Learning, ser. ICML ’09. New York, NY, USA: ACM, 2009, pp. 609–616. 
[18]
M. Ranzato, F. Huang, Y.L. Boureau, and Y. LeCun, “Unsupervised learning of
invariant feature hierarchies with applications to object recognition,” in
Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on
. IEEE, 2007, pp. 1–8.  [19] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
 [20] M. Zeiler, “Adadelta: an adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012.
 [21] L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio, “Sharp minima can generalize for deep nets,” arXiv preprint arXiv:1703.04933, 2017.

[22]
S. Hochreiter and J. Schmidhuber, “Long shortterm memory,”
Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.  [23] A. Morcos, D. Barrett, N. Rabinowitz, and M. Botvinick, “On the importance of single directions for generalization,” arXiv preprint arXiv:1803.06959, 2018.
 [24] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training GANs,” in Advances in Neural Information Processing Systems, 2016, pp. 2234–2242.
 [25] N. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. Tang, “On largebatch training for deep learning: Generalization gap and sharp minima,” arXiv preprint arXiv:1609.04836, 2016.
 [26] J. Ba, J. Kiros, and G. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
 [27] T. Salimans and D. Kingma, “Weight normalization: A simple reparameterization to accelerate training of deep neural networks,” in Advances in Neural Information Processing Systems, 2016, pp. 901–909.
 [28] Y. LeCun, L. Bottou, G. Orr, and K.R. Müller, “Efficient backprop,” in Neural Networks: Tricks of the Trade, this book is an outgrowth of a 1996 NIPS workshop. SpringerVerlag, 1998, pp. 9–50.
 [29] D. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (ELUs),” CoRR, vol. abs/1511.07289, 2015. [Online]. Available: http://arxiv.org/abs/1511.07289
 [30] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry, “How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift),” ArXiv eprints, May 2018.
 [31] S. Zagoruyko and N. Komodakis, “Wide residual networks,” 2016, arxiv.1605.07146.
 [32] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” Microsoft Research, Tech. Rep., 2016, arxiv.1603.05027.
 [33] T. Devries and G. W. Taylor, “Improved regularization of convolutional neural networks with cutout,” CoRR, vol. abs/1708.04552, 2017. [Online]. Available: http://arxiv.org/abs/1708.04552
 [34] G. Hinton, O. Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network,” ArXiv eprints, Mar. 2015.
 [35] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” CoRR, vol. abs/1706.04599, 2017. [Online]. Available: http://arxiv.org/abs/1706.04599
 [36] B. Moons, D. Bankman, L. Yang, B. Murmann, and M. Verhelst, “Binareye: An alwayson energyaccuracyscalable binary CNN processor with all memory on chip in 28nm CMOS,” in IEEE Custom Integrated Circuits Conference, CICC, San Diego, April 811, 2018, pp. 1–4.
 [37] Y.H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energyefficient reconfigurable accelerator for deep convolutional neural networks.” in ISSCC. IEEE, 2016, pp. 262–263.
 [38] K. Ando, K. Ueyoshi, K. Orimo, H. Yonekawa, S. Sato, H. Nakahara, S. TakamaedaYamazaki, M. Ikebe, T. Asai, T. Kuroda, and M. Motomura, “Brein memory: A singlechip binary/ternary reconfigurable inmemory deep neural network accelerator achieving 1.4 TOPS at 0.6 W,” IEEE Journal of SolidState Circuits, 12 2017.
Comments
There are no comments yet.