Unbounded Output Networks for Classification

07/25/2018 ∙ by Stefan Elfwing, et al. ∙ Okinawa Institute of Science and Technology Graduate University 0

We proposed the expected energy-based restricted Boltzmann machine (EE-RBM) as a discriminative RBM method for classification. Two characteristics of the EE-RBM are that the output is unbounded and that the target value of correct classification is set to a value much greater than one. In this study, by adopting features of the EE-RBM approach to feed-forward neural networks, we propose the UnBounded output network (UBnet) which is characterized by three features: (1) unbounded output units; (2) the target value of correct classification is set to a value much greater than one; and (3) the models are trained by a modified mean-squared error objective. We evaluate our approach using the MNIST, CIFAR-10, and CIFAR-100 benchmark datasets. We first demonstrate, for shallow UBnets on MNIST, that a setting of the target value equal to the number of hidden units significantly outperforms a setting of the target value equal to one, and it also outperforms standard neural networks by about 25%. We then validate our approach by achieving high-level classification performance on the three datasets using unbounded output residual networks. We finally use MNIST to analyze the learned features and weights, and we demonstrate that UBnets are much more robust against adversarial examples than the standard approach of using a softmax output layer and training the networks by a cross-entropy objective.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

We proposed the expected energy-based restricted Boltzmann machine (EE-RBM) as a discriminative RBM method for classification [1]. The main difference between the EE-RBM architecture and the standard feed-forward neural network architecture is that the output is not computed in specific output nodes. Instead, the output is defined as the negative expected energy of the RBM, which is computed by the weighted sum of all bi-directional connections in the network. Two characteristics of the EE-RBM are that the output values are unbounded and that the target value of correct classification,

, is related to the size of the network and therefore set to a value much greater than one. We have successfully applied the EE-RBM in the reinforcement learning domain 

[2], achieving what was then the state-of-the-art score in stochastic SZ-Tetris and achieving effective learning in a robot navigation task with raw and noisy RGB images as state input.

In this study, by adopting features of the EE-RBM approach to feed-forward neural networks, we propose the UnBounded output network (UBnet) which is characterized by three features:

  1. Unbounded units in the output layer;

  2. The target value of correct classification, , is set to a value much greater than one;

  3. The models are trained by a modified mean-squared error objective that gives more weight to errors that correspond to correct classification and less weight to errors that correspond to incorrect classification.

We use the sigmoid-weighted Linear Unit (SiLU), which we originally proposed as an activation function for neural networks in the reinforcement learning domain 

[3]

. The SiLU unit is also based on the EE-RBM. The activation of the SiLU is computed by the sigmoid function multiplied by its input, which is equal to the contribution to the negative expected energy from one hidden unit in an EE-RBM, where the negative expected energy is equal to the negative free energy minus the entropy.

We have successfully used the SiLU and its derivative (dSiLU) as activation functions in neural network-based function approximation in reinforcement learning [3, 4], achieving the current state-of-the-art scores in SZ-Tetris and in -Tetris, and achieving competitive performance compared with the DQN algorithm [5] in the domain of classic Atari 2600 video games. After we first proposed the SiLU [6], Ramachandran et al. [7]

performed a comprehensive comparison between the SiLU, the Rectified Linear Unit (ReLU

[8]

, and 6 other activation functions in the supervised learning domain. They found that the SiLU consistently outperformed the other activation functions when tested using 3 deep architectures on CIFAR-10/100 

[9]

, using 5 deep architectures on ImageNet 

[10], and on 4 test sets for English-to-German machine translation.

We use the MNIST dataset to demonstrate that for shallow UBnets, a setting of equal to the number of hidden units significantly outperforms a setting of and it also outperforms standard neural networks without additional optimization by about 25%. We train UnBounded output Residual networks (UBRnets) on the MNIST, CIFAR-10, and CIFAR-100 benchmark datasets and validate our approach by achieving high-level classification performance on the three datasets. We use the CIFAR-10 dataset to demonstrate a small but significant improvement in performance of UBRnets with SiLUs compared with UBRnets with ReLUs. We finally use MNIST to analyze the features and weights learned by UBnets and we demonstrate that UBnets are much more robust against adversarial examples [11] than the standard approach of using a softmax output layer and training the networks by a cross-entropy objective.

Ii Method

We proposed the EE-RBM [1] as a discriminative learning approach to provide a self-contained RBM [12, 13, 14] method for classification. In an EE-RBM, the output

of an input vector

and a class vector (“one-of-” coding) is computed by the negative expected energy of the RBM, which is given by the weighted sum of all bi-directional connections in the network:

(1)
(2)

Here, is the input to hidden unit for class , is the bias weight for input unit , and is the bias weight for class unit .

In this study, we use the SiLU for neural network-based classification. The activation of the SiLU is computed by the sigmoid function multiplied by its input, which is equal to the contribution to the output from one hidden unit in an EE-RBM. Given an input vector , the activation of a SiLU (in a hidden layer or the output layer), , is given by:

(3)
(4)

Here, is the weight connecting input and unit , and is the bias weight for unit . For -values of large magnitude, the activation of the SiLU is approximately equal to the activation of the ReLU (see Fig. 1), i.e., the activation is approximately equal to zero for large negative -values and approximately equal to for large positive -values. Unlike the ReLU (and other commonly used activation units such as sigmoid and tanh units), the activation of the SiLU is not monotonically increasing. Instead, it has a global minimum value of approximately for .

Fig. 1: The activation functions of the SiLU () and the ReLU ().

The derivative of the activation function of the SiLU, used for stochastic gradient-descent updates of the weight parameters, is given by

(5)

Two features of the EE-RBM are that the output of the network is unbounded and that the target value for correct classification, , is set to a value . In this study, we emulate this approach by proposing UBnets with unbounded units, such as the SiLU, in the output layer of standard feed-forward neural network architectures. The learning is achieved by a modified mean-squared error training objective:

(6)
(7)

Here, is the standard target value ( if training example belongs to class , otherwise ) and is the output value for class . The stochastic gradient-descent update of the parameters, , for an input vector is then computed by either

(8)

or

(9)

Here, is the learning rate.

For , the modified objective is not proportional to the standard mean-squared error training objective. Errors corresponding to incorrect classification () are weighted less, by a factor , because (see (6)). Errors corresponding to correct classification () are weighted more. This is especially the case in the beginning of the learning (assuming that the output weights are initialized so that the outputs are zero or close to zero) when and (see (7)).

For negative input values to the output layer, , the output

is either equal to zero (ReLU output) or not a monotonically increasing function (SiLU output). We, therefore, classify input vectors with unknown class labels,

, in the validation and test sets according to largest -value:

(10)

Iii Experiments

We evaluated our approach using the MNIST [15], CIFAR-10 [9], and CIFAR-100 [9] benchmark datasets. We first used shallow UBnets (i.e., with one hidden layer) on the MNIST dataset to demonstrate that for networks with unbounded output units classification performance is significantly improved by setting the target value for correct classification .

Layer name Layer type
inputBN BN
conv1 conv: , BN
conv2_x
conv3_x
conv4_x
max pooling

, stride 2

fc1 fc: 2048
fc2 fc: #classes
TABLE I: UBRnet architecture used for the MNIST, CIFAR-10, and CIFAR-100 experiments

In the subsequent experiments, we used a residual network (ResNet) [16] architecture with unbounded output (UBRnets; for details, see Table I), similar to the architecture of the ResNets used in the CIFAR-10 experiments in [16]

. Our UBRnet architecture consisted of applying batch normalization (BN) 

[17] to the input to the network, followed by a convolutional layer, three stacks of residual units, a max pooling layer, a fully-connected (fc) layer, and an unbounded output layer with either 10 (MNIST and CIFAR-10) or 100 (CIFAR-100) units. All convolutional filters were of size . There were convolutional filters in the first convolutional layer and the first stack of residual units (conv1 and conv2_x in Table I). The number of convolutional filters were then increased by a factor of 2 in each stack of residual units after the first and, at same time, downsampling by a factor of 2 were performed using a stride of 2 in the first convolutional layer in the first residual unit (conv3_1 and conv4_1 in Table I). All shortcut connections performed parameter free (option A in  [16]) identity mappings [18]. Based on preliminary CIFAR-10 experiments, we changed the order of the modules in the residual units from convolution-BN-activation in the original residual unit [16] to convolution-activation-BN (previously investigated in [19]).

If not otherwise noted, the UBRnets were trained using a mini-batch size, , of 100. The network weights were initialized as in  [20]

, except for the fully-connected layers, which were initialized using a zero-mean Gaussian distribution with standard deviation of 0.1 (first fc layer) and of 0.00001 (second fc layer). We did not use common optimization/regularization techniques such as momentum, weight decay, and dropout. Following

[21] and [16], we report test set performance as best (mean standard deviation) based on five independent runs where the UBRnets were trained on the original training sets. The UBRnets were implemented using the MatConvNet toolbox for MATLAB [22].

Iii-a Mnist

Fig. 2: The average number of validation set errors (over 5 experiments) on the MNIST dataset as a function of the target value for correct classification, , for shallow SiLU and ReLU UBnets with 500, 2000, and 4000 hidden units, .

The MNIST dataset [15] consists of 60000 training images and 10000 test images of ten handwritten digits, zero to nine, with an image size of pixels. The grayscale pixel values were normalized to the range by dividing the values by 255.

Method Test error rate (%)
Maxout [23] 0.45
CKN [24] 0.39
DSN [25] 0.39
FitNet-LSUV-SVM [19] 0.38
RCNN [26] 0.31
UBRnet 0.28 (0.34 0.039)
Gated Pooling [27] 0.29 0.016
MIN [28] 0.24
TABLE II: Test set error rate on the MNIST dataset without data augmentation.

We first test our hypothesis that for UBnets classification performance is significantly improved by setting the target value for correct classification . We trained shallow UBnets with either SiLUs and ReLUs in both the hidden and the output layers, with 500, 2000, and 4000 hidden units () and 6 different settings of : 1, , , , , and . For each setting of and

, we trained the networks for 50 epochs and repeated each experiment five times. The results show a quite remarkable improvement for settings of

(see Fig. 2). For example, in the experiments with 4000 hidden units, a setting of reduced the average number of validation set errors by 31 (from 149 to 118) for the SiLU UBnet and by 48 (from 166 to 118) for the ReLU UBnet, compared with a setting of .

Method Depth CIFAR-10 CIFAR-100
DSN [25] - 7.97 34.57
All-CNN [29] - 7.25 33.71
Highway [21] - 7.54 32.24
ELU-CNN [30] - 6.55 24.28
FractalNet [31] 20 5.22 23.30
with dropout/drop-path 20 4.60 23.73
ResNet [16] 110 6.43 27.22 [32]
ResNet with 110 5.25 24.98
stoch. depth [32] 1202 4.91 -
ResNet (pre-activation) [18] 164 5.46 24.33
1001 4.62 22.71
Wide ResNet [33] 16 4.56 21.59
28 4.17 20.43
DenseNet-BC [34] 100 4.51 22.27
190 3.46 17.18
UBRnet () 15 8.37 (8.67 0.24) -
UBRnet () 15 6.80 (7.04 0.18) -
UBRnet () 15 6.17 (6.27 0.11) -
UBRnet () 15 5.67 (5.85 0.14) 26.42 (26.70 0.37)
UBRnet () 15 5.33 (5.54 0.19) 24.54 (25.00 0.31)
UBRnet () 15 5.25 (5.35 0.08) 22.94 (23.26 0.26)
TABLE III: Test set error rate (%) on CIFAR-10 and CIFAR-100 with standard data augmentation.

For both types of networks and all settings of , a setting of achieved slightly better, or equally good, average performance. The only exception was the ReLU UBnet with 500 hidden units where performed slightly better. Based on these results, was set to the number of hidden units in the last hidden layer in the subsequent experiments.

The result of about 120 errors is a large improvement compared with the approximately 160 errors achieved by standard neural networks with either sigmoid [35] or ReLU [36] hidden units in the permutation invariant version of the MNIST task that do not use dropout training or other advanced regularization/optimization techniques.

On the MNIST dataset, we trained UBRnets with SiLUs for 25 epochs without data augmentation, using a fixed learning rate , , and (one residual unit in each of the 3 stacks with 62, 128, and 256 filters, see Table I). The target value for correct classification was set to 2048, as there were 2048 SiLUs in the first fully-connected layer (see Table I). As shown in Table II, our UBRnet achieved a test error rate of , which is slightly worse than the current state-of-the-art of achieved by batch normalized maxout network in networks (MIN) [28].

Iii-B CIFAR-10 and CIFAR-100

The CIFAR-10 dataset [9] consists of natural

RGB images belonging to 10 classes. The training set contains 50000 images and the test set contains 10000 images. We preprocessed the images by subtracting the per-pixel mean value, computed over the training set. We followed the standard data augmentation: four zero-valued pixels were padded to each side and a

crop was randomly sampled from the padded image or its horizontal flip. During testing, the original images were evaluated. The networks were trained for 100 epochs and the learning rate was annealed by a factor of after every 40 epochs. The target value for correct classification was set to 2048, as there were 2048 SiLUs in the first fully-connected layer (see Table I).

We first compared UBRnets with SiLU and ReLU activation functions in both the hidden and the output layers. We trained the UBRnets with and , and performed 10 independent runs for each of two activation functions. The SiLU UBRnet achieved a mean test error rate of 5.85 0.14%, which is significantly better (Wilcoxon signed-rank test with ) than the 6.04 0.15% achieved by the ReLU UBRnet.

Following the result in [33] that showed that width of residual networks (i.e., the number of convolutional filters) is more important for performance than the depth (i.e., the number of layers), we trained SiLU UBRnets with (15 layers) and . In the case of , the mini-batch size, , was halved to . The results of the experiments, as well as the reported results of other methods, are summarized in Table III 111The UBRnet result for and is based on 10, instead of 5, independent runs, as it is taken from the previous experiment comparing SiLUs and ReLUs. The results shows a large effect of increasing the width of the UBRnets, decreasing the test error rate from 8.37% for to 5.25% for .

Fig. 3: Learning curves for the best runs in CIFAR-10 (left) and CIFAR-100 (right) for the UBRnet with . The solid lines show the test error rates and the dotted lines show the training error rate.

The CIFAR-100 dataset [9] has the same format and size as the CIFAR-10 dataset. The number of classes is 100, i.e., the number of training images per class is a tenth of the number in CIFAR-10. We used the same experimental setup (preprocessing, UBRnet architecture, meta-parameter settings, and data augmentation) as in the CIFAR-10 experiments. We trained UBRnets with and . The test error rate (see Table III) decreased from 26.42% for to 22.94% for . In contrast with CIFAR-10, there was a large improvement in performance for increasing from (24.54%) to 450 (22.94%), which suggest that there is room for further improvement by further increasing the width of the UBRnet.

The classification results achieved by our UBRnet are encouraging, as they were achieved with minimal use of optimization/regularization techniques, and without using pre-activation in the residual units (i.e., batch normalization and the activation function are placed before each convolutional layer [18]). The UBRnet with outperformed the 164-layer ResNet with pre-activation and performed only slightly worse than the extremely deep 1001-layer ResNet. Our UBRnet results have only been surpassed by a large margin by the wide ResNet [33] and the DenseNet [34], both of which used pre-activation.

The learning was stable and fast as shown in Fig. 3. For example, on CIFAR-10, the UBRnet with

reached a 10% test error rate after only 12 epochs. In comparison, the ResNets with pre-activation reached a 10% test error rate after 40-50 epochs (estimated from Fig. 1 and 6 in 

[18]) and the DenseNet reached it after more than 50 epochs (right panel in Fig. 4 in [34])

Iv Analysis of Unbounded Output Units

A distinct feature of our approach of using unbounded units in the output layers is that the value of an output unit (and thereby the input to the output unit, ) for correct classification is trained to match an exact value. The training error () can therefore be both positive and negative. In contrast, for softmax output units, the training error is determined by the differences in -values and the training error is always non-negative for correct classification. To investigate the effect of our approach, we used the MNIST task and trained 3 shallow network with 2000 hidden units for 50 epochs on the original training set: a SiLU UBnet with (SiLU-T2k), a SiLU UBnet with (SiLU-T1), and a network with SiLU hidden units and a softmax output layer that were trained with a cross-entropy objective (SiLU-SM). They achieved the following test error rates (training error rate): 1.27% (0.26%) by the SiLU-T2k network, 1.49% (0.14%) by the SiLU-T1 network, and 1.64% (0%) by the SiLU-SM network.

Fig. 4: Normalized histograms (150 bins) of the - and -values for the SiLU-T2k and SiLU-T1 networks and the -values for the SiLU-SM network for the training set (top row), and normalized histograms of the -values (bottom row) for the training and test sets.

To investigate the differences between the networks, we looked at the -values after learning. If the correct class for an input vector is denoted , let (the -value for correct classification) and (the maximum -value for incorrect classification). To get a measure of the networks’ ability to separate the -values for correct and incorrect classification, we computed the normalized distance between and

(similar to margin analysis for support vector machines):

-, where is the weight vector incident on output unit . Negative -values correspond to incorrect classified instances.

The distributions of the - and -values for the training set (see the top row in Fig. 4) show distinct differences between the networks. For the SiLU-T2k network, there was a clear separation between the narrow -distribution with a peak at almost exactly (mean value of ) and the very wide -distribution with a mean value of about -1200. There was almost no overlap between the two distributions, except for a small number of images (157 or 0.26 %) that were not only wrongly classified, but their -values were negative and often of large magnitude (mean of about -1400). For the SiLU-T1 network, there was considerable overlap between the -distribution with a peak at about 1.25 () and the -distribution with a peak at about 0.14 (). This strongly suggests that the worse performance of the SiLU-T1 network can be explained by that a target value of is too small to learn a large enough separation of the -values for incorrect and correct classification. For the SiLU-SM network, there was no wrongly classified training images, i.e., - for all images. However, for a relatively large number of images the (-)-values were relatively small, e.g., - for about 12 % of the training set.

The bottom row in Fig. 4 shows the -distributions for the training and test datasets. For the training set, the -values for the SiLU-T2k network were much larger than the other two networks. For example, the minimum -value for correct classification was  10.9, compared to  0.45 for the SiLU-SM network.

Iv-a Adversarial Examples

Recent works [37, 11, 38] have shown that neural networks are vulnerable to adversarial examples, i.e., they misclassify examples that are only slightly different than correctly classified examples. The changes can be so small that they are not visible to the human eye [11]. The larger “safety margin” of the SiLU-T2k network, in the form of larger -values, suggests that it would be more resilient against adversarial examples. To test this hypothesis, we created adversarial examples of the MNIST test set using the fast gradient sign method [38], where an adversarial example is created from the original example according to

(11)
Fig. 5: Average test set accuracy of the SiLU-T2k, SiLU-T1, and SiLU-SM networks on the MNIST test set for adversarial examples with varied from to with increments.

The result of the experiment (Fig. 5) shows that the UBnets, and especially the SiLU-T2k network, were much more resilient against adversarial examples. For example, for , the networks achieved test set accuracy rates of 76.8% (SiLU-T2K), 42.7% (SiLU-T1), and 17.8% (SiLU-SM). For , the SiLU-SM network could only correctly classify 58 test set images (0.58%), which is similar to the accuracy rate of 0.1% achieved by a shallow softmax network in  [38]. In the same study, a maxout network achieved an accuracy rate of 10.6%, which is much worse than the 52.94% accuracy rate achieved by the SiLU-T2k network in this study. The SiLU-T2k network maintained an almost 50% accuracy rate as increased to 0.5.

Iv-B Learned Features and Weights

Fig. 6: The learned hidden layer filters for the SiLU-T2k network (top panel) and the SiLU-SM network (bottom panel). The columns show the 10 learned hidden layer filters with the highest median activation for each class (shown in order with the highest value at the top).

To investigate the difference in learned hidden layer feature representation between the SiLU-T2k network and the SiLU-SM network, we computed the median activation of each unit in the hidden layer for 100 randomly selected images from the training set from each class. The columns of Fig. 6 show the 10 learned hidden layer filters with the highest median activation for each class (shown in order with the highest value at the top). The visualization shows a clear difference between the two methods. In the SiLU-SM network, the 10 learned filters with the highest median activations were, to large degree, different for each class. Only 14 filters were shared by two classes and no filter was shared by more than two classes. In contrast, in the SiLU-T2K network, there were 4 filters that were shared by more than 7 classes. Two of the filters were shared by all classes: the first filter generated the highest or second highest median activation for all classes and the second filter generated the second or third highest median activation for all classes.

Fig. 7: The learned output weights for the SiLU-T2K (top panel) and the SiLU-SM (bottom panel) networks. For each network, the hidden units were sorted by class according to the maximum value of the median activation and then grouped by class. The figure shows the output weights connected to the 50 hidden units with the highest median activation in each group

To further investigate the difference between the two methods, we looked at the weights in the output layers. Fig. 7 shows the values of the trained weights in the output layer for the SiLU-T2K network and the SiLU-SM network. The hidden units were sorted by class according to the maximum value of the median activation and then grouped by class. The figure shows the output weights connected to the 50 hidden units with the highest median activation in each group. The visualized data shows two obvious differences between the two methods. First, the range of the trained SiLU-SM weights were about a magnitude larger than the range of the trained SiLU-T2K weights. Second, the trained SiLU-SM network had a less shared (or less global) output weight structure, as shown by higher positive values (yellow colors) for the rectangles along the diagonal and mostly values with smaller magnitudes (greenish color) outside the diagonal. To a large degree, the SiLU-SM network learned separate classifiers, using non-overlapping subsets of the hidden units, for each class.

V Conclusion

In this study, inspired by the EE-RBM, we proposed unbounded output networks, UBnets, with unbounded units in the output layer and where the target value for correct classification . The UBnets are trained by a modified mean-squared error training objective, which weighs errors corresponding to correct (incorrect) classification more (less).

We demonstrated, using shallow UBnets on MNIST, that a setting of equal to the number of hidden units significantly outperformed a setting of and it also outperformed the reported results of standard neural networks by about 25%. Using unbounded output residual networks, UBRnets, we validated our approach by achieving high-level classification performance on the MNIST, CIFAR-10, and CIFAR-100 datasets.

Finally, we used MNIST to demonstrate that UBnets are much more resilient against adversarial examples than the standard approach of using a softmax output layer and training the networks by a cross-entropy objective.

References

  • [1] S. Elfwing, E. Uchibe, and K. Doya, “Expected energy-based restricted boltzmann machine for classification,” Neural Networks, vol. 64, no. 3, pp. 29–38, 2015.
  • [2] ——, “From free energy to expected energy: Improving energy-based value function approximation in reinforcement learning,” Neural Networks, vol. 84, pp. 17–27, 2016.
  • [3] ——, “Sigmoid-weighted linear units for neural network function approximation in reinforcement learning,” Neural Networks, 2018. [Online]. Available: https://doi.org/10.1016/j.neunet.2017.12.012
  • [4] ——, “Online meta-learning by parallel algorithm competition,” in

    Proceedings of the Genetic and Evolutionary Computation Conference (GECCO)

    , 2018, pp. 426–433.
  • [5] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
  • [6] S. Elfwing, E. Uchibe, and K. Doya, “Sigmoid-weighted linear units for neural network function approximation in reinforcement learning,” arXiv:1702.03118 [cs.LG], 2017.
  • [7] P. Ramachandran, B. Zoph, and Q. V. Le, “Searching for activation functions,” arXiv:1710.05941 [cs.NE], 2017.
  • [8] V. Nair and G. Hinton, “Rectified linear units improve restricted boltzmann machines,” in

    Proceedings of the International Conference on Machine Learning (ICML)

    , 2010, pp. 807–814.
  • [9] A. Krizhevsky, “Learning multiple layers of features from tiny images,” University of Toronto, Tech. Rep., 2009.
  • [10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in CVPR09, 2009.
  • [11] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus, “Intriguing properties of neural networks.” arXiv:1312.6199 [cs.CV], 2013.
  • [12] P. Smolensky, “Information processing in dynamical systems: Foundations of harmony theory,” in Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Volume 1: Foundations, D. E. Rumelhart and J. L. McClelland, Eds.   MIT Press, 1986.
  • [13]

    Y. Freund and D. Haussler, “Unsupervised learning of distributions on binary vectors using two layer networks,” in

    Proceedings Advances in Neural Information Processing Systems (NIPS), 1992.
  • [14]

    G. E. Hinton, “Training products of experts by minimizing contrastive divergence,”

    Neural Computation, vol. 12, no. 8, pp. 1771–1800, 2002.
  • [15] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” in Proceedings of the IEEE, vol. 86(11), 1998, pp. 2278–2324.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in

    Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR2016)

    , 2016, pp. 770–778.
  • [17] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proceedings of the International Conference on Machine Learning (ICML), 2015, pp. 448–456.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in Proceeding of European Conference on Computer Vision (ECCV), 2016, pp. 630–645.
  • [19] D. Mishkin and J. Matas, “All you need is a good init,” in Proceedings of the International Conference on Learning Representations (ICLR), 2016.
  • [20] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the International Conference on Computer Vision (ICCV), 2015.
  • [21] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Training very deep networks,” in Proceedings of Advances in Neural Information Processing Systems (NIPS), 2015, pp. 2377–2385.
  • [22]

    A. Vedaldi and K. Lenc, “Matconvnet: Convolutional neural networks for matlab,” in

    Proceedings of the ACM International Conference on Multimedia (ACMMM), 2015.
  • [23] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, “Maxout networks,” in Proceedings of the international conference on machine learning (ICML), 2013.
  • [24] J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid, “Convolutional kernel networks,” in Proceedings of Advances in Neural Information Processing Systems (NIPS), 2014, pp. 2627–2635.
  • [25] C. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu, “Deeply-supervised nets,” in

    Proceedings of International Conference on Artificial Intelligence and Statistics (AISTATS)

    , 2015.
  • [26] M. Liang and X. Hu, “Recurrent convolutional neural network for object recognition,” in Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3367–3375.
  • [27] C. Lee, P. W. Gallagher, and Z. Tu, “Generalizing pooling functions in convolutional neural networks: Mixed, gated, and tree,” in Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2016, pp. 464–472.
  • [28] J. Chang and Y. Chen, “Batch-normalized maxout network in network,” arXiv:1511.02583 [cs.CV], 2015.
  • [29] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Riedmiller, “Striving for simplicity: The all convolutional net,” arXiv:1412.6806 [cs.LG], 2014.
  • [30] D. A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (ELUs),” in Proceedings of the International Conference on Learning Representations (ICLR), 2016.
  • [31] G. Larsson, M. Maire, and G. Shakhnarovich, “Fractalnet: Ultra-deep neural networks without residuals,” in Proceedings of the International Conference on Learning Representations (ICLR), 2017.
  • [32] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger, “Deep networks with stochastic depth,” in Proceeding of European Conference on Computer Vision (ECCV), 2016, pp. 630–645.
  • [33] S. Zagoruyko and N. Komodakis, “Wide residual networks,” in Proceedings of the British Machine Vision Conference (BMVC), 2016.
  • [34] G. Huang, Z. Liu, and K. Q. Weinberger, “Densely connected convolutional networks,” arXiv:1608.06993 [cs.CV], 2017.
  • [35] P. Y. Simard, D. Steinkraus, and J. C. Platt, “Best practices for convolutional neural networks applied to visual document analysis,” in Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), 2003, pp. 958–963.
  • [36] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014.
  • [37] B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Šrndić, P. Laskov, G. Giacinto, and F. Roli, “Evasion attacks against machine learning at test time,” in Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML-KDD), 2013, pp. 387–402.
  • [38] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in Proceedings of the International Conference on Learning Representations (ICLR), 2015.