Piecewise Linear Multilayer Perceptrons and Dropout

by   Ian J. Goodfellow, et al.

We propose a new type of hidden layer for a multilayer perceptron, and demonstrate that it obtains the best reported performance for an MLP on the MNIST dataset.


page 1

page 2


No bad local minima: Data independent training error guarantees for multilayer neural networks

We use smoothed analysis techniques to provide guarantees on the trainin...

Constructing Multilayer Perceptrons as Piecewise Low-Order Polynomial Approximators: A Signal Processing Approach

The construction of a multilayer perceptron (MLP) as a piecewise low-ord...

Predicting vehicle parking space availability using multilayer perceptron neural network

In this study, we have investigated potential use of Multilayer Perceptr...

Implementing the ICE Estimator in Multilayer Perceptron Classifiers

This paper describes the techniques used to implement the ICE estimator ...

A Network of Localized Linear Discriminants

The localized linear discriminant network (LLDN) has been designed to ad...

Autoencoders, Kernels, and Multilayer Perceptrons for Electron Micrograph Restoration and Compression

We present 14 autoencoders, 15 kernels and 14 multilayer perceptrons for...

Progressive Operational Perceptron with Memory

Generalized Operational Perceptron (GOP) was proposed to generalize the ...

1 The piecewise linear activation function

We propose to use a specific kind of piecewise linear function as the activation function for a multilayer perceptron.

Specifically, suppose that the layer receives as input a vector

. The layer then computes presynaptic output where and are learnable parameters of the layer.

We propose to have each layer produce output via the activation function where is a different non-empty set of indices into for each .

This function provides several benefits:

  • It is similar to the rectified linear units

    (Glorot et al., 2011) which have already proven useful for many classification tasks.

  • Unlike rectifier units, every unit is guaranteed to have some of its parameters receive some training signal at each update step. This is because the inputs are only compared to each other, and not to 0., so one is always guaranteed to be the maximal element through which the gradient flows. In the case of rectified linear units, there is only a single element and it is compared against 0. In the case when , receives no update signal.

  • Max pooling over groups of units allows the features of the network to easily become invariant to some aspects of their input. For example, if a unit pools (takes the max) over , , and , and , and respond to the same object in three different positions, then is invariant to these changes in the objects position. A layer consisting only of rectifier units can’t take the max over features like this; it can only take their average.

  • Max pooling can reduce the total number of parameters in the network. If we pool with non-overlapping receptive fields of size , then has size , and the next layer has its number of weight parameters reduced by a factor of relative to if we did not use max pooling. This makes the network cheaper to train and evaluate but also more statistically efficient.

  • This kind of piecewise linear function can be seen as letting each unit learn its own activation function. Given large enough sets , can implement increasing complex convex functions of its input. This includes functions that are already used in other MLPS, such as the rectified linear function and absolute value rectification.

2 Experiments

We used in our experiments. In other words, the activation function consists of max pooling over non-overlapping groups of five consecutive pre-synaptic inputs.

We apply this activation function to the multilayer perceptron trained on MNIST by Hinton et al. (2012). This MLP uses two hidden layers of 1200 units each. In our setup, the presynaptic activation

has size 1200 so the pooled output of each layer has size 240. The rest of our training setup remains unchanged apart from adjustment to hyperparameters.

Hinton et al. (2012) report 110 errors on the test set. To our knowledge, this is the best published result on the MNIST dataset for a method that uses neither pretraining nor knowledge of the input geometry.

It is not clear how Hinton et al. (2012) obtained a single test set number. We train on the first 50,000 training examples, using the last 10,000 as a validation set. We use the misclassification rate on the validation set to determine at what point to stop training. We then record the log likelihood on the first 50,000 examples, and continue training but using the full 60,000 example training set. When the log likelihood of the validation set first exceeds the recorded value of the training set log likelihood, we stop training the model, and evaluate its test set error. Using this approach, our trained model made 94 mistakes on the test set. We believe this is the best-ever result that does not use pretraining or knowledge of the input geometry.