In one of his recent seminars, Geoffrey Hinton mentioned that after all of the developments of neural networks [fausett1994fundamentals]
and deep learning[goodfellow2016deep], perhaps it is time to move on from backpropagation [rumelhart1986learning] to newer algorithms for training neural networks. Especially, now that we know why shallow [soltanolkotabi2018theoretical] and deep [allen2019learning] networks work very well and why local optima are fairly good in networks [feizi2017porcupine], other training algorithms can help improve the insights into neural nets. Different training methods have been proposed for neural networks, some of which are backpropagation [rumelhart1986learning]montana1989training, leung2003tuning]
, and belief propagation as in restricted Boltzmann machines[hinton2006reducing].
A neural network can be viewed from a manifold learning perspective [hauser2017principles]
. Most of the spectral manifold learning methods can be reduced to kernel principal component analysis[ham2004kernel] which is a projection-based method [ghojogh2019unsupervised]. Moreover, at its initialization, every layer of a network can be seen as a random projection [karimi2018exploring]. Hence, a promising direction could be a projection view of training neural networks. In this paper, we propose a new training algorithm for feedforward neural networks based on projection and backprojection (or so-called reconstruction). In the backprojection algorithm, we update the weights layer by layer. For updating a layer , we project the data from the input, until the layer . We also backproject the labels of data from the last layer to the layer . The projected data and backprojected labels at layer
should be equal because in a perfectly trained network, projection of data by the entire layers should result in the corresponding labels. Thus, minimizing a loss function over the projected data and backprojected labels would correctly tune the layer’s weights. This algorithm is proposed for both the input and feature spaces where in the latter, the kernel of data is fed to the network.
2 Backprojection Algorithm
2.1 Projection and Backprojection in Network
In a neural network, every layer without its activation function acts as a linear projection. Without the nonlinear activation functions, a network/autoencoder is reduced to a linear projection/principal component analysis[ghojogh2019unsupervised]. If denotes the projection matrix (i.e., the weight matrix of a layer), projects onto the column space of . The reverse operation of projection is called reconstruction or backprojection and is formulated as which shows the projected data in the input space dimensionality (note that it is if we have a nonlinear function after the linear projection). At the initialization, a layer acts as a random projection [karimi2018exploring] which is a promising feature extractor according to the Johnson-Lindenstrauss lemma [achlioptas2003database]. Fine tuning the weights using labels makes the features more useful for discrimination of classes.
Let us have a training set
and their one-hot encoded labelswhere , , and
are the sample size, dimensionality of data, and dimensionality of labels, respectively. We denote the dimensionality or the number of neurons in layerby . By convention, we have and where is the number of layers and is the dimensionality of the output layer. Let the data after the activation function of the -th layer be denoted by . Let the projected data in the -th layer be where is the weight matrix of the -th layer. Note that where is the activation function in the -th layer. By convention, . The data are projected and passed through the activation functions layer by layer; hence, is calculated as:
In a mini-batch gradient descent set-up, let be a batch of size . For a batch, we denote the outputs of activation functions at the -th layer by .
Now, consider the one-hot encoded labels of batch, denoted by . We take the inverse activation function of the labels and then reconstruct or backproject them to the previous layer to obtain . We do similarly until the layer . Let denote the backprojected data at the -th layer, calculated as:
By convention, . The backprojected batch at the -th layer is . We use and to denote the column-wise batch matrix and its one-hot encoded labels.
In the backprojection algorithm, we optimize the layers’ weights one by one. Consider the -th layer whose loss we denote by :
where is a loss function such as the squared norm (or Mean Squared Error (MSE)), cross-entropy, etc. The loss tries to make the projected data as similar as possible to the backprojected data by tuning the weights . This is because the output of the network is supposed to be equal to the labels, i.e., . In order to tune the weights for Eq. (3
), we use a step of gradient descent. Using chain rule, the gradient is:
where we use the Magnus-Neudecker convention in which matrices are vectorized andis de-vectorization to matrix. If the loss function is MSE or cross-entropy for example, the derivatives of the loss function w.r.t. the activation function, respectively, are:
where and are the -th dimension of and , respectively.
For the activation functions in which the nodes are independent, such as linear, sigmoid, and hyperbolic tangent, the derivative of the activation function w.r.t. its input is a diagonal matrix:
where makes a matrix with its input as diagonal.
The derivative of the projected data before the activation function (i.e., the input of the activation function) w.r.t. the weights of the layer is:
where denotes the Kronecker product and is the identity matrix.
The procedure for updating weights in the -the layer is shown in Algorithm LABEL:algorithm_update_layer. Until the layer , data is projected and passed through activation functions layer by layer. Also, the label is backprojected and passed through inverse activation functions until the layer . A step of gradient descent is used to update the layer’s weights where is the learning rate. Note that the backprojected label at a layer may not be in the feasible domain of its inverse activation function. Hence, at every layer, we should project the backprojected label onto the feasible domain [parikh2014proximal]. We denote projection onto the feasible set by .
2.4 Different Procedures
So far, we explained how to update the weights of a layer. Here, we detail updating the entire network layers. In terms of the order of updating layers, we can have three different procedures for a backprojection algorithm. One possible procedure is to update the first layer first and move to next layers one by one until we reach the last layer. Repeating this procedure for the batches results in the forward procedure. In an opposite direction, we can have the backward procedure where, for each batch, we update the layers from the last layer to the first layer one by one. If we have both directions of updating, i.e., forward update for a batch and backward update for the next batch, we call it the forward-backward procedure. Algorithm LABEL:algorithm_backprojection shows how to update the layers in different procedures of the backprojection algorithm. Note that in this algorithm, an updated layer impacts the update of next/previous layer. One alternative approach is to make updating of layers dependent only on the weights tuned by previous mini-batch. In that approach, the training of layers can be parallelized within mini-batch.
3 Kernel Backprojection Algorithm
Suppose is the pulling function to the feature space. Let denote the dimensionality of the feature space, i.e., . Let the matrix-form of and be denoted by and . The kernel matrix [hofmann2008kernel] for the training data is defined as where . We normalize the kernel matrix [ah2010normalized] as where denotes the -th element of the kernel matrix.
According to representation theory [alperin1993local], the projection matrix can be expressed as a linear combination of the projected training data. Hence, we have where every column of is the vector of coefficients for expressing a projection direction as a linear combination of projected training data. The projection of the pulled data is .
In the kernel backprojection algorithm, in the first network layer, we project the pulled data from the feature space with dimensionality to another feature space with dimensionality . The projections of the next layers are the same as in backprojection. In other words, kernel backprojection applies backprojection in the feature space rather than the input space. In a mini-batch set-up, we use the columns of the normalized kernel corresponding to the batch samples, denoted by . Therefore, the projection of the -th data point in the batch is . In kernel backprojection, the dimensionality of the input is and the kernel vector is fed to the network as input. If we replace the by , Algorithms LABEL:algorithm_update_layer and LABEL:algorithm_backprojection are applicable for kernel backprojection.
In the test phase, we normalize the kernel over the matrix where is the test data point. Then, we take the portion of normalized kernel which correspond to the kernel over the training versus test data, denoted by . The projection at the first layer is then .
). For more difficulty, we set different variances for the classes. The data were standardized as a preprocessing. For this conference short-paper, we limit ourselves to introduction of this new approach and small synthetic experiments. Validation on larger real-world datasets is ongoing for future publication.
Neural Network Settings: We implemented a neural network with three layers whose number of neurons are where and for the binary and ternary classification, respectively. In different experiments, we used MSE loss for the middle layers and MSE or cross-entropy losses for the last layer. Moreover, we used Exponential Linear Unit (ELU) [clevert2015fast] or linear functions for activation functions of the middle layers while sigmoid or hyperbolic tangent (tanh) were used for the last layer. The derivative and inverse of these activation functions are as the following:
where in the inverse functions, we bound the output values for computational reasons in computer. Mostly, a learning rate of was used for backprojection and backpropagation and was used for kernel backprojection.
Comparison of Procedures: The performance of different forward, backward, and forward-backward procedures in backprojection and kernel backprojection are illustrated in Fig. 1
. In these experiments, the Radial Basis Function (RBF) kernel was used in kernel backprojection. Although the performance of these procedures are not identical but all of them are promising discrimination of classes. This shows that all three proposed procedures work well for backprojection in the input and feature spaces. In other words, the algorithm is fairly robust to the order of updating layers.
Comparison to Backpropagation: The performances of backprojection, kernel backprojection, and backpropagation are compared in the binary and ternary classification, shown in Figs. 1 and 2, respectively. In Fig. 2, the linear kernel was used. In Fig. 1, most often, kernel backprojection considers a spherical class around the blue (or even red) class which is because of the choice of RBF kernel. Comparison to backpropagation in the two figures shows that backprojection’s performance nearly matches that of backpropagation.
In the different experiments, the mean time of every epoch was often 0.08, 0.11, and 0.2 seconds for backprojection, kernel backprojection, and backpropagation, respectively, where the number of epochs were fairly similar in the experiments. This shows that backprojection isfaster than backpropagation. This is because backpropagation updates the weights one by one while backprojection updates layer by layer.
5 Conclusion and Future Direction
In this paper, we proposed a new training algorithm for feedforward neural network named backprojection. The proposed algorithm, which can be used for both the input and feature spaces, tries to force the projected data to be similar to the backprojected labels by tuning the weights layer by layer. This training algorithm, which is moderately faster than backpropagation in our initial experiments, can be used with either forward, backward, or forward-backward procedures. It is noteworthy that adding a penalty term for weight decay [krogh1992simple] to Eq. (3) can regularize the weights in backprojection [ghojogh2019theory]
. Moreover, batch normalization can be used in backprojection by standardizing the batch at the layers[ioffe2015batch]. This paper concentrated on feedforward neural networks. As a future direction, we can develop backprojection for other network structures such as convolutional networks [lecun1998gradient] and carry more expensive validation experiments on real-world data.