Trainable back-propagated functional transfer matrices

10/28/2017 ∙ by Cheng-Hao Cai, et al. ∙ The University of Auckland Microsoft Griffith University 0

Connections between nodes of fully connected neural networks are usually represented by weight matrices. In this article, functional transfer matrices are introduced as alternatives to the weight matrices: Instead of using real weights, a functional transfer matrix uses real functions with trainable parameters to represent connections between nodes. Multiple functional transfer matrices are then stacked together with bias vectors and activations to form deep functional transfer neural networks. These neural networks can be trained within the framework of back-propagation, based on a revision of the delta rules and the error transmission rule for functional connections. In experiments, it is demonstrated that the revised rules can be used to train a range of functional connections: 20 different functions are applied to neural networks with up to 10 hidden layers, and most of them gain high test accuracies on the MNIST database. It is also demonstrated that a functional transfer matrix with a memory function can roughly memorise a non-cyclical sequence of 400 digits.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

Functional-Transfer-Neural-Networks

None


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many neural networks use weights to represent connections between nodes. For instance, a fully connected deep neural network usually has a weight matrix in each hidden layer, and the weight matrix, a bias vector and a nonlinear activation are used to map input signals to output signals DBLP:journals/taslp/MohamedDH12 . Much work has been done on combining neural networks with different activations, such as the logistic function DBLP:journals/jbi/DreiseitlO02 , the rectifier DBLP:journals/jmlr/GlorotBB11 , the maxout function DBLP:conf/nips/MontufarPCB14

and the long short-term memory blocks

DBLP:journals/neco/HochreiterS97 . In this work, we study neural networks from another viewpoint: Instead of using different activations, we replace weights in the weight matrix with functional connections. In other words, the connections between nodes are represented by using real functions, but no longer real numbers. Specifically, this work focuses on:

  • Extending the back-propagation algorithm to the training of functional connections. The extended back-propagation algorithm includes rules for computing deltas of parameters in functional connections and rules for transmitting error signals from a layer to its previous layer. This algorithm is adapted from the standard back-propagation algorithm for fully connected feedforward neural networks DBLP:journals/nn/Hecht-Nielsen88a .

  • Discussing the meanings of some functionally connected structures. Different functional transfer matrices can construct different mathematical structures. Although it is not necessary for these structures to have meanings, some of them do have meanings in practice.

  • Discussing some practical training methods for (deep) functional transfer neural networks. Although the theory of back-propagation is applicable for these functional models, it is difficult to train them in practice. Also, the training becomes more difficult as the depth increases. In order to train them successfully, some tricks may be required.

  • Demonstrating that these functional transfer neural networks can practically work. We apply these models to the MNIST hand-written digit dataset DBLP:journals/spm/Deng12 , in order to demonstrate that different functional connects can work for classification tasks. Also, we try to make a model with a memory function memorise the circumference ratio , in order to demonstrate that functional connects can be used as a memory block.

The rest of this paper is organised as follows: Section 2 provides a brief review of the standard deep feedforward neural networks. Section 3 introduces the theory of functional transfer neural networks, including functional transfer matrices and a back-propagation algorithm for functional connections. Section 4 provides some examples for explaining the meanings of functionally connected structures. Section 5 discusses training methods for practical use. Section 6 provides experimental results. Section 7 introduces related work. Section 8 concludes this work and discusses future work.

2 Background on deep feedforward neural networks

We assume that readers have been familiar with deep feedforward neural networks and only mention concepts and formulae relevant to our research DBLP:journals/nn/Hecht-Nielsen88a : A feedforward neural network usually consists of an input layer, some hidden layers and an output layer. Two neighbouring layers are connected by a linear weight matrix , a bias vector and an activation . Given an vector , the neural network can map it to another vector via:

(1)

3 A theory of functional transfer neural networks

3.1 Functional transfer matrices

Instead of using weights, a functional transfer matrix uses trainable functions to represent connections between nodes. Formally, it is defined as:

(2)

where each is a trainable function. We also define a transfer computation “” for the functional transfer matrix and a column vector such that:

(3)

Suppose that is a bias vector, is an activation, and is an input vector. An output vector can then be computed via

(4)

In other words, the th element of the output vector is

(5)

3.2 Back-propagation

Multiple functional transfer matrices can be stacked together with bias vectors and activations to form a multi-layer model, and the model can be trained via back-propagation DBLP:journals/nn/Hecht-Nielsen88a : Suppose that a model has hidden layers. For the th hidden layer, is a functional transfer matrix, its function is a real function with an independent variable and trainable parameters , is a bias vector, is an activation, and is an input signal. An output signal is computed via:

(6)

The output signal is the input signal of the next layer. In other words, if , then . After an error signal is evaluated by an output layer and an error function (see Section 5.2), the parameters can be updated: Suppose that is an error signal of output nodes, and is a learning rate. Let . Deltas of the parameters are computed via:

(7)

Deltas of the biases are the same as the conventional rule:

(8)

An error signal of the th layer is computed via:

(9)

Please note that the computation of can be simplified as a transfer computation: Let be a derivative matrix in which each element is . Then the computation can be done via . Similarly, the computation of in Eq. (7) can also be simplified: Let be a derivative matrix in which each element is . Then the computation can be done via . Both and can be computed symbolically before applying the transfer computation.

4 Examples

4.1 Periodic functions

Periodic functions can be applied to functional transfer matrices. For instance, is a periodic function, where , and are its amplitude, angular velocity and initial phase respectively. Thus, we can define a matrix consisting of the following function:

(10)

where , and are trainable parameters. If it is an matrix, then it can transfer an -dimensional vector to an -dimensional vector such that . It is noticeable that each is a composition of many periodic functions, and their amplitudes, angular velocities and initial phases can be updated via training. The training process is based on the following partial derivatives:

(11)
(12)
(13)
(14)

Similarly, this method can be extended to the cosine function.

4.2 Modelling of conic hypersurfaces

A functional transfer neural network can represent decision boundaries constructed by ellipses and hyperbolas. Recall the mathematical definition of ellipses: A ellipse with a centre and semi-axes and can be defined as . This equation can be rewritten to , where is a positive number. Let , , , and . The equation becomes . It is noticeable that can be rewritten to , which is a transfer computation (See Eq. 3). Therefore, a model with an activation can be defined as where and are inputs and is an output. Based on the above structure, multiple ellipse decision boundaries can be formed. For instance:

(15)
(16)

and

(17)

In the above model, and are input nodes. , and are hidden nodes. is an output node. is an activation. The input nodes and the hidden nodes are connected via a functional transfer matrix. The hidden nodes and the output node are connected via a weight matrix. The hidden nodes and the output node are activated via biases and the activation. Fig. 1 shows decision boundaries formed by this model: The functional transfer matrix and the hidden nodes form three ellipse boundaries (in orange, green and purple respectively). The weight matrix and the output node form a boundary which is the union of inner parts of all ellipse boundaries. In addition, this figure shows some example inputs and outputs: If the inputs are inside the decision boundaries, the output is 1. Otherwise, the output is 0.

Figure 1: Decision boundaries formed by Eq. (15), Eq. (16) and Eq. (17).

The reason why the model can construct ellipse boundaries is that its functional transfer matrix consists of the following function:

(18)

where and are trainable parameters. If the input dimension of this matrix is 2, it will construct ellipse boundaries on a plane. If its input dimension is 3, it will construct ellipsoid boundaries in a 3-dimensional space. Generally, if its input dimension is (), it will construct boundaries represented by -dimensional closed hypersurfaces of ellipses in an -dimensional space. This phenomenon reflects a significant difference between the a transfer matrix with Eq. (18

) and a standard linear weight matrix, as the former models closed hypersurfaces, while the later models hyperplanes. In addition, to use the back-propagation algorithm to train the parameters, derivatives are computed via:

(19)
(20)
(21)

Further, an adapted version of Eq. (18) can represent not only ellipses, but also hyperbolas. The adapted function is defined as:

(22)

where is initialised as 1 or -1. Please note that is a constant, but NOT a trainable parameter. Given a matrix with this function:

(23)

If it is initialised by:

(24)

then it can represent an ellipse boundary. On the other hand, if it is initialised by:

(25)

then it can be used to form an hyperbola boundary. More generally, in an -dimensional space, an functional transfer matrix with Eq. (22) can represent different -dimensional conic hypersurfaces.

4.3 Sleeping weights and dead weights

In a standard neural network, connections between nodes usually do not “sleep”. The reason is that the connection from Node to Node is a real weight . Let denote a state of Node . Node will receive a signal . When and , the signal always influences Node , regardless of whether or not Node needs it. To enable the connections to “sleep” temporarily, the following function is used in a functional transfer matrix:

(26)

In the above function, and are trainable parameters, and is a constant which is initialised as 1 or -1 before training. It also makes use of the rectifier DBLP:journals/jmlr/GlorotBB11 :

(27)

The rectifier makes the function be able to “sleep”. In other words, when , must be zero. To use the back-propagation algorithm to train the parameters, derivatives are computed via:

(28)
(29)
(30)

Another function enables a connection between two nodes to “die”. The word “die” is different from the word “sleep”, as the former means that for all , and the later means that for some . If the connection from Node to Node is dead, then any change of Node does not influence Node . This function is:

(31)

In the above function, is a trainable parameter, and is a constant which is initialised as 1 or -1 before training. To use the back-propagation algorithm to train the parameters, derivatives are computed via:

(32)
(33)

It is noticeable that , and are zero when , which means that the function does not transfer any signal and cannot be updated in this case. In other words, the function is dead once is updated to a negative value. A matrix with this function can then represent a partially connected neural network, as dead functions can be considered as broken connections after training. A problem about the function is that it is not able to “revive”. In other words, once is updated to a negative value, it has no chance to be non-negative again. To solve this problem, an adapted version of the rectifier is used:

(34)

For the computation of , its derivative is:

(35)

For the computation of , however, its derivative is arbitrarily defined as:

(36)

We use “” instead of “” because this derivative is not a mathematically sound result, but a predefined value. Thus, derivatives of are computed via:

(37)
(38)

It is noticeable that is not zero when is not zero, which means that can be updated when it is negative. Thus, although the function is dead when is negative, it can be revived by updating to a non-negative value.

4.4 Sequential Modelling via Memory Functions

A functional transfer matrix with memory functions can be used to model sequences. For instance, given a sequential input , the th state () of a memory function is computed via:

(39)
(40)

where , and are trainable parameters. In particular, is a memory cell and its initial state is set to 0. To use the back-propagation algorithm to train the parameters, derivatives are computed via:

(41)
(42)
(43)
(44)

An dimensional matrix with the memory function can record

signals from a previous state, as each memory function records one signal. On the other hand, a standard recurrent neural network usually uses hidden units to record signals from the previous state. If it has

input units and hidden units, then it can record signals. It is noticeable that the functional transfer matrix with the memory function records times more signals than the recurrent neural network.

5 Practical training methods for functional transfer neural networks

This section discusses some practical training methods for functional transfer neural networks. These methods are also used in the experiments (Section 6).

5.1 The model structure and initialisation

Figure 2: The general structure of functional transfer neural networks.

Fig. 2

shows the general structure of functional transfer neural networks: They usually have an input layer, one or more hidden layers and an output layer. Each hidden layer consists of a functional transfer matrix and some hidden units with a bias vector and an activation function. In particular, the activation function can be the logistic sigmoid function

DBLP:journals/jmlr/GlorotBB11 :

(45)

the rectified linear unit (ReLU)

(46)

or the hyperbolic tangent (tanh) function

(47)

The output layer consists of a linear weight matrix and output units with a bias vector and a softmax function:

(48)

After setting up a model based on the above structure, parameters are initialised based on the following method:

  • Weights in the linear weight matrix are randomised as small real numbers.

  • Biases in the bias vectors are set to zero.

  • For different functional transfer matrices, different initialisation methods should be used. In the experimental section, some examples of initialisation are provided (see Table 1 and Table 2).

5.2 Training a single hidden layer

If a model only has one hidden layer, the training can be done via back-propagation directly rumelhart1988learning : Firstly, an input signal is propagated to the output layer, and an output signal is generated. Then an error signal is computed by comparing the output signal with a target signal. The comparison is based on the cross-entropy criterion hinton2012deep . Next, the error signal is back-propagated through the output layer and the hidden layer, and deltas of parameters are computed. For the output layer, the standard delta rules are used. For the hidden layer, the rules described by Section 3.2

are used. Finally, the parameters are updated according to their deltas. The above process can be combined with stochastic gradient descent

DBLP:conf/nips/ZinkevichWSL10 .

5.3 Layer-wise supervised training and fine-tuning

Figure 3: Layer-wise supervised training and fine-tuning. Functional connections are denoted by solid lines, and linear weight connections are denoted by dashed lines.

Training a functional transfer neural network can be more difficult when the number of hidden layers increases. To resolve this problem, the combination of layer-wise supervised training111For details about layer-wise supervised training, please also refer to Algorithm 7 in the appendix of the referenced paper. and fine-tuning is used DBLP:conf/nips/BengioLPL06 , which is shown by Fig. 3

. Layer-wise training includes the following steps: Firstly, the first hidden layer is trained, while the other hidden layers and the output layer are ignored. To train this layer, a new softmax layer (with a linear weight matrix and a bias vector) is added onto it, and the method for training a single hidden layer (described by Section

5.2) is used to train it. Then the softmax layer is removed, and the second hidden layer is added onto the first hidden layer. To train the second hidden layer, another new softmax layer is added onto it, and the same training method is used again. Please note that the first hidden layer is not updated in this step. Finally, all remaining hidden layers are trained by using the same method. In particular, the trained softmax layer on the last hidden layer is considered as the output layer of the whole neural network. After layer-wise training, the whole neural network can be further optimised via fine-tuning: For the output layer, the standard back-propagation rules are used; For the hidden layers, the rules described by Section 3.2 are used. In practice, learning rates for fine-tuning are smaller than those for layer-wise training.

6 Experiments

6.1 MNIST handwritten digit recognition

6.1.1 Experimental settings

ID Trainable Parameter(s)
F01
F02
F03
F04
F05
F06
F07
F08
F09
F10
F11
F12
F13
F14
F15
F16
F17
F18
F19
F20
Table 1: Example Functions Used to Model Functional Networks.

Table 1 provides 20 example functions which are used to construct functional transfer neural networks. in the functional matrix (See Eq. (2)) is substituted for these example functions. In particular, , and are used to denote trainable parameters instead of using . These functions include all functions discussed in Section 4, except the memory function in Section 4.4: F06, F07, and F08 have been discussed in Section