Connections between nodes of fully connected neural networks are usually represented by weight matrices. In this article, functional transfer matrices are introduced as alternatives to the weight matrices: Instead of using real weights, a functional transfer matrix uses real functions with trainable parameters to represent connections between nodes. Multiple functional transfer matrices are then stacked together with bias vectors and activations to form deep functional transfer neural networks. These neural networks can be trained within the framework of back-propagation, based on a revision of the delta rules and the error transmission rule for functional connections. In experiments, it is demonstrated that the revised rules can be used to train a range of functional connections: 20 different functions are applied to neural networks with up to 10 hidden layers, and most of them gain high test accuracies on the MNIST database. It is also demonstrated that a functional transfer matrix with a memory function can roughly memorise a non-cyclical sequence of 400 digits.READ FULL TEXT VIEW PDF
Recently low displacement rank (LDR) matrices, or so-called structured
In this paper we propose and study a technique to reduce the number of
We introduce the Tucker Tensor Layer (TTL), an alternative to the dense
We investigate learning of the online local update rules for neural
Deep neural networks have dramatically transformed machine learning, but...
In neural networks, developing regularization algorithms to settle
The neural networks have trained on incomplete sets that a doctor could
Many neural networks use weights to represent connections between nodes. For instance, a fully connected deep neural network usually has a weight matrix in each hidden layer, and the weight matrix, a bias vector and a nonlinear activation are used to map input signals to output signals DBLP:journals/taslp/MohamedDH12 . Much work has been done on combining neural networks with different activations, such as the logistic function DBLP:journals/jbi/DreiseitlO02 , the rectifier DBLP:journals/jmlr/GlorotBB11 , the maxout function DBLP:conf/nips/MontufarPCB14
and the long short-term memory blocksDBLP:journals/neco/HochreiterS97 . In this work, we study neural networks from another viewpoint: Instead of using different activations, we replace weights in the weight matrix with functional connections. In other words, the connections between nodes are represented by using real functions, but no longer real numbers. Specifically, this work focuses on:
Extending the back-propagation algorithm to the training of functional connections. The extended back-propagation algorithm includes rules for computing deltas of parameters in functional connections and rules for transmitting error signals from a layer to its previous layer. This algorithm is adapted from the standard back-propagation algorithm for fully connected feedforward neural networks DBLP:journals/nn/Hecht-Nielsen88a .
Discussing the meanings of some functionally connected structures. Different functional transfer matrices can construct different mathematical structures. Although it is not necessary for these structures to have meanings, some of them do have meanings in practice.
Discussing some practical training methods for (deep) functional transfer neural networks. Although the theory of back-propagation is applicable for these functional models, it is difficult to train them in practice. Also, the training becomes more difficult as the depth increases. In order to train them successfully, some tricks may be required.
Demonstrating that these functional transfer neural networks can practically work. We apply these models to the MNIST hand-written digit dataset DBLP:journals/spm/Deng12 , in order to demonstrate that different functional connects can work for classification tasks. Also, we try to make a model with a memory function memorise the circumference ratio , in order to demonstrate that functional connects can be used as a memory block.
The rest of this paper is organised as follows: Section 2 provides a brief review of the standard deep feedforward neural networks. Section 3 introduces the theory of functional transfer neural networks, including functional transfer matrices and a back-propagation algorithm for functional connections. Section 4 provides some examples for explaining the meanings of functionally connected structures. Section 5 discusses training methods for practical use. Section 6 provides experimental results. Section 7 introduces related work. Section 8 concludes this work and discusses future work.
We assume that readers have been familiar with deep feedforward neural networks and only mention concepts and formulae relevant to our research DBLP:journals/nn/Hecht-Nielsen88a : A feedforward neural network usually consists of an input layer, some hidden layers and an output layer. Two neighbouring layers are connected by a linear weight matrix , a bias vector and an activation . Given an vector , the neural network can map it to another vector via:
Instead of using weights, a functional transfer matrix uses trainable functions to represent connections between nodes. Formally, it is defined as:
where each is a trainable function. We also define a transfer computation “” for the functional transfer matrix and a column vector such that:
Suppose that is a bias vector, is an activation, and is an input vector. An output vector can then be computed via
In other words, the th element of the output vector is
Multiple functional transfer matrices can be stacked together with bias vectors and activations to form a multi-layer model, and the model can be trained via back-propagation DBLP:journals/nn/Hecht-Nielsen88a : Suppose that a model has hidden layers. For the th hidden layer, is a functional transfer matrix, its function is a real function with an independent variable and trainable parameters , is a bias vector, is an activation, and is an input signal. An output signal is computed via:
The output signal is the input signal of the next layer. In other words, if , then . After an error signal is evaluated by an output layer and an error function (see Section 5.2), the parameters can be updated: Suppose that is an error signal of output nodes, and is a learning rate. Let . Deltas of the parameters are computed via:
Deltas of the biases are the same as the conventional rule:
An error signal of the th layer is computed via:
Please note that the computation of can be simplified as a transfer computation: Let be a derivative matrix in which each element is . Then the computation can be done via . Similarly, the computation of in Eq. (7) can also be simplified: Let be a derivative matrix in which each element is . Then the computation can be done via . Both and can be computed symbolically before applying the transfer computation.
Periodic functions can be applied to functional transfer matrices. For instance, is a periodic function, where , and are its amplitude, angular velocity and initial phase respectively. Thus, we can define a matrix consisting of the following function:
where , and are trainable parameters. If it is an matrix, then it can transfer an -dimensional vector to an -dimensional vector such that . It is noticeable that each is a composition of many periodic functions, and their amplitudes, angular velocities and initial phases can be updated via training. The training process is based on the following partial derivatives:
Similarly, this method can be extended to the cosine function.
A functional transfer neural network can represent decision boundaries constructed by ellipses and hyperbolas. Recall the mathematical definition of ellipses: A ellipse with a centre and semi-axes and can be defined as . This equation can be rewritten to , where is a positive number. Let , , , and . The equation becomes . It is noticeable that can be rewritten to , which is a transfer computation (See Eq. 3). Therefore, a model with an activation can be defined as where and are inputs and is an output. Based on the above structure, multiple ellipse decision boundaries can be formed. For instance:
In the above model, and are input nodes. , and are hidden nodes. is an output node. is an activation. The input nodes and the hidden nodes are connected via a functional transfer matrix. The hidden nodes and the output node are connected via a weight matrix. The hidden nodes and the output node are activated via biases and the activation. Fig. 1 shows decision boundaries formed by this model: The functional transfer matrix and the hidden nodes form three ellipse boundaries (in orange, green and purple respectively). The weight matrix and the output node form a boundary which is the union of inner parts of all ellipse boundaries. In addition, this figure shows some example inputs and outputs: If the inputs are inside the decision boundaries, the output is 1. Otherwise, the output is 0.
The reason why the model can construct ellipse boundaries is that its functional transfer matrix consists of the following function:
where and are trainable parameters. If the input dimension of this matrix is 2, it will construct ellipse boundaries on a plane. If its input dimension is 3, it will construct ellipsoid boundaries in a 3-dimensional space. Generally, if its input dimension is (), it will construct boundaries represented by -dimensional closed hypersurfaces of ellipses in an -dimensional space. This phenomenon reflects a significant difference between the a transfer matrix with Eq. (18
) and a standard linear weight matrix, as the former models closed hypersurfaces, while the later models hyperplanes. In addition, to use the back-propagation algorithm to train the parameters, derivatives are computed via:
Further, an adapted version of Eq. (18) can represent not only ellipses, but also hyperbolas. The adapted function is defined as:
where is initialised as 1 or -1. Please note that is a constant, but NOT a trainable parameter. Given a matrix with this function:
If it is initialised by:
then it can represent an ellipse boundary. On the other hand, if it is initialised by:
then it can be used to form an hyperbola boundary. More generally, in an -dimensional space, an functional transfer matrix with Eq. (22) can represent different -dimensional conic hypersurfaces.
In a standard neural network, connections between nodes usually do not “sleep”. The reason is that the connection from Node to Node is a real weight . Let denote a state of Node . Node will receive a signal . When and , the signal always influences Node , regardless of whether or not Node needs it. To enable the connections to “sleep” temporarily, the following function is used in a functional transfer matrix:
In the above function, and are trainable parameters, and is a constant which is initialised as 1 or -1 before training. It also makes use of the rectifier DBLP:journals/jmlr/GlorotBB11 :
The rectifier makes the function be able to “sleep”. In other words, when , must be zero. To use the back-propagation algorithm to train the parameters, derivatives are computed via:
Another function enables a connection between two nodes to “die”. The word “die” is different from the word “sleep”, as the former means that for all , and the later means that for some . If the connection from Node to Node is dead, then any change of Node does not influence Node . This function is:
In the above function, is a trainable parameter, and is a constant which is initialised as 1 or -1 before training. To use the back-propagation algorithm to train the parameters, derivatives are computed via:
It is noticeable that , and are zero when , which means that the function does not transfer any signal and cannot be updated in this case. In other words, the function is dead once is updated to a negative value. A matrix with this function can then represent a partially connected neural network, as dead functions can be considered as broken connections after training. A problem about the function is that it is not able to “revive”. In other words, once is updated to a negative value, it has no chance to be non-negative again. To solve this problem, an adapted version of the rectifier is used:
For the computation of , its derivative is:
For the computation of , however, its derivative is arbitrarily defined as:
We use “” instead of “” because this derivative is not a mathematically sound result, but a predefined value. Thus, derivatives of are computed via:
It is noticeable that is not zero when is not zero, which means that can be updated when it is negative. Thus, although the function is dead when is negative, it can be revived by updating to a non-negative value.
A functional transfer matrix with memory functions can be used to model sequences. For instance, given a sequential input , the th state () of a memory function is computed via:
where , and are trainable parameters. In particular, is a memory cell and its initial state is set to 0. To use the back-propagation algorithm to train the parameters, derivatives are computed via:
An dimensional matrix with the memory function can record
signals from a previous state, as each memory function records one signal. On the other hand, a standard recurrent neural network usually uses hidden units to record signals from the previous state. If it hasinput units and hidden units, then it can record signals. It is noticeable that the functional transfer matrix with the memory function records times more signals than the recurrent neural network.
This section discusses some practical training methods for functional transfer neural networks. These methods are also used in the experiments (Section 6).
shows the general structure of functional transfer neural networks: They usually have an input layer, one or more hidden layers and an output layer. Each hidden layer consists of a functional transfer matrix and some hidden units with a bias vector and an activation function. In particular, the activation function can be the logistic sigmoid functionDBLP:journals/jmlr/GlorotBB11 :
or the hyperbolic tangent (tanh) function
The output layer consists of a linear weight matrix and output units with a bias vector and a softmax function:
After setting up a model based on the above structure, parameters are initialised based on the following method:
Weights in the linear weight matrix are randomised as small real numbers.
Biases in the bias vectors are set to zero.
If a model only has one hidden layer, the training can be done via back-propagation directly rumelhart1988learning : Firstly, an input signal is propagated to the output layer, and an output signal is generated. Then an error signal is computed by comparing the output signal with a target signal. The comparison is based on the cross-entropy criterion hinton2012deep . Next, the error signal is back-propagated through the output layer and the hidden layer, and deltas of parameters are computed. For the output layer, the standard delta rules are used. For the hidden layer, the rules described by Section 3.2
are used. Finally, the parameters are updated according to their deltas. The above process can be combined with stochastic gradient descentDBLP:conf/nips/ZinkevichWSL10 .
Training a functional transfer neural network can be more difficult when the number of hidden layers increases. To resolve this problem, the combination of layer-wise supervised training111For details about layer-wise supervised training, please also refer to Algorithm 7 in the appendix of the referenced paper. and fine-tuning is used DBLP:conf/nips/BengioLPL06 , which is shown by Fig. 3
. Layer-wise training includes the following steps: Firstly, the first hidden layer is trained, while the other hidden layers and the output layer are ignored. To train this layer, a new softmax layer (with a linear weight matrix and a bias vector) is added onto it, and the method for training a single hidden layer (described by Section5.2) is used to train it. Then the softmax layer is removed, and the second hidden layer is added onto the first hidden layer. To train the second hidden layer, another new softmax layer is added onto it, and the same training method is used again. Please note that the first hidden layer is not updated in this step. Finally, all remaining hidden layers are trained by using the same method. In particular, the trained softmax layer on the last hidden layer is considered as the output layer of the whole neural network. After layer-wise training, the whole neural network can be further optimised via fine-tuning: For the output layer, the standard back-propagation rules are used; For the hidden layers, the rules described by Section 3.2 are used. In practice, learning rates for fine-tuning are smaller than those for layer-wise training.
Table 1 provides 20 example functions which are used to construct functional transfer neural networks. in the functional matrix (See Eq. (2)) is substituted for these example functions. In particular, , and are used to denote trainable parameters instead of using . These functions include all functions discussed in Section 4, except the memory function in Section 4.4: F06, F07, and F08 have been discussed in Section