Machine-learning technology powers many aspects of modern society: from web searches to content filtering on social networks to recommendations on e-commerce websites, and it is increasingly presented in consumer products such as cameras and smart phones (1). Over last several years, a variety of artificial neural networks have played an increasing important role in development of machine learning. Among all neural networks, the back propagation neural networks (BPNN) (2) and convolutional neural networks (CNN) (3; 4; 5; 6; 7; 8) are nearly the most successful theories and applications. However, no matter what kind a neural network is, the key of algorithm is gradient method in backward propagation.
As fractional order calculus is successfully applied in LMS filtering (9; 10; 11), systems identification (12; 13), control theories (14; 15; 16; 17) and so on, there arises a new trend that introduces fractional order calculus into gradient method. Professor Pu is the first one who pays attention to fractional order gradient method. He adopts fractional order derivatives to replace the integer order derivatives in traditional gradient method directly (18). Although it is possible for such method to escape local optimal point, it cannot ensure convergence to real extreme point. To remedy this congenital defect, Chen uses truncation and short memory principle to modify the fractional order gradient method (19; 20), which turns out that it is convergent to real extreme point and shows faster convergent speed as well.
During the research of fractional order gradient method, some scholars have found its application to artificial neural networks at the same time. Considering that fractional order derivatives of composite functions are complicated, professor Wang only uses fractional order gradients for updating parameters so that the chain rule will be kept to calculate integer order gradients along backward propagation (21). Similar method is followed but the different structure of networks is applied in (22). Both their applications on BPNN are proved to be smooth and realize outstanding performance. However, their fractional order gradient method is based on the strict definition of fractional order derivatives, which leads to the same problem as (18).
Even if great efforts have been made to neural networks with fractional order gradient method, it is still a novel research and far away from perfection at present. There remain some aspects to be improved.
The convergence to real extreme point is necessary for gradient method.
The available range of fractional order can be extended to .
Neural networks of more complicated structure are worth researching in depth.
How to use the chain rule in fractional order neural networks is still a problem.
Loss function may be chosen as not only quadratic function but cross-entropy function.
Therefore, this paper provides conventional CNN with a novel fractional order gradient method. To the best of our knowledge, no scholar has ever investigated the CNN by fractional order gradient method. The proposed method is creative for neural networks as well as gradient method. First, based on the Caputo’s definition of fractional order derivatives, a fractional order gradient method is designed and proved to converge to real extreme point. Second, the gradients in backward propagation of neural networks are divided into two categories, namely the gradients transferred between layers and the gradients for updating parameters within layers. Third, the updating gradients are replaced by fractional order one, but transferring gradients are integer order so that the chain rule could be kept using. Finally, with connecting all layers end-to-end and adding loss functions, the CNN with fractional order gradient method is achieved well.
The remainder of this article is organized as follows. Section 2 introduces a fractional order gradient method and provides some basic knowledge for subsequent use. Fractional order gradient method is recommended for the fully connected layers and convolution layers in Section 3, respectively. In Section 4, some experiments are provided to illustrate the validity of the proposed approach. Conclusions are given in Section 5.
There are some widely accepted definitions of fractional order derivative , such as Riemann-Liouville, Caputo and Grunwald-Letnikov, but the Caputo’s one is chosen for subsequent use, since its derivative of constant equals zero. The Caputo’s definition is
where , is the Gamma function, is the initial value. Alternatively, (1) can be rewritten as the following form
Suppose to be a smooth convex function with a unique extreme point . It is well known that each iterative step of the conventional gradient method is formulated as
where is the iterative step size or learning rate, is iterative times. Similarly, the fractional order gradient method is written as
If fractional order derivatives are directly applied in (4), the above fractional order gradient method cannot converge to the real extreme point , but to an extreme point under definition of fractional order derivatives, such extreme point is associated with initial value and order, generally not equal to (18).
To guarantee the convergence to real extreme point, an alternative fractional order gradient method (19) is considered via following iterative step
When only the first item is reserved and its absolute value is introduced, the fractional order gradient method with is simplified as
If fractional order gradient method (7) is convergent, it will converge to the real extreme point .
It is a proof by contradiction. Assume that converges to a different point , namely . Therefore, it can be concluded that for any sufficient small positive scalar , there exists a sufficient large number such that for any . Then must hold.
According to (7), the following inequality is obtained
Considering that one can always find a such that , then the following inequality will hold
The above inequality could be rewritten as . When this inequality is introduced into (8), the result is
which implies that is not convergent. It contradicts to the assumption that is convergent to , thus the proof is completed. ∎
When a small positive value is introduced, the following fractional order gradient method will avoid singularity caused by .
Compared with gradient method based on strict definition of fractional order derivatives (18), the modified fractional order gradient methods (7) and other similar methods (19) are proved to be convergent to the real extreme point. Moreover, such methods turn out to converge faster than integer order gradient methods.
3 Main Results
A general CNN is composed of convolution layers, pooling layers and fully connected layers. Fractional order gradient method is applicable for all layers except pooling layers, since there is no need of updating parameters in pooling layers. Although the key procedure of mathematical calculation is quite similar in convolution layers and fully connected layers, the different structures lead to different ways to research. First of all, fully connected layers with fractional order gradient are introduced.
3.1 Fully Connected Layers
The training procedure of neural networks contains two steps, one of them is forward propagation. Such propagation between two layers is illustrated as Fig. 1, where superscript is the number of layer, subscript is the number of node in certain layer, is the output of -th node in -th layer.
The output is from
where is weight, is bias, is the output of last layer and function
Another step of training procedure is backward propagation, in which fractional order gradient method takes the place of traditional method. Due to imperfect use of chain rule in fractional order derivatives, the gradients of backward propagation are a blend of fractional order and integer order. As is shown in Fig 2, there are two types of gradients that pass through layers. One is the transferring gradient (solid line) which links nodes between two layers, the other is updating gradient (dotted line) which is used for parameters within layers. is the loss function, is the fractional order, and are defined as fractional order gradients of and , respectively.
In order to use the chain rule continuously, the transferring gradient is provided with integer order
but the updating gradient is replaced by fractional order
with and . When the fractional order gradient (7) is adopted, the gradient of the -th iteration becomes
where and are and at the -th iteration, respectively.
Actually, samples are not input one by one in most case. When a batch of samples are input each time, (16) turns into
where is batch size, the subscript means
-th sample of a batch. After vectorization, above equations are simplified as
signs like and are the element-wise calculation, is the Hadamard product, is the sum of a matrix along horizontal axis. Then the updating parameters of fully connected layers can be summarized as
Compared with integer order backward propagation (23), the same transferring gradient is kept, but the difference exists in updating gradient where the order is changed as and . Even so, based on the Theorem 1, and of fractional order gradient will converge to the real extreme point that and of integer order gradient also converge to. Therefore, the proof has reverted to integer order backward propagation which is omitted here.
Because of integer order transferring gradient, the chain rule is still available for the proposed gradient method (18, 19), which avoids complicated calculation caused by fractional order derivatives, especially derivatives of activation function. As modified fractional order gradient (7) is applied smoothly, the speed of convergence is improved and real extreme point can be reached now.
3.2 Convolution Layers
Although the key calculation of convolution layers is similar to fully connected layers, the complicated structure makes its iterative algorithm different. It is hard to understand the algorithm without help of figures or auxiliary descriptions.
For subsequent research, the forward propagation of convolution layers is drawn briefly in Fig. 3 where is the output of the -th layer, and are weight and bias for channel , is the size of convolution kernel, is a slice of by selecting rows and columns over all channels (red cube), , and are height, width and channels of output , respectively.
Similarly, the gradients of backward propagation are divided into two types. The transferring gradient of convolution layers is also kept the same as integer order gradient. Considering that the input is a batch of samples, the updating gradient is
where is the weight that contains all , is the over all samples.
When the fractional order gradient method (7) is introduced, the updating gradient at the -th iteration is changed to
where is of the -th sample. It could be simply regarded as with
and is moving length of convolution kernel each time. After vectorization, (21) is further simplified as
In order to show the algorithm clearly, the calculation of (22) is transformed to following process.
Then the updating parameters of convolution layers is as follows
According to Theorem 1, the updating gradient replaced by fractional order gradient method could ensure the convergence and its convergence to real extreme point. It implies the convergence of fractional order updating gradient is the same as the integer order one. Since integer order transferring gradient is kept, overall convergence of convolution layers is similar to integer order case and easily guaranteed.
Based on backward propagation of convolution layers, when padding is introduced into the-th layers, the transferring gradient will be influenced. The gradient calculated by Algorithm 1 is the gradient of padded output. The padded part of needs deleting. However, there is no change happened for the updating gradient of fractional order.