Convolutional neural networks with fractional order gradient method

05/14/2019 ∙ by Dian Sheng, et al. ∙ USTC 0

This paper proposes a fractional order gradient method for the backward propagation of convolutional neural networks. To overcome the problem that fractional order gradient method cannot converge to real extreme point, a simplified fractional order gradient method is designed based on Caputo's definition. The parameters within layers are updated by the designed gradient method, but the propagations between layers still use integer order gradients, and thus the complicated derivatives of composite functions are avoided and the chain rule will be kept. By connecting every layers in series and adding loss functions, the proposed convolutional neural networks can be trained smoothly according to various tasks. Some practical experiments are carried out in order to demonstrate the effectiveness of neural networks at last.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Machine-learning technology powers many aspects of modern society: from web searches to content filtering on social networks to recommendations on e-commerce websites, and it is increasingly presented in consumer products such as cameras and smart phones (1). Over last several years, a variety of artificial neural networks have played an increasing important role in development of machine learning. Among all neural networks, the back propagation neural networks (BPNN) (2) and convolutional neural networks (CNN) (3; 4; 5; 6; 7; 8) are nearly the most successful theories and applications. However, no matter what kind a neural network is, the key of algorithm is gradient method in backward propagation.

As fractional order calculus is successfully applied in LMS filtering (9; 10; 11), systems identification (12; 13), control theories (14; 15; 16; 17) and so on, there arises a new trend that introduces fractional order calculus into gradient method. Professor Pu is the first one who pays attention to fractional order gradient method. He adopts fractional order derivatives to replace the integer order derivatives in traditional gradient method directly (18). Although it is possible for such method to escape local optimal point, it cannot ensure convergence to real extreme point. To remedy this congenital defect, Chen uses truncation and short memory principle to modify the fractional order gradient method (19; 20), which turns out that it is convergent to real extreme point and shows faster convergent speed as well.

During the research of fractional order gradient method, some scholars have found its application to artificial neural networks at the same time. Considering that fractional order derivatives of composite functions are complicated, professor Wang only uses fractional order gradients for updating parameters so that the chain rule will be kept to calculate integer order gradients along backward propagation (21). Similar method is followed but the different structure of networks is applied in (22). Both their applications on BPNN are proved to be smooth and realize outstanding performance. However, their fractional order gradient method is based on the strict definition of fractional order derivatives, which leads to the same problem as (18).

Even if great efforts have been made to neural networks with fractional order gradient method, it is still a novel research and far away from perfection at present. There remain some aspects to be improved.

  • The convergence to real extreme point is necessary for gradient method.

  • The available range of fractional order can be extended to .

  • Neural networks of more complicated structure are worth researching in depth.

  • How to use the chain rule in fractional order neural networks is still a problem.

  • Loss function may be chosen as not only quadratic function but cross-entropy function.

Therefore, this paper provides conventional CNN with a novel fractional order gradient method. To the best of our knowledge, no scholar has ever investigated the CNN by fractional order gradient method. The proposed method is creative for neural networks as well as gradient method. First, based on the Caputo’s definition of fractional order derivatives, a fractional order gradient method is designed and proved to converge to real extreme point. Second, the gradients in backward propagation of neural networks are divided into two categories, namely the gradients transferred between layers and the gradients for updating parameters within layers. Third, the updating gradients are replaced by fractional order one, but transferring gradients are integer order so that the chain rule could be kept using. Finally, with connecting all layers end-to-end and adding loss functions, the CNN with fractional order gradient method is achieved well.

The remainder of this article is organized as follows. Section 2 introduces a fractional order gradient method and provides some basic knowledge for subsequent use. Fractional order gradient method is recommended for the fully connected layers and convolution layers in Section 3, respectively. In Section 4, some experiments are provided to illustrate the validity of the proposed approach. Conclusions are given in Section 5.

2 Preliminaries

There are some widely accepted definitions of fractional order derivative , such as Riemann-Liouville, Caputo and Grunwald-Letnikov, but the Caputo’s one is chosen for subsequent use, since its derivative of constant equals zero. The Caputo’s definition is

(1)

where , is the Gamma function, is the initial value. Alternatively, (1) can be rewritten as the following form

(2)

Suppose to be a smooth convex function with a unique extreme point . It is well known that each iterative step of the conventional gradient method is formulated as

(3)

where is the iterative step size or learning rate, is iterative times. Similarly, the fractional order gradient method is written as

(4)

If fractional order derivatives are directly applied in (4), the above fractional order gradient method cannot converge to the real extreme point , but to an extreme point under definition of fractional order derivatives, such extreme point is associated with initial value and order, generally not equal to (18).

To guarantee the convergence to real extreme point, an alternative fractional order gradient method (19) is considered via following iterative step

(5)

with and

(6)

When only the first item is reserved and its absolute value is introduced, the fractional order gradient method with is simplified as

(7)
Theorem 1.

If fractional order gradient method (7) is convergent, it will converge to the real extreme point .

Proof.

It is a proof by contradiction. Assume that converges to a different point , namely . Therefore, it can be concluded that for any sufficient small positive scalar , there exists a sufficient large number such that for any . Then must hold.

According to (7), the following inequality is obtained

(8)

with .

Considering that one can always find a such that , then the following inequality will hold

(9)

The above inequality could be rewritten as . When this inequality is introduced into (8), the result is

(10)

which implies that is not convergent. It contradicts to the assumption that is convergent to , thus the proof is completed. ∎

Remark 1.

When a small positive value is introduced, the following fractional order gradient method will avoid singularity caused by .

(11)

Compared with gradient method based on strict definition of fractional order derivatives (18), the modified fractional order gradient methods (7) and other similar methods (19) are proved to be convergent to the real extreme point. Moreover, such methods turn out to converge faster than integer order gradient methods.

3 Main Results

A general CNN is composed of convolution layers, pooling layers and fully connected layers. Fractional order gradient method is applicable for all layers except pooling layers, since there is no need of updating parameters in pooling layers. Although the key procedure of mathematical calculation is quite similar in convolution layers and fully connected layers, the different structures lead to different ways to research. First of all, fully connected layers with fractional order gradient are introduced.

3.1 Fully Connected Layers

The training procedure of neural networks contains two steps, one of them is forward propagation. Such propagation between two layers is illustrated as Fig. 1, where superscript is the number of layer, subscript is the number of node in certain layer, is the output of -th node in -th layer.

Fig. 1: Forward propagation of fully connected layers.

The output is from

(12)

where is weight, is bias, is the output of last layer and function

is activation function.

Another step of training procedure is backward propagation, in which fractional order gradient method takes the place of traditional method. Due to imperfect use of chain rule in fractional order derivatives, the gradients of backward propagation are a blend of fractional order and integer order. As is shown in Fig 2, there are two types of gradients that pass through layers. One is the transferring gradient (solid line) which links nodes between two layers, the other is updating gradient (dotted line) which is used for parameters within layers. is the loss function, is the fractional order, and are defined as fractional order gradients of and , respectively.

Fig. 2: Backward propagation of fully connected layers.

In order to use the chain rule continuously, the transferring gradient is provided with integer order

(13)

but the updating gradient is replaced by fractional order

(14)

with and . When the fractional order gradient (7) is adopted, the gradient of the -th iteration becomes

(15)

where and are parameters and at the -th iteration, is output at the ()-th iteration. Consequently, the fractional order updating gradient is achieved by introducing (15) to (14)

(16)

where and are and at the -th iteration, respectively.

Actually, samples are not input one by one in most case. When a batch of samples are input each time, (16) turns into

(17)

where is batch size, the subscript means

-th sample of a batch. After vectorization, above equations are simplified as

(18)

where

signs like and are the element-wise calculation, is the Hadamard product, is the sum of a matrix along horizontal axis. Then the updating parameters of fully connected layers can be summarized as

(19)
Theorem 2.

The fully connected layers updated by fractional order gradient method (18, 19) are convergent to real extreme point.

Compared with integer order backward propagation (23), the same transferring gradient is kept, but the difference exists in updating gradient where the order is changed as and . Even so, based on the Theorem 1, and of fractional order gradient will converge to the real extreme point that and of integer order gradient also converge to. Therefore, the proof has reverted to integer order backward propagation which is omitted here.

Remark 2.

Because of integer order transferring gradient, the chain rule is still available for the proposed gradient method (18, 19), which avoids complicated calculation caused by fractional order derivatives, especially derivatives of activation function. As modified fractional order gradient (7) is applied smoothly, the speed of convergence is improved and real extreme point can be reached now.

3.2 Convolution Layers

Although the key calculation of convolution layers is similar to fully connected layers, the complicated structure makes its iterative algorithm different. It is hard to understand the algorithm without help of figures or auxiliary descriptions.

For subsequent research, the forward propagation of convolution layers is drawn briefly in Fig. 3 where is the output of the -th layer, and are weight and bias for channel , is the size of convolution kernel, is a slice of by selecting rows and columns over all channels (red cube), , and are height, width and channels of output , respectively.

Fig. 3: Forward propagation of convolution layers.

Similarly, the gradients of backward propagation are divided into two types. The transferring gradient of convolution layers is also kept the same as integer order gradient. Considering that the input is a batch of samples, the updating gradient is

(20)

where is the weight that contains all , is the over all samples.

When the fractional order gradient method (7) is introduced, the updating gradient at the -th iteration is changed to

(21)

where is of the -th sample. It could be simply regarded as with

and is moving length of convolution kernel each time. After vectorization, (21) is further simplified as

(22)

In order to show the algorithm clearly, the calculation of (22) is transformed to following process.

1:  
2:  for  do
3:     for  do
4:        for  do
5:           for  do
6:              
7:              
8:              
9:              
10:              
11:           end for
12:        end for
13:     end for
14:  end for
15:  return , ,
Algorithm 1 Backward propagation of convolution layers by fractional order gradient method.

Then the updating parameters of convolution layers is as follows

(23)
Theorem 3.

The convolution layers updated by fractional order gradient method (22, 23) are convergent to real extreme point.

According to Theorem 1, the updating gradient replaced by fractional order gradient method could ensure the convergence and its convergence to real extreme point. It implies the convergence of fractional order updating gradient is the same as the integer order one. Since integer order transferring gradient is kept, overall convergence of convolution layers is similar to integer order case and easily guaranteed.

Remark 3.

Based on backward propagation of convolution layers, when padding is introduced into the

-th layers, the transferring gradient will be influenced. The gradient calculated by Algorithm 1 is the gradient of padded output. The padded part of needs deleting. However, there is no change happened for the updating gradient of fractional order.

Remark 4.

During the training procedure, a tiny value could be added to (18, 22) so that the singularity caused by or is avoided easily. Hence the gradients modified by (11) are listed below