# Conjugate-gradient-based Adam for stochastic optimization and its application to deep learning

This paper proposes a conjugate-gradient-based Adam algorithm blending Adam with nonlinear conjugate gradient methods and shows its convergence analysis. Numerical experiments on text classification and image classification show that the proposed algorithm can train deep neural network models in fewer epochs than the existing adaptive stochastic optimization algorithms can.

## Authors

• 1 publication
• 2 publications
• ### Maximizing Invariant Data Perturbation with Stochastic Optimization

Feature attribution methods, or saliency maps, are one of the most popul...
07/12/2018 ∙ by Kouichi Ikeno, et al. ∙ 2

• ### Binary Search and First Order Gradient Based Method for Stochastic Optimization

In this paper, we present a novel stochastic optimization method, which ...
07/27/2020 ∙ by Vijay Pandey, et al. ∙ 7

• ### Multi-objective free-form shape optimization of a synchronous reluctance machine

This paper deals with the design optimization of a synchronous reluctanc...
10/20/2020 ∙ by Peter Gangl, et al. ∙ 0

01/26/2021 ∙ by Aaron Defazio, et al. ∙ 0

• ### Optimization Methods for Large-Scale Machine Learning

This paper provides a review and commentary on the past, present, and fu...
06/15/2016 ∙ by Leon Bottou, et al. ∙ 0

• ### Gradient-based Taxis Algorithms for Network Robotics

Finding the physical location of a specific network node is a prototypic...
09/26/2014 ∙ by Christian Blum, et al. ∙ 0

• ### Shampoo: Preconditioned Stochastic Tensor Optimization

Preconditioned gradient methods are among the most general and powerful ...
02/26/2018 ∙ by Vineet Gupta, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Adaptive stochastic optimization algorithms based on stochastic gradient and exponential moving averages have a strong presence in the machine learning field. The algorithms are used to solve stochastic optimization problems; they especially, play a key role in finding more suitable parameters for deep neural network (DNN) models by using empirical risk minimization (ERM)

[1]

. The DNN models perform very well in many tasks, such as natural language processing (NLP), computer vision, and speech recognition. For instance, recurrent neural networks (RNNs) and their variant long short-term memory (LSTM) are useful models that have shown excellent performance in NLP tasks. Moreover, convolutional neural networks (CNNs) and their variants such as the residual network (ResNet)

[2] are widely used in the image recognition field [3]

. However, these DNN models are complex and need to tune a lot of parameters to optimize, and finding appropriate parameters for the prediction is very hard. Therefore, it would be very useful to look for optimization algorithms for minimizing the loss function and finding better parameters.

In this paper, we focus on adaptive stochastic optimization algorithms based on stochastic gradient and exponential moving averages. The stochastic gradient descent (SGD) algorithm

[4, 5, 6, 7], which uses a stochastic gradient with a smart approximation method, is a great cornerstone that underlies other modern stochastic optimization algorithms. Numerous variants of SGD have been proposed for many interesting situations, in part, because it is sensitive to an ill-conditioned objective function or step size (called the learning rate in machine learning). To deal with this problem, momentum SGD [8] and the Nesterov accelerated gradient method [9] leverage exponential moving averages of gradients. In addition, adaptive methods, AdaGrad [10]

and RMSProp

[11], take advantage of an efficient learning rate derived from element-wise squared stochastic gradients. In the deep learning community, Adam [12] is a popular method that uses exponential moving averages of stochastic gradients and of element-wise squared stochastic gradients. However, despite it being a powerful optimization method, Adam does not converge to the minimizers of the stochastic optimization problems in some cases. As a result, a variant, AMSGrad [13], was proposed to guarantee convergence to the optimal solution.

The nonlinear conjugate gradient (CG) method [14] is an elegant, efficient technique of deterministic unconstrained nonlinear optimization. Unlike the basic gradient descent methods, the CG method does not use vanilla gradients of the objective function as the search directions. Instead of normal gradients, conjugate gradient directions are used in the CG method, which can be computed from not only the current gradient but also past gradients. Interestingly, the method requires little memory and has strong local and global convergence properties. The way of generating conjugate gradient directions has been researched for long time, and efficient formulae have been proposed, such as Hestenes-Stiefel (HS) [15], Fletcher-Reeves (FR) [16], Polak-Ribière-Polyak (PRP) [17, 18], Dai-Yuan (DY) [19], and Hager-Zhang (HZ) [20].

This paper is organized as follows. Section 2 gives the mathematical preliminaries. Section 3 presents the CoBA algorithm for solving the stochastic optimization problem and analyzes its convergence. Section 4 numerically compares the behaviors of the proposed algorithms with those of the existing ones. Section 5 concludes the paper with a brief summary.

## 2 Mathematical Preliminaries

We use the standard notation for -dimensional Euclidean space, with the standard Euclidean inner product and associated norm . Moreover, for all , let be the -th coordinate of

. Then, for all vectors

, denotes the -norm, defined as , denotes the element-wise square, and indicates a diagonal matrix whose diagonal entries starting in the upper left corner are . Further, for any vectors , we use to denote the element-wise maximum. For a matrix and a constant , let be the element-wise -th power of .

We use to denote a nonempty, closed convex feasible set and say has a bounded diameter if for all ,. Let denote a noisy objective function which is differentiable on and be the realization of the stochastic noisy objective function at subsequent timesteps . For a positive-definite matrix , the Mahalanobis norm is defined as and the projection onto under the norm is defined for all by

 {ΠF,A(y)}:= argminx∈F∥x−y∥A=argminx∈F√⟨x−y,A(x−y)⟩.

Let

be a random number whose probability distribution

is supported on a set . Suppose that it is possible to generate independent, identically distributed (iid) numbers of realization of . We use to denote a stochastic gradient of at .

### 2.1 Adaptive stochastic optimization methods for stochastic optimization

Let us consider the stochastic optimization problem:

###### Problem 2.1.

Suppose that is nonempty, closed, and convex and is convex and differentiable for all . Then,

 minimize ∑t∈Tft(x) subject% to x∈F. (1)

Stochastic gradient descent (SGD) method [4, 5, 6, 7] is a basic method based on using the stochastic gradient for solving Problem 2.1, and it outperforms algorithms based on a batch gradient. The method generates the sequence by using the following update rule:

 xt+1:=ΠF(xt−αtgt), (2)

where and is the projection onto the set defined as (). A diminishing step size for a positive constant is typically used for . Also, adaptive algorithms using an exponential moving average, which are variants of SGD, are useful for solving Problem 2.1. For instance, AdaGrad [10], RMSProp [11], and Adam [12] perform very well at minimizing the loss functions used in deep learning applications.

In this paper, we focus on Adam, which is fastest at minimizing the loss function in deep learning. For all , the algorithm updates the parameter by using the following update rule: for , , , and ,

 mt:=β1mt−1+(1−β1)gt,vt:=β2vt−1+(1−β2)~gt, Vt:=diag(vt),dt:=[mt,1√vt,1+ϵ,…,mt,N√vt,N+ϵ]⊤,xt+1:=ΠF,V12t(xt−αdt). (3)

Although Adam is an excellent choice for solving the stochastic optimization problem, it does not always converge, as shown in [13, Theorem 3]. Reference [13] presented a good variant algorithm of Adam, called AMSGrad, which converges to a solution to Problem 2.1. The AMSGrad algorithm is as follows: for , , , and ,

 mt:=β1tmt−1+(1−β1t)gt,vt:=β2vt−1+(1−β2)~gt,^vt:=max{^vt−1,vt}, ^Vt:=diag(^vt),dt:=[mt,1√^vt,1+ϵ,…,mt,N√^vt,N+ϵ]⊤,xt+1:=ΠF,^V12t(xt−αtdt). (4)

### 2.2 Nonlinear conjugate gradient methods

Nonlinear conjugate gradient (CG) methods [14] are used for solving deterministic unconstrained nonlinear optimization problems, as formulated below:

###### Problem 2.2.

Suppose that is continuously differentiable. Then,

 minimize~{}f(x)~{}subject~{}to~{}x∈RN. (5)

The nonlinear CG method in [14] for solving Problem 2.2 generates a sequence with an initial point and the following update rule:

 xt+1:=xt+αtdt, (6)

where . The search direction used in the update rule (6) is called the conjugate gradient direction and is defined as the follows:

 dt:=−gt+γtdt−1, (7)

where and . Here, we use to denote the conjugate gradient update parameter, which can be computed from the gradient values and . The parameter has been researched for many years because its value has a large effect on the nonlinear objective function . For instance, the following parameters proposed by Hestenes-Stiefel (HS) [15], Fletcher-Reeves (FR) [16], Polak-Ribière-Polyak (PRP) [17, 18], and Dai-Yuan (DY) [19] are widely used to solve Problem 2.2:

 γHSt:=⟨gt,yt⟩⟨dt−1,yt⟩, (8) γFRt:=∥gt∥2∥gt−1∥2, (9) γPRPt:=⟨gt,yt⟩∥gt−1∥2, (10) γDYt:=∥gt∥2⟨dt−1,yt⟩, (11)

where .

In addition, Hager-Zhang [20] is an improvement on defined by (8) that works well on Problem 2.2. The parameter is computed as follows:

 γHZt:=⟨gt,yt⟩⟨dt−1,yt⟩−λ∥yt∥2⟨dt−1,yt⟩2⟨gt,dt−1⟩, (12)

where and .

## 3 Proposed algorithm

This section presents the conjugate-gradient-based Adam (CoBA) algorithm (Algorithm 1 is the listing). The way in which the parameters satisfying steps 7–11 are computed is based on the update rule of AMSGrad (4). The existing algorithm computes an momentum parameter and an adaptive learning rate parameter by using the stochastic gradient computed in step 4 for all . We replace the stochastic gradients used in AMSGrad with conjugate gradients and compute with the conjugate gradients computed in steps 5–6 for all . Here, the conjugate gradient update parameters are calculated using each of (8)–(12) for all .

Furthermore, we give a convergence analysis of the proposed algorithm. The proof is given in Appendix A.

###### Theorem 3.1.

Suppose that , , and are the sequences generated by Algorithm 1 with , , , , and for all . Assume that is bounded, has a bounded diameter , and there exist such that and for some . Then, for any solution of Problem 2.1, the regret satisfies the following inequality:

 R(T)≤ D2∞√Tα(1−β1)N∑i=1√^vT,i+D2∞2(1−β1)T∑t=1β1tαtN∑i=1√^vt,i +α√1+logT(1−β1)2(1−μ)√1−β2N∑i=1 ⎷T∑t=1d2t,i +D∞¯G∞T∑t=1|γt|ta.

Theorem 3.1 indicates that Algorithm 1 has the nice property of convergence of the average regret , whereas Adam does not guarantee convergence in that sense, as shown in [13, Theorem 3]. In addition, we can see that the properties of Algorithm 1 shown in Theorem 3.1 are theoretically almost the same as those of AMSGrad (4) (see [13, Theorem 4]):

 R(T)≤ D2∞√Tα(1−β1)N∑i=1√^vT,i+D2∞2(1−β1)T∑t=1β1tαtN∑i=1√^vt,i +α√1+logT(1−β1)2(1−μ)√1−β2N∑i=1 ⎷T∑t=1g2t,i.

## 4 Experiments

This section presents the results of experiments evaluating our algorithms and comparing them with the existing algorithms.

Our experiments were conducted on a fast scalar computation server

at Meiji University. The environment has two Intel(R) Xeon(R) Gold 6148 (2.4 GHz, 20 cores) CPUs, an NVIDIA Tesla V100 (16GB, 900Gbps) GPU and a Red Hat Enterprise Linux 7.6 operating system. The experimental code was written in Python 3.6.9, and we used the NumPy 1.17.3 package and PyTorch 1.3.0 package.

### 4.1 Text classification

We used the proposed algorithms to learn a long short-term memory (LSTM) for text classification. The LSTM is an artificial recurrent neural network (RNN) architecture used in the field of deep learning for natural language processing, time-series analysis, etc.

This experiment used the IMDb dataset for text classification tasks. The dataset contains 50,000 movie reviews along with their associated binary sentiment polarity labels. The dataset is split into 25,000 training and 25,000 test sets.

We trained a multilayer neural network for solving the text classification problem on the IMDb dataset. We used an LSTM with an affine layer and a sigmoid function as an activation function for the output. For training it, we used the binary cross entropy (BCE) as a loss function minimized by the existing and proposed algorithms. The BCE loss

is defined as follows:

 L(y,z):=−1TT∑t=1{ytlogzt+(1−yt)log(1−zt)}, (13)

where with a binary class label , meaning a positive or negative review, and with the output of the neural network at each time step .

Let us numerically compare the performances of the proposed algorithms with Adam, AMSGrad, RMSProp, and AdaGrad. In this experiment, we used a random vector as the initial parameter and , for all , as the step size parameter of all algorithms. The previously reported results (see [21, 22]) on convex optimization algorithms empirically used and . We used the default values provided in as the hyper parameter settings of the optimization algorithms and set and in Adam, AMSGrad, and CoBA. We set , , and in CoBA.

The results of the experiment are reported in Figures 14. Figure 1 shows the behaviors of the algorithms for the loss function values defined by (13) with respect to the number of epochs, while Figure 2 shows those with respect to elapsed time [s]. Figure 3 presents the accuracy scores of the classification on the training data with respect to the number of epochs, whereas Figure 4 plots the accuracy score versus elapsed time [s]. We can see that the CoBA algorithms perform better than Adam, AdaGrad, and RMSProp in terms of both the training loss and accuracy score. In particular, Figures 1 and 2 show that CoBA using reduces the loss function values in fewer epochs and shorter elapsed time than AMSGrad. Figure 3 and 4 indicate that CoBA using reaches accuracy faster than AMSGrad.

### 4.2 Image classification

We performed numerical comparisons using Residual Network (ResNet) [2], a relatively deep model based on a convolutional neural network (CNN), on an image classification task. Rather than having only convolutional layers, ResNet has additional shortcut connections, e.g., identity mappings, between pairs of 33 filters. The architecture can relieve the degradation problem wherein accuracy saturates when a deeper neural network starts converging. As a result, ResNet is considered to be a practical architecture for image recognition on some datasets. In this experiment, we used the CIFAR10 dataset [23], a benchmark for image classification. The dataset consists of 60,000 color images (3232) in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images. The test batch contained exactly 1,000 randomly selected images from each class.

We trained a 34-layer ResNet (ResNet-34) organized into a 77 convolutional layer, 32 convolutional layers which have filters, and a 1,000-way-fully-connected layer with a softmax function. We used the cross entropy as the loss function for fitting ResNet in accordance with the common strategy in image classification. In the case of classification to the -class, the cross entropy torch.nn.CrossEntropyLoss is defined as follows:

 L(Y,Z):=−1TT∑t=1K∑k=1yt,klogzt,k, (14)

where with the one-hot multi-class label and with the output of the neural network for all and .

In this experiment, we used a random vector as the initial parameter and , for all , as the step size parameter [21, 22] of all the algorithms. As described in Subsection 4.1, we set the default values of Adam, AMSGrad, and CoBA to and . For each type of conjugate gradient update parameter , we set the coefficients and to values optimized by a grid search over a parameter grid consisting of and . We set in CoBA(HZ).

The results of the experiments are reported in Figure 58. Figure 5 plots the loss function values defined by (14) versus the number epochs, while Figure 6 plots the loss function values versus elapsed time [s]. Figure 7 presents the accuracy score on the dataset for training every epoch, whereas Figure 8 plots the accuracy score versus elapsed time [s].

We can see that the CoBA algorithms perform better than Adam, AdaGrad, and RMSProp in terms of both the train loss and accuracy score. In particular, Figures 5 and 7 show that CoBA using , , , , or reduces the loss function values and reaches an accuracy score of in fewer epochs than AMSGrad. Figure 6 shows that CoBA and AMSGrad converge faster than the other algorithms. Although CoBA it takes more time than AMSGrad does to update the parameters of ResNet, they theoretically take about the same amount of time for computing the conjugate gradient direction [24]. Figures 7 and 8 indicate that CoBA using reaches accuracy faster than AMSGrad.

## 5 Conclusion and Future Work

We presented the conjugate-gradient-based Adam (CoBA) algorithm for solving stochastic optimization problems that minimize the empirical risk in fitting of deep neural networks and showed its convergence. We numerically compared CoBA with an existing learning method in a text classification task using the IMDb dataset and an image classification task using the CIFAR-10 dataset. The results demonstrated its optimality and efficiency. In particular, compared with the existing methods, CoBA reduced the loss function value in fewer epochs on both datasets. In addition, it classification score reached a accuracy in fewer epochs compared with the existing methods.

In the future, we would like to improve the implementation of the proposed algorithms to enable computation of conjugate gradients in a theoretically reasonable time. In addition, we would like to design a more appropriate a stochastic conjugate gradient direction and conjugate gradient update parameter, e.g., one in which the expected value is equivalent to a deterministic conjugate gradient. Furthermore, we would like to find a way to find a suitable step size which permits the proposed algorithm to converge faster to the solution to the stochastic optimization problem.

## Appendix A Proof of Theorem3.1

To begin with, we prove the following lemma with the boundedness of the conjugate gradient direction .

###### Lemma A.1.

Suppose that is the sequence generated by Algorithm 1 with the parameter settings and conditions assumed in Theorem 3.1. Further, assume that is bounded and there exist such that and for some . Then, holds for all .

###### Proof.

We will use mathematical induction. The fact that ensures that there exists such that, for all ,

 Mta|γt|≤12. (15)

The definition of implies that for all . Suppose that for some . Then, from the definition of and the triangle inequality,

 ∥dj∥ ≤∥gj∥+Mja|γj|∥dj−1∥ ≤G∞+Mja|γj|¯G∞,

which, together with (15), implies that

 ∥dj∥≤¯G∞2+¯G∞2=¯G∞.

Accordingly, holds for all . ∎

Next, we show the following lemma:

###### Lemma A.2.

For the parameter settings and conditions assumed in Theorem 3.1 and for any , we have

 T∑t=l+1αtN∑i=1m2t−l,i√^vt,i≤α√1+logT(1−β1)(1−μ)√1−β2N∑i=1 ⎷T∑t=1d2t,i.
###### Proof.

Let be fixed arbitrarily. For all , we define . Then, we have

 T∑t=l+1αtN∑i=1m2t−l,i√^vt,i = T−1∑t=l+1αtN∑i=1m2t−l,i√^vt,i+αTN∑i=1m2T−l,i√^vT,i ≤ T−1∑t=l+1αtN∑i=1m2t−l,i√^vt,i+αTN∑i=1m2T−l,i√vT,i ≤ T−1∑t=l+1αtN∑i=1m2t−l,i√^vt,i+α√TN∑i=1{∑T−lj=1(1−β1j)¯β1jdj,i}2√(1−β2)∑Tj=1βT−j2d2j,i, ≤ T−1∑t=l+1αtN∑i=1m2t−l,i√^vt,i+α√T(1−β2)N∑i=1(∑T−lj=1¯β1jdj,i)2√∑Tj=1βT−j2d2j,i,

where the second inequality comes from the definition of , which is , the third one follows from and the update rules of and in Algorithm 1 for , and the fourth one comes from for all . Here, from the Cauchy-Schwarz inequality and the fact that for all , we have

 ≤ N∑i=1(∑T−lj=1¯β1j)(∑T−lj=1¯β1jd2j,i)√∑Tj=1βT−j2d2j,i ≤ 11−β1N∑i=1∑Tj=1βT−j1d2j,i√βT−j2d2j,i ≤ 11−β1N∑i=1T∑j=1μT−j|dj,i|,

where is defined by . Hence, we have

 T∑t=l+1αtN∑i=1m2t−l,i√^vt,i≤T−1∑t=l+1αtN∑i=1m2t−l,i√^vt,i+α(1−β1)√1−β2N∑i=1T∑j=1μT−j|dj,i|√T.

A discussion similar to the one for all ensures that

 T∑t=l+1αtN∑i=1m2t−l,i√^vt,i ≤ T∑t=l+1α(1−β1)√1−β2N∑i=1t∑j=1μt−j|dj,i|√t = α(1−β1)√1−β2N∑i=1T∑t=l+1t∑j=1μt−j|dj,i|√t = α(1−β1)√1−β2N∑i=1T∑t=l+1|dt,i|T∑j=tμj−t√j ≤ α(1−β1)√1−β2N∑i=1T∑t=l+1|dt,i|T∑j=tμj−t√t ≤ α(1−β1)(1−μ)√1−β2N∑i=1T∑t=l+1|dt,i|√t.

From the Cauchy-Schwarz inequality,

 T∑t=l+1αtN∑i=1m2t−l,i√^vt,i ≤ α(1−β1)(1−μ)√1−β2N∑i=1 ⎷T∑t=l+1d2t,i ⎷T∑t=l+11t ≤ α√1+logT(1−β1)(1−μ)√1−β2N∑i=1 ⎷T∑t=1d2t,i.

This completes the proof. ∎

Finally, we prove Theorem 3.1.

###### Proof.

Let and be fixed arbitrarily. From the update rule of Algorithm 1, we have

 {xt+1} ={ΠF,^V12t(xt−αt^dt)} =argminx∈F∥∥xt−αt^dt−x⋆∥∥^V12t.

Here, for all positive-definite matrixes and for all with , we have [13, Lemma 4]. Hence,

 ∥xt+1−x⋆∥2^V12t≤∥∥xt−αt^dt−x⋆∥∥2^V12t,

which, together with () and the definitions of and , implies that

 ∥xt+1−x⋆∥2^V12t ≤ ∥xt−x⋆∥2^V12t+α2t∥∥^dt∥∥2^V12t −2αt⟨mt,xt−x⋆⟩ ≤ ∥xt−x⋆∥2^V12t+α2t∥∥^dt∥∥2^V12t −2αt⟨β1tmt−1+(1−β1t)dt,xt−x⋆⟩ ≤ ∥xt−x⋆∥2^V12t+α2t∥∥^dt∥∥2^V12t−2αtβ1t⟨m