Adaptive stochastic optimization algorithms based on stochastic gradient and exponential moving averages have a strong presence in the machine learning field. The algorithms are used to solve stochastic optimization problems; they especially, play a key role in finding more suitable parameters for deep neural network (DNN) models by using empirical risk minimization (ERM)
. The DNN models perform very well in many tasks, such as natural language processing (NLP), computer vision, and speech recognition. For instance, recurrent neural networks (RNNs) and their variant long short-term memory (LSTM) are useful models that have shown excellent performance in NLP tasks. Moreover, convolutional neural networks (CNNs) and their variants such as the residual network (ResNet) are widely used in the image recognition field 
. However, these DNN models are complex and need to tune a lot of parameters to optimize, and finding appropriate parameters for the prediction is very hard. Therefore, it would be very useful to look for optimization algorithms for minimizing the loss function and finding better parameters.
In this paper, we focus on adaptive stochastic optimization algorithms based on stochastic gradient and exponential moving averages. The stochastic gradient descent (SGD) algorithm[4, 5, 6, 7], which uses a stochastic gradient with a smart approximation method, is a great cornerstone that underlies other modern stochastic optimization algorithms. Numerous variants of SGD have been proposed for many interesting situations, in part, because it is sensitive to an ill-conditioned objective function or step size (called the learning rate in machine learning). To deal with this problem, momentum SGD  and the Nesterov accelerated gradient method  leverage exponential moving averages of gradients. In addition, adaptive methods, AdaGrad 
and RMSProp, take advantage of an efficient learning rate derived from element-wise squared stochastic gradients. In the deep learning community, Adam  is a popular method that uses exponential moving averages of stochastic gradients and of element-wise squared stochastic gradients. However, despite it being a powerful optimization method, Adam does not converge to the minimizers of the stochastic optimization problems in some cases. As a result, a variant, AMSGrad , was proposed to guarantee convergence to the optimal solution.
The nonlinear conjugate gradient (CG) method  is an elegant, efficient technique of deterministic unconstrained nonlinear optimization. Unlike the basic gradient descent methods, the CG method does not use vanilla gradients of the objective function as the search directions. Instead of normal gradients, conjugate gradient directions are used in the CG method, which can be computed from not only the current gradient but also past gradients. Interestingly, the method requires little memory and has strong local and global convergence properties. The way of generating conjugate gradient directions has been researched for long time, and efficient formulae have been proposed, such as Hestenes-Stiefel (HS) , Fletcher-Reeves (FR) , Polak-Ribière-Polyak (PRP) [17, 18], Dai-Yuan (DY) , and Hager-Zhang (HZ) .
For the present study, we developed a stochastic optimization algorithm, which we refer to as conjugate-gradient-based Adam (CoBA, Algorithm 1), for determining more comfortable parameters for DNN models. The algorithm proposed herein combines the CG method with the existing stochastic optimization algorithm, AMSGrad, which is based on Adam. Our analysis indicates that the theoretical performance of CoBA is comparable to that of AMSGrad (Theorem 3.1). In addition, we give several examples in which the proposed algorithm can be used to train DNN models in certain significant tasks. In concrete terms, we conducted numerical experiments on training an LSTM for text classification and making a ResNet for image classification. The results demonstrate that, thanks to the benefits of conjugate gradients, CoBA performs better than the existing adaptive methods, such as AdaGrad, RMSProp, Adam, and AMSGrad, in the sense of minimizing the sum of loss functions.
This paper is organized as follows. Section 2 gives the mathematical preliminaries. Section 3 presents the CoBA algorithm for solving the stochastic optimization problem and analyzes its convergence. Section 4 numerically compares the behaviors of the proposed algorithms with those of the existing ones. Section 5 concludes the paper with a brief summary.
2 Mathematical Preliminaries
We use the standard notation for -dimensional Euclidean space, with the standard Euclidean inner product and associated norm . Moreover, for all , let be the -th coordinate of
. Then, for all vectors, denotes the -norm, defined as , denotes the element-wise square, and indicates a diagonal matrix whose diagonal entries starting in the upper left corner are . Further, for any vectors , we use to denote the element-wise maximum. For a matrix and a constant , let be the element-wise -th power of .
We use to denote a nonempty, closed convex feasible set and say has a bounded diameter if for all ,. Let denote a noisy objective function which is differentiable on and be the realization of the stochastic noisy objective function at subsequent timesteps . For a positive-definite matrix , the Mahalanobis norm is defined as and the projection onto under the norm is defined for all by
be a random number whose probability distributionis supported on a set . Suppose that it is possible to generate independent, identically distributed (iid) numbers of realization of . We use to denote a stochastic gradient of at .
2.1 Adaptive stochastic optimization methods for stochastic optimization
Let us consider the stochastic optimization problem:
Suppose that is nonempty, closed, and convex and is convex and differentiable for all . Then,
Stochastic gradient descent (SGD) method [4, 5, 6, 7] is a basic method based on using the stochastic gradient for solving Problem 2.1, and it outperforms algorithms based on a batch gradient. The method generates the sequence by using the following update rule:
where and is the projection onto the set defined as (). A diminishing step size for a positive constant is typically used for . Also, adaptive algorithms using an exponential moving average, which are variants of SGD, are useful for solving Problem 2.1. For instance, AdaGrad , RMSProp , and Adam  perform very well at minimizing the loss functions used in deep learning applications.
In this paper, we focus on Adam, which is fastest at minimizing the loss function in deep learning. For all , the algorithm updates the parameter by using the following update rule: for , , , and ,
Although Adam is an excellent choice for solving the stochastic optimization problem, it does not always converge, as shown in [13, Theorem 3]. Reference  presented a good variant algorithm of Adam, called AMSGrad, which converges to a solution to Problem 2.1. The AMSGrad algorithm is as follows: for , , , and ,
2.2 Nonlinear conjugate gradient methods
Nonlinear conjugate gradient (CG) methods  are used for solving deterministic unconstrained nonlinear optimization problems, as formulated below:
Suppose that is continuously differentiable. Then,
where . The search direction used in the update rule (6) is called the conjugate gradient direction and is defined as the follows:
where and . Here, we use to denote the conjugate gradient update parameter, which can be computed from the gradient values and . The parameter has been researched for many years because its value has a large effect on the nonlinear objective function . For instance, the following parameters proposed by Hestenes-Stiefel (HS) , Fletcher-Reeves (FR) , Polak-Ribière-Polyak (PRP) [17, 18], and Dai-Yuan (DY)  are widely used to solve Problem 2.2:
3 Proposed algorithm
This section presents the conjugate-gradient-based Adam (CoBA) algorithm (Algorithm 1 is the listing). The way in which the parameters satisfying steps 7–11 are computed is based on the update rule of AMSGrad (4). The existing algorithm computes an momentum parameter and an adaptive learning rate parameter by using the stochastic gradient computed in step 4 for all . We replace the stochastic gradients used in AMSGrad with conjugate gradients and compute with the conjugate gradients computed in steps 5–6 for all . Here, the conjugate gradient update parameters are calculated using each of (8)–(12) for all .
Furthermore, we give a convergence analysis of the proposed algorithm. The proof is given in Appendix A.
Theorem 3.1 indicates that Algorithm 1 has the nice property of convergence of the average regret , whereas Adam does not guarantee convergence in that sense, as shown in [13, Theorem 3]. In addition, we can see that the properties of Algorithm 1 shown in Theorem 3.1 are theoretically almost the same as those of AMSGrad (4) (see [13, Theorem 4]):
This section presents the results of experiments evaluating our algorithms and comparing them with the existing algorithms.
Our experiments were conducted on a fast scalar computation server111https://www.meiji.ac.jp/isys/hpc/ia.html
at Meiji University. The environment has two Intel(R) Xeon(R) Gold 6148 (2.4 GHz, 20 cores) CPUs, an NVIDIA Tesla V100 (16GB, 900Gbps) GPU and a Red Hat Enterprise Linux 7.6 operating system. The experimental code was written in Python 3.6.9, and we used the NumPy 1.17.3 package and PyTorch 1.3.0 package.
4.1 Text classification
We used the proposed algorithms to learn a long short-term memory (LSTM) for text classification. The LSTM is an artificial recurrent neural network (RNN) architecture used in the field of deep learning for natural language processing, time-series analysis, etc.
This experiment used the IMDb dataset222https://datasets.imdbws.com/ for text classification tasks. The dataset contains 50,000 movie reviews along with their associated binary sentiment polarity labels. The dataset is split into 25,000 training and 25,000 test sets.
We trained a multilayer neural network for solving the text classification problem on the IMDb dataset. We used an LSTM with an affine layer and a sigmoid function as an activation function for the output. For training it, we used the binary cross entropy (BCE) as a loss function minimized by the existing and proposed algorithms. The BCE lossis defined as follows:
where with a binary class label , meaning a positive or negative review, and with the output of the neural network at each time step .
Let us numerically compare the performances of the proposed algorithms with Adam, AMSGrad, RMSProp, and AdaGrad. In this experiment, we used a random vector as the initial parameter and , for all , as the step size parameter of all algorithms. The previously reported results (see [21, 22]) on convex optimization algorithms empirically used and . We used the default values provided in torch.optim333https://pytorch.org/docs/stable/optim.html as the hyper parameter settings of the optimization algorithms and set and in Adam, AMSGrad, and CoBA. We set , , and in CoBA.
The results of the experiment are reported in Figures 1–4. Figure 1 shows the behaviors of the algorithms for the loss function values defined by (13) with respect to the number of epochs, while Figure 2 shows those with respect to elapsed time [s]. Figure 3 presents the accuracy scores of the classification on the training data with respect to the number of epochs, whereas Figure 4 plots the accuracy score versus elapsed time [s]. We can see that the CoBA algorithms perform better than Adam, AdaGrad, and RMSProp in terms of both the training loss and accuracy score. In particular, Figures 1 and 2 show that CoBA using reduces the loss function values in fewer epochs and shorter elapsed time than AMSGrad. Figure 3 and 4 indicate that CoBA using reaches accuracy faster than AMSGrad.
4.2 Image classification
We performed numerical comparisons using Residual Network (ResNet) , a relatively deep model based on a convolutional neural network (CNN), on an image classification task. Rather than having only convolutional layers, ResNet has additional shortcut connections, e.g., identity mappings, between pairs of 33 filters. The architecture can relieve the degradation problem wherein accuracy saturates when a deeper neural network starts converging. As a result, ResNet is considered to be a practical architecture for image recognition on some datasets. In this experiment, we used the CIFAR10 dataset , a benchmark for image classification. The dataset consists of 60,000 color images (3232) in 10 classes, with 6,000 images per class. There are 50,000 training images and 10,000 test images. The test batch contained exactly 1,000 randomly selected images from each class.
We trained a 34-layer ResNet (ResNet-34) organized into a 77 convolutional layer, 32 convolutional layers which have filters, and a 1,000-way-fully-connected layer with a softmax function. We used the cross entropy as the loss function for fitting ResNet in accordance with the common strategy in image classification. In the case of classification to the -class, the cross entropy torch.nn.CrossEntropyLoss444https://pytorch.org/docs/stable/nn.html is defined as follows:
where with the one-hot multi-class label and with the output of the neural network for all and .
In this experiment, we used a random vector as the initial parameter and , for all , as the step size parameter [21, 22] of all the algorithms. As described in Subsection 4.1, we set the default values of Adam, AMSGrad, and CoBA to and . For each type of conjugate gradient update parameter , we set the coefficients and to values optimized by a grid search over a parameter grid consisting of and . We set in CoBA(HZ).
The results of the experiments are reported in Figure 5–8. Figure 5 plots the loss function values defined by (14) versus the number epochs, while Figure 6 plots the loss function values versus elapsed time [s]. Figure 7 presents the accuracy score on the dataset for training every epoch, whereas Figure 8 plots the accuracy score versus elapsed time [s].
We can see that the CoBA algorithms perform better than Adam, AdaGrad, and RMSProp in terms of both the train loss and accuracy score. In particular, Figures 5 and 7 show that CoBA using , , , , or reduces the loss function values and reaches an accuracy score of in fewer epochs than AMSGrad. Figure 6 shows that CoBA and AMSGrad converge faster than the other algorithms. Although CoBA it takes more time than AMSGrad does to update the parameters of ResNet, they theoretically take about the same amount of time for computing the conjugate gradient direction . Figures 7 and 8 indicate that CoBA using reaches accuracy faster than AMSGrad.
5 Conclusion and Future Work
We presented the conjugate-gradient-based Adam (CoBA) algorithm for solving stochastic optimization problems that minimize the empirical risk in fitting of deep neural networks and showed its convergence. We numerically compared CoBA with an existing learning method in a text classification task using the IMDb dataset and an image classification task using the CIFAR-10 dataset. The results demonstrated its optimality and efficiency. In particular, compared with the existing methods, CoBA reduced the loss function value in fewer epochs on both datasets. In addition, it classification score reached a accuracy in fewer epochs compared with the existing methods.
In the future, we would like to improve the implementation of the proposed algorithms to enable computation of conjugate gradients in a theoretically reasonable time. In addition, we would like to design a more appropriate a stochastic conjugate gradient direction and conjugate gradient update parameter, e.g., one in which the expected value is equivalent to a deterministic conjugate gradient. Furthermore, we would like to find a way to find a suitable step size which permits the proposed algorithm to converge faster to the solution to the stochastic optimization problem.
Appendix A Proof of Theorem3.1
To begin with, we prove the following lemma with the boundedness of the conjugate gradient direction .
We will use mathematical induction. The fact that ensures that there exists such that, for all ,
The definition of implies that for all . Suppose that for some . Then, from the definition of and the triangle inequality,
which, together with (15), implies that
Accordingly, holds for all . ∎
Next, we show the following lemma:
For the parameter settings and conditions assumed in Theorem 3.1 and for any , we have
Let be fixed arbitrarily. For all , we define . Then, we have
where the second inequality comes from the definition of , which is , the third one follows from and the update rules of and in Algorithm 1 for , and the fourth one comes from for all . Here, from the Cauchy-Schwarz inequality and the fact that for all , we have
where is defined by . Hence, we have
A discussion similar to the one for all ensures that
From the Cauchy-Schwarz inequality,
This completes the proof. ∎
Finally, we prove Theorem 3.1.