Optimization of functions involving large datasets and high dimensional models finds today large applicability in several data-driven fields in science and the industry. Given the growing role of deep learning, in this paper we look at optimization problems arising in the training of neural networks. The training of these models can be cast as the minimization or maximization of a certain objective function with respect to the model parameters. Because of the complexity and computational requirements of the objective function, the data and the models, the common practice is to resort to iterative training procedures, such as gradient descent. Among the iterative methods that emerged as the most effective and computationally efficient is stochastic gradient descent (SGD). SGD owes its performance gains to the adoption of an approximate version of the objective function at each iteration step, which, in turn, yields an approximate or noisy gradient.
While SGD seems to benefit greatly (, in terms of rate of convergence) from such approximation, it has also been shown that too much noise hurts the performance [83, 6]. This suggests that, to further improve over SGD, one could attempt to model the noise of the objective function. We consider the iteration-time varying loss function used in SGD as a stochastic process obtained by adding the expected risk to zero mean Gaussian noise. A powerful approach designed to handle estimation with such processes is Kalman filtering . The idea of using Kalman filtering to train neural networks is not new . However, the way to apply it to address this task can vary vastly. Indeed, in our approach, which we call KaFiStO, we introduce a number of novel ideas that result in a practical and effective training algorithm. Firstly, we introduce drastic approximations of the estimated covariance of Kalman’s dynamical state so that the corresponding matrix depends on only up to a matrix of parameters. Secondly, we approximate intermediate Kalman filtering calculations so that more accuracy can be achieved. Thirdly, because of the way we model the objective function, we can also define a schedule for the optimization that behaves similarly to learning rate schedules used in SGD and other iterative methods .
We highlight the following contributions: 1) KaFiStO is designed to handle high-dimensional data and models, and large datasets; 2) The tuning of the algorithm is automated, but it is also possible to introduce a learning rate schedule similar to those in existing methods, albeit with a very different interpretation; 3) KaFiStO adapts automatically to the noise in the loss, which might vary depending on the settings of the training (, the minibatch size), and to the variation in the estimated weights over iteration time; 4) It can incorporate iteration-time dynamics of the model parameters, which are analogous to momentum
; 5) It is a framework that can be easily extended (we show a few variations of KaFiStO); 6) As shown in our experiments, KaFiStO is on par with state of the art optimizers and can yield better minima in a number of problems ranging from image classification to generative adversarial networks (GAN) and natural language processing (NLP).
2 Prior Work
In this section, we review optimization methods that found application in machine learning, and, in particular, for large scale problems. Most of the progress in the last decades aimed at improving the efficiency and accuracy of the optimization algorithms.
First-order methods exploit only the gradient of the objective function. The main advantage of these methods lies in their speed and simplicity. Robbins and Monro  introduced the very first stochastic optimization method (SGD) in early 1951. Since then, the SGD method has been thoroughly analyzed and extended [42, 67, 32, 73]. Some considered restarting techniques for optimization purposes [46, 82].
However, a limitation of SGD is that the learning rate must be manually defined and the approximations in the computation of the gradient hurt the performance.
Second-Order Methods. To address the manual tuning of the learning rates in first-order methods and to improve the convergence rate, second-order methods rely on the Hessian matrix. However, this matrix becomes very quickly unmanageable as it grows quadratically with the number of model parameters. Thus, most work reduces the computational complexity by approximating the Hessian with a block-diagonal matrix [20, 5, 39]. A number of methods looked at combining the second-order information in different ways. For example, Roux and Fitzgibbon  combined Newton’s method and natural gradient. Sohl-Dickstein  combined SGD with the second-order curvature information leveraged by quasi-Newton methods. Yao  dynamically incorporated the curvature of the loss via adaptive estimates of the Hessian. Henriques  proposed a method that does not even require to store the Hessian at all. In contrast with these methods, KaFiStO does not compute second-order derivatives, but focuses instead on modeling noise in the objective function.
Adaptive. An alternative to using second-order derivatives is to design methods that automatically adjust the step-size during the optimization process. The adaptive selection of the update step-size has been based on several principles, including: the local sharpness of the loss function , incorporating a line search approach [80, 53, 49], the gradient change speed , the Barzilai-Borwein method , a “belief” in the current gradient direction , the linearization of the loss , the per-component unweighted mean of all historical gradients , handling noise by preconditioning based on a covariance matrix 
, the adaptive and momental bounds, decorrelating the second moment and gradient terms , the importance weights , the layer-wise adaptation strategy , the gradient scale invariance , multiple learning rates , controlling the increase in effective learning , learning the update-step size , looking ahead at the sequence of fast weights generated by another optimizer . Among the widely adopted methods is the work of Duchi , who presented a new family of sub-gradient methods called AdaGrad. AdaGrad dynamically incorporates knowledge of the geometry of the data observed in earlier iterations. Tieleman 
introduced RmsProp, further extended by Mukkama with logarithmic regret bounds for strongly convex functions. Zeiler  proposed a per-dimension learning rate method for gradient descent called AdaDelta. Kingma and Ba  introduced Adam, based on adaptive estimates of lower-order moments. A wide range of variations and extensions of the original Adam optimizer has also been proposed [44, 60, 28, 47, 78, 86, 14, 72, 8, 33, 43, 48, 84, 85, 41]. Recent work proposed to decouple the weight decay [22, 17]. Chen 
introduced a partially adaptive momentum estimation method. Some recent work also focused on the role of gradient clipping[97, 96]. Another line of research focused on reducing the memory overhead for adaptive algorithms [1, 69, 56]. In most prior work, adaptivity comes from the introduction of extra hyper-parameters that also require task-specific tuning. In our case, this property is a direct byproduct of the Kalman filtering framework.
Kalman filtering. The use of Kalman filtering theory and methods for the training of neural networks is not new. Haykin  edited a book collecting a wide range of techniques on this topic. More recently, Shashua 
incorporates Kalman filtering for Value Approximation in Reinforcement Learning. Ollivier recovers the exact extended Kalman filter equations from first principles in statistical learning: the Extended Kalman filter is equal to Amari’s online natural gradient, applied in the space of trajectories of the system. Vilmarest 
applies the Extended Kalman filter to linear and logistic regressions. Takenga compared GD to methods based on either Kalman filtering or the decoupled Kalman filter. To summarize, all of these prior Kalman filtering approaches either focus on a specific non-general formulation or face difficulties when scaling to high-dimensional parameter spaces of large-scale neural models.
3 Modeling Noise in Stochastic Optimization
In machine learning, we are interested in minimizing the expected risk
with respect to some loss that is a function of both the data , with the data dimensionality, and the model parameters (, the weights of a neural network), where is the number of parameters in the model. We consider the case, which is of common interest today, where both and are very large ((, in the case of image classification we stack in both the input image and the output label). In practice, we have access to only a finite set of samples and thus resolve to optimize the empirical risk
where , for , are our training dataset samples. Because of the nonconvex nature of the loss function with respect to the model parameters, this risk is then optimized via an iterative method such as gradient descent.
Since in current datasets can be very large, the computation of the gradient of the empirical risk at each iteration is too demanding. To address this issue, the stochastic gradient descent (SGD) method  minimizes instead the following risk approximation
where is a sample set of the dataset indices that changes over iteration time . SGD then iteratively builds a sequence of parameters by recursively updating the parameters with a step in the opposite direction of the gradient of , , with some random initialization for and for
where denotes the gradient of with respect to and computed at , and is commonly referred to as the learning rate and regulates the speed of convergence.
While this approach is highly efficient, it is also affected by the training set sampling at each iteration. The gradients are computed on the time-varying objectives and can be seen as noisy versions of the gradient of the expected risk . Due to the aleatoric nature of this optimization, it is necessary to apply a learning rate decay  to achieve convergence. There are also several methods to reduce noise in the gradients, which work on a dynamic sample size or a gradient aggregation strategy .
Rather than directly modeling the noise of the gradient, we adopt a different perspective on the minimization of the expected risk. Let us denote with the optimal set of model parameters and let us denote with the expected risk at . Then, we model
is a Gaussian random variable with the optimal parametersas the mean and covariance , and we associate both the stochasticity of the sampling and of to the scalar noise variable
, which we assume to be zero-mean Gaussian with variance. With eq. (5) we implicitly assume that and are statistically dependent since is constant. Often, we know the value of up to some approximation. In the next sections we additionally show that it is possible to obtain an online estimate of .
The task now can be posed as that of identifying the parameters such that the observations (5) are satisfied. A natural way to tackle the identification of parameters given their noisy observations is to use Kalman filtering. As discussed in the prior work, there is an extensive literature on the application of Kalman filtering as a stochastic gradient descent algorithm. However, these methods differ from our approach in several ways. For instance, Vuckovic  uses the gradients as measurements. Thus, this method requires large matrix inversions, which are not scalable to the settings we consider in this paper and that are commonly used in deep learning. As we describe in the next section, we work instead directly with the scalar risks and introduce a number of computational approximations that make the training with large datasets and high dimensional data feasible with our method.
3.1 Kalman Filtering for Stochastic Optimization
We assumed that is a random variable capturing the optimum up to some zero-mean Gaussian error (which represents our uncertainty about the parameters). Then, the values of the time-varying loss at samples of will be scattered close to (see eq. (5)). Thus, a possible system of equations for a sequence of samples of is
Here, is modeled as a zero-mean Gaussian variable with covariance . The dynamical model implies that (at ) the state does not change on average.
where are also called the hidden state, are the observations, and are functions that describe the state transition and the measurement dynamics respectively. The Extended Kalman filter infers optimal estimates of the state variables from the previous estimates of and the last observation . Moreover, it also estimates the a posteriori covariance matrix of the state. This is done in two steps: Predict and Update, which we recall in Table 1.
If we directly apply the equations in Table 1 to our equations (6) and (7), we would immediately find that the posterior covariance is an matrix, which would be too large to store and update for values used in practice. Hence, we approximate
as a scaled identity matrix. Since the update equation for the posterior covariance requires the computation of, we need to approximate
also with a scaled identity matrix. We do this by using its largest eigenvalue, ,
where denotes the identity matrix. Because we work with a scalar loss , the innovation covariance is a scalar and thus it can be easily inverted. We call this first parameter estimation method the Vanilla Kalman algorithm, and summarize it in Algorithm 1.
3.2 Incorporating Momentum Dynamics
The framework introduced so far, which we call KaFiStO (as a shorthand notation for Kalman Filtering for Stochastic Optimization), is very flexible and allows several extensions. A first important change we introduce is the incorporation of Momentum . Within our notation, this method could be written as
where are so called momentums or velocities, that accumulate the gradients from the past. The parameter , commonly referred to as momentum rate, controls the trade-off between current and past gradients. Such updates claim to stabilize the training and prevent the parameters from getting stuck at local minima.
To incorporate the idea of Momentum within the KaFiStO framework, one can simply introduce the state velocities and define the following dynamics
where and is a zero-centered Gaussian random variable.
One can rewrite these equations again as Kalman filter equations by combining the parameters and the velocities into one state vector and similarly for the state noise . This results in the following dynamical system
where and . Similarly to the Vanilla Kalman algorithm, we also aim to drastically reduce the dimensionality of the posterior covariance, which now is a matrix. We approximate with the following form , where , , are scalars. In this formulation we have that and thus our approximation for the Kalman update of the posterior covariance will use
The remaining equations follow directly from the application of Table 1. We call this method the Kalman Dynamics algorithm.
3.3 Estimation of the Measurement and State Noise
In the KaFiStO framework we model the noise in the observations and the state transitions with zero-mean Gaussian variables with covariances and respectively. So far, we assumed that these covariances were constant. However, they can also be estimated online, and lead to more accurate state and posterior covariance estimates. For we use the following running average
where we set . Similarly, for the covariance , the online update for the component is
where we set . This adaptivity of the noise helps both to reduce the number of hyper-parameters and to stabilize the training and convergence.
3.4 Learning Rate Scheduling
In both the Vanilla Kalman and the Kalman Dynamics algorithms, the update equation for the state estimate needs (see , eq. (13)). This term can be in many cases set to , when we believe that this value can be achieved (, in some image classification problems). Also, we have the option to change progressively with the iteration time . For instance, we could set . By substituting this term in eq. (13) we obtain a learning rate that is times the learning rate with . By varying over the iteration time, we thus can define a learning rate schedule as in current SGD implementations [19, 46]. Notice, however, the very different interpretation of the schedule in the case of KaFiStO, where we are gradually decreasing the target expected risk.
3.5 Layer-wise Approximations
Let us consider the optimization problem specifically for large neural networks. Let us denote with the number of layers in a network. Next, we consider the observations to be -dimensional vectors. The -th entry in this observation vector is obtained by considering that only the parameters of the -th layer are varying. Under these assumptions, the update equation (13) for both the Vanilla Kalman and the Kalman Dynamics algorithm will split into layer-wise equations, where each separate equation incorporates only the gradients with respect to the parameters of a specific layer. Additionally to this, now the matrix also yields separate blocks (one per observation), each of which gets approximated by the corresponding largest block eigenvalue. Finally, the maximum of these approximations gives us the approximation of the whole matrix . That is
where is the subset of parameters corresponding to the -th layer and is the innovation covariance corresponding to only the -th measurement. We observe that this procedure induces additional stability in training.
In this section we ablate the following features and parameters of both Vanilla Kalman and Kalman Dynamics algorithms: the dynamics of the weights and velocities, the initialization of the posterior covariance matrix and the adaptivity of the measurement and state noise estimators. In some ablations we also separately test the Kalman Dynamics algorithm with adaptive , since it usually gives a large boost to performance. Furthermore, we show that our algorithm is relatively insensitive to different batch sizes and weight initialization techniques.
We evaluate our optimization methods by computing the test performance achieved by the model obtained with the estimated parameters. Although such performance may not uniquely correlate to the performance of our method, as it might be affected also by the data, model and regularization, it is a useful indicator. In all the ablations, we choose the classification task on CIFAR-100 with ResNet18 . We train all the models for epochs and decrease the learning rate by a factor of every epochs.
For the last two ablations and in the Experiments section, we use the Kalman Dynamics algorithm with , adaptive and , initial posterior covariance parameters and . We refer to this configuration as KaFiStO and have no need in tuning it further.
Impact of the state dynamics. We compare the Vanilla Kalman algorithm (, constant dynamics) to the Kalman Dynamics (, with velocities). Additionally, we ablate the , , the decay rate of the velocities. The results are shown in Table 2. We observe that the use of velocities with a calibrated moment has a positive impact on the estimated parameters, and that the adaptive state noise estimation provides a further substantial gain.
|KaFiStO Variant||Top-1 Error||Top-5 Error|
|Dynamics (adapt. )||0.50||28.25||8.91|
|Dynamics (adapt. )||0.90||23.39||6.50|
|Dynamics (adapt. )||0.99||33.63||9.82|
Posterior covariance initialization. The KaFiStO framework requires to initialize the matrix . In the case of the Vanilla Kalman algorithm, we approximate the posterior covariance with a scaled identity matrix, , , where . In the case of Kalman Dynamics, we approximate with a block diagonal matrix, and we initialize it with
where . In this section we ablate , and to show that the method quickly adapts to the observations and the initialization of does not have a significant impact on the final accuracy achieved with the estimated parameters. The results are given in Table 3 and in Figure 1.
|Parameter||Value||KaFiStO Variant||Top-1 Error||Top-5 Error|
|0.01||Dynamics (adapt. )||23.67||6.81|
|0.10||Dynamics (adapt. )||23.39||6.50|
|1.00||Dynamics (adapt. )||23.82||6.53|
|0.01||Dynamics (adapt. )||23.37||7.13|
|0.10||Dynamics (adapt. )||23.39||6.50|
|1.00||Dynamics (adapt. )||24.24||7.40|
Noise adaptivity. We compare the performance obtained with a fixed measurement variance to the one with an online estimate based on the -th minibatch. Similarly, we ablate the adaptivity of the process noise . The results are shown in Table 4.
|KaFiStO Variant||Top-1 Error||Top-5 Error|
We observe that the adaptivity of is essential for the model to converge and the adaptivity of helps to further improve the performance of the trained model. Moreover, with adaptive noises there is no need to set some initial values for them, which reduces the number of hyper-parameters to tune.
Batch size. Usually one needs to adapt the learning rate to the chosen minibatch size. In this experiment, we change the batch size in the range and show that KaFiStO adapts to it naturally. Table 5 shows that the accuracy of the model does not vary significantly with a varying batch size, which is a sign of stability.
|Batch Size||Top-1 Error||Top-5 Error|
Weight initialization. Similarly to the batch size, here we use different initialization techniques to show that the algorithm is robust to them. We apply the same initializations to SGD for comparison. We test Kaiming Uniform , Orthogonal , Xavier Normal , Xavier Uniform . The results are shown in Table 6.
|Initialization||Optimizer||Top-1 Error||Top-5 Error|
In order to assess the efficiency of KaFiStO, we evaluate it on different tasks, including image classification (on CIFAR-10, CIFAR-100 and ImageNet), generative learning and language modeling. For all these tasks, we report the quality metrics on the validation sets to compare KaFiStO to the optimizers commonly used in the training of existing models. We find that KaFiStO outperforms or is on par with the existing methods, while requiring fewer hyper-parameters to tune.
CIFAR-10/100. We first evaluate KaFiStO on CIFAR-10 and CIFAR-100 using the popular ResNets  and WideResNets  for training. We compare our results with the ones obtained with commonly used existing optimization algorithms, such as SGD with Momentum and Adam. For SGD we set the momentum rate to , which is the default for many popular networks, and for Adam we use the default parameters . In all experiments on CIFARs, we use a batch size of
and basic data augmentation (random horizontal flipping and random cropping with padding bypixels). For each configuration we have two runs for and epochs respectively. For SGD we start with a learning rate equal to , for Adam to and for KaFiStO. For the -epochs run we decrease the learning rate by a factor of every epochs. For -epochs on CIFAR-10 we decrease the learning rate only once at epoch by the same factor. For the -epoch training on CIFAR-100 the learning rate is decreased by a factor of at epochs , and . For all the algorithms, we additionally use a weight decay of .
To show the benefit of using KaFiStO for training on classification tasks, we report the Top-1 and Top-5 errors on the validation set. For both the -epoch and -epoch configurations, we report the mean error among runs with different random seeds. The results are reported in Table 7. Figure 2 shows the behavior of the training loss, the validation loss and the Top-1 error on the validation set, as well as the adaptive evolution of KaFiStO’s “learning rate”, , the step size that scales the gradient in the update eq. (13).
ImageNet. Following , we train a ResNet50  on downscaled images with the most common settings: epochs of training with learning rate decrease of after every epochs and a weight decay of . We use random cropping and random horizontal flipping during training and we report the validation accuracy on single center crop images. As shown in Table 7, our model achieves a comparable accuracy to SGD, but without any task-specific hyper-parameter tuning.
5.2 Generative Adversarial Networks Training
Generative Adversarial Networks (GAN)  are generative models trained to generate new samples from a given data distribution. GAN consists of two networks: generator and discriminator, which are trained in adversarial manner. The training alternates between these networks in a mini-max game, which tends to be difficult to train. Algorithms like SGD struggle to find a good solution, and a common practice for training GANs is to use adaptive methods like Adam or RMSProp. Thus, a good performance on the training of GANs is a good indicator of stability and the ability to handle complex loss functions.
Following , we test our method with one of the most popular models, Wasserstein GAN  with gradient penalty (WGAN-GP) . The objectives for both the generator and the discriminator in WGAN-GP are unbounded from below, which makes it difficult to apply our model directly. Indeed, our algorithm works under the assumption that the expected risk at the optimum is some given finite value. However, we can control the measurements equations in KaFiStO by adjusting the , as was done to obtain learning rate schedules. The simplest way to deal with unbounded losses is to set below the current estimation of the loss . That is, for a given minibatch the target should be equal to , for some constant , which we set before the training. In our experiments, we fix . We also set similarly to a common choice of for Adam in GAN training.
, which captures the quality and diversity of the generated samples. We report the mean and standard deviation amongruns with different random seeds. Usually GANs are trained in an alternating way, that is the generator is updated every iterations. This should make the generator compete with a stronger discriminator and achieve convergence. We test KaFiStO on two settings: . On both settings KaFiStO outperforms Adam in terms of FID score and is more stable. The results are shown in Table 8 and images sampled from the trained generator are reported in the Supplementary material.
5.3 Language modeling
for language modeling on the Penn TreeBank dataset
and Wikitext-2. We use the default data splits for training and validation and report the perplexity (lower is better) on the test set in Table 9. We used a two layer LSTM with
hidden neurons and input embedding size offor our tiny-LSTM (the same settings as Bernstein ) and increased the sizes to for the larger-LSTM experiment (the same as Zhang ). The learning rate for Adam and SGD were picked based on a grid search and we used the default learning rate of for our optimizer. In order to prevent overfitting we used an aggressive dropout  rate of and tied the input and output embeddings , which is a common practice in NLP. Since we are using small datasets, we use only a two-layer masked multi-head self-attention transformer with two heads, which performs worse than LSTM. We find that even in these settings, KaFiStO is on-par with other optimizers.
|Model||Optimizer||PTB ppl||WikiText-2 ppl|
|Dynamics (adapt. )||110.1||124.69|
|Dynamics (adapt. )||81.57||89.64|
|Dynamics (adapt. )||129.45||179.81|
We have introduced KaFiStO, a novel Kalman filtering-based approach to stochastic optimization. KaFiStO is suitable to train modern neural network models on current large scale datasets with high-dimensional data. The method can self-tune and is quite robust to wide range of training settings. Moreover, we design KaFiStO so that it can incorporate optimization dynamics such as those in Momentum and Adam, and learning rate schedules. The efficacy of this method is demonstrated on several experiments in image classification, image generation and language processing.
-  (2019) Memory efficient adaptive optimization. In Advances in Neural Information Processing Systems, External Links: Cited by: §2.
-  (2017) Wasserstein generative adversarial networks. In International conference on machine learning, Cited by: §5.2.
-  (2020) On the distance between two neural networks and the stability of learning. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), External Links: Cited by: §5.3.
-  (2000) Gradient convergence in gradient methods with errors. SIAM Journal on Optimization (3). Cited by: §3.
-  (2017) Practical gauss-newton optimisation for deep learning. In ICML, External Links: Cited by: §2.
-  (2018) Optimization methods for large-scale machine learning. Siam Review (2). Cited by: §1, §3.
Closing the generalization gap of adaptive gradient methods in training deep neural networks.
Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, Note: Main track External Links: Cited by: §2.
-  (2019) On the convergence of a class of adam-type algorithms for non-convex optimization. In International Conference on Learning Representations, External Links: Cited by: §2.
-  (2020) ELECTRA: pre-training text encoders as discriminators rather than generators. ArXiv abs/2003.10555. Cited by: §5.3.
-  (2021) Expectigrad: fast stochastic optimization with robust convergence properties. External Links: Cited by: §2.
-  (2020) Stochastic online optimization using kalman recursion. External Links: Cited by: §2.
-  (2019-06) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota. External Links: Cited by: §5.3.
-  (2019) An adaptive and momental bound method for stochastic learning. arXiv preprint arXiv:1910.12249. Cited by: §2.
Incorporating nesterov momentum into adam. Cited by: §2.
DiffGrad: an optimization method for convolutional neural networks. IEEE Transactions on Neural Networks and Learning Systems (11). External Links: Cited by: §2.
-  (2011) Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research (61). External Links: Cited by: §2.
-  (2020) Stochastic gradient methods with layer-wise adaptive moments for training of deep networks. External Links: Cited by: §2.
-  (2010) Understanding the difficulty of training deep feedforward neural networks. In In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS’10). Society for Artificial Intelligence and Statistics, Cited by: §4.
-  (1977) On convergence rates of subgradient optimization methods.. Mathematical Programming. 13, pp. 329–347. External Links: Cited by: §3.4.
-  (2020) Practical quasi-newton methods for training deep neural networks. In Advances in Neural Information Processing Systems, External Links: Cited by: §2.
-  (2014) Generative adversarial nets. Advances in Neural Information Processing Systems. Cited by: §5.2.
-  (2021) Beyond sgd: iterate averaged adaptive gradient method. External Links: Cited by: §2.
-  (2017) Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, External Links: Cited by: §5.2.
-  (2004) Kalman filtering and neural networks. John Wiley & Sons. Cited by: §1.
Deep residual learning for image recognition.
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), External Links: Cited by: §4, §5.1, §5.1.
-  (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15. Cited by: §4.
-  (2019-10) Small steps and giant leaps: minimal newton solvers for deep learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §2.
-  (2021) AdamP: slowing down the slowdown for momentum optimizers on scale-invariant weights. In International Conference on Learning Representations (ICLR), Cited by: §2.
-  (2017) GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, External Links: Cited by: §5.2.
-  (1997) Long short-term memory. Neural Computation. Cited by: §5.3.
-  (2018) Universal language model fine-tuning for text classification. In ACL, Cited by: §5.3.
-  (2020) Biased stochastic first-order methods for conditional stochastic optimization and applications in meta learning. In Advances in Neural Information Processing Systems, External Links: Cited by: §2.
-  (2019-07) Nostalgic adam: weighting more of the past gradients when designing the adaptive learning rate. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, External Links: Cited by: §2.
-  (2017) Adaptive learning rate via covariance matrix based preconditioning for deep neural networks. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, External Links: Cited by: §2.
-  (2001) Kalman filtering and neural networks. Adaptive and learning systems for signal processing, communications, and control, Wiley, New York. External Links: Cited by: §2.
-  (1960) A new approach to linear filtering and prediction problems. Transactions of the ASME – Journal of Basic Engineering (Series D). Cited by: §1, §3.1.
-  (2015) Adam: a method for stochastic optimization. In ICLR (Poster), External Links: Cited by: §1, §2.
-  (2009) Learning multiple layers of features from tiny images. Cited by: §4.
-  (2012) Efficient backprop. In Neural Networks: Tricks of the Trade: Second Edition, External Links: Cited by: §2.
-  (2018) Online adaptive methods, universality and acceleration. In Advances in Neural Information Processing Systems, External Links: Cited by: §2.
-  (2020) AdaX: adaptive gradient descent with exponential long term memory. External Links: Cited by: §2.
-  (2020) PAGE: a simple and optimal probabilistic gradient estimator for nonconvex optimization. External Links: Cited by: §2.
-  (2020) On the variance of the adaptive learning rate and beyond. In International Conference on Learning Representations, External Links: Cited by: §2.
-  (2020) Adam with bandit sampling for deep learning. In Advances in Neural Information Processing Systems, External Links: Cited by: §2.
-  (2019) RoBERTa: a robustly optimized bert pretraining approach. ArXiv abs/1907.11692. Cited by: §5.3.
-  (2017-04) SGDR: stochastic gradient descent with warm restarts. In International Conference on Learning Representations (ICLR) 2017 Conference Track, Cited by: §2, §3.4.
-  (2019) Decoupled weight decay regularization. In International Conference on Learning Representations, External Links: Cited by: §2, §5.1.
-  (2019) Adaptive gradient methods with dynamic bound of learning rate. In International Conference on Learning Representations, External Links: Cited by: §2.
-  (2015) Probabilistic line searches for stochastic optimization. In Advances in Neural Information Processing Systems, External Links: Cited by: §2.
-  (1993) Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics. External Links: Cited by: §5.3.
-  (2017) Pointer sentinel mixture models. ArXiv abs/1609.07843. Cited by: §5.3.
-  (2017-06–11 Aug) Variants of RMSProp and Adagrad with logarithmic regret bounds. In Proceedings of the 34th International Conference on Machine Learning, Proceedings of Machine Learning Research, International Convention Centre, Sydney, Australia. External Links: Cited by: §2.
-  (2020) Parabolic approximation line search for dnns. In Advances in Neural Information Processing Systems, External Links: Cited by: §2.
-  (2019) The extended kalman filter is a natural gradient descent in trajectory space. External Links: Cited by: §2.
-  (2015) Scale-free algorithms for online linear optimization. In Algorithmic Learning Theory, Cham. External Links: Cited by: §2.
-  (2020-22–25 Jul) The role of memory in stochastic optimization. In Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, Proceedings of Machine Learning Research, Tel Aviv, Israel. External Links: Cited by: §2.
-  (2018-06) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana. External Links: Cited by: §5.3.
-  (2017-04) Using the output embedding to improve language models. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain, pp. 157–163. External Links: Cited by: §5.3.
-  (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §5.2.
-  (2018) On the convergence of adam and beyond. In International Conference on Learning Representations, External Links: Cited by: §2.
-  (1951) A Stochastic Approximation Method. The Annals of Mathematical Statistics (3). External Links: Cited by: §1, §2, §3.
-  (2018) L4: practical loss-based stepsize adaptation for deep learning. In Advances in Neural Information Processing Systems, External Links: Cited by: §2.
-  (2010) A fast natural newton method. In ICML, External Links: Cited by: §2.
-  (2015) ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115 (3). Cited by: §5.
-  (2014) Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In 2nd International Conference on Learning Representations, ICLR 2014, Cited by: §4.
-  (2013) No more pesky learning rates. In ICML (3), External Links: Cited by: §2.
-  (2018) VR-sgd: a simple stochastic variance reduction method for machine learning. External Links: Cited by: §2.
-  (2019) Trust region value optimization using kalman filtering. External Links: Cited by: §2.
-  (2018) Adafactor: adaptive learning rates with sublinear memory cost. In ICML, External Links: Cited by: §2.
-  (2014-22–24 Jun) Fast large-scale optimization by unifying stochastic gradient and quasi-newton methods. In Proceedings of the 31st International Conference on Machine Learning, Proceedings of Machine Learning Research, Bejing, China. External Links: Cited by: §2.
-  (2014-01) Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., pp. 1929–1958. External Links: Cited by: §5.3.
Adathm: adaptive gradient method based on estimates of third-order moments.
2019 IEEE Fourth International Conference on Data Science in Cyberspace (DSC), Los Alamitos, CA, USA. External Links: Cited by: §2.
-  (2020) S-sgd: symmetrical stochastic gradient descent with weight noise injection for reaching flat minima. External Links: Cited by: §2.
-  (2013-17–19 Jun) On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning, Proceedings of Machine Learning Research, Atlanta, Georgia, USA. External Links: Cited by: §1, §3.2.
-  (2004) Comparison of gradient descent method, kalman filtering and decoupled kalman in training neural networks used for fingerprint-based positioning. In IEEE 60th Vehicular Technology Conference, 2004. VTC2004-Fall. 2004, External Links: Cited by: §2.
-  (2016) Barzilai-borwein step size for stochastic gradient descent. In Advances in Neural Information Processing Systems, External Links: Cited by: §2.
-  (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. Cited by: §2.
-  (2019) On the convergence proof of amsgrad and a new version. IEEE Access (). External Links: Cited by: §2.
-  (2017) Attention is all you need. ArXiv abs/1706.03762. Cited by: §5.3.
Painless stochastic gradient: interpolation, line-search, and convergence rates. In Advances in Neural Information Processing Systems, External Links: Cited by: §2.
-  (2018) Kalman gradient descent: adaptive variance reduction in stochastic optimization. External Links: Cited by: §3.
-  (2020) Scheduled restart momentum for accelerated stochastic gradient descent. arXiv preprint arXiv:2002.10583. Cited by: §2.
-  (2013) Variance reduction for stochastic gradient optimization. Advances in Neural Information Processing Systems. Cited by: §1.
-  (2019) SignADAM++: learning confidences for deep neural networks. In 2019 International Conference on Data Mining Workshops (ICDMW), External Links: Cited by: §2.
-  (2020) SAdam: a variant of adam for strongly convex functions. In International Conference on Learning Representations, External Links: Cited by: §2.
-  (2019-Jul.) HyperAdam: a learnable task-adaptive adam for network training. External Links: Cited by: §2.
-  (2020) WNGrad: learn the learning rate in gradient descent. External Links: Cited by: §2.
-  (2019) XLNet: generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems, External Links: Cited by: §5.3.
-  (2021) ADAHESSIAN: an adaptive second order optimizer for machine learning. AAAI. Cited by: §2.
-  (2020) Large batch optimization for deep learning: training bert in 76 minutes. In International Conference on Learning Representations, External Links: Cited by: §2.
-  (2020) SALR: sharpness-aware learning rates for improved generalization. External Links: Cited by: §2.
-  (2016-05) Wide residual networks. Cited by: §5.1.
-  (2018) Adaptive methods for nonconvex optimization. In Advances in Neural Information Processing Systems, External Links: Cited by: §2.
-  (2012) ADADELTA: an adaptive learning rate method. CoRR. External Links: Cited by: §2.
-  (2017) YellowFin and the art of momentum tuning. arXiv preprint arXiv:1706.03471. Cited by: §5.3.
Why are adaptive methods good for attention models?. In Advances in Neural Information Processing Systems, External Links: Cited by: §2.
-  (2020) Why are adaptive methods good for attention models?. In Advances in Neural Information Processing Systems, External Links: Cited by: §2.
-  (2019) Lookahead optimizer: k steps forward, 1 step back. In Advances in Neural Information Processing Systems, External Links: Cited by: §2.
-  (2019) AdaShift: decorrelation and convergence of adaptive learning rate methods. In International Conference on Learning Representations, External Links: Cited by: §2.
-  (2020) AdaBelief optimizer: adapting stepsizes by the belief in observed gradients. In Advances in Neural Information Processing Systems, External Links: Cited by: §2, §5.2.