Nonlinear Least Squares for Large-Scale Machine Learning using Stochastic Jacobian Estimates

by   Johannes J. Brust, et al.

For large nonlinear least squares loss functions in machine learning we exploit the property that the number of model parameters typically exceeds the data in one batch. This implies a low-rank structure in the Hessian of the loss, which enables effective means to compute search directions. Using this property, we develop two algorithms that estimate Jacobian matrices and perform well when compared to state-of-the-art methods.



There are no comments yet.


page 1

page 2

page 3

page 4


Stochastic Mirror Descent for Low-Rank Tensor Decomposition Under Non-Euclidean Losses

This work considers low-rank canonical polyadic decomposition (CPD) unde...

PNKH-B: A Projected Newton-Krylov Method for Large-Scale Bound-Constrained Optimization

We present PNKH-B, a projected Newton-Krylov method with a low-rank appr...

Scalable Derivative-Free Optimization for Nonlinear Least-Squares Problems

Derivative-free - or zeroth-order - optimization (DFO) has gained recent...

Dissecting Hessian: Understanding Common Structure of Hessian in Neural Networks

Hessian captures important properties of the deep neural network loss la...

Sketchy Empirical Natural Gradient Methods for Deep Learning

In this paper, we develop an efficient sketchy empirical natural gradien...

Search Algorithms and Loss Functions for Bayesian Clustering

We propose a randomized greedy search algorithm to find a point estimate...

Modern Subsampling Methods for Large-Scale Least Squares Regression

Subsampling methods aim to select a subsample as a surrogate for the obs...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A major challenge in optimization of machine learning models is the large size of both; the large number of model parameters and the number of data points. Therefore, the use of higher derivative information whose complexity may scale quadratically with problem size typically has to be done in a careful manner (Bottou et al., 2018; Xu et al., 2020). Moreover, because very large data amounts may not be manageable all at once, even methods that use order gradients only, usually randomly split the whole dataset into smaller blocks (minibatches). Once the data is appropriately prepared, the goal of an effective optimization algorithm is to fit the model accurately to the data by iteratively minimizing a loss objective function. A commonly used loss measure is the (mean) sum-of-squares (Higham and Higham, 2019; Sra et al., 2011). Since machine learning models are normally nonlinear, the resulting optimization problems are formulated as nonlinear least squares (NLLS) problems:


where is a scaling value, is the machine learning model, are “(feature, label)” data pairs and are the optimization weights. Note that in (1) both and can be large. By splitting the data pairs into minibatch blocks each containing elements (i.e., ) we define the indices for and . Moreover, defining the residual at a data pair as

and the vector of residuals in a batch

the least squares objective is


In the remainder we assume that for each batch the value of the objective and its corresponding gradient are known.

2 Contribution

This work exploits the property that usually the number of data pairs within a minibatch is smaller than the number of model parameters, i.e., . Combining this with the nonlinear least squares objective (2), we note that derivative matrices contain a low-rank structure that can be used for effective search direction computations. In particular, we describe two algorithms that approximate higher derivative information using a low memory footprint, and that have computational complexities, which are comparable to state-of-the-art first order methods. Other work on stochastic NLLS problems includes the computation of higher derivatives via a domain-decomposition process (Huang et al., 2021) or the analysis of noisy loss and residual values (Bergou et al., 2020)

. An initial backpropagation algorithm for the NLLS objective was developed in

(Hagan and Menhaj, 1994). However, none of these works exploit the low rank structures in the loss objective.

3 Methods

Suppose that at a given batch the loss and its gradient are computed. We temporarily suppress the subscripts so that the residual vector and gradient are denoted and , respectively. The and derivatives of the loss are


Throughout we will denote the Jacobian matrix by and the Hessian matrix as . We define a new search direction at a current iteration as the update to the weights: . Using Newton’s method the search direction is defined by the linear system


where gradient, Jacobian and Hessian are evaluated at the current iteration, i.e., , , and . For large forming the Hessian matrix and solving with it is, however, not feasible. Thus we estimate the components in (4). First, note that the residuals near the solution are expected to be small and thus estimating the matrix with a diagonal and scalar appears reasonable. For regular least squares problems setting yields a Gauss-Newton type method, whereas corresponds to a Levenberg-Marquardt type method (cf. (Nocedal and Wright, 2006)). In this work we develop methods that estimate the components of the Hessian, which are suited for large data in machine learning problems. In particular, we define search directions by the system


where the Jacobian is approximated. Since the matrix in parenthesis in (6) can be inverted using the Sherman-Morrison-Woodbury (SMW) inverse. Even so, computing the full Jacobian is typically very expensive for large and . Therefore, we describe two approaches for approximating that enable effective computation of in (6). Specifically, we derive representations of based on the gradient/Jacobian relation in (3). In particular, the Jacobian matrices in this work approximate the relation .

3.1 Rank-1 Jacobian Estimate

Let denote an intermediate iteration. In our first approach we define the Jacobian estimate by a rank-1 matrix. More specifically, we represent an intermediate Jacobian in the form

where the coefficients are to be determined. In order to satisfy the gradient/Jacobian relation we deduce the form


Thus the coefficients at an intermediate iterate are specified as , and . Since the entire objective function in (1) consists of all data pairs beyond those in one batch, our rank-1 Jacobian approximation accumulates previously computed information



Note that in the rank-1 approximation of (8) the vectors of residuals need not be explicitly stacked, because the search direction in (6) uses the term in which the residuals reduce to the inner product .

3.2 Rank-L Jacobian Estimate

For our second approach we use the rank-L representation


where and are two random permutation matrices that rearrange the row and column ordering. In order to satisfy the relation , the Jacobian estimate can be written as


where is the rectangular matrix in (9) with elements

We accumulate the intermediate estimates as to define the rank-L Jacobian as


Finally, note that we do not form the Jacobians in (8) or (11) explicitly. Rather, we accumulate the information in and . For instance, we represent by storing the vector and the scalar . For we store the non-zeros elements of (10) as a vector and subsequently accumulate these values in (11). Our methods are implemented in two algorithms for computing a new search direction.

4 Algorithms

This section describes algorithms for computing search directions based on our estimates of the Jacobian matrix. The model weights are updated according to . Here the square of the diagonal matrix is stored in a vector a squared accumulated gradients


where .* represents element-wise multiplication. Similarly, ./ will stand for element-wise division and for the element-wise square root. Algorithm 1 implements the rank-1 estimate from Section 3.1.

  Initialize: , , , , , ,
  for   do
     for   do
               #Loss accum.
                  #Jac. accum.
            #Diag. accum.
        #Representation: H and H(cf. (6))
        #H= D+vv
     end for
  end for
Algorithm 1 NLLS1 (Nonlinear Least Squares Rank-1)

Note that in Algorithm 1 part of the problem specifications, i.e. (number of batches) and (number of data pairs in a batch) determine the order of magnitude for the parameter . Otherwise, we initialize and . Moreover, if Jacobian accumulation were disabled, meaning that , then and the search directions reduce to a basic version of the Adagrad algorithm (Duchi et al., 2011). The method of Section 3.2 can be implemented similarly in another algorithm. We will refer to the implementation of the rank-L Jacobian approximation as NLLSL.

Note that our algorithms can be implemented using element-wise multiplications, divisions, and square roots only, and are thus applicable to large problems.

5 Numerical Experiments

This section describes numerical experiments on three different optimization problems using the least squares loss. We compare our implementations of NLLS1 (Algorithm 1) and NLLSL with the widely used methods SGD, Adam (Kingma and Ba, 2015) and Adagrad (Duchi et al., 2011). The experiments are re-run 5 times for which the average loss evolution is plotted. On the x-axis are the number of epochs for a given experiment, which correspond to the number of data passes over the full dataset (each epoch the dataset is composed of batches with

(feature, label) pairs). The algorithms are implemented in Phyton 3.7 with TensorFlow 2.4.0

(Developers, 2021) to compute the machine learning models and their derivatives. The experiments are carried out on a MacBook Pro @2.6 GHz Intel Core i7 with 32 GB of memory. The codes are freely available at

5.1 Experiment 1

This experiment compares 6 algorithms on a small classification problem using the Iris flower dataset. A fully connected network with three dense layers, softmax thresholding and relu activation is used. The dataset contains 3 different flower classes, which are characterized by 4 features. The number of optimization weights is , with data sizes and . Since the data is relatively small, we additionally include the “Full Jacobian” method, which explicitly computes the entire Jacobian in (3). Note however that unless and are small, computing full Jacobian matrices is not practical. Adam obtained good results using default TensorFlow parameters, while using a learning of improved SGD. Adagard used the same learning rate of as SGD. The parameter in Algorithm 1 is , and the outcomes are in Figure 1.

Figure 1: Comparison of 6 algorithms for the classification of 3 types of Iris plants (dataset: (Dua and Graff, 2017)). Observe that the “Full Jacobian” method reaches the lowest average loss as expected, since this method explicitly computes all Jacobian derivatives. However, the proposed rank-1 Jacobian approximation (NLLS1) achieves similar low losses, without explicitly computing the derivatives. The experiment outcomes are averaged over 5 runs.

5.2 Experiment 2

The movie recommendation dataset contains 100,000 (movie title, rating) pairs. A fully connected model with two dense layers and relu activation is used for rating predictions. Movie titles and ratings are embedded in two preprocessing layers of dimension 32. The total number of optimization weights is . The data sizes are and . Since this problem is relatively large, computing full Jacobian matrices is not practical. Therefore the “Full Jacobian” method is not applicable in this experiment. However, all other algorithms are able to scale to the problem size. We use the default TensorFlow parameters for all algorithms, expect Adagrad which obtained better results with a learning rate of . We set the parameter .

Figure 2:

Comparison of 5 algorithms for training a movie recommender model on the MovieLens 100K dataset

(Harper and Konstan, 2015). The proposed algorithm NLLS1 obtains the overall lowest losses. The outcomes are calculated as the average of 5 runs.

5.3 Experiment 3

In Experiment 3, the algorithms are applied to a large autoencoding model. The Fashion MNIST dataset of 60,000

pixel images is used. The model consists of a dense layers with relu activation for the decoding (with embedding dimension 64) and dense layers with sigmoid activation for the decoding step. The total number of weights are and the data consists of and . We use the default TensorFlow parameters for Adam and set the learning rate to for SGD and Adagrad. We set .

Figure 3: Comparison of 5 algorithms for training an autoencoder on the Fashion MNIST dataset (Xiao et al., 2017). NLLS1 achieves the overall lowest loss values. The results are averaged over 5 runs.

6 Conclusions

This article develops algorithms that approximate Jacobian information in order to improve search directions in nonlinear least squares loss functions. By using the fact that the number of model parameters typically exceeds the number of data pairs in a batch of the dataset, we propose computationally effective methods for computing search directions. In numerical experiments, including large-scale applications, our proposed algorithms perform well when compared to the state-of-the-art such as Adam or Adagrad.


  • E. Bergou, Y. Diouane, V. Kungurtsev, and C. W. Royer (2020) A stochastic levenberg-marquardt method using random models with complexity results and application to data assimilation. External Links: 1807.02176 Cited by: §2.
  • L. Bottou, F. E. Curtis, and J. Nocedal (2018) Optimization methods for large-scale machine learning. SIAM Review 60 (2), pp. 223–311. External Links: Document, Link, Cited by: §1.
  • T. Developers (2021) TensorFlow. Zenodo. External Links: Document, Link Cited by: §5.
  • D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: Figure 1.
  • J. Duchi, E. Hazan, and Y. Singer (2011) Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12 (null), pp. 2121–2159. External Links: ISSN 1532-4435 Cited by: §4, §5.
  • M.T. Hagan and M.B. Menhaj (1994) Training feedforward networks with the marquardt algorithm.

    IEEE Transactions on Neural Networks

    5 (6), pp. 989–993.
    External Links: Document Cited by: §2.
  • F. M. Harper and J. A. Konstan (2015) The movielens datasets: history and context. ACM Trans. Interact. Intell. Syst. 5 (4). External Links: ISSN 2160-6455, Link, Document Cited by: Figure 2.
  • C. F. Higham and D. J. Higham (2019) Deep learning: an introduction for applied mathematicians. SIAM Review 61 (4), pp. 860–891. External Links: Document, Link, Cited by: §1.
  • J. Huang, S. Huang, and M. Sun (2021) DeepLM: large-scale nonlinear least squares on deep learning frameworks using stochastic domain decomposition. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 10308–10317. Cited by: §2.
  • D.P. Kingma and L.J. Ba (2015) Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR 2015), Cited by: §5.
  • J. Nocedal and S. J. Wright (2006) Numerical optimization. 2 edition, Springer-Verlag, New York. External Links: ISBN 0-387-30303-0 Cited by: §3.
  • S. Sra, S. Nowozin, and S. J. Wright (2011) Optimization for machine learning. The MIT Press. External Links: ISBN 026201646X Cited by: §1.
  • H. Xiao, K. Rasul, and R. Vollgraf (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. CoRR abs/1708.07747. External Links: Link, 1708.07747 Cited by: Figure 3.
  • P. Xu, F. Roosta, and M.W. Mahoney (2020) Second-order optimization for non-convex machine learning: an empirical study. In Proceedings of the 2020 SIAM International Conference on Data Mining (SDM), pp. 199–207. Cited by: §1.