1 Introduction
A major challenge in optimization of machine learning models is the large size of both; the large number of model parameters and the number of data points. Therefore, the use of higher derivative information whose complexity may scale quadratically with problem size typically has to be done in a careful manner (Bottou et al., 2018; Xu et al., 2020). Moreover, because very large data amounts may not be manageable all at once, even methods that use order gradients only, usually randomly split the whole dataset into smaller blocks (minibatches). Once the data is appropriately prepared, the goal of an effective optimization algorithm is to fit the model accurately to the data by iteratively minimizing a loss objective function. A commonly used loss measure is the (mean) sumofsquares (Higham and Higham, 2019; Sra et al., 2011). Since machine learning models are normally nonlinear, the resulting optimization problems are formulated as nonlinear least squares (NLLS) problems:
(1) 
where is a scaling value, is the machine learning model, are “(feature, label)” data pairs and are the optimization weights. Note that in (1) both and can be large. By splitting the data pairs into minibatch blocks each containing elements (i.e., ) we define the indices for and . Moreover, defining the residual at a data pair as
and the vector of residuals in a batch
the least squares objective is(2) 
In the remainder we assume that for each batch the value of the objective and its corresponding gradient are known.
2 Contribution
This work exploits the property that usually the number of data pairs within a minibatch is smaller than the number of model parameters, i.e., . Combining this with the nonlinear least squares objective (2), we note that derivative matrices contain a lowrank structure that can be used for effective search direction computations. In particular, we describe two algorithms that approximate higher derivative information using a low memory footprint, and that have computational complexities, which are comparable to stateoftheart first order methods. Other work on stochastic NLLS problems includes the computation of higher derivatives via a domaindecomposition process (Huang et al., 2021) or the analysis of noisy loss and residual values (Bergou et al., 2020)
. An initial backpropagation algorithm for the NLLS objective was developed in
(Hagan and Menhaj, 1994). However, none of these works exploit the low rank structures in the loss objective.3 Methods
Suppose that at a given batch the loss and its gradient are computed. We temporarily suppress the subscripts so that the residual vector and gradient are denoted and , respectively. The and derivatives of the loss are
(3)  
(4) 
Throughout we will denote the Jacobian matrix by and the Hessian matrix as . We define a new search direction at a current iteration as the update to the weights: . Using Newton’s method the search direction is defined by the linear system
(5) 
where gradient, Jacobian and Hessian are evaluated at the current iteration, i.e., , , and . For large forming the Hessian matrix and solving with it is, however, not feasible. Thus we estimate the components in (4). First, note that the residuals near the solution are expected to be small and thus estimating the matrix with a diagonal and scalar appears reasonable. For regular least squares problems setting yields a GaussNewton type method, whereas corresponds to a LevenbergMarquardt type method (cf. (Nocedal and Wright, 2006)). In this work we develop methods that estimate the components of the Hessian, which are suited for large data in machine learning problems. In particular, we define search directions by the system
(6) 
where the Jacobian is approximated. Since the matrix in parenthesis in (6) can be inverted using the ShermanMorrisonWoodbury (SMW) inverse. Even so, computing the full Jacobian is typically very expensive for large and . Therefore, we describe two approaches for approximating that enable effective computation of in (6). Specifically, we derive representations of based on the gradient/Jacobian relation in (3). In particular, the Jacobian matrices in this work approximate the relation .
3.1 Rank1 Jacobian Estimate
Let denote an intermediate iteration. In our first approach we define the Jacobian estimate by a rank1 matrix. More specifically, we represent an intermediate Jacobian in the form
where the coefficients are to be determined. In order to satisfy the gradient/Jacobian relation we deduce the form
(7) 
Thus the coefficients at an intermediate iterate are specified as , and . Since the entire objective function in (1) consists of all data pairs beyond those in one batch, our rank1 Jacobian approximation accumulates previously computed information
(8) 
where
Note that in the rank1 approximation of (8) the vectors of residuals need not be explicitly stacked, because the search direction in (6) uses the term in which the residuals reduce to the inner product .
3.2 RankL Jacobian Estimate
For our second approach we use the rankL representation
(9) 
where and are two random permutation matrices that rearrange the row and column ordering. In order to satisfy the relation , the Jacobian estimate can be written as
(10) 
where is the rectangular matrix in (9) with elements
We accumulate the intermediate estimates as to define the rankL Jacobian as
(11) 
Finally, note that we do not form the Jacobians in (8) or (11) explicitly. Rather, we accumulate the information in and . For instance, we represent by storing the vector and the scalar . For we store the nonzeros elements of (10) as a vector and subsequently accumulate these values in (11). Our methods are implemented in two algorithms for computing a new search direction.
4 Algorithms
This section describes algorithms for computing search directions based on our estimates of the Jacobian matrix. The model weights are updated according to . Here the square of the diagonal matrix is stored in a vector a squared accumulated gradients
(12) 
where .* represents elementwise multiplication. Similarly, ./ will stand for elementwise division and for the elementwise square root. Algorithm 1 implements the rank1 estimate from Section 3.1.
Note that in Algorithm 1 part of the problem specifications, i.e. (number of batches) and (number of data pairs in a batch) determine the order of magnitude for the parameter . Otherwise, we initialize and . Moreover, if Jacobian accumulation were disabled, meaning that , then and the search directions reduce to a basic version of the Adagrad algorithm (Duchi et al., 2011). The method of Section 3.2 can be implemented similarly in another algorithm. We will refer to the implementation of the rankL Jacobian approximation as NLLSL.
Note that our algorithms can be implemented using elementwise multiplications, divisions, and square roots only, and are thus applicable to large problems.
5 Numerical Experiments
This section describes numerical experiments on three different optimization problems using the least squares loss. We compare our implementations of NLLS1 (Algorithm 1) and NLLSL with the widely used methods SGD, Adam (Kingma and Ba, 2015) and Adagrad (Duchi et al., 2011). The experiments are rerun 5 times for which the average loss evolution is plotted. On the xaxis are the number of epochs for a given experiment, which correspond to the number of data passes over the full dataset (each epoch the dataset is composed of batches with
(feature, label) pairs). The algorithms are implemented in Phyton 3.7 with TensorFlow 2.4.0
(Developers, 2021) to compute the machine learning models and their derivatives. The experiments are carried out on a MacBook Pro @2.6 GHz Intel Core i7 with 32 GB of memory. The codes are freely available at5.1 Experiment 1
This experiment compares 6 algorithms on a small classification problem using the Iris flower dataset. A fully connected network with three dense layers, softmax thresholding and relu activation is used. The dataset contains 3 different flower classes, which are characterized by 4 features. The number of optimization weights is , with data sizes and . Since the data is relatively small, we additionally include the “Full Jacobian” method, which explicitly computes the entire Jacobian in (3). Note however that unless and are small, computing full Jacobian matrices is not practical. Adam obtained good results using default TensorFlow parameters, while using a learning of improved SGD. Adagard used the same learning rate of as SGD. The parameter in Algorithm 1 is , and the outcomes are in Figure 1.
5.2 Experiment 2
The movie recommendation dataset contains 100,000 (movie title, rating) pairs. A fully connected model with two dense layers and relu activation is used for rating predictions. Movie titles and ratings are embedded in two preprocessing layers of dimension 32. The total number of optimization weights is . The data sizes are and . Since this problem is relatively large, computing full Jacobian matrices is not practical. Therefore the “Full Jacobian” method is not applicable in this experiment. However, all other algorithms are able to scale to the problem size. We use the default TensorFlow parameters for all algorithms, expect Adagrad which obtained better results with a learning rate of . We set the parameter .
5.3 Experiment 3
In Experiment 3, the algorithms are applied to a large autoencoding model. The Fashion MNIST dataset of 60,000
pixel images is used. The model consists of a dense layers with relu activation for the decoding (with embedding dimension 64) and dense layers with sigmoid activation for the decoding step. The total number of weights are and the data consists of and . We use the default TensorFlow parameters for Adam and set the learning rate to for SGD and Adagrad. We set .6 Conclusions
This article develops algorithms that approximate Jacobian information in order to improve search directions in nonlinear least squares loss functions. By using the fact that the number of model parameters typically exceeds the number of data pairs in a batch of the dataset, we propose computationally effective methods for computing search directions. In numerical experiments, including largescale applications, our proposed algorithms perform well when compared to the stateoftheart such as Adam or Adagrad.
References
 A stochastic levenbergmarquardt method using random models with complexity results and application to data assimilation. External Links: 1807.02176 Cited by: §2.
 Optimization methods for largescale machine learning. SIAM Review 60 (2), pp. 223–311. External Links: Document, Link, https://doi.org/10.1137/16M1080173 Cited by: §1.
 TensorFlow. Zenodo. External Links: Document, Link Cited by: §5.
 UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: Figure 1.
 Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12 (null), pp. 2121–2159. External Links: ISSN 15324435 Cited by: §4, §5.

Training feedforward networks with the marquardt algorithm.
IEEE Transactions on Neural Networks
5 (6), pp. 989–993. External Links: Document Cited by: §2.  The movielens datasets: history and context. ACM Trans. Interact. Intell. Syst. 5 (4). External Links: ISSN 21606455, Link, Document Cited by: Figure 2.
 Deep learning: an introduction for applied mathematicians. SIAM Review 61 (4), pp. 860–891. External Links: Document, Link, https://doi.org/10.1137/18M1165748 Cited by: §1.

DeepLM: largescale nonlinear least squares on deep learning frameworks using stochastic domain decomposition.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 10308–10317. Cited by: §2.  Adam: a method for stochastic optimization. In International Conference on Learning Representations (ICLR 2015), Cited by: §5.
 Numerical optimization. 2 edition, SpringerVerlag, New York. External Links: ISBN 0387303030 Cited by: §3.
 Optimization for machine learning. The MIT Press. External Links: ISBN 026201646X Cited by: §1.
 Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. CoRR abs/1708.07747. External Links: Link, 1708.07747 Cited by: Figure 3.
 Secondorder optimization for nonconvex machine learning: an empirical study. In Proceedings of the 2020 SIAM International Conference on Data Mining (SDM), pp. 199–207. Cited by: §1.
Comments
There are no comments yet.