1 Introduction
Mathematical optimization is the workhorse of virtually all machine learning algorithms. For a given objective function (which may have a special structure or constraints), almost all machine learning problems can be boiled down to the following optimization form:
(1) 
Optimization is often computationally intensive and may correspond to most of the time taken to train a machine learning model. For instance, the training of deep neural networks is dominated by the optimization of the model parameters on the data
schmidhuber2015deep. Even popular machine learning models such as logistic regression have training times mostly dominated by an optimization procedure
kingma2015adam .The ubiquity of optimization in machine learning algorithms highlights the need for robust and flexible implementations of optimization algorithms. We present ensmallen, a C++ optimization toolkit that contains a wide variety of optimization techniques for many types of objective functions. Through the use of C++ template metaprogramming Abrahams_2004 ; Vandevoorde_2018 , ensmallen is able to generate efficient code that can help with the demanding computational needs of many machine learning algorithms.
Although there are many existing machine learning optimization toolkits, few are able to take explicit advantage of metaprogramming based code optimizations, and few offer robust support for various types of objective functions. For instance, deep learning libraries like Caffe
jia2014caffe, PyTorch
paszke2017automatic, and TensorFlow
abadi2016tensorfloweach contain a variety of optimization techniques. However, these techniques are limited to stochastic gradient descent (SGD) and SGDlike optimizers that operate on small batches of data points at a time. Other machine learning libraries, such as
scikitlearn pedregosa2011scikit contain optimization algorithms but not in a coherent or reusable framework. Many programming languages have higherlevel packages for mathematical optimization. For example, scipy.optimize jones2014scipy , is widely used in the Python community, and MATLAB’s function optimization support has been available and used for many decades. However, these implementations are often unsuitable for modern machine learning tasks—for instance, computing the full gradient of the objective function may not be feasible because there are too many data points.In this paper, we describe the functionality of ensmallen and the types of problems that it can be applied to. We discuss the mechanisms by which ensmallen is able to provide both computational efficiency and easeofuse. We show a few examples that use the library, as well as empirical performance comparisons with other optimization libraries.
ensmallen is opensource software licensed under the 3clause BSD license Rosen_2004_full , allowing unencumbered use in both opensource and proprietary projects. It is available for download from https://ensmallen.org. Armadillo sanderson2016armadillo is used for efficient linear algebra operations, with optional interface to GPUs via NVBLAS nvidia2015 .
2 Types of Objective Functions
ensmallen provides a set of optimizers for optimizing userdefined objective functions. It is also easy to implement a new optimizer in the ensmallen framework. Overall, our goal is to provide an easytouse library that can solve the problem for any function
that takes a vector or matrix input
. In most cases, will have special structure; one example might be that is differentiable. Therefore, the abstraction we have designed for ensmallen can optionally take advantage of this structure. For example, in addition to , a user can provide an implementation of , which in turn allows firstorder gradientbased optimizers to be used. This generally leads to significant speedups.There are a number of other properties that ensmallen can use to accelerate computation. These are listed below:

arbitrary: no assumptions can be made on

differentiable: has a computable gradient

separable: is a sum of individual components:

categorical: contains elements that can only take discrete values

sparse: the gradient or (for a separable function) is sparse

partially differentiable: the separable gradient is also separable for a different axis

bounded: is limited in the values that it can take
Due to its straightforward abstraction framework, ensmallen is able to provide a large set of diverse optimization algorithms for objective functions with these properties. Below is a list of currently available optimizers:

SGD variants Ruder_2016 : Stochastic Gradient Descent (SGD), SGD with Restarts, Parallel SGD (Hogwild!) hogwild_nips_2011 , Stochastic Coordinate Descent, AdaGrad Duchi_2011 , AdaDelta Zeiler_2012
, RMSProp, SMORMS3, Adam
kingma2015adam , AdaMax 
QuasiNewton variants: Limitedmemory BFGS (LBFGS) zhu1997algorithm , incremental QuasiNewton method Mokhtari_2018 , Augmented Lagrangian Method Hestenes_1969

Genetic variants: Conventional Neuroevolution Belew_1990 , Covariance Matrix Adaptation Evolution Strategy Hansen_2001

Other: Conditional Gradient Descent, FrankWolfe algorithm Frank_1956 , Simulated Annealing kirkpatrick1983optimization
In ensmallen’s framework, if a user wants to optimize a differentiable objective function, they only need to provide implementations of and , and then they can use any of the gradientbased optimizers that ensmallen provides. Table 1 contrasts the classes of objective functions that can be handled by ensmallen and other popular frameworks and libraries.
Not every optimization algorithm provided by ensmallen can be used by every class of objective function; for instance, a gradientbased optimizer such as LBFGS cannot operate on a nondifferentiable objective function. As such, the best the library can attain is to maximize the flexibility available, so that a user can easily implement a function and have it work with as many optimizers as possible.
To accomplish the flexibility, ensmallen makes heavy use of C++ template metaprogramming. When implementing an objective function to be optimized, a user may only need to implement a few methods; metaprogramming is then automatically used to check that the given functions match the requirements of the optimizer that is being used. When implementing an optimizer, we can assume that the given function to be optimized meets the required assumptions of the optimizers, and encode those requirements as compiletime checks (via static_assert).
For the most common case of a differentiable , the user only needs to implement two methods:

double Evaluate(): given coordinates , this function returns the value of .

void Gradient(, ): given coordinates and a reference to , set .
Alternatively, the user can simply implement a EvaluateWithGradient() function that computes both and simultaneously, which is useful in cases where both the objective and gradient depend on similar computations.
The required API for separable differentiable objective functions (i.e. those that would use an optimizer like SGD) is very similar, except that Evaluate(), Gradient() and EvaluateWithGradient() should operate only on minibatches, and utility methods Shuffle() and NumFunctions() must be added. The same pattern applies for other types of objective functions: only a few methods specific to class of objective function itself must be implemented and then any optimizer may be used.
3 Example: Learning Linear Regression Models
As an example of usage, consider the linear regression objective function
^{1}^{1}1For simplicity, we ignore the bias term. It can be rederived by taking .. Given a dataset and associated responses , the model of linear regression is to assume that for each point and response . To fit this model to the data, we must find(2) 
The objective function has the associated gradient
(3) 
We can implement these two functions in a class named LinearRegressionFunction, as shown in Fig. 1. This is the entire required implementation to optimize the linear regression model with any of the gradientbased optimizers in ensmallen.
Given the userdefined LinearRegressionFunction class, the code snippet below shows how the LBFGS optimizer can be used to find the best parameters :
[fontsize=]c++ LinearRegressionFunction lrf(X, y); // we assume X and y already hold data ens::L_BFGS lbfgs; // create LBFGS optimizer with default parameters
arma::vec theta(X.n_rows, arma::fill::randu); // random uniform initialization lbfgs.Optimize(lrf, theta); // after this call, theta holds the solution
To use the smallbatch SGDlike optimizers provided by ensmallen, only a slight variation on the signature of Evaluate() and Gradient() would be needed, plus the Shuffle() and NumFunctions() utility methods.
4 Automatic Metaprogramming for Ease of Use and Efficiency
When optimizing a given function , the computation of the objective function and its derivative often involve the computation of identical intermediate results. Consider the linear regression objective function described in Section 3, . For this objective function, the derivative has a related form of . Both the objective function and the derivative depend on the computation of the vector term , which can be computationally expensive to compute if is a large matrix. Existing optimization frameworks do not have an easy way to avoid this duplicate computation. In many cases, an optimization algorithm may need the values of both and for a given .
Using template metaprogramming, ensmallen provides an easy (and optional) way for users to avoid this extra computational overhead. Instead of specifying individual Evaluate() and Gradient() functions, a user may simply write an EvaluateWithGradient() function that returns both the objective value and the gradient value for an input . As an example, for the LinearRegressionFunction class in Fig. 1, we can replace Evaluate() and Gradient() with an implementation of EvaluateWithGradient() that computes only once:
scale=0.950.95 [fontsize=]c++ double EvaluateWithGradient(const arma::mat& theta, arma::mat& gradient) const arma::vec v = (y  X * theta); // Cache result gradient = 2 * X.t() * v; // Store gradient in the provided matrix return arma::accu(v
Template metaprogramming techniques are automatically used to detect which methods exist, and a wrapper class will use suitable mixins in order to provide ‘missing‘ functionality smaragdakis2000mixin . For instance, if EvaluateWithGradient() is not provided, a version will be automatically generated that calls both Gradient() and Evaluate() in turn. Similarly, if Evaluate() or Gradient() does not exist, then EvaluateWithGradient() is called, and the unnecessary part of the result will be discarded.
The use of template metaprogramming in this manner also allows for compiler optimizations that would not otherwise be possible (and that are often not possible in other frameworks). Firstly, because the objective function class itself is a template parameter, the compiler is able to avoid the overhead of a function pointer dereference, which would not be easily possible when using a language with virtual inheritance. The compiler is also able to use inlining and any optimizations that may imply, including removing temporary values and dead code elimination. Further, if ensmallen automatically generates an Evaluate() or Gradient() method from a usersupplied EvaluateWithGradient() method, the compiler can in some cases recognize and remove the computation of unnecessary results. For instance, in an automatically generated Evaluate() method, the computation of the gradient from EvaluateWithGradient() can be avoided entirely.
Overall, the automatic code generation functionality in ensmallen reduces the requirements for users when they are implementing their own objective functions to be optimized, and allows users a way to provide more efficient implementations of their objective functions. This leads to quicker development, quicker results, and reduces the likelihood of bugs.
At the time of writing, the automatic code generation is implemented for the most commonlyused cases: fullbatch and smallbatch Evaluate(), Gradient(), and EvaluateWithGradient(). We aim to expand this support to other sets of methods for other types of objective functions.
5 Experiments
ensmallen  scipy  Optim.jl  samin  

default  0.004s  1.069s  0.021s  3.173s 
tuned  0.574s  3.122s 
To demonstrate the benefits of the metaprogramming based code optimizations that ensmallen can exploit, we compare the performance of ensmallen with several other optimization frameworks, including some that use automatic differentiation.
For our first experiment, we consider the simple and popular Rosenbrock function Rosenbrock1960 : . For the optimizer, we use simulated annealing kirkpatrick1983optimization , a gradientfree optimizer. Simulated annealing will call the objective function numerous times; for each simulation we limit the optimizer to 100k objective evaluations. Since the objective function is straightforward and is called many times, this can help us understand the overheads of various frameworks.
We compare four frameworks for this task: (i) ensmallen, (ii) scipy.optimize.anneal from SciPy 0.14.1 jones2014scipy , (iii) simulated annealing implementation in Optim.jl with Julia 1.0.1 mogensen2018optim , (iv) samin in the optim package for Octave octave . While another option here might be simulannealbnd() in the Global Optimization Toolkit for MATLAB, no license was available. We ran our code on a MacBook Pro i7 2018 with 16GB RAM running macOS 10.14 with clang 1000.10.44.2, Julia version 1.0.1, Python 2.7.15, and Octave 4.4.1.
Initially, we implemented these functions as simply as possible and ran them without any tuning. This reflects how a typical user might interact with a given framework. The results are shown in the first row of Table 2. Only Julia and ensmallen are compiled, and thus are able to avoid the function pointer dereference and take advantage of inlining and related optimizations.
However, both Python and Octave have routes for acceleration, such as Numba lam2015numba , MEX bindings and JIT compilation. We handoptimized the Rosenbrock implementation using Numba, which required significant modification of the underlying anneal.anneal() function. These techniques did produce some speedup, as shown in the second row of Table 2. For Octave, a MEX binding did not produce a noticeable difference. We were unable to tune either ensmallen or Optim.jl to get any speedup, suggesting that novice users will easily be able to write efficient code in these cases.
Next, we consider the linear regression example described in Sec. 3. For this task we use the firstorder LBFGS optimizer zhu1997algorithm . Using the same four packages, we implement the linear regression objective and gradient. For ensmallen we implement a version with only EvaluateWithGradient(), denoted as ensmallen1. We also implement a version with both Evaluate() and Gradient(): ensmallen2. We also use automatic differentiation for Julia via the ForwardDiff.jl RevelsLubinPapamarkou2016 package and for Python via the Autograd maclaurin2015autograd package. For GNU Octave we use the bfgsmin() function.
Results for various data sizes are shown in Table 3. For each implementation, LBFGS was allowed to run for only iterations and never converged in fewer iterations. The datasets used for training are highly noisy random data with a slight linear pattern. Note that the exact data is not relevant for the experiments here, only its size. Runtimes are reported as the average across 10 runs.
The results indicate that ensmallen with EvaluateWithGradient() is the fastest approach. Furthermore, the use of EvaluateWithGradient() yields considerable speedup over the ensmallen2 implementation with both the objective and gradient implemented separately. In addition, although the automatic differentiation support makes it easier for users to write their code, we observe that the output of automatic differentiators is not as efficient, especially with ForwardDiff.jl. We expect this effect to be more pronounced with increasingly complex objective functions.
Lastly, we demonstrate the easy pluggability in ensmallen for using various optimizers on the same task. Using a version of LinearRegressionFunction from Sec. 3 adapted for separable differentiable optimizers, we run six optimizers with default parameters in just 8 lines of code, as shown in Fig. 2(a). Applying these optimizers to the YearPredictionMSD dataset from the UCI repository ucimlrepository yields the learning curves shown in Fig. 2(b). Any other optimizer for separable differentiable objective functions can be dropped into place in the same manner.
for 5 epochs of training. The optimizers can be tuned for better performance.
6 Conclusion
We have described ensmallen, a flexible C++ library for function optimization that provides an easy interface for the implementation and optimization of userdefined objective functions. Many types of functions can be optimized, including separable and constrained functions. The library provides many prebuilt optimizers (including numerous variants of SGD and QuasiNewton optimizers). The library internally exploits template metaprogramming to maximise opportunities for efficiency gains, as well as to make the implementation of objective functions easier by automatically generating missing methods.
We aim to expand the library with further optimization techniques as the need arises. Since ensmallen is open source, external contributions to the codebase are welcome. For more information on the library, see the website and documentation at https://ensmallen.org. The library is already in use for function optimization in the mlpack machine learning toolkit mlpack2018 .
Acknowledgements. We would like to thank the many contributors to ensmallen, who are listed on the associated website.
References
 [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, et al. Tensorflow: Largescale machine learning on heterogeneous distributed systems. arXiv:1603.04467, 2016.
 [2] D. Abrahams and A. Gurtovoy. C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond. AddisonWesley Professional, 2004.

[3]
R. K. Belew, J. McInerney, and N. N. Schraudolph.
Evolving networks: Using the genetic algorithm with connectionist learning, 1990.
CSE Technical Report #CS90174, University of California at San Diego.  [4] F. Chollet. Keras. https://github.com/fchollet/keras, 2015.
 [5] R. R. Curtin, M. Edel, M. Lozhnikov, Y. Mentekidis, S. Ghaisas, and S. Zhang. mlpack 3: a fast, flexible machine learning library. Journal of Open Source Software, 3:726, 2018.
 [6] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
 [7] J. W. Eaton, D. Bateman, S. Hauberg, and R. Wehbring. GNU Octave version 4.4.0 manual: a highlevel interactive language for numerical computations, 2018.
 [8] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(12):95–110, 1956.
 [9] N. Hansen and A. Ostermeier. Completely derandomized selfadaptation in evolution strategies. Evolutionary Computation, 9(2):159–195, 2001.
 [10] M. R. Hestenes. Multiplier and gradient methods. Journal of Optimization Theory and Applications, 4(5):303–320, 1969.
 [11] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM International Conference on Multimedia, pages 675–678, 2014.
 [12] E. Jones, T. Oliphant, and P. Peterson. SciPy: open source scientific tools for Python, 2014. http://www.scipy.org/.
 [13] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
 [14] S. Kirkpatrick, C. G. Jr., and M. P. Vecchi. Optimization by simulated annealing. Science, 220(4598):671–680, 1983.
 [15] S. K. Lam, A. Pitrou, and S. Seibert. Numba: A LLVMbased python JIT compiler. In Second Workshop on the LLVM Compiler Infrastructure in HPC, page 7, 2015.
 [16] J. Langford, L. Li, and A. Strehl. Vowpal wabbit open source project. Technical report, Yahoo!, 2007.
 [17] M. Lichman. UCI Machine Learning Repository, 2013. http://archive.ics.uci.edu/ml.
 [18] D. Maclaurin, D. Duvenaud, and R. P. Adams. Autograd: Effortless gradients in numpy. In AutoML Workshop at ICML, 2015.
 [19] Mathworks. Matlab optimization toolbox, 2017. The MathWorks, Natick, MA, USA.
 [20] P. K. Mogensen and A. N. Riseth. Optim: A mathematical optimization package for Julia. Journal of Open Source Software, 3(24):615, 2018.
 [21] A. Mokhtari, M. Eisen, and A. Ribeiro. IQN: An incremental quasinewton method with local superlinear convergence rate. SIAM Journal on Optimization, 28(2):1670–1698, 2018.
 [22] F. Niu, B. Recht, C. Ré, and S. J. Wright. HOGWILD!: A lockfree approach to parallelizing stochastic gradient descent. In Advances in Neural Information Processing Systems, 2011.
 [23] NVIDIA. NVBLAS Library. http://docs.nvidia.com/cuda/nvblas, 2015.
 [24] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in PyTorch. Autodiff Workshop at NIPS, 2017.
 [25] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, et al. Scikitlearn: Machine learning in python. Journal of Machine Learning Research, 12(Oct):2825–2830, 2011.
 [26] J. Revels, M. Lubin, and T. Papamarkou. Forwardmode automatic differentiation in Julia. arXiv:1607.07892, 2016.
 [27] L. Rosen. Open Source Licensing: Software Freedom and Intellectual Property Law. Prentice Hall, 2004.
 [28] H. H. Rosenbrock. An automatic method for finding the greatest or least value of a function. The Computer Journal, 3:175–184, 1960.
 [29] S. Ruder. An overview of gradient descent optimization algorithms. arXiv:1609.04747, 2016.
 [30] C. Sanderson and R. Curtin. Armadillo: a templatebased C++ library for linear algebra. Journal of Open Source Software, 1:26, 2016.
 [31] J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, 2015.
 [32] Y. Smaragdakis and D. Batory. Mixinbased programming in C++. In International Symposium on Generative and ComponentBased Software Engineering, Lecture Notes in Computer Science, volume 2177, pages 164–178. Springer, 2000.
 [33] S. Sonnenburg, G. Rätsch, S. Henschel, C. Widmer, J. Behr, A. Zien, F. de Bona, A. Binder, C. Gehl, and V. Franc. The SHOGUN machine learning toolbox. Journal of Machine Learning Research, 11:1799–1802, 2010.
 [34] D. Vandevoorde and N. M. Josuttis. C++ Templates: The Complete Guide. AddisonWesley, 2nd edition, 2017.
 [35] M. D. Zeiler. ADADELTA: An adaptive learning rate method. arXiv:1212.5701, 2012.
 [36] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal. Algorithm 778: LBFGSB: Fortran subroutines for largescale boundconstrained optimization. ACM Transactions on Mathematical Software (TOMS), 23(4):550–560, 1997.