Jensen: An Easily-Extensible C++ Toolkit for Production-Level Machine Learning and Convex Optimization

07/17/2018 ∙ by Rishabh Iyer, et al. ∙ University of California-Davis Microsoft 2

This paper introduces Jensen, an easily extensible and scalable toolkit for production-level machine learning and convex optimization. Jensen implements a framework of convex (or loss) functions, convex optimization algorithms (including Gradient Descent, L-BFGS, Stochastic Gradient Descent, Conjugate Gradient, etc.), and a family of machine learning classifiers and regressors (Logistic Regression, SVMs, Least Square Regression, etc.). This framework makes it possible to deploy and train models with a few lines of code, and also extend and build upon this by integrating new loss functions and optimization algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the past few decades, convex optimization has emerged as a key ingredient for machine learning. With its growth and prominence, several convex solvers have been released. Such solvers broadly fall into one of two categories: general solvers and production-level solvers. The former, such as CVX (Grant and Boyd, 2014; Diamond and Boyd, 2016), are often written for a high-level programming language (such as Matlab or Python) and may be used to solve small-scale problems. Such solvers are vital in instructional environments as tools for developing practitioners to gain insight and intuition about convex problems. However, even modest-scale problems pose both memory and runtime difficulties, while large-scale problems are infeasible. On the other hand, production-level software solve convex problems for large-scale machine learning by narrowing the scope of functions which may be represented. For instance, LIBSVM (Chang and Lin, 2011) offers non-linear support vector machine (SVM) learning and, narrowing the scope of represented functions even further, LIBLINEAR (Fan et al., 2008) offers faster model training for the class of linear SVMs.

When the time comes for researchers to transition from a general solver to developing their own domain-specific software, it is highly desirable to call an ideally-suited production-level solver using an API. However, researchers are often met with a number of obstacles when pursuing this route. Firstly, owing to their narrow-scope, a production-level solver may not exist for a general objective initially investigated at small-scale using a general solver. This necessarily means developing a custom-solver from scratch. Secondly, API support varies widely and, in general, requires significant development time and effort to merge into an existing project’s framework. In order to bridge the gap between general solvers and production-level solvers while supporting an extremely user-friendly API, we present Jensen.

Written in C++, Jensen is an open-source, production-level solver for general machine learning. Users may define convex functions, select from a large number of widely-used optimization algorithms, and specify machine learning tasks to build applications capable of handling problems at massive scale. Supporting general convex functions without sacrificing support of large-scale problems, Jensen thus combines the strengths of production-level software with those of general solvers.

2 Jensen

Jensen allows the intuitive implementation of new optimization algorithms using a blend of linear algebra (similar to Matlab syntax) and C++ object-oriented syntax, similar to (but not as specialized as) the domain-specific language (DSL) offered by CVX. We call this blend a pseudo-domain specific language (PDSL). Coupled with the ability to construct and easily select among many different optimization algorithms during runtime, Jensen adds a broad degree of algorithmic design flexibility to suit different application needs (i.e., stochastic or batch learning, first order or quasi-Newton, etc.) without rigorous recoding (demonstrated in Section 2.1).

2.1 Software Design

The toolkit follows a highly modular design where different optimization algorithms, objective functions, and machine learning tasks may be combined to quickly produce tailor-made applications. Training and testing (including cross-validation) for classification (binary and multi-class) or regression are supported for a large number of popular loss functions (listed in Appendix Table 2) and state-of-the-art optimization algorithms (listed in Appendix Table 3). Algorithms may be intuitively and compactly expressed using Jensen’s PDSL, without any sacrifice in runtime performance. For example, gradient descent for arbitrary convex functions is defined in only 14 lines (illustrated below) and conjugate gradient descent for arbitrary convex functions is defined in only 41 lines.

Vector gd(const ContinuousFunctions& c, const Vector& x0, const double alpha,
const int maxEval, const double TOL, const int verbosity){
        Vector x(x0), g;
        double f; // function value
        c.eval(x, f, g); // evaluate f(x), compute gradient g
        double gnorm = norm(g);
        int funcEval = 1;
        while ((gnorm >= TOL) && (funcEval < maxEval) ){
                multiplyAccumulate(x, alpha, g); // x = x - alpha*g
                c.eval(x, f, g);
                funcEval++;
                gnorm = norm(g);
        }
        return x;
}

When building deployable machine learning applications, at the highest design-level, machine learning modules call user-specified loss functions. During training, the loss function calls the desired optimization algorithm. For instance, using the OWL-QN algorithm (Andrew and Gao, 2007), the following trains then tests an -regularized, logistic regression classifier with training file trainFile and testing file testFile in LIBSVM format:

int algtype = 0; // choose OWL-QN algorithm
double lambda = 1.0; // regularization strength
int ntrain, mtrain, ntest, mtest; // num training instances/features, test instances/features
vector<struct SparseFeature> trainFeatures, testFeatures; // feature vectors
Vector ytrain, ytest; // labels
readFeatureLabelsLibSVM(trainFile, trainFeatures, ytrain, ntrain, mtrain);
readFeatureLabelsLibSVM(testFile, testFeatures, ytest, ntest, mtest);
Classifiers<SparseFeature>* c = new
    L1LogisticRegression<SparseFeature>(trainFeatures, ytrain, mtrain, ntrain, nClasses,
                                        lambda, algtype);
c->train();
accuracy = predictAccuracy(c, testFeatures, ytest);

The solver-selection process is streamlined at the machine-learning design level, where varying algtype in the earlier example selects among various solvers. The solvers themselves operate over arbitrary continuous functions. For instance, the earlier trained classifier is equivalent to instantiating an -regularized logistic-loss function and passing this function to the desired solver, as follows:

Vector x0 = Vector(mtrain, 0); // initial starting point
double alpha = 1.0; // initial step size
double gamma = 1e-4; // back-tracking parameter used by line-search solvers
int maxIter = 250;
int lbfgsMemory = 100; // budget for limited-memory solvers
double eps = 1e-3; // convergence criteria
L1LogisticLoss<Feature> ll(mtrain, trainFeatures, ytrain, lambda);
lbfgsMinOwl(ll, x0, alpha, gamma, maxIter, lbfgsMemory, eps);

Each module is self-contained and may be used for general tasks; at its core, Jensen is a solver library optimizing arbitrary continuous functions using general optimization algorithms. Thus, Jensen may not only be used to quickly deploy new machine learning applications, but also to easily investigate convex optimization problems in an instructional setting. For instance, the earlier call to lbfgsMinOwl may be replaced with an out-of-the-box solver or replaced with a user-constructed solver written using Jensen’s PDSL. Note the ease with which the APIs of Jensen data types are called in the previous examples. The major abstract data types are further detailed in Appendix Section B.

3 Experiments

We compare the speed of Jensen to the general solver CVXPY and production-level solvers LIBLINEAR and LIBSVM using three datasets ranging from small- to large-scale: the small-scale dataset ijcnn1 containing 22 features and 35,000 training instances, the mid-scale dataset rcv1_train containing 677,399 features and 20,242 training instances, and the large-scale dataset news20 containing 1,355,191 features and 19,996 training instances. Each dataset is freely available for download at the LibSVM Data page.

Dataset Objective Jensen LIBLINEAR CVXPY LIBSVM
ijcnn1 Logistic Regression 0.50 0.27 221.72 *
L2-SVM, Primal 0.64 0.23 4.03 *
L2-SVM, Dual 0.51 0.39 *
L1-SVM, Primal 0.45 * 4.13 *
L1-SVM, Dual 0.50 0.31 6.04 14.74
rcv1_train Logistic Regression 1.27 0.85 270.16 *
L2-SVM, Primal 2.26 0.66 31.83 *
L2-SVM, Dual 0.97 0.77 *
L1-SVM, Primal 1.88 * 31.92 *
L1-SVM, Dual 0.95 1.11 52.78
news20 Logistic Regression 7.64 4.80 *
L2-SVM, Primal 28.69 5.81 *
L2-SVM, Dual 4.72 4.09 *
L1-SVM, Primal 23.40 * *
L1-SVM, Dual 4.71 5.55 318.29
Table 1: Solver runtimes (in seconds) for several popular training objective functions evaluated over datasets of increasing scale. As used by LIBLINEAR, L1-SVM denotes the SVM using hinge-loss and L2-SVM denotes the SVM using the squared hinge-loss. Each formulation is -regularized. * denotes formulations not supported by the solver, denotes formulations for CVXPY exhausted all system memory and were infeasible, and denotes formulations for which CVXPY exited before completion without further diagnostic detail.

Runtimes are reported in Table 1 evaluating several popular machine learning objective functions, where each objective includes -regularization. All tests were run on the same machine with an Intel Xeon E5-2620 (clocked at 2.1 GHz) and 64 GB of memory. Tests were run using LIBLINEAR version 2.11, LIBSVM version 3.22, and CVXPY version 0.4.10. Jensen, LIBSVM, and LIBLINEAR wall-clock runtimes are reported. Due to the excessive overhead incurred by formulating problems in CVXPY (e.g., the dual L1-SVM objective for small-scale dataset ijcnnl required 3.46 hours to formulate), CVXPY runtimes report only the elapsed runtime calling a formulated problem’s solve() attribute. Jensen experiments were run using several different optimization algorithms to demonstrate the toolkit’s breadth: L-BFGS (Wright and Nocedal, 1999) for primal SVM objectives, TRON (Lin et al., 2008) for logistic regression, and dual coordinate descent (Hsieh et al., 2008) for dual SVM objectives. All method’s were run with ; however, dual problems in CVXPY were subject to numerical issues, requiring to obtain accurate solutions. Logistic regression and primal SVM experiments using CVXPY were run using solver SCS, which was found to be faster than other CVXPY solvers (three orders-of-magnitude faster in some cases), while dual SVM experiments were run using solver ECOS (found to be slightly faster than SCS for this problem). The training accuracy for each completed method’s learned parameters performed similarly, agreeing up to two significant figures for all completed methods and respective datasets with the exception of CVPY achieving a one percent higher accuracy than competitors for logistic regression on dataset rcv1.

4 Conclusions

We’ve described Jensen, a flexible open-source library for scalable machine learning and convex optimization. The source code of Jensen is available at https://github.com/rishabhk108/jensen. Jensen does not rely on external packages and is supported on Unix, Windows, and OSX Operating Systems.

Acknowledgments

This work was partially supported by NIH NCATS grant UL1 TR001860.

References

  • Andrew and Gao (2007) Galen Andrew and Jianfeng Gao. Scalable training of l 1-regularized log-linear models. In Proceedings of the 24th international conference on Machine learning, pages 33–40. ACM, 2007.
  • Chang and Lin (2011) Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, 2011. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
  • Diamond and Boyd (2016) Steven Diamond and Stephen Boyd. Cvxpy: A python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1–5, 2016. URL http://jmlr.org/papers/v17/15-408.html.
  • Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
  • Fan et al. (2008) Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classification. Journal of machine learning research, 9(Aug):1871–1874, 2008.
  • Grant and Boyd (2014) Michael Grant and Stephen Boyd. CVX: Matlab software for disciplined convex programming, version 2.1. http://cvxr.com/cvx, March 2014.
  • Hsieh et al. (2008) Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S Sathiya Keerthi, and Sellamanickam Sundararajan. A dual coordinate descent method for large-scale linear svm. In Proceedings of the 25th international conference on Machine learning, pages 408–415. ACM, 2008.
  • Lin et al. (2008) Chih-Jen Lin, Ruby C Weng, and S Sathiya Keerthi. Trust region newton method for logistic regression. Journal of Machine Learning Research, 9(Apr):627–650, 2008.
  • Roux et al. (2012) Nicolas L Roux, Mark Schmidt, and Francis R Bach. A stochastic gradient method with an exponential convergence _rate for finite training sets. In Advances in Neural Information Processing Systems, pages 2663–2671, 2012.
  • Wright and Nocedal (1999) Stephen Wright and Jorge Nocedal. Numerical optimization. Springer Science, 35:67–68, 1999.
  • Xiao (2010) Lin Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11(Oct):2543–2596, 2010.
  • Yuan et al. (2010) Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin. A comparison of optimization methods and software for large-scale l1-regularized linear classification. Journal of Machine Learning Research, 11(Nov):3183–3234, 2010.

Appendix A Jensen Loss Functions and Optimization Algorithms

Loss function Description
L2LeastSquaresLoss -regularized, least squares loss.
L1LeastSquaresLoss -regularized, least squares loss.
L2LogisticLoss -regularized, logistic loss.
L1LogisticLoss -regularized, logistic loss.
L2ProbitLoss -regularized, probit loss.
L1ProbitLoss -regularized, probit loss.
L2HingeSVMLoss -regularized, linear L1-SVM loss (i.e., hinge loss).
L1HingeSVMLoss -regularized, linear L1-SVM loss.
L2SmoothSVMLoss -regularized, linear L2-support vector machine loss (i.e., squared hinge loss).
L1SmoothSVMLoss -regularized, linear L2-SVM loss.
L2HuberSVMLoss -regularized, linear SVM with Huber loss.
L1HuberSVMLoss -regularized, linear SVM with Huber loss.
L2HingeSVRLoss -regularized, linear L1-support vector regression (SVR) loss.
L1HingeSVRLoss -regularized, linear L1-SVR loss.
L2SmoothSVRLoss -regularized, linear L2-SVR loss.
L1SmoothSVRLoss -regularized, linear L2-SVR loss.
Table 2: Out-of-the-box Jensen loss functions. L1- and L2-SVMs are described at length in Fan et al. (2008), where L1-SVM utilizes the standard hinge loss and L2-SVM uses the hinge loss squared (rendering it differentiable).
Algorithm name Description
gd Gradient descent (GD) with static step size.
gdLineSearch GD with backtracking line search.
gdBarzilaiBorwein GD with backtracking line search and Barzilai-Borwein step length.
gdNesterov GD with Nesterov’s accelerated method.
lbfgsMin L-BFGS for Quasi-Newton optimization (Wright and Nocedal, 1999).
lbfgsMinOwl Orthant-Wise Limited-memory Quasi-Newton(OWL-QN) (Andrew and Gao, 2007) for regularized Quasi-Newton optimization.
tron Trust Region Newton (Lin et al., 2008).
SVCDual Dual coordinate descent algorithm for SVMs (Hsieh et al., 2008).
sgd Stochastic gradient descent (SGD) with static step size.
sgdDecayingLearningRate SGD with a specified rate of decay.
sgdAdagrad SGD with AdaGrad (Duchi et al., 2011)
sgdStochasticAverageGradient Stochastic average gradient (SAG) descent (Roux et al., 2012).
sgdRegularizedDualAveraging SGD using the regularized dual-averaging algorithm (Xiao, 2010) for regularization.
sgdRegularizedDualAveragingAdagrad SGD using the regularized dual-averaging algorithm and AdaGard.
Table 3: Out-of-the-box Jensen optimization algorithms.

Appendix B Jensen Data Types

b.1 Objective (Loss) Functions

The continuous function abstract base class defines operations such as function evaluation, computation of the (sub) gradient, and computation of the (sub) Hessian. Derived classes correspond to loss (or arbitrary objective) functions such as -regularized logistic loss, -regularized hinge loss, and so on. Once a new loss function is derived, any Jensen optimization algorithm (discussed in Section B.2) may be used. Natively implemented loss functions are listed in Table 2.

b.2 Optimization Algorithms

Optimization algorithms are defined to act on continuous functions. This generality means we need only define a specific algorithm once to optimize any continuous loss functions. This design substantially cuts down debugging and coding time while greatly streamlining the codebase (the optimization algorithms themselves are also very compact). This also means that once a user instantiates a loss function, any Jensen optimization algorithm may be used as a solver. Natively implemented optimization algorithms are listed in Table 3.

b.3 Machine Learning

With the efficient optimization of arbitrary loss functions, machine learning is readily applicable for real-world problems. In Jensen, (multi-class) classification and regression are natively supported so that classifiers and regressors may be customized as desired. For instance, if classification using -regularized logistic regression trained using the OWL-QN algorithm is desired, all that is necessary is to instantiate continuous function L1LogisticLoss and specify the lbfgsMinOwl solver; this example may be found in
$JENSEN/src/machinelearning/logisticRegression/L1LogisticRegression.cc, where $JENSEN is the uncompressed download directory (further discussed in Appendix Section C).

Appendix C Build Process and Examples

For the discussion that follows, we assume that the toolkit source has been cloned (or uncompressed) to directory $JENSEN and that commands are run using Bash (denoted by a preceding $). Linking and building for Jensen are handled using CMake. After downloading the codebase, the following builds the example applications found in $JENSEN/test to directory $JENSEN/build:

$ cd $JENSEN
$ mkdir build
$ cd build
$ cmake .. && make

The examples in $JENSEN/test demonstrate the API use for all out-of-the-box loss functions (listed in Table 2) and optimization algorithms (listed in Table 3) evaluated over real-world data (in $JENSEN/data). For instance, $JENSEN/test/TestL2LogisticLoss.cc displays how to instantiate the loss function for -regularized logistic regression and optimize this function using all relevant out-of-the-box solvers (i.e., gradient descent, TRON, L-BFGS, stochastic gradient descent, etc.). From $JENSEN/test/TestL2LogisticLoss.cc, the following code excerpt loads the dataset, instantiates the loss function, and uses gradient descent to optimize the loss function:

⋮
char* featureFile = "../data/20newsgroup.feat";
char* labelFile = "../data/20newsgroup.label";
int n; // number of data items
int m; // numFeatures
vector<struct SparseFeature> features = readFeatureVectorSparse(featureFile, n, m);
Vector y = readVector(labelFile, n);
L2LogisticLoss<SparseFeature> ll(m, features, y, 1);

int numEpochs = 50;
double stepSize = 1e-5;
double f;
Vector x0(m, 0), g, x;
⋮
x = gd(ll, x0, stepSize, numEpochs);
After linking and building, running the binary $JENSEN/build/TestL2LogisticLoss will proceed optimizing the -regularized logistic loss function with a large number of optimization algorithms. Similar respective binaries are built for all loss functions (and applicable out-of-the-box optimization algorithms) listed in Table 2.

The use of CMake makes linking and building user-customized modules simple; in the previous example, linking the -regularized logistic loss function and defining the test executable to be built are specified in $JENSEN/CMakeLists.txt:

add_library(jensen
⋮
src/optimization/contFunctions/L2LogisticLoss.cc
⋮
)

add_executable(TestL2LogisticLoss test/TestL2LogisticLoss.cc)
target_link_libraries(TestL2LogisticLoss jensen)
New user-defined loss functions, optimization algorithms, and machine learning applications may be easily added, linked to the toolkit library, and built in a similar manner.

c.1 Machine Learning Application Examples

Examples of building deployable applications using Jensen’s machine learning APIs (which combine arbitrary loss functions and optimization algorithms to perform machine learning tasks) and command line/file-handling utilities are also available in $JENSEN/test. For example, $JENSEN/test/TestClassification.cc builds a classifier which accepts as command-line input training files, testing files, the loss function to be evaluated, the optimization algorithm to be utilized, and various optimization algorithm options. After linking and building, running the following will train then evaluate the accuracy of an -regularized L2-SVM classifier (with regularization strength ) trained using the OWL-QN algorithm (Andrew and Gao, 2007):

$ cd $JENSEN/build
$ ./TestClassification \
-method 3 -algtype 0 -reg 0.25 -nClasses 20 \
-maxIter 1000 -startwith1 true \
-trainFeatureFile ../data/20newsgroup.feat \
-trainLabelFile ../data/20newsgroup.label \
-testFeatureFile ../data/20newsgroup.feat \
-testLabelFile ../data/20newsgroup.label

In the above example, varying method alters the objective function used for training and testing and varying algtype alters the training algorithm utilized. Thus, the generality and flexibility of the toolkit facilitate the rapid evaluation of different objective functions and training algorithms given new data. Note that the source file for this example is easily customizable (including command-line options) and serves as a template for users to easily create tailor-made applications for large-scale, real-world data. Other deployable application examples in $JENSEN/test include TestClassificationCrossVal.cc and TestClassificationLibSVM.cc, which demonstrate cross-validation and LIBSVM file-format support, respectively.