POLO: a POLicy-based Optimization library

10/08/2018 ∙ by Arda Aytekin, et al. ∙ KTH Royal Institute of Technology 0

We present POLO --- a C++ library for large-scale parallel optimization research that emphasizes ease-of-use, flexibility and efficiency in algorithm design. It uses multiple inheritance and template programming to decompose algorithms into essential policies and facilitate code reuse. With its clear separation between algorithm and execution policies, it provides researchers with a simple and powerful platform for prototyping ideas, evaluating them on different parallel computing architectures and hardware platforms, and generating compact and efficient production code. A C-API is included for customization and data loading in high-level languages. POLO enables users to move seamlessly from serial to multi-threaded shared-memory and multi-node distributed-memory executors. We demonstrate how POLO allows users to implement state-of-the-art asynchronous parallel optimization algorithms in just a few lines of code and report experiment results from shared and distributed-memory computing architectures. We provide both POLO and POLO.jl, a wrapper around POLO written in the Julia language, at https://github.com/pologrp under the permissive MIT license.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Wide adoption of Internet of Things (IoT) enabled devices, as well as developments in communication and data storage technologies, have made the collection, transmission, and storage of bulks of data more accessible than ever. Commercial cloud-computing providers, which traditionally offer their available computing resources at data centers to customers, have also started supporting low-power IoT-enabled devices available at the customers’ site in their cloud ecosystem [2018-Greengrass, 2018-IoTEdge, 2018-IoTHub]

. As a result, we are experiencing an increase in not only the problem dimensions of data-driven machine-learning applications but also the variety of computing architectures on which these applications are deployed.

There is no silver bullet for solving machine-learning problems. Because the problems evolve continuously, one needs to design new algorithms that are tailored for the specific problem structure, and exploit all available computing resources. However, algorithm design is a non-trivial process. It consists in prototyping new ideas, benchmarking their performance against that of the state-of-the-art, and finally deploying the successful ones in production. Many ideas today are prototyped in high-level languages such as MATLAB, Python and Julia, and it is very rare that researchers implement their algorithms in lower-level languages such as C and C++. Even so, these prototypes are often incomplete, have different abstractions and coding style, and are hand-crafted for specific problem-platform combinations. Ultimately, this results in either performance degradations of high-level implementations in production, or high engineering costs attached to rewriting the low-level implementations from scratch for each different problem-platform combination.

Currently, there exist numerous machine-learning libraries, each targeting a different need. Libraries such as PyTorch/Caffe2 [2018-PyTorch], Theano [2016-Team] and TensorFlow [2016-Abadi]

primarily target deep-learning applications, and support different powerful computing architectures. High-level neural-network frameworks such as

Keras [2015-Chollet] provide user-friendly interfaces to backends such as Theano and TensorFlow. Even though these libraries implement many mature algorithms for solving optimization problems resulting from deep-learning applications, they fall short when one needs to prototype and benchmark novel solvers [2017-Curtin, 2018-Vasilache]. Algorithm designers need to either use the provided communication primitives explicitly [2018-PyTorch] or interfere with the computation graph facilities [2016-Team, 2016-Abadi] to write the algorithms from scratch. Recent libraries and domain-specific languages such as Ray [2017-Moritz] and Tensor Comprehensions [2018-Vasilache] aim at extending the capabilities of computation graph facilities of aforementioned libraries, and targeting users’ custom needs regarding network architectures and data shapes, respectively. More lightweight options such as mlpack [2017-Curtin] and Jensen [2018-Iyer], on the other hand, aim at providing more generic optimization frameworks in C++. Unfortunately, these options are not centered around algorithm design, either. Although both options provide a good selection of predefined algorithms, these algorithms are implemented for serial [2018-Iyer] or, to a limited extent, single-machine parallel [2017-Curtin] computations. Designers still need to rewrite these implementations for, say, multi-machine parallel computations. In short, we believe that there is a need for a generic optimization platform that facilitates prototyping, benchmarking and deployment of algorithms on different platforms without much performance penalty.

In this paper, we present POLO — an open-source, header-only C++ library that uses the policy-based design technique [2005-Alexandrescu]. It consists of two primary layers (Figure 1). The utilities layer builds on standard libraries and lightweight third-party libraries, and offers a set of well-defined primitives for atomic floating-point operations, matrix algebra, communication and serialization, logging and dataset reading. Throughout the development, we have put special emphasis on making the utilities layer portable and efficient. The algorithms layer, on the other hand, builds on the utilities layer and implements different families of algorithms. The library abides by modern C++ design principles [2005-Alexandrescu], and follows ideas from recent parallel algorithms library proposals [2013-Hoberock, 2015-Hoberock] to

  • decouple optimization algorithm building blocks from system primitives,

  • facilitate code reuse, and,

  • generate tight and efficient code.

In the rest of the paper, we introduce POLO in more detail. In Section 2, we motivate the design of POLO based on our observations from a family of optimization algorithms. In Section 3, we provide detailed information about the design of the library and the supported computing architectures. In Section 4, we show how some of the state-of-the-art algorithms can be quickly prototyped in POLO together with their performance comparisons against each other. Finally, in Section 6, we conclude the paper with further discussions.

Figur 1: Structure of POLO. Utilities layer provides thin wrappers for the essential functionalities provided by the standard and third-party libraries, and implements custom ones, which are then used in the algorithms layer.

2 Motivation

To demonstrate the power of policy-based design for distributed optimization library development, we consider regularized optimization problems on the form:

(1)

Here, is a differentiable function of and

is a possibly non-differentiable function. In machine-learning applications, the smooth part of the loss typically represents the empirical data loss, and is a finite sum of loss functions

while the non-smooth part is a regularizer that encourages certain properties (such as sparsity) of the optimal solution.

One approach for solving problems on the form (1) is to use proximal gradient methods. The basic form of the the proximal gradient iteration is

(2)

where is a step-size parameter. Thus, the next iterate, , is selected to be the minimizer of the sum of the first-order approximation of the differentiable loss function around the current iterate, , the non-differentiable loss function, and a quadratic penalty on the deviation from the current iterate [2017-Beck]. After some algebraic manipulations, one can rewrite (2) in terms of the proximal operator [2017-Beck]

As a result, the method can be interpreted as a two-step procedure: first, a query point is computed by modifying the current iterate in the direction of the negative gradient, and then the prox operator is applied to this query point.

Even though the proximal gradient method described in (2) looks involved, in the sense that it requires solving an optimization problem at each iteration, the prox-mapping can actually be evaluated very efficiently for several important functions . Together with its strong theoretical convergence guarantees, this makes the proximal gradient method a favorable option in many applications. However, the gradient calculation step in the vanilla proximal gradient method can be prohibitively expensive when the number of data points (

) or the dimension of the decision vector (

) is large enough. To improve the scalability of the proximal gradient method, researchers have long tried to come up with ways of parallelizing the proximal gradient computations and more clever query points than the simple gradient step in (2[2007-Blatt, 2016-Aytekin, 2014-Defazio, 1964-Polyak, 1983-Nesterov, 2011-Duchi, 2012-Zeiler, 2014-Kingma, 2016-Dozat, 2015-Sra, 2011-Recht, 2016-Leblond, 2017-Pedregosa]. As a result, the proximal gradient family encompasses a large variety of algorithms. We have listed some of the more influential variants in Table 1.

2.1 A Look at Proximal Gradient Methods

Algorithm boosting smoothing step prox execution
SGD , s, cr, ps
IAG [2007-Blatt] aggregated s, cr, ps
PIAG [2016-Aytekin] aggregated s, cr, ps
SAGA [2014-Defazio] saga s, cr, ps
Momentum [1964-Polyak] classical s
Nesterov [1983-Nesterov] nesterov s
AdaGrad [2011-Duchi] adagrad s
AdaDelta [2012-Zeiler] adadelta s
Adam [2014-Kingma] classical rmsprop s
Nadam [2016-Dozat] nesterov rmsprop s
AdaDelay [2015-Sra] s, cr, ps
HOGWILD! [2011-Recht] ir
ASAGA [2016-Leblond] saga ir
ProxASAGA [2017-Pedregosa] saga ir
Tabell 1: Some members of the proximal gradient methods. s, cr, ir and ps under the execution column stand for serial, consistent read/write, inconsistent read/write and Parameter Server [2013-Li], respectively.

A careful review of the serial algorithms in the proximal gradient family reveals that they differ from each other in their choices of five distinctive algorithm primitives: (1) which gradient surrogate they use; (2) how they combine multiple gradient surrogates to form a search direction, a step we refer to as boosting; (3) how this search direction is filtered or scaled, which we call smoothing

; (4) which step-size policy they use; and (5) the type of projection they do in the prox step. For instance, stochastic gradient descent (SGD) algorithms use partial gradient information coming from functions (

) or decision vector coordinates () as the gradient surrogate at each iteration, whereas their aggregated versions [2007-Blatt, 2016-Aytekin, 2014-Defazio] accumulate previous partial gradient information to boost the descent direction. Similarly, different momentum-based methods such as the classical [1964-Polyak] and Nesterov’s [1983-Nesterov] momentum accumulate the full gradient information over iterations. Algorithms such as AdaGrad [2011-Duchi] and AdaDelta [2012-Zeiler]

, on the other hand, use the second-moment information from the gradient and the decision vector updates to adaptively scale, , smooth, the gradient surrogates. Popular algorithms such as Adam 

[2014-Kingma] and Nadam [2016-Dozat], available in most machine-learning libraries, incorporate both boosting and smoothing to get better update directions. Algorithms in the serial setting can also use different step-size policies and projections independently of the choices above, which results in the pseudocode representation of these algorithms given in Algorithm 1.

Data: Differentiable functions, ; regularizer, .
Input: Initial decision vector, ; step size, .
Output: Final decision vector, .
1 ;
2 ;
3 while not_done(, , , ) do
       gradient_surrogate();
        // partial or full gradient
       boosting(, );
        // optional
       smoothing(, , );
        // optional
4       step(, , , );
       ;
        // prox step
5       ;
6      
7 end while
8return ;
Algorithm 1 Serial implementation of proximal gradient methods.

Most of the serial algorithms in this family either have direct counterparts in the parallel setting or can be extended to have the parallel execution support. However, adding parallel execution support brings yet another layer of complexity to algorithm prototyping. First, there are a variety of parallel computing architectures to consider, from shared-memory and distributed-memory environments with multiple CPUs to distributed-memory heterogeneous computing environments which involve both CPUs and GPUs. Second, some of these environments, such as the shared-memory architectures, offer different choices in how to manage race conditions. For example, some algorithms [2007-Blatt, 2016-Aytekin, 2014-Defazio, 2015-Sra] choose to use mutexes to consistently read from and write to the shared decision vector from different processes, whereas others [2011-Recht, 2016-Leblond, 2017-Pedregosa] prefer atomic operations to allow for inconsistent reads and writes. Finally, the choice in a specific computing architecture might constrain choices in other algorithm primitives. For instance, if the algorithm is to run on a distributed-memory architecture such as the Parameter Server [2013-Li], where only parts of the decision vector and data are stored in individual nodes, then only updates and projections that embrace data locality can be used.

2.2 Algorithmic perspective

Out of the large number of proximal gradient methods proposed in the literature, only very few have open source low-level implementations, and the existing codes are difficult to extend. This means that algorithm designers need to reimplement others’ work to benchmark their ideas even when the ideas differ from each other only in a few aspects. Still, our observations in the previous subsection reveal that algorithm design can be seen as a careful selection of a limited number of compatible policies, see Figure 2.

Figur 2: Tree representation of different choices in proximal gradient methods. Both Adam [2014-Kingma] and Nadam [2016-Dozat] use rmsprop smoothing, constant step and none prox policies on serial executors. They differ from each other in their respective choices of classical and nesterov boosting policies. Recent research [2018-Reddi] suggests that using a different smoothing policy, , adamax, results in better performance compared to Adam-like algorithms.

POLO aims at being an optimization library centered around algorithm design, that facilitates rapid experimentation with new ideas without sacrificing performance. It has a clear separation between algorithm and system primitives to allow for (1) prototyping of new ideas with as few lines of code as possible, and (2) benchmarking resulting algorithms against each other on different platforms as simple as possible. Ultimately, this will lead to faster development of new algorithms as well as easily reproducible results.

2.3 Implementation perspective

Algorithm development is an iterative process of designing and analyzing an algorithm, prototyping and benchmarking it on small-sized test problems, and making modifications until it performs well. Successful designs are then used to solve larger problems. This workflow typically involves implementing and testing the algorithm on different computing environments such as, , multi-core personal computers, dedicated servers, on-demand cloud-computing nodes, and even a cluster of low-power IoT-enabled devices.

From an implementer’s perspective, we would like to have a flexible library that allows us to implement growing number of algorithm primitives on a diversity of computing architectures. For this reason, we aim for building a low-level library with small code-print, high performance and as few dependencies as possible that would also support high-level language integrations and cloud-based solutions.

3 Design

Our observations on different platforms and algorithms in the previous sections reveal a trade-off in algorithm design. Decomposing algorithms into their small, essential policies facilitates behavior isolation and code reuse at the expense of combinatorially increasing design choices ( Figure 2). To facilitate code reuse while handling the combinatorial explosion of choices, POLO relies on policy-based design  [2005-Alexandrescu], which combines multiple inheritance and template programming in a clever way.

To visualize how policy-based design can be used to implement optimization algorithms, let us first assume that we would like to compute functions of the average of a streaming sequence of vectors . In languages that support both multiple inheritance and template programming such as C++, one can define this functionality in terms of its primitives and as given in Listing 1.

1 /* Template class for f */
2template <class value_t, template <class> class fbar,
3          template <class> class ave>
4struct f : fbar<value_t>, ave<value_t> {
5  value_t operator()(const vector<value_t> &x) {
6    vector<value_t> temp(x.size());
7    ave<value_t>::operator()(begin(x), end(x), begin(temp));
8    return fbar<value_t>::operator()(begin(temp), end(temp));
9  }
10};
11
12 /* Exponential Moving Average */
13template <class value_t> struct ema {
14  template <class InputIt, class OutputIt>
15  OutputIt operator()(InputIt xb, InputIt xe, OutputIt tb) {
16    size_t idx{0};
17    while (xb != xe) {
18      average[idx] = (1 - alpha) * average[idx] + alpha * *xb++;
19      *tb++ = average[idx++];
20    }
21    return tb;
22  }
23
24  void set_parameters(const value_t alpha) { this->alpha = alpha; }
25private:
26  value_t alpha;
27  vector<value_t> average;
28};
29
30 /* Sum of squared elements of a vector */
31template <class value_t> struct sumsquared {
32  template <class InputIt>
33  value_t operator()(InputIt xb, InputIt xe) {
34    value_t result{0};
35    while (xb != xe) {
36      result += *xb * *xb;
37      xb++;
38    }
39    return result;
40  }
41};
Listing 1: Simple example that visualizes how policy-based design can be used to support functions operating on the average of a streaming vector sequence. Common methods for construction, initialization and destruction are omitted for brevity.

Above, the library code first defines a template class for the function f, which publicly inherits from its generic template parameters fbar and ave, both of which are templated against the value type (value_t) they are working on. The library implements one concrete policy for each function primitive. As can be observed, f can work on different precision value-types such as float, double and fixed, and its policy classes can have internal states (average) and parameters (alpha) that need be persistent and modifiable. Note how f is merely a shell that integrates its policy classes and determines how they should be applied to achieve the desired transformation while not knowing their internal details.

A user of the library can then mix and match the provided policy classes to achieve the desired functionality. The compiler assembles only the functionality which has been requested and optimizes the code for the specified value types. If the user wants to have a different averaging function such as the cumulative moving average or they want to get the maximum absolute value of the averaged , which are not originally provided by the library, they can simply implement these policies and use them together with the rest (see Listing 2).

1 /* Use library-provided policies */
2f<float, sumsquared, ema> f1;
3f1.set_parameters(0.5);
4f1.initialize({0,0});
5f1({3,4});  /* returns 6.25 */
6f1({6,8});  /* returns 39.0625 */
7
8 /* Implement cumulative moving average */
9template <class value_t> struct cma {
10  template <class InputIt, class OutputIt>
11  OutputIt operator()(InputIt xb, InputIt xe, OutputIt tb) {
12    size_t idx{0};
13    while (xb != xe) {
14      average[idx] = (N * average[idx] + *xb++) / (N + 1);
15      *tb++ = average[idx++];
16    }
17    N++;
18    return tb;
19  }
20
21private:
22  size_t N{0};
23  vector<value_t> average;
24};
25
26 /* Implement maximum absolute value */
27template <class value_t> struct maxabs {
28  template <class InputIt>
29  value_t operator()(InputIt xb, InputIt xe) {
30    value_t result{0};
31    while (xb != xe) {
32      value_t current = abs(*xb++);
33      if (current > result)
34        result = current;
35    }
36    return result;
37  }
38};
39
40 /* Use user-defined policies */
41f<double, maxabs, cma> f2;
42f2.initialize({0,0});
43f2({3,4});  /* cumulative average becomes (3,4), f2 returns 4 */
44f2({6,8});  /* cumulative average becomes (4.5,6), f2 returns 6 */
Listing 2: Simple example that visualizes how policy-based design can help users pick and implement different policies in a library.

Based on these simple examples, we immediately realize the strengths of the policy-based design technique. First, the library designer and the user only need to write code for the distinct functionalities, and can then combine them into exponentially many design combinations. Second, the library can support user-defined functionalities, which are not known in advance, and the compiler can optimize the executable code for the selected data types thanks to template programming. Last, the library can also support some enriched functionality that the library designer has not foreseen. For example, the library designer provides a shell f that only enforces the call function (operator()) on its policies. However, thanks to the inheritance mechanism, the user can use some added functionality (set_parameters) when the policy classes provide them. In short, policy-based design helps handle combinatorially increasing design choices with linearly many policies while also allowing for extensions and enriched functionalities in ways library designers might not have predicted.

Optimization algorithms have a lot in common with the examples provided in Listings 1 and 2. Any family of algorithms defines, at each iteration, how a sequence of different transformations should be applied to some shared variable until a stopping condition is met. However, optimization algorithms are much more involved than the examples above. They have state loggers, different samplers and termination conditions, as well as different ways of parallelizing various parts of the code. Moreover, not all the policies of optimization algorithms are compatible with each other. As a result, a similar approach is needed to represent these algorithms in a flexible way while also preventing incompatible policies. To this end, we follow the standard C++ parallel algorithms library proposals [2013-Hoberock, 2015-Hoberock] in POLO to add execution support to optimization algorithms, and use type traits to disable incompatible policies at compile-time.

In the rest of the section, we will revisit proximal gradient methods to show our design choice, and list the executors provided in POLO.

3.1 Revisiting Proximal Gradient Methods

In POLO, we provide the template class proxgradient to represent the family of proximal gradient methods. It consists of boosting, step, smoothing, prox and execution policies, all of which are templated against the real value-type, such as float or double, and the integral index-type, such as int32_t or uint64_t (see Listing 3 for an excerpt).

1 /* namespace polo::algorithm */
2template <class value_t = double, class index_t = int,
3      template <class, class> class boosting = boosting::none,
4      template <class, class> class step = step::constant,
5      template <class, class> class smoothing = smoothing::none,
6      template <class, class> class prox = prox::none,
7      template <class, class> class execution = execution::serial>
8struct proxgradient : public boosting<value_t, index_t>,
9                      public step<value_t, index_t>,
10                      public smoothing<value_t, index_t>,
11                      public prox<value_t, index_t>,
12                      public execution<value_t, index_t> {
13   /* constructors and initializers */
14
15  template <class Loss, class Terminator, class Logger>
16  void solve(Loss &&loss, Terminator &&terminator, Logger &&logger)
17  {
18    execution<value_t, index_t>::solve(this,
19      std::forward<Loss>(loss),
20      std::forward<Terminator>(terminator),
21      std::forward<Logger>(logger)
22    );
23  }
24
25   /* other member functions */
26};
Listing 3: proxgradient algorithm template implemented in POLO. Its policy templates default to implement the vanilla gradient-descent algorithm running serially without any proximal step.

As can be observed, proxgradient is simply a shell that glues together its essential policies by using policy-based design, and delegates how to solve the optimization problem to its execution policy. Based on this excerpt only, we can see that any policy class that implements the corresponding solve member function can be an executor for the proxgradient family. For example, one can write a simple serial executor as in Listing 4.

1 /* namespace polo::execution */
2template <class value_t, class index_t> struct serial {
3   /* defaulted constructors and assignments */
4
5protected:
6   /* initializers and other functions */
7
8  template <class Algorithm, class Loss, class Terminator,
9            class Logger>
10  void solve(Algorithm *alg, Loss &&loss, Terminator &&terminate,
11             Logger &&logger) {
12    while (!std::forward<Terminator>(terminate)(k, fval, xb_c,
13                                                xe_c, gb_c)) {
14      fval = std::forward<Loss>(loss)(xb_c, gb);
15      iterate(alg, std::forward<Logger>(logger));
16    }
17  }
18
19private:
20  template <class Algorithm, class Logger>
21  void iterate(Algorithm *alg, Logger &&logger) {
22    alg->boost(index_t(0), k, k, gb_c, ge_c, gb);|\label{lst:startline}|
23    alg->smooth(k, k, xb_c, xe_c, gb_c, gb);
24    const value_t step = alg->step(k, k, fval, xb_c, xe_c, gb_c);
25    alg->prox(step, xb_c, xe_c, gb_c, xb);|\label{lst:endline}|
26    std::forward<Logger>(logger)(k, fval, xb_c, xe_c, gb_c);
27    k++;
28  }
29
30  index_t k{1};
31  value_t fval{0};
32  std::vector<value_t> x, g;
33  value_t *xb;  /* points to the first element of x */
34  const value_t *xb_c;  /* points to the first element of x */
35  const value_t *xe_c;  /* points to past the last element of x */
36  value_t *gb;  /* points to the first element of g */
37  const value_t *gb_c;  /* points to the first element of g */
38  const value_t *ge_c;  /* points to past the last element of g */
39};
Listing 4: serial execution policy class implemented for the proxgradient family in POLO. Due to space considerations, only the relevant lines are shown.

The serial execution policy implements, between lines LABEL:lst:startlineLABEL:lst:endline in Listing 4, the pseudocode given between lines 11 in Algorithm 1. In fact, it only keeps track of the shared variables k, fval, x and g, and determines the order of executions, while delegating the actual task to the policy classes. In other words, it calls, sequentially in the given order, the boost, smooth, step and prox functions, which are inherited in proxgradient from boosting, smoothing, step and prox policies, respectively, without knowing what they actually do to manipulate and . This orthogonal decomposition of algorithms into policies makes it easier for both library writers and users to implement these primitives independently of each other.

Next, let us see how a user of POLO can build their algorithms by selecting from predefined policies in Listing 5.

1 /* include libraries */
2#include  "polo/polo.hpp"
3using namespace polo;
4using namespace polo::algorithm;
5
6 /* Assemble Heavyball */
7proxgradient<double, int, boosting::momentum,
8             step::constant, smoothing::none> heavyball;
9
10 /* Assemble AdaGrad */
11proxgradient<double, int, boosting::none,
12             step::constant, smoothing::adagrad> adagrad;
13
14 /* Assemble Adam */
15proxgradient<double, int, boosting::momentum,
16             step::constant, smoothing::rmsprop> adam;
17
18 /* Assemble Nadam */
19proxgradient<double, int, boosting::nesterov,
20             step::constant, smoothing::rmsprop> nadam;
Listing 5: Example code that uses POLO’s policy classes to assemble different proximal gradient methods.

Here, the user can easily assemble their favorite proximal gradient methods, such as Heavyball [1964-Polyak], AdaGrad [2011-Duchi], Adam [2014-Kingma] and Nadam [2016-Dozat], by choosing among the provided policy classes. Similarly, if the user would like to try a different smoothing policy instead of adagrad and rmsprop, they can define their custom smoother (lines LABEL:lst:customdefbLABEL:lst:customdefe in Listing 6), and later, use it with other policy classes (lines LABEL:lst:customusebLABEL:lst:customusee). In Listing 6, the custom smoother is indeed the amsgrad smoother, which improves the convergence properties of Adam-like algorithms [2018-Reddi].

1 /* Define a custom smoother */
2template <class value_t, class index_t> struct custom {|\label{lst:customdefb}|
3  custom(const value_t beta = 0.99, const value_t epsilon = 1E-6)
4      : beta{beta}, epsilon{epsilon} {}
5
6   /* defaulted copy/move operations */
7
8  template <class InputIt1, class InputIt2, class OutputIt>
9  OutputIt smooth(const index_t, const index_t, InputIt1 xbegin,
10                  InputIt1 xend, InputIt2 gprev, OutputIt gcurr) {
11    value_t g_val{0};
12    index_t idx{0};
13    while (xbegin != xend) {
14      xbegin++;
15      g_val = *gprev++;
16      nu[idx] = beta * nu[idx] + (1 - beta) * g_val * g_val;
17      nu_hat[idx] = std::max(nu_hat[idx], nu[idx]);
18      *gcurr++ = g_val / (std::sqrt(nu_hat[idx]) + epsilon);
19      idx++;
20    }
21    return gcurr;
22  }
23
24protected:
25  void parameters(const value_t beta, const value_t epsilon) {
26    this->beta = beta;
27    this->epsilon = epsilon;
28  }
29
30  template <class InputIt> void initialize(InputIt xbegin,
31                                           InputIt xend) {
32    nu = std::vector<value_t>(std::distance(xbegin, xend));
33    nu_hat = std::vector<value_t>(nu);
34  }
35
36  ~custom() = default;
37
38private:
39  value_t beta{0.99}, epsilon{1E-6};
40  std::vector<value_t> nu, nu_hat;
41};|\label{lst:customdefe}|
42
43 /* Assemble an algorithm using the custom smoother */
44proxgradient<double, int, boosting::momentum,|\label{lst:customuseb}|
45             step::constant, custom> myalg;|\label{lst:customusee}|
Listing 6: Example code that defines a custom policy class to be used together with provided functionality.

Finally, it is also worth mentioning that, in Listing 5, the unused prox and execution policies default to none and serial, respectively, which means that the proximal operator is identity and the algorithm runs sequentially on one CPU. By mixing in different choices of these policies, the user can extend the respective algorithms’ capabilities to handle different proximal terms on different, supported executors.

3.2 Provided Executors

POLO provides 4 different execution policy classes for the proxgradient family to support 3 major computing platforms (see Figure 3):

  1. serial executor for sequential runs,

  2. consistent executor, which uses mutexes to lock the whole decision vector for consistent reads and writes, for shared-memory parallelism,

  3. inconsistent executor, which uses atomic operations to allow for inconsistent reads and writes to individual coordinates of the decision vector, for shared-memory parallelism, and,

  4. paramserver executor, which is a lightweight implementation of the Parameter Server [2013-Li] similar to that discussed in [2017-Xiao], for distributed-memory parallelism.

Figur 3: Supported platforms in POLO for the proxgradient family. When the problem data can fit on a single computer, the user can select among serial, consistent and inconsistent executions. The consistent and inconsistent executions differ from each other in that the former uses mutexes to lock (L) the whole decision vector when accessing the coordinates whereas the latter uses atomic operations on individual elements in shared-memory platforms. The library also supports distributed-memory executions when the problem data and decision vector are shared among different worker and master machines.

When the problem data can fit in a single memory space, users of POLO can choose between the provided serial, consistent and inconsistent execution policy classes. As we have partly seen in Listing 4, the serial execution simply implements the pseudocode in Algorithm 1, and runs on a single CPU. When multiple CPUs are available to speed-up the computations, users can choose between consistent and inconsistent execution policies. In both executors, updates to the decision vector are all well-defined. The name inconsistent comes from the fact that it uses custom floating-point types that support atomic operations, until these operations are supported officially with C++20, inside the decision vector. Accessing and modifying individual coordinates of the decision vector simultaneously from different worker processes without locking them usually results in faster convergence at the expense of suboptimal results, which can be controlled in certain algorithm-problem pairs, especially when updates are relatively sparse.

POLO also provides paramserver execution policy to support computations which involve problem data scattered among different computing nodes. Such solutions are needed when the data is either too large to fit on a single node or it has to be kept at a certain place due to storing and accessing requirements. The paramserver executor is not a full-fledged Parameter Server [2013-Li] implementation but rather a lightweight distributed memory executor similar to that mentioned in DSCOVR [2017-Xiao]111At the time of writing this text, the Parameter Server website was not functioning and there was no available code for either the Parameter Server or DSCOVR.. It uses cURL [2018-curl] and ZMQ [2017-ZMQ] libraries for message passing, and cereal [2018-USCiLab] for serialization of the transmitted data in a portable way. The paramserver executor has 3 main agents (see Figure 4). The scheduler is the only static node in the network. It is responsible for bookkeeping of connected master nodes, publishing (PUB) global messages to the subscribed (SUB) master and worker nodes, and directing worker nodes to the corresponding master nodes when worker nodes need to operate on specific parts of the decision vector. The master nodes share the full decision vector in a linearly load-balanced fashion. They are responsible for receiving gradient updates from worker nodes, taking the prox step and sending the updated decision vector to the worker nodes when they request (REQ). Finally, worker nodes share the smooth loss-function data, and they can join the network dynamically. Based on the sampling they use, worker nodes request a list of master nodes that keep the corresponding part of the decision vector, and they establish connections with them to communicate decision vectors and (partial) gradient updates.

Figur 4: Implementation of Parameter Server in POLO. Arrow directions represent the flow of information. Solid lines represent static connections that have to be established at all times, whereas dashed lines represent dynamic connections that are established when needed.

4 Examples

In this section, we will demonstrate how POLO helps us implement state-of-the-art optimization algorithms easily on different executors To this end, we will first solve a randomly generated [1984-Lenard]

unconstrained quadratic problem using different serial algorithms from the proximal gradient family. Then, we will solve a regularized logistic regression problem on the

rcv1 [2004-Lewis] dataset using different algorithms in various ways of parallelism.

We provide a sample code listing (Listing 8) in the appendix for the examples we cover in this section, and state which parts of the listing that need to be changed for each example.

4.1 Unconstrained Quadratic Programming

In POLO, we provide a QP generator that follows the approach described in [1984-Lenard] to create randomly generated QP problems on the form:

where is an matrix, is an -vector, is an matrix, and is an -vector. We use the generator to create an unconstrained QP problem

with , and

. We use vanilla gradient descent, Nesterov momentum and Adam to solve the problem. For this, one needs to change lines

LABEL:lst:samplealgbLABEL:lst:samplealge of Listing 8 to have the following:

1#ifdef GD
2algorithm::proxgradient<value_t, index_t> alg;
3alg.step_parameters(2 / (mu + L));
4alg.initialize(x0);
5#elif defined NESTEROV
6algorithm::proxgradient<value_t, index_t, boosting::nesterov> alg;
7alg.boosting_parameters(0.9, 1 / L);
8alg.initialize(x0);
9#else
10algorithm::proxgradient<value_t, index_t, boosting::momentum,
11                        step::constant, smoothing::rmsprop> alg;
12alg.step_parameters(0.08);
13alg.boosting_parameters(0.9, 0.1);
14alg.smoothing_parameters(0.999, 1E-8);
15alg.initialize(x0);
16#endif

In the end, we compile the code with appropriate definitions, and run the code until the termination criterion is satisfied. We post-process the iterates generated by the algorithms, and present the optimality gap and iterate convergence in Figure 5.

Figur 5: Performance comparisons of different algorithms on the randomly generated QP problem. Adding boosting to full gradient iterations increases the rate of convergence, whereas adding smoothing results in a slightly slower convergence rate than that of boosting-only methods.

4.2 Regularized Logistic Regression

In POLO, we also provide various convenience functions such as dataset readers, loss abstractions, samplers and basic matrix algebra functionality that uses the installed BLAS/LAPACK implementations. It is worth noting that POLO is not designed as a replacement for mature linear algebra packages such as Eigen [2010-Guennebaud] and Armadillo [2016-Sanderson], and it can be used in combination with these libraries when working with matrices.

In this subsection, we use the convenience functions to define a regularized logistic loss

where the pair is the feature vector and the corresponding label, respectively, of each sample in a given dataset, while is the regularization parameter. For this example, we will read the rcv1 [2004-Lewis] dataset in LIBSVM format. The dataset is sparse with samples and features.

We first assemble a different serial algorithm, AMSGrad [2018-Reddi], and use mini-batch gradients as the gradient surrogate, where mini-batches are created by sampling the component functions uniformly at random. Necessary changes to the sample code in Listing 8 are as follows:

1 /* Remove auxiliary variables between linesLABEL:lst:sampleauxb--LABEL:lst:sampleauxe| in Listing\textcolor{green}{\ref{lst:samplelisting}}| */
2 /* Replace the loss definition on lineLABEL:lst:sampleloss| in Listing\textcolor{green}{\ref{lst:samplelisting}}| */
3const string dsfile =  "rcv1";  /* name of the dataset file */
4auto dataset = utility::reader<value_t, index_t>::svm({dsfile});
5loss::logistic<value_t, index_t> logloss(dataset);
6
7const index_t K = 20000;
8const index_t M = 1000;  /* mini-batch size */
9const index_t N = dataset.nsamples();
10const index_t d = dataset.nfeatures();
11const value_t L = 0.25 * M;  /* L_i = 0.25; rcv1 is normalized */
12const index_t B = N / M;  /* number of mini-batches */
13const value_t lambda1 = 1e-4;
14 /* Replace the algorithm between linesLABEL:lst:samplealgb--LABEL:lst:samplealge| in Listing\textcolor{green}{\ref{lst:samplelisting}}| */
15algorithm::proxgradient<value_t, index_t, boosting::momentum,
16                        step::constant, smoothing::amsgrad,
17                        prox::l1norm> alg;
18
19alg.step_parameters(1. / B);  /* tuned for reasonable performance */
20alg.boosting_parameters(0.9, 0.1);  /* default suggested */
21alg.smoothing_parameters(0.999, 1E-8);  /* default suggested */
22alg.prox_parameters(lambda1);
23auto loss = [&](const value_t *x, value_t *g,
24                const index_t *ibegin,
25                const index_t *iend) -> value_t {
26  return logloss.incremental(x, g, ibegin, iend);
27};
28 /* Finally, change the solve method on lineLABEL:lst:samplesolve| in Listing\textcolor{green}{\ref{lst:samplelisting}}| */
29utility::sampler::uniform<index_t> sampler;
30sampler.parameters(0, N - 1);
31alg.solve(loss, utility::sampler::component, sampler, M,
32          utility::terminator::maxiter<value_t, index_t>(K),
33          logger);

As before, we compile and run the algorithm, this time with different mini-batch sizes , and then we reconstruct the total function loss values from the traces saved in the logger. We report the results in Figure 6.

Figur 6: Total function losses for AMSGrad with respect to the iteration count (left) and wall-clock time (right) on the rcv1 dataset. As the number of mini-batches increases, the convergence rate (in terms of iteration count) increases and the error decreases, as expected. Moreover, the total computation time increases with the increasing number of mini-batches.

Next, we use the distributed memory functionality of POLO to solve logistic regression on a parameter server architecture. To demonstrate how POLO can be used equally well on embedded systems as on high-performance computers, we first form a cluster of low-power devices. In the cluster, we assign 5 Raspberry PI2 devices as the worker nodes, each of which has parts of the samples in the rcv1 dataset, and an NVIDIA Jetson TK1 as the single master, which only keeps track of the decision vector and does the -norm ball projection. We assign a laptop as the scheduler, which orchestrates the communication between the static master nodes and the dynamically joining worker nodes in the network. For this example, we choose to use the proximal incremental aggregated gradient (PIAG) [2016-Aytekin] method to solve the problem. Necessary modifications are given in the next listing.

1 /* Each worker node has their own local dataset */
2 /* Change the lineLABEL:lst:sampleloss| in Listing\textcolor{green}{\ref{lst:samplelisting}}| */
3#ifdef WORKER
4const string dsfile =  "rcv1";
5auto dataset = utility::reader<value_t, index_t>::svm({dsfile});
6loss::logistic<value_t, index_t> logloss(dataset);
7
8const index_t N = dataset.nsamples();
9const index_t d = dataset.nfeatures();
10const value_t L = 0.25 * N;
11#else
12 /* scheduler and master(s) need not know the loss */
13auto loss = nullptr;
14#endif
15 /* Change the linesLABEL:lst:samplealgb--LABEL:lst:samplealge| in Listing\textcolor{green}{\ref{lst:samplelisting}}| */
16const value_t lambda1 = 1e-4;
17const index_t K = 100000;
18
19algorithm::proxgradient<value_t, index_t, boosting::aggregated,
20                        step::constant, smoothing::none,
21                        prox::l1norm,
22                        execution::paramserver::executor> alg;
23
24alg.step_parameters(1 / L);
25alg.prox_parameters(lambda1);
26
27execution::paramserver::options psopts;
28 /* workers timeout after 10 seconds of inactivity */
29psopts.worker_timeout(10000);
30 /* master’s own IP address; used by the worker */
31psopts.master(maddress, 50000);
32 /* scheduler’s address & ports; needed globally */
33psopts.scheduler(saddress, 40000, 40001, 40002);
34
35alg.execution_parameters(psopts);

We compile the code three times, one for each of the worker, master and scheduler agents in the distributed executor. We run the algorithm and post-process the iterates logged by the master node. We report the result in Figure 7.

Figur 7: Total function loss on the rcv1 dataset for PIAG executed on the Parameter Server executor provided in POLO.

We finally recompiled the same code and ran it on virtual machines in a cloud computing environment at Ericsson Research Data Center. No changes, apart from reconfiguration of IP addresses, were needed to the code. Although the wall-clock time performance improved significantly, the per-iterate convergence is similar to the one for the PI-cluster and therefore not reported here.

5 High-Level Language Integration

Even though POLO helps us prototype ideas and implement state-of-the-art optimization algorithms easily in C++, it is often even more convenient to use a high-level language such as MATLAB, Python and Julia to access data from different sources, define and automatically differentiate loss functions, and visualize results. To be able to support integration with the high-level languages, we also provide a C-API that implements the influential algorithms listed in Table 1 on the executors covered in Section 3.2. Using the provided C header file and the compiled library, users can access these algorithms from their favorite high-level language.

Central to the C-API is a loss abstraction that allows for arbitrary loss function implementations in any high-level language. The function signature of the loss abstraction is given below.

1template <class value_t>
2using loss_t = value_t (*)(const value_t *x, value_t *g, void *data);
Listing 7: Loss abstraction used in the C-API. data is an opaque pointer used to support callback functions from high-level languages in the C library.

Any given loss should read , write the gradient of at into , and return . The signature also includes an opaque pointer to implementation specific data. For example, it could point to some loss function object implemented in a high-level language, which calculates the loss and gradient at . Next, the C-API defines algorithm calls for a set of algorithm types and executors as follows.

1using value_t = double;  /* double-precision floating points */
2void run_serial_alg(const value_t *xbegin, const value_t *xend,
3                    loss_t<value_t> loss_fcn, void *loss_data) {
4  serial_alg_t alg;
5  alg.initialize(xbegin, xend);
6  alg.solve([=](const value_t *xbegin, value_t *gbegin) {
7    return loss_fcn(xbegin, gbegin, loss_data);
8  });
9}

The serial_alg_t is varied to specific algorithm types through different policy combinations. For example, the following defines regular gradient descent with a serial executor.

1using namespace polo;
2using namespace polo::algorithm;
3using serial_alg_t =
4  proxgradient<double, int, boosting::none, step::constant,
5               smoothing::none, prox::none, execution::serial>;

High-level languages can then make use of these algorithms by supplying loss functions that adhere to the abstraction in Listing 7.

The C-API also defines a set of custom policies within the proxgradient algorithm family. These policies have empty implementations that can be filled in by high-level languages. This feature is implemented using abstractions with opaque pointers, similar to the loss abstraction in Listing 7 and gives the user extended functionality outside the range of precompiled algorithms in the C-API. To illustrate this versatility, we have implemented POLO.jl in the Julia language. POLO.jl includes Julia implementations of all proxgradient policies available in POLO. Through the custom policy API, any algorithm type can be instantiated and tested interactively. In addition, it allows us to implement other high-level abstractions in Julia such as custom termination criteria and advanced logging. As an example, the following implements Adam in POLO.jl

1Adam(execution::ExecutionPolicy) =
2  ProxGradient(execution, Boosting::Momentum(), Step.Constant(),
3               Smoothing.RMSprop(), Prox.None())

In essence, POLO.jl provides a dynamic setting for algorithm design while also enabling the use of the powerful executors implemented in POLO. Due to current limitations in the Julia language, the multi-threaded executors could not yet be applied.

6 Conclusion and Future Work

In this paper, we have presented POLO, an open-source, header-only, C++ template library. Compared to other low-level generic optimization libraries, POLO is focused on algorithm development. With its proper algorithm abstractions, policy-based design approach and convenient utilities, POLO not only offers optimization and machine-learning researchers an environment to prototype their ideas flexibly without losing much from performance but also allows them to test their ideas and other state-of-the-art optimization algorithms on different executors easily. POLO is still a work in progress, and we are working on adding other families of algorithms and different executors to the library as well as wrapping the C-API as convenient packages in high-level languages such as Python and Julia.

Acknowledgment

This research is sponsored in part by the Swedish Research Council under project grant “Scalable optimization: dynamics, architectures and data dependence,” and by the Swedish Foundation for Strategic Research under project grant “Societal-scale cyber-physical transport systems.” We also thank Ericsson Research for their generosity in letting us use the computing resources at Ericsson Research Data Center for our experiments.

Bilaga A Sample Code Listing

1 /* Include necessary libraries */
2using index_t = int32_t;
3using value_t = float;
4
5int main(int argc, char *argv[]) {
6   /* Define auxiliary variables */
7  const index_t d = 10000;|\label{lst:sampleauxb}|
8  const index_t K = 2500;
9  const value_t L = 20;
10  const value_t mu = 1 / L;|\label{lst:sampleauxe}|
11
12   /* Define a loss function */
13  problem::qp<value_t, index_t> qp(d, L);|\label{lst:sampleloss}|
14
15   /* Randomly initialize the starting point */
16  mt19937 gen(random_device{}());
17  normal_distribution<value_t> normal(5, 3);
18  vector<value_t> x0(d);
19  for (auto &val : x0)
20    val = normal(gen);
21
22   /* Select, configure and initialize an algorithm */
23  algorithm::proxgradient<value_t, index_t> alg;|\label{lst:samplealgb}|
24  alg.step_parameters(2 / (mu + L));
25  alg.initialize(x0);|\label{lst:samplealge}|
26
27   /* Use a decision logger */
28  utility::logger::decision<value_t, index_t> logger;
29
30   /* Terminate after K iterations */
31  utility::terminator::maxiter<value_t, index_t> maxiter(K);
32
33   /* Solve the problem */
34  alg.solve(qp, maxiter, logger);|\label{lst:samplesolve}|
35
36   /* Post-process the logged states, i.e., function values,
37     decision vector, etc.
38  */
39
40  return 0;
41}
Listing 8: Sample code listing to reproduce experiments in Section 4. Only POLO related functionality is shown for brevity.