Wide adoption of Internet of Things (IoT) enabled devices, as well as developments in communication and data storage technologies, have made the collection, transmission, and storage of bulks of data more accessible than ever. Commercial cloud-computing providers, which traditionally offer their available computing resources at data centers to customers, have also started supporting low-power IoT-enabled devices available at the customers’ site in their cloud ecosystem [2018-Greengrass, 2018-IoTEdge, 2018-IoTHub]
. As a result, we are experiencing an increase in not only the problem dimensions of data-driven machine-learning applications but also the variety of computing architectures on which these applications are deployed.
There is no silver bullet for solving machine-learning problems. Because the problems evolve continuously, one needs to design new algorithms that are tailored for the specific problem structure, and exploit all available computing resources. However, algorithm design is a non-trivial process. It consists in prototyping new ideas, benchmarking their performance against that of the state-of-the-art, and finally deploying the successful ones in production. Many ideas today are prototyped in high-level languages such as MATLAB, Python and Julia, and it is very rare that researchers implement their algorithms in lower-level languages such as C and C++. Even so, these prototypes are often incomplete, have different abstractions and coding style, and are hand-crafted for specific problem-platform combinations. Ultimately, this results in either performance degradations of high-level implementations in production, or high engineering costs attached to rewriting the low-level implementations from scratch for each different problem-platform combination.
In this paper, we present POLO — an open-source, header-only C++ library that uses the policy-based design technique [2005-Alexandrescu]. It consists of two primary layers (Figure 1). The utilities layer builds on standard libraries and lightweight third-party libraries, and offers a set of well-defined primitives for atomic floating-point operations, matrix algebra, communication and serialization, logging and dataset reading. Throughout the development, we have put special emphasis on making the utilities layer portable and efficient. The algorithms layer, on the other hand, builds on the utilities layer and implements different families of algorithms. The library abides by modern C++ design principles [2005-Alexandrescu], and follows ideas from recent parallel algorithms library proposals [2013-Hoberock, 2015-Hoberock] to
decouple optimization algorithm building blocks from system primitives,
facilitate code reuse, and,
generate tight and efficient code.
In the rest of the paper, we introduce POLO in more detail. In Section 2, we motivate the design of POLO based on our observations from a family of optimization algorithms. In Section 3, we provide detailed information about the design of the library and the supported computing architectures. In Section 4, we show how some of the state-of-the-art algorithms can be quickly prototyped in POLO together with their performance comparisons against each other. Finally, in Section 6, we conclude the paper with further discussions.
To demonstrate the power of policy-based design for distributed optimization library development, we consider regularized optimization problems on the form:
Here, is a differentiable function of and
is a possibly non-differentiable function. In machine-learning applications, the smooth part of the loss typically represents the empirical data loss, and is a finite sum of loss functions
while the non-smooth part is a regularizer that encourages certain properties (such as sparsity) of the optimal solution.
One approach for solving problems on the form (1) is to use proximal gradient methods. The basic form of the the proximal gradient iteration is
where is a step-size parameter. Thus, the next iterate, , is selected to be the minimizer of the sum of the first-order approximation of the differentiable loss function around the current iterate, , the non-differentiable loss function, and a quadratic penalty on the deviation from the current iterate [2017-Beck]. After some algebraic manipulations, one can rewrite (2) in terms of the proximal operator [2017-Beck]
As a result, the method can be interpreted as a two-step procedure: first, a query point is computed by modifying the current iterate in the direction of the negative gradient, and then the prox operator is applied to this query point.
Even though the proximal gradient method described in (2) looks involved, in the sense that it requires solving an optimization problem at each iteration, the prox-mapping can actually be evaluated very efficiently for several important functions . Together with its strong theoretical convergence guarantees, this makes the proximal gradient method a favorable option in many applications. However, the gradient calculation step in the vanilla proximal gradient method can be prohibitively expensive when the number of data points (
) or the dimension of the decision vector () is large enough. To improve the scalability of the proximal gradient method, researchers have long tried to come up with ways of parallelizing the proximal gradient computations and more clever query points than the simple gradient step in (2) [2007-Blatt, 2016-Aytekin, 2014-Defazio, 1964-Polyak, 1983-Nesterov, 2011-Duchi, 2012-Zeiler, 2014-Kingma, 2016-Dozat, 2015-Sra, 2011-Recht, 2016-Leblond, 2017-Pedregosa]. As a result, the proximal gradient family encompasses a large variety of algorithms. We have listed some of the more influential variants in Table 1.
2.1 A Look at Proximal Gradient Methods
|SGD||,||s, cr, ps|
|IAG [2007-Blatt]||aggregated||s, cr, ps|
|PIAG [2016-Aytekin]||aggregated||s, cr, ps|
|SAGA [2014-Defazio]||saga||s, cr, ps|
|AdaDelay [2015-Sra]||s, cr, ps|
A careful review of the serial algorithms in the proximal gradient family reveals that they differ from each other in their choices of five distinctive algorithm primitives: (1) which gradient surrogate they use; (2) how they combine multiple gradient surrogates to form a search direction, a step we refer to as boosting; (3) how this search direction is filtered or scaled, which we call smoothing
; (4) which step-size policy they use; and (5) the type of projection they do in the prox step. For instance, stochastic gradient descent (SGD) algorithms use partial gradient information coming from functions () or decision vector coordinates () as the gradient surrogate at each iteration, whereas their aggregated versions [2007-Blatt, 2016-Aytekin, 2014-Defazio] accumulate previous partial gradient information to boost the descent direction. Similarly, different momentum-based methods such as the classical [1964-Polyak] and Nesterov’s [1983-Nesterov] momentum accumulate the full gradient information over iterations. Algorithms such as AdaGrad [2011-Duchi] and AdaDelta [2012-Zeiler]
, on the other hand, use the second-moment information from the gradient and the decision vector updates to adaptively scale, , smooth, the gradient surrogates. Popular algorithms such as Adam[2014-Kingma] and Nadam [2016-Dozat], available in most machine-learning libraries, incorporate both boosting and smoothing to get better update directions. Algorithms in the serial setting can also use different step-size policies and projections independently of the choices above, which results in the pseudocode representation of these algorithms given in Algorithm 1.
Most of the serial algorithms in this family either have direct counterparts in the parallel setting or can be extended to have the parallel execution support. However, adding parallel execution support brings yet another layer of complexity to algorithm prototyping. First, there are a variety of parallel computing architectures to consider, from shared-memory and distributed-memory environments with multiple CPUs to distributed-memory heterogeneous computing environments which involve both CPUs and GPUs. Second, some of these environments, such as the shared-memory architectures, offer different choices in how to manage race conditions. For example, some algorithms [2007-Blatt, 2016-Aytekin, 2014-Defazio, 2015-Sra] choose to use mutexes to consistently read from and write to the shared decision vector from different processes, whereas others [2011-Recht, 2016-Leblond, 2017-Pedregosa] prefer atomic operations to allow for inconsistent reads and writes. Finally, the choice in a specific computing architecture might constrain choices in other algorithm primitives. For instance, if the algorithm is to run on a distributed-memory architecture such as the Parameter Server [2013-Li], where only parts of the decision vector and data are stored in individual nodes, then only updates and projections that embrace data locality can be used.
2.2 Algorithmic perspective
Out of the large number of proximal gradient methods proposed in the literature, only very few have open source low-level implementations, and the existing codes are difficult to extend. This means that algorithm designers need to reimplement others’ work to benchmark their ideas even when the ideas differ from each other only in a few aspects. Still, our observations in the previous subsection reveal that algorithm design can be seen as a careful selection of a limited number of compatible policies, see Figure 2.
POLO aims at being an optimization library centered around algorithm design, that facilitates rapid experimentation with new ideas without sacrificing performance. It has a clear separation between algorithm and system primitives to allow for (1) prototyping of new ideas with as few lines of code as possible, and (2) benchmarking resulting algorithms against each other on different platforms as simple as possible. Ultimately, this will lead to faster development of new algorithms as well as easily reproducible results.
2.3 Implementation perspective
Algorithm development is an iterative process of designing and analyzing an algorithm, prototyping and benchmarking it on small-sized test problems, and making modifications until it performs well. Successful designs are then used to solve larger problems. This workflow typically involves implementing and testing the algorithm on different computing environments such as, , multi-core personal computers, dedicated servers, on-demand cloud-computing nodes, and even a cluster of low-power IoT-enabled devices.
From an implementer’s perspective, we would like to have a flexible library that allows us to implement growing number of algorithm primitives on a diversity of computing architectures. For this reason, we aim for building a low-level library with small code-print, high performance and as few dependencies as possible that would also support high-level language integrations and cloud-based solutions.
Our observations on different platforms and algorithms in the previous sections reveal a trade-off in algorithm design. Decomposing algorithms into their small, essential policies facilitates behavior isolation and code reuse at the expense of combinatorially increasing design choices ( Figure 2). To facilitate code reuse while handling the combinatorial explosion of choices, POLO relies on policy-based design [2005-Alexandrescu], which combines multiple inheritance and template programming in a clever way.
To visualize how policy-based design can be used to implement optimization algorithms, let us first assume that we would like to compute functions of the average of a streaming sequence of vectors . In languages that support both multiple inheritance and template programming such as C++, one can define this functionality in terms of its primitives and as given in Listing 1.
Above, the library code first defines a template class for the function f, which publicly inherits from its generic template parameters fbar and ave, both of which are templated against the value type (value_t) they are working on. The library implements one concrete policy for each function primitive. As can be observed, f can work on different precision value-types such as float, double and fixed, and its policy classes can have internal states (average) and parameters (alpha) that need be persistent and modifiable. Note how f is merely a shell that integrates its policy classes and determines how they should be applied to achieve the desired transformation while not knowing their internal details.
A user of the library can then mix and match the provided policy classes to achieve the desired functionality. The compiler assembles only the functionality which has been requested and optimizes the code for the specified value types. If the user wants to have a different averaging function such as the cumulative moving average or they want to get the maximum absolute value of the averaged , which are not originally provided by the library, they can simply implement these policies and use them together with the rest (see Listing 2).
Based on these simple examples, we immediately realize the strengths of the policy-based design technique. First, the library designer and the user only need to write code for the distinct functionalities, and can then combine them into exponentially many design combinations. Second, the library can support user-defined functionalities, which are not known in advance, and the compiler can optimize the executable code for the selected data types thanks to template programming. Last, the library can also support some enriched functionality that the library designer has not foreseen. For example, the library designer provides a shell f that only enforces the call function (operator()) on its policies. However, thanks to the inheritance mechanism, the user can use some added functionality (set_parameters) when the policy classes provide them. In short, policy-based design helps handle combinatorially increasing design choices with linearly many policies while also allowing for extensions and enriched functionalities in ways library designers might not have predicted.
Optimization algorithms have a lot in common with the examples provided in Listings 1 and 2. Any family of algorithms defines, at each iteration, how a sequence of different transformations should be applied to some shared variable until a stopping condition is met. However, optimization algorithms are much more involved than the examples above. They have state loggers, different samplers and termination conditions, as well as different ways of parallelizing various parts of the code. Moreover, not all the policies of optimization algorithms are compatible with each other. As a result, a similar approach is needed to represent these algorithms in a flexible way while also preventing incompatible policies. To this end, we follow the standard C++ parallel algorithms library proposals [2013-Hoberock, 2015-Hoberock] in POLO to add execution support to optimization algorithms, and use type traits to disable incompatible policies at compile-time.
In the rest of the section, we will revisit proximal gradient methods to show our design choice, and list the executors provided in POLO.
3.1 Revisiting Proximal Gradient Methods
In POLO, we provide the template class proxgradient to represent the family of proximal gradient methods. It consists of boosting, step, smoothing, prox and execution policies, all of which are templated against the real value-type, such as float or double, and the integral index-type, such as int32_t or uint64_t (see Listing 3 for an excerpt).
As can be observed, proxgradient is simply a shell that glues together its essential policies by using policy-based design, and delegates how to solve the optimization problem to its execution policy. Based on this excerpt only, we can see that any policy class that implements the corresponding solve member function can be an executor for the proxgradient family. For example, one can write a simple serial executor as in Listing 4.
The serial execution policy implements, between lines LABEL:lst:startline–LABEL:lst:endline in Listing 4, the pseudocode given between lines 1–1 in Algorithm 1. In fact, it only keeps track of the shared variables k, fval, x and g, and determines the order of executions, while delegating the actual task to the policy classes. In other words, it calls, sequentially in the given order, the boost, smooth, step and prox functions, which are inherited in proxgradient from boosting, smoothing, step and prox policies, respectively, without knowing what they actually do to manipulate and . This orthogonal decomposition of algorithms into policies makes it easier for both library writers and users to implement these primitives independently of each other.
Next, let us see how a user of POLO can build their algorithms by selecting from predefined policies in Listing 5.
Here, the user can easily assemble their favorite proximal gradient methods, such as Heavyball [1964-Polyak], AdaGrad [2011-Duchi], Adam [2014-Kingma] and Nadam [2016-Dozat], by choosing among the provided policy classes. Similarly, if the user would like to try a different smoothing policy instead of adagrad and rmsprop, they can define their custom smoother (lines LABEL:lst:customdefb–LABEL:lst:customdefe in Listing 6), and later, use it with other policy classes (lines LABEL:lst:customuseb–LABEL:lst:customusee). In Listing 6, the custom smoother is indeed the amsgrad smoother, which improves the convergence properties of Adam-like algorithms [2018-Reddi].
Finally, it is also worth mentioning that, in Listing 5, the unused prox and execution policies default to none and serial, respectively, which means that the proximal operator is identity and the algorithm runs sequentially on one CPU. By mixing in different choices of these policies, the user can extend the respective algorithms’ capabilities to handle different proximal terms on different, supported executors.
3.2 Provided Executors
POLO provides 4 different execution policy classes for the proxgradient family to support 3 major computing platforms (see Figure 3):
serial executor for sequential runs,
consistent executor, which uses mutexes to lock the whole decision vector for consistent reads and writes, for shared-memory parallelism,
inconsistent executor, which uses atomic operations to allow for inconsistent reads and writes to individual coordinates of the decision vector, for shared-memory parallelism, and,
paramserver executor, which is a lightweight implementation of the Parameter Server [2013-Li] similar to that discussed in [2017-Xiao], for distributed-memory parallelism.
When the problem data can fit in a single memory space, users of POLO can choose between the provided serial, consistent and inconsistent execution policy classes. As we have partly seen in Listing 4, the serial execution simply implements the pseudocode in Algorithm 1, and runs on a single CPU. When multiple CPUs are available to speed-up the computations, users can choose between consistent and inconsistent execution policies. In both executors, updates to the decision vector are all well-defined. The name inconsistent comes from the fact that it uses custom floating-point types that support atomic operations, until these operations are supported officially with C++20, inside the decision vector. Accessing and modifying individual coordinates of the decision vector simultaneously from different worker processes without locking them usually results in faster convergence at the expense of suboptimal results, which can be controlled in certain algorithm-problem pairs, especially when updates are relatively sparse.
POLO also provides paramserver execution policy to support computations which involve problem data scattered among different computing nodes. Such solutions are needed when the data is either too large to fit on a single node or it has to be kept at a certain place due to storing and accessing requirements. The paramserver executor is not a full-fledged Parameter Server [2013-Li] implementation but rather a lightweight distributed memory executor similar to that mentioned in DSCOVR [2017-Xiao]111At the time of writing this text, the Parameter Server website was not functioning and there was no available code for either the Parameter Server or DSCOVR.. It uses cURL [2018-curl] and ZMQ [2017-ZMQ] libraries for message passing, and cereal [2018-USCiLab] for serialization of the transmitted data in a portable way. The paramserver executor has 3 main agents (see Figure 4). The scheduler is the only static node in the network. It is responsible for bookkeeping of connected master nodes, publishing (PUB) global messages to the subscribed (SUB) master and worker nodes, and directing worker nodes to the corresponding master nodes when worker nodes need to operate on specific parts of the decision vector. The master nodes share the full decision vector in a linearly load-balanced fashion. They are responsible for receiving gradient updates from worker nodes, taking the prox step and sending the updated decision vector to the worker nodes when they request (REQ). Finally, worker nodes share the smooth loss-function data, and they can join the network dynamically. Based on the sampling they use, worker nodes request a list of master nodes that keep the corresponding part of the decision vector, and they establish connections with them to communicate decision vectors and (partial) gradient updates.
In this section, we will demonstrate how POLO helps us implement state-of-the-art optimization algorithms easily on different executors To this end, we will first solve a randomly generated [1984-Lenard]
unconstrained quadratic problem using different serial algorithms from the proximal gradient family. Then, we will solve a regularized logistic regression problem on thercv1 [2004-Lewis] dataset using different algorithms in various ways of parallelism.
We provide a sample code listing (Listing 8) in the appendix for the examples we cover in this section, and state which parts of the listing that need to be changed for each example.
4.1 Unconstrained Quadratic Programming
In POLO, we provide a QP generator that follows the approach described in [1984-Lenard] to create randomly generated QP problems on the form:
where is an matrix, is an -vector, is an matrix, and is an -vector. We use the generator to create an unconstrained QP problem
with , and
. We use vanilla gradient descent, Nesterov momentum and Adam to solve the problem. For this, one needs to change linesLABEL:lst:samplealgb–LABEL:lst:samplealge of Listing 8 to have the following:
In the end, we compile the code with appropriate definitions, and run the code until the termination criterion is satisfied. We post-process the iterates generated by the algorithms, and present the optimality gap and iterate convergence in Figure 5.
4.2 Regularized Logistic Regression
In POLO, we also provide various convenience functions such as dataset readers, loss abstractions, samplers and basic matrix algebra functionality that uses the installed BLAS/LAPACK implementations. It is worth noting that POLO is not designed as a replacement for mature linear algebra packages such as Eigen [2010-Guennebaud] and Armadillo [2016-Sanderson], and it can be used in combination with these libraries when working with matrices.
In this subsection, we use the convenience functions to define a regularized logistic loss
where the pair is the feature vector and the corresponding label, respectively, of each sample in a given dataset, while is the regularization parameter. For this example, we will read the rcv1 [2004-Lewis] dataset in LIBSVM format. The dataset is sparse with samples and features.
We first assemble a different serial algorithm, AMSGrad [2018-Reddi], and use mini-batch gradients as the gradient surrogate, where mini-batches are created by sampling the component functions uniformly at random. Necessary changes to the sample code in Listing 8 are as follows:
As before, we compile and run the algorithm, this time with different mini-batch sizes , and then we reconstruct the total function loss values from the traces saved in the logger. We report the results in Figure 6.
Next, we use the distributed memory functionality of POLO to solve logistic regression on a parameter server architecture. To demonstrate how POLO can be used equally well on embedded systems as on high-performance computers, we first form a cluster of low-power devices. In the cluster, we assign 5 Raspberry PI2 devices as the worker nodes, each of which has parts of the samples in the rcv1 dataset, and an NVIDIA Jetson TK1 as the single master, which only keeps track of the decision vector and does the -norm ball projection. We assign a laptop as the scheduler, which orchestrates the communication between the static master nodes and the dynamically joining worker nodes in the network. For this example, we choose to use the proximal incremental aggregated gradient (PIAG) [2016-Aytekin] method to solve the problem. Necessary modifications are given in the next listing.
We compile the code three times, one for each of the worker, master and scheduler agents in the distributed executor. We run the algorithm and post-process the iterates logged by the master node. We report the result in Figure 7.
We finally recompiled the same code and ran it on virtual machines in a cloud computing environment at Ericsson Research Data Center. No changes, apart from reconfiguration of IP addresses, were needed to the code. Although the wall-clock time performance improved significantly, the per-iterate convergence is similar to the one for the PI-cluster and therefore not reported here.
5 High-Level Language Integration
Even though POLO helps us prototype ideas and implement state-of-the-art optimization algorithms easily in C++, it is often even more convenient to use a high-level language such as MATLAB, Python and Julia to access data from different sources, define and automatically differentiate loss functions, and visualize results. To be able to support integration with the high-level languages, we also provide a C-API that implements the influential algorithms listed in Table 1 on the executors covered in Section 3.2. Using the provided C header file and the compiled library, users can access these algorithms from their favorite high-level language.
Central to the C-API is a loss abstraction that allows for arbitrary loss function implementations in any high-level language. The function signature of the loss abstraction is given below.
Any given loss should read , write the gradient of at into , and return . The signature also includes an opaque pointer to implementation specific data. For example, it could point to some loss function object implemented in a high-level language, which calculates the loss and gradient at . Next, the C-API defines algorithm calls for a set of algorithm types and executors as follows.
The serial_alg_t is varied to specific algorithm types through different policy combinations. For example, the following defines regular gradient descent with a serial executor.
High-level languages can then make use of these algorithms by supplying loss functions that adhere to the abstraction in Listing 7.
The C-API also defines a set of custom policies within the proxgradient algorithm family. These policies have empty implementations that can be filled in by high-level languages. This feature is implemented using abstractions with opaque pointers, similar to the loss abstraction in Listing 7 and gives the user extended functionality outside the range of precompiled algorithms in the C-API. To illustrate this versatility, we have implemented POLO.jl in the Julia language. POLO.jl includes Julia implementations of all proxgradient policies available in POLO. Through the custom policy API, any algorithm type can be instantiated and tested interactively. In addition, it allows us to implement other high-level abstractions in Julia such as custom termination criteria and advanced logging. As an example, the following implements Adam in POLO.jl
In essence, POLO.jl provides a dynamic setting for algorithm design while also enabling the use of the powerful executors implemented in POLO. Due to current limitations in the Julia language, the multi-threaded executors could not yet be applied.
6 Conclusion and Future Work
In this paper, we have presented POLO, an open-source, header-only, C++ template library. Compared to other low-level generic optimization libraries, POLO is focused on algorithm development. With its proper algorithm abstractions, policy-based design approach and convenient utilities, POLO not only offers optimization and machine-learning researchers an environment to prototype their ideas flexibly without losing much from performance but also allows them to test their ideas and other state-of-the-art optimization algorithms on different executors easily. POLO is still a work in progress, and we are working on adding other families of algorithms and different executors to the library as well as wrapping the C-API as convenient packages in high-level languages such as Python and Julia.
This research is sponsored in part by the Swedish Research Council under project grant “Scalable optimization: dynamics, architectures and data dependence,” and by the Swedish Foundation for Strategic Research under project grant “Societal-scale cyber-physical transport systems.” We also thank Ericsson Research for their generosity in letting us use the computing resources at Ericsson Research Data Center for our experiments.