Capri: A Control System for Approximate Programs

Approximate computing trades off accuracy of results for resources such as energy or computing time. There is a large and rapidly growing literature on approximate computing that has focused mostly on showing the benefits of approximation. However, we know relatively little about how to control approximation in a disciplined way. This document briefly describes our published work of controlling approximation for non-streaming programs that have a set of "knobs" that can be dialed up or down to control the level of approximation of different components in the program. The proposed system, Capri, solves this control problem as a constrained optimization problem. Capri uses machine learning to learn cost and error models for the program, and uses these models to determine, for a desired level of approximation, knob settings that optimize metrics such as running time or energy usage. Experimental results with complex benchmarks from different problem domains demonstrate the effectiveness of this approach. This report outlines improvements and extensions to the existing Capri system to address its limitations, including a complete rewrite of the software, and discusses directions for follow up work. The document also includes instructions and guidelines for using the new Capri infrastructure.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

07/31/2019

Towards a General Framework for Static Cost Analysis of Parallel Logic Programs

The estimation and control of resource usage is now an important challen...
03/18/2020

Automatic accuracy management of quantum programs via (near-)symbolic resource estimation

When compiling programs for fault-tolerant quantum computers, approximat...
11/12/2017

An introduction to approximate computing

Approximate computing is a research area where we investigate a wide spe...
10/27/2019

Cilkmem: Algorithms for Analyzing the Memory High-Water Mark of Fork-Join Parallel Programs

Software engineers designing recursive fork-join programs destined to ru...
11/12/2019

MCPA: Program Analysis as Machine Learning

Static program analysis today takes an analytical approach which is quit...
12/17/2018

A stochastic approximation method for chance-constrained nonlinear programs

We propose a stochastic approximation method for approximating the effic...
10/30/2017

Verification of BSF Parallel Computational Model

The article is devoted to the verification of the BSF parallel computing...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There is growing interest in approximate computing as a way of reducing the energy and time required to execute applications [4, 2, 55, 49, 63]. In conventional computing, programs are usually treated as implementations of mathematical functions, so there is a precise output that must computed for a given input. In many problem domains, it is sufficient to produce some approximation of this output; for example, when rendering a scene in graphics, it is acceptable to take computational short-cuts if human beings cannot tell the difference in the rendered scene.

In this paper, we focus on a class of approximate programs that we call tunable approximate programs. Intuitively, these programs have one or more knobs or parameters that can be changed to vary the fidelity of the produced output. These knobs might control the number of iterations performed by a loop [44, 8], determine the precision with which floating-point computations are performed [46, 51], or switch between precise and approximate hardware [16]; for the purposes of this paper, the source of approximation does not matter so long as the fidelity of the output is changed by adjusting the knobs.

There is now a fairly large literature on this subject, some of which is surveyed in Section 2. Most of this work addresses what we call the forward problem in this paper: they show that for some programs, particular techniques such as skipping loop iterations or tasks, within limits, degrade output quality in an acceptable way while reducing energy or running time. Other work has focused on type systems and static analyses to ensure that computational short-cuts do not affect portions of the program that may be critical to correctness such as control-flow decisions or memory management [49, 11, 34].

However, exploiting approximation effectively requires the solution to what we call the inverse problem in this paper: given a program with knobs that control execution parameters like the number of the iterations executed by a loop and a lower bound on output quality, how do we set the knobs optimally to minimize energy or running time? This is a classical optimal control problem. What makes the problem particularly difficult is that for most programs, optimal values of knob settings are very dependent on the values of inputs, as we show in Section 3, so auto-tuning, the standard parameter optimization technique used in computer systems, is not useful.

This paper describes our published work on solving the inverse problem for tunable approximate programs [56]. Roughly speaking, given a permissible error for the output, we want to set the knobs to minimize computational costs, such as running time or energy, while meeting the error constraint. The work describes a solution to the proactive control problem for non-streaming programs that consist of components controlled by one or more knobs and in which the error and cost behaviors are substantially different for different inputs. Our approach is to treat the control problem as a constrained optimization problem in which an objective function such as energy is minimized, subject to constraints such as a lower bound on the acceptable output quality. The major challenge is that this formulation requires us to know the objective and constraint functions, but in general these are complex functions that we do not know and cannot write down in closed form. We deal with this by modeling these functions using machine learning techniques. The resulting Capri control system [56], which is an example of open-loop control [3], is fairly successful in controlling approximation in a principled way in complex applications from several domains including machine learning, image processing and graph analytics.

This paper extends the scope of our published work [56], by first highlighting limitations of the existing control system, such as, potential lack of scalability, and neglecting the prediction error from cost and error models. We discuss follow up work to Capri that addresses these issues. A requisite for extending Capri is to reimplement the system in a scalable and modular fashion. This paper discusses our new implementation in detail to acquaint potential users with the internals of the Capri control system. We present approximation results with the new Capri implementation.

2 Related Work

Approximation opportunities in software and hardware.

Loop perforation [55] explores skipping iterations during loop execution. Rinard explores randomly discarding tasks in parallel applications [43]. Rinard [44] and Campanoni et al. [10] explore relaxing synchronization in parallel applications. Karthik et al. explore different algorithmic level approximation schemes on a video summarization algorithm [58]. Samadi et al. develop methods to recognize patterns in programs that provide approximation opportunities [47]. These techniques could be used to provide knobs automatically and thus complement our work.

A distortion model using linear regression was used by Rinard to demonstrate the feasibility of their approximation techniques 

[43]. The results in this paper (Section 4.4) show that linear regression is not useful for modeling quality and cost.

Researchers have proposed several hardware designs for exploiting approximate computing [41, 15, 16, 50, 33, 54]. Our techniques can be useful in choosing how to most efficiently map programs onto such hardware and thus increase the effectiveness of such approaches.

Reactive control of streaming applications.

In this problem, the system is presented with a stream of inputs in which successive inputs are assumed to be correlated with each other, and results from processing one input can be used to tune the computation for succeeding inputs. The Green System [4]

periodically monitors QoS values and recalibrates using heuristics whenever the QoS is lower than a specified level. PowerDial 

[25] leverages feedback control theory for recalibration. Argo [20] is an autotuning system for adapting application performance to changes in multicore resources. SAGE [48] exploits this approach on GPU platforms. Fang et al. use simulated annealing to adjust the knob settings [17]. The problem considered in this paper is fundamentally different since it involves proactive control of an application with a single input rather than reactive control for a stream of inputs. However, the techniques described in this paper may be applicable to reactive control as well.

Auto-tuning.

Auto-tuning explores a space of exact implementations to optimize a cost metric like running time; in contrast, the control problem defined in this paper deals with both error and cost dimensions. Several papers [2, 14] have extended the PetaBricks [1] auto-tuning system to include an error bound. Ding et al. group training inputs into clusters based on user-provided features, and auto-tuning is used to find optimal knob settings for each cluster for given error bounds [14]

. For a new input, optimal knob settings for the same error bounds are determined by classifying the input into one of the clusters and using the predetermined knob settings for that cluster. Auto-tuning is used by Precimonious 

[46] to lower precision of floating point types to improve performance for a particular accuracy constraint.

The main difference between our approach and auto-tuning approaches is that our approach builds error and cost models that can be used to control knobs for any error constraint presented during the online phase, without requiring re-training. Since auto-tuning approaches do not build models, they do not have the ability to generalize their results from the constraints they were trained for to other constraints. Note that the clustering-classification approach can be combined with our approach by clustering the training inputs and building a different model for each cluster.

Programming language support.

EnerJ [49] proposes a type system to separate exact and approximate data in the program. Rely [11] uses static analysis techniques to quantify the errors in programs on approximate hardware. Ringenburg et al. [45] developed tools for debugging approximate programs. None of these tools deal with controlling the tradeoff of error versus cost.

Error guarantees.

Zhu et al. formulated a randomized program transformation which trades off expected error versus performance as an optimization problem [63]. However, their formulation assumes very small variations of errors across inputs, an assumption violated in all of our complex real-world benchmark applications. They also assume the existence of an a priori error bound for each approximation in the program and that the error propagation is bounded by a linear function. These assumptions make it hard to apply this approach to real-world applications. For example, we know of no non-trivial error bounds for our benchmarks. Chisel [34] extends Rely [11]

to use integer linear programming (ILP) to optimize the selection of instructions/data executed/stored in approximate hardware. The ILP constraints are generated by static analysis, which propagates errors through the program. While they consider input reliability, i.e. the probability that an input contains errors, they do not deal with input sensitivity of the error function. Moreover, their error propagation method requires that the error function be differentiable and their static analysis technique cannot deal with input-dependent loops, which are common in our benchmarks and many other applications.

ApproxHadoop [21] applies statistical sampling theory to Hadoop tasks for controlling input sampling and task dropping. While statistical sampling theory gives nice error guarantees, the application of this technique is restricted. Mahajan et al. [29]

uses neural networks to predict whether to invoke approximate accelerators or execute precise code for a quality constraint.

Analytic properties of programs.

Several techniques exist to verify whether a program is Lipschitz-continuous [12]. Smooth interpretation [13] can smooth out irregular features of a program. Given the input variability exhibited in our applications, analytic properties usually provide very loose error bounds and are not helpful for setting knobs.

3 Problem Formulation

We describe the formulation of the proactive control problem we use in this paper, justifying it by describing other reasonable formulations and explaining why we do not use them. To keep notation simple, we consider a program that can be controlled with two knobs and that take values from finite sets and respectively. We write and to denote this, and use and to denote particular settings of these knobs. The formulation generalizes to programs with an arbitrary number of knobs in an obvious way.

It is convenient to define the following functions.

  • Output: In general, the output value of the tunable program is a function of the input value , and knob settings and . Let be this function.

  • Error/quality degradation: Let be the magnitude of the output error or quality degradation for input and knob settings and .

  • Cost: Let be the cost of computing the output for input with knob settings and . This can be the running time, energy or other execution metric to be optimized.

We formulate the control problem as an optimization problem in which the error is bounded for the particular input of interest. This optimization problem is difficult to solve, so we formulate a different problem in which the expected error over all inputs is less than the given error bound, with some probability. This gives the implementation flexibility in finding low-cost solutions.

One way to formulate the control problem informally is the following: given an input value and a bound on the output error, find knob settings that (i) meet the error bound and (ii) minimize the cost. This can be formulated as the following constrained optimization problem.

Problem Formulation 1.

Given:

  • a program with knobs and , and

  • a set of possible inputs I.

For input and error bound , find such that

  • is minimized

In the literature, the constraint is said to define the feasible region, and values of that satisfy this constraint for a given input are said to lie within the feasible region for that input. The function is the objective function, and a solution to the optimization problem is a point that lies within the feasible region and minimizes the objective function.

Figure 1: Cost vs. error for GEM. Each dot represents one knob setting for one input. Different colors represent different inputs.
Figure 2: Pareto-optimal curves for GEM benchmark. Different lines represent different inputs.

For most tunable programs, this is a very complex optimization problem since the Pareto-optimal knob settings vary greatly for different inputs [56]. To get a sense of this complexity, consider the GEM benchmark, a graph partitioner for social network graphs [60] studied in more detail in Sections 4.4 and 6.1. Figure 2 shows the results of running GEM with a variety of inputs and different knob settings, and measuring the cost (running time of the program) and error of the output of the resulting programs. In this figure, each point represents the cost and error for a single input graph and knob settings combination; points that correspond to the same input graph are colored identically. It can be seen that even for a single input graph, there are many knob combinations that produce the same output error, and that these combinations have widely different costs.

For a given input graph and output error, we are interested in minimizing cost, so only the leftmost point for each such combination is of interest. Figure 2 shows these Pareto-optimal points for each input graph. Since these Pareto-optimal curves are very different for different inputs, it is difficult to produce the Pareto-optimal knob settings for a given input and output error without exploring much of the space of knob settings for a given input, which is intractable for non-trivial systems.

One way to simplify the control problem is to require only that the expected output error over all inputs be less than some specified bound . Since some inputs may be more likely to be presented to the system than others, each input can be associated with a probability that is the likelihood that input is presented to the system. This lets us give more weight to more likely inputs, as is done in Valiant’s probably approximately correct (PAC) theory of machine learning [59]. Since the cost function is still a function of the actual input, knob settings for a given value of will be different in general for different inputs, but the output error will be within the given error bounds only in an average sense. In our approach, we consider a variation of this optimization problem, inspired by Valiant’s work [59], in which we are also given a probability with which the error bound must be met. Intuitively, values of less than 1 give the control system a degree of slack in meeting the error constraint, permitting the system to find lower cost solutions. This control problem can be formulated as an optimization problem as follows.

Problem Formulation 2.

Given:

  • a program with knobs and ,

  • a set of possible inputs , and

  • a probability function such that for any , is the probability of getting input .

For an input , error bound , and a probability with which this error bound must be met, find , such that

  • is minimized

If the term (denoted by ) is greater or equal to , then is in the feasible region for error bound . For future reference, we call this the fitness of knob setting for error ; intuitively, the greater the fitness of a knob setting, the more likely it is that it satisfies the error bound for the given ensemble of inputs. In the rest of the paper, we refer to Problem Formulation 2 as the “control problem.”

4 Capri: Proactive Control for Approximate Programs

Figure 3: Overview of the Capri control system

For the complex applications we are interested in, the error function and the cost function are non-linear functions of the inputs, and it is difficult if not impossible to derive closed-form expressions for them. Therefore, we use machine learning techniques to build proxies for these functions offline, using a suitable collection of training inputs. Figure 3 is an overview of the control system, which we call Capri. For a given program, the system must be provided with a set of training inputs, and metrics for the error/quality of the output and the cost. The offline portion of the system runs the program on these inputs using a variety of knob settings, and learns the functions and . These models are inputs to the controller in the online portion of the system; given an input and values of and

, the controller solves the control problem to estimate optimal knob settings. In the following, we describe the important modules of the Capri control system.

4.1 Error Model

The error model is a proxy for the fitness function and is used to determine whether a knob setting is in the feasible region. Intuitively, a knob setting is in the feasible region if the inputs for which the error is between and have a combined probability mass greater than or equal to

. We use Bayesian networks 

[38]

to determine this. A Bayesian network is a directed acyclic graph (DAG) in which each node represents a random variable in the model and each edge represents the dependence relationship between the variables corresponding to the nodes of its end points.

There are several ways to model the error probability distribution using a Bayesian network. We use a simple model in which each of the knobs and the error is modeled as a random variable and the output error depends on all of the knobs. The disadvantage of this simple model is that the size of the table for the output error is exponential in the number of knobs (see Section 

5.1); however, it works well for the applications we have investigated. Our system allows new models for error to be plugged in easily into the overall framework (Section 5.1).

4.2 Cost Model

The cost model is the proxy for . We model both the running time and total energy. For most algorithms, the running time can vary substantially for different inputs; after all, even for simple algorithms like matrix multiplication, the running time is a function of the input size. For complex irregular algorithms like the ones considered in this proposal, running time will depend not only the input size but also on other features of the input. For example, the running time of a graph clustering algorithm is affected by the number of vertices and edges in the graph as well as the number of clusters. Therefore, the running time is usually a complex function of input features and knob settings. Our system currently requires the user to specify what these features are.

We use M5 [42], which builds tree-based models, to model the cost function . Input features and knob settings define a multidimensional space; the tree model divides this space into a set of subspaces, and constructs a linear model in each subspace. The division into sub-spaces is done automatically by M5, which is a major advantage of using this system. Intuitively, this model can approximate cost well because the running time does not usually exhibit sharp discontinuities with respect to knob settings.

4.3 Controller

The control algorithm must search the space of knob settings to find optimal knob settings, using the error and cost models as proxies for and respectively. Our system is implemented so that new search strategies can be incorporated seamlessly. This lets us evaluate model accuracy separately from search accuracy.

We evaluated two search algorithms: exhaustive search and Precimonious search [46]. If the error and cost models are not expensive to evaluate and each knob has a finite number of settings, we can use exhaustive search. We sweep over the entire space of knob settings, and for each knob setting, use the error model to determine if that knob setting is in the feasible region. The cost model is then used to find a minimal cost point in the feasible region. In a large search space, heuristics-based search is an effective way to trade-off search cost for quality of the result. Precimonious search, which is based on the delta-debugging algorithm [22], is one such strategy. The algorithm starts with all knobs set to the highest values, and attempts to lower these settings iteratively. Precimonious can quickly prune the search space but the solution it finds may be a local minimum. Other search strategies can be implemented easily within Capri.

4.4 Results

We evaluate the control system on the following five complex applications: (i) GEM, a graph partitioner for social networks [60], (ii) Ferret [5], a content-similarity based image search engine, (iii) ApproxBullet [40], a 3D physics game engine, (iv) SGDSVM [8]

, a library for support vector machines, and (v) OpenOrd 

[31], a library for two-dimensional graph layouts. We modified the code for these applications to permit control of approximations. These applications provide between two and five knobs that allow tuning tradeoffs between a user-specified quality metric in each case and the execution time or energy consumption. In addition, we did a blind test of the system using an unmodified radar processing application [24], written by Hank Hoffmann at the University of Chicago.

Evaluation of the cost and error models.

For each benchmark, we collected a set of inputs. To evaluate the error and the cost models, the inputs were randomly partitioned into training and testing subsets. We evaluated our control system for ranging from to and ranging from to . Training is done offline, so training time is not as important as the accuracy of the cost and error models. Training time obviously increases with the number of training inputs, but even for Ferret, which has the largest training set, it takes only seconds to train the error model and seconds (about 2 minutes) to train the cost model (this does not take into account the time to run the application programs).

Figure 4: Accuracy of cost and error models

Evaluating the accuracy of the cost model for a given application is straightforward: we sweep the space of test inputs and knob settings, and for each point in this space, we compare the running time predicted by the cost model with the actual execution time. The top charts in Figure 4 show the results for the applications in our test suite. In each graph, the x-axis is the predicted running time and the y-axis is the measured running time. If the cost model is perfect, all points should lie on the line. Figure 4 shows that this is more or less true for Ferret and Radar. For GEM and SGD, the predicted time is usually less than the actual execution time, and for Bullet and OpenORD, the over-predictions and under-predictions are more or less evenly distributed. Radar implements a regular algorithm in which running time depends on the size of the input. In contrast, GEM and OpenORD implement complex graph algorithms, so they are more irregular in their behavior.

Estimating the accuracy of the error model has to be more indirect since the model does not make error predictions for individual inputs but only for an ensemble of inputs . The error model is a proxy for the fitness function . This proxy is constructed during the training phase by letting be the set of training inputs. One way to evaluate the accuracy of this proxy is to construct another proxy by letting be the set of test inputs. If the model is accurate, these two proxy functions, which we call the predicted fitness and measured fitness, will be equal.

The bottom charts of Figure 4 show the results of this experiment. The x and y axes in each graph are the predicted and measured fitness respectively. We sweep over the space of (discretized) error values and knob settings, and for each point in this space, we evaluate the two proxy functions and plot the point in the graph. We see that the error model is very accurate for Bullet, Ferret and Radar, and less so for the other three benchmarks. For GEM and SGD, most of the points lie above the line, which means that the predicted fitness is usually less than the actual fitness. Therefore, the feasible region determined by using the model may be smaller than the actual feasible region.

It is important to note that since the error and cost models are used only to rank knob settings in the feasible region, more accurate models do not necessarily give better solutions to the control problem.

Optimizing run-time performance.
Bullet Ferret GEM OpenOrd SGD
0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.1 1.1 NA NA 1.3 1.4 1.6 1.9 NA 2.0 2.4 6.3 6.3 5.9 NA 2.5 5.6 14.1 31.3 77.4
0.9 1.0 1.0 1.0 1.0 1.0 1.0 1.1 1.4 1.4 1.6 1.7 1.9 NA 1.1 1.5 1.9 2.2 2.5 NA 2.9 6.5 6.0 5.9 8.4 NA 11.8 30.3 43.0 56.3 94.9
0.8 1.0 1.0 1.0 1.0 1.0 24.3 1.1 1.4 1.6 1.6 1.9 1.9 NA 1.2 1.7 2.1 2.4 2.6 NA 5.2 6.4 8.7 8.7 8.5 NA 39.7 52.5 103.7 165.0 184.0
0.7 1.0 1.0 1.0 1.0 9.0 96.6 1.1 1.4 1.6 1.7 1.9 2.0 NA 1.3 1.7 2.1 2.5 2.7 NA 6.3 6.1 8.5 8.8 8.5 NA 73.3 101.4 139.9 176.1 271.7
0.6 1.0 1.0 1.0 1.7 96.6 141.3 1.2 1.4 1.6 1.7 1.9 2.0 NA 1.4 2.0 2.2 2.6 2.7 NA 6.0 8.5 8.7 8.5 8.5 1.0 97.5 136.7 207.0 302.0 395.3
0.5 1.0 1.0 1.0 39.9 115.4 204.6 1.2 1.4 1.6 1.7 2.0 2.0 NA 1.7 2.3 2.5 2.7 3.0 NA 8.5 8.5 8.6 8.4 8.4 1.0 104.6 161.1 259.1 302.0 418.4
Table 1: Speedups of the tuned programs for a subset of constraint space.

Speedup is defined as ratio of the running time at a particular knob setting to the running time with the knobs set for maximum quality. Table 1 shows speedups for each application for values between 0 and 0.5 and values between 0.5 and 1.0. Each entry gives the average speedup over all test inputs for the knob settings found by the control algorithm based on exhaustive search, given constraints in the intervals specified by the row and column indices.

Speedups depend on the application and the constraints. For each application, the top-left corner of the constraint space is the “hard” region since the error must be low with high probability. The knob settings must be at or close to maximum, and speedup will be limited. Table entries marked “NA” show where the control system was unable to find any feasible solution for these hard constraints. In contrast, the bottom-right corner of the constraint space is the “easier” region, so one would expect higher speedups. This is seen in all benchmarks. Overall, we see that controlling the knobs in these applications can yield significant speedups in running time.

Effectiveness in finding optimal knob settings.

While Table 1 shows speedups obtained from the knob settings found by the control algorithm in different regions of the constraint space, it does not show how well these constraints were actually met. To provide context, we have evaluated this both for our method and for a similar method using linear regression to model both error and running time (linear regression can be seen as the simplest non-trivial model one can build for these values). For each () combination, we evaluated the quality of the achieved control.

Overall, the control system using the Bayes model for error and the M5 model for cost performs quite well for all inputs and regions of the constraint space: for most points, it finds solutions and the cost difference from the oracle’s solution is within 40%. The only noticeable problem is in SGD. A closer study showed that the feasible region found by the Bayes error model is smaller than it should be and did not contain some low-cost points found by the oracle control. This can be attributed to the fact that the predicted fitness function for SGD is somewhat conservative, as seen in Figure 4.

In contrast, the control system based on linear regression performs quite poorly. No solutions are found in most parts of the space, and even when solutions are found, the cost of the solutions is very sub-optimal.

Performance of the Radar processing application.

We also performed a blind test of the system using a radar processing application [24]. Unlike the five applications described above, this code was already instrumented with knobs, so we used it out of the box as a blind test for our system. Using our machine-learning-based control scheme, we were able to obtain speedups over a base fixed system configuration comparable to those obtained by hand tuning. In contrast, models using linear regression were unable to find solutions in most of the constraint space.

Optimizing energy consumption.

We note that a major advantage of our approach is that it can be used to optimize not just running time but any metric for which a reasonable cost model can be constructed. In this section, we show the results of applying the system to optimizing energy consumption for the same benchmarks. We measured energy on a Intel Xeon E5-2630 CPU with 16 GB of memory. We used the Intel RAPL (Running Average Power Limit) interface and PAPI to measure the energy consumption. This machine does not support DRAM counters, so what is being measured is the CPU package energy consumption.

Table 2 shows the power savings obtained for our benchmarks for values between 0 and 0.5 and values between 0.5 and 1.0. Each entry gives the average power savings over all test inputs for the knob settings found by our control algorithm given constraints in the intervals specified by the row and column indices. As expected, savings are greater when the constraints are looser.

Bullet Ferret GEM OpenOrd SGD Radar
0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5
1.0 1.0 1.0 1.0 1.0 1.0 1.0 NA 1.0 1.0 1.0 1.1 1.1 NA NA 1.3 1.5 1.7 1.9 NA 2.4 6.1 7.2 7.2 8.9 NA 21.6 59.5 83.3 108.3 107.3 1.0 1.0 1.0 1.1 1.1 1.1
0.9 1.0 1.0 1.1 1.0 1.0 1.0 1.0 1.5 1.7 1.8 1.8 1.8 NA NA 1.7 2.0 2.1 2.3 NA 6.0 7.1 7.2 8.9 8.9 NA 51.0 98.0 149.2 168.7 262.7 1.0 1.0 1.0 1.1 1.1 1.1
0.8 1.0 1.1 1.0 1.0 1.0 1.0 1.1 1.6 1.8 1.8 1.8 1.8 NA 1.1 1.8 2.1 2.3 2.5 NA 6.0 7.2 8.9 8.9 8.9 NA 91.0 192.5 266.0 265.0 319.0 1.0 1.0 1.0 1.1 1.1 1.1
0.7 1.0 1.0 1.0 1.0 1.0 1.0 1.1 1.7 1.8 1.8 1.8 1.8 NA 1.2 1.8 2.3 2.5 2.5 NA 7.1 7.2 8.9 8.9 8.9 NA 112.7 193.6 265.0 338.2 319.0 1.0 1.0 1.0 1.1 1.1 1.1
0.6 1.0 1.0 1.0 1.0 1.0 1.2 1.4 1.8 1.8 1.8 1.8 1.8 NA 1.5 2.1 2.3 2.5 2.8 NA 7.2 8.9 8.9 8.9 8.9 1.0 110.2 193.6 345.1 341.8 410.2 1.0 1.0 1.0 1.3 1.3 1.3
0.5 1.0 1.0 1.0 1.0 1.2 1.3 1.4 1.8 1.8 1.8 1.8 1.8 NA 1.7 2.3 2.5 2.8 2.8 NA 8.9 8.9 8.9 8.9 8.9 1.0 129.9 254.2 345.1 420.2 410.2 1.0 1.0 1.0 1.3 1.3 1.3
Table 2: Energy savings of the tuned programs for a subset of constraint space.

5 Extending Capri

The Capri system is an example of an open-loop control system, which uses a model of the system (in our case, the tunable program) to determine optimal knob settings before the application is executed. However, the Capri control system suffers from the following drawbacks.

5.1 Scaling to Large Tunable Programs

The open-loop control system described in Section 4 works well for programs that are a few hundred lines long and have five or six knobs. This holds true especially for the Bayesian network-based error model that was used as a proxy function for , as discussed in Section 4.1. There are many ways to model the error probability distribution using a Bayesian network. The original Capri work used a simple model, where the output error E depends directly on the settings of all of the knobs. Although this is simple, the size of the table for the output error is exponential in the number of knobs. This was not a problem for the applications studied in the original Capri paper. However, the control system may suffer from poor performance with applications that provide several ( 100s) knobs and therefore have a large space of knob settings.

There are possible ways to improve the performance of a Bayesian network-based cost model (or based on any machine learning model), and to scale the open-loop controller. In the following, we discuss few opportunities:

  • Reducing the size of the knob space: This can be achieved by (i) reducing the number of knobs that need to be controlled simultaneously, a process that we call knob orthogonalization, and (ii) reducing the number of settings for each knob.

    The first step is to exploit phase behavior in long-running programs [52, 53]. For example, a Barnes-Hut n-body code executes the following phases repeatedly: (i) build the spatial decomposition tree, (ii) compute the mass and center of gravity of each spatial partition, (iii) compute force on each particle, and (iv) update position and velocity of each particle. At any given point in the execution, the program is executing only one of these phases, so the overall control problem can be decomposed into a set of smaller control problems, one for each phase, thereby reducing the number of knobs that need to be controlled simultaneously.

    The next step is to reduce the number of knobs by exploiting the 90/10 rule, which says that in most programs, more than 90% of the execution time is spent in less than 10% of the code. By ignoring knobs outside such “hot” regions, it may be possible to obtain most of the benefits of optimal control without the effort of controlling every knob in a program. In Barnes-Hut for example, more than 90% of the time is spent in the force computation phase, so it may be possible to ignore knobs in all other phases, at least for controlling computation time and energy.

    Once the number of knobs that need to be considered simultaneously has been minimized, reducing the number of knob settings that need to be considered by the control system can be accomplished by using a mixture of coarse-grain and fine-grain knobs. If the program output changes relatively slowly with the value of a particular control variable, a coarse-grain knob with relatively few settings can be used to set the value of that variable, reducing the size of the search space for optimal knob settings. Profiling with test data can be used to determine the relative sensitivity of the output to particular control variables.

  • Reducing search time for optimal knob settings: Our current control algorithm sweeps the knob space to find optimal knob settings for a given input and desired quality guarantees. Although exhaustive search has worked well for the small-scale applications we have considered so far, it obviously does not scale to large numbers of knobs, so we will develop intelligent search algorithms to find optimal knob settings efficiently in a large knob space. As mentioned in Section 4, we have experimented with the heuristic search strategy used in the Precimonious system [46]. The results showed, as one might expect, that Precimonious was significantly faster but found sub-optimal knob settings compared to our current exhaustive search strategy. In particular, for the OpenOrd application, Precimonious search got stuck in a local minimum that was sub-optimal. We will investigate search techniques that trade-off computing time for solution quality.

  • Scalable error models: The Bayesian error model in Capri has the virtue of being simple, but it does not scale since the size of the conditional probability distribution tables increases exponentially with the number of knobs. Abstractly, the error model is a function

    that maps a knob setting and an error bound to the corresponding probability defined in Problem formulation 2. The simple Bayesian model explicitly stores the probability for all combinations of knob settings and error bounds. In our studies, we have found that the probability function changes quite slowly as the knob settings are changed. Therefore, we might be able to usefully approximate the probability table by partitioning the knob space into subspaces and using a simple model like a linear model within each subspace. This is what a tool like M5 will do automatically if it is given the same training data as the Bayesian error model. We are investigating these model compression techniques.

  • Clustering inputs: Instead of building a single error model and cost model to handle all inputs, we can use clustering techniques [6, 35] to cluster the inputs into a set of classes where in each class, the error and cost behaviors are similar. For each class, we can build a separate quality and cost model using our approach. At runtime, a given input is first classified and then the corresponding models are used to set the knobs. This may improve both the accuracy and the scalability of both learning and querying the quality and cost models since the complexities of the models can be reduced by considering a subset of input scenarios. Clustering has been used successfully for auto-tuning in the Petabricks system [1]

    . Automatic feature extraction and selection techniques may be useful for this problem; for example, they have been quite successful in the audio domain 

    [32, 36].

5.2 Closed-loop Control

Open-loop control systems cannot adapt during execution to compensate for model error. Such errors can be significant; for the SGD benchmark for example, the Capri control system does not find some low-cost points found by the oracle control because the Bayesian error model is overly conservative, as seen in Figure 4.

The need to compensate for modeling errors, particularly in the context of complex systems, presents an opportunity for closed-loop control. In this approach, a function of the current system state and/or output is fed back as an input to the control system so that system behavior can be optimized for subsequent computations. Closed-loop control systems are generally applicable to a large class of iterative and streaming applications that have a notion of “progress.” For an iterative application, each iteration represents progress, and provides the control system with an opportunity for correcting the difference between the current value of a system variable and the desired “setpoint.” For streaming computations, the application processes a sequence of inputs and produces a sequence of outputs, so the processing of successive inputs represents progress. Closed-loop control systems are well-studied in control theory, and systematic techniques for designing controllers with provably desirable properties are well understood, especially for linear, time-invariant systems. These techniques have proved to be adequate for simple cruise control systems in cars, autopilots in aircraft, audio amplifiers, and basic process control systems in manufacturing.

Recently, there has been a surge in using closed-loop control to build adaptive software and hardware systems for complex applications. However, there are several challenges: 1) building reasonable initial approximate cost and quality models, 2) finding effective run-time metrics strongly correlated with cost and error/quality, 3) low overhead profiling of these run-time metrics, and 4) updating knob settings, and cost and quality models efficiently. These ideas have been explored in recent papers [4, 18, 23, 27, 30, 19, 37, 26] on adapting traditional control theory for use in computer applications. Some of these systems consider a combination of system knobs, e.g. the number of cores used and their clock rate, in addition to application knobs of the sort we use in Capri for open-loop control. However, existing systems typically use a separate PID (proportional-integral-derivative) controller [3] for each type of knob and employ ad-hoc techniques to combine these into an overall system. PID controllers have the advantage of not needing system models but because of this, they cannot ensure optimal

control; in addition, composing these controllers in ad hoc ways limits the degree to which overall system behavioral properties can be guaranteed. They share these properties with systems based on reinforcement learning 

[57]. Such ad-hoc techniques are ill-suited for several emerging class of application contexts such as exascale applications, which can execute for several days and which require tuning of additional system-level knobs related to resource allocation such as load balancing and allocating cores [26].

We propose to extend our model-based open-loop control framework to provide a systematic approach for designing closed-loop controllers that integrate the use of system and application knobs to achieve predictable, desirable system behavior. To that end, we will extend the strategy used in the established area of Model-Predictive Control (MPC) for traditional control systems [61, 9]. Traditional MPC systems are used to design relatively complex process control systems for industrial plants, and can be more effective than simple PID controllers. Unlike PID controllers, they are based on specific, closed-form dynamics models of the processes being controlled. Based on these models, an explicit closed form objective function describing the desired system behavior as well as explicit closed-form constraints on the range of behaviors allowed can be expressed. This results in a formulation of the control problem as a constrained optimization problem, where the behavioral objective function is optimized subject to the specified constraints. Note that this is similar to our formulation for the open-loop problem. In traditional MPC systems, these functions are closed-form continuous functions to be optimized over an infinite time period. To make the problem computationally tractable, a finite time horizon is imposed and the optimal trajectory of the control settings over that time horizon is computed. Since in the real control system, knobs are set at discrete time intervals, the setting computed by the optimizer for the first time interval is used by the controller. The optimization step is then repeated for the given horizon, and the first step of the resulting trajectory actually used, and so on. Of course, this comes at the cost of more expensive computation per time step than for PID controllers. In principle, the traditional MPC method can be extended to non-linear systems, although the resulting nonlinear optimization problems may be too expensive to solve, for real-time use, using traditional methods.

In the complex systems we wish to consider, we know of no closed-form models for the cost and quality functions, but we can model these using machine learning techniques as described in Section 4. These will constitute the initial approximate cost and quality models for the proposed closed-loop control system. We are currently working on finding effective run-time metrics strongly correlated with cost and error/quality for the applications discussed in Section 4. We are analyzing Simultaneous Localization And Mapping (SLAM) applications to determine sensitivity to platform knobs as well as the best application knobs to use in trading off accuracy for computation time and energy savings [37]. We are also exploring the incorporation of our MPC-based controllers into systems like APEX [26] for controlling exascale computations (the current APEX system uses a simple proportional controller to control the number of cores assigned to a computation, for example). We believe these kinds of real-world applications are a rich source of interesting problems for our proposed extensions.

We are also working on methods for incrementally updating optimal knob settings using the models and feedback information about the state of the computation. If this is done at each iteration of a streaming computation, we can see that this fits the model of MPC control with a time horizon for optimization of a single time-step. The final step will be to incorporate model updates into the system, as is done in approaches like Kalman filters in traditional control theory 

[3]. In our current open-loop control system, we do not take into account the results of previous computations to refine the models constructed during the initial training. For online control, it may be desirable to incorporate some kind of model refinement so that subsequent optimization steps improve in quality. A related goal is to develop techniques for guaranteeing that our systems converge to desired behaviors using this approach. Finally, we will develop techniques for multi-time-step optimization to provide better results on appropriate problems such as recognizing and tracking the motion of objects with multi-model sensors.

6 Making Capri Extensible

Section 5 discusses possible extensions to Capri [56]. We are actively working on tailoring the existing Capri implementation to integrate extensions. This section describes our new Capri implementation in detail, and uses applications from three varied domains to demonstrate the effectiveness of and regression test the new Capri implementation.

6.1 Applications

We evaluate the new Capri system on three complex applications: (i) GEM, the graph partitioner for social networks [60] which was introduced earlier, (ii) a radar processing application [24] written by researchers at the University of Chicago, and (iii) SLAMBench, an open source tool designed to assist in the development of simultaneous localisation and mapping (SLAM) algorithms [37]. The code for GEM was modified by us to permit control of approximation, while Radar and SLAMBench were already setup for control.

Error/Quality definition.

To compute the error/quality of the output, we require the user to provide a distance function that quantifies the difference between an approximate execution and a reference execution for a given input. The reference execution can be the exact execution if such a thing exists or the best execution in the knob space for that input. The error is defined as a normalized version of this distance

where , and represent the distance for a execution, the maximum distance and the minimum distance over the knob space for the same input. The distance function is application-specific.

6.1.1 Gem

GEM [60] is a graph clustering algorithm for social networks.

Knobs:

There are two components; both use a weighted kernel k-means algorithm and have a knob controlling the number of iterations. Each knob can be set to one of

levels. All input graphs are partitioned into clusters in our experiments.

Error metric: The output of GEM is the cluster assignment of each node in the graph. There is a standard way to measure the quality of graph clustering, using the notion of a normalized cut, which is defined as follows:

where is the number of clusters, denotes the number of edges between cluster and cluster , and denotes the edges inside cluster

The distance function computes the difference of the normalized cut given two clustering assignments. The reference execution is the execution achieving the smallest normalized cut.

Input features for modeling cost: the number of vertices in the graph, the number of edges and the number of clusters.

6.1.2 Radar

We used a radar processing application [24] developed by Hank Hoffmann at the University of Chicago. Unlike the other applications, this code was already instrumented with knobs, so we used it out of the box as a blind test for our system. This code is a pipeline with four stages. The first stage (LPF) is a low-pass filter to eliminate high-frequency noise. The second stage (BF) does beam-forming which allows a phased array radar to concentrate in a particular direction. The third stage (PC) performs pulse compression, which concentrates energy. The final stage is a constant false alarm rate detection (CFAR), which identifies targets.

Knobs: The application supports four knobs. The first two knobs change the decimation ratios in the finite impulse response filters that make up the LPF stage. The third knob changes the number of beams used in the beam former. The fourth knob changes the range resolution. The application can have 512 separate configurations using these four knobs.

Error metrics: The signal-to-noise ratio (SNR) is used to measure the quality of the detection. The reference execution is the one achieving the highest SNR.

Input features for modeling cost: No input features are used in this application.

6.1.3 SLAMBench

SLAMBench is an open source tool designed to assist in the development of simultaneous localisation and mapping (SLAM) algorithms, and evaluation of platforms for implementing those algorithms [37]. It runs on the Linux operating system, and has been used on X86 and ARM along with various GPUs, from high-end to mobile devices. SLAMBench combines a framework for quantifying quality-of-result with instrumentation of execution time and energy consumption. It contains a KinectFusion [39] implementation in C++, OpenMP, OpenCL and CUDA. It offers a platform for a broad spectrum of future research in jointly exploring the design space of algorithmic and implementation-level optimizations.

Knobs: The application supports several algorithmic-level knobs [7], such as, volume resolution, iterative closest point (ICP) threshold, etc. To minimize the search space over the set of all possible knob combinations, we vary only those knobs that seem to have a high correlation with the run-time performance and tracking. We used the following four algorithmic parameters as knobs:

  • Compute size ratio - The fractional depth image resolution used as input. As an example, a value of 8 means that the raw frame is resized to one-eighth resolution.

  • ICP threshold - The threshold for the iterative closest point (ICP) algorithm used during the tracking phase in KinectFusion.

  • distance - The output volume of KFusion is defined as a truncated signed distance function (TSDF) [39]. Every volume element (voxel) of the volume contains the best likelihood distance to the nearest visible surface, up to a truncation distance denoted by the parameter .

  • Volume resolution - The resolution of the scene being reconstructed. As an example, a 64x64x64 voxel grid captures less detail than a 256x256x256 voxel grid.

Error metrics: The KinectFusion algorithm reports the absolute trajectory error (ATE) in meters after processing an input. The ATE measures accuracy, and represents the precision of the computation. Acceptable values are in the range of few centimeters.

Input features for modeling cost:

An input in SLAMBench is a trajectory, which is sequence of depth images. We have defined the following features that can be extracted from a given trajectory: mean and standard deviation of the depth values in a frame, and mean and standard deviation of differences in depth values between successive frames. The first two features track the variation among pixels in a single frame, while the second pair of features aim to capture the variation in depth values across two images. In other words, it tries to capture the “burstiness” between two successive frames.

6.2 Implementation

Environment.

Capri has been implemented and tested with Python v3.5.

M5 model.

The Cubist111http://rulequest.com/cubist-info.html application implements the M5 [42] machine learning model. Training a Cubist/M5 model requires a schema file which lists the independent and the dependent variables, and a file containing data points that is used for training. After training, Cubist/M5 generates a set of piecewise-linear rule-based models that balance the need for accurate prediction against the requirements of intelligibility. Cubist/M5 models generally give better results than those produced by simple techniques such as multivariate linear regression, while also being easier to understand than the more complex neural networks. Cubist/M5 scales well to hundreds of attributes.

Capri uses the Cubist222https://cran.r-project.org/web/packages/Cubist/index.html package available for the R programming language. We have written a Python package wrapper for interfacing with the Cubist R package. Our new Capri implementation is modular, which makes it easy to replace the M5 model with other machine learning models.

Source structure.

The Capri source is divided into the follow directories:

  • lib - Contains the source for Cubist R module, and a Python wrapper for interfacing with Capri.

  • scripts - Contains scripts for helping with running applications and parsing the output results.

  • src - This directory contains Python modules that implement the control algorithm in Capri.

Running Capri.

Capri can be run with Python version 3.5, and requires the following Python packages: numpy, psutil, overrides, matplotlib, and ordered_set. These packages can be installed by invoking the following command if required:

  pip3 install upgrade numpy psutil overrides matplotlib ordered_set

Since each application has a unique set of knobs and a different range of values, therefore an user of the Capri system needs to list the details about the knobs and their range of values in a configuration file. The Capri source provides configuration files for several applications that we have used. The following snippet shows an example for the GEM application:

[FIXED]

PBS = (1.0;0.05;-0.05)
EBS = (0.05;1;+0.05)
TRAIN_RATIO = 0.75
ACCURATE_KNOBS = {’iter1’: ’40’, ’iter2’: ’40’}

[KNOBS]

NUM_FIRST_ITER = (1;40;+1) # iter1
NUM_SECOND_ITER = (1;40;+1) # iter2

The configuration section FIXED lists experimental settings that are common to all applications. PBS and EBS stand for the acceptable probability and error bounds, as discussed in the final problem formulation 2 in Section 3. TRAIN_RATIO specifies the proportion of the experimental data to be used in training the M5 models, the rest of the input data is used for prediction. ACCURATE_KNOBS specifies the knob configurations that compute the most accurate output, which is used to compute the “golden value” (i.e., the most accurate output) and scale the error.

Given a configuration file for an application, we have automated all the steps involved in running the Capri control system with the application. Executing the control system involves four steps:

  • run - Run the application with different knob settings to generate experimental data to be then used for offline machine learning and for prediction. This task does not depend on other tasks, and can be run independently. Note that running this task over all possible knob settings can take a long time (i.e., several hours to several days depending on the application).

  • stats - Process a set of experimental results to collect statistics. This task depends on output generated by a prior run task.

  • predict - Train the M5 models and compute the feasible region for a given constraint of error bound () and probability bound () (Section 3). This task depends on the run task.

  • result - Find the optimal knob setting that minimizes the objective function and meets the constraints set in Problem formulation 2. It also generates plots and speedups to help compare the performance of the Capri control system. This task depends on the stats and the predict tasks.

In the following, we show a sample invocation of the Capri control system.

capri bench=gem input=all outputDir=gem-full tasks=run,stats,predict,result

To know more about different options to Capri, use

  capri -h
Extending Capri with new applications.

It is straightforward to add support for new applications to Capri. A user of the Capri system needs to provide the following information:

  • Implement an application-specific module under apps in the src directory. The Capri user should implement how to compute the cost and the error for the application. Please refer to existing applications for reference.

  • Provide a configuration file for the application.

  • Use Capri to run the application with different knob settings, and then use the controller to predict optimal knobs for any given performance metric and quality constraints.

7 Evaluation

In this section, we describe results using our new implementation of the Capri control system with GEM, Radar, and SLAMBench. For each benchmark, we collected a set of inputs as shown in Table 3. We would have liked to have more training inputs for GEM, and we are currently investigating other sources of getting new inputs for SLAMBench. To evaluate the error and cost models, inputs were randomly partitioned into a training and a test suite based on the TRAIN_RATIO.

Benchmark #Total #Train (75%) #Test (25%) Source
GEM 43 33 10 [28, 62]
Radar 128 96 32 synthetic
SLAMBench 12 9 3  [37]
Table 3: Inputs for benchmarks. Inputs are randomly divided into training set and testing set.
(a) GEM
(b) Radar application
(c) SLAMBench
Figure 5: Accuracy of the cost model with the new implementation of Capri.

7.1 Evaluation of the Cost and Error Models

We regression test our new implementation of Capri by comparing the performance of the M5 cost and error models with prior results (Section 4.4). Figure 5 shows the accuracy of the M5 cost model. As in Figure 4, the black line represents the y=x line which captures perfect prediction behavior. The green line shows linear regression for the given data points. From Figures 4 and 5, it is obvious that the behavior of the new M5 cost model closely matches the earlier result.

(a) GEM
(b) Radar application
(c) SLAMBench
Figure 6: Accuracy of the error model with the new implementation of Capri.

Prior published work on Capri used Bayesian network for modeling error [56]. Unlike prior work, our reimplementation uses M5 for modeling error. Figure 6 shows the accuracy of the M5 cost model. Evaluating the accuracy of the error model is more involved than the cost model, since the error bound needs to be met probabilistically over an ensemble of inputs. We simulate that by tracking the proportion of the inputs for which the Capri control system’s predictions meet the given error bound. The black and green lines in the figure have the same meaning as in Figure 5. From Figures 4 and 6, we see that predictions with an M5 model are within a reasonable match of predictions with a Bayesian network. Fitness predictions for SLAMBench are wayward, we believe this is due to lack of sufficient training data and a poor choice of the error function (based on absolute trajectory error). We are investigating ways to fix this problem with SLAMBench. In particular, we are looking into how to use the RGB-D SLAM dataset from TUM333http://vision.in.tum.de/data/datasets/rgbd-dataset and to generate new trajectories.

GEM Radar SLAMBench
0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5
1.0 NA NA 1.3 1.7 2.2 2.3 1 1 1.1 1.3 1.4 1.4 NA NA NA 1.6 1.6 1.5
0.9 NA 1.1 2.1 2.8 3.2 3.7 1 1 1.1 1.3 1.4 1.4 NA NA 1.2 1.6 1.6 1.5
0.8 NA 1.6 2.4 3.2 3.6 4.0 1 1 1.1 1.3 1.4 1.4 NA 1.2 1.5 1.5 1.5 1.5
0.7 NA 1.8 2.8 3.5 3.9 4.4 1 1 1.1 1.3 1.4 1.4 NA 1.2 1.5 1.5 1.5 1.5
0.6 NA 2.3 3.2 3.6 4.0 4.8 1 1 1.1 1.3 1.4 1.4 NA 1.3 1.5 1.5 1.5 1.5
0.5 NA 2.6 3.4 3.9 4.1 4.9 1 1 1.1 1.3 1.4 1.4 NA 1.3 1.5 1.5 1.5 1.5
Table 4: Speedups of the tuned programs for a subset of constraint space.

7.2 Speedups

The original Capri work shows that the time for control is relatively small compared to the time taken by the applications to run. In our reevaluation, we have not measured the proportion of the time taken by the control algorithm to run compared to the applications. But we evaluate speedup to sanity check the performance of the new Capri implementation. Speedup is defined as ratio of the running time at a particular knob setting to the running time with the knobs set for maximum quality.

Table 4 shows speedups for each application for values between 0 and 0.5 and values between 0.5 and 1.0 (we show only a portion of the overall constraint space for simplicity). Each entry gives the average speedup over all test inputs for the knob settings found by the control algorithm based on exhaustive search, given constraints in the intervals specified by the row and column indices.

Speedups depend on the application and the constraints. For each application, the top-left corner of the constraint space is the “hard” region since the error must be low with high probability. The knob settings must be at or close to maximum, and speedup will be limited. Table entries marked “NA” show where the control system was unable to find any feasible solution for these hard constraints. In contrast, the bottom-right corner of the constraint space is the “easier” region, so one would expect higher speedups. This is seen with all the applications. Overall, we see that controlling the knobs in these applications can yield significant speedups in running time.

7.3 Inversions

GEM Radar SLAMBench
0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5
1.0 NA NA F T T T F F F F F F NA NA NA F F T
0.9 NA F F F T F F F F F F F NA NA F F F T
0.8 NA F F T T F F F F F F F NA T F T T T
0.7 NA F T F T F F F F F F F NA T F T T T
0.6 NA F T F F T F F F F F F NA T F T T T
0.5 NA F T T F F F F F F F F NA F T T T T
Table 5: Inversion of the tuned programs for a subset of constraint space.

The cost and error models in Capri are used only to rank knob settings in the feasible region, so more accurate models do not necessarily give better solutions to the control problem even if the predictions of the machine learning models are close to accurate. We say an inversion has occurred for a given constraint of and when the knob setting predicted by Capri does not match with the knob settings identified with an oracle. We evaluated the number of inversions that happened with the M5 model in Capri, by comparing whether the predicted knob settings matched with the knob settings predicted using an oracle for a given and constraint. Table 5 show the proportion of inversions that occurred with the different applications. We denote an inversion has occurred with T, otherwise the entry contains F. The table shows that the machine learning models in Capri perform reasonably well.

8 Conclusion

Although there is a large body of work on using approximate computing to reduce computation time as well as power and energy requirements, little is known about how to control approximate programs in a principled way. Previous work on approximate computing has focused either on showing the feasibility of approximation or on controlling streaming programs in which error estimates for one input can be used to reactively control error for subsequent inputs.

In this paper, we addressed the problem of controlling tunable approximate programs, which have one or more knobs that can be changed to vary the fidelity of the output of the approximate computation. We showed how the proactive control problem for tunable programs can be formulated as an optimization problem, and then gave an algorithm for solving this control problem by using error and cost models generated using machine learning techniques. Our experimental results show that this approach performs well on controlling tunable approximate programs.

We extend prior published work called Capri to make the new control system scale to hundreds of knobs, and to provide optimal control for streaming programs. For controlling streaming programs, we propose to solve a closed-loop control system with model-predictive control. We showed initial results with our new implementation of Capri to regression test the system.

References

  • [1] J. Ansel, C. Chan, Y. L. Wong, M. Olszewski, Q. Zhao, A. Edelman, and S. Amarasinghe. PetaBricks: A Language and Compiler for Algorithmic Choice. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’09, pages 38–49, New York, NY, USA, 2009. ACM.
  • [2] J. Ansel, Y. L. Wong, C. Chan, M. Olszewski, A. Edelman, and S. Amarasinghe. Language and Compiler Support for Auto-Tuning Variable-Accuracy Algorithms. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’11, pages 85–96, Washington, DC, USA, 2011. IEEE Computer Society.
  • [3] K. J. Åström and R. M. Murray. Feedback Systems: An Introduction for Scientists and Engineers. Princeton University Press, Princeton, NJ, USA, 2008.
  • [4] W. Baek and T. M. Chilimbi. Green: A Framework for Supporting Energy-Conscious Programming using Controlled Approximation. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’10, pages 198–209, New York, NY, USA, 2010. ACM.
  • [5] C. Bienia. Benchmarking Modern Multiprocessors. PhD thesis, Princeton, NJ, USA, Jan. 2011. AAI3445564.
  • [6] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.
  • [7] B. Bodin, L. Nardi, M. Z. Zia, H. Wagstaff, G. Sreekar Shenoy, M. Emani, J. Mawer, C. Kotselidis, A. Nisbet, M. Lujan, B. Franke, P. H. Kelly, and M. O’Boyle.

    Integrating Algorithmic Parameters into Benchmarking and Design Space Exploration in 3D Scene Understanding.

    In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, PACT ’16, pages 57–69, New York, NY, USA, 2016. ACM.
  • [8] L. Bottou.

    Large-Scale Machine Learning with Stochastic Gradient Descent.

    In Y. Lechevallier and G. Saporta, editors, Proceedings of the 19th International Conference on Computational Statistics (COMPSTAT’2010), pages 177–186, Heidelberg, 2010. Physica-Verlag HD.
  • [9] E. F. Camacho and C. A. Bordons. Model Predictive Control in the Process Industry. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1997.
  • [10] S. Campanoni, G. Holloway, G.-Y. Wei, and D. Brooks. HELIX-UP: Relaxing Program Semantics to Unleash Parallelization. In Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’15, pages 235–245, Washington, DC, USA, 2015. IEEE Computer Society.
  • [11] M. Carbin, S. Misailovic, and M. C. Rinard. Verifying Quantitative Reliability for Programs That Execute on Unreliable Hardware. In Proceedings of the 2013 ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA ’13, pages 33–52, New York, NY, USA, 2013. ACM.
  • [12] S. Chaudhuri, S. Gulwani, and R. Lublinerman. Continuity and Robustness of Programs. Communications of the ACM, 55(8):107–115, Aug. 2012.
  • [13] S. Chaudhuri and A. Solar-Lezama. Smooth Interpretation. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’10, pages 279–291, New York, NY, USA, 2010. ACM.
  • [14] Y. Ding, J. Ansel, K. Veeramachaneni, X. Shen, U.-M. O’Reilly, and S. Amarasinghe. Autotuning Algorithmic Choice for Input Sensitivity. In Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’15, pages 379–390, New York, NY, USA, 2015. ACM.
  • [15] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger. Architecture Support for Disciplined Approximate Programming. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVII, pages 301–312, New York, NY, USA, 2012. ACM.
  • [16] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger. Neural Acceleration for General-Purpose Approximate Programs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, pages 449–460, Washington, DC, USA, 2012. IEEE Computer Society.
  • [17] S. Fang, Z. Du, Y. Fang, Y. Huang, Y. Chen, L. Eeckhout, O. Temam, H. Li, Y. Chen, and C. Wu. Performance Portability Across Heterogeneous SoCs Using a Generalized Library-Based Approach. ACM Transactions on Architecture and Code Optimization, 11(2):21:1–21:25, June 2014.
  • [18] A. Farrell and H. Hoffmann. MEANTIME: Achieving Both Minimal Energy and Timeliness with Approximate Computing. In 2016 USENIX Annual Technical Conference (USENIX ATC 16), pages 421–435, Denver, CO, June 2016. USENIX Association.
  • [19] A. Filieri, H. Hoffmann, and M. Maggio. Automated Design of Self-adaptive Software with Control-Theoretical Formal Guarantees. In Proceedings of the 36th International Conference on Software Engineering, ICSE 2014, pages 299–310, New York, NY, USA, 2014. ACM.
  • [20] D. Gadioli, G. Palermo, and C. Silvano. Application Autotuning to Support Runtime Adaptivity in Multicore Architectures. In SAMOS XV, 2015.
  • [21] I. Goiri, R. Bianchini, S. Nagarakatte, and T. D. Nguyen. ApproxHadoop: Bringing Approximations to MapReduce Frameworks. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’15, pages 383–397, New York, NY, USA, 2015. ACM.
  • [22] R. Hildebrandt and A. Zeller. Simplifying Failure-Inducing Input. In Proceedings of the 2000 ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA ’00, pages 135–145, New York, NY, USA, 2000. ACM.
  • [23] H. Hoffmann. JouleGuard: Energy Guarantees for Approximate Applications. In Proceedings of the 25th Symposium on Operating Systems Principles, SOSP ’15, pages 198–214, New York, NY, USA, 2015. ACM.
  • [24] H. Hoffmann, A. Agarwal, and S. Devadas. Selecting Spatiotemporal Patterns for Development of Parallel Applications. IEEE Transactions on Parallel and Distributed Systems, 23(10):1970–1982, Oct. 2012.
  • [25] H. Hoffmann, S. Sidiroglou, M. Carbin, S. Misailovic, A. Agarwal, and M. Rinard. Dynamic Knobs for Responsive Power-Aware Computing. In Proceedings of the Sixteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVI, pages 199–212, New York, NY, USA, 2011. ACM.
  • [26] K. Huck, A. Porterfield, N. Chaimov, H. Kaiser, A. Malony, T. Sterling, and R. Fowler. An Autonomic Performance Environment for Exascale. Supercomputing frontiers and innovations, 2(3), 2015.
  • [27] C. Imes, D. H. K. Kim, M. Maggio, and H. Hoffmann. POET: A Portable Approach to Minimizing Energy Under Soft Real-time Constraints. In 21st IEEE Real-Time and Embedded Technology and Applications Symposium, pages 75–86, Apr. 2015.
  • [28] J. Leskovec. Stanford Large Network Dataset Collection(SNAP). http://snap.stanford.edu/data/.
  • [29] D. Mahajan, A. Yazdanbakhsh, J. Park, B. Thwaites, and H. Esmaeilzadeh. Prediction-Based Quality Control for Approximate Accelerators. In Second Workshop on Approximate Computing Across the System Stack, WACAS, 2015.
  • [30] D. Mahajan, A. Yazdanbaksh, J. Park, B. Thwaites, and H. Esmaeilzadeh. Towards Statistical Guarantees in Controlling Quality Tradeoffs for Approximate Acceleration. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pages 66–77, June 2016.
  • [31] S. Martin, W. M. Brown, R. Klavans, and K. W. Boyack. OpenOrd: An Open-Source Toolbox for Large Graph Layout. volume 7868, 2011.
  • [32] I. Mierswa and K. Morik. Automatic Feature Extraction for Classifying Audio Data. Machine Learning, 58(2-3):127–149, Feb. 2005.
  • [33] J. S. Miguel, M. Badr, and N. E. Jerger. Load Value Approximation. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-47, pages 127–139, Washington, DC, USA, 2014. IEEE Computer Society.
  • [34] S. Misailovic, M. Carbin, S. Achour, Z. Qi, and M. C. Rinard. Chisel: Reliability- and Accuracy-Aware Optimization of Approximate Computational Kernels. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages & Applications, OOPSLA ’14, pages 309–328, New York, NY, USA, 2014. ACM.
  • [35] T. M. Mitchell. Machine Learning. McGraw-Hill, Inc., New York, NY, USA, first edition, 1997.
  • [36] S. T. Monteiro. Automatic Hyperspectral Data Analysis: A machine learning approach to high dimensional feature extraction. VDM Verlag Dr. Müller, 2010.
  • [37] L. Nardi, B. Bodin, M. Z. Zia, J. Mawer, A. Nisbet, P. H. J. Kelly, A. J. Davison, M. Luján, M. F. P. O’Boyle, G. Riley, N. Topham, and S. Furber. Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM. In IEEE International Conference on Robotics and Automation (ICRA), May 2015. arXiv:1410.2167.
  • [38] R. E. Neapolitan. Learning Bayesian Networks. Prentice Hall, 2003.
  • [39] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and A. Fitzgibbon. KinectFusion: Real-Time Dense Surface Mapping and Tracking. In IEEE ISMAR. IEEE, Oct. 2011.
  • [40] M. A. Otaduy and M. C. Lin. CLODs: Dual Hierarchies for Multiresolution Collision Detection. In Proceedings of the 2003 Eurographics/ACM SIGGRAPH Symposium on Geometry Processing, SGP ’03, pages 94–101, Aire-la-Ville, Switzerland, Switzerland, 2003. Eurographics Association.
  • [41] K. V. Palem. Energy Aware Computing Through Probabilistic Switching: A Study of Limits. IEEE Transactions on Computers, 54(9):1123–1137, Sept. 2005.
  • [42] J. R. Quinlan. Learning With Continuous Classes. pages 343–348. World Scientific, 1992.
  • [43] M. Rinard. Probabilistic Accuracy Bounds for Fault-Tolerant Computations That Discard Tasks. In Proceedings of the 20th Annual International Conference on Supercomputing, ICS ’06, pages 324–334, New York, NY, USA, 2006. ACM.
  • [44] M. C. Rinard. Using Early Phase Termination To Eliminate Load Imbalances At Barrier Synchronization Points. In Proceedings of the 22Nd Annual ACM SIGPLAN Conference on Object-oriented Programming Systems and Applications, OOPSLA ’07, pages 369–386, New York, NY, USA, 2007. ACM.
  • [45] M. Ringenburg, A. Sampson, I. Ackerman, L. Ceze, and D. Grossman. Monitoring and Debugging the Quality of Results in Approximate Programs. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’15, pages 399–411, New York, NY, USA, 2015. ACM.
  • [46] C. Rubio-González, C. Nguyen, H. D. Nguyen, J. Demmel, W. Kahan, K. Sen, D. H. Bailey, C. Iancu, and D. Hough. Precimonious: Tuning Assistant for Floating-Point Precision. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’13, pages 27:1–27:12, New York, NY, USA, 2013. ACM.
  • [47] M. Samadi, D. A. Jamshidi, J. Lee, and S. Mahlke. Paraprox: Pattern-Based Approximation for Data Parallel Applications. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’14, pages 35–50, New York, NY, USA, 2014. ACM.
  • [48] M. Samadi, J. Lee, D. A. Jamshidi, A. Hormati, and S. Mahlke. SAGE: Self-Tuning Approximation for Graphics Engines. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 13–24, New York, NY, USA, 2013. ACM.
  • [49] A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman. EnerJ: Approximate Data Types for Safe and General Low-Power Computation. In Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’11, pages 164–174, New York, NY, USA, 2011. ACM.
  • [50] A. Sampson, J. Nelson, K. Strauss, and L. Ceze. Approximate Storage in Solid-State Memories. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, pages 25–36, New York, NY, USA, 2013. ACM.
  • [51] E. Schkufza, R. Sharma, and A. Aiken. Stochastic Optimization of Floating-Point Programs with Tunable Precision. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’14, pages 53–64, New York, NY, USA, 2014. ACM.
  • [52] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically Characterizing Large Scale Program Behavior. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS X, pages 45–57, New York, NY, USA, 2002. ACM.
  • [53] T. Sherwood, E. Perelman, G. Hamerly, S. Sair, and B. Calder. Discovering and Exploiting Program Phases. IEEE Micro, 23(6):84–93, Nov. 2003.
  • [54] M. Shoushtari, A. BanaiyanMofrad, and N. Dutt. Exploiting Partially-Forgetful Memories for Approximate Computing. Embedded Systems Letters, IEEE, Mar. 2015.
  • [55] S. Sidiroglou-Douskos, S. Misailovic, H. Hoffmann, and M. Rinard. Managing Performance vs. Accuracy Trade-offs With Loop Perforation. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, ESEC/FSE ’11, pages 124–134, New York, NY, USA, 2011. ACM.
  • [56] X. Sui, A. Lenharth, D. S. Fussell, and K. Pingali. Proactive Control of Approximate Programs. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’16, pages 607–621, New York, NY, USA, 2016. ACM.
  • [57] R. S. Sutton and A. G. Barto. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, first edition, 1998.
  • [58] K. Swaminathan, C.-C. Lin, A. Vega, A. Buyuktosunoglu, P. Bose, and S. Pankanti. A Case for Approximate Computing in Real-Time Mobile Cognition. In Second Workshop on Approximate Computing Across the System Stack, WACAS, 2015.
  • [59] L. G. Valiant. A Theory of the Learnable. Communications of the ACM, 27(11):1134–1142, Nov. 1984.
  • [60] J. J. Whang, X. Sui, and I. S. Dhillon. Scalable and Memory-Efficient Clustering of Large-Scale Social Networks. In Proceedings of the 2012 IEEE 12th International Conference on Data Mining, ICDM ’12, pages 705–714, Washington, DC, USA, 2012. IEEE Computer Society.
  • [61] D. I. Wilson and B. R. Young. The Seduction of Model Predictive Control. Electrical & Automation Technology, pages 27–28, Dec. 2006. ISSN: 1177-2123.
  • [62] R. Zafarani and H. Liu. Social Computing Data Repository at ASU, 2009. http://socialcomputing.asu.edu.
  • [63] Z. A. Zhu, S. Misailovic, J. A. Kelner, and M. Rinard. Randomized Accuracy-Aware Program Transformations For Efficient Approximate Computations. In Proceedings of the 39th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL ’12, pages 441–454, New York, NY, USA, 2012. ACM.