On-line Application Autotuning Exploiting Ensemble Models

01/18/2019 ∙ by Tomas Martinovic, et al. ∙ 0

Application autotuning is a promising path investigated in literature to improve computation efficiency. In this context, the end-users define high-level requirements and an autonomic manager is able to identify and seize optimization opportunities by leveraging trade-offs between extra-functional properties of interest, such as execution time, power consumption or quality of results. The relationship between an application configuration and the extra-functional properties might depend on the underlying architecture, on the system workload and on features of the current input. For these reasons, autotuning frameworks rely on application knowledge to drive the adaptation strategies. The autotuning task is typically done offline because having it in production requires significant effort to reduce its overhead. In this paper, we enhance a dynamic autotuning framework with a module for learning the application knowledge during the production phase, in a distributed fashion. We leverage two strategies to limit the overhead introduced at the production phase. On one hand, we use a scalable infrastructure capable of leveraging the parallelism of the underlying platform. On the other hand, we use ensemble models to speed up the predictive capabilities, while iteratively gathering production data. Experimental results on synthetic applications and on a use case show how the proposed approach is able to learn the application knowledge, by exploring a small fraction of the design space.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the end of Dennard scaling [14], the optimization focus shifted towards efficiency in a wide range of scenarios, from embedded to High-Performance Computing (HPC). To this end, autotuning [3] has been identified as a promising research. In this direction, it is possible to use frameworks to optimize a specific task [51, 37] and frameworks to optimize knobs at either the system-level (e.g. the core frequencies) or the application-level (e.g. software parameters) [1, 39]. Moreover, if the target application can tolerate an error on the output, approximate computing [32] represents an appealing path to further increase computation efficiency. Finding application parameters that enable a quality-throughput tradeoff might be a complex task. Therefore, several techniques have been proposed in literature to expose them, such as loop perforation [22] or task skipping [40].

As a consequence of this trend, application requirements are increasing in complexity to address several extra-functional properties (EFPs), such as execution time, power consumption, and quality of the results. On the other hand, application developers have started to expose in the source code a huge set of tunable parameters that alter the extra-functional behaviour of the application, thus enabling autotuning. In this paper, we refer to these parameters as software-knobs.

The relation between the software-knobs and the EFPs of interest for end-users is complex and unknown, therefore selecting the best configuration is a complex task. The EFP values might depend on the underlying architecture, on the system workload, and on features of the current input. A subset of software-knobs relates to parameters that aim at tailoring the application for the underlying architecture, such as work-group size, MPI runtime parameters or compiler options. Typically, autotuning frameworks that address these parameters perform a Design Space Exploration (DSE) at design-time to find the most suitable configuration to be used in the production phase. The main challenge in these approaches is the exponential growth of the Design Space when considering several, usually unbounded, software-knobs. The second subset of software-knobs relates to application-specific parameters, such as the number of Monte Carlo trials, loop perforation factors or algorithm parameters. Typically, it is easier to change these software-knobs during the production phase and their effects on EFPs are strongly coupled with the features of the current input. For these reasons, autotuning frameworks that address these parameters typically model as an application-knowledge the relationship among software-knobs, EFPs, and input features. The autotuners leverage this knowledge to identify and seize optimization opportunities, improving the computation efficiency during the production phase. The main challenge of these approaches consists of providing mechanisms to enhance the target application with an adaptation layer that gives self-optimization capabilities [25], considering application requirements and the system evolution.

Give the complexity of the tuning task, this is done typically offline prior to application execution, letting only the configuration selection at run-time. The main problem is that every time the code is ported to a new architecture, updated, or new input data are used for the elaboration, the offline tuning process should be redone. However porting this phase online at production time requires a significant effort to minimize the tuning time and overhead as much as possible.

In this paper, we propose a framework to learn online the application-knowledge at the beginning of the production phase. It has been designed to work in a distributed context were different entities can collaborate to the knowledge collection. The framework mainly targets the context of HPC, where an application is composed of more than one process and it usually executes for a long period. However, it might be applied also in a wider range of scenarios. In fact, the benefits of learning the application-knowledge at runtime are the following: (1) we are able to leverage the parallelism of the platform to reduce the time-to-knowledge; (2) we are able to learn the behavior of the application using the actual input set and (3) using the actual execution environment. Given that we are stealing time to the execution of the application, the main challenge addressed in this paper is to reduce as much as possible the time required to learn the application knowledge. To reach this goal, in addition to the full exploitation of the parallel production machine, the proposed methodology. We employ an iterative exploration strategy to reduce as much as possible the required number of samples. In particular, we start to explore a fraction of the design space. If the derived models are not able to reach a target quality in the validation phase, the framework will resume the Design Space Exploration (DSE). Otherwise, the framework broadcasts the obtained application knowledge.

From the implementation point of view, we used the mARGOt [19] dynamic autotuning framework as starting point. mARGOt is an adaptation layer that provides to the target application mechanisms to adapt in a reactive and proactive fashion, based on the application-knowledge. In this paper, we enhance the mARGOt learning module with a model-driven approach. The goal of this paper is not to compare different modelling techniques, but to leverage them for reducing the time required to compute the knowledge of the application. We used synthetic applications with a known relation between EFPs and software-knobs to experimentally evaluate the out-of-sample predictions of the learning module. We focused on a molecular docking application, to assess the benefits of the proposed framework in a real-world case study.

The main contributions of the paper can be summarized as follows:

  • We propose an autotuning framework to learn the relation between EFPs, software-knob configurations, and input features at the production phase;

  • We learn online how to exploit the parallelism of the production machine and to consider the production input data;

  • We leverage ensemble models to reduce the cost of the learning phase, while we select automatically the most suitable model according to the target problem;

  • We enhanced a state-of-the-art autotuning framework with the proposed learning module. In particular, we extended it to leverage iterative exploration needed by the proposed approach;

  • We evaluated the framework with synthetic functions and a real-life case study from the HPC domain to demonstrate the introduced benefits.

The remainder of the paper is structured as follows. Section 2 compares the proposed framework with the related works, highlighting the main contributions. Section 3 describes the framework implementation, focusing on the relation among the involved components and showing the work-flow of the proposed methodology. Section 4 formalizes the problem and describes in detail how the learning module computes the application knowledge. Section 5 experimentally evaluates the proposed framework. Finally, Section 6 concludes the paper.

2 State-of-the-art

In the context of autonomic computing [25], an application is perceived as an autonomic element capable of self-management. Among the self-* properties required by the self-management property, autotuning frameworks aim to provide self-optimization [29]. In this context, the end-user specifies high-level requirements and the application should adapt accordingly, without the human-in-the-loop. This is a promising path investigated in literature [3]

, where several autotuning frameworks have been evaluated according to their vision on how to provide self-optimization properties. In the remainder of the section, we classify in six categories the most recent and relevant autotuning frameworks according to the methodology used to learn the application-knowledge.

In the context of HPC, there are several autotuning frameworks tailored to optimize specific domains. ATLAS [51] was designed for matrix multiplication routine, FTTW [18] for FFTs operations, OSKI [49] for sparse matrix kernels, SPIRAL [37] for digital signal processing, CLTune [34] and GLINDA [46] for OpenCL applications, Patus [10] and Sepya [24] for stencil computations. Considering domain-specific autotuning, these works are interesting. However, they were designed to take orthogonal decisions with respect to the proposed approach, which is oriented towards supporting the online-learning phase.

The second category represents frameworks that aim to apply code or binary transformations to introduce the possibility of exploiting accuracy-throughput tradeoffs. QuickStep [31], Paraprox [43], and PowerGAUGE [12] are some examples in this category. The main focus of these works is on how to expose tradeoffs by introducing software-knobs. The parameter tuning phase is done at design-time by relying on the representative input set.

The third category includes frameworks that aim to explore a very large design space, to find the best configuration of the software-knobs according to the application requirements before the production phase. ATune-IL [45], OpenTuner [1], and the ATF framework [39] are some examples in this category. ATune-IL provides a mechanism to prune and reduce the configuration space according to the code structure and to the dependencies among software-knobs. OpenTuner uses a multi-armed bandit framework to choose the best search algorithm for the given application. ATF framework improves the OpenTuner strategies by considering also domain-constraints of the parameters. Given that the tuning phase is done at design-time, these frameworks usually target software-knobs loosely-coupled with the inputs. Moreover, the output of the tuning process is a single software-knob configuration, not the application-knowledge required to adapt at the production phase.

In the fourth category, we represent frameworks that target streaming applications. They typically learn the application-knowledge at design-time, to be leveraged during the production phase. The Green framework [2], the Sage [44] framework, and PowerDial [23] are some examples in this category. The main focus of these works is on how to provide reaction mechanisms to a streaming application. Therefore, they use representative inputs to derive the application knowledge and then react during the production phase. This approach assumes only a few abrupt changes in the input that must be elaborated. However, this assumption might not hold in all the types of workloads, e.g. see Section 5.2 or [11, 48, 19].

In the fifth category, we consider the autotuning frameworks that adapt an application also in a proactive way by using input features and learning the application-knowledge at design-time. Petabricks [11], the framework proposed in [20], and Capri [48], are some examples in this category. The methodology used to derive the application-knowledge assumes the possibility to select which input to consider in the representative set used during the learning process. The methodology proposed in this paper has been designed to use directly the production input, therefore it is not possible to apply the same approaches. Moreover, these frameworks are able to express a tradeoff between a quality metric and an additional EFP only. On the other hand, the proposed approach is capable to address an arbitrary number of EFPs.

In the sixth and last category, we represent the autotuning frameworks (such as [28, 30]) that adapt an application also in a proactive way, without learning the application-knowledge at design-time. The framework proposed in this paper falls in this category. The IRA framework [28] defines the concept of canary input as the smallest sub-sampling of the actual input, which has the same property as the original input. It proposes the usage of a canary input for a runtime parameter exploration of the target application for each data to be processed. Then, it uses the fastest configuration of the software-knobs resulting within a given bound on the minimum accuracy. The main drawback of this methodology is that the presented sub-sampling technique applies to matrix-like input, therefore limiting the applicability of the framework. A rather different approach with respect to the previous ones is Anytime Automaton [30]. It suggests source code transformations to re-write the application by using a pipeline design pattern. The idea is that the longer the given input executes in the pipeline, the more accurate the output becomes. The work targets hard constraints on the execution time, therefore the idea is to interrupt the algorithm when it depletes the time budget. In this way, it is possible to have guarantees on the feasible maximum accuracy. However, this approach works only for exploiting accuracy-performance tradeoffs.

In this paper, we propose a framework that can be considered orthogonal to most of the autotuning frameworks presented so far. In particular, it has been designed to support dynamic tuning at runtime with the capability to learn online the application-knowledge. To reduce the learning time, it uses two different approaches. First, it is based on a scalable infrastructure capable of exploiting the parallelism of the underlying production platform. Second, it uses ensemble models for accelerating the learning phase, while selecting always the best model for the target problem.

3 Proposed Framework Architecture

Fig. 1: Overview of the proposed autotuning framework. Model analysis and out-of-sample predictions are computed by the learning module, executing in a dedicated server.

This Section introduces the architectural view of the proposed technique in the context of the whole adaptive framework. Later, Section 4 will focus on the learning approach from the methodology point of view.

The proposed approach uses as a starting point the mARGOt autotuning framework [19], which aims to enhance a target application with an adaptation layer to provides mechanisms to adapt in a proactive and reactive fashion. From the implementation point of view, mARGOt is a C++ library to be linked to a target application. Therefore, each instance of the application can take autonomous decisions. mARGOt takes as input an application-knowledge defined as a discrete set of Operating Points (OPs). Each OP relates a software-knobs configuration with the expected EFP values reached using the configuration, according to a set of input features. Therefore an OP is composed of three sets of values, namely : the software-knobs configuration (), the expected EFPs (), and the related input feature values (). By using this representation, the mARGOt autotuner framework is capable to select the most suitable one, according to application requirements defined as a constrained multi-objective optimization problem. Moreover, it uses feedback information from monitors to adapt in a reactive way, while it uses features of the current input to adapt in a proactive way.

The main goal of the proposed framework is to obtain the OPs list during the production phase, without requiring a design-time profiling phase. The proposed methodology guides the learning process by leveraging the underlying mARGOt infrastructure. To achieve these goals, we exploit the possibility to dynamically update the OPs list of each application client. In particular, we would like to assign at each application client a different software-knob configuration to distribute the design space exploration. We would like to broadcast the OPs list to all the application clients, once the learning module generates the application knowledge. In this way, 1) it is possible to leverage all the available nodes to reduce the time-to-model, 2) the application knowledge is tailored for the current input, and 3) we measure the EFPs with the production environment.

Figure 1 provides an overview of the proposed autotuning framework. It is composed of three components: a remote application handler (server), an application local handler (client) and the learning module. The remote application handler is the central coordinator, implemented as a thread pool that executes in a dedicated server. To store information about the managed applications and the status of the DSE, the server might use the Apache Cassandra database for scalability reasons, or CSV files for small instances. The learning module

is the core of the proposed approach and it performs three main tasks: 1) It leverages the Design of Experiments techniques to sample efficiently the design space to be explored; 2) It leverages state-of-the-art modelling techniques to interpolate out-of-sample predictions; 3) It uses a validation stage to test whether the quality of the obtained model is acceptable or it requires additional effort to explore the design space. An extensive description of the learning module is presented in Section

4. The application local handler is a service thread in each application instance which runs asynchronously with respect to the application execution flow. Its main goal is to manipulate the application-knowledge of the related application instance. In particular, during the design space exploration phase, the application local handler forces the autotuner to select the software-knobs configuration that must be evaluated. When the model is available, it sets the application-knowledge accordingly. Moreover, it has in charge the synchronization with the server counterpart. The communications between server and clients leverage the MQTT or MQTTs protocols.

Fig. 2: Sequence diagram of the interaction between the remote application handler, the application local handler, and the learning module. In this diagram, we consider an unknown application composed of a single process.

Figure 2 shows the typical workflow of the framework when it interacts with an unknown application. In particular, it is composed of the following steps:

  1. The clients notify themselves to the server.

  2. The server asks one client information about the application, such as the number of software-knobs and their domain.

  3. Once the server has collected the information, it will call the learning module to generate a set of configurations to explore (DoE, design of experiments).

  4. The server dispatches to the available clients the configurations to evaluate in a round-robin way.

  5. Once the clients have explored all the configurations, the server requests a model from the learning module.

  6. The learning module trains, validates, selects and returns the best model. If the model is not valid, it returns an empty model.

  7. If there is a valid model, the server broadcasts the model to the available clients, otherwise it restarts from step 3.

The framework implementation is resilient to crashes at the server-side and at the client-sides. Moreover, whenever a new client becomes available, it can join the design space exploration or receive the model directly.

4 Proposed Methodology

Fig. 3: Overview of the proposed methodology: Orange elements represent components of the learning module, white elements represent components of the mARGOt autotuning framework.

The goal of the proposed approach is to learn, during the production phase, the relation between software-knobs configuration, EFPs, and input features. The main challenges in achieving this goal are twofold. On one hand, we need to reduce as much as possible the time for learning the application-knowledge. On the other hand, we are not capable to control the features of the input. While we are capable to force an application instance to use a given software-knobs configuration, the input set is the one of the production run.

Figure 3 shows an overview of the proposed methodology. Orange elements represent components of the learning module that drives the learning process, white elements represent components of the mARGOt framework. Given the description of an unknown application, we use techniques of Design of Experiments (DoE) to sample efficiently the design space. After the exploration of the selected software-knobs configuration, we do two operations. First, we build a model for each EFP of interest. Second, we cluster the observed input features. If the best model that we found is deemed valid, we use it to generate the list of OPs to broadcast to the application clients. Otherwise, we generate additional software-knob configurations to be evaluated and we restart the procedure. The remainder of this section formalizes the main components of the proposed approach: the DoE techniques, the modelling techniques, the model training and validation, the model selection and finally the input feature clustering.

4.1 Design of Experiments

The proposed approach aims at obtaining the application knowledge at the production phase, therefore we want to reduce the design space exploration as much as possible. To reach this goal, it is important to sample the design space to maximize the retrieved information. This is a well-known problem in literature, where several design of experiments (DoE) techniques were investigated [33]. The proposed framework leverages the Dmax algorithm [13], which maximizes the determinant of the correlation defined as in Equation 1:

(1)

where is the distance between points and , is the threshold distance of the correlation between two points and is a variogram. This measure maximizes the information entropy of the design and thus strives to maximize the information gained by exploring the given software-knobs. Therefore, this method works well for creating DoE on the design space with no apriori knowledge.

This DoE technique exposes two free parameters: the total number of software-knobs configuration to be explored and the threshold distance . We provide to the end-user the possibility to change these parameters from their default values ( and

respectively). Moreover, the end-user might specify how many times each selected software-knob configuration is explored. Ability to make several executions of the same software-knob is important to increase the robustness of the estimation in cases of the non-deterministic applications. Given that the input features are not controllable, multiple runs of the same software-knobs configuration might lead to learning the knowledge from different feature sets.

The Dmax algorithm is designed to sample a continuous design space. The application description defines a discrete domain for each software-knob, therefore the selected samples are then mapped to the closest non-selected software-knobs configuration. Moreover, to accommodate the needs for more complicated design spaces, it is possible to set restrictions on the software-knobs domain. There are several ways to implement restrictions on a design space. On one hand, it is possible to compute the feasible design space based on the provided restrictions. However, in the case of nonlinear relationships between software-knobs, this would lead to hard-to-solve nonlinear inequalities. On the other hand, it is possible first to create the full factorial design space, and then to remove software-knobs configuration that does not satisfy the given restrictions. The latter approach is feasible since the user-defined design space is discrete and finite. In the proposed approach, we select the software-knobs configurations to explore in four steps:

  1. Create a full-factorial design.

  2. Remove the OPs not valid for the given restrictions.

  3. Use Dmax to sample the continuous design space to explore.

  4. Map the selected samples to the software-knobs domain.

The output of this stage is a list of software-knob configurations to explore in the Design Space Exploration. The latter task is performed by the autotuning framework.

4.2 Modelling Techniques

This section describes the model families used to learn the relation between EFPs, software-knobs, and input features. The learning module models each EFP independently. Therefore, in our notation represents the expected value of the target EFP, while

represents the vector of predictors, i.e. software-knobs and input features. In particular, we model an EFP as

, where the function is represented by a given modelling technique. The remainder of the section describes the type of models used in the proposed approach.

4.2.1 Linear Models

Linear regression [33] with dependent variables and explanatory variables is defined in Equation 2:

(2)

where is a constant, is a vector of parameters, is a matrix of explanatory terms, and is vector of residuals or errors.

We use two types of linear models: first we consider only the model with a constant and the explanatory variables (first order); second we consider also two-way interactions of explanatory variables (first order with interactions).

4.2.2 MARS Models

The second family of models used in the learning module is multivariate adaptive regression splines (MARS)

[17]

. This model iteratively adds basis functions to create the best possible representation of the variables interactions (non-parametric model). The MARS representation is defined in Equation

3:

(3)

where is a constant, is number of basis functions, is the constant coefficient of the basis function , is the basis function . The basis function is of the form , or the multiplication of multiple basis functions. The parameter is a constant estimated by the model. In the learning module, we also use a variation of this model, named POLYMARS, which enables the maximum of two-way interactions in the model [47].

4.2.3 Kriging Model

The third family of models used by the learning module is Kriging [42]. We use an extension of the original model, named Universal Kriging (UK) [50], which assumes that observed values come from a deterministic process given by Equation 4:

(4)

where is a trend defined by the number of basis functions and is a known covariance kernel.

In the context of the proposed framework, the generating process is seldom deterministic, therefore we need to relax this assumption. In particular, we forced the determinism by averaging the observed values for each observed software-knobs configuration.

4.2.4 Ensemble Models

Model ensembling is a well-known approach to increase the predictive capabilities of base models by combining them together, using different techniques. The learning module leverages two techniques based on cross-validation models: bagging [8] and stacking [9].

The bagging approach aims at decreasing the variance of the prediction. It focuses on a single base modelling technique and it combines instances of the model trained with different data sub-samples. The basic idea of bagging is increasing the robustness of the predictive capabilities of the model by combining models trained on the datasets with only small differences. For example, if we have 10 observations and train 10 different models always leaving out one observation, each model pair will share 8 out of 9 observations used for training the model. Combination of such models should lead to a more robust ensemble model. To obtain the prediction, it uses the mean of the predictions generated by the model instances. Figure

4 shows the process of bagging model computation.

Fig. 4: The procedure to compute ensemble models by using the bagging approach, starting from observations.

The stacking approach aims to increase the robustness of the prediction by combining together base models. A stacked model should be able to decrease the weaknesses of the individual models and leverage their strengths. The learning module uses a weighted mean of the model families describes in the previous section together. The weights for each model family aims at minimizing the error of the stacking model towards the training data observations. Moreover, they must be positive and sum up to one. This is a quadratic optimization problem and it has been solved using a dedicated R package [35].

Fig. 5: The procedure to compute ensemble models by using the stacking approach, starting from observations.

Figure 5 shows the process used by the learning model for computing the stacking model, based on the following steps:

  1. Train several instances of each model family using a cross-validation scheme.

  2. Use every instance of a model family to predict the holdout data for the training.

  3. Create a matrix of predicted data considering the predictions of step 2 for all the model types.

  4. Compute the weights for each model family by using quadratic optimization on the matrix of prediction computed at step 3, to best fit the training data.

  5. For each model family, train the model by using the complete training data.

  6. Average the prediction of models trained in step 5 using the weights computed in step 4.

As stated in the previous work [9], this definition of the stacking greatly reduces the exploration space and makes the weights estimation robust.

4.3 Model Training and Validation

This section describes how we partition the information from the Design Space Exploration to train and to validate the models. This step is crucial to broadcast to the application clients a reliable application knowledge.

The typical approach for testing how models fare in the prediction is to divide the input data in a training set and validation set. However, given that the proposed framework leverages the application-knowledge during the production phase, we might have a small set of observation for training and for validating the models. This is true especially for the first iterations of the learning process. Therefore, the learning module uses three different validation schemes according to the number of software-knob configuration explored and a parameter that represents the validation set ratio. The selected approach depends on the relationship of and , where is the number of explored software-knob configurations and is the number of cross-validation folds.

In the general case, when , we use a k-fold validation scheme: the full set observations are divided into -parts of equal size. One part is always used as a holdout set and the rest is used to train the model. This implies that the learning module trains models and each of them will have out-of-sample predictions on a different part of data. We will call these models cross-validation models. In this case the validation set has size of .

On the other cases, when there are only few explored configurations (i.e. ), then we use the first data to train the cross-validation models using a leave-one-out cross-validation scheme. We use the remainder of the data as the holdout validation set. Moreover, If the number of explored software-knob configurations is less than , we apply a leave-one-out cross-validation, without using any holdout validation set. In this case, we are not using ensemble models due to bias problems in the model selection. On one hand, bagged models use the average of cross-validation models. On the other hand, stacking model requires a training with all the data used to compute cross-validation models. Therefore, without any holdout data, ensemble models are implicitly trained by using the full set of observations available.

Using a holdout validation set gives better information about how the models fare on the out-of-sample prediction. Moreover, it allows us to use ensemble models which are based on the cross-validation models. In practice, the third method is implemented only for the special cases when the user decides to use really high or he has just a few observations. It is important to notice how in this case the results of the models on out-of-sample predictions might be highly volatile.

To increase the robustness of the model’s prediction capabilities estimation, we used a similar approach to the k-fold cross validation also with the holdout validation set. We split the input data into folds, where is a number of observations used for the holdout validation set. Then validation is made for each of the folds. In this way, it is possible to test prediction capabilities on the whole input dataset and not just one randomly selected part for the validation.

4.4 Model Selection

To quantify the prediction quality of a model, we consider two metrics. A variant of the coefficient of determination () [16], and the mean absolute error, normalized by the observed values range (). In our case, is computed as the square of the correlation between observed and predicted data. This variant [16] has been selected because it is restricted to [-1; 1] and it can be used on the cross-validation and out-of-sample predictions to compare the results in a consistent manner. For evaluating these metrics for each base modelling types, we consider the mean of and across the holdout validation models. In the special case when a number of explored software-knobs is less than and the holdout set is not used the mean of and of the cross-validation models is used. For model ensembles, we compute them considering the whole set of observations.

Once we evaluate all the models, we deem as eligible the ones that have higher than and less than , to enforce a minimum quality. Among the eligible models, we select the one that minimizes the . If no model is eligible, the proposed approach will restart the design space exploration, up to a maximum number of iterations. When the maximum number of iterations () is reached, the learning module selects the model with the smallest for the out-of-sample predictions. The parameters , and are exposed to end-user and by default, they are set to , and respectively.

4.5 Feature Clustering

The main goal of this component is to find representative clusters based on the input features to be exploited in the application-knowledge. This stage is based on the same considerations done in Petabricks [11], that inspired our implementation. Algorithmic choices and software-knob configurations are often sensitive to input features. However, the feature space can be too large to be completely considered. Clustering techniques are an intuitive solution to reduce the problem complexity.

To cluster the input features observed in DSE, we apply well-known clustering techniques such as k-means

[21] and DBSCAN [15] according to a parameter exposed to the end-user. K-means is a clustering technique to minimize the intra-cluster variance, where the user sets a fixed number of clusters. Using a different approach, DBSCAN partitions the samples in clusters, according to their proximity, where the user sets a fixed distance threshold. Moreover, the user is able to manually define the clusters.

This component is activated after the Design Space Exploration. As shown in Figure 3, it works in parallel with the model learning component. Once the best model has been selected, the generated clusters on the input features are combined with the learned models and used to generate the application-knowledge to be broadcast to the clients.

5 Experimental Results

This section describes the experimental assessment of the proposed framework. First, we evaluate the prediction capabilities of the proposed framework, by using synthetic applications with a known relation between the EFPs and the software-knobs. Then, we focus on a geometrical docking application to evaluate the benefits provided by the proposed framework for the end-user. For this experiment, we used a platform with eight CPUs Intel(R) Xeon(R) X5482 @3.20GHz and 8GB of memory. Concerning the implementation, we used the R packages for the Dmax design [13] and for estimating the linear models [38], the MARS [36], POLYMARS [26], and Kriging [41] models. The plugin uses the tidyverse package [52] for data manipulation and to unify the workflow of different models.

5.1 Framework Validation with Synthetic Applications

The main goal of this section is to evaluate the proposed framework in out-of-sample predictions, when trained with a fraction of the design space. This is the main challenge of the proposed framework, given that each sample used to train the model steals time from the target application. To validate the proposed framework, we created two synthetic applications with a known relation between the software-knobs and the EFPs that has been inherited from well-known functions [7, 27].

The first application is derived from the work of Binh [7], and it has been defined in Eq. 5:

(5)

where and represent the EFPs of interest for the end-user, while and represent the software-knobs of the application. The mathematical formulas describing the two functions represent the relationship between the EFPs and the software-knobs modeled by the framework by using a limited training set. The proposed approach models each EFP independently, therefore this first application aims to demonstrate the ability of the framework to model linear and nonlinear behaviors.

The second application is derived from the work of Kursawe [27], and it has been defined in Eq. 6:

(6)

where and represent the EFPs of interest for the end-user, while , , and are the software-knobs of the application. This application aims at validating the approach by increasing the complexity of the relation between the EFPs and the software-knobs. In particular, represents exponential functions, while represents periodic functions. These functions have been chosen to represent a large set of function types such as linear, nonlinear and exponential.

5.1.1 Model Training and Validation

(a) Validation of the function.
(b) Validation of the function.
(c) Validation of the function.
Fig. 6: Mean absolute error and correlation coefficient of the synthetic application EFPs models by varying the number of explored configurations during the DSE.

In this section, we focus on evaluating the model training and validation by using a limited training set. Figure 6 shows the evaluation of the trained models for synthetic applications by varying the number of software-knob configurations evaluated in the DSE. In particular, Figure 5(a) validates the trained models for the function, while Figure 5(b) and Figure 5(c) validate the trained models for the and the functions. We omitted the validation results of the function because the underlying linear equation was learned well by almost all the models included in the proposed framework. In Figure 6, the y-axis represents the model quality in terms of mean absolute error and correlation coefficient , while the x-axis represents the number of software-knob configurations evaluated in the DSE. For each x-value, we repeated the experiment 20 times and we have shown its distribution.

From the experimental results, we noticed a trend on the mean absolute error across all the model types and modelled functions: it decreases when we increase the number of explored software-knob configurations. However, this is not true for the correlation coefficient. Let us consider the EFPs of the Kursawe application. We might notice how POLYMARS models to improve by increasing the number of training samples, while Kriging models seem to overfit on the training data, performing worse on the out-of-sample predictions. Different model types behave differently in out-of-sample predictions according to the EFP underlying equation. Let us focus on the MARS models across different EFPs. We notice how according to the underlying equation of the target EFP, it behaves differently. It is able to model the function, it struggles to model the function, while it fails to model the function.

If we consider all the EFPs, we might notice how at least one model type fares well in out-of-sample predictions in the validation set, except for the function. If we focus on this EFP, it is possible to notice how the trained models struggle to explain the data, according to the correlation coefficient. This is due to the periodic nature of the underlying equation and to the limited size of the training set. Instead of learning the behaviour of the function, the trend is to settle along an average trend. Therefore, the MAE decreases but the decreases as well.

5.1.2 Model Selection

(a) Binh synthetic application
(b) Kursawe synthetic application
Fig. 7: Number of times that a model family is selected as the best one across the different EFPs of the two synthetic applications, by varying the number of software-knob configurations in the DSE.

In this section, we evaluate the capability of the proposed framework to leverage different modelling techniques to learn the application knowledge at runtime. Figure 7 shows the number of times that a model type is deemed as the best one, according to the methodology explained in Section 4.4. For each EFP, we report the model selection according to the number of software-knob configurations explored during the DSE. Models not listed in Figure 7 have never been selected.

From the experimental results, carried out on both synthetic applications, we notice how there is not a unique model always dominant with respect to the others. This confirms the importance of the proposed approach. The actual model selected strongly depends on the predicted function and on the values of the model selection parameters (i.e. and ). If we focus on the models generated with a small training set, the learning module selects from a wider range of model families. On the opposite, with a larger number of samples (i.e. 50), the models more frequently selected are Kriging, Kriging bagged, POLYMARS bagged and stacking. In all cases, the selected models provide good out-of-sample predictions because the cross-validation and testing error (as shown in Section 5.1.3) are comparable.

5.1.3 Framework Validation

(a) Binh synthetic application
(b) Kursawe synthetic application
Fig. 8: Mean absolute error and correlation coefficient of the application knowledge generated by the proposed framework, by varying the number of software-knobs configurations.

This section aims at evaluating the quality of the final output of the learning model: the application-knowledge. In this experiment, we consider as input the application-knowledge given by the best model for the given run and we compare it with the underlying equation of the target EFP. Figure 8 shows the experimental results in terms of prediction error () and , across the whole software-knobs domain. Figure 7(a) refers to the Binh synthetic application, while Figure 7(b) refers to Kursawe synthetic application. In both cases, the y-axis represents the quality of the application knowledge, while the x-axis represents the number of software-knob configurations used to compute the model.

From the experimental results, we can notice how the quality metrics are consistent with the validation of the model done in the training phase. In particular, we can predict all the software-knob configurations of the design space within a of 4% and a of 0.94, for all the EFPs of the synthetic applications, except for the EFP for the Kursawe application. These results are coherent with the quality of the model determined in the validation phase, and used for the model selection. This behavior is important because we can correctly judge the quality of the application-knowledge before broadcasting it to the application clients. Therefore, the user is capable to manage the tradeoff between the learning phase duration and the quality of the result, by using the parameters and . Moreover, the proposed framework enable us to distribute the DSE to the application clients and to use an iterative refinement procedure to minimize the learning time of the model.

5.2 The Molecular Docking Case Study

In this section, we validate the proposed approach on a real-life case study taken from the HPC world. First, we demonstrate the advantages obtained by adopting the proposed solution (Section 5.2.1 and 5.2.2). Then, we evaluate some characteristics of the framework in terms of input feature clustering and scalability (Section 5.2.3 and 5.2.4).

In a drug discovery process, molecular docking is one of the earliest tasks and it is performed in silico. Molecular docking is used to virtual screen a very large library of molecules, named ligands, to find the ones with the strongest interaction with the binding site of a second molecule, named pocket, to forward to later stages of the drug discovery process [5]

. The complexity of this task is not only due to the huge number of ligands to evaluate, but also to the number of degrees of freedom in the evaluation of the ligand-pocket interaction. In particular, it is possible to alter the shape of the molecule, without altering its chemical properties, by rotating a subset of bonds between the atoms of a ligand, named

rotamers.

In this experiment, we focus on a geometric docking kernel, part of the LiGen Dock application [4]. Due to the complexity of evaluating the chemical interaction of a pocket-ligand pair, this kernel considers only geometrical information and it is used to filter out the ligands unable to fit in the target pocket. The application exposes two software-knobs that generate quality-throughput tradeoffs by reducing the number of alternative poses evaluated for each rotamer of the ligand. The end-users typically belong to pharmaceutical companies that rent the resources of an HPC infrastructure, to evaluate a chemical library by running a typical batch job. Therefore, the end-users are interested in time-to-solution and on the quality of the elaboration, defined as the number of evaluated poses.

The goal of this experiment is to assess the benefits of the proposed framework where the throughput is heavily input-dependent. The time spent on evaluating a pocket-ligand pair depends on the number of atoms and rotamers of the ligand, and by the geometrical properties of the pocket that are difficult to represent numerically. This heavy input dependency is perceived as a significant noise when measuring the execution time for a given software-knob configuration across several ligands. Given that the pocket remains the same for the entire screening process, the proposed approach aims at learning the effect of the target pocket on the relation between software-knobs and the execution time at the production phase. Every time the pocket changes, we need to learn again the application-knowledge, however, 1) a pocket is seldom re-evaluated, 2) the time spent on evaluating the library is several orders of magnitude larger than the learning time, 3) the application-knowledge is tailored to the actual input, and 4) we exploit the parallelism of the production system (composed of several HPC nodes).

Given that the later stages of the drug discovery process require an expensive cost to execute the in-vitro and in-vivo tests, the reproducibility of the experiment is a domain-requirement. Therefore, once we have obtained the application-knowledge, we restart to evaluate the chemical library with the configuration that maximizes the quality, while respecting the time-to-solution constraint adjusted by the time spent during the learning phase. In the following experiments, we used a library of ligands, where each ligand has a number of atoms between and and a number of rotamers between and . Moreover, we use six pockets (1b9v, 1c1b, 1cvu, 1cx2, 1dh3 and 1fm9) from the RCSB Protein Databank (PDB) [6].

In the next sections, we validate the proposed approach by using different experiments. First, we show the typical execution traces where the online learning approach has been employed. Second, we validate the approach by simulating a virtual screening campaign. Third, we demonstrate the advantages introduced by the input feature clustering module to reduce the prediction variability. Finally, we use this case study to evaluate the scalability in terms of the time-to-learn for the proposed approach.

(a) MPI process 1
(b) MPI process 2
(c) MPI process 3
(d) MPI process 4
Fig. 9: Execution trace of the docking application learning phase by using 16 MPI process and targeting the pocket . For out of MPI processes, we show the throughput for evaluating the pocket-ligand pair and the quality over the firsts seconds of a longer virtual screening process.

5.2.1 Execution Trace

Figure 9 shows an execution trace of the docking application running on MPI processes on the pocket 1b9v. Figure 9 represents the application behavior in terms of the throughput for evaluating the pocket-ligand pair and the quality of the results over the firsts seconds of a large experiment. Each sub-figure shows the EFP behavior of out of MPI processes, as representative behaviour. The length of the learning phases is determined by the model convergence time and on the input characteristics. It is possible to notice that the length is almost the same for all the client’s thanks to the configuration distribution performed by the remote application handler. After the learning phase, the goal value has been set by computing the average throughput required to process the entire target ligand database given the target time-to-solution (experiment target budget). After the initial exploration of the design space, the application settles with a software-knob configuration that is the same for all the clients. This is because all clients are part of the same virtual screening experiment. By varying the current inputs (i.e. the target pocket and the target ligand database) or the time-to-solution constraint, the autotuning process would lead to a different configuration.

5.2.2 Prediction Accuracy

To validate the proposed approach, we run an experimental campaign with a library of ligands, randomly sampled from ligands, targeting six different pockets. In particular, for each pocket we repeated the experiment ten times, reporting the prediction error distribution (see Figure 10). For the learning phase, we observed each software-knob configuration in the DoE with different ligands. Figure 10 shows that a large fraction of the time-to-solution errors are within . In this case, the proposed approach accurately estimates the time-to-solution for the current inputs, maximizing the quality of the results given the time budget. In this experiment, we found that Kriging, MARS bagged, and stacked models are the top three models according to , the actual selected model varied across the experiments.

Fig. 10: Distribution of the prediction error in percentage, grouped by different target pockets.

5.2.3 Input Feature Clustering

In this subsection, we demonstrate the benefits introduced by the input feature clustering module to reduce the prediction error variability and the training set dependency. In particular, Figure 11 shows the effects obtained by increasing the number of input feature clusters on the prediction accuracy. The clusters have been determined by using the K-means algorithm over the number of atoms of the molecule and the number of rotamers. The prediction accuracy has been measured by considering the variability of the docking time of each molecule represented by the same input feature cluster. This value has been normalized to the one obtained without considering the input features (i.e. one single cluster). Increasing this value, the execution time swing reduces on average and the distribution of the data becomes tighter. Despite a rapid initial reduction, the swing does not lead to zero because there are some input characteristics not completely captured by the data features used for the clustering (e.g. the geometry of the pocket and the ligand). Increasing the number of clusters, on one side, we reduce the execution time swing, while, on the other side, we are increasing the amount of memory needed to store the mARGOt OP list.

The reduction of the EFP variability in each cluster is even more important when the data used during the training phase are not distributed as for the whole data-set for the experiment. Indeed, considering a single cluster (no input features) means that the average behaviour learned during the training phase will be the same also for other data. This happens either if the EFPs are not dependent on the input data, or if the input data are on average the same as those used for the training. If we are not in the previous cases, having more clusters enables a better prediction process of the EFP values.

Fig. 11: Distribution of the normalized execution time swing in each cluster by varying the number of clusters

5.2.4 Scalability Analysis

This experiment addresses the scalability of the proposed approach in terms of time-to-learn. In particular, we show how the time requested to generate the application-knowledge decreases according to the number of clients that contribute to the exploration phase. Figure 12 shows the distribution of the time required by an application-client to receive the application-knowledge. For each number of MPI processes, we repeated the experiment five times. From the results, the overhead is almost inversely proportional to the number of clients. Indeed, the time-to-knowledge almost halved when doubling the number of MPI processes. In particular, the proposed framework required about seconds for the training and validation of all models. This time is short enough when considering the number of models we are training and the target use case. The time needed by the training, validation and selection phase is constant regardless of the number of clients.

Fig. 12: Distribution of the time spent for learning the application-knowledge by varying the number of MPI processes

6 Conclusions

This paper proposed an online autotuning framework to learn the relation between software-knobs, extra-functional properties and input features at the production phase, in a distributed way. To minimize the learning time, the framework is based on two strategies. On one side, it uses ensemble models to boost the prediction capabilities of the base models. On the other side, it uses an iterative approach for sampling the design space until the computed models reach the target quality.

Experimental results on synthetic applications and on a real-world case study, demonstrate how there isno-free-lunch: it does not exist one model to fit all the cases. This result confirms the main goal of the proposed approach: the goal was not to compare different modelling techniques, but to provide a framework exploiting them to learn the application-knowledge at the runtime.

References

  • [1] J. Ansel et al. Opentuner: An extensible framework for program autotuning. In PACT. IEEE, 2014.
  • [2] W. Baek and T. M. Chilimbi. Green: a framework for supporting energy-conscious programming using controlled approximation. In ACM Sigplan Notices, volume 45, pages 198–209. ACM, 2010.
  • [3] P. Balaprakash, J. Dongarra, T. Gamblin, M. Hall, J. K. Hollingsworth, B. Norris, and R. Vuduc. Autotuning in high-performance computing applications. Proceedings of the IEEE, 106(11):2068–2083, Nov 2018.
  • [4] C. Beato et al. Use of experimental design to optimize docking performance: The case of ligendock, the docking module of ligen, a new de novo design program, 2013.
  • [5] A. R. Beccari, C. Cavazzoni, C. Beato, and G. Costantino. Ligen: a high performance workflow for chemistry driven de novo design, 2013.
  • [6] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne. The protein data bank. Nucleic Acids Res, 28:235–242, 2000.
  • [7] T. T. Binh.

    A Multiobjective Evolutionary Algorithm - The Study Cases.

    INSTITUTE FOR AUTOMATION AND COMMUNICATION, 1999.
  • [8] L. Breiman. Bagging predictors. Machine Learning, 1996.
  • [9] L. Breiman. Stacked regressions. Machine Learning, 24, 1996.
  • [10] M. Christen, O. Schenk, and H. Burkhart. Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 676–687. IEEE, 2011.
  • [11] Y. Ding et al. Autotuning algorithmic choice for input sensitivity. In ACM SIGPLAN Notices, volume 50. ACM, 2015.
  • [12] J. Dorn, J. Lacomis, W. Weimer, and S. Forrest. Automatically exploring tradeoffs between software output fidelity and energy costs. IEEE Transactions on Software Engineering, 2017.
  • [13] D. Dupuy et al. DiceDesign and DiceEval: Two R Packages for Design and Analysis of Computer Experiments. Journal of Statistical Software, 2015.
  • [14] H. Esmaeilzadeh et al. Dark silicon and the end of multicore scaling. In ISCA. IEEE, 2011.
  • [15] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd, volume 96, pages 226–231, 1996.
  • [16] B. Everitt et al. The Cambridge dictionary of statistics. 2010.
  • [17] J. H. Friedman. Multivariate Adaptive Regression Splines. The Annals of Statistics, 1991.
  • [18] M. Frigo and S. G. Johnson. The design and implementation of fftw3. Proceedings of the IEEE, 93(2):216–231, 2005.
  • [19] D. Gadioli, E. Vitali, G. Palermo, and C. Silvano. margot: a dynamic autotuning framework for self-aware approximate computing. IEEE Transactions on Computers, 2018.
  • [20] H. Guo. A bayesian approach for automatic algorithm selection. In

    Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI03), Workshop on AI and Autonomic Computing, Acapulco, Mexico

    , pages 1–5, 2003.
  • [21] J. A. Hartigan and M. A. Wong. Algorithm as 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):100–108, 1979.
  • [22] H. Hoffmann et al. Using code perforation to improve performance, reduce energy consumption, and respond to failures. 2009.
  • [23] H. Hoffmann et al. Dynamic knobs for responsive power-aware computing. In ACM SIGPLAN Notices. ACM, 2011.
  • [24] S. A. Kamil. Productive high performance parallel programming with auto-tuned domain-specific embedded languages. University of California, Berkeley, 2012.
  • [25] J. Kephart et al. The vision of autonomic computing. Computer 2003, 36, 2003.
  • [26] C. Kooperberg. polspline: Polynomial Spline Routines, 2018. R package version 1.1.13.
  • [27] F. Kursawe. A variant of evolution strategies for vector optimization. In Lecture Notes in Computer Science, 1991.
  • [28] M. A. Laurenzano et al. Input responsiveness: using canary inputs to dynamically steer approximation. ACM SIGPLAN Notices, 2016.
  • [29] S. Mahdavi-Hezavehi, V. H. Durelli, D. Weyns, and P. Avgeriou. A systematic literature review on methods that handle multiple quality attributes in architecture-based self-adaptive systems. Information and Software Technology, 90:1–26, 2017.
  • [30] J. S. Miguel et al. The anytime automaton. In ACM SIGARCH Computer Architecture News. IEEE Press, 2016.
  • [31] S. Misailovic, D. Kim, and M. Rinard. Parallelizing sequential programs with statistical accuracy tests. ACM Transactions on Embedded Computing Systems (TECS), 12(2s):88, 2013.
  • [32] S. Mittal. A survey of techniques for approximate computing. ACM Computing Surveys (CSUR), 48(4):62, 2016.
  • [33] D. C. Montgomery. Design and analysis of experiments. John wiley & sons, 2017.
  • [34] C. Nugteren and V. Codreanu. Cltune: A generic auto-tuner for opencl kernels. In Embedded Multicore/Many-core Systems-on-Chip (MCSoC), 2015 IEEE 9th International Symposium on, pages 195–202. IEEE, 2015.
  • [35] S. original by Berwin A. Turlach R port by Andreas Weingessel ¡Andreas.Weingessel@ci.tuwien.ac.at¿. quadprog: Functions to solve Quadratic Programming Problems., 2013. R package version 1.5-5.
  • [36] S. original by Trevor Hastie & Robert Tibshirani. Original R port by Friedrich Leisch, K. Hornik, and B. D. Ripley. mda: Mixture and Flexible Discriminant Analysis, 2017. R package version 0.4-10.
  • [37] M. Püschel, J. M. Moura, B. Singer, J. Xiong, J. Johnson, D. Padua, M. Veloso, and R. W. Johnson. Spiral: A generator for platform-adapted libraries of signal processing alogorithms. The International Journal of High Performance Computing Applications, 18(1):21–45, 2004.
  • [38] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2018.
  • [39] A. Rasch et al. Atf: A generic auto-tuning framework. In High Performance Computing and Communications. IEEE, 2017.
  • [40] M. Rinard. Probabilistic accuracy bounds for fault-tolerant computations that discard tasks. In Proceedings of the 20th annual international conference on Supercomputing, pages 324–334. ACM, 2006.
  • [41] O. Roustant et al. DiceKriging, DiceOptim: Two R Packages for the Analysis of Computer Experiments by Kriging-Based Metamodeling and Optimization. Journal of Statistical Software, 2012.
  • [42] J. Sacks, W. J. Welch, T. J. Mitchell, and H. P. Wynn. Design and analysis of computer experiments. Statistical Science, 4(4):409–423, 1989.
  • [43] M. Samadi, D. A. Jamshidi, J. Lee, and S. Mahlke. Paraprox: Pattern-based approximation for data parallel applications. ACM SIGPLAN Notices, 49(4):35–50, 2014.
  • [44] M. Samadi et al. Sage: Self-tuning approximation for graphics engines. In MICRO. IEEE, 2013.
  • [45] C. A. Schaefer, V. Pankratius, and W. F. Tichy. Atune-il: An instrumentation language for auto-tuning parallel applications. In European Conference on Parallel Processing, pages 9–20. Springer, 2009.
  • [46] J. Shen, A. L. Varbanescu, H. Sips, M. Arntzen, and D. G. Simons. Glinda: A framework for accelerating imbalanced applications on heterogeneous platforms. In Proceedings of the ACM International Conference on Computing Frontiers, CF ’13, pages 14:1–14:10, New York, NY, USA, 2013. ACM.
  • [47] C. J. Stone et al.

    Polynomial splines and their tensor products in extended linearmodeling.

    Annals of Statistics, 1997.
  • [48] X. Sui et al. Proactive control of approximate programs. ACM SIGOPS Operating Systems Review, 2016.
  • [49] R. Vuduc, J. W. Demmel, and K. A. Yelick. Oski: A library of automatically tuned sparse matrix kernels. In Journal of Physics: Conference Series, volume 16, page 521. IOP Publishing, 2005.
  • [50] R. Webster et al. Optimal interpolation and isarithmic mapping of soil properties III changing drift and universal kriging. Journal of Soil Science, 1980.
  • [51] R. C. Whaley and J. J. Dongarra. Automatically tuned linear algebra software. In Proceedings of the 1998 ACM/IEEE conference on Supercomputing, pages 1–27. IEEE Computer Society, 1998.
  • [52] H. Wickham. tidyverse: Easily Install and Load the ’Tidyverse’, 2017. R package version 1.2.1.