1 Introduction
With the end of Dennard scaling [14], the optimization focus shifted towards efficiency in a wide range of scenarios, from embedded to HighPerformance Computing (HPC). To this end, autotuning [3] has been identified as a promising research. In this direction, it is possible to use frameworks to optimize a specific task [51, 37] and frameworks to optimize knobs at either the systemlevel (e.g. the core frequencies) or the applicationlevel (e.g. software parameters) [1, 39]. Moreover, if the target application can tolerate an error on the output, approximate computing [32] represents an appealing path to further increase computation efficiency. Finding application parameters that enable a qualitythroughput tradeoff might be a complex task. Therefore, several techniques have been proposed in literature to expose them, such as loop perforation [22] or task skipping [40].
As a consequence of this trend, application requirements are increasing in complexity to address several extrafunctional properties (EFPs), such as execution time, power consumption, and quality of the results. On the other hand, application developers have started to expose in the source code a huge set of tunable parameters that alter the extrafunctional behaviour of the application, thus enabling autotuning. In this paper, we refer to these parameters as softwareknobs.
The relation between the softwareknobs and the EFPs of interest for endusers is complex and unknown, therefore selecting the best configuration is a complex task. The EFP values might depend on the underlying architecture, on the system workload, and on features of the current input. A subset of softwareknobs relates to parameters that aim at tailoring the application for the underlying architecture, such as workgroup size, MPI runtime parameters or compiler options. Typically, autotuning frameworks that address these parameters perform a Design Space Exploration (DSE) at designtime to find the most suitable configuration to be used in the production phase. The main challenge in these approaches is the exponential growth of the Design Space when considering several, usually unbounded, softwareknobs. The second subset of softwareknobs relates to applicationspecific parameters, such as the number of Monte Carlo trials, loop perforation factors or algorithm parameters. Typically, it is easier to change these softwareknobs during the production phase and their effects on EFPs are strongly coupled with the features of the current input. For these reasons, autotuning frameworks that address these parameters typically model as an applicationknowledge the relationship among softwareknobs, EFPs, and input features. The autotuners leverage this knowledge to identify and seize optimization opportunities, improving the computation efficiency during the production phase. The main challenge of these approaches consists of providing mechanisms to enhance the target application with an adaptation layer that gives selfoptimization capabilities [25], considering application requirements and the system evolution.
Give the complexity of the tuning task, this is done typically offline prior to application execution, letting only the configuration selection at runtime. The main problem is that every time the code is ported to a new architecture, updated, or new input data are used for the elaboration, the offline tuning process should be redone. However porting this phase online at production time requires a significant effort to minimize the tuning time and overhead as much as possible.
In this paper, we propose a framework to learn online the applicationknowledge at the beginning of the production phase. It has been designed to work in a distributed context were different entities can collaborate to the knowledge collection. The framework mainly targets the context of HPC, where an application is composed of more than one process and it usually executes for a long period. However, it might be applied also in a wider range of scenarios. In fact, the benefits of learning the applicationknowledge at runtime are the following: (1) we are able to leverage the parallelism of the platform to reduce the timetoknowledge; (2) we are able to learn the behavior of the application using the actual input set and (3) using the actual execution environment. Given that we are stealing time to the execution of the application, the main challenge addressed in this paper is to reduce as much as possible the time required to learn the application knowledge. To reach this goal, in addition to the full exploitation of the parallel production machine, the proposed methodology. We employ an iterative exploration strategy to reduce as much as possible the required number of samples. In particular, we start to explore a fraction of the design space. If the derived models are not able to reach a target quality in the validation phase, the framework will resume the Design Space Exploration (DSE). Otherwise, the framework broadcasts the obtained application knowledge.
From the implementation point of view, we used the mARGOt [19] dynamic autotuning framework as starting point. mARGOt is an adaptation layer that provides to the target application mechanisms to adapt in a reactive and proactive fashion, based on the applicationknowledge. In this paper, we enhance the mARGOt learning module with a modeldriven approach. The goal of this paper is not to compare different modelling techniques, but to leverage them for reducing the time required to compute the knowledge of the application. We used synthetic applications with a known relation between EFPs and softwareknobs to experimentally evaluate the outofsample predictions of the learning module. We focused on a molecular docking application, to assess the benefits of the proposed framework in a realworld case study.
The main contributions of the paper can be summarized as follows:

We propose an autotuning framework to learn the relation between EFPs, softwareknob configurations, and input features at the production phase;

We learn online how to exploit the parallelism of the production machine and to consider the production input data;

We leverage ensemble models to reduce the cost of the learning phase, while we select automatically the most suitable model according to the target problem;

We enhanced a stateoftheart autotuning framework with the proposed learning module. In particular, we extended it to leverage iterative exploration needed by the proposed approach;

We evaluated the framework with synthetic functions and a reallife case study from the HPC domain to demonstrate the introduced benefits.
The remainder of the paper is structured as follows. Section 2 compares the proposed framework with the related works, highlighting the main contributions. Section 3 describes the framework implementation, focusing on the relation among the involved components and showing the workflow of the proposed methodology. Section 4 formalizes the problem and describes in detail how the learning module computes the application knowledge. Section 5 experimentally evaluates the proposed framework. Finally, Section 6 concludes the paper.
2 Stateoftheart
In the context of autonomic computing [25], an application is perceived as an autonomic element capable of selfmanagement. Among the self* properties required by the selfmanagement property, autotuning frameworks aim to provide selfoptimization [29]. In this context, the enduser specifies highlevel requirements and the application should adapt accordingly, without the humanintheloop. This is a promising path investigated in literature [3]
, where several autotuning frameworks have been evaluated according to their vision on how to provide selfoptimization properties. In the remainder of the section, we classify in six categories the most recent and relevant autotuning frameworks according to the methodology used to learn the applicationknowledge.
In the context of HPC, there are several autotuning frameworks tailored to optimize specific domains. ATLAS [51] was designed for matrix multiplication routine, FTTW [18] for FFTs operations, OSKI [49] for sparse matrix kernels, SPIRAL [37] for digital signal processing, CLTune [34] and GLINDA [46] for OpenCL applications, Patus [10] and Sepya [24] for stencil computations. Considering domainspecific autotuning, these works are interesting. However, they were designed to take orthogonal decisions with respect to the proposed approach, which is oriented towards supporting the onlinelearning phase.
The second category represents frameworks that aim to apply code or binary transformations to introduce the possibility of exploiting accuracythroughput tradeoffs. QuickStep [31], Paraprox [43], and PowerGAUGE [12] are some examples in this category. The main focus of these works is on how to expose tradeoffs by introducing softwareknobs. The parameter tuning phase is done at designtime by relying on the representative input set.
The third category includes frameworks that aim to explore a very large design space, to find the best configuration of the softwareknobs according to the application requirements before the production phase. ATuneIL [45], OpenTuner [1], and the ATF framework [39] are some examples in this category. ATuneIL provides a mechanism to prune and reduce the configuration space according to the code structure and to the dependencies among softwareknobs. OpenTuner uses a multiarmed bandit framework to choose the best search algorithm for the given application. ATF framework improves the OpenTuner strategies by considering also domainconstraints of the parameters. Given that the tuning phase is done at designtime, these frameworks usually target softwareknobs looselycoupled with the inputs. Moreover, the output of the tuning process is a single softwareknob configuration, not the applicationknowledge required to adapt at the production phase.
In the fourth category, we represent frameworks that target streaming applications. They typically learn the applicationknowledge at designtime, to be leveraged during the production phase. The Green framework [2], the Sage [44] framework, and PowerDial [23] are some examples in this category. The main focus of these works is on how to provide reaction mechanisms to a streaming application. Therefore, they use representative inputs to derive the application knowledge and then react during the production phase. This approach assumes only a few abrupt changes in the input that must be elaborated. However, this assumption might not hold in all the types of workloads, e.g. see Section 5.2 or [11, 48, 19].
In the fifth category, we consider the autotuning frameworks that adapt an application also in a proactive way by using input features and learning the applicationknowledge at designtime. Petabricks [11], the framework proposed in [20], and Capri [48], are some examples in this category. The methodology used to derive the applicationknowledge assumes the possibility to select which input to consider in the representative set used during the learning process. The methodology proposed in this paper has been designed to use directly the production input, therefore it is not possible to apply the same approaches. Moreover, these frameworks are able to express a tradeoff between a quality metric and an additional EFP only. On the other hand, the proposed approach is capable to address an arbitrary number of EFPs.
In the sixth and last category, we represent the autotuning frameworks (such as [28, 30]) that adapt an application also in a proactive way, without learning the applicationknowledge at designtime. The framework proposed in this paper falls in this category. The IRA framework [28] defines the concept of canary input as the smallest subsampling of the actual input, which has the same property as the original input. It proposes the usage of a canary input for a runtime parameter exploration of the target application for each data to be processed. Then, it uses the fastest configuration of the softwareknobs resulting within a given bound on the minimum accuracy. The main drawback of this methodology is that the presented subsampling technique applies to matrixlike input, therefore limiting the applicability of the framework. A rather different approach with respect to the previous ones is Anytime Automaton [30]. It suggests source code transformations to rewrite the application by using a pipeline design pattern. The idea is that the longer the given input executes in the pipeline, the more accurate the output becomes. The work targets hard constraints on the execution time, therefore the idea is to interrupt the algorithm when it depletes the time budget. In this way, it is possible to have guarantees on the feasible maximum accuracy. However, this approach works only for exploiting accuracyperformance tradeoffs.
In this paper, we propose a framework that can be considered orthogonal to most of the autotuning frameworks presented so far. In particular, it has been designed to support dynamic tuning at runtime with the capability to learn online the applicationknowledge. To reduce the learning time, it uses two different approaches. First, it is based on a scalable infrastructure capable of exploiting the parallelism of the underlying production platform. Second, it uses ensemble models for accelerating the learning phase, while selecting always the best model for the target problem.
3 Proposed Framework Architecture
This Section introduces the architectural view of the proposed technique in the context of the whole adaptive framework. Later, Section 4 will focus on the learning approach from the methodology point of view.
The proposed approach uses as a starting point the mARGOt autotuning framework [19], which aims to enhance a target application with an adaptation layer to provides mechanisms to adapt in a proactive and reactive fashion. From the implementation point of view, mARGOt is a C++ library to be linked to a target application. Therefore, each instance of the application can take autonomous decisions. mARGOt takes as input an applicationknowledge defined as a discrete set of Operating Points (OPs). Each OP relates a softwareknobs configuration with the expected EFP values reached using the configuration, according to a set of input features. Therefore an OP is composed of three sets of values, namely : the softwareknobs configuration (), the expected EFPs (), and the related input feature values (). By using this representation, the mARGOt autotuner framework is capable to select the most suitable one, according to application requirements defined as a constrained multiobjective optimization problem. Moreover, it uses feedback information from monitors to adapt in a reactive way, while it uses features of the current input to adapt in a proactive way.
The main goal of the proposed framework is to obtain the OPs list during the production phase, without requiring a designtime profiling phase. The proposed methodology guides the learning process by leveraging the underlying mARGOt infrastructure. To achieve these goals, we exploit the possibility to dynamically update the OPs list of each application client. In particular, we would like to assign at each application client a different softwareknob configuration to distribute the design space exploration. We would like to broadcast the OPs list to all the application clients, once the learning module generates the application knowledge. In this way, 1) it is possible to leverage all the available nodes to reduce the timetomodel, 2) the application knowledge is tailored for the current input, and 3) we measure the EFPs with the production environment.
Figure 1 provides an overview of the proposed autotuning framework. It is composed of three components: a remote application handler (server), an application local handler (client) and the learning module. The remote application handler is the central coordinator, implemented as a thread pool that executes in a dedicated server. To store information about the managed applications and the status of the DSE, the server might use the Apache Cassandra database for scalability reasons, or CSV files for small instances. The learning module
is the core of the proposed approach and it performs three main tasks: 1) It leverages the Design of Experiments techniques to sample efficiently the design space to be explored; 2) It leverages stateoftheart modelling techniques to interpolate outofsample predictions; 3) It uses a validation stage to test whether the quality of the obtained model is acceptable or it requires additional effort to explore the design space. An extensive description of the learning module is presented in Section
4. The application local handler is a service thread in each application instance which runs asynchronously with respect to the application execution flow. Its main goal is to manipulate the applicationknowledge of the related application instance. In particular, during the design space exploration phase, the application local handler forces the autotuner to select the softwareknobs configuration that must be evaluated. When the model is available, it sets the applicationknowledge accordingly. Moreover, it has in charge the synchronization with the server counterpart. The communications between server and clients leverage the MQTT or MQTTs protocols.Figure 2 shows the typical workflow of the framework when it interacts with an unknown application. In particular, it is composed of the following steps:

The clients notify themselves to the server.

The server asks one client information about the application, such as the number of softwareknobs and their domain.

Once the server has collected the information, it will call the learning module to generate a set of configurations to explore (DoE, design of experiments).

The server dispatches to the available clients the configurations to evaluate in a roundrobin way.

Once the clients have explored all the configurations, the server requests a model from the learning module.

The learning module trains, validates, selects and returns the best model. If the model is not valid, it returns an empty model.

If there is a valid model, the server broadcasts the model to the available clients, otherwise it restarts from step 3.
The framework implementation is resilient to crashes at the serverside and at the clientsides. Moreover, whenever a new client becomes available, it can join the design space exploration or receive the model directly.
4 Proposed Methodology
The goal of the proposed approach is to learn, during the production phase, the relation between softwareknobs configuration, EFPs, and input features. The main challenges in achieving this goal are twofold. On one hand, we need to reduce as much as possible the time for learning the applicationknowledge. On the other hand, we are not capable to control the features of the input. While we are capable to force an application instance to use a given softwareknobs configuration, the input set is the one of the production run.
Figure 3 shows an overview of the proposed methodology. Orange elements represent components of the learning module that drives the learning process, white elements represent components of the mARGOt framework. Given the description of an unknown application, we use techniques of Design of Experiments (DoE) to sample efficiently the design space. After the exploration of the selected softwareknobs configuration, we do two operations. First, we build a model for each EFP of interest. Second, we cluster the observed input features. If the best model that we found is deemed valid, we use it to generate the list of OPs to broadcast to the application clients. Otherwise, we generate additional softwareknob configurations to be evaluated and we restart the procedure. The remainder of this section formalizes the main components of the proposed approach: the DoE techniques, the modelling techniques, the model training and validation, the model selection and finally the input feature clustering.
4.1 Design of Experiments
The proposed approach aims at obtaining the application knowledge at the production phase, therefore we want to reduce the design space exploration as much as possible. To reach this goal, it is important to sample the design space to maximize the retrieved information. This is a wellknown problem in literature, where several design of experiments (DoE) techniques were investigated [33]. The proposed framework leverages the Dmax algorithm [13], which maximizes the determinant of the correlation defined as in Equation 1:
(1) 
where is the distance between points and , is the threshold distance of the correlation between two points and is a variogram. This measure maximizes the information entropy of the design and thus strives to maximize the information gained by exploring the given softwareknobs. Therefore, this method works well for creating DoE on the design space with no apriori knowledge.
This DoE technique exposes two free parameters: the total number of softwareknobs configuration to be explored and the threshold distance . We provide to the enduser the possibility to change these parameters from their default values ( and
respectively). Moreover, the enduser might specify how many times each selected softwareknob configuration is explored. Ability to make several executions of the same softwareknob is important to increase the robustness of the estimation in cases of the nondeterministic applications. Given that the input features are not controllable, multiple runs of the same softwareknobs configuration might lead to learning the knowledge from different feature sets.
The Dmax algorithm is designed to sample a continuous design space. The application description defines a discrete domain for each softwareknob, therefore the selected samples are then mapped to the closest nonselected softwareknobs configuration. Moreover, to accommodate the needs for more complicated design spaces, it is possible to set restrictions on the softwareknobs domain. There are several ways to implement restrictions on a design space. On one hand, it is possible to compute the feasible design space based on the provided restrictions. However, in the case of nonlinear relationships between softwareknobs, this would lead to hardtosolve nonlinear inequalities. On the other hand, it is possible first to create the full factorial design space, and then to remove softwareknobs configuration that does not satisfy the given restrictions. The latter approach is feasible since the userdefined design space is discrete and finite. In the proposed approach, we select the softwareknobs configurations to explore in four steps:

Create a fullfactorial design.

Remove the OPs not valid for the given restrictions.

Use Dmax to sample the continuous design space to explore.

Map the selected samples to the softwareknobs domain.
The output of this stage is a list of softwareknob configurations to explore in the Design Space Exploration. The latter task is performed by the autotuning framework.
4.2 Modelling Techniques
This section describes the model families used to learn the relation between EFPs, softwareknobs, and input features. The learning module models each EFP independently. Therefore, in our notation represents the expected value of the target EFP, while
represents the vector of predictors, i.e. softwareknobs and input features. In particular, we model an EFP as
, where the function is represented by a given modelling technique. The remainder of the section describes the type of models used in the proposed approach.4.2.1 Linear Models
Linear regression [33] with dependent variables and explanatory variables is defined in Equation 2:
(2) 
where is a constant, is a vector of parameters, is a matrix of explanatory terms, and is vector of residuals or errors.
We use two types of linear models: first we consider only the model with a constant and the explanatory variables (first order); second we consider also twoway interactions of explanatory variables (first order with interactions).
4.2.2 MARS Models
The second family of models used in the learning module is multivariate adaptive regression splines (MARS)
[17]. This model iteratively adds basis functions to create the best possible representation of the variables interactions (nonparametric model). The MARS representation is defined in Equation
3:(3) 
where is a constant, is number of basis functions, is the constant coefficient of the basis function , is the basis function . The basis function is of the form , or the multiplication of multiple basis functions. The parameter is a constant estimated by the model. In the learning module, we also use a variation of this model, named POLYMARS, which enables the maximum of twoway interactions in the model [47].
4.2.3 Kriging Model
The third family of models used by the learning module is Kriging [42]. We use an extension of the original model, named Universal Kriging (UK) [50], which assumes that observed values come from a deterministic process given by Equation 4:
(4) 
where is a trend defined by the number of basis functions and is a known covariance kernel.
In the context of the proposed framework, the generating process is seldom deterministic, therefore we need to relax this assumption. In particular, we forced the determinism by averaging the observed values for each observed softwareknobs configuration.
4.2.4 Ensemble Models
Model ensembling is a wellknown approach to increase the predictive capabilities of base models by combining them together, using different techniques. The learning module leverages two techniques based on crossvalidation models: bagging [8] and stacking [9].
The bagging approach aims at decreasing the variance of the prediction. It focuses on a single base modelling technique and it combines instances of the model trained with different data subsamples. The basic idea of bagging is increasing the robustness of the predictive capabilities of the model by combining models trained on the datasets with only small differences. For example, if we have 10 observations and train 10 different models always leaving out one observation, each model pair will share 8 out of 9 observations used for training the model. Combination of such models should lead to a more robust ensemble model. To obtain the prediction, it uses the mean of the predictions generated by the model instances. Figure
4 shows the process of bagging model computation.The stacking approach aims to increase the robustness of the prediction by combining together base models. A stacked model should be able to decrease the weaknesses of the individual models and leverage their strengths. The learning module uses a weighted mean of the model families describes in the previous section together. The weights for each model family aims at minimizing the error of the stacking model towards the training data observations. Moreover, they must be positive and sum up to one. This is a quadratic optimization problem and it has been solved using a dedicated R package [35].
Figure 5 shows the process used by the learning model for computing the stacking model, based on the following steps:

Train several instances of each model family using a crossvalidation scheme.

Use every instance of a model family to predict the holdout data for the training.

Create a matrix of predicted data considering the predictions of step 2 for all the model types.

Compute the weights for each model family by using quadratic optimization on the matrix of prediction computed at step 3, to best fit the training data.

For each model family, train the model by using the complete training data.
As stated in the previous work [9], this definition of the stacking greatly reduces the exploration space and makes the weights estimation robust.
4.3 Model Training and Validation
This section describes how we partition the information from the Design Space Exploration to train and to validate the models. This step is crucial to broadcast to the application clients a reliable application knowledge.
The typical approach for testing how models fare in the prediction is to divide the input data in a training set and validation set. However, given that the proposed framework leverages the applicationknowledge during the production phase, we might have a small set of observation for training and for validating the models. This is true especially for the first iterations of the learning process. Therefore, the learning module uses three different validation schemes according to the number of softwareknob configuration explored and a parameter that represents the validation set ratio. The selected approach depends on the relationship of and , where is the number of explored softwareknob configurations and is the number of crossvalidation folds.
In the general case, when , we use a kfold validation scheme: the full set observations are divided into parts of equal size. One part is always used as a holdout set and the rest is used to train the model. This implies that the learning module trains models and each of them will have outofsample predictions on a different part of data. We will call these models crossvalidation models. In this case the validation set has size of .
On the other cases, when there are only few explored configurations (i.e. ), then we use the first data to train the crossvalidation models using a leaveoneout crossvalidation scheme. We use the remainder of the data as the holdout validation set. Moreover, If the number of explored softwareknob configurations is less than , we apply a leaveoneout crossvalidation, without using any holdout validation set. In this case, we are not using ensemble models due to bias problems in the model selection. On one hand, bagged models use the average of crossvalidation models. On the other hand, stacking model requires a training with all the data used to compute crossvalidation models. Therefore, without any holdout data, ensemble models are implicitly trained by using the full set of observations available.
Using a holdout validation set gives better information about how the models fare on the outofsample prediction. Moreover, it allows us to use ensemble models which are based on the crossvalidation models. In practice, the third method is implemented only for the special cases when the user decides to use really high or he has just a few observations. It is important to notice how in this case the results of the models on outofsample predictions might be highly volatile.
To increase the robustness of the model’s prediction capabilities estimation, we used a similar approach to the kfold cross validation also with the holdout validation set. We split the input data into folds, where is a number of observations used for the holdout validation set. Then validation is made for each of the folds. In this way, it is possible to test prediction capabilities on the whole input dataset and not just one randomly selected part for the validation.
4.4 Model Selection
To quantify the prediction quality of a model, we consider two metrics. A variant of the coefficient of determination () [16], and the mean absolute error, normalized by the observed values range (). In our case, is computed as the square of the correlation between observed and predicted data. This variant [16] has been selected because it is restricted to [1; 1] and it can be used on the crossvalidation and outofsample predictions to compare the results in a consistent manner. For evaluating these metrics for each base modelling types, we consider the mean of and across the holdout validation models. In the special case when a number of explored softwareknobs is less than and the holdout set is not used the mean of and of the crossvalidation models is used. For model ensembles, we compute them considering the whole set of observations.
Once we evaluate all the models, we deem as eligible the ones that have higher than and less than , to enforce a minimum quality. Among the eligible models, we select the one that minimizes the . If no model is eligible, the proposed approach will restart the design space exploration, up to a maximum number of iterations. When the maximum number of iterations () is reached, the learning module selects the model with the smallest for the outofsample predictions. The parameters , and are exposed to enduser and by default, they are set to , and respectively.
4.5 Feature Clustering
The main goal of this component is to find representative clusters based on the input features to be exploited in the applicationknowledge. This stage is based on the same considerations done in Petabricks [11], that inspired our implementation. Algorithmic choices and softwareknob configurations are often sensitive to input features. However, the feature space can be too large to be completely considered. Clustering techniques are an intuitive solution to reduce the problem complexity.
To cluster the input features observed in DSE, we apply wellknown clustering techniques such as kmeans
[21] and DBSCAN [15] according to a parameter exposed to the enduser. Kmeans is a clustering technique to minimize the intracluster variance, where the user sets a fixed number of clusters. Using a different approach, DBSCAN partitions the samples in clusters, according to their proximity, where the user sets a fixed distance threshold. Moreover, the user is able to manually define the clusters.This component is activated after the Design Space Exploration. As shown in Figure 3, it works in parallel with the model learning component. Once the best model has been selected, the generated clusters on the input features are combined with the learned models and used to generate the applicationknowledge to be broadcast to the clients.
5 Experimental Results
This section describes the experimental assessment of the proposed framework. First, we evaluate the prediction capabilities of the proposed framework, by using synthetic applications with a known relation between the EFPs and the softwareknobs. Then, we focus on a geometrical docking application to evaluate the benefits provided by the proposed framework for the enduser. For this experiment, we used a platform with eight CPUs Intel(R) Xeon(R) X5482 @3.20GHz and 8GB of memory. Concerning the implementation, we used the R packages for the Dmax design [13] and for estimating the linear models [38], the MARS [36], POLYMARS [26], and Kriging [41] models. The plugin uses the tidyverse package [52] for data manipulation and to unify the workflow of different models.
5.1 Framework Validation with Synthetic Applications
The main goal of this section is to evaluate the proposed framework in outofsample predictions, when trained with a fraction of the design space. This is the main challenge of the proposed framework, given that each sample used to train the model steals time from the target application. To validate the proposed framework, we created two synthetic applications with a known relation between the softwareknobs and the EFPs that has been inherited from wellknown functions [7, 27].
The first application is derived from the work of Binh [7], and it has been defined in Eq. 5:
(5) 
where and represent the EFPs of interest for the enduser, while and represent the softwareknobs of the application. The mathematical formulas describing the two functions represent the relationship between the EFPs and the softwareknobs modeled by the framework by using a limited training set. The proposed approach models each EFP independently, therefore this first application aims to demonstrate the ability of the framework to model linear and nonlinear behaviors.
The second application is derived from the work of Kursawe [27], and it has been defined in Eq. 6:
(6) 
where and represent the EFPs of interest for the enduser, while , , and are the softwareknobs of the application. This application aims at validating the approach by increasing the complexity of the relation between the EFPs and the softwareknobs. In particular, represents exponential functions, while represents periodic functions. These functions have been chosen to represent a large set of function types such as linear, nonlinear and exponential.
5.1.1 Model Training and Validation
In this section, we focus on evaluating the model training and validation by using a limited training set. Figure 6 shows the evaluation of the trained models for synthetic applications by varying the number of softwareknob configurations evaluated in the DSE. In particular, Figure 5(a) validates the trained models for the function, while Figure 5(b) and Figure 5(c) validate the trained models for the and the functions. We omitted the validation results of the function because the underlying linear equation was learned well by almost all the models included in the proposed framework. In Figure 6, the yaxis represents the model quality in terms of mean absolute error and correlation coefficient , while the xaxis represents the number of softwareknob configurations evaluated in the DSE. For each xvalue, we repeated the experiment 20 times and we have shown its distribution.
From the experimental results, we noticed a trend on the mean absolute error across all the model types and modelled functions: it decreases when we increase the number of explored softwareknob configurations. However, this is not true for the correlation coefficient. Let us consider the EFPs of the Kursawe application. We might notice how POLYMARS models to improve by increasing the number of training samples, while Kriging models seem to overfit on the training data, performing worse on the outofsample predictions. Different model types behave differently in outofsample predictions according to the EFP underlying equation. Let us focus on the MARS models across different EFPs. We notice how according to the underlying equation of the target EFP, it behaves differently. It is able to model the function, it struggles to model the function, while it fails to model the function.
If we consider all the EFPs, we might notice how at least one model type fares well in outofsample predictions in the validation set, except for the function. If we focus on this EFP, it is possible to notice how the trained models struggle to explain the data, according to the correlation coefficient. This is due to the periodic nature of the underlying equation and to the limited size of the training set. Instead of learning the behaviour of the function, the trend is to settle along an average trend. Therefore, the MAE decreases but the decreases as well.
5.1.2 Model Selection
In this section, we evaluate the capability of the proposed framework to leverage different modelling techniques to learn the application knowledge at runtime. Figure 7 shows the number of times that a model type is deemed as the best one, according to the methodology explained in Section 4.4. For each EFP, we report the model selection according to the number of softwareknob configurations explored during the DSE. Models not listed in Figure 7 have never been selected.
From the experimental results, carried out on both synthetic applications, we notice how there is not a unique model always dominant with respect to the others. This confirms the importance of the proposed approach. The actual model selected strongly depends on the predicted function and on the values of the model selection parameters (i.e. and ). If we focus on the models generated with a small training set, the learning module selects from a wider range of model families. On the opposite, with a larger number of samples (i.e. 50), the models more frequently selected are Kriging, Kriging bagged, POLYMARS bagged and stacking. In all cases, the selected models provide good outofsample predictions because the crossvalidation and testing error (as shown in Section 5.1.3) are comparable.
5.1.3 Framework Validation
This section aims at evaluating the quality of the final output of the learning model: the applicationknowledge. In this experiment, we consider as input the applicationknowledge given by the best model for the given run and we compare it with the underlying equation of the target EFP. Figure 8 shows the experimental results in terms of prediction error () and , across the whole softwareknobs domain. Figure 7(a) refers to the Binh synthetic application, while Figure 7(b) refers to Kursawe synthetic application. In both cases, the yaxis represents the quality of the application knowledge, while the xaxis represents the number of softwareknob configurations used to compute the model.
From the experimental results, we can notice how the quality metrics are consistent with the validation of the model done in the training phase. In particular, we can predict all the softwareknob configurations of the design space within a of 4% and a of 0.94, for all the EFPs of the synthetic applications, except for the EFP for the Kursawe application. These results are coherent with the quality of the model determined in the validation phase, and used for the model selection. This behavior is important because we can correctly judge the quality of the applicationknowledge before broadcasting it to the application clients. Therefore, the user is capable to manage the tradeoff between the learning phase duration and the quality of the result, by using the parameters and . Moreover, the proposed framework enable us to distribute the DSE to the application clients and to use an iterative refinement procedure to minimize the learning time of the model.
5.2 The Molecular Docking Case Study
In this section, we validate the proposed approach on a reallife case study taken from the HPC world. First, we demonstrate the advantages obtained by adopting the proposed solution (Section 5.2.1 and 5.2.2). Then, we evaluate some characteristics of the framework in terms of input feature clustering and scalability (Section 5.2.3 and 5.2.4).
In a drug discovery process, molecular docking is one of the earliest tasks and it is performed in silico. Molecular docking is used to virtual screen a very large library of molecules, named ligands, to find the ones with the strongest interaction with the binding site of a second molecule, named pocket, to forward to later stages of the drug discovery process [5]
. The complexity of this task is not only due to the huge number of ligands to evaluate, but also to the number of degrees of freedom in the evaluation of the ligandpocket interaction. In particular, it is possible to alter the shape of the molecule, without altering its chemical properties, by rotating a subset of bonds between the atoms of a ligand, named
rotamers.In this experiment, we focus on a geometric docking kernel, part of the LiGen Dock application [4]. Due to the complexity of evaluating the chemical interaction of a pocketligand pair, this kernel considers only geometrical information and it is used to filter out the ligands unable to fit in the target pocket. The application exposes two softwareknobs that generate qualitythroughput tradeoffs by reducing the number of alternative poses evaluated for each rotamer of the ligand. The endusers typically belong to pharmaceutical companies that rent the resources of an HPC infrastructure, to evaluate a chemical library by running a typical batch job. Therefore, the endusers are interested in timetosolution and on the quality of the elaboration, defined as the number of evaluated poses.
The goal of this experiment is to assess the benefits of the proposed framework where the throughput is heavily inputdependent. The time spent on evaluating a pocketligand pair depends on the number of atoms and rotamers of the ligand, and by the geometrical properties of the pocket that are difficult to represent numerically. This heavy input dependency is perceived as a significant noise when measuring the execution time for a given softwareknob configuration across several ligands. Given that the pocket remains the same for the entire screening process, the proposed approach aims at learning the effect of the target pocket on the relation between softwareknobs and the execution time at the production phase. Every time the pocket changes, we need to learn again the applicationknowledge, however, 1) a pocket is seldom reevaluated, 2) the time spent on evaluating the library is several orders of magnitude larger than the learning time, 3) the applicationknowledge is tailored to the actual input, and 4) we exploit the parallelism of the production system (composed of several HPC nodes).
Given that the later stages of the drug discovery process require an expensive cost to execute the invitro and invivo tests, the reproducibility of the experiment is a domainrequirement. Therefore, once we have obtained the applicationknowledge, we restart to evaluate the chemical library with the configuration that maximizes the quality, while respecting the timetosolution constraint adjusted by the time spent during the learning phase. In the following experiments, we used a library of ligands, where each ligand has a number of atoms between and and a number of rotamers between and . Moreover, we use six pockets (1b9v, 1c1b, 1cvu, 1cx2, 1dh3 and 1fm9) from the RCSB Protein Databank (PDB) [6].
In the next sections, we validate the proposed approach by using different experiments. First, we show the typical execution traces where the online learning approach has been employed. Second, we validate the approach by simulating a virtual screening campaign. Third, we demonstrate the advantages introduced by the input feature clustering module to reduce the prediction variability. Finally, we use this case study to evaluate the scalability in terms of the timetolearn for the proposed approach.
5.2.1 Execution Trace
Figure 9 shows an execution trace of the docking application running on MPI processes on the pocket 1b9v. Figure 9 represents the application behavior in terms of the throughput for evaluating the pocketligand pair and the quality of the results over the firsts seconds of a large experiment. Each subfigure shows the EFP behavior of out of MPI processes, as representative behaviour. The length of the learning phases is determined by the model convergence time and on the input characteristics. It is possible to notice that the length is almost the same for all the client’s thanks to the configuration distribution performed by the remote application handler. After the learning phase, the goal value has been set by computing the average throughput required to process the entire target ligand database given the target timetosolution (experiment target budget). After the initial exploration of the design space, the application settles with a softwareknob configuration that is the same for all the clients. This is because all clients are part of the same virtual screening experiment. By varying the current inputs (i.e. the target pocket and the target ligand database) or the timetosolution constraint, the autotuning process would lead to a different configuration.
5.2.2 Prediction Accuracy
To validate the proposed approach, we run an experimental campaign with a library of ligands, randomly sampled from ligands, targeting six different pockets. In particular, for each pocket we repeated the experiment ten times, reporting the prediction error distribution (see Figure 10). For the learning phase, we observed each softwareknob configuration in the DoE with different ligands. Figure 10 shows that a large fraction of the timetosolution errors are within . In this case, the proposed approach accurately estimates the timetosolution for the current inputs, maximizing the quality of the results given the time budget. In this experiment, we found that Kriging, MARS bagged, and stacked models are the top three models according to , the actual selected model varied across the experiments.
5.2.3 Input Feature Clustering
In this subsection, we demonstrate the benefits introduced by the input feature clustering module to reduce the prediction error variability and the training set dependency. In particular, Figure 11 shows the effects obtained by increasing the number of input feature clusters on the prediction accuracy. The clusters have been determined by using the Kmeans algorithm over the number of atoms of the molecule and the number of rotamers. The prediction accuracy has been measured by considering the variability of the docking time of each molecule represented by the same input feature cluster. This value has been normalized to the one obtained without considering the input features (i.e. one single cluster). Increasing this value, the execution time swing reduces on average and the distribution of the data becomes tighter. Despite a rapid initial reduction, the swing does not lead to zero because there are some input characteristics not completely captured by the data features used for the clustering (e.g. the geometry of the pocket and the ligand). Increasing the number of clusters, on one side, we reduce the execution time swing, while, on the other side, we are increasing the amount of memory needed to store the mARGOt OP list.
The reduction of the EFP variability in each cluster is even more important when the data used during the training phase are not distributed as for the whole dataset for the experiment. Indeed, considering a single cluster (no input features) means that the average behaviour learned during the training phase will be the same also for other data. This happens either if the EFPs are not dependent on the input data, or if the input data are on average the same as those used for the training. If we are not in the previous cases, having more clusters enables a better prediction process of the EFP values.
5.2.4 Scalability Analysis
This experiment addresses the scalability of the proposed approach in terms of timetolearn. In particular, we show how the time requested to generate the applicationknowledge decreases according to the number of clients that contribute to the exploration phase. Figure 12 shows the distribution of the time required by an applicationclient to receive the applicationknowledge. For each number of MPI processes, we repeated the experiment five times. From the results, the overhead is almost inversely proportional to the number of clients. Indeed, the timetoknowledge almost halved when doubling the number of MPI processes. In particular, the proposed framework required about seconds for the training and validation of all models. This time is short enough when considering the number of models we are training and the target use case. The time needed by the training, validation and selection phase is constant regardless of the number of clients.
6 Conclusions
This paper proposed an online autotuning framework to learn the relation between softwareknobs, extrafunctional properties and input features at the production phase, in a distributed way. To minimize the learning time, the framework is based on two strategies. On one side, it uses ensemble models to boost the prediction capabilities of the base models. On the other side, it uses an iterative approach for sampling the design space until the computed models reach the target quality.
Experimental results on synthetic applications and on a realworld case study, demonstrate how there isnofreelunch: it does not exist one model to fit all the cases. This result confirms the main goal of the proposed approach: the goal was not to compare different modelling techniques, but to provide a framework exploiting them to learn the applicationknowledge at the runtime.
References
 [1] J. Ansel et al. Opentuner: An extensible framework for program autotuning. In PACT. IEEE, 2014.
 [2] W. Baek and T. M. Chilimbi. Green: a framework for supporting energyconscious programming using controlled approximation. In ACM Sigplan Notices, volume 45, pages 198–209. ACM, 2010.
 [3] P. Balaprakash, J. Dongarra, T. Gamblin, M. Hall, J. K. Hollingsworth, B. Norris, and R. Vuduc. Autotuning in highperformance computing applications. Proceedings of the IEEE, 106(11):2068–2083, Nov 2018.
 [4] C. Beato et al. Use of experimental design to optimize docking performance: The case of ligendock, the docking module of ligen, a new de novo design program, 2013.
 [5] A. R. Beccari, C. Cavazzoni, C. Beato, and G. Costantino. Ligen: a high performance workflow for chemistry driven de novo design, 2013.
 [6] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne. The protein data bank. Nucleic Acids Res, 28:235–242, 2000.

[7]
T. T. Binh.
A Multiobjective Evolutionary Algorithm  The Study Cases.
INSTITUTE FOR AUTOMATION AND COMMUNICATION, 1999.  [8] L. Breiman. Bagging predictors. Machine Learning, 1996.
 [9] L. Breiman. Stacked regressions. Machine Learning, 24, 1996.
 [10] M. Christen, O. Schenk, and H. Burkhart. Patus: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 676–687. IEEE, 2011.
 [11] Y. Ding et al. Autotuning algorithmic choice for input sensitivity. In ACM SIGPLAN Notices, volume 50. ACM, 2015.
 [12] J. Dorn, J. Lacomis, W. Weimer, and S. Forrest. Automatically exploring tradeoffs between software output fidelity and energy costs. IEEE Transactions on Software Engineering, 2017.
 [13] D. Dupuy et al. DiceDesign and DiceEval: Two R Packages for Design and Analysis of Computer Experiments. Journal of Statistical Software, 2015.
 [14] H. Esmaeilzadeh et al. Dark silicon and the end of multicore scaling. In ISCA. IEEE, 2011.
 [15] M. Ester, H.P. Kriegel, J. Sander, X. Xu, et al. A densitybased algorithm for discovering clusters in large spatial databases with noise. In Kdd, volume 96, pages 226–231, 1996.
 [16] B. Everitt et al. The Cambridge dictionary of statistics. 2010.
 [17] J. H. Friedman. Multivariate Adaptive Regression Splines. The Annals of Statistics, 1991.
 [18] M. Frigo and S. G. Johnson. The design and implementation of fftw3. Proceedings of the IEEE, 93(2):216–231, 2005.
 [19] D. Gadioli, E. Vitali, G. Palermo, and C. Silvano. margot: a dynamic autotuning framework for selfaware approximate computing. IEEE Transactions on Computers, 2018.

[20]
H. Guo.
A bayesian approach for automatic algorithm selection.
In
Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI03), Workshop on AI and Autonomic Computing, Acapulco, Mexico
, pages 1–5, 2003.  [21] J. A. Hartigan and M. A. Wong. Algorithm as 136: A kmeans clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):100–108, 1979.
 [22] H. Hoffmann et al. Using code perforation to improve performance, reduce energy consumption, and respond to failures. 2009.
 [23] H. Hoffmann et al. Dynamic knobs for responsive poweraware computing. In ACM SIGPLAN Notices. ACM, 2011.
 [24] S. A. Kamil. Productive high performance parallel programming with autotuned domainspecific embedded languages. University of California, Berkeley, 2012.
 [25] J. Kephart et al. The vision of autonomic computing. Computer 2003, 36, 2003.
 [26] C. Kooperberg. polspline: Polynomial Spline Routines, 2018. R package version 1.1.13.
 [27] F. Kursawe. A variant of evolution strategies for vector optimization. In Lecture Notes in Computer Science, 1991.
 [28] M. A. Laurenzano et al. Input responsiveness: using canary inputs to dynamically steer approximation. ACM SIGPLAN Notices, 2016.
 [29] S. MahdaviHezavehi, V. H. Durelli, D. Weyns, and P. Avgeriou. A systematic literature review on methods that handle multiple quality attributes in architecturebased selfadaptive systems. Information and Software Technology, 90:1–26, 2017.
 [30] J. S. Miguel et al. The anytime automaton. In ACM SIGARCH Computer Architecture News. IEEE Press, 2016.
 [31] S. Misailovic, D. Kim, and M. Rinard. Parallelizing sequential programs with statistical accuracy tests. ACM Transactions on Embedded Computing Systems (TECS), 12(2s):88, 2013.
 [32] S. Mittal. A survey of techniques for approximate computing. ACM Computing Surveys (CSUR), 48(4):62, 2016.
 [33] D. C. Montgomery. Design and analysis of experiments. John wiley & sons, 2017.
 [34] C. Nugteren and V. Codreanu. Cltune: A generic autotuner for opencl kernels. In Embedded Multicore/Manycore SystemsonChip (MCSoC), 2015 IEEE 9th International Symposium on, pages 195–202. IEEE, 2015.
 [35] S. original by Berwin A. Turlach R port by Andreas Weingessel ¡Andreas.Weingessel@ci.tuwien.ac.at¿. quadprog: Functions to solve Quadratic Programming Problems., 2013. R package version 1.55.
 [36] S. original by Trevor Hastie & Robert Tibshirani. Original R port by Friedrich Leisch, K. Hornik, and B. D. Ripley. mda: Mixture and Flexible Discriminant Analysis, 2017. R package version 0.410.
 [37] M. Püschel, J. M. Moura, B. Singer, J. Xiong, J. Johnson, D. Padua, M. Veloso, and R. W. Johnson. Spiral: A generator for platformadapted libraries of signal processing alogorithms. The International Journal of High Performance Computing Applications, 18(1):21–45, 2004.
 [38] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2018.
 [39] A. Rasch et al. Atf: A generic autotuning framework. In High Performance Computing and Communications. IEEE, 2017.
 [40] M. Rinard. Probabilistic accuracy bounds for faulttolerant computations that discard tasks. In Proceedings of the 20th annual international conference on Supercomputing, pages 324–334. ACM, 2006.
 [41] O. Roustant et al. DiceKriging, DiceOptim: Two R Packages for the Analysis of Computer Experiments by KrigingBased Metamodeling and Optimization. Journal of Statistical Software, 2012.
 [42] J. Sacks, W. J. Welch, T. J. Mitchell, and H. P. Wynn. Design and analysis of computer experiments. Statistical Science, 4(4):409–423, 1989.
 [43] M. Samadi, D. A. Jamshidi, J. Lee, and S. Mahlke. Paraprox: Patternbased approximation for data parallel applications. ACM SIGPLAN Notices, 49(4):35–50, 2014.
 [44] M. Samadi et al. Sage: Selftuning approximation for graphics engines. In MICRO. IEEE, 2013.
 [45] C. A. Schaefer, V. Pankratius, and W. F. Tichy. Atuneil: An instrumentation language for autotuning parallel applications. In European Conference on Parallel Processing, pages 9–20. Springer, 2009.
 [46] J. Shen, A. L. Varbanescu, H. Sips, M. Arntzen, and D. G. Simons. Glinda: A framework for accelerating imbalanced applications on heterogeneous platforms. In Proceedings of the ACM International Conference on Computing Frontiers, CF ’13, pages 14:1–14:10, New York, NY, USA, 2013. ACM.

[47]
C. J. Stone et al.
Polynomial splines and their tensor products in extended linearmodeling.
Annals of Statistics, 1997.  [48] X. Sui et al. Proactive control of approximate programs. ACM SIGOPS Operating Systems Review, 2016.
 [49] R. Vuduc, J. W. Demmel, and K. A. Yelick. Oski: A library of automatically tuned sparse matrix kernels. In Journal of Physics: Conference Series, volume 16, page 521. IOP Publishing, 2005.
 [50] R. Webster et al. Optimal interpolation and isarithmic mapping of soil properties III changing drift and universal kriging. Journal of Soil Science, 1980.
 [51] R. C. Whaley and J. J. Dongarra. Automatically tuned linear algebra software. In Proceedings of the 1998 ACM/IEEE conference on Supercomputing, pages 1–27. IEEE Computer Society, 1998.
 [52] H. Wickham. tidyverse: Easily Install and Load the ’Tidyverse’, 2017. R package version 1.2.1.
Comments
There are no comments yet.