I Introduction
Big data analytics frameworks (BDAFs), such as Hadoop MapReduce [1], Spark [2], and Dryad [3], have been increasingly utilized for a wide range of data processing applications. These applications have vastly diverse characteristics. To support such diversity, BDAFs provide a large number of parameters for users to configure. For example, both Hadoop and Spark have 100+ parameters that an application can configure [1, 4]. The configuration of these parameters significantly affects the application performance on BDAFs.
However, configuring such a large number of parameters is overwhelming to users [5]. As a result, users often accept the default settings [6]. Alternatively, manually tuning is widely adopted, which requires indepth knowledge on both BDAFs and applications. It is laborintensive, timeconsuming, and often suboptimal. Therefore, there is a strong need for automatically tuning application configurations on BDAFs.
Because configuration parameters have high dimensionality, naive exhaustive search is not feasible. Existing methods include modelbased, simulationbased, searchbased, and
learningbased, as discussed in Section II. Among them,
learningbased tuning has received much recent attention. In general, it constructs a performance prediction model using training samples of different configurations, and then explores better configurations using some searching algorithms. Although previous studies on learningbased tuning show promising results, one critical challenge is to generate enough samples in a highdimensional parameter space, because it is very time consuming. In practice, we often have time constraints on how long we can tune configurations.
To address this challenge, we present AutoTune–an automatic configuration tuning system that aims to tune
applicationspecific BDAF configurations within a given time constraint. AutoTune consists of two key components: AutoTune testbed and AutoTune algorithm. First, we construct a smallerscale testbed on which we run most of the experiments under different configurations. The motivation is to obtain more samples in the highdimensional parameter space so that we can train a better prediction model. The challenge is to construct a testbed that runs faster but still captures the performance variations of different configurations on the production system. Furthermore, the AutoTune algorithm searches for better configurations using both the testbed and the production system under the time constraint. The key is to generate a set of samples that can provide a wide coverage over the highdimensional parameter space, and to search for more promising configurations using the trained model. It has to balance the effort on initialization, exploration and exploitation, and the best configuration selection on the production system.
In summary, our work makes the following contributions:

We propose a novel approach that derives a testbed to facilitate the exploration on the production system. It allows us to generate more training samples, and thus to produce a better prediction model, under a given time constraint.

We develop the AutoTune algorithm. It uses latin hypercube sampling (LHS) to generate effective samples in the highdimensional parameter space, and multiple boundandsearch to select promising configurations in the bounded space suggested by the existing best configurations.

We evaluate the performance of AutoTune through extensive experiments using a wellknown big data benchmark in a public cloud. We show that AutoTune outperforms default configurations by 63.70% on average, and the five stateoftheart tuning algorithms by 6%23%.
Ii Related Work
Parameter tuning for BDAFs has received much attention from both industry and academia. Such work can be classified into four categories: modelbased, simulationbased, searchbased, and learningbased tuning.
In modelbased tuning, analytical models are constructed based on domain knowledge a priori to predict and optimize the performance of BDAFs. Such approaches include the Starfish project [7], MRCOF [8], MRTuner [9], etc. Modelbased tuning relies on analysis for performance optimization, and thus can be done a priori without experiments. However, analytical models may fail to capture the highly complex runtime characteristics, especially as BDAFs evolve rapidly with new architectures and technologies.
Simulationbased tuning constructs combined simulation models that capture both the internal behavioral metrics of the BDAFs and the externally observed inputoutput relationships, such as in [10, 11, 12, 13]. Such approaches need to determine all the factors that could affect performance, and to probe system internals many times to collect rawdata needed in the performance model.
Searchbased tuning perceives parameter tuning problem as a blackbox optimization problem and leverages a variety of search algorithms to explore good solutions, such as in BestConfig [5], Gunther [14], MRONLINE [15], and SPSA [16]. Searchbased tuning is easier to run and can be applied to general scenarios because it does not require any systemspecific knowledge. However, it requires extensive experimentation on production systems, and thus timeconsuming and sometimes impractical.
Most relevant to our work is learningbased tuning. Typically, one first constructs performance prediction models using training samples under different configurations, and then applies some searching algorithms to find better configurations based on these models. For instance, RFHOC uses random forests for performance prediction and a genetic algorithm to search for the Hadoop configuration space
[17]; ALOJAML [18]identifies key performance properties of the workloads through several machine learning techniques, and predicts performance properties for a unseen workload;
[19]employed support vector machines (SVM) to predict the performance of Hadoop applications;
[20] proposed a support vector regression (SVR) model; [21]presented polynomial multivariate linear regression for MapReduce;
[22] used random forests and genetic algorithm; [23] used a modified knearest neighbor algorithm to find desirable configurations based on similar past jobs that have performed well; [10] compared twenty machine learning algorithms and identified four with high accuracy; [24] built SVMbased performance models for Spark using randomly modified and combined configurations; [25]proposed a reinforcement learning approach;
[26]combined kmeans++ clustering and simulated annealing algorithms;
[27] employed an ensemble method with the combination of random sampling and hill climbing (RHC) method; and [28] considered multiclassification models.In these approaches, it requires a large number of samples to construct a useful model. Generating such samples requires a significant amount of time on production systems [17], which is expensive or impractical. Thus we are motivated to address this key limitation.
Last, there are best practices based on industrial experience. For example, the official site on Spark tuning [4] indicates that the data serialization and memory tuning are two main factors in tuning a Spark application. Other attempts include instrumentation [29], tuning recommendations [30], and statistical analysis [31].
Iii Problem Statement
In this work, we study the parameter tuning problem for big data analytics frameworks (PTBDAF). As illustrated in Figure 1, a big data analytics framework (e.g. Hadoop, Spark, etc.) is deployed on a collection of interconnected virtual machines (VMs) provided by a cloud provider. It serves data analysis applications comprised of programs and input datasets. In this process, the user who submits the application also needs to specify the configurations for the BDAF. Such configurations have a significant impact on application performance [32, 8, 5, 11, 17].
The goal of PTBDAF is thus to find an optimal configuration that minimize execution time, given a specific BDAF, an application, and the underlying runtime environment. Specially, PTBDAF has the following components.
Application: An application represents a big data analytics task running on a specific BDAF. We model it as a 2tuple =, where is the program that expresses a set of computations; represents the input data to be processed by program .
Runtime Environment: Runtime environment is the execution environment provided to a BDAF by the cloud infrastructure. We model it as a 5tuple =, where denotes the number of CPU cores; is the CPU frequency; represents the physical memory size; indicates the available disk space; and reflects the networking setup.
Configuration and Execution Time: Let = be the configuration of a BDAF. For example, in a Spark framework with 180+ parameters, an executor process can configure the number of cores and the memory size it uses, the maximum degree of parallelism, etc., as shown in Table I. Given a configuration for an application and its environment , the execution time is denoted as . In this paper, we treat as a blackbox function to be learned.
Time Constraint: In practice, the time for configuration tuning is often restricted. We define this restricted tuning time as time constraint, denoted as . Any solution to the PTBDAF problem must terminate when is met.
In short, the PTBDAF problem can be stated as follows:
(1) 
(2) 
where (1) states that the goal of PTBDAF problem is to find a configuration that minimizes execution time for a given application and its environment . In , the value of each component parameter must be within CB, the configuration bound predefined by the BDAF. The constraint (2) is that any solution to the problem must terminate after a amount of tuning time.
This definition shows that the goal of PTBDAF is to search for an optimal configuration of a set of parameters to minimize the execution time. According to a previous study [33]
, our PTBDAF is essentially an instance of classic combinatorial optimization (CO) problems
[34], which is known to be NPcomplete. The NPcompleteness proof by restriction is established in [35].Based on the above complexity analysis, we conclude that PTBDAF is NPcomplete and nonapproximable, which rules out the existence of any polynomialtime optimal or approximate solution unless . Any complete methods that guarantees to find an optimal solution might need exponential computation time in the worstcase. This often leads to computation times too high for practical purposes [33]
. Therefore, we shall focus on the design of a heuristic approach to this optimization problem.
Iv AutoTune for PTBDAF
In this section, we present AutoTune, a learningbased automatic parameter tuning system for BDAFs. Figure 2 sketches the automatic parameter tuning process of AutoTune.
AutoTune consists of two important components: AutoTune testbed and AutoTune algorithm. The motivation of constructing a smallerscale testbed instead of tuning directly on the production system is to obtain more samples to learn a better prediction model, and to conduct more iterations of exploration and exploitation to find a better configuration. The challenge is to construct a testbed that runs faster but still captures the performance variations of different configurations on the production system.
The AutoTune algorithm searches for better configurations by integrating the testbed and production system under the time constraint. The key of the AutoTune algorithm is to generate a set of samples that can provide a wide coverage over the highdimensional parameter space, and search for more promising configurations using the trained model. It needs to balance the effort on the initialization, the exploration and exploitation, and the best configuration selection on the production system.
Iva AutoTune Testbed
As discussed earlier, the most critical challenge in learningbased tuning is that obtaining training samples is timeconsuming. To address this challenge, we propose a testbedbased approach. The goal of constructing a testbed is to evaluate the performance of different configurations on a BDAF in an accurate enough way but at a faster speed. The key is to reduce the size of dataset processed by the application and adjust the resource allocation properly so that the relative performance of different configurations on the testbed is as close as possible to that on the production system.
At a high level we consider a scenario where a user provides as input an analytics application (written using any existing data analytics framework) and a pointer to the input data for the application. Assuming that the machine types are fixed, we need to build a model first that will predict the execution time for any input size, number of machines for this given application. With the predictive model, the user can choose a appropriate testbed with certain reduction factor of the production system. Note that for the generality, we do not assume the presence of any historical logs about the application in order to infer the model.
The main steps in building such a predictive model are (a) capturing the computation properties of an application; (b) expressing the internal commination patterns in an application; and (c) determining how much data points we need to collect. We discuss all three aspects below.
IvA1 Computation properties
As described in [36], data analytics applications differ from other applications like SQL queries or stream processing in a number of ways. These applications are typically numerically intensive and thus are sensitive to the number of cores and memory bandwidth available. Further, these applications can also be longrunning: for example, to obtain the stateoftheart accuracy on tasks like image recognition and speech recognition, jobs are run for many hours or days.
Take a wellknown KMeans application, proposed in [37], as an example of an data analytics application. The application divides points into clusters so that the withincluster sum of squares is minimized, and its execution DAG is shown in Figure 3. From the figure we can see that this KMeans application contains three main stages: The first stage of the DAG reads input data, transforms the data into a collection of vectors, and normalizes each vector. The second stage in the application finds the nearest cluster centers by calculating the distance of each pair of vectors, and marks every vector with the nearest center. In the last stage, the vectors are grouped and represented by the centers, and the mapstructured clustering results are returned to the user. The cluster centers are refined in every iteration and these steps are repeated for many iterations to achieve acceptable accuracy.
As observed in Figure 3, assuming the equaldense input data and the equalsized data partitions, we can find that each task in the application will take a similar amount of time to compute. Specifically, the computation required per data partition remains the same as we scale the input data, and if we add more machines to the cluster, the computation time decreases in linear or quasilinear manner [36].
IvA2 Communication patterns
By investigating many stateoftheart BDAFs, we observe that only a few communication patterns repeatedly occur in data analytics applications. These patterns (Figure 4) include (a) the collect (alltoone) pattern, where data from all the partitions is sent to one machine; (c) the shuffle (manytomany) pattern where data goes from many source machines to many destinations; and (c) the treeaggregation pattern where data is aggregated using a treelike structure. Actually, these patterns are not specific to analytics applications and have been wildly studied in many different distributed computing frameworks [38, 39]. Having a handful of such patterns means that we can try to automatically infer how the communication costs change as we increase the scale of computation. For example, assuming that data grows as we add more machines (i.e., the data per machine is constant), the time taken for the collect increases as (nm) as a single machine needs to receive all the data, similarly the time taken for a binary aggregation tree grows as (nm)), where represents the number of machines.
IvA3 Predictive model
To build our model, we add terms related to the computation and communication patterns as discussed earlier. Specifically, based on [36], we propose the following model:
(3) 
where represents the data scale of the input data size, denotes the number of machines. As shown in Eq.(3), the terms we add to our linear model are:

The first term of Eq.(3) consists of a fixed cost term that represents the amount of time spent in serial computation, and the interaction between the data size and the inverse of the number of machines. This term is to capture the parallel computation time for tasks, i.e., if we double the number of machines with the same size of input data, the the computation time will reduce linearly.

The second term contains a term to model communication patterns like treeaggregation trees, and a linear term which captures the alltoone communication pattern and fixed overheads like scheduling/serializing tasks (i.e. overheads that scale as we add more machines to the system).
Note that as we use a linear combination of nonlinear features in Eq.(3), we can model nonlinear behavior as well.
The objective of the training is to learn values of , , , and . We can use a nonnegative least square (NNLS) [40] solver to find the model that best fits the training data. NNLS fits our use case very well as it ensures that each term contributes some nonnegative amount to the overall time taken. This avoids overfitting and also avoids corner cases where say the running time could become negative as we increase the number of machines.
IvA4 Projective sampling
The next step is to collect training samples for building our predictive model. Specifically, we use the input data provided by the user and run the complete job on small samples of the data and collect the time taken for the application to execute. The ultimate goal of this step is minimizing the time spent on collecting training data while achieving good enough accuracy.
To improve the time taken for training without sacrificing the prediction accuracy, we outline a scheme based on projective sampling [41]
, a stateoftheart technique that fits a function to a partial learning curve obtained from a small subset of potentially available data and then uses it to analytically estimate the optimal training set size. More specifically, based on the common assumption that the error rate is a nonincreasing function of the sample size
[42], we use projective sampling to predict the number of samples required to build the predictive model. The initial samples are selected by randomly adding a constant number of samples (combinations of and ) to the training set from the training pool. In each iteration, the model is built, the accuracy of the model is evaluated using the testing data, and a sample point for the learning curve is thus generated. Using the information from the already generated sample points, we follow the approach proposed by Last [41] in selecting the known projective learning function (including Logarithmic, Weiss and Tian, Power Law and Exponential functions) that exhibits highest correlation with these points. Once we have determined the bestfit function, we can calculate the optimal sample size that ensures the most optimal tradeoff between sampling cost and prediction accuracy.IvA5 Testbed construction
With the predictive model and the sampling strategy, we can now discuss the testbed construction algorithm in Algorithm 1.
The Autotune testbed algorithm starts with an empty training set, adding a constant number of samples to the training set in each iteration (line 4). Once the samples are selected, the corresponding performance values are evaluated (line 7). The samples and the associated performance values are then used to build a predictive model for the testbed (line 8).
Each iteration adds a sample point for the calculation of the learning curve equation, where the cumulative training set size is considered an independent variable and the error rate of the model induced from examples is treated as the dependent variable. Given the newly learned model , we use Pearson’s correlation coefficient [43] to estimate the correlation for each candidate learning curve function (lines 10–12). Since in a “wellbehaved” learning curve, the error rate should be a nondecreasing function of , a function with a minimal (closest to ) negative correlation coefficient should be the best fit for the data. However, the actual data points may be noisy, occasionally resulting in positive correlation coefficients over a limited number of data points. In that case, the algorithm will keep purchasing additional samples until the lowest correlation coefficient becomes negative or the maximum amount of available training examples is exceeded (line 13).
Once we have determined the bestfit function that can approximate the learning curve accurately, we can calculate the coefficients of the projected function using the leastsquares method [44] and estimate the optimal training set size (line 14). If is greater than the size of the current training set and the TC permits more tests, the algorithm purchases the missing amount of examples; otherwise the algorithm stops and update the predictive model (lines 15–20).
With the final predictive model, the algorithm returns all possible testbed settings, each of which has the expected execution time of scale factor , and satisfies the resource constraint RC proposed by user (lines 21–22).
Note that the fixed sampling increment in Algorithm 1 is determined in advance based on domainspecific constraints such as the minimum number of batches in each experiment. We set the value of to 5 in our experiments, and have seen from experiments that even a value of 1 can give good results [45].
IvB AutoTune Algorithm
We now discuss the AutoTune algorithm in Algorithm 2.
1. Latin Hypercube Sampling. A key component in AutoTune algorithm is its sampling strategy. Because the configuration parameter space is high dimensional, it is challenging to provide a good coverage in it, especially with a low number of samples. In such a scenario, latin hypercube sampling (LHS) performs better, compared to random or grid sampling, because it allows each of the key parameters to be represented in a fully stratified manner, no matter which parameters are important [46]. Specifically, LHS divides the range of each parameter into
intervals and take only one sample from each interval with equal probabilities
[46]. The general LHS algorithm for generating random vectors (or configurations) of dimension can be summarized as follows:
Generate random permutations with dimensions , denoted by , ,…, where .

For the th dimension (=), divide the parameter range into nonoverlapping intervals of equal probabilities.

The th sampled point is an dimensional vector, with the value for dimension uniformly drawn from the th interval of .
Figure 5 illustrates an example of LHS with five intervals in a 2D dimension, where  denote the five LHS samples. Note that a set of LHS sample with vectors will have exactly one point in every interval on each dimension. That is, LHS attempts to provide a coverage of the experimental space as evenly as possible. Compared to pure random sampling, LHS provides a better coverage of the parameter space and allows a significant reduction in the sample size to achieve a given level of confidence without compromising the overall quality of the analysis [47].
AutoTune uses LHS for sampling (line 1 and 5). Line 13 generates the initial training set, where
is a hyperparameter, discussed later.
2. Training a Prediction Model. Based on these samples, we have tried different machine learning algorithms to train a prediction model. Specifically, random forests achieve good performance and thus is adopted (line 8).
3. Exploration and Exploitation. To explore the parameter space, we apply LHS again with intervals and choose configurations randomly (line 5). After that, we run these configurations on the testbed and use the results to generate the exploration set EP (line 6).
To exploit the previouslyfound best configurations, we design a twostep multiple bound and search (MBS) algorithm to find potential better configurations near alreadyknown good configurations (line 915). This strategy works well in practice because there is a high possibility that one can find other configurations with similar or better performances around the configuration with the best performance in the sample set [5].
Boundandsample. For each configuration in , MBS generates another set of samples in the bounded space around . The bounded space is generated as follows. For each parameter in , MBS finds the largest value (lower bound) that is represented in and is smaller than that of . It also finds the smallest value (upper bound) that is represented in and that is larger than that of . The same bounding mechanism are carried out for every in . Figure 5 illustrates this bounding mechanism of MBS with 2D space. After determining the bound for , we use LHS again to divide each bound into intervals and generate samples close to (line 11).
Search. Given these configurations, we use the trained prediction model to choose the best configuration (line 12). We then run the testbed with and collect the corresponding execution time (line 13). Last, the sample () is added to the exploitation set (line 14).
We repeat these two steps until every configuration in is tested, and update with the best configurations from EP EI (line 16). We refine by adding new samples from exploration and exploitation phases to the training set (line 7 and 17).
4. Selecting the best configuration. Once the time budget on exploration and exploitation is met, we stop searching and run the best configurations in on the production system (line 19). Finally we return the best one from these configurations.
To satisfy the overall time constraint, we divide it into three phases, i.e. initialization, exploration and exploitation, and the best configuration selection. Suppose the time constraint is , the proportion of time spent in these three phases is denoted as , , and , respectively, where and . Let the average time of running an application with one configuration on testbed equals to , and the average time on production system is , we have , , and , where iter represents the approximate iterations in the exploration and exploitation phase, and , , are three hyperparameters in AutoTune algorithm. We will discuss the performance variations over different time ratios in Section VD.
V Experiments
Va Experimental Settings
Runtime environment. We conduct our experiments on a public cloud infrastructure named Aliyun^{1}^{1}1https://www.aliyun.com. We use 6 Aliyun ECS instances consisting of two types: 1 general type instance (g5) for a master node, and 5 memory type (r5) for slave nodes. The master node is equipped with a 4core Intel Skylake Xeon Platinum 8163 2.5GHz processor, 16GB RAM, and 100G disk. Each of the slave nodes is equipped with a 4core Intel Skylake Xeon Platinum 8163 2.5GHz processor, 32GB RAM, and 250G disk. Both instances have CentOS 6.8 (64bit) installed. All of the VM instances are connected via a highspeed 1.5Gbps LAN.
Framework and configuration setup. We choose Spark as our experimental framework. Spark is a generalpurpose cluster computing engine for streaming, graph processing and machine learning [2]. We choose Spark because it is a widely adopted opensource data processing engines.
Based on the Spark manual [4] and previous studies [5, 24, 28, 11], we identify 13 out of 180+ parameters that are considered critical to the performance of Spark applications, as listed in Table I. It’s worth noting that even with only 13 parameters, the search space is still enormous, and exhaustive search is infeasible.
Parameters  Brief Description 



spark.executor.cores 

4  
spark.executor.memory 

1024MB  
spark.memory.fraction 

0.6  
spark.memory.storageFraction 

0.5  
spark.default.parallelism 

20  
spark.shuffle.compress 

True  
spark.shuffle.spill.compress 

True  
spark.broadcast.compress 

True  
spark.rdd.compress 

False  
spark.io.compression.codec 

lz4  
spark.reducer.maxSizeInFlight 

48MB  
spark.shuffle.file.buffer 

32KB  
spark.serializer 

JavaSerializer 
Benchmark. The HiBench [48] benchmark is used in our experiments. We select seven representative applications in three different categories: micro benchmark, machine learning, and websearch benchmark. Table II lists the applications and the corresponding dataset size.
Abbr.  Program  Dataset Size 

WC  WordCount  76.5GB 
BC  Bayesian classification  5.6GB 
KC  KMeans clustering  38.3GB 
LR  Logistic regression  7.5GB 
SVM  Support vector machine  80.8GB 
GBT  Gradient boosting trees  603.2MB 
PR  PageRank  506.9MB 
VB Baseline Algorithms
To evaluate the performance of AutoTune, we compare it with five stateoftheart algorithms, namely random search [49], BestConfig [5], RFHOC [17], Hyperopt [50], and SMAC [51]. We provide a brief description for each algorithm and report its hyperparameters (if necessary) as follows:
Random search (Random) is a searchbased tuning approach that explores each dimension of parameters uniformly at random. It is more efficient than grid search in highdimensional configuration spaces, and is a highperformance baseline, as suggested in [49].
BestConfig^{2}^{2}2Code is available from: https://github.com/zhuyuqing/bestconf is a searchbased tuning approach that uses divideanddiverge sampling and recursive boundandsearch algorithm to find a best configuration. We follow the suggestions in [5] and tabulate the value of in Table III.
RFHOC is a learningbased tuning approach that constructs a prediction model using random forests, and a genetic algorithm to automatically explore the configuration space. We use the hyperparameters suggested in [17].
Hyperopt^{3}^{3}3Code is available from: http://jaberg.github.io/hyperopt/ is a learningbased tuning approach based on Bayesian optimization. It is widely used for hyperparameter optimization. We use the suggested settings in [50].
SMAC^{4}^{4}4Code is available from: http://www.cs.ubc.ca/labs/beta/Projects/SMAC is a learningbased tuning method using random forests and an aggressive racing strategy.
In AutoTune algorithm, we set the number of iterations to
for random forest model, and do not limit the depth of the decision tree and the number of available features for the tree; the values of three hyperparameters, i.e.
, , and , for each application are listed in Table III.For each run in our experiments, every algorithm is executed under the same time constraint and stops once the constraint is met.
Apps  BestConfig  AutoTune  

PS  TB  PS  TB  
WC  31  211  31  5  420  10  29 
BC  100  446  100  5  446  10  61 
KC  25  324  25  5  320  20  51 
LR  20  163  20  5  163  10  24 
SVM  9  88  9  2  88  10  15 
GBT  23  64  23  5  64  10  15 
PR  79  317  79  5  317  10  50 
VC Evaluation Metrics
We consider three performance metrics in our experiments for performance evaluation, namely cost of testbed construction, nDCG and ET. Cost of testbed construction measures the cost of building a testbed, nDCG is a metric that evaluates the quality of a testbed, and execution time is the ultimate performance metric for AutoTune.
Cost of testbed construction. The critical step of AutoTune testbed is to derive a performance prediction model that can guide the construction of testbed under the time constraint and the desired scale factor proposed by users. Typically, performance prediction models are evaluated on the basis of their prediction accuracy. It is also common knowledge that usually a larger training set results in higher prediction accuracy [45]. However, a large training set is often undesirable in terms of measurement effort in this problem. Thus, any performance prediction model built for this purpose should be evaluated not only in terms of prediction accuracy, but also in terms of measurement cost involved in building the training and testing sets. More specifically, we adopt the cost model proposed in [45] to include the cost incurred in measuring the testing set along with the training set:
(4) 
where is the number of samples in the training ( samples) and testing sets ( samples), is the prediction error of the predictive model for testbed built with the samples, represents the number of settings whose performance value will be predicted by the model, and means that we equally weigh the cost incurred in measuring samples and the cost due to prediction error [45]. Note that in this definition, we ignore the cost incurred in building a performance prediction model for the testbed, as for linear regression model (Eq.(3)), which is used in our approach, this cost is computationally insignificant, compared to the other cost factors.
Normalized Discounted Cumulative Gain (nDCG) is originally used to evaluate a rankingquality metric of a search result set [52]. We use it here to evaluate the quality of a ranking for a set of configurations generated from the testbed by comparing it with the corresponding real ranking from the production system. Note that the ranking of a prediction model is more important than its prediction accuracy in AutoTune, because we select the subset of best configurations in AutoTune and feed it into the production system (line 19 in Algorithm 2).
Specially, given a ranking with configurations, its DCG is defined as:
where is the graded relevance of the result at position . Given a predicted ranking and its true ranking at position , we define a 5level relevance rating criteria by calculating the absolute relative deviation between and , as shown in Table IV.
The normalized DCG of a ranking is thus defined as the ratio of to the real ranking sequence :
For example, suppose the real ranking of three configurations is , and the predicted ranking of these settings is , the DCG of is ; the DCG of is , since the first and the second relevance values are both 3 (good), and the last relevance value is 5 (perfect). The nDCG of is thus equal to .
Relevance Rating  Value  Condition 

Perfect  5  
Excellent  4  
Good  3  
Fair  2  
Bad  0 
Execution Time (ET) is the total completion time of the application from the time when the request is fed into the BDAF to the time when the result is returned to the client. The ET improvement of an algorithms over the baseline algorithm in comparison is defined as:
where is the execution time of the baseline, and is that of the algorithm being evaluated.
To ensure consistency, we run each application five times and calculate the average of these five runs. The standard deviation of the execution time is 0.02 or smaller, which indicates the stability of the performance.
VD Experiment Results
VD1 Testbed evaluation
The first experiment is to evaluate the effectiveness of our AutoTune testbed approach. By conducting experiments on seven Spark applications, we aim at answering the following two research questions:

RQ1: Is the projective sampling strategy in our AutoTune testbed algorithm more cost efficient than classic progressive sampling method?

RQ2: How about the quality of testbeds with different scale factors?
Since cost efficiency is the primary determinant for judging the effectiveness of a sampling strategy, we compared the total cost of sampling and prediction according to Eq.(4), for all seven applications, using progressive and projective sampling. Progressive sampling is a popular sampling strategy that has been used for a variety of learning models. The central idea is to use a sampling schedule , where each is an integer that specifies the size of the sample set that is used to build a performance prediction model at iteration . In our experiment, we adopt the geometric progressive sampling strategy, where . The parameter is a constant that defines how fast we increase the size of the sample set.
As shown in Table V, for both progressive and projective sampling, we calculated the cost and accuracy of building prediction models with the optimal sample set size (). The value of is determined through the respective sampling techniques. We see in Table V that projective sampling outperforms progressive sampling in terms of cost and also in terms of accuracy. For SVM and GBT, progressive sampling gets stuck in a local optimum and produces low accuracies of 20% and 15%. Progressive and projective sampling are comparable in terms of accuracy and cost for WC. However, for all other four applications, projective sampling is considerably more cost efficient than progressive sampling.
Apps  Cost  Accuracy (%)  

Progressive  Projective  Progressive  Projective  
WC  220  180  85  88 
BC  877  380  90  92 
KC  1235  423  89  89 
LR  1492  516  87  85 
SVM  276  89  20  91 
GBT  78  42  15  83 
PR  415  231  78  86 
To measure the quality of testbeds, we need to evaluate the nDCG values on testbeds with different scale factors. For each application, we construct the testbeds with many combinations of data scale () and the number of machines (). After that, we randomly generate 30 different configurations, and collect the execution time on both production system () and different testbeds (s). Finally, we calculate the nDCG values of these s. Table VI lists the nDCG values of the KMeans application with ten values and five values.
# of machines  nDCG values with different data scale  

0.03125  0.05  0.0625  0.1  0.125  0.25  0.5  0.6  0.8  0.9  
1  0.4407  0.5288  0.6091  0.5929  0.5991  0.6032  0.6432  0.6054  0.5928  0.7002 
2  0.4252  0.5239  0.7070  0.7158  0.6229  0.6774  0.7675  0.6743  0.7863  0.8162 
3  0.4213  0.5032  0.7872  0.7865  0.7923  0.6258  0.6632  0.7058  0.7721  0.8765 
4  0.4726  0.5911  0.8543  0.8170  0.7662  0.8345  0.8732  0.8251  0.9021  0.9854 
5  0.4942  0.6032  0.9462  0.9352  0.8621  0.9954  0.9007  0.9976  0.9171  0.9986 
We can see from Table VI that the testbeds having the same number of machines () as the production system obtain better nDCG values than others. This is because changing the number of machines in the runtime environment will lead to corresponding changes on resource provision and scheduling, it can cause unpredictable performance of the application. Therefore, we should keep the underlaying environment of testbed as close as possible to the production system. Another interesting result found in Table VI is that the moderate values of data scale can generate significantly better nDCG results than that of smaller values, and the nDCG values have insignificant changes while data scale becomes much larger.
Based on this observation, we keep the underlaying environment unchanged and choose 1/16 as the scale factor value for our testbed in the following experiments, i.e. and .
VD2 Prediction model comparison
The second experiment is to evaluate the quality of different prediction models by comparing their rankings on the testbed. For each application, we set its time constraint first, and then construct the production system and the corresponding testbed to run the application independently under the constraint. After that, we collect samples on production system () and on testbed (), and train prediction models, i.e. random forest (RF), gradient boosting decision tree (GBDT), and support vector regression (SVR), using the 10fold crossvalidation method on both and . Table VII lists the nDCG values on production system and testbed. We find that the nDCG values of all three models on testbed outperform those on production system. Specially, given the same time constraints, the nDCG values of RF on testbed obtain an average of 7.11% improvement over them on production system, 5.28% improvement for GBDT, and 5.68% improvement for SVR. The results also indicate that the randomforest model achieves better performance than other two models.
Apps 







WC  1.65h  TB (216)  0.9482  0.9026  0.8118  
PS (31)  0.9088  0.8909  0.8080  
BC  14.95h  TB (545)  0.8412  0.7933  0.7501  
PS (118)  0.7836  0.6942  0.6498  
KC  34.45h  TB (633)  0.9038  0.8452  0.7851  
PS (94)  0.7817  0.7116  0.7761  
LR  12.42h  TB (743)  0.8714  0.8689  0.8077  
PS (92)  0.7852  0.7947  0.7534  
SVM  64.4h  TB (267)  0.9834  0.9355  0.8445  
PS (28)  0.8404  0.9255  0.7264  
GBT  1.3h  TB (67)  0.9052  0.8761  0.8783  
PS (23)  0.8638  0.8702  0.7732  
PR  9.54h  TB (242)  0.9032  0.8939  0.8491  
PS (59)  0.8953  0.8585  0.8422  
VD3 Hyperparameters
The third experiment is to evaluate the performance variations over different time portions of initialization, exploration and exploitation (E&E), and the best configuration selection phases in our AutoTune algorithm. For two applications KMeans Clustering (KC) and PageRank (PR), we plot their execution time results over different time proportions of three phases in Figure 6, where axis denotes the initialization time portion from 0 to 1, y axis denotes the E&E time portion from 0 to 1, and each point () in the 3D plots represents the execution time under the best configuration generated by AutoTune algorithm. These performance results show that the time portions of three phases do affect the execution time results on both applications: about 10.77% relative difference over the worst value on , and 34.62% on . We can also observe the similar blue cross regions in the middle of these two 3D plots, which means that we can find the optimal configuration with a higher probability by balancing the time allocation to three phases.
VD4 Execution time
Given a fixed time constraint for each application, we construct the testbed to run six different tuning algorithms plus default configuration independently. Table VIII lists the execution time results (ms). As expected, the default configuration does not perform well. Our algorithm achieves an average of 63.70% improvement over the default configurations. Furthermore, using testbed improves performance for all algorithms: random on testbed achieves an average of 9.57% execution time improvement over only uses the production systems, BestConfig 15.18% improvement, RFHOC 19.13% improvement, Hyperopt 8.38% improvement, SMAC 10.7% improvement, and AutoTune 13.33% improvement.
Apps 


Default^{3}^{3}footnotemark: 3  Random  BestConfig  RFHOC  Hyperopt  SMAC  AutoTune  
WC  5.33h (100)  TB   







PS  217524  174780  225963  180126  175857  174991  173198  
BC  12.6h (100)  TB   







PS  425501  116293  121180  120722  113681  114439  114296  
KC  18h (49)  TB   







PS  1301446  255550  282722  325247  266733  265648  232939  
LR  6h (44)  TB   







PS  516677  249086  275951  449964  222016  275224  222694  
SVM  24h (10)  TB   







PS  7880587  567873  862026  1595021  547657  544763  536207  
GBT  6h (105)  TB   







PS  201714  171097  163664  167082  168916  165873  167025  
PR  8.08h (50)  TB   







PS  328951  91032  99354  102416  87929  86647  98426  
Running Times: given the fixed time constraints, the approximate running times on production system with default configuration.
^{2}^{2}footnotemark: 2Target Platform: where the algorithms work on and evaluate different configurations. TB = Testbed, PS = Production System.
^{3}^{3}footnotemark: 3The performance of default configurations on testbed is not recorded for its unimportance.
Finally, we plot the overall execution time improvement percentage of BestConfig, RFHOC, Hyperopt, SMAC and AutoTune in Figure 7, using the random algorithm as the baseline. In the Figure 7, axis lists the seven different applications and axis represents the improvement percentage over the random algorithm. We observe that compared with the random algorithm, our approach achieves 4.8%–8.7% improvement among all applications. AutoTune achieves an average of 7.35% improvement over Random, 14.35% improvement over BestConfig, 22.79% improvement over RFHOC, 6.28% improvement over Hyperopt, and 6.73% improvement over SMAC. We can conclude from Figure 7 that AutoTune achieves stable and significant improvements compared with the other five algorithms. Another interesting observation from Figure 7 is that the random search achieves surprisingly good results in our experiments. This is consistent with the findings of Bergstra and Bengio in [49].
Vi Conclusion and Future Work
In this paper, we propose AutoTune–an automatic configuration tuning system to optimize execution time for BDAFs. AutoTune constructs a smallerscale testbed from the production system so that it can generate more samples, and thus train a better prediction model, under a given time constraint. Furthermore, the AutoTune algorithm selects a set of samples that can provide a wide coverage over the highdimensional parameter space, and searches for more promising configurations using the trained prediction model.
It is of our future work to refine our testbed approach by supporting the automatic selection of the appropriate testbed settings given a scale factor and a resource constraint. We will also investigate the performance dynamics of BDAFs and design better approaches to account for such dynamics in our algorithm design. Last, we hope to integrate the proposed algorithm into the major Hadoop/Spark releases to support intelligent and automatic parameter tuning for big data analytics applications.
References
 [1] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
 [2] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: A faulttolerant abstraction for inmemory cluster computing,” in Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 2012, pp. 2–2.
 [3] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, “Dryad: distributed dataparallel programs from sequential building blocks,” in ACM SIGOPS operating systems review, vol. 41, no. 3. ACM, 2007, pp. 59–72.
 [4] Spark, “http://spark.apache.org/docs/latest/configuration.html,” Accessed on January 31th 2017.
 [5] Y. Zhu, J. Liu, M. Guo, Y. Bao, W. Ma, Z. Liu, K. Song, and Y. Yang, “Bestconfig: tapping the performance potential of systems via automatic configuration tuning,” in Proceedings of the 2017 Symposium on Cloud Computing. ACM, 2017, pp. 338–350.
 [6] K. Ren, Y. Kwon, M. Balazinska, and B. Howe, “Hadoop’s adolescence: an analysis of hadoop usage in scientific workloads,” Proceedings of the VLDB Endowment, vol. 6, no. 10, pp. 853–864, 2013.
 [7] H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu, “Starfish: A selftuning system for big data analytics.” in CIDR, vol. 11, 2011, pp. 261–272.
 [8] C. Liu, D. Zeng, H. Yao, C. Hu, X. Yan, and Y. Fan, “Mrcof: a genetic mapreduce configuration optimization framework,” in International Conference on Algorithms and Architectures for Parallel Processing. Springer, 2015, pp. 344–357.
 [9] J. Shi, J. Zou, J. Lu, Z. Cao, S. Li, and C. Wang, “Mrtuner: a toolkit to enable holistic optimization for mapreduce jobs,” Proceedings of the VLDB Endowment, vol. 7, no. 13, pp. 1319–1330, 2014.
 [10] S. Kadirvel and J. A. Fortes, “Greybox approach for performance prediction in mapreduce based platforms,” in Computer Communications and Networks (ICCCN), 2012 21st International Conference on. IEEE, 2012, pp. 1–9.
 [11] K. Wang and M. M. H. Khan, “Performance prediction for apache spark platform,” in High Performance Computing and Communications (HPCC), 2015 IEEE 17th International Conference on. IEEE, 2015, pp. 166–173.
 [12] M. Cardosa, P. Narang, A. Chandra, H. Pucha, and A. Singh, “Steamengine: Driving mapreduce provisioning in the cloud,” in High Performance Computing (HiPC), 2011 18th International Conference on. IEEE, 2011, pp. 1–10.
 [13] G. Wang, A. R. Butt, P. Pandey, and K. Gupta, “A simulation approach to evaluating design decisions in mapreduce setups,” in Modeling, Analysis & Simulation of Computer and Telecommunication Systems, 2009. MASCOTS’09. IEEE International Symposium on. IEEE, 2009, pp. 1–11.
 [14] G. Liao, K. Datta, and T. L. Willke, “Gunther: Searchbased autotuning of mapreduce,” in European Conference on Parallel Processing. Springer, 2013, pp. 406–419.
 [15] M. Li, L. Zeng, S. Meng, J. Tan, L. Zhang, A. R. Butt, and N. Fuller, “Mronline: Mapreduce online performance tuning,” in Proceedings of the 23rd international symposium on Highperformance parallel and distributed computing. ACM, 2014, pp. 165–176.
 [16] S. Kumar, S. Padakandla, P. Parihar, K. Gopinath, S. Bhatnagar et al., “Performance tuning of hadoop mapreduce: A noisy gradient approach,” arXiv preprint arXiv:1611.10052, 2016.
 [17] Z. Bei, Z. Yu, H. Zhang, W. Xiong, C. Xu, L. Eeckhout, and S. Feng, “Rfhoc: A randomforest approach to autotuning hadoop’s configuration,” IEEE Transactions on Parallel and Distributed Systems, vol. 27, no. 5, pp. 1470–1483, 2016.
 [18] J. L. Berral, N. Poggi, D. Carrera, A. Call, R. Reinauer, and D. Green, “Alojaml: A framework for automating characterization and knowledge discovery in hadoop deployments,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015, pp. 1701–1710.
 [19] P. Lama and X. Zhou, “Aroma: Automated resource allocation and configuration of mapreduce environment in the cloud,” in Proceedings of the 9th international conference on Autonomic computing. ACM, 2012, pp. 63–72.
 [20] N. Yigitbasi, T. L. Willke, G. Liao, and D. Epema, “Towards machine learningbased autotuning of mapreduce,” in Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS), 2013 IEEE 21st International Symposium on. IEEE, 2013, pp. 11–20.
 [21] N. B. Rizvandi, J. Taheri, R. Moraveji, and A. Y. Zomaya, “On modelling and prediction of total cpu usage for applications in mapreduce environments,” in International Conference on Algorithms and Architectures for Parallel Processing. Springer, 2012, pp. 414–427.
 [22] C. Tang, “System performance optimization via design and configuration space exploration,” in Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. ACM, 2017, pp. 1046–1049.
 [23] R. Zhang, M. Li, and D. Hildebrand, “Finding the big data sweet spot: Towards automatically recommending configurations for hadoop clusters on docker containers,” in Cloud Engineering (IC2E), 2015 IEEE International Conference on. IEEE, 2015, pp. 365–368.
 [24] N. Luo, Z. Yu, Z. Bei, C. Xu, C. Jiang, and L. Lin, “Performance modeling for spark using svm,” in Cloud Computing and Big Data (CCBD), 2016 7th International Conference on. IEEE, 2016, pp. 127–131.
 [25] C. Peng, C. Zhang, C. Peng, and J. Man, “A reinforcement learning approach to map reduce autoconfiguration under networked environment,” International Journal of Security and Networks, vol. 12, no. 3, pp. 135–140, 2017.
 [26] D. Wu and A. Gokhale, “A selftuning system based on application profiling and performance analysis for optimizing hadoop mapreduce cluster configuration,” in High Performance Computing (HiPC), 2013 20th International Conference on. IEEE, 2013, pp. 89–98.
 [27] C.O. Chen, Y.Q. Zhuo, C.C. Yeh, C.M. Lin, and S.W. Liao, “Machine learningbased configuration parameter tuning on hadoop system,” in Big Data (BigData Congress), 2015 IEEE International Congress on. IEEE, 2015, pp. 386–392.
 [28] G. Wang, J. Xu, and B. He, “A novel method for tuning configuration parameters of spark based on machine learning,” in High Performance Computing and Communications (HPCC), 2016 IEEE 18th International Conference on. IEEE, 2016, pp. 586–593.
 [29] P. Wendell, “Understanding the performance of spark applications.” San Francisco, California, USA: Spark Summit, 23 December 2013.
 [30] A. Bida and R. Warren, “Spark tuning for enterprise system administors.” New York, New York, USA: Spark Summit East, 1618 February 2016.
 [31] M. Armbrust, “Catalyst: A query optimization framework for spark and shark.” San Francisco, California, USA: Spark Summit, 23 December 2013.
 [32] H. Herodotou and S. Babu, “Profiling, whatif analysis, and costbased optimization of mapreduce programs,” Proceedings of the VLDB Endowment, vol. 4, no. 11, pp. 1111–1122, 2011.
 [33] C. Blum and A. Roli, “Metaheuristics in combinatorial optimization: Overview and conceptual comparison,” ACM computing surveys (CSUR), vol. 35, no. 3, pp. 268–308, 2003.
 [34] C. H. Papadimitriou and K. Steiglitz, “Combinatorial optimization: algorithms and complexity,” 1982.
 [35] M. R. Garey and D. S. Johnson, “Computers and intractability: a guide to npcompleteness,” 1979.
 [36] S. Venkataraman, Z. Yang, M. J. Franklin, B. Recht, and I. Stoica, “Ernest: Efficient performance prediction for largescale advanced analytics.” in NSDI, 2016, pp. 363–378.
 [37] J. A. Hartigan and M. A. Wong, “Algorithm as 136: A kmeans clustering algorithm,” Journal of the Royal Statistical Society. Series C (Applied Statistics), vol. 28, no. 1, pp. 100–108, 1979.
 [38] W. Gropp, E. Lusk, and A. Skjellum, Using MPI: portable parallel programming with the messagepassing interface. MIT press, 1999, vol. 1.
 [39] M. Chowdhury and I. Stoica, “Coflow: A networking abstraction for cluster applications,” in Proceedings of the 11th ACM Workshop on Hot Topics in Networks. ACM, 2012, pp. 31–36.
 [40] C. L. Lawson and R. J. Hanson, Solving least squares problems. Siam, 1995, vol. 15.
 [41] M. Last, “Improving data mining utility with projective sampling,” in Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2009, pp. 487–496.
 [42] F. Provost, D. Jensen, and T. Oates, “Efficient progressive sampling,” in Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1999, pp. 23–32.
 [43] K. Pearson, “Note on regression and inheritance in the case of two parents,” Proceedings of the Royal Society of London, vol. 58, pp. 240–242, 1895.
 [44] D. W. Marquardt, “An algorithm for leastsquares estimation of nonlinear parameters,” Journal of the society for Industrial and Applied Mathematics, vol. 11, no. 2, pp. 431–441, 1963.
 [45] A. Sarkar, J. Guo, N. Siegmund, S. Apel, and K. Czarnecki, “Costefficient sampling for performance prediction of configurable systems (t),” in Automated Software Engineering (ASE), 2015 30th IEEE/ACM International Conference on. IEEE, 2015, pp. 342–352.
 [46] M. D. McKay, R. J. Beckman, and W. J. Conover, “Comparison of three methods for selecting values of input variables in the analysis of output from a computer code,” Technometrics, vol. 21, no. 2, pp. 239–245, 1979.
 [47] B. Xi, Z. Liu, M. Raghavachari, C. H. Xia, and L. Zhang, “A smart hillclimbing algorithm for application server configuration,” in Proceedings of the 13th international conference on World Wide Web. ACM, 2004, pp. 287–296.
 [48] S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang, “The hibench benchmark suite: Characterization of the mapreducebased data analysis,” in Data Engineering Workshops (ICDEW), 2010 IEEE 26th International Conference on. IEEE, 2010, pp. 41–51.
 [49] J. Bergstra and Y. Bengio, “Random search for hyperparameter optimization,” Journal of Machine Learning Research, vol. 13, no. Feb, pp. 281–305, 2012.
 [50] J. Bergstra, D. Yamins, and D. D. Cox, “Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms,” in Proceedings of the 12th Python in Science Conference. Citeseer, 2013, pp. 13–20.
 [51] F. Hutter, H. H. Hoos, and K. LeytonBrown, “Sequential modelbased optimization for general algorithm configuration.” LION, vol. 5, pp. 507–523, 2011.
 [52] K. Järvelin and J. Kekäläinen, “Cumulated gainbased evaluation of ir techniques,” ACM Transactions on Information Systems (TOIS), vol. 20, no. 4, pp. 422–446, 2002.
Comments
There are no comments yet.