As software systems grow in complexity, the space of possible configurations grows exponentially. Within this increasing complexity, developers, maintainers, and users cannot keep track of the interactions between all the various configuration options. Finding the optimally performing configuration of a software system for a given setting is challenging. Recent approaches address this challenge by learning performance models based on a sample set of configurations. However, collecting enough data on enough sample configurations can be very expensive since each such sample requires configuring, compiling and executing the entire system against a complex test suite. The central insight of this paper is that choosing a suitable source (a.k.a. "bellwether") to learn from, plus a simple transfer learning scheme will often outperform much more complex transfer learning methods. Using this insight, this paper proposes BEETLE, a novel bellwether based transfer learning scheme, which can identify a suitable source and use it to find near-optimal configurations of a software system. BEETLE significantly reduces the cost (in terms of the number of measurements of sample configuration) to build performance models. We evaluate our approach with 61 scenarios based on 5 software systems and demonstrate that BEETLE is beneficial in all cases. This approach offers a new highwater mark in configuring software systems. Specifically, BEETLE can find configurations that are as good or better as those found by anything else while requiring only 1/7th of the evaluations needed by the state-of-the-art.READ FULL TEXT VIEW PDF
As software systems grow in complexity and the space of possible
Finding good configurations for a software system is often challenging s...
Most modern software systems (operating systems like Linux or Android, W...
To develop, analyze, and evolve today's highly configurable software sys...
Modern deep neural network (DNN) systems are highly configurable with la...
We here introduce a novel scheme for generating smoothly-varying infill
Modern software systems provide many configuration options which
Finding the right set of configurations that can achieve the best performance becomes increasingly challenging as the number of configurations increases (Xu et al., 2015). As software systems grow in both size and complexity, optimizing a software system to meet the needs of a workload has surpassed the abilities of humans (Bernstein et al., 1998).
Much current research has explored this problem, usually by creating accurate performance models that predict performance characteristics. While this approach is cheaper and more effective than manual configuration, it still incurs the expensive of extensive data collection about the software (Guo et al., 2013; Sarkar et al., 2015; Siegmund et al., 2012; Nair et al., 2017a, b, 2018; Oh et al., 2017; Guo et al., 2017; Jamshidi and Casale, 2016). This is undesirable, since this data collection has to be repeated if ever the software is updated on the workload of the system changes abruptly. In such a scenario, all prior research suffers from the same drawback:
These approaches do not learn from previous optimization experiments and must be rerun whenever the environment of the experiments change.
Note our use of the term environment. This refers to the external factors influencing the performance of the system such as workload, hardware, version of the software.
Recent research in performance prediction for configurable systems has shown that transfer learning can be effective for resolving ‘cold start’ problems—problems for which collecting data is expensive. Transfer learning typically entails the transfer of information from a selected “source” software system operating in an environment to learn a predictive model for predicting the performance of the system in the “target” environment (software system in a different environment), hence enabling a user to reuse measurements. This approach has received much attention in the software analytics literature(Nam et al., 2013; Nam and Kim, 2015; Jing et al., 2015; Kocaguneli and Menzies, 2011; Kocaguneli et al., 2015; Turhan et al., 2009; Peters et al., 2015; Krishna and Menzies, 2017).
Transfer learning can only be useful in cases where the source environment is similar to the target environment. If the source and the target are not similar, knowledge should not be transferred. In such extreme situation, transfer learning can be unsuccessful and can lead to a negative transfer. Prior work on transfer learning focused on “What to transfer” and “How to transfer”, by implicitly assuming that the source and target are related to each other. Hence, that work failed to address “When to transfer” (Pan and Yang, 2010a). Jamshidi et al. (Jamshidi et al., 2017a) alluded to this and explained when transfer learning works but, did not provide a method which can help in selecting a suitable source. In this paper, we focus on solving the problem of performance optimization by choosing a suitable source to transfer knowledge. This has not been explored in the prior work.
The issue of identifying a suitable source is a common problem in transfer learning. To address this, some researchers (Krishna et al., 2016; Mensah et al., 2017b; Mensah et al., 2017a; Krishna and Menzies, 2017) have recently proposed the use of the bellwether effect, which states that:
“ … When analyzing a community of software projects, then within that community there exists at least one exemplary project, called the bellwether(s), which can best define predictors for other projects …”
The bellwether effect
has shown promise in identifying suitable sources for transfer learning in such varied domains as defect prediction, effort estimation, and code smell detection(Krishna and Menzies, 2017). However, the effect was shown only to work on domains where the data is relatively generic, easy to gather, and free of budget constraints (Krishna and Menzies, 2017). In this paper, we introduce BEETLE, a bellwether based transfer learner, which uses the bellwether effect to identify suitable sources to train transfer learners for performance optimization. Our main claim of this paper is:
A good source with a simple transfer learner is better than source agnostic complex transfer learners.
In summary, we make the following contributions:
Source selection using bellwethers: We show that the bellwether effect exists in performance optimization and that we can use this to discover suitable sources (called bellwether environments) to perform transfer learning. (§8)
A fast novel source selection algorithm: We develop a fast algorithm for discovering the bellwether environment with only of the measurements (§5).
Transfer learning using Bellwethers: We develop a novel transfer learning algorithm using Bellwether called BEETLE (short for Bellwether Transfer Learner) that uses the bellwether environment to construct a simple transfer learner (§5).
More effective than non-transfer learning: We show that using the BEETLE is just as good as than non-transfer learning approaches. It is also significantly more economical. (§̃8).
More effective than state-of-the-art methods: We show that the configurations discovered using the bellwether environment are much closer to the true-optima when compared to other state-of-the-art methods (Valov et al., 2017; Jamshidi et al., 2017b). And we show that we are a lot more economical. ( §8).
The rest of the paper is structured as follows. In §2, provides motivations for this work. §3 introduces the research questions. In §4, we provide formal definitions of terminologies used in this paper. §5 presents BEETLE, a new transfer learner proposed in this paper. In §6, two state-of-the-art transfer learners are discussed. §7 contains the experimental setup. In §8, the results of the paper are presented as answers to research questions. We discuss further implications of our findings in §9. Threats of validity is highlighted in §10. Finally, we conclude our findings in §11.
This section motivates our work by highlighting the many problems associated with performance modeling and transfer learning.
Modern software systems come with a large number of configuration options. For example, in Apache (a popular web server) there are around 600 different configuration options and in hadoop, as of version 2.0.0, there are around 150 different configuration options (Xu et al., 2015). Previous empirical studies have also shown that the number of options is growing over releases (Xu et al., 2015). These configuration options control the internal properties of the system such as memory, response times. The number of configuration options usually increase over time (Van Aken et al., 2017; Xu et al., 2015). Given the large number of configurations, it becomes increasingly difficult to assess the impact of the configuration options
of the performance of the software system. To address this issue, a common practice is to employ performance prediction models constructed using machine learning algorithms to estimate the performance of the system under these configurations(Hoste et al., 2006; Guo et al., 2013; Hutter et al., 2014; Thereska et al., 2010; Valov et al., 2015; Westermann et al., 2012).
Further, research has shown that, when faced with a large volume of configuration options, developers tend to ignore a majority (over 80%) of the configuration options (Xu et al., 2015). This leaves a considerable amount of untapped potential and often induces poor performance of software systems (Xu et al., 2015). To leverage the full benefit of the software system by exploiting the flexibility of the features offered by the system, researchers augment performance prediction models to enable performance optimization (Nair et al., 2017b; Oh et al., 2017). Performance optimization extends performance prediction by identifying the best set of configuration options to pick to accomplish a given task with near-optimal performance.
Performance optimization requires access to measurements of the software system under various configuration settings. However, obtaining these performance measurements can take a significant amount of time and cost. For example, in one of the software systems studied here (a video encoding application called x264), it takes over 1536 hours to obtain performance measurements for 11 out the 16 possible configuration options (Valov et al., 2017). This is in addition to other time-consuming tasks involved in commissioning these systems such as setup, teardown, etc. Further, making performance measurements can cost an exorbitant amount of money. For the same system, Figure 1 shows the amount we spent on gathering performance measurements on 2048 different configurations.
For a software system under a new environment, instead of having to make exhaustive cost and time intensive measurements, it makes sense to reuse performance measurements made for previous environments. The concept of reusing information from other sources is the idea behind transfer learning (Nam et al., 2013; Nam and Kim, 2015; Jing et al., 2015; Kocaguneli and Menzies, 2011; Kocaguneli et al., 2015; Turhan et al., 2009; Peters et al., 2015). Specifically, to predict for the optimum configurations in a new environment (referred to as target environment), we may use the performance measures of another workload as a proxy (referred to as the source environment). For performance optimization, such transfer learning approaches have been shown to decrease the cost of learning by a significant amount (Chen et al., 2011; Jamshidi et al., 2017b, a; Valov et al., 2017).
It must be noted that transfer learning methods place an implicit faith in the nature of the source. Several researchers in transfer learning caution that the source must be chosen with care to ensure optimum performance of transfer learners (Yosinski et al., 2014; Long et al., 2015; Afridi et al., 2018). An incorrect choice of source may result in the all too common negative transfer phenomenon (Ben-David and Schuller, 2003; Rosenstein et al., 2005; Pan and Yang, 2010b; Afridi et al., 2018). A negative transfer can be particularly damaging in that it often leads to performance degradation instead of performance optimization (Jamshidi et al., 2017a; Afridi et al., 2018). A preferred way to avoid negative transfer is with source selection (Afridi et al., 2018; Krishna et al., 2016; Krishna and Menzies, 2017). In software engineering, researchers have shown that the so-called bellwether effect can be used to identify source datasets for effective transfer learning (Krishna et al., 2016). This bellwether effect has been shown to be very effective in defect prediction, effort estimation, code-smell detection, etc. (Mensah et al., 2017b; Mensah et al., 2017a; Krishna and Menzies, 2017).
In this work, we introduce the notion of source selection with bellwether effect for transfer learning in performance optimization. With this, we develop a Bellwether Transfer Learner called BEETLE. We show that, for performance optimization, BEETLE can outperform both non-transfer learning methods and the current state-of-the-art transfer learning methods.
This inquiry is structured around the following research questions.
RQ1: Does there exist a Bellwether Environment?
Purpose: In the first research question, we ask if there exist bellwether environments to train transfer learners for performance optimization. We hypothesize that, if these bellwether environments exist, we can improve the efficacy of transfer learning algorithms.
Approach: To answer this research question, we explore five popular open source software systems (for details see §7.1 and Table 1). These software systems have performance measurements under different environments. In each of these software systems, we train a transfer learning model on one environment (the source) and predict for optima in all the other environments. We repeat this process in a round-robin manner for every environment. We then statistically rank the source environments based on their ability to find near-optimal solutions on the targets. If there exist bellwether environments, then those bellwether environments will have a better rank compared to all others environments.
RQ2: How many performance measurements are required to discover bellwether environments?
Purpose: Having established that bellwether environments are prevalent, the purpose of this research question is to establish how many performance measurements need to be made in the environments to discover bellwether environments.
Approach: To answer this question, we developed an iterative method, based on incremental sampling strategy to find the bellwether environment. We start with 1% of configurations from each environment and incrementally increases the number of sampled until we find the bellwether. Before each increment, we eliminate those environments which do not show much promise.
RQ3: How does BEETLE compare to non-transfer learning methods?
Purpose: The alternative to transfer learning is just to use the target data (similar to methods proposed in prior work) to find the near-optimal configurations. Literature is abundant with performance optimization algorithms that do not use transfer learning (Guo et al., 2013; Sarkar et al., 2015; Nair et al., 2017b, a). For our comparisons, we used the performance optimization model proposed by Nair et al. (Nair et al., 2017a) in FSE ’17.
Approach: To answer this research question we compute the Win-Loss ratios of transfer learning with the bellwether environment (aka. BEETLE) to a regular performance optimization method. In addition to this, we compare the cost of the methods, in terms of number of measurements of learning a model.
RQ4: How does BEETLE compare to state-of-the-art methods?
Purpose: In this research question we compare BEETLE with two other state-of-the-art transfer learners used commonly in performance optimization (for details see §6). The purpose of this research question is to determine if a simple transfer learner like BEETLE with carefully selected source environments can perform as well as other complex transfer learning algorithms that do not perform source selection.
Approach: To answer this question, we compare the performance of the near-optimal configurations found using the bellwether environment to the near-optimal configuration found by other transfer learning methods. The configuration found by the bellwether environment is similar to (or better than) the other transfer learning methods.
Environments: A software system () has configuration options which can be tweaked to change the performance of the software system (). The software system can be operated in different environments (). Each environment is described by 3 variables drawn from the environment space. Here, represents the hardware, represents the workload, and represents the software version. In a software system, the total number of environments is given by . A software system () operating in an environment () is denoted by .
Configuration: Let indicate the configuration option of a software system operating in environment (denoted by ), which can either be (1) numeric or (2) boolean. A configuration is a member of the configuration space . is a Cartesian product of all possible options = Dom() Dom() … Dom(), where (in our setting) and is the number of configuration options.
Performance: Each configuration () of a system, , has a corresponding performance measure associated with it. The configuration and the corresponding performance measure is referred to as independent and dependent variables respectively. We denote the performance measure associated with a given configuration () by . We consider the problem of finding the near-optimal configurations () such that is less than other configurations in . That is:
Transfer Learning: In transfer learning, we find the near-optimal configuration for a target environment (), by learning from the measurements () for the same system operating in different source environments ().
Bellwether Environment: We show that, when performing transfer learning, there are exemplar source environments called the bellwether environment(s) (), which are the best source environment(s) to find near-optimal configuration for the rest of the environments ().
In this paper, we propose an alternative transfer learning approach to the current state-of-the-art discussed in the previous section. Our approach has two key components:
Identifying the bellwether environment: To train a transfer model, we use bellwether effect to discover the best source environments (known as the bellwether environment) among the available environments.
Construct the Transfer Model: Next, to perform transfer learning, we use these bellwether environments to train a performance prediction model with regression tree (Breiman, 1996).
Our key finding is that if a source environment is carefully selected using the bellwether effect, then it is possible to build a simple transfer model without any complex methods and still be able to generate near-optimal configurations in a target environment.
In Figures 2 & 3, we describe BEETLE and list a generic algorithm of BEETLE. In this example, there are 7 source environments (), which have been optimized previously. represents the target environments, which need to be optimized. BEETLE'objective is to find a bellwether among the source environments and use it to find the near-optimal configuration for the target environments. BEETLE, a Bellwether based approach can be separated into the following main steps: (i) finding the Bellwether environment, and (ii) using the Bellwether environment to find the near-optimal configuration for target environments.
The central idea for finding Bellwether is to use minimal sampling to recursively eliminate environments which do not show promise of being a bellwether, that is those environments cannot be used to predict the near-optimal solutions. Figure 2 is a generic algorithm that defines the process of finding bellwethers. The process starts by sampling a small subset of the source environments. The size of the subset is controlled by a predefined parameter step_size (Line 6). The cost of sampling the configuration is calculated (Line 8). In our setting, we use the number of measurements as a proxy for cost and can be replaced by any user-defined cost function (get_cost). To compute the effectiveness of an environment, the sampling configurations along with the performance measure is used to build a performance model (regression tree in our setting). Please note that we choose regression tree as a model because it has been extensively used for performance prediction of configurable software systems and demonstrated good results (Guo et al., 2013; Sarkar et al., 2015; Nair et al., 2017a, b, 2018; Guo et al., 2017; Valov et al., 2017). This model is then used to predict the optimal configuration among the configurations sampled in Line 6 (Line 10). We only used the sampled configurations because the actual performance of a source can only be calculated if the actual performance values (associated with the configurations) are known. This process is repeated for all the environments (represented as sources) under consideration. Depending on how an environment can find the near-optimal configuration for other environments and a user-defined threshold (thres), the non-bellwether environments are eliminated (Line 12). Non-bellwether environments are environments, which are not able to find near-optimal configurations for the other environments. If the no environment is eliminated (with more data) when compared to the previous iteration (lesser data), then a life is lost (Line 14). When all lives are expired or run out of the budget, the search process terminates. The environment with maximum performance is returned as the bellwether (Line 16). Please note that, FindBellwether can identify multiple bellwethers (). However, in our setting, we return a single bellwether.
The motivation behind using the parameter lives is to detect convergence of the search process. If adding more configuration does not improve the chances of finding the Bellwether, the search process should terminate to avoid resource wastage; see also § 9.
Once the bellwether is identified, it can be used to find the near-optimal configurations of target environments. As shown in Figure 3 FindBellwether returns the predicted bellwether environment (Line 3). Performance modeling then samples the bellwether environments (Line 5) for some number of samples (a user-defined parameter called budget). Please note that a user might choose to reuse the measurement used in FindBellwether and save on cost. The sampled configuration and their corresponding performance measures it used to build a prediction model (Line 7). Similar to FindBellwether, we use regression tree as our modeling method of choice. The prediction model is there used to predict the optimal configuration among the (unevaluated configurations) for the target environment (Lines 8-9). The predicted optimal configuration is returned as the best configuration. This process is then repeated for each target environments ().
In this paper, BEETLE is compared against (a) two state-of-the-art transfer learners from ICPC’17 (Van Aken et al., 2017) and SEAMS’17 (Jamshidi et al., 2017b); and (b) a non-transfer learner from FSE’17 (Nair et al., 2017b).
Valov et al. (Valov et al., 2017) proposed an approach for transferring performance models of software systems across platforms with different hardware settings. Figure 4 shows the pseudocode of the transfer learning method. The method consists of the following two components:
Performance prediction model: The configuration source hardware are sampled using Sobol sampling. The number of configurations is given by , where is the training coefficient and is the number of configuration options. These configurations are used to construct a Regression Tree model.
To transfer the predictions from the source to the target, the authors construct a linear regression model since it was found to provide good approximations of the transfer function. To construct this model, a small number of random configurations are obtained from the source and the target.
Jamshidi et al. (Jamshidi et al., 2017b) took a slightly different approach to transfer learning. They used Gaussian Processes (GP) to find the relatedness between the performance measures in source and the target as well as the configurations. The relationships between input configurations were captured in the GP model using a covariance matrix that defined the kernel function to construct the Gaussian processes model. To encode the relationships between the measured performance of the configuration in the source the target, the authors propose a scaling factor to the above kernel.
x264is a video encoder that compresses video files to adjust output quality, encoder types,and encoding heuristics.
An industrial strength bit-vector arithmetic decisionprocedure and a Boolean satisfiability (SAT) solver. It is designed for proving software verification conditions and it is used for bug hunting.
|Database||SQLite||x264is a video encoder that compresses video files to adjust output quality, encoder types,and encoding heuristics.||1000||14||15||2||13||2|
|Compiler||SaC||SQLite is a lightweight relational database management sys-tem, embedded in several browsers and operating systems.||846||50||7||2||10||2|
|Apache Storm is a distributed stream processing computation framework written predominantly in the Clojure programming language.||2048||12||4||2||3||2|
The new kernel function is a defined as follows:
where represents the multiplicative scaling factor. is given by the correlation between source f(s) and target f(t) function, while is the covariance function for input environments (s & t). The essence of this method is that the kernel captures the interdependence between the source and target configurations and their corresponding performance measurement values.
A performance optimization model with no transfer was proposed by Nair et al. (Nair et al., 2017b) in FSE ’17. It works as follows:
Sample a small set of measurements of configurations from the target environment
Construct performance prediction model with regression trees.
Predict for near-optimal configurations.
The key distinction here is that unlike transfer learners, that use a different source environment to build to predict for near-optimal configurations in a target environment, a non-transfer method such as this uses configurations from within the target environment to predict for near-optimal configurations.
In this study, we selected five configurable software systems from different domains, with different functionalities, and written in different programming. We selected these real-world software systems since their characteristics cover a broad spectrum of scenarios. Table 1 lists the details of the software systems used here. The rest of this section provides a summary of the subject systems.
Spear is an industrial strength bit-vector arithmetic decision procedure and Boolean satisfiability (SAT) solver. It is designed for proving software verification conditions, and it is used for bug hunting. We considered a configuration space with 14 options with or 16384 configurations. We measured how long it takes to solve an SAT problem in all 16,384 configurations in 10 environments.
x264 is a video encoder that compresses video files and has 16 configurations options to adjust output quality, encoder types, and encoding heuristics. Due to the size of the configuration space, we randomly sample 4000 configurations in 21 environments.
SQLite is a lightweight relational database management system, embedded in several browsers and operating systems, which has 14 configuration options to change indexing and features for size compression. Due to a limited budget, we use 1000 randomly selected configurations in 15 different environments.
SaC is a compiler for high-performance computing. The SaC compiler implements a large number of high-level and low-level optimizations to tune programs for efficient parallel executions. It has 50 configuration options to control optimization options. We measure the execution time of a program compiled in 71,267 randomly selected configurations.
Storm is a distributed stream processing framework which is used for data analytics. We run three benchmarks and measure the latency of the benchmark in 2,048 randomly selected configurations to assess the performance impact of Storm's options.
Typically, performance models are evaluated based on the accuracy or error using measures such as MMRE. We note that there has been a lot of criticism regarding MMRE, which shows that MMRE along with other accuracy statistics such as MBRE has been shown to cause conclusion instability (Myrtveit and Stensrud, 2012; Myrtveit et al., 2005; Foss et al., 2003), given by:
While this typical in performance prediction, our objective is to find the near-optimal configurations or performance optimization. For this, measures similar to MMRE is not applicable (Nair et al., 2017b). Recently work by Nair et al. (Nair et al., 2017b)has shown that when MMRE cannot be used, other measures such as rank difference may be used—which emphasizes the sorted order of configurations and their performances rather than accuracy of the predictions.
Once the performance model is trained, the accuracy of the model is measured by sorting the values of from ‘small’ to ‘large’, that is:
The predicted rank order is then compared to the actual rank order. We note that rank difference though effective is not particularly informative since it is sensitive to the workload. This means that in some workloads, a small difference in performance measure can lead to a large rank difference and vice-versa. For example, in Spear_0 the top performance of the optimal configuration (rank 1) and 100 best configuration has a difference of 0.09%—which means a large rank difference of 100 does not mean poor performance.
Hence, we define a performance measure called Normalized Absolute Residual (NAR), which represents the difference between the actual performance measurements of the optimal configuration and the predicted optimal configuration. The difference between the actual and predicted optimal configuration is normalized to the difference between the actual best and worst configurations,
where . This measure is similar to Absolute Residual (lower is better). However, in our setting the range of the performance measures across different environments are not equal (hence the need for a normalization step).
Our experiments discussed in RQ1 and RQ4 are all subject to inherent randomness introduced by sampling configurations or by different source and target environments. To overcome this, we use 30 repeated runs, each time with a different random number seed. The repeated runs provide us with a sufficiently large sample size for statistical comparisons. Each repeated run collects the values of NAR to assess the the transfer learners.
To rank these 30 numbers collected as above, we use the Scott-Knott test recommended by Mittas and Angelis (Mittas and Angelis, 2013). The Skott-Knott test has been endorsed by several SE researchers (Leech and Onwuegbuzie, 2002; Poulding and Clark, 2010; Arcuri and Briand, 2011; Shepperd and MacDonell, 2012; Kampenes et al., 2007; Kocaguneli, Zimmermann, Bird, Nagappan, and Menzies, Kocaguneli et al.). Scott-Knott is a non-parametric statistical test that performs a bootstrap test with 95% confidence (Efron and Tibshirani, 1993) to determine the existence of statistically significant differences. This followed by an A12 test to check that any observed differences were not trivially small effects (Vargha and Delaney, 2000). We say that a “small” effect has . Scott-Knott test results in treatments being ranked from best to worst. Note that, if a set of treatments are not significantly different, they will have the same ranks.
Purpose: The first research question seeks to establish the presence of bellwether environments within different environments of a software system. We hypothesize that, if these bellwether environments exist, we can improve the transfer learning algorithms.
Approach: For each subject software system, we use the environments to perform a pair-wise comparison method (similar to leave one out testing) as follows:
We pick one environment as a source and construct a transfer learner.
Next, we use the remaining environments as targets. For every target environment, we use the transfer learner-constructed in the previous step to predict for the optimum configuration.
Then, we measure the NAR of the predictions (see §7.2).
Afterward, we repeat steps 1, 2, and 3 for all the source environments and gather the outcomes.
Finally, we use Scott-Knott test to rank each environment (and its usefulness as a source).
Summary: Our results are shown in Figure 6. Overall, we find that there is always at least one environment (the bellwether environment) in all the subject systems, that is much superior to others. Note that, Storm
is an interesting case, here all the environments are ranked 1, which means that all the environments are equally useful as a bellwether environment. Further, we note that the variance in the bellwether environments are much lower compared to other environments.
Result: There exist environments in each subject system, which act as bellwether environment and hence can be used to find the near-optimal configuration for the rest of the environments.
Purpose: The bellwether environments found in RQ1 required us to use 100% measurements from all the environments. This may not be economical in a real-world scenario. It can be prohibitively
expensive to run and test all configurations of subject systems (as done in RQ1) since their configuration spaces are large. Thus, in this research question, we ask if we can find the bellwether environments sooner using fewer configurations.
Approach: We developed an iterative method, based on incremental sampling strategy to find the bellwether environment. It works as follows
We start from 1% of configurations from each environment and assume that every environment is a potential bellwether environment.
Then, we increment the number of configurations in steps of 1% and measure the NAR values.
We rank the environments and eliminate those that do not show much promise. A detailed description of how this is accomplished can be found in §5.
We repeat the above steps until we cannot eliminate any more environments.
The above strategy uses
of the NAR values at each step as a threshold to eliminate non-bellwether environments. To function correctly, this requires the NAR values to follow a normal distribution. If normality is violated, we used power transforms to make the data more normal distribution-like. We note that this is a prevalent strategy commonly used in other domains to reduce the size of options or alternatives(Borgelt, 2005) and is also known as backward elimination (Blum and Langley, 1997).
To see if our proposed method is effective, we compare the performance of bellwether environment with the predicted bellwether environment.
Summary: Table 2 summarizes our findings. We find that,
In all 5 cases, using at most 10% of the configurations we find one of the bellwether environments that are found with 100% of the configurations. See, column Rank in Table 2.
In terms of quality of predictions, the NAR values of the predicted bellwether environments with 10% of the configurations is less than 1.0% different from the bellwether found at 100%.
Our results are encouraging in that they demonstrate how the bellwether environments can be discovered very fast with just a fraction of the original configuration size. Since fewer configuration takes less time to collect and is cheaper, we can assert that discovering bellwether environments can be very economical.
Result: The bellwether environment can be recognized using only a fraction of the measurements (under 10%), and the identified bellwether environments have similar NAR values to the actual bellwether environment.
Motivation: Having established that there exist bellwether environments in the subjects systems (RQ1) and that they can be found with very few measurements (RQ2), in this research question we explore how BEETLE compares to a non-transfer learning approach. For our experiment, we use the non-transfer performance optimizer proposed by Nair et al. (Nair et al., 2017b). More details on Nair et al.’s method can be found in §6.3.
Both BEETLE and Nair et al.’s methods seek to achieve the same goal—find optimal configuration in a target environment. BEETLE uses configurations from a different source to achieve this, whereas the non-transfer learner uses configurations from with the target.
Approach: Our setup involves evaluating the Win/Loss ratio of BEETLE to the non-transfer learning algorithm while predicting for the optimal configuration. Comparing against true optima, we define “win” as cases where BEETLE has a better (or same) optima as the non-transfer learner. A “loss” otherwise.
Summary: Our results are shown in Figures 7 and 8. In Figure 8, the x-axis represents the number of configurations (expressed in %) to train the non-transfer learner and BEETLE, and the y-axis represents the number of wins/losses. From this figure we observe:
Better performance: In 4 out of 5 systems, the BEETLE “wins” significantly more than it “losses”. This means that BEETLE is better than (or at least as good as) non-transfer learning methods.
Lower cost: In terms of cost, we note that BEETLE outperforms the non-transfer learner significantly, “winning” at configurations of 10% to 100% of the original sample size. Further, when we look at the trade-off between performance and number of measurements in Figure 7, we note that BEETLE achieves an NAR close to zero with close around 100 samples. On the other hand, the non-transfer learning method of Nair et al. (Nair
et al., 2017b)has significantly larger NAR while also requiring large sample sizes.
Result: BEETLE performs better than (or same as) a non-transfer learning approach. BEETLE is also cost/time efficient as it requires far fewer measurements.
Purpose: The main motivation of this work is to show that the source environment can have a significant impact on transfer learning. In this research question, we seek to compare BEETLE with two other state-of-the-art transfer learners by Jamshedi et al. (Jamshidi et al., 2017b) and Valov et al. (Valov et al., 2017). For further details of these methods, see §6.
Approach: To perform our comparisons, we use a Scott-Knott test to rank the values. These values indicate the % performance difference between estimated and the actual near-optimal.
Summary: Our results are shown in Figure 9. In this figure, the best transfer learner is ranked 1. We note that in 4 out of 5 cases, the baseline transfer learner based on source selection performs just as well as (or better than) the state-of-the-art. This result is encouraging in that it points to significant impact choosing a good source environment can have on the performance of transfer learners. Further, in Figure 10 we compare the number of performance measurements required to construct the transfer learners (note the logarithmic scale on the vertical axis). It can be noticed that BEETLE requires far fewer measurements compared to the other transfer-learning methods.
Result: In most software systems, BEETLE performs just as well as (or better than) other state-of-the-art transfer learners for performance optimization using far fewer measurements.
What is the trade-off between hyper-parameters and effectiveness of BEETLE? In Figure 11
, we show the trade-off between the hyperparameters (budget, lives) and NAR values (effectiveness). We note that the performance is correlated to the budget and number of lives. That is, as budget increases the NAR value decreases. Since our objective is to minimize the number of measurements while reducing overall NAR, we assign the value of 5 to lives and 10% to budget for our experiments.
When are bellwethers ineffective? Existence or discovery of bellwethers depends on the following: (a) Metrics used: Finding bellwether using metrics that are not justifiable, may be unsuccessful, e.g, trying to discover bellwethers in performance optimization, by measuring MMRE instead of NAR for will fail (see http://tiny.cc/bw_metrics) (Nair et al., 2017b); (b) Different Software System: Bellwethers of a certain software system ‘A’ may not work for software system ‘B’; and (c) Different Performance Measures: Bellwether discovered for one performance measure (time) may not work for other performance measures (throughput).
Is BEETLE applicable in other domains? Yes, BEETLE can be applied to any transfer learning application, where the choice of the source data impacts the performance of transfer learning. This can be applied to problems such as configuring big data systems (Jamshidi and Casale, 2016), finding suitable cloud configuration for a workload (Hsu et al., 2018, 2017), configuring hyper parameters of machine learning algorithms (Fu et al., 2016a, b; Afridi et al., 2018).
External validity: We selected a diverse set of subject systems and a large number of selected environment changes, but, as usual, one has to be careful when generalizing to other subject systems and environment changes. Even though we tried to run our experiment on a variety of software systems from different domains, we cannot generalize our results beyond these software systems.
Internal validity: Due to the size of configuration spaces, we could only measure configurations exhaustively in one subject system and had to rely on sampling (with substantial sampling size) for the others, which may miss effects in parts of the configuration space that we did not sample. We did not encounter any surprisingly different observation in our exhaustively measured SPEAR dataset. Measurement noise in benchmarks can be reduced but not avoided. We performed benchmarks on dedicated systems and repeated each measurement 3 times. We repeated experiments when we encountered unusually large deviations.
Parameter bias: With all the transfer learners and predictors discussed here, there are a number of internal parameters that have been set by default. The result of changing these parameters may (or may not) have a significant impact on the outcomes of this study.
Our approach exploits the bellwether effect—there are one or more bellwether environments which can be used to find good configurations for rest of the environments. We also propose a new transfer learning method, called BEETLE, which exploits this phenomenon. We show that BEETLE can quickly identify the bellwether environments with only a few measurements () and use it to find the near-optimal solutions in the target environments. We have done extensive experiments with 5 highly configurable systems demonstrating that BEETLE can (i) identify the most suitable source to construct transfer learners, (ii) find near-optimal configurations with only a small number of measurements (less than of configuration space), (iii) performs as well as non-transfer learning approaches , and (iv) performs as well as state-of-the-art transfer learners.
On automated source selection for transfer learning in convolutional neural networks.
Journal of Pattern Recognition(2018).
International Conference on Software Engineering and Knowledge Engineering.