The idea of optimising the configuration of a compiler for a particular application or set of applications is not new. The Milepost GCC project Fursin et al. (2011) is perhaps the most prominent example and uses machine learning to dynamically determine the best level of optimisation. In an iterative process, it can improve execution time, code size, compilation time and other metrics. The approach has been integrated into the widely-used GCC compiler. Other approaches that optimise the code generation for C programs include Haneda, Knijnenburg, and Wijshoff (2005); Pan and Eigenmann (2006); Plotnikov et al. (2013). While most of these optimise the GCC compiler, there exists some work on LLVM as well Fursin et al. (2014).
Another focus of research for automatic dynamic optimisation of compiled code has been the Jikes Java compiler Alpern et al. (2005). Hoste, Georges, and Eeckhout (2010) use multi-objective evolutionary search to identify configurations that are Pareto-optimal in terms of compilation time and code quality. Cavazos and O’Boyle (2006)
learn logistic regression models that predict the best optimisation to apply to a method.Kulkarni and Cavazos (2012)
use artificial neural networks to determine the order in which a set of optimisations should be applied during compilation.
A major concern with all compiler configuration optimisation approaches is the computational effort required to determine a good or optimal configuration. If this is too large, any benefits gained through the optimisation may be negated. One approach to reducing the initial overhead is to move the configuration process online and to learn to identify good configurations over successive compilations, but other approaches have been explored in the literature (see, e.g. Thomson et al. (2010); Ansel et al. (2012); Tartara and Crespi Reghizzi (2013)).
Automated Algorithm Configuration
Most software has switches, flags and options through which the user can control how it operates. As the software becomes more complex or is used to solve more challenging and diverse problems, the number of these options also tends to increase. While some of these parameters control the input/output behaviour of a given piece of software or algorithm, others merely affect efficiency in terms of resource use.
The algorithm configuration problem is concerned with finding the best parameter configuration for a given algorithm on a set of inputs, where the definition of “best” can vary, depending on the given application scenario. In many practical cases, the goal is to achieve better performance, and this is how we use algorithm configuration here – we want to achieve the same functionality, but with reduced resource requirements. Specifically, in this work we focus on minimizing the CPU time required, but in principle, any scalar measure of performance can be used.
Finding the best parameter configuration for a given algorithm is a long-standing problem. Humans tend to be bad at solving it – evaluating parameter configurations requires substantial effort, and interactions between parameters may be complex and unintuitive. Minton (1996) notes that,
“Unlike our human subjects, [the system] experimented with a wide variety of combinations of heuristics. Our human subjects rarely had the inclination or patience to try many alternatives, and on at least one occasion incorrectly evaluated alternatives that they did try.”
Fortunately, there exist many automated procedures for algorithm configuration. Perhaps the simplest approach is to try all combinations of parameter values. This approach is known as a full factorial design in the statistics literature on experimental design and as grid search in computer science (specifically, in machine learning); its main disadvantage lies in its high cost – the number of configurations to be evaluated grows exponentially with the number of parameters and their values. For most practical applications, including the ones we consider in the following, complete grid search is infeasible.
A commonly used alternative is simple random sampling: Instead of evaluating every combination of parameter values, we randomly sample a small subset. This is much cheaper in practice and achieves surprisingly good results Bergstra and Bengio (2012). Indeed, in machine learning, random sampling is a widely used method for hyper-parameter optimisation. Unfortunately, when searching high-dimensional configuration spaces, random sampling is known to achieve poor coverage and can waste substantial effort evaluating poorly performing candidate configurations.
A more sophisticated approach to algorithm configuration is provided by so-called racing methods Birattari et al. (2002), which iteratively evaluate candidate configurations on a series of inputs and eliminate candidates as soon as they can be shown to significantly fall behind the current leader of this race. Local search based configurators, on the other hand, iteratively improve a given configuration by applying small changes and avoid stagnation in local optima by means of diversification techniques (see, e.g., Hutter et al. (2009)).
More recently, model-based algorithm configuration methods have gained prominence. These are based on the key idea of constructing a model of how the parameters affect performance; this empirical performance model is then used to select candidate configurations to be evaluated and updated based on the results from those runs. Arguably the best known model-based configurator (and the current state of the art) is SMAC Hutter, Hoos, and Leyton-Brown (2011), which we use in the following.
SMAC and the other general-purpose algorithm configuration methods mentioned above have been applied with great success to a broad range of problems, including propositional satisfiability Hutter, Hoos, and Stützle (2007), mixed integer programming Hutter, Hoos, and Leyton-Brown (2010), machine learning classification and regression Thornton et al. (2013), and improving the performance of garbage collection in Java Lengauer and Mössenböck (2014).
The existence of effective algorithm configuration procedures has implications for the design and development of high-performance software. Namely, rather than limiting design choices and configurable options to make it easier (for human developers) to find good settings, there is now an incentive to introduce, expose and maintain many design choices, and to let automated configuration procedures find performance-optimized configurations for specific application contexts. This is the core idea behind the recent Programming by Optimization (PbO) paradigm Hoos (2012).
Furthermore, in typical applications of automated algorithm configuration, developers need to carefully construct a set of ‘training’ inputs that is representative of those encountered in the intended application context of the algorithm or software to be configured. If automated configuration is applied to produce a performance-optimised configuration using training inputs unlike those seen in typical use, the resulting configuration is unlikely to perform as well in the actual application as on the training set used as the basis for configuration. (This, of course, also holds for manual configuration, but the effect tends to become more pronounced if more effective optimisation methods are used.)
|# parameters of type|
Our benchmark suite comprises the Octane 2.0 777https://developers.google.com/octane/, SunSpider 1.0.2 888https://www.webkit.org/perf/sunspider/sunspider.html, Kraken 1.1 999http://krakenbenchmark.mozilla.org and Ostrich Khan et al. (2014) benchmark sets. We created harnesses that allowed us to execute and measure these benchmarks programmatically, outside of a browser environment. We note that the techniques we use here readily extends to browser-based settings, albeit the integration effort would be higher.
The SunSpider 1.0.2 benchmark set was developed by the WebKit team and contains 26 problem instances representing a variety of different tasks that are relevant to real-world applications, including string manipulation, bit operations, date formatting and cryptography.
Kraken 1.1 was developed by Mozilla and contains 14 problem instances that were extracted from real-world applications and libraries. These benchmarks primarily cover web-specific tasks (e.g., JSON parsing), signal processing (e.g., audio and image processing), cryptography (e.g., AES, PBKDF2, and SHA256 implementations) and general computational tasks, such as combinatorial search.
All of the experiments reported in the following were performed using a single Microsoft Azure Cloud instance of type “G5” running a standard installation of Ubuntu 15.04. This instance type has two 16-core processors with a total of 448GB of RAM; it is the sole user of the underlying hardware, based on a one-to-one mapping to two Intel Xeon E5-2698A v3 processors.
For each of our configuration scenarios, we performed 25 independent runs of SMAC with a 1 CPU-day runtime cutoff, allocating a maximum of 60 CPU seconds to each run on a particular problem instance. The objective value to be minimised by SMAC is the so-called Penalised Average Runtime (PAR) score, which penalises timed-out and crashing runs by assigning them an objective value of 10 times the runtime cutoff (PAR-10), and otherwise assigns an objective value of the CPU time used. This greatly disincentivizes bad and invalid configurations, in order to bias the configurator against selecting them.
The incumbent configuration with the best PAR-10 score reported by SMAC after termination was selected as the final result of the configuration process, and a subsequent validation phase was performed to run both the JSC/V8 default configuration and the optimised configuration selected by our procedure on the entire problem instance set. For these validation runs, we perform 100 runs per configuration and benchmark instance, and compute the PAR-10 score across all runs for each configuration.
We require repeated runs to obtain statistically stable results. Individual runs are very short and subject to susbstantial noise from the environment, e.g. operating system jobs and contention for shared memory. Through repeated runs and averaging, we achieve more realistic results that are less affected by very short and very long outlier runs.
Results on Benchmark Sets
|PAR10 [CPU s]|
|Instance set||JSC default||JSC configured||rel. impr. [%]||V8 default||V8 configured||rel. impr. [%]|
Results on Individual Benchmark Instances
Three of these instances are taken from the Ostrich set: graph-traversal, sparse-linear-algebra, and structured-grid, and two instances stem from the Octane set: PDFjs and Splay. Results from these experiments are shown in Table 3
, and we show additional empirical cumulative distribution functions of running time and scatter plots for the defaultvs optimised configuration in Figure 2 and Figure 1. On Ostrich graph-traversal or structured-grid, not shown in the table and figures, we have not obtained significant performance improvements for either of the two engines.
|PAR10 [CPU s]|
|Instance set||JSC default||JSC configured||rel. impr. [%]||V8 default||V8 configured||rel. impr. [%]|
Considering the Ostrich Sparse Linear Algebra individual-instance configuration scenario, we show empirical cumulative distribution functions (CDFs) of runtime for 100 runs on the respective problem instance, along with scatter plots vs. the default configuration. The CDFs show the probability that a run will complete within a certain amount of time as a function of the time, as observed empirically. That is, a finished run of an instance at a particular time increases the probability.
Overall, these results are remarkable as even new code optimisation methods often only result in performance improvements by single-digit percentages. We hypothesize that there are some specific aspects of these problem instances which differ sufficiently from the other instances in their respective benchmark sets, that these configurations cannot be successfully be used across those entire sets, but are very effective on the individual instance in question. We present some preliminary results towards identifying the source of these improvements in the following.
Time to Find Improving Configurations
Even when considering the remarkable performance improvements seen in our individual-instance configuration experiments, there may be some concern about the time required to find these improving configurations, given that we used 25 independent SMAC runs of 1 CPU day to achieve these.
Upon further investigation, in all of our individual-instance configuration scenarios, the final optimised configuration was found in less than 3 CPU hours of runtime, with initial improvements over the default configuration typically found in less than 5 CPU minutes. Longer runtimes are required for the complete instance set configuration scenarios, but even in those cases, the final configuration was found in less than 6 CPU hours, with initial improving configurations typically being found in less than 1 CPU hour.
In practice, a much smaller configuration budget would be sufficient to achieve qualitatively similar results. In fact, we observed the first improvements after only a few minutes of configuration.
Changed Parameter Values
In order to better understand the source of our individual-instance performance improvements, we empirically analysed the parameters changed from their default values using ablation analysis Fawcett and Hoos (2015). This approach has been previously used successfully to assess the importance of parameter changes observed in applications of automated algorithm configuration techniques to propositional satisfiability, mixed-integer programming and AI-planning problems. Ablation analysis greedily constructs a path through the parameter configuration space from the default to a given target configuration, selecting at each stage the single parameter modification resulting in the greatest performance improvement. The order of the resulting modifications reflects the relative contribution to the overall performance improvements obtained by the configuration process, where later changes may occasionally achieve bigger improvements that would not have been possible before earlier modifications to the default configuration. The three parameter modifications resulting in the greatest performance improvement for the Octane Splay and PDFjs instances are shown in Table 4 and Table 5, respectively.
While the portion of the relative improvement indicated in the tables is approximate due to the nature of the ablation analysis procedure, it appears that in both cases, over 90% of the observed relative improvement can be explained by the modification of the three parameters shown. This is consistent with previous results using ablation analysis, where in many scenarios, the vast majority of the improvement was observed to be achieved by modifying a small set of parameters. Of course, identifying these parameters in post hoc ablation analysis is much easier than determining them within the configuration process that gave rise to the optimised configurations thus analysed.
|distance from default||parameter modified||from||to||approx. portion of rel. impr.|
|distance from default||parameter modified||from||to||approx. portion of rel. impr.|
Performance under Different Loads
Modern computers have multiple processors, with multiple CPU cores each, and it is desirable to run multiple processes simultaneously in order to take full advantage of the processing power thus provided. However, other factors, such as shared caches, memory bandwidth and the I/O subsystem can affect performance negatively, if too many processes are vying for resources.
In order to investigate to which extent such factors may impact our experimental setup, we ran different configurations of workloads. First, we utilized all 32 cores of the machine used for our experiments by running 32 benchmark experiments in parallel. Second, we ran only 8 experiments in parallel, leaving the remaining cores for operating system processes.
The results show that there are significant differences. The graph-traversal instance of the Ostrich benchmark set requires a large amount of memory and sufficient memory bandwidth. With the machine fully loaded, we observe that we easily find a parameter configuration that performs better than the default. On the lightly loaded machine we are unable to do so, and the benchmark runs significantly faster than on the fully loaded machine, even with the improved configuration. This clearly indicates a memory bottleneck that can be mitigated through configuration.
This result shows that the optimisation of compiler flags should be done not only for the machine that the code will be run on, but also for the expected load on that machine – configuring for a lightly loaded machine will yield different results than configuring for a heavily loaded one. Furthermore, there is much promise in switching between different configurations based on machine load.
|PAR10,s (ID) at rank|
|JSC (32) Octane PDFjs 1||2.163||2.009 (11)||2.020 (14)||2.037 (20)||2.037 ( 9)||2.042 (22)||2.046 ( 3)||2.047 (12)||2.052 (15)||2.060 ( 5)||2.074 (18)|
|JSC (32) Octane PDFjs 2||2.976||2.890 ( 4)||2.923 ( 1)||2.929 ( 2)||2.963 ( 5)||2.971 ( 7)||2.987 ( 3)||2.996 ( 9)||3.037 (11)||3.041 ( 6)||3.060 (15)|
|JSC (32) Octane PDFjs 3||2.875||1.996 (14)||2.005 (11)||2.035 (24)||2.038 (20)||2.043 (15)||2.045 (22)||2.053 (12)||2.067 ( 5)||2.068 ( 9)||2.071 (16)|
|JSC (8) Octane PDFjs 1||1.691||1.422 (11)||1.434 (14)||1.460 (22)||1.471 (24)||1.473 (2)||1.477 (5)||1.489 (20)||1.513 (3)||1.555 (16)||1.556 (18)|
|JSC (8) Octane PDFjs 2||1.687||1.426 (11)||1.431 (14)||1.463 (22)||1.466 (24)||1.470 (2)||1.471 (5)||1.479 (20)||1.516 (3)||1.555 (16)||1.558 (18)|
|JSC (8) Octane PDFjs 3||1.691||1.416 (11)||1.426 (14)||1.459 (22)||1.462 (24)||1.471 (5)||1.471 (2)||1.483 (20)||1.525 (3)||1.548 (18)||1.556 (16)|
We believe that our results are promising and believe that our approach enables many interesting applications and follow-up work. We are currently planning additional work including a broader set of experiments, additional analysis of the parameter space structure, a deeper investigation into the effect of machine load on runtime performance and configuration, and an investigation of the transferability of these configuration results to machines other than those used for training.
Part of this research was supported by a Microsoft Azure for Research grant. HH also acknowledges support through an NSERC Discovery Grant.
- Alpern et al. (2005) Alpern, B.; Augart, S.; Blackburn, S. M.; Butrico, M.; Cocchi, A.; Cheng, P.; Dolby, J.; Fink, S.; Grove, D.; Hind, M.; McKinley, K. S.; Mergen, M.; Moss, J. E. B.; Ngo, T.; Sarkar, V.; and Trapp, M. 2005. The Jikes Research Virtual Machine project: Building an open-source research community. IBM Systems Journal 44(2):399–417.
- Ansel et al. (2012) Ansel, J.; Pacula, M.; Wong, Y. L.; Chan, C.; Olszewski, M.; O’Reilly, U.-M.; and Amarasinghe, S. 2012. Siblingrivalry: Online Autotuning Through Local Competitions. In 2012 International Conference on Compilers, Architectures and Synthesis for Embedded Systems, CASES ’12, 91–100. New York, NY, USA: ACM.
- Bergstra and Bengio (2012) Bergstra, J., and Bengio, Y. 2012. Random Search for Hyper-parameter Optimization. J. Mach. Learn. Res. 13(1):281–305.
Birattari et al. (2002)
Birattari, M.; Stützle, T.; Paquete, L.; and Varrentrapp, K.
A Racing Algorithm for Configuring Metaheuristics.
Genetic and Evolutionary Computation, 11–18. Morgan Kaufmann.
- Cavazos and O’Boyle (2006) Cavazos, J., and O’Boyle, M. F. P. 2006. Method-specific Dynamic Compilation Using Logistic Regression. SIGPLAN Not. 41(10):229–240.
- Fawcett and Hoos (2015) Fawcett, C., and Hoos, H. H. 2015. Analysing differences between algorithm configurations through ablation. Journal of Heuristics 1–28.
- Feng et al. (2012) Feng, W.-c.; Lin, H.; Scogland, T.; and Zhang, J. 2012. OpenCL and the 13 Dwarfs: A Work in Progress. In 3rd ACM/SPEC International Conference on Performance Engineering, ICPE ’12, 291–294. New York, NY, USA: ACM.
- Fursin et al. (2011) Fursin, G.; Kashnikov, Y.; Memon, A.; Chamski, Z.; Temam, O.; Namolaru, M.; Yom-Tov, E.; Mendelson, B.; Zaks, A.; Courtois, E.; Bodin, F.; Barnard, P.; Ashton, E.; Bonilla, E.; Thomson, J.; Williams, C.; and O’Boyle, M. 2011. Milepost GCC: Machine Learning Enabled Self-tuning Compiler. International Journal of Parallel Programming 39(3):296–327.
- Fursin et al. (2014) Fursin, G.; Miceli, R.; Lokhmotov, A.; Gerndt, M.; Baboulin, M.; Malony, A. D.; Chamski, Z.; Novillo, D.; and Vento, D. D. 2014. Collective mind: Towards practical and collaborative auto-tuning. Scientific Programming 22(4):309–329.
Haneda, Knijnenburg, and
Haneda, M.; Knijnenburg, P. M.; and Wijshoff, H. A.
Automatic selection of compiler options using non-parametric inferential statistics.In 14th International Conference on Parallel Architectures and Compilation Techniques, 123–132.
- Hoos (2012) Hoos, H. H. 2012. Programming by Optimization. Commun. ACM 55(2):70–80.
- Hoste, Georges, and Eeckhout (2010) Hoste, K.; Georges, A.; and Eeckhout, L. 2010. Automated Just-in-time Compiler Tuning. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, 62–72. ACM.
- Hutter et al. (2009) Hutter, F.; Hoos, H. H.; Leyton-Brown, K.; and Stützle, T. 2009. ParamILS: An Automatic Algorithm Configuration Framework. J. Artif. Int. Res. 36(1):267–306.
Hutter, Hoos, and
Hutter, F.; Hoos, H. H.; and Leyton-Brown, K.
Automated Configuration of Mixed Integer Programming
Integration of AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems, volume 6140 of Lecture Notes in Computer Science, 186–202. Springer Berlin Heidelberg.
- Hutter, Hoos, and Leyton-Brown (2011) Hutter, F.; Hoos, H. H.; and Leyton-Brown, K. 2011. Sequential Model-Based Optimization for General Algorithm Configuration. In LION 5, 507–523.
Hutter, Hoos, and
Hutter, F.; Hoos, H. H.; and Stützle, T.
Automatic Algorithm Configuration based on Local Search.
22nd National Conference on Artificial Intelligence, 1152–1157. AAAI Press.
- Kulkarni and Cavazos (2012) Kulkarni, S., and Cavazos, J. 2012. Mitigating the Compiler Optimization Phase-ordering Problem Using Machine Learning. SIGPLAN Not. 47(10):147–162.
- Lengauer and Mössenböck (2014) Lengauer, P., and Mössenböck, H. 2014. The Taming of the Shrew: Increasing Performance by Automatic Parameter Tuning for Java Garbage Collectors. In 5th ACM/SPEC International Conference on Performance Engineering, ICPE ’14, 111–122. ACM.
- Minton (1996) Minton, S. 1996. Automatically Configuring Constraint Satisfaction Programs: A Case Study. Constraints 1:7–43.
- Pan and Eigenmann (2006) Pan, Z., and Eigenmann, R. 2006. Fast and Effective Orchestration of Compiler Optimizations for Automatic Performance Tuning. In International Symposium on Code Generation and Optimization, CGO ’06, 319–332. Washington, DC, USA: IEEE Computer Society.
- Plotnikov et al. (2013) Plotnikov, D.; Melnik, D.; Vardanyan, M.; Buchatskiy, R.; Zhuykov, R.; and Lee, J.-H. 2013. Automatic Tuning of Compiler Optimizations and Analysis of their Impact. In International Conference on Computational Science, volume 18, 1312–1321. 2013 International Conference on Computational Science.
- Tartara and Crespi Reghizzi (2013) Tartara, M., and Crespi Reghizzi, S. 2013. Continuous Learning of Compiler Heuristics. ACM Trans. Archit. Code Optim. 9(4):46:1–46:25.
- Thomson et al. (2010) Thomson, J.; O’Boyle, M.; Fursin, G.; and Franke, B. 2010. Reducing Training Time in a One-Shot Machine Learning-Based Compiler. In Languages and Compilers for Parallel Computing, volume 5898 of Lecture Notes in Computer Science, 399–407. Springer Berlin Heidelberg.
Thornton et al. (2013)
Thornton, C.; Hutter, F.; Hoos, H. H.; and Leyton-Brown, K.
Auto-WEKA: Combined selection and hyperparameter optimization of classification algorithms.In KDD.