1 Introduction
The efficient scheduling of tasks within parallel loops is a problem that must be solved in every instance of looplevel parallelism, no matter the application. A key aspect to efficient execution of these loops is to manage the amount of work contained in each task. This amount of work per task is known as its chunksize. The creation of tasks induces overheads, however, the more tasks created the more work can be done concurrently. The challenge for application developers is to schedule in a way such that the overheads introduced by new tasks are amortized by the performance gains made by the increase in parallelism.
The simplest loop scheduling method is static scheduling. This method divides the iterations evenly and statically among all the processors, either as a consecutive block or in a roundrobin manner [liu1994safe]. Since all the assignments happen at compile time or before execution of the application, this method imposes no runtime scheduling overhead. Several factors including interprocessor communication, cache misses, and page faults can lead to different execution times for different iterations, leading to load imbalance among the processors [philip1995increasing]. Meanwhile, dynamic scheduling methods postpone the assignment to runtime, which tends to improve load balancing, at the cost of higher scheduling overhead. Some of the dynamic scheduling methods include: Pure Selfscheduling, Chunk Selfscheduling, Guided Selfscheduling [polychronopoulos1987guided], Factoring [hummel1992factoring] and Trapezoid Selfscheduling [tzen1993trapezoid, liu1994safe].
In this paper we focus solely on dynamic scheduling methods. These approaches, require the fine tuning the chunksize as chunksizes which are too small will cause scheduling and execution overheads and chunksizes too large will not induce enough parallelism to achieve maximum performance. This “sweet spot” is applicationdependent because it varies with the structure of the task and the hardware used. In particular, we are interested in parallelizing the linear algebra algorithms in the Blaze C++ library using the HPX backend. Blaze is a linear algebra library that uses smart expression templates in order to mix readable mathematical syntax and high performance [iglberger2012expression]. HPX is a C++ standard library used for parallelism and concurrency [kaiser2014hpx, heller2017hpx].
The objective of this research is to generate a machine learning based scheduler that is able to take compiletime and runtime information of any linear algebra operation and predict the optimal chunksize at runtime with minimal overhead to make the predictions. This scheduler should be the leastintrusive within the Blaze source code and should not disturb the normal workflow of the user. We investigate if a supervised learning model can be used to make such a scheduler.
Our work is an extension of [khatami2017hpx] where machine learning was used in tandem with HPX for the first time. In this research, the authors developed a ClangTool to extract compiletime information by analyzing the Abstract Syntax Tree of any algorithm being parallelized. For a large set of parallel algorithms, they extracted information like the number of operations per iteration, number of floating point operations per iteration, the number of if
statements, the deepest loop level and number of function calls per iteration, number of iterations and number of threads. The training and test sets where generated by using the ClangTool on various algorithms. The features extracted were used as the inputs of a logistic regression algorithm to predict the best chunksize at runtime. The novelty of the ClangTool is that it can work for a large set of algorithms since it extracts lowlevel information from the abstract syntax tree directly. However, it would not apply to our linear algebra operations because we use dynamic matrices. Since the size of the matrices are unknown at compiletime, values like number of operations per iteration cannot be extracted by a compiler.
The major contributions of the paper are the following:

We extend the existing range of machine learning applications in HPC by using supervised learning to optimize the performance of basic parallel linear algebra operations.

We compare existing approaches to compute the optimal chunksize for a given operation and conclude that we can directly predict the chunksize via a classification method that provides efficient and accurate predictions.

We develop a new decisiontree classification model that is customized to more efficiently estimate the optimal chunksize. Such a model is, in theory, not restricted to the linear algebra operations described in this paper. It could be used to solve any new problem where the objective is to predict the value of a certain parameter that yields optimal performance for a given application.
The paper is organized as follows: In Section 2, we review the literature on contributions related to our research work. Machine learning terminology, definition of the parallelization framework, and mathematical formulation of the scheduling as an optimization problem are discussed in Section 3. Section 4 describes our methodology to solve the optimization problem via supervised learning and our experimental results are shown in Section 5. Finally, we conclude with Section 6.
2 Related Contributions
Most of the previous work done in this field focuses on predicting the performance of parallel algorithms, which is different from what we are trying to accomplish. However, modeling performance can be used to find the optimal way to schedule a loop and it is linked to our research. Previous works mainly focus on three types of models: analytic, tracebased, and empirical models [malakar2018benchmarking].
Analytic models [blagojevic2008modeling, kerbyson2001predictive, valiant1990bridging], while providing an arithmetic formula to represent the execution time of an application, require a deep understanding of the application. These models cannot be generalized to other types of domains and architectures [lee2007methods, sun2017automated, pllana2007performance].
Tracedbased models use the traces collected through instrumentation in order to predict performance. Instrumentation refers to the code that is added to the program source code as a way to capture runtime information. These models, in contrast to analytic models, do not rely on an expert’s knowledge of the application. However, they do add some overhead at runtime, require a large storage space to save the traces and are hard to analyze [sun2017automated]. Moreover, for any new application, one must first run a portion of the code before producing performance predictions.
In empirical modeling, the results are obtained by running an application on multiple machines with a set of userspecified features to build a model that will predict performance for new feature values [malakar2018benchmarking]. This type of modeling includes machine learningbased approaches. The line between empirical models and trace based models can be blur at times as traces can be seen as the features of an empirical model. The two methods differ, however, by the fact that tracebased models add extra instructions to the program, which is not necessarily the case with empirical models.
It is our observation that three types of features are commonly used in literature for empirical models, e.g. nondeterministic features, deterministic applicationfree features, and deterministic applicationspecific features. Nondeterministic features are values that keep changing when rerunning the same program. These include various performance counters, like cache hitsmisses, CPU idle rate etc. Deterministic applicationfree features are values that do not change when running the same program and can be measured on different kinds of applications. For example the authors in [khatami2017hpx] used deterministic applicationfree features extracted by compiler. Values like number of operations per iteration, deepest loop level, number of threads, etc. are not specific to a certain task. Deterministic applicationspecific features are parameters that can only be computed for a certain class of algorithms and therefore cannot be used for other algorithms. For example, matrix sizes could be used to characterize a linear algebra operations but it cannot be used to characterize a binary tree search algorithm. This terminology, despite being nonofficial, will be used throughout the paper when talking about features for machine learning models in order to be more specific.
In [ipek2005approach], the authors use neural networks to predict the performance for SMG2000 applications, a parallel multigrid solver for linear systems [falgout2002hypre]
, on two different architectures. They use a fully connected neural network to predict the performance, defining workloads per processor and processor topology as the deterministic applicationfree features of the model. Since they consider the absolute mean error as their loss function, they use stratification to replicate samples with lower values of performance by a factor that is proportional to their performance as a way to ensure that no sample is masked. They also apply bagging techniques to decrease the variance in the model. As they increase the size of the training set to 5K points, they reach an error rate of 4.9%.
The tracebased model [sun2017automated]
analyzes the abstract syntax tree of a code and collects data by inserting special code for instrumentation in four different situations: assignments, branches, loops, and MPI communications. These features have the advantage of being applicationfree so they can be used to characterize different operations: Graph500, GalaxSee, and SMG2000. The authors then use different machine learning methods, e.g. random forests, support vector machine, and ridge regression to build a prediction model from the collected data. By applying a two feature reduction processes, they were able to decrease the amount of additional overhead and storage requirements. Their results indicated that the random forests method was the most useful, because of the lower impact of categorical features on it, which is helpful in the general cases where one has no knowledge about the type of features
[sun2017automated].In [malakar2018benchmarking], the authors investigate a set of machine learning techniques, including deep neural networks, support vector machine, decision tree, random forest, and knearest neighbors to predict the execution time of four different applications. They use deterministic applicationspecific features for each of their applications. For example, in the case of the miniMD molecular dynamics application, the number of processes and the number of atoms were considered input features, while for miniAMR, an application for studying adaptive mesh refinement, the number of processes and block sizes in the , , and
directions, were used as the input features. While achieving promising results, especially for deep neural networks, bagging, and boosting methods, the authors suggest utilizing transfer learning through deep neural networks to predict performance on other platforms.
Despite concentrating on GPUs, Liu et al. [liu2018runtime] proposed a lightweight machine learning based performance model to choose the number of threads to use for the parallelization of the training of a neural network (NN). They chose to use nondeterministic features collected by hardware counters, namely, the number of CPU cycles, the numbers of cache misses, the accesses for the last cache level, and the number of level 1 cache hits. Since each step of the training of a NN has the same computational and memory patterns, the authors used the first steps to extract the performance counters and fed them to the performance model. The output of the model then guided the choice for the number of threads that would be used throughout the rest of the NN training steps. They took two different approaches to build their model. In the first one, they tried ten different regression models including random forest, and in the second one, they used a hill climbing algorithm to choose the number of threads. In addition of being hardware independent, the hill climbing algorithm achieves a much higher accuracy compared to the best performing regression model. The authors suspected that machine learning models were poorly performing because of inaccuracies in the hardware counters.
In this paper, we propose using machine learning to directly predict the optimal chunksize to achieve the best performance instead of predicting the execution time. Also, we do not attempt to find the optimal number of cores to run an application on like in [liu2018runtime]. In our research, it is assumed that the user is working on a given number of cores and simply want to find the optimal way to share the workload between these cores. These are examples of what differentiates our approach from the ones discussed above.
3 Mathematical formulation
3.1 Supervised learning basics
In supervised learning, the objective is to predict a variable whose value is unknown given some other known variable . This is equivalent to finding a function , which will be referred from now on to as the model. The input of the model is called the feature vector and the output is called the target. When the set is finite, the model is considered a classification function and when , the model is considered a regression. The output of the model is usually different from the actual value of , so we define the notion of error or “loss” between the actual and predicted values by using a loss function , which is applicationdependent. The loss can be computed on multiple instances of feature vectors and targets contained within a set (note that the superscript will always be used to represent an instance within a set). This leads to the notion of the cost function on the set :
(1) 
Typically, the machine learning model is selected by minimizing the cost function on a socalled training set. Once a model has been selected, it can be evaluated on new input values that were not included in the training set. To assess how the model performs on these new values, the cost function must be evaluated on a test set containing previously unused inputs. A model that performs well on the test set is said to generalize. The notion of generalization is critical to estimate how the model will perform in real applications. More details on supervised learning can be found in [hastie2005elements].
3.2 Parallelization Framework
In the Blaze library, every linear algebra expression, like , is accessible as an expression template that is a light weight object. More complex expressions are stored in expression template trees that can be traversed at compile time via template metaprogramming. The elements of an expression are only evaluated when the assignment operator is called, i.e. . During an assignment, the elements of the expression template are evaluated and the results are stored in . This step can be parallelized since each component of is independently computed. Figure 1 shows how work is distributed among threads and introduces the two critical parameters involved: blocksize and chunksize. In this work, we demonstrate the use of machine learning to simplify the task of selecting the chunk sizes, given a specified blocksize.
3.3 Mathematical Formulation of the Scheduling Problem
We discuss here the mathematical formulation that will be used throughout the paper as well as a more formal statement of the scheduling problem. Many variables are implicated in the problem to solve so we introduce the following notation and domains:

Matrix size (Vector size):

Number of threads:

Operation type:

Architecture:

Chunksize:

Blocksize:

Unaccounted processes:
OT represents the domain of all possible linear algebra operations and AT represents the set of all possible architectures. These sets are too complex to be written explicitly. The variable describes all processes in the computer that we are unaware of. Our ignorance of this variable introduces a nondeterministic behavior to the system. It is assumed that there exists a function Per that takes all these variables as inputs and outputs the performance in MFlop/s:
(2) 
Note that the dependence of Per on is not explicitly written so its influence will be perceived as noise. By defining as an abstract feature vector, we can write the following optimization problem:
(3) 
The values and will be referred to as the optimal chunksize and blocksize, respectively. We use machine learning terminology for because solving Eq. (3) is analogous to finding a model with input . We say that the feature vector is abstract because the components ar and ot do not take numerical values but represent concepts. For example, ot can stand for matrixmatrix multiplication, vectorvector addition, etc. This abstract feature vector is useful when doing equations with a general notation, but all of its abstract components will need to be transformed into numbers when it comes to generate the training and test sets. We will refer to this as concretizing the abstract feature vector. The deterministic applicationfree features, deterministic applicationspecific features, and nondeterministic features described earlier can be used to concretize the abstract components.
One way to solve Eq. (3) would be to use ZeroOrder Optimization. These methods do an intelligent search in the space to find maximum performance. Such methods are very expensive because the function Per must be called multiple times to find a maximum. Performance is costly to compute so we cannot consider such methods to find the optimal parameters at runtime.
The previous HPX backend scheduler solved Eq. (3) by fixing and choosing the blocksize so the matrix would be divided into submatrices of similar sizes. For example, the blocksize chosen for a matrix operation on 4 threads would be which is the specific blocking scenario illustrated in Figure 1. With this blocking, it is assumed that b depends only on ms and . However, since b is linked to memory, it should depend as well on at and ot since different architectures may have different cache sizes and different operations may access memory differently.
The next section describes the proposed approach to solve this optimization problem by using supervised learning models.
4 Methodology
4.1 Benchmarks and Resources
All experiments were performed on a node consisting of two Intel^{®} Xeon^{®} CPU E52450 clocked at 2.10GHZ, with 16 cores, hyperthreading disabled, and 48 GB DDR4 memory. The CPUs had L1 caches of 32 kBytes and nonshared L2 caches of 256 kBytes. Blaze v3.4 was compiled with the dependencies OpenBlas v0.3.2, and HPX v1.3.0 [hartmut_kaiser_2019_3189323]. HPX was compiled using Boost 1.68.0 and Hwloc 1.11.0 as its dependencies. The compiler used for Blaze and HPX was Clang 6.0.1. All benchmarks that were used in this paper are listed in Table 1. Mat, Vec, D, T refer to matrix, vector, dense and transpose, respectively. For example, DVecDVecAdd means an addition of two dense vectors. The default alignment of matrices in Blaze is rowmajor so transposed matrices are columnmajor.
Daxpy  TDMatTDMatAdd 

DVecDVecAdd  DMatScalarMult 
DMatDVecMult  DMatDMatMult 
DMatDMatAdd  DMatTDMatMult 
DMatTDMatAdd  TDMatTDMatMult 
TDMatDMatAdd 
4.2 Estimating the Best Blocksize
Solving Eq. (3) for both and is rather intensive because of the amount of configurations in . To simplify the problem, we decided to first solve for and then for . To find the optimal blocksize, we consider the following hypothesis:
Hypothesis 1
, i.e. the best blocksize depends only on architecture and operation types.
Our reasoning is that finding the best block size is a hardware related problem. The optimal blocksize must ensure that all the data sent to a CPU must fit within its cache. Since different benchmarks will access the memory differently, it is also safe to assume that the best blocksize will depend on the operation type. The approach used to compute was to fix cs and start with b containing elements and change its aspect ratio
to see how it impacted performance. For some benchmarks where these values did not give good performance, other values were found empirically. This approach is somewhat heuristic, but it is not the focal point of this paper. Indeed, our main goal is to predict the best chunksize when given a certain blocking. For that reason, the blocksizes to be used on the benchmarks can be chosen suboptimal.
4.3 Estimating the Best Chunksize
Once the approximation of the best blocksize is computed for any ot and ar, we can redefine the abstract feature vector . The best chunksize can be calculated by solving the problem:
(4) 
We have used supervised learning models to find the solutions of Eq. (4) for any arbitrary value of the feature vector . The advantage of machine learning methods is that they can output predictions pretty fast. Therefore, the selection of the optimal chunksize can be done seamlessly in Blaze. The most timeconsuming part of the machine learning methodology is generating data and training the model. However, these models do not need to be trained often. Our goal is to find a blackbox model fed with a brut training set , which outputs an approximation of the best chunksize for any new example (where the hat symbols stands for approximation). The reason we call this set the brut training set will become clear later on. Such a model is illustrated in Figure 2. We refer to this model as a black box model because there exist multiple ways to implement them as discussed below. However, all black box models share the same input, output, and bruttraining set. They only differ by how they are implemented in the “inside” which justifies the terminology. We describe two types of these black box models that were seen in literature and attempt to give them formal names that will be used throughout the paper as a way to distinguish both.
4.3.1 PreTraining Optimization Model (PreTO)
This model is the same as the one used in [khatami2017hpx] to compute the optimal chunksize, see Figure 4. The goal is to solve Eq. (4) for each example in the brut training set to generate pairs of values to be used for the training of the classification model. The target of the machine learning model is then , the value we want to predict. Classification is chosen because the output set is a finite set. The output of the blackbox model is directly the output of the classification model within the black box.
is the direct output of the classifier
Classical machine learning classification algorithms try to maximize the accuracy on the training set:
(5) 
where the loss function, , is an indicator function that returns one if the input statement is true and zero otherwise. Note that, in practice, classification algorithms rarely maximize the accuracy directly because it is nondifferentiable. Many classification algorithm solve equivalent optimization problems instead. For example, logistic regressions and neural networks use loglikelihood as their lossfunction. For decision trees, the accuracy is replaced by the information gain [hastie2005elements]. All these methods are indirect ways of maximizing the accuracy so we can only consider the accuracy without loss of generality. The advantage of the preTO is that it implies very little overhead as the only step required to compute a prediction is to feed the input to the classification model inside the box. The problem with the PreTO model is that, sometimes, different chunksizes yield similar performance and the classification model is oblivious to that fact. Since classification models maximize accuracy, they will try to classify each chunk size on the training set correctly even if, in some cases, a missclassification does not affect performance. This can be seen as wasted effort by the model.
4.3.2 PostTraining Optimisation Model (PosTO)
This model is similar to what was used in [liu2018runtime], see Figure 4. It consists of modeling Per via a regression called and then, instead of solving Eq. (4), we find
(6) 
This shows how our work and all publications about performance modeling discussed in Section 2 are related. Solving Eq. (6) can be done with a linesearch at runtime. Since we want our machine learning model to predict the performance, we need to do a feature augmentation procedure on our data. This procedure includes cs among the features of the performance model instead of treating it as a target. We can write this as defining a new feature vector . The procedure is illustrated in Figure 5.
After the procedure the number of examples in the training set becomes (where is the cardinality of set CS). The formal description of the new training set is . Classical regression models minimize the meansquarederror (MSE) cost function on the training set:
(7) 
The advantage of the PosTO model compared to the PreTO is that the machine learning model will take performance into account during training. The disadvantage is that there is more overhead at runtime because of the need to maximize with respect to cs.
It is important not to confuse the black box models (PosTO and PreTO) with the machine learning models used inside them, particularly in the case of PosTO models because the outputs are different. Moreover, one must not confuse the brut training set and the training sets and which are used to train the classifier and regression of performance. The two models described earlier have the same issue; the machine learning models inside the black box have an ambiguous loss function. For example, in the PreTO model, the classification model could have a bad accuracy but could still be high since multiple chunksizes may yield good performance. Moreover, if a PreTO model has a better accuracy than another one, it does not imply that the chunksizes predicted by the first model will yield a higher performance than the second one. This is because some missclassifications can induce a larger loss in performance. This ambiguity is only removed when accuracy is reached, but then, the model may be way more complex than necessary.
In the case of PreTO models, the MSE is also an imperfect measure for assessing the error of the black box model. In Figure 6, Per is the real performance for a fixed . The two dashed lines are hypothetical approximations of performance. One can see that has a lower MSE than . If we solve Eq. (6) with both performance models, we get the predictions and . However, from Figure 6, we see that , so the second model of performance yields a better prediction of chunksize even though it has a larger MSE. To conclude, we cannot compare PosTO models based only on the MSE of the performance model inside them. This ambiguity is only removed when the MSE is zero, but this is surely overfitting.
It is clear from that analysis that we need another way to assess the performance for both PreTO and PosTO models to avoid any ambiguity. Candidate measures should take the actual performances into account. We introduce a new cost function that we call the Mean SubOptimal Performance:
(8) 
The performance associated with the chunksize prediction is divided by the optimal observed performance as a way to normalize all instances in order to ensure that benchmarks with typically better performances will not dominate every other benchmark. Also, the MSOP is restricted to take values from zero to one. The MSOP must be interpreted as how relatively far away one is, in average, from the optimal performance that the model could have predicted. The denominators can be computed by simply using the values in the training or test set. However, the numerators , can only be easily computed if is restricted to predict from the set CS. In that case, performance acquired in the training or test set can be used to compute the numerators. The MSOP will be the metric used when comparing PreTO and PosTO models.
4.4 Custom Decision Tree
For PreTO models, the classifier that predicts the optimal chunksize is trained by maximizing the accuracy on the training set. The biggest limitation of accuracy is that each missclassification is given the same impact. Some missclassification of chunksize could induce bigger losses in performance than others, therefore, accuracy is not a good way to train the classification model. A reasonable loss function should be able to treat every missclassification differently. A naive approach would be to attribute weights to each example (stratification). Those weights could be chosen to be large for instances where one chunksize yields a better performance than all others and small for instances where all chunksizes yield similar performance. However, for a fixed example , any missclassification still has the same impact. For this reason, using weighted accuracy is not sufficient. We decided to use the MSOP as our loss function for the classification model training process. We believe the MSOP is the right choice for the loss function because its value is only affected by a missclassification when such missclassifications induces a relatively large loss in performance with respect to the optimal performance. Therefore, the MSOP treats every missclassification differently. The MSOP is nondifferentiable because it involves Per which one cannot differentiate. Therefore, we cannot rely on gradientbased methods for fitting. Decision trees [hastie2005elements] provide a solution to this challenge since they do not require differentiable lossfunctions. Any decision tree can be described as a set of domains that cover the whole input space and a set of scalars such that:
(9) 
This tree can be optimized by solving:
(10) 
This combinatorial problem is extremely hard to solve so we attempt to find a suboptimal solution by dividing the problem into two steps:

Find given .

Find .
The first step is relatively straightforward and consists in solving the optimization problem:
(11) 
The MSOP is maximized by only considering the points in that domain. Since the MSOP is nondifferentiable, it must be maximized via a linesearch.
The second step of the optimization problem is the hardest. To find the best domains, a greedy algorithm is used. First, we start with a domain that covers the whole domain, we then consider a feature and a split point . The domain is then cut into two disjoint regions and . The best split is selected via the minimization problem:
(12) 
where , are the numbers of data points in each region. The and are computed by Eq. (12) on regions and . Eq. (11) and (12) can be applied recursively to give a growing procedure of a custom classification tree which takes performance into account. A threshold for the MSOP in every region can be used which would stop the splits from happening in regions where the MSOP is high enough. This should result in less complex trees. We expect the model to have less overhead on average to compute predictions than classification trees.
4.5 Data Generation and Model Selection
The abstract feature vector is . This vector needs to be concretized before continuing. The compiletime variables ar and ot must be quantified if they are to be used in a model. In Section 2, several ways to express a feature vector in terms of numbers were suggested. In [khatami2017hpx], deterministic applicationfree features were used. ot was represented with characteristics of the structure of the algorithm such as number of operations, number of “if” statements and deepest loop level, which were measured at compile time. In [sun2017automated], the authors use deterministic applicationfree features. They measured assignments, branches, and loops at runtime using dynamic analysis of the program. In [li2009machine], nondeterministic features were measured. The variables ar, ot were represented using performance counters: number of CPU cycles, number of cache misses, cache accesses for the last cache level, and number of level one cache hits. Using performance counters does not just characterize ot and ar but it also reveals information about the unknown processes . This is because performance counters will vary based on the many unknown processes happening at the same time as the user’s task. Using performance counters as features is a double edgesword. One has reasons to think that using performance counters could improve the prediction of the model since the model will be aware of runtime imbalances between CPUs and overheads. However, if the features used to characterize are not highly correlated with performance, this will have the effect of adding unnecessary noise to the data. Moreover, in practice, every new task would need to be run twice; once to extract the performance counters and feed them to the machine learning model and another time with the predicted chunksize for optimal performance. We don’t want the user to run his linear algebra operation two times. This goes against our initial goal of not disturbing the normal workflow of the user. For that reason, we do not use performance counters when concretizing the feature vector.
We also decided to remove ar since we only worked on one architecture for simplicity. In theory, the proposed methodology can be extended to other architectures. To characterize the operation types, we extracted information about the complexity from Blaze’s expression templates. We created a function getTotalMflop() that could estimate the predicted number of floating point operations of an arbitrary expression template tree at runtime. Runtime features like ms and can be extracted at runtime with HPX’s getnumthreads() function and Blaze’s matrix.cols() member function for Matrix/Vector classes. We also changed the feature to which is the number of iterations in the for_loop. The motivation in doing so is that the optimal chunksize is highly correlated with . So we can define the concrete feature vector which was used when making the training and test set : . Note that the influence of is now hidden in the feature . Figure 2 shows all benchmarks used to generate the dataset.
DVecDVecAdd  DMatDVecMult 

DMatDMatAdd  DMatScalarMult 
DMatTDMatAdd  DMatDMatMult 
Each benchmark was run with different matrix sizes and number of threads . The sets MS were chosen differently for each benchmark to cover the regions of interest. The whole brut data collected consisted of 288 examples that were randomly shuffled and then split with a ratio 2:1 into a training set and a test set. To select the best black box model, we used kfold crossvalidation [hastie2005elements] on the training set and compared the average MSOP and prediction times. The value k was chosen to be 3 so the ratios would still be 2:1 when doing crossvalidation. The model that performed best in average on the 3 folds, was then evaluated on the test set. Many classical classification models and regression models can be used inside the PreTO and PosTO models. The Python library scikitlearn [scikitlearn] was used because of its high level syntax that helped to embed these models within the blackbox models. For PreTO Models, both DecisionTreeClassifier and LogisticRegression were considered. For PosTO models, both MultilayerPerceptronRegression and DecisionTreeRegression were used as regressions of performance. Finally, our CustomDecisionTreeClassifier tree was implemented in Python in a PreTO model [doi_data].
5 Results
5.1 Best Block Size
Table 3 shows the values of blocksize that were found empirically and the ranges of matrix sizes on which they performed well. These blocksizes were used when generating the training and test sets for the predictions of chunksize.
Benchmark  Matrix/Vector sizes  Selected blocksize 

DVecDVecAdd  [25000, ]  1x4096 
DMatDVecMult  [250, 2500]  1x16 
DMatDMatAdd  [100, 1000]  4x1024 
TDMatTDMatAdd  [100, 1000]  1024x4 
DMatTDMatAdd  [100, 1000]  64x64 
TDMatDMatAdd  [100, 1000]  64x64 
DMatDMatMult  [100, [  64x64 
DMatDMatMult  [, ]  256x256 
TDMatTDMatMult  [100, [  64x64 
TDMatTDMatMult  [, ]  256x256 
DMatTDMatMult  [100, [  64x64 
DMatTDMatMult  [, ]  256x256 
We see that for matrix multiplications, the best blocksize varies with the size of the matrices, which disproves hypothesis 1 (H1). We cannot explain yet why the optimal blocksize increases with ms for matrix multiplication. This result is hard to interpret because of the abstract nature of expression templates and the complexity of the software stack. In Blaze, specific assignment kernels are called for each benchmark. These Kernels can use their own blocking. Secondly, the blocksize is not always representative of the number of elements indexes during an assignment. For example, with given matrices of size , a blocksize of (100, 100) on will index elements from and . However, for , the same blocksize on will index elements of and . For all benchmarks using square blocksizes, we found specific cases where the blocksize would segment the matrix in fewer submatrices than there were threads, which resulted in reduced performance. It is clear that the choice of blocksize must take into account the size of the Matrix/Vector and the number of threads in these cases. More research needs to be done on the blocking of these operations, however we determined this to be outside the scope of our work.
5.2 Choice of set CS
Graphs of performance with respect to chunksize were generated by running multiple benchmarks on different number of threads. Visualizing the real performance with respect to chunksize is a useful way to design the set CS. It is critical to choose a good set because the black box models will be restricted to analyze examples with these chunksizes and will make predictions from this set. If it is too detailed, generating data will take a long time. If it is too poor, the training set will lack critical information.
In Figures 8–10, we see the expected bellshape for performance. When chunksizes too small were chosen we observed high scheduling overheads and when chunksizes were too large, we observed poor parallelism. We note that, in some cases, there is a very specific chunksize that yields a peak in performance. In others, there is a range of values of chunksize that yield an optimal performance. For these cases, predicting the exact value of the optimal chunksize is not necessary. In Figure 8, the performance line collected with 16 threads is a good example where all missclassifications of chunksizes do not have the same impact on performance. For that specific benchmark and number of threads, . However, missclassifying it with leads to a 14% loss in performance and missclassifying it with causes a 40% loss in performance. This is a concrete example that shows why we decided to build a custom classification model to predict chunksizes. In figure 10, we see that the optimal chunksize is always one. This could be due to the fact that, since matrix multiplications are very complex algorithms, each iteration of the for$\_$loop() involves significant work in comparison to the HPX’s thread scheduling overhead. However, we suspect that this behavior is mainly caused by the specific kernels that Blaze calls. Taken together, the graphs indicate that the optimal chunksize is always between one and ten so we used this insight to fix the set CS to . The time to generate our data set of 288 examples was around two days which we view as reasonable so we did not attempt to reduce the size of the set CS. Throughout the rest of the paper, machine learning models will be limited to analyze and predict chunksizes from this set.
5.3 Model Selection and True Error estimation
As explained earlier, kfold cross validation was used to select the best black box model. The black box models were also compared to two reference models; random choice and equal share model . Random choice is a reference point for the worst classifier one could achieve so it gives a lower bound on the error. The equal share model is a more naive model that simply tries to have one chunk per thread. We have previously stated that our models should only output values within the set CS which is why, in the equal share model, an output larger than 10 is reduced to 10. This constraint is important if we are to use the MSOP to compare all the models.
Model  MSOP (%)  Prediction Time (s) 

PreTO  customDTC  
PreTO  DTC  
PreTO  LogReg  
PosTO  MLP  
PosTO  DTR  
Random Guess    
Equal Share   
Table 4 provides the results from all models described above. We can see that all models except PreTO  LogReg, are significantly better than Random Guessing. Also, we see that the three models with the largest MSOP use an decision tree model, which suggest that trees are better for this specific task. As expected, PosTO models have a bigger average prediction time than PreTO models. Finally, we see that the custom tree has the best MSOP and the smallest average prediction time. Since the custom decision tree and scikitlearn decision trees do not have the same implementation, this difference in prediction time cannot be attributed solely to the complexity of the tree. We have therefore computed the average number of nodes for both our custom decision tree and scikitlearn decision tree classifier as a way to compare complexity. The average number of nodes for the custom and classical decision tree were and respectively. This shows that using the MSOP to train a classification tree allows to get good predictions of chunksize with half the complexity of a classical decision tree. Since PreTO  customDTC has the best MSOP and minimal prediction time, it was selected as the model to be implemented in Blaze. After selecting the model, its overall performance on the test set of 96 examples was estimated with an MSOP of 94.8 (%).
This shows that the ProTO model with an intrinsic custom Decision Tree Classifier can be generalized to examples outside of the training data. Since the training and test sets were obtained by shuffling all 288 examples for 6 benchmarks, we cannot conclude that the model would perform well in the cases of completely new benchmarks. The result simply states that if one restricts oneself to a specific architecture and a limited set of linear algebra benchmarks, one can generate training data on these benchmarks by using random values of ms and . After training, this model can be used to get predictions on the same set of benchmarks for all possible ms and .
5.4 C++ Implementation and Performance
The Custom Decision Tree generated with the training set was implemented in Python so that we could compare it directly to the scikitlearn library. However, to get the predictions within Blaze, the tree needed to be implemented in C++. In order to do so, a function called printTreeHeader() was written in Python to read the structure of the tree and output a header file that contained the tree structure.
The features are extracted via the functions described in Section 4.5 and the blocking described in Table 3 was also implemented in Blaze. The implementation in Blaze was designed to be as seamless as possible. To use the new machine learning scheduler, a user is simply required to add a flag when compiling. By implementing the machine learning model in Blaze, we are able to assess how well it works in a reallife scenario. However, the relative measure provided by MSOP fails to describe the absolute performance of the models. Because of that, MSOP does not tell how the machine learning based scheduling compares to the one used in the previous implementation of the HPX backend of Blaze. The previous implementation is described in the last paragraph of Section 3.3. To compare with the previous scheduling and blocking, we ran all benchmarks from Table 1 using the old backend, the new backend with equal sharing, that is , and the new backend with chunksize predicted via machine learning. The old backend and the new backend with equal sharing have approximately the same amount of work in a chunk. The only difference is that in the new backend, a chunk contains many blocks that are designed to work well with the cache. We, therefore, expect the difference between these two methods to only be caused by blocking. The difference in performance between the new backend with equal share and the new backend with machine learning scheduling is caused by the selection of chunksize, since the blocksizes used are the same.
Figure 11 shows a comparison of the three parallelization schemes for four different benchmarks from Table 1 (Daxpy, DMatDVecMult, TDMatTDMatAdd, DMatTDMatMult). Comparing the dotted and full lines shows the effect of the machine learning based chunking. Comparing the dashed and dotted lines shows the effect of blocking on performance. Since our main focus is to predict chunksize, we are mainly interested in analyzing variations of performance that are attributed to the chunksize prediction. Because of that, we will not attempt to analyze differences of performance attributed to blocking. The dotted line will simply be used as a reference point to show which variations in performance are due to blocking and which are due to chunksize prediction. By looking at the Daxpy benchmark (first row of the graph), we see that the chunksize selection improves performance for vector sizes ranging from to . However, for 16 threads, the chunksize predicted by machine learning yields very similar performance to the old backend implementation. We also observe more oscillations in performance for the machine learning scheduler. This can be attributed to the fact that there is more variance in the scheduling when more threads are used, thus, the machine learning struggles to predict the correct chunksize. For DMatDVecMult, (second row), we observe an improvement in performance based on the chunksize selection for vectors of size ranging from 500 to 1050. However, the performance for the machine learning scheduler drops for vector sizes beyond 1050. This drop seems to be related to blocking, not the machine learning model, since the performance also drops for the dotted lines. By analyzing TDMatTDMatAdd (third row), we observe an improvement for matrix sizes in the range 200 to 1000. The most surprising result is that for 16 threads, the machine learning scheduler performs worst than the equal share model. Once again, the model seems to struggle when working on 16 threads. An analysis of DMatTDMatMult (fourth row) shows that the machine learning scheduler yields a better performance for matrices of sizes ranging from 110 to . The machine learning learning scheduler is better than equal share because, as shown in Figure 10, the optimal chunksize is always one, which the machine learning model is able to predict. For matrices of size smaller than 100, the performance is worst than the old backend. This is because the blocksize used for these matrices is which is so big that it can generate less blocks than there are threads. A finetuning of blocksize in these cases should solve the issue.
6 Conclusions and Perspectives
6.1 Conclusions
In conclusion, our results show that, when using a certain blocking on our given architecture, we are able to calculate an optimal chunksize for parallel linear algebra algorithms using machine learning models. We were also able to compare the model that predict the chunksize directly (PreTO) and the model that predicts the performance to guide the choice of chunksize (PosTO). In the literature, these two approaches were always used in isolation. Reasonable predictions of chunksize were achieved by both models but the PosTO had larger prediction times due to the optimization step required at runtime. Also, we developed our own custom decision tree model using the MSOP as lossfunction. This resulted in a decision tree that yields similar performance to classical decision trees but with half the number of nodes. The custom decision tree was seamlessly implemented in Blaze, only requiring a compiletime flag to launch the machine learning based scheduler. Early results of this scheduler show an improvement in performance when compared to the previous implementation of the HPX’s backend. Nevertheless, we observed that the decision tree would struggle as the number of threads increased.
6.2 Future Works
One limitation of the proposed method is that it was only tested on one type of architecture. For the methodology to be seen as viable, it must be portable for multiple architectures. Two critical experiments must be done in the future to show such viability. We must demonstrate that the methodology can be reproduced on another architecture and we must show that a single model can be trained using training data collected from multiple architectures. This will require the use of architecture features in the feature vector. Moreover, a more efficient blocking method should be developed to avoid the cost of empirically computing the best blocksize, This burden on the user inhibits the benefits of our approach. Other less critical research directions include: finding new features to characterize an operation type, include linear algebra operations with sparse matrices, and adding performance counters to the feature vector. Using performance counters will significantly change the workflow of the Blaze user but there may be something useful to learn from such experiment. To conclude, we believe that this work lays a good foundation for more research that can be done in the domain of machine learning applied to parallel linear algebra. However, a lot more work is required to assess the viability of the proposed methodology in reallife applications.
Supplementary material
All the artifacts of the paper are available on github^{1}^{1}1https://github.com/gablabc/BlazeML and on Zenodo [doi_data]. The modified verison of Blaze is available here^{2}^{2}2https://bitbucket.org/gablabc/blaze/src/hpx_backend_ML_Blaze3.4/.
Acknowledgment
This work is funded by the Department of Energy Award DENA0003525. Any opinions, findings, conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the Department of Energy. Special thanks to Google Summer of Code 2018 for supporting the first three months of the research. Special thanks to Marc Laforest for his feedback on the paper.
Comments
There are no comments yet.