OpEvo: An Evolutionary Method for Tensor Operator Optimization

06/10/2020
by   Xiaotian Gao, et al.
Microsoft
5

Training and inference efficiency of deep neural networks highly rely on the performance of tensor operators on hardware platforms. Manually optimized tensor operators have limitations in terms of supporting new operators or supporting new hardware platforms. Therefore, automatically optimizing device code configurations of tensor operators is getting increasingly attractive. However, current methods for tensor operator optimization usually suffer from poor sample-efficiency due to the combinatorial search space. In this work, we propose a novel evolutionary method, OpEvo, which efficiently explores the search spaces of tensor operators by introducing a topology-aware mutation operation based on q-random walk distribution to leverage the topological structures over the search spaces. Our comprehensive experiment results show that OpEvo can find the best configuration with the least number of trials and the lowest variance compared with state-of-the-art methods. All code of this work is available online.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/11/2020

Ansor : Generating High-Performance Tensor Programs for Deep Learning

High-performance tensor programs are crucial to guarantee efficient exec...
06/29/2021

Interaction of Multiple Tensor Product Operators of the Same Type: an Introduction

Tensor product operators on finite dimensional Hilbert spaces are studie...
01/15/2022

Moses: Efficient Exploitation of Cross-device Transferable Features for Tensor Program Optimization

Achieving efficient execution of machine learning models has attracted s...
08/11/2020

Woodpecker-DL: Accelerating Deep Neural Networks via Hardware-Aware Multifaceted Optimizations

Accelerating deep model training and inference is crucial in practice. E...
10/25/2021

Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance

Today's auto-tuners (e.g., AutoTVM, Ansor) generate efficient tensor pro...
06/16/2020

Evolutionary Algorithms with Self-adjusting Asymmetric Mutation

Evolutionary Algorithms (EAs) and other randomized search heuristics are...
04/10/2021

Joint Program and Layout Transformations to enable Convolutional Operators on Specialized Hardware based on Constraint Programming

The success of Deep Artificial Neural Networks (DNNs) in many domains cr...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Abundant applications raise the demands of training and inference deep neural networks (DNNs) efficiently on diverse hardware platforms ranging from cloud servers to embedded devices. Moreover, computational graph-level optimization of deep neural network, like tensor operator fusion, may introduce new tensor operators. Thus, manually optimized tensor operators provided by hardware-specific libraries have limitations in terms of supporting new hardware platforms or supporting new operators, so automatically optimizing tensor operators on diverse hardware platforms is essential for large-scale deployment and application of deep learning technologies in the real-world problems.

Tensor operator optimization is substantially a combinatorial optimization problem. The objective function is the performance of a tensor operator on specific hardware platform, which should be maximized with respect to the hyper-parameters of corresponding device code, such as how to tile a matrix or whether to unroll a loop. Thereafter, we will refer to a tuple of hyper-parameters determining device code as a configuration, and the set of all possible configurations as a configuration space or search space. Unlike many typical problems of this type, such as travelling salesman problem, the objective function of tensor operator optimization is a black box and expensive to sample. One has to compile a device code with a specific configuration and run it on real hardware to get the corresponding performance metric. Therefore, a desired method for optimizing tensor operators should find the best configuration with as few samples as possible.

The expensive objective function makes solving tensor operator optimization problem with traditional combinatorial optimization methods, for example, simulated annealing (SA) [1]

and evolutionary algorithms (EA) 

[2], almost impossible. Although these algorithms inherently support combinatorial search spaces, they do not take sample-efficiency into account, thus thousands of or even more samples are usually needed, which is unacceptable when tuning tensor operators in product environments. On the other hand, sequential model based optimization (SMBO) methods are proved sample-efficient for optimizing black-box functions with continuous search spaces [3, 4, 5]. However, when optimizing ones with combinatorial search spaces, SMBO methods are not as sample-efficient as their continuous counterparts [6]

, because there is lack of prior assumptions about the objective functions, such as continuity and differentiability in the case of continuous search spaces. For example, if one could assume an objective function with a continuous search space is infinitely differentiable, a Gaussian process with a radial basis function (RBF) kernel could be used to model the objective function. In this way, a sample provides not only a single value at a point but also the local properties of the objective function in its neighborhood or even global properties, which results in a high sample-efficiency. In contrast, SMBO methods for combinatorial optimization suffer poor sample-efficiency due to the lack of proper prior assumptions and surrogate models which can leverage them.

In this work, we propose OpEvo (Operator Evolution), which combines both advantages of EA and SMBO by leveraging prior assumptions on combinatorial objective functions in an evolutionary framework. Although there is no nice property like continuity or differentiability, we construct topological structures over search spaces of tensor operators by assuming similar configurations of a tensor operator will result in similar performance, and then introduce a topology-aware mutation operation by proposing a -random walk distribution to leverage the constructed topological structures for better trade-off between exploration and exploitation. In this way, OpEvo not only inherits the support of combinatorial search spaces from EA, but also benefits from the prior assumptions about combinatorial objective functions, so that OpEvo can efficiently optimize tensor operators. The contributions of the paper are three-fold:

  • We define -random walk distributions over combinatorial search spaces equipped with topological structures for better trade-off between exploitation and exploration;

  • We propose OpEvo, which can leverage the topological structures over search spaces by introducing a novel topology-aware mutation operation based on -random walk distributions;

  • We construct topological structures for search spaces of tensor operator optimization and evaluate the proposed algorithm with comprehensive experiments with three representative types of tensor operators on both Nvidia and AMD platforms. Our experiments demonstrate that OpEvo can find the best configuration with the least number of trials and the lowest variance compared with state-of-the-art (SOTA) methods.

The rest of this paper is organized as follows. We summarize the related work in Section 2, and then introduce a formal description of tensor optimization problem and construct topological structures in Section 3. In Section 4, we describe OpEvo method in detail and demonstrate its strength with experiments of optimizing typical tensor operators in Section 5. Finally, we conclude in Section 6.

2 Related Work

As a class of popular methods for expensive black-box optimization, SMBO methods are potential solutions for tensor operator optimization. Although classic SMBO methods, such as Bayesian optimization (BO) with Gaussian process surrogate, are usually used to optimize black-box functions with continuous search spaces, many works have been done in using SMBO to optimize combinatorial black-box functions. Hutter et al. [6]

proposed SMAC, which uses random forest as a surrogate model to optimize algorithm configuration successfully.

Bergstra et al. [7]

proposed TPE, which uses tree-structured Parzen estimator as a surrogate model to optimize hyper-parameters of neural networks and deep belief networks. As for tensor operator optimization, TVM 

[8] framework implemented a SMBO method called AutoTVM [9]

to optimize parameters of tensor operators. Specifically, AutoTVM fits a surrogate model with either XGBoost 

[10] or TreeGRU [11], and then uses SA to optimize the surrogate model for generating a batch of candidates in an -greedy style. Although these methods are successfully used in many combinatorial optimization problems, they are not as sample-efficient as their continuous counterparts due to the lack of proper prior assumptions and corresponding surrogate models. OpEvo, on the other hand, introduces and leverages topological structures over combinatorial search spaces thus obtains better sample-efficiency than previous arts.

Besides AutoTVM, two domain-specific methods, Greedy Best First Search (G-BFS) and Neighborhood Actor Advantage Critic (N-A2C), have been proposed recently to tune matrix tiling schemes of matrix multiplication (MatMul) operators by taking the relation between different configurations into account [12]

. They actually introduce a topology over the configuration space of MatMul operator by defining a neighborhood system on it, and further employ a Markov Decision Process (MDP) for exploration over the configuration space. By leveraging a domain-specific topological structure, G-BFS and N-A2C outperform AutoTVM in optimizing MatMul operators. However, these two methods are only designed for tuning tiling schemes of multiplication of matrices with only power of 2 rows and columns, so they are not compatible with other types of configuration spaces. Further, they tend to encounter curse of dimensionality as the number of parameters needed tuning getting bigger, because they only change one parameter at a time based on the MDP they defined. Thus, generalizing them to more general tensor operators is not straightforward. OpEvo, on the other hand, constructs topological structures in a general way and uses evolutionary framework rather than MDP framework to explore search spaces, so that the aforementioned problems encountered by G-BFS and N-A2C are overcame.

3 Problem Formulation

As earlier mentioned, tensor operator optimization is substantially a black-box optimization problem with a combinatorial search space. It can be formally written as

(1)

Here, is a black-box function that measures the performance of a specific tensor operator with configuration . We use trillion floating-point operations per second (TFLOPS) as the measurement in this work. Configuration is an ordered -tuple and each component

corresponds to a hyperparameter of a device code, so the entire search space

is the Cartesian product of all component feasible sets . Our aim is to find the optimal configuration that corresponds to the maximum TFLOPS.

A topological structure over each can be introduced by defining an undirected graph , where the set of vertices is , and the set of edges . Here is an adjacency function mapping from to . represents vertices and are adjacent, otherwise and are not adjacent. In this way, one can introduce a topological structure over by defining an adjacency function according to prior assumptions on . For example, it is intuitive to treat and as adjacent if similar performance can be expected when changing from to

. Search process can benefit from the topological structures introduced this way by obtaining information about neighborhood of samples, like the performance of configurations in the neighborhood of a poor performance configuration are probably poor as well, so that better sample-efficiency could be achieved.

Different tensor operators may have different types of hyperparameters and corresponding feasible sets. In the rest part of this section, we will discuss three typical kinds of hyperparameters of tensor operators as well as associated feasible sets, and construct topological structures for them. It should be noted that, besides them, one can easily introduce other types of hyperparameters and construct corresponding topological structures based on concrete demands in a similar way.

First is the ordered -tuple with a factorization constraint, , where are constants depending on specific tensor operators. We will refer to this type of parameter as factorization parameter thereafter. The factorization parameter is required by a popular technique called matrix tiling for improving the cache hit rate of memory access. It iteratively splits computation into smaller tiles to adapt memory access patterns to a particular hardware. From the implementation perspective, it transforms a single loop into nested loops, where is the number of nested loops, is the total loop length and is the loop length of each nested loop. We define two factorizations of are adjacent if one of them can be transformed to the other by moving , a prime factor of , from the -th factor to the -th factor. This adjacency function can be formally written as if such that and , and otherwise, where are distinct indices. Figure 1 illustrates the topology defined this way when and .

The second type of hyperparameter is the discrete parameter, , in which there are finite numbers. The maximum step of loop unrolling is an example of discrete type parameter. There is a natural adjacency among discrete parameters since they have well-defined comparability. This natural adjacency function can be formally written as if such that , and otherwise.

The last type of hyperparameter is the categorical parameter, , in which there are finite elements that can be any entity. The choices like whether to unroll a loop and which thread axis to dispatch computation are examples of categorical type parameter. Unlike discrete parameters, there is no natural adjacency among categorical parameters, so all elements in the feasible set of categorical type parameter are treated as adjacent, which can be formally written as for all and , and otherwise.

4 Methodology

4.1 Evolutionary Algorithm

EA is a kind of stochastic derivative-free optimization methods, which can be used to solve problems defined by Equation 1. EA imitates the natural selection in the evolution process of biological species to find the optimal configuration of an objective function. Evolutionary concepts are translated into algorithmic operations, i.e., selection, recombination, and mutation [13], which significantly influence the effectiveness and efficiency of EA.

To efficiently search the best configuration of a tensor operator, OpEvo leverages topological structures defined in Section 3 with an evolutionary framework. In specific, OpEvo evolves a population of configurations, which are also called individuals in EA terminology. The TFLOPS of executing tensor operators on a target hardware is a measure of the individuals’ quality or fitness. At each evolutionary step, we select top ranked individuals to be parents based on their fitnesses, and then recombine and mutate them to generate new individuals or children. After evaluation, children are added to the population to be candidates of new parents at the next evolutionary step. This iteration will repeat until some termination criteria are met.

In the rest of this section, we will describe the selection, recombination and mutation operations of OpEvo in detail and illustrate how OpEvo leverages the topological structures and why OpEvo can outperform previous arts in this way.

4.2 Selection and Recombination

Suppose we already have a list of individuals which are ranked by their fitnesses. Top- ranked individuals are chosen to be parents, where governs the diversity in evolutionary process. Evolution with large tends to get rid of suboptimum but sacrifices data efficiency, while one with small is easier to converge but suffers suboptimum.

A child will be generated by recombining these selected parents in a stochastic way. Specifically, we sample below categorical distribution with categories times to decide which parents each parameter of a child should inherit from.

(2)

where is the number of parameters in a configuration, superscripts represent different parents, and subscripts represent different parameters in a configuration. is the -th parameter of generated child .

Equation 2 is a fitness-related distribution in which a parent with larger fitness has bigger probability to transfer its characters to offspring. Rank-based fitness shaping is a popular technique often used to avoid suboptimum and accelerate convergence [14, 15]. However, we don’t use it because meaningful fitnesses are quite sparse when optimizing some tensor operators, rank-based fitness shaping is harmful rather than helpful in these cases.

4.3 Mutation

OpEvo mutates each parameter

of each child by sampling a topology-aware probability distribution over corresponding feasible set

. Given a topology over and current vertex, such topology-ware probability distribution can be constructed by a random walk-like process. The transition probability from vertex to an adjacent vertex is

(3)

where is the mutation rate which trade-offs the exploration and exploitation. OpEvo tends to exploration as approaches , while tends to exploitation as approaches . is the set of all vertices adjacent to , and denotes the cardinality of set . The major difference between the "random walk" defined by Equation 3 and the regular random walk is that the summation of transition probability over all adjacent vertices is rather than

, so the "random walk" we introduced is not a Markov chain since there is a probability of

to stop walking. In this way, given a current vertex , the topology-aware probability distribution for all could be defined as the probability of walking started from and stopped at . We will refer to the distribution defined this way as -random walk distribution thereafter. Appendix A formally proved that the -random walk distribution is a valid probability distribution over .

(a)
(b)
Figure 1: Illustrations of factorization parameter with and , and two -random walk distributions over it.

For revealing the intuition behind -random walk distribution, two q-random walk distributions over the feasible set of factorization parameter with and are illustrated in Figure 1. They start from the same vertex (the blue vertex) but mutate with different . It could be easily observed that the vertices with smaller distance to the start vertex have higher probability to be sampled, which ensures a good trade-off between exploration and exploitation. Further, the distribution with a larger has a wider spread than one with a smaller , because larger encourages more jumps in the -random walk process. Considering the asymptotic case of , the -random walk degenerates into a regular random walk on an undirected graph, which keeps jumping forever and eventually traverses all vertices on the graph, while in the case of , the -random walk vanishes and no mutation acts on parameter . Thus, is a hyper-parameter for OpEvo to trade off exploitation and exploration.

Considering a regular random walk on an undirected graph, i.e. , the probability of visiting a vertex in the graph is determined by the graph topology when the Markov chain induced by the random walk is converged. That’s why random walk can be used for embedding graphs in many works [16]. -random walk distribution also inherits this topology-aware nature. Observing vertices with the same distance to the start vertex in Figure 1, the vertices with more complex neighborhood have larger probability. For example, vertices , and have the same distance to start vertex , but vertex has larger probability since it has larger degree. This property of -random walk distribution helps explore search spaces efficiently, because sampling vertices with more complex neighborhood will provide us more knowledge about objective functions.

4.4 Summary

The OpEvo algorithm is summarized in Algorithm 1. At first, configurations are randomly generated and evaluated to initialize a priority queue ordered by decreasing fitness. Next, taking top configurations from as parents and recombining them to generate children according to Section 4.2. Then, each child is mutated based on Section 4.3. Note that the same configuration will not be sampled twice in the whole process of OpEvo, since the noise of TFLOPS of executing a tensor operator on a hardware is relatively small and data efficiency can benefit from non-duplicated samples. As a result, when a mutated child has already in , we will mutate the child again until it is not already sampled. Finally, the fitnesses of children are evaluated on target hardware, and enqueued into . This iteration will repeat until some termination criteria are met.

Input: all component feasible sets , parents size , offspring size , mutation rate
Output: optimal configuration

1:  randomly generate configurations
2:  evaluate to get associated fitness, and enqueue into a priority queue
3:  repeat
4:     select parents from and recombine them to generate children according to Section 4.2
5:     mutate children according to Section 4.3
6:     evaluate children on hardware, and enqueue into
7:  until termination criterion is met
8:  return the best configuration so far
Algorithm 1 OpEvo

5 Experiments

We now evaluate the empirical performance of the proposed method with three typical kinds of tensor operators, MatMul, BatchMatMul and 2D Convolution, on both Nvidia (GTX 1080Ti) and AMD (MI50 GPU) platforms. All tensor operators in our experiments are described and generated with TVM framework, and then compiled and run with CUDA 10.0 or RCOM 2.9. Additionally, we compare OpEvo with three aforementioned SOTA methods, G-BFS, N-A2C and AutoTVM. In our experiments, OpEvo, G-BFS and N-A2C are implemented by ourselves with the framework of Neural Network Intelligence (NNI, https://github.com/microsoft/nni/), and AutoTVM is implemented by its authors in the TVM project (https://github.com/apache/incubator-tvm). All codes for OpEvo, G-BFS, N-A2C and our benchmarks are publicly available with the NNI project. Please refer to Appendix B for more details about the experiments.

MatMul. Three different MatMul operators are chosen from BERT to evaluate proposed method. The maximum performance obtained so far versus number of trials which have been used is illustrated in Figure 2

. The lines denote the averages of 5 runs, while the shaded areas indicate standard deviations. Different colors and line styles represent different algorithms. Table 

1 shows the mean and standard derivation of best TFLOPS obtained by each algorithm after samples. From the results, it can be easily concluded that the methods leveraging predefined topology, OpEvo, G-BFS and N-A2C, much outperform the general SMBO method, AutoTVM. G-BFS and N-A2C leverage the underlying topology by introducing a MDP, so just local topology is considered and leveraged to explore the configuration space, while OpEvo can consider the global topology thanks to the mutation operation based on the -random walk distribution. Therefore, OpEvo performs better than G-BFS and N-A2C in most cases in terms of mean and variance of best TFLOPS and data-efficiency.

Figure 2: Algorithms comparison for three MatMul operators. The upper row of figures are results on Nvidia platform, while the lower row of figures are results on AMD platform. Three columns correspond to three operators MM1, MM2 and MM3 in Table 1 from left to right, respectively.

=0ex =0ex Operators G-BFS N-A2C AutoTVM OpEvo mean std mean std mean std mean std Results on Nvidia platform MM1 MM2 MM3 Results on AMD platform MM1 MM2 MM3

Table 1: Mean and standard derivation of the best TFLOPS obtained by optimizing three MatMul operators with G-BFS, N-A2C, AutoTVM and OpEvo after 512 samples. Please refer to Appendix B.1 for detailed explanation about these operators.

BatchMatMul. There are also three BatchMatMul operators selected from BERT for evaluation. All these operators have batch size , so G-BFS and N-A2C are not capable to optimize them because they can only deal with matrices with power of 2 rows and columns. The comparison between OpEvo and AutoTVM are shown in Figure 3 and Table 2. Compared with MatMul operators, BatchMatMul has an order of magnitude bigger search space since one more parameter needed to be optimized. Also, the generated BatchMatMul device code is more likely to overflow the device memory as tile size of BatchMatMul is bigger than that of MatMul, which leads to sparser performance measurement. Although these challenges exist, OpEvo performs still well thanks to the globally exploration mechanism. The variance of best performance even better than that of MatMul because of the sparsity of objective function.

Figure 3: Algorithms comparison for three BatchMatMul operators. The upper row of figures are results on Nvidia platform, while the lower row of figures are results on AMD platform. Three columns correspond to three operators BMM1, BMM2 and BMM3 in Table 2 from left to right, respectively.

=0ex =0ex Operators AutoTVM OpEvo mean std mean std Results on Nvidia platform BMM1 BMM2 BMM3 Results on AMD platform BMM1 BMM2 BMM3

Table 2: Mean and standard derivation of the best TFLOPS obtained by optimizing three BatchMatMul operators with AutoTVM and OpEvo after 512 samples. Please refer to Appendix B.2 for detailed explanation about these operators.

2D Convolution. Two two-dimensional convolution operators are chosen from AlexNet for evaluation. Because G-BFS and N-A2C cannot handle discrete and categorical parameters, they are not capable to tune convolution operators. Figure 4 and Table 3 show the comparison between OpEvo and AutoTVM. Compared with tensor operators discussed before, convolution operator has more complex search space which is harder to model. Although XGBoost which AutoTVM used to model the objective function is a tree boosting model which is friendly to discrete and categorical parameters, AutoTVM still performs worse than OpEvo, because EA inherently supports complex search space and OpEvo further improves sample-efficiency by leveraging domain knowledge.

Figure 4: Algorithms comparison for two 2D Convolution operators. The upper row of figures are results on Nvidia platform, while the lower row of figures are results on AMD platform. Two columns correspond to two operators C1 and C2 in Table 3 from left to right, respectively.

=0ex =0ex Operators AutoTVM OpEvo mean var mean var Results on Nvidia platform C1 C2 Results on AMD platform C1 C2

Table 3: Mean and standard derivation of the best TFLOPS obtained by optimizing two convolution operators with AutoTVM and OpEvo after 512 samples. Please refer to Appendix B.3 for detailed explanation about these operators.

6 Conclusion

In this paper, we proposed OpEvo, a novel evolutionary method which can efficiently optimize tensor operators. We constructed topological structures for tensor operators and introduced a topology-aware mutation operation based on -random walk distribution, so that OpEvo can leverage the constructed topological structures to guide exploration. Empirical results show that OpEvo outperforms SOTA methods in terms of best performance, best performance variance and data-efficiency. This work also demonstrated that good leverage of proper prior assumptions on objective functions is the key of sample-efficiency regardless of model-based or model-free methods. Even EA can beat SMBO in terms of sample-efficiency as long as proper prior assumptions are effectively leveraged. We noted that the proposed method cannot only be used to optimize tensor operators, but can also be generally applicable to any other combinatorial search spaces with underlying topological structures. We will investigate this in our future work.

We would like to thank Lidong Zhou and Jing Bai for their helpful comments, and all contributors of NNI project for the powerful and user-friendly framework they created.

References

Appendix A Proof for q-Random Walk Distribution

In this section, we will formally prove that the -random walk distribution defined in Section 4.3 is a valid probability distribution over a finite undirected graph . First of all, we denote the probability that the random walk reaches each vertex at time step

as a vector

, so the is a one-hot vector where the component corresponding to start vertex is while others are . Further, the transition matrix determined by Equation 3 and graph satisfies due to the definition of random walk. Note that, the summation of each column of is , because it is the summation of transition probability over all adjacent vertex.

Lamma 1. exists.

Proof.

Suppose that is not invertible, there exist a nontrivial such that , so . This shows that

is an eigenvalue of

, thus , the spectral radius of , satisfies . However, leads to a conflict, so exists. ∎

Lamma 2. Summation of each column of is .

Proof.

Suppose is a vector with all s as its entities, summation of each column of is implies

Theorem 1. -random walk is a valid probability distribution over .

Proof.

The probability vector that the random walk stops at each vertex at time step is , thus the probability vector describing the -random walk distribution is

(4)

Left multiplying on both sides of above equation yields

(5)

Equation 4 minus Equation 5 yields

Further, considering is positive, all entities of are non-negative and is a one-hot vector, all entities of vector are non-negative and add to 1. Therefore, the -random walk distribution is a valid distribution over a finite undirected graph . ∎

Appendix B Experiment details

In this section, we will describe the details of the experiments showed in Section 5. Following Reference [12] which proposed G-BFS and N-A2C, for all experiments involving G-BFS and N-A2C, we set the random selection parameter for the G-BFS method, the maximum number of search steps for the N-A2C method, and the initial state for both methods as the configuration without multi-level matrix tiling. For all experiments involving AutoTVM, we use the default settings in the TVM project. As for OpEvo, the parents size and offspring size are both set to , and the mutation rate is set to . We run each algorithm for each operator 5 times.

b.1 MatMul

MatMul is one of the basic but crucial tensor operators in deep learning as well as other applications. It multiplies two input matrices and to produce an output matrix by computing for all elements of . The execution efficiency of MatMul highly depends on the cache hit rate of memory access, since the operating matrices are usually too large to cache.

Matrix tiling is a popular solution for this problem. It iteratively splits computation into smaller tiles to adapt memory access patterns to a particular hardware. In this experiment, following the built-in schedule policy provided by TVM 111Please refer to https://github.com/apache/incubator-tvm/blob/ v0.6/topi/python/topi/cuda/dense.py for more details., we factorize and into four pieces, and , respectively. Here, is the shape of a basic tile, and blocks with threads per block are needed for computing basic tiles per thread. In other words, the factorization of and governs the computational granularity. Further, is split into three factors corresponding to three stages of data transfer among three levels of GPU memory hierarchy, global memory, shared memory and vector general purpose registers (VGPRs). That is to say, the factorization of controls the data granularity. In short, the search space of optimizing MatMul operators are composed of three factorization parameters.

Three MatMul operators are selected from BERT for evaluation. In Table 1, represents matrix of shape multiplies matrix of shape , represents matrix of shape multiplies matrix of shape , and represents matrix of shape multiplies matrix of shape .

b.2 BatchMatMul

BatchMatMul is another important tensor operator usually used in deep leaning models. It takes two tensors and as inputs and outputs a tensor by computing for all elements of . Very similar to MatMul, matrix tiling is also essential for execution efficiency of BatchMatMul. Besides factorization of , and , we also factorize into pieces, so there are four factorization parameters needed optimizing for BatchMatMul operators.

There are also three BatchMatMul operators selected from BERT for evaluation. In Table 2, represents batch matrices of shape multiplies batch matrices of shape , represents batch transposed matrices of shape multiplies batch matrices of shape , represents batch matrices of shape multiplies batch transposed matrices of shape .

b.3 2D Convolution

Without any exaggeration, convolution is the heart of modern computer vision, so almost all vision-related applications can benefit from speeding up execution of convolution operators. A two-dimensional convolution with stride

and padding

takes an image tensor with shape and a kernel tensor with shape as input, and outputs a tensor with shape , where

Each element of can be obtained by

(6)

It should be noted that there are also other methods to calculate convolution, such as FFT convolution [17, 18] and Winograd convolution [19]. In this work, all convolution operators are based on direct convolution described by Equation 6.

Following the TVM built-in schedule policy 222 Please refer to https://github.com/apache/incubator-tvm/blob/master/topi/python/topi/cuda/conv2d_direct.py for more information, we split , and into four factors for computational granularity, and spilt , , and into two factors for data granularity. Besides these six factorization parameters, there are one categorical parameter, which is a binary choice controlling whether to use macro #pragma unroll explicitly in the source codes for giving the downstream compiler a hint, and one discrete parameter, which controls the maximum unrolling steps in the TVM code generation pass.

Two 2D convolution operators are selected from AlexNet for evaluation. In Table 3, represents image tensor of shape convolves with kernel tensor of shape with stride and padding , represents image tensor of shape convolves with kernel tensor of shape with stride and padding .