1 Introduction
Abundant applications raise the demands of training and inference deep neural networks (DNNs) efficiently on diverse hardware platforms ranging from cloud servers to embedded devices. Moreover, computational graphlevel optimization of deep neural network, like tensor operator fusion, may introduce new tensor operators. Thus, manually optimized tensor operators provided by hardwarespecific libraries have limitations in terms of supporting new hardware platforms or supporting new operators, so automatically optimizing tensor operators on diverse hardware platforms is essential for largescale deployment and application of deep learning technologies in the realworld problems.
Tensor operator optimization is substantially a combinatorial optimization problem. The objective function is the performance of a tensor operator on specific hardware platform, which should be maximized with respect to the hyperparameters of corresponding device code, such as how to tile a matrix or whether to unroll a loop. Thereafter, we will refer to a tuple of hyperparameters determining device code as a configuration, and the set of all possible configurations as a configuration space or search space. Unlike many typical problems of this type, such as travelling salesman problem, the objective function of tensor operator optimization is a black box and expensive to sample. One has to compile a device code with a specific configuration and run it on real hardware to get the corresponding performance metric. Therefore, a desired method for optimizing tensor operators should find the best configuration with as few samples as possible.
The expensive objective function makes solving tensor operator optimization problem with traditional combinatorial optimization methods, for example, simulated annealing (SA) [1]
and evolutionary algorithms (EA)
[2], almost impossible. Although these algorithms inherently support combinatorial search spaces, they do not take sampleefficiency into account, thus thousands of or even more samples are usually needed, which is unacceptable when tuning tensor operators in product environments. On the other hand, sequential model based optimization (SMBO) methods are proved sampleefficient for optimizing blackbox functions with continuous search spaces [3, 4, 5]. However, when optimizing ones with combinatorial search spaces, SMBO methods are not as sampleefficient as their continuous counterparts [6], because there is lack of prior assumptions about the objective functions, such as continuity and differentiability in the case of continuous search spaces. For example, if one could assume an objective function with a continuous search space is infinitely differentiable, a Gaussian process with a radial basis function (RBF) kernel could be used to model the objective function. In this way, a sample provides not only a single value at a point but also the local properties of the objective function in its neighborhood or even global properties, which results in a high sampleefficiency. In contrast, SMBO methods for combinatorial optimization suffer poor sampleefficiency due to the lack of proper prior assumptions and surrogate models which can leverage them.
In this work, we propose OpEvo (Operator Evolution), which combines both advantages of EA and SMBO by leveraging prior assumptions on combinatorial objective functions in an evolutionary framework. Although there is no nice property like continuity or differentiability, we construct topological structures over search spaces of tensor operators by assuming similar configurations of a tensor operator will result in similar performance, and then introduce a topologyaware mutation operation by proposing a random walk distribution to leverage the constructed topological structures for better tradeoff between exploration and exploitation. In this way, OpEvo not only inherits the support of combinatorial search spaces from EA, but also benefits from the prior assumptions about combinatorial objective functions, so that OpEvo can efficiently optimize tensor operators. The contributions of the paper are threefold:

We define random walk distributions over combinatorial search spaces equipped with topological structures for better tradeoff between exploitation and exploration;

We propose OpEvo, which can leverage the topological structures over search spaces by introducing a novel topologyaware mutation operation based on random walk distributions;

We construct topological structures for search spaces of tensor operator optimization and evaluate the proposed algorithm with comprehensive experiments with three representative types of tensor operators on both Nvidia and AMD platforms. Our experiments demonstrate that OpEvo can find the best configuration with the least number of trials and the lowest variance compared with stateoftheart (SOTA) methods.
The rest of this paper is organized as follows. We summarize the related work in Section 2, and then introduce a formal description of tensor optimization problem and construct topological structures in Section 3. In Section 4, we describe OpEvo method in detail and demonstrate its strength with experiments of optimizing typical tensor operators in Section 5. Finally, we conclude in Section 6.
2 Related Work
As a class of popular methods for expensive blackbox optimization, SMBO methods are potential solutions for tensor operator optimization. Although classic SMBO methods, such as Bayesian optimization (BO) with Gaussian process surrogate, are usually used to optimize blackbox functions with continuous search spaces, many works have been done in using SMBO to optimize combinatorial blackbox functions. Hutter et al. [6]
proposed SMAC, which uses random forest as a surrogate model to optimize algorithm configuration successfully.
Bergstra et al. [7]proposed TPE, which uses treestructured Parzen estimator as a surrogate model to optimize hyperparameters of neural networks and deep belief networks. As for tensor operator optimization, TVM
[8] framework implemented a SMBO method called AutoTVM [9]to optimize parameters of tensor operators. Specifically, AutoTVM fits a surrogate model with either XGBoost
[10] or TreeGRU [11], and then uses SA to optimize the surrogate model for generating a batch of candidates in an greedy style. Although these methods are successfully used in many combinatorial optimization problems, they are not as sampleefficient as their continuous counterparts due to the lack of proper prior assumptions and corresponding surrogate models. OpEvo, on the other hand, introduces and leverages topological structures over combinatorial search spaces thus obtains better sampleefficiency than previous arts.Besides AutoTVM, two domainspecific methods, Greedy Best First Search (GBFS) and Neighborhood Actor Advantage Critic (NA2C), have been proposed recently to tune matrix tiling schemes of matrix multiplication (MatMul) operators by taking the relation between different configurations into account [12]
. They actually introduce a topology over the configuration space of MatMul operator by defining a neighborhood system on it, and further employ a Markov Decision Process (MDP) for exploration over the configuration space. By leveraging a domainspecific topological structure, GBFS and NA2C outperform AutoTVM in optimizing MatMul operators. However, these two methods are only designed for tuning tiling schemes of multiplication of matrices with only power of 2 rows and columns, so they are not compatible with other types of configuration spaces. Further, they tend to encounter curse of dimensionality as the number of parameters needed tuning getting bigger, because they only change one parameter at a time based on the MDP they defined. Thus, generalizing them to more general tensor operators is not straightforward. OpEvo, on the other hand, constructs topological structures in a general way and uses evolutionary framework rather than MDP framework to explore search spaces, so that the aforementioned problems encountered by GBFS and NA2C are overcame.
3 Problem Formulation
As earlier mentioned, tensor operator optimization is substantially a blackbox optimization problem with a combinatorial search space. It can be formally written as
(1) 
Here, is a blackbox function that measures the performance of a specific tensor operator with configuration . We use trillion floatingpoint operations per second (TFLOPS) as the measurement in this work. Configuration is an ordered tuple and each component
corresponds to a hyperparameter of a device code, so the entire search space
is the Cartesian product of all component feasible sets . Our aim is to find the optimal configuration that corresponds to the maximum TFLOPS.A topological structure over each can be introduced by defining an undirected graph , where the set of vertices is , and the set of edges . Here is an adjacency function mapping from to . represents vertices and are adjacent, otherwise and are not adjacent. In this way, one can introduce a topological structure over by defining an adjacency function according to prior assumptions on . For example, it is intuitive to treat and as adjacent if similar performance can be expected when changing from to
. Search process can benefit from the topological structures introduced this way by obtaining information about neighborhood of samples, like the performance of configurations in the neighborhood of a poor performance configuration are probably poor as well, so that better sampleefficiency could be achieved.
Different tensor operators may have different types of hyperparameters and corresponding feasible sets. In the rest part of this section, we will discuss three typical kinds of hyperparameters of tensor operators as well as associated feasible sets, and construct topological structures for them. It should be noted that, besides them, one can easily introduce other types of hyperparameters and construct corresponding topological structures based on concrete demands in a similar way.
First is the ordered tuple with a factorization constraint, , where are constants depending on specific tensor operators. We will refer to this type of parameter as factorization parameter thereafter. The factorization parameter is required by a popular technique called matrix tiling for improving the cache hit rate of memory access. It iteratively splits computation into smaller tiles to adapt memory access patterns to a particular hardware. From the implementation perspective, it transforms a single loop into nested loops, where is the number of nested loops, is the total loop length and is the loop length of each nested loop. We define two factorizations of are adjacent if one of them can be transformed to the other by moving , a prime factor of , from the th factor to the th factor. This adjacency function can be formally written as if such that and , and otherwise, where are distinct indices. Figure 1 illustrates the topology defined this way when and .
The second type of hyperparameter is the discrete parameter, , in which there are finite numbers. The maximum step of loop unrolling is an example of discrete type parameter. There is a natural adjacency among discrete parameters since they have welldefined comparability. This natural adjacency function can be formally written as if such that , and otherwise.
The last type of hyperparameter is the categorical parameter, , in which there are finite elements that can be any entity. The choices like whether to unroll a loop and which thread axis to dispatch computation are examples of categorical type parameter. Unlike discrete parameters, there is no natural adjacency among categorical parameters, so all elements in the feasible set of categorical type parameter are treated as adjacent, which can be formally written as for all and , and otherwise.
4 Methodology
4.1 Evolutionary Algorithm
EA is a kind of stochastic derivativefree optimization methods, which can be used to solve problems defined by Equation 1. EA imitates the natural selection in the evolution process of biological species to find the optimal configuration of an objective function. Evolutionary concepts are translated into algorithmic operations, i.e., selection, recombination, and mutation [13], which significantly influence the effectiveness and efficiency of EA.
To efficiently search the best configuration of a tensor operator, OpEvo leverages topological structures defined in Section 3 with an evolutionary framework. In specific, OpEvo evolves a population of configurations, which are also called individuals in EA terminology. The TFLOPS of executing tensor operators on a target hardware is a measure of the individuals’ quality or fitness. At each evolutionary step, we select top ranked individuals to be parents based on their fitnesses, and then recombine and mutate them to generate new individuals or children. After evaluation, children are added to the population to be candidates of new parents at the next evolutionary step. This iteration will repeat until some termination criteria are met.
In the rest of this section, we will describe the selection, recombination and mutation operations of OpEvo in detail and illustrate how OpEvo leverages the topological structures and why OpEvo can outperform previous arts in this way.
4.2 Selection and Recombination
Suppose we already have a list of individuals which are ranked by their fitnesses. Top ranked individuals are chosen to be parents, where governs the diversity in evolutionary process. Evolution with large tends to get rid of suboptimum but sacrifices data efficiency, while one with small is easier to converge but suffers suboptimum.
A child will be generated by recombining these selected parents in a stochastic way. Specifically, we sample below categorical distribution with categories times to decide which parents each parameter of a child should inherit from.
(2) 
where is the number of parameters in a configuration, superscripts represent different parents, and subscripts represent different parameters in a configuration. is the th parameter of generated child .
Equation 2 is a fitnessrelated distribution in which a parent with larger fitness has bigger probability to transfer its characters to offspring. Rankbased fitness shaping is a popular technique often used to avoid suboptimum and accelerate convergence [14, 15]. However, we don’t use it because meaningful fitnesses are quite sparse when optimizing some tensor operators, rankbased fitness shaping is harmful rather than helpful in these cases.
4.3 Mutation
OpEvo mutates each parameter
of each child by sampling a topologyaware probability distribution over corresponding feasible set
. Given a topology over and current vertex, such topologyware probability distribution can be constructed by a random walklike process. The transition probability from vertex to an adjacent vertex is(3) 
where is the mutation rate which tradeoffs the exploration and exploitation. OpEvo tends to exploration as approaches , while tends to exploitation as approaches . is the set of all vertices adjacent to , and denotes the cardinality of set . The major difference between the "random walk" defined by Equation 3 and the regular random walk is that the summation of transition probability over all adjacent vertices is rather than
, so the "random walk" we introduced is not a Markov chain since there is a probability of
to stop walking. In this way, given a current vertex , the topologyaware probability distribution for all could be defined as the probability of walking started from and stopped at . We will refer to the distribution defined this way as random walk distribution thereafter. Appendix A formally proved that the random walk distribution is a valid probability distribution over .For revealing the intuition behind random walk distribution, two qrandom walk distributions over the feasible set of factorization parameter with and are illustrated in Figure 1. They start from the same vertex (the blue vertex) but mutate with different . It could be easily observed that the vertices with smaller distance to the start vertex have higher probability to be sampled, which ensures a good tradeoff between exploration and exploitation. Further, the distribution with a larger has a wider spread than one with a smaller , because larger encourages more jumps in the random walk process. Considering the asymptotic case of , the random walk degenerates into a regular random walk on an undirected graph, which keeps jumping forever and eventually traverses all vertices on the graph, while in the case of , the random walk vanishes and no mutation acts on parameter . Thus, is a hyperparameter for OpEvo to trade off exploitation and exploration.
Considering a regular random walk on an undirected graph, i.e. , the probability of visiting a vertex in the graph is determined by the graph topology when the Markov chain induced by the random walk is converged. That’s why random walk can be used for embedding graphs in many works [16]. random walk distribution also inherits this topologyaware nature. Observing vertices with the same distance to the start vertex in Figure 1, the vertices with more complex neighborhood have larger probability. For example, vertices , and have the same distance to start vertex , but vertex has larger probability since it has larger degree. This property of random walk distribution helps explore search spaces efficiently, because sampling vertices with more complex neighborhood will provide us more knowledge about objective functions.
4.4 Summary
The OpEvo algorithm is summarized in Algorithm 1. At first, configurations are randomly generated and evaluated to initialize a priority queue ordered by decreasing fitness. Next, taking top configurations from as parents and recombining them to generate children according to Section 4.2. Then, each child is mutated based on Section 4.3. Note that the same configuration will not be sampled twice in the whole process of OpEvo, since the noise of TFLOPS of executing a tensor operator on a hardware is relatively small and data efficiency can benefit from nonduplicated samples. As a result, when a mutated child has already in , we will mutate the child again until it is not already sampled. Finally, the fitnesses of children are evaluated on target hardware, and enqueued into . This iteration will repeat until some termination criteria are met.
5 Experiments
We now evaluate the empirical performance of the proposed method with three typical kinds of tensor operators, MatMul, BatchMatMul and 2D Convolution, on both Nvidia (GTX 1080Ti) and AMD (MI50 GPU) platforms. All tensor operators in our experiments are described and generated with TVM framework, and then compiled and run with CUDA 10.0 or RCOM 2.9. Additionally, we compare OpEvo with three aforementioned SOTA methods, GBFS, NA2C and AutoTVM. In our experiments, OpEvo, GBFS and NA2C are implemented by ourselves with the framework of Neural Network Intelligence (NNI, https://github.com/microsoft/nni/), and AutoTVM is implemented by its authors in the TVM project (https://github.com/apache/incubatortvm). All codes for OpEvo, GBFS, NA2C and our benchmarks are publicly available with the NNI project. Please refer to Appendix B for more details about the experiments.
MatMul. Three different MatMul operators are chosen from BERT to evaluate proposed method. The maximum performance obtained so far versus number of trials which have been used is illustrated in Figure 2
. The lines denote the averages of 5 runs, while the shaded areas indicate standard deviations. Different colors and line styles represent different algorithms. Table
1 shows the mean and standard derivation of best TFLOPS obtained by each algorithm after samples. From the results, it can be easily concluded that the methods leveraging predefined topology, OpEvo, GBFS and NA2C, much outperform the general SMBO method, AutoTVM. GBFS and NA2C leverage the underlying topology by introducing a MDP, so just local topology is considered and leveraged to explore the configuration space, while OpEvo can consider the global topology thanks to the mutation operation based on the random walk distribution. Therefore, OpEvo performs better than GBFS and NA2C in most cases in terms of mean and variance of best TFLOPS and dataefficiency.BatchMatMul. There are also three BatchMatMul operators selected from BERT for evaluation. All these operators have batch size , so GBFS and NA2C are not capable to optimize them because they can only deal with matrices with power of 2 rows and columns. The comparison between OpEvo and AutoTVM are shown in Figure 3 and Table 2. Compared with MatMul operators, BatchMatMul has an order of magnitude bigger search space since one more parameter needed to be optimized. Also, the generated BatchMatMul device code is more likely to overflow the device memory as tile size of BatchMatMul is bigger than that of MatMul, which leads to sparser performance measurement. Although these challenges exist, OpEvo performs still well thanks to the globally exploration mechanism. The variance of best performance even better than that of MatMul because of the sparsity of objective function.
2D Convolution. Two twodimensional convolution operators are chosen from AlexNet for evaluation. Because GBFS and NA2C cannot handle discrete and categorical parameters, they are not capable to tune convolution operators. Figure 4 and Table 3 show the comparison between OpEvo and AutoTVM. Compared with tensor operators discussed before, convolution operator has more complex search space which is harder to model. Although XGBoost which AutoTVM used to model the objective function is a tree boosting model which is friendly to discrete and categorical parameters, AutoTVM still performs worse than OpEvo, because EA inherently supports complex search space and OpEvo further improves sampleefficiency by leveraging domain knowledge.
6 Conclusion
In this paper, we proposed OpEvo, a novel evolutionary method which can efficiently optimize tensor operators. We constructed topological structures for tensor operators and introduced a topologyaware mutation operation based on random walk distribution, so that OpEvo can leverage the constructed topological structures to guide exploration. Empirical results show that OpEvo outperforms SOTA methods in terms of best performance, best performance variance and dataefficiency. This work also demonstrated that good leverage of proper prior assumptions on objective functions is the key of sampleefficiency regardless of modelbased or modelfree methods. Even EA can beat SMBO in terms of sampleefficiency as long as proper prior assumptions are effectively leveraged. We noted that the proposed method cannot only be used to optimize tensor operators, but can also be generally applicable to any other combinatorial search spaces with underlying topological structures. We will investigate this in our future work.
We would like to thank Lidong Zhou and Jing Bai for their helpful comments, and all contributors of NNI project for the powerful and userfriendly framework they created.
References
 Kirkpatrick et al. [1983] Scott Kirkpatrick, C Daniel Gelatt, and Mario P Vecchi. Optimization by simulated annealing. science, 220(4598):671–680, 1983.
 Bäck and Schwefel [1993] Thomas Bäck and HansPaul Schwefel. An overview of evolutionary algorithms for parameter optimization. Evolutionary computation, 1(1):1–23, 1993.
 Srinivas et al. [2009] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.
 HernándezLobato et al. [2014] José Miguel HernándezLobato, Matthew W Hoffman, and Zoubin Ghahramani. Predictive entropy search for efficient global optimization of blackbox functions. In Advances in neural information processing systems, pages 918–926, 2014.

Wang and Jegelka [2017]
Zi Wang and Stefanie Jegelka.
Maxvalue entropy search for efficient bayesian optimization.
In
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pages 3627–3635. JMLR. org, 2017.  Hutter et al. [2011] Frank Hutter, Holger H Hoos, and Kevin LeytonBrown. Sequential modelbased optimization for general algorithm configuration. In International conference on learning and intelligent optimization, pages 507–523. Springer, 2011.
 Bergstra et al. [2011] James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for hyperparameter optimization. In Advances in neural information processing systems, pages 2546–2554, 2011.
 Chen et al. [2018a] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. TVM: An automated endtoend optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 578–594, Carlsbad, CA, October 2018a. USENIX Association. ISBN 9781939133083. URL https://www.usenix.org/conference/osdi18/presentation/chen.
 Chen et al. [2018b] Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Learning to optimize tensor programs. In Advances in Neural Information Processing Systems, pages 3389–3400, 2018b.
 Chen and Guestrin [2016] Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794. ACM, 2016.
 Tai et al. [2015] Kai Sheng Tai, Richard Socher, and Christopher D Manning. Improved semantic representations from treestructured long shortterm memory networks. arXiv preprint arXiv:1503.00075, 2015.
 Zhang et al. [2019] Huaqing Zhang, Xiaolin Cheng, Hui Zang, and Dae Hoon Park. Compilerlevel matrix multiplication optimization for deep learning. arXiv preprint arXiv:1909.10616, 2019.
 Kramer [2016] Oliver Kramer. Machine learning for evolution strategies, volume 20. Springer, 2016.
 Hansen and Ostermeier [1996] Nikolaus Hansen and Andreas Ostermeier. Adapting arbitrary normal mutation distributions in evolution strategies: The covariance matrix adaptation. In Proceedings of IEEE international conference on evolutionary computation, pages 312–317. IEEE, 1996.
 Wierstra et al. [2008] Daan Wierstra, Tom Schaul, Jan Peters, and Juergen Schmidhuber. Natural evolution strategies. In 2008 IEEE Congress on Evolutionary Computation (IEEE World Congress on Computational Intelligence), pages 3381–3387. IEEE, 2008.
 Perozzi et al. [2014] Bryan Perozzi, Rami AlRfou, and Steven Skiena. Deepwalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701–710, 2014.
 Mathieu et al. [2013] Michael Mathieu, Mikael Henaff, and Yann LeCun. Fast training of convolutional networks through ffts. arXiv preprint arXiv:1312.5851, 2013.
 Vasilache et al. [2014] Nicolas Vasilache, Jeff Johnson, Michael Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. Fast convolutional nets with fbfft: A gpu performance evaluation. arXiv preprint arXiv:1412.7580, 2014.

Lavin and Gray [2016]
Andrew Lavin and Scott Gray.
Fast algorithms for convolutional neural networks.
InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 4013–4021, 2016.
Appendix A Proof for qRandom Walk Distribution
In this section, we will formally prove that the random walk distribution defined in Section 4.3 is a valid probability distribution over a finite undirected graph . First of all, we denote the probability that the random walk reaches each vertex at time step
as a vector
, so the is a onehot vector where the component corresponding to start vertex is while others are . Further, the transition matrix determined by Equation 3 and graph satisfies due to the definition of random walk. Note that, the summation of each column of is , because it is the summation of transition probability over all adjacent vertex.Lamma 1. exists.
Proof.
Suppose that is not invertible, there exist a nontrivial such that , so . This shows that
is an eigenvalue of
, thus , the spectral radius of , satisfies . However, leads to a conflict, so exists. ∎Lamma 2. Summation of each column of is .
Proof.
Suppose is a vector with all s as its entities, summation of each column of is implies
∎
Theorem 1. random walk is a valid probability distribution over .
Proof.
The probability vector that the random walk stops at each vertex at time step is , thus the probability vector describing the random walk distribution is
(4) 
Left multiplying on both sides of above equation yields
(5) 
Equation 4 minus Equation 5 yields
Further, considering is positive, all entities of are nonnegative and is a onehot vector, all entities of vector are nonnegative and add to 1. Therefore, the random walk distribution is a valid distribution over a finite undirected graph . ∎
Appendix B Experiment details
In this section, we will describe the details of the experiments showed in Section 5. Following Reference [12] which proposed GBFS and NA2C, for all experiments involving GBFS and NA2C, we set the random selection parameter for the GBFS method, the maximum number of search steps for the NA2C method, and the initial state for both methods as the configuration without multilevel matrix tiling. For all experiments involving AutoTVM, we use the default settings in the TVM project. As for OpEvo, the parents size and offspring size are both set to , and the mutation rate is set to . We run each algorithm for each operator 5 times.
b.1 MatMul
MatMul is one of the basic but crucial tensor operators in deep learning as well as other applications. It multiplies two input matrices and to produce an output matrix by computing for all elements of . The execution efficiency of MatMul highly depends on the cache hit rate of memory access, since the operating matrices are usually too large to cache.
Matrix tiling is a popular solution for this problem. It iteratively splits computation into smaller tiles to adapt memory access patterns to a particular hardware. In this experiment, following the builtin schedule policy provided by TVM ^{1}^{1}1Please refer to https://github.com/apache/incubatortvm/blob/ v0.6/topi/python/topi/cuda/dense.py for more details., we factorize and into four pieces, and , respectively. Here, is the shape of a basic tile, and blocks with threads per block are needed for computing basic tiles per thread. In other words, the factorization of and governs the computational granularity. Further, is split into three factors corresponding to three stages of data transfer among three levels of GPU memory hierarchy, global memory, shared memory and vector general purpose registers (VGPRs). That is to say, the factorization of controls the data granularity. In short, the search space of optimizing MatMul operators are composed of three factorization parameters.
Three MatMul operators are selected from BERT for evaluation. In Table 1, represents matrix of shape multiplies matrix of shape , represents matrix of shape multiplies matrix of shape , and represents matrix of shape multiplies matrix of shape .
b.2 BatchMatMul
BatchMatMul is another important tensor operator usually used in deep leaning models. It takes two tensors and as inputs and outputs a tensor by computing for all elements of . Very similar to MatMul, matrix tiling is also essential for execution efficiency of BatchMatMul. Besides factorization of , and , we also factorize into pieces, so there are four factorization parameters needed optimizing for BatchMatMul operators.
There are also three BatchMatMul operators selected from BERT for evaluation. In Table 2, represents batch matrices of shape multiplies batch matrices of shape , represents batch transposed matrices of shape multiplies batch matrices of shape , represents batch matrices of shape multiplies batch transposed matrices of shape .
b.3 2D Convolution
Without any exaggeration, convolution is the heart of modern computer vision, so almost all visionrelated applications can benefit from speeding up execution of convolution operators. A twodimensional convolution with stride
and padding
takes an image tensor with shape and a kernel tensor with shape as input, and outputs a tensor with shape , whereEach element of can be obtained by
(6) 
It should be noted that there are also other methods to calculate convolution, such as FFT convolution [17, 18] and Winograd convolution [19]. In this work, all convolution operators are based on direct convolution described by Equation 6.
Following the TVM builtin schedule policy ^{2}^{2}2 Please refer to https://github.com/apache/incubatortvm/blob/master/topi/python/topi/cuda/conv2d_direct.py for more information, we split , and into four factors for computational granularity, and spilt , , and into two factors for data granularity. Besides these six factorization parameters, there are one categorical parameter, which is a binary choice controlling whether to use macro #pragma unroll explicitly in the source codes for giving the downstream compiler a hint, and one discrete parameter, which controls the maximum unrolling steps in the TVM code generation pass.
Two 2D convolution operators are selected from AlexNet for evaluation. In Table 3, represents image tensor of shape convolves with kernel tensor of shape with stride and padding , represents image tensor of shape convolves with kernel tensor of shape with stride and padding .
Comments
There are no comments yet.