1 Introduction, Motivation, and Setup
It is well-known that the computationally demanding routines driving the state-of-the-art in domains such as dense linear algebra and deep learning are all based on carefully hand-optimized and tuned libraries developed with significant engineering effort and insights over time. The techniques and the level of attention necessary to optimize for near-peak performance on a target architecture makes the process often inaccessible to those without a deep knowledge of code optimization and low-level interactions with hardware. In many cases, the best performing code comes from the hardware vendors themselves. Yet, fortunately, there are published works such as those of Goto and Van de Geijn[goto2008toms] that have described in great detail how such close to peak performance could be obtained. Subsequent works [vanzee2015toms]
made the process more modular, reusable, and accessible to a wider audience, having translated to an open-source projectFLAME/BLIS.
This article alludes to the fact that this process could potentially be made even more modular, automatable and systematic — by hosting it on top of a real compiler IR that has the necessary abstractions and infrastructure. This completely avoids the need to write any code by hand — be it C or inline assembly. The IR infrastructure that will be used here is of course, MLIR [mlir2020arxiv, mlir-web] and we will try to recreate the OpenBLAS/BLIS’ optimization approach in a compiler-oriented manner using MLIR. MLIR is a new intermediate representation designed to provide a unified, modular, and extensible infrastructure to progressively lower dataflow compute graphs, through loop nests potentially, to high-performance target-specific code. While MLIR shares similarities with traditional CFG-based three-address SSA representations (including LLVM IR and Swift intermediate language), it also introduces notions from the polyhedral compiler framework as first class concepts to allow powerful analysis and transformation in the presence of loop nests and multi-dimensional arrays. It is thus a useful infrastructure suitable for developing new compilers, whether general-purpose or domain-specific, and whether targeting general-purpose multicores or specialized accelerator chips.
We are going to be using an Intel Skylake-based high-end desktop/workstation processor for all experimentation. The processor is an Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz, which is actually based on Coffee Lake
, a process refinement of Skylake, and thus with the same core/performance characteristics. Note that although this is based on the Skylake microarchitecture, it is not a SkyLake-X: as such, its vectors are not AVX-512 but just AVX-2 (256-bit). It has a 32 KiB L1 data cache and a 256 KiB L2 unified cache per core, and a 12 MiB shared L3 cache (2 MiB / core).
Before we start, we are going to compute the machine peak on this processor. This CPU supports AVX-2 and has two FMA units that can operate on 256-bit wide vectors. As such, one can perform double-precision floating-point multiply and adds per cycle, which at the max turbo boost frequency of 4.7 GHz translates to 75.2 GFLOPS per core. (Frequency scaling can impact how this figure is calculated, but we will ignore that for now.)
1.2 Running example: DGEMM
Matrix-matrix multiplication is often an excellent routine for a tutorial or exercise in code optimization because a number of standard practices could be demonstrated using it. It is also one of the most important ones for many domains and thus often the first thing one builds an optimized implementation for when rolling out a new architecture.
Let us take a 2088x2048 double precision matrix-matrix multiplication, and benchmark it in various ways. We choose 2088 here since it is divisible by both 4 and 3: this makes it easier to read some of the snippets given that we will be exploring register tiling by sizes that are multiples either of 2, 3, or 4.
1.3.1 Compilers with -O3, Pluto
Let us see what current compilers do if this is written as a naive nest in C. These results can be reproduced with the setup in Pluto from under examples/matmul/. (We use -ffast-math along with -O3 to be fair given the other things we are going to compare this with.)
Disappointingly, clang and GCC are at 0.6% and 6% of the machine peak respectively! But general-purpose compilers are not expected or meant to get anywhere to the peak. Programmers instead typically use highly tuned libraries. We are going to get to that shortly, but a quick note on why Clang performs poorly here. Clang only performs innermost loop vectorization, which is really a bad choice here (due to poor locality) when that’s the only thing being done. Even a textbook loop interchange here from improves Clang performance by about 10x (because now the right loop would be at the innermost level for both locality and auto-vectorization). Clang does not perform any unroll-and-jam here either. While on this let us also see what a polyhedral tool Pluto, a source to source translator, does on this. The below run shows that it is at about 25% of the machine peak, which is better but also pretty unsatisfying.
Now, let us see what the fastest libraries in the world do on matmul. Again, the setup in Pluto allows experimenting with OpenBLAS, BLIS, or MKL quickly.
MKL and OpenBLAS are nearly at 92% of the peak (68/75.2), while BLIS is close at 85% of the peak. We will keep these in mind as we try out how to build high-performance versions with MLIR. Note: we haven’t been very rigorous with the above experimentation in terms of performing warmup runs, taking an average across repetitions etc., but our goal is to just get a reasonably accurate picture for now.
2 Using MLIR to Optimize GEMM
There is currently no C/C++ or another language/programming model frontend that emits MLIR. The way to get something in MLIR run on CPUs is through mlir-cpu-runner which can take MLIR as input and JIT compile and execute it. As part of this process, optimized MLIR is converted to LLVM IR, which then goes through its optimization pipeline to generate target code, all of this through mlir-cpu-runner.
2.1 Affine Ops
Here is how a simple canonical matrix-matrix multiplication looks like in MLIR – the structure is close to its form in C.
affine.for and affine.load/store are ops from the Affine dialect [affine-dialect] used above. The IR above also uses a helper op from the LinAlg dialect to initialize matrices. The rest like (addf, mulf, alloc) are all ops from MLIR’s standard dialect.
A memref [memref]
in MLIR is the in-memory representation of a tensor in MLIR. Depending on its layout map, it could refer to a potentially non-contiguous sub-region of the underlying buffer. All memrefs in the above snippet have the default identity layout map, which corresponds to a row major layout in memory. 2088 x 2048 is the shape of memref where each of its elements is an f64 (double). There is other information such as an affine layout map (a mapping for its logical coordinate space to physical memory) that is elided when these are identity maps (the default). We will look at an example of this a little later (Section 2.9). The only way to access the elements of a memref is through load and store operations.
2.3 A high-level domain-specific op that can expand
MLIR is extensible in several ways, and one of them is in the ease of adding/creating new ops at the level of abstraction we desire. One way of classifying ops or IR in general in MLIR is into high-level, mid-level, and low-level, although there is not a clear distinction and lowering can be progressive and you can have a mix. High-level ops operate on tensor and memref types themselves, i.e., their inputs (operands) and results are of that type. affine.for and affine.if are mid-level ops – these have nested blocks (which are themselves a list of ops) and a particular structure to them. Low-level ops are at the same level as instructions in LLVM, and these typically correspond to either one or a flat sequence of instructions matching a target (three address code style in other words).
If one is building a DSL and knows that they need or have a matmul, one could just define an op that does the matmul taking memrefs as operands. For the purpose of this article, we will create a hop.matmul op. An operation in MLIR has inputs, results, and attributes (ones with regions have region arguments as well).
hop.matmul above is an operation that operates on three memrefs, %A, %B, and %C, which are the LHS, the RHS, and the output corresponding to a matrix-matrix multiplication. This op has zero results: since these are memrefs, the underlying memory can be updated, but the memref itself is an SSA value and thus can be defined only once. In this case, %C will be both read and written by this heavy op. some_attr is an attribute of the op, which like with LLVM metadata is meant to be a constant of one of the numerous types available including some polyhedral types like affine maps and integer sets. Such polyhedral attributes could be used to encode powerful information and we will see an example of this later.
For the purpose of this article, we create an -hopt pass that is able to expand out hop.matmul into the naive 3-d loop nest we showed further above. Let us just execute this with MLIR without any optimization to see where we are starting our journey from.
This is pretty close to the execution time of ‘clang -O3’ that we saw above, which makes sense, but this is equally slow (less than 1% of the machine peak)! But we have not really made use of or run any of the optimization infrastructure in MLIR. The -hopt here just lowers this op into the canonical 3-d loop nest we showed further above, which is then lowered to LLVM and goes through the same LLVM -O3 pipeline that clang went through.
2.4 Tiling for caches: OpenBLAS/BLIS tiling as a polyhedral schedule
Tiling and blocking has been studied for several decades, and it is now known how matrix-matrix multiplication should be tiled for high performance on widely used architectures. We are going to use the same cache tiling strategy used in OpenBLAS and BLIS. [This scheme](https://dl.acm.org/citation.cfm?id=1356053) is a very clever one that carefully tiles for multiple levels of the cache in a way that exploits reuse for each of the matrices involved at one level or another: the objective is to make sure the vector FMA/add mul pipeline remains full and one does not wait for loads. It does not blindly tile the nest multiple times to exploit reuse for all matrices at L1, L2, and L3, as would most polyhedral tools do.
There are various ways of describing OpenBLAS’s/BLIS’ tiling scheme and they are explained in excellent detail with illustration in the papers by Van Zee and van de Geijn on BLIS [vanzee2015toms], and later with analysis and modeling by Low et al. 2016 [low2016toms]. But here, we will describe the tiling transformation compactly via polyhedral schedules. For an introduction to polyhedral schedules, see here [uday-poly-intro] or the material on polyhedral.info. In particular, the tiling and corresponding intra-tile loop orders used in the OpenBLAS/BLIS approach can be expressed as the following polyhedral schedule (a multi-dimensional affine map / transformation):
where ‘/’ is a floor division. The innermost two loops after applying the above schedule are meant to be fully unrolled leading to an x loop body. Ignoring the last level of tiling (for the L3 cache), this becomes:
Figure 1 from the BLIS works is a great illustration of its optimization strategy:
The parameters (, ) are chosen such that a tile of the LHS matrix (%A) of size x is reused in the L2 cache, an RHS matrix tile of size x is reused in the L1 cache, while a register tile of the output matrix x is reused in just the registers. There are several other considerations in choosing these parameters if one is interested in analytical modeling; those are described in the work of Low et al. [low2016toms]. The last three dimensions of the schedule are multiplying a panel of %A of size x with a panel of %B of size x . Note that %C is being both read and written, and intuitively, it makes sense to exploit reuse for it in the registers while running a large enough k intra-tile loop at the innermost level. In the schedule above, the innermost two dimensions (j, i) are fully unrolled, i.e., you get an unrolled body of size x times the original one. For readers seeking more detail in general on this and on programming for high performance, please see course notes [flame-laff] from the BLIS team members.
2.5 Tiling in MLIR
There are two ways of achieving this tiling in MLIR: one by calling [mlir::tile] (which is also what the loop tiling pass in MLIR uses), and then performing the desired loop interchange via mlir::interchangeLoops The other is by implementing a higher order polyhedral (HOP) approach based on domains and schedules. We use this latter approach here since MLIR’s affine analysis machinery does not yet have the necessary simplification to get rid of certain redundant bounds resulting from tiled code generation in advanced cases. The HOP approach depends on an external library, [ISL](http://isl.gforge.inria.fr), and we implement this as part of the -hopt pass. We express the tiling schedule as an MLIR affine map (in fact, any affine schedule could be used), perform the code generation via ISL, and convert the ISL AST back to MLIR. We will now use , , , as attributes. on the hop.matmul op.
We have used = 256, = 64, = 4, = 8 as a starting point: these can be analyzed to be good values given the cache size constraints described earlier. This is functionally equivalent to using the following schedule with HOP and fully unrolling the last two dimensions, d1, d0, which will be of trip counts 8 and 4 respectively):
In order to just evaluate cache tiling for now, we will just use the following schedule (drop the tiling for registers):
Now, with the above tiling, we get this loop nest:
Note that the invariant load on %B has been hoisted out. When we execute this:
This is nearly a 3x improvement in performance, but still nowhere close to the machine peak or the performance of MKL, OpenBLAS or BLIS!
2.6 Explicit Copying or Packing
Whenever a code with multi-dimensional arrays exhibits reuse, explicit copying or packing is a commonly used technique in HPC, which first copies or packs accessed data into contiguous buffers and indexes those buffers for computation. Such copying reduces or nearly eliminates conflict misses, TLB misses, and improves hardware prefetching performance (since the access will lead to fewer prefetch streams). More importantly, in conjunction with tiling, it provides the intended benefits of tiling. On the one hand, tiling is performed so that reuse is exploited in multiple directions when the data accessed fits in a higher level of the memory hierarchy, on the other hand, the data accessed for a tile is no longer contiguous in the original matrix/tensor leading to conflict misses, TLB misses, and more prefetch streams — taking away a lot of the gain even if there is high reuse. There is also a choice in terms of which depth / where one would like to introduce the copies, but this is quite clear for the tiling strategy we are using.
The packing optimization is something that one would really wish a compiler performs automatically. MLIR has a pass -affine-data-copy-generate and a utility mlir::affineDataCopyGenerate that can perform explicit copying or packing. The utility can also be called on a specific memref to perform the packing at a specific loop depth, and with a number of other options (including DMA support).
With the tiling approach we are using here, packing is performed for the LHS matrix %A right under the second loop ( dimension in the schedule). For %B (RHS matrix), the copying is performed right under j floordiv (the third dimension). Figure 2 illustrates the packing needed.
To perform the packing automatically with MLIR, instead of calling the -affine-data-copy-generate pass, we will just use the underlying utility [mlir::affineDataCopyGenerate] to place copies at the right depths. For the purpose of this article, we enabled this under the -hopt-copy option. With -hopt-copy, now the nest looks like:
This is nearly a 1.5x improvement in performance, but in absolute terms still pretty bad. On a general note, the benefit of any optimization often depends on how well optimized the code is on other aspects before the optimization was applied – even more so in this case. It’s important to be mindful of this in order to not draw incorrect conclusions on the relative benefits of each individual transformation at this time. For example, the benefits of cache tiling on a code where unroll-and-jam or register tiling has already been performed is significantly lower than on a version where no optimizations were performed. We will look at an example a little later.
2.7 Unroll and Jam
We will now use the complete schedule that has the and register tiling so that an unroll-and-jam of the innermost two loops by and respectively could be achieved, but we actually didn’t enable that yet. We use the option -hopt-unroll to turn this on and off. We also run scalar replacement in MLIR post unroll-and-jam. Besides turning memref locations being reduced into scalars (single element memrefs) and hoisting them out, it also eliminates redundant loads, and hoists invariant loads out of loops.
While at unroll-and-jam, we will also add one more parameter to the list, , which will be the unroll factor for the loop (intra-tile loop corresponding to k reduction dimension). We use = 4 to unroll the k loop (which will end up as the innermost loop post all register tiling). So, we have:
A code is only as fast as its weakest link. While the cache tiling really optimized reuse for the LHS and RHS, the strategy used by OpenBLAS and BLIS exploits reuse for the output matrix only in registers (not in L1 nor L2). As such, without unroll-and-jam aka register tiling, brakes had been applied. So, this last step opened the floodgates here yielding a 10x improvement and getting us to 23 GFLOPS.
Let us go back a step and check what would happen if we disabled packing while performing unroll-and-jam.
We lose over 2x of the performance if we do not perform packing here! And if we had not performed the innermost loop unrolling (i.e., set = 1), we get:
All of these improvements are significant from where we started but they are all finally not enough at all on an absolute scale — we are still at about 34% of the peak even after having reproduced a good part of the OpenBLAS/BLIS recipe! Clearly, there is something orthogonal that all of this is missing. It looks like we completely missed vectorization! The highly tuned library kernels that we have set as a target use a careful selection and schedule of vector instructions in inline assembly.
Let us see whether our code is even being vectorized reasonably under the hood by LLVM. It’s pretty straightforward to check this. Instead of looking at the pass output remarks, we will just look at the generated assembly. The ’-dump-object-file -object-filename’ options of mlir-cpu-runner are useful for this purpose.
This is really not the code we would like to see in the innermost loop! Although the code is vectorized, it is using XMM registers (128-bit) for a good part instead of YMM ones (256-bit). More importantly, the FMAs are regularly accompanied by loads around them and neither should this have been necessary nor is it any good for performance! A good inner loop here is expected to only have one load of the LHS and RHS every and FMA ops, roughly speaking, because that’s the amount of reuse we should be getting. In addition, there should be no loads or stores for the output matrix in the innermost loop, which happens to be the case here. And finally, it all should have been %ymm.
It is also important to note that vectorization is often nearly useless unless the code has been optimized for locality, (i.e., we should be getting data from caches at least), and it is in fact still not as useful unless we are not reading from registers most of the time. We will not dig and analyze any further here to figure out what went wrong with clang/LLVM and why (this is actually due to a sub-optimal vectorization choice)! Instead, we will see how we could get the vectorization done in MLIR itself where we have all the information and opportunity.
MLIR supports vector types and vector types as element types for memrefs. Unlike in LLVM, vector types in MLIR can be multi-dimensional. However, we only need 1-d vector types here. Loop vectorization in MLIR would just lead to IR that turns memrefs of f64 into memrefs of vector of f64 besides transforming loops/loop bodies. A casting op that changes the elemental type of a memref, and the splat op are also needed here. In more complex cases, the view/subview ops are also needed. But for this benchmark, the vectorization needed is quite straightforward, and is a simple outer loop vectorization along the j loop.
The existing “super vectorizer” in MLIR is really not functional or complete for the auto-vectorization we need. For this article, we build and use a new loop vectorizer (enabled by -hopt-vect or via -affine-vectorize when running separately). We introduce a new memref_shape_cast op which is needed to change the elemental type on a memref. Here’s how the vectorized MLIR looks like if we started with the naive matmul nest.
The above IR is generated by vectorizing along the loop, which is a data parallel loop as opposed to a reduction dimension). The memref_shape_cast op casts a memref to one of another shape as long as the product of the dimension sizes all the way up to any aggregate elemental types remain the same. memref_shape_cast actually does not exist in MLIR trunk. Also, we do not discuss the case of trip counts not being a multiple of vector size here since some of the infrastructure around it in MLIR is still being built. Let us look at the performance with vectorization and with nothing else enabled.
2.8.1 Rewind and measure with vectorization
Let us rewind and see how the vectorized code performs step by step, with tiling, with packing, and with register tiling / unroll-and-jam.
As mentioned earlier, simply vectorizing without worrying about memory bandwidth gets us nowhere. We have now broken that barrier, and can go further here.
As observed earlier, once again, the last step has opened the floodgates here yielding a 4.5x improvement and getting us to 49 GFLOPS – only 23% away from BLIS and about 27% away from MKL or OpenBLAS.
Let us see how the MLIR looks like right after vectorization, tiling, copying, and unroll-and-jam (we will disable the loop unroll, i.e., set = 1 to make it easier to read; the unroll-jamming of is still shown).
The memref<1 x f64> values (single element memrefs) that we see below are a result of the scalar replacement pass (-affine-scalrep) that runs after the unroll-and-jam. We see eight of these summing up to 32 elements of the output matrix, and these are held in registers when the innermost loop (k) is iterating. LLVM’s passes later turn these into virtual registers.
Both BLIS and OpenBLAS use inline assembly microkernels and other hand written components that are composed in a modular/reusable way. On the other hand, we have been trying to generate all our code all the way down, including using LLVM’s instruction selection, scheduling, and register allocation. Note that we have got here through the hop.matmul with the values of , , , , indicated below:
Before we look at adjusting these for maximum mileage, we will take a detour to memref layouts.
2.9 A Quick Detour into Affine Map Layouts
We are going to take a quick detour here to understand how a memref’s layout map [memref] works. A memref type has three pieces of information: its shape, a affine layout map, and a memory space.
This is a memref of shape 64x256, and with the identity map (d0, d1) -> (d0, d1) which corresponds to a row-major layout in memory. 64 rows and 256 columns, and each of its elements is a float. An element %A[%i, %j] will map to element () in the buffer – in this case all elements are contiguous. ’0’ is the memory space the memref lives in. On architectures that have different kinds of explicitly addressable memory, a different memory space id is assigned to each. Whenever the layout map is identity, its printing is elided; similarly, in the case of memory space 0.
Now, let us look at the memref corresponding to the buffer for the LHS matrix (%A), which is of type memref<64x256xf64>.
This block is being multiplied with the RHS panel of type
memref<256x2xvector<2xf64>>. The figure above shows how the elements of
A block get traversed: along columns of height all the way up to
columns, and then turn around for the next panel, until one has processed all
/ panels. Note that the elements are not being accessed contiguously.
But since the block of A is L2 cache resident and is just streamed through L1,
we still get both spatial reuse for it in L1 as well as L2, i.e., it is just
that the reuse in space for A does not happen immediately but once you’ve
traversed the height. And we know per this whole approach that it still
stays in cache. However, there is no harm (and it can only potentially get
better) if we ensure that A is accessed contiguously. For eg., it can only help
with prefetching into L1. Hence, we would like to pick a layout different from
the default row-major for memref<64x256xf64>, and MLIR’s affine layout
maps make it convenient to express the desired layout here in a compact and
powerful way. If we simply change the memref type to:
memref<64x256xf64, (d0, d1) (d0 floordiv , d1, d0 mod )>
it yields the desired layout. The rest of the IR does not need a single change! The mapping specifies that a in the logical access space should be mapped to (d0 flooridv , d1, d0 mod ) in a physical (contiguous) buffer of size 16x256x4. When mlir::normalizeMemRef runs, it will turn this memref into:
And an access Abuf[%i, %j] is remapped to Abuf[%i floordiv 4, %j, %i mod 4]. This is an example of a tiled layout, akin to how iteration space tiling is performed – stripmine and interchange – but on the data space.
As another example, here’s a memref that accesses a non-contiguous sub-space of its underlying buffer. memref<64x256xf64, (d0, d1) -> (d0, d1 floordiv 6, d1 mod 6)>
Note that we would have a “padding” worth 2 elements (256 % 6) at the end of each row that can never be accessed via this memref while staying in bounds. Another way to write this is:memref<64x256xf64, >
A memref with access strides say 128, 2 (for major, minor respectively) in a larger underlying buffer can be expressed for example as:memref<126x100xf32, >.
More examples can be found here.
The copy options supplied to mlir::affineDataCopyGenerate allows one to choose a custom data layout for the buffer (being copied into/from). One can choose any layout map as long as it is injective: any injective layout will lead to semantically correct code.
We will see that this layout is only needed for an experiment later (Section 2.12). We will now get back to adjusting parameters to go full throttle.
2.10 Tweaking , , , to maximize reuse
Using a different setting of these parameters is as simple as changing the attributes on the hop.matmul op. Note that although one can tune these parameters on BLIS, its ASM kernel has to be among those that shipped — these are long hand-written / hand-unrolled ones. However, with MLIR we are in a position to also play around with the register tile sizes just as easily. Let us pick a different set of values of , , , and for better utilization of registers and L2 cache capacity as far %A’s buffer goes. We’ll set to 3 and to 16 since this uses 12 vector registers for %C instead of the under utilization of 8 earlier (there are a total of 16 %ymm registers on x86-64).
When one chooses = 3 and = 16, it leads to 3 * 16 /4 = 12 256-bit vector registers for the output values, 3 values for the LHS, and 1 for the RHS, which amounts to 12 + 3 + 1 = 16 registers — exactly the number of VEX encodable registers for AVX-2. The BLIS provided kernels for Haswell otherwise include , , , – they too use ( =) 12 registers for the output, but these options differ in how much register reuse is relatively exploited along the two dimensions and how big the L1 resident buffer for the RHS is. Let us execute the = 3, = 16 configuration.
We immediately see a 24% boost in performance! A big strength of a pure compiler approach here is that one automatically generates everything – all the way down to the innermost loop body. So, we have the opportunity to explore more than what’s possible with a fixed set of manually pre-optimized kernels. This is all under the assumption that the compiler generated code is competitive or could be made competitive. Let us now explore the impact of just the , values around the best configuration we have found.
Let us now peek at the generated assembly of the innermost loop to see how good it is. A necessary thing is that the innermost block should have no load/store’s on the output (everything should be in registers) and we should have a continuous sequence of VFMA instructions on 256-bit AVX registers which are named %ymm[0-15].
This looks as good as we may expect! The output matrix values are always in registers (no spilling). LHS and RHS values have the intended amount of reuse in registers. The LHS is loaded in once via broadcast/splat (vbroadcastsd), and reused along the register tile dimension, while the RHS is moved in via aligned vector loads and reused along the register tile dimension.
As for , , note that we are using large enough values that the x tile (675 KB for = 180, = 480) of the LHS matrix (A) actually fits in the L3 cache, as opposed to in the L2 cache, which was the plan with the BLIS/OpenBLAS tiling approach. Since we haven’t really done the additional level of tiling ( per the OpenBLAS/BLIS strategy, we notice that we get more mileage here out of using a larger tile for the LHS in L3 instead of a smaller tile in L2 (the L3 was sort of unoccupied for us). Note that a larger in particular amortizes the cost of the RHS’s transfer into L1 and exploits more L1 reuse on that tile. The additional level of tiling is conceptually straightforward given where we are now, and this could be evaluated along with better values for , for improved L2 and L3 reuse. We skip that for this article.
Let us now look back at our journey up until this best performing result, and the role each optimization played.
2.11 Within 9% of OpenBLAS and MKL
Let us plot the overall picture so far.
We have now nearly matched BLIS performance (61 vs 63 GFLOPS), and are only 9away from MKL and OpenBLAS’s performance! Note that BLIS’ performance could also likely improved by tuning and playing around and as opposed to choosing the configuration it preset ( = 72, = 256), but as we observed here, it is the , values that have a greater impact here as far as the MLIR + LLVM approach goes. Secondly, further tuning of BLIS is likely to still be around the OpenBLAS performance (based on published results).
2.11.1 Vectorization with copying, unroll-and-jam, unrolling but no scalar replacement
On this note, let us look at what happens if we disabled MLIR’s scalar replacement but enabled unroll-and-jam, sort of leaving it to LLVM to perform the replacement.
Relying on LLVM to do the scalar replacement does not work! We will not again go into why. Although the passes in LLVM could be improved here, this is an optimization that MLIR is supposed to perform well since it has to do with multidimensional subscripts and loops.
2.12 Map to BLIS micro-kernel: “Outer MLIR, Inner BLIS”
We might wonder at this point as to what would happen if the MLIR approach just borrowed the BLIS micro-kernel for its inner three loops (the ones running with trip counts , , ). How much better performance would it lead to for the best MLIR version? Given that we were able to get close to BLIS using MLIR + LLVM all the way, a no difference in performance would be great news for those who have worked on the x86/x86-64 LLVM backends and most of the low level infrastructure surrounding it.
We will do a simple experiment here. One could imagine employing BLIS’ micro-kernel (multiplication of two panels: x and x ) within MLIR itself, i.e., the outer loops, tiling, explicit copying etc. are all automated with MLIR the way we described above, but the carefully crafted inline assembly kernel of BLIS is called for the inner part. This is also an approach of high interest to several research groups working on HPC compilers today because many believe that (1) the performance of the hand-written inner kernel cannot be attained through compiler generated code, (2) one shouldn’t be subject to the vagaries of a compiler when generating such code where even the last ounce of performance should be extracted. Both points are debatable, and the issues involved are addressable.
Over here, our experiment is also interesting because it tells us whether the remaining difference in performance is coming from the carefully tailored instruction selection and schedule used in BLIS’ micro-kernel, which the compiler (having been tasked to deal with the entire world of programs) is unable to nail down optimally. Seeing a big difference when we swap out our inner loops with the BLIS kernel could be disappointing because it would mean we will have to now get into LLVM to see how we could improve things as one option. On the other hand, the lack of any significant improvement would be great news for us because we could then focus on the "macro kernel", code and loops that we have all possible control over, to go closer to the peak. It would also be a tribute to all those who have worked on LLVM backends and surrounding infrastructure.
The BLIS micro-kernel being used here is bli_dgemm_haswell_asm_6x8; this corresponds to = 6 and = 8. We’ll thus run a pure MLIR code generated version with = 174, = 512, = 6, = 8, and then the same one with the innermost 3-d nest mapped to BLIS’ microkernel. This keeps the comparison meaningful. (We have adjusted 180 and 480 slightly to 174 and 512 respectively to ensure that all cache tiles are full tiles to keep things more readable).
We also need to change the layout of the LHS buffer to the packed layout shown in green - this was discussed in detail earlier when we described how memref layouts work (Section 2.9).
Let us look at the structure of MLIR here with tiling and packing, but without vectorization or any unroll-jamming enabled since the inner portion is going to be swapped out.
Note that the buffer of shape <174x512xf64> after being assigned the tiled layout and post memref normalization has been converted to the shape <29x512x6xf64>, and it has elements exactly in the order the BLIS micro-kernel wants. One can notice that the blocks and panels corresponding to the buffers here being multiplied (LHS buffer: <29x512x6xf64>, RHS buffer: <512x8xf64>, the output is just 2048x2048xf64 since we will reuse it in registers). For the purpose of this article, we developed a -hopt-blis pass option that actually maps the right part to the BLIS micro-kernel. For a 2088x2048 matrix, this is the generated IR:
The hopt_dgemm_blis_kernel function is added to mlir_runtime_utils, and just wraps around bli_dgemm_haswell_asm_6x8.
The allocs above have alignments since the BLIS kernel uses 256-bit aligned loads on these buffers. So, this experimentation was done with additional support in the MLIR’s std to llvm dialect conversion to use aligned_alloc. We have these alignments set even for the pure MLIR generated code since the loads and stores on vector types will have the vector size as the default ABI alignment, leading to aligned load/stores during LLVM code generation; so the alloc’s need to be aligned to vector size boundaries.
There is nearly no difference when comparing the hybrid MLIR and BLIS micro kernel with the best MLIR version we had. For a given x register tile with schedule we have been using, the register requirement (assuming the instruction scheduler is not doing any non-trivial reordering) would be + 1 (the division by four is for the vector width). With , , this requirement would exactly be 16, which is the number of %ymm registers we have! Using = 6, = 8 with our code (as opposed to with the BLIS micro-kernel), the requirement goes up to 62 + 6 + 1 = 19 registers, and leads to a spill. The assembly dump confirms this (notice the vmovupd stores in between the FMAs, which didn’t exist earlier). Also, the intuitively sub-optimal sizes of = 4, = 8 provide better performance than the tighter = 6, = 8, indicating the cliff associated with spilling.
However, = 6, = 8 could be made to fit in the register budget by using a different permutation of instructions for the innermost loop body, i.e., with a register tile whose intra-tile loops have been permuted before the unrolling was performed. In such a case, the register requirement would be: * /4 + /4 + 1 = 15! The schedule to be used to realize that would be:
Notice the flip instead of . This also means that the LLVM backend is not going to automatically do a complex permutation on the innermost body that changes live ranges, reducing register requirements to fit within the budget and thereby avoiding a spill. Using the above schedule leads to no spill and the code performs at about 5% slower though than = 3, = 16. It is thus also important to get the right permutation for the unrolled register tile (at the time of selecting the loop transformation) — so that the matrix values with longer reuse distances for that intra register tile permutation (LHS values in the case of (d1, d0) and RHS values in the case of (d0, d1)) do not cause a spill. Alternatively, this problem could be viewed as one that performs the unroll-and-jam for , in the right order. The register requirement in the context of this "known" computation and its interaction with the intra-tile loop permutation and register tile sizes (, ) are pretty easy to capture at a high level to make sure even a simple register allocator does the right thing.
Overall, we conclude here that the kind of code we could get automatically is clearly on par or close to what was achieved with expert written assembly. The best performing x of 3x16 is on par with the MLIR-BLIS configuration we explored. The 6x8 version, if made to work without spilling, will likely be on par with 3x16. This basically means the 4-9% difference between the pure MLIR one and the BLIS/OpenBLAS ones likely stems from things around and outside the micro-kernel. This part requires more experimentation before any firm conclusions can be made.
On a separate note, one can see that MLIR does make it easier to map to external hand-optimized kernels, and combine the best of both worlds if that’s necessary for peak performance. Also, it is straightforward to perform a systematic exploration around interesting values by generating ops with different parameter attributes. However, each choice requires a compile and execute cycle since we are not generating any target code parametric in the identified parameters.
2.13 SGEMM performance
We can similarly now benchmark SGEMM performance quickly. All that we need to change on the op’s operands is an s/f64/f32! In addition, we will just double the register tile size from 16 to 32 since two times the number of f32 elements can be held in a vector register (8 x f32 instead of 4 x f64). We would thus still be using 3x4 = 12 vector registers for C. In addition, we will also double to 348 for the same reason. We thus use this op to generate SGEMM.
Evaluating this, we find that the performance of purely MLIR generated code is within 2% of MKL performance here!, and in fact marginally better than OpenBLAS and BLIS. The outer MLIR + inner BLIS version here delivers the expected performance, nearly on par with pure MLIR here.
2.14 What about the remaining 9%?
At this point, we would like to see what stands in the way of bridging the remaining 9% gap in getting to MKL/OpenBLAS’s performance. In particular, the hand-written kernels in OpenBLAS, BLIS also use prefetching – we are not yet generating any of these instructions.
In summary, we have been able to generate all code starting from just this op:
2.15 Other questions
There may be a few questions here.
While the code was automatically generated, the transformation sequence was specified from within an MLIR pass. What does the sequence look like - just C++ IR building and calls to utilities/passes? How productive is it and the subsequent exploration?
What about problem sizes that are not a perfect multiple of register tile sizes (, )? While MLIR’s unroll-and-jam works with cleanup code generated, there are unresolved issues in interactions downstream (for performance). There is literature in the polyhedral body of works on separation of full and partial tiles, and this is largely an implementation issue. Note that cache tile sizes need not perfectly divide problem sizes - this already works in MLIR in conjunction with packing and other transformations presented here. For eg. the best size we used (180) does not divide 2048.
We have primarily used the polyhedral passes in MLIR here since the generated code at each stage always stays affine. The LinAlg dialect does not have the utilities to automatically analyze and generate packing code for example. Nearly all passes and utilities used here such as unroll-and-jam, scalar replacement, and memref normalization work on affine dialect ops.
What about all the other options around input matrices such as strides and transpose? These all are cleanly expressed via memrefs’ affine layout map discussed above (Section 2.9). We haven’t considered the and arguments for DGEMM though (it is really ) - we have actually assumed = = 1. In fact, MLIR’s pattern rewrites can fold and optimize away code/nests when or are 0 or 1. Overall, these aspects are something to keep in mind before making a complete/exact comparison.
How do we determine good or the optimal parameter values? The work of Low et al. is on analytical modeling to derive good parameter values (the derived formulae are pretty straightforward to just plug values from hardware resources into). Besides such analytical modeling, there is a lot of room to play here depending on how powerful the generative infrastructure is, and to what extent we would like to generalize this approach beyond the domain considered here.
3 Related Work
Most of the related work on OpenBLAS [goto2008toms] and BLIS [vanzee2015toms] was already described inline. The work by Gareev et al. [gareev18taco] is the only approach I am aware of that tried the BLIS approach from inside a compiler (Polly/LLVM there). From their results, they appeared to have had reached about 70% of MKL/OpenBLAS performance. With a low-level IR such as LLVM, and with Polly and LLVM infrastructure decoupled to some extent, one’s expressive power and mileage is expected to vary. There has been a large amount of work on empirical search driven library generation for GEMM [atlas] (ATLAS), and works like that of Yotov et al. [yotov2005ieee] have plugged in analytical models to drive such optimization with comparable performance. The work of Low et al. [low2016toms] makes a similar attempt with BLIS. However, all of these frameworks were centered around generation of C code along with inner inline assembly kernels.
4 Reproducing these Results
Fedora Linux 30 running kernel 5.3.6-200.fc30.x86_64, BLIS version 0.6.0-40-gf4f5170f, MKL version 2019.4.243, OpenBLAS 0.3.7-1.fc30.x86_64, and Pluto git 0.11.4-903-g7f21ab57. cpupower was used to set the frequency governor to ‘performance’.
A good part of the experiments presented in this article can be reproduced with MLIR trunk. There are major features though that are pending upstream integration (memref_shape_cast op, alloca op, scalar replacement, a new vectorization pass/utility, and support for a few packing options), but these are available in the branch here, which is the recommended way of reproducing results presented herein as is. Please see this README to run most of the experiments.
The hop branch however is not intended to be continuously kept in sync with changing upstream MLIR development — so that the examples, IR and results presented herein remain valid. On the other hand, the MLIRX [mlirx] repository provides all of the features presented here while being maintained to be in sync with upstream MLIR development. If the reader’s objective is to build upon the augmented IR infrastructure as opposed to just trying to reproduce things, MLIRX is recommended.