A Variable Vector Length SIMD Architecture for HW/SW Co-designed Processors

by   Rakesh Kumar, et al.

Hardware/Software (HW/SW) co-designed processors provide a promising solution to the power and complexity problems of the modern microprocessors by keeping their hardware simple. Moreover, they employ several runtime optimizations to improve the performance. One of the most potent optimizations, vectorization, has been utilized by modern microprocessors, to exploit the data level parallelism through SIMD accelerators. Due to their hardware simplicity, these accelerators have evolved in terms of width from 64-bit vectors in Intel MMX to 512-bit wide vector units in Intel Xeon Phi and AVX-512. Although SIMD accelerators are simple in terms of hardware design, code generation for them has always been a challenge. Moreover, increasing vector lengths with each new generation add to this complexity. This paper explores the scalability of SIMD accelerators from the code generation point of view. We discover that the SIMD accelerators remain underutilized at higher vector lengths mainly due to: a) reduced dynamic instruction stream coverage for vectorization and b) increase in permutations. Both of these factors can be attributed to the rigidness of the SIMD architecture. We propose a novel SIMD architecture that possesses the flexibility needed to support higher vector lengths. Furthermore, we propose Variable Length Vectorization and Selective Writing in a HW/SW co-designed environment to transparently target the flexibility of the proposed architecture. We evaluate our proposals using a set of SPECFP2006 and Physicsbench applications. Our experimental results show an average dynamic instruction reduction of 31 SPECFP2006 and Physicsbench respectively, for 512-bit vector length, over the scalar baseline code.



There are no comments yet.


page 10

page 11


Adaptable Register File Organization for Vector Processors

Modern scientific applications are getting more diverse, and the vector ...

AVX-512 extension to OpenQCD 1.6

We publish an extension of openQCD-1.6 with AVX-512 vector instructions ...

The Tersoff many-body potential: Sustainable performance through vectorization

Molecular dynamics models materials by simulating each individual partic...

Securing Accelerators with Dynamic Information Flow Tracking

Systems-on-chip (SoCs) are becoming heterogeneous: they combine general-...

Technical Report about Tiramisu: a Three-Layered Abstraction for Hiding Hardware Complexity from DSL Compilers

High-performance DSL developers work hard to take advantage of modern ha...

Simulating collective neutrinos oscillations on the Intel Many Integrated Core (MIC) architecture

We evaluate the second-generation Intel Xeon Phi coprocessor based on th...

Algoritmos de minería de datos en la industria sanitaria

In this paper, we review data mining approaches for health applications....
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Hardware/Software (HW/SW) co-designed processors offer a solution to the power and complexity problems of modern microprocessors (Sathaye et al., 1999)(Dehnert et al., 2003)(Krewell, 2003). In order to reduce the power consumption and complexity, these processors incorporate simple hardware. Moreover, several dynamic optimizations are applied to improve the performance.

Single Instruction Multiple Data (SIMD) accelerators form an integral part of modern microprocessors. Since these accelerators perform the same operation on multiple pieces of data, they just require duplicated functional units and a very simple control mechanism. Despite their simplicity, they are well suited to exploit data level parallelism from modern multimedia, scientific and throughput computing applications. For this reason, SIMD accelerators are ubiquitous in processors from different computing domains like general purpose processors (13)(Diefendorff et al., 2000)(Lee, 1996), Digital Signal Processors (Arcy and Beach, 1999), gaming consoles (Kahle et al., 2005)(Sporny et al., 2002) as well as embedded architectures (Baron, 2005). Due to their hardware simplicity, SIMD accelerators grow in size with each new generation. For example, Intel MMX (13) had vector length of 64-bits, which was increased to 128-bits in SSE (13) extensions. Intel AVX (13) and AVX2 (13) support 256-bit vectors. Whereas Intel‘s recent SIMD extensions AVX-512 (10) and Many Integrated Core architecture (11) support 512-bit vector operations.

In spite of their hardware simplicity, code generation for SIMD accelerators has always been challenging. In the early days, programmers used to target these extensions mainly using in-line assembly or specialized library calls which is tedious and error prone. Then, automatic generation of SIMD instructions (auto-vectorization) was introduced in compilers (Bik et al., 2002)(Naishlos, 2004), which borrowed their methodology from vector compilers. These compilers target loops for generating code for SIMD accelerators. Later, S. Larsen et al. (Larsen and Amarasinghe, 2000) introduced Superword Level Parallelism (SLP) in which they target basic blocks instead of whole loops for vectorization. Apart from these static approaches, dynamic vectorization in superscalar processors has also been explored by A. Pajuelo et al. (Pajuelo et al., 2002).

Although SIMD accelerators are amenable to scaling from the hardware point of view, generating efficient code for higher vector lengths is not straightforward. The problem lies in the fact that different applications have different natural vector length. There are applications for which compilers just need to unroll loops with a higher unroll factor to fill the wider vector paths. However, there are other applications that do not have enough parallelism for vectorization at higher vector lengths and SIMD resources are left un/under-utilized. Generating code for these applications for wider vector units becomes a challenge.

In this paper, we explore the scalability of SIMD accelerators from the code generation point of view. We discover that there are two key factors that thwart the performance at higher vector lengths. First, the dynamic instruction stream coverage for vectorization reduces as vector length increases. This is because the instructions in current vector ISAs operate on all the vector lanes together and not on a subset of it. For example, ADDPS in Intel SSE, VADD in ARM Neon and VADDFP in PowerPC Altivec all operate on all the vector lanes together. Therefore, compilers generate a vector instruction only when there are sufficient numbers of independent operations to fill the vector path. When there are not enough instructions to fill up the vector path, all the instructions are left in the scalar form. We propose to have a flexible SIMD architecture that allows to operate on any number of vector lanes. In addition, we propose Variable Length Vectorization (VLV) to target the flexible vector datapath.

Second, the number of permutation instructions increases with vector length. The rigidness of SIMD architecture is again responsible for this. For example, the scalar SIMD instructions in Intel SSE always write their result to the lowest element of the vector register. If a vector instruction needs to read these results, they first need to be packed together in a single vector register using shuffle instructions. The proposed SIMD architecture allows scalar instructions to write their result to any element of the vector register depending on how they are needed by the consumer vector instruction. Therefore, the shuffle instructions are no longer required. We call this ability of writing to any selective part of a vector register as Selective Writing (SWR).

VLV increases the dynamic instruction stream coverage by iteratively packing maximum number of scalar instructions together, even if the number is less than the number of vector lanes available. SWR employs two techniques to keep the permutations to minimum. As a result, the proposed SIMD architecture alleviate the rigidness problem of the traditional SIMD architecture and allows to generate optimized code at higher vector lengths. Moreover, the HW/SW co-designed nature of the processor provides some addition advantages. For example, since vectorization is done at runtime on the program binary, it does not require any changes in compiler, operating system or application source code. Therefore, we can target the proposed ISA without modifying anything in the software stack. The main contributions of this paper can be summarized as:

  • Identifies the bottleneck in vector code generation for wider vector units.

  • Proposes a flexible SIMD architecture.

  • Proposes Variable Length Vectorization to increase the dynamic instruction stream coverage.

  • Proposes Selective Writing to reduce the number of permutation instructions.

This paper is an extension of our prior work (Kumar et al., 2013) and makes the following additional contributions:

  • Shows why both VLV and SWR are necessary and not only just either of them.

  • Shows why vector length register is not a good choice for SIMD accelerators.

The rest of the paper is organized as follows: Section 2 provides a background on HW/SW co-designed processors. Section 3 briefly provides the motivation for the work presented in this paper and identifies key issues in efficient vector code generation for higher vector lengths. Section 4 describes the speculative dynamic vectorization algorithm. Section 5 and 6 explain the proposed SIMD ISA, Variable Length Vectorization and Selective Writing techniques. Evaluation of the proposals using a set of SPECFP2006 and Physicsbench applications is presented in Section 7. Section 8 presents related work and Section 9 concludes.

2. Background Of HW/SW Co-designed Processors

A HW/SW co-designed processor is a hybrid architecture that leverages hardware/software co-design to couple a software layer to the microarchitectural design of a processor. The software layer resides between the hardware and the operating system. This software layer allows host and guest ISAs to be completely different by translating the guest ISA instructions to the host ISA dynamically. We define the host ISA as the ISA that is implemented in the hardware, whereas, guest ISA is the one for which applications are compiled. The basic idea behind these processors is to have a simple host ISA to reduce power consumption and complexity. This kind of processors(Dehnert et al., 2003)(Ebcioğlu and Altman, )(Sathaye et al., 1999) first emerged more than two decades ago. Moreover, there is a renewed interest in them in both industry and academia (12)(Lupon et al., 2014)(Branković et al., 2014) (Wang et al., 2013)(Kumar et al., 2014)(Neelakantam et al., 2010) (Kumar et al., 2013)(Kumar et al., 2013).

(a) Conventional RISC processor
(b) Conventional CISC processor
(c) HW/SW co-designed processor
Figure 1. HW/SW interface in processors

These processors are specifically designed to achieve energy efficiency, design simplicity, and performance improvement. In order to achieve design simplicity, they keep the hardware simple and implement a relatively simple ISA. The simple hardware design also helps in achieving energy efficiency. Transmeta reports significant reduction in power dissipation for their HW/SW co-designed processor Crusoe compared to Intel Pentium III for a software DVD player (Klaiber, 2000). Their data shows that Pentium III heats up to a temperature of 105º C whereas Crusoe’s maximum temperature goes only up to 48º C running the same software DVD player. Furthermore, to achieve the performance goal, HW/SW co-designed processors employ dynamic binary optimizations.

In general, HW/SW co-designed processors implement a proprietary ISA in order to achieve design simplicity and power efficiency. Therefore, they need to apply binary translation to map the guest ISA on to the host ISA. The binary translation, in general, can be implemented in either hardware or software. Modern processors implementing CISC ISA, like x86, implement binary translation in hardware (Smith and Nair, 2005). The hardware binary translator translates CISC instructions to RISC like instructions dynamically to simplify the execution pipeline implementation. However, the hardware implementation leads to significant hardware complexity and power consumption. HW/SW co-designed processors, on the other hand, implements dynamic binary translation in software which leads to energy efficiency.

Fig. 1(a) shows the hardware/software interface in a conventional RISC processor where the software stack directly interacts with the hardware. Conventional CISC processors implement a RISC like ISA in hardware. As shown in Fig. 1(b), they employ a hardware dynamic binary translator to translate CISC instructions to the internal ISA instructions. The binary translation in HW/SW co-designed processors is performed by a software layer as shows Fig. 1(c). We call this software layer as Translation Optimization Layer (TOL) in this paper.

Performing the dynamic binary translation/optimization in software layer provides several benefits over the hardware implementation. For example, the software implementation significantly reduces hardware complexity and power consumption. Furthermore, it allows to upgrade a processor in the field by introducing new optimizations in the software layer. In contrast, if TOL is implemented in hardware, adding new optimizations in the existing processor is not feasible. Additionally, software implementation of TOL significantly reduces hardware validation and verification cost and time.

2.1. Dynamic Binary Translation/Optimization

Translating guest ISA code to host ISA is the prime responsibility of TOL. The translation is done dynamically and, generally, in multiple phases. Usually, in the first phase, an interpreter decodes and executes guest ISA instructions sequentially. In the rest of the phases, the guest code in translated into host ISA code and stored in the code cache, after applying several dynamic optimizations, for faster execution. The number of translation phases and optimizations in each phase are implementation dependent.

Fig. 2 shows a typical two stage translation/optimizations flow in a TOL. It starts by interpreting guest ISA instruction stream sequentially. While interpreting, TOL also profiles the guest code to collect information about most frequently executed code and biased branch directions. The execution frequency guides TOL to decide which guest code basic blocks to translate. When a basic block has been executed more than a predetermined number of times, TOL invokes the translator. The translator takes the guest ISA basic blocks as input, translates them to host ISA code and saves the translated code into the code cache for fast native execution. Instead of translating and optimizing each basic block in isolation, the translator uses biased branch direction information, collected during interpretation, to create bigger optimization regions, called superblocks. A superblock, generally, consists of multiple basic blocks following the biased direction of branches. Therefore, superblocks increase the scope of optimizations to multiple basic blocks and allow more aggressive optimizations. Superblocks have a single entry point that is the first instruction of the first basic block included in the superblock. However, depending on the implementation they might have multiple or a single exit point.

Initially, the control is transferred back to TOL after executing a superblock from the code cache. Then, TOL searches the next instructions to be executed. If the next instruction is not already translated, it has to be interpreted. However, if it is already translated, TOL patches the last branch of the first superblock (the one that transferred the control back to TOL) to the beginning of the second superblock. This process is called chaining or linking.

Figure 2. Typical two stage TOL control flow

2.2. Why HW/SW Co-designed Processors

HW/SW co-designed processors provide certain features that set them apart from traditional hardware only processors. Following are the some of the reasons that motivated us to choose them for our proposals:

Aggressive Vectorization: Compilers inability to do accurate interprocedural pointer disambiguation and interprocedural array dependence analysis severely limits their vectorization ability(Maleki et al., 2011). On the other hand, dynamic optimization environment in HW/SW co-designed processors avoids the need to these analysis by vectorizing speculatively(Kumar et al., 2013). Furthermore, these processors provide efficient support to recover from speculation failures(Sathaye et al., 1999)(Dehnert et al., 2003). Therefore, they enable aggressive vectorization and catch vectorization opportunities missed by conservative compiler vectorization.

Dynamic Information: Since the vectorization is done at runtime it benefits from the availability of the runtime information. For example, loop unroll factor can be determined at runtime through profiling for the loops where loop trip count in unknown at compile time. This is especially important for variable length vectorization where the optimal loop unroll factor varies based on logical vector length which is not always equal to the SIMD accelerator width as explained in section 5.1.

Decoupled vector ISA and SIMD accelerator: HW/SW co-designed processors decouple the hardware implementation of SIMD accelerator from application visible vector ISA by means of dynamic binary translation. This enables modifications/improvements in the SIMD accelerator without affecting the application visible SIMD ISA. We leverage this fact to introduce a flexible SIMD accelerator without any modification in the application visible (guest) ISA, compiler or any other component of the software stack.

Portable Vectorization: Since vectorization is done by TOL at runtime, the same application binary can be executed on different SIMD accelerators. This kind of portable vectorization provides forward and backward binary compatibility.

Legacy Code Vectorization: Runtime vectorization in HW/SW co-designed processors also enables legacy code vectorization. Therefore, the code that was not compiled for any SIMD accelerator can also benefit from there presence.

Figure 3. Dynamic FP instruction stream coverage for vectorization at 128, 256 and 512-bit vector lengths

3. Motivation

The trends in the recent past show that the vector lengths are likely to keep increasing in future microprocessors, since wider vectors provide a simple and efficient way of achieving higher FLOPS in an energy efficient manner. Intel’s 256-bit AVX (13) and 512-bit vector length of AVX-512 (10) and Larrabee (Seiler et al., 2008) are few examples of these trends. However, it is a challenge to generate efficient code to utilize these wider vector units. To demonstrate this fact, we vectorized floating point instructions in SPECFP2006 for three different vector lengths of 128, 256, and 512-bits using the speculative dynamic vectorization algorithm described in (Kumar et al., 2013). Moreover, at a given vector length, all the vector instructions operate only on the maximum vector length and not on a subset of it. For example, for 512-bit vector length case, all the vector instructions operate on whole 512-bits and there is no vector instruction that operates only on 256 or 128-bits. This is inline with how the vector instructions function in the current SIMD architectures, operating on all the vector lanes and not on a subset.

Our results show that there are mainly two problems in vector code generation at higher vector lengths: reduced dynamic instruction stream coverage for vectorization and huge number of permutation instructions.

3.1. Reduced Dynamic Instruction Stream Coverage

We define dynamic instruction stream coverage as the number of dynamic scalar instructions vectorized. Fig. 3 shows the dynamic instruction stream coverage for vectorization at different vector lengths normalized to the 128-bit case. The best, worst and average cases are shown. We divide the applications in two categories: The first category applications have maximum dynamic instruction stream coverage at all the vector lengths, like 454.calculix. On the contrary, there are applications like 444.namd where dynamic instruction steam coverage falls by 70% at vector length of 512-bits.

The dynamic instruction stream coverage at different vector lengths depends upon the degree of data level parallelism available in the application and how this parallelism is extracted through SIMD extensions. If an application spends most of its time in loops with high trip counts, it will benefit from higher vector lengths, since the wider vector paths can be filled by unrolling the loops more number of times depending on the vector length. However, as shown by the average case of Fig. 3, this is not the case for most of the applications. We see an average reduction of 25% and 48% in dynamic instruction stream coverage at 256-bit and 512-bit respectively. If this trend continues, the coverage is going to be even lesser at higher vector lengths.

Figure 4. Normalized Number of Permutation Instructions generated per vector instruction

3.2. Number of Permutation Instructions

When the input operands of a vector instruction are not available in a single vector register or are not in the same order as required by the vector instruction, permutation instructions are needed to arrange them in the correct order. Our results show that the number of permutation instructions grows significantly with increasing vector lengths.

Fig. 4 shows the number of permutation instructions generated per vector instruction in SPECFP2006 normalized to the 128-bit case. As the figure shows, if we generate one permutation instruction for each vector instruction at 128-bit vector length, this number goes as high as 10 at 512-bit vectors in case of 444.namd. Also, there are applications for which this number does not grow that rapidly. However, the average behavior suggests that number of permutation instructions is going to be a problem at higher vector lengths.

Both of these factors become a limitation as vector paths become wider and instead of performance improvements, it starts degrading compared to the lower vector lengths. In essence, both of these problems arise because current SIMD architectures are not flexible enough to handle these situations. The vector instructions in current SIMD architectures operate on all the vector lanes and not on a subset of it. As a result, if there are not enough independent instructions performing the same operation, compilers do not generate vector instruction. This behavior leads to reduced dynamic instruction stream coverage. Also, the scalar instructions in current SIMD architectures, such as ADDSS, MULSS etc. in Intel SSE, write their result only to lowest element of a vector register. If a vector instruction needs to read these results, they need to be packed in single register using shuffle instructions before they can be consumed by the vector instruction; thereby increasing the number of permutations. This paper investigates both the problems and proposes a flexible SIMD architecture along with Variable Length Vectorization and Selective Writing to solve the problems of reduced coverage and permutation instructions, respectively.

4. Vectorization Algorithm

This section briefly discusses the baseline speculative dynamic vectorization scheme; the details of the algorithm and its evaluation can be found in (Kumar et al., 2013, 2016, 2012). The software layer of our co-designed processor is called Translation Optimization Layer (TOL). TOL operates in three translation modes for generating host code from guest x86 code: Interpretation Mode (IM), Basic Block Translation Mode (BBM) and Superblock Translation Mode (SBM). SBM is the most aggressive translation/optimization mode and the majority (more than 90%) of the dynamic application code is executed in this mode. Vectorization is done only in SBM, after applying several standard optimizations.

4.1. Pre-Vectorization Steps

Before starting with vectorization we create a superblock, optimize them by applying standard compiler optimizations, and generate a Data Dependence Graph (DDG) as explained below:

4.1.1. Superblock Creation

TOL starts by interpreting guest x86 instruction stream in IM. When a basic block is executed more than a predetermined number of times, TOL switches to BBM. In this mode, the whole basic block is translated and stored in the code cache and the rest of the executions of this basic block are done from the code cache. Moreover, profiling information is gathered for all the basic blocks in BBM using software counters. This information consists of execution and edge counters. The execution counter provides the execution frequency of a basic block while the edge counters monitor the biased branch direction. Once the execution of a basic block exceeds another predetermined threshold, TOL creates a bigger optimization region, called superblock, using the branch profiling information collected during BBM. A superblock generally includes multiple basic blocks following the biased direction of branches.

Moreover, the branches inside the superblocks are converted to “asserts” so that a superblock can be treated as a single-entry, single-exit sequence of instructions. This gives the freedom to reorder and optimize instructions across multiple basic blocks. “Asserts” are similar to branches in the sense that both checks a condition. Branches determine the next instruction to be executed based on the condition; however, asserts have no such effect. If the condition is true, assert does nothing. However, if the condition evaluates to false, the assert “fails” and the execution is restarted from a previously saved checkpoint in IM. Furthermore, if the number of assert failures in a superblock exceeds a predetermined limit, the superblock is recreated without converting branches to “asserts”. As a result, this time the superblock has to be treated as a single-entry multiple-exit sequence of instructions. Having multiple exits in a superblock also reduces available optimization opportunities because the instructions across different exit paths cannot be reordered as freely as before.

Loop unrolling plays a major role in vectorization. Compilers unroll the loops a particular number of times to get sufficient independent instructions to fill the vector path. It is relatively simple to determine the unroll factor for loops with static trip count. However, for the loops, where the number of iterations are not know statically, it is difficult to decide the unroll factor. The availability of dynamic application behavior in HW/SW co-designed processors allows us to detect the loop unroll factor dynamically. We profile the applications, in BBM, to collect loop iteration count for each loop. This information is used in superblock creation to decide loop unroll factor. Currently, we unroll loops with a single basic block, as the loops with no or minimum control flow are the ones which provide maximum benefits (Muchnick, 1997).

4.1.2. Pre-optimizations

The optimizer applies several transformations on the superblock. First, x86 code is translated to an intermediate representation. Then the resulting code is transformed into a Static Single Assignment format. This transformation removes anti & output dependences and significantly reduces the complexity of subsequent optimizations. Second, a forward pass applies a set of conventional single pass optimizations: constant folding, constant propagation, copy propagation, and common subexpression elimination. Third, a backward pass applies dead code elimination.

After the basic optimizations, the Data Dependence Graph (DDG) is prepared. During DDG creation, we perform memory disambiguation analysis. If the analysis cannot prove that a pair of memory operations will never/always alias, it is marked as “may alias”. In case of reordering, the original memory instructions are converted to speculative memory operations. Apart from this, Redundant Load Elimination and Store Forwarding are also applied during DDG phase so that redundant memory operations are removed before vectorization. The DDG is then passed as input to the vectorizer. After vectorization, an instruction scheduler that uses a conventional list scheduling algorithm schedules the vectorized code. Afterwards, the determined schedule is used by the register allocator that implements linear scan register allocation algorithm. Finally, the optimized code is translated to the host instructions and is stored in the code cache.

4.2. The Vectorizer

The vectorizer packs together a number of independent scalar instructions that perform the same operation, and replaces them with one vector instruction. The number of scalar instructions packed depends on two factors:

  • data-types of scalar instructions

  • host vector length

For example, for a host vector length of 128-bit, four 32-bit single-precision floating-point instructions can be packed together in a single vector instruction. Therefore, vectorization reduces dynamic instruction count and improves performance. Before describing the algorithm itself, we define a set of conditions that a pair of instructions must satisfy to be included in the same pack:

  • The instructions must perform the same operation.

  • The instructions must be independent.

  • The instructions must not be in another pack.

  • If the instructions are load/store, they must be accessing consecutive memory locations.

Vectorization starts by marking all the instructions which are candidates for vectorization. Moreover, we mark First Load and First Store instructions. First Load/Store instructions are those for which there are no other loads/stores from/to adjacently previous memory locations. For example, if there is a 64-bit load instruction that loads from a memory location [M] and there is no 64-bit load instruction that loads from address [M-8], we call First Load.

Vectorization begins by packing consecutive stores, starting from a First Store. The decision of starting with stores instead of loads is based on the observation that a given kind of operation always has the same number of predecessors, e.g. all the additions always have two predecessors, whereas the number of successors may vary depending on how many instructions consume the result. Consequently, following a bottom-up approach results in a more structured tree traversal than a top-down approach.

Once a pack of stores is created, their predecessors are packed, before packing other stores, if they satisfy the packing conditions. Moreover, if the last store in the pack has a next adjacent store, it is marked as First Store so that a new pack can start from it.

Once all the stores are packed and their predecessor/successors chains have been followed, we check for remaining load instructions that satisfy the packing conditions and pack them in the same way as stores.

Vectorization starting from adjacent loads/stores has an obvious limitation: if a superblock does not have any consecutive loads/stores, nothing can be vectorized. To tackle this problem, after packing all loads/stores and their predecessors/successors, we check if still there are some arithmetic instructions that can be packed together. If so, we vectorize them and follow their predecessor/successor trees. This allows to partially vectorize loops with interleaved memory accesses.

While traversing the predecessor/successor chains, if we find out that the predecessors of a pack cannot be vectorized, a Pack instruction is generated. This Pack instruction collects the results of all the predecessors into a single vector register and feeds the current pack. Similarly, if all the successors of a pack cannot be vectorized, an Unpack instruction is generated. This Unpack instruction distributes the result of the pack to the scalar successor instructions. For example, in the case of loops with interleaved memory access, when we reach several load instructions while traversing the tree, we find out that they cannot be packed since they are not consecutive. Therefore, we leave them in scalar form and assemble their results using a Pack instruction.

5. Variable Length Vectorization

As shown in Fig. 3 in Section 3, the dynamic instruction stream coverage for vectorization reduces at higher vector lengths. We observe that the reason for this behavior lies in the way the vector instructions in SIMD architectures function. Vector instructions in the current SIMD architectures, such as ADDPS in Intel SSE, VADD in ARM Neon and VADDFP in PowerPC Altivec, operate on all the vector lanes and not on a subset of it. Due to this reason, compilers generate a vector instruction only when there are sufficient numbers of independent operations to fill the vector path. When there are not enough instructions to fill up the vector path, all the instructions are left in scalar form. This is going to be an important issue in the future microprocessors with wider vector paths and a lot of, otherwise vectorizable, code will be left unvectorized. We propose Variable Length Vectorization (VLV), a speculative dynamic iterative vectorization technique that targets a flexible SIMD architecture for optimal vectorization of data parallel applications.

VLV targets a SIMD architecture with vector instructions that can operate on all or any subset of vector lanes. Since the vector instructions can operate on any number of vector lanes, we need a way to notify the SIMD accelerator which vector lanes to enable and which ones not. We make use of mask registers for this purpose. Mask register has one bit per vector lane. The bits containing ones signify the corresponding vector lanes are to be enabled; 0 means otherwise. The mask register is included in instruction encoding in addition to the regular source and destination registers.

An important factor to consider here is the need of masking. Masking is used to disable unused vector lanes when a vector instruction does not use all the lanes. In general, not masking the unused lanes might work well for arithmetic instructions from the functionality point of view. However, performing unnecessary operations in the unused lanes might also generate false exceptions, like divide by zero. Therefore, we would need a way to distinguish real and false exceptions. Furthermore, for memory access instructions this might result in crossing array boundaries and leading to page/segmentation faults. Also, for store instructions it would result in writing incorrect data to the memory. Moreover, the register file will contain invalid data because whole destination register will be written. As a result, we would need a way to distinguish between invalid and valid data in the register file. Mixing the architectural state and temporal values is typically not a good idea. On the other hand, masking the unused lanes helps us get rid of all these problems.

From the implementation perspective, we do not really need to have real mask registers in the hardware. Since we need to enable only consecutive lower order vector lanes, the number of lanes to be activated can directly be encoded in the instructions encoding. This also saves upon the extra instructions, otherwise, needed to write the mask in the registers. It is important to note that the traditional vector processors support variable vector length through a vector length register. It needs to be set to the desired vector length before executing vector instructions. However, it is not the optimal solutions for the processors targeting general purpose applications, where the vector length needs to be changed frequently. In this scenario, the overhead of writing the vector length register would affect the performance severely as will be shown in Section 7. Therefore, instead of having a variable vector length register we propose to have Variable Length Vectorization using masked vector instructions.

For the execution of a vector instruction, the hardware now reads not only the source registers but also a mask to enable only the required vector lanes. Example in Fig. 5 shows the execution of a vector instruction that needs only two of the four vector lanes. As shown in the figure only two of the four vector lanes are activated. This is also important from the power consumption point of view, not to activate all the vector lanes for all the vector instructions.

Figure 5. Masked Vector Instruction Execution

5.1. Code Generation

We modify our baseline speculative dynamic vectorization algorithm of (Kumar et al., 2013), briefly explained in Section 4, to generate vector code with variable vector length SIMD ISA. The modified algorithm starts by vectorizing for the given maximum vector length, we call it physical vector length. Once all the possible packs for the physical vector length have been created, the vectorizer reduces the logical vector length iteratively. At lower logical vector lengths, packs are created with smaller number of scalar instructions than required to fill the vector path. The left out positions in a pack are considered as no operations.

Fig. 6 shows a simple vectorization example using the proposed VLV algorithm. Fig. 6(a) shows unvectorized code having six independent single-precision floating-point (32-bit) addition instructions. For a vector length of 128-bits, we can pack a maximum of four single-precision floating-point additions in a single vector addition instruction. The algorithm first packs four of the six instructions in a vector instruction and assigns a mask with all ones to this instruction, as shown in Fig. 6(b). A mask with all ones signifies that all the vector lanes are to be enabled.

(a) Unvectorized code
(b) Vectorized code for fixed vector length of 128-bits
(c) Vectorized code with variable length vectorization
Figure 6. Variable Length Vectorization Example

A fixed vector length vectorization algorithm will stop at this point, since there are just two ADDSS instructions left and at least four are required to generate a vector instruction. However, VLV algorithm continues and packs the remaining two addition instructions as shown in Fig. 6(c). Moreover, a mask register with ones only at lowest two positions is assigned to this instruction. It makes sure that only the two lower vector lanes are enabled during the execution of this vector instruction as show in Fig 5.

Variable Length Vectorization helps in vectorizing the applications which have loops with lower iteration count than required by the vector length and the straight line code with fewer independent scalar operations.

VLV algorithm is fairly simple to extend to compilers for the static trip count loops, however for loops with unknown trip count at compile time it becomes tricky. For fixed vector length, compiler can vectorize such loops by unrolling them enough number of times to fill the vector path and putting a runtime check before the vectorized version to decide whether to execute it or not. However, for variable length vectorization, choosing a single unroll factor becomes difficult at compile time. The runtime information of the program behavior in HW/SW co-designed processors makes it straightforward to choose the correct unroll factor for VLV.

6. Selective Writing

This section presents the proposed Selective Writing (SWR) technique to reduce the number of permutation instructions at higher vector lengths. First, we present a technique to eliminate permutation instructions completely if the result of an instruction is read only by one instruction. Then, we present another technique to reduce the number of instructions required to pack N values from N-1 to N/2, if the values to be packed are in N different registers.

6.1. Eliminating Permutation using Selective Writing

If the producer instructions of a vector instruction cannot be vectorized, the results of these instructions have to be packed together before feeding the vector instruction. This is due to the fact that the scalar instructions in the current SIMD architectures, such as ADDSS, MULSS etc. in Intel SIMD extensions, write their results only to the lowest element of vector registers. Whereas the vector instructions need them to be in a single vector register and in a particular order.

Fig. 7(a) shows a situation where producers of I7 (I0-I3) are not vectorized and their results are packed using a permutation instruction sequence (I4-I6). As shown in the figure, I0 to I3 write their results to the lowest elements of different vector registers. Then a sequence of three instructions, I4 to I6, is used to pack these results in a single vector register xmm3, before feeding it to the vector instruction I7.

(a) Traditional code sequence
(b) Proposed instruction sequence
Figure 7. Packing scalar instruction results for feeding a vector instruction

The scalar instructions in the proposed SIMD architecture can write their results to any element of a vector register, instead of always writing to the lowest element, thus getting rid of the permutation instructions. It is done by making the scalar instructions to selectively write to the different elements of a vector register in the order they are needed by the vector instruction, Fig. 8. This way, we can avoid putting permutation instructions altogether. This kind of selective writing capability is already available in the memory access instruction set of current architectures. For example, INSERTPS in Intel SSE can be used to write a 32-bit value loaded from memory to any part of the destination register. We extend this capability to the arithmetic instruction set as well.

Figure 8. Functionality of the proposed arithmetic scalar instructions

In addition to carry source and destination register numbers, all scalar arithmetic instructions also carry an immediate that specifies to which element of the destination vector register the scalar result is to be written. If scalar instructions have written their results to a single vector register in the order in which they are needed by the vector instruction, the instruction sequence for packing these results is not needed anymore as shown in Fig. 7(b).

The limitation of SWR scheme is that it works as long as a scalar instruction has only one consumer. In the case of more than one consumer, we would not get the maximum benefit out of SWR. However, our analysis of SPECFP2006 shows that more than 70% of dynamic instructions have only one consumer.

The proposed scalar instructions can be viewed as an arithmetic operation followed by a shuffle. However, this does not affect the latency of these instructions, since the results can be forwarded as soon as the arithmetic operation is finished. As Fig. 9 shows, it requires only an additional input to the multiplexers, selecting input operands of the ALUs from the output of the first vector lane (which performs scalar operations). Consequently, forwarding the results of the first vector lane to any other vector lane provides the functionality of a shuffle operation.

Figure 9. Operand forwarding before shuffle

6.2. Reducing Permutations to Pack N Values

Current architectures provide vector instruction set where N-1 instructions are required to bring N values to a register. A typical instruction sequence to bring 4 values from different vector registers to single vector register in x86 architecture is shown in Fig. 10(a). The first two shuffle instructions bring values selected by the immediate into register xmm1 and xmm3, respectively. Then a BLENDPS instruction is used to combine the results from xmm1 and xmm3 into xmm3.

(a) x86 instruction sequence
(b) Proposed instruction sequence
Figure 10. Instruction sequence for packing 4 values from different registers into a single register
Figure 11. Functionality of the proposed Pack instruction

One of the main factors that force this instruction count to be N-1 is that, these instructions write to all the elements of the destination register. If it is possible to write only the selective elements of the destination register, then this number can be brought down. In this case, the number of instructions required will depend upon the total number of different registers to be read and the number of registers that can be read by a single permutation instruction. In a case where we need to read N registers and the permutation instruction can read only two registers, we would need N/2 instructions to collect N values in a single register. If we support more number of input registers, the number of instructions required can be brought further down. Moreover, we need a mechanism to tell which elements of the source registers are to be read and which elements of the destination register are to be written.

Figure 12. Dynamic Instruction stream coverage at three vector lengths, baseline and with VLV

We propose to have a permutation instruction with the functionality in Fig. 11. The proposed instruction (PACKPS) has two input registers and a 16-bit immediate that tells which elements of the source and destination registers are to be accessed. The first four bits of the immediate [0:3] tells which element of the first source register is to be read and the next four bits [4:7] tell where it is to be written in the destination. Similarly, bits [8:11] tell which element of the second source register is to be written to the destination element selected by the bits [12:15]. Note that PACKPS is very similar to SHUFPS but with a bit more freedom in choosing source element for each destination element. Therefore, their latencies will be similar.

The instruction sequence for replacing x86 instruction sequence of Fig. 10(a) is shown in Fig. 10(b). In this case, we are able to reduce the number of instructions required to two. For higher vector lengths, where we need to get 8 and 16 values in a register, we need just 4 and 8 instructions, respectively, instead of 7 and 15 instructions required by the original sequence. The down side of this scheme is that it requires N/2 instructions even if the values to be collected are in less than N number of registers. However, our experiments show that in SPECFP2006, on average, about 86% and 48% of permutations, for 256-bit and 512-bit vectors respectively, need to read N or N-1 registers to pack N values.

7. Performance Evaluation

7.1. Benchmarks

To measure the success of our proposals, we use a set of applications from SPECFP2006 (42) and Physicsbench (Yeh et al., 2007) benchmark suites. All the SPECFP2006 benchmarks used in our experiments employ 64-bit double precision floating point data types, except 435.gromacs, whereas benchmarks in Physicsbench operate on 32-bit single precision floating point values. All the benchmarks are compiled with gcc-4.5.3 with “-O3 -fomit-frame-pointer -ffast-math -mfpmath=sse -msse3” flags.

For SPECFP2006 we instrument the benchmarks, using PIN (Luk et al., 2005), to find the most frequently executing routines. Then we simulate one billion instructions starting from these routines. The benchmarks in Physicsbench are executed till completion.

Figure 13. Dynamic Instruction stream distribution for SPECFP2006: 128, 256 and 512-bit vector lengths without and with VLV

7.2. Experimental Framework

To evaluate our proposals, we use DARCO (Pavlou et al., 2011; Kumar et al., 2017), which is an infrastructure for evaluating HW/SW co-designed virtual machines. DARCO executes guest x86 binary on a PowerPC-like RISC host architecture. Since DARCO emulates floating point code in software, we extended the infrastructure to add floating point scalar and vector operations. We implemented the dynamic vectorization algorithm in the TOL to provide vectorization support.

For our experiments, we extended the host architecture to supports vector sizes of 128, 256 and 512-bits. Moreover, we consider only floating point operations for vectorization (because most SIMD optimizations tend to focus on them) and no integer operation is vectorized. Therefore, we show only the floating point instructions in the results presented.

7.3. Dynamic Instruction Stream Coverage

Fig. 12 shows the dynamic instruction stream coverage for three vector lengths first without and then with Variable Length Vectorization (VLV). We will have maximum coverage when the number of instructions required to create a pack is minimum, i.e. two instructions. At 128-bit vector length the maximum number of 64-bit double precision operations that can be packed together is two. Therefore, 128-bit vector length provides maximum coverage, even without VLV, for double precision operations. Since all the SPECFP2006 benchmarks primarily operate on double precision floating point variables, they have maximum coverage at 128-bits as shown in Fig. 12. For single precision floating point variables, Variable Length Vectorization helps increasing coverage even at 128-bit vector length, as is evident from the figure, for Physicsbench benchmark suite and 435.gromacs.

For the vector lengths of 256-bit and 512-bits, the benchmarks can be divided into two categories. First, the benchmarks like 454.calculix have maximum, or close to maximum, dynamic instruction stream coverage at higher vector lengths also. The hottest loops of these benchmarks have enough iterations to fill the wider vector paths. Second, the benchmarks like 436.cactusADM, 444.namd, and Physicsbench show drastic reduction in coverage as vector length increases, due to the lack of independent instructions to fill the wider paths. These benchmarks either have loops with fewer iterations or with complex control flow. For example, the hottest loops in 410.bwave iterate four times, therefore, for 256-bit vector length it has the maximum coverage but for 512-bit, it drops down to zero. Benchmarks in Physicsbench have loops with complex control flow and cannot be unrolled. Moreover, number of independent instruction in individual superblocks is not enough to fill the vector path. Thus, the dynamic instruction stream coverage reduces severely. Using VLV, we bring the coverage for these benchmarks also to the maximum as shown in the Fig. 12.

Figure 14. Number of Permutation Instructions per vector instruction, baseline and with SWR
Figure 15. Dynamic Instruction stream distribution for SPECFP2006: 128, 256 and 512-bit vector lengths without and with SWR

7.4. Dynamic Instruction Stream Distribution with VLV

This section shows that even though VLV increases the dynamic instructions stream coverage, by itself it does not provide much benefit in terms of overall dynamic instruction reduction because of a corresponding increase in permutations. Fig. 13 presents dynamic instruction stream distribution for SPECFP2006 for 128, 256 and 512-bit vector lengths first without VLV (called baseline in the figure) and then with VLV. The results shown are normalized to no vectorization case. The dynamic instruction stream is divided into: Scalar and Vector instructions, Pack/Unpack instructions (as described in Section 4.2), and unvectorizable instructions (e.g. we do not vectorize conversions).

On average, the number of scalar instructions increases with increase in vector length without VLV as shown by the 128, 256 and 512-bit baseline case. Scalar instructions constitute 31% of overall dynamic instruction stream for SPECFP2006 at 128-bit vector length without VLV. However this number increases to 41% and 52% at 256 and 512-bit without VLV. It is because of this increase in scalar instructions (or the corresponding decrease in dynamic instruction stream coverage) that we do not get any reduction in overall dynamic instruction stream at higher vector lengths. VLV, on the other hand, reduces the scalar instructions in the dynamic instruction stream by extracting additional vectorization opportunities. As shown in Fig. 13, VLV brings down the scalar instructions to 28% from 41% and 52% at 256 and 512-bit vector lengths.

Even though VLV increases the dynamic instructions vectorized, the overall reduction in dynamic instructions stream is only marginal as is evident from Fig. 13. It is the result of the fact that the increased number of vectorized instructions comes at the cost of an increase in the permutations. Therefore, we need a way to keep the permutation instructions to a minimum. We use Selective Writing (SWR) as a means to that and evaluate it next.

For Physicsbench, VLV by itself is able to provide significant dynamic instruction stream reduction with minimal increase in permutations. Therefore, we do not show results for it.

7.5. Permutation Reduction

Fig. 14 shows the number of permutation instructions per vector instruction required at three vector lengths without and with Selective Writing (SWR). Again, we have the same two categories of benchmarks as for the dynamic instruction stream coverage. Benchmarks like 434.zeusmp, 459.GemsFDTD, and Physicsbench have, essentially, the same amount of permutation instructions across all the vector lengths. Packing the instructions from the different iterations of unrolled loops avoids generation of permutation instructions in the case of 434.zeusmp and 459.GemsFDTD. Physicsbench, however, has really less number of permutations since we fail to vectorize anything. On the contrary, 433.milc, 436.cactusADM and 444.namd show an increase in the permutation instructions at higher vector lengths. Complex control flow and fewer loop iterations forces us to vectorize straight line code which require higher number of permutation instructions. SWR helps in eliminating significant number of permutation instructions for these benchmarks.

Another point to notice in Fig. 14 is that for 128-bit vector length there is negligible reduction in permutation instructions. This is because we need to pack two double precision values in a 128-bit register and for N=2, N/2 and N-1 are same. Therefore, we do not get much benefit. However, on average we reduce the number of permutation instruction required to half.

7.6. Dynamic Instruction Stream Distribution with SWR

This section shows that even though SWR is effective in keeping the permutation instructions to a minimum, it also by itself is unable to provide significant overall dynamic instruction reduction. Fig. 15 present dynamic instruction stream distribution for SPECFP2006 for 128, 256 and 512-bit vector lengths first without SWR (called baseline in the figure) and then with SWR. The results shown are also normalized to no vectorization case. The dynamic instruction stream is again divided into: Scalar and Vector instructions, Pack/Unpack instructions and unvectorizable instructions.

SWR achieves significant permutation reduction as shown in Fig. 15 especially for 433.milc, 436.cactusADM and 470.lbm benchmarks. For other benchmarks like 410.bwaves, 434.zeusmp, 437.leslie3d etc. permutation instructions are not significant either because of small number of vectorized instructions due to less coverage or because the benchmarks have enough parallelism at higher vector lengths also. Even though SWR is effective in keeping the permutations to a minimum it cannot provide significant dynamic instruction reduction if the vectorizer is not able to vectorize most of the code as shown in Fig. 15.

Therefore, none of VLV and SWR by itself is able to achieve significant dynamic instruction stream reductions at higher vector lengths. However, when combined together, they do reduce the dynamic instruction stream substantially as shown in the next section.

7.7. Putting Everything Together

Fig. 16 shows the percentage of dynamic instructions after vectorization without and with VLV-SWR. As shown in this figure, after applying both the optimizations all the applications perform better as vector length is increased. Applications like 433.milc, 436.cactusADM, 470.lbm, and Physicsbench which were earlier getting worse with increase in the vector length, compared to 128-bit vector length; now perform better. On average, VLV-SWR help eliminating 9% and 16% more dynamic instructions compared to the baseline vectorization, at 256-bit and 512-bit vector lengths respectively, for SPECFP2006. Overall, vectorization with VLV-SWR reduce unvectorized dynamic instruction stream by 15%, 27%, and 31% for 128-bit, 256-bit, and 512-bit vector lengths respectively. For Physicsbench, we eliminate 40% more instructions compared to baseline vectorization and unvectorized code, at 256-bit, and 512-bit vector lengths with VLV- SWR. Baseline vectorization does not find any vectorization opportunity at higher vector lengths for Physicsbench.

Figure 16. Dynamic Instruction Percentage after baseline and VLV-SWR vectorizations

As Fig. 16 shows, the percentage of reduced instructions is same for 256-bit and 512-bit vector lengths in case of Physicsbench and 410.bwaves. The lack of availability of independent instructions at 512-bit vector length forces VLV to vectorize the code the same way as for 256-bit vector length. However, important point to notice is that we still have more instruction reduction than 128-bit case, which was not possible without VLV.

7.8. Vector Length Register vs VLV

Traditional vector processors used a special register, called vector length register, to choose the number of vector lanes to be enabled. This register needs to be written every time a vector instruction needs different number of lanes than the vector instruction immediately preceding it. This section shows why vector length register is not an optimal solution in SIMD accelerators for dynamically varying the logical vector length. Fig. 17 shows the average number of dynamic vector instructions executed before a vector instruction requiring a different number of vector lanes is encountered. In other words, the figure shows how frequently the vector length register would need to be written had we used it instead of the proposed VLV.

Figure 17. Average number of consecutive dynamic vector instructions with same vector length in a 512-bit wide vector unit with VLV-SWR

As the figure shows, a hypothetical vector length register would need to be written very frequently for most of the benchmarks. For example, for 433.milc, 436.cactusADM and 470.lbm it would be written after executing only two vector instructions. Although, there are few benchmarks like 410.bwaves, 454.calculix and 482.sphinx3 where the writes to the vector length register are quite rare however, for the majority of the benchmarks it would need to be written very frequently. The vector processors could use vector length register because they specifically targeted heavily data parallel applications.

The extra instructions to write the vector length register would severely affect the performance benefits of vector execution. Therefore, VLV chooses to encode the number of vector lanes to be enabled in the instruction encoding rather than using a vector length register.

7.9. Performance

We model a simple in-order processor, in congruence with the simple hardware design philosophy of the co-designed processors, with issue width of two. Microarchitectural parameters are shown in Table 1.

Parameter Value
L1 I-cache 64KB, 4-way set associative, 64-byte line, 1 cycle hit, LRU
L1 D-cache 64KB, 4-way set associative, 64-byte line, 1 cycle hit, LRU
Unified L2 cache 512KB, 8-way set associative, 64-byte line, 6 cycle hit, LRU
Scalar Functional Units (latency) 2 simple int(1), 2 int mul/div (3/10) 2 simple FP(2), 2 FP mul/div (4/20)
Vector Functional Units (latency) 1 simple int(1), 1 int mul/div (3/10) 1 simple FP(2), 1 FP mul/div (4/20)
Registers 128-Integer, 128-Vector, 32-FP
Memory Lat 128 Cycles
Table 1. Processor Microarchitectural Parameters
Figure 18. Execution time for baseline and VLV-SWR vectorizations normalized to unvectorized code execution time

Fig. 18 shows the percentage of execution time, at three vector lengths, after vectorization without and with VLV-SWR. On average VLV-SWR provide 5% and 7% speed up over the baseline vectorization and 10% and 13% over the unvectorized code, for vector length of 256-bit and 512-bit respectively, for SPECFP2006. Similarly, for Physicsbench, we get a speed up of 10% for with VLV-SWR over unvectorized and baseline vectorization.

There are several interesting points to note in Fig. 18. First, even though we have higher dynamic instruction elimination, e.g. 31% for SPECFP2006, the speed up we get is smaller, 13% for SPECFP2006 at 512-bit vector length. This is because only 39% of dynamic instructions are floating point in SPECFP2006, which reduces the overall performance. Second, dynamic instruction reduction is more for Physicsbench, 40% compared to 31% of SPECFP2006 for 512-bit vector length; SPECFP2006 shows more speed up, 13% compared to 10% of Physicsbench for 512-bit vector length. This is due to the fact that Physicsbench has higher percentage of integer instructions than SPECFP2006.

8. Related Work

Masked operations have been used in the past for vectorization of code with control flow. However, we use them in the absence of control flow to increase dynamic instructions stream coverage. J. Smith et al. (Smith et al., 2000) proposed masked operations as a means of adding support for conditional operations in vector instruction set. J. Shin et al. (Shin et al., 2005) incorporated masked operations to vectorize loops with conditional flow in Superword Level Parallelism approach. Larrabee also uses masked instructions to map scalar if-then-else control structure to the vector processing unit. All of these proposals execute both if and else clauses and select the correct results based on the values in the mask registers. Our proposal, on the other hand, uses masked operations to increase the dynamic instruction stream coverage when there not enough instruction to fill the wider vector paths.

Significant amount of work has been done on the optimal generation of permutation instructions. However, previous work does not show effect of permutations at increasing vector lengths. A. Kudriavtsev et al. (Kudriavtsev and Kogge, 2005) show the relationship between operation grouping and permutation generation. They show the ordering of individual operations in SIMD instructions affect the number of permutation instructions required. G. Ren et al. (Ren et al., 2006) presented an algorithm that converts all the permutations to a generic form. Then, permutations are propagated across the statement and redundant permutations are eliminated. These solutions focus on reducing the number of permutations required, whereas our solution reduces the number of instructions for each permutation. L. Huang et al. (Huang et al., 2010) proposed a method to reduce the number of instruction for one permutation. Their system has a Permutation Vector Register File which provides implicit permutation capabilities. However, the permutation pattern is to be saved beforehand in a permutation register. Moreover, only the values from two consecutive registers can be permutated.

The proposal by M. Woh et al. (Woh et al., 2009) for supporting multiple SIMD widths is the closest to our proposal of Variable Length Vectorization. They proposed a configurable SIMD datapath that can be configured to process wide vectors or multiple narrow vectors. Unfortunately, details of their vectorization algorithm for vectorization for multiple vector lengths are not provided.

Speculative Dynamic Vectorization, in itself, is not a much extended topic in literature. There have only been a few proposals like Speculative Dynamic Vectorization (Pajuelo et al., 2002), Dynamic Vectorization in Trace Processors (Vajapeyam et al., 1999) and Liquid SIMD (Clark et al., 2007). None of them is in the context of HW/SW co-designed processors. A. Pajuelo et al. (Pajuelo et al., 2002) proposed to speculatively vectorize the instruction stream in the hardware for superscalar architectures. Their scheme prefetches data into the vector registers and speculatively manipulates it through arithmetic instructions. S. Vajapeyam et al. (Vajapeyam et al., 1999) builds a large logical instruction window and converts repetitive dynamic instructions from different iterations of a loop into vector form. The whole loop is vectorized if all iterations of the loop have the same control flow. Liquid SIMD (Clark et al., 2007) decouples the SIMD accelerator implementation from the instruction set of the processor by compiler support and a hardware based dynamic translator. Compiler passes hints to dynamic translator, which can then retarget the vector code for different SIMD accelerators. Selective devectorization (Kumar et al., 2013, 2014) has also been explored to reduce the energy consumption of SIMD accelerators by keeping them power gated for longer intervals.

9. Conclusion

In this paper, we showed that widening the SIMD accelerators does not improve the performance for all the applications. We discovered two main problems hurting the performance of naturally low vector length applications for wider SIMD units: Reduced dynamic instruction stream coverage and large number of permutation instructions.

We proposed a flexible SIMD architecture that allows the vector instructions to operate on variable number of lanes. Additionally, the scalar instructions can selectively write to any element of the vector register, thus avoiding permutations. We also proposed Variable Length Vectorization and Selective Writing techniques to target the flexibility of the proposed SIMD architecture. Variable Length Vectorization vectorizes the code even though it is not possible to fill the wider vector path. Selective Writing allows to write to any particular element of vector registers, thus reduces permutations. Our experimental results show an average dynamic instruction elimination of 31% and 40% and an average speed up of 13% and 10% for SPECFP2006 and Physicsbench respectively, for 512-bit vector length, over the scalar baseline code.


  • P. D. Arcy and s. Beach (1999) StarCore sc140: a new dsp architecture for portable devices. In Wireless Symposium. Motorola, External Links: Document, ISSN Cited by: §1.
  • M. Baron (2005) Cortex-a8: high speed, low power. In Microprocessor Report,11(14), pp. 1–6. External Links: Document, ISSN Cited by: §1.
  • A. J. C. Bik, M. Girkar, P. M. Grey, and X. Tian (2002) Automatic intra-register vectorization for the intel architecture. Int. J. Parallel Program. 30 (2), pp. 65–98. External Links: ISSN 0885-7458, Link, Document Cited by: §1.
  • A. Branković, K. Stavrou, E. Gibert, and A. González (2014) Warm-up simulation methodology for hw/sw co-designed processors. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO ’14, , pp. 284–294. External Links: ISBN 978-1-4503-2670-4, Link, Document Cited by: §2.
  • N. Clark, A. Hormati, S. Yehia, S. Mahlke, and K. Flautner (2007) Liquid simd: abstracting simd hardware using lightweight dynamic mapping. In High Performance Computer Architecture, 2007. HPCA 2007. IEEE 13th International Symposium on, pp. 216–227. External Links: Document Cited by: §8.
  • J. C. Dehnert, B. K. Grant, J. P. Banning, R. Johnson, T. Kistler, A. Klaiber, and J. Mattson (2003) The transmeta code morphing™ software: using speculation, recovery, and adaptive retranslation to address real-life challenges. In Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization, CGO ’03, , pp. 15–24. External Links: ISBN 0-7695-1913-X, Link Cited by: §1, §2.2, §2.
  • K. Diefendorff, P.K. Dubey, R. Hochsprung, and H. Scale (2000) AltiVec extension to powerpc accelerates media processing. Micro, IEEE 20 (2), pp. 85–95. External Links: Document, ISSN 0272-1732 Cited by: §1.
  • [8] K. Ebcioğlu and E. R. Altman DAISY: dynamic compilation for 100. Cited by: §2.
  • L. Huang, L. Shen, Z. Wang, W. Shi, N. Xiao, and S. Ma (2010) SIF: overcoming the limitations of simd devices via implicit permutation. In High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on, pp. 1–12. External Links: Document, ISSN 1530-0897 Cited by: §8.
  • [10] () Intel avx-512. External Links: Link Cited by: §1, §3.
  • [11] () Intel mic. External Links: Link Cited by: §1.
  • [12] () Intel’s hw/sw co-designed processor project. External Links: Link Cited by: §2.
  • [13] () Intel® 64 and ia-32 architectures software developer´s manual. External Links: Link Cited by: §1, §3.
  • J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy (2005) Introduction to the cell multiprocessor. IBM J. Res. Dev. 49 (4/5), pp. 589–604. External Links: ISSN 0018-8646, Link Cited by: §1.
  • A. Klaiber (2000) The technology behind the crusoe processors. In White paper, pp. . External Links: Document, ISSN Cited by: §2.
  • K. Krewell (2003) Transmeta gets more efficeon. In Micro-processor Report, 17(10), pp. . External Links: Document, ISSN Cited by: §1.
  • A. Kudriavtsev and P. Kogge (2005) Generation of permutations for simd processors. In Proceedings of the 2005 ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems, LCTES ’05, , pp. 147–156. External Links: ISBN 1-59593-018-3, Link, Document Cited by: §8.
  • R. Kumar, J. Cano, A. Brankovic, D. Pavlou, K. Stavrou, E. Gibert, A. Martínez, and A. González (2017) HW/sw co-designed processors: challenges, design choices and a simulation infrastructure for evaluation. In 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Vol. , pp. 185–194. External Links: Document Cited by: §7.2.
  • R. Kumar, A. Martínez, and A. González (2012) Speculative dynamic vectorization for hw/sw codesigned processors. In 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT), Vol. , pp. 459–460. External Links: Document Cited by: §4.
  • R. Kumar, A. Martínez, and A. González (2013) Vectorizing for wider vector units in a hw/sw co-designed environment. In High Performance Computing and Communications(HPCC) 2013 IEEE International Conference on, pp. 518–525. External Links: Document Cited by: §1, §2.
  • R. Kumar, A. Martínez, and A. Gonzalez (2013) Speculative dynamic vectorization to assist static vectorization in a hw/sw co-designed environment. In High Performance Computing (HiPC), 2013 20th International Conference on, pp. . External Links: Document Cited by: §2.2, §2, §3, §4, §5.1.
  • R. Kumar, A. Martínez, and A. González (2013) Dynamic selective devectorization for efficient power gating of simd units in a hw/sw co-designed environment. In 2013 25th International Symposium on Computer Architecture and High Performance Computing, Vol. , pp. 81–88. External Links: Document Cited by: §8.
  • R. Kumar, A. Martinez, and A. González (2014) Efficient power gating of simd accelerators through dynamic selective devectorization in an hw/sw codesigned environment. ACM Trans. Archit. Code Optim. 11 (3), pp. 25:1–25:23. External Links: ISSN 1544-3566, Link, Document Cited by: §2, §8.
  • R. Kumar, A. Martinez, and A. Gonzalez (2016) Assisting static compiler vectorization with a speculative dynamic vectorizer in an hw/sw codesigned environment. ACM Trans. Comput. Syst. 33 (4). External Links: ISSN 0734-2071 Cited by: §4.
  • S. Larsen and S. Amarasinghe (2000) Exploiting superword level parallelism with multimedia instruction sets. In Proceedings of the ACM SIGPLAN 2000 Conference on Programming Language Design and Implementation, PLDI ’00, , pp. 145–156. External Links: ISBN 1-58113-199-2, Link, Document Cited by: §1.
  • R. B. Lee (1996) Subword parallelism with max-2. IEEE Micro 16 (4), pp. 51–59. External Links: ISSN 0272-1732, Link, Document Cited by: §1.
  • C. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood (2005) Pin: building customized program analysis tools with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’05, , pp. . External Links: ISBN 1-59593-056-6, Link, Document Cited by: §7.1.
  • M. Lupon, E. Gibert, G. Magklis, S. Samudrala, R. Martínez, K. Stavrou, and D. R. Ditzel (2014) Speculative hardware/software co-designed floating-point multiply-add fusion. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’14, , pp. . External Links: ISBN 978-1-4503-2305-5, Link, Document Cited by: §2.
  • S. Maleki, Y. Gao, M. J. Garzarán, T. Wong, and D. A. Padua (2011) An evaluation of vectorizing compilers. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques, PACT ’11, , pp. 372–382. External Links: ISBN 978-0-7695-4566-0, Link, Document Cited by: §2.2.
  • S. S. Muchnick (1997) Advanced compiler design & implementation. Morgan Kaufmann, . Cited by: §4.1.1.
  • D. Naishlos (2004) Autovectorization in gcc. In The 2004 GCC Developers’ Summit, pp. 105–118. External Links: Document, ISSN Cited by: §1.
  • N. Neelakantam, D. R. Ditzel, and C. Zilles (2010) A real system evaluation of hardware atomicity for software speculation. In Proceedings of the Fifteenth Edition of ASPLOS on Architectural Support for Programming Languages and Operating Systems, ASPLOS XV, , pp. 29–38. External Links: ISBN 978-1-60558-839-1, Link, Document Cited by: §2.
  • A. Pajuelo, A. Gonzalez, and M. Valero (2002) Speculative dynamic vectorization. In Computer Architecture, 2002. Proceedings. 29th Annual International Symposium on, pp. 271–280. External Links: Document, ISSN 1063-6897 Cited by: §1, §8.
  • D. Pavlou, A. Brankovic, R. Kumar, M. Gregori, K. Stavrou, E. Gibert, and A. Gonzalez (2011) DARCO: infrastructure for research on hw/sw co-designed virtual machines. In In Proceedings of the 4th Workshop on Architectural and Microarchitectural Support for Binary Translation (AMAS-BT’11) at ISCA-38, . External Links: Link Cited by: §7.2.
  • G. Ren, P. Wu, and D. Padua (2006) Optimizing data permutations for simd devices. In Proceedings of the 2006 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’06, , pp. 118–131. External Links: ISBN 1-59593-320-4, Link, Document Cited by: §8.
  • S. Sathaye, P. Ledak, J. Leblanc, S. Kosonocky, M. Gschwind, J. Fritts, A. Bright, E. Altman, and C. Agricola (1999) BOA: targeting multi-gigahertz with binary translation. In In Proc. of the 1999 Workshop on Binary Translation, IEEE Computer Society Technical Committee on Computer Architecture Newsletter, pp. 2–11. Cited by: §1, §2.2, §2.
  • L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan (2008) Larrabee: a many-core x86 architecture for visual computing. ACM Trans. Graph. 27 (3), pp. 18:1–18:15. External Links: ISSN 0730-0301, Link, Document Cited by: §3.
  • J. Shin, M. Hall, and J. Chame (2005) Superword-level parallelism in the presence of control flow. In Proceedings of the International Symposium on Code Generation and Optimization, CGO ’05, , pp. . External Links: ISBN 0-7695-2298-X, Link, Document Cited by: §8.
  • J. E. Smith, G. Faanes, and R. Sugumar (2000) Vector instruction set support for conditional operations. In Proceedings of the 27th Annual International Symposium on Computer Architecture, ISCA ’00, , pp. 260–269. External Links: ISBN 1-58113-232-8, Link, Document Cited by: §8.
  • J. Smith and R. Nair (2005) Virtual machines: versatile platforms for systems and processes. Morgan Kaufmann Publishers Inc., . External Links: ISBN 1558609105 Cited by: §2.
  • M. Sporny, G. Carper, and J. Turner (2002) The playstation 2 linux kit handbook. In , External Links: Document, ISSN Cited by: §1.
  • [42] () Standard performance evaluation corporation. spec cpu2006 benchmarks. External Links: Link Cited by: §7.1.
  • S. Vajapeyam, P. J. Joseph, and T. Mitra (1999) Dynamic vectorization: a mechanism for exploiting far-flung ilp in ordinary programs. In In Proceedings of the 26th Annual International Symposium on Computer Architecture, pp. 16–27. Cited by: §8.
  • C. Wang, Y. Wu, and M. Cintra (2013) Acceldroid: co-designed acceleration of android bytecode. In Code Generation and Optimization (CGO), 2013 IEEE/ACM International Symposium on, pp. . External Links: Document Cited by: §2.
  • M. Woh, S. Seo, S. Mahlke, T. Mudge, C. Chakrabarti, and K. Flautner (2009) AnySP: anytime anywhere anyway signal processing. In Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA ’09, , pp. 128–139. External Links: ISBN 978-1-60558-526-0, Link, Document Cited by: §8.
  • T. Y. Yeh, P. Faloutsos, S. J. Patel, and G. Reinman (2007) ParallAX: an architecture for real-time physics. In Proceedings of the 34th Annual International Symposium on Computer Architecture, ISCA ’07, , pp. 232–243. External Links: ISBN 978-1-59593-706-3, Link, Document Cited by: §7.1.