References
1 Introduction
The merge operation is widely used for a variety of applications, including in many popular sorting algorithms, such as mergesort and timsort [auger2015merge], as well as in hardware for database operations, including sortmerge joins [mergejoin]. It is thus desirable to accelerate, and recent research has proposed a variety of merge accelerators on fieldprogrammable gate arrays (FPGAs).
As frequency scaling has stopped being the primary method for achieving performance, the main way of achieving highthroughput/ highbandwidth in modern systems is now to increase the datapath width. This has influenced computer architecture in many aspects, such as with wider singleinstruction multipledata (SIMD) instructions on general purpose processors (CPUs). One application that can benefit from highthroughput is sorting, as more data are being able to be processed per cycle. This paper presents a highthroughput merger algorithm, that merges two arbitrarily long input lists with highthroughput, exporting elements per cycle, assuming the inputs are appropriately provided, such as through banked memories, reading up to elements from each of the two input lists per cycle.
The challenges for the merge operations on FPGAs have been the low clock frequency due to the feedback datapath being the critical path, and the high resource utilisation in some attempts to remove the feedback datapath.
At the time of writing, FLiMS uses the least number of comparators and pipeline stages among the latest mergers. It uses a modified version of the bitonic merge block, as found in bitonic sorters (see figure 3
), repurposed for performing 2way parallel merge for streaming data. All alternative designs require the two input sequences of the bitonic (partial) merger (or the oddeven mergebased equivalent) to be sorted. The
main idea is to relax this condition. This idea eliminates the need for expensive rotations before the inputs of the merger [pmt], or alternative workarounds involving redundant logic [mms, vms, ehms].Other novel aspects in this work include the distributed nature of the selector stage, which has better timing characteristics on FPGAs, without occupying additional pipeline stages from the merger logic. The optimisation for skewed datasets is more lightweight and scalable than a previous attempt [pmt], as it does not rely on barrel shifters. FLiMS also does not suffer from the tierecord challenge found in all other feedbackless designs [mms, vms, ehms, ehmsp], and a costly workaround is not deemed necessary. The regularity in the topology of the comparators is also found to help with SIMD implementations of highthroughput merging in CPUs.
The main contribution is the highlyefficient design of a highthroughput 2way merger. Three variations of FLiMS are presented to achieve stable merging, highthroughput on skewed datasets and efficient memory use for special cases. Other contributions include a systematic comparison with a variety of alternative approaches, and an evaluation on both FPGAs and modern CPUs with SIMD. Automated generator scripts^{1}^{1}1Source available: http://philippos.info/flimsj provide (a) the Verilog code for the FPGA implementation, as well as (b) C++ code with SIMD intrinsics for CPUs, for a userspecified degree of parallelism ().
This paper is an extension of “FLiMS: Fast Lightweight Merge Sorter” [flims] and is also partly based on the paper “An Adaptable HighThroughput FPGA Merge Sorter for Accelerating Database Analytics” [fsorter]. It extends the material from these two papers with more detailed proofs (section 5) and a thorough comparison with related work, such as providing exact values for the number of comparators each approach requires (section 6). Additional variations are presented to also support stable sorting and dequeuing whole rows (section 4). The experimental comparison is updated with implementations of the stateoftheart alternatives [ehms, ehmsp] on a highend FPGA (section 7), and the SIMDbased version of FLiMS is extended to implement full sorting on modern CPUs to compare with sorting library functions (section 8).
2 Background
On FPGAs, the available sorting accelerators are inspired by a variety of different serial algorithms, including insertion sort [fpgasort, linear, fsorter]. However, this background section is restricted to highthroughput merge sorting on FPGAs. Merge sorting has been one of the most studied algorithms for sorting on FPGAs. This is due to its versatility, such as when reusing the same circuits to sort arbitrarily long input recursively.
2.1 Highthroughput merge sorters
Highthroughput merge sorters can merge a number of sorted lists simultaneously, while providing an output rate of more than 1 element per cycle. This can be achieved by building a merge tree (PMT [pmt]), mainly consisting of highthroughput mergers of 2 lists and FIFO queues, as with predecessors [casper, kobayashi2015face].
Figure 1 shows how these building blocks can be arranged to merge 8 sorted inputs of throughput 1, with an output rate 8. The ‘merge rate’ of the mergers in each level of the merge tree denotes the number of elements in their two inputs collectively and the number of elements in their output. This directly contributes to the throughput of the merger, as well as the bandwidth requirements of the proceeding and succeeding logic in the pipeline. For instance, a merger of rate 4:2 merges two inputs of width 2 (times the element width) and outputs two elements per cycle. The difference in widths from level to level is managed by rate converters and the appropriate stall signals.
One challenge in parallel merge trees is that, while they can easily saturate the available bandwidth by scaling the number of inputs, they do not scale well for high number of inputs. For this reason, manyleaf mergers have emerged, to support a higher number of inputs more efficiently (currently up to a few thousands [leaf, christopher] rather than in tens for PMTs). Merging many inputs simultaneously helps reduce the number of data passes required for complete sorting. A single data pass is equivalent to reading the entire input data once. However, manyleaf mergers are singlerate, meaning that they can only produce one output per cycle. If the data are not wide enough, this can lead to underutilisation of the available bandwidth.
To solve these tradeoffs, the hybrid parallel merge tree (HPMT) was introduced [fsorter], to enable both highthroughput and manyleaf merging at the same time. The size of the HPMT can be easily adjusted to saturate the bandwidth of the target architecture, while eliminating the number of passes of the data by still using manyleaf merging. Figure 2 shows how HPMT can combine 4 manyleaf mergers of K inputs (totalling 4 input lists) into a merge tree with an output rate of 4 elements per cycle.
2.2 Highthroughput 2way mergers
The highthroughput 2way mergers are the main building blocks of the aforementioned merge trees.
A merger for 2 alreadysorted sublists of fixed length can be modified to merge 2 lists of arbitrary length in streaming fashion. Then it can be used as a building block for a parallel merge tree to merge many lists concurrently.
Most of them are based on the two popular sorting networks: Batcher’s oddeven mergesort and the bitonic sorter [batcher]. These two sorting networks have the same number of stages and can be built hierarchically using 2 sorters of half the input and an appropriately sized merger to merge two equallysized sorted sublists. The merger part consumes the last stages in both sorting networks, where is the number of inputs. This merge block is optimised and/or combined with additional logic to work as a parallel merger for longer lists as streams.
Figure 3 shows the bitonic merger, as found in a bitonic sorter of 8 inputs. The pairs of circles are compareandswap (CAS) units, working as sorters of two inputs, i.e. . The list is initially unsorted, and right before the merger, it is partially sorted into two sorted sublists ( and ). The merger merges these two sublists, consuming the last 3 stages of this bitonic sorter.
The first known highthroughput merger for arbitrarily long input was based on a well known algorithm for merging using SIMD instructions on Intel processors [simd2008]. It was adopted for database use on FPGAs [casper]. However, the long feedback problem was more prominent on FPGAs [mms], since it can negatively impact the critical path, and is not scalable for many inputs. The algorithm goes as follows: starting with the first sized batches of each of the sorted sublists, the merger produces the top result as output to progressively merge the entire input, while the lower result is fed back into the lower of the input to continue the merging. A single comparison between the first element of each batch is enough to distinguish the next batch to dequeue and place at the merger. Figure 4 shows the highlevel representation of this approach for FPGAs.
Some 2way hardware mergers (including FLiMS) utilise an optimization of the bitonic merger, the 2to bitonic partial merger [farmahini2012modular], which outputs just the top half of the result (or lower if the CAS units are flipped). The bitonic partial merger is a subset of the bitonic merger. These are summarised in figure 3. This “pruned” merge block is combined with additional logic to work as a parallel merger for longer lists as streams.
In 2016, Song et al. built the parallel merge tree (PMT [pmt]) with 2to bitonic partial mergers. In figure 5, we can see a highlevel view of the merge block. This merger works as follows: two input queues, A and B, output 0 to elements each per cycle, according to how many made it in the last result of elements. This is known from just the first stage of the 2to bitonic partial merger and is used as a feedback to select the amount of elements to be dequeued from A and B. Since some elements remain from the previous cycle, each input of the bitonic partial merger block needs to be rotated by an offset equal to the number of dequeued elements (so far). This is done to ensure that the bitonic partial merger gets sorted inputs. However, the crossbars implementing the barrel shifters create a critical path that increases with , leading to low frequency designs and it does not scale well [mashimo2017high].
At some point, most of the attention was drawn on removing the expensive feedback length that existed in traditional merger designs [casper, pmt], that prevented scalability in terms of operating frequency for an increased degree of parallelism () (SHMS [mashimo2017high], MMS [mms], VMS [vms]).
In 2017, Mashimo et al. [mashimo2017high] proposed a lowerlatencyfeedback architecture, SHMS, to solve the long critical path problem with the previous approaches. While achieving much higher frequencies, as high as 3.14 times more than PMT [pmt] for 32 different input queues (also focusing on multiple inputs), the register utilisation was much higher (7.26 times more registers than PMT for 32 inputs). This does not scale well, not from the long feedback datapath length, but due to the high register utilisation.
Then, Saitoh et al. [mms] proposed a feedbackless architecture, MMS, to increase the performance and scalability of the merge operator. MMS uses two 2to bitonic partial merge blocks along with shift registers and an extra comparator and multiplexer. VMS [vms] is a variation of MMS that is based on oddeven mergers instead, but also focuses on improving the tierecord workaround (see section 6). In figure 6, we can see the highlevel view of these designs.
Finally, FLiMS [flims] and WMS/EHMS [ehms] offered further improvements by focusing on efficiency, for minimising the required hardware resources, usually with a subsequent improvement in operating frequency. Essentially, WMS is an optimisation of MMS [mms] (and VMS [vms]), because it fuses the two 2to partial mergers into one bigger merger block, and also eliminates the need for additional shift registers. Figure 7 shows how a single merger is used in WMS, closely resembling the other approaches in MMS, VMS and EHMS.
EHMSP [ehmsp], was then introduced as a potential successor to EHMS [ehms]. EHMS and EHMSP try to move some complexity to the selector stage for lower resource utilisation at the expense of a lower operating frequency. EHMSP specifically is not considered here for comparison, as its resource utilisation is close to EHMS, but with a further performance overhead due to the increased complexity of its selector stage, worsening its critical path [ehmsp]. Also, in contrast to the other highthroughput mergers, EHMSP is for values not in powers of 2, making it less versatile.
3 A novel 2way highthroughput merger
FLiMS is a novel 2way highthroughput merger that only uses a single 2to bitonic partial merger. It merges 2 sorted inputs with high throughput without the use of barrel shifters or shift registers.
In figure 8, we can see a highlevel visualisation of the proposed parallel merger. In contrast to previous works, FLiMS is shown to dequeue from the banks on an individual basis, rather than in batches of elements, due to the decentralised selector logic.
In figure 9, we see a lowerlevel representation, where the green circle pairs are the compareandswap units, with the exception of the first pipeline stage, which has pairs of one green and one grey circle, representing the MAX units (selector stage). If we ignore the modification of the first stage to MAX units, this topology is the bitonic partial merger [farmahini2012modular], and it would produce the top out of two sorted lists of elements, and is a subset of the bitonic merger (see figure 3).
The proposed parallel algorithm can be broken down into 3 segments: a selector stage for handling the input from multiple banks/queues, the pipelined bitonic partial merger (minus the first stage) and finally the output logic.
3.1 Selector stage
After the data are written into the BRAM banks (stored with a roundrobin priority), a set of independent entities (nodes) are responsible for controlling the input to the merger. These entities respond to the same clock and their behaviour can be described by a distributed algorithm.
Let and be the input FIFO queues containing the two sorted lists to be merged. Let , …, and , …, be the queues corresponding to the respective banks for A and B. Each of the entities , , …, have as input the pair of queue heads (, ), (, ), …, (, ) respectively, with being the head of and being the head of . Each of these entities outputs one number per cycle when both inputs are valid.
Each entity has the data registers and to store the last heads that were dequeued from banks and respectively. It also has a register for the sorting network input (). On each cycle with valid input, if , it means that will make it into the top in the result of the pipeline and therefore is copied into . In this case, is replaced by the head , which is dequeued from , but will remain unchanged, since it will need to be compared again in the next cycle, being in the lower . The equivalent logic goes for the case when .
In algorithm 1, we can see the pseudocode for the distributed algorithm. Collectively, this algorithm replaces the first stage of the partial bitonic merge (halfcleaner), with MAX units instead of CAS units. It selects the current top on each cycle and inserts them into pipeline registers for the rest of the CAS network to sort and produce the correct sized chunk of output.
There may be a need for some extra logic required to correctly handle the ending of the input queues, but it is omitted here for the sake of simplicity and portability to different architectures, as it is relatively trivial to construct. For example, when sorting natural numbers in descending order, the value 0 can be passed afterwards to handle the ending without additional dedicated logic.
Cycle  Input A  Input B  cA  cB  Output (after the pipeline delay) 
0  3 3 4 5 11 16 17 26 26 29  0 7 8 9 12 15 18 19 21 22  
1  3 3 4 5 11 16  0 7 8 9 12 15  17 26 26 29  22 21 19 18  22 26 26 29 
2  3 3 4  0 7 8 9 12  17 5 11 16  15 21 19 18  17 18 19 21 22 26 26 29 
3  3 3  0 7  4 5 11 16  15 12 9 8  11 12 15 16 17 18 19 21 22 26 26 29 
4  4 5 3 3  7 0 9 8  5 7 8 9 11 12 15 16 17 18 19 21 22 26 26 29  
5  4 3 3  0  0 3 3 4 5 7 8 9 11 12 15 16 17 18 19 21 22 26 26 29  
see http://philippos.info/sort_visual for an online visualisation. 
3.2 Cas network
The compareandswap (CAS) network of FLiMS is responsible for sorting the top result of the selector stage. It is a partial bitonic merger minus the first stage, or in other words, a butterfly network. It is not considered a sorting network on its own and does not sort arbitrary number sequences. For its input though, it behaves correctly and its output is always sorted (see proof 5.1).
3.3 Output logic
On each cycle, if the output of the partial bitonic merger is marked as valid, it is written down as a sized chunk of the result, such as in output banks containing implementing an output queue .
One observation is that when there is valid output per cycle, it produces exactly elements, as with other 2way mergers. This is useful for easing synchronisation when embedding into a merge tree [fsorter].
The sets of registers cA and cB are not visible in the first (selector) pipeline stage of figure 9. The notion of those registers is optional because they can also be considered the current sets of heads of the banked queues A and B. Though, it is sometimes convenient to use cA and cB, such as when the input queues are block RAM sections, where a read register is already present for reading each memory.
Table I presents an example execution for =4.
4 Additional functionality
This section presents variations of FLiMS, which can be used to increase its applicability or performance, according to the distribution of the data and the requirements of the sorting problem and platform.
4.1 Skewness optimisation
FLiMS can be used to build parallel merge trees that merge many input lists hierarchically in a single pass. Parallel merge trees can suffer from rate mismatch that occurs when the input data distribution leads to underutilisation of certain mergers, resulting in reduced throughput [pmt, leaf, christopher].
While the memory access throughput is a matter of the memory system, it might be allowed for the accelerator to receive the data from each of the input lists A and B with a fixed bandwidth, less than , such as in a PMT. A contributor to rate mismatch is when there are a lot of duplicates in the input (skewed datasets). When the data are skewed, the merger only dequeues from one of the input queues for long periods of time. This results in stalled cycles from underutilising the aggregate bandwidth of the queues.
PMT [pmt] proposes a simple solution which causes the merger blocks to fetch from both inputs at a similar rate when there are duplicates. This ensures that the input queues are consumed with a similar throughput, that collectively balances the utilisation of the merge tree. However, PMT’s mergers inherit the long feedback problem, which was addressed in subsequent works [mms, vms, flims, ehms].
We propose the equivalent optimisation for FLiMS [flims], while keeping the decentralised nature of the selector stage. On duplicates, there is an “oscillating” effect at the MAX units, which balances the dequeuing rate from the two groups of inputs. The code for the new selector units is illustrated in algorithm 2. An 1bit register called represents the input out of which the result was taken during the previous cycle, and is appended to the least significant bit in the comparison, to enforce a sort priority on equal values.
4.2 Stable merge
In contrast to the skewness optimisation, stable sort may be desired instead. Stable sort is when the sorted output has the same order for duplicate values as they appear in the input. For implementing a stable merge sort, FLiMS would also need to be stable, i.e. to prioritise the duplicates of input A over the ones from input B, and keep their original order inside A and B accordingly. Such a modification cannot coexist with the skewness optimisation, since the priority between the duplicates will be based on the input source.
Originally, FLiMS is not stable, as it is partlybased on the bitonic sorter, which is not stable. Temporarily appending the input source (1 bit) and the port number ( bits) to the MSB would be required to disambiguate between the original order of duplicates inside the CAS network. As the order of the inputs inside MAX units are naturally rotated by an offset, the port order is not enough to distinguish the order inside each batch containing duplicates. For this reason, a 2bit value needs to be carried between the input source and the port number, that keeps count of the batch order. A singlebit counter would not be enough for distinguishing which of the two compared entries came first. Algorithm 3 shows the modifications required on the MAX units to support stable merge.
Additionally, the CAS units also need to be modified to correctly prioritise the case where the 2bit order is “00” against “11”, as all other combinations (same values or other pairs having a difference of one) would correctly represent the original order priorities. The general idea of this approach is to emulate appending the original input order to the MSB of the data, but with a steady and low number of bits for merging arbitrarily long input. The order field can be seen as the last few bits of the input order.
4.3 Dequeuing whole rows (FLiMSj)
One potential advantage of the majority of the related work [casper, mms, vms, ehms, ehmsp] over FLiMS is that they dequeue whole rows of elements (or for EHMS [ehms]) from the inputs by default (see figure 7). This reduces the number of dequeue signals, and can also be more efficient in special cases, such as when reading narrow data from wider memories.
A relatively efficient buffering arrangement can unify the dequeue signals for FLiMS as well. This is possible because the FIFOs in FLiMS are collectively consumed in roundrobin fashion and at no point two FIFO indexes of the same input differ by more than one. This means that a set of registers for buffering the next queue heads is enough for dequeuing an element batch at the right time, while providing full bandwidth for the respective input. This can be achieved with a single set of registers for both inputs.
Figure 10 introduces the related modification to the MAX units using a merging example for . The general idea is that the top to elements can be stored in a set of registers () after every selection iteration (cycle), while maintaining their natural rotation order. This results in at least available elements per input, when combined with the current heads in and , while still eliminating the need for a rotation and a growing feedback. Algorithm 4 describes this approach in more detail.
5 Correctness
This section provides proofs on some nontrivial parts of our proposed techniques. The correct operation of FLiMS and its skewness optimisation can be proven by induction. FLiMS can be used for merging lists in descending order, as well as for merging ascending lists with minor modifications (reversing all comparators, and reversing the order in the stable version). Here, we study FLiMS designs with data in descending order, but the proofs can be easily adapted for the other case, without loss of generality.
5.1 Without additional functionality
In order to prove that the main design behaves as expected we will show that it is functionally equivalent to a more trivial merger implementation, where on each cycle the input comes sorted to a 2to bitonic partial merger, as with the merger used in PMT [pmt]. We will show that (1) the selector stage algorithm always selects the top out of the 2sized input and that (2) the butterfly network always sorts this top list before writing the result in output.
(1) We denote as the naturallyoccurring rotation offsets for input queues A and B respectively. Supposedly, the selector stage dequeues consecutive elements from each banked input. Since A and B are written in a round robin fashion inside the banked memory, different rotation offsets would be required to read a sorted set of elements from each set of FIFOs, similar to a design that rotates its inputs like PMT [pmt]. Note that the MAX units in FLiMS receive elements from the corresponding banks without performing any additional rotation.
Induction hypothesis: On each cycle, if the selector stage has worked correctly on the previous cycle, it will load elements from A and from B, collectively corresponding to the combined top out of the available elements.
Base step: At cycle 0, the parallel merger behaves in the same way as the 2to bitonic partial merger, because the inputs from each list are already sorted ( and ), corresponding to the first elements from A and first from B (stored in and registers respectively). The first stage of the 2to bitonic partial merger is known as a halfcleaner, and produces the top out of the inputs, in a bitonic sequence.
Induction step: Given that the selector stage worked correctly in the previous cycle, and are the amounts of sorted elements dequeued last from A and B respectively.
On each cycle, and will be updated according to the number of dequeued heads, as the starting positions shall succeed the dequeued elements. That is, for A, and for B, where , the offsets of the previous cycle. Therefore,
As cycle 0 assumes and (due to the input already being aligned correctly in the banks), and that the induction hypothesis is assumed correct for all previous cycles, it always holds that . Since , for every cycle.
In order for the selector stage to produce the top out of the current elements, the comparisons that need to be made are between all pairs in , where , …, are the current top elements in A and , …, are the current top elements in B, sorted in descending order. These comparisons are required in order to emulate the first stage (halfcleaner) of the bitonic partial merger.
The MAX units have the same topology as the bitonic (partial) merger (as seen in figure 3). We notice that whatever the rotation combination from and is, the correct comparisons will be made, as:
(2) The sorting network receives the correct top output from (1) and the task is to sort it. Originally, this structure is supposed to sort bitonic sequences of size [batcher]. The output of the distributed algorithm block is a rotated bitonic sequence, as we saw that the comparisons will be rotated by . A rotated bitonic sequence is also a bitonic sequence [zachmann2013adaptive], therefore the input for the sorting network has the correct property. This completes the proof.
5.2 Including the skewness optimisation
2way merger  Feedback  Latency  Number of comparators  H/W modules  Merger  Tierecord  
length  topology  challenge  
basic [casper, simd2008]  2wto2w merger  bitonic  no  
from PMT [pmt] 

bitonic  no  

MMS [mms]  1  +1 

bitonic  yes  
VMS [vms]  1  +1 

oddeven  yes  
WMS [ehms, ehmsp]  1 

oddeven  yes  
EHMS [ehms, ehmsp]  1 

oddeven  yes  
FLiMS [flims]  1  2wtow merger  bitonic  no  
FLiMSj  1  2wtow merger  bitonic  no 
To prove that FLiMS continues to sort correctly, the selector stage must be shown to still produce a bitonic sequence [flims] (up to one local maximum and up to one local minimum).
The bitonic sequence property needs to apply also on the order of each input when there are duplicates. Therefore, we need to show that there will still be up to one local maximum and up to one local minimum in the bitonic sequence, even though there might be multiple additional entries with a value equal to the minimum or maximum.
The original order between consecutive duplicates in the same input is used to correctly prioritise duplicates, as the input lists are considered already sorted. Being consistent about the original order ensures dequeuing consecutive entries from A and B, keeping the integrity of the input data, as the data are stored in roundrobin fashion in banks.
On each cycle, each units compares to , where is a rotation of by a common offset (), and and represent the remaining elements of the input queues in descending order. This emulates a halfcleaner that selects (and dequeues) a total of the greatest elements from a total of elements, generating a bitonic sequence.
Induction hypothesis: On each cycle, if the skewnessoptimised selector stage worked correctly on the previous cycle, it will produce a bitonic sequence from consecutive elements from A and consecutive elements from B, where .
Base step: At cycle 0, the common offset is zero (), as the inputs are properly aligned inside the input FIFOs, and no element has been dequeued yet. Each unit has an initial value of 0 stored in its register. This means that on the event of comparing duplicates, it will behave as in the nonoptimised case, where a single source (B) is preferred for duplicates, as the comparison of line 6 of algorithm 2 is now equivalent to line 5 of algorithm 1 ( “if then” ), which is already proven to produce a bitonic sequence from the proof of section 5.1.
Induction step: The skewness optimisation modification only takes effect where there are duplicates, i.e. . In such a case, we notice that this only happens consecutively (including wraparounds from the natural rotation) and for the minimum value of the output, as is monotonically decreasing and is monotonically increasing. As a consequence, the position of the minimum (split) in the bitonic sequence can be at the start, end or between this region of duplicates.
Given that FLiMS worked correctly on the last cycle, the registers correspond to the last halfcleaner decisions, which will be of the form , after considering the offset . Ones and zeros appear consecutively, since only consecutive elements are dequeued from each input list on each cycle. Also, the sequence of registers starts from 1, when we consider the current rotation offset. This is because the naturallyoccurring rotation offset is updated according to the last position dequeued from , the next of which corresponds to the first from list , that yields the “first” 1 in the sequence.
Therefore, the region of duplicates will be a sublist of the expression , with its 1s and 0s replaced by consecutive duplicates from and respectively. As a result, there will be up to one local minimum (split) in this region, and therefore up to one local minimum in the entire halfcleaner result, which consists a bitonic sequence.
6 Comparison with the related work
Table II compares FLiMS to the related work, according to different terms contributing to the resource utilisation and efficiency of the design. FLiMS uses the least amount of resources by only requiring a single 2to bitonic partial merger. It has a singlestage feedback latency, making it “feedbackless”, and has the least amount of latency, which is . FLiMSj of section 4.3 is also added to the table, even though its only modification is in its MAX units, which results in one more cycle of pipeline latency.
The first two entries [casper, pmt] have a feedback consuming multiple stages, such as from the additional stages required to implement the barrel shifters before the inputs in the mergers of PMT [pmt]. Hence, an increased number of inputs has a scalability problem, as a pipelined implementation of the feedback would reduce the throughput. Alternatively, squeezing the increasing logic into a single pipeline stage (consuming a single cycle) can heavily impact the operating frequency [mms].
MMS [mms] and VMS [vms] were the first solutions that provided a practical solution for the feedback problem with a relatively low resource utilisation. Their approach was to use either two 2to bitonic partial mergers (MMS) or two 2to oddeven mergers (VMS). Both of those topologies have pipeline stages and are relatively similar. They are from the last steps from the bitonic sorter and oddeven merge sort respectively [batcher].
WMS and EHMS [ehms] on the other hand, achieve to use a single feedbackless merger, as with FLiMS, but this merger is for double the inputs (plus optimisations), totalling to one more pipeline stage.
Figure 11 shows how a 4wto4w merger from oddeven mergesort is adopted to implement 3wto3w, by pruning CAS units. There is, though, additional pruning as the output is only elements. EHMS uses the same merger, but the first values of the input are not used, resulting in fewer comparisons. The resulting amount of comparators for each approach is shown in the table II, and the formulas mainly derive from Cullen numbers [cullen], and are validated by using yosys through synthesising the Verilog implementations of the evaluation section 7.
The pipeline length (latency) also impacts the resource utilisation, as certain values need to be propagated for longer through the pipeline registers. The merger of WMS and EHMS uses one more cycle than FLiMS, as it is an optimised merge block for double the inputs (as seen in figures 9 and 11). Note that WMS and EHMS propose an optimisation to reduce the pipeline registers by fusing some CAS units, but it is ignored in this comparison, as it can be explored separately for all mergers. All mergers other than FLiMS and in PMT [pmt], have a separate singlecycle selector stage, contributing to one more pipeline stage, while in FLiMS it is integrated in the modified bitonic partial merger.
One challenge with the mergers MMS [mms], VMS [vms], WMS and EHMS [ehms] is that, if there are duplicate values being compared, the output can be corrupted, also known as the tierecord issue. Specifically, this is a problem in keyvalue pairs, where only the key is compared, and the integrity of the values can be lost where there are duplicate keys. The available workarounds of the related works vary in complexity (not presented). In FLiMS, this is not the case, as the selector stage decides for the top result to propagate through the output immediately. In the other approaches, finding the top is done more indirectly, relying on two orders, and the problem arises due to the oddeven merge and bitonic sort topologies not implementing stable sort.
7 FPGA implementation
In order to evaluate FLiMS on FPGAs, we compare its resource utilisation and maximal operating frequency with the latest alternatives WMS and EHMS [ehms]. Our comparison includes FLiMSj for including a more direct competitor to the stateoftheart WMS and EHMS mergers, when dequeuing whole rows is required (see section 4.3). All generator scripts are implemented from scratch and produce Verilog code for each of the compared mergers for a given degree of parallelism and data width. This experiment uses 64bit mergers and targets the Xilinx Alveo U280 board.
The generated designs work as simple AXI peripherals, that read already sorted data stored in distributed memory (onchip) and write back also to distributed memory. The host places the sorted sublists and reads the merged result for validation purposes. For every different value of , the FIFOs are only 2 elements deep (totalling elements for input and output) to eliminate the differences between different merger designs to only their core logic. (Though, the bitstreams generated for validation had longer queues).
In order to simplify the comparison, the fusion of some pairs of compareandswap units (CAS) in WMS and EHMS [ehms] is not followed in this evaluation. This omitted optimisation could be explored separately, as with removing pipeline registers [pswitch], and does not directly relate to the main structure of the mergers. Additionally, the tierecord workarounds of WMS and EHMS have not been taken into consideration, even while FLiMS does not suffer from the tierecord issue. Thus, unique input values are assumed, such as with timestamp information inside the 64bit input, or no satellite/payload data (i.e. values in keyvalue pairs).
FLiMS  FLiMSj  WMS  EHMS  
kLUT  kFF  kLUT  kFF  kLUT  kFF  kLUT  kFF  
4  1.7  2.9  2.5  3.2  2.7  5.3  3.1  4.8 
8  3.6  6.3  5.1  6.8  5.6  11.0  6.2  10.3 
16  7.0  1.4  10.6  14.6  11.7  23.1  13.0  21.6 
32  15.4  29.0  20.9  31.2  23.5  48.3  26.7  45.3 
64  33.7  62.0  45.0  66.4  53.3  100.8  57.9  94.6 
128  73.4  132.2  96.1  140.8  106.6  209.8  120.4  197.5 
256  158.6  280.7  208.6  297.9  224.0  436.0  252.2  411.4 
512  345.3  594.0  436.2  628.4  473.0  904.7  525.3  855.6 
Table III includes the dataset on the obtained resource utilisation of lookuptables (LUTs) and flipflop registers (FF), as reported by Vivado 2020.1. Figure 12 is based on the same data and uses FLiMS as a baseline to emphasise on the overheads of the alternative approaches on resource utilisation. As a conclusion, FLiMS uses the least amount of LUTs and FFs, while WMS and EHMS only differ marginally, as they are based on a similar merger. As expected [ehmsp], between WMS and EHMS, WMS wins in LUT utilisation, while EHMS wins in FF utilisation. Using the current implementation as an AXI peripheral, FLiMS is roughly about 1.5 to 2 times more hardware resource efficient. FLiMSj has almost the same FF utilisation as FLiMS, though in terms of LUTs it is about 1.3x more expensive than FLiMS, but always more resource efficient than WMS and EHMS.
Finally, figure 13
presents the comparison of the obtained maximal operating frequencies through the reported worst negative slack (WNS). Most datapoints used the default Vivado 2020.1 settings, though additional directives such as aggressive explore were used on some outliers or nonroutable designs, especially for
. Having such irregularity or small variations in the results are expected, as placeandroute is heuristicbased and becomes more challenging for larger designs. For WMS with
, the additional tested directives did not help with routability.FLiMS has a considerable advantage over both WMS and EHMS, sometimes yielding more than double the operating frequency. FLiMSj has a small overhead over FLiMS, though WMS seems to marginally win for . WMS is known to be better performing than EHMS, at the expense of additional hardware resources, though the reduced resources help with routability and performance for high values of . FLiMS wins in both performance and resource utilisation by a great margin, while FLiMSj lays between FLiMS and the alternatives WMS/EHMS in most aspects to offer the additional functionality of section 4.3.
This evaluation focused more on the merging techniques, ignoring the building of merge trees or the handling of the list endings. The relevant paper [fsorter] elaborates on a complete sorter implementation based on FLiMS, with a highly competitive logic and time complexity combination over related work on complete sorting.
8 Software implementation using SIMD intrinsics
The goal of this section is to experimentally show that a singleinstruction multiple data (SIMD)accelerated merge sort function based on FLiMS can compete with alternative popular sorting functions based on different sorting algorithms. Today’s general purpose processors (CPUs) feature vector or SIMD instructions as a way to increase performance in numerous compute and memoryintensive applications. Parallel merging algorithms implemented using SIMDintrinsics, have already been shown to improve the sorting performance on CPUs.
Chhugani et al. [simd2008] used the rather simple merge algorithm based on the bitonic merger (see figure 4) to enable highthroughput merging on an older Intel processor. Since it uses a full (nonpartial) bitonic merger, both the lower and upper are used. As the upper is calculated after stages, this could have the feedback problem that is addressed by the latest research on FPGA merging (as summarised in table II). However, on CPUs this is not much of a concern, because the pipelining functionality is not that advanced to achieve efficient taskpipelining. For example, for a single layer of compareandswap (CAS) units, there need to be at least three SIMD intrinsics, one for , one for and at least one shuffle. The latter is to appropriately permute the inputs to emulate the CAS network topology.
In practice, how FLiMS could help is with a reduction in instruction count, as the lower of the result is not needed. In other words, with roughly a similar number of instructions, FLiMS can merge with double the amount of parallelism . Other desirable characteristics of FLiMS on SIMD are the bitonic merge topology, which is more regular/ symmetric when compared with the oddeven merge sort topology, and the elimination of the rotation of the inputs. A similar discussion can be made for comparing with the other merger alternatives of section 6, which require more comparisons and lengthier pipelines than FLiMS.
In order to assess the efficiency of FLiMS as an SIMD algorithm, a manually vectorised code in C++ is developed using Advanced Vector Extensions for 256bit registers (AVX2). It is then extended to perform full sorting, and is finally compared to other existing sorting functions.
8.1 Merge function implementation
An SIMDbased implementation of FLiMS can be split into two main parts; one for the MAX (selector) stage and one for the butterfly network. The MAX stage is responsible for fetching the next elements from the input lists, selecting the top and feeding it to the butterfly network. The butterfly network is more straightforward to implement using intrinsics, as it is a part of the bitonic sorter, which was already explored in a similar context [bramas].
The MAX stage can be implemented in SIMD by keeping the and in vector registers and generating the top by comparing them. There are two ways to implement its fetching functionality. The first is to keep and update independent pointers per input representing the input queues. Then mask/gather intrinsics are called to update the and values which are kept in vector registers. This is more faithful to the original FLiMS algorithm, but it is less efficient to use a gather AVX2 intrinsic [intel] for fetching otherwise continuous memory locations. The preferred faster method involves “pre”fetching sized batches of elements, which is reminiscent of FLiMSj of section 4.3.
One complication with fetching whole vectors for the MAX stage is that the vector needs to be in reverse order. This is done by the intrinsic _mm256_permutevar8x32_epi32() to reverse the contents of each batch fetched from list B. The comparison of the MAX units is done by the intrinsic _mm256_cmpgt_epi32() and the result boolean vector is also used to fetch the next heads selectively after blending _mm256_blendv_epi8() with the current set of heads. The result of the compare instruction _mm256_cmpgt_epi32() has the form of zeros and ones, each now denoting the source list of the next heads. The negation _mm256_andnot_si256() of this vector is also used to get the equivalent vector to be used when blending the next vector from the second input.
With respect to the butterfly network part, each layer of CAS units can be calculated by calling both a min (_mm256_min_epi32()) and a max (_mm256_max_epi32()) instruction consecutively, since each CAS has two outputs. Alternatively, a compare instruction can be used once _mm256_cmpgt_epi32(), but there is a performance overhead from the additional calls of the _mm256_blendv_epi8() instruction, required to translate the result vector of _mm256_cmpgt_epi32() into minimums and maximums. Before each CAS layer, the inputs must be properly aligned, and this is done through different permute intrinsics such as _mm256_shuffle_epi32() and _mm256_permute2x128_si256().
In principle, emulating a wider FLiMS, can result in increased data locality, as more computation is done in registers. On the other hand, more logic needs to be represented, and the instruction count of the loop increases, with the additional disadvantage of lessobvious data dependencies. Since the number of vector units are limited in processors, emulating more logic on the merge loop can trigger more cache accesses and worsen the performance.
Figure 14 shows how the degree of parallelism of the emulated FLiMS influences the achievable throughput on an AVX2native processor (Intel i78809G, with a steady 4.2 GHz clock). The C++ code for the 2way merge function is generated by a python script, and is compiled with and . Two sorted random inputs of elements are fed into the FLiMS merge function. The conclusion is that at and the throughput is the highest, and there is little variation between different compilers.
8.2 Complete sorting
The FLiMSbased CPU merge function can be used recursively to merge input of arbitrary length, by accelerating merge sort with SIMD instructions.
As a complement to the FLiMSbased merge function, a sortinchunks function is developed to facilitate the need for initial sorted chunks, as well as to provide longenough chunks for FLiMS to benefit from streaming access patterns. This function is based on the bitonic sorter, with a similar technique to section 8.1 for building the butterfly network. A similar approach was followed in an SIMDbased quicksort implementation [bramas], although here it is implemented from scratch with AVX2 intrinsics. In our use case, the optimal sorted chunk size is found to be 512 integers.
The performance of the FLiMSbased sort function is compared against the C++ Standard Library implementation of sorting std::sort(), as well as a highlyoptimised [bramas] SIMDbased radix sort implementation from Intel’s integrated performance primitives (IPP).
Additionally, a multithreaded version of the FLiMSbased SIMD sort function is implemented, with the help of OpenMP pragmas. Both the merging function and the sortinchunks function are unaltered. The parallelisation is done on the calls of each, operating on equallysized consecutive portions of the entire input, when possible.
The performance of the multithreaded FLiMSbased sort is compared against the singlethreaded baselines, as well as a parallel sort implementation in the Boost C++ libraries. The block_indirect_sort() function implements the samplesort sorting algorithm, and is regarded as one of the best performing C++ sort implementations [boost].
Figure 15 presents the results of both the singlethreaded and multithreaded experiments. The target processor is the 16thread Ryzen Pro 4750U. The main observation is that the 16thread FLiMS sort function surpasses the performance of the 16thread block_indirect_sort() for the input range from to . A hybrid approach can be used to enable the singlethreaded version of FLiMS for below to achieve the highestperformance overall, except radix sort.
Intel’s radix sort takes the lead between and on this AMD processor, but IPP radix sort has some notable limitations. These include the less predictable performance across different systems when run across different processors (not demonstrated, for brevity), a restrictive license and radixsort related implications, such as a limit of the input length to about for the tested implementation.
As a conclusion, FLiMSbased sorting is an attractive approach to accelerate merge sort using SIMD intrinsics. However, it is still relevant to research solutions based on a variety of sorting methods, as different algorithms are more appropriate in different distributions and use cases [auger2015merge]. For instance, radix sort can perform fewer data passes on data of a restricted range, such as for 10bit integers.
9 Future work
Current and future work includes developing efficient FLiMS implementations targeting other technologies such as GPUs, and exploring the adoption of FLiMS in various applications such as database analytics [fsorter, mergejoin] and beyond.
One concern in today’s FPGA research on sorting is that it is mostly limited to fixedwidth values [thiemj]. Therefore, it would be helpful to also study FLiMS adaptations or alternatives for data of arbitrary width, such as strings [asiatici2021many].
The SIMD evaluation could also be extended with additional optimisation for specific processors and data. This could also include the skewness and stable variations and their applicability in this context. An AVX512 version has already been developed, but for the target processor (Xeon 8124M) the performance benefits over AVX2 on the same processor were underwhelming. This possibly related to the efficiency of AVX512 in the specific microarchitecture, though it would be appropriate to further investigate the applicability of FLiMS in future processor technologies.
It would also be interesting to formally prove any optimalities FLiMS may exhibit, as well as to try to find equivalent circuits for merging more than two sorted input lists, for further reducing the size of parallel merge trees.
10 Conclusions
FLiMS is currently the most hardwareefficient 2way merge block on FPGAs. It can be used to build efficient highthroughput merge trees for facilitating sorting of unsorted input on hardware. It features fewer and/or simpler pipeline stages than the alternatives, while achieving a higher amount of parallelism with less hardware resource utilisation. The skewness optimisation is the equivalent workaround found in an older merger, while maintaining the decentralised nature of the MAX entities logic, which is collectively used as a scalable selector logic. A variation for implementing stable sort is also presented to facilitate the needs of some database applications, as well as FLiMSj for unifying the dequeue signals that can be costly in some memory configurations. An SIMD implementation of FLiMS and its multithreaded variant on a modern processor are also found to outperform popular highlyoptimised C++ sort libraries. Sorting using such a highthroughput merge block can be more appropriate for big data applications than alternative approaches, since it yields streaming memory access patterns and can also be applied recursively for arbitrarily long data without keeping growing states.
Acknowledgments
This research was sponsored by dunnhumby. The support of Microsoft and the United Kingdom EPSRC (grant number EP/L016796/1, EP/I012036/1, EP/L00058X/1, EP/N031768/1 and EP/K034448/1), European Union Horizon 2020 Research and Innovation Programme (grant number 671653) is gratefully acknowledged.
Philippos Papaphilippou got his PhD from Imperial College London. His PhD was funded by dunnhumby for researching novel accelerators to improve the performance of big data analytics. His research topics include FPGAs, sorting algorithms, network switches, multiprocessor architecture and data science. 
Wayne Luk is a professor of computer engineering at Imperial College London. He leads the Programming Languages and Systems Section, and the Custom Computing Research Group at the Department of Computing. He is a Fellow of the Royal Academy of Engineering, the IEEE, and the BCS. 
Chris Brooks is the Head of Science Innovation at dunnhumby, UK. He is accountable for researching and developing new science techniques and technical implementations. He leads science research into a variety of areas, focussed primarily on the retail domain. 